reset history

There is about four weeks worth of history, the interesting parts of which I've documented in `CONTRIBUTING.md`. I'm now throwing this history away because there is a lot of messing with data files in there that bloats the repo unnecessarily, and this is my last chance to get rid of that bloat before other people start pulling it.
2025-07-03 06:16:01 -07:00 · 2021-04-18 14:13:28 +02:00 · 2021-04-18 14:13:28 +02:00 · 5ceeec3acc
commit 5ceeec3acc
26 changed files with 195162 additions and 0 deletions
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -0,0 +1,153 @@
+# Contributing
+
+Issues and PRs are welcome, but I cannot give any specific guarantees how fast I'll get to them. Some specific remarks:
+
+* If the copy of JMdict is outdated and you need a newer version, open an issue and I'll make a new release with a newer
+  copy for you. Please don't send a PR for this; I have no way to verify the diff and will do the import myself anyway.
+
+# Explanations
+
+## Payload structure
+
+The obvious idea would be to have `build.rs` generate a bunch of code like this...
+
+```rust
+// many fields elided for brevity
+static ENTRIES: &[Entry] = &[
+  Entry {
+    sequence_number: 1000150,
+    kanji_elements: &[
+      KanjiElement {
+        text: "ＲＳ２３２ケーブル",
+      },
+    ],
+    reading_elements: &[
+      ReadingElement {
+        text: "アールエスにさんにケーブル",
+      },
+    ],
+    senses: &[
+      Sense {
+        parts_of_speech: &[
+          PartOfSpeech::Noun,
+        ],
+        glosses: &[
+          Gloss {
+            text: "rs232 cable",
+          },
+        ],
+      },
+    ],
+  },
+  ...
+];
+```
+
+...and just `include!()` it into the main binary. The problem with this is that each `&[T]` or `&str` is its own
+relocatable object that the linker has to deal with, so compile times, link times and binary size are absurdly high.
+I initially optimized this by putting the all strings into one giant string, somewhat like this:
+
+```rust
+//This actually comes from an include_str!().
+static ALL_TEXT: &str = "ＲＳ２３２ケーブルアールエスにさんにケーブルrs232 cable...";
+
+static ENTRIES: &[EntryRepr] = &[
+  ...,
+  Entry {
+    sequence_number: 1000150,
+    kanji_elements: &[
+      KanjiElementRepr {
+        text: StringRef { start: 0, end: 27 },
+      },
+    ],
+    reading_elements: &[
+      ReadingElementRepr {
+        text: StringRef { start: 27, end: 66 },
+      },
+    ],
+    senses: &[
+      SenseRepr {
+        parts_of_speech: &[
+          PartOfSpeech::Noun,
+        ],
+        glosses: &[
+          GlossRepr {
+            text: StringRef { start: 66, end: 77 },
+          },
+        ],
+      },
+    ],
+  },
+  ...
+];
+```
+
+This helps with the `&str` objects, but there is still the various cascaded `&[T]`. I applied the same technique to
+those as well:
+
+```rust
+static ALL_TEXT: &str = "ＲＳ２３２ケーブルアールエスにさんにケーブルrs232 cable...";
+
+static ALL_K_ELE: &[KanjiElementRepr] = &[
+  KanjiElementRepr {
+    text: StringRef { start: 0, end: 27 },
+  },
+  ...
+];
+
+static ALL_R_ELE: &[ReadingElementRepr] = &[
+  ReadingElementRepr {
+    text: StringRef { start: 27, end: 66 },
+  },
+  ...
+];
+
+static ALL_POS: &[PartOfSpeech] = &[
+  PartOfSpeech::Noun,
+  ...
+];
+
+static ALL_GLOSSES: &[GlossRepr] = &[
+  GlossRepr {
+    text: StringRef { start: 66, end: 77 },
+  },
+  ...
+];
+
+static ALL_SENSES: &[SenseRepr] = &[
+  SenseRepr {
+    parts_of_speech: ArrayRef { start: 0, end: 1 },
+    glosses: ArrayRef { start: 0, end: 1 },
+  },
+  ...
+];
+
+static ALL_ENTRIES: &[EntryRepr] = &[
+  EntryRepr {
+    kanji_elements: ArrayRef { start: 0, end: 1 },
+    reading_elements: ArrayRef { start: 0, end: 1 },
+    senses: ArrayRef { start: 0, end: 1 },
+  },
+  ...
+];
+```
+
+With this and the previous sample, you can see that it's not `Entry` anymore, but `EntryRepr` instead, since those
+`StringRef` and `ArrayRef` instances need to be resolved into the things they point to at the API boundary. That's why
+the actual exposed types use iterators instead of slice refs for everything: to provide a point where this mapping can
+take place.
+
+The structure as described above produces binaries of reasonable size, but because all that generated code needs to be
+parsed by the compiler, compile times are still frustratingly slow (on the order of minutes for a full build). And
+what's worse, the compiler uses so much working memory that my desktop PC with 16 GiB of RAM went OOM trying to compile
+it.
+
+To avoid the need for parsing generated code altogether, I finally replaced all `&[TRepr]` arrays with a single
+`static ALL_DATA: &[u32]` that gets imported from a binary file via `include_bytes!()`. u32 was chosen because it is
+large enough to index into all relevant structures (both `ALL_TEXT` and `ALL_DATA`). I could have encoded enum variants
+as u16, but for now, I prefered the simplicity of having everything in one place and accepted the slight inefficiency in
+encoding.
+
+Besides `ALL_TEXT` and `ALL_DATA`, there is one final structure, `static ALL_ENTRY_OFFSETS: &[u32]`, which, as an
+entrypoint into the self-referencing structure of `ALL_DATA`, provides the offsets into `ALL_DATA` where entries are
+located.