# Contributing Issues and PRs are welcome, but I cannot give any specific guarantees how fast I'll get to them. Some specific remarks: * If the copy of JMdict is outdated and you need a newer version, open an issue and I'll make a new release with a newer copy for you. Please don't send a PR for this; I have no way to verify the diff and will do the import myself anyway. # Explanations ## Payload structure The obvious idea would be to have `build.rs` generate a bunch of code like this... ```rust // many fields elided for brevity static ENTRIES: &[Entry] = &[ Entry { sequence_number: 1000150, kanji_elements: &[ KanjiElement { text: "RS232ケーブル", }, ], reading_elements: &[ ReadingElement { text: "アールエスにさんにケーブル", }, ], senses: &[ Sense { parts_of_speech: &[ PartOfSpeech::Noun, ], glosses: &[ Gloss { text: "rs232 cable", }, ], }, ], }, ... ]; ``` ...and just `include!()` it into the main binary. The problem with this is that each `&[T]` or `&str` is its own relocatable object that the linker has to deal with, so compile times, link times and binary size are absurdly high. I initially optimized this by putting the all strings into one giant string, somewhat like this: ```rust //This actually comes from an include_str!(). static ALL_TEXT: &str = "RS232ケーブルアールエスにさんにケーブルrs232 cable..."; static ENTRIES: &[EntryRepr] = &[ ..., Entry { sequence_number: 1000150, kanji_elements: &[ KanjiElementRepr { text: StringRef { start: 0, end: 27 }, }, ], reading_elements: &[ ReadingElementRepr { text: StringRef { start: 27, end: 66 }, }, ], senses: &[ SenseRepr { parts_of_speech: &[ PartOfSpeech::Noun, ], glosses: &[ GlossRepr { text: StringRef { start: 66, end: 77 }, }, ], }, ], }, ... ]; ``` This helps with the `&str` objects, but there is still the various cascaded `&[T]`. I applied the same technique to those as well: ```rust static ALL_TEXT: &str = "RS232ケーブルアールエスにさんにケーブルrs232 cable..."; static ALL_K_ELE: &[KanjiElementRepr] = &[ KanjiElementRepr { text: StringRef { start: 0, end: 27 }, }, ... ]; static ALL_R_ELE: &[ReadingElementRepr] = &[ ReadingElementRepr { text: StringRef { start: 27, end: 66 }, }, ... ]; static ALL_POS: &[PartOfSpeech] = &[ PartOfSpeech::Noun, ... ]; static ALL_GLOSSES: &[GlossRepr] = &[ GlossRepr { text: StringRef { start: 66, end: 77 }, }, ... ]; static ALL_SENSES: &[SenseRepr] = &[ SenseRepr { parts_of_speech: ArrayRef { start: 0, end: 1 }, glosses: ArrayRef { start: 0, end: 1 }, }, ... ]; static ALL_ENTRIES: &[EntryRepr] = &[ EntryRepr { kanji_elements: ArrayRef { start: 0, end: 1 }, reading_elements: ArrayRef { start: 0, end: 1 }, senses: ArrayRef { start: 0, end: 1 }, }, ... ]; ``` With this and the previous sample, you can see that it's not `Entry` anymore, but `EntryRepr` instead, since those `StringRef` and `ArrayRef` instances need to be resolved into the things they point to at the API boundary. That's why the actual exposed types use iterators instead of slice refs for everything: to provide a point where this mapping can take place. The structure as described above produces binaries of reasonable size, but because all that generated code needs to be parsed by the compiler, compile times are still frustratingly slow (on the order of minutes for a full build). And what's worse, the compiler uses so much working memory that my desktop PC with 16 GiB of RAM went OOM trying to compile it. To avoid the need for parsing generated code altogether, I finally replaced all `&[TRepr]` arrays with a single `static ALL_DATA: &[u32]` that gets imported from a binary file via `include_bytes!()`. u32 was chosen because it is large enough to index into all relevant structures (both `ALL_TEXT` and `ALL_DATA`). I could have encoded enum variants as u16, but for now, I prefered the simplicity of having everything in one place and accepted the slight inefficiency in encoding. Besides `ALL_TEXT` and `ALL_DATA`, there is one final structure, `static ALL_ENTRY_OFFSETS: &[u32]`, which, as an entrypoint into the self-referencing structure of `ALL_DATA`, provides the offsets into `ALL_DATA` where entries are located.