mirror of https://github.com/ElnuDev/rust-jmdict
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
28 lines
1.7 KiB
28 lines
1.7 KiB
# `data/`
|
|
|
|
We cannot put the JMdict into Git as one single file because its size is over 100 MiB, and GitHub does not like files
|
|
that big. I don't want to use LFS because it's still a text file and thus delta-compresses really well if you let Git do
|
|
its job. Therefore we split the file into chunks of roughly 1000 entries each.
|
|
|
|
Since we're pre-processing anyway, we're also converting from XML to JSON. The original XML file uses a lot of memory
|
|
when parsed as a whole, and parsing in pieces is finicky because we need to carry over the DTD into each chunk, if only
|
|
for the entity definitions. The JSON files in this directory, on the other hand, do not have any magical entities and
|
|
thus trivially parse as individual entries. It also turns out that parsing JSON is much quicker than parsing XML, which
|
|
makes a significant impact on the build time of the whole crate.
|
|
|
|
## Import workflow
|
|
|
|
To update the JMdict copy in this directory, run `make import JMDICT_PATH=/path/to/JMdict`. Check the `git diff`
|
|
afterwards; it should usually only show changes for a few places where upstream edited the respective JMdict entries.
|
|
|
|
## Export workflow
|
|
|
|
We cannot bundle the data files with the crates when publishing because crates.io imposes a 10 MiB limit on crates. The
|
|
data files are therefore stored in a compressed bundle by `make export`. The output file appears in this directory as
|
|
`entrypack-YYYY-MM-DD.json.gz`, with the date being extracted from JMdict's own modification timestamp in
|
|
`entries-999.json`.
|
|
|
|
This file can then be copied to its web server location, currently residing on <http://dl.xyrillian.de/jmdict/> under
|
|
the control of [@majewsky](https://github.com/majewsky). Finally, update the constants at the top of
|
|
`jmdict-traverse/src/file.rs` to refer to the new file.
|