Add data processing post

main
Elnu 2 years ago
parent d23858057b
commit e567e49c46

@ -0,0 +1,83 @@
---
title: "The Joy of Data Processing"
date: 2022-07-27
tags:
- programming
- japanese
---
For a project Im working on (Ill make a post about it once its done), I needed a large list of Japanese words, with the requirements being that the words be short and without kanji (only in hiragana). In addition, ideally they should be simple words that the average Japanese learner would know, and must be in a machine-readable format that I can use in JavaScript.
Luckily, I found something that matches these criteria! A [JLPT N5 vocabulary list from JLPT Matome](https://www.jlptmatome.com/jlpt-n5-vocabulary-list/) that covers 549 words, which should hopefully be more than enough for my needs.
However, the list is in the form of HTML tables, and so its going to needs some work. In addition, the pronunciation is provided only in romaji, so well need to convert it to hiragana.
This is where the joy of data processing and web scraping comes in: refining a data source step-by-step until its something you can work with. This post is going to be less of a tutorial and more of a walk-through of a process that I found fun.
### Extracting the raw data
The first step is to convert the tables to a [CSV file](https://en.wikipedia.org/wiki/Comma-separated_values). If youre unfamiliar with CSV, its basically the most basic way of storing spreadsheet data in a file. On each line, theres a comma-separated list of values (hence its name, a *Comma-Separated Values* file), one for each column, and each line corresponds with a row.
There are various command-line applications and browser extensions that convert HTML to CSV, but what I ended up using is [this website](https://www.convertcsv.com/html-table-to-csv.htm). Using a website for conversions is cringe, I know, but when youre doing a one-off thing and dont need to write any scripts to do the job, utility websites are often the easiest way to get the job done.
Its pretty handy, and it does the URL fetching for you. The only annoyance was that the source was paginated onto 11 pages, so I had to do each page separately then put them together, but after that was all done I had a nice CSV file:
```CSV
1,あげる,ageru,to give
2,朝,asa,morning
3,封筒,fuutou,envelope
4,冬,fuyu,winter
5,五,go,five
...
```
Now, this is more data than we need. I only need the third column with the pronunciations (well convert these into hiragana later), so one can remove the first, second, and fourth column in some spreadsheet software like LibreOffice and then reexport a single-column CSV file with just the romaji. Once thats done, we have a text file that is just a list of pronunciations:
```CSV
ageru
asa
fuutou
fuyu
go
...
```
### Converting to hiragana
After a bit of research, I found [koozaki/romaji-conv](https://github.com/koozaki/romaji-conv), an [npm package](https://www.npmjs.com/package/@koozaki/romaji-conv) that does exactly what I need. Theres a [web-based demo](https://romaji-conv.koozaki.com/) that you can use if you want to quickly try it out or do a quick conversion, but it also has a CLI (command-line interface), which is what well use.
Assuming you already have Node.js and npm installed, you can globally install it with the following command (`i` is a an alias for `install`, `-g` installs the package globally to your system instead of to a particular projects `node_modules`):
```SH
npm i -g @koozaki/romaji-conv
```
We can use the following command to run romaji-conv on each line of our romaji file, `romaji.csv`, and push the resulting hiragana version to a new file, `hiragana.csv`. Previously, I wasnt familiar with `xargs` but I found this solution thanks to [this Stack Overflow answer](https://stackoverflow.com/a/29836986).
```SH
cat romaji.csv | xargs -L1 romaji-conv > hiragana.csv
```
### Converting into a JSON list
The final step is to convert this simple text file list into a JSON list/array that we can put directly into our JavaScript for use, in a format such as the following:
```JSON
["あげる", "あさ", "ふうとう", "ふゆ", "ご", ...]
```
To do this, all we need to do is a simple find-and-replace in any text editor, first replacing each new line `\n` with `", "`, and then finally adding the starting `["` and closing `"]`.
Were done!
### Closing thoughts
The things Ive done here by themselves are not in any way groundbreaking. I could have done some things more elegantly, and honestly none of this is anything to call home about. However, I wrote this post anyway because I wanted to show the power of step-by-step data processing and manipulation, and how one with a bit of time one can mutate data sources into whatever form is needed.
If you see data in tables or some other form online, dont give up! Getting it into the form you need isnt going to be as hard or time-consuming as you think.
If youre interested in Japanese text manipulation, please do check out Kojiro Ozakis romaji-conv on GitHub and give it a star! :star: Its super handy and easy to use, and is painfully underrated at only 8 stars (including mine) at the time of writing.
See you in the next post!
じゃーね!
Loading…
Cancel
Save