#+TITLE: Furigana in Markdown Using Regular Expressions
#+DATE: 2022-01-23
#+FILETAGS: japanese
#+DESCRIPTION: The Markdown spec doesn’t include support for furigana, more commonly known as ruby text, which can be troublesome if you want to write content in Japanese. In this tutorial, learn how to add furigana support to Hugo using regular expressions!
#+begin_quote
TL;DR: [[#creating-the-regular-expression][Here]]
#+end_quote
* Background
:PROPERTIES:
:CUSTOM_ID: background
:END:
As I was building +the current+ /a previous/ version of this website, I came across an issue. In the previous version of the site which I made with [[https://nuxtjs.org/][Nuxt.js]], a [[https://vuejs.org/][Vue.js]]-based JavaScript web development framework similar to the [[https://reactjs.org/][React]]-based [[https://nextjs.org/][Next.js]], I used [[https://www.npmjs.com/package/markdown-it][=markdown-it=]] for rendering my Markdown content. The great thing about =markdown-it= was how extensible it is: there are a vast number of available npm packages that extend its functionality beyond the base [[https://commonmark.org/][CommonMark]] specification.
One of the packages I used was [[https://www.npmjs.com/package/furigana-markdown-it][=furigana-markdown-it=]], which enabled [[https://en.wikipedia.org/wiki/Furigana][furigana]], or more widely known as [[https://en.wikipedia.org/wiki/Ruby_character][ruby characters]], which are reading information written beside logographic characters in East Asian languages. For example, in Japanese the reading for 猫, cat, is ねこ, and can be written with furigana as [猫]{ねこ}. The syntax for this was the main text in square parentheses, =[猫]=, followed by the furigana in curly brackets, {ねこ}, was quite convenient, and I wanted to be able to do the same thing in the new [[https://gohugo.io/][Hugo]] site.
Instead of =markdown-it=, Hugo by default uses [[https://github.com/yuin/goldmark][Goldmark]], a Markdown renderer written in Go (a language I don't know), and while extensible, I really didn't want to go through the effort to learn Go, figure out how to make a Goldmark extension, and get it working in Hugo. After looking into it some more, it turns out that in Hugo's templating there is a [[https://gohugo.io/functions/replacere/][=replaceRE=]] function that lets you find and replace content intelligently using regular expressions.
* Creating the regular expression
:PROPERTIES:
:CUSTOM_ID: creating-the-regular-expression
:END:
After watching [[https://www.youtube.com/watch?v=rhzKDrUiJVk][this]] useful tutorial by Web Dev Simplified on regular expressions to get an idea of how they work, I managed to create this regular expression:
#+begin_example
\[([^\]]*)\]{([^\}]*)}
#+end_example
The first section, =\[([^\]]*)\]= creates a capturing group =()= around character surrounded by square brackets =[]=. The square brackets are escaped by the proceeding backslash (=\[=, =\]=) to make sure they aren't interpreted as regex syntax characters. Inside of the capturing group is =[^\]]*=. The negated set =[^\]]= means any character that isn't a right square bracket =]=, and the asterisk =*= repeats the previous token zero or more times in a row. In other words, the capturing group will end as soon as a right square bracket =]= is detected.
The second section, ={([^\}]*)=, is basically the same thing as the first, except the capturing group is surrounded by curly brackets ={}=. Again, they are escaped to make sure they aren't being interpreted as regex syntax characters.
You can test out the regular expression and see a breakdown of how all the parts of it work on [[https://regexr.com/6dspa][RegExr]], a super useful tool for building and testing regular expressions.
* Ruby text HTML syntax
:PROPERTIES:
:CUSTOM_ID: ruby-text-html-syntax
:END:
The HTML syntax for furigana/ruby text is as follows. For more information, see the [[https://developer.mozilla.org/en-US/docs/Web/HTML/Element/ruby][MDN documentation]].
#+begin_src html
猫
#+end_src
For me, all of my ruby text is going to be in Japanese, so I've added the attribute =lang="ja"= to ensure that the Japanese (not Chinese) character variants are rendered. For some characters, the way they are written in Japan and China slightly differs. For example, for the Unicode character [[https://en.wiktionary.org/wiki/直][U+76F4]], it is rendered as the Chinese variant, 直, by default but as 直 when the language is explicitly specified to be Japanese, despite them being /the exact same Unicode character code./
* Adding the regular expression to Hugo
:PROPERTIES:
:CUSTOM_ID: adding-the-regular-expression-to-hugo
:END:
In Hugo templates, one can display the rendered Markdown content of a given page using ={{ .Content }}=. What we need to do is pass =.Content= into the aforementioned =replaceRE= function. Since Hugo by default escapes HTML syntax, we need to then pipe everything into the =safeHTML= function. The =$1= and =$2= are placeholders for the first and second capture groups in the regular expression, respectively.
#+begin_example
{{ replaceRE `\[([^\]]*)\]{([^\}]*)}` `$1` .Content | safeHTML }}
#+end_example
All one needs to do now to get furigana rendering on all of one's page types is replace ={{ .Content }}= with this on all of their templates! To prevent code duplication, I put all of this into a [[https://gohugo.io/templates/partials/][partial template]].
* Conclusion
:PROPERTIES:
:CUSTOM_ID: conclusion
:END:
I hope you found this first blog post on this site helpful! If you're going to have any Japanese content in your site, being able to write ruby text in your markup is a must. While this tutorial was targeted toward Hugo, you can do this in *any static site generator that supports regular expressions in templates.*