Why CJK Support in Rust Is Hard

DEV Community: rust (kent-tokyo)

Most Rust developers don't think about CJK until they need it. Then they discover that embedding Japanese text in a PDF, building a search index over Chinese content, or normalizing Korean input involves a stack of interlocking problems that Latin-script tooling simply never had to solve.

This post breaks down why CJK is genuinely hard — not just "different" — and where the Rust ecosystem still has gaps.

The Scale Problem

The first thing that surprises developers: a full CJK font file is enormous.

A Latin font like Inter Regular is around 300 KB. A full Japanese font — say, Noto Sans CJK JP — is over 15 MB. That's because Unicode's CJK Unified Ideographs block alone defines over 92,000 characters, and a production font needs to cover most of them.

For most use cases you don't need all 92,000 glyphs. If you're generating a PDF invoice with a customer name and address, you might use 50 distinct CJK characters. But a naive approach embeds the entire font, making a simple document balloon to 15 MB.

The solution is font subsetting: extract only the glyphs actually used, rebuild a minimal font binary, and embed that. It sounds straightforward. It isn't.

The Three Hard Problems

1. Font Subsetting for CJK

Subsetting a Latin font is well-understood. Subsetting a CJK font involves:

Glyph ID remapping. A font maps Unicode code points to internal Glyph IDs (GIDs). After subsetting, the GID space is compacted — the 50 glyphs you kept now have new GIDs from 0 to 49. Every reference to the old GIDs in the font binary and in your document needs to be updated.

CMap table reconstruction. The font's cmap table maps Unicode → GID. After subsetting, this table must be rebuilt to reflect the new GID assignments. Get this wrong and the font renders garbage or fails to load entirely.

Advance width recalculation. Fonts store per-glyph advance widths (how far the cursor moves after each character). After GID remapping, the width table must be reindexed. In PDF specifically, the /Widths array in the CIDFont object must match the new GIDs exactly — a mismatch causes text spacing to break in subtle, hard-to-debug ways.

Type0/CIDFont object graph. PDF represents CJK fonts as a two-level structure: a Type0 (composite) font wrapping a CIDFont. The CIDFont references the embedded font stream and the ToUnicode CMap. Building this object graph correctly requires understanding the PDF spec at a level most developers would rather avoid.

In pure Rust, the allsorts crate handles TTF subsetting. It works well for TrueType fonts. OpenType CFF fonts (.otf files with PostScript outlines) are more complex and allsorts coverage is incomplete — this is a known gap in the Rust ecosystem.

2. ToUnicode CMap Generation

PDF separates rendering (which glyph to draw) from semantics (what Unicode character that glyph represents). Rendering uses GIDs. Semantics are stored in a separate stream called the ToUnicode CMap.

Without a ToUnicode CMap:

Copy-pasting text from a PDF produces garbage characters or empty strings
Search within the PDF doesn't find CJK text
Screen readers can't read the document

The CMap is a PostScript-like stream that maps GID ranges to Unicode code points. For CJK fonts with thousands of glyphs, generating this stream correctly — with proper range compression for consecutive code points — requires care. A naive one-entry-per-glyph approach technically works but produces unnecessarily large streams.

3. Normalization and Variant Characters

CJK text has an encoding problem that Latin scripts largely don't: the same logical character can have multiple valid representations.

Unicode normalization forms (NFC, NFD, NFKC, NFKD) affect how composed characters are stored. Japanese text in particular mixes hiragana, katakana, kanji, and Latin characters, each with their own normalization quirks. Fullwidth ASCII (Ａ, Ｂ, Ｃ) and halfwidth katakana (ｱ, ｲ, ｳ) are canonically equivalent to their standard forms under NFKC but not NFC.

CJK Compatibility Ideographs (U+F900–U+FAFF) are compatibility mappings for characters that appear in legacy encodings. U+FA30 (㌍) is canonically equivalent to U+30AD U+30ED (キロ). Depending on whether you normalize before indexing, the same string might or might not match a query.

Variant selectors add another layer. CJK Unified Ideographs sometimes have multiple visual forms (simplified vs. traditional Chinese, Japanese vs. Korean glyph shapes). Unicode encodes this with Variation Selectors — invisible code points that follow a base character to select a specific glyph. 葛 followed by VS17 (U+E0100) selects a specific variant used in place names. A text search that isn't VS-aware will fail to match these strings.

For fuzzy matching over CJK content, you need to decide which of these equivalences to collapse before indexing. The right answer depends on the use case: a legal document system probably wants exact glyph matching; a general search index probably wants NFKC normalization.

The Legacy Encoding Problem

Modern CJK text is Unicode, but a significant amount of real-world content is still encoded in legacy formats:

Converting these to Unicode isn't just a lookup table — legacy CJK encodings have overlapping code spaces, vendor extensions, and edge cases that differ between Windows, macOS, and Linux implementations.

The encoding_rs crate (originally written for Firefox) is the authoritative pure Rust implementation of the WHATWG Encoding Standard and handles most of these correctly. This is one area where the Rust ecosystem is actually in good shape.

Why Everything Still Leans on C

The elephant in the room: most production CJK text processing still depends on C or C++ libraries.

HarfBuzz — text shaping (converting Unicode to positioned glyphs) — is C++. For CJK, shaping is relatively simple compared to Arabic or Indic scripts (no complex ligatures or bidirectional reordering), but HarfBuzz is still the de facto standard.

FreeType — font rasterization — is C. If you're rendering CJK text to a bitmap, you're almost certainly using FreeType bindings.

ICU (International Components for Unicode) — normalization, collation, locale-aware string comparison — is C++. The icu4x project is a ground-up Rust rewrite led by the Unicode Consortium, and it's making solid progress, but it's not yet a drop-in replacement for all ICU use cases.

The consequence for Rust developers: if you need CJK support and reach for crates that wrap these C libraries, you give up WASM compatibility, you complicate cross-compilation, and you add a build-time dependency on the system libraries or vendored C sources.

The Current State of Pure Rust CJK

Here's an honest assessment of the pure Rust ecosystem for CJK work:

The gaps are real. Text shaping in particular is a hard open problem for pure Rust — for simple CJK rendering you can get away without a full shaper, but for mixed CJK/Latin text with proper kerning and ligatures, you eventually need something HarfBuzz-level.

What This Means in Practice

If you're building something that needs to handle CJK text in Rust:

For encoding conversion, use encoding_rs. Don't roll your own.
For normalization, use unicode-normalization and decide up front which form you want. For search, NFKC is usually the right default.
For PDF with CJK, the pure Rust path exists but requires understanding the subsetting pipeline. Wrapping pdfium or a C-based library is currently the easier path if WASM compatibility isn't a requirement.
For fuzzy search over CJK, normalization before indexing matters more than the search algorithm itself.
For text rendering, if you can accept a C dependency, HarfBuzz + FreeType is the proven path. Pure Rust rendering is possible for simple cases.

Closing Thought

CJK support isn't a single feature — it's a stack of problems that compound. The good news is that the Rust ecosystem is making real progress on each layer. The bad news is that each layer requires understanding the layer below it, which is why CJK support tends to be either "works perfectly" or "completely broken" with little middle ground.

If you're working on any of these problems — subsetting, normalization, collation, shaping — I'd love to compare notes.

I've run into most of these problems while building harumi, a pure Rust PDF library with CJK font subsetting. The gaps in the table above are the ones I've personally hit.

Generated by RSStT. The copyright belongs to the original author.

Source