Unicode of Death 2.0

Unicode of Death 2.0



back to http://telegra.ph/crash-text-02-17


Picking Apart the Crashing iOS String

Posted by Manish Goregaokar | Feb 15th, 2018 12:00 am

https://manishearth.github.io/blog/2018/02/15/picking-apart-the-crashing-ios-string/


So there’s yet another iOS text crash, where just looking at a particular string crashes iOS. Basically, if you put this string in any system text box (and other places), it crashes that process. I’ve been testing it by copy-pasting characters into Spotlight so I don’t end up crashing my browser.

The original sequence is U+0C1C U+0C4D U+0C1E U+200C U+0C3E, which is a sequence of Telugu characters: the consonant ja (జ), a virama ( ్ ), the consonant nya (ఞ), a zero-width non-joiner, and the vowel aa ( ా).

I was pretty interested in what made this sequence “special”, and started investigating.

So first when looking into this, I thought that the <ja, virama, nya> sequence was the culprit. That sequence forms a special ligature in many Indic scripts (ज्ञ in Devanagari) which is often considered a letter of its own. However, the ligature for Telugu doesn’t seem very “special”.

Also, from some experimentation, this bug seemed to occur for any pair of Telugu consonants with a vowel, as long as the vowel is not ై (ai). Huh.

The ZWNJ must be doing something weird, then. <consonant, virama, consonant, vowel> is a pretty common sequence in any Indic script; but ZWNJ before a vowel isn’t very useful for most scripts (except for Bengali and Oriya, but I’ll get to that).

And then I saw that there was a sequence in Bengali that also crashed.

The sequence is U+09B8 U+09CD U+09B0 U+200C U+09C1, which is the consonant “so” (স), a virama ( ্ ), the consonant “ro” (র), a ZWNJ, and vowel u ( ু).



Before we get too into this, let’s first take a little detour to learn how Indic scripts work:


Indic scripts and consonant clusters

Indic scripts are abugidas; which means that their “letters” are consonants, which you can attach diacritics to to change the vowel. By default, consonants have a base vowel. So, for example, क is “kuh” (kə, often transcribed as “ka”), but I can change the vowel to make it के (the “ka” in “okay”) का (“kaa”, like “car”).

Usually, the default vowel is the ə sound, though not always (in Bengali it’s more of an o sound).

Because of the “default” vowel, you need a way to combine consonants. For example, if you wished to write the word “ski”, you can’t write it as स + की (sa + ki = “saki”), you must write it as स्की. What’s happened here is that the स got its vowel “killed”, and got tacked on to the की to form a consonant cluster ligature.

You can also write this as स्‌की . That little tail you see on the स is known as a “virama”; it basically means “remove this vowel”. Explicit viramas are sometimes used when there’s no easy way to form a ligature, e.g. in ङ्‌ठ because there is no simple way to ligatureify ङ into ठ. Some scripts also prefer explicit viramas, e.g. “ski” in Malayalam is written as സ്കീ, where the little crescent is the explicit virama.

In unicode, the virama character is always used to form a consonant cluster. So स्की was written as <स, ्, क, ी>, or <sa, virama, ka, i>. If the font supports the cluster, it will show up as a ligature, otherwise it will use an explicit virama.

For Devanagari and Bengali, usually, in a consonant cluster the first consonant is munged a bit and the second consonant stays intact. There are exceptions – sometimes they’ll form an entirely new glyph (क + ष = क्ष), and sometimes both glyphs will change (ड + ड = ड्ड, द + म = द्म, द + ब = द्ब). Those last ones should look like this in conjunct form:


Investigating the Bengali case

Now, interestingly, unlike the Telugu crash, the Bengali crash seemed to only occur when the second consonant is র (“ro”). However, I can trigger it for any choice of the first consonant or vowel, except when the vowel is ো (o) or ৌ (au).

Now, র is an interesting consonant in some Indic scripts, including Devanagari. In Devanagari, it looks like र (“ra”). However, it does all kinds of things when forming a cluster. If you’re having it precede another consonant in a cluster, it forms a little feather-like stroke, like in र्क (rka). In Marathi, that stroke can also look like a tusk, as in र्‍क. As a suffix consonant, it can provide a little “extra leg”, as in क्र (kra). For letters without a vertical stroke, like ठ (tha), it does this caret-like thing, ठ्र (thra).

Basically, while most consonants retain some of their form when put inside a cluster, र does not. And a more special thing about र is that this happens even when र is the second consonant in a cluster – as I mentioned before, for most consonant clusters the second consonant stays intact. While there are exceptions, they are usually specific to the cluster; it is only र for which this happens for all clusters.

It’s similar in Bengali, র as the second consonant adds a tentacle-like thing on the existing consonant. For example, প + র (po + ro) gives প্র (pro).

But it’s not just র that does this in Bengali, the consonant “jo” does as well. প + য (po + jo) forms প্য (pjo), and the য is transformed into a wavy line called a “jophola”.

So I tried it with য — , and it turns out that the Bengali crash occurs for য as well! So the general Bengali case is <consonant, virama, র OR য, ZWNJ, vowel>, where the vowel is not ো or ৌ.


Suffix-joining consonants

So we’re getting close, here. At least for Bengali, it occurs when the second consonant is such that it often combines with the first consonant without modifying its form much.

In fact, this is the case for Telugu as well! Consonant clusters in Telugu are usually formed by preserving the original consonant, and tacking the second consonant on below!

For example, the original crashy string contains the cluster జ + ఞ, which looks like జ్ఞ. The first letter isn’t really modified, but the second is.

From this, we can guess that it will also occur for Devanagari with र. Indeed it does! U+0915 U+094D U+0930 U+200C U+093E, that is, <क, ्, र, zwnj, ा> (< ka, virama, ra, zwnj, aa >) is one such crashing sequence.

But this isn’t really the whole story, is it? For example, the crash does occur for “kro” + zwnj + vowel in Bengali, and in “kro” (ক্র = ক + র = ko + ro) the resultant cluster involves the munging of both the prefix and suffix. But the crash doesn’t occur for द्ब or ड्ड. It seems to be specific to the letter, not the nature of the cluster.

Digging deeper, the reason is that for many fonts (presumably the ones in use), these consonants form “suffix joining consonants”1 (a term I made up) when preceded by a virama. This seems to correspond to the pstf OpenType feature, as well as vatu.

For example, the sequence virama + क gives ्क, i.e. it renders a virama with a placeholder followed by a क.

But, for र, virama + र renders ्र, which for me looks like this:

In fact, this is the case for the other consonants as well. For me, ्र ্র ্য ్ఞ ్క (Devanagari virama-ra, Bengali virama-ro, Bengali virama-jo, Telugu virama-nya, Telugu virama-ka) all render as “suffix joining consonants”:

(This is true for all Telugu consonants, not just the ones listed).

An interesting bit is that the crash does not occur for <र, virama, र, zwnj, vowel>, because र-virama-र uses the prefix-joining form of the first र (र्र). The same occurs for র with itself or ৰ or য. Because the virama is “sticker” to the left in these cases, it doesn’t cause a crash. (h/t hackbunny for discovering this using a script to enumerate all cases).

Kannada also has “suffix joining consonants”, but for some reason I cannot trigger the crash with it. Ya in Gurmukhi is also suffix-joining.


The ZWNJ

The ZWNJ is curious. The crash doesn’t happen without it, but as I mentioned before a ZWNJ before a vowel doesn’t really do anything for most Indic scripts. In Indic scripts, a ZWNJ can be used to explicitly force a virama if used after the virama (I used it to write स्‌की in this post), however that’s not how it’s being used here.

In Bengali and Oriya specifically, a ZWNJ can be used to force a different vowel form when used before a vowel (e.g. রু vs র‌ু), however this bug seems to apply to vowels for which there is only one form, and this bug also applies to other scripts where this isn’t the case anyway.

The exception vowels are interesting. They’re basically all vowels that are made up of two glyph components. Philippe Verdy points out:

And why this bug does not occur with some vowels is because these are vowels in two parts, that are first decomposed into two separate glyphs reordered in the buffer of glyphs, while other vowels do not need this prior mapping and keep their initial direct mapping from their codepoints in fonts, which means that this has to do to the way the ZWNJ looks for the glyphs of the vowels in the glyphs buffer and not in the initial codepoints buffer: there’s some desynchronization, and more probably an uninitialized data field (for the lookup made in handling ZWNJ) if no vowel decomposition was done (the same data field is correctly initialized when it is the first consonnant which takes an alternate form before a virama, like in most Indic consonnant clusters, because the a glyph buffer is created.


Generalizing

So, ultimately, the full set of cases that cause the crash are:

Any sequence <consonant1, virama, consonant2, ZWNJ, vowel> in Devanagari, Bengali, and Telugu, where:

  • consonant2 is suffix-joining (pstf/vatu) – i.e. र, র, য, and all Telugu consonants
  • consonant1 is not a reph-forming letter like र/র (or a variant, like ৰ)
  • vowel does not have two glyph components, i.e. it is not ై, ো, or ৌ

This leaves one question open:

Why doesn’t it apply to Kannada? Or, for that matter, Khmer, which has a similar virama-like thing called a “coeng”.


Are these valid strings?

A recurring question I’m getting is if these strings are valid in the language, or unicode gibberish like Zalgo text. Breaking it down:

  • All of the rendered glyphs are valid. The original Telugu one is the root of the word for “knowledge” (and I’ve taken to calling this bug “forbidden knowledge” for that reason).
  • In Telugu and Devanagari, there is no functional use of the ZWNJ as used before a vowel. It should not be there, and one would not expect it in typical text.
  • In Bengali (also Oriya), putting a ZWNJ before some vowels prevents them from ligatureifying, and this is mentioned in the Unicode spec. However, it seems rare for native speakers to use this.
  • In all of these scripts, putting a ZWNJ after viramas can be used to force an explicit virama over a ligature. That is not the position ZWNJ is used here, but it gives a hint that this might have been a mistype. Doing this is also rare at least for Devanagari (and I believe for the other two scripts as well)
  • Android has an explicit key for ZWNJ on its keyboards for these languages2, right next to the spacebar. iOS has this as well on the long-press of the virama key. Very easy to mistype, at least for Android.

So while the crashing strings are usually invalid, and when not, very rare, they are easy enough to mistype.

An example by @FakeUnicode was the string “For/k” (or “Foŕk”, if accents were easier to type). A slash isn’t something you’d normally type there, and the produced string is gibberish, but it’s easy enough to type by accident.

Except of course that the mistake in “For/k”/“Foŕk” is visually obvious and would be fixed; this isn’t the case for most of the crashing strings.


Conclusion

I don’t really have one guess as to what’s going on here – I’d love to see what people think – but my current guess is that the “affinity” of the virama to the left instead of the right confuses the algorithm that handles ZWNJs after viramas into thinking the ZWNJ applies to the virama (it doesn’t, there’s a consonant in between), and this leads to some numbers not matching up and causing a buffer overflow or something. Philippe’s diagnosis of the vowel situation matches up with this.

An interesting thing is that I can cause this crash to happen more reliably in browsers by clicking on the string.

Additionally, sometimes it actually renders in spotlight for a split second before crashing; which means that either the crash isn’t deterministic, or it occurs in some process after rendering. I’m not sure what to think of either. Looking at the backtraces, the crash seems to occur in different places, so it’s likely that it’s memory corruption that gets uncovered later.

I’d love to hear if folks have further insight into this.

Update: Philippe on the Unicode mailing list has an interesting theory

Yes, I could attach a debugger to the crashing process and investigate that instead, but that’s no fun 😂


  1. Philippe Verdy points out that these may be called “phala forms” at least for Bengali↩
  2. I don’t think the Android keyboard needs this key; the keyboard seems very much a dump of “what does this unicode block let us do”, and includes things like Sindhi-specific or Kashmiri-specific characters for the Marathi keyboard as well as extremely archaic characters, whilst neglecting more common things like the eyelash reph (which doesn’t have its own code point but is a special unicode sequence; native speakers should not be expected to be aware of this sequence).↩




Unicode of Death 2.0

https://www.unicode.org/mail-arch/unicode-ml/y2018-m02/0103.html

From: Philippe Verdy via Unicode <unicode_at_unicode.org> 

Date: Sun, 18 Feb 2018 01:30:09 +0100


My opinion about this bug is that Apple's text renderer dynamically 

allocates a glyphs buffer only when needed (lazily), but a test is missing 

for the lazy construction of this buffer (which is not needed for most 

texts not needing glyph substitutions or reordering when a single accessor 

from the code point can find the glyph data directly by lookup in font 

tables) and this is causing a null pointer exception at run time. 

The bug occurs effectively when processing the vowel that occurs after the 

ZWNJ, if the code assumes that there's a glyphs buffer already constructed 

for the cluster, in order to place the vowel over the correct glyph (which 

may have been reordered in that buffer). 

Microsoft's text renderer, or other engines use do not delay the 

constructiuon of the glyphs buffer, which can be reused for processing the 

rest of the text, provided it is correctly reset after processing a cluster. 


2018-02-17 21:54 GMT+01:00 Manish Goregaokar <manish_at_mozilla.com>: 

> Heh, I wasn't aware of the word "phala-form", though that seems 

> Bengali-specific? 

> Interesting observation about the vowel glyphs, I'll mention this in the 

> post. Initially I missed this because I hadn't realized that the bengali o 

> vowel crashed (which made me discount this). 

> Thanks! 

> -Manish 

> On Sat, Feb 17, 2018 at 12:22 PM, Philippe Verdy <verdy_p_at_wanadoo.fr> 

> wrote: 

>> I would have liked that your invented term of "left-joining consonants" 

>> took the usual name "phala forms" (to represent RA or JA/JO after a virama, 

>> generally named "raphala" or "japhala/jophala"). 

>> 

>> And why this bug does not occur with some vowels is because these are 

>> vowels in two parts, that are first decomposed into two separate glyphs 

>> reordered in the buffer of glyphs, while other vowels do not need this 

>> prior mapping and keep their initial direct mapping from their codepoints 

>> in fonts, which means that this has to do to the way the ZWNJ looks for the 

>> glyphs of the vowels in the glyphs buffer and not in the initial codepoints 

>> buffer: there's some desynchronization, and more probably an uninitialized 

>> data field (for the lookup made in handling ZWNJ) if no vowel decomposition 

>> was done (the same data field is correctly initialized when it is the first 

>> consonnant which takes an alternate form before a virama, like in most 

>> Indic consonnant clusters, because the a glyph buffer is created. 

>> 

>> Now we have some hints about why the bug does not occur in Kannada or 

>> Khmer: a glyph buffer is always created, but there was some shortcut made 

>> in Devanagari, Bengali, and Telugu to allow processing clusters faster 

>> without having to create always a gyphs buffer (to allow reordering glyphs 

>> before positioning them), and working directly on the codepoints streams. 

>> 

>> So it seems related to the fact that OpenType fonts do not need to 

>> include rules for glyph substitution, but the PHALA forms are represented 

>> without any glyph substitution, by mapping directly the phala forms in a 

>> separate table for the consonants. Because there's been no code to glyph 

>> subtitution, the glyph buffer is not created, but then when processing the 

>> ZWNJ, it looks for data in a glyph buffer that has still not be initialized 

>> (and this is specific to the renderers implemented by Apple in iOS and 

>> MacOS). This bug does not occur if another text rendering engine is used 

>> (e.g. in non-Apple web browsers). 

>> 

>> 

>> 2018-02-16 19:44 GMT+01:00 Manish Goregaokar <manish_at_mozilla.com>: 

>> 

>>> FWIW I dissected the crashing strings, it's basically all <consonant, 

>>> virama, consonant, zwnj, vowel> sequences in Telugu, Bengali, Devanagari 

>>> where the consonant is suffix-joining (ra in Devanagari, jo and ro in 

>>> Bengali, and all Telugu consonants), the vowel is not Bengali au or o / 

>>> Telugu ai, and if the second consonant is ra/ro the first one is not also 

>>> ra/ro (or ro-with-line-through-it). 

>>> 

>>> https://manishearth.github.io/blog/2018/02/15/picking-apart

>>> the-crashing-ios-string/ 

>>> 

>>> -Manish 

>>> 

>>> On Thu, Feb 15, 2018 at 10:58 AM, Philippe Verdy via Unicode < 

>>> unicode_at_unicode.org> wrote: 

>>> 

>>>> That's probably not a bug of Unicode but of MacOS/iOS text renderers 

>>>> with some fonts using advanced composition feature. 

>>>> 

>>>> Similar bugs could as well the new advanced features added in Windows 

>>>> or Android to support multicolored emojis, variable fonts, contextual glyph 

>>>> transforms, style variants, or more font formats (not just OpenType); the 

>>>> bug may also be in the graphic renderer (incorrect clipping when drawing 

>>>> the glyph into the glyph cache, with buffer overflows possibly caused by 

>>>> incorrectly computed splines), and it could be in the display driver (or in 

>>>> the hardware accelerator having some limitations on the compelxity of 

>>>> multipolygons to fill and to antialias), causing some infinite recursion 

>>>> loop, or too deep recursion exhausting the stack limit; 

>>>> 

>>>> Finally the bug could be in the OpenType hinting engine moving some 

>>>> points outside the clipping area (the math theory may say that such 

>>>> plcement of a point outside the clipping area may be impossible, but 

>>>> various mathematical simplifcations and shortcuts are used to simplify or 

>>>> accelerate the rendering, at the price of some quirks. Even the SVG 

>>>> standard (in constant evolution) could be affected as well in its 

>>>> implementation. 

>>>> 

>>>> There are tons of possible bugs here. 

>>>> 

>>>> 2018-02-15 18:21 GMT+01:00 James Kass via Unicode <unicode_at_unicode.org> 

>>>> : 

>>>> 

>>>>> This article: 

>>>>> https://techcrunch.com/2018/02/15/iphone-text-bomb-ios-mac-c 

>>>>> rash-apple/?ncid=mobilenavtrend 

>>>>> 

>>>>> The single Unicode symbol referred to in the article results from a 

>>>>> string of Telugu characters. The article doesn't list or display the 

>>>>> characters, so Mac users can visit the above link. A link in one of 

>>>>> the comments leads to a page which does display the characters. 

>>>>> 


back to http://telegra.ph/crash-text-02-17

Report Page