How fast can you validate UTF-8 strings in JavaScript?
Daniel Lemire's blogWhen you recover textual content from the disk or from the network, you may expect it to be a Unicode string in UTF-8. It is the most common format. Unfortunately, not all sequences of bytes are valid UTF-8 and accepting invalid UTF-8 without validating it is a security risk.
How might you validate a UTF-8 string in a JavaScript runtime?
You might use the valid-8 module:
import valid8 from "valid-8";
if(!valid8(file_content)) { console.log("not UTF-8"); }
Another recommended approach is to use the fact that TextDecoder can throw an exception upon error:
new TextDecoder("utf8", { fatal: true }).decode(file_content)
Or you might use the isUtf8 function which is part of Node.js and Bun.import { isUtf8 } from "node:buffer";
if(!isUtf8(file_content)) { console.log("not UTF-8"); }
How do they compare? Using Node.js 20 on a Linux server (Intel Ice Lake), I get the following speeds with three files representative of different languages. The Latin file is just ASCII. My benchmark is available.
The current isUtf8 function in Node.js was implemented by Yagiz Nizipli. It uses the simdutf library underneath.
Generated by RSStT. The copyright belongs to the original author.