Computers store every piece of words using a “character encoding,” which gives a number to each character. For example, the byte 61 stands for ‘a’ and 62 stands for ‘b’ in the ASCII encoding, which was launched in 1963. Before the web, machine systems were siloed, and there were hundreds of different encodings. Depending on the encoding, C1 could mean any of ¡, Ё, Ą, Ħ, ‘, ”, or parts of thousands of characters, from æ to 品. If you brought a file from one machine to another, it could come outside as gobbledygook.
Unicode was invented to solve that difficulty: to encode all human languages, from Chinese (中文) to Russian (русский) to Arabic (العربية), and much emoji symbols like or
; it encodes nearly 75,000 Chinese ideographs alone. In the ASCII encoding, there wasn’t much enough room for all the English punctuation (like curly quotes), while Unicode has room for over a million characters. Unicode was first published in 1991, coincidentally the year the Earth Wide Web debuted—small did anyone realize at the age they would be so vital for each other. Today, human beings can easily share documents on the web, no affair what their language.
Every January, we gaze at the percentage of the webpages in our index that are in different encodings. Here’s what our data looks like with the latest figures*:
*Your mileage may vary: these figures may vary somewhat from what other search engines find. The graph lumps together encodings by script. We detect the encoding for each webpage; the ASCII pages just contain ASCII characters, for example. Thanks again to Erik van der Poel for collecting the data.
As you can see, Unicode has experienced an 800 percent increase in “market share” since 2006. Notice that we separate outside ASCII (~16 percent) since it is a subset of most other encodings. When you comprehend ASCII, nearly 80 percent of web documents are in Unicode (UTF-8). The more documents that are in Unicode, the less likely you will see mangled characters (what Japanese call mojibake) when you’re surfing the web.
We’ve extended used Unicode as the internal format for all the words Google searches and action: any other encoding is first converted to Unicode. Version 6.1 just released with over 110,000 characters; soon we’ll be updating to that version and to Unicode’s locale data from CLDR 21 (both via ICU). The continued rise in employ of Unicode makes it much simpler to do the processing for the many languages that we cover. Without it, our unified index it would be nearly impossible—it’d be a bit like not being able to convert between the hundreds of currencies in the earth; commerce would be, well, dense. Thanks to Unicode, Google is able to aid human beings find data in nearly any language.
Posted by Mark Davis, International Software Architect
DOWNLOAD: Mark Sanchez