Hi guys!
Quick midterm update from me! Just as a little refresher, my name is Anqi and I’ve been working on the Transliteration of Search Results project this summer.
Time has really flown by and we are at the halfway point! Here I hope to give a summary of the work that’s been done so far, as well as what I hope to accomplish during the next part of the summer! We want to be able to return the transliterated name as a field during search results, with a proof of concept shown here! We can see in this proof of concept that the results are as expected, with the addition of one transliterated name field.
{
"place_id":100067,
"licence":"Data © OpenStreetMap contributors, ODbL 1.0. http://osm.org/copyright",
"osm_type":"way",
"osm_id":1307932969,
"place_rank":30,
"category":"amenity",
"type":"school",
"importance":-0.004452559995167061,
"addresstype":"amenity",
"name":"丹东市第六中学",
"display_name":"丹东市第六中学, 七纬路, 站前街道, 丹东市, 振兴区, 118000, 中国",
"transliterated_name":"Dan Dong Shi Di Liu Zhong Xue, Qi Wei Lu, Zhan Qian Jie Dao, Dan Dong Shi, Zhen Xing Qu, 118000, Zhong Guo",
"bbox":[124.3804784,40.1271951,124.3830593,40.1292045],
"geometry":{
"type": "Point",
"coordinates": [124.38176886493923, 40.12819985]
},
The general flow can be visualized as below.

One additional comment: one big issue that we have run into so far is the difference between a script and a language. The script refers to the way it is written, i.e. both Cantonese and Mandarin spoken in Taiwan share the same script, the way English and French share the same script. However, language primarily refers to how it is spoken. This is important as although some languages that share the same script do pronounce characters the same way, i.e. English and French, some do not, such as Cantonese and Mandarin. This is especially important for transliteration, as the whole point is to preserve the pronunciation of the original text.
For the transliteration process, we first take in the result as to detect what country it is from. Using Nominatim’s internal list of countries, we are able to extract the languages spoken in the country. However, this does not yet work for autonomous regions Hong Kong and Macau, as they are not recognized as countries. Further work on this (and regionalization in general) is therefore needed, which I hope to tackle later. This will also allow us to potentially venture into dialects and region-specific pronunciations further in the future.
Similar to Nominatim’s current logic, if there is only one language spoken in the result country, we take that to be the given language of the result. If not, this is left undefined and no conclusions can be made. At this stage, if there is a given singular result language and the user knows it, we know that no transliteration is necessary and our new transliterated name field will just be the default name.
If the user does not know the language, or if there are multiple languages spoken in that country, we then move on to the second stage: localization. From the given name tags for a result, we try and return the best matching name from a dictionary of names containing different name variants, as well as an identifier with regards to what language used. This is slightly different from how display names are currently returned, due to the presence of a secondary return value. From this new function, we also set a new field which allows us to determine if we are in a user-understandable language. This is important as previously localization just assigned a name, but if no valid names were present, it would just default to the country default.
After this, we know that the result is not in a user-understandable script if both
- The user does not know the result country language, or if there are multiple languages spoken in that country
and
- There are no name tags at present in a user-understandable name, forcing the address component to still be in the default value.
From here, we can finally proceed with transliteration. For transliteration to any Latin-based script, we use the unidecode library. We can detect if the user knows any Latin-based script or not with a newly created list of languages, in which I have included all two letter ISO 639 codes (which can be found here) as well as the tag yue, for Cantonese. This list acts as a dictionary to see if the language has a Latin-based writing system or not, and also contains the full name of the language and a sample excerpt of the script, if available. The transliteration function will go through the list of user languages in an ordered fashion, targeting the users highest preferred languages first. This means if a user has English, Chinese, Arabic as their list of languages, transliteration will always be to a Latin-based script.
You might be wondering, what about transliteration to any non-Latin based script, what then? Well, due to the variety of tones and characters and inflections in many non-Latin languages, transliterating to their script is actually very difficult. For example, using Chinese to transliterate even the word “Normal” might result in 10 different iterations. However, we do want to provide support for this, creating a pluggable interface for future development. Therefore, for every language the user knows, not only do we check if the script is Latin-based, but we also check if we have a way of transliterating to that language. Currently, we have examples of this for Cantonese, Traditional Mandarin, and Simplified Mandarin.
The final part for this is actually understanding what languages the user does know. This is actually taken from the browser information, similar to what Nominatim currently does. However, if you have worked with browser codes before, you know that normally it is not just a nice two-letter code; the browser also tries to add some region-specific language identifiers such as en-US and en-CAD. To preserve regional locales, but remove general duplicates, we firstly preserve the ordering of the ‘importance’ of the language to the user due to its weighing, then upon the first instance of a non-generalized language code, add the generalized form directly after. This also allows us to normalize languages when needed, which is especially important for the case of Chinese, which spans Simplified Chinese script in a spoken Mandarin form, Traditional Chinese script in a spoken Mandarin form, and Traditional Chinese script in a spoken Cantonese form, and identifiers that can leave what the user actually understands unclear. Therefore, in the case of ambiguity, the largest number of languages will be added, which means that while zh-tw will only map to zh-Hant, zh could map to any of zh-Hans, zh-Hant, or yue. The normalization code can be found here.
Thank you guys so much for reading! Please leave any feedback if you have any!