Language Scribing Overview
Web Content Accessibility Guidelines (WCAG) AA accessibility requires that ebooks mark when a language shifts within a book. This helps screen readers and other assistive technology read the content without jarring and incorrect pronunciation. “Proper names, technical terms, words of indeterminate language, and words or phrases that have become part of the vernacular of the immediately surrounding text” are all exempt from this requirement.
Languages that use non-Latin scripts (e.g., Hebrew, Arabic, Chinese, Greek) can frequently be identified using the Scribe Language Styles setting in the Digital Hub. Languages that use the Latin alphabet often need to be identified through manual actions and editorial judgment.
The Well-Formed Document Workflow includes methods to mark these language shifts in different stages of a project. Ideally, this action takes place when preparing a manuscript in Word. In a .docx file, languages can be marked by manually creating a new Word paragraph or character style with a name that combines an ScML style and an established language code.
- Pattern in Word:
[ScML Style]@lang=[Language Code] - Example Style Name:
lang-i@lang=es
Note: While most language tagging will occur on the character style level, if entire paragraphs use a different language, the scribing can be applied on the paragraph level.
These styles can be created and applied to scribed manuscripts by using the SAI’s Add Language Style tool. This may be done during the Word Scribing procedure or at a designated time during or following the copyedit, before a publication moves on to the production stages. When creating new or revised ebooks from files that have already been produced, language styles may be added to the ScML file.
Language codes generally consist of two or three letters, determined by the BCP-47 standard. See Language Codes for a list of many common languages and how to find a corresponding code. If a language has no corresponding code, Scribe recommends applying lang or lang-i to this content with no additional code. Made-up languages, as may be found in works of science fiction or fantasy, should not be scribed.
The metadata (language codes) and language styles can be added in a Word document, a sam file, an ScML file, or an InDesign document. At whatever stage it is added, this metadata will travel through the Well-Formed Document Workflow.
This example shows how the metadata for Spanish-language italic text could be identified in Word and carried through to sam, ScML, and InDesign. In each environment, the formatting of the style name is slightly different.
- In Word:
lang-i@lang=es - In sam/ScML:
<lang-i lang="es"> - In InDesign:
lang-i-language-es
Note: Hyphenated language codes, including region subtags (en-US, en-GB), are not completely supported throughout the WFDW. The language codes must be entirely lowercase. If this level of specificity is required, region subtags can be added at the ScML stage before converting to ebook.
Note: Even if language styles are applied at the manuscript stage, changes to content (adding indexes and praise pages and applying alterations) require that attention is given to this throughout the workflow. For example, the scribing of foreign-language terms in indexes should match how they are scribed in the body text.
Methods of Finding Languages to be Scribed
Whether starting with a scribed Word file, a sam file, or an ScML file, the following methods can be used to determine what content will require language styles to be applied. If starting with a .docx file, process the file to .sam in order to run the listed regular expressions.
- Use the Book Topic and TOC as a Guide: Take a broad view of what to expect based on the subject matter of the publication.
- Review the special characters list in the Digital Hub for languages that fall outside the Latin alphabet. Languages that use non-Latin scripts (e.g., Hebrew, Arabic, Chinese, Greek) should already be well served by the Scribe Language Styles setting in the Digital Hub. When using this setting, the results should be reviewed to confirm all language blocks have been identified correctly by this automated process.
- Spelling and Grammar: Use the spelling and grammar features in programs like Microsoft Word.
- Use Sublime Check 2 and skim the list of titles to see if there is any widespread use of a language.
- AI Tools: As of 2026, Scribe does not endorse the use of artificial intelligence to perform any actions within the WFDW, and the official procedure presented here does not provide any recommendations for prompts or methods to interact with AI services. However, users may choose for themselves to prompt AI to flag terms for a human to review.
- Review Character Styles, Paragraph Styles, Book-Specific Styles, and Text in Quotation Marks: Use the searches listed in the Language Scribing in sam, ScML, or InDesign section.
Determining Languages
Many terms and phrases cannot be identified programmatically, particularly due to the use of a common alphabet. Therefore, a key aspect of language scribing is the need for a human to review terms and decide what action should be taken.
When determining what language a term or phrase is, exceptions abound, and borrowed terms or loanwords may fall into a gray area in which people may come to different conclusions about how to approach them. Terms like “déjà vu” or “rendezvous” are commonly accepted as English when used in an English context; if surrounded by French text, however, they would reasonably be included in the French language tagging.
Consider the following when encountering terms and phrases:
- When using the Scribe Language Styles setting in the Digital Hub, some Asian scripts may get identified with different subtags by the Digital Hub, even when it’s clear from the context that the same language is being used for the entire block of text. For example, a single line of text could have both the general Chinese language code (“zh”) and the Taiwanese subtag (“zh-TW”) applied to different words. In these cases, the correct tag needs to be identified and the language markup needs to be cleaned up.
- Some Unicode blocks may be used in more than one language. Examples include the CJK Unified Ideographs used in Chinese, Japanese, and Korean; Arabic, Persian, and others; Hebrew and Yiddish; and Cyrillic languages. If using the Scribe Language Styles setting, copy the languages into a new file with the following search and review the list for anything that does not match what is known about the book:
Find:
lang="[^"]+"
- See Language Scribing in sam, ScML, or InDesign for a list of searches to help find languages that use the Latin alphabet.
Scribe Recommendations
- If there is any doubt about whether a language tag should be applied, default to applying the style. Even if the term does not require the tagging, it should not harm the file by including it, and it can take less time to scribe the content than to research or debate the issue.
- Scribe book, movie, and publication titles when they are in other languages.
Loanwords
In many cases, terms with foreign origins have been adopted into the English language. When making a decision about whether to apply the language scribing, the following factors may be considered.
- If the term is well known from English usage, it can be considered English. Foods like “nacho” or “taco” are examples of Spanish words that are commonly used and accepted as English when used in an English-language context.
- If a term appears in the Merriam-Webster Collegiate Dictionary without being marked specifically as a foreign term, it can be considered English.
- If a term has an English-language Wikipedia page, this may be indicative of it being a technical term that is understood across languages.
If the determination is unclear, default to applying a language style.
When to Scribe Language Styles
Language styles can be applied at any time within the WFDW. Consider the following to determine a work plan that will be most efficient, with language choices made by the appropriate person.
- Scribing stage. If there are relatively few instances, it is recommended to apply the language styles during the scribing stage.
- Copyediting stage. If the language scribing will be more involved and has the potential to have terms changed, added, or removed during author review, it is recommended to schedule the language scribing for after all other editorial considerations have been handled, before proceeding to production stages.
- Print production stage. It is not recommended to do extensive language scribing during typesetting and page proof stages. However, as content is added and alterations are applied, language scribing should be included and maintained.
- Ebook production stage. For ebooks being produced from typeset files, language scribing can be included as part of the ScML preparation steps. Language scribing should be handled within the ScML file before processing it to ePub format.
Language Scribing in Word
Use the SAI or SAI Lite to add and apply language styles in Word.
- Use the Add Language Style tool to create the necessary language styles for the project in Word.
- Per the scribing procedure for all projects, load ScML styles into the document.
- Use the drop-down menu in Load ScML Styles to select .
- Select the base style to use. Select from the drop-down menu or click the button to use the style of selected text.
- Enter the language code. Common languages can be selected from the drop-down menu.
- The Resulting style field will display the name of the style being added.
- Click to apply the style to selected text or to add the style without applying it.
- At the designated time (during the initial scribing or while copyediting), review all italic text for phrases that need to have language metadata applied. Apply the language styles created. {~?~BSmith: I think we need to indicate HOW to do this. Provide a regex? Build into Sublime Checks somehow?}
- Apply additional language styles as needed (e.g., apply
lang@lang=esto Spanish text using the default paragraph font, orgt@lang=esto Spanish-language glossary terms).
Note: Scribe recommends using lang-i instead of i for any italic bibliography (rf) text that needs language scribing. The bibliography tools will convert any i@lang=[Language Code] bibliography text to lang-i@lang=[Language Code].
Language Scribing in sam, ScML, or InDesign
If the live file is in a production stage, the language styles can be added in a .sam file or in InDesign. When each language style has been identified, add them to the point document with the appropriate formatting.
- In sam/ScML:
<lang-i lang="es"> - In InDesign:
lang-i-language-es
The searches presented here can be used as a starting point for finding languages in books based on the character or paragraph styles applied to them as well as word and letter patterns that are unique to a particular language.
Review the results before changing or replacing any styles.
Character Styles
Review italic terms, various “-i” styles, and various lang terms in a new Sublime file.
Find:
<i>[^<]+</i>|<[^>]+-b?i>[^<]+</[^>]+-b?i>|<lang[^>]*>[^<]+</lang[^>]*>|<[^>]*lang[^>]*>[^<]+</[^>]*>
Copy into a new file, permute unique lines, remove English text, and delete proper names. Add lang attributes as needed to the original file.
Paragraph Styles
Review block quote (bq) and senseline (sl) paragraphs as common places to identify if there are full paragraphs in another language. Check other paragraph styles as needed.
Find:
<[^>"]*(bq|sl)[^>]*>[^\n]+
Copy into a new file, turn off word wrap, and skim for non-English text.
Book-Specific Styles
Certain books such as Bibles or language/grammar books may have additional paragraph or character styles that are being used to identify languages. Review additional content for languages based on the type of publication.
Text in Quotation Marks
Publications with a significant amount of dialogue or other content in quotation marks may make reviewing all quotes unfeasible. If it is determined that it will be beneficial, use this search to find all instances of text in quotation marks and review the results.
Find:
(“[^\n”]*”)
Language Names
Use the following searches to identify languages that may be referenced specifically as well as certain text patterns associated with particular languages.
Search:
-
\b(Aari|Abanyom|Abaza|Abkhaz|Abkhazian|Abujmaria|Acehnese|Adele|Adyghe|Afar|Afrikaans|Afro-Seminole Creole|Aimaq|Barbari|Aini|Ainu|Akan|Akawaio|Aklanon|Albanian|Aleut|Algonquin|Alsatian|Altay|Alutor|Amharic|Anda|Amdang|Ancient Meitei|Angika|Anyin|Ao|A-Pucikwar|Arabic|Aragonese|Aramaic|'Are'are|Argobba|Aromanian|Macedo-Romanian|Armenian|Arvanitic|Ashkun|Asi|Assamese|Assyrian Neo-Aramaic|Asturian|Ateso|Teso|A'Tong|'Auhelawa|Auslan|Austro-Bavarian|Avar|Avestan|Awadhi|Aymara|Azerbaijani|Badaga|Badeshi|Bahnar|Balinese|Balochi|Balti|Bambara|Bamanankan|Banjar|Banyumasan|Bartangi|Basaa|Bashkardi|Bashkir|Basque|Batak Karo|Batak Toba|Batak Simalungun|Bats|Beja|Belarusian|Belhare|Berta|Bemba|Bengali|Bezhta|Betawi|Bete|Bhili|Bhojpuri|Bijil Neo-Aramaic|Bikol|Bikya|Furu|Bissa|Blackfoot|Boholano|Bohtan Neo-Aramaic|Bonan|Paoan|Bororo|Bodo|Bosnian|Brahui|Breton|Bua|Buginese|Bukusu|Bulgarian|Bunjevac|Burmese|Burushaski|Buryat|Caddo|Cahuilla|Caluyanon|Caluyanun|Cantonese|Catalan|Cayuga|Cebuano|Chabacano|Chavacano|Chaga|Kichagga|Chakma|Chamorro|Chaouia|Tachawit|Chechen|Chenchu|Chenoua|Cherokee|Cheyenne|Chhattisgarhi|Chickasaw|Chintang|Chhintang|Chilcotin|Chinese|Chiricahua|Mescalero-Chiricahua Apache|Chichewa|Nyanja|Chipewyan|Chittagonian|Choctaw|Chorasmian|Khwarezmian|Chukchi|Chukot|Chulym|Church Slavonic|Chuukese|Trukese|Chuvash|Cocoma|Cocama|Cocopa|Coeur d’Alene|Comanche|Comorian|Cornish|Corsican|Cree|Crimean Tatar|Crimean Turkish|Croatian|Csángó|Cuneiform|Cuyonon|Czech|Dagbani|Dahlik|Dalecarlian|Dameli|Danish|Dargin|Dakota|Dari|Dari-Persian|Daur|Dagur|Dena'ina|Tanaina|Dhatki|Dhivehi|Maldivian|Dida|Dioula|Jula|Dogri|Dogrib|Tli Cho|Dolgan|Domaaki|Dumaki|Dongxiang|Santa|Duala|Dungan|Dutch|Dzhidi|Judeo-Persian|Dzongkha|Eastern Yugur|Edo|Efik|Esan|Egyptian Arabic|Egyptian Hieroglyphs|Ekoti|Enets|Yenisey Samoyed|English|Erzya|Esperanto|Estonian|Evenk|Evenki|Ewe|Extremaduran|Faroese|Fang|Fijian|Filipino|Finnish|Flemish|Fon|Franco-Provençal|Arpitan|French|Frisian|Friulian|Fula|Fulfulde|Fulani|Fur|Ga|Gadaba|Gagauz|Galician|Gallo|Gan|Ganda|Gangte|Garhwali|Gayo|Gen|Gẽ|Mina|Georgian|German|Gikuyu|Kikuyu|Gilbertese|Kiribati|Gileki|Goaria|Gondi|Gorani|Gurani|Gowro|Gawar-Bati|Gowari|Narsati|Greek|Guaraní|Guinea-Bissau Creole|Gujarati|Gula Iro|Kulaal|Gullah|Sea Island Creole English|Gusii|Gwichʼin|Hadza|Haida|Haitian Creole|Hakka|Hän|Harari|Harauti|Harsusi|Haryanavi|Harzani|Hausa|Havasupai|Upland Yuman|Hawaiian|Hazaragi|Hebrew|Herero|Hértevin|Hiligaynon|Hindi|Hinukh|Hiri Motu|Hixkaryana|Hmong|Ho|Hobyót|Hopi|Hulaulá|Hungarian|Hunsrik|Hutterite German|Ibibio|Iban|Ibanag|Icelandic|Ido|Ifè|Igbo|Biafra|Ikalanga|Kalanga|Ili Turki|Ilokano|Ilocano|Inari Sami|Indonesian|Ingrian|Izhorian|Ingush|Interlingua|Inuktitut|Inupiaq|Inuvialuktun|Iraqw|Irish|Irish Gaelic|Irish|Irula|Isan|Northeastern Thai|Ishkashimi|Ishkashmi|Istro-Romanian|Italian|Itelmen|Kamchadal|Jacaltec|Jakalteko|Jalaa|Jamaican Patois|Japanese|Jaqaru|Jarai|Javanese|Jen|Jewish Babylonian Aramaic|Jibbali|Shehri|Jicarilla Apache|Juang|Jurchen|Kabardian|Kabyle|Kachin|Jingpo|Kalaallisut|Greenlandic|Kalami|Gawri|Dirwali|Kalasha|Kalmyk|Oirat|Kalto|Nahali|Kamtapuri|Rangpuri|Rajbongshi|Kankanai|Kankanaey|Kannada|Kaonde|Chikaonde|Kapampangan|Karachay-Balkar|Karagas|Karaim|Karakalpak|Karelian|Karenni|Kashmiri|Kashubian|Kazakh|Kerek|Ket|Khakas|Khalaj|Kham|Sheshi|Khandeshi|Khanty|Ostyak|Khasi|Khitan|Khmer|Khmu|Khowar|Kildin Sami|Kimatuumbi|Kinaray-a|Hiraya|Kinyarwanda|Kirombo|Kirundi|Kivunjo|Klallam|Clallam|Klingon|Kodava Takk|Kodagu|Coorgi|Kohistani|Khili|Kolami|Komi|Komi-Zyrian|Konkani|Kongo|Kikongo|Koraga|Korandje|Korean|Korku|Korowai|Korwa|Koryak|Kosraean|Kota|Koyra Chiini|Western Songhay|Koy Sanjaq Surat|Koya|Krymchak|Judeo-Crimean Tatar|Krio|Kujarge|Kui|Kumauni|Kumyk|Kumzari|ǃKung|Kurdish|Kurukh|Kurux|Kusunda|Kutenai|Kootenay|Ktunaxa|Kwanyama|Ovambo|Kxoe|Kyrgyz|Kirghiz|Láadan|Laal|Ladakhi|Ladin|Ladino|Judeo-Spanish|Laki|Lakota|Lakhota|Teton|Lambadi|Lamani|Banjari|Lao|Laotian|Larestani|Latin|Latvian|Laverent|Laz|Lazuri|Leonese|Lepcha|Lemerig|Lezgi|Agul|Ligbi|Ligby|Ligurian|Limbu|Limburgish|Lingala|Lipan Apache|Lisan al-Dawat|Lishana Deni|Lishanid Noshan|Lishana Didan|Lithuanian|Livonian|Liv|Lombard|Lotha|Low German|Low Saxon|Lower Sorbian|Lozi|Silozi|Ludic|Ludian|Lunda|Chilunda|Luo|Luri|Lushootseed|Lusoga|Soga|Luvale|Luwati|Luxembourgish|Lycian|Lydian|Macedonian|Magadhi|Maguindanao|Maithili|Makasar|Makhuwa|Makua|Makhuwa-Meetto|Malagasy|Malay|Malayalam|Maltese|Malto|Sauria Paharia|Malvi|Malavi|Ujjaini|Mam|Manchu|Mandaic|Mandarin|Mandinka|Mansi|Vogul|Manx|Manyika|Maori|Mapudungun|Mapuche|Maranao|Marathi|Mari|Cheremis|Marquesan|Marshallese|Ebon|Masaba|Masbatenyo|Minasbate|Maya|Mazandarani|Tabari|Meänkieli|Tornedalen Finnish|Megleno-Romanian|Megrelian|Mingrelian|Mehri|Mahri|Meitei|Manipuri|Meithei|Menominee|Mentawai|Meroitic|Mescalero Apache|Meru|Kimeru|Michif|Mikasuki|Miccosukee|Mi'kmaq|Micmac|Minangkabau|Mirandese|Mobilian Jargon|Moghol|Mohawk|Moksha|Molengue|Mon|Mongolian|Mono|Mono|Mono|Montagnais|Montenegrin|Motu|Muher|Mundari|Munji|Muria|Nafaanra|Nagarchal|Nahuatl|Nama|Nanai|Nauruan|Navajo|Navaho|Ndau|Southeast Shona|Ndebele|Ndonga|Neapolitan|Negidal|Nepal Bhasa|Newari|Nepali|Nihali|Nahali|Nganasan|Tavgi|Ngumba|Nheengatu|Geral|Modern Tupí|Nias|Niellim|Nigerian Pidgin|Nisenan|Niuean|Niue|Nivkh|Gilyak|Nogai|Norfuk|Norfolk|Pitcairn-Norfolk|Norman-French|Northern Sami|Northern Sotho|Sepedi|Northern Yukaghir|Norwegian|Bokmål|Nynorsk|Riksmål|Nuer|Nurt|Nuxálk|Bella Coola|Nyabwa|Nyah Kur|Nyangumarta|Nyoro|Nǀu|Occitan|Provençal|Ojibwe|Ojibwa|Chippewa|Okinawan|Olonets Karelian|Liv|Livvi|Omagua|Ongota|Odia|Ormuri|Oroch|Orok|Oromo|Afaan Oromoo|Ossetic|Ossetian|Old East Slavic|Old Russian|Oostfräisk|East Frisian Low Saxon|Old Prussian|Oshimbalantu|Odia|Padaung|Páez|Nasa Yuwe|Palauan|Palawa_kani|Pangasinan|Pa'O|Papiamento|Papiamentu|Parachi|Parya|Pashto|Pushto|Pashtu|Pennsylvania Dutch|Pennsylvania German|Persian|farsi|Phalura|Phuthi|Pig Latin|Picard|Pirahã|Plautdietsch|Mennonite Low German|Polish|Portuguese|Pradhan|Pardhan|Puelche|Puma|Punjabi|Panjabi|Pwo Karen|Palestinian Arabic|Pascenda|Pashandah|Phat Thai|Q’eqchi’|Qashqai|Ghashghai|Quechua|Qui|Rajasthani|Ratagnon|Datagnon|Latagnun|Réunion Creole|Bourbonnais|Romagnol|Romanian|Romansh|Rhaeto-Romance|Romany|Romblomanon|Rotokas|Runyankole|Nyankore|Russian|Ruthenian|Rusyn|Carpathian|Sabaean|Sadri|Salar|Samoan|Sandawe|Sango|Sanskrit|Santali|Saramaccan|Sardinian|Sarikoli|Saurashtra|Sourashtra|Savara|Savi|Sawai|Scots|Ulster Scots|Hiberno-Scots|Ullans|Scots Gaelic|Scottish Gaelic|Gaidhlig|Gaelic|Selkup|Ostyak Samoyed|Semnani|Senaya|Serbian|Serbo-Croatian|Sesotho|Seto|Setu|Seychellois Creole|S'gaw Karen|Shimaore|Shina|Shona|Shor|Shoshoni|Shughni|Shumashti|Shuswap|Sicilian|Sidamo|Sika|Silesian|Silt'e|Selti|East Gurage|Sindhi|Sinhalese|Sioux|Sivandi|Skolt Sami|Slavey|Slovak|Slovene|Slovenian|Soddo|Kistane|Somali|Sonjo|Temi|Sonsorolese|Sonsorol|Soqotri|Sora|Sorbian, Lower|Sorbian, Upper|Sourashtra|Southern Sami|South Estonian|Southern Yukaghir|Tundra Yukaghir|Spanish|Sranan Tongo|St'at'imcets|Lillooet|Sucite|Sìcìté Sénoufo|Suba|Sundanese|Supyire|Supyire Senoufo|Surigaonon|Susu|Svan|Swahili|Swati|Swazi|Siswati|Seswati|Swedish|Syriac|Tabasaran|Tabassaran|Tachelhit|Tagalog|Tahitian|Tajik|Takestani|Talysh|Tamil|Tamasheq|Tamazight|Tanacross|Tangut|Tarifit|Rifi|Riff Berber|Tat|Tati|Tatar|Tausug|Tehuelche|Telugu|Tetum|Tepehua|Tepehuán|Thai|Tharu|Tibetan|Tigre|Xasa|Tigrinya|Timbisha|Panamint|Tiv|Tlingit|Tobian|Toda|Tok Pisin|Tokelauan|Tonga|Tongan|Torwali|Turvali|Tregami|Tsat|Tsez|Dido|Tshiluba|Luba-Kasai|Luba-Lulua|Tsonga|Tswana|Setswana|Tu|Monguor|Tuareg|Tamasheq|Tulu|Tumbuka|Tupiniquim|Turkish|Turkmen|Turoyo|Tuvaluan|Tuvan Tuvin|Tyvan|Udihe|Ude|Udege|Udmurt|Votyak|Ukrainian|Ukwuani-Aboh-Ndoni|Ulch|Olcha|Unserdeutsch|Rabaul Creole German|Upper Sorbian|Urdu|Uripiv|Urum|Ute|Uyghur|Uigur|Uzbek|Vafsi|Valencian|Vasi-vari|Prasuni|Venda|Tshivenda|Venetian|Veps|Vietnamese|Volapük|Võro|Votic|Votian|Waddar|Waigali|Kalasha-Ala|Waima|Roro|Wakhi|Walloon|Waray-Waray|Binisaya|Washo|Welsh|Western Frisian|Western Neo-Aramaic|Wolaytta|Wolane|Silt'e|Wolof|Wu|Xhosa|Xiang|Xibe|Sibo|Xipaya|Xóõ|Yaaku|Yaeyama|Yaghnobi|Yakut|Yankunytjatjara|Yanomami|Yanyuwa|Yapese|Yaqui|Yauma|Yavapai|Yazdi|Yazgulyam|Yazgulami|Yemenite Hebrew|Yeni|Yevanic|Yi|Yiddish|Yidgha|Yogur|Yoghur|Sarï Uyghur|Yellow Uyghur|Mongolic|Yokutsan|Yonaguni|Yoruba|Yucatec Maya|Yuchi|Yugur|Yughur|Sarïgh Uyghur|Yellow Uyghur|Turkic|Yukaghir|Yupik|Yurats|Yurok|Záparo|Zapotec|Zazaki|Zulu|Zuñi|Zuni|Zway|Zay)\b
Spanish
Find:
(?:\sdel?\s|[Ee]stoy|¿|[eE]stá|iendo|[sS]í|Dios|\s[eE]s\s|\s[Ee]res\s|\sy\s)
Find:
(“[^\n”]*(?:\sdel?\s|[Ee]stoy|¿|[eE]stá|iendo|[sS]í|Dios|\s[eE]s\s|\s[Ee]res\s|\sy\s)[^\n”]*”)
Replace with:<lang lang="es">\1</lang>
French
Find:
\b(?:[Jj]e|[vV]ous|moi|[Mm]ais|[Uu]ne|[cCsS]’est|[Ll]eur)\b
German
Find:
\b(?:für|und)\b
Language Codes
Some of the most common language codes are listed here as well as on the BCP-47 Wikipedia page.
The subtag lookup tool can be used to find thousands of additional language codes.
| Language | Code |
|---|---|
| Afrikaans | af |
| Albanian | sq |
| Amharic | am |
| Arabic | ar |
| Armenian | hy |
| Assamese | as |
| Azerbaijani | az |
| Bashkir | ba |
| Basque | eu |
| Belarusian | be |
| Bengali | bn |
| Bosnian | bs |
| Breton | br |
| Bulgarian | bg |
| Burmese | my |
| Catalan | ca |
| Central Kurdish | ckb |
| Chinese | zh |
| Corsican | co |
| Croatian | hr |
| Czech | cs |
| Danish | da |
| Dari | prs |
| Divehi | dv |
| Dutch | nl |
| English | en |
| Estonian | et |
| Faroese | fo |
| Filipino | fil |
| Finnish | fi |
| French | fr |
| Frisian | fy |
| Galician | gl |
| Georgian | ka |
| German | de |
| Gilbertese | gil |
| Greek | el |
| Greenlandic | kl |
| Gujarati | gu |
| Hausa | ha |
| Hebrew | he |
| Hindi | hi |
| Hungarian | hu |
| Icelandic | is |
| Igbo | ig |
| Indonesian | id |
| Inuktitut | iu |
| Irish | ga |
| Italian | it |
| Japanese | ja |
| Kʼicheʼ | quc |
| Kannada | kn |
| Kazakh | kk |
| Khmer | km |
| Kinyarwanda | rw |
| Kiswahili | sw |
| Konkani | kok |
| Korean | ko |
| Kurdish | ku |
| Kyrgyz | ky |
| Lao | lo |
| Latvian | lv |
| Lithuanian | lt |
| Lower Sorbian | dsb |
| Luxembourgish | lb |
| Macedonian | mk |
| Malay | ms |
| Malayalam | ml |
| Maltese | mt |
| Maori | mi |
| Mapudungun | arn |
| Marathi | mr |
| Mohawk | moh |
| Mongolian | mn |
| Moroccan Arabic | ary |
| Nepali | ne |
| Norwegian (Bokmål) | nb |
| Norwegian (Nynorsk) | nn |
| Norwegian | no |
| Occitan | oc |
| Odia | or |
| Papiamento | pap |
| Pashto | ps |
| Persian | fa |
| Polish | pl |
| Portuguese | pt |
| Punjabi | pa |
| Quechua | qu |
| Romanian | ro |
| Romansh | rm |
| Russian | ru |
| Sami (Inari) | smn |
| Sami (Lule) | smj |
| Sami (Northern) | se |
| Sami (Skolt) | sms |
| Sami (Southern) | sma |
| Sanskrit | sa |
| Scottish Gaelic | gd |
| Serbian | sr |
| Sesotho | st |
| Sinhala | si |
| Slovak | sk |
| Slovenian | sl |
| Spanish | es |
| Swedish | sv |
| Swiss German | gsw |
| Syriac | syc |
| Tagalog | tl |
| Tajik | tg |
| Tamazight | tzm |
| Tamil | ta |
| Tatar | tt |
| Telugu | te |
| Thai | th |
| Tibetan | bo |
| Tswana | tn |
| Turkish | tr |
| Turkmen | tk |
| Ukrainian | uk |
| Upper Sorbian | hsb |
| Urdu | ur |
| Uyghur | ug |
| Uzbek | uz |
| Vietnamese | vi |
| Welsh | cy |
| Wolof | wo |
| Xhosa | xh |
| Yakut | sah |
| Yi | ii |
| Yoruba | yo |
| Zulu | zu |
Note: Bold text indicates commonly used languages.
Language Scribing QC Checklist
Quality control steps for language scribing represent a collaboration and confirmation of decisions that may not have a definitive right or wrong aspect. While the formatting of the language tags must follow particular patterns in environments like Word, InDesign, sam, or ScML, the choices of which terms should be tagged may vary from person to person.
QC should assess the content for the following:
- Terms that are not tagged that should be
- Terms that are tagged that should not be
- Terms that have the wrong language applied to them
Use a sam or ScML file to search for terms to review.
Tagged Paragraph Styles
Find all, copy the results into a new file, and review the list for anything that appears to be incorrect.
Find:
^ *<[^>]*lang[^\n]+
Tagged Character Styles
Find all, copy the results into a new file, and review the list for anything that appears to be incorrect.
Find:
<[^>]*lang[^>]*>[^<]*<[^>]*>
Untagged Terms
In a copy of the file, remove the content that has language tags applied to it.
Find:
^ *<[^>]*lang[^\n]+
Replace with:NOTHING
Find:
<[^>]*lang[^>]*>[^<]*<[^>]*>
Replace with:NOTHING
With this content deleted, repeat some of the searches and techniques used to find foreign languages. (Use spellcheck selectively to help skim through results.)
Review the following:
- Text in italics
- Text in quotation marks
- Text marked by spellcheck
- Terms containing the special characters listed in the Digital Hub stats
- Words in the body of the book that may indicate certain languages