Documentation

Language Scribing

Language Scribing Overview

Web Content Accessibility Guidelines (WCAG) AA accessibility requires that ebooks mark when a language shifts within a book. This helps screen readers and other assistive technology read the content without jarring and incorrect pronunciation. “Proper names, technical terms, words of indeterminate language, and words or phrases that have become part of the vernacular of the immediately surrounding text” are all exempt from this requirement.

Languages that use non-Latin scripts (e.g., Hebrew, Arabic, Chinese, Greek) can frequently be identified using the Scribe Language Styles setting in the Digital Hub. Languages that use the Latin alphabet often need to be identified through manual actions and editorial judgment.

The Well-Formed Document Workflow includes methods to mark these language shifts in different stages of a project. Ideally, this action takes place when preparing a manuscript in Word. In a .docx file, languages can be marked by manually creating a new Word paragraph or character style with a name that combines an ScML style and an established language code.

  • Pattern in Word: [ScML Style]@lang=[Language Code]
  • Example Style Name: lang-i@lang=es

Note: While most language tagging will occur on the character style level, if entire paragraphs use a different language, the scribing can be applied on the paragraph level.

These styles can be created and applied to scribed manuscripts by using the SAI’s Add Language Style tool. This may be done during the Word Scribing procedure or at a designated time during or following the copyedit, before a publication moves on to the production stages. When creating new or revised ebooks from files that have already been produced, language styles may be added to the ScML file.

Language codes generally consist of two or three letters, determined by the BCP-47 standard. See Language Codes for a list of many common languages and how to find a corresponding code. If a language has no corresponding code, Scribe recommends applying lang or lang-i to this content with no additional code. Made-up languages, as may be found in works of science fiction or fantasy, should not be scribed.

The metadata (language codes) and language styles can be added in a Word document, a sam file, an ScML file, or an InDesign document. At whatever stage it is added, this metadata will travel through the Well-Formed Document Workflow.

This example shows how the metadata for Spanish-language italic text could be identified in Word and carried through to sam, ScML, and InDesign. In each environment, the formatting of the style name is slightly different.

  • In Word: lang-i@lang=es
  • In sam/ScML: <lang-i lang="es">
  • In InDesign: lang-i-language-es

Note: Hyphenated language codes, including region subtags (en-US, en-GB), are not completely supported throughout the WFDW. The language codes must be entirely lowercase. If this level of specificity is required, region subtags can be added at the ScML stage before converting to ebook.

Note: Even if language styles are applied at the manuscript stage, changes to content (adding indexes and praise pages and applying alterations) require that attention is given to this throughout the workflow. For example, the scribing of foreign-language terms in indexes should match how they are scribed in the body text.

Methods of Finding Languages to be Scribed

Whether starting with a scribed Word file, a sam file, or an ScML file, the following methods can be used to determine what content will require language styles to be applied. If starting with a .docx file, process the file to .sam in order to run the listed regular expressions.

  • Use the Book Topic and TOC as a Guide: Take a broad view of what to expect based on the subject matter of the publication.
  • Review the special characters list in the Digital Hub for languages that fall outside the Latin alphabet. Languages that use non-Latin scripts (e.g., Hebrew, Arabic, Chinese, Greek) should already be well served by the Scribe Language Styles setting in the Digital Hub. When using this setting, the results should be reviewed to confirm all language blocks have been identified correctly by this automated process.
  • Spelling and Grammar: Use the spelling and grammar features in programs like Microsoft Word.
  • Use Sublime Check 2 and skim the list of titles to see if there is any widespread use of a language.
  • AI Tools: As of 2026, Scribe does not endorse the use of artificial intelligence to perform any actions within the WFDW, and the official procedure presented here does not provide any recommendations for prompts or methods to interact with AI services. However, users may choose for themselves to prompt AI to flag terms for a human to review.
  • Review Character Styles, Paragraph Styles, Book-Specific Styles, and Text in Quotation Marks: Use the searches listed in the Language Scribing in sam, ScML, or InDesign section.

Determining Languages

Many terms and phrases cannot be identified programmatically, particularly due to the use of a common alphabet. Therefore, a key aspect of language scribing is the need for a human to review terms and decide what action should be taken.

When determining what language a term or phrase is, exceptions abound, and borrowed terms or loanwords may fall into a gray area in which people may come to different conclusions about how to approach them. Terms like “déjà vu” or “rendezvous” are commonly accepted as English when used in an English context; if surrounded by French text, however, they would reasonably be included in the French language tagging.

Consider the following when encountering terms and phrases:

  • When using the Scribe Language Styles setting in the Digital Hub, some Asian scripts may get identified with different subtags by the Digital Hub, even when it’s clear from the context that the same language is being used for the entire block of text. For example, a single line of text could have both the general Chinese language code (“zh”) and the Taiwanese subtag (“zh-TW”) applied to different words. In these cases, the correct tag needs to be identified and the language markup needs to be cleaned up.
  • Some Unicode blocks may be used in more than one language. Examples include the CJK Unified Ideographs used in Chinese, Japanese, and Korean; Arabic, Persian, and others; Hebrew and Yiddish; and Cyrillic languages. If using the Scribe Language Styles setting, copy the languages into a new file with the following search and review the list for anything that does not match what is known about the book:

Find: lang="[^"]+"

Scribe Recommendations

  • If there is any doubt about whether a language tag should be applied, default to applying the style. Even if the term does not require the tagging, it should not harm the file by including it, and it can take less time to scribe the content than to research or debate the issue.
  • Scribe book, movie, and publication titles when they are in other languages.

Loanwords

In many cases, terms with foreign origins have been adopted into the English language. When making a decision about whether to apply the language scribing, the following factors may be considered.

  • If the term is well known from English usage, it can be considered English. Foods like “nacho” or “taco” are examples of Spanish words that are commonly used and accepted as English when used in an English-language context.
  • If a term appears in the Merriam-Webster Collegiate Dictionary without being marked specifically as a foreign term, it can be considered English.
  • If a term has an English-language Wikipedia page, this may be indicative of it being a technical term that is understood across languages.

If the determination is unclear, default to applying a language style.

When to Scribe Language Styles

Language styles can be applied at any time within the WFDW. Consider the following to determine a work plan that will be most efficient, with language choices made by the appropriate person.

  • Scribing stage. If there are relatively few instances, it is recommended to apply the language styles during the scribing stage.
  • Copyediting stage. If the language scribing will be more involved and has the potential to have terms changed, added, or removed during author review, it is recommended to schedule the language scribing for after all other editorial considerations have been handled, before proceeding to production stages.
  • Print production stage. It is not recommended to do extensive language scribing during typesetting and page proof stages. However, as content is added and alterations are applied, language scribing should be included and maintained.
  • Ebook production stage. For ebooks being produced from typeset files, language scribing can be included as part of the ScML preparation steps. Language scribing should be handled within the ScML file before processing it to ePub format.

Language Scribing in Word

Use the SAI or SAI Lite to add and apply language styles in Word.

  1. Use the Add Language Style tool to create the necessary language styles for the project in Word.
    • Per the scribing procedure for all projects, load ScML styles into the document.
    • Use the drop-down menu in Load ScML Styles to select Add Language Style.
    • Select the base style to use. Select from the drop-down menu or click the Get base style from selection button to use the style of selected text.
    • Enter the language code. Common languages can be selected from the drop-down menu.
    • The Resulting style field will display the name of the style being added.
    • Click Apply style to selection to apply the style to selected text or Add style to document to add the style without applying it.
  2. At the designated time (during the initial scribing or while copyediting), review all italic text for phrases that need to have language metadata applied. Apply the language styles created. {~?~BSmith: I think we need to indicate HOW to do this. Provide a regex? Build into Sublime Checks somehow?}
  3. Apply additional language styles as needed (e.g., apply lang@lang=es to Spanish text using the default paragraph font, or gt@lang=es to Spanish-language glossary terms).

Note: Scribe recommends using lang-i instead of i for any italic bibliography (rf) text that needs language scribing. The bibliography tools will convert any i@lang=[Language Code] bibliography text to lang-i@lang=[Language Code].

Language Scribing in sam, ScML, or InDesign

If the live file is in a production stage, the language styles can be added in a .sam file or in InDesign. When each language style has been identified, add them to the point document with the appropriate formatting.

  • In sam/ScML: <lang-i lang="es">
  • In InDesign: lang-i-language-es

The searches presented here can be used as a starting point for finding languages in books based on the character or paragraph styles applied to them as well as word and letter patterns that are unique to a particular language.

Review the results before changing or replacing any styles.

Character Styles

Review italic terms, various “-i” styles, and various lang terms in a new Sublime file.

Find: <i>[^<]+</i>|<[^>]+-b?i>[^<]+</[^>]+-b?i>|<lang[^>]*>[^<]+</lang[^>]*>|<[^>]*lang[^>]*>[^<]+</[^>]*>

Copy into a new file, permute unique lines, remove English text, and delete proper names. Add lang attributes as needed to the original file.

Paragraph Styles

Review block quote (bq) and senseline (sl) paragraphs as common places to identify if there are full paragraphs in another language. Check other paragraph styles as needed.

Find: <[^>"]*(bq|sl)[^>]*>[^\n]+

Copy into a new file, turn off word wrap, and skim for non-English text.

Book-Specific Styles

Certain books such as Bibles or language/grammar books may have additional paragraph or character styles that are being used to identify languages. Review additional content for languages based on the type of publication.

Text in Quotation Marks

Publications with a significant amount of dialogue or other content in quotation marks may make reviewing all quotes unfeasible. If it is determined that it will be beneficial, use this search to find all instances of text in quotation marks and review the results.

Find: (“[^\n”]*”)

Language Names

Use the following searches to identify languages that may be referenced specifically as well as certain text patterns associated with particular languages.

Spanish

Find: (?:\sdel?\s|[Ee]stoy|¿|[eE]stá|iendo|[sS]í|Dios|\s[eE]s\s|\s[Ee]res\s|\sy\s)

Find: (“[^\n”]*(?:\sdel?\s|[Ee]stoy|¿|[eE]stá|iendo|[sS]í|Dios|\s[eE]s\s|\s[Ee]res\s|\sy\s)[^\n”]*”)
Replace with: <lang lang="es">\1</lang>

French

Find: \b(?:[Jj]e|[vV]ous|moi|[Mm]ais|[Uu]ne|[cCsS]’est|[Ll]eur)\b

German

Find: \b(?:für|und)\b

Language Codes

Some of the most common language codes are listed here as well as on the BCP-47 Wikipedia page.

The subtag lookup tool can be used to find thousands of additional language codes.

Language Code
Afrikaans af
Albanian sq
Amharic am
Arabic ar
Armenian hy
Assamese as
Azerbaijani az
Bashkir ba
Basque eu
Belarusian be
Bengali bn
Bosnian bs
Breton br
Bulgarian bg
Burmese my
Catalan ca
Central Kurdish ckb
Chinese zh
Corsican co
Croatian hr
Czech cs
Danish da
Dari prs
Divehi dv
Dutch nl
English en
Estonian et
Faroese fo
Filipino fil
Finnish fi
French fr
Frisian fy
Galician gl
Georgian ka
German de
Gilbertese gil
Greek el
Greenlandic kl
Gujarati gu
Hausa ha
Hebrew he
Hindi hi
Hungarian hu
Icelandic is
Igbo ig
Indonesian id
Inuktitut iu
Irish ga
Italian it
Japanese ja
Kʼicheʼ quc
Kannada kn
Kazakh kk
Khmer km
Kinyarwanda rw
Kiswahili sw
Konkani kok
Korean ko
Kurdish ku
Kyrgyz ky
Lao lo
Latvian lv
Lithuanian lt
Lower Sorbian dsb
Luxembourgish lb
Macedonian mk
Malay ms
Malayalam ml
Maltese mt
Maori mi
Mapudungun arn
Marathi mr
Mohawk moh
Mongolian mn
Moroccan Arabic ary
Nepali ne
Norwegian (Bokmål) nb
Norwegian (Nynorsk) nn
Norwegian no
Occitan oc
Odia or
Papiamento pap
Pashto ps
Persian fa
Polish pl
Portuguese pt
Punjabi pa
Quechua qu
Romanian ro
Romansh rm
Russian ru
Sami (Inari) smn
Sami (Lule) smj
Sami (Northern) se
Sami (Skolt) sms
Sami (Southern) sma
Sanskrit sa
Scottish Gaelic gd
Serbian sr
Sesotho st
Sinhala si
Slovak sk
Slovenian sl
Spanish es
Swedish sv
Swiss German gsw
Syriac syc
Tagalog tl
Tajik tg
Tamazight tzm
Tamil ta
Tatar tt
Telugu te
Thai th
Tibetan bo
Tswana tn
Turkish tr
Turkmen tk
Ukrainian uk
Upper Sorbian hsb
Urdu ur
Uyghur ug
Uzbek uz
Vietnamese vi
Welsh cy
Wolof wo
Xhosa xh
Yakut sah
Yi ii
Yoruba yo
Zulu zu

Note: Bold text indicates commonly used languages.

Language Scribing QC Checklist

Quality control steps for language scribing represent a collaboration and confirmation of decisions that may not have a definitive right or wrong aspect. While the formatting of the language tags must follow particular patterns in environments like Word, InDesign, sam, or ScML, the choices of which terms should be tagged may vary from person to person.

QC should assess the content for the following:

  • Terms that are not tagged that should be
  • Terms that are tagged that should not be
  • Terms that have the wrong language applied to them

Use a sam or ScML file to search for terms to review.

Tagged Paragraph Styles

Find all, copy the results into a new file, and review the list for anything that appears to be incorrect.

Find: ^ *<[^>]*lang[^\n]+

Tagged Character Styles

Find all, copy the results into a new file, and review the list for anything that appears to be incorrect.

Find: <[^>]*lang[^>]*>[^<]*<[^>]*>

Untagged Terms

In a copy of the file, remove the content that has language tags applied to it.

Find: ^ *<[^>]*lang[^\n]+
Replace with: NOTHING

Find: <[^>]*lang[^>]*>[^<]*<[^>]*>
Replace with: NOTHING

With this content deleted, repeat some of the searches and techniques used to find foreign languages. (Use spellcheck selectively to help skim through results.)

Review the following:

  • Text in italics
  • Text in quotation marks
  • Text marked by spellcheck
  • Terms containing the special characters listed in the Digital Hub stats
  • Words in the body of the book that may indicate certain languages