Language Scribing Overview
Web Content Accessibility Guidelines (WCAG) AA accessibility requires that ebooks mark when a language shifts within a book. This helps screen readers and other assistive technology read the content without jarring and incorrect pronunciation. “Proper names, technical terms, words of indeterminate language, and words or phrases that have become part of the vernacular of the immediately surrounding text” are all exempt from this requirement.
Languages that use non-Latin scripts (e.g., Hebrew, Arabic, Chinese, Greek) can frequently be identified using the Scribe Language Styles setting in the Digital Hub. Languages that use the Latin alphabet often need to be identified through manual actions and editorial judgment.
The Well-Formed Document Workflow includes methods to mark these language shifts in different stages of a project. Ideally, this action takes place when preparing a manuscript in Word. In a .docx file, languages can be marked by manually creating a new Word paragraph or character style with a name that combines an ScML style and an established language code.
- Pattern in Word:
[ScML Style]@lang=[Language Code] - Example Style Name:
lang-i@lang=es
Note: While most language tagging will occur on the character style level, if entire paragraphs use a different language, the scribing can be applied on the paragraph level.
These styles can be created and applied to scribed manuscripts by using the SAI’s Add Language Style tool. This may be done during the Word Scribing procedure or at a designated time during or following the copyedit, before a publication moves on to the production stages. When creating new or revised ebooks from files that have already been produced, language styles may be added to the ScML file.
Language codes generally consist of two or three letters, determined by the BCP-47 standard. See Language Codes for a list of many common languages and how to find a corresponding code. If a language has no corresponding code, Scribe recommends applying lang or lang-i to this content with no additional code.
The metadata (language codes) and language styles can be added in a Word document, a sam file, an ScML file, or an InDesign document. At whatever stage it is added, this metadata will travel through the Well-Formed Document Workflow.
This example shows how the metadata for Spanish-language italic text could be identified in Word and carried through to sam, ScML, and InDesign. In each environment, the formatting of the style name is slightly different.
- In Word:
lang-i@lang=es - In sam/ScML:
<lang-i lang="es"> - In InDesign:
lang-i-language-es
Note: Hyphenated language codes, including region subtags (en-US, en-GB), are not completely supported throughout the WFDW. The language codes must be entirely lowercase. If this level of specificity is required, region subtags can be added at the ScML stage before converting to ebook.
Note: Even if language styles are applied at the manuscript stage, changes to content (adding indexes, praise pages, and applying alterations) require that attention is given to this throughout the workflow. For example, the scribing of foreign-language terms in indexes should match how they are scribed in the body text.
Vetting Content for Language Considerations
Methods of Finding Languages to be Scribed
Whether starting with a scribed Word file, a sam file, or an ScML file, the following methods can be used to determine what content will require language styles to be applied. If starting with a .docx file, process the file to .sam in order to run the listed regular expressions.
- Use the Book Topic and TOC as a Guide: Take a broad view of what to expect based on the subject matter of the publication.
- Review the special characters list in the Digital Hub for languages that fall outside the Latin alphabet. Languages that use non-Latin scripts (e.g., Hebrew, Arabic, Chinese, Greek) should already be well served by the Scribe Language Styles setting in the Digital Hub. When using this setting, the results should be reviewed to confirm all language blocks have been identified correctly by this automated process.
- Spelling and Grammar: Use the spelling and grammar features in programs like Microsoft Word.
- Use Sublime Check 2 and skim the list of titles to see if there is any widespread use of a language.
- AI Tools: Scribe does not endorse the use of artificial intelligence to perform any actions within the WFDW. However, prompting AI to flag terms for a human to review may be beneficial in certain circumstances. {~?~BSmith: Should I go there?}
- Review character styles: Review italic terms, various “-i” styles, and various lang terms in a new Sublime file.
Find:
<i>[^<]+</i>|<[^>]+-b?i>[^<]+</[^>]+-b?i>|<lang[^>]*>[^<]+</lang[^>]*>|<[^>]*lang[^>]*>[^<]+</[^>]*>
Copy into a new file, permute unique lines, remove English text, and delete proper names. Add lang attributes as needed to the original file.
- Review Paragraph Styles: Review block quote (bq) and senseline (sl) paragraphs as common places to identify if there are full paragraphs in another language. Check other paragraph styles as needed.
Find:
<[^>"]*(bq|sl)[^>]*>[^\n]+
Copy into a new file, turn off word wrap, and skim for non-English text.
- Review Book-Specific Styles: Certain books such as Bibles or language/grammar books may have additional paragraph or character styles that are being used to identify languages. Review additional content for languages based on the type of publication.
- Review Text in Quotation Marks
Find:
(“[^\n”]*”)
Determining Languages
The human language of each passage or phrase in the content can be programmatically determined except for proper names, technical terms, words of indeterminate language, and words or phrases that have become part of the vernacular of the immediately surrounding text.
When determining what language a term or phrase is, exceptions abound, and borrowed terms or loanwords may fall into a gray area in which people may come to different conclusions about how to approach them. Terms like “déjà vu” or “rendezvous” are commonly accepted as English when used in an English context; if surrounded by French text, however, they would reasonably be included in the French language tagging.
Consider the following when encountering terms and phrases:
- When using the Scribe Language Styles setting in the Digital Hub, some Asian scripts may get identified with different subtags by the Digital Hub, even when it’s clear from the context that the same language is being used for the entire block of text. For example, a single line of text could have both the general Chinese language code (“zh”) and the Taiwanese subtag (“zh-TW”) applied to different words. In these cases, the correct tag needs to be identified and the language markup needs to be cleaned up.
- Some Unicode blocks may be used in more than one language, such as the CJK Unified Ideographs used in Chinese, Japanese, and Korean. If using Scribe Language Styles, copy the languages into a new file with the following search and review the list for anything that does not match what is known about the book:
Find:
lang=”[^”]+”
- See Language Scribing in sam, ScML, or InDesign for a list of searches to help find languages that use the Latin alphabet.
Scribe Recommendations
- If there is any doubt about whether a language tag should be applied, default to applying the style. Even if the term does not require the tagging, it should not harm the file by including it, and it can take less time to scribe the content than to research or debate the issue.
- Scribe book, movie, and publication titles when they are in other languages.
When to Scribe Language Styles
Language styles can be applied at any time within the WFDW. Consider the following to determine a work plan that will be most efficient, with language choices made by the appropriate person.
- Scribing stage. If there are relatively few instances, it is recommended to apply the language styles during the scribing stage.
- Copyediting stage. If the language scribing will be more involved and has the potential to have terms changed, added, or removed during author review, it is recommended to schedule the language scribing for after all other editorial considerations have been handled, before proceeding to production stages.
- Print production stage. It is not recommend to do extensive language scribing during typesetting and page proof stages. However, as content is added and alterations are applied, language scribing should be included and maintained.
- Ebook production stage. For ebooks being produced from typeset files, language scribing can be included as part of the ScML preparation steps. Language scribing should be handled within the ScML file before processing it to ePub format.
Language Scribing in Word
Use the SAI or SAI Lite to add and apply language styles in Word.
- Use the Add Language Style tool to create the necessary language styles for the project in Word.
- Per the scribing procedure for all projects, load ScML styles into the document.
- Use the drop-down menu in Load ScML Styles to select .
- Select the base style to use. Select from the drop-down menu or click the button to use the style of selected text.
- Enter the language code. Common languages can be selected from the drop-down menu.
- The Resulting style field will display the name of the style being added.
- Click to apply the style to selected text or to add the style without applying it.
- At the designated time (during the intial scribing or while copyediting), review all italic text for phrases that need to have language metadata applied. Apply the language styles created. {~?~BSmith: I think we need to indicate HOW to do this. Provide a regex? Build into Sublime Checks somehow?}
- Apply additional language styles as needed (e.g., apply
lang@lang=esto Spanish text using the default paragraph font, orgt@lang=esto Spanish-language glossary terms).
Note: Scribe recommends using lang-i instead of i for any italic bibliography (rf) text that needs language scribing. The bibliography tools will convert any i@lang=[Language Code] bibliography text to lang-i@lang=[Language Code].
Language Scribing in sam, ScML, or InDesign
If the live file is in a production stage, the language styles can be added in a .sam file or in InDesign. When each language style has been identified, add them to the point document with the appropriate formatting.
- In sam/ScML:
<lang-i lang="es"> - In InDesign:
lang-i-language-es
Words and letter patterns that are unique to a language can indicate the presence of a language in a book. The searches presented here can be used as a starting point for finding select languages in books.
Review the results before changing or replacing any styles.
General
Find:
\b(French|Spanish|German|Italian)\b
Spanish
Find:
(?:\sdel?\s|[Ee]stoy|¿|[eE]stá|iendo|[sS]í|Dios|\s[eE]s\s|\s[Ee]res\s|\sy\s)
Find:
(“[^\n”]*(?:\sdel?\s|[Ee]stoy|¿|[eE]stá|iendo|[sS]í|Dios|\s[eE]s\s|\s[Ee]res\s|\sy\s)[^\n”]*”)
Replace with:<lang lang="es">\1</lang>
French
Find:
\b(?:[Jj]e|[vV]ous|moi|[Mm]ais|[Uu]ne|[cCsS]’est|[Ll]eur)\b
German
Find:
\b(?:für|und)\b
Language Codes
Some of the most common language codes are listed here as well as on the BCP-47 Wikipedia page.
The subtag lookup tool can be used to find thousands of additional language codes.
| Language | Code |
|---|---|
| Afrikaans | af |
| Albanian | sq |
| Amharic | am |
| Arabic | ar |
| Armenian | hy |
| Assamese | as |
| Azerbaijani | az |
| Bashkir | ba |
| Basque | eu |
| Belarusian | be |
| Bengali | bn |
| Bosnian | bs |
| Breton | br |
| Bulgarian | bg |
| Burmese | my |
| Catalan | ca |
| Central Kurdish | ckb |
| Chinese | zh |
| Corsican | co |
| Croatian | hr |
| Czech | cs |
| Danish | da |
| Dari | prs |
| Divehi | dv |
| Dutch | nl |
| English | en |
| Estonian | et |
| Faroese | fo |
| Filipino | fil |
| Finnish | fi |
| French | fr |
| Frisian | fy |
| Galician | gl |
| Georgian | ka |
| German | de |
| Gilbertese | gil |
| Greek | el |
| Greenlandic | kl |
| Gujarati | gu |
| Hausa | ha |
| Hebrew | he |
| Hindi | hi |
| Hungarian | hu |
| Icelandic | is |
| Igbo | ig |
| Indonesian | id |
| Inuktitut | iu |
| Irish | ga |
| Italian | it |
| Japanese | ja |
| Kʼicheʼ | quc |
| Kannada | kn |
| Kazakh | kk |
| Khmer | km |
| Kinyarwanda | rw |
| Kiswahili | sw |
| Konkani | kok |
| Korean | ko |
| Kurdish | ku |
| Kyrgyz | ky |
| Lao | lo |
| Latvian | lv |
| Lithuanian | lt |
| Lower Sorbian | dsb |
| Luxembourgish | lb |
| Macedonian | mk |
| Malay | ms |
| Malayalam | ml |
| Maltese | mt |
| Maori | mi |
| Mapudungun | arn |
| Marathi | mr |
| Mohawk | moh |
| Mongolian | mn |
| Moroccan Arabic | ary |
| Nepali | ne |
| Norwegian (Bokmål) | nb |
| Norwegian (Nynorsk) | nn |
| Norwegian | no |
| Occitan | oc |
| Odia | or |
| Papiamento | pap |
| Pashto | ps |
| Persian | fa |
| Polish | pl |
| Portuguese | pt |
| Punjabi | pa |
| Quechua | qu |
| Romanian | ro |
| Romansh | rm |
| Russian | ru |
| Sami (Inari) | smn |
| Sami (Lule) | smj |
| Sami (Northern) | se |
| Sami (Skolt) | sms |
| Sami (Southern) | sma |
| Sanskrit | sa |
| Scottish Gaelic | gd |
| Serbian | sr |
| Sesotho | st |
| Sinhala | si |
| Slovak | sk |
| Slovenian | sl |
| Spanish | es |
| Swedish | sv |
| Swiss German | gsw |
| Syriac | syc |
| Tagalog | tl |
| Tajik | tg |
| Tamazight | tzm |
| Tamil | ta |
| Tatar | tt |
| Telugu | te |
| Thai | th |
| Tibetan | bo |
| Tswana | tn |
| Turkish | tr |
| Turkmen | tk |
| Ukrainian | uk |
| Upper Sorbian | hsb |
| Urdu | ur |
| Uyghur | ug |
| Uzbek | uz |
| Vietnamese | vi |
| Welsh | cy |
| Wolof | wo |
| Xhosa | xh |
| Yakut | sah |
| Yi | ii |
| Yoruba | yo |
| Zulu | zu |
Note: Bold text indicates commonly used languages.