Documentation

IDTT to sam

Use the following procedure to convert InDesign Tagged Text (IDTT) that has been extracted from an InDesign file to a Scribe Abbreviated Markup (sam) file.

Only use this procedure to convert IDTT to .sam for files that have been typeset outside of the Well-Formed Document Workflow.

If a book has been typeset using the WFDW, use the Export .sam from InDesign process.

References/Prerequisites

References and prerequisites are listed here.

Procedure

Merge Files

Merge all IDTT files into a single text file with a .sam extension.

Place content in order.

Note: In some cases, it may be most efficient to place content into approximate locations and then review the content for final order later in the process, after all extraneous tags have been removed.

Remove Metadata

Delete file setup information.

Find: <(ASCII|Version|Define)[^\n]*\n
Replace with: NOTHING

Find: <FILENAME[^\n]*\n
Replace with: NOTHING

Control Characters

Remove all control characters (e.g., ESC, BEL, BS). These will have a shaded background.

Find: [^ -~\n\t]
Replace with: NOTHING
or: SPACE

Named Entities

Replace ampersand, less than, and greater than characters with named entities.

Find: &
Replace with: &amp;

Find: \\<
Replace with: &lt;

Find: \\>
Replace with: &gt;

Line Breaks

Note: Some of the following searches may need to be run again at different stages during conversion.

Add Placeholder Style Name

Add “nostyle” as a placeholder paragraph style name.

Find: (<ParaStyle:)(>)
Replace with: \1nostyle\2

Place Paragraphs on New Lines

Find: ([^\n])(<ParaStyle:)
Replace with: \1\n\2

Remove Empty Lines

Repeat the following until there are no more results:

Find: \n\n
Replace with: \n

Move Closing Tags to the Ends of Lines

Find: \n(<[^>]*:>)
Replace with: \1

Remove Unnecessary InDesign Tags

Note: The following searches may be modified if an aspect can be used to determine where an ScML style should be used. For example, “Skew” may be useful for identifying italics, or “TextAlignment” may indicate poetry. Do not delete any tag that may contain vital style information until the appropriate ScML style has been applied.

Remove unnecessary character rendering tags.

Find: <c[^>]*(Leading|Kerning|Tracking|Spacing|Size|Ligatures|OTF|Skew|Language|Baseline)[^>]*>
Replace with: NOTHING

Find: <c(Bouten|Kent(en)|Shatai|Tatech?u|Tsume|Wari(chu)|Hindi|StrokeGradient|NextXChars)([^>]*>)
Replace with: NOTHING

Remove unnecessary paragraph rendering tags.

Find: <p[^>]*(Space|TabRuler|KeepwithNext|Auto|Hyphen)[^>]*>
Replace with: NOTHING

Remove unnecessary paragraph styles representing blank lines.

Find: ^<ParaStyle:[^>]*>[ \t]*\n
Replace with: NOTHING

Remove unnecessary hyperlink tags.

Find: <Hyperlink:=(<[^>]*>)*>
Replace with: NOTHING

Remove unnecessary text alignment tags.

Find: <pTextAlignment[^>]*>
Replace with: NOTHING

Convert Characters to Unicode Entities

Search for the following and determine the best “replace” option based on context.

Typesetting Spaces and Manual Breaks

Search for typesetting spaces.

Find: <0x200[0-9A-F]>
Replace with: NOTHING
or: SPACE

Search for soft hyphens.

Find: <0x00AD>
Replace with: NOTHING

Search for manual line breaks.

Find: <0x000A>
Replace with: \n
or: SPACE
or: NOTHING

Entity Format

Change remaining characters to hexadecimal entity format.

Find: <0(x[A-F0-9]+)>
Replace with: &#\1;

Characters in hexadecimal entity format will be converted to their corresponding Unicode characters when processed to other file formats through the Digital Hub.

Convert Character Styles

Construct searches based on what is found in order to apply the appropriate ScML character styles.

Note: The same rendering may be applied to elements that require different ScML styles.

Note: In some cases, the content between two tags may be complex and not fit the regular expressions listed. As needed, consider deleting unnecessary opening and closing tags using two separate searches to simply remove the opening tag and then the closing tag.

<cPosition

Find: <cPosition

Example:

Find: <cPosition:Superscript>([A-z\d\- \.\&\#;]+)<cPosition:>
Replace with: <enref>\1</enref>
or: <fnref>\1</fnref>
or: <sup>\1</sup>

<c

Find: <c

Example:

Find: <cTypeface:Italic>([^<]*)<cTypeface:>
Replace with: <i>\1</i>

Page IDs

Convert page IDs to self-closing tags.

Single page IDs:

Find: <CharStyle:page>\{~\?~PG: @([\da-z]+)@\}<CharStyle:>
Replace with: <page id="p\1"/>

Adjacent page IDs:

Find: <CharStyle:page>\{~\?~PG: @([\da-z]+)@\}\{~\?~PG: @([\da-z]+)@\}<CharStyle:>
Replace with: <page id="p\1"/><page id="p\2"/>

Search for any remaining page IDs.

Find: \{~\?~PG:

<CharStyle:

Find: <CharStyle:

Example:

Find: <CharStyle:Italic>([^<]*)<CharStyle:>
Replace with: <i>\1</i>

<cCase:

Find: <cCase:Small Caps>([^<]*)<cCase:>
Replace with: <sm>\1</sm>

Find: <cCase:All Caps>([^<]*)<cCase:>
Replace with: \U\1

Remove Stray Closing Character Style Tags

After ScML character styles have been applied, remove any remaining closing character style tags.

Find: <cTypeface:>|<CharStyle:>
Replace with: NOTHING

Convert Paragraph Styles

Construct searches based on what is found in order to apply the appropriate ScML paragraph styles.

At this time, do not scribe spacing variations (f, l, s, or o) unless the existing styles in the file provide a 1-to-1 correspondence. Identify only the structural aspects of the paragraphs. Articulation can be added when converting the .sam file to .scml at a later stage by enabling the Articulate Spacing Distinctions setting in the Digital Hub.

<Para

Find: <Para

Example:

Find: <ParaStyle:Chapter Title>([^\n]*)$
Replace with: <ct>\1</ct>

Remaining InDesign Tags

Search for remaining InDesign tags. Replace the tag with the appropriate ScML style or delete it.

Find: <[^>]*:[^>]*>

Images

Place callouts for images in the appropriate locations.

<fig><img src="imagename.jpg"/></fig>

Note: If a logo image is part of the title page, scribe it as bkpub (or bkpub1, if necessary) rather than fig. If a logo image is part of the copyright page, scribe it as crtf (or a different crt style, if necessary) rather than fig.

Structure Indicators

If required, place structure indicators.

Example:

<structure>{~?~ST: begin chapter}</structure>

and

<structure>{~?~ST: end chapter}</structure>

sam Tags and Validation

Add sam Tags and DOCTYPE Declaration

Add the following text to the beginning of the file.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="http://www.scribeproduction.com/datafiles/dtd/scml.css" type="text/css"?>
<!DOCTYPE sam PUBLIC "-//Scribe Inc.//DTD sam v1.3.0//EN" "http://scml.scribenet.com/dtds/current/sam.dtd">
<sam>

Add the following text to the end of the file.

</sam>

Validation and QC

Validate the file.

Note: To validate, set up Sublime Text as indicated here and use the validation options under Build > XML: DTD Validation. Upload the file to the Digital Hub and address the errors it lists.

Once the file is valid, review the file using the .sam QC Checklist.

Note: The text checks in the Regular Expressions Resource work best with unicode characters in place, rather than hexadecimal formatting. To change hexadecimal entities to single unicode characters, process the .sam file to .docx in the Digital Hub. This is also an opportunity to refine the file. Review the file using the Scribing QC Checklist and process the file back to .sam to perform text checks.

If the final output is to be a Word file, apply changes directly into the .docx file or process the corrected .sam file to .docx.