Use the following procedure to convert InDesign Tagged Text (IDTT) that has been extracted from an InDesign file to a Scribe Abbreviated Markup (sam) file.
Only use this procedure to convert IDTT to .sam for files that have been typeset outside of the Well-Formed Document Workflow.
If a book has been typeset using the WFDW, use the Export .sam from InDesign process.
References/Prerequisites
References and prerequisites are listed here.
Procedure
Merge Files
Merge all IDTT files into a single text file with a .sam extension.
Place content in order.
Note: In some cases, it may be most efficient to place content into approximate locations and then review the content for final order later in the process, after all extraneous tags have been removed.
Remove Metadata
Delete file setup information.
Find:
<(ASCII|Version|Define)[^\n]*\n
Replace with:NOTHING
Find:
<FILENAME[^\n]*\n
Replace with:NOTHING
Control Characters
Remove all control characters (e.g., ESC, BEL, BS). These will have a shaded background.
Find:
[^ -~\n\t]
Replace with:NOTHING
or:SPACE
Named Entities
Replace ampersand, less than, and greater than characters with named entities.
Find:
&
Replace with:&
Find:
\\<
Replace with:<
Find:
\\>
Replace with:>
Line Breaks
Note: Some of the following searches may need to be run again at different stages during conversion.
Add Placeholder Style Name
Add “nostyle” as a placeholder paragraph style name.
Find:
(<ParaStyle:)(>)
Replace with:\1nostyle\2
Place Paragraphs on New Lines
Find:
([^\n])(<ParaStyle:)
Replace with:\1\n\2
Remove Empty Lines
Repeat the following until there are no more results:
Find:
\n\n
Replace with:\n
Move Closing Tags to the Ends of Lines
Find:
\n(<[^>]*:>)
Replace with:\1
Remove Unnecessary InDesign Tags
Note: The following searches may be modified if an aspect can be used to determine where an ScML style should be used. For example, “Skew” may be useful for identifying italics, or “TextAlignment” may indicate poetry. Do not delete any tag that may contain vital style information until the appropriate ScML style has been applied.
Remove unnecessary character rendering tags.
Find:
<c[^>]*(Leading|Kerning|Tracking|Spacing|Size|Ligatures|OTF|Skew|Language|Baseline)[^>]*>
Replace with:NOTHING
Find:
<c(Bouten|Kent(en)|Shatai|Tatech?u|Tsume|Wari(chu)|Hindi|StrokeGradient|NextXChars)([^>]*>)
Replace with:NOTHING
Remove unnecessary paragraph rendering tags.
Find:
<p[^>]*(Space|TabRuler|KeepwithNext|Auto|Hyphen)[^>]*>
Replace with:NOTHING
Remove unnecessary paragraph styles representing blank lines.
Find:
^<ParaStyle:[^>]*>[ \t]*\n
Replace with:NOTHING
Remove unnecessary hyperlink tags.
Find:
<Hyperlink:=(<[^>]*>)*>
Replace with:NOTHING
Remove unnecessary text alignment tags.
Find:
<pTextAlignment[^>]*>
Replace with:NOTHING
Convert Characters to Unicode Entities
Search for the following and determine the best “replace” option based on context.
Typesetting Spaces and Manual Breaks
Search for typesetting spaces.
Find:
<0x200[0-9A-F]>
Replace with:NOTHING
or:SPACE
Search for soft hyphens.
Find:
<0x00AD>
Replace with:NOTHING
Search for manual line breaks.
Find:
<0x000A>
Replace with:\n
or:SPACE
or:NOTHING
Entity Format
Change remaining characters to hexadecimal entity format.
Find:
<0(x[A-F0-9]+)>
Replace with:&#\1;
Characters in hexadecimal entity format will be converted to their corresponding Unicode characters when processed to other file formats through the Digital Hub.
Convert Character Styles
Construct searches based on what is found in order to apply the appropriate ScML character styles.
Note: The same rendering may be applied to elements that require different ScML styles.
Note: In some cases, the content between two tags may be complex and not fit the regular expressions listed. As needed, consider deleting unnecessary opening and closing tags using two separate searches to simply remove the opening tag and then the closing tag.
<cPosition
Find:
<cPosition
Example:
Find:
<cPosition:Superscript>([A-z\d\- \.\&\#;]+)<cPosition:>
Replace with:<enref>\1</enref>
or:<fnref>\1</fnref>
or:<sup>\1</sup>
<c
Find:
<c
Example:
Find:
<cTypeface:Italic>([^<]*)<cTypeface:>
Replace with:<i>\1</i>
Page IDs
Convert page IDs to self-closing tags.
Single page IDs:
Find:
<CharStyle:page>\{~\?~PG: @([\da-z]+)@\}<CharStyle:>
Replace with:<page id="p\1"/>
Adjacent page IDs:
Find:
<CharStyle:page>\{~\?~PG: @([\da-z]+)@\}\{~\?~PG: @([\da-z]+)@\}<CharStyle:>
Replace with:<page id="p\1"/><page id="p\2"/>
Search for any remaining page IDs.
Find:
\{~\?~PG:
<CharStyle:
Find:
<CharStyle:
Example:
Find:
<CharStyle:Italic>([^<]*)<CharStyle:>
Replace with:<i>\1</i>
<cCase:
Find:
<cCase:Small Caps>([^<]*)<cCase:>
Replace with:<sm>\1</sm>
Find:
<cCase:All Caps>([^<]*)<cCase:>
Replace with:\U\1
Remove Stray Closing Character Style Tags
After ScML character styles have been applied, remove any remaining closing character style tags.
Find:
<cTypeface:>|<CharStyle:>
Replace with:NOTHING
Convert Paragraph Styles
Construct searches based on what is found in order to apply the appropriate ScML paragraph styles.
At this time, do not scribe spacing variations (f, l, s, or o) unless the existing styles in the file provide a 1-to-1 correspondence. Identify only the structural aspects of the paragraphs. Articulation can be added when converting the .sam file to .scml at a later stage by enabling the Articulate Spacing Distinctions setting in the Digital Hub.
<Para
Find:
<Para
Example:
Find:
<ParaStyle:Chapter Title>([^\n]*)$
Replace with:<ct>\1</ct>
Remaining InDesign Tags
Search for remaining InDesign tags. Replace the tag with the appropriate ScML style or delete it.
Find:
<[^>]*:[^>]*>
Images
Place callouts for images in the appropriate locations.
<fig><img src="imagename.jpg"/></fig>
Note: If a logo image is part of the title page, scribe it as bkpub (or bkpub1, if necessary) rather than fig. If a logo image is part of the copyright page, scribe it as crtf (or a different crt style, if necessary) rather than fig.
Structure Indicators
If required, place structure indicators.
Example:
<structure>{~?~ST: begin chapter}</structure>
and
<structure>{~?~ST: end chapter}</structure>
sam Tags and Validation
Add sam Tags and DOCTYPE Declaration
Add the following text to the beginning of the file.
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="http://www.scribeproduction.com/datafiles/dtd/scml.css" type="text/css"?>
<!DOCTYPE sam PUBLIC "-//Scribe Inc.//DTD sam v1.3.0//EN" "http://scml.scribenet.com/dtds/current/sam.dtd">
<sam>
Add the following text to the end of the file.
</sam>
Validation and QC
Validate the file.
Note: To validate, set up Sublime Text as indicated here and use the validation options under . Upload the file to the Digital Hub and address the errors it lists.
Once the file is valid, review the file using the .sam QC Checklist.
Note: The text checks in the Regular Expressions Resource work best with unicode characters in place, rather than hexadecimal formatting. To change hexadecimal entities to single unicode characters, process the .sam file to .docx in the Digital Hub. This is also an opportunity to refine the file. Review the file using the Scribing QC Checklist and process the file back to .sam to perform text checks.
If the final output is to be a Word file, apply changes directly into the .docx file or process the corrected .sam file to .docx.