Joe English
Last updated: Sunday 27 June 1999, 15:31 PDT
RTF, or Rich Text Format, is Microsoft's interchange format for word processing files. RATFINK is a set of Tcl routines for creating RTF files, and a Cost interface for converting SGML to RTF.
RTF is also the basis for Windows Help (WINHELP) format. RATFINK does not currently contain any direct support for WINHELP.
rtflib.tcl contains the low-level utility routines. This file is not Cost-specific and may be used in any Tcl script. RTF.spec contains the extra Cost commands for SGML conversion.
To use RATFINK with Cost, create a translation script that does the following:
NOTE -- The last three steps may be done in the main procedure.
Then run
sgmls sgmldecl yourdoc.sgml | costsh -S yourscript.spec > output.rtfto convert yourdoc.sgml to RTF.
# Define stylesheet: rtf:paraStyle body "Body Text" { Font Roman FontSize 10pt LeftIndent 0.5in } rtf:paraStyle heading "Heading 1" { Font Sans FontSize 14pt Bold 1 LeftIndent 0pt } # Specify processing for each element type: specification rtfSpec { {element P} { rtf para paraStyle body } {element H1} { rtf para paraStyle heading } } # Main routine: proc main {} { rtf:start rtf:convert rtfSpec rtf:end }
Lexically, RTF files are a simple stream of text, control words, and groups, All text is seven-bit ASCII. Control words are lowercase alphabetic tokens beginning with a backslash, followed by an optional integer parameter, and terminated by a space or a non-alphanumeric character. Groups are nested data enclosed in curly braces.
Semantically, things start to get complicated.
RTF files start with a header, which contains a font table, color table, stylesheet, and other metainformation. The header is followed by the document data. Everything in the document is a paragraph, except for the stuff that isn't. The stuff that isn't a paragraph includes
Actually it's all very messy and you shouldn't have to worry about it too much except to note that every block of displayed data is considered a paragraph, and that RTF has a ``flat'' structure: sections, paragraphs, and table rows do not nest.
NOTE -- Groups on the other hand do nest, and group boundaries can cross section, paragraph, row, and cell boundaries, but don't worry about that either.
Formatting is determined by stylesheet entries. There are three types of stylesheet entries in RTF: character styles, paragraph styles, and section styles. These are defined with the commands rtf:charStyle, rtf:paraStyle, and rtf:sectStyle, respectively All three commands have the same syntax:
rtf:xxxStyle id "description" [ -basedon styleid ] { param value param value ... }
Each stylesheet entry has a symbolic id by which it is referenced in other commands, and a textual description, which is inserted into the RTF file as the style name.
TIP -- Some RTF style names are interpreted specially by various word processors; see 2.6. ``Special style names'', below.
If -basedon is specified, then all of the style properties listed in the named styleid are copied to the style being defined. Parameters in the new style override those in the base style. styleid must be the id of a previously-defined style of the same type (paragraph, character, or section).
Style properties are specified as a list of param-value pairs. Parameters map (more or less) onto RTF control words. Parameter values take one of several forms, depending on the parameter. Boolean parameters specify true/false values; the value must be 1 or 0. Flag parameters are like booleans, but the parameter can only be turned on (flag parameters correspond to properties which are off by default and turned on by the presence of an RTF control word.) Dimension parameters specify a length as a number followed by a unit name, e.g., 12pt, 0.5in; see 2.1.1. ``Dimensions'' for more details. Other parameters may be integers, a list of enumerated values, or some other type as described below.
Most RTF control words expect lengths to be specified in twips; others expect them to be specified in half-points. There are 20 twips to a point, and there are 72 points to an inch.
NOTE -- There are several definitions of a ``point''; RTF uses the conventional DTP definition of exactly 72 points to the inch.
Since twips and half-points are not a very convenient way to specify lengths, RATFINK allows you to specify dimensions in terms of other units and converts to twips or half-points as appropriate.
Dimension specifications are decimal numbers, with an optional minus sign and fractional part, followed immediately by one of the following units:
NOTE -- Due to roundoff errors in converting millimeters to twips, the metric units are unreliable. Also, different versions of Word seem to use different rounding conventions. It's best to avoid cm and mm if possible.
For example,
rtf:paraStyle body "Body Text" { FontSize 10pt LineSpacing 12pt LeftMargin 1in FirstIndent 0.5in }
The rtf:cvTwips and rtf:cvHalfpts functions convert dimension specifications to twips and half-points, respectively.
rtf:cvTwips dimension
Returns the value of dimension in twips, rounded to the nearest integer.
rtf:cvHalfpts dimension
The same as rtf:cvTwips, but returns the value in half-points.
NOTE -- At present, the available fonts are hard-coded and may not be changed.
The available fonts are:
Every paragraph style may include its own set of tab stops. This is specified by the TabStops paragraph style property.
Tab stops are specified as a list of tabspecs; each tabspec is a list consisting of a dimension followed by any of the following property specifications:
Tab stops may also be defined with the rtf:tabStops command:
rtf:tabStops name { tabspec ... }
Defines and assigns to name a set of tab stops. name may be referenced by a TabStops parameter in subsequent paragraph style definitions.
For example,
rtf:tabStops normaltabs { 80pt 160pt 240pt 320pt 400pt 480pt 560pt 640pt 720pt } rtf:tabStops threepart { "3in Align Center" "6in Align Right" } rtf:paraStyle header "Page header" { TabStops threepart } rtf:tabStops toctabs { "6in Align Left Leaders Dot" }
The TabStops paragraph style property may not be overridden: tab stops in a paragraph style are added to those in the base paragraph style.
NOTE -- The RATFINK syntax for tab stops is messy and counterintuitive, and will probably change...
RATFINK predefines the rule styles thin, thick, and double. Additional rule and border styles may be defined with the rtf:ruleStyle command. Like rtf:tabStops, this defines a symbolic name for a particular rule style that is referenced as the value of other stylesheet parameters.
rtf:ruleStyle name { param value ... }
Defines a rule style named name. Allowable parameters are:
NOTE -- It is unclear what the \brdrhair (Style Hairline) control word means, since the thickness of the rule is actually determined by the \brdrw (Thickness) control word.
RATFINK makes sure that all the control words are output in the correct order.
Character styles are used for regions of text within a paragraph. All of the character style parameters are also valid paragraph style parameters.
rtf:charStyle id "description" -basedon styleid { param value ... }
Available parameters are:
NOTE -- The ``shadow'' and ``outline'' character formatting properties have been omitted on aesthetic grounds. Considering how Word typically renders ``small caps,'' that should probably be avoided as well.
NOTE -- Apparently it is not possible in RTF to specify double word underline.
rtf:paraStyle id "description" -basedon styleid { param value ... }
Defines a paragraph style, which may be referenced by id in a later call to rtf:startPara.
Available parameters are:
In addition, all of the character formatting style attributes (see 2.2. ``Character styles'') may be specified for a paragraph style.
TIP -- To achieve ``hanging indentation'', use a negative value for FirstIndent.
NOTE -- If a series of successive paragraphs specify the same set of borders, the borders are drawn around the group as a whole unless the InnerBorders flag is specified.
rtf:sectStyle id "description" -basedon styleid { param value ... }
Defines a section style, which may be referenced by id in a later call to rtf:startSection.
NOTE -- It is unclear how this affects the interpretation of the page size and margin control words.
NOTE -- It is unclear whether these specify distance to the top or the bottom of the header and footer.
The rtf:documentFormat command specifies formatting properties for the document as a whole. This command is optional; the default values for these parameters should be sufficient unless you need finer control over the layout.
rtf:documentFormat { param value ... }
Available document formatting parameters are:
NOTE -- The RTF spec says only that this control word ``switches margin definitions on left and right pages,'' which is ambiguous. By experimentation, LeftMargin corresponds to the ``inner'' margin and RightMargin corresponds to the ``outer'' margin, at least in Word for Windows 95 Version 7
NOTE -- It is unclear how the top, bottom, left, and right margins are interpreted if landscape orientation is specified.
See also 3.7. ``Footnotes''.
Word uses the paragraph style names Heading 1, Heading 2, etc., as the source text for building a table of contents. Use these names as the description for heading entries in the text to facilitate automatic TOC generation.
Word applies the paragraph styles TOC 1, TOC 2, etc., to entries in automatically-generated tables of contents. If you include definitions for these styles in the stylesheet, Word will use them to format the table of contents.
Call rtf:start after all declarations and before writing any output. Call rtf:end at the end of processing.
rtf:start
Begins the top-level RTF group, emits the style sheet and other header information, and sets any document-wide formatting properties specified by rtf:documentFormat.
rtf:end
Closes the top-level RTF group. Must be called at the end of processing.
The basic unit of text in RTF is the paragraph. In RTF, a paragraph is any block of displayed text -- including section headings, list items, and table cell entries -- not necessarily a conventional paragraph.
rtf:startPara styleid # generate paragraph text... rtf:endPara
rtf:startPara and rtf:endPara delimit the start and end of paragraphs. styleid is the name of a paragraph style defined with rtf:paraStyle. Since paragraphs do not nest, rtf:endPara is optional.
rtf:startPhrase styleid # ... rtf:endPhrase
Use rtf:startPhrase and rtf:endPhrase to apply special formatting to text within a paragraph. styleid is the name of a character style defined with rtf:charStyle.
rtf:endPhrase is not optional. Phrase boundaries must not cross paragraph boundaries. (Actually RTF doesn't care if they do, but this confuses RATFINK).
RTF documents may optionally be broken into sections.
rtf:startSection styleid rtf:endSection
styleid is a section style declared with rtf:startSection. Since sections do not nest in RTF, rtf:endSection is optional.
rtf:text "text"
Writes text to the output file, escaping backslashes and braces so they are not interpreted as RTF markup.
rtf:text makes sure that the output is inside a paragraph. If not, it starts a new paragraph and issues a warning.
rtf:text also replaces sequences of two consecutive hyphens with an en-dash, three hyphens with an em-dash, two backquotes (`) with a left double quote, and two apostrophes (') with a right double quote.
rtf:insert data
Inserts data into the current paragraph verbatim, leaving backslashes and braces as-is.
rtf:write "data"
rtf:write inserts data into the output verbatim. data may contain RTF control codes.
NOTE -- Be very careful when using rtf:write to generate RTF commands directly.
The rtf:special command inserts a special character into the output stream, ensuring that the output is currently inside a paragraph.
The global array rtfSpecial maps symbolic character names to the corresponding RTF control words.
rtf:special name rtf:insert $rtfSpecial(name)
name is one of the following symbolic names:
NOTE -- The $rtfSpecial array may also be referenced in prefix and suffix parameters in Cost specifications, for example.
rtf:tab rtf:lineBreak rtf:pageBreak rtf:columnBreak
These commands generate a ``hard'' tab, line break, page break, and column break control word, respectively. rtf:tab and rtf:lineBreak may only be used inside a paragraph.
RTF destination groups are used for text that does not appear in the main flow; e.g., page headers or footnotes. The rtf:divert command starts a new destination group.
rtf:divert destination # generate data for destination... rtf:undivert
See 3.6. ``Page headers and footers'' and 3.7. ``Footnotes'' for more information.
Header and footer text is specified in destination groups. There are several different destinations related to headers and footers; which ones are applicable depend on various document and section style properties.
The Header and Footer destination groups specify the default header and footer, respectively. LeftHeader, LeftFooter, RightHeader and RightFooter specify the header and footer for left (verso) and right (recto) pages; these are only applicable if the TwoSide document formatting property is set. FirstPageHeader and FirstPageFooter specify the header and footer for the first page of the section; these are only applicable if the HasTitlePage section formatting property is specified for the section.
Headers and footers should be specified immediately after the call to rtf:startSection. If a particular applicable header or footer is not specified, then it is inherited from the previous section.
Headers and footers contain ordinary paragraph text.
The PageNumber special character may be useful in headers and footers.
Footnotes are generated with a Footnote destination group.
rtf:special FootnoteNumber rtf:divert Footnote # generate footnote text ... rtf:undivert
Footnotes are ``anchored'' to the character that immediately precedes the destination group. Use the FootnoteNumber special character to obtain automatically-numbered footnotes.
The following document-wide formatting properties affect how footnotes are formatted; they may be specified with the rtf:documentFormat command prior to the start of output.
If FootnoteLocation is not specified, footnotes appear ath the bottom of each page.
NOTE -- RTF also has ``alternate'' footnotes, used to put both footnotes and endnotes in the document. RATFINK does not support alternate footnotes.
RTF allows sections of text to be defined as a bookmark. It is not clear from the RTF specification what this feature does; presumably it is used by word processing software.
rtf:startBookmark name rtf:endBookmark name
The bookmark name may be any character data. rtf:startBookmark must be followed by a matching rtf:endBookmark; bookmarks may overlap however.
NOTE -- Table support is still in beta. This interface has not been very well tested or debugged, and is subject to change.
In RTF, a table is a consecutive series of rows, each of which contains a series of cells. Cells contain either a series of one or more paragraphs or inline text.
RTF has no explicit control words for the beginning and end of a table. Instead, tables are specified as a sequence of rows. Cell properties (sizes and rules) are specified all at once at the beginning of each row, followed by the cells themselves.
RATFINK makes the following simplifying assumptions:
rtf:startTable ( -numcols n | -abswidths "w1 w2 ... wn" | -relwidths "w1 w2 ... wn" ) [ -width dimension ] [ -frame rulestyle ] [ -rowsep rulestyle ] [ -colsep rulestyle ] [ -align (Left|Right|Center) ] # ... rtf:endTable
rtf:startTable begins a table. Exactly one of -numcols, -relwidths, or -abswidths must be specified to define the number of columns in the table.
The other options are:
rtf:startRow [ -colspans "s1 s2 ... sm" ] [ -toprule rulestyle ] [ -botrule rulestyle ] [ -colsep rulestyle ] [ -rowheight dimension ] # ... rtf:endRow
rtf:startRow begins a new table row.
rtf:endRow and rtf:endCell are optional.
rtf:startCell [ paraStyle ] ... rtf:endCell
Begins a new cell. Cells can contain inline text, or a series of paragraphs.
rtf:endCell is optional. It marks the end of the current cell.
A field in RTF is a hook for specifying program-specific commands to the word processor reading the RTF file. Fields contain two parts: a field instruction, which is the actual command; and an optional field result, which holds the results of processing the field. (The field result may be used to provide default text in case the application does not understand how to process the field instruction.)
There are two ways to insert fields with RATFINK:
rtf:insertField {instruction} [ "result text..." ]
Inserts a field instruction. instruction is any character data; backslashes and other special characters will be escaped before writing to the output. The second parameter is optional; if supplied it will be used as the text of the field result.
rtf:startField "field instruction" # ... generate field result ... rtf:endField
Like rtf:insertField, except the field result may contain arbitrary RTF text instead of a simple character string.
NOTE -- Many field instructions have optional parameters that are specified with sequences beginning with a backslash. Note that these are not RTF control words (except for \fldalt, but that's too horrifying to get into...).
The available field instructions vary from application to application; check the documentation for the program in question.
The commands described in the previous section and defined in rtflib.tcl are all general-purpose Tcl utilities for creating RTF, and may be used independently of Cost or SGML. The file RTF.spec is a high-level Cost script to assist in converting SGML to RTF.
rtf:convert specname
To specify the processing for a particular DTD, define a Cost specification supplying RATFINK processing parameters for each element type, then call the rtf:convert command with the name of your specification. (This is in addition to defining a stylesheet and document formatting properties as described above.)
NOTE -- A Cost specification maps document nodes to parameters based on queries; see the Cost reference manual for full details.
For example:
specification rtfSpec { {element P} { rtf para paraStyle body } {elements "DFN EM"} { rtf phrase charStyle hp0 } {elements "UL OL"} { rtf #IMPLIED } {element LI} { rtf para paraStyle litem } {element LI in UL} { prefix {$rtfSpecial(Bullet)$rtfSpecial(Tab)} } {element LI in OL} { prefix {[childNumber].$rtfSpecial(Tab)} } {element PRE} { rtf linespecific paraStyle verbatim } {element H1} { rtf para paraStyle heading1 } {element H2} { rtf para paraStyle heading2 } ... etc. } rtf:convert rtfSpec
TIP -- rtf:convert is reentrant and may be called recursively, possibly with a different specification, for complex processing.
There is one mandatory parameter for every element: rtf. This specifies one of the following ``architectural forms'':
The optional parameters startAction and endAction are valid for every element. They specify Tcl code to execute at the start and end of the element, respectively. The code is evaluated at global scope.
Tcl variable- and command- replacement is performed on the charStyle, paraStyle, and sectStyle parameters.
NOTE -- The RTF translation routine prints a warning if there is no rtf parameter specified for an element.
The following parameters may be specified for any element node, and may be used to specify automatically-generated text:
Tcl variable- and command- replacement is performed on these parameters with the subst command. The result is inserted directly into the output file and may contain RTF control words. You should use the rtf:Escape command if the value might contain data that looks like an RTF control instruction; for example,
{element IMG} { rtf #IMPLIED prefix {[rtf:Escape [q attval ALT]]} }
Paragraphs do not nest in RTF, but they do in many SGML applications. For example, it is often legal to include
For para-form elements, the optional continuedStyle parameter names a paragraph style for subsequent blocks of text that are part of a logical paragraph in the SGML document but are treated as separate paragraphs in RTF.
Normally, record-ends are converted to spaces. If rtf linespecific is specified for an element, then record-ends are processed as hard line breaks. The element is formatted as a single paragraph, and the paraStyle parameter also applies.
If rtf section is specified for an element, RATFINK starts a new section with rtf:startSection at the beginning of the element. (It does not call rtf:endSection at the end of the element, since in general sections may nest in an SGML document while they may not in RTF; keep this in mind.)
The startAction and endAction parameters are also evaluated for section-form elements. These parameters contain arbitrary Tcl code, evaluated at top-level (global) scope. (startAction may be used to generate headers and footers, for example.)
The processing of data entities and data entity references is controlled by the content parameter. This is evaluated as Tcl code at global scope.
{dataent withdcn EPS} { content { rtf:insertField "INCLUDEPICTURE \"[query sysid]\"" {Picture goes here...} } }
The following meta-DTD describes the basic structure of the RATFINK output process:
<!-- Meta-DTD for RATFINK RTF conversion --> <!doctype ratfink [ <!element ratfink - - (section+|para+)> <!element section - O (headings?, para+)> <!attlist section sectStyle CDATA #REQUIRED > <!element para - O (#PCDATA|phrase)*> <!attlist para paraStyle CDATA #REQUIRED > <!element phrase - - (#PCDATA)> <!attlist phrase charStyle CDATA #REQUIRED > <!entity % headings "(fphead|fpfoot|head|foot|lhead|lfoot|rhead|rfoot)"> <!element headings O O ( fphead? & fpfoot? & ((head? & foot?) | (lhead? & lfoot? & rhead? & rfoot?)) ) > <!element %headings; - - (para+ | (#PCDATA|phrase)*)> ]>
Note that this DTD is not actually used by RATFINK; it is for descriptive purposes only.
Conceptually, the mapping from source document elements onto architectural forms is determined by the rtf parameter, which specifies the result element type; other parameters correspond to result attributes. The %headings; architectural forms do not corrsepond to source elements; they are instead generated by the application.
There is currently no way to set formatting properties for a section, paragraph, or phrase without defining a stylesheet entry.
Handling nested lists and other such things is more difficult than it ought to be.
RATFINK does not always output control words in the order prescribed by the RTF syntax productions (but neither does Microsoft Word, for what it's worth...)
There is no support for pictures, drawing objects, embedded objects, or other features.
NOTE -- I'd really like to support of bitmapped images, but the RTF spec is extremely unhelpful on this point.
Does not handle context-sensitive style information very well. For example, if the DTD allows bulleted lists inside regular paragraphs and inside notes, and the desired formatting is to set regular paragraphs in a roman font and notes in a sans-serif font, then there must be distinct RTF paragraph styles for lists inside notes and lists inside regular paragraphs.
If a style overrides a parameter in its base style, the corresponding control word will be emitted more than once. Since the later setting takes precedence, this usually makes no difference, but it means that ``flag'' control words cannot be turned off if they are turned on in a base style, and that tab stops in a base paragraph style may not be cleared.
I've done my best to make sure that this library only generates legal RTF (as far as my understanding of the specification goes), but it is still possible in certain obscure circumstances for the output to crash Word.
RTF has control words to embed table of contents entries, and index entries; however, there are no control words to build a table of contents or index. Consequently, I haven't bothered to support these features in RATFINK.
NOTE -- With Word for Windows 95 Version 7 you can do these things with field instructions, so they may be useful after all.
RTF supports automatic numbering of lists and headings, but not very well. For example, if you include a paragraph inside a list item Word resets the counter for the next item; and if you have two numbered lists in a row with no intervening paragraphs there is no way to restart the list numbers at 1. Consequently, I haven't bothered to support these features either.
Many SGML applications assume an application convention whereby multiple spaces are equivalent to a single space. Many text formatting utilities (TeX, n/troff, Scribe, etc.) work this way, but RTF does not: all spaces are significant. RATFINK does not do anything to compress multiple spaces by default; however you can do some tricks with short reference maps to take care of this in the parser.
The output of RATFINK has been extensively tested with Microsoft Word for Windows 95 Version 7, and to some extent with Microsoft Word for Macintosh Version 5.1a. I have no idea how well it will work, if at all, with other word processors; chances are good that there will be differences in other applications' interpretations of the specification.
It takes a lot of work to get any sort of decent typography out of Word.
Information about Cost can be found at http://www.flightlab.com/cost/.
Another tool for converting SGML to RTF is JADE, James Clark's amazing DSSSL engine. See http://www.jclark.com/ for details. This program works under Win32 and most Unix variants.
The RTF format is defined in the Application Note GC0165, available on the Microsoft FTP server at ftp://ftp.microsoft.com/Softlib/mslfiles/gc0165.exe.
NOTE -- The 1.4 RTF spec also includes a sample RTF reader program, but it isn't very good; Paul Dubois' RTF tools are a better bet.
The Microsoft material is only supplied as a self-extracting DOS executable. If you don't have a DOS system available, you're not completely out of luck: the Info-ZIP project UNZIP utility runs on just about every system imaginable and is able to unpack this format. See ftp://ftp.uunet.net/pub/archiving/zip/ and elsewhere (ask Archie; it's widely mirrored). You'll still need a copy of Microsoft Word to read the RTF specification, though.
Information about WINHELP may be found in the Windows Help Authoring Toolkit at ftp://ftp.microsoft.com/Softlib/mslfiles/what6.exe, and the Usenet newsgroup comp.os.ms-windows.programmer.winhelp, and its related FAQ.
WARNING -- Winhelp is not for the faint of heart or weak of stomach. RTF is pretty messed up, but Winhelp is a complete abomination.
To parse RTF files, check out Paul Dubois' excellent RTF tools at http://www.primate.wisc.edu/software/RTF/. See also rtftohtml at http://www.sunpack.com/RTF/.
OSF's ``Rainmaker'' software can convert RTF documents to the Rainbow SGML document type; see ftp://ftp.ebt.com/pub/nv/dtd/rainbow/
Many thanks to Boris Tobotras for testing, bugfixes, and enhancements.