|
NAME="GENERATOR"
CONTENT="Modular DocBook HTML Stylesheet Version 1.76b+
">Chapter 8. The Record ModelThe Zebra system is designed to support a wide range of data management applications. The system can be configured to handle virtually any kind of structured data. Each record in the system is associated with a record schema which lends context to the data elements of the record. Any number of record schemas can coexist in the system. Although it may be wise to use only a single schema within one database, the system poses no such restrictions. The record model described in this chapter applies to the fundamental, structured record type grs, introduced in the Section called Record Types in Chapter 5. Records pass through three different states during processing in the system.
Local RepresentationAs mentioned earlier, Zebra places few restrictions on the type of data that you can index and manage. Generally, whatever the form of the data, it is parsed by an input filter specific to that format, and turned into an internal structure that Zebra knows how to handle. This process takes place whenever the record is accessed - for indexing and retrieval. The RecordType parameter in the zebra.cfg file, or the -t option to the indexer tells Zebra how to process input records. Two basic types of processing are available - raw text and structured data. Raw text is just that, and it is selected by providing the argument text to Zebra. Structured records are all handled internally using the basic mechanisms described in the subsequent sections. Zebra can read structured records in many different formats. How this is done is governed by additional parameters after the "grs" keyword, separated by "." characters. Four basic subtypes to the grs type are currently available:
Canonical Input FormatAlthough input data can take any form, it is sometimes useful to describe the record processing capabilities of the system in terms of a single, canonical input format that gives access to the full spectrum of structure and flexibility in the system. In Zebra, this canonical format is an "SGML-like" syntax. To use the canonical format specify grs.sgml as the record type. Consider a record describing an information resource (such a record is sometimes known as a locator record). It might contain a field describing the distributor of the information resource, which might in turn be partitioned into various fields providing details about the distributor, like this:
The keywords surrounded by <...> are tags, while the sections of text in between are the data elements. A data element is characterized by its location in the tree that is made up by the nested elements. Each element is terminated by a closing tag - beginning with </, and containing the same symbolic tag-name as the corresponding opening tag. The general closing tag - </> - terminates the element started by the last opening tag. The structuring of elements is significant. The element Telephone, for instance, may be indexed and presented to the client differently, depending on whether it appears inside the Distributor element, or some other, structured data element such a Supplier element. Record RootThe first tag in a record describes the root node of the tree that makes up the total record. In the canonical input format, the root tag should contain the name of the schema that lends context to the elements of the record (see the Section called Internal Representation). The following is a GILS record that contains only a single element (strictly speaking, that makes it an illegal GILS record, since the GILS profile includes several mandatory elements - Zebra does not validate the contents of a record against the Z39.50 profile, however - it merely attempts to match up elements of a local representation with the given schema):
VariantsZebra allows you to provide individual data elements in a number of variant forms. Examples of variant forms are textual data elements which might appear in different languages, and images which may appear in different formats or layouts. The variant system in Zebra is essentially a representation of the variant mechanism of Z39.50-1995. The following is an example of a title element which occurs in two different languages.
The syntax of the variant element is <var class type value>. The available values for the class and type fields are given by the variant set that is associated with the current schema (see the Section called The Variant Set (.var) Files). Variant elements are terminated by the general end-tag </>, by the variant end-tag </var>, by the appearance of another variant tag with the same class and value settings, or by the appearance of another, normal tag. In other words, the end-tags for the variants used in the example above could have been omitted. Variant elements can be nested. The element
Associates two variant components to the variant list for the title element. Given the nesting rules described above, we could write
The title element above comes in two variants. Both have the IANA body type "text/plain", but one is in English, and the other in Danish. The client, using the element selection mechanism of Z39.50, can retrieve information about the available variant forms of data elements, or it can select specific variants based on the requirements of the end-user. Input FiltersIn order to handle general input formats, Zebra allows the operator to define filters which read individual records in their native format and produce an internal representation that the system can work with. Input filters are ASCII files, generally with the suffix .flt. The system looks for the files in the directories given in the profilePath setting in the zebra.cfg files. The record type for the filter is grs.regx.filter-filename (fundamental type grs, file read type regx, argument filter-filename). Generally, an input filter consists of a sequence of rules, where each rule consists of a sequence of expressions, followed by an action. The expressions are evaluated against the contents of the input record, and the actions normally contribute to the generation of an internal representation of the record. An expression can be either of the following:
An action is surrounded by curly braces ({...}), and consists of a sequence of statements. Statements may be separated by newlines or semicolons (;). Within actions, the strings that matched the expressions immediately preceding the action can be referred to as $0, $1, $2, etc. The available statements are:
The following input filter reads a Usenet news file, producing a record in the WAIS schema. Note that the body of a news posting is separated from the list of headers by a blank line (or rather a sequence of two newline characters.
If Zebra is compiled with support for Tcl (Tool Command Language) enabled, the statements described above are supplemented with a complete scripting environment, including control structures (conditional expressions and loop constructs), and powerful string manipulation mechanisms for modifying the elements of a record. Tcl is a popular scripting environment, with several tutorials available both online and in hardcopy. |