Introduction to XML
Daniel V. Pitti
Institute for Advanced Technology in the Humanities
University of Virginia
The World Wide Web Consortium (W3C) released EXtensible Markup Language (XML) in 1998. XML is not in itself a markup language, as its name suggests. One cannot simply apply XML to a letter, a novel, an article, a software manual, or a finding aid. XML, in fact, is a markup language meta-standard, or in simpler words, a standard for constructing markup languages. XML provides a syntax and meta-language for naming and representing the logical relations of the textual components of documents.
One can think of XML as a set of formal rules for defining specific markup languages for individual kinds of documents. Using these formal rules, a community sharing a particular kind of document can collaboratively create a markup language specific to a shared type of document.
Specific markup languages written in compliance with formal XML requirements are called Document Type Definitions (DTDs). For example, the Association of American Publishers has developed a set of three DTDs: one for books, one for journals, and one for journal articles. A consortium of software developers and producers has developed a DTD for computer manuals called DocBook. The Library of Congress has developed a DTD for USMARC. The archival community has defined a DTD for encoding finding aids, Encoded Archival Description (EAD).
DTDs shared and followed by a community are themselves standards. The Association of American Publishers DTD is registered as ANSI/NISO z39.59-1988, and after substantial revision, has been approved as an international standard, ISO 12083. The Society of American Archivists has designated EAD a standard for the archival community in the United States.
XML is thus very general and abstract. It exists formally over and above individual markup languages for specific document classes. It is also standard, which is to say, a formal set of conventions in the public domain. XML is not owned by and thus not dependent on any hardware or software producer. That XML is a standard offers its users reasonable assurance that the information represented using it will not become obsolete because of hardware and software changes.
The formality of XML has very important implications. Because XML syntax and rules are formal and precise, it is possible to write software that can be relatively easily adjusted to work with any compliant DTD. Typically, XML software products have features that allow users to adapt their functionality to specific DTDs. As a result, the market driving XML software development is in principle everyone with a document. This is very different from community-specific encoding schemes such as Machine-Readable Cataloging (MARC) that are restricted to a relatively small number of users.
The XML market thus includes virtually everyone. Many government agencies are using XML including the Department of Defense, Department of Energy, Internal Revenue Service, and Library of Congress. A wide variety of industries are also employing XML including software producers; airline, automobile, and tractor manufacturers; print and electronic publishers; and pharmaceutical and medical companies. The academic community also has a number of important initiatives underway consisting of among others the Text Encoding Initiative, an international project to provide encoding standards to support linguistic and literary research. Most recently, there are a number of initiatives for developing DTDs to facilitate electronic commerce on the Internet and a wide variety of scientific and academic research. In fact, while XML arose in the document community, it is now widely used in a variety of database applications, for representing a wide variety of non-textual data, for machine-to-machine communication, and many other applications.
In order to understand why XML has generated such broad interest from both users and developers, it is useful to consider the nature of markup and what kind of markup XML promotes. In an article now considered by many to be a classic presentation of document markup theory, James Coombs, Allen Renear, and Steven DeRose distinguished six kinds of markup, three of which I would like to discuss briefly: procedural, descriptive, and referential.
Procedural Markup. In the last few years, through the use of word processing systems, we have become familiar with procedural markup. Procedural markup consists of processing instructions to the computer. It tells the computer what to do with specified components of the text. For example, the title of a major section might have instructions that tell the printer to center the text, use a font of a certain size, and perhaps print it in bold italics. Most procedural markup is characterized by being paper directed, that is, it tells the printer how to put the text on paper. If you want to do anything else with the text, the markup is not of much help. If you want to search for the initials "XML" in the machine-readable version of a book, but only where it occurs as a chapter or section title, the procedural markup provides no assistance. Nor does it help if you want to display the text on a computer screen, since paper presentation and monitor presentation are quite different. Finally, procedural markup is generally characterized by a further limitation, until recently all procedural markup has been proprietary. This means, for example, that the documents created on WordPerfect® cannot be processed flawlessly on MicroSoft® Word and vice-versa. Each word processing software package uses its own markup. In this environment, the future of the document is tied to the future of the software.
Descriptive Markup. A second type of markup mentioned by Coombs, Renear, and DeRose is descriptive or declarative markup. With descriptive markup, we arrive at the form of markup recommended by XML. Descriptive markup identifies the logical components of documents. While procedural markup specifies a particular procedure to be applied to a document component, descriptive markup indicates what the component is. Examples are chapter, chapter title, section, paragraph, author, publisher, and cataloging-in-publication data. None of these gives any indication of what procedures are to be applied to these components. But, if you know the document's components, then you can have processors to do whatever you want to them. Descriptive markup liberates the document for multiple uses. It is possible, for example, to use one and the same source document to produce printed, electronic, Braille, and voiced synthesized versions, and, for good measure, to produce HTML and flat ASCII. EXtensible Stylesheet Language-Transformation (XSLT) is an XML-based initiative for providing a standard language for transforming XML encoded documents. The fact that descriptive markup can be used in so many different ways is one of its important characteristics. It escapes the single use trap of procedural markup.
It is useful to distinguish two kinds of descriptive markup: structural and nominal. Descriptive structural markup identifies document components and their logical relationship. Structural elements are components that you usually want to present visually in some distinct manner. Examples are chapter titles, paragraphs, block quotes, and the like. Descriptive nominal markup, as you might expect, identifies named entities, both concrete and abstract. Examples are corporate names, personal names, topical subjects, genres, and geographic names. While you may want to visually present these names online or on paper in some particular manner, you usually want to index them in particular ways, to use them to provide access to the source or subject matter of the document.
Referential Markup. As its name suggests, referential markup refers to information that is not present. It is markup in the third person, so-to-speak. There are different kinds and ways that one might use referential markup, but I would like to focus on the kind of referential markup that enables something about which most of you have heard, and perhaps with which many of you have some experience, namely, hypertext and hypermedia. In addition to supporting text, XML also provides provisions for using text to refer to other text, and to refer to other kinds of digital information derived from the full array of native formats: photographs (color as well as black and white); sound motion pictures; drawings; paintings; audio recordings; three dimensional objects of all kinds, shapes, and sizes; maps; manuscripts; typescripts; printed pages; mathematical data; financial data; diagrams; musical notation; choreographic notation; and anything else open to being digitally captured and rendered in some useful form. It is possible not only to refer to or point at this other digital information from within XML based documents, but also to control the notation information needed to launch the devices necessary for rendering the various objects into humanly intelligible forms. It is thus possible to use electronic text to control and manage extra-XML information objects of all kinds, as well as to provide access to and navigation through them. XML Linking Language (XLink) is an XML-based initiative to standardize hypertext linking.
SGML, HTML and the Emergence of XML
XML is based on Standard Generalized Markup Language (SGML). The International Standards Organization (ISO) published SGML as a standard in 1986 (ISO 8879). Many government, industry, and academic agencies and institutions enthusiastically embraced SGML, though it did not enjoy widespread success because of its complexity. While SGML largely adhered to fundamental rules of computer grammars, the rules that provide the underpinnings of computer programming languages and programs, it did not fully adhere. Because of the failure to fully comply, programmers found it difficult to work with SGML.
By the mid-1990s, HyperText Markup Language (HTML) had become an SGML compliant DTD that was enjoying enormous success as the encoding standard underpinning the World Wide Web. As a specific application of SGML, the HTML DTD limits itself to simple procedural encoding dedicated to online display and hypermedia linking. HTML hardwires a small set of procedurally oriented tags. Constraining the set of elements made it feasible to build applications that made life relatively easy for authors and Web publishers. Because of HTML’s relative ease of use and its support of online display of documents, it enjoyed wide success. In fact, it is difficult to image the emergence of the Internet as the important economic, social, and cultural phenomenon it has become without HTML.
The small, closed tag set, however, came at a price: HTML has extremely limited functionality. Jon Bosak has identified three areas in which HTML is wanting: extensibility, structure, and validation. SGML is strong in all of these areas, but its strength, like HTML's weakness, comes at a price: SGML is complicated for both application developers and the users of the applications.
Many users of the Internet wanted the ability to support much more than display. They wanted to support sharing of information across computers and among people, and to support sophisticated indexing, navigation, and processing of information. HTML was inadequate an inadequate foundation for these complex tasks. While SGML provided the necessary semantic and structural foundation, it was too difficult for many developers to write programs to perform the tasks. The W3C's XML Working Group decided to that what was needed was a simplified version of SGML, a version that would be easier to process and use. The result of XML Working Group work is EXtensible Markup Language, released as a W3C recommendation in 1998.
Since the release of XML in 1998, many other standards in support of the use of XML have been developed or are under development, in particular, EXtensible Stylesheet Language-Transformation, and XML Linking Language (XLink). XML seems well on its way to attracting many new communities of users, and with them, many developers of XML software.
For additional information about SGML, see http://www.oasis-open.org/cover/
XML (and SGML) Software Tools
The best source of information on XML tools is: http://www.xmlsoftware.com/. Many of the tools work with both XML and SGML.
Parsers: Validation of DTDs and documents.
Key to the use of XML is a parser or validator. Essentially, parsers are aware of the formal requirements of the XML meta-language and syntax, and they use this awareness to do three very important things.
First, a parser can read the DTD itself, and make sure that it formally adheres to the standard. It reads all of the element, attribute, and entity declarations to make sure that they are compliant with the specifications in the standard. If naming conventions and syntax are used incorrectly, the parser will report errors or warnings.
Second, once a parser has read the DTD and finds it valid, it can read an encoded document and validate that all of the encoding meets the specifications in the DTD. Note that a parser will not check the validity of the textual content, but only the XML tags, attributes, and other markup.
Finally, the parser outputs the document in a form that other SGML software can use.
All SGML and XML compliant software use recognized parsers! If they do not, they are not compliant!
There are free SGML parsers available for ftp. See: http://www.oasis-open.org/cover/publicSW.html#parsers
The best available parser is NSGMLS. NSGMLS is part of a suite of related tools developed by James Clark called SP: http://www.jclark.com/sp/index.htm. All of Clark's SP suite of software is now open source. It is available at http://openjade.sourceforge.net/ Note that the OpenSP is distributed as part of OpenJade.
There are many different parsers available for XML. There are two kinds of parsers: validating and non-validating parsers. Non-validating parsers check to make sure XML documents are well-formed, but do not check to make sure documents adhere to a DTD. Validating parsers, like SGML parsers, make sure the DTD is valid and that the document adheres to the rules in it.
ElCel has a free, good standalone XML parser that is available for Linux, Solaris, and Windows. It is available at http://www.elcel.com/index.html.
Converters and Transformers: moving into and out of SGML and XML.
The most challenging task is converting print text into machine-readable, encoded form. While this can be on-site, the preferred method is generally to contract with a text conversion vendor.
For existing machine-readable texts, either in word processing formats or database formats, there are several possible machine-assisted methods.
Custom created scripts using perl, Python, and other scripting languages are frequently used to convert texts. It is also possible to use macro programs in word processing programs such as WordPerfect to mark up text based on formatting clues. Most database products have export features that will assist in the creation of encoded texts.
There are also several applications devoted to converting and transforming documents from various non-SGML or XML formats into SGML and XML. See http://www.xmlsoftware.com/convert.html.
James Clark's SX is widely used for SGML to XML conversions. It is included in the distribution of SP (see above).
Authoring and Editing: writing and maintaining SGML and XML documents.
SGML and XML authoring and editing tools are used for writing and maintaining documents. While many SGML and XML authoring and editing tools behave very much like word processing programs, with WYSIWYG interfaces, spell checkers, and the like, they also have special features to facilitate creating and maintaining valid documents.
The best SGML and XML authoring and editing tools have real time parsers, which is to say, parsers that compel the author to use only DTD compliant elements. In addition, good authoring tools provide mechanism for automatically adding the tags using function keys or point-and-click menus.
There are free as well as commercial editing tools. See http://www.xmlsoftware.com/editors.html for current list of available software.
Of the commercial packages, the most reasonably priced and of good quality is: SoftQuad’s XMETAL: See SoftQuad: Welcome to xmetal.com.
Browsers and Electronic Publishing: rendering and indexing SGML- and XML-based documents.
While it is possible to publish SGML on the Web, most sites now choose to publish XML, even if they are maintaining documents in SGML.
There are a number of different methods that can be used for publishing XML on the Web. Internet Explorer 6.x offers support for direct rendering of XML using either XSLT or Cascading Style Sheets (or a combination of the two using XSLT). XML can also be converted to HTML prior to being published as HTML, or published using server-side applications that convert XML to HTML on the fly. Saxon is a popular tool for latter two approaches: http://saxon.sourceforge.net/. A variety of other tools also are available: http://www.xmlsoftware.com/xslt.html.
XML indexing software continues to expensive and the choices limited, though both the prices and the choices have improved since the release of XML in 1998.
TextML is relatively inexpensive, and is available in a "lite" version that supports indexing up to 1000 documents. It is only available on Windows NT and 2000:http://www.ixiasoft.com/products/textmlserver/index.asp.
Tamino (Software AG) is a much more expensive product: http://www.softwareag.com/tamino/
Finally Xpat is a very fast search engine. It is based on OpenText's Pat, version 5.0. The Digital Library EXtension Service (University of Michigan) purchased OpenText's source code in 1999 and continues to develop it, in particular they are promising to add support for Unicode: http://www.dlxs.org/
Coombs, James H., Allen H. Renear, and Steven J. DeRose. 1987. "Markup Systems and the Future of Scholarly Text Processing." Communications of the Association for Computing Machinery 30 (11): 933-947.
Jon Bosak XML, Java, and the future of the Web see: http://sunsite.unc.edu/pub/sun_info/standards/xml/why/xmlapps.htm