Introduction to XML
by
Daniel V. Pitti
Institute for Advanced Technology in the Humanities
University of Virginia
2002
Introduction
The World Wide Web
Consortium (W3C) released EXtensible Markup Language (XML) in 1998. XML is not in itself a markup language, as
its name suggests. One cannot simply apply XML to a letter, a novel, an
article, a software manual, or a finding aid.
XML, in fact, is a markup language meta-standard, or in simpler
words, a standard for constructing markup languages. XML provides a
syntax and meta-language for naming and representing the logical relations of
the textual components of documents.
One can think of XML as a
set of formal rules for defining specific markup languages for individual kinds
of documents. Using these formal rules, a community sharing a particular kind
of document can collaboratively create a markup language specific to a shared
type of document.
Specific markup languages
written in compliance with formal XML requirements are called Document Type
Definitions (DTDs). For example, the
Association of American Publishers has developed a set of three DTDs: one for
books, one for journals, and one for journal articles. A consortium of software
developers and producers has developed a DTD for computer manuals called
DocBook. The Library of Congress has developed a DTD for USMARC. The archival
community has defined a DTD for encoding finding aids, Encoded Archival
Description (EAD).
DTDs shared and followed by
a community are themselves standards. The Association of American Publishers
DTD is registered as ANSI/NISO z39.59-1988, and after substantial revision, has
been approved as an international standard, ISO 12083. The Society of American
Archivists has designated EAD a standard for the archival community in the
United States.
XML is thus very general and
abstract. It exists formally over and above individual markup languages for
specific document classes. It is also standard, which is to say, a formal set
of conventions in the public domain. XML is not owned by and thus not dependent
on any hardware or software producer. That XML is a standard offers its users
reasonable assurance that the information represented using it will not become
obsolete because of hardware and software changes.
The formality of XML has
very important implications. Because XML syntax and rules are formal and
precise, it is possible to write software that can be relatively easily
adjusted to work with any compliant DTD. Typically, XML software products have
features that allow users to adapt their functionality to specific DTDs. As a
result, the market driving XML software development is in principle everyone
with a document. This is very different from community-specific encoding
schemes such as Machine-Readable Cataloging (MARC) that are restricted to a
relatively small number of users.
The XML market thus includes
virtually everyone. Many government agencies are using XML including the
Department of Defense, Department of Energy, Internal Revenue Service, and
Library of Congress. A wide variety of industries are also employing XML
including software producers; airline, automobile, and tractor manufacturers;
print and electronic publishers; and pharmaceutical and medical companies. The
academic community also has a number of important initiatives underway
consisting of among others the Text Encoding Initiative, an international
project to provide encoding standards to support linguistic and literary research.
Most recently, there are a number of initiatives for developing DTDs to
facilitate electronic commerce on the Internet and a wide variety of scientific
and academic research. In fact, while XML arose in the document community, it
is now widely used in a variety of database applications, for representing a
wide variety of non-textual data, for machine-to-machine communication, and
many other applications.[1]
In order to understand why
XML has generated such broad interest from both users and developers, it is
useful to consider the nature of markup and what kind of markup XML promotes.
In an article now considered by many to be a classic presentation of document
markup theory, James Coombs, Allen Renear, and Steven DeRose distinguished six
kinds of markup, three of which I would like to discuss briefly: procedural,
descriptive, and referential[2].
Procedural Markup. In the last few years, through the use of word
processing systems, we have become familiar with procedural markup. Procedural
markup consists of processing instructions to the computer. It tells the
computer what to do with specified components of the text. For example, the
title of a major section might have instructions that tell the printer to
center the text, use a font of a certain size, and perhaps print it in bold
italics. Most procedural markup is characterized by being paper directed, that
is, it tells the printer how to put the text on paper. If you want to do
anything else with the text, the markup is not of much help. If you want to
search for the initials "XML" in the machine-readable version of a
book, but only where it occurs as a chapter or section title, the procedural
markup provides no assistance. Nor does it help if you want to display the text
on a computer screen, since paper presentation and monitor presentation are
quite different. Finally, procedural markup is generally characterized by a
further limitation, until recently all procedural markup has been proprietary.
This means, for example, that the documents created on WordPerfect® cannot be
processed flawlessly on MicroSoft® Word and vice-versa. Each word processing
software package uses its own markup. In this environment, the future of the
document is tied to the future of the software.
Descriptive Markup. A second type of markup mentioned by Coombs,
Renear, and DeRose is descriptive or declarative markup. With descriptive
markup, we arrive at the form of markup recommended by XML. Descriptive markup
identifies the logical components of documents. While procedural markup
specifies a particular procedure to be applied to a document component,
descriptive markup indicates what the component is. Examples are
chapter, chapter title, section, paragraph, author, publisher, and
cataloging-in-publication data. None of these gives any indication of what
procedures are to be applied to these components. But, if you know the document's
components, then you can have processors to do whatever you want to them.
Descriptive markup liberates the document for multiple uses. It is possible,
for example, to use one and the same source document to produce printed,
electronic, Braille, and voiced synthesized versions, and, for good measure, to
produce HTML and flat ASCII. EXtensible Stylesheet Language-Transformation
(XSLT) is an XML-based initiative for providing a standard language for
transforming XML encoded documents.[3]
The fact that descriptive markup can be used in so many different ways is one
of its important characteristics. It escapes the single use trap of procedural
markup.
It is useful to distinguish
two kinds of descriptive markup: structural and nominal. Descriptive structural
markup identifies document components and their logical relationship.
Structural elements are components that you usually want to present visually in
some distinct manner. Examples are chapter titles, paragraphs, block quotes,
and the like. Descriptive nominal markup, as you might expect, identifies named
entities, both concrete and abstract. Examples are corporate names, personal
names, topical subjects, genres, and geographic names. While you may want to
visually present these names online or on paper in some particular manner, you
usually want to index them in particular ways, to use them to provide access to
the source or subject matter of the document.
Referential Markup. As its name suggests, referential markup refers to
information that is not present. It is markup in the third person, so-to-speak.
There are different kinds and ways that one might use referential markup, but I
would like to focus on the kind of referential markup that enables something
about which most of you have heard, and perhaps with which many of you have
some experience, namely, hypertext and hypermedia. In addition to supporting
text, XML also provides provisions for using text to refer to other text, and
to refer to other kinds of digital information derived from the full array of
native formats: photographs (color as well as black and white); sound motion
pictures; drawings; paintings; audio recordings; three dimensional objects of
all kinds, shapes, and sizes; maps; manuscripts; typescripts; printed pages;
mathematical data; financial data; diagrams; musical notation; choreographic
notation; and anything else open to being digitally captured and rendered in
some useful form. It is possible not only to refer to or point at this other
digital information from within XML based documents, but also to control the
notation information needed to launch the devices necessary for rendering the
various objects into humanly intelligible forms. It is thus possible to use
electronic text to control and manage extra-XML information objects of all
kinds, as well as to provide access to and navigation through them. XML Linking
Language (XLink) is an XML-based initiative to standardize hypertext linking.[4]
SGML, HTML and the
Emergence of XML
XML is based on Standard
Generalized Markup Language (SGML). The International Standards Organization
(ISO) published SGML as a standard in 1986 (ISO 8879). Many government,
industry, and academic agencies and institutions enthusiastically embraced
SGML, though it did not enjoy widespread success because of its complexity.
While SGML largely adhered to fundamental rules of computer grammars, the rules
that provide the underpinnings of computer programming languages and programs,
it did not fully adhere. Because of the failure to fully comply, programmers
found it difficult to work with SGML.
By the mid-1990s, HyperText
Markup Language (HTML) had become an SGML compliant DTD that was enjoying
enormous success as the encoding standard underpinning the World Wide Web.[5]
As a specific application of SGML, the HTML DTD limits itself to simple procedural
encoding dedicated to online display and hypermedia linking. HTML
hardwires a small set of procedurally oriented tags. Constraining the set of
elements made it feasible to build applications that made life relatively easy
for authors and Web publishers. Because of HTML’s relative ease of use and its
support of online display of documents, it enjoyed wide success. In fact, it is
difficult to image the emergence of the Internet as the important economic,
social, and cultural phenomenon it has become without HTML.
The small, closed tag set,
however, came at a price: HTML has extremely limited functionality. Jon Bosak
has identified three areas in which HTML is wanting: extensibility, structure,
and validation.[6]
SGML is strong in all of these areas, but its strength, like HTML's weakness,
comes at a price: SGML is complicated for both application developers and the
users of the applications.
Many users of the Internet
wanted the ability to support much more than display. They wanted to support
sharing of information across computers and among people, and to support
sophisticated indexing, navigation, and processing of information. HTML was
inadequate an inadequate foundation for these complex tasks. While SGML
provided the necessary semantic and structural foundation, it was too difficult
for many developers to write programs to perform the tasks. The W3C's XML Working Group decided to that
what was needed was a simplified version of SGML, a version that would be
easier to process and use. The result of XML Working Group work is EXtensible
Markup Language, released as a W3C recommendation in 1998.
Since the release of XML in
1998, many other standards in support of the use of XML have been developed or
are under development, in particular, EXtensible Stylesheet
Language-Transformation, and XML Linking Language (XLink).[7]
XML seems well on its way to attracting many new communities of users, and with
them, many developers of XML software.
For additional information
about SGML, see http://www.oasis-open.org/cover/
XML (and SGML) Software Tools
The best source of information
on XML tools is: http://www.xmlsoftware.com/.
Many of the tools work with both XML and SGML.
Parsers: Validation of DTDs and documents.
Key
to the use of XML is a parser or validator. Essentially, parsers are aware of
the formal requirements of the XML meta-language and syntax, and they use this
awareness to do three very important things.
First,
a parser can read the DTD itself, and make sure that it formally adheres to the
standard. It reads all of the element, attribute, and entity declarations to
make sure that they are compliant with the specifications in the standard. If
naming conventions and syntax are used incorrectly, the parser will report
errors or warnings.
Second,
once a parser has read the DTD and finds it valid, it can read an encoded
document and validate that all of the encoding meets the specifications in the
DTD. Note that a parser will not check the validity of the textual content, but
only the XML tags, attributes, and other markup.
Finally,
the parser outputs the document in a form that other SGML software can use.
All
SGML and XML compliant software use recognized parsers! If they do not, they
are not compliant!
There
are free SGML parsers available for ftp. See:
http://www.oasis-open.org/cover/publicSW.html#parsers
The
best available parser is NSGMLS. NSGMLS is part of a suite of related
tools developed by James Clark called SP: http://www.jclark.com/sp/index.htm.
All of Clark's SP suite of software is now open source. It is available at http://openjade.sourceforge.net/
Note that the OpenSP is distributed as part of OpenJade.
There
are many different parsers available for XML. There are two kinds of parsers:
validating and non-validating parsers. Non-validating parsers check to make
sure XML documents are well-formed, but do not check to make sure documents
adhere to a DTD. Validating parsers, like SGML parsers, make sure the DTD is
valid and that the document adheres to the rules in it.
ElCel
has a free, good standalone XML parser that is available for Linux, Solaris,
and Windows. It is available at http://www.elcel.com/index.html.
Converters and
Transformers: moving into and out of
SGML and XML.
The
most challenging task is converting print text into machine-readable, encoded
form. While this can be on-site, the preferred method is generally to contract
with a text conversion vendor.
For
existing machine-readable texts, either in word processing formats or database formats,
there are several possible machine-assisted methods.
Custom
created scripts using perl, Python, and other scripting languages are
frequently used to convert texts. It is also possible to use macro programs in
word processing programs such as WordPerfect to mark up text based on
formatting clues. Most database products have export features that will assist
in the creation of encoded texts.
There
are also several applications devoted to converting and transforming documents
from various non-SGML or XML formats into SGML and XML. See http://www.xmlsoftware.com/convert.html.
James
Clark's SX is widely used for SGML to XML conversions. It is included in the
distribution of SP (see above).
Authoring and Editing: writing and maintaining SGML and XML documents.
SGML
and XML authoring and editing tools are used for writing and maintaining
documents. While many SGML and XML authoring and editing tools behave very much
like word processing programs, with WYSIWYG interfaces, spell checkers, and the
like, they also have special features to facilitate creating and maintaining
valid documents.
The
best SGML and XML authoring and editing tools have real time parsers, which is
to say, parsers that compel the author to use only DTD compliant elements. In
addition, good authoring tools provide mechanism for automatically adding the
tags using function keys or point-and-click menus.
There
are free as well as commercial editing tools. See http://www.xmlsoftware.com/editors.html
for current list of available software.
Of
the commercial packages, the most reasonably priced and of good quality is: SoftQuad’s
XMETAL: See SoftQuad: Welcome to xmetal.com.
Browsers and Electronic
Publishing: rendering and indexing
SGML- and XML-based documents.
While
it is possible to publish SGML on the Web, most sites now choose to publish
XML, even if they are maintaining documents in SGML.
There
are a number of different methods that can be used for publishing XML on the
Web. Internet Explorer 6.x offers support for direct rendering of XML using
either XSLT or Cascading Style Sheets (or a combination of the two using
XSLT). XML can also be converted to
HTML prior to being published as HTML, or published using server-side
applications that convert XML to HTML on the fly. Saxon is a popular tool for latter two approaches: http://saxon.sourceforge.net/. A
variety of other tools also are available: http://www.xmlsoftware.com/xslt.html.
XML
indexing software continues to expensive and the choices limited, though both
the prices and the choices have improved since the release of XML in 1998.
TextML
is relatively inexpensive, and is available in a "lite" version that
supports indexing up to 1000 documents. It is only available on Windows NT and
2000:http://www.ixiasoft.com/products/textmlserver/index.asp.
Tamino
(Software AG) is a much more expensive product: http://www.softwareag.com/tamino/
Finally Xpat is a very fast search engine. It is based on
OpenText's Pat, version 5.0. The Digital Library EXtension Service (University
of Michigan) purchased OpenText's source code in 1999 and continues to develop
it, in particular they are promising to add support for Unicode: http://www.dlxs.org/
[1]See the list of current XML applications and initiatives on The XML Cover Pages (http://www.oasis-open.org/cover/siteIndex.html).
[2]Coombs, James H., Allen H. Renear, and Steven J. DeRose. 1987. "Markup Systems and the Future of Scholarly Text Processing." Communications of the Association for Computing Machinery 30 (11): 933-947.
[3] See XSL Transformations (XSLT) (http://www.w3.org/TR/xslt)
[4] See XML Linking Language (XLink) (http://www.w3.org/TR/xlink/).
[5] HTML is now available as both an SGML and XML DTD. See the W3C's HyperText Markup Language for additional information (http://www.w3.org/MarkUp/).
[6]Jon Bosak XML, Java, and the future of the Web see: http://sunsite.unc.edu/pub/sun_info/standards/xml/why/xmlapps.htm
[7] For other W3C XML initiatives, see Extensible Markup Language (XML) (http://www.w3.org/XML/).