Proceedings

Conference Overview

Call for Participation

Track Sessions

Advisory Committee

Featured Speakers

Important Dates/Deadlines

Conference Agenda

Pre-Conference Workshops

Workshops

Registration Form

Payment Information

Hotel Information

Travel Information

Exhibit Information

Previous Conferences

itconf@mtsu.edu

Eighth Annual
Mid-South Instructional Technology Conference
Teaching, Learning, & Technology
The Challenge Continues

March 30-April 1, 2003

2003 Conference Proceedings

XML: A Beginner's Guide

By: Robert Hallis
Track 3 - Shaping a Transformative Learning Environment
Interest: General :: Lecture/Presentation :: Level: Beginner

Abstract

XML is one of the more recent resources delivered through the web, and itís impact on information professionals is sure to be profound. Whereas HTML provided instructions for displaying information contained in the file, XML provides a syntax within which one can distinguish the various types of information within a file. This presentation will provide an introduction to the language, uses and future implications of this technology in the information profession.

Proceeding

Extensible Markup Language (XML) provides an extremely versatile environment for exchanging information. Coupled with related markup languages, such as XPaths, Links and Pointers; XSchamas and XStyle-sheetListings, and supporting technologies such as programming scripts, databases and compatible applications; this markup language offers powerful opportunities through which one can maximize the exchangeability of information.  It’s impact can easily be seen in the number of books published on the topic, and the extent to which XML has been incorporated into such applications as MS Office, Quark, and many others.  There are enough “how to” books already published.  [The bibliography lists the dozen I’ve worked with out of hundreds that are currently available].  And an hour is too short a time to get heavily into the coding.  This guide will concentrate on conceptually understanding the structure of an XML file and how these elements are tagged.  We will also briefly consider the growing interest in structuring data within an XML format, and examine an example currently being developed in the Harmon Computer Commons at Central Missouri State University.

XML is a recently developed subset of SGML: Standardized Generalized Markup Language.  In fact it just turned 5 years old in February 2003.  Simply stated, XML does nothing.  Rather it identifies the structural content of a document.  Once the structure is established, however, this markup language provides greater latitude in defining the display characteristics of a document, facilitates the manipulation of data through programming scripts, and provides greater flexibility in moving between XML documents and sections of documents than HTML.

There are several parts to the extensible language.  XML refers to the document that uses extensible markup language to structure its contents.  It is a tagged language, which means that data are enclosed between tags created by the author.  Rather than using a set of defined tags, as HTML does, the author determines the tag structure best suited to the informational structure, and defines these elements as well as their relations and attributes in a Document Type Definition.  The XML file can either contain or call two supporting files: the Document Type Definition, which explains the structural elements and their relations; and formatting instructions, which are coded either as a cascading style sheet or in an extensible style-sheet language

The structure of XML documents provides a greater degree of accuracy when identifying parts of a document, linking between parts of a document, and referring to parts of a document.  These functions are found in XPaths, XLinks and XPointers.  They use the tags and relations of the XML file to provide paths through which one can move forward or backward through a document, return specific nodes of data, or incorporate the data of one or more XML files through referencing these tags.  XSchemas are used to ensure the data contained within an XML file conforms to specific guidelines.  Given the time constraints this afternoon, we will concentrate on structuring and tagging the data within an XML file.

XML files contain a structural representation of the content of a document.  So the first step involves distinguishing the appearance of a document from its structural relations.  We will be using a typical memo throughout this discussion as an example of a simple document. [Fig A]


Fig A.

Looking at the “paper” document, one sees formatted fonts, within a particular color scheme, placed within a particular arrangement on a document.  The structure of this document, however, is quite different.  [Fig B]


Fig B

This memo contains several distinct elements: who should receive the information, who sent it, what it is about, the date on which it was written, and the content of the memo.  These elements are in a particular relation.  The informational content is in a child relation to the parent memo, and the date information is in a child relation to the parent element date.  Each of these elements may contain particular attributes, which are predefined qualities, or be free to take on any content.  In either case, this memo is an example of a template we may call companymemos.  As a paper document, there may be hundreds of these generated in a day.  As an XML document, each individual instance of a memo is a data node within the file tree companymemo.

These elements, relations and attributes are reflected in the tagged structure of the XML syntax.

A markup language is characterized by beginning and ending each informational unit with a tag.  An opening tag begins with a less-than symbol, the name of the tag and a greater-than symbol.  A closing tag uses a backslash before the tagname to indicate that it closes the tag pair.  Figure C illustrates how a ‘memo’ tag would appear.  So every instance of a memo would begin with a tag and end with a tag, and every element within that memo would be properly nested within this structure.

Fig C

<memo> MEMO</memo>

Looking at the XML representation of these documents, we see that each collection of elements, in their relationships, is added to the root structure of companymemo.  These elements, relations and attributes are reflected in the tagged structure of the XML syntax in a parent child relationship.  Each memo has five child elements, and the date element has three children  Each element of the XML file tree has a corresponding tag, which is identified in the Document Type Definition.  Tags, however, need to be nested.  So the XML file for this document would nest each child element within the parent structure. [Fig D]


Fig D

Consequently, after opening the tag companymemo, memo would be opened.  Tags ‘to,’ ‘from,’ and ‘re’ would each be opened and closed.  The date tag would be opened, but would not close before the ‘month,’ ‘day,’ and ‘year’ tag were opened and closed.  Then the ‘date’ tag would be closed.  The ‘message’ tag would be opened and closed, and then the memo tag would be closed, ending that node on the ‘companymemo’ file tree.  The structure would be repeated for each memo generated.  A file following all the XML rules is said to be well formed.

The basic syntax of XML follows several simple rules.  Content is structured.  All tags are closed and properly nested.  Tags are case sensitive.  No spaces should be in a tag.  White Space is not ignored.  All attributes are in quotes.  XML is human readable to the extent one uses tags that denote their contents.  Again, one of the nice things about XML is that one makes up the tags to denote the informational content of the document.

Comparing a paper document with conventional HTML and XML exhibits many of the defining characteristics of XML.  [Ex. E]


Fig. E

This memo contains several distinct types of information: who should receive it, who sent it, when it was sent, a descriptive title, and the information itself.  This information is then arranged on a page [through a word processor or typewriter].  Turning this document into a web document using HTML involves declaring the markup language used, html; and how the text should be displayed within a browser using a standard set of tags.  The title of the page of information is contained within the <title> tag, and the body of the document designated that one line should be displayed as a header, while the carriage returns need to be inserted at appropriate points throughout the rest of the document.  The body tag is then closed, and the html tag is closed.  By contrast, the XML document identifies a style sheet and document, and encloses each piece of data with a tag that describes the informational content of that datum.  In otherwords, html tells the browser how the information contained within its tags is to be displayed, and XML describes the content and structure of the information.  Through the use of a document definition, the elements within the XML file are validated, and through a style sheet, the browser is instructed how each type of data should appear.  A closer examination of XML reveals how this works.

An XML document consists of a well formed file that generally references or contains information about the elements it contains, usually in the form of a Document Type Definition, DTD, and how it should be displayed, usually through CSS or XSL coding.  These can be contained in the document, or the files containing this coding can be referenced from within the XML document.  Consequently, three types of information are used to display an XML document.  Extensible links, whether paths, links or pointers, permit one to reference and move between elements of an XML file, and Schemas provide a means by which one can restrict the data contained within a tag inorder to ensure its compatibility with other applications or database requirements.

This is a display of an XML document that calls separate files for the Document Type Definition and a Cascading Style Sheet.  Examining each document separately permits one to see how they interact.  The second and third line of the XML file [Fig D] contains the path names to the two supporting files.  The Cascading Style Sheet identifies how the font of each element is to be formatted and displayed.  <?xml-stylesheet href=“companymemo.css” type=“text/css”?>  The DTD identifies each element, their relationship and any attributes they may have.  <!DOCTYPE companymemo SYSTEM “companymemo.dtd”>  Examining each more closely shows how one influences the other.

The document type definition identifies the elements and their relations as well as identifying permissible attributes.  They provide a means by which people can agree on the markup standards used in an XML file.  The contents of a DTD may be included within the XML file, an internal DTD, or saved in a separate file and referenced by the XML file.  Each element is identified within the <!ELEMENT namespace    > tag, and the relations are denoted by enclosing the children within parenthesis.  In figure F, one sees that the root element companymemo, has one child, memo, and the element date has three elements; month, day and year.  The ‘+’ after ‘memo’ indicates that there may be one or more instances of ‘memo’ within the root element ‘companymemo.’  Each element containing that contains data identifies that type of data.  Here the #PCDATA  denotes that that element contains parced character data.  Here one sees that each parent element lists the number of child elements.


Fig. F

A well-formed document follows correct XML syntax.  A valid document adheres to the structure prescribed in the Document Type Definition.  The structure outlined in figure B is defined in the DTD, figure F.  The element ‘memo’ has five children; ‘to,’ ‘from,’ ‘re,’ ‘date,’ and ‘message.’ While the parent ‘day’ has three children; ‘month,’ ‘day,’ and ‘year.’  Elements can occur once, many times or not at all; and this needs to be set up in the DTD.  Attributes are also listed in the DTD.

Attributes identify the element name, the attributes associated with that element, the type of data and the usage.  The attributes of an element follow the convention, <!ATTLIST emement_name Attribute_name Type Default_value>  If we wanted to designate whether the memo was internal, or its priority, setting up appropriate attributes could easily be done.

Figure F illustrates the line added below the element memo in the DTD, and lists an attribute ‘destination,’ which can be either internal or external; and ‘priority,’ which can be FYI, Normal or Urgent.  The ‘destination’ attribute is required, and the ‘priority’ attribute is optional.  Thus the portion of code at the bottom of Figure G denotes that this instance of ‘memo’ has the attributes of ‘internal’ and ‘FYI.’


Fig. G

There are many more aspects to using attributes within a DTD, but the conceptual point here is that they may affect how one structures data within the XML file, and how the data may be manipulated.  The tag ‘day’ in figure D has three children; ‘month,’ ‘day,’ and ‘year.’  This could be arranged so that the element ‘day’ has three attributes; month, day and year; and each attribute is required from a list of appropriate choices.  This would be coded like figure H.

Fig. H

<date month=“September” day=“22” year=“2004” />

XML files need additional coding inorder to display their data.  Cascading Style Sheets (CSS) define the font, size, color and placement of each element.  One cannot add text that is not in the XML file, but one can define the manner in which the text is formatted.  Figure I is an example of an appropriate CSS file that would produce the following display.


Figure I.

The CSS identifies the font, size, color and any indention that is to be used for each element.  Here the heading and label data needed to be added to the XML file because it cannot be added within the CSS.  In this example, the heading is to be displayed in a block format, with Arial or Helvetica font, at 18 points in the hexadecimal color code for green.  The from, to, re and message share a common formatting.  There are many more aspects to using CSS to display the contents of an XML file, but the conceptual point here is that additional tags may have to be used inorder to include text.

Extensible Style-sheet Language reads the data from an XML file into an HTML page.  Each node of the XML file is displayed with the same formatting.  Text can be added in the document. As the browser parses the XML document, it loops through the XSL for each instance of a particular node within the XML file.  Figure J is an XSL that produces the following display of the data contained in companymemo. 


Fig J

The XSL is identified in the XML prolog with the following code: <?xml-stylesheet type="text/xsl" href="companymemo.xsl" ?>   Once the file is called, the XML data is inserted into an HTML format using some special tags.  <xsl for-each  select="nodepath"> is used to identify the node from which individual data elements will be selected, and each individual element is selected through a <xsl:value-of-select=“fieldname” … tag.  When the last element in the node is selected, the stylesheet goes to the next node, repeating the cycle until all the nodes have been processed with the same set of instructions.

There are a number of XML editors, just as there are a number of HTML editors.  Many of them check for well-formed XML files as well as validating them against a DTD.  The more sophisticated editors enable one to work with all of the extensible languages, and facilitate the incorporation of programming scripts.  Dreamweaver MX and XML Spy are two powewrful editors.  A simple, but freely downloadable editor, is Moicrosoft’s XML Notepad.  It is an editor that enables one to define elements and attributes, and saves the resulting XML document. [Fig K]


Fig. K.

It is available at. http://www.webattack.com/get/xmlnotepad.shtml. 

XML is well suited to work with several types of data.  It works well with data in complex forms, and in data structures in which the fields are large or complex.  Because it is slower to process than data processed in relational databases, speed is a consideration, and searching is not as quick.  Because it handles data as character strings, passing numerical data for coordinates or within a machine readable fashion may be problematic.  It is, however good for data warehousing or archiving, and can facilitated moving data across different platforms or applications.  It also provides a scaleable option when data needs grow.

The Library at Central Missouri State University offers about 30 different workshops, three times a semester.  Information about each workshop would include learning objectives, a brief description of the activities, who leads the workshop, and the title of the workshop.  In addition, we would like to provide this information in both web and print form in a variety of different forms.  In addition, we wanted to provide clients with some flexibility in working with this information, while at the same time, keep server traffic to a minimum.  [Fig. L]


Fig L

Individual fields may be long and contain formatted text.  We would like to recycle this information, once encoded, in a variety of formats.  We also want to provide the client with a great deal of flexibility in manipulating the data without overwhelming him or her.

I began by outlining the structure of the information, and created a simple Access database within which I could quickly edit information and export it into an XML environment.  One of the Network Administrators provided some Visual Basic and Java script to enable sorting and filtering, and I brought the prototype that provides the client with a great deal of control over the display of information.  Workshops can be sorted by type or instructor, and clicking on any individual workshop displays all the information about that particular class.  [Fig. M]


Fig M

Through using a variety of XSLs and XPaths, general information can be exported to supporting web-pages, print handouts, and other publicity media.  Because the client receives all the data on the initial request, he or she is working with a dataset on their desktop and not repeatedly querying our server to resort, filter or display the information.

In conclusion, XML provides a structure to the file that simplifies working with the data.  It is a language designed for the internet.  It is flexible, open, widespread and platform neutral.  It supports searching and uses simple browsers.  On the other hand, searching is slow and perhaps inefficient and it moves away from centralized DB models.  Nevertheless, it may be ideally suited for the data you need to work with.

[from Ronald Schmeizer, Senior Analyst.  The Pros and Cons of XML.  ZapThink Research, 2001.

http://www.zapthink.com/reports/proscopns.html    2/5/03]

Bibliography

Harold, Elliotte Rusty.  XML Bible.  New York: Hungry Minds, 2001.

Ladd, Eric and O’Donnell, Morgan, Mike and Watt, Andrew H.  Platinum Edition Using XHTML, XML and JAVA2.  Indianapolis, IN.: Que Corporation, 2001.

Phillips, Lee Anne.  Special Edition: Using XML.  Indianapolis, IN.: Que, 2000.

Navarro, Ann; White, Chuck and Burman, Linda.  Mastering XML.  Sanfrancisco: Sybex, 2000.

Pitts, Natanya.  XML Black Book 2nd ed.  Scotsdale, AZ.: The Coriollis Group, LLC., 2001.

Tittel, Ed; and Boumphrey, Frank.  XML for Dummies.  Foster City, CA.: IDG Books Worldwide, 2000.

Schmeizer, Ronald.  The Pros and Cons of XML.  Zapthink, 2001.  [PDF downloaded from www.zapthink.com/reports/proscons.html on 2/5/03]

Web Sites

www.w3schools.com

xmlpitstop.com