Introduction to eXtended Markup Language

From Wiki

Jump to: navigation, search
This article is a stub. You can help us by expanding it.

Consider the following text document publicly available on the Web:

Email:
From: Adrian Giurca, giurca@tu-cottbus.de
To: mihaiug@inf.ucv.ro
Subject:Where is your paper?
Where is the paper you promised me last week?

It is not clear, how a machine can extract the author name from such an email content...

  • Using some heuristics? ("the string "To:" following the ...")
  • Writing a specific parser for this kind of data? (What to do if one change the text?)


Contents

First Example

An XML document consists of a prolog (everything before the root element) and a root element, containing a number of other elements that may have attributes and further sub-elements.

<?xml version="1.0" encoding="UTF-16"?>
<!DOCTYPE email SYSTEM "myEmail.dtd">
<email>
 <head>
  <from name="Adrian Giurca" address="giurca@tu-cottbus.de"/>
   <to name="Mihai Gabroveanu" address="mihaiug@inf.ucv.ro"/>
   <subject>Where is your paper?</subject>
 </head>
 <body>
  Where is the paper you promised me last week?
 </body>
</email>

The above example has a prolog: The first line is a processing-instruction specifying an XML document in version 1.0 using the character encoding UTF-8. The second one is at document type declaration, DTD.

The root element containing all the concrete document is <email>

XML Elements

In the above example email, head, from, to, subject are XML elements. An element has a start tag (such as <head>) and an end tag (such as </head>). Everything between the start tag and the end tag represents the content of the element.

The order of elements in an XML markup is significant.

An XML element may contain other elements

As in the example above the element head contains the element from as sub-element.

An XML element may contain attributes

For example, the element from contains the attribute address with the value giurca@tu-cottbus.de i.e.

<from name="Adrian Giurca" address="giurca@tu-cottbus.de"/>

While the order of elements is important, it doesn't count for attributes. However, no element may contain two attributes which:

  • have identical names, or
  • have qualified names with the same local part and with prefixes which have been bound to namespace names that are identical.

XML Elements can be empty

For example, the element from in the above example is an empty element. As you can see it does not have a closing tag.

<from name="Adrian Giurca" address="giurca@tu-cottbus.de"/>

Sometimes it is the same with

<from name="Adrian Giurca" address="giurca@tu-cottbus.de"></from>

but you should be aware on white spaces between the tags. The first solution to encode an empty element is the best one.

XML Documents as Ordered Labeled Trees

  • Root
    • email
      • head
        • from
          • name="Adrian Giurca"
          • address="giurca@tu-cottbus.de"
        • to
          • address="mihaiug@inf.ucv.r"
          • name="Mihai Gabroveanu"
        • subject Where is your paper?
      • body Where is the paper you promised me last week?

Well-formed XML Documents

XML documents are called well-formed, if they satisfy a number of syntactic conditions:

  1. There must be exactly one root element.
  2. Each element has a start tag and an end tag.
  3. Tags don't overlap, e.g. not <author><name>Tim Berners Lee</author></name>.
  4. Attribute names are unique within the scope of an element.

Valid XML documents

An XML document is called valid, if:

  1. it is well-formed,
  2. it refers to a grammar (DTD or XML Schema), and
  3. it respects the grammar


The Advantages of the XML Format

  1. The data is self-describing. The tags encode the semantics of the content:
    1. content markup (such as <Subject>) and
    2. content relationships implied by nested markup (such as Email[Subject])
    3. integrity constraints, which constrain the admissible text values and content structures
  2. The data can be authored and viewed with standard tools.
  3. You can create different views of the same data using style sheets.


Other Readings


References

Personal tools