"I have a hard time arguing that anything in XML is generically useful any more except for the basic syntax, which lets us apply some very handy low-level tools like parsers and XSLT. The rest (XLink, schemas, etc.) has been a pointless trip into complexity."
Simon St.Laurent
"My own experience is that having Prolog, Scheme, and Haskell available it'll take a gun pointed at my head or an extremely large bribe to make me use XSLT for anything."
Richard A. O'Keefe
xml.pl is a module for parsing XML with Prolog, which provides Prolog applications with a simple 'Document Value Model' interface to XML documents. It has been used successfully in a number of applications.
It supports a subset of XML suitable for XML Data and Worldwide Web applications. It is neither as strict nor as comprehensive as the XML 1.0 Specification mandates.
It is not as strict because, while the specification must eliminate ambiguities, not all errors need to be regarded as faults, and some reasonable examples of real XML usage would have to be rejected if they were.
It is not as comprehensive because, where the XML specification makes provision for more or less complete DTDs to be provided as part of a document, xml.pl actions the local definition of ENTITIES only. Other DTD extensions are treated as commentary.
The code, and a small Windows application which embodies it, has been placed into the public domain to encourage the use of Prolog with XML.
I hope that they will be useful to you, but they are not supported, and they are provided without any warranty of any kind.
Three predicates are exported by the module: xml_parse/[2,3], xml_subterm/2 and xml_pp/1.
xml_parse( {+Controls}, +?Chars,
?+Document ) parses Chars, a list of character codes,
to/from a data structure of the form
xml(),
where:<attributes>,
<content>
<attributes> is a list of
<name>=<char data>
attributes from the (possibly implicit) XML signature of the
document.
<content> is a (possibly empty) list comprising occurrences of:
pcdata(<char data>)
comment(<char data>)
namespace(<URI>,<prefix>,<element>)
element(<tag>, <attributes>, <content>)
<tag>..</tag> encloses <content> or <tag /> if empty.
instructions(<name>, <char data>)
<name><char data>?>
cdata(<char data>)
<char data>]]>
doctype(<tag>, <doctype id>)
The conversions are not completely symmetrical in that weaker XML is
accepted than can be generated. Specifically, in-bound (Chars ->
Document) parsing does not require strictly well-formed XML. If Chars does not represent well-formed
XML, Document is instantiated
to the term malformed( .<attributes>, <content>)
The <content> of a malformed/2
structure can include:
unparsed( <char data> )
out_of_context( <tag> )
<tag> is not closed
in addition to the parsed-term types.
Out-bound (Document -> Chars) parsing does require that Document defines well-formed XML. If an error is detected, a 'domain' exception is raised.
The domain exception will attempt to identify the particular sub-term in
error, and will list the ancestor elements of the sub-term in error as
<tag>{(id)}* terms - where
<id> is the value of any attribute
named id.
At this release, the Controls applying to in-bound (Chars -> Document) parsing are:
extended_characters(<bool>)
format(<bool>)
remove_attribute_prefixes(<bool>)
allow_ampersand(<bool>)
For out-bound (Document -> Chars) parsing, the only available option is:
format(<bool>)
<tag><name><URI><char data><doctype id>
public(<char data>, <char data>)
,
public(<char data>,
<char data>,
<dtd literals>),
system(<char data>),
system(<char data>,
<dtd literals>),
local or local(<dtd literals>)<dtd literals>dtd_literal(<char data>) terms - e.g. attribute-list
declarations.<bool>true
or falsexml_subterm( +XMLTerm, ?Subterm ) unifies Subterm with a sub-term of XMLTerm. This can be especially useful when trying to test or retrieve a deeply-nested subterm from a document, as demonstrated in this example program. Note that XMLTerm is a sub-term of itself.
xml_pp( +XMLDocument ) "pretty prints" XMLDocument on the current output stream.
The module is available from this site, and is supplied as a library with the following Prologs:
It has been adapted for the Logtalk Open source object-oriented extension to Prolog by Paulo Moura. (See the folder "contributions/xml_parser" from release 2.29.1);
It is available in the ECLiPSe Constraint Programming System, as a third-party library;
It has been ported to B-Prolog by Neng-Fa Zhou.
It has been adapted for SICStus Prolog by Mats Carlsson.
It is included in Quintus Prolog Release 3.5.
As of version 3.0, I have made it more compatible with LPA Prolog.
The xml/2 data structure has some useful properties.
Using a native Prolog representation of XML, in which terms represent document 'nodes', makes the parser reusable for any XML application. In effect, xml.pl encapsulates the application-independent tasks of document parsing and generation, which is essential where documents have components from more than one Namespace.
The Prolog term representing a document has the same structure as the document itself, which makes the correspondence between the literal representation of the Prolog term and the XML source readily apparent.
For example, this simple SVG image:
<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.0//EN" "http://www.w3.org/.../svg10.dtd"
[
<!ENTITY redblue "fill: red; stroke: blue; stroke-width: 1">
]>
<svg xmlns="http://www.w3.org/2000/svg" width="500" height="500">
<circle cx=" 25 " cy=" 25 " r=" 24 " style="&redblue;"/>
</svg>
... translates into this Prolog term:
xml( [version="1.0", standalone="no"],
[
doctype( svg, public( "-//W3C//DTD SVG 1.0//EN", "http://www.w3.org/.../svg10.dtd" ) ),
namespace( 'http://www.w3.org/2000/svg', "",
element( svg,
[width="500", height="500"],
[
element( circle,
[cx="25", cy="25", r="24", style="fill: red; stroke: blue; stroke-width: 1"],
[] )
] )
)
] ).
Each type of node in an XML document is represented by a different Prolog functor, while data, (PCDATA, CDATA and Attribute Values), are left as "strings", (lists of character codes).
The use of distinct functors for mark-up structures enables the efficient recursive traversal of a document, while leaving the data as strings facilitates the application-specific parsing of data content (aka Micro-parsing).
For example, to turn every CDATA node into a PCDATA node with tabs expanded into spaces:
cdata_to_pcdata( cdata(CharsWithTabs), pcdata(CharsWithSpaces) ) :-
tab_expansion( CharsWithTabs, CharsWithSpaces ).
cdata_to_pcdata( xml(Attributes, Content1), xml(Attributes, Content2) ) :-
cdata_to_pcdata( Content1, Content2 ).
cdata_to_pcdata( namespace(URI,Pfx,Content1), namespace(URI,Pfx,Content2) ) :-
cdata_to_pcdata( Content1, Content2 ).
cdata_to_pcdata( element(Name,Atts,Content1), element(Name,Atts,Content2) ) :-
cdata_to_pcdata( Content1, Content2 ).
cdata_to_pcdata( [], [] ).
cdata_to_pcdata( [H1|T1], [H2|T2] ) :-
cdata_to_pcdata( H1, H2 ),
cdata_to_pcdata( T1, T2 ).
cdata_to_pcdata( pcdata(Chars), pcdata(Chars) ).
cdata_to_pcdata( comment(Chars), comment(Chars) ).
cdata_to_pcdata( instructions(Name, Chars), instructions(Name, Chars) ).
cdata_to_pcdata( doctype(Tag, DoctypeId), doctype(Tag, DoctypeId) ).The above uses no 'cuts', but will not create any choice points with ground input.
The resolution of entity references and the decomposition of the document into distinct nodes means that the calling application is not concerned with the occasionally messy syntax of XML documents.
For example, the clean separation of namespace nodes means that Namespaces, which are useful in combining specifications developed separately, have similar usefulness in combining applications developed separately.
The source code is available here. Although it is unsupported, please feel free to e-mail queries and suggestions. I will respond as time allows.
An example program is available to illustrate one of the ways that the code can be used.