I’m studying YAML as a replacement for XML, but I’m experiencing difficulties when dealing with elements containing free-flowing text with embedded elements. For instance, the following XML document:
<text>
This is an example text, spanning multiple lines, and it has embedded elements
like <a p="value">this</a> and <b>this</b>. There is also a list:
<quote>
<text>The text of the quote, spanning multiple lines, and it has
embedded elements like <c p="value">this</c> and <b>this</b></text>
<author>The Author of this quote</author>
</quote>
Text continues here.
</text>
I don’t know how to convert the embedded elements in YAML. My understanding is that the above XML document segment translates to something like this (except for the embedded elements):
text: >
This is an example text, spanning multiple lines, and it has embedded
elements like <a p="value">this</a> and <b>this</b>. There is also a
list:
quote:
text: >
The text of the quote, spanning multiple lines,
and it has embedded elements like <c p="value">this</c>
and <b>this</b>
author: The Author of this quote
Text continues here.
Also, is indentation not needed in some places?
If an XML/HTML/SGML parser, for some programming language X, generically parses your kind of input (instead of generating abstracted objects), you normally have the tags translated to the mapping construct for X, with the children tags and string elements as a sequence construct for X (as these need ordering), with the attributes of the tag some special first element of that sequence (if attributes are available).
Such a hierarchy is a perfect match for YAML¹:
text:
- |-
This is an example text, spanning multiple lines, and it has embedded elements
like
- a:
- .attribute:
p: value
- this
- and
- b: this
- '. There is also a list:'
- quote:
- text:
- |-
The text of the quote, spanning multiple lines, and it has
embedded elements like
- c:
- .attribute:
p: value
- this
- and
- b: this
- author: The Author of this quote
- Text continues here.
The literal block has been used here to preserve newlines in the original data, but folded could be used as well.
This YAML can be used to regenerate the orginal XML/HTML/SGML structure except for some stripped and collapsed white-space information, which normally do not affect e.g. rendering of HTML. The above YAML is not the representation of XML/HTML/SGML in YAML, just one of the possible ways of doing that.
As for your second question:
is indentation not needed in some places?
As you can see from the output, indentation is not always needed, the sequence elements under the key text:
are not indented. But if you outdent, then you always end a collection (sequence/mapping).
¹ This is the output of yaml from-html --no-body input.xml
with input.xml
containing your XML document. yaml
is a command that is part of my python package ruamel.yaml
The YAML website describes YAML in the following terms:
YAML: YAML Ain't Markup Language What It Is: YAML is a human friendly data serialization standard for all programming languages.
You probably don’t want YAML for this. YAML is meant to be used to save data out to files for future retrieval, basically as a form of serialization. The major benefit it has over XML in this regard is the fact that he lacks the clutter of XML’s angle brackets and end tags. It’s easy to read and edit for humans. XML, on the other hand, is already designed to work with an HTML-like syntax. XML’s major purpose is as a markup language.
You could look at it this way:
- Use YAML when you want to save text or objects to a file.
- Use XML when you want to provide markup for
documents
4