EDML defines an XML syntax for declaring internal and external general parsed entities. It makes no provision for unparsed entities or for parameter entities. The scope provides a means for definition of boilerplate (text alone or text with markup), mnemonic aliases for numeric character references, and the traditional SGML inclusion mechanism. EDML may be incorporated into an XML schema language, incorporated into or referenced from a particular schema, or referenced from an XML instance. In order for it to be truly useful in instances and schemas, parsers would have to add support for it. It is possible that a preprocessor could be defined which would have largely equivalent effects.
This is an initial draft. Comments solicited. Future status (submission to standards bodies and the like) is not clear.
This draft has been submitted for discussion to the xml-dev mailing list
1. Introduction
2. EDML Namespace
3. Syntax
4. Usage
5. Security Considerations
A. Examples (Non-Normative)
B. Schema for Entity Definition Markup Language (Non-Normative)
C. Changelog (Non-Normative)
1. Introduction
2. EDML Namespace
3. Syntax
3.1 The entities Element
3.1.1 Internal Collections and Compilations
3.1.1.1 The uri attribute
3.1.1.2 The canonical attribute
3.1.1.3 The version attribute
3.1.1.4 Children of the entities element
3.1.2 References to Entity Collections
3.1.2.1 The system attribute
3.1.2.2 The public attribute
3.1.2.3 Additional requirements for an entities element as
reference
3.2 The entity Element
3.2.1 Internal Parsed General Entities
3.2.1.1 The name attribute
3.2.1.2 Children of the entity element
3.2.1.3 XML 1.0 implications
3.2.2 External Parsed General Entities
3.2.2.1 The name attribute
3.2.2.2 The system attribute
3.2.2.3 The public attribute
3.2.2.4 Additional requirements for an entities element as
reference
3.2.2.5 XML 1.0 implications
3.3 Alternative Syntax
4. Usage
4.1 Extending Schema Languages with EDML
4.2 The EDML Processing Instruction
4.3 Other Forms of EDML Embedding
4.3.1 Document-Scope Embedding
4.3.2 Element-Scope Embedding
4.4 Priority of Definition
5. Security Considerations
A. Examples (Non-Normative)
A.1 Example: Standalone Entity Definitions Collections
A.2 Example: Boilerplate or Subdocument Definition
A.3 Example: Overriding Imported Definitions
A.4 Example: Including Entity Definitions in a Schema
A.5 Example: Including Entity Definitions Using Processing
Instructions
B. Schema for Entity Definition Markup Language (Non-Normative)
C. Changelog (Non-Normative)
The Entity Definition Markup Language began life as a subset of the doctype markup language, an XML transformation of DTDs. In working on the latter, it soon became clear that the task, while perhaps important, was extremely large, and the temptation to add things and leave things out was difficult to resist. On the other hand the subset language for defining entities proved quite tractable, and by its nature (because of its use of XML syntax), seemed to the author elegant and useful.
EDML defines XML 1.0 entities (applicability to XML 1.1 is left for those who need it). Because it uses XML syntax, it inherently enforces certain well-formedness and validity constraints that must be tested by other means when entities are defined in a DTD. Because it uses XML syntax, it can be relatively easily incorporated into existing XML schema languages, such as W3C XML Schema Definition Language or Relax NG (normal form).
EDML does not permit definition of all the entity types defined in XML 1.0. Specifically, EDML provides a means of defining internal and external parsed entities. Unparsed entities are out of scope (together with notations). Parameter entities, used only in DTDs, are out of scope. Effectively, EDML provides a means of defining that subset of entities which might be considered "macros" in another language--a means of substituting for a commonly repeated sequence, or difficult-to-type sequence, a simple text representation which the parser can replace with the defined substitution text or tree. Also, EDML is stronger on external parsed entities than internal ones; it cannot provide a replacement for the internal DTD subset without abandoning its advantages or drastically modifying XML.
The official URI for EDML is http://www.talsever.org/namespaces/edml
This URL is subject to change in future revisions.
There are three formats in which entities may appear:
In an entity collection (root element: entities
)
As a standalone entity marked as such and internally named (root element:
entity
)
As a standalone entity with no EDML content, merely a well-formed external entity, named on import (any root element, or multiple roots, or none)
Both of the first two types may be embedded into other XML dialects (typically schema languages). The third type does not reap the benefit of XML containment that enforces well-formedness for the other two types.
entities
ElementThe container for all entity definitions is the entities
element. This container element may directly contain entity
elements, and may contain entities
elements that point to external
defintions (but not internal ones). An entities
definition either
defines an internal collection of entities, defines a compilation of
external entities with internal entities, or references an external
collection of entity definitions.
An entities
element that defines an internal collection or
compilation MAY contain up to three optional attributes, and MUST contain at
least one entity
or entities
child elements.
uri
attributeThe optional uri
attribute provides an identifier for
this entity collection. If present, it MUST NOT be empty, and MUST NOT be
relative (that is, it MUST be an absolute URI). It may also be a Formal
Public Identifier, particularly if the collection is a conversion from a DTD
definitions collection (see examples, below).
canonical
attributeThe optional canonical
attribute provides a URL (it's
pointless to use a URN) that gives the canonical location of the latest
version of this collection of entities. It MUST NOT be empty, and MUST be
absolute.
version
attributeThe optional version
attribute is a string identifying
the version. No further information is provided.
entities
elementIf any of the uri
, canonical
, or
version
attributes exist, the element MUST contain children, and
MUST NOT contain the system
or public
attributes.
An internal entities
element MUST contain at least one
child, which must be an external entities
element, or an internal
or external entity
element. It may contain any number of each of
these types of element, which are to be processed in strict order. Note
that it is fairly pointless for the element to contain only a single
external entities
element, but the grammar permits this.
If an entities
element contains any entity
or
entities
children, it MUST NOT contain the system
or
public
attributes.
An entities
element that points at an external collection
of entity definitions contains a single, required attribute,
system
, and may contain an optional attribute,
public
.
system
attributeThe system
attribute MUST be a URI, MUST NOT be the
empty URI (the empty string), and MUST resolve to an XML document or
fragment with a root element of entities
. The referenced
entities
element MUST NOT be a reference to an external collection
(that is, it must be an internal collection, although it may contain
additional external references).
entity
ElementEvery entity is defined using an entity
element. Entities
may be referenced (from inside an entities
element or some
other XML dialect that permits it) independently.
The basic building block of an entity definition is the internal
parsed general entity. It is defined using the entity
element,
which contains a single required attribute, name
, and may contain
any child content (the replacement content).
name
attributeThe name
attribute MUST conform to the rules for an
entity name. This name may be composed with the & character and the ;
character to make an entity reference in documents. When it appears so in a
document, the content of the entity
element replaces the entity.
entity
elementThe entity
element, if it has only a
name
attribute, MUST have content, which may be elements or text.
If included elements are in a namespace, the namespace MUST be declared.
Note that namespace declarations MAY be part of an entity definition. Any
entities referenced in the content of this entity MUST appear earlier in the
document (directly or by inclusion) or the replacement text will contain a
skipped entity. Any content, including text and mixed
content, is permitted, so long as it is well-formed (the XML sine qua
non).
Note: because the content of an entity
element is
XML, all internal entities defined in this manner are by
definition well-formed per XML 1.0 section 4.3.2. Because
an entity is not defined until its closing tag has been
read, recursion is implicitly disallowed (both direct and
indirect), satisfying the No Recursion WFC in section 4.1.
This specification provides an alternative means of declaring
an entity, to satisfy the Entity Declared WFC and VC.
A single entity definition may be referenced externally. In this
case, the entity
element contains two or three attributes. The
name
attribute is optional. The system
attribute, a URI (corresponding to the SYSTEM pseudo-attribute of a doctype
declaration) is required. The public
attribute (which contains an FPI) is optional.
name
attributeThe optional name
attribute MUST conform to the rules for an entity
name. If it is not present, the URL of the system
attribute MUST
resolve to an XML document or fragment with entity
as its root, and
conforming to the content model for internal general parsed entities
(references may not be chained). If the name
attribute is
present, the content of the resolved document or fragment may be any
well-formed XML. However, if it is an entity
element, then the
name supplied in the importing entity
definition overrides the name
of the imported definition (that is, this provides a renaming and copying
mechanism).
system
attributeThe system
attribute MUST conform to the rules for URIs,
and additionally MUST NOT be the empty URI (the empty string). Relative URIs
are to be resolved relative to the document, unless it has an XML Base URI
defined, in which case relative URIs are resolved relative to the base URI.
Cataloging systems may supplant the usual resolution rules, of course. The
resolved entity MUST point at an XML document or fragment. The content MUST
be well-formed. If the target is an entity
element, then parsing
should verify it; it is permitted (if the name
attribute is
present) for the target to be any well-formed XML fragment.
public
attributeThe public
attribute, if it exists, MUST conform to the
rules for a Formal Public Identifier. Cataloging systems may make use of
FPIs for resolution.
There is already an alternative syntax for EDML; it's called DTD, and it's more widely supported than EDML is or is likely to be. If a non-XML syntax is to be used, better to use that one than to invent another. DTD has the advantage of providing the internal subset. EDML is designed to integrate smoothly with existing DTD definitions of entities, but not to provide a substitute for the internal subset.
That is, while it is possible (in theory, if not yet in practice) to attach entity definitions to a schema, to attach them to an instance, and to override them in an instance, all of these techniques require external files, unlike the internal subset, which is defined inside the instance document.
For EDML to be useful, it must be possible to refer to EDML definitions in existing documents. There are four possible ways to do this:
Embedding in the schema for a class of documents
Inclusion of files in the prologue of an instance document
Embedding in an instance document
Embedding in the prologue of an instance document
Embedding in a schema is feasible, either directly or by import. Inclusion into an instance document in the prologue is also feasible, although it uses a technique (processing instructions) which many XML gurus find distasteful. It is similar, however, to the use of PIs for stylesheets and the like. Embedding in an instance document is problematic, primarily due to the scoping issues that it raises and the limitations that it is necessarily under. Embedding in the prologue of an instance document requires either that XML documents be redefined to permit multiple roots, or that an alternative, non-XML syntax for EDML be defined (thereby losing many of the benefits of EDML). We have already dismissed an alternative syntax for EDML (see above).
For entity definition collections that provide character replacement (such as the LATIN-1 entity definition collection), it is easy and sensible to incorporate the entity definition collection into the schema for the language (XHTML, for instance). This is easily enough accomplished by placing the entity definitions early in the schema document for the dialect being defined. Both internal and external definitions are possible.
The drawback to doing this, at present, is that it doesn't work. XML parsers must be enhanced to include support for EDML embedded in a schema. It may be possible to define, for instance, a resolver in SAX that can expand the entity definitions in some fashion, but it is not clear that this is even feasible (more study is needed).
It is not uncommonly necessary to repeat blocks of markup in a single document, or in a number of related instance documents produced by a single organization. This calls for the ability to include a reference to the entity definition collection in the prologue of an instance document.
This technique is also useful when creating large documents. The document can be broken into sections, each representing a smaller subtree. In this case, the usual inclusion mechanism uses the style of import which references a single entity, effectively supplying the entity name for the referenced file. This permits each subtree to be independently developed and validated.
Note: it might be necessary to define what happens if an imported entity of this type contains other entity imports, or a doctype declaration, or an internal subset.
A processing instruction is defined for this purpose. Actually,
there are two. One is called 'entities' and the other is 'entity', and both
contain, as their content, a single URI. The entity processing instruction
may also contain an entity name. These correspond directly with the
external entities
and entity
elements defined above. That
is, an entities processing instruction imports a document containing, as its
root, an internal entities
element. An entity processing
instruction which lacks a name imports a document containing, as its root,
an internal entity
element. An entity processing instruction which
has a name MAY import a document containing, as its root, an internal
entity
element, in which case the name in the processing
instruction replaces the name found in the name
attribute. An
entity processing instruction that has a name may also import a document or
fragment which is merely well-formed XML; the name provided is the name
of the entity, which has the content of the document or fragment as its
content.
As with inclusion in schema languages, the major problem with this technique is that it doesn't currently work with any shipping XML parser. In this case, however, the "macro-replacement" character of EDML is an advantage; by placing a filter into a SAX processing stream, the XML can be modified by an intelligent EDML filter. This filter would need to receive processing instruction notifications, would have to resolve these PIs, and would then perform replacement of entities as they are encountered in the stream before the rest of the parser saw them. The drawback of this technique is that it does not integrate with the use of internal DTD subsets; it would potentially expand entities wrongly, ignoring the overrides in the internal subset.
It is possible to imagine a schema that includes the schema for EDML. This would, potentially, allow instance documents to define a sort of internal subset inside the document element.
A minor drawback of this technique, shared with including EDML in the schema and using processing instructions, is that it doesn't work. A more significant drawback is that it probably shouldn't work.
The largest question that arises when embedding inside the document element of an instance document is what the scope of the entity definitions ought to be? One solution is to state that entity definitions are valid from the point of definition to the end of the document.
This scoping mechanism has two drawbacks. First, if the principle of "first defined has priority" is maintained, then it will surprise users accustomed to the internal DTD subset, because their attempted overrides won't work. Second, for large documents, this scoping mechanism breaks expectations. Customized parsers for large documents may read only a portion of a document; permitting document scope for entity definitions breaks this technique irretrievably.
Another possibility for scoping entities defined inside the document element is to re-use the pattern of namespace declarations (or the xml:lang element). Sort of. Unfortunately, entities aren't declared in attributes. So it goes. Anyway. The scope of the definition would extend through the scope of the containing element.
This scoping mechanism has drawbacks, as well. First, it naturally follows that overrides happen in narrowest scope. This is the reverse of the normal principle ("first seen") for entities, and would be very likely to unreasonably complicate implementations. Second, it is highly counter-intuitive to those who currently use entities--it is easy to imagine the frustration of a user defining entities in /html/head who can't understand why they don't work inside /html/body, for instance. Mostly, it is likely to add enormously to the weight of the machinery necessary for processing entities; instead of knowing what the entities are when the document element is encountered, the parser must be ready for additional definitions, and even redefitions.
Several techniques for incorporating EDML definitions into an instance document (directly or indirectly) have been outlined above. Given these various techniques, it is reasonable to expect them to be combined. What happens, then, when a particular entity is defined multiple times?
The rule for DTDs is simple: the first definition rules. Since the internal subset is fully processed before the external subset is loaded (even though the external subset is identified before the internal subset is complete), this allows an instance to override external definitions, a powerful feature.
The same rule is applied, as a principle, to EDML definitions. When a document contains an internal subset, its entity definitions (if any) are processed first. If processing instructions exist in the document pointing to external general parsed entity definitions (see above), they are next processed. Any previously-defined entities which may be redefined are instead ignored. If the schema language includes a facility for identifying the schema within the instance, it is logically processed next, on encountering the document element. This rule is extended to those language (such as RELAX NG) which do not provide a facility to identify the governing schema inside an instance document: effectively, this means that entity definitions defined by the internal subset, then entity definitions defined using the entity inclusion processing instructions, override the entities defined in a schema for a particular XML dialect, regardless of the schema language used to define the dialect.
Entities defined using document-scope or element-scope rules cannot override existing definitions. As already noted, this may make these scopes less than useful in practice. They can only be used to define previously-undefined entities. It is recommended, therefore, that neither document-scope nor element-scope rules be deployed.
Note that schema languages that support EDML may override imports,
by using an internal entities
element which contains first the
overrides, then the imports. This will not resolve issues for very large,
overlapping, conflicting entity definition collections, but it does help a
bit.
The Billion Laughs attack is the most obvious. This specification requires that previously-seen URIs be ignored, which ameliorates the attack, but not by much.
URIs in submitted documents may be designed to produce particular errors, in order to allow an attacker to probe the network topology of a target. The tradeoff is between enhanced security and impoverished error messages.
This section contains some examples of entity definition with EDML, from fairly simple character replacement, to boilerplate and subdocument inclusion, to overrides, incorporating entity definitions into a schema, and importing entity definitions into a document using processing instructions.
For our first trick, ladies and gentlemen, the Latin-1 entity definitions, converted to EDML.
<?xml version="1.0" encoding="utf-8"?> <entities xmlns="http://www.talsever.org/namespaces/edml" uri="ISO 8879:1986//ENTITIES Added Latin 1//EN//XML" canonical="http://www.talsever.org/entities/latin1.edml" version="0.3"> <!-- This version converted from: Copyright (C) 2001, 2002 Organization for the Advancement of Structured Information Standards (OASIS). Permission to use, copy, modify and distribute this entity set and its accompanying documentation for any purpose and without fee is hereby granted in perpetuity, provided that the above copyright notice and this paragraph appear in all copies. The copyright holders make no representation about the suitability of the entities for any purpose. It is provided "as is" without expressed or implied warranty. --> <entity name="aacute">á</entity><!-- LATIN SMALL LETTER A WITH ACUTE --> <entity name="Aacute">Á</entity><!-- LATIN CAPITAL LETTER A WITH ACUTE --> <entity name="acirc">â</entity><!-- LATIN SMALL LETTER A WITH CIRCUMFLEX --> <entity name="Acirc">Â</entity><!-- LATIN CAPITAL LETTER A WITH CIRCUMFLEX --> <entity name="agrave">à</entity><!-- LATIN SMALL LETTER A WITH GRAVE --> <entity name="Agrave">À</entity><!-- LATIN CAPITAL LETTER A WITH GRAVE --> <entity name="aring">å</entity><!-- LATIN SMALL LETTER A WITH RING ABOVE --> <entity name="Aring">Å</entity><!-- LATIN CAPITAL LETTER A WITH RING ABOVE --> <entity name="atilde">ã</entity><!-- LATIN SMALL LETTER A WITH TILDE --> <entity name="Atilde">Ã</entity><!-- LATIN CAPITAL LETTER A WITH TILDE --> <entity name="auml">ä</entity><!-- LATIN SMALL LETTER A WITH DIAERESIS --> <entity name="Auml">Ä</entity><!-- LATIN CAPITAL LETTER A WITH DIAERESIS --> <entity name="aelig">æ</entity><!-- LATIN SMALL LETTER AE --> <entity name="AElig">Æ</entity><!-- LATIN CAPITAL LETTER AE --> <entity name="ccedil">ç</entity><!-- LATIN SMALL LETTER C WITH CEDILLA --> <entity name="Ccedil">Ç</entity><!-- LATIN CAPITAL LETTER C WITH CEDILLA --> <entity name="eth">ð</entity><!-- LATIN SMALL LETTER ETH --> <entity name="ETH">Ð</entity><!-- LATIN CAPITAL LETTER ETH --> <entity name="eacute">é</entity><!-- LATIN SMALL LETTER E WITH ACUTE --> <entity name="Eacute">É</entity><!-- LATIN CAPITAL LETTER E WITH ACUTE --> <entity name="ecirc">ê</entity><!-- LATIN SMALL LETTER E WITH CIRCUMFLEX --> <entity name="Ecirc">Ê</entity><!-- LATIN CAPITAL LETTER E WITH CIRCUMFLEX --> <entity name="egrave">è</entity><!-- LATIN SMALL LETTER E WITH GRAVE --> <entity name="Egrave">È</entity><!-- LATIN CAPITAL LETTER E WITH GRAVE --> <entity name="euml">ë</entity><!-- LATIN SMALL LETTER E WITH DIAERESIS --> <entity name="Euml">Ë</entity><!-- LATIN CAPITAL LETTER E WITH DIAERESIS --> <entity name="iacute">í</entity><!-- LATIN SMALL LETTER I WITH ACUTE --> <entity name="Iacute">Í</entity><!-- LATIN CAPITAL LETTER I WITH ACUTE --> <entity name="icirc">î</entity><!-- LATIN SMALL LETTER I WITH CIRCUMFLEX --> <entity name="Icirc">Î</entity><!-- LATIN CAPITAL LETTER I WITH CIRCUMFLEX --> <entity name="igrave">ì</entity><!-- LATIN SMALL LETTER I WITH GRAVE --> <entity name="Igrave">Ì</entity><!-- LATIN CAPITAL LETTER I WITH GRAVE --> <entity name="iuml">ï</entity><!-- LATIN SMALL LETTER I WITH DIAERESIS --> <entity name="Iuml">Ï</entity><!-- LATIN CAPITAL LETTER I WITH DIAERESIS --> <entity name="ntilde">ñ</entity><!-- LATIN SMALL LETTER N WITH TILDE --> <entity name="Ntilde">Ñ</entity><!-- LATIN CAPITAL LETTER N WITH TILDE --> <entity name="oacute">ó</entity><!-- LATIN SMALL LETTER O WITH ACUTE --> <entity name="Oacute">Ó</entity><!-- LATIN CAPITAL LETTER O WITH ACUTE --> <entity name="ocirc">ô</entity><!-- LATIN SMALL LETTER O WITH CIRCUMFLEX --> <entity name="Ocirc">Ô</entity><!-- LATIN CAPITAL LETTER O WITH CIRCUMFLEX --> <entity name="ograve">ò</entity><!-- LATIN SMALL LETTER O WITH GRAVE --> <entity name="Ograve">Ò</entity><!-- LATIN CAPITAL LETTER O WITH GRAVE --> <entity name="oslash">ø</entity><!-- LATIN SMALL LETTER O WITH STROKE --> <entity name="Oslash">Ø</entity><!-- LATIN CAPITAL LETTER O WITH STROKE --> <entity name="otilde">õ</entity><!-- LATIN SMALL LETTER O WITH TILDE --> <entity name="Otilde">Õ</entity><!-- LATIN CAPITAL LETTER O WITH TILDE --> <entity name="ouml">ö</entity><!-- LATIN SMALL LETTER O WITH DIAERESIS --> <entity name="Ouml">Ö</entity><!-- LATIN CAPITAL LETTER O WITH DIAERESIS --> <entity name="szlig">ß</entity><!-- LATIN SMALL LETTER SHARP S --> <entity name="thorn">þ</entity><!-- LATIN SMALL LETTER THORN --> <entity name="THORN">Þ</entity><!-- LATIN CAPITAL LETTER THORN --> <entity name="uacute">ú</entity><!-- LATIN SMALL LETTER U WITH ACUTE --> <entity name="Uacute">Ú</entity><!-- LATIN CAPITAL LETTER U WITH ACUTE --> <entity name="ucirc">û</entity><!-- LATIN SMALL LETTER U WITH CIRCUMFLEX --> <entity name="Ucirc">Û</entity><!-- LATIN CAPITAL LETTER U WITH CIRCUMFLEX --> <entity name="ugrave">ù</entity><!-- LATIN SMALL LETTER U WITH GRAVE --> <entity name="Ugrave">Ù</entity><!-- LATIN CAPITAL LETTER U WITH GRAVE --> <entity name="uuml">ü</entity><!-- LATIN SMALL LETTER U WITH DIAERESIS --> <entity name="Uuml">Ü</entity><!-- LATIN CAPITAL LETTER U WITH DIAERESIS --> <entity name="yacute">ý</entity><!-- LATIN SMALL LETTER Y WITH ACUTE --> <entity name="Yacute">Ý</entity><!-- LATIN CAPITAL LETTER Y WITH ACUTE --> <entity name="yuml">ÿ</entity><!-- LATIN SMALL LETTER Y WITH DIAERESIS --> </entities>
Note that this conversion was largely mechanical (specification authors are due to be replaced by a small shell script ...).
A common requirement for organizations producing XML documents is a standard copyright notice, disclaimer, or other weasel-wording. Local guidelines probably suggest where this must appear, but the actual content is probably defined externally, once. Here's a simple copyright example.
<?xml version="1.0" encoding="utf-8"?> <entities xmlns="http://www.talsever.org/namespaces/edml" uri="http://www.talsever.org/boilerplate/copyright" canonical="http://www.talsever.org/boilerplate/copyright.xml"> <entity name="copyright"><p>Copyright © <a href="http://www.talsever.org/">Talsever</a>, All Rights Reserved.</p> </entity> </entities>
The above form is perfectly standalone, but may not be the best solution. Here's an alternative. First, define the XML content in a file by itself.
<?xml encoding="utf-8"?> <p>Copyright © <a href="http://www.talsever.org/">Talsever</a>, All Rights Reserved.</p>
Then import it, using the schema technique or the processing instruction technique. We will demonstrate this, below. For purposes of demonstration, we assert that the above fragment may be found at http://www.talsever.org/boilerplate/copyright.xhtml.
Any XML entity (well-formed fragment) may be assigned an entity name in this fashion. For purposes of discussion later on, imagine that we have also written three sections of a specification, which are each located at http://www.talsever.org/xml/edml/, in the files header.xml, normative.xml, and informative.xml.
When vocabularies containing entities are mixed, it is not uncommonly necessary to manually override certain definitions. For instance, it may be that a set of Greek language entity definitions used by classical document authors could collide with a set of symbol entity definitions used in markup of mathematics (in a treatise on classical greek geometry, perhaps?). As entities are generally given short, memorable names, in a global namespace, increasing use of entities leads to increasing likelihood of collisions. For another instance, imagine that the previous example had called the entity "copy" rather than "copyright", and an instance document also imported the ISO Numeric entities. For this example, though, we'll be frivolous.
<?xml version="1.0" encoding="utf-8"?> <entities xmlns="http://www.talsever.org/namespaces/edml" uri="http://www.talsever.org/boilerplate/cuties" canonical="http://www.talsever.org/boilerplate/cuties.xml"> <entity name="Aacute">Ah, a cutie!</entity> <entity name="Oacute">Oh, a cutie!</entity> <entities system="http://www.talsever.org/entities/latin1.edml" /> </entities>
Because of the priority rules, an import of this collection into a schema or instance document results in the definition of the Á and Ó overrides, plus the rest of the Latin 1 set.
For purposes of demonstration later, imagine that the following is located in a file at http://www.talsever.org/entities/Aacute.xml:
<?xml encoding="UTF-8"?> Ah, a cutie!
And the following is asserted to be located in a file at http://www.talsever.org/entities/Oacute.edml:
<?xml version="1.0" encoding="UTF-8"?> <entity xmlns="http://www.talsever.org/namespaces/edml" name="Oacute">Oh, a cutie!</entity>
On to actually using the defined entities. One of the proposed methods is to incorporate the entities into a schema for a particular XML dialect. Having started with the Latin 1 entities, used by XHTML, and knowing that the XHTML working group is currently engaged in creating RELAX NG schemas for XHTML 2.0, here is how to add EDML entity definitions.
First, add a section to the end of the driver (http://www.w3.org/2002/06/xhtml2):
<div> <x:h2>Entities module</x:h2> <include href="entities.rng"/> </div>
Next, define the entities grammar:
<?xml version="1.0" encoding="UTF-8"?> <grammar xmlns="http://relaxng.org/ns/structure/1.0" xmlns:x="http://www.w3.org/1999/xhtml" xmlns:e="http://www.talsever.org/namespaces/edml"> <x:h1>Entity Definitions Collections Module</x:h1> <div> <x:h2>Core Entity Definitions Collection</x:h2> <e:entities system="ISOamsa.edml" /> <e:entities system="ISOamsb.edml" /> <e:entities system="ISOamsc.edml" /> <e:entities system="ISOamsn.edml" /> <e:entities system="ISOamso.edml" /> <e:entities system="ISOamsr.edml" /> <e:entities system="ISObox.edml" /> <e:entities system="ISOcyr1.edml" /> <e:entities system="ISOcyr2.edml" /> <e:entities system="ISOdia.edml" /> <e:entities system="ISOgrk1.edml" /> <e:entities system="ISOgrk2.edml" /> <e:entities system="ISOgrk3.edml" /> <e:entities system="ISOgrk4.edml" /> <e:entities system="ISOlat1.edml" /> <e:entities system="ISOlat2.edml" /> <e:entities system="ISOnum.edml" /> <e:entities system="ISOpub.edml" /> <e:entities system="ISOtech.edml" /> </div> <!-- other collections might be defined as well --> </grammar>
Finally, convert existing entity collections (ISOxyzzy.ent) to EDML format (ISOxyzzy.edml).
Note that the relative ease of defining things in this fashion is one of the attractions of EDML. The drawback, as previously mentioned, is that no current RNG or XML processor understands that these are entity definitions, so it's rather a trophy wife at the moment.
And, for our final act, an actual, albeit trivial document that uses processing instructions to import previously defined examples, with some silly overrides.
<?xml version="1.0" ?> <!DOCTYPE spec PUBLIC "-//W3C//DTD Specification V2.2//EN" "http://www.w3.org/2002/xmlspec/dtd/2.2/xmlspec.dtd"> <!-- redefine Aacute --> <?entity Aacute http://www.talsever.org/entities/Aacute.xml ?> <!-- redefine Oacute --> <?entity http://www.talsever.org/entities/Oacute.edml ?> <!-- redefine oacute using the content of the Oacute entity --> <?entity oacute http://www.talsever.org/entities/Oacute.edml ?> <!-- import all of latin 1 *except* the already defined cuties --> <?entities http://www.talsever.org/entities/latin1.edml ?> <!-- import the three parts of the document --> <?entity header http://www.talsever.org/edml/header.xml ?> <?entity normative http://www.talsever.org/edml/normative.xml ?> <?entity informative http://www.talsever.org/edml/informative.xml ?> <spec w3c-doctype="other" other-doctype="random-noise" status="int-review "> &header; &normative; &informative; </spec>
Supposing that the document base uri is http://www.talsever.org/edml, then the following variant is possible:
<?xml version="1.0" ?> <!DOCTYPE spec PUBLIC "-//W3C//DTD Specification V2.2//EN" "http://www.w3.org/2002/xmlspec/dtd/2.2/xmlspec.dtd"> <?entity Aacute /entities/Aacute.xml ?> <?entity /entities/Oacute.edml ?> <?entity oacute /entities/Oacute.edml ?> <?entities /entities/latin1.edml ?> <?entity header header.xml ?> <?entity normative normative.xml ?> <?entity informative informative.xml ?> <spec w3c-doctype="other" other-doctype="random-noise" status="int-review "> &header; &normative; &informative; </spec>
Note, in the foregoing examples, that the character entities are not actually used in the document. Que sera, sera.
The following schema parses (via trang), but could contain errors. If the text differs from the schema, then the text rules.
<?xml version="1.0" encoding="utf-8"?> <grammar xmlns="http://relaxng.org/ns/structure/1.0" ns="http://www.talsever.org/namespaces/edml" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"> <start> <choice> <element name="entities"> <ref name="entities-internal" /> </element> <element name="entity"> <ref name="entity-internal" /> </element> </choice> </start> <define name="entities-content"> <choice> <ref name="location-attributes" /> <ref name="entities-internal" /> </choice> </define> <define name="entities-internal"> <ref name="entities-attributes" /> <oneOrMore> <choice> <element name="entities"> <ref name="entities-content" /> </element> <element name="entity"> <ref name="entity-content" /> </element> </choice> </oneOrMore> </define> <define name="entity-content"> <choice> <ref name="entity-internal" /> <group> <ref name="location-attributes" /> <optional> <ref name="name-attribute" /> </optional> </group> </choice> </define> <define name="entity-internal"> <ref name="name-attribute" /> <ref name="any-mixed" /> </define> <define name="any-mixed"> <mixed> <oneOrMore> <element> <anyName> <except> <nsName ns="http://www.talsever.org/namespaces/edml" /> </except> </anyName> <ref name="any-mixed" /> </element> </oneOrMore> </mixed> </define> <define name="location-attributes"> <attribute name="system"> <data type="anyURI" /> </attribute> <optional> <attribute name="public" /> </optional> <!-- always empty if this attribute set is present --> <empty /> </define> <define name="entities-attributes"> <optional> <attribute name="uri"> <data type="anyURI" /> </attribute> </optional> <optional> <attribute name="canonical"> <data type="anyURI" /> </attribute> </optional> <optional> <attribute name="version" /> </optional> </define> <define name="name-attribute"> <attribute name="name"> <data type="NCName" /> </attribute> </define> </grammar>
Same schema, different syntax:
default namespace nedml = "http://www.talsever.org/namespaces/edml" start = element entities { entities-internal } | element entity { entity-internal } entities-content = location-attributes | entities-internal entities-internal = entities-attributes, (element entities { entities-content } | element entity { entity-content })+ entity-content = entity-internal | (location-attributes, name-attribute?) entity-internal = name-attribute, any-mixed any-mixed = mixed { element * - edml:* { any-mixed }+ } location-attributes = attribute system { xsd:anyURI }, attribute public { text }?, # always empty if this attribute set is present empty entities-attributes = attribute uri { xsd:anyURI }?, attribute canonical { xsd:anyURI }?, attribute version { text }? name-attribute = attribute name { xsd:NCName }
2004 Apr 25: Modified the semantic of the entity element and processing instruction, so that entity may be used standalone. Enabled renaming of entities on import. Revised the schemas (again).
2004 Apr 25: Added more divisions in the syntax section, to break up long blocks of text and identify each piece. Improved (well, arguably so, anyway, everyone's a critic!) the examples.