Output and Compare Canonical XML

Last modified by superadmin on 2018-01-12 21:33

Output and Compare Canonical XML

Public source: http://java-eim.googlecode.com/svn/trunk/java-eim-demo-xmlsamples Local source: /home/student/workspace/java-eim-demo-xmlsamples Technologies: Java (J2SDK 1.6), Ant and Maven2 build tools, JDeveloper IDE, Xerces XML library, simple edit and command-line utilities Estimated time: 15 minutes

Since many different files can represent the same XML data structure, Canonical XML is sometimes used to compare, if the XML file contains the same information, to create digital signatures and message digests for XML documents, and also to better understand certain details of XML syntax. 

General Description and Scope

A few simple XML documents are provided. Some of them might have DTDs. A small program or a script reads these documents and outputs them to another directory in "canonical" form. One can run "diff" utility to compare them, i.e. to see, if they contain the same information. The examples illustrate the following syntax features:

  1. XML files may have different encodings; non-ANSI characters may need Unicode escapes for encodings that are not UTF-8 (e.g. "windows-1257" = "Baltic Windows" or "windows-1251" = "Cyrillic Windows" or "ISO-8859-1" = "Latin Western"). (See sample1a.xml and sample1b.xml in directory java-eim-demo-xmlsamples/src/test/resources/xml-in/)
  2. Attribute order is immaterial, and XML attributes may get default values
  3. XML does not change depending on namespace prefixes for elements and attributes 
  4. Special characters may be written as escapes or CDATA can be used
  5. XML Entities may be defined and used
  6. Depending on the situation, whitespace is either ignored, or normalized, or preserved

Provided resources

  • Run a script to make files canonical. 
  • Use some text editor - JDeveloper, or something like "vi", mc's "Edit" function or "Kate" to edit XML files whenever necessary. 
  • Run the provided command-line script to compare the files (it uses "diff" utility). 


Brief description: In this exercise you would need to run the command-line script to canonize XML. If necessary perform edit fixes to some files so that they become equal to the given file.

  • Open a console (click the black icon on the taskbar) and go to the root directory of the project java-eim-demo-xmlsamples, i.e. /home/student/workspace/java-eim-demo-xmlsamples
  • Run the script canonize.sh. (Or just inspect manually the 10 provided sample files - they are very short.) If there are any XML parse errors, correct them to make the DOM structure of both XML file versions "sampleNa" and "sampleNb" (e.g. src/test/resources/xml-in/sample1a.xml and src/test/resources/xml-in/sample1b.xml) equal, even if the files themselves are not equal compared byte-by-byte. (You may need to use "vi" or some more intelligent editor to change the files. To learn more about the "vi" editor, see the attached file "vi Cheat Sheet".)
  • Run the script to compare the files. See, for which samples the files "sampleNa" and "sampleNb" are different. 
cd /home/student/workspace/xmlsamples/src/test/resources/xml-out/
diff sample1a.xml sample1b.xml
diff sample2a.xml sample2b.xml
diff sample3a.xml sample3b.xml
diff sample4a.html sample4b.html
diff sample5a.xml sample5b.xml
  • Answer these questions about the XML canonization process implemented by the "java dom.Writer -c infile > outfile": *a. Does the canonization replace non-unicode encodings with UTF-8? *a. Does it remove the processing instruction *a. Are the attributes of type #FIXED (if XML has an attached DTD) inserted to the document? *a. Are the default values for attributes of type #IMPLIED or enumeration type (if XML has an attached DTD) inserted to the document? *a. Are the CDATA sections replaced (and all special characters replaced with their escape sequences)? *a. How are the "&" characters represented in canonical files (are they *a. Is the normalization of whitespace in XML free text (inside elements) or inside attribute values applied (cf. http://www.w3.org/TR/xml-c14n#Example-WhitespaceInContent)? *a. Are the attribute names for each element ordered in some standard order? *a. How are empty elements represented after canonization (i.e. does the line break element look like "<br />" or "<br/>", or "<br></br>"?
  • Notice that in order to answer these questions, you may need to edit some of the samples and do a couple of experiments (e.g. with the "canonical" empty elements). In order to do this, add some XML stuff (e.g. "<experiment><br /><br/><br></br></experiment>") to one of the sample files and see what happens after canonization.


The provided file canonize.sh is obviously wrong. It has all the problems, which differ BAT scripts in Windows from "sh" scripts in Linux: 

  1. It does not have run permission - when trying to execute "./canonize.sh", one gets "Permission denied" (this can be resolved by "chmod 770 canonize.sh" command, which adds the execute permission. 
  2. It has wrong Windows-style (not UNIX-style) line-breaks - i.e. each line ends with two bytes (carriage return + line feed "rn") rather than one byte - n
  3. It misses the first line indicating the scripting language (i.e. "#! /bin/sh -e").
  4. It uses "set CLASSPATH=..." instead of the correct Unix/Linux command: "export CLASSPATH=..." to define environment variable. 

Please see attachment in the Wiki page of a correct canonize.sh file. Thanks to Jevgenijs Goreliks for suggesting these corrections!

Created by Kalvis Apsītis on 2007-10-13 14:03
This wiki is licensed under a Creative Commons 2.0 license
XWiki Enterprise 6.4 - Documentation