XMLCatalogBuilder is here!

catalogIf you’re like us, working a lot with XML files which follow XML Schema definitions from 3rd parties, you will often have cursed people publishing libraries of XML Schema definitions. Most people decide to simply include everything that their schemas <import> or otherwise reference in their package. Hence, now you have several copies of everything floating around your hard-drive. There is no one, true way of publishing XML Schema packages however, but there is one, true way how to use them: XML Catalog

XML Catalog is a way of giving your editor and your validation engine a lookup table of identifiers (system IDs, namespaces, etc) and locations (e.g. on the local file system or on the web). So when the editor or validator encounter a namespace, system ID, or similar that they haven’t heard about before, they will consult the XML Catalog to find the location of the defining entity (DTD or XML Schema). That would be cool, wouldn’t it? Why “would”? Well, because the metadata definition publishers have put a stop gap between us and paradise: @schemaLocation. Sometimes it says “./myschema.xsd”, but sometimes it also says more threatening things like “C:\\My Documents\\…”. So we should be using @schemaLocation if and only if everything else failed first. Luckily, most XML tools provide a setting for this. In my tool, it’s called this:

“If Process namespaces through URI mappings for XML Schema is selected then the target namespace of the imported XML Schemas is resolved through the uri mappings. The namespace is taken into account only when the schema specified in the schemaLocation attribute was not resolved successfully.”

This means that @schemaLocation will only be used if all else has failed. Cool, that’s what we want.

Now that you’ve switched your XML tool to “auto-pilot”, how should you organise 3rd party packages that you will be using for authoring or validating schemas and instance documents? We suggest a straightforward, 3-level directory tree. The top level is a single directory, which is the root of the tree (e.g. “XML Schemas”). Within that, one directory for each source of metadata definitions (e.g. “W3C’, or “MPEG”), and below that one directory per package and version (e.g. “MPEG-7 2008”, “MPEG-7 2015”, etc.). Into these package/version directories, you’ll simply extract the respective ZIP (or whichever way it was delivered to you).

So far so good, but how is your XML editing/validation tool going to find all these files? XML Catalog, fine. But who will create the catalogue files? If you are using a couple of serious packages, it can easily be hundreds of files that will need to be listed in the catalogue. Impossible you will be achieving this by manual editing. And how to handle additions and new versions. Sounds like a full-time job, doesn’t it?

But do not despair! We are here to save you. We had the exact same problem, and we created a solution for it. It’s called XMLCatalogBuilder, and we’re hosting it on GitHub. Grab the Perl script from there, arrange your XML metadata definitions as described, and run the script on the tree. Whenever something changes (e.g. you add or remove something), just run it again and all catalogues will be regenerated in place. No fuzz, no buzz. Happy XML-ing!