XÖV: Using the wrong tool for the job

18 Mar, 2023

I am currently working on tooling for a German e-gov standard called XDatenfelder. It is part of a larger collection of standards called XÖV. As the technical landscape within the German government is extremely diverse, the goal of XÖV is to standardize the exchange of data between programs from different vendors. In theory, I think this is a great idea. In practice, this turns out to be very frustrating to work with. And that is because the X in XÖV stands for XML.

To be clear, this is not a post about why XML is a bad technology. Every tool looks bad when used to solve the wrong problem. Unfortunately, this is exactly what has happened a lot in XÖV. When I said that the project aims to standardize the exchange of data, I really mean just any kind of data. And regardless of how that data actually looks like, you are already locked into XML as the transport medium. Which turns out to be a bad choice for most of the data defined by the standards.

Structured Data

The most common use case for XML in XÖV is as a transport format for structured messages. But unlike notable alternatives like JSON or Protocol Buffers, the mapping between XML and a struct (or class, dict, object or whatever is used in your programming language of choice for structuring data) is ambiguous. This is because XML was not designed as a serialization-target for structured data, it was designed as a markup language for text documents. Parsing XML into a workable data structure is therefore always that extra bit more tedious than it should be.

There is another important area in which the origin of XML shows: handling whitespace. The following XML fragment was copied from an example included in the standard XDatenfelder v3.0.0:

<xdf:freitextRegel>Es muss entweder Feldgruppe G60000086 "Anschrift Inland Straßenanschrift" oder Feldgruppe
                G60000087 "Anschrift Inland Postfachanschrift" befüllt werden.
</xdf:freitextRegel>

Ignoring the weird indentation, this should look fine for anyone who has ever worked with HTML. However, when you take a look into standard, the content of <xdf:freitextRegel> has the type xs:string. This means all of the whitespace should be left as-is, therefore the long indentation in front of "G60000087" is included in the string. This is probably not intended. And just collapsing the string regardless of the actual type is not an option: The tag sometimes contains pre-formatted text which you actually want to use as-is.

Even though the badly formatted text is technically an error made by the author of the example, I am having a hard time putting all of the blame onto them. Because I can easily see myself making the exact same mistake. It is possible to use XML correctly as a data exchange format, but just more error-prone and less intuitive than available alternatives.

Files

Another standard in XÖV for safely transferring arbitrary data between actors is called OSCI. There are in my opinion several design flaws with OSCI, but in the center of it is using XML as the transport format. One important use-case of OSCI is to allow sending and receiving large files. To avoid the headache of transferring arbitrary files using XML, the standard chose to use SOAP, a XML based messaging protocol with support for binary attachments. Still, the solution had severe performance problems when working with large files. So the standard added extra complexity for splitting the payload into multiple chunks and reassembling them at the end. The frustrating part is, that this problem is already solved in HTTP. But since SOAP is transport independent, the standard does not require to use HTTP either but rather markets this "flexibility" as an advantage. As a result, all the nice features from HTTP are effectively unusable when working with OSCI. So instead of just using a well-defined HTTP endpoint which can efficiently receive and process a stream-friendly binary data format, you now have to create multiple XML-based SOAP messages with a binary attachment after manually splitting the payload into several chunks. Great.

Standards Documents

Even though the documents are written in my native tongue, I never had such a hard time grasping the contents of a software-related standard. The main reason for that is the heavy reliance on XSD within the documents themselves. But XSD is simply not an acceptable alternative for writing human-readable definitions. I understand the motivation to provide XSD files along with the standard to simplify a certain workflow. But standards should not be written for machines, they should be written for humans. And again, it is technically possible to get all the information from the XSD files. This however follows the common theme of making the life of a developer working with the standard that extra bit more difficult than it could be.

What's Next?

To be clear, I am very new to the German e-gov landscape and there are almost definitely important information I do not know about. But hopefully this outside-perspective can also be valuable. And from this point of view, the whole "let's use XML anywhere regardless" seems extremely dogmatic and ill-advised. Not only is XML not the right format for the very common case of working with structured data. Locking into any fixed data format for all of the possible use-cases will always lead to sub-optimal results.

So in the long run, the standardization process must become more flexible to allow picking the right tool for the job on a case-by-case basis. Yes, this will lead to people needing to learn new tools once in a while and maybe even to some amount of duplicated work here and there. But in my experience, this up-front effort is always limited and short-term. Working with the wrong technology - even though completely avoidable from the start - is costly for the whole lifetime of the products built on top of it.