About This Blog

Hi, I'm Ben Pryor. This blog contains my thoughts about general software engineering topics, and occasionally specifics that I find interesting. If you see something here that sparks your interest, please feel free to comment on a post or send me an email at ben at benpryor.com.

21 April 2006 - 15:54XSD to SAX Parser Generator

Motivation

It’s always a nice feeling to end the work week by doing something cool. Every software job has the fun, exciting parts that you look forward to, and the boring parts that no one wants but still have to be done. Something I try to do is to save at least one cool thing each week for Friday afternoons - leaving for the weekend on a high note always seems to make the next Monday feel better.

So today I made some time to do something I thought was pretty cool: a code generator that creates a SAX-based XML parser in Java. The input to the code generator is an XML Schema (XSD) file that describes the XML format, and the output is a bunch of Java classes that do typed XML parsing based on the schema.

For a feature I’m currently in the middle of, I needed to parse some XML that has scant documentation. Luckily for me an XML Schema existed…

Why Code Generation?

I could have used some third party XML databinding library (like Castor), but that didn’t feel like a good fit. I wanted a simple and fast parser, and didn’t want to pay the tax of yet another dependency for an ancillary need (my project already has dependencies to the tune of an 18MB download :)).

Of course, I could have just written a SAX parser by hand. In fact, I did write one by hand at first. As I was writing it, I kept thinking about how much of the code was very similar, and how the code was prone to typos. The particular schema I’m parsing is kind of complex, and I wanted to be certain my parser was 100% correct.

So I decided to write a code generator instead. The generated parser wouldn’t have any typos, and I could spend my time writing interesting code instead of copy-and-pasting similar blocks of SAX parser code. Also, if my project needs to generate parsers for other schemas in the future (which is likely on this project), it’ll be simple - the code generator is generalized and doesn’t make many assumptions about the input XSD.

Technical Description

The code generator is pretty simple. The generator itself uses SAX to parse the XSD and build an in-memory representation of the types in the schema (xs:complexType and xs:simpleType) and the relationships between the types. The type model is then fed to the codegen engine, which uses the Jakarta Velocity templating engine to create the output Java files.

This was the first time I’ve used Velocity, and it worked really well. When I’ve done code generators before, I’ve either done ad-hoc templates or embedded code as strings inside the generator classes. Using Velocity allowed me to cleanly separate the codegen engine and the templates themselves, and the final product is much more understandable than other codegen layers I’ve written before. Velocity is very easy to learn and use - highly recommended.

The code generator uses 5 different templates. 3 of them are for representing the XML data as Java objects - there’s a template for first-class Java objects that correspond to xs:complexType types, a template that creates a base class for the complex types, and a template for the xs:simpleTypes that I wanted to represent as classes. Usually the xs:simpleTypes can be modeled as primitives, but occasionally (like xs:enumeration, for instance) it makes more sense to have them be classes. The remaining 2 templates cover the parser class itself and a class containing constants that appear in the XML (like element and attribute names).

An interesting implementation note - I originally tried writing the codegen with very fine-grained templates. That designed used basically a template for each code block that was repeated. This didn’t work out well - a lot of the templating logic was in the code generator instead of in the template. Once I moved to coarse-grained templates (a template per class), the code generator got a lot cleaner. This required using things like Velocity #foreach constructs inside the templates.

The most complicated part of the generator is the part that models the XML schema types and the relationships between them. I spent the longest amount of time on this, but once it was right everything else fell into place.

Conclusion

It feels great to feed an XSD into my code generator and see the parser code get spit out. If you want something in between XML databinding and writing parsers by hand, code generation of parsers is a good way to go.

No Comments | Tags: Uncategorized

18 April 2006 - 10:52UML

UML (Universal Modeling Language) is a commonly used modeling tool in many software development organizations. UML has been around for about a decade (read about the history of UML), and it’s enjoyed relative popularity during that time. UML is often misapplied in software projects, but when used appropriately, it can be a valuable tool.

What UML is Good At

UML is best used as a partial description of an object-oriented model. For reasons I’ll explore further below, using UML to completely document every aspect of a model, down to the last class and method, is at best a waste of time and at worst damaging to a project. UML is great for concisely capturing of a portion of a model, bringing a coworker up to speed on a design, or for quick throwaway sketches.

I often use UML to help capture some model knowledge that I don’t want to forget. Perhaps there’s a complex interaction between a set of classes, or an inheritance hierarchy that’s just a little too deep and not easily understandable. By diagramming out a small portion of the model, I may be able to capture enough model knowledge to save time when I come back to the model in the future. The diagram serves to jog my memory about the model, and this task is what I use UML the most often for.

A short, simple UML diagram is also one of the best tools for collaboration between team members. Say a colleague comes into my office to discuss a design. I could spend 30 minutes or more in a verbose, verbal discussion of the design. Alternatively, we could go through the code together and spend a lot of time coming up to speed on the design. However, by drawing a UML diagram on my whiteboard, I can convey the same information in a more concise form. This saves time, and allows my colleague and I to get right to the issue at hand instead of spending too much time on background details.

It’s helpful to think of UML as a common language between you and your coworkers. Having a team that can quickly share information and designs in visual terms that everyone understands and can comment on is very valuable. Informal design reviews are common for many teams, and the ability to represent a design in common terms makes design reviews flow much more smoothly. In this usage, UML is a lot like design patterns - it gives a team a common lexicon in order to express concepts that everyone is already familiar with.

UML also excels at “back of the napkin” type sketches. These kind of diagrams are sloppy, imprecise, and usually thrown away quickly. They’re used to quickly jot down a few thoughts about a design or model, or perhaps to prototype a design. Often by spending 5 minutes or less making such a sketch, I can spot problems with a design that wouldn’t have been apparent right away if I had jumped directly into an implementation.

What UML Isn’t So Good At

In college, I got my introduction to UML through a professor who knew the modeling language quite well. We used a nimble little software tool called Rational Rose :). Although I’m going to use my college experience as an example of a misapplication of UML, I want to stress that my professor was actually quite a good teacher of UML (and software engineering in general), and I learned many modeling concepts that I still use on a daily basis.

The idea was that you would do a big up front design before you wrote a line of code. Rational Rose was used to create a very large, very complex class diagram that contained each class, each method, and all interaction between classes. We were taught to iterate on this design diagram for a long period of time (relative to the implementation phase), refining the UML model as we received feedback. Eventually, the model was supposed to reach a state of completeness, at which time you could perform the simple task of translating the model into code.

During my last year of college, all of the graduating software engineers worked on a senior project together. There were about 20 of us in the class, and we had two semesters to complete our software project. The entire first semester was to be spent creating a UML model of our project. The class was split into groups, and each group was responsible for a functional area of the application. By the end of the first semester, we had a combined class diagram with a few hundred classes that showed every aspect of the classes and their relationships. The idea was that we’d come back for the second semester, crank our huge design through the implementation machine, and wind up with our finished application.

If you’ve ever been in a situation like this, I’m sure you already know where the story is going. As soon as we tried implementing our design, we ran into lots and lots of those small annoyances called “implementation details”. We realized that our model was imprecise and left out lots of important aspects. Often we hadn’t thought of these aspects but sometimes the model simply wasn’t capable of expressing them. We also had integration and performance problems. Our modeling didn’t do a good job of capturing interactions at layer boundaries, and the model left a lot to implementation details there. Since our chosen architecture had been completely on paper for all of the design phase, we didn’t realize that it had some poor performance characteristics until we started implementing the application.

We eventually finished the project with moderate success, but in order to do so we had to cut a lot of ties to the model. The original expectation was that we would finish with a working implementation and a UML model that accurately described it, but the end result was a working implementation and a model that wasn’t really very close to it. I have a feeling that our result was not at all uncommon among projects done using a similar methodology.

Big Design Up Front

A big UML design up front seems to be often used as an attempt to shorten the implementation phase of a software project. The idea being that by concentrating all of the creativity and developer experience into a diagram, then that diagram can then be put through a mindless machine of sorts and useful code will come out the other end.

In practice, this simply doesn’t work. UML is not a replacement for implementation. When properly used, it can supplement implementation, but UML is not an alternate form of code. UML and code have separate roles, and they each do certain things well. Code is unambiguous (sometimes wrong, but always precise) and always up to date. A UML diagram is neither of these.

The up-to-date issue is something I think a lot of teams run into. Assuming that you’ve done a big up front UML design and are now implementing it, what do you do when your implementation must deviate from the model? Obviously this can happen for a number of reasons: perhaps the model neglected to take something into account, or the implementation revealed additional aspects that needed modeling, or the design had to change due to external needs. In any case, it’s rare for the UML model to stay current with the implementation. On teams that keep the UML up to date, I imagine they invest many man-hours in updating the UML model.

A certain class of UML tools called round-trip UML tools are supposed to help here. I don’t have much personal experience with these kinds of tools so I’m not going to say much about them. The idea behind them is that a UML diagram and implementation are just both different “views” of the same design, and the tool allows modification of one view of the design to change the other view. I’ve certainly heard lots of second-hand evidence that these tools simply don’t work as advertised.

Other Problems with UML

In his book Domain Driven Design, Eric Evans makes a point about the inadequacy of UML to capture the big picture. UML is very good at capturing a representation of objects. This representation can be very precise, showing you exactly what objects are like and what object relationships are like. But UML is not good at capturing the meaning of a design - the design’s intention, and what the objects are meant to do.

Think about giving a very precise, detailed description of a foreign object to a child who had never seen the object before. The description could include lots of accuracy, but accuracy would not be meaningful without relevancy. UML diagrams can answer what questions, and sometimes can answer how questions. But often, the more important questions are why questions. A UML diagram can’t tell you why a particular class is modeled the way it is, or what tradeoffs were considered in order to arrive a design.

UML is not even always good at representation. One problem is that once you start getting beyond simple relationships, class diagrams can become exceedingly complex. There are lots of UML class diagram symbols that I don’t know, don’t want to, and have never needed to use. There are also a lot of representational concepts that UML just kind of punts on. You can tell what these concepts are by looking for << this kind of text >> in a UML class diagram. In a complicated class diagram you’ll see these all over the place, and it’s really just a regression to a textual description of a concept inside a diagram.

Something I’ve noticed that tells a lot about the misapplication of UML is that large UML diagrams almost always require some text to accompany them to explain things. If UML really captured large designs well, the diagram would stand alone and wouldn’t require a supplement in order for it to have meaning. Of course, any diagram or model always has context, and the context needs to be explained, but what I’m referring to goes beyond context and into interpretation.

A UML Tool I Like

Even though I’ve had a lot to say about the misapplication of UML, I don’t want to throw the baby out with the bathwater. Just because UML is often misused isn’t a reason for avoiding it completely. Design patterns are misused more often than UML, but it would be foolish to refuse to ever use a design pattern because of widespread misapplication.

The UML tool that I use most often is Violet. It’s nice because it’s so simple that it requires no training or reading of manuals. It’s also cross platform, so I can share Violet diagrams with my coworkers on other platforms. It allows me to quickly get the job done and the tool stays out of the way.

A tool like Violet is not going to scale to doing large diagrams, and it doesn’t have any sort of reverse engineering, code generation, or round-trip tooling built in. But for simple diagramming, the kind that UML is really useful for, it’s hard to beat.

No Comments | Tags: Uncategorized