algernon @ BalaBit

Footprints of a tiny mouse, a hacker

Posts Tagged ‘virtualisation’

Emulating, fun and profit

Sunday, February 27, 2011 @ 05:02 PM Author: Gergely Nagy

As part of my afmongodb driver, I wrote a mongodb client library, and this time, I started to experiment with test-driven development. While there’s still room for improvement, as neither my test suite is complete enough, nor is my documentation at the level I want it to be, there’s a few lessons I learned in the process, and some of these, I’d like to share.

First, a reasonably complete test suite is a godsend, and I mean that. Even better is when one writes the tests first, along with documentation, and the code afterwards. There were a lot of bugs I could catch, because the test suite caught them: ranging from endianness bugs, through abusing va_list in ways it wasn’t meant to be used, to simple coding errors that would result in bad implementation of a spec. But a test suite alone is not what I want to talk about today, especially since the suite I wrote for libmongo-client has a lot to improve still. What I want to talk about is the importance of testing one’s code on multiple architectures.

During the past few days, I’ve been preparing for the first release of the library, and due to the nature of the MongoDB wire protocol (it’s Little Endian), I wanted to test it on a Big Endian system too , so off I went and installed Debian/PowerPC in QEMU, and ran the test suite there, which, to my suprise, revealed a couple of endianness-related bugs. Even though I went to great lengths to ensure my code is endian-safe, there were still a few cases where it wasn’t, and the test suite caught it, but only when ran on a very different system.

After the endianness bugs were hunted down and fixed, I had this another idea: what about testing on a little-endian, but non-x86 architecture? At first, I wanted to try mipsel, for various reasons (almost a decade ago, I had the pleasure of working with a mipsel system, and found it very neat at the time; a well written book about the architecture internals just emphasized that), but ran into a few issues, namely I would’ve needed a firmware, which I didn’t have at hand. So instead of hunting one down (the only mips hardware I have at home is a trusty old router with very limited firmware that does not have any easy remote access apart from http)., I opted to find another suitable architecture that QEMU can emulate, and ended up with armel.

Now that was another interesting experience: the installation went fine, there’s plenty of HOWTOs on the net, but the test suite revealed another kind of bug: one that was a lot harder to find and fix than endianness. This too, was found by the test suite, as there were no compiler warnings, nor anything, and the example applications worked perfectly aswell.

What was the bug?” one may ask, and I’ll tell you: what I saw, is that I had a function that took a variable number of arguments, terminated by a zero-value. I used this to build BSON objects in cases where most of the contents are known at compile time. One of the test cases built the same BSON object using this API, and compared the result to building the same document with the traditional API. On x86, x86-64 and powerpc, this testcase ran perfectly, but on armel, it bailed out with errors. However, there were other functions in my code that took a variable number of arguments, but they worked just fine.

Digging into the testcase with GDB proved to be a surprising experience too, for a multitude of reasons, and prompted me to read up a bit on the ARM architecture (it’s interesting, by the way), but for a long time, I couldn’t figure out where the problem lies: my stack was just full of garbage the moment I entered the function, and it stayed that way. As it turns out, the problem was that I passed a va_list to another function, which used va_arg() on them. According to POSIX:

The object ap may be passed as an argument to another function; if that function invokes the va_arg() macro with parameter ap, the value of ap in the calling function is unspecified and shall be passed to the va_end() macro prior to any further reference to ap.

And that is exactly what I did, but I never read this line in the documentation before. Turns out, of the four architectures I compiled on, three worked the way I expected, and I could va_arg()ing after the called function returned. Not on arm, however. All I got back was garbage, and that’s what the test suite caught, but I needed an armel system to catch this.

The lesson learned?

However complete your test suite may be, you can still have code that works only by accident, so testing on different architectures can help a lot. And QEMU, along with all the tools built around it, is an awesome tool in a developers toolbox, that can greatly aid in writing correct and portable code. And on a Debian system, it’s very easy to set up a build environment, thanks to tools like sbuild and dput. I can now compile and test packages on four different architectures, without the need to figure out how buildbot works.