Sunday, September 23, 2007

Yet Another Reason I Love C#

Sorry for the lack of updates recently. I took over a project in July that had several challenges and consumed my time for the past couple of months. Not that I was spending every waking hour at work, but after putting in 10-12 hour days coding and dealing with the responsibilities of being a team lead (schedules, status meetings, interfacing with the customer, and all of that other fun stuff), the thought of doing more coding for fun wasn't high on my list of priorities.

Despite the workload, it was, for the most part, an interesting project — the kind that makes me enjoy programming for a living. It was challenging and stimulating work that built on my company's existing experience developing embedded sensor networking products. But there was one challenge in particular that I could have done without.

Without going into too much detail, this is our second generation of sensor networking products and this version is still fairly young. The first generation was written in Ada and was very much tied to a specific platform and architecture. Now that there are a number of readily available and inexpensive embedded platforms capable of running a variety of operating systems, including Windows CE and even embedded Linux, our goal with the second generation was to make the product more readily portable. We also needed to keep it very extendible, since we market it to a variety of customers — multiple branches and offices within the military, as well as state and local governments, and even commercial — all of whom have slightly different needs and uses for networking sensors.

So, because of those requirements, we decided to switch from Ada to C++ and to develop a basic framework which provided the core functionality needed by all of our customers. The framework was designed to have a "common" set of code that was completely OS and hardware-independent, along with a platform-specific middleware layer. Individual projects are also intended to be completely platform-independent and merely link in the version of the framework for the platform it runs on.

All of that reflects sound software engineering principles and looks great on paper as a way to get around the many problems with writing truly portable C++ code. However, late in the project we discovered the software crashed whenever one particular type of message was sent from our sensor networking device. I took a look at it and found that in the section of code that was crashing, variables on the stack were getting changed, seemingly at random.

I immediately assumed it was a threading problem, and that another thread had a rogue pointer or perhaps a stack overrun. I asked a couple of our senior engineers to look at those possibilities, but they were unable to track it down. Just a few weeks before delivery, I was concerned we may not be able to deliver a working product. So we had a brainstorming session to think of all of the possibilities. We had proven it wasn't a stack overrun, and it was too regular and repeatable to be a thread timing problem. Because the crash itself happened inside the WinCE STL string library, there was some discussion of it being a problem with that library. However, the fact that we used STL strings throughout our code without seeing this problem anywhere else, suggested that wasn't likely.

We narrowed in on the fact that the function that was crashing seemed to be laying out the message structure in memory differently than the functions that passed in the message, which were all in different compilation units. With that information, it took me only an hour more of stepping through the code with the debugger to find the problem. The crashing function was using a definition of time_t that was only 32-bits. All of the other compilation units correctly used a 64-bit version of time_t since we don't want our product to become obsolete in 2038.

How did this happen? The most immediate reason is that one particular header file in the supposedly portable section of the framework code included a Windows header file that defined time_t as being 32-bit. OS-specific header files were supposed to be relegated to the middleware layer. That said, because of its C legacy, C++ makes it far too easy for problems like this to occur. Conditional compilation and the vagaries of determining exactly what "standard" header is being used in a particular file make it far too easy for different compilation units to use different definitions of supposedly "standard" types.

C# and Java (and Ada), have their own problems, but this kind of pulling-out-your-hair problem simply can't happen in those languages. A significant amount of experienced resources were spent tracking down a problem that shouldn't have been able to happen in the first place.

Anyway, sorry to ramble on about a non-game-programming related problem. The good news is that we had a very successful sell-off demonstration of our product last Friday and the customer is very happy. Which means my working hours will return to normal and I can hopefully get back to finishing off Snowball Fight and participating in the XNA community again.