A Tale of Two Bugs

We recently released a closed pre-alpha of The Machinery. As a result, we of course discovered a few new bugs. In this blog post, I’ll take a look at the two hairiest ones.

In general, I find that it is hard to find good information on debugging. You’re just expected to “pick it up as you go”. But it is a complicated topic and it’s hard to be a good programmer if you’re not good at debugging. I hope this post helps.

The ghost in the network

The first bug was actually discovered just before we released the alpha. We noticed that collaborative editing didn’t work between mine and Tobias’ machine. The editing session would start up and synchronize the scene, but on certain actions after that — such as moving or cloning an entity — the editor would crash.

The weird thing about this is that collaboration worked locally. I.e., if either me or Tobias started up two editors on one of our machines, those two editors could collaborate without any issues at all. What’s weird about that is that to the software there should be no difference between a local and a remote session. A local session is just a session that connects on 127.0.0.1 and a remote session is a session that connects to another IP. The difference just exists in the socket layer, so what could possibly be going on?

We spent some time just verifying the facts. This is always a good idea when something seems really weird. But yes, remote sessions were crashing and local sessions worked, consistently.

I investigated the code around the crash on the client side. The collaboration code is based on synchronizing objects with properties. (In The Truth, which I’ve talked about before.) Whenever a property on an object is changed, the change is serialized and sent to the other collaborators in the session. The crash happened when the client got a message from the server that a set property of an object had changed. The problem was that according to the data from the server, the set had billions of elements. Since the client was unable to allocate the data for this, the application crashed. So it would seem like the data from the server got corrupted somehow.

Our next idea was that maybe it was a Debug/Release difference. Turns out Tobias was running the Debug version of the engine and I was running the Release version. Maybe the layout of some struct was different in Debug or Release causing serialization differences? Or maybe we had uninitialized memory somewhere? But no, when we switched to running the same version we still had the crash.

Since there seemed to be data corruption somewhere, our next step was to try to figure out where. Maybe there was a bug in our network layer when packages were split over multiple send and recv calls. Or maybe the network itself was corrupting packages, wifi had been spotty due to a bad router. Furthermore, if we could show that there was no network corruption we knew that the problem would have to be in the server serialization code, since the data was alread bad when it got to the client.

To check that, I started logging MD5 checksums of all packages as they entered and left the network layer of our application. Turns out the checksums all matched, so there were no bugs in the network layer code.

At this point I knew that the problem must be in serialization/deserialization, so I decided to step through the code as a bad packet was being serialized on the server and then step through the deserialization of the same packet on the client.

This gave me the first real clue to the bug. I noticed that an object property was being serialized as a reference (one of the data types in The Truth), but on the client side it was deserialized as a set. Thus, the object ID of the reference was interpreted as the element count of the set, which caused the bug.

On closer examination, this was because the client and the server did not agree on the object type. The index of the type (which was what was sent over the wire) was the same. But that index had different meanings on the server and the client. The type table looked something like this:

Type table on client and server

There were only a few types in the list that had switched places. So collaboration worked fine until one of these types was sent, then it would crash.

When I created the collaboration system I knew that I would eventually have to deal with situations where the type table was different between the server and the client. This can happen, for example, if the client and server uses different plugins and some of these plugins introduce new types.

However, I put this off as something to do “later”. After all, for now, all users will be running with the same set of plugins (the ones we include in the alpha). Also, I’m not 100 % sure what collaboration even means when plugins differ. One of the users can create objects that the other users have no idea what they are. Does this even make sense?

So I’d found the bug, but I still wanted to figure out why our type lists differed. After all, both me and Tobias were running with the same set of DLLs — the default DLLs that we distribute with the alpha. Was there something non-deterministic in the loading of the DLLs that caused them to be loaded in a different order on different machines?

As we dug into this, we noticed it was actually something else. At one point during the development of the pre-alpha we decided to rename our sample_plugin.dll to tesselated_plane_plugin.dll. Tobias had deleted his bin directory after the rename, while I had not. So I had both the sample_plugin.dll and the tesselated_plane_plugin.dll, while he had only the latter one. Since plugins are loaded in alphabetical order, this caused the plugin to load and register its types earlier on my system, causing them to appear earlier in the list.

Once I deleted the sample_plugin.dll from my bin directory, collaboration worked as expected.

In the end, turns out there was nothing to fix for this bug. Of course, we eventually need to fix the problem where users can’t collaborate if their plugins don’t match, but it’s too big of a fix for the alpha. And users will only run into this bug if they write their own plugins that define new types and they run collaboration sessions between different machines and they only put these plugins on some of the machines.

At the end of a debugging session like this, I like to reflect back to see what can be done to prevent similar bugs in the future, and/or to make them easier to debug.

As for the debugging session itself, I think it went pretty well. The important things when debugging are:

  • Verify your hypotheses. If you think something is true, find a way of proving or disproving it (by logging, single-stepping, etc).

  • Find ways of getting more information. How can you increase your knowledge about the bug?

When debugging, you really need to think like a scientist and keep track (in your head or on paper) of the things you know, the things you believe (but don’t know for sure) and then find “experiments” that can add to what you know, and confirm or rule out the things you believe.

As for how to prevent bugs like this in the future, it is clear that I made a pretty big mistake when writing the original collaboration code. I was very aware that the code would only work (for now) if the type lists matched between the client and the server. So this was a pre-requisite, but I didn’t put that pre-requisite into the code.

If I had just added an assert verifying this pre-requisite, the assert would have triggered immediately and we would have known right away what was going on and not have to spend two hours debugging it.

Adding an assert is easy, we can just hash the entire type table and send that checksum as part of the initial synchronization message. If the checksums don’t match between client and server, we print an error.

So the lesson we can learn from this is clear:

  • If your code is assuming something — especially if you know that there are situations where that assumption might not be true — put that assumption into the code in the form of an assert.

The curiously recursing child entities

The second bug was discovered by one of our users. Luckily it had a 100 % reproduction case, but the reproduction steps were kind of complicated. First, you needed to open one of our more complicated projects, then open a big entity with hundreds of sub-entities and make that entity a child of itself. Finally, create a new entity, and make the complex entity a child of that entity.

Having an entity be a child of itself is kind of a weird setup. It happens when you use an entity as a prototype and place an instance of that prototype in the entity itself. (Note: In our system, all entities can act as prototypes, there is no distinction between an “entity” and a “prototype”.) Normally, a situation like that would lead to infinite recursion. However, we have a rule that in cases of self-reference, we only expand the first level in the runtime. So if a room is a child of itself, you only see that first room child, you don’t see the child of that room. (You can force another level of expansion by overriding the instance locally, but that’s another story.)

An entity that includes a smaller copy of itself, expanded to level 4.

The reason we allow entities to be children of themselves is not that we think it is a super useful feature, it is just that it is nicer to allow it than to throw a big error box saying “ILLEGAL OPERATION” in the user’s face.

When I reproduced the bug I could see that the crash happened in a piece of code that enumerated the entities in the scene. That code went into infinite recursion which smashed the stack.

Thus, it seemed pretty likely that there was a problem somewhere with how we handled these recursive entities. Probably I’d missed something that made us still go into infinite recursion.

My next step was to try to create a simpler test case. With the hundreds of objects in the initial scene, it was kind of hard to tell what was going on. I figured that with a simpler scene, I could just step through the code and it would be easier to understand the bug.

However, when I tried to simplify the problem, I could no longer reproduce it. I created a fresh entity and added the same kind of recursion that was in the original repro case — no crash. I then started with the original repro case and tried to delete a bunch of sub-entities to make it simpler — no crash anymore. Weird.

Again, I figured I needed more information, so I started to add logging to the application. In the code that was smashing the stack I added logging of object names, so I could see what was going on. I noticed something weird.

According to the log, the room entity was a child of the cellar_door entity. And the room had a props entity, which was the parent of the cellar_door. This was causing the loop. However, this was weird for two reasons.

First, in the runtime there wasn’t supposed to be any loops at all. When entities are spawned into the runtime, each instance gets it’s own identity. So the original room is a different entity, from the room that is a child of that room (even though they come from the same resource). And these two rooms are the only ones that should exist, since we only instantiate the first level of recursion.

Second, the room was never a child of the cellar_door. The room was a child of itself, so how does it show up as a child of the cellar_door?

Clearly, something must be wrong with how children are being tracked. In The Machinery this is handled by a special component — the Owner Component. It allows entities to own other entities. Its main purpose is to let us recurse through all entities in the scene, and to make sure that child entities are destroyed when their parent is destroyed.

I decided the next logical step was to try to find out when the problem happens. I wrote a little snippet of code that looked for loops in the owner component. Then I made sure that whenever someone called add_children() on the owner component, this little snippet would run.

I figured that at some point, somebody must call add_children() with a bad entity creating a loop in the parent-child graph and with this test I would detect that.

But curiously, my loop check never triggered before the stack smashed. Huh? There was obviously a loop when the stack smashed, which means the loop wasn’t caused by a call to add_children(), but something else.

I decided to cast a wider net and sprinkled the code with loop checks. After some bisecting I could determine that the loop was introduced after a call to destroy_entity(). destroy_entity() calls destroy_all_children() on the owning component, which looks something like this (simplified):

void destroy_children(tm_owner_component_manager_o *man, tm_owner_component_t *c)
{
    for (uint32_t i=0; i<c->num_children; ++i)
        tm_entity_api->destroy_entity(man->ctx, c->children[i]);
    free_children(man, c);
}

I found the entity that had the loop. It was always the same one, so I could set a conditional breakpoint on its ID and step through the destroy code for that entity. However, when I stepped through, everything looked fine. The child entities were destroyed and the free_children() call set c->num_children to 0. However, a little later, in the code where I checked for loops, c->num_children of that same entity was not zero. Huh? That should not be possible.

After some investigation I found that the c pointer to the component had changed. get_component() returned a different value — i.e., the component had been moved in memory. This explained how c->num_children could be different.

We allow component data to move in memory for performance reasons (we want to pack it as tightly as possible, so update loops can run as fast as possible). But the object’s data should be preserved when it is moved. How come, in this case, the data was different in the new location? Did we have a memory overwrite bug? Was there a bug in the copying code?

I put a breakpoint in the code that did the move and saw that it was being called from destroy_entity(), which in turn was being called from destroy_children(). That makes sense, it is when an entity is destroyed that we need to compact the component data (move another entity into the “hole” left by the destroyed entity). But this means we are moving component data while we are processing it. What if the data that c points to is moved? That means we are not pointing to the right data anymore, and we will free some other entity’s children. All hell can break loose.

This is a kind of bug I’ve run into many many times before (but I apparently still keep making). I’ve come to think of it as the callback problem. It can happen whenever a lower level system calls out to a higher level system in one of its functions. I.e., in code like this:

void do_something(object_t *o, ...)
{
    uint32_t n = o->count;
    do_stuff(o, n);
    callback(...);
    do_more_stuff(o, n);
}

The problem is that as you are writing this code you don’t really know what the callback is going to do. The callback might change the very object you are working on, which means your cached local variables (n in the example above) are no longer valid. The callback might even delete the object outright, or destroy the whole system where the object lives. In some cases I’ve run into, the callback called out to a script code, so truly anything could happen.

When you write code you consciously or subconsciously assume that certain things don’t change in the middle of the function you are writing and callbacks to higher-level system can violate those assumptions. (Calls to lower-level systems are usually not a problem, because lower-level systems don’t usually mutate the objects of higher-level systems.)

How can the callback problem be fixed? One way is to delay all callbacks to the end of the function. Something like this:

void destroy_children(tm_owner_component_manager_o *man, tm_owner_component_t *c)
{
    TM_INIT_TEMP_ALLOCATOR(ta);
    
    const uint32_t n = c->num_children;
    const size_t bytes = n * sizeof(entity_t);
    entity_t *children = tm_temp_alloc(ta, bytes);
    memcpy(children, c->children, bytes);
    free_children(man, c);
    
    // Don't use c after this point.
    c = NULL;
    
    for (uint32_t i=0; i<n; ++i)    
        tm_entity_api->destroy_entity(man->ctx, children[i]);
        
    TM_SHUTDOWN_TEMP_ALLOCATOR(ta);
}

Here, we copy all the data that we need out of c into temporary variables before calling the callback. This means that even if the callback messes up c, or even deletes it, we’re still good. For good measure, we set c to NULL to make sure that we don’t use it after we’ve started calling the callbacks.

This approach works and I’ve used it successfully in the past to fix bugs like this, but to be honest, it’s not great. It’s complex, doesn’t look nice, requires extra memory allocations and is easy to mess up. What if you forget to copy out some of the object data? Or what if the callback deletes man, in that case we will still have an error, since we’re using man->ctx after the callback? What if ctx gets deleted, can we even make the callback at that point?

These days, I view code like this as a code smell and prefer to move the callback out of the low level system completely, to avoid these kind of errors. In this case, I decided that there is no need for the owner component to have a destroy_children() function at all. Instead, the owner component can just expose a function for extracting all the children and then a higher level system (the entity context) can be responsible for extracting the children and calling destroy_entity() on them. Since destroy_entity() is implemented by the entity context, this avoids the issue of calling out to a higher level system.

Reflecting back, the debug session went pretty well. I wished I had started to log all the entity names a bit sooner, that would have made me find the bug faster.

An interesting thing to note is that even though I was pretty certain that the bug was caused by recursive entities, it turns out in the end that the bug had nothing at all to do with that. Any sufficiently complicated entity setup could have triggered the bug, whether it had recursive entities or not. This is again a good warning about not being too set in your preconceived notions.

As for avoiding this bug in the future, the conclusion is clear:

  • Avoid callbacks to higher level system. They may have unexpected side effects.

I should certainly know this by now, and yet I keep running into this bug.