Advice From The Swamp

Wednesday, April 18, 2007

Software Efficiency And Optimization (Part 3)

In the first two parts of this series, I discussed the difference between optimizing a design versus optimizing implementation, and showed how to use prototyping to benchmark key algorithms. This part will look at how to use a profiler to find performance bottlenecks.

The First Rule of Program Optimization: Don't do it. The Second Rule of Program Optimization (for experts only!): Don't do it yet.

So when do you do it? Only when you know you have a problem, and only when you know what the problem is. Profilers help you find these problems.

Argghhh! Why is my game so slow?

So you've finally gotten your game to the point that it is basically playable with most of the features you want implemented, and now you are hoping to get some friends to help you start beta testing it. But when you run it, the animation is incredibly jerky. Because you had been bragging to your friends about the game and want them to see it as soon as possible, you start going through the code in your mind, frantically thinking about what could be causing the slowdowns.

"I've heard foreach is very slow," you think, so you quickly do a search through your files and replace every one of your foreach statements with equivalent for statements, kicking yourself for not having done this in the first place. After fifteen minutes or so, and several recompiles because you made some mistake or another while translating the code — "Whoops, I declared my index variable to be 'index', but I called it 'i' when I tried to increment it." ... a couple minutes later ... "Dammit, that array only has ten elements but I told it to loop through eleven." — you finally get it running again... only to be dismayed to discover that it made no difference.

"Hmmm... I have an awful lot of matrices I'm passing around by value. Copying all of that data at every function call must take an awful lot of time." So you go through all of the code again looking for every method that takes a matrix and changing them to pass by reference. After many, many more recompiles than were needed during the change from for to foreach — "Oh, come on, you mean I have to add the ref keyword both when I declare the method and when I call it?!?" — you finally get it back to the point where you can run it again. Unfortunately, now it doesn't work at all. You run it and objects start jumping around the screen for no apparent reason. Then you remember that some of your methods took advantage of the fact that a matrix was being passed by value and changed its value locally. Now that it is being passed by reference, that change is being made in the original object, which is not at all what you intended.

If you are lucky (or smart), you stop there and take a step back. If you're not lucky (or you're just foolish), you think that you'll be able to fix it and you start going into each method trying to figure out how to kludge your way around the problem on a case-by-case basis. Maybe you'll even succeed and get the game back to its runnable state that you had it in hours earlier. Maybe you're even lucky enough to have not introduced yet more bugs in the process. But in all likelihood, you haven't fixed the problem. And now you have code that does not run any better than when you started, but is harder to understand, harder to debug, and has kludges in it to get around problems that you introduced trying to fix the wrong problem.

Your time will be much better spent finding out where your problems are before you try to fix them.

Using NProf

Execution profilers are tools that help you find where your code is spending most of its time. There are two types of execution profilers. Event-based profilers use actual events, such as function calls, and trace the exact execution path as the program runs. Statistical profilers work by interrupting the program at frequent intervals to determine what section of code is currently running. Statistical profilers are less intrusive and therefore provide more accurate measure of how much time is spent in each function.

There are some very nice professional statistical profilers available for .NET development, including the ANTS Profiler and dotTrace. Both provide lots of great features, but both cost hundreds of dollars. Since one of the great aspects of XNA development is that Microsoft has made the Game Studio Express tools available for free, I'll show you how to keep using free tools by concentrating on the Open Source NProf profiler.

As with many (most?) Open Source tools, the main things missing from NProf are documentation and support. Once you get past that, though, it's a great tool. It doesn't provide the cool graphs and other bells-and-whistles of its $300+ commercial counterparts, but it lets us find out how much time is being spent in each of our functions. Ultimately, that's all we care about.

You can download the latest version of NProf (0.10 as of this writing) here. When you run it, you get the following (rather unhelpful) screen:

You might think that you would want to start by selecting File->New, which is the convention for Windows programs. Sadly, that does nothing. (Actually, it clears the current application so you can load a new one. But since you haven't loaded one at all yet, doing this now doesn't change the screen or give you any feedback.)

To select an application for profiling, press the Browse button. Then find a .NET executable for debugging. Note that when profiling XNA games, you will have to select a Windows executable — currently there is no mechanism for profiling Xbox 360 games.

Your XNA game is unlikely to take command line arguments, but on the off-chance that you have written it to support them, you can enter them in the Arguments text box.

Once you have selected a program to profile and entered any command line arguments, select Project->Start or press F5 to start the program. Your program will start running. Play through your game and then exit the program. When you exit, the NProf window will look something like this:

This is a sample run from my Space Invasion game. The two views shown in the top and bottom panels list every method that was executed when you played the game. The top view lets you expand each method to see what methods it called. The bottom view lets you expand each method to see what methods called it. In both views, the Time column shows what percentage of the program's total execution time was spent in each method. The methods are sorted by the percentage of the program's execution time that was spent inside that method.

Note that that includes the cumulative time spent inside every method that method called. So any time you run NProf on an XNA game, the first method listed will be System.Threading.ExecutionContext.Run and it will be shown as taking 100% of the time. That's because everything else in the game is ultimately called by that method, so every time NProf took an execution sample, it was in System.Threading.ExecutionContext.Run.

Looking down a bit further, you'll notice your main program (in this case, SpaceInvasion.Program.Main) and below that the XNA Framework's Run method (Microsoft.Xna.Framework.Game.Run). Continuing down even further, you'll finally find Microsoft.Xna.Framework.Game.Tick. In a fixed-step game, this will be called every TargetElapsedTime period (defaulting to 60 times per second).

If you expand that method in the Callees panel, you'll see that it calls three methods, Microsoft.Xna.Framework.Game.DrawFrame, SpaceInvasion.InvadersGame.Update, and Microsoft.Xna.Framework.GameClock.Step. The latter took only 0.01% of the total execution time, so we can safely ignore it. The SpaceInvasion.InvadersGame.Update method took only about 2% of the total execution time. And Microsoft.Xna.Framework.Game.DrawFrame took over 90% of the execution time!

Does that mean the program took over 15 milliseconds to draw each frame for a simple Space Invaders game with no more than 60 2D objects on screen at any time? Not at all. The time spent in the DrawFrame method includes the time spent waiting to synchronize with the vertical trace on the GraphicsDeviceManager.

For now, don't worry too much about how much time is spent resolving the back buffer to the GPU. (I'll talk very briefly about GPU issues at the end.) Instead, look at how much time is spent inside your program's Update and Draw methods. In this particular run of Space Invasion, these are approximately 3.4% for SpaceInvasion.InvadersGame.Update and 1.6% for SpaceInvasion.InvadersGame.Draw.

Since we know that Space Invasion is running at the default fixed-step time period of 60 frames per second, then the amount of time spent in Microsoft.Xna.Framework.Game.Tick each frame is 16.7 milliseconds. So we might be tempted to try to calculate the amount of time spent in each method by multiplying the percentage of time spent in each by 16.7.

However, this isn't accurate because, as we noted above, only 93.3% of the program's execution time was spent in the Microsoft.Xna.Framework.Game.Tick method. And, of course, I've already mentioned that even statistical profilers like NProf add some amount of overhead and error. While you could get out your calculator and start dividing the time spent in the Update and Draw methods by 0.933, there is yet another catch. The time shown is the total time for the entire program, including the time spent on the title screen when we weren't actually playing the game.

Of course, there are still ways around this if you are determined to get exact numbers for the average time spent in each method for each frame. But it's really not worth it. Don't make the mistake of thinking you can use a profiler to find out exactly how much time each method takes, or even that it's helpful to do so. Relative time is what matters.

Even looking at relative time, though, we probably don't care about how much time was spent updating and drawing the title screen. We want to know how much time was spent moving our sprites and performing collision detection while playing the game. Fortunately, Space Invasion is set up so that each screen is handled by a different class. I highly recommend you do this in your games too. I've seen some games where the Game's Update method is a huge switch statement, or, even worse, a set of if/then/else statements that determine which game screen is currently displayed. The whole point of object oriented programming is to allow the creation of specialized classes that handle specific functions. Providing a separate class to handle each screen's Update and Draw methods will make your programs easier to understand, easier to debug, and, as we're about to see, easier to profile.

Since Update is (not surprisingly) where we are spending the majority of our time, we'll drill down into it to see how it's spending that time. Expanding the SpaceInvasion.PlayScreen.UpdateGame method, we see the following:

Perhaps not surprisingly, when playing the game, the majority of the time is spent in our inefficient collision detection code. But, as we can see, the percentage of time spent in any of the Space Invasion methods is insignificant compared to the time spent in Microsoft.Xna.Framework.Game.Tick as a whole. Not exactly a shock since the game clearly runs quite smoothly, with no animation or control glitches. But it may be surprising to see just how little time is spent even in the collision detection code, which we knew in advance was the most inefficient algorithm possible. I was certainly surprised.

The important thing here is to see how to use this data to find where your performance problems are when you do have them. If the game was running slowly, knowing that over 75% of SpaceInvasion.PlayScreen.UpdateGame's time was being spent in the HandleCollisions method would tell us immediately where to focus our optimization time. (For those wondering where I got 75%, it's the amount of time spent in HandleCollisions as a percentage of the amount of time spent in UpdateGame.)

What it wouldn't tell us, however, is how best to optimize that code. This is where going back to the lessons in part 2 of this series is helpful. You might be able to optimize the code by making some simple implementation optimizations, such as passing large value objects by reference. As long as such changes are localized to small sections of code, and documented with good comments so you can remember later why you wrote the code that way, this is perfectly acceptable.

However, if you are way out of the bag and need significant code changes that may lead to buggy and hard-to-maintain code, you are better off going back to the drawing board and implementing a completely new algorithm. The only way to know for sure is to go back to prototyping and benchmarking some alternatives, in a small test program that doesn't require changing your main game code until you know what those changes need to be.

Other Potential Performance Bottlenecks

Execution profilers are great for helping find CPU bottlenecks, but there are other potential sources of performance problems. In particular, inefficient use of the GPU and/or memory can cause problems as well. If you are having performance issues but profiling shows that you are not significant amounts of time in your Update and Draw methods, then in all likelihood the GPU or memory is to blame.

GPU issues are caused either by making too many draw calls to the GPU or by sending it too much data each frame. Modern 3D graphics cards are pipelined so they are most efficient when they can perform many operations on a single source of data (i.e., a single texture). Therefore, minimizing the number of times textures are transferred to the GPU and minimizing the number of actual GPU draw calls go a long ways towards making the most efficient use of the GPU. The best tool for investigating GPU issues is PIX, part of the DirectX SDK from Microsoft.

Memory issues are mostly caused by not understanding how the CLR allocates memory and performs garbage collection. (Memory issues can also be caused by simply having too much data in memory at a given time, but most computers have at least 512MB if not a full gigabyte of memory these days, so that's less of an issue than it once was. Of course, that doesn't mean that you should feel free to load 500+MB of game data and expect to be able to keep it all in memory at one time...)

The important thing to keep in mind is that in C#, reference types, meaning all actual classes, are dynamically allocated on the heap. They will remain there until they are no longer referenced by any other object, at which point they become eligible for removal by the garbage collector. Value types, however, which include all of the basic types (int, float, bool, etc.) as well as structs, go on the stack and are removed when they go out of scope. Therefore, they never cause garbage collection issues (unless they are boxed). A key problem for new C# programmers, especially those that come from a C or C++, background, is that the new keyword in C# does not mean "allocate space for a new object on the heap", the way it does in those other languages. It just means "call this object's constructor". So, for example, the statement

Rectangle rect = new Rectangle(0, 0, 100, 200);

does not allocate an object on the heap, because Rectangle is a struct and, therefore, a value type. Variables of type Rectangle will never cause garbage collection issues, nor will any other struct types (again, unless they become boxed), despite the fact that to a C++ programmer the above syntax looks like you just dynamically allocated memory.

Another, even worse, common garbage collection confusion is that certain language constructs can create objects on the heap under the covers. This happens when value types become boxed, which happens whenever you try to use a value type where a reference type is expected. In these cases, the value type has to be "boxed" inside a reference type.

It also happens when the foreach statement creates an iterator for a reference type. This is the reason the foreach statement gets such bad press, because it can create iterators on the heap that will need to be garbage collected without it being obvious from the code that it is doing so (at least, to a programmer unaware of this issue). Using foreach with value types causes no such problems.

Memory and garbage collection issues are best discovered through the use of the CLR Profiler and the Remote Performance Monitor for the Xbox 360. The .NET Compact Framework team put together a great article on garbage collection in the Compact Framework and how to use the Remote Performance Monitor. Eli Tayrien's Cornflower Blue blog has two good articles on using the CLR Profiler and the Remote Performance Monitor tool to dispel the myth that foreach is evil.

The Bottom Line

Part 2 of this series showed how to use benchmarking to find out how long it takes to run a particular section of code. This article showed how to use a profiler to determine what percentage of time is being spent in each method during a full run of a complete program. By combining those two techniques, you can get a good handle on what algorithms are most appropriate while you are designing your game, and what sections of your code (if any) should be considered for optimization once you have written the game.

Hopefully what you have taken away from this series is 1) understanding the importance of optimizing your design before optimizing your implementation; 2) understanding the different types of performance problems that videogames can run into; and 3) understanding how to use various techniques to help you find your performance problems so you spend your time solving the right problem.

What Next

I'm trying to decide where to take this blog from here. I could continue focusing on software engineering as it applies to games. Or I could start taking in-depth looks at particular game algorithms and how to implement them for XNA — for example, I could show how I used the minimax algorithm with alpha-beta pruning for finding the best move in my Othello game, and how to implement it in XNA; or show how to implement the A* algorithm for pathfinding. Or I could start showcasing my Dream-Build-Play entry and how various parts are implemented. If you have any thoughts or preferences, please leave a comment.

Sunday, March 25, 2007

Software Efficiency And Optimization (Part 2)

In Part 1 of this series, I discussed the importance of understanding game design tradeoffs and introduced the concept of Big O notation for comparing the efficiency of different algorithms. However, as I said, Big O notation tells us nothing about how fast an algorithm will actually run, it just tells us how efficiently it scales when operating on greater numbers of objects.

"That's great", I hear you saying, "but how do I know when I'm doing my design whether a particular algorithm will meet my needs?"

Ah, well, there is really only one way — experience.

"Gee, thanks,", you reply. "How am I supposed to get the experience I need to design my game, if you say I shouldn't write it until after I've designed it?!?"

That's actually pretty simple. You need to prototype your approach and see how well it meets your requirements. If it doesn't meet your requirements, you need to either prototype different approaches until you find one that does, or scale back your requirements.

Prototyping and Benchmarking

A prototype is just a simplified version of software (in this case, your game), that focuses on one particular aspect of it. Prototypes are often used as proofs of concept to show that the software will be able to work as expected. Because prototypes don't have to deal with all of the details that the final software will, they can be written quickly so that, if necessary, different approaches can be evaluated.

Prototypes are frequently used to evaluate user interfaces so that customers can provide early feedback. That can be useful for game programming too, since if you can implement a working display and control scheme you may be able to find out what works and doesn't work before you get too far along in the actual implementation of the game. However, the use of prototypes that we are concerned with here is to determine whether an algorithm is fast enough for the game we want to write. To do that, we will want to benchmark it. Benchmarking is just the process of timing how long an algorithm takes to run.

Fortunately, the .NET Framework makes benchmarking very easy by providing the System.Debug.Stopwatch class.The Stopwatch class provides a Start and a Stop method. It keeps track of the total number of clock ticks that occur between calls to Start and Stop. Even better, like a real stopwatch, it keeps a running count of ticks between successive calls to Start and Stop. You can find out how much time has passed by querying its ElapsedTicks or ElapsedMilliseconds properties. A Reset method lets us reset the Stopwatch back to zero.

Lets put this to use in a simple example that merely tests how long it takes to move a sprite. For this, I'll use a simplified version of the Sprite class I wrote for my Space Invasion game. A quick overview of the class is provided below. You can download the full code for this benchmark example here.

public abstract class Sprite { public Vector2 Position { get; set; } public Color Color { get; set; } /// /// Sprite's collision rectangle in screen coordinates. /// public BoundingRectangle BoundingBox { get; } public Sprite( string imageName, BoundingRectangle boundingBox); public virtual void Initialize(); public virtual void LoadGraphicsContent( ContentManager content); public virtual void Update(GameTime time); public virtual void Draw(SpriteBatch spriteBatch); /// /// Tests for collision with another Sprite. If a /// collision occurs, it calls the Collide method for /// both Sprites. /// Returns true if images collide. /// public bool TestCollision(Sprite item); /// /// Called when the TestCollision method detects a /// collision with another Sprite. /// protected virtual void Collide( BoundingRectangle overlap, Sprite item); }

As you can see, it provides methods corresponding to all of the usual game engine methods — Initialize, LoadGraphicsContent, Update, and Draw. Additionally, it provides properties for getting and setting the position and color. For collision detection, it provides a TestCollision method that tests for collision with another Sprite and a virtual method, Collide, that gets called by TestCollision for both sprites if their BoundingBox values intersect.

Since it is abstract, it is intended to be used as a parent to other Sprite classes that will implement its behavior, so we will create our own TestSprite class. The TestSprite will generate a random starting position, directional movement vector, and speed (in pixels per second), as shown here:

public override void Initialize() { // Set starting position. Position = new Vector2( random.Next(screenWidth), random.Next(screenHeight)); // Create a random movement vector. direction.X = (float)random.NextDouble() * 2 - 1; direction.Y = (float)random.NextDouble() * 2 - 1; direction.Normalize(); // Determine random speed in pixels per second. speed = (float)random.NextDouble() * 300 + 150; }

Each frame, it will update its position based on its movement direction, speed, and the amount of time that has elapsed. It will also test to see if it has hit the edge of the screen, and deflect away from it:

public override void Update(GameTime time) { // Reset color back to white. Color = Microsoft.Xna.Framework.Graphics.Color.White; // Calculate movement vector. Vector2 move = (float)time.ElapsedGameTime.TotalSeconds * speed * direction; // Determine new position. UpdatePosition(move); } private void UpdatePosition(Vector2 move) { Position += move; if ((BoundingBox.Left < 0) || (BoundingBox.Right > screenWidth)) { direction.X = -direction.X; Position -= new Vector2(move.X, 0); } if ((BoundingBox.Top < 0) || (BoundingBox.Bottom > screenHeight)) { direction.Y = -direction.Y; Position -= new Vector2(0, move.Y); } }

Note that the screen collision test shown here is quite simple. An actual game may want to determine the actual point of intersection so it could deflect away from that point more realistically. If you did need that level of realism, you would probably want to go ahead and implement your strategy here so you could time it. However, all we are trying to prototype here is basic update time so this version is fine for our needs.

Note that the Update method does not test for collisions. We don't want individual sprites testing for collisions because to do so, our Sprite class would have to know about other game objects and we would be severely limiting our design options for collision testing. Any changes to our collision testing algorithm could, and likely would, impact our Sprite class. We want to avoid anything that limits future design changes, so we'll give our Sprite class the ability to test for collisions but require another part of our code to determine what objects should be tested.

We'll talk more about collision testing next. For now, let's see what it takes to time just moving our TestSprite around the screen. Inside our game, we'll create a TestSprite object and call its Initialize and LoadGraphicsContent methods at the appropriate places. And we'll create a SpriteBatch for our game and pass it to our TestSprite's Draw method inside our own Draw method. Now all we need is for our Update method to call our TestSprite's Update method each frame and use a Stopwatch to time it. To do that, we'll create a couple of helper methods that start and stop the Stopwatch, and have it print out the amount of time it takes for each update:

private Stopwatch updateTimer; private int updates = 0; private int framesPerSecond; private void StartTimer() { updateTimer.Start(); } private void StopTimer() { updateTimer.Stop(); updates++; // Show the results every five seconds. if (updates == 5 * framesPerSecond) { Debug.WriteLine( updates + " updates took " + updateTimer.ElapsedTicks + " ticks (" + updateTimer.ElapsedMilliseconds + " milliseconds)."); int msPerUpdate = (int)updateTimer.ElapsedMilliseconds / updates; Debug.WriteLine( "Each update took " + msPerUpdate + " milliseconds."); // Reset stopwatch. updates = 0; updateTimer.Reset(); } }

By putting calls to StartTimer and StopTimer around the calls to our sprite's Update method, we'll get a report of the average time each call takes.

300 updates took 34931 ticks (9 milliseconds). Each update took 0.03 milliseconds. 300 updates took 24445 ticks (6 milliseconds). Each update took 0.02 milliseconds. 300 updates took 23541 ticks (6 milliseconds). Each update took 0.02 milliseconds. 300 updates took 23583 ticks (6 milliseconds). Each update took 0.02 milliseconds. 300 updates took 23963 ticks (6 milliseconds). Each update took 0.02 milliseconds.

Not bad, each call took on average of 20 microseconds (on my development laptop — your results will vary). But notice that the very first set of updates took almost one and half times as long to run as the others. That's because the first time these methods are called, the JIT compiler is compiling the code and our Stopwatch is timing that as well. It's also possible, since this is a fairly small amount of code that is being called repeatedly, that some or all of it may be fitting in the cache, which will increase the speed of later calls.

These show some of the problems with benchmarking code. Another problem is that we are adding some time by using the Stopwatch itself. Thus benchmark times for prototype code can be used as a general guide, but can not be relied upon for exact values. In fact, exact values of the time it takes for functions to run are very hard to determine. Although intended only to describe quantum phenomenon, a variation of the Heisenberg Uncertainty Principle is at play here — the act of measuring something affects the thing being measured. We'll explore this more, and ways to work around it, in Part 3.

Benchmarking Collision Handling

Now let's expand our prototype to help us determine whether we can get away with a brute-force approach to collision detection. You can download the updated project for this section here.

First, let's look at the collision handling code that I already placed in TestSprite's Collide method. Remember that this gets called, for both sprites, whenever TestCollision determines that a collision between two sprites occurred. All it does is set the sprite's color to red.

protected override void Collide( BoundingRectangle overlap, Sprite item) { // Turn the sprite red to indicate collision. Color = Microsoft.Xna.Framework.Graphics.Color.Red; }

Let's give this a test by replacing our single TestSprite with an array of TestSprites. Every place we referenced TestSprite in the original code, we now have to loop through the array to handle all of our TestSprites. To make this a little easier to manage, we'll refactor our original sprite.Update() call in Update into a new UpdateSprites method that updates every sprite. And we'll add a new HandleCollisions method to our game to test for collisions. Finally, we'll change Update so that it only calls StartTimer and StopTimer around the call to HandleCollisions. The relevant sections look like this:

private TestSprite[] sprites = new TestSprite[10]; protected override void Update(GameTime gameTime) { if (Keyboard.GetState().IsKeyDown(Keys.Escape)) { this.Exit(); } UpdateSprites(gameTime); StartTimer(); HandleCollisions(); StopTimer(); base.Update(gameTime); } private void UpdateSprites(GameTime gameTime) { foreach (Sprite sprite in sprites) { sprite.Update(gameTime); } } private void HandleCollisions() { for (int i = 0; i < sprites.Length; i++) { for (int j = i + 1; j < sprites.Length; j++) { sprites[i].TestCollision(sprites[j]); } } }

Looking at that, you may wonder why I'm not using foreach for the HandleCollisions call. It's simply because with foreach we have no way of knowing what sprites we already tested. This algorithm tests every sprite against every other sprite exactly once.

So, what are the results? On my machine, with 10 sprites, I get the following:

300 updates took 48827 ticks (13 milliseconds). Each update took 0.04333333 milliseconds. 300 updates took 42466 ticks (11 milliseconds). Each update took 0.03666667 milliseconds. 300 updates took 42371 ticks (11 milliseconds). Each update took 0.03666667 milliseconds. 300 updates took 43086 ticks (12 milliseconds). Each update took 0.04 milliseconds. 300 updates took 43449 ticks (12 milliseconds). Each update took 0.04 milliseconds.

Wow. Handling collisions for ten sprites takes only twice as long as it did just to move one sprite. How could that be? It's partly due to the overhead of using the Stopwatch class and making function calls, and partly due to the fact that we are measuring very fast operations. Obviously, the closer you get to the resolution of the underlying timer, the more error you get in trying to time things.

We'll talk more about timer resolution later. Before we go on, notice also that the impact of the JIT compiler during our first set of updates is signicantly less, comparatively, than it was in our last example. This shows how effective JIT compilation is and why we don't need to worry about it impacting the performance of our game. We may take a performance hit the first time a section of code is run, but it's relatively miniscule to our overall performance.

Now let's see what happens when we increase the number of sprites to 100:

300 updates took 2079460 ticks (580 milliseconds). Each update took 1.933333 milliseconds. 300 updates took 2156954 ticks (602 milliseconds). Each update took 2.006667 milliseconds. 300 updates took 2138909 ticks (597 milliseconds). Each update took 1.99 milliseconds. 300 updates took 2150696 ticks (600 milliseconds). Each update took 2 milliseconds. 300 updates took 2169919 ticks (606 milliseconds). Each update took 2.02 milliseconds.

Whether you should be impressed or dismayed depends on how you want to use this collision handling algorithm. On the one hand, averaging 2 milliseconds per frame is still a miniscule part of our 16.7 millisecond frame time. If you are not planning on having more than a hundred sprites or so, this algorithm will suit your needs perfectly. But looking at the relative time difference per sprite gives a completely different perspective. It takes us 50 times as long to handle 10 times the number of sprites.

Hmmmm... what happens when we increase to 500? I urge you to run this code so you can see the results for yourself!

300 updates took 28266113 ticks (7896 milliseconds). Each update took 26.32 milliseconds. 300 updates took 28179606 ticks (7872 milliseconds). Each update took 26.24 milliseconds. 300 updates took 28291296 ticks (7903 milliseconds). Each update took 26.34333 milliseconds. 300 updates took 28199114 ticks (7877 milliseconds). Each update took 26.25667 milliseconds. 300 updates took 28182787 ticks (7873 milliseconds). Each update took 26.24333 milliseconds.

Another "Wow!", and this time there is no way to hide the dismay. The movement is noticeably jerky and we're clearly getting far less than our desired 60 frames per second! In fact, just the HandleCollisions call alone is taking almost twice our allotted 16.7 milliseconds per frame. Multiplying the number of objects by 5 increased our time by 13!

Of course, as I mentioned in Part 1, this is what we would expect from an O(n²) algorithm. The times aren't increasing exactly quadratically, due to overhead, but the rate of increase is clear.

Lessons

Does this mean we should never consider this algorithm? Hopefully at this point the answer is obvious. Many games can easily get away with only having on the order of a hundred or so objects active at a time, which we have clearly shown can be handled easily. The fact that the algorithm is trivial to implement and maintain makes it a no-brainer for a large number of games.

On the other hand, if you know you will need to have hundreds of objects, you'll need another solution. You have two options — optimize this algorithm or find a new one. Anyone who is experienced with code optimization will see several obvious ways to make both the algorithm and its implementation more efficient.

For starters, most games don't actually need to test every object against every other object. Taking my Space Invasion game as an example, I don't need to test Invaders for collision with other Invaders. In fact, it's almost crazy to do so. (But I do anyway!)

Another obvious optimization is that the Sprite class's BoundingBox property is adding the Sprite's current screen position to its internal BoundingRectangle every time TestCollision is called. This despite the fact that the Position changes only once or twice per frame. TestCollision, on the other hand, is called once for every other Sprite in the game. (Note that I did optimize this in my actual Space Invasion game by updating the BoundingBox in the Position property's setter. As we can see from these numbers, however, I would have gotten equally satisfactory results if I had used the less efficient approach shown here.)

In addition, the Sprite's TestCollision code is computing the actual intersection rectangle even though we're not using it here. We could easily save some time by not computing it. But we give ourselves more flexibility by going ahead and doing it. Remember that this is supposed to be a generic Sprite class that can be used for many games.

These suggestions don't even get into implementation optimizations, such as always passing our BoundingBoxes by reference instead of value; and providing direct access to member variables instead of accessing them through Properties. These are exactly the types of optimizations suggested by many efficiency proponents in the XNA forums. But these also make the code less readable, harder to debug and harder to maintain.

Since Space Invasion never has more than around 60 objects on screen at a time, the unoptimized brute force approach works just fine. And that's undoubtedly true for many other games as well. (In fact, as we'll see in Part 3, the entire Update loop for Space Invasion averages less than a millisecond per frame!) But what if your game does need more than 100 collideable objects? Shouldn't you make those optimizations so you can handle them?

The answer is... maybe. By making some of these optimizations, we can get this same brute force algorithm to handle 500 objects at a far more reasonable 6.4 milliseconds per frame. You can download the optimized project here

300 updates took 6682522 ticks (1866 milliseconds). Each update took 6.22 milliseconds. 300 updates took 7038462 ticks (1966 milliseconds). Each update took 6.553333 milliseconds. 300 updates took 7023610 ticks (1962 milliseconds). Each update took 6.54 milliseconds. 300 updates took 6718281 ticks (1876 milliseconds). Each update took 6.253334 milliseconds. 300 updates took 7136208 ticks (1993 milliseconds). Each update took 6.643333 milliseconds.

That's an impressive improvement and shows how signicantly performance can be optimized through these techniques. But the disadvantages mentioned above — less maintainable and less flexible code — shouldn't be ignored. And even if you do these sorts of implementation optimizations, keep in mind that this algorithm will still degrade exponentially as you add more objects. You may be able to move up from 100 to 500 objects, but it won't get you to 1000. At some point, you need to recognize that you need a different algorithm to efficiently handle more objects — such as one that partitions your game space, like quadtrees.

Finally, remember that 6.4 milliseconds is still 40% of your entire frame time. If you are maintaining on the order of a thousand or more objects at a time, other parts of your code are almost certainly also going to be difficult to manage at a reasonable frame rate. Is optimizing your collision detection the best use of your time? How do you know in advance which ones to optimize? Optimizing all of them as you go will surely take you longer to write, not to mention make your code more difficult to debug and maintain.

If benchmarking shows your algorithm has problems without implementation optimizations, you are probably better off with a different algorithm. Which isn't to say that these sorts of optimizations aren't still useful. But it's better to do these selectively, once you have your code implemented and have profiled it to find specific problem areas. We'll look at how to use a profiler to find these problem areas next week, in Part 3.

Friday, March 23, 2007

Software Efficiency and Optimization (Part 1)

Software efficiency and optimization is a frequently recurring theme in the XNA Forums. It invariably involves questions about the C# foreach construct, garbage collection, and whether value types should be passed by reference.

While those are all valid concerns, and anyone developing XNA games needs to be aware of these issues, it's clear from many of the posts that some XNA developers were blowing them out of proportion. Many posts have claimed that anyone developing XNA games should never use foreach, as well as suggestions to always pass Matrix structures by reference. This led XNA Framework developer Shawn Hargreaves to write a wonderful post about why it is more important to write clean, maintainable code than to worry overly much about performance, especially early in the development lifecycle.

As a professional programmer who has written software for NASA, commercial communication satellites, and the defense industry, I agree with Shawn almost completely. Two favorite sayings that were drilled into my head early in my career are:

Premature optimization is the root of all evil.

and

The first rule of program optimization: don't do it. The second rule: don't do it yet.

Yet the fact is, game software does have to be efficient. Games are a class of realtime software, meaning that it is not enough for them to produce the correct result, they must also complete within a fixed time window. In general, game developers shoot for a minimum of displaying 30 frames per second in order to produce smooth, glitch-free animations; and most prefer 60 frames per second. That means that all of the game calculations — getting player input, implementing enemy AI, moving objects, collision detection and handling, and drawing each frame — must be completed within 16.7 milliseconds! When you consider that most modern videogames have hundreds, or even thousands, of objects that have to be updated and drawn within that time period, it's no wonder that programmers feel they have to optimize every line of code.

However, from reading the many posts with questions about "the fastest way" to implement something, it is obvious that many XNA programmers aren't familiar with the tools and methods for determining when, where, how, or even if, they should optimize their code. The point of this article is to help you answer those very questions.

Design Versus Implementation

A common response by those who question, or even outright disagree, with the idea that optimizing code early is a bad idea, is to point out that it is far easier to change software early in its lifecycle than after it has been written. That is, of course, very true. But that is why it is important to understand the difference between design optimization and implementation optimization.

While designing a game (or any software), you must take into account the size and complexity of your game, and select the correct data structures and algorithms that can support it. A simple 2D shooter or platformer with no more than a hundred or so objects interacting at any given time can probably get away with a brute force approach for handling movement and collisions. Maintaining a simple list or array of objects and iterating through it each frame will most likely work fine, and will be very simple to implement and debug.

However, a more complex game world, with perhaps thousands of active objects, will need an efficient method of partitioning the game space to minimize the number of object interaction tests each frame. Similarly, games requiring detailed enemy AI will need to rely on algorithms that can produce "intelligent" actions as quickly as possible.

There are a large number of available resources that discuss game programming algorithms, including the use of quadtrees and octrees for partitioning the game world to minimize collision detection tests; the minimax algorithm with alpha-beta pruning for efficiently finding the "best" move in two player strategy games; and the A* algorithm for efficient pathfinding.

Discussion of the specifics of these algorithms is outside the scope of this article. The important thing to take from this is:

The selection of the appropriate data structures and algorithms during the design phase has a far greater impact on the eventual performance of your game than any implementation optimization you will make.

Why? Because your algorithms determine the maximum number of operations your game will have have to perform during each frame.

To demonstrate this point, imagine that for your first game you write a simple 2D shooter that relies on a brute force approach to collision detection. Every frame, you simply test every active object against every other active object to see if they intersect. Because you decide to have only a limited number of enemies active at a time, it works well and easily runs at 60 frames per second.

With that experience under your belt, you now want to write a second, far more ambitious game. This time you decide to write a Zelda-like adventure game with a large scrolling game board and hundreds of objects moving around it simultaneously. Using your existing code as a starting point, you get well into the game's implementation before you discover that the brute force approach that worked very well in your simple game doesn't work so well in this new game. In fact, you may be measuring screen draws in seconds per frame instead of frames per second!

The reason is that comparing every object against every other object is what is known as an O(n²) algorithm. That is, the number of operations that have to be performed is related to the square of the number of objects you are operating on. If you have ten objects in your game, you only have to perform a hundred tests to see if there are any collisions. If you have a hundred objects, you have to perform ten thousand tests. Which may still be possible on a modern PC if each test can be done quickly enough. But if you have five hundred — just five times as many as the last example — you will have to perform 250,000 collision tests. Even if each test took only 67 microseconds, you would stll be using the entire 16.7 milliseconds frame time (at 60 frames per second) just for collision detection. The point is, it doesn't matter how efficiently you implement that algorithm in code, its performance will still devolve exponentially with the number of objects in your game and will therefore be the single greatest limiting factor to the size of your game.

Big O Notation

Big O notation is a mathematical method of comparing algorithmic efficiency. It is sometimes derided by programmers because it says nothing about the actual time a particular implementation of an algorithm will take. However, it can be quite useful if applied appropriately to help compare algorithms at design time.

The following table shows the relative efficiency of the most common types of algorithms, in decreasing order of efficiency. Thus, an O(log n) algorithm is considered more efficient than an O(n) algorithm, which is more efficient than an O(n²) algorithm.

Efficiency	Mathematical Property	Comment
O(1)	Constant	The number of operations are always the same no matter how many objects are being operated on.
O(log n)	Logarithmic	The number of operations increases proportional to the logarithm of the number of objects being operated on.
O(n)	Linear	The number of operations increases proportional to the number of objects being operated on.
O(n²)	Quadratic	The number of operations increases proportional to the square of the number of objects being operated on.
O(n!)	Factorial	The number of operations increases proportional to the factorial of the number of objects being operated on.

Remember that the point here is to determine how well the algorithms scale to handle more objects. An O(n²) algorithm where each actual operation is performed very quickly, may be faster than an O(log n) algorithm for many values of n. However, the O(n²) algorithm will always have a lower limit to the maximum number of objects it can handle. The trick is to choose an algorithm that is easy to implement while being efficient enough for the number of objects you want to handle in your game.

For discussions of videogame related algorithms, I highly recommend the Game Programming Gems series. Various websites, including Gamasutra and GameDev.net also have many good articles on algorithms.

It's also important to understand the relative performance of the C# generic collection classes. MSDN has a good article on how to select the appropriate Collection class for your needs, as well as an article on when to use Generic Collections. (My answer to that question is to always prefer Generic Collections to their non-generic counterparts. They are just as easy to use, are almost always more efficient, and have less chance of impacting the garbage collector.)

What Next?

That's an awful lot of background and theory, and I haven't even started to talk about optimizing implementation performance. Of course, that's the point I am trying to make — understanding design tradeoffs in order to select the design that is most appropriate for your game is the single most important task you will face in writing an efficient game.

However, it doesn't end there, and you may well end up needing to optimize certain sections of your code as you go along. Part 2 of this series will look at the importance of prototyping various aspects of your game, as well as showing you tools for profiling and benchmarking your code in order to determine which, if any, sections need to be improved.