Sunday, March 25, 2007

Software Efficiency And Optimization (Part 2)

In Part 1 of this series, I discussed the importance of understanding game design tradeoffs and introduced the concept of Big O notation for comparing the efficiency of different algorithms. However, as I said, Big O notation tells us nothing about how fast an algorithm will actually run, it just tells us how efficiently it scales when operating on greater numbers of objects.

"That's great", I hear you saying, "but how do I know when I'm doing my design whether a particular algorithm will meet my needs?"

Ah, well, there is really only one way — experience.

"Gee, thanks,", you reply. "How am I supposed to get the experience I need to design my game, if you say I shouldn't write it until after I've designed it?!?"

That's actually pretty simple. You need to prototype your approach and see how well it meets your requirements. If it doesn't meet your requirements, you need to either prototype different approaches until you find one that does, or scale back your requirements.

Prototyping and Benchmarking

A prototype is just a simplified version of software (in this case, your game), that focuses on one particular aspect of it. Prototypes are often used as proofs of concept to show that the software will be able to work as expected. Because prototypes don't have to deal with all of the details that the final software will, they can be written quickly so that, if necessary, different approaches can be evaluated.

Prototypes are frequently used to evaluate user interfaces so that customers can provide early feedback. That can be useful for game programming too, since if you can implement a working display and control scheme you may be able to find out what works and doesn't work before you get too far along in the actual implementation of the game. However, the use of prototypes that we are concerned with here is to determine whether an algorithm is fast enough for the game we want to write. To do that, we will want to benchmark it. Benchmarking is just the process of timing how long an algorithm takes to run.

Fortunately, the .NET Framework makes benchmarking very easy by providing the System.Debug.Stopwatch class.The Stopwatch class provides a Start and a Stop method. It keeps track of the total number of clock ticks that occur between calls to Start and Stop. Even better, like a real stopwatch, it keeps a running count of ticks between successive calls to Start and Stop. You can find out how much time has passed by querying its ElapsedTicks or ElapsedMilliseconds properties. A Reset method lets us reset the Stopwatch back to zero.

Lets put this to use in a simple example that merely tests how long it takes to move a sprite. For this, I'll use a simplified version of the Sprite class I wrote for my Space Invasion game. A quick overview of the class is provided below. You can download the full code for this benchmark example here.

public abstract class Sprite { public Vector2 Position { get; set; } public Color Color { get; set; } /// /// Sprite's collision rectangle in screen coordinates. /// public BoundingRectangle BoundingBox { get; } public Sprite( string imageName, BoundingRectangle boundingBox); public virtual void Initialize(); public virtual void LoadGraphicsContent( ContentManager content); public virtual void Update(GameTime time); public virtual void Draw(SpriteBatch spriteBatch); /// /// Tests for collision with another Sprite. If a /// collision occurs, it calls the Collide method for /// both Sprites. /// Returns true if images collide. /// public bool TestCollision(Sprite item); /// /// Called when the TestCollision method detects a /// collision with another Sprite. /// protected virtual void Collide( BoundingRectangle overlap, Sprite item); }

As you can see, it provides methods corresponding to all of the usual game engine methods — Initialize, LoadGraphicsContent, Update, and Draw. Additionally, it provides properties for getting and setting the position and color. For collision detection, it provides a TestCollision method that tests for collision with another Sprite and a virtual method, Collide, that gets called by TestCollision for both sprites if their BoundingBox values intersect.

Since it is abstract, it is intended to be used as a parent to other Sprite classes that will implement its behavior, so we will create our own TestSprite class. The TestSprite will generate a random starting position, directional movement vector, and speed (in pixels per second), as shown here:

public override void Initialize() { // Set starting position. Position = new Vector2( random.Next(screenWidth), random.Next(screenHeight)); // Create a random movement vector. direction.X = (float)random.NextDouble() * 2 - 1; direction.Y = (float)random.NextDouble() * 2 - 1; direction.Normalize(); // Determine random speed in pixels per second. speed = (float)random.NextDouble() * 300 + 150; }

Each frame, it will update its position based on its movement direction, speed, and the amount of time that has elapsed. It will also test to see if it has hit the edge of the screen, and deflect away from it:

public override void Update(GameTime time) { // Reset color back to white. Color = Microsoft.Xna.Framework.Graphics.Color.White; // Calculate movement vector. Vector2 move = (float)time.ElapsedGameTime.TotalSeconds * speed * direction; // Determine new position. UpdatePosition(move); } private void UpdatePosition(Vector2 move) { Position += move; if ((BoundingBox.Left < 0) || (BoundingBox.Right > screenWidth)) { direction.X = -direction.X; Position -= new Vector2(move.X, 0); } if ((BoundingBox.Top < 0) || (BoundingBox.Bottom > screenHeight)) { direction.Y = -direction.Y; Position -= new Vector2(0, move.Y); } }

Note that the screen collision test shown here is quite simple. An actual game may want to determine the actual point of intersection so it could deflect away from that point more realistically. If you did need that level of realism, you would probably want to go ahead and implement your strategy here so you could time it. However, all we are trying to prototype here is basic update time so this version is fine for our needs.

Note that the Update method does not test for collisions. We don't want individual sprites testing for collisions because to do so, our Sprite class would have to know about other game objects and we would be severely limiting our design options for collision testing. Any changes to our collision testing algorithm could, and likely would, impact our Sprite class. We want to avoid anything that limits future design changes, so we'll give our Sprite class the ability to test for collisions but require another part of our code to determine what objects should be tested.

We'll talk more about collision testing next. For now, let's see what it takes to time just moving our TestSprite around the screen. Inside our game, we'll create a TestSprite object and call its Initialize and LoadGraphicsContent methods at the appropriate places. And we'll create a SpriteBatch for our game and pass it to our TestSprite's Draw method inside our own Draw method. Now all we need is for our Update method to call our TestSprite's Update method each frame and use a Stopwatch to time it. To do that, we'll create a couple of helper methods that start and stop the Stopwatch, and have it print out the amount of time it takes for each update:

private Stopwatch updateTimer; private int updates = 0; private int framesPerSecond; private void StartTimer() { updateTimer.Start(); } private void StopTimer() { updateTimer.Stop(); updates++; // Show the results every five seconds. if (updates == 5 * framesPerSecond) { Debug.WriteLine( updates + " updates took " + updateTimer.ElapsedTicks + " ticks (" + updateTimer.ElapsedMilliseconds + " milliseconds)."); int msPerUpdate = (int)updateTimer.ElapsedMilliseconds / updates; Debug.WriteLine( "Each update took " + msPerUpdate + " milliseconds."); // Reset stopwatch. updates = 0; updateTimer.Reset(); } }

By putting calls to StartTimer and StopTimer around the calls to our sprite's Update method, we'll get a report of the average time each call takes.

300 updates took 34931 ticks (9 milliseconds). Each update took 0.03 milliseconds. 300 updates took 24445 ticks (6 milliseconds). Each update took 0.02 milliseconds. 300 updates took 23541 ticks (6 milliseconds). Each update took 0.02 milliseconds. 300 updates took 23583 ticks (6 milliseconds). Each update took 0.02 milliseconds. 300 updates took 23963 ticks (6 milliseconds). Each update took 0.02 milliseconds.

Not bad, each call took on average of 20 microseconds (on my development laptop — your results will vary). But notice that the very first set of updates took almost one and half times as long to run as the others. That's because the first time these methods are called, the JIT compiler is compiling the code and our Stopwatch is timing that as well. It's also possible, since this is a fairly small amount of code that is being called repeatedly, that some or all of it may be fitting in the cache, which will increase the speed of later calls.

These show some of the problems with benchmarking code. Another problem is that we are adding some time by using the Stopwatch itself. Thus benchmark times for prototype code can be used as a general guide, but can not be relied upon for exact values. In fact, exact values of the time it takes for functions to run are very hard to determine. Although intended only to describe quantum phenomenon, a variation of the Heisenberg Uncertainty Principle is at play here — the act of measuring something affects the thing being measured. We'll explore this more, and ways to work around it, in Part 3.

Benchmarking Collision Handling

Now let's expand our prototype to help us determine whether we can get away with a brute-force approach to collision detection. You can download the updated project for this section here.

First, let's look at the collision handling code that I already placed in TestSprite's Collide method. Remember that this gets called, for both sprites, whenever TestCollision determines that a collision between two sprites occurred. All it does is set the sprite's color to red.

protected override void Collide( BoundingRectangle overlap, Sprite item) { // Turn the sprite red to indicate collision. Color = Microsoft.Xna.Framework.Graphics.Color.Red; }

Let's give this a test by replacing our single TestSprite with an array of TestSprites. Every place we referenced TestSprite in the original code, we now have to loop through the array to handle all of our TestSprites. To make this a little easier to manage, we'll refactor our original sprite.Update() call in Update into a new UpdateSprites method that updates every sprite. And we'll add a new HandleCollisions method to our game to test for collisions. Finally, we'll change Update so that it only calls StartTimer and StopTimer around the call to HandleCollisions. The relevant sections look like this:

private TestSprite[] sprites = new TestSprite[10]; protected override void Update(GameTime gameTime) { if (Keyboard.GetState().IsKeyDown(Keys.Escape)) { this.Exit(); } UpdateSprites(gameTime); StartTimer(); HandleCollisions(); StopTimer(); base.Update(gameTime); } private void UpdateSprites(GameTime gameTime) { foreach (Sprite sprite in sprites) { sprite.Update(gameTime); } } private void HandleCollisions() { for (int i = 0; i < sprites.Length; i++) { for (int j = i + 1; j < sprites.Length; j++) { sprites[i].TestCollision(sprites[j]); } } }

Looking at that, you may wonder why I'm not using foreach for the HandleCollisions call. It's simply because with foreach we have no way of knowing what sprites we already tested. This algorithm tests every sprite against every other sprite exactly once.

So, what are the results? On my machine, with 10 sprites, I get the following:

300 updates took 48827 ticks (13 milliseconds). Each update took 0.04333333 milliseconds. 300 updates took 42466 ticks (11 milliseconds). Each update took 0.03666667 milliseconds. 300 updates took 42371 ticks (11 milliseconds). Each update took 0.03666667 milliseconds. 300 updates took 43086 ticks (12 milliseconds). Each update took 0.04 milliseconds. 300 updates took 43449 ticks (12 milliseconds). Each update took 0.04 milliseconds.

Wow. Handling collisions for ten sprites takes only twice as long as it did just to move one sprite. How could that be? It's partly due to the overhead of using the Stopwatch class and making function calls, and partly due to the fact that we are measuring very fast operations. Obviously, the closer you get to the resolution of the underlying timer, the more error you get in trying to time things.

We'll talk more about timer resolution later. Before we go on, notice also that the impact of the JIT compiler during our first set of updates is signicantly less, comparatively, than it was in our last example. This shows how effective JIT compilation is and why we don't need to worry about it impacting the performance of our game. We may take a performance hit the first time a section of code is run, but it's relatively miniscule to our overall performance.

Now let's see what happens when we increase the number of sprites to 100:

300 updates took 2079460 ticks (580 milliseconds). Each update took 1.933333 milliseconds. 300 updates took 2156954 ticks (602 milliseconds). Each update took 2.006667 milliseconds. 300 updates took 2138909 ticks (597 milliseconds). Each update took 1.99 milliseconds. 300 updates took 2150696 ticks (600 milliseconds). Each update took 2 milliseconds. 300 updates took 2169919 ticks (606 milliseconds). Each update took 2.02 milliseconds.

Whether you should be impressed or dismayed depends on how you want to use this collision handling algorithm. On the one hand, averaging 2 milliseconds per frame is still a miniscule part of our 16.7 millisecond frame time. If you are not planning on having more than a hundred sprites or so, this algorithm will suit your needs perfectly. But looking at the relative time difference per sprite gives a completely different perspective. It takes us 50 times as long to handle 10 times the number of sprites.

Hmmmm... what happens when we increase to 500? I urge you to run this code so you can see the results for yourself!

300 updates took 28266113 ticks (7896 milliseconds). Each update took 26.32 milliseconds. 300 updates took 28179606 ticks (7872 milliseconds). Each update took 26.24 milliseconds. 300 updates took 28291296 ticks (7903 milliseconds). Each update took 26.34333 milliseconds. 300 updates took 28199114 ticks (7877 milliseconds). Each update took 26.25667 milliseconds. 300 updates took 28182787 ticks (7873 milliseconds). Each update took 26.24333 milliseconds.

Another "Wow!", and this time there is no way to hide the dismay. The movement is noticeably jerky and we're clearly getting far less than our desired 60 frames per second! In fact, just the HandleCollisions call alone is taking almost twice our allotted 16.7 milliseconds per frame. Multiplying the number of objects by 5 increased our time by 13!

Of course, as I mentioned in Part 1, this is what we would expect from an O(n2) algorithm. The times aren't increasing exactly quadratically, due to overhead, but the rate of increase is clear.

Lessons

Does this mean we should never consider this algorithm? Hopefully at this point the answer is obvious. Many games can easily get away with only having on the order of a hundred or so objects active at a time, which we have clearly shown can be handled easily. The fact that the algorithm is trivial to implement and maintain makes it a no-brainer for a large number of games.

On the other hand, if you know you will need to have hundreds of objects, you'll need another solution. You have two options — optimize this algorithm or find a new one. Anyone who is experienced with code optimization will see several obvious ways to make both the algorithm and its implementation more efficient.

For starters, most games don't actually need to test every object against every other object. Taking my Space Invasion game as an example, I don't need to test Invaders for collision with other Invaders. In fact, it's almost crazy to do so. (But I do anyway!)

Another obvious optimization is that the Sprite class's BoundingBox property is adding the Sprite's current screen position to its internal BoundingRectangle every time TestCollision is called. This despite the fact that the Position changes only once or twice per frame. TestCollision, on the other hand, is called once for every other Sprite in the game. (Note that I did optimize this in my actual Space Invasion game by updating the BoundingBox in the Position property's setter. As we can see from these numbers, however, I would have gotten equally satisfactory results if I had used the less efficient approach shown here.)

In addition, the Sprite's TestCollision code is computing the actual intersection rectangle even though we're not using it here. We could easily save some time by not computing it. But we give ourselves more flexibility by going ahead and doing it. Remember that this is supposed to be a generic Sprite class that can be used for many games.

These suggestions don't even get into implementation optimizations, such as always passing our BoundingBoxes by reference instead of value; and providing direct access to member variables instead of accessing them through Properties. These are exactly the types of optimizations suggested by many efficiency proponents in the XNA forums. But these also make the code less readable, harder to debug and harder to maintain.

Since Space Invasion never has more than around 60 objects on screen at a time, the unoptimized brute force approach works just fine. And that's undoubtedly true for many other games as well. (In fact, as we'll see in Part 3, the entire Update loop for Space Invasion averages less than a millisecond per frame!) But what if your game does need more than 100 collideable objects? Shouldn't you make those optimizations so you can handle them?

The answer is... maybe. By making some of these optimizations, we can get this same brute force algorithm to handle 500 objects at a far more reasonable 6.4 milliseconds per frame. You can download the optimized project here

300 updates took 6682522 ticks (1866 milliseconds). Each update took 6.22 milliseconds. 300 updates took 7038462 ticks (1966 milliseconds). Each update took 6.553333 milliseconds. 300 updates took 7023610 ticks (1962 milliseconds). Each update took 6.54 milliseconds. 300 updates took 6718281 ticks (1876 milliseconds). Each update took 6.253334 milliseconds. 300 updates took 7136208 ticks (1993 milliseconds). Each update took 6.643333 milliseconds.

That's an impressive improvement and shows how signicantly performance can be optimized through these techniques. But the disadvantages mentioned above — less maintainable and less flexible code — shouldn't be ignored. And even if you do these sorts of implementation optimizations, keep in mind that this algorithm will still degrade exponentially as you add more objects. You may be able to move up from 100 to 500 objects, but it won't get you to 1000. At some point, you need to recognize that you need a different algorithm to efficiently handle more objects — such as one that partitions your game space, like quadtrees.

Finally, remember that 6.4 milliseconds is still 40% of your entire frame time. If you are maintaining on the order of a thousand or more objects at a time, other parts of your code are almost certainly also going to be difficult to manage at a reasonable frame rate. Is optimizing your collision detection the best use of your time? How do you know in advance which ones to optimize? Optimizing all of them as you go will surely take you longer to write, not to mention make your code more difficult to debug and maintain.

If benchmarking shows your algorithm has problems without implementation optimizations, you are probably better off with a different algorithm. Which isn't to say that these sorts of optimizations aren't still useful. But it's better to do these selectively, once you have your code implemented and have profiled it to find specific problem areas. We'll look at how to use a profiler to find these problem areas next week, in Part 3.

Friday, March 23, 2007

Software Efficiency and Optimization (Part 1)

Software efficiency and optimization is a frequently recurring theme in the XNA Forums. It invariably involves questions about the C# foreach construct, garbage collection, and whether value types should be passed by reference.

While those are all valid concerns, and anyone developing XNA games needs to be aware of these issues, it's clear from many of the posts that some XNA developers were blowing them out of proportion. Many posts have claimed that anyone developing XNA games should never use foreach, as well as suggestions to always pass Matrix structures by reference. This led XNA Framework developer Shawn Hargreaves to write a wonderful post about why it is more important to write clean, maintainable code than to worry overly much about performance, especially early in the development lifecycle.

As a professional programmer who has written software for NASA, commercial communication satellites, and the defense industry, I agree with Shawn almost completely. Two favorite sayings that were drilled into my head early in my career are:

Premature optimization is the root of all evil.

and

The first rule of program optimization: don't do it. The second rule: don't do it yet.

Yet the fact is, game software does have to be efficient. Games are a class of realtime software, meaning that it is not enough for them to produce the correct result, they must also complete within a fixed time window. In general, game developers shoot for a minimum of displaying 30 frames per second in order to produce smooth, glitch-free animations; and most prefer 60 frames per second. That means that all of the game calculations — getting player input, implementing enemy AI, moving objects, collision detection and handling, and drawing each frame — must be completed within 16.7 milliseconds! When you consider that most modern videogames have hundreds, or even thousands, of objects that have to be updated and drawn within that time period, it's no wonder that programmers feel they have to optimize every line of code.

However, from reading the many posts with questions about "the fastest way" to implement something, it is obvious that many XNA programmers aren't familiar with the tools and methods for determining when, where, how, or even if, they should optimize their code. The point of this article is to help you answer those very questions.

Design Versus Implementation

A common response by those who question, or even outright disagree, with the idea that optimizing code early is a bad idea, is to point out that it is far easier to change software early in its lifecycle than after it has been written. That is, of course, very true. But that is why it is important to understand the difference between design optimization and implementation optimization.

While designing a game (or any software), you must take into account the size and complexity of your game, and select the correct data structures and algorithms that can support it. A simple 2D shooter or platformer with no more than a hundred or so objects interacting at any given time can probably get away with a brute force approach for handling movement and collisions. Maintaining a simple list or array of objects and iterating through it each frame will most likely work fine, and will be very simple to implement and debug.

However, a more complex game world, with perhaps thousands of active objects, will need an efficient method of partitioning the game space to minimize the number of object interaction tests each frame. Similarly, games requiring detailed enemy AI will need to rely on algorithms that can produce "intelligent" actions as quickly as possible.

There are a large number of available resources that discuss game programming algorithms, including the use of quadtrees and octrees for partitioning the game world to minimize collision detection tests; the minimax algorithm with alpha-beta pruning for efficiently finding the "best" move in two player strategy games; and the A* algorithm for efficient pathfinding.

Discussion of the specifics of these algorithms is outside the scope of this article. The important thing to take from this is:

The selection of the appropriate data structures and algorithms during the design phase has a far greater impact on the eventual performance of your game than any implementation optimization you will make.

Why? Because your algorithms determine the maximum number of operations your game will have have to perform during each frame.

To demonstrate this point, imagine that for your first game you write a simple 2D shooter that relies on a brute force approach to collision detection. Every frame, you simply test every active object against every other active object to see if they intersect. Because you decide to have only a limited number of enemies active at a time, it works well and easily runs at 60 frames per second.

With that experience under your belt, you now want to write a second, far more ambitious game. This time you decide to write a Zelda-like adventure game with a large scrolling game board and hundreds of objects moving around it simultaneously. Using your existing code as a starting point, you get well into the game's implementation before you discover that the brute force approach that worked very well in your simple game doesn't work so well in this new game. In fact, you may be measuring screen draws in seconds per frame instead of frames per second!

The reason is that comparing every object against every other object is what is known as an O(n2) algorithm. That is, the number of operations that have to be performed is related to the square of the number of objects you are operating on. If you have ten objects in your game, you only have to perform a hundred tests to see if there are any collisions. If you have a hundred objects, you have to perform ten thousand tests. Which may still be possible on a modern PC if each test can be done quickly enough. But if you have five hundred — just five times as many as the last example — you will have to perform 250,000 collision tests. Even if each test took only 67 microseconds, you would stll be using the entire 16.7 milliseconds frame time (at 60 frames per second) just for collision detection. The point is, it doesn't matter how efficiently you implement that algorithm in code, its performance will still devolve exponentially with the number of objects in your game and will therefore be the single greatest limiting factor to the size of your game.

Big O Notation

Big O notation is a mathematical method of comparing algorithmic efficiency. It is sometimes derided by programmers because it says nothing about the actual time a particular implementation of an algorithm will take. However, it can be quite useful if applied appropriately to help compare algorithms at design time.

The following table shows the relative efficiency of the most common types of algorithms, in decreasing order of efficiency. Thus, an O(log n) algorithm is considered more efficient than an O(n) algorithm, which is more efficient than an O(n2) algorithm.

Efficiency
Mathematical Property
Comment
O(1)
Constant
The number of operations are always the same no matter how many objects are being operated on.
O(log n)
Logarithmic
The number of operations increases proportional to the logarithm of the number of objects being operated on.
O(n)
Linear
The number of operations increases proportional to the number of objects being operated on.
O(n2)
Quadratic
The number of operations increases proportional to the square of the number of objects being operated on.
O(n!)
Factorial
The number of operations increases proportional to the factorial of the number of objects being operated on.

Remember that the point here is to determine how well the algorithms scale to handle more objects. An O(n2) algorithm where each actual operation is performed very quickly, may be faster than an O(log n) algorithm for many values of n. However, the O(n2) algorithm will always have a lower limit to the maximum number of objects it can handle. The trick is to choose an algorithm that is easy to implement while being efficient enough for the number of objects you want to handle in your game.

For discussions of videogame related algorithms, I highly recommend the Game Programming Gems series. Various websites, including Gamasutra and GameDev.net also have many good articles on algorithms.

It's also important to understand the relative performance of the C# generic collection classes. MSDN has a good article on how to select the appropriate Collection class for your needs, as well as an article on when to use Generic Collections. (My answer to that question is to always prefer Generic Collections to their non-generic counterparts. They are just as easy to use, are almost always more efficient, and have less chance of impacting the garbage collector.)

What Next?

That's an awful lot of background and theory, and I haven't even started to talk about optimizing implementation performance. Of course, that's the point I am trying to make — understanding design tradeoffs in order to select the design that is most appropriate for your game is the single most important task you will face in writing an efficient game.

However, it doesn't end there, and you may well end up needing to optimize certain sections of your code as you go along. Part 2 of this series will look at the importance of prototyping various aspects of your game, as well as showing you tools for profiling and benchmarking your code in order to determine which, if any, sections need to be improved.