GDC2003 ¿¡¼ ¹ßÇ¥µÈ ±Û
- Introduction
- Problem Background
- Scope of Automation within TSO
- Technical Approach Summary
- High level requirements
- System Architecture
- Test Client: Design and Implementation
- Presentation Layer
- SimScript
- Results Analyzer
- Esper
- Uses of Automated Testing within TSO
- Load Testing
- Regression Testing
- Monkey Tests
- Development : Pre-Checkin Regression & Sandbox Testing
- Other Potential Uses
- Implementation Notes
- Algorithmic test generation
- Lessons Learned
- What's Reusable?
- esper
- The Testing Harness
- Design Reuse
- Avatar Scripting
- Conclusions
- Acknowledgements
1 Introduction
This paper describes the design and implementation of the automated testing system developed in support of EA's large scale Persistent State World, The Sims Online.
Our architecture and implementation are described in terms of the classes of testing addressed, including a brief discussion of alternate approaches. The integration of automated testing with the day-to-day development environment is also described. We conclude with 'lessons learned' across implementation and early use, a reuse analysis, and future extensions.
1.1 Problem Background
The development and fielding of massively multiplayer Persistent State Worlds has proven to be difficult. The distributed nature and scale of the target system increases the complexity of the implementation, while also incurring a high testing cost. Stability and scalability at launch are highly desirable: the major cost drivers of customer service calls, operational costs and the player experience are all heavily influenced by these factors. The Sims Online (TSO) has devoted considerable effort to increase test coverage via automation, aimed at stabilizing the game well in advance of launch. These tools are also aimed at increasing the efficiency of initial development and early load testing at projected peak conditions. All tools are intended to carry over into Operations as long-term aids in the maintenance and extension of the product.
1.2 Scope of Automation within TSO
Automation is only used for certain aspects of testing, in particular, testing tasks involving highly repetitive actions, and/or testing tasks requiring large numbers of connected, interacting clients, and/or synchronized series of actions across multiple clients. Load testing and regression are natural fits, with a set of minor extensions that can support code development, game tuning or marketing. Some sub-systems were explicitly excluded due to technical complexity, rate of change, and isolated impact of failures. For example, a GUI clipping artifact impacts only a few clients, will likely be easier to catch via manual testing, and is not generally affected by scale. However, a defect in the client connection process is mission-critical, roadblocks all players, and is likely to be affected by the load on the city. Note the difference from testing single-player games. The complexity of TSO's underlying distributed system simply does not exist in a SP game, whereas clean drawing is extremely high priority in a SP title.
1.3 Technical Approach Summary
A subset of the mainline game client code is used to assemble a test client, in which the GUI is mimicked via a script-driven control system. Test clients attach as normal to a candidate server cluster. Events to and from the server are also processed as normal: a test client differs from a normal client only in terms of which source generates an event. To support this abstraction, a Presentation Layer was inserted between the TSO client's GUI and the supporting client-side portions of game code at the semantic layer, i.e. the abstract game function operated by the user, not the syntactic series of mouse / key inputs used to express the operation. Our scripting system, SimScript, is then attached to the same layer. User play sessions may thus be expressed as a series of scripted actions, such as "Create Avatar", "Buy House", or "Use Object".
The test client may be configured to support all forms of TSO system testing: load, regression, and development. Test clients are created and controlled via a family of controllers, each of which configures the test client in terms of runtime components, input scripts, and desired outputs. Client output is collected and analyzed by a similar family of collectors.
2 High level requirements
Automated testing goals in general are:
- Lower the manpower costs associated with regressing PSWs
- Increase the stability of the game to enhance our player experience
- Load test the system to ensure satisfactory response times, and to evaluate operational costs
- Generate synchronized, multi-threaded inputs to support complex regression cases and new feature development.
3 System Architecture
4 Test Client: Design and Implementation
Using the Composable Units, any form of Client may be assembled. A Unit may be Shipping or Non-Shipping. To allow simple removal of testing code prior to ship, all test-related code resides in a non-Shipping Unit (loaded as an optional .dll).
Two broad categories of Units exist: Event Generators and Viewing Systems.
An Event Generator uses the Strategy pattern to select events. Strategies range from algorithmic (Random / Ordered) to specific, ordered lists of events (Scripted / Recorded). The game's actual GUI may be simply considered as a separate implementation of an Event Generator.
Viewing Systems observe the internal game state, transforming the raw data into different views. Current Views are the game's GUI and the Results Analyzer. The Results Analyzer is used in turn to view, test and log specific game state conditions.
4.1 Presentation Layer
An important abstraction is that of the Presentation Layer. Infrastructure control points in the code were scattered, and often tightly tangled with the view system. We decided to refactor the code to increase testability, inserting the Presentation Layer between the GUI and the game's (client-side) infrastructure components. All Event Generators then use this interface, including the GUI. Drawing from both the Middleware school of thought and the Command pattern, the Presentation Layer has been quite useful. More and more common functionality has been exposed to the Presentation Layer for control via the scripting system.
Test code is more stable by tying into the game client's infrastructure in the same manner as the GUI. Exposing those control points as primitives by definition mimics our User Interface, essentially providing a command line interface to our client. Scripts themselves become ordered sequences of user actions, facilitating scripting of GUI-oriented test suites, bug reproduction documentation, and sample player sessions.
A natural recording / instrumentation point is also thus provided. Play sessions are recorded directly to SimScript for easy replay and editing, simply by recording the UI Commands as they pass through the Presentation Layer. Capturing the semantics of the inputs provides excellent portability across builds.
From an instrumentation viewpoint, we may also easily capture user data from our play sessions, generating new load test scenarios simply by generating events of each type at the same frequency as generated by our Beta users, and analyzing player behaviour.
5 SimScript
The Test Client exposes a series of control and view primitives to TSO testing scripts. Control primitives generate internal game "Interaction Requests", such as "UseObject Pinata:Bash", or "BuildWall 20,20:30,30".
Similarly, a series of view primitives allow monitoring and logging of various internal Game State variables. Examples include "WaitUntil Pinata:Broken", "WaitUntil reading_skill:100" and "TraceAndLogCurrentPosition".
The aggregate interface of the Control primitives and the Analyzer primitives is called SimScript. SimScript primitives may be called directly via the in-game cheat system, or invoked from a textual script file.
SimScript is an extremely simple scripting system. As it is intended to mimic or record a series of user actions, it supports no conditional statements, loop constructs, or arithmetic operations. Stored procedures and const-style parameters are used to support reusable scripts for common functionality across multiple tests.
Two basic flow control statements exist in SimScript: WaitFor and WaitUntil. These provide the ability to simulate the time gaps between mimicked user actions "wait_for 5 seconds", "wait_until reading_skill:100", to block until a process enters a particular state ("wait_until client_state:in_a_house"), and to synchronize actions across the distributed system ("wait_until avatar_two:arrives"). WaitUntil commands simply block script execution until the condition is met, or a timeout value is exceeded.
SimScript uses simple text files for its scripts and log files. Both are designed for readability to aid non-engineers in writing scripts and interpreting logs. Since it is based on the UI for The Sims, by definition all team members understand the basic operation of SimScript primitives, albeit not the specific operations of the scripting system.
A test integration of Python and SimScript has been successfully prototyped, providing optional access to a full scripting control system simply by loading a Python.dll.
Table 1.1 : Sample SimScript
# this script brings an avatar directly into a testable condition inside the game: a quick skill test is then performed
wait_until game_state selectasim
pick_avatar $alpha_chimp # Green Primitives are User Actions
wait_until game_state inlot # Black Primitives are Control Actions
chat Hi. I'm in and running.
log_message Testing object placement
log_objects
place_object chair 10 10
log_objects
# invoke a command on a different client
remote_command $monkey_bo. use_object chair sit
# and do some skill increase for self
set_data avatar reading_skill 0
use_object bookshelf read
wait_until avatar reading_skil 100
6 Results Analyzer
The exact data items to be monitored and logged differ considerably across types of testing and individual objects. The Results Analyzer provides a series of primitives that allow scripts to specify the exact conditions they are testing for.
For Load Testing, the test client is configured (via script) to log pass/fail status and response time on key user-level transactions.
For Regression, each script determines which internal game states need to be logged, and at what points in the test. For example, the avatar's money should be logged before and after purchasing an object, or the avatar's position should be recorded after issuing a 'walk' command.
For Development, internal infrastructure data may be exposed for logging. Examples include logging the simulator's event queue before and after an event is posted, or logging what objects the client thinks are owned by any given avatar versus what objects the server thinks are owned by that avatar.
The Results Analyzer functionality is implemented via a Lurker, who attaches Observers to a particular location in the game's State. A Lurker continually snoops on a given location. Registered handlers are triggered on any changes to a Data Item, evaluating a predicate to determine if an action should be taken. Conditional logging is thus supported: a Data Item may be logged when any change occurs, or when a specific change occurs. Similarly, wait_until() is implemented in terms of Lurkers, and may thus be attached to any portion of the internal game state.
7 Esper
Extracted from Andre Norton's Beast Master, an "esper" is one with ESP. Esper thus comprises the sub-systems that observe and control the internal workings of our game's (distributed) mind. An extensible Command system allows new simple registration of Commands against any game component, regardless of location in the distributed system. Similarly, a Data Accessor Factory provides a flexible, data driven mapping system for internal game state for any process in the system. Set Accessors allow the game to be placed in a specific test condition, while Get Accessors may be attached to the WaitUntil construct, or the logging system. For example, avatar maps to an X byte object, and skill maps to the Nth element. Such names are used as the data targets for SimScript primitives:
set_data avatar skill 10
use_object pinata bash
set_data avatar money 0
wait_until pinata broken
log_message Money should go up after breaking the pinata
log_dat avatar money
A Remote Command mechanism is used to control remote processes from a single process, using the internal game transport layer to carry any Command to any game process. Thus we can synchronize the actions of multiple clients, examine and/or change the internal state of any process, and test for correctness at any point in the system, all via a single script or console. Examples:
remote_command server_simulator log_on_change action_queue # state of server
log_action_queue; use_object chair; log_action_queue # state of client
Esper may thus be used both for input/output controls for regression testing, complex inputs for a developer test, customizable debugging information, and customizable controls. Generalizing esper into a distributed debugging tool synchronized remote log viewing was an unexpected bonus of the remote command system.
8 Uses of Automated Testing within TSO
8.1 Load Testing
The game as a whole is both large and complex; thousands of independent, interacting processes are distributed across the (brittle and hostile) Internet. We need to be able to generate realistic load on a continual basis to harden our system before ship. A series of test clients is created, each executing independently, controlled via a scripted series of individual User Interface actions, or an event generation algorithm. The system collects and aggregates data from the clients and the server cluster. System startup/shutdown is fully automated, as is the metrics collection.
System load testing is controlled via LoadRunner: a commercially available load generation system. An integration bridge was developed to hook LoadRunner into a TSO test client. LoadRunner is capable of controlling hundreds or thousands of simulated users against a candidate server cluster. Simple pass/fail/timing reports on selected transactions are sent back to LoadRunner. The test client is also configured to run without a GUI (NullView). The greatly reduced memory footprint allows many more test clients per load generation box.
While a pure transaction-bot is lighter weight than even a NullView client, we chose the heavier client to more closely mimic actual client behaviour and increase shared code across our testing systems. Given our shifting implementation and intricate interdependencies between TSO control protocols, a pure transaction-bot would have required constant maintenance, if in fact an accurate model could have been derived.
8.2 Regression Testing
The complexity of a distributed system entails a higher degree of new work breaking old work. Further, stability is an important customer requirement, thus we need to be able to continually validate the behaviour of each game feature. A single test client ?or multiple clients in a single test -- follows a unique script that exercises all required aspects of the specified game feature. The same script also specifies what game state needs to be logged for regression. Once a correct implementation has been observed and blessed by the normal QA process, the feature is auto-regressed nightly, comparing current output with the (saved) blessed output for that feature. A small set of additional regression tests were run against performance, charting key performance metrics each week. These tests turned out to be largely irrelevant in the early stages of the project, as they did not help uncover and prevent several roadblocking stability / scalability issues (See Section 8.4: Pre-Checkin Regression). Finally, a small set of tests was built explicitly to regress against discovered defects. While very useful, defect regression was rarely used except against recurring critical path roadblocks. Increased usability of the scripting system, full feature coverage and process changes in the defect handling process are required to broaden the coverage of such tests.
False positives are defined as the condition where the Difference Engine erroneously reports a failure. False positives can be generated by changing in-game text messages, or recording non-deterministic data, such as the precise number of seconds for an avatar's skill to increase one level. False positives can be avoided by specializing the logging within each script, strictly limiting the logged data to that required for that specific test. Note also the impact on log clarity: the cost of reviewing a log must remain low, thus the amount of information within the log must be kept small, focused and readable. Further reduction in the false positive rate was achieved via continual monitoring and tuning the log output to eliminate irrelevant testing data and/or timing variances, and/or changing the semantics of the logged data.
Note that a corollary to False Positives exists: the False Negative. This is defined as the condition where the test driver erroneously reports 뱊o errors found?when, in fact, an error did occur. Thus a continual tradeoff occurs in setting logs for each test case: too much data may report errors when none occurred, whereas not enough data logged may fail to detect an actual error. We found it valuable to keep the logs very light in initial fielding, looking primarily for crash and hang defects, adding additional testing constraints after the game stabilized.
8.3 Monkey Tests
A series of basic functionality tests are run against the daily build. Monkey tests exercise specific game features, in isolation. Each test is run with many repetitions, testing for non-deterministic timing errors in protocols and/or game logic. Critical Path Monkeys run hourly, serving two purposes. First, the health of the system is continually monitored. Second, server components are under constant exercise, with a constantly growing set of data. Many system failures were found to occur at a higher rate under load, thus we keep all server components under a light, continual load.
8.4 Development : Pre-Checkin Regression & Sandbox Testing
Programmers run a short automated regression test before checking in, as a supplement to their normal testing. This "sniff test" regresses a broad but not deep feature set, simply intended to ensure the basic features of the game were not accidentally broken as a side effect of feature development. Other configurations of the client are also run to ensure correctness. The additional cost per checkin is on the order of 10 minutes, whereas the cost per fix (post-developer) is significantly higher. Catching errors as early in the development cycle saves considerable manpower.
Engineers also have full access to the monkey test suite, allowing repeated stress testing of a feature under development. Complex series of (coordinated) inputs may easily be generated and synchronized via script or console.
Sandbox Testing was added later in the project as a more complete form of Pre-Checkin testing, and proved quite valuable. A full QA Manual Smoke test was done for each Sandbox, in conjunction with a full set of automated tests and a set of focused tests against the specific changes contained within that Sandbox.
8.5 Other Potential Uses
The automated testing system is currently restricted to just test functions, but it is capable of supporting several non-test functions in the future.
Tuning Verifying the avatar skill increase rates under varying conditions is quite onerous, involving many sessions with a stop watch and a large coffee cup. After the major work to support load and regression testing is complete, it becomes trivial to use the automation gear to support the game tuning process.
Live team In addition to having a full testing / regression system, aspects of the customer servic. problem are addressable. A Customer Service system could be implemented via the Remote Command system and the Data Accessor's Set/Get abilities, allowing CSRs to troubleshoot and repair problems at runtime.
Marketing It is difficult to do effective demos in a virtual world: variability hurts precisely laying out interesting events, and large numbers of clients are needed to make the virtual world look populated. Automated scripts greatly ease both problems. Note also that a complex demo script may be added to the nightly regressions: imagine a demo that always worked...
9 Implementation Notes
The amount of new C++ code required to support our testing system was surprisingly small. An existing testing service was used to hook our logic into the game client, and a Presentation Layer is simply exposing existing game functionality as a testable interface. The basis for the scripting system itself was found in existing code: the cheat system. Easily extensible and equipped with a basic "cheat file reader", the cheat system was already used to modify internal game state. Data Mapping is primarily data-driven as well. Python scripts are heavily used as Controllers and Collectors. Our entire testing infrastructure is really nothing more than a series of Python wrappers around a cheat system on steroids.
We invested heavily in our Python testing harness, to excellent effect. A GUI was added, allowing new test sessions to be built, configured and scheduled in seconds. A Report Summarizer captures logs from one to dozens of test clients, filtering out known defects, and aggregating / summarizing the remaining results. Reports are automatically entered into the MonkeyWatcher Database and posted on a public Dashboard website, with automated email when required. One to dozens of clients can be created and controlled within a single test session. Test Sessions can be run sequentially (for validation) with a Test Looper, or in parallel with the Monkey Grinder option.
9.1 Algorithmic test generation
Many of our test scripts are also very lightweight in implementation. Scripting by hand of all conditions is not realistic, nor desired. An extensible series of algorithms is thus used to generate events wherever possible. Random and Deterministic Strategies create 'scripts' on the fly.
In particular, TestAll is an extremely powerful algorithm. TestAll walks the list of all objects currently in a Lot, building a list of all possible interactions. It then executes each action, using the Deterministic Strategy. Game semantics may then be exploited to force a broad system test, by placing the objects in differing terrain / house configurations.
For example, by placing one object on top of a hill, another on a house's second floor, we not only test the operations of that game object, but we also ensure that the routing, terrain and second-story modules all get exercised in regression. A second test house may contain game objects behind locked doors, and objects placed in unusable positions. By then running the same TestAll on this second house, we ensure that avatar permissions and object slot attachment code is also exercised.
Using TestAll as our first approach was very successful. We had 260 game objects under regression and about 40% of the game's supporting infrastructure under automated regression in a month.
10 Lessons Learned
Get your regression tool in place early. Once a feature works, it should never be allowed to stop working.
Senior management support: Introducing automated testing tools into a project changes how people work. This is never done quickly, or easily. Changes such as TSO's mandatory pre-checkin regression testing required solid, perhaps even aggressive, commitment from management: even minor non-compliance would have defeated the purpose of the test.
Basic infrastructure tests should precede specific, scripted feature testing. If you can't connect to a server, nothing else matters. Infrastructure defects must be flagged and killed at all costs: they slow all developers down. Further, they can seriously impact the testing system itself. Identify your Critical Path elements, and protect them.
Design for testability: During design, it quickly became apparent that our system was not very susceptible to automated testing. A key design decision was refactoring the code to increase its testability. Inserting a Presentation Layer to isolate the GUI from the game's infrastructure was very important. Having the testing system and the UI attach at the same points of the infrastructure enhanced both validity and stability. New code changes were given a design requirement of "able to be tested via automation".
Recorders and Regression don't mix: Two recorders were constructed for TSO, each using differing data capture / replay points. The most successful recorder was used by QA and development to record complex, multi-process crash defects for eventual replay and repair. However, the detailed nature of the recording tapes were not generally portable across builds, i.e., recordings could only be played back on the build from which they were recorded. This was due to the high churn rate of our code base: any change in a key data structure, network packet, DB response or code logic shift could invalidate a saved recording. Thus the very nature of a good debugging tool precluded its use as a regression tool. However, recording was shown to be a useful tool, thus we continue to support the functionality by recording directly to SimScript as UI events pass through the Presentation Layer.
Workflow Automation / Noise Reduction: Monitoring, maintaining and extending hundreds of automated test cases can quickly get out of hand. Tools to help with the information management are essential, as are tools to quickly add, suspend or change the current set of tests attached to the game. A 'drag and drop' directory is used to store N tests or N logs. Preprocessors then configure a test client based on scripts currently in the active directory. Similarly, Data Collectors pick up all logs stored in the directory, and produce summary results from the available logs. In particular, it is very important to get report summarization and noise reduction down to an (automated) art form. Information overload is a serious problem, as is conveying accurate testing information out to the team as a whole. Filtering out meaningless results is critical to success.
Unit Tests: Unit testing is an extremely useful development tool. Unfortunately, it also requires, well, units. After years of shifts in development, The Sims code base had grown quite intertwined: few C++ modules were found that could be easily extracted for traditional unit testing. Objects built via Edith (TSO's content scripting system) are much more encapsulated in nature, and thus are relatively simple to test individually, but do require loading the game itself as the test harness. Some of the newly developed sub-systems were unit tested while under construction: the developers involved reported very positive experiences. Finally, monkey tests themselves may be considered a poor man's unit test, again using the game as a whole as a test harness for a particular feature.
Full-Scale Load Testing: Running constant load tests with thousands of active clients proved to be very effective. Many defects that only occurred at scale were eliminated prior to launch, and significant server performance improvements were driven by direct observation of load tests.
Pre-Checkin Regression: The Sniff Test also proved to be very effective. Testing at later stages in the development process was too late: if something broken was checked into Mainline, the entire team suffered. Automation regression tests further down the development pipeline were also severely impacted: content regression scripts were valueless if critical_path primitives such as enter_lot() were unreliable.
11 What's Reusable?
11.1 esper
Our control infrastructure meets many of the same requirements as a distributed system monitoring / debugging tool. TSO is in the process of extending the esper system into a full-fledged debugging tool. The beta version of this tool, espermon, was tacked on to the testing system with less than 3 days effort.
11.2 The Testing Harness
Applying the automated testing system to other games ?single or multi-player ?should not be difficult. The supporting infrastructure is primarily composed of Python modules. The esper generic command registration system allows any title to use the full scripting system simply by exposing key control points to the Presentation Layer and integrating the TSOTest.dll, while the Data Mapping Service requires registering a few new Data Accessors, and writing a data mapping (text) file. The LoadRunner middleware is fully reusable, allowing a simple integration of any game client to LoadRunner.
11.3 Design Reuse
Physical code reuse aside, some of the design abstractions used here should be applicable to many game systems. Extending a game's cheat system into a scripting system / testing harness, using the Presentation Layer abstraction to expose key game functionality to a scripting system for automated testing, algorithmic test generation, and a configurable test client should all be applicable to many single-player or multi-player games.
11.4 Avatar Scripting
You'd think there would be some way you could make a game feature out of an easy-to-use multi-avatar scripting system: it's fun with a capital F?
12 Conclusions
The use of automation in the testing of large scale gaming systems appears very promising. Based on early results, TSO is continuing to expand the use of automated testing into every major aspect of the software development process. Integrating regression tests of critical game features early in the development cycle caused a significant reduction in the defect reoccurrence rate, and of side-effect defects. Daily load testing uncovered many defects and scaling issues in advance of launch, greatly improving system stability and scalability.
From a design perspective, the Presentation Layer / Command / Nullview Client approach provides a stable, extensible basis for both game and test code development, and the simple remote command implementation supports synchronized, multi-threaded test inputs.
Testing the simple game objects algorithmically and the complex game operations via scripts looks like a winning approach. A great deal of the game logic was placed under automated regression very cheaply.
13 Acknowledgements
The design and implementation of TSO's testing system has been very much a team effort, with the design evolving as we progressed.
Chris Yates was the executive sponsor, and was continually involved in requirements analysis and architecture, as well as leading the load testing itself.
Darrin West and Greg Kearney provided valuable insights across many design discussions, not to mention some critical jumpstarts with the code itself.
Steve Keller (on load from EA.com) headed the LoadRunner integration design and implementation.
TSO's automation team consists of: Larry Mellon, Chris Kosmakos, Moe Hendawi, Jeff Marshall, Joel Tablante, Minkz Ngo and (formerly) Derek Shaw. Thanks in particular to ChrisK for the Lurker's detailed design: all right, so we did need the extra extensibility?
SeriousMoin v1 (koMoinMoin 1.0a4 Modified)