Every game developer has experienced the painful loop: play the game, find a bug, fix the bug, play the game again to verify the fix, discover a different bug that the fix introduced. Repeat until the deadline forces you to ship whatever you have.
QA is the bottleneck that never gets enough resources. Large studios dedicate entire teams to it and still ship bugs. Small studios and solo developers do it themselves, splitting attention between creating the game and breaking it. The result is predictable — indie games ship with bugs that would have been caught with more testing time, because testing time is the one thing indie developers never have enough of.
Automated testing has existed in software development for decades, but game testing is harder to automate than web app testing. Games have visual output that is difficult to validate programmatically. Game logic involves physics, AI behavior, and player interaction that create emergent states. "Correct" in a game is often subjective — a physics glitch might be a bug or a feature depending on context.
MCP-based automation does not solve all of these problems, but it makes a meaningful dent in the most mechanical aspects of QA: traversal testing, regression detection, performance profiling, and systematic edge case coverage. This post describes how to build an automated playtesting pipeline using MCP servers, what it can realistically catch, and what it cannot.
What Automated MCP Testing Can and Cannot Do
What It Can Do
Traversal testing. An AI agent can navigate your levels systematically, verifying that the player can reach every intended location and cannot reach unintended ones. Stuck spots, invisible walls with gaps, unreachable collectibles, and navigation mesh holes can be detected automatically.
Performance profiling. The agent can execute standardized scenarios — walking a specific path, entering a specific area, triggering a specific event — while capturing frame times, memory usage, draw call counts, and other metrics. Running these scenarios on every build creates a performance regression history that catches degradation early.
State validation. After performing actions, the agent can verify that game state matches expectations. Did the quest update after the objective was completed? Is the item in the inventory after being picked up? Does the save file contain the correct data after saving? These deterministic checks are well-suited to automation.
Visual regression detection. The agent can capture screenshots at predetermined camera positions and compare them against reference images. Significant differences — a missing texture, a broken shader, a displaced mesh — are flagged for human review. This catches visual bugs that slip through code reviews because they only manifest in specific rendering conditions.
Systematic edge case coverage. The agent can execute scenarios that human testers find tedious and tend to skip: opening and closing the inventory 100 times in rapid succession, saving and loading 50 times consecutively, interacting with every object in a room in every possible order. These mechanical stress tests reliably uncover state management bugs.
What It Cannot Do
Subjective quality assessment. An AI agent cannot tell you whether a jump feels satisfying, whether the lighting creates the right mood, or whether a boss fight is fair. Game feel, aesthetic quality, and design balance require human judgment.
Creative bug discovery. Human testers discover bugs by doing unexpected things — combining abilities in unusual ways, exploiting geometry in creative ways, interacting with systems in orders the designers did not anticipate. AI agents follow patterns. They can be programmed to try unexpected combinations, but they lack the creative mischief that makes human testers effective at finding the weirdest bugs.
Context-dependent correctness. Some game states are bugs in one context and features in another. A character clipping through a wall is a bug. A character clipping through a wall during a specific cutscene transition that resolves correctly is acceptable. The agent cannot make this distinction without explicit rules for every context.
First-time player experience. Automated tests cannot tell you whether the tutorial is confusing, whether the difficulty curve is appropriate, or whether the story beats land emotionally. These require actual human players experiencing the game for the first time.
Be honest about these limitations with yourself and your team. Automated QA supplements human testing — it does not replace it.
The Pipeline Architecture
Components
A complete automated playtesting pipeline has four components:
1. The MCP Server. The Unreal MCP Server or Godot MCP Server, depending on your engine. This provides the AI agent with tools to interact with the running editor — spawning actors, reading properties, capturing screenshots, executing console commands, and querying game state.
2. The AI Agent. Claude, GPT, or another LLM connected to the MCP server. The agent interprets test specifications, executes them through MCP tools, observes results, and generates reports. The agent's role is translating human-readable test descriptions into sequences of MCP tool calls.
3. The Test Specification. A structured document describing what to test, how to test it, and what constitutes a pass or fail. Test specifications should be human-readable (so developers can write and review them) and structured enough for the AI agent to interpret consistently.
4. The Report Generator. A system that collects test results, organizes them by severity and category, and produces reports that developers can act on. This can be as simple as a markdown file or as sophisticated as integration with your issue tracker.
Connecting the Components
The typical flow:
- Developer writes or updates test specifications
- The pipeline launches the game in the editor (or a packaged test build)
- The AI agent reads the test specifications
- The agent connects to the MCP server and executes test steps
- Results are captured — screenshots, metrics, state snapshots, pass/fail for each test
- The report generator compiles results into an actionable report
- Developer reviews the report and addresses failures
This entire flow can run overnight. You push your changes at the end of the day, the pipeline runs tests while you sleep, and you have a report waiting when you start work the next morning.
Building the Traversal Test Suite
Traversal testing is the most immediately valuable automated test type for most games. It answers the fundamental question: can the player move through the game as intended?
Defining Traversal Points
Start by defining a set of waypoints that represent the player's intended path through each level. These can be marked in the editor using simple actors (empty actors with descriptive names work fine) or defined in a data table with world coordinates.
For each waypoint, define:
- Position and name: Where the point is and what it represents ("Tutorial_Room_Exit", "Boss_Arena_Entrance")
- Expected reachability: Can the player reach this point from the previous one using normal movement?
- Maximum traversal time: How long should it take to walk between consecutive waypoints? If the agent takes significantly longer, it likely got stuck somewhere
- Required state: Does the player need a specific item, ability, or quest state to reach this point? (A locked door that requires a key, for instance)
Running the Traversal
The AI agent navigates between waypoints using the engine's navigation system. In Unreal, this means:
- Query the navigation mesh for a path between the current position and the next waypoint
- Move the player character along the path
- Monitor for stuck conditions (position not changing despite movement input)
- Record the actual path taken and the time elapsed
- Log any collision events, stuck detections, or pathfinding failures
Using the Unreal MCP Server, the agent can execute movement commands, query actor positions, and inspect navigation mesh state. It can also spawn debug visualization (path lines, waypoint markers) to make the test results visually reviewable.
Detecting Traversal Issues
Common traversal issues the agent can detect:
Stuck spots. The player position stops changing while movement input is active. The agent records the position, takes a screenshot, and logs the stuck location. Over multiple test runs, you build a heat map of stuck-prone areas.
Navigation mesh gaps. The pathfinding system returns no valid path between waypoints that should be connected. This indicates missing or broken navigation mesh coverage.
Timing anomalies. A traversal segment that normally takes 30 seconds suddenly takes 90 seconds. The player is not stuck, but something is impeding movement — perhaps a physics object blocking a corridor, a door that is not opening, or a newly added piece of geometry creating an unintended obstacle.
Unreachable collectibles. If your test points include collectible locations, the agent can verify that each one is reachable. A collectible that was placed correctly but became unreachable due to a geometry change will be flagged.
Building the Performance Test Suite
Standardized Performance Scenarios
Define a set of scenarios that exercise different performance profiles:
Scene complexity test. Navigate to the most visually dense area of each level and capture frame times. This tests the rendering pipeline under maximum load. Record draw calls, triangle count, and memory usage alongside frame times.
Particle stress test. Trigger the maximum expected number of simultaneous particle systems and monitor frame times. Spawn 50 enemies, trigger an explosion near all of them, and see if the particle budget holds.
AI load test. Activate the maximum number of simultaneous AI characters and monitor both frame times and AI behavior correctness. Under load, AI should degrade gracefully (slower decision-making) rather than catastrophically (NPC teleporting, pathing failures).
Streaming test. Move through the world at maximum speed and monitor for hitches caused by level streaming, texture streaming, or asset loading. Record hitch frequency, duration, and the assets being loaded during each hitch.
Performance Regression Detection
The value of performance testing compounds over time. Each test run produces a data point. Over weeks and months, you build a performance history that makes regressions immediately visible.
Baseline establishment. Run the performance suite on a known-good build and record all metrics. These become your baselines.
Threshold definition. Define acceptable deviation from baselines. A 5% frame time increase might be acceptable (within measurement noise). A 20% increase is a clear regression that needs investigation.
Regression reports. When a test exceeds thresholds, the report includes: which scenario regressed, by how much, which build introduced the regression (if you test on every commit or every daily build), and what changed between the passing and failing builds.
The MCP server captures metrics by executing console commands (stat fps, stat unit, stat memory) and reading the results. The AI agent interprets the metrics and compares them against baselines, generating natural-language reports that describe the regression in human-readable terms.
Building the State Validation Suite
Gameplay Action Verification
State validation tests execute a game action and verify that the expected state change occurred:
Inventory tests. Pick up an item, verify it appears in the inventory with the correct count and properties. Drop an item, verify it is removed. Stack items, verify the count updates correctly. Attempt to exceed inventory capacity, verify the overflow behavior is correct.
Quest tests. Complete a quest objective, verify the quest state updates. Complete all objectives, verify the quest completes. Abandon a quest, verify the state reverts correctly. Complete quests out of order, verify dependencies are enforced.
Combat tests. Deal damage to an enemy, verify health decreases by the expected amount. Apply a status effect, verify it activates with the correct duration. Kill an enemy, verify it drops the expected loot. Test damage against armor, verify the damage reduction formula is correct.
Save/load tests. Save the game in a specific state. Modify the game state (move to a different location, use items, progress quests). Load the save. Verify that all state has been restored correctly — position, inventory, quest progress, world state.
The Specification Format
A practical test specification format looks like this:
Test: Inventory - Basic Pickup
Setup: Spawn player at (0, 0, 100). Spawn HealthPotion at (100, 0, 100).
Action: Move player to HealthPotion location. Execute interact action.
Verify: Player inventory contains 1 HealthPotion. HealthPotion actor no longer exists in world.
Test: Inventory - Stack Overflow
Setup: Spawn player. Add 99 HealthPotions to inventory (max stack: 99).
Action: Spawn HealthPotion at player location. Execute interact action.
Verify: Inventory still contains 99 HealthPotions. HealthPotion actor still exists in world.
The AI agent reads these specifications, translates each step into MCP tool calls, executes them, and reports results. The natural language format makes test specifications easy for developers to write and review without learning a testing framework.
Crash Grouping and Anomaly Detection
Automated Crash Analysis
When the game crashes during automated testing, the pipeline captures:
- The test that was executing when the crash occurred
- The last N MCP operations before the crash
- The game's log output leading up to the crash
- A screenshot from the last successful frame
- The crash dump (if available)
The AI agent can analyze crash dumps and log files to categorize crashes. Common crash categories:
- Null pointer access: An object was destroyed or never created when the code expected it to exist
- Out of memory: A resource leak accumulated until memory was exhausted
- Infinite loop / hang: The game stopped responding without a formal crash
- Assert failure: A sanity check in the code detected an invalid state
Crashes that share the same call stack are grouped together. If the same crash reproduces across multiple test runs, it is highly likely to be a real bug rather than a transient issue.
Anomaly Detection
Beyond crashes, the agent can detect anomalies — situations that are not crashes but indicate something is wrong:
Metric spikes. A sudden jump in memory usage, draw calls, or frame time that does not correspond to a known expensive operation.
State inconsistencies. The player has an item that was never picked up. An enemy is alive but has negative health. A quest is marked complete but has incomplete objectives. These logical contradictions indicate state management bugs.
Visual anomalies. Through screenshot comparison, the agent can detect: missing textures (magenta/checkerboard patterns), Z-fighting (flickering surfaces), broken shadows (shadow casting where there should be none), and LOD popping (visible mesh changes at screen boundaries).
Audio anomalies. The agent can monitor the audio engine for errors — sounds that fail to play, audio channels that are exhausted, sounds that play at incorrect volumes or from incorrect positions.
Generating Actionable Reports
Report Structure
The automated testing report should be organized for quick triage:
Critical Issues (Crashes and Blockers)
- Crashes with reproduction steps
- Progression blockers (player cannot advance past a point)
- Data loss bugs (save corruption, inventory loss)
High Priority (Regressions and Performance)
- Performance regressions with before/after metrics
- Functionality that worked in the previous build but fails now
- Visual regressions
Medium Priority (Functional Issues)
- Failed state validation tests
- Traversal issues (stuck spots, unreachable areas)
- Edge case failures
Low Priority (Cosmetic and Minor)
- Visual anomalies that do not affect gameplay
- Minor timing discrepancies
- Warnings from the engine log
Each issue includes: a description of the problem, reproduction steps (the test specification that triggered it), evidence (screenshots, metrics, log excerpts), and the build or commit that introduced the issue (if regression testing identifies it).
Integration with Issue Trackers
For studios using Jira, Linear, GitHub Issues, or similar tools, the report can be formatted for direct import. The AI agent can generate issue descriptions with the correct fields filled in, ready for a developer to review and assign.
This step should not be fully automated. Automatically creating tickets for every test failure generates noise. Instead, the report should be reviewed by a developer who decides which failures warrant tickets and which are false positives, known issues, or acceptable.
A Practical Implementation Plan
Week 1: Basic Traversal Testing
Start small. Define 10-20 waypoints in your most-played level. Set up the MCP server connection and write a simple traversal test specification. Run it manually (you initiate the test, the agent executes it, you watch). Get comfortable with the workflow before automating it.
Week 2: Performance Baselines
Run performance scenarios on your current build. Record the baselines. Set reasonable thresholds. Run the performance suite on subsequent builds and verify that the regression detection works correctly — intentionally introduce a performance problem and confirm that the system flags it.
Week 3: State Validation
Write test specifications for your most important gameplay systems. Start with inventory and save/load, since these are the most commonly buggy systems and the most suitable for automated verification.
Week 4: Overnight Automation
Set up the full pipeline to run overnight. Use your CI/CD system (GitHub Actions, Jenkins, or even a simple cron job) to trigger the pipeline after each day's commits. Review the morning report and refine the tests based on false positives and missed issues.
Ongoing: Expand and Refine
Every time you fix a bug that was found by a human tester, ask: could automated testing have caught this? If yes, write a test specification for it. Over time, your automated test suite grows to cover the issues your project is most prone to.
Limitations and Honest Assessment
Automated MCP testing will not replace your human QA process. It is a supplement, not a replacement. Here is what we have found in practice:
It catches about 30-40% of the bugs that human testers catch. The bugs it catches are the mechanical, reproducible, state-management bugs that are tedious for humans to test but easy for automation to verify. The bugs it misses are the creative, context-dependent, feel-based bugs that require human judgment.
False positive rate starts high and decreases over time. Expect 20-30% false positives in the first few weeks as you tune thresholds and refine test specifications. After a month of iteration, false positives typically drop to 5-10%.
The biggest value is regression detection. Catching new bugs is useful. Catching old bugs that come back is more useful, because regressions are the most frustrating and preventable category of bugs. Automated testing excels at this because it runs the same tests on every build with perfect consistency.
Setup cost is real but amortized quickly. Building the initial pipeline takes 2-4 days of focused work. Writing the initial test suite takes another 2-3 days. After that, the ongoing cost is incremental — adding a few tests when you add new features, and reviewing reports each morning. For any project that will be in development for more than a few months, the investment pays off.
If you are a solo developer working on a project with more than three months of development ahead, automated testing is worth the setup cost. If you are in a game jam or a very short-term project, the overhead is not justified.
Closing Thoughts
The dream of fully automated QA that tests your game while you sleep is partially achievable today. The mechanical, deterministic, state-based aspects of game testing can be automated effectively with MCP-based pipelines. The creative, subjective, feel-based aspects cannot.
The practical approach is layered testing: automated tests handle the tedious, systematic coverage that humans skip or rush. Human testers focus on creative exploration, feel assessment, and first-time-player experience. Together, they produce better coverage than either approach alone.
Start small. Prove the value on your most bug-prone systems. Expand gradually. And do not expect perfection — expect incremental improvement in your bug detection rate, your regression catch rate, and your morning coffee routine (reading a test report is more pleasant than manually retesting yesterday's fixes).