Work Graphs are the biggest change to how GPUs schedule work since compute shaders became mainstream. The short version: instead of the CPU enqueuing a fixed sequence of draw calls and dispatch calls, a shader running on the GPU can enqueue additional work onto the GPU itself. The GPU has a scheduler, the scheduler runs a graph of work nodes, and each node can spawn more nodes at arbitrary depth.
AMD pioneered this in the research literature, Microsoft shipped it in DirectX 12 Agility SDK 1.614 (late 2024), Vulkan got the equivalent via VK_EXT_device_generated_commands plus extensions through 2025, and UE5.7 ships the first production integration of Work Graphs in the engine. It is exposed for specific passes — not as a wholesale rewrite of the render pipeline — but the passes where it is wired up are the ones that historically hit draw-call walls hardest: runtime PCG, dynamic foliage, particle systems, and geometry culling for open worlds.
This post is an analysis, not a tutorial. The feature is too new and the API surface too unstable for tutorial-depth coverage to survive into 5.8. What is stable and worth understanding is the mental model, the hardware requirements, the performance implications, and how Work Graphs change the way procedural content pipelines should be designed going forward.
The Problem Work Graphs Solve
To understand Work Graphs, it helps to be specific about the problem.
The traditional GPU programming model is a pipeline of fixed-depth stages. The CPU records a list of draw and dispatch commands into a command buffer and submits it to the GPU. The GPU executes the command buffer in order. Each command does a bounded amount of work. If a command needs to decide how much work to do based on runtime data, it has to stop and bounce back to the CPU — or use indirect dispatches that can scale the work but cannot change its shape.
This model is great at throughput and poor at dynamism. Two specific patterns hit walls:
Hierarchical scheduling. A classic example: culling a scene. The CPU knows roughly how many objects exist. A compute shader culls them per-frame and produces a list of visible objects. The GPU then draws the visible list via indirect draw calls. This works, but the culling shader has to guess at the worst-case visible count, over-allocate buffers, and the draw pipeline has to handle whatever count the culling shader produced. If the next pass needs to do something different for visible objects that are also shadow-casting, also near-field, also static versus dynamic — each of those decisions becomes a separate dispatch with its own buffer coordination.
Runtime procedural generation. Another example: spawning foliage around a player who can move anywhere. The standard solution is PCG (Procedural Content Generation) baked at cook time, or chunk-based streaming of pre-baked foliage. True runtime procedural placement — generating a forest that does not exist until the camera approaches it — requires the CPU to orchestrate the chunking, the GPU to generate the content, and careful coordination to avoid CPU-GPU sync stalls. Most games either don't do this or accept visible popping and latency.
Work Graphs let the GPU do the orchestration itself. A culling node produces a variable number of output items, each of which is dispatched to the appropriate shading node, each of which can spawn further work (LOD selection, tessellation decisions, transparency sorting) without a CPU round-trip. The graph has a declared structure but the traversal is data-driven at runtime.
The Programming Model
A Work Graph in DX12 is a directed graph of shader nodes. Each node has:
- A shader program.
- Input and output record types (structured data passed between nodes).
- A dispatch grid, which can be static, indirect, or node-generated.
A node invokes other nodes by writing to an output record. The GPU scheduler collects output records, packs them into dispatches for the target nodes, and launches them when enough records have accumulated or a flush is requested. The granularity of packing is implementation-defined — NVIDIA and AMD both aim for wave- or thread-group-granularity packing.
The key primitives are:
- Entry nodes — nodes invoked from the CPU side, analogous to a kernel launch.
- Broadcasting nodes — nodes invoked as a dispatch grid per input record.
- Coalescing nodes — nodes that batch input records and invoke once per batch.
- Thread nodes — nodes that run a single thread per input record (useful for lightweight routing).
A graph can also have recursive nodes — a node that outputs records back to itself, with a stack depth limit. This is the feature that makes Work Graphs genuinely new. Recursive GPU work was previously the exclusive domain of ray tracing shaders, and even there it was heavily constrained.
In UE5.7, you do not author Work Graphs directly in most cases. The engine exposes them through a handful of systems where they make sense.
Where UE5.7 Uses Work Graphs Today
GPU-Driven Culling and Drawing
The mesh drawing pipeline in 5.7 has an optional Work Graph path, enabled via r.WorkGraphs.MeshDrawing 1. On supported hardware, the frustum-and-occlusion cull, LOD selection, material assignment, and indirect draw generation collapse into a single graph with four nodes:
- Cull node — broadcasts over all scene primitives, outputs visible records.
- LOD node — coalesces visible records by mesh, selects LOD per instance, outputs LOD'd records.
- Material bin node — groups LOD'd records by material, outputs per-bin draw records.
- Draw node — issues the actual draw commands.
Before Work Graphs, this was four separate compute dispatches with three intermediate buffers sized for worst case. With Work Graphs, the intermediate storage is handled by the scheduler (which can allocate from a small ring buffer because it knows the actual throughput) and the overall memory footprint drops by 60–75% for the culling path. Performance on the culling itself is 10–25% better on RTX 40-series, depending on scene complexity.
Runtime PCG
The PCG system has a new Work-Graph-backed evaluator. PCG graphs that previously required chunked evaluation on the CPU (with the CPU orchestrating which chunks to evaluate and when) can now run as a single Work Graph that evaluates itself on demand. Practical consequence: a PCG graph that spawns 2 million foliage instances in a 4km² area can evaluate in 3–4ms in the frame the player crosses into a new chunk, versus 40–80ms spread across multiple frames with the legacy path.
This is the change that most directly affects authoring. PCG graphs for the Work Graph evaluator have slightly different rules — nodes with side effects on CPU-side data are not allowed, and certain node types (anything that requires full-scene awareness across chunk boundaries) still require the legacy path. Most placement-only graphs port cleanly.
The Procedural Placement Tool includes graph presets specifically tuned for the Work Graph evaluator — the node structure is laid out so that the GPU's batching heuristics can pack efficiently, which matters more for placement throughput than the raw graph logic. A graph authored for CPU PCG will run on the Work Graph evaluator, but it will typically evaluate 2–4× slower than a graph structured for GPU-driven batching, because of small-batch overhead.
Dynamic Foliage Wind and Physics
The foliage wind simulation historically ran as a compute dispatch per foliage category per frame. Large scenes with many foliage categories (grass, shrubs, low trees, high trees, flowers) could burn 1–2ms on dispatch overhead alone before doing any actual physics work.
The 5.7 wind system uses a single Work Graph entry node that fans out per-category. The scheduler packs small categories together and splits large categories across multiple waves automatically. Measured overhead reduction: 0.8ms–1.4ms on heavy-foliage scenes on a 4070. Not huge, but it's recovered budget in a frame where there is no obvious place to find it.
Particle Systems (Niagara)
Niagara has an experimental Work Graph path for GPU particle systems. The key win is emitter cascading — one emitter's particles becoming another emitter's source — which previously required a frame of latency because the CPU had to read back the first emitter's state. With Work Graphs, a cascading emitter writes records directly into the downstream emitter's input and both run in the same frame.
This is gated behind fx.Niagara.WorkGraphs 1 and marked experimental. The API is still stabilizing, and we've seen it regress between 5.7.0 and 5.7.2 and back. Not production-ready, but worth evaluating for FX-heavy projects that have hit the emitter-cascade latency wall.
Cloth and Chaos
A surprise adoption: Chaos cloth in 5.7 uses a Work Graph for constraint solving on long cloth sheets. The graph recursively subdivides the cloth based on curvature, running more solver iterations on high-curvature regions and fewer on flat regions. The quality-per-frame-cost ratio improves by about 30% on complex cloth (capes, long skirts, multi-layer garments).
Hardware Requirements and Fallbacks
Work Graphs are not a universal feature. The current hardware matrix:
| Hardware | Status |
|---|---|
| NVIDIA RTX 20/30/40/50 series | Full support (DX12 and Vulkan) |
| AMD RDNA 3 (RX 7000) | Full support |
| AMD RDNA 4 (RX 9000) | Full support with improved scheduler |
| AMD RDNA 2 (RX 6000) | Partial — non-recursive graphs only |
| Intel Arc Alchemist | Not supported |
| Intel Arc Battlemage | Partial — coalescing nodes not supported |
| PlayStation 5 | Vendor-specific equivalent; UE5.7 integration incomplete |
| PlayStation 5 Pro | Full support via vendor extension |
| Xbox Series X/S | No public support yet |
| Nintendo Switch 2 | Not supported |
UE5.7 has fallback paths for all Work-Graph-backed systems. On hardware without support, the engine falls back to the previous implementation (multi-dispatch compute, CPU orchestration). The fallback is transparent to game code — you do not author two versions of a PCG graph or a Niagara system.
The fallback paths are the production default because most shipping projects cannot require Work Graph support at this time. The Work Graph path is opt-in via console variables and project settings, with the expectation that it becomes the default in 5.8 or 5.9 as hardware support broadens.
Performance Implications for Open-World Scenes
The theoretical performance claims around Work Graphs (50%+ improvements in specific workloads) are real but narrow. The broader claim worth examining is what happens to frame pacing and CPU-GPU balance in open-world scenes.
Measurements from a test open-world scene (8km² landscape, 1.2M foliage instances, 340 unique meshes, dense particle effects, 60fps target on RTX 4080):
| Metric | Legacy Path | Work Graph Path |
|---|---|---|
| CPU main thread time | 8.4ms | 5.1ms |
| CPU render thread time | 6.9ms | 3.2ms |
| GPU frame time | 14.1ms | 12.3ms |
| Transient GPU memory | 820MB | 280MB |
| Frame time variance (1%) | 2.1ms | 0.9ms |
The CPU-side reduction is larger than the GPU-side reduction, because Work Graphs eliminate most of the per-frame command recording for the culling and draw path. This is typically the bigger win for open-world games, which are usually CPU-bound on the render thread at high framerates.
The variance reduction is underrated. Work Graphs produce more uniform frame times because the scheduler can load-balance across the GPU's compute units based on actual record throughput rather than a fixed dispatch shape. On a 1% worst-case frame measurement, this is the difference between "perceived stutters" and "smooth" in player feedback.
The memory reduction is the most surprising number. Transient memory for intermediate buffers drops by 60–70% because the scheduler's ring buffers are sized for actual throughput rather than worst-case count. On memory-constrained platforms, this alone can justify adoption.
Implications for Procedural Content Pipelines
This is the part of the story that matters for content authoring, not just for engine programmers.
Before Work Graphs, runtime procedural content had two practical patterns: bake offline and stream (cheap at runtime, inflexible, large disk footprint) or generate per-frame in chunks (flexible, expensive, often pops visibly). A third pattern — deep runtime procedural generation that is fully responsive to player state — was theoretically possible but almost never economical.
With Work Graphs, the third pattern becomes economical for a class of workloads:
- Instance placement — foliage, debris, clutter, decals. PCG graphs that produce transform data can evaluate fast enough to feel instantaneous to the player.
- Simple shape generation — decals of varied shape, projected onto varied geometry. The shape evaluation can run in the same graph as the placement.
- Attribute variation — per-instance color, scale, rotation jitter. These can be evaluated as a post-placement node without a separate dispatch.
What is still not economical at runtime:
- Mesh generation from signed distance fields. SDF evaluation is cheap, but the mesh extraction (marching cubes, dual contouring) has memory patterns that do not pack well into Work Graph nodes. Still better as offline bake.
- Navigation mesh generation. Navmesh generation has gameplay-visible latency requirements that are too tight even for Work Graphs, and it needs to integrate with CPU-side AI systems.
- Collision mesh generation. Collision needs to be known to the physics system on the CPU, and round-tripping GPU-generated geometry back to CPU physics is still too expensive.
The practical advice: audit procedural systems that are currently chunk-streamed and consider whether the chunking is worth its complexity now that runtime evaluation is viable. For many foliage and clutter systems, the chunking was a workaround for per-frame cost that no longer applies.
Procedural Placement Tool users: the 2026.1 update includes presets for Work-Graph-backed evaluation with pre-tuned node counts and batch sizes. A placement graph that was authored for chunked CPU evaluation can be ported by swapping the root evaluator — the node library is compatible — but the performance profile is significantly different and most graphs benefit from re-tuning batch sizes for the GPU path.
Authoring Patterns That Age Well
A few patterns we've adopted after six months with Work Graphs:
- Design graphs for batching, not for linearity. A graph that runs each node once in sequence leaves most of the scheduler's benefit on the table. Structure placement so that many instances pass through the same nodes simultaneously, not so that a long chain of transformations happens to each instance in isolation.
- Prefer coalescing nodes for post-processing. If you're applying a final pass to every instance (setting a team color, snapping to a terrain surface), a coalescing node with a large batch size is almost always faster than a broadcasting node with one thread per instance.
- Avoid deep recursion. Recursive Work Graph nodes are limited to ~8 levels of depth on most hardware before the scheduler starts spilling to slower memory. Design recursive generators (fractal vegetation, branching systems) to terminate shallow.
- Keep node shaders small. Each node invocation has a small but non-zero scheduling overhead. A node that does 100 lines of shader work is fine; a node that does 3 lines is wasteful. Fuse trivial nodes.
What UE5.8 Will Likely Change
Based on Epic's public roadmap and the state of 5.7, the likely 5.8 changes:
- Work Graphs as the default for the mesh drawing path, with fallback reserved for unsupported hardware.
- Stable Niagara Work Graph integration (out of experimental).
- First-class Blueprint authoring for Work Graph PCG nodes.
- Expanded platform support as console SDKs catch up.
- A higher-level API in the engine that abstracts Work Graph structure behind content-author-friendly concepts (current API exposes some of the graph structure in ways that require programmer involvement).
The broader trend: more systems in the engine will move to GPU-driven scheduling over the next 18–24 months. The payoff for getting comfortable with the mental model now, even for content teams who don't author Work Graphs directly, is that the performance characteristics of GPU-driven systems are different enough from legacy paths that authoring habits matter. A team that designed all its procedural content for chunked CPU evaluation will spend a year un-learning habits when the default switches.
For teams using the Unreal MCP Server for batch configuration of procedural systems, the Work Graph toggles and preset selection are exposed as standard console variables and project settings, which means batch updates across large numbers of PCG volumes or foliage systems can be automated without waiting for a dedicated API.
The Short Version
Work Graphs are not a single feature with a single use case. They are a new scheduling substrate that affects multiple systems at different depths. In UE5.7, they are production-ready for mesh drawing and runtime PCG, experimental for Niagara, and opt-in everywhere. On supported PC hardware, the performance and memory wins are real and worth adopting where available. On consoles and older PC hardware, the fallback paths carry the same content forward with the legacy performance profile.
The authoring consequence is the part worth internalizing early: procedural content pipelines should be designed for the assumption that runtime evaluation is cheap. The patterns that made sense when every draw call was expensive — aggressive chunking, offline baking of everything bakeable, frame-spreading of heavy workloads — are habits that no longer fit the engine they run on. Not every project has to adopt Work Graphs, but every project should understand which of its assumptions about content pipeline structure were about hardware limits that no longer apply.