Dev Log - The New Multithreading Framework

Friday, June 27, 2025 1:19:55 PM

Dev Log - The New Multithreading Framework

Hello, Engineers! We're excited to share that development of Dyson Sphere Program has been progressing steadily over the past few months. Every line of code and every new idea reflects our team's hard work and dedication. We hope this brings even more surprises and improvements to your gameplay experience!(Vehicle System: Activated!) Bad News: CPU is maxing out During development and ongoing maintenance, we've increasingly recognized our performance ceilings. Implementing vehicle systems would introduce thousands of physics-enabled components—something the current architecture simply can't sustain. Back in pre-blueprint days, we assumed "1k Universe Matrix/minute" factories would push hardware limits. Yet your creativity shattered expectations—for some, 10k Universe Matrix was just the entry-level challenge. Though we quickly rolled out a multithreading system and spent years optimizing, players kept pushing their PCs to the absolute limit. With pioneers achieving 100k and even 1M Universe Matrix! Clearly, it was time for a serious performance boost. After a thorough review of the existing code structure, we found that the multithreading system still had massive optimization potential. So, our recent focus has been on a complete overhaul of Dyson Sphere Program's multithreading framework—paving the way for the vehicle system's future development. (A performance snapshot from a 100 K Matrix save. Logic frame time for the entire production line hits 80 ms.) Multithreading in DSP Let's briefly cover some multithreading basics, why DSP uses it, and why we're rebuilding the system. Take the production cycle of an Assembler as an example. Ignoring logistics, its logic can be broken into three phases:1. Power Demand Calculation: The Assembler's power needs vary based on whether it's lacking materials, blocked by output, or mid-production.2. Grid Load Analysis: The power system sums all power supply capabilities from generators and compares it to total consumption, then determines the grid's power supply ratio.3. Production Progress: Based on the Power grid load and factors like resource availability and Proliferator coating, the production increment for that frame is calculated. Individually, these calculations are trivial—each Assembler might only take a few hundred to a few thousand nanoseconds. But scale this up to tens or hundreds of thousands of Assemblers in late-game saves, and suddenly the processor could be stuck processing them sequentially for milliseconds, tanking your frame rate. (This sea of Assemblers runs smoothly thanks to relentless optimization.) Luckily, most modern CPUs have multiple cores, allowing them to perform calculations in parallel. If your CPU has eight cores and you split the workload evenly, each core does less, reducing the overall time needed. But here's the catch: not every Assembler takes the same time to process. Differences in core performance, background tasks, and OS scheduling mean threads rarely finish together—you're always waiting on the slowest one. So, even with 8 cores, you won't get an 8x speedup. So, next stop: wizard mode.Okay, jokes aside. Let's get real about multithreading's challenges. When multiple CPU cores work in parallel, you inevitably run into issues like memory constraints, shared data access, false sharing, and context switching. For instance, when multiple threads need to read or modify the same data, a communication mechanism must be introduced to ensure data integrity. This mechanism not only adds overhead but also forces one thread to wait for another to finish. There are also timing dependencies to deal with. Let's go back to the three-stage Assembler example. Before Stage 2 (grid load calculation) can run, all Assemblers must have completed Stage 1 (power demand update)—otherwise, the grid could be working with outdated data from the previous frame. To address this, DSP's multithreading system breaks each game frame's logic into multiple stages, separating out the heavy workloads. We then identify which stages are order-independent. For example, when Assemblers calculate their own power demand for the current frame, the result doesn't depend on the power demand of other buildings. That means we can safely run these calculations in parallel across multiple threads. What Went Wrong with the Old System Our old multithreading system was, frankly, showing its age. Its execution efficiency was mediocre at best, and its design made it difficult to schedule a variety of multithreaded tasks. Every multithreaded stage came with a heavy synchronization cost. As the game evolved and added more complex content, the logic workload per frame steadily increased. Converting any single logic block to multithreaded processing often brought marginal performance gains—and greatly increased code maintenance difficulty. To better understand which parts of the logic were eating up CPU time—and exactly where the old system was falling short—we built a custom performance profiler. Below is an example taken from the old framework:(Thread performance breakdown in the old system)In this chart, each row represents a thread, and the X-axis shows time. Different logic tasks or entities are represented in different colors. The white bars show the runtime of each sorter logic block in its assigned thread. The red bar above them represents the total time spent on sorter tasks in that frame—around 3.6 ms. Meanwhile, the entire logic frame took about 22 ms. (The red box marks the total time from sorter start to sorter completion.)Zooming in, we can spot some clear issues. Most noticeably, threads don't start or end their work at the same time. It's a staggered, uncoordinated execution.(Here, threads 1, 2, and 5 finish first—only then do threads 3, 4, and 6 begin their work) There are many possible reasons for this behavior. Sometimes, the system needs to run other programs, and some of those processes might be high-priority, consuming CPU resources and preventing the game's logic from fully utilizing all available cores. Or it could be that a particular thread is running a long, time-consuming segment of logic. In such cases, the operating system might detect a low number of active threads and, seeing that some cores are idle, choose to shut down a few for power-saving reasons—further reducing multithreading efficiency. In short, OS-level automatic scheduling of threads and cores is a black box, and often it results in available cores going unused. The issue isn't as simple as "16 cores being used as 15, so performance drops by 1/16." In reality, if even one thread falls behind due to reasons like those above, every other thread has to wait for it to finish, dragging down the overall performance.Take the chart below, for example. The actual CPU task execution time (shown in white) may account for less than two-thirds of the total available processing window.(The yellow areas highlight significant zones of CPU underutilization.)Even when scheduling isn't the issue, we can clearly see from the chart that different threads take vastly different amounts of time to complete the same type of task. In fact, even if none of the threads started late, the fastest thread might still finish in half the time of the slowest one.Now look at the transition between processing stages. There's a visible gap between the end of one stage and the start of the next. This happens because the system simply uses blocking locks to coordinate stage transitions. These locks can introduce as much as 50 microseconds of overhead, which is quite significant at this level of performance optimization. The New Multithreading System Has Arrived! To maximize CPU utilization, we scrapped the old framework and built a new multithreading system and logic pipeline from scratch. In the brand new Multithreading System, every core is pushed to its full potential. Here's a performance snapshot from the new system as of the time of writing:The white sorter bars are now tightly packed. Start and end times are nearly identical—beautiful! Time cost dropped to ~2.4 ms (this is the same save). Total logic time fell from 22 ms to 11.7 ms—an 88% improvement(Logical frame efficiency only). That's better than upgrading from a 14400F to a 14900K CPU! Here's a breakdown of why performance improved so dramatically: 1. Custom Core Binding: In the old multithreading framework, threads weren't bound to specific CPU cores. The OS automatically assigned cores through opaque scheduling mechanisms, often leading to inefficient core utilization. Now players can manually bind threads to specific cores, preventing these "unexpected operations" by the system scheduler.(Zoomed-in comparison shows new framework no longer has threads queuing while cores sit idle like old version ) 2. Dynamic Task Allocation: Even with core binding, uneven task distribution or core performance differences could still cause bottlenecks. Some cores might be handling other processes, delaying thread starts. To address this, we introduced dynamic task allocation.Here's how it works: Tasks are initially distributed evenly. Then, any thread that finishes early will "steal" half of the remaining workload from the busiest thread. This loop continues until no thread's workload exceeds a defined threshold. This minimizes reallocation overhead while preventing "one core struggling while seven watch" scenarios. As shown below, even when a thread starts late, all threads now finish nearly simultaneously.(Despite occasional delayed starts, all threads now complete computations together) 3. More Flexible Framework Design: Instead of the old "one-task-per-phase" design, we now categorize all logic into task types and freely combine them within a phase. This allows a single core to work on multiple types of logic simultaneously during the same stage. The yellow highlighted section below shows Traffic Monitors, Spray Coaters, and Logistics Station outputs running in parallel:(Parallel execution of Traffic Monitor/Spray Coater/Logistics Station cargo output logic now takes <0.1 ms) (Previously single-threaded, this logic consumed ~0.6 ms) Thanks to this flexibility, even logic that used to be stuck in the main thread can now be interleaved. For example, the blue section (red arrow) shows Matrix Lab (Research) logic - while still on the main thread, it now runs concurrently with Assemblers and other facilities, fully utilizing CPU cores without conflicts.(More flexible than waiting for other tasks to complete) The diagram above also demonstrates that mixing dynamically and statically allocated tasks enables all threads to finish together. We strategically place dynamically allocatable tasks after static ones to fill CPU idle time.(Updating enemy turrets/Dark Fog units alongside power grids utilizes previously idle CPU cycles) 4. Enhanced Thread Synchronization: The old system required 0.02-0.03 ms for the main thread to react between phases, plus additional startup time for new phases. As shown, sorter-to-conveyor phase transitions took ~0.065 ms. The new system reduces this to 6.5 μs - 10x faster.(New framework's wait times are dramatically faster than old ) We implemented faster spinlocks (~10 ns) with hybrid spin-block modes: spinlocks for ultra-fast operations, and blocking locks for CPU-intensive tasks. This balanced approach effectively eliminates the visible "gaps" between phases. As the snapshot shows, the final transition now appears seamless. Of course, the new multithreading system still has room for improvement. Our current thread assignment strategy will continue to evolve through testing, in order to better adapt to different CPU configurations. Additionally, many parts of the game logic are still waiting to be moved into the new multithreaded framework. To help us move forward, we'll be launching a public testing branch soon. In this version, we're providing a variety of customizable options for players to manually configure thread allocation and synchronization strategies. This will allow us to collect valuable data on how the system performs across a wide range of real-world hardware and software environments—crucial feedback that will guide future optimizations.(Advanced multithreading configuration interface) Since we've completely rebuilt the game's core logic pipeline, many different types of tasks can now run in parallel—for example, updating the power grid and executing Logistics Station cargo output can now happen simultaneously. Because of this architectural overhaul, the CPU performance data shown in the old in-game stats panel is no longer accurate or meaningful. Before we roll out the updated multithreading system officially, we need to fully revamp this part of the game as well. We're also working on an entirely new performance analysis tool, which will allow players to clearly visualize how the new logic pipeline functions and performs in real time.(We know you will love those cool-looking charts—don't worry, we'll be bringing them to you right away!) That wraps up today's devlog. Thanks so much for reading! We're aiming to open the public test branch in the next few weeks, and all current players will be able to join directly. We hope you'll give it a try and help us validate the new system's performance and stability under different hardware conditions. Your participation will play a crucial role in preparing the multithreading system for a smooth and successful official release. See you then, and thanks again for being part of this journey! Ends