Dev Log - The New Multithreading Framework
Hello, Engineers! We're excited to share that development of Dyson Sphere Program has been progressing steadily over the past few months. Every line of code and every new idea reflects our team's hard work and dedication. We hope this brings even more surprises and improvements to your gameplay experience!(Vehicle System: Activated!) Bad News: CPU is maxing out During development and ongoing maintenance, we've increasingly recognized our performance ceilings. Implementing vehicle systems would introduce thousands of physics-enabled components—something the current architecture simply can't sustain. Back in pre-blueprint days, we assumed "1k Universe Matrix/minute" factories would push hardware limits. Yet your creativity shattered expectations—for some, 10k Universe Matrix was just the entry-level challenge. Though we quickly rolled out a multithreading system and spent years optimizing, players kept pushing their PCs to the absolute limit. With pioneers achieving 100k and even 1M Universe Matrix! Clearly, it was time for a serious performance boost. After a thorough review of the existing code structure, we found that the multithreading system still had massive optimization potential. So, our recent focus has been on a complete overhaul of Dyson Sphere Program's multithreading framework—paving the way for the vehicle system's future development. (A performance snapshot from a 100 K Matrix save. Logic frame time for the entire production line hits 80 ms.) Multithreading in DSP Let's briefly cover some multithreading basics, why DSP uses it, and why we're rebuilding the system. Take the production cycle of an Assembler as an example. Ignoring logistics, its logic can be broken into three phases:1. Power Demand Calculation: The Assembler's power needs vary based on whether it's lacking materials, blocked by output, or mid-production.2. Grid Load Analysis: The power system sums all power supply capabilities from generators and compares it to total consumption, then determines the grid's power supply ratio.3. Production Progress: Based on the Power grid load and factors like resource availability and Proliferator coating, the production increment for that frame is calculated. Individually, these calculations are trivial—each Assembler might only take a few hundred to a few thousand nanoseconds. But scale this up to tens or hundreds of thousands of Assemblers in late-game saves, and suddenly the processor could be stuck processing them sequentially for milliseconds, tanking your frame rate. (This sea of Assemblers runs smoothly thanks to relentless optimization.) Luckily, most modern CPUs have multiple cores, allowing them to perform calculations in parallel. If your CPU has eight cores and you split the workload evenly, each core does less, reducing the overall time needed. But here's the catch: not every Assembler takes the same time to process. Differences in core performance, background tasks, and OS scheduling mean threads rarely finish together—you're always waiting on the slowest one. So, even with 8 cores, you won't get an 8x speedup. So, next stop: wizard mode.Okay, jokes aside. Let's get real about multithreading's challenges. When multiple CPU cores work in parallel, you inevitably run into issues like memory constraints, shared data access, false sharing, and context switching. For instance, when multiple threads need to read or modify the same data, a communication mechanism must be introduced to ensure data integrity. This mechanism not only adds overhead but also forces one thread to wait for another to finish. There are also timing dependencies to deal with. Let's go back to the three-stage Assembler example. Before Stage 2 (grid load calculation) can run, all Assemblers must have completed Stage 1 (power demand update)—otherwise, the grid could be working with outdated data from the previous frame. To address this, DSP's multithreading system breaks each game frame's logic into multiple stages, separating out the heavy workloads. We then identify which stages are order-independent. For example, when Assemblers calculate their own power demand for the current frame, the result doesn't depend on the power demand of other buildings. That means we can safely run these calculations in parallel across multiple threads. What Went Wrong with the Old System Our old multithreading system was, frankly, showing its age. Its execution efficiency was mediocre at best, and its design made it difficult to schedule a variety of multithreaded tasks. Every multithreaded stage came with a heavy synchronization cost. As the game evolved and added more complex content, the logic workload per frame steadily increased. Converting any single logic block to multithreaded processing often brought marginal performance gains—and greatly increased code maintenance difficulty. To better understand which parts of the logic were eating up CPU time—and exactly where the old system was falling short—we built a custom performance profiler. Below is an example taken from the old framework:(Thread performance breakdown in the old system)In this chart, each row represents a thread, and the X-axis shows time. Different logic tasks or entities are represented in different colors. The white bars show the runtime of each sorter logic block in its assigned thread. The red bar above them represents the total time spent on sorter tasks in that frame—around 3.6 ms. Meanwhile, the entire logic frame took about 22 ms. (The red box marks the total time from sorter start to sorter completion.)Zooming in, we can spot some clear issues. Most noticeably, threads don't start or end their work at the same time. It's a staggered, uncoordinated execution.(Here, threads 1, 2, and 5 finish first—only then do threads 3, 4, and 6 begin their work) There are many possible reasons for this behavior. Sometimes, the system needs to run other programs, and some of those processes might be high-priority, consuming CPU resources and preventing the game's logic from fully utilizing all available cores. Or it could be that a particular thread is running a long, time-consuming segment of logic. In such cases, the operating system might detect a low number of active threads and, seeing that some cores are idle, choose to shut down a few for power-saving reasons—further reducing multithreading efficiency. In short, OS-level automatic scheduling of threads and cores is a black box, and often it results in available cores going unused. The issue isn't as simple as "16 cores being used as 15, so performance drops by 1/16." In reality, if even one thread falls behind due to reasons like those above, every other thread has to wait for it to finish, dragging down the overall performance.Take the chart below, for example. The actual CPU task execution time (shown in white) may account for less than two-thirds of the total available processing window.(The yellow areas highlight significant zones of CPU underutilization.)Even when scheduling isn't the issue, we can clearly see from the chart that different threads take vastly different amounts of time to complete the same type of task. In fact, even if none of the threads started late, the fastest thread might still finish in half the time of the slowest one.Now look at the transition between processing stages. There's a visible gap between the end of one stage and the start of the next. This happens because the system simply uses blocking locks to coordinate stage transitions. These locks can introduce as much as 50 microseconds of overhead, which is quite significant at this level of performance optimization. The New Multithreading System Has Arrived! To maximize CPU utilization, we scrapped the old framework and built a new multithreading system and logic pipeline from scratch. In the brand new Multithreading System, every core is pushed to its full potential. Here's a performance snapshot from the new system as of the time of writing:The white sorter bars are now tightly packed. Start and end times are nearly identical—beautiful! Time cost dropped to ~2.4 ms (this is the same save). Total logic time fell from 22 ms to 11.7 ms—an 88% improvement(Logical frame efficiency only). That's better than upgrading from a 14400F to a 14900K CPU! Here's a breakdown of why performance improved so dramatically: 1. Custom Core Binding: In the old multithreading framework, threads weren't bound to specific CPU cores. The OS automatically assigned cores through opaque scheduling mechanisms, often leading to inefficient core utilization. Now players can manually bind threads to specific cores, preventing these "unexpected operations" by the system scheduler.(Zoomed-in comparison shows new framework