Agent-Zero wrote on Mar 12, 2017, 00:39:
seems to imply that a single Ryzen processor appears to act like a dual CPU system, is that correct? thats bizarre, because I figured that multi-core CPUs were already acting like that to some degree, but i suppose it must be different in terms of how the system buses work with that
Having now read the article in full, the CCX and Infinity Fabric architecture do indeed appear to cause the Ryzen to behave as though a Ryzen chip is two 4-core CPUs on a single die. While that may sound like a distinction without a difference, it's actually quite important, and the benchmarks are clearly showing substantially more overhead for a context shift across the Infinity Fabric as opposed to a context shift within a CCX.
As you probably already know, a process is a program that has been loaded into memory and is in a runnable state. More importantly, a process has associated with it a virtual memory space, i.e. each memory address the program references is not the location in physical memory but rather a mapping to the actual physical address. Additionally, the process maintains a series of data structures containing the state of the process, i.e. the contents of the various CPU registers at a given moment of execution, among other things, allowing it to be stopped and started with the process itself none the wiser. However, if a process needs to communicate with another process, the virtualized address space means that they are effectively partitioned from one another, and while there are several different ways to handle inter-process communication, they basically all involve one or more processes blocking (waiting) on I/O calls, which slows down processing dramatically. It would be far faster if you could just share a virtualized address space, and that's what threads do: they are mini-processes that maintain their separate execution states but share their virtualized address space allowing rapid in-memory communication without resorting to high-overhead I/O calls or context switches (switching which process is actively running on a core). The downside of threads is that the shared address space makes it extremely easy to write code where the results depend on the order in which the CPU schedules threads (race condition) which leads to all sorts of subtle and hard-to-detect bugs.
What this means for our discussion is that it's possible to have parallel execution, or at least a facsimile of it, with even a single core CPU. The OS let's one process/thread run either until it hits a specific time limit (quantum) or blocks on I/O, and then the OS saves the state of the process and swaps in another one to let it run. Since CPUs execute literally billions of instructions per second, this all happens so fast that to us slow humans, it appears like the computer is doing multiple things at once. Add more cores, and you really can run more processes/threads at once. Importantly, with a HyperThreaded (Intel's term) CPU, you're still only executing one process/thread per core at any given instant, but in the case of threads, with their shared address space, architectural optimizations allow for extremely fast context switches between the threads of a process, which in turn means the CPU presents itself to the OS as having x physical cores, each composed of y (usually two) virtual cores.
The hypothesis was that the Ryzen wasn't correctly enumerating which of its virtual cores were part of the same physical core to Windows, and therefore the NT kernel wasn't scheduling threads to the right virtual core in order to leverage the AMD equivalent of HyperThreading. But the benchmarks show that isn't the case: Windows understands which virtual cores belong to which physical core, it's just that the Ryzen architecture has a heavy context shift penalty across the two CCXs. A patch to Windows will allow it to optimize for avoiding context shifts across the CCXs, but that still won't change the fact that any context shift across CCXs will be much slower than a context shift within a CCX.
Why does this happen? Without knowing the specifics of the Ryzen architecture, it has to come down to caching. Memory varies in it's speed, it's volatility, and it's price. Accordingly, on the CPU die itself you have several levels of expensive to manufacture but extremely fast memory that loses its state on power down (volatile); then you have main memory, which isn't as fast and still loses its state on power down, but is much cheaper than on-die cache memory; and finally you have persistent storage in the form of SSDs and HDDs, which are very slow to access relative to the other memory, but make up for it by being very inexpensive.
The trick for an operating system is to place the code that is most likely to be needed, and needed repeatedly, in the fastest levels of cache, so as to optimize the CPUs instruction retrieval speed (it can't execute what's still in transit over the various busses). Aside from repeatedly executed procedures and repeatedly retrieved data, the OS is probably also going to put some or all of its process bookkeeping data in some level of on-die cache, so it can minimize the time necessary to context switch between processes. Again, without knowing the specifics of the Ryzen architecture, if the two CCXs are maintaining separate, likely L3, caches, then any context switch across CCXs loses the context switch optimization which the shared cache within the CCX provides.
So where does this leave us? A patch to Windows can make the NT kernel's scheduling algorithm aware of the Ryzen context switch penalty across CCXs, and allow it to be smarter, but the penalty will still be baked into the Ryzen architecture no matter what Microsoft, or any OS developer does.
tl;dr Operating Systems leverage shared L3 caching to minimize the overhead of swapping processes and threads across multicore CPUs. Since Ryzen partitions its two CCXs, no such optimization is possible for context switches between CCXs. Windows can be patched to minimize cross CCX context switches, but when such a switch is unavoidable then there's nothing Windows can do but incur the extra overhead baked into Ryzen's architecture.