# Final Report **Yuyang Guo** (yuyangg) and **Isaac Lim** (idl) *** ## Links [Final Presentation Slides](files/slides.pdf) [Github Repo](https://github.com/isaaclimdc/cachemulator) *** ## Summary Cachemulator is a coherent multi-processor cache emulator that uses a **snooping**-based cache coherent protocol. We have 2 goals: - Compare the performance of the MSI / MESI / MESIF protocols - Provide cache usage information including bus traffic, hits/misses/evictions, total *ticks* required, highly contended cache lines, false sharing, etc. At the parallelism competition, we will present the charts and statistics output by our system, running on both a simple example program (that demonstrates false sharing), and a larger, more *real-world* example. All these will be presented as run on the GHC 3000 machines. *** ## Background ### Cache coherence In the Introduction to Computer Systems class, we were required to write a cache simulator in *CacheLab*. This simulator was very basic, and did not account for multi-core parallelism. Cachemulator is a multi-processor version of this, with the goal of exhibiting **simulated** empirical evidence of the performance of various coherence protocols. Cache coherence refers to the consistency of data in the same address, across each processor's local cache. When a single address is shared by processors and processors *A* and *B* are reading from it, if processor *C* wants to write a new value into that address, the local copies in *A* and *B*'s caches need to be updated. There are 3 protocols grouped under the term *snooping*, namely MSI, MESI and MESIF; all these protocols ensure this invariant. **Snooping** protocols require any processor that wants to interact with an address to first broadcast *to all other processors*, the fact that it needs to gain exclusive access to that address. More specifically, MSI, MESI and MESIF are slightly different ways to implement this snooping protocol, in terms of the states that a cache line can be in. MESIF has advantages over MESI, and MESI over MSI; this is what we seek to investigate with Cachemulator. ### Inputs and outputs Cachemulator takes as input a user's executable file and some secondary configuration information, and processes these into statistics and metrics on the executable's cache performance. A feasible output here is to programmatically plot charts using these metrics. ### Workload While the programs taken in are (supposedly) parallel in their implementation, (either using pthreads, OpenMP, Open MPI, etc.), our emulator is inherently sequential in its execution. It has no inherent dependendies in its execution except for the need to sequentially iterate through the input memory trace. Because the memory trace is assumedly parallel, our simulation must also simulate parallelism in its memory accesses. *** ## Approach ### The Main System As can be seen below, our system consists of many C++ objects encapsulated in each other. Each class name is prefixed by "CM" (for *C*ache*m*ulator) for better namespacing. The largest entity is `CMComp`, which basically simulates the entire *computer*, from the perspective of the cache. The computer then has as many `CMProc` member instances as the largest thread ID reported in the memory trace fed in. Each processor has a fixed number of `CMSet`s, which in turn have a fixed number of `CMLine`s, reflecting as closely as possible the configuration of real-world caches. At this lowest level, each cache line keeps track of its own `tag`, `blockOffset`, and `dirty` bit, etc. We don't actually store data in the cache because we realized this was not necessary to ensure correctness and help with the analysis we intended to perform. ![Layout](files/cachemulator_layout.png "Layout") ### Helper Classes - `CMAddr`: Encapsulates each line of the input memory trace, computing the address' `tag`, `blockOffset`, etc. - `CMConfig`: Contains all the simulation constants, such as the times for cache hit delays (4 cycles), memory access delays (100 cycles), etc. - `CMJob`: We manage each request made by the `CMProc`s and granted by the `CMBusCtrlr` using a `CMJob`. This encapsulates things such as the number of *ticks* remaining until this job is no longer delayed and can be attended to. This class primarily helps to maintain the scheduling of requests made across the shared **atomic** bus that we have. - `CMSharing`: Keeps track of sharing data, accumulating data as requests are made over the bus. This object allows us to differentiate **false sharing** over true sharing, and to pinpoint what the **highly-contended** lines in the execution are. ### Test Harness We wrote a python script that helped us keep track of and maintain correctness of our cache as we built it. The script runs the cache on a range of *handcrafted* memory traces specifically engineered to exploit noncoherent caches, so that if something was not functioning correctly, we would very easily be able to detect that our battery of tests fail. This test verifies metrics such as: - The expected output sequence of hits/misses/evictions, such as `<HIT, HIT, MISS, MISS, EVICT, HIT>` - The expected bus traffic sequence, such as `<BusRd, BusRd, BusRdX, BusUpg, BusWr>` - The states that we expect each address' cache line to be in, such as `<SHARED, MODIFIED, MODIFIED, EXCLUSIVE, SHARED>` ### User Workflow In this section we discuss what we envisioned (and thus built) Cachemulator to look and perform like from a coder's perspective. Between the input and the output steps as described above, here is what we do to make it all work: 1. Given the user's executable, we use [Pin](https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool), Intel's dynamic binary instrumentation tool, to generate a **multicore** memory access trace. - The output memory trace has format just like the diagram above, and denotes whether that instruction is a read or a write, followed by the address accessed, then the thread ID of the processor responsible for it. ``` R 0x7fff9012 0 R 0x7fff9014 1 R 0x7fff9018 2 R 0x7fff9020 0 W 0x7fff9036 0 ``` - We wrote our own **pintool** called `pinatrace.cpp` in order to output this information. See the "Optimizations" section for more details about the challenges we faced here. 2. We then take the memory trace generated and step through it using our emulator. With each memory access, the cache is accessed accordingly, and depending on a range of factors (presence in the cache, its current coherence state, whether it is dirty or not, whether other processors have it in Modified state, etc.), we output the resulting state and make the necessary modifications. It is in this stage that we maintain cache coherence through the simulated atomic bus. 3. The emulator in the previous step outputs the final metrics and statistics having run the *entire* memory trace, such as number of hits/misses/evictions, bus traffic, instances of false sharing and highly-contended lines, etc. We then take these numbers and plot them on a wide range of charts using a Python script. The hope is that with these charts, the coder is able to visually observe their program's **cache health**, such as whether it causes a lot of invalidations, and so on. ### Technologies Used We wrote Cachemulator in vanilla C++, using just the standard library data structures and libraries, with the exception of our standalone pintool, which was written using Intel's `PIN` and `RTN` (routine) APIs in order to extract the memory trace. No machine in particular was targeted, but we mainly worked on the GHC 3000 machines for convenience. This code is fairly portable and can function on any machine that can run Pin. Note that as previously mentioned, because we have an inherently sequential simulator and are just *simulating* parallel code, the question of mapping the problem to our target parallel machine does not apply. ### Development Process We began by spending a significant amount of time designing the object-oriented system, in order to make it **as modular as possible** from the ground up, from the very beginning. We feel that we achieved this thoroughly well, as can be seen from the diagram above, by how suitably divided each entity in the emulator is. After the planning stage, we basically proceeded to implement Cachemulator layer by layer by layer, incrementally building up its complexity and capability. We were very careful to continue testing using our test harness at each stage of development, so as to ensure that nothing broke as a result of the current changes (which were usually fairly destructive and far-reaching). An example of the stage-by-stage changes we made is that only after we had a working MSI implementation did we extend it to support the MESI protocol, then MESIF after that. The challenge with maintaining correctness after each stage is that each stage usually overwrite nearly every single file in our codebase, and so bugs remnant from the previous stage would just accumulate and manifest only when we least expect them to. In addition, as previously mentioned, correctness tests (traces) had to be manually written, in order to expose an error in the memory access behavior if coherence was not ensured. It is worth noting here that we originally intended to start out with one of our 15-213 Cachelab code, but soon realized that though on the surface, this sounds like simply an extension of it, the code would be substantially different. So we decided not to do so, and just start from scratch; looking back, that turned out to be an *excellent* decision. ### Optimizations, Iterations, and Failed Attemps Here are some points about the many attempts that we made during the development process of this project. Some are failed attempts, some improved iterations over previous attempts, and some merely optimizations and tweaks to working code. #### Generating memory traces We wrote our own **pintool** called `pinatrace.cpp` in order to output the necessary information in each memory trace. We initially just output every single memory access in the given executable, but soon realized that this was a terrible idea. Through some experimentation, even for the trivial program: ``` #include <stdlib.h> int main() { return 0; } ``` the trace generated was roughly 50,000 lines long, and caused our system to continuously segfault. A real-life program like BFS generates a trace file 400MB large, which is even more infeasible to be passed into our emulator. So what we had to do is to make `pinatrace` more intelligently identify only the **important user functions** in the given executable, ignoring extraneous `libc` calls, etc., which turned out to be the majority of the large file. We also took this opportunity to ignore accesses to stack variables, and leave only heap accesses in the resulting trace file, since the former isn't particularly interesting to look at and won't contribute to the analysis. As a result of this tweaking (which took a significant amount of time because we had to familiarize ourselves with Pin's massive and slightly eccentric API), we are able to extract **very finely-tuned** memory traces from the executable using `pin`. #### Parsing trace files Some memory traces we got back from Pin were still very large, especially for more complicated programs such as BFS. What we initially tried to do was to preprocess the input trace file into a `CMTest` object that simply held a queue of all the requests in that file, and distribute them accordingly to the 'owning' processors. However, this caused these vectors to grow too large, and prevented the emulator from running to completion. We fixed this by first splitting the input trace file into `P` indivdual temporary trace files, where `P` is the number of processors, and file `p` had only the addresses accessed by processor `p`. Then on each tick, which is a single processor cycle, each `CMProc` would just read from its local queue of requests, keeping this queue small. If the queue size dropped below some threshold, then we would read the next, say 100, lines from its trace file. While a lot more complicated to implement, this **streaming** technique very effectively solved the problem. #### Bus race conditions with `CMJob` Although Cachemulator is an inherently sequential system, the simulation of a **shared atomic bus** gave rise to many *pseudo*-race conditions. For example, while processor 1 simulates stalling on a memory fetch (which we modelled to take 100 cycles), its current `CMJob` would simply have a counter that started at 100 and decremented on each processor `tick()`, during which it just does nothing. If during this period, processor 2 swoops in and modifies the address that processor 1 was originally requesting for, processor 1 would need to make the necessary modifications to its current job in order to avoid reading erroneous data when its stall is complete. This is an example of such *pseudo*-race conditions that we had to deal with while building Cachemulator. #### Attempt to store data in `CMLine` As previously mentioned, we needed a way to check that our emulator was actually preserving coherence after each memory access. The challenge here was to figure out how exactly this can be ensured, and the first idea that came to mind was to augment each cache line (`CMLine`) with a data field (this is unlike the 213 Cachelab simulator that didn't store any data). This would enable us to see if, say upon each read after another processor's write, the data read is coherent with the other write. However, we ran into a handful of complications with this implementation. The first issue here was that we would then need Pin to provide not only the address it was accessing, but also the data at that address, and include this data in the memory trace. This appeared to be a very hard thing to do, and we didn't make progress here. Secondly, we realized that in order to *truly* maintain coherence with the inclusion of a data field, we would need to perform actual flushes to memory, instead of just *simulating* a 100 cycle stall. We realized that this and other instances would severely complicate our code, and so we decided not to do this. Instead, we resolved to just comparing the output shout sequence to the reference that we handcrafted. #### Compiling optimizations A minor detail is that we use `-O0` with `g++` in our `Makefile` to make sure that we get the exact memory trace as represented in the source code, and that these aren't *cleverly* optimized out by `g++`. #### Wrapper shell script To conclude our project, we wrote a script `cachemulator.sh` that performs the 3-step process previously mentioned, to take a user from inputting his/her executable file, to outputting charts comparing all the different metrics supported by Cachemulator. *** ## Results ### False sharing To motivate this problem, [here is a simple example](falsesharing.html) of **false sharing** in action. We ran both the versions with and without false sharing through our system and obtained the following graphs, plotting the two versions with the count of each different metric. These tests were run on `ghc35.ghc.andrew.cmu.edu`, which has 6 cores and Hyper-Threading support. ![Misses](files/fs_misses.png "Misses") ![Busshouts](files/fs_busshouts.png "Bus shouts") ![Invalidations](files/fs_invalidations.png "Invalidations") ![Contention](files/fs_contention.png "Contention") ![Cycles](files/fs_cycles.png "Cycles") Cachemulator provides very clear-cut results here of the effects that false sharing has on all these metrics. We consider this to be a **great success** for out emulator, because to an inexperienced coder, [this code](falsesharing.html) looks perfectly correct and maximally efficient, since no *true sharing* is taking place. Our emulator thus successfully exhibits the very nasty effects of false sharing, and in this situation, would warn the user: *"Hey, your code has terrible cache performance! Do something about it!"* However, something here that we are not able to explain, or perhaps is a bug in our system, is the fact that for most of these charts, the MESIF protocol seems to have the highest count across the board for the false sharing implementation. MESIF is known to be useful for its reduction in bus traffic for example, but the "Bus Shouts" chart clearly shows MESIF with the highest traffic count. A topic of further study is why this is so. Is this an accurate representation of the MESIF protocol and how it performs under false sharing? ### OpenMP BFS Although the previous code is a highly useful pedagogical tool to illustrate what false sharing is and the very clearly negative effects it has on unsuspecting code, let's take a look at a more complicated, **real-world** example: our **top-down** BFS OpenMP code from [Assignment 3](http://15418.courses.cs.cmu.edu/spring2014/article/6). This executable was run on a graph called `tiny.graph`, which does not guarantee spatial locality, also on `ghc35.ghc.andrew.cmu.edu`. First of all, it is cool to be able to see even OpenMP code working with our emulator, thanks to Pin's *dynamic instrumentation*, instead of just the usual *pthreads*. #### Performance It is interesting to see in the figure below, the increasing pattern in the vertical bars for each metric. ![Perf](files/bfs_perf.png "Perf") Observe in the figure below that with the MESIF protocol, the total number of tick counts required to run the trace to completion is substantially lower than with MSI and MESI. The question of why this is so was incidentally answered by our own emulator with the figure below it. One of the things we learned in the process of implementing Cachemulator is exactly how MESIF works. It takes advantage of cache-to-cache transfers to save unnecessary BusRdX shouts in order to upgrade a line to Modified state if all other sharers are in the Shared state. As modelled by our system, cache-to-cache transfer delays are only about 30 cycles, compared to 100 cycles for memory stalls (we couldn't find exact data for the c2c delays, so we made an estimate). Thus, this results in an overall decrease in execution time, just from using the MESIF protocol. ![Ticks](files/bfs_ticks.png "Ticks") ![C2C](files/bfs_c2c.png "C2C") #### Bus traffic The following charts plot what percentage of overall bus traffic is due to BusRdX versus BusRd. The horizontal axis is the number of ticks, grouped by 100s in order to laterally compress the graph. The vertical axis is a percentage of total bus traffic. As can be seen, the bus is always busy, and there is an inverse relationship between the BusRdX and BusRd, since we have an atomic bus and so no two requests can be serviced at the same time. ![MSI](files/bfs_MSI.png "MSI") ![MESI](files/bfs_MESI.png "MESI") ![MESIF](files/bfs_MESIF.png "MESIF") ## Concluding Remarks Cache performance is a huge factor of any program's performance, and yet it is one of the most difficult features to reason about just by looking at the program's sourcecode, especially as the code gets more complex. Cachemulator aims to combat this issue plaguing programmers everywhere, by giving them an accessible way to evaluate their program's cache performance. The emulator does away with complicated processes and simply reports the most important few metrics. An area of improvement that can be made to Cachemulator is apart from simply reporting that *there are* lines that are highly contended, or that there are *many counts* of invalidations, it would be **awesome** if it could report *which* lines are highly contended, and *which* lines are being invalidated the most. This would undoubtedly be a programmer's strong new weapon! If you are interested in reading the sourcecode of Cachemulator, or are interested in contributing to the project, please visit [its repo on Github](https://github.com/isaaclimdc/cachemulator)!