# Final Report
**Yuyang Guo** (yuyangg) and **Isaac Lim** (idl)
***
## Links
[Final Presentation Slides](files/slides.pdf)
[Github Repo](https://github.com/isaaclimdc/cachemulator)
***
## Summary
Cachemulator is a coherent multi-processor cache emulator that uses a
**snooping**-based cache coherent protocol. We have 2 goals:
- Compare the performance of the MSI / MESI / MESIF protocols
- Provide cache usage information including bus traffic, hits/misses/evictions,
total *ticks* required, highly contended cache lines, false sharing, etc.
At the parallelism competition, we will present the charts and statistics
output by our system, running on both a simple example program (that demonstrates
false sharing), and a larger, more *real-world* example. All these will be
presented as run on the GHC 3000 machines.
***
## Background
### Cache coherence
In the Introduction to Computer Systems class, we were required to write a
cache simulator in *CacheLab*. This simulator was very basic, and did not
account for multi-core parallelism. Cachemulator is a multi-processor version
of this, with the goal of exhibiting **simulated** empirical evidence of the
performance of various coherence protocols.
Cache coherence refers to the consistency of data in the same address, across
each processor's local cache. When a single address is shared by processors
and processors *A* and *B* are reading from it, if processor *C* wants to
write a new value into that address, the local copies in *A* and *B*'s caches
need to be updated. There are 3 protocols grouped under the term *snooping*,
namely MSI, MESI and MESIF; all these protocols ensure this invariant.
**Snooping** protocols require any processor that wants to interact with an
address to first broadcast *to all other processors*, the fact that it needs
to gain exclusive access to that address. More specifically, MSI, MESI and MESIF
are slightly different ways to implement this snooping protocol, in terms of the
states that a cache line can be in. MESIF has advantages over MESI, and MESI
over MSI; this is what we seek to investigate with Cachemulator.
### Inputs and outputs
Cachemulator takes as input a user's executable file and some secondary
configuration information, and processes these into statistics and metrics on
the executable's cache performance. A feasible output here is to programmatically
plot charts using these metrics.
### Workload
While the programs taken in are (supposedly) parallel in their implementation,
(either using pthreads, OpenMP, Open MPI, etc.), our emulator is inherently
sequential in its execution. It has no inherent dependendies in its execution
except for the need to sequentially iterate through the input memory trace.
Because the memory trace is assumedly parallel, our simulation must also
simulate parallelism in its memory accesses.
***
## Approach
### The Main System
As can be seen below, our system consists of many C++ objects encapsulated in
each other. Each class name is prefixed by "CM" (for *C*ache*m*ulator) for
better namespacing. The largest entity is `CMComp`, which basically simulates
the entire *computer*, from the perspective of the cache. The computer then has
as many `CMProc` member instances as the largest thread ID reported in the
memory trace fed in. Each processor has a fixed number of `CMSet`s, which in
turn have a fixed number of `CMLine`s, reflecting as closely as possible the
configuration of real-world caches. At this lowest level, each cache line keeps
track of its own `tag`, `blockOffset`, and `dirty` bit, etc. We don't actually
store data in the cache because we realized this was not necessary to ensure
correctness and help with the analysis we intended to perform.

### Helper Classes
- `CMAddr`: Encapsulates each line of the input memory trace, computing the
address' `tag`, `blockOffset`, etc.
- `CMConfig`: Contains all the simulation constants, such as the times for
cache hit delays (4 cycles), memory access delays (100 cycles), etc.
- `CMJob`: We manage each request made by the `CMProc`s and granted by the
`CMBusCtrlr` using a `CMJob`. This encapsulates things such as the number of
*ticks* remaining until this job is no longer delayed and can be attended to.
This class primarily helps to maintain the scheduling of requests made across
the shared **atomic** bus that we have.
- `CMSharing`: Keeps track of sharing data, accumulating data as requests are
made over the bus. This object allows us to differentiate **false sharing**
over true sharing, and to pinpoint what the **highly-contended** lines in
the execution are.
### Test Harness
We wrote a python script that helped us keep track of and maintain correctness
of our cache as we built it. The script runs the cache on a range of *handcrafted*
memory traces specifically engineered to exploit noncoherent caches, so that
if something was not functioning correctly, we would very easily be able to
detect that our battery of tests fail. This test verifies metrics such as:
- The expected output sequence of hits/misses/evictions, such as ``
- The expected bus traffic sequence, such as ``
- The states that we expect each address' cache line to be in, such as ``
### User Workflow
In this section we discuss what we envisioned (and thus built) Cachemulator to
look and perform like from a coder's perspective. Between the input and the
output steps as described above, here is what we do to make it all work:
1. Given the user's executable, we use [Pin](https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool), Intel's dynamic binary instrumentation tool, to generate a **multicore**
memory access trace.
- The output memory trace has format just like the diagram above, and denotes
whether that instruction is a read or a write, followed by the address accessed,
then the thread ID of the processor responsible for it.
```
R 0x7fff9012 0
R 0x7fff9014 1
R 0x7fff9018 2
R 0x7fff9020 0
W 0x7fff9036 0
```
- We wrote our own **pintool** called `pinatrace.cpp` in order to output this
information. See the "Optimizations" section for more details about the
challenges we faced here.
2. We then take the memory trace generated and step through it using our
emulator. With each memory access, the cache is accessed accordingly, and
depending on a range of factors (presence in the cache, its current coherence
state, whether it is dirty or not, whether other processors have it in Modified
state, etc.), we output the resulting state and make the necessary modifications.
It is in this stage that we maintain cache coherence through the simulated
atomic bus.
3. The emulator in the previous step outputs the final metrics and statistics
having run the *entire* memory trace, such as number of hits/misses/evictions,
bus traffic, instances of false sharing and highly-contended lines, etc. We then
take these numbers and plot them on a wide range of charts using a Python script.
The hope is that with these charts, the coder is able to visually observe their
program's **cache health**, such as whether it causes a lot of invalidations,
and so on.
### Technologies Used
We wrote Cachemulator in vanilla C++, using just the standard library data
structures and libraries, with the exception of our standalone pintool, which
was written using Intel's `PIN` and `RTN` (routine) APIs in order to extract
the memory trace.
No machine in particular was targeted, but we mainly worked on the GHC 3000
machines for convenience. This code is fairly portable and can function on any
machine that can run Pin. Note that as previously mentioned, because we have
an inherently sequential simulator and are just *simulating* parallel code,
the question of mapping the problem to our target parallel machine does not
apply.
### Development Process
We began by spending a significant amount of time designing the object-oriented
system, in order to make it **as modular as possible** from the ground up,
from the very beginning. We feel that we achieved this thoroughly well, as
can be seen from the diagram above, by how suitably divided each entity in
the emulator is.
After the planning stage, we basically proceeded to implement Cachemulator
layer by layer by layer, incrementally building up its complexity and
capability. We were very careful to continue testing using our test harness
at each stage of development, so as to ensure that nothing broke as a result
of the current changes (which were usually fairly destructive and far-reaching).
An example of the stage-by-stage changes we made is that only after we had a
working MSI implementation did we extend it to support the MESI protocol, then
MESIF after that.
The challenge with maintaining correctness after each stage is that each stage
usually overwrite nearly every single file in our codebase, and so bugs remnant
from the previous stage would just accumulate and manifest only when we least
expect them to. In addition, as previously mentioned, correctness tests (traces)
had to be manually written, in order to expose an error in the memory access
behavior if coherence was not ensured.
It is worth noting here that we originally intended to start out with one of
our 15-213 Cachelab code, but soon realized that though on the surface, this
sounds like simply an extension of it, the code would be substantially
different. So we decided not to do so, and just start from scratch; looking
back, that turned out to be an *excellent* decision.
### Optimizations, Iterations, and Failed Attemps
Here are some points about the many attempts that we made during the
development process of this project. Some are failed attempts, some improved
iterations over previous attempts, and some merely optimizations and tweaks to
working code.
#### Generating memory traces
We wrote our own **pintool** called `pinatrace.cpp` in order to output the
necessary information in each memory trace. We initially just output every
single memory access in the given executable, but soon realized that this was
a terrible idea. Through some experimentation, even for the trivial program:
```
#include
int main() {
return 0;
}
```
the trace generated was roughly 50,000 lines long, and caused our system to
continuously segfault. A real-life program like BFS generates a trace file
400MB large, which is even more infeasible to be passed into our emulator.
So what we had to do is to make `pinatrace` more intelligently identify only
the **important user functions** in the given executable, ignoring extraneous
`libc` calls, etc., which turned out to be the majority of the large file.
We also took this opportunity to ignore accesses to stack variables, and leave
only heap accesses in the resulting trace file, since the former isn't
particularly interesting to look at and won't contribute to the analysis.
As a result of this tweaking (which took a significant amount of time because
we had to familiarize ourselves with Pin's massive and slightly eccentric API),
we are able to extract **very finely-tuned** memory traces from the executable
using `pin`.
#### Parsing trace files
Some memory traces we got back from Pin were still very large, especially for
more complicated programs such as BFS. What we initially tried to do was to
preprocess the input trace file into a `CMTest` object that simply held a
queue of all the requests in that file, and distribute them accordingly to the
'owning' processors. However, this caused these vectors to grow too large, and
prevented the emulator from running to completion.
We fixed this by first splitting the input trace file into `P` indivdual
temporary trace files, where `P` is the number of processors, and file `p`
had only the addresses accessed by processor `p`. Then on each tick, which is
a single processor cycle, each `CMProc` would just read from its local queue
of requests, keeping this queue small. If the queue size dropped below some
threshold, then we would read the next, say 100, lines from its trace file.
While a lot more complicated to implement, this **streaming** technique very
effectively solved the problem.
#### Bus race conditions with `CMJob`
Although Cachemulator is an inherently sequential system, the simulation of
a **shared atomic bus** gave rise to many *pseudo*-race conditions. For example,
while processor 1 simulates stalling on a memory fetch (which we modelled to
take 100 cycles), its current `CMJob` would simply have a counter that started
at 100 and decremented on each processor `tick()`, during which it just does
nothing. If during this period, processor 2 swoops in and modifies the address
that processor 1 was originally requesting for, processor 1 would need to make
the necessary modifications to its current job in order to avoid reading
erroneous data when its stall is complete. This is an example of such
*pseudo*-race conditions that we had to deal with while building Cachemulator.
#### Attempt to store data in `CMLine`
As previously mentioned, we needed a way to check that our emulator was actually
preserving coherence after each memory access. The challenge here was to figure
out how exactly this can be ensured, and the first idea that came to mind was
to augment each cache line (`CMLine`) with a data field (this is unlike the
213 Cachelab simulator that didn't store any data). This would enable us to
see if, say upon each read after another processor's write, the data read is
coherent with the other write.
However, we ran into a handful of complications with this implementation.
The first issue here was that we would then need Pin to provide not
only the address it was accessing, but also the data at that address, and
include this data in the memory trace. This appeared to be a very hard thing
to do, and we didn't make progress here. Secondly, we realized that in order
to *truly* maintain coherence with the inclusion of a data field, we would
need to perform actual flushes to memory, instead of just *simulating* a 100
cycle stall. We realized that this and other instances would severely
complicate our code, and so we decided not to do this. Instead, we resolved to
just comparing the output shout sequence to the reference that we handcrafted.
#### Compiling optimizations
A minor detail is that we use `-O0` with `g++` in our `Makefile` to make sure
that we get the exact memory trace as represented in the source code, and that
these aren't *cleverly* optimized out by `g++`.
#### Wrapper shell script
To conclude our project, we wrote a script `cachemulator.sh` that performs
the 3-step process previously mentioned, to take a user from inputting his/her
executable file, to outputting charts comparing all the different metrics
supported by Cachemulator.
***
## Results
### False sharing
To motivate this problem, [here is a simple example](falsesharing.html) of
**false sharing** in action. We ran both the versions with and without false
sharing through our system and obtained the following graphs, plotting the
two versions with the count of each different metric. These tests were run
on `ghc35.ghc.andrew.cmu.edu`, which has 6 cores and Hyper-Threading support.





Cachemulator provides very clear-cut results here of the effects that false
sharing has on all these metrics. We consider this to be a **great success**
for out emulator, because to an inexperienced coder, [this code](falsesharing.html)
looks perfectly correct and maximally efficient, since no *true sharing* is
taking place. Our emulator thus successfully exhibits the very nasty effects of
false sharing, and in this situation, would warn the user: *"Hey, your code
has terrible cache performance! Do something about it!"*
However, something here that we are not able to explain, or perhaps is a bug
in our system, is the fact that for most of these charts, the MESIF protocol
seems to have the highest count across the board for the false sharing
implementation. MESIF is known to be useful for its reduction in bus traffic
for example, but the "Bus Shouts" chart clearly shows MESIF with the highest
traffic count. A topic of further study is why this is so. Is this an accurate
representation of the MESIF protocol and how it performs under false sharing?
### OpenMP BFS
Although the previous code is a highly useful pedagogical tool to illustrate
what false sharing is and the very clearly negative effects it has on
unsuspecting code, let's take a look at a more complicated, **real-world**
example: our **top-down** BFS OpenMP code from [Assignment 3](http://15418.courses.cs.cmu.edu/spring2014/article/6).
This executable was run on a graph called `tiny.graph`, which does not
guarantee spatial locality, also on `ghc35.ghc.andrew.cmu.edu`. First of all,
it is cool to be able to see even OpenMP code working with our emulator,
thanks to Pin's *dynamic instrumentation*, instead of just the usual *pthreads*.
#### Performance
It is interesting to see in the figure below, the increasing pattern in the
vertical bars for each metric.

Observe in the figure below that with the MESIF protocol, the total number of
tick counts required to run the trace to completion is substantially lower
than with MSI and MESI. The question of why this is so was incidentally
answered by our own emulator with the figure below it. One of the things we
learned in the process of implementing Cachemulator is exactly how MESIF works.
It takes advantage of cache-to-cache transfers to save unnecessary BusRdX
shouts in order to upgrade a line to Modified state if all other sharers are
in the Shared state. As modelled by our system, cache-to-cache transfer delays
are only about 30 cycles, compared to 100 cycles for memory stalls (we couldn't
find exact data for the c2c delays, so we made an estimate). Thus, this results
in an overall decrease in execution time, just from using the MESIF protocol.


#### Bus traffic
The following charts plot what percentage of overall bus traffic is due to
BusRdX versus BusRd. The horizontal axis is the number of ticks, grouped by
100s in order to laterally compress the graph. The vertical axis is a percentage
of total bus traffic. As can be seen, the bus is always busy, and there is an
inverse relationship between the BusRdX and BusRd, since we have an atomic
bus and so no two requests can be serviced at the same time.



## Concluding Remarks
Cache performance is a huge factor of any program's performance, and yet it
is one of the most difficult features to reason about just by looking at
the program's sourcecode, especially as the code gets more complex. Cachemulator
aims to combat this issue plaguing programmers everywhere, by giving them an
accessible way to evaluate their program's cache performance. The emulator does
away with complicated processes and simply reports the most important few
metrics.
An area of improvement that can be made to Cachemulator is apart from simply
reporting that *there are* lines that are highly contended, or that there are
*many counts* of invalidations, it would be **awesome** if it could report
*which* lines are highly contended, and *which* lines are being invalidated
the most. This would undoubtedly be a programmer's strong new weapon!
If you are interested in reading the sourcecode of Cachemulator, or are
interested in contributing to the project, please visit
[its repo on Github](https://github.com/isaaclimdc/cachemulator)!