================= Lab 6:
- .. contents:: Contents
- depth: 2
Deadlines
- Tasks 2, 3: 2024-03-31
- Tasks 4, 5: 2024-04-07
Objective
In this lab, you will optimize existing code to improve its cache performance.
Logistics
As in the prior lab, we will use castor.ece.ubc.ca
to evaluate your submissions, so your code must work there; in particular, note that memory reference counts might be different on different machines. As usual, all submissions are made by committing your code and pushing the commits to the main branch of the assignment repo on GitHub before the deadline for the relevant task(s).
As with all work in CPEN 212, everything you submit must be original and written by you from scratch after the lab has been released. You may not refer to any code other than what is found in the textbooks and lectures, and you must not share code with anyone at any time. See the academic integrity policy for details.
Be sure that you are submitting only working code. If your code does not compile, does not link, or crashes, you will receive no credit for the relevant task. In addition, your optimized code must produce the same results as the original code and may not run longer than the unoptimized baseline; you will receive no credit for “optimized” code that produces different results or runs too long.
Make sure you review the cache optimization lectures and understand how caches work, how they support reuse, and how the in-lecture examples reduce miss counts, including the role associativity plays.
Task 1 is a tutorial; it is not worth any marks, but you will have a hard time with the remaining tasks if you're not able to achieve the task 1 objectives.
Coding
Generated templates
For tasks 2 to 5, you will start from code generated for you by::
~cpen212/Public/lab6/lab6gen
Be sure to move the generated files to the appropriate directories in the lab repository.
Constraints
Some constraints you must obey when writing code:
-
When compiling your code, we will only use a single C file in the corresponding task directory:
task2.c
for Task 2,task3.c
for Task 3, and so on. Any other files you add will be ignored even if you try to#include
them. -
You may not define any global symbols (functions or variables) in your code.
-
You may not not rename any directories, functions, or variables provided in the templates.
-
You may not alter any types or qualifiers in the existing declarations.
-
You may not change the dimensions of any matrices in the template code.
-
Your code may not cause any side effects other than those performed by the reference code.
-
You may not use any names that start with a double underscore (e.g.,
__foo
). -
Your code must be in C (specifically the dialect used by default by the globally-installed
gcc
oncastor
). -
Your code must not require linking against any libraries other that the usual
libc
(which is linked against by default when compiling C). -
Your code must compile and run without errors or warnings. If we can't compile your code, it produces compiler warnings, or we can't link with or run with the resulting object files, you will receive no credit.
-
Your optimized code may not take longer to run than the original template.
If you violate these rules, we will likely not be able to compile and/or properly test your code, and you will receive no credit for the relevant task(s).
We will compile and link your file using the compiler options from the original Task 1 Makefile
, except that we will also define NDEBUG
to ignore any assert
statements. Be sure to use the same command; otherwise, you might end up generating extra memory references that we would not see when marking your submission. Similarly, we will test your code with all matrices in all tasks allocated on 4096-byte boundaries, again as in Task 1; you will want to do the same or we might not be able to reproduce your results.
The cache model
To complete this lab, you will need to measure the number of cache hits and cache misses caused by a program that you run. Because caches on real hardware are shared among processes (and users), we will instead use a cache emulator called cachegrind
that runs your code and reports cache performance as if it were the only active process.
Cache configuration
The cache hierarchy we will use for all tasks in this lab consists of two levels:
-
a first-level (L1) cache with 4KB (4096 bytes) of data capacity, 4-way set-associative
-
a last-level (LL) cache of 16KB (16384 bytes) of data capacity, 8-way set-associative
Both levels have 64-byte cache lines. The caches are inclusive, so when a cache line is brought into the L1 cache, it is also installed in the last-level cache. (The cache hierarchy modelled by cachegrind
also includes an instruction cache, but in this lab we will ignore instruction fetch misses and only focus on data cache misses.)
Performance metrics
To model cache performance, we will measure the number of data cache misses at each level, and combine the overall memory access cost as::
cost = L1 data cache misses + 10 × last-level data cache misses
with cache hits and instruction accesses not included. This models the roughly order-of-magnitude difference in latency between hitting and missing in the LLC in a real system.
Observe that a single LLC hit saves more cost than a single L1 hit; this is realistic, and has consequences for the optimization strategy.
You might wonder why we're measuring miss counts rather than miss rates as in lecture. We used the latter in lecture because it's a bit easier to reason about them abstractly; however, miss rates are not good measurements because they can be arbitrarily lowered by increasing the total number of memory references.
Task 1
In the task1
directory, you will find C files that implement and drive a matrix multiplication function, similar to what we've seen in lecture. The Makefile
there will build the full program for you.
Running cachegrind
You can run cachegrind
on mxmult
like this::
valgrind -v --tool=cachegrind --D1=4096,4,64 --LL=16384,8,64 --cachegrind-out-file=stats.raw mxmult
cg_annotate --auto=yes --show-percs=no stats.raw
The first step (valgrind
) runs mxmult
and collects cache statistics in a stats.raw
file. It also displays some cache statistics to standard output, but we will be interested in the more detailed statistics we can get in the second step.
Running the second step (cg_annotate
) takes this stats.raw
file and produces a table with some more detailed information::
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
--------------------------------------------------------------------------------
1,126,784,985 363 361 375,007,058 250,250,352 250,250,239 125,001,508 128 121 PROGRAM TOTALS
where D1mr and D1mw are read and write misses in the L1 data cache and DLmr and DLmw are the read and write misses in the LL data cache.
To feed these into our cost equation, we would first add the reads and writes for each level, and then plug the totals into the equation; in this case, we have::
cost = 250,250,480 + 10 × 250,250,460 = 2,752,755,080
Below the program totals you can see a breakdown of the misses by source code line.
You will find that most of the misses are attributed to one line. This makes sense, because that is where out
, a
, and b
are accessed, but it would be more useful to know the accesses to each array separately. To do this, you can edit the C file so that the reference to each array is on a separate line; when you then rebuild mxmult
and re-generate the cache statistics, you will find that you now know the number of accesses to each array.
The operation space
Before going further, review the matrix multiplication example in lecture, and make sure you understand exactly what is going on.
Once you've done that, let's try to analyze the same example from a different angle. Here is the pseudocode::
for (r = 0; r < N; ++r)
for (c = 0; c < N; ++c)
for (i = 0; i < N; ++i)
z[r,c] += x[r,i] * y[i,c]
First, let's think about what this entire computation really means. The core is the multiply-and-accumulate (+=
) operation, and this +=
is nested within three levels of loops with a bound of N. To complete the operation, therefore, we need to perform N×N×N +=
operations on the correct data elements.
We can visualize this 3D loop as a cube, where each dimension is one of the loop indices, and each point inside the cube corresponds to a single +=
operation:
.. image:: figures/op-space.svg
This is called the operation space.
Observe that projecting the +=
operation onto the walls of the cube gets you the elements of each matrix you need for this operation. In fact, the entire front/back projection in this picture is the entire z
matrix, the right/left projection is the x
matrix, and the top/bottom projection is the y
matrix.
Now let's try to visualize tiling in the operation space. We're really just reorganizing how we traverse this space so that we process one 3D tile before another — we have to do all the +=
operations exactly once anyway to preserve correctness. For example, we might choose to process tile 1 before tile 2 below:
.. image:: figures/tiling.svg
How does this support reuse? Following the intuition from the first diagram, the projections of each tile onto the three dimensions tell you which part of each matrix you need for finishing the operations in the tile. So from the diagram you can see that when you move from tile 1 to tile 2, you change the slices of the z
and y
matrices you need, but you reuse the same slice of the x
matrix. This is where the temporal reuse due to tiling comes from.
The tiling diagram also makes it clear that each tile does not fully compute its projection of z
— each element will be a partial sum and the final value will only appear once all of the tiles this element intersects have been processed.
If you choose a different order in which to execute the tiles in this example, you can reuse a different matrix; for example, if you traverse the space depthwise you'll get the reuse of z
we saw in lecture.
You can make two more interesting observations here. First, correctness does not require the operation-space tiles here to be cubical; you could make them a different shape as long as you traverse each point in the space exactly once. In fact they don't even to be regularly-shaped with straight “walls,” as long as the each +=
is still done exactly once.
Secondly, you can actually visualize how cachelines fit in this picture; these would essentially look like short stripes on the "walls." In the lecture example, they will be all aligned because each row is an exact multiple of the cacheline size, but that is not necessarily true in general.
To double-check your understanding,
-
visualize how the operation space is traversed in the pseudocode above and the lecture examples, ignoring spatial locality;
-
now add cache line boundaries to your visualization, determine the best traversal order(s) for spatial locality, and cross-check with the lecture slides;
-
determine what kind of tile shapes would ensure that the final value of an element of
z
is fully computed within a single tile.
You should now have a pretty good intuition for matrix multiplication, so you can move onto optimizing some code.
Optimizing matrix multiplication
Your mission here is to optimize mxmult.c
, in two steps:
-
Minimize the raw L1 data miss counts to get some initial intuition:
a. first, optimize for spatial reuse by changing the order in which the matrices are traversed, and
b. then, optimize for temporal reuse by tiling the matrices and the overall operation space.
-
Minimize the cost function defined in
Performance metrics
_:a. first, optimize for LLC reuse alone by adjusting the tile dimensions to match the LLC geometry, and
b. then, add another level of tiling to match the L1 geometry.
For the first part, you should be able to get on the order of 3,220,000 L1 data misses as a reasonable target. For the second, a total cost on the order of 28,250,000 would be reasonable.
Hints
-
Make sure you understand the cache optimizations covered in lecture, including tiling choices. Also make sure you understand how the set index is selected and how associativity works.
-
It's a good idea to add running the cache simulator to the
Makefile
, so that you can just runmake
after making code modifications and get the output. It might also be useful to redirect the output ofcg_annotate
to a file so that you can browse through it in an editor. -
It is imperative that your optimized function compute the same values as the original. Therefore you should probably make a renamed copy of the original implementation in your test driver code, and add some code to compare the output of your optimized function to that of the original.
-
It would also be wise to randomize the contents of
a
,b
, andout
before calling the functions and comparing the results, so that you don't rely on some accidental initialization values (e.g., all zero) that might mask an otherwise incorrect implementation. -
It really helps to attribute misses to specific matrices.
-
When you implement an optimization, you might want to double-check that the address stream for each matrix follows the pattern you expect. You can always temporarily print the addresses of each array you're accessing, as long as you take this instrumentation out before submitting your final version.
-
The caches — especially the L1 — are small enough for you to draw all the sets and ways, and simulate a sequence of accesses on your diagram. This is a great way to understand the exact hit and miss patterns that result from a given address access sequence.
-
If you want to know the order in which the arrays are accessed, you can compile the C to assembly (the
-S
option tocc
) and examine the instructions. In a real CPU these can execute somewhat out of order, although this is not very likely with this code; in any case,cachegrind
will simulate their effects in program order. -
Don't forget to include the write misses in your calculations. In the default implementation, you shouldn't see very many, but it's possible to create “optimizations” that result in much higher write miss counts.
-
When optimizing for the total cost function, it might be necessary to sacrifice some L1 hits to achieve fewer LLC misses. This is often reasonable because missing in the LLC is so much slower than missing in the L1, but be sure to check that whatever tradeoff you're making is actually worth it when considering our specific cost equation.
-
It's also possible to increase the total number of memory references but still decrease the the miss count / cost. In our cost model, L1 cache hits cost nothing.
Deliverables
No deliverables for this task.
Task 2
Optimize the function in the task2.c
file obtained per Generated templates
_ to minimize L1 data cache misses (including both reads and writes). To receive any credit, your optimized function must produce the same outputs as the reference, and must apply all optimizations we learned about.
To receive full credit, you will need to reach a near-optimal miss counts. Miss counts below that will be interpolated so that improvements closer to the optimal miss count (which are harder to achieve) are worth more than improvements close to the reference miss counts (which are relatively easy).
In ANALYSIS.txt
, explain your optimization strategy: what you aimed to reuse in each stage of optimization and why, and which code transformations correspond to each instance of reuse. Include relevant calculations (e.g., tile sizes). Code submitted without a corresponding analysis will not receive credit regardless of performance.
Hints
-
Re-read the hints for Task 1
-
Try to visualize the operation space as shown under
The operation space
_; think about everything you can reuse for various traversal strategies and how much space everything would take -
Don't forget about associativity
Deliverables
In task2
:
-
optimized
task2.c
-
filled-out
ANALYSIS.txt
Task 3
Optimize the function in the task3.c
file obtained per Generated templates
to minimize the total cost defined in Performance metrics
. The template is the same as in Task 2, so you can build on that. To receive any credit, your optimized function must produce the same outputs as the reference, and must apply all optimizations we learned about.
To receive full credit, you will need to reach a near-optimal cost values. Costs below that will be interpolated so that improvements closer to the optimal cost (which are harder to achieve) are worth more than improvements close to the reference implementation's cost metric (which are relatively easy).
In ANALYSIS.txt
, explain your optimization strategy: what you aimed to reuse in each stage of optimization and why, and which code transformations correspond to each instance of reuse. Include relevant calculations (e.g., tile sizes). Code submitted without a corresponding analysis will not receive credit regardless of performance.
Deliverables
In task3
:
-
optimized
task3.c
-
filled-out
ANALYSIS.txt
Task 4
Optimize the function in the task4.c
file obtained per Generated templates
_ to minimize L1 data cache misses (including both reads and writes). To receive any credit, your optimized function must produce the same outputs as the reference, and must apply all optimizations we learned about.
To receive full credit, you will need to reach a near-optimal miss counts. Miss counts below that will be interpolated so that improvements closer to the optimal miss count (which are harder to achieve) are worth more than improvements close to the reference miss counts (which are relatively easy).
In ANALYSIS.txt
, explain your optimization strategy: what you aimed to reuse in each stage of optimization and why, and which code transformations correspond to each instance of reuse. Include relevant calculations (e.g., tile sizes). Code submitted without a corresponding analysis will not receive credit regardless of performance.
Hints
-
Drawing the operation space on paper is a bit harder than before because it now has more dimensions, but the principles covered under
The operation space
_ are the same -
Don't forget about associativity and how it relates to what you want to keep cached
Deliverables
In task4
:
-
optimized
task4.c
-
filled-out
ANALYSIS.txt
Task 5
Optimize the function in the task5.c
file obtained per Generated templates
to minimize the total cost defined in Performance metrics
. The template is the same as in Task 4, so you can build on that. To receive any credit, your optimized function must produce the same outputs as the reference, and must apply all optimizations we learned about.
To receive full credit, you will need to reach a near-optimal cost values. Costs below that will be interpolated so that improvements closer to the optimal cost (which are harder to achieve) are worth more than improvements close to the reference implementation's cost metric (which are relatively easy).
In ANALYSIS.txt
, explain your optimization strategy: what you aimed to reuse in each stage of optimization and why, and which code transformations correspond to each instance of reuse. Include relevant calculations (e.g., tile sizes). Code submitted without a corresponding analysis will not receive credit regardless of performance.
Deliverables
In task5
:
-
optimized
task5.c
-
filled-out
ANALYSIS.txt
Marks
To earn marks, you must commit and push each task to the GitHub repo before the deadline for that task.
Remember that CPEN 212 labs are individual, so you must complete all tasks by yourself; see the academic integrity policy for details.
- Task 2: 3 marks
- Task 3: 2 marks
- Task 4: 3 marks
- Task 5: 2 marks