1. Homepage
  2. Programming
  3. CPEN 212 Computing Systems II - Lab 6: Cache Performance

CPEN 212 Computing Systems II - Lab 6: Cache Performance

Engage in a Conversation
UBCCPEN212Computing SystemsCCache Performance

================= Lab 6:

.. contents:: Contents
depth: 2

Deadlines

  • Tasks 2, 3: 2024-03-31
  • Tasks 4, 5: 2024-04-07

Objective

In this lab, you will optimize existing code to improve its cache performance. CourseNana.COM

Logistics

As in the prior lab, we will use castor.ece.ubc.ca to evaluate your submissions, so your code must work there; in particular, note that memory reference counts might be different on different machines. As usual, all submissions are made by committing your code and pushing the commits to the main branch of the assignment repo on GitHub before the deadline for the relevant task(s). CourseNana.COM

As with all work in CPEN 212, everything you submit must be original and written by you from scratch after the lab has been released. You may not refer to any code other than what is found in the textbooks and lectures, and you must not share code with anyone at any time. See the academic integrity policy for details. CourseNana.COM

Be sure that you are submitting only working code. If your code does not compile, does not link, or crashes, you will receive no credit for the relevant task. In addition, your optimized code must produce the same results as the original code and may not run longer than the unoptimized baseline; you will receive no credit for “optimized” code that produces different results or runs too long. CourseNana.COM

Make sure you review the cache optimization lectures and understand how caches work, how they support reuse, and how the in-lecture examples reduce miss counts, including the role associativity plays. CourseNana.COM

Task 1 is a tutorial; it is not worth any marks, but you will have a hard time with the remaining tasks if you're not able to achieve the task 1 objectives. CourseNana.COM

Coding

Generated templates

For tasks 2 to 5, you will start from code generated for you by:: CourseNana.COM

~cpen212/Public/lab6/lab6gen

Be sure to move the generated files to the appropriate directories in the lab repository. CourseNana.COM

Constraints

Some constraints you must obey when writing code: CourseNana.COM

  • When compiling your code, we will only use a single C file in the corresponding task directory: task2.c for Task 2, task3.c for Task 3, and so on. Any other files you add will be ignored even if you try to #include them. CourseNana.COM

  • You may not define any global symbols (functions or variables) in your code. CourseNana.COM

  • You may not not rename any directories, functions, or variables provided in the templates. CourseNana.COM

  • You may not alter any types or qualifiers in the existing declarations. CourseNana.COM

  • You may not change the dimensions of any matrices in the template code. CourseNana.COM

  • Your code may not cause any side effects other than those performed by the reference code. CourseNana.COM

  • You may not use any names that start with a double underscore (e.g., __foo). CourseNana.COM

  • Your code must be in C (specifically the dialect used by default by the globally-installed gcc on castor). CourseNana.COM

  • Your code must not require linking against any libraries other that the usual libc (which is linked against by default when compiling C). CourseNana.COM

  • Your code must compile and run without errors or warnings. If we can't compile your code, it produces compiler warnings, or we can't link with or run with the resulting object files, you will receive no credit. CourseNana.COM

  • Your optimized code may not take longer to run than the original template. CourseNana.COM

If you violate these rules, we will likely not be able to compile and/or properly test your code, and you will receive no credit for the relevant task(s). CourseNana.COM

We will compile and link your file using the compiler options from the original Task 1 Makefile, except that we will also define NDEBUG to ignore any assert statements. Be sure to use the same command; otherwise, you might end up generating extra memory references that we would not see when marking your submission. Similarly, we will test your code with all matrices in all tasks allocated on 4096-byte boundaries, again as in Task 1; you will want to do the same or we might not be able to reproduce your results. CourseNana.COM

The cache model

To complete this lab, you will need to measure the number of cache hits and cache misses caused by a program that you run. Because caches on real hardware are shared among processes (and users), we will instead use a cache emulator called cachegrind that runs your code and reports cache performance as if it were the only active process. CourseNana.COM

Cache configuration

The cache hierarchy we will use for all tasks in this lab consists of two levels: CourseNana.COM

  • a first-level (L1) cache with 4KB (4096 bytes) of data capacity, 4-way set-associative CourseNana.COM

  • a last-level (LL) cache of 16KB (16384 bytes) of data capacity, 8-way set-associative CourseNana.COM

Both levels have 64-byte cache lines. The caches are inclusive, so when a cache line is brought into the L1 cache, it is also installed in the last-level cache. (The cache hierarchy modelled by cachegrind also includes an instruction cache, but in this lab we will ignore instruction fetch misses and only focus on data cache misses.) CourseNana.COM

Performance metrics

To model cache performance, we will measure the number of data cache misses at each level, and combine the overall memory access cost as:: CourseNana.COM

cost = L1 data cache misses + 10 × last-level data cache misses

with cache hits and instruction accesses not included. This models the roughly order-of-magnitude difference in latency between hitting and missing in the LLC in a real system. CourseNana.COM

Observe that a single LLC hit saves more cost than a single L1 hit; this is realistic, and has consequences for the optimization strategy. CourseNana.COM

You might wonder why we're measuring miss counts rather than miss rates as in lecture. We used the latter in lecture because it's a bit easier to reason about them abstractly; however, miss rates are not good measurements because they can be arbitrarily lowered by increasing the total number of memory references. CourseNana.COM

Task 1

In the task1 directory, you will find C files that implement and drive a matrix multiplication function, similar to what we've seen in lecture. The Makefile there will build the full program for you. CourseNana.COM

Running cachegrind

You can run cachegrind on mxmult like this:: CourseNana.COM

valgrind -v --tool=cachegrind --D1=4096,4,64 --LL=16384,8,64 --cachegrind-out-file=stats.raw mxmult
cg_annotate --auto=yes --show-percs=no stats.raw

The first step (valgrind) runs mxmult and collects cache statistics in a stats.raw file. It also displays some cache statistics to standard output, but we will be interested in the more detailed statistics we can get in the second step. CourseNana.COM

Running the second step (cg_annotate) takes this stats.raw file and produces a table with some more detailed information:: CourseNana.COM

--------------------------------------------------------------------------------
Ir            I1mr ILmr Dr          D1mr        DLmr        Dw          D1mw DLmw
--------------------------------------------------------------------------------
1,126,784,985  363  361 375,007,058 250,250,352 250,250,239 125,001,508  128  121  PROGRAM TOTALS

where D1mr and D1mw are read and write misses in the L1 data cache and DLmr and DLmw are the read and write misses in the LL data cache. CourseNana.COM

To feed these into our cost equation, we would first add the reads and writes for each level, and then plug the totals into the equation; in this case, we have:: CourseNana.COM

cost = 250,250,480 + 10 × 250,250,460 = 2,752,755,080

Below the program totals you can see a breakdown of the misses by source code line. CourseNana.COM

You will find that most of the misses are attributed to one line. This makes sense, because that is where out, a, and b are accessed, but it would be more useful to know the accesses to each array separately. To do this, you can edit the C file so that the reference to each array is on a separate line; when you then rebuild mxmult and re-generate the cache statistics, you will find that you now know the number of accesses to each array. CourseNana.COM

The operation space

Before going further, review the matrix multiplication example in lecture, and make sure you understand exactly what is going on. CourseNana.COM

Once you've done that, let's try to analyze the same example from a different angle. Here is the pseudocode:: CourseNana.COM

for (r = 0; r < N; ++r)
    for (c = 0; c < N; ++c)
        for (i = 0; i < N; ++i)
            z[r,c] += x[r,i] * y[i,c]

First, let's think about what this entire computation really means. The core is the multiply-and-accumulate (+=) operation, and this += is nested within three levels of loops with a bound of N. To complete the operation, therefore, we need to perform N×N×N += operations on the correct data elements. CourseNana.COM

We can visualize this 3D loop as a cube, where each dimension is one of the loop indices, and each point inside the cube corresponds to a single += operation: CourseNana.COM

.. image:: figures/op-space.svg CourseNana.COM

This is called the operation space. CourseNana.COM

Observe that projecting the += operation onto the walls of the cube gets you the elements of each matrix you need for this operation. In fact, the entire front/back projection in this picture is the entire z matrix, the right/left projection is the x matrix, and the top/bottom projection is the y matrix. CourseNana.COM

Now let's try to visualize tiling in the operation space. We're really just reorganizing how we traverse this space so that we process one 3D tile before another — we have to do all the += operations exactly once anyway to preserve correctness. For example, we might choose to process tile 1 before tile 2 below: CourseNana.COM

.. image:: figures/tiling.svg CourseNana.COM

How does this support reuse? Following the intuition from the first diagram, the projections of each tile onto the three dimensions tell you which part of each matrix you need for finishing the operations in the tile. So from the diagram you can see that when you move from tile 1 to tile 2, you change the slices of the z and y matrices you need, but you reuse the same slice of the x matrix. This is where the temporal reuse due to tiling comes from. CourseNana.COM

The tiling diagram also makes it clear that each tile does not fully compute its projection of z — each element will be a partial sum and the final value will only appear once all of the tiles this element intersects have been processed. CourseNana.COM

If you choose a different order in which to execute the tiles in this example, you can reuse a different matrix; for example, if you traverse the space depthwise you'll get the reuse of z we saw in lecture. CourseNana.COM

You can make two more interesting observations here. First, correctness does not require the operation-space tiles here to be cubical; you could make them a different shape as long as you traverse each point in the space exactly once. In fact they don't even to be regularly-shaped with straight “walls,” as long as the each += is still done exactly once. CourseNana.COM

Secondly, you can actually visualize how cachelines fit in this picture; these would essentially look like short stripes on the "walls." In the lecture example, they will be all aligned because each row is an exact multiple of the cacheline size, but that is not necessarily true in general. CourseNana.COM

To double-check your understanding, CourseNana.COM

  • visualize how the operation space is traversed in the pseudocode above and the lecture examples, ignoring spatial locality; CourseNana.COM

  • now add cache line boundaries to your visualization, determine the best traversal order(s) for spatial locality, and cross-check with the lecture slides; CourseNana.COM

  • determine what kind of tile shapes would ensure that the final value of an element of z is fully computed within a single tile. CourseNana.COM

You should now have a pretty good intuition for matrix multiplication, so you can move onto optimizing some code. CourseNana.COM

Optimizing matrix multiplication

Your mission here is to optimize mxmult.c, in two steps: CourseNana.COM

  1. Minimize the raw L1 data miss counts to get some initial intuition: CourseNana.COM

    a. first, optimize for spatial reuse by changing the order in which the matrices are traversed, and CourseNana.COM

    b. then, optimize for temporal reuse by tiling the matrices and the overall operation space. CourseNana.COM

  2. Minimize the cost function defined in Performance metrics_: CourseNana.COM

    a. first, optimize for LLC reuse alone by adjusting the tile dimensions to match the LLC geometry, and CourseNana.COM

    b. then, add another level of tiling to match the L1 geometry. CourseNana.COM

For the first part, you should be able to get on the order of 3,220,000 L1 data misses as a reasonable target. For the second, a total cost on the order of 28,250,000 would be reasonable. CourseNana.COM

Hints

  • Make sure you understand the cache optimizations covered in lecture, including tiling choices. Also make sure you understand how the set index is selected and how associativity works. CourseNana.COM

  • It's a good idea to add running the cache simulator to the Makefile, so that you can just run make after making code modifications and get the output. It might also be useful to redirect the output of cg_annotate to a file so that you can browse through it in an editor. CourseNana.COM

  • It is imperative that your optimized function compute the same values as the original. Therefore you should probably make a renamed copy of the original implementation in your test driver code, and add some code to compare the output of your optimized function to that of the original. CourseNana.COM

  • It would also be wise to randomize the contents of a, b, and out before calling the functions and comparing the results, so that you don't rely on some accidental initialization values (e.g., all zero) that might mask an otherwise incorrect implementation. CourseNana.COM

  • It really helps to attribute misses to specific matrices. CourseNana.COM

  • When you implement an optimization, you might want to double-check that the address stream for each matrix follows the pattern you expect. You can always temporarily print the addresses of each array you're accessing, as long as you take this instrumentation out before submitting your final version. CourseNana.COM

  • The caches — especially the L1 — are small enough for you to draw all the sets and ways, and simulate a sequence of accesses on your diagram. This is a great way to understand the exact hit and miss patterns that result from a given address access sequence. CourseNana.COM

  • If you want to know the order in which the arrays are accessed, you can compile the C to assembly (the -S option to cc) and examine the instructions. In a real CPU these can execute somewhat out of order, although this is not very likely with this code; in any case, cachegrind will simulate their effects in program order. CourseNana.COM

  • Don't forget to include the write misses in your calculations. In the default implementation, you shouldn't see very many, but it's possible to create “optimizations” that result in much higher write miss counts. CourseNana.COM

  • When optimizing for the total cost function, it might be necessary to sacrifice some L1 hits to achieve fewer LLC misses. This is often reasonable because missing in the LLC is so much slower than missing in the L1, but be sure to check that whatever tradeoff you're making is actually worth it when considering our specific cost equation. CourseNana.COM

  • It's also possible to increase the total number of memory references but still decrease the the miss count / cost. In our cost model, L1 cache hits cost nothing. CourseNana.COM

Deliverables

No deliverables for this task. CourseNana.COM

Task 2

Optimize the function in the task2.c file obtained per Generated templates_ to minimize L1 data cache misses (including both reads and writes). To receive any credit, your optimized function must produce the same outputs as the reference, and must apply all optimizations we learned about. CourseNana.COM

To receive full credit, you will need to reach a near-optimal miss counts. Miss counts below that will be interpolated so that improvements closer to the optimal miss count (which are harder to achieve) are worth more than improvements close to the reference miss counts (which are relatively easy). CourseNana.COM

In ANALYSIS.txt, explain your optimization strategy: what you aimed to reuse in each stage of optimization and why, and which code transformations correspond to each instance of reuse. Include relevant calculations (e.g., tile sizes). Code submitted without a corresponding analysis will not receive credit regardless of performance. CourseNana.COM

Hints

  • Re-read the hints for Task 1 CourseNana.COM

  • Try to visualize the operation space as shown under The operation space_; think about everything you can reuse for various traversal strategies and how much space everything would take CourseNana.COM

  • Don't forget about associativity CourseNana.COM

Deliverables

In task2: CourseNana.COM

Task 3

Optimize the function in the task3.c file obtained per Generated templates to minimize the total cost defined in Performance metrics. The template is the same as in Task 2, so you can build on that. To receive any credit, your optimized function must produce the same outputs as the reference, and must apply all optimizations we learned about. CourseNana.COM

To receive full credit, you will need to reach a near-optimal cost values. Costs below that will be interpolated so that improvements closer to the optimal cost (which are harder to achieve) are worth more than improvements close to the reference implementation's cost metric (which are relatively easy). CourseNana.COM

In ANALYSIS.txt, explain your optimization strategy: what you aimed to reuse in each stage of optimization and why, and which code transformations correspond to each instance of reuse. Include relevant calculations (e.g., tile sizes). Code submitted without a corresponding analysis will not receive credit regardless of performance. CourseNana.COM

Deliverables

In task3: CourseNana.COM

Task 4

Optimize the function in the task4.c file obtained per Generated templates_ to minimize L1 data cache misses (including both reads and writes). To receive any credit, your optimized function must produce the same outputs as the reference, and must apply all optimizations we learned about. CourseNana.COM

To receive full credit, you will need to reach a near-optimal miss counts. Miss counts below that will be interpolated so that improvements closer to the optimal miss count (which are harder to achieve) are worth more than improvements close to the reference miss counts (which are relatively easy). CourseNana.COM

In ANALYSIS.txt, explain your optimization strategy: what you aimed to reuse in each stage of optimization and why, and which code transformations correspond to each instance of reuse. Include relevant calculations (e.g., tile sizes). Code submitted without a corresponding analysis will not receive credit regardless of performance. CourseNana.COM

Hints

  • Drawing the operation space on paper is a bit harder than before because it now has more dimensions, but the principles covered under The operation space_ are the same CourseNana.COM

  • Don't forget about associativity and how it relates to what you want to keep cached CourseNana.COM

Deliverables

In task4: CourseNana.COM

Task 5

Optimize the function in the task5.c file obtained per Generated templates to minimize the total cost defined in Performance metrics. The template is the same as in Task 4, so you can build on that. To receive any credit, your optimized function must produce the same outputs as the reference, and must apply all optimizations we learned about. CourseNana.COM

To receive full credit, you will need to reach a near-optimal cost values. Costs below that will be interpolated so that improvements closer to the optimal cost (which are harder to achieve) are worth more than improvements close to the reference implementation's cost metric (which are relatively easy). CourseNana.COM

In ANALYSIS.txt, explain your optimization strategy: what you aimed to reuse in each stage of optimization and why, and which code transformations correspond to each instance of reuse. Include relevant calculations (e.g., tile sizes). Code submitted without a corresponding analysis will not receive credit regardless of performance. CourseNana.COM

Deliverables

In task5: CourseNana.COM

Marks

To earn marks, you must commit and push each task to the GitHub repo before the deadline for that task. CourseNana.COM

Remember that CPEN 212 labs are individual, so you must complete all tasks by yourself; see the academic integrity policy for details. CourseNana.COM

  • Task 2: 3 marks
  • Task 3: 2 marks
  • Task 4: 3 marks
  • Task 5: 2 marks

Get in Touch with Our Experts

WeChat (微信) WeChat (微信)
Whatsapp WhatsApp
UBC代写,CPEN212代写,Computing Systems代写,C代写,Cache Performance代写,UBC代编,CPEN212代编,Computing Systems代编,C代编,Cache Performance代编,UBC代考,CPEN212代考,Computing Systems代考,C代考,Cache Performance代考,UBChelp,CPEN212help,Computing Systemshelp,Chelp,Cache Performancehelp,UBC作业代写,CPEN212作业代写,Computing Systems作业代写,C作业代写,Cache Performance作业代写,UBC编程代写,CPEN212编程代写,Computing Systems编程代写,C编程代写,Cache Performance编程代写,UBCprogramming help,CPEN212programming help,Computing Systemsprogramming help,Cprogramming help,Cache Performanceprogramming help,UBCassignment help,CPEN212assignment help,Computing Systemsassignment help,Cassignment help,Cache Performanceassignment help,UBCsolution,CPEN212solution,Computing Systemssolution,Csolution,Cache Performancesolution,