KIT308 Multicore Architecture and Programming
Assignment 2 — SIMD
Aims of the Assignment
The purpose of this assignment is to give you experience at writing a program using SIMD programming techniques. This assignment will give you an opportunity to demonstrate your understanding of: the where there is one there is man y approach (WTIO TIM); structures of arr ays; the use of SIMD for calculation; and the tr anslation of conditional statements into SIMD .
Assignment Submission
Your assignment is to be submitted electronically via MyL O and should contain: A .zip (or .r ar) containing a Visual Studio Solution containing a project for each attempted stage of the assignment (in the format provided in the downloadable materials pro vided below). A document containing: A table of timing information comparing the pro vided multi-threaded base code, the pro vided base code with SIMD sphere intersections, and the Stages 1–4 implementation on each scene file (all running with the maximum threads nativ ely supported b y your CPU). An analysis of the abo ve timing data. You do not need to (and shouldn't) submit executables, temporary object files, or images. In particular , you must delete the ".vs" diretory before submission as it just Visual Studio tempor ary files and 100s of MBs. Do not howev er delete the "Scenes" folder or the "Outputs" folder (but do delete the images within this one).
Task/Topic Marks
-
Conversion of Distance Calculation to Where There is One there is Many (WTIOTIM) 10% Correct implementation of WTIO TIM approach to isTriangleIntersected function to calculate distance to nearest object (in Intersection.cpp ), and con version of its usage in the objectIntersection function (also in Intersection.cpp )5% Correct implementation of WTIO TIM approach to create a short -circuiting isTriangleIntersected function to return (as soon as possible) whether any triangle intersects (in Intersection.cpp ), and con version of its usage in the isInShadow function (in Lighting.cpp )5%
-
Conversion of Scene Objects to SoA 15% Correct declar ations of data structures for SoA SIMD form of T riangle container data (don't delete the AoS v ersions), and SoA SIMD form of data uses minimal (sensible) amount of memory5% Correct code to con vert AoS container to dynamically declared SoA structures (con version should happen after the scene has been loaded)5% Correct con version of isTriangleIntersected functions to use SoA cop y of T riangle data 5%
-
AVX SIMD Conversion of Triangle Intersection Calculation 55%
- A. SIMD Conversion of short-circuiting isTriangleIntersected function 40%
- Correct con version of scalar inputs to SIMD 5%
- Correct and efficient intersection point t0 calculation 5%
- Correct and efficient (e.g. no use of if -statements) handling of cases where no intersection occurs due to r ay being
- parallel(ish) with triangle (i.e. first if -statement)5%
- Correct and efficient (e.g. no use of if -statements) handling of cases where no intersection occurs due to point being
- outside of triangle (i.e. second and third if -statements)5%
- Correct and efficient (e.g. no use of if -statements) handling of cases where intersection does occur (i.e. fourth if -
- statement)5%
- Correct and efficient (e.g. no use of if -statements, for -loops, or scalar code) calculation of intersection data 5%
- Correct declar ation of loop constants before loop 5%
- No remaining scalar code (ex cept WTIO TIM for -loop) and correct final output of SIMD code (including cases where
- number of triangles isn't divisible b y 8)5% B. SIMD Conversion of non-short-circuiting isTriangleIntersected function 15% Correct (i.e. corresponding to scalar minimum) and efficient (e.g. no use of if -statements) SIMD calculation of closest intersection (i.e. t) (NOTE: full marks can be gained for this component if SIMD is correctly used in the main loop , but a scalar loop is required to determine the final minimum from a SIMD v ector of minimums)5% Correct (i.e. corresponding to scalar index) and efficient (e.g. no use of if -statements) SIMD calculation of corresponding triangle index (i.e. index ) (NOTE: full marks can be gained for this component if SIMD is correctly used in the main loop , but a scalar loop is required to determine the final index from a SIMD v ector of index es)5% Correct (i.e. corresponding to scalar v alues) and efficient (e.g. no use of for -statements) SIMD ("horiz ontal") calculation to determine closest intersection and index (i.e. t and index )5%
-
AVX SIMD Conversion and General Optimisation of Lighting Calculation 30% (NOTE: this section will require SoA declar ations and con version from AoS for Light container data) Correct SIMD con version of applySpecular function (i.e. so it handles 8 lights at once) 5% Correct and efficient (i.e. common terms identified) inlining of applyCheckboard , applyCircles , and applyWood functions within applyDiffuse function5% Correct and efficient (e.g. no use of switch-statements) SIMD con version of applyDiffuse function (i.e. so it handles 8 lights at once)5% Correct SIMD con version of applyLighting function (NO TE: calling isInShadow should use a scalar loop , as it's not possible to use SIMD here)5% Correct determination and declar ation of all loop constants before main loop in applyLighting function (NO TE: this requires either inlining of one or more subfunctions or splitting these function(s) into multiple parts)10%
Documentation 10%
Outputs showing timing information for the base assignment code, base code with SIMD sphere intersections, and Stages 1–4 (with the maximum threads nativ ely supported b y your CPU) on all applicable scene files5%Marking This assignment will be mark ed out of 100 (NO TE: there are 120 marks a vailable, but it's only possible to receiv e a maximum of 100). NOTE: if y our code for a particular stage does not correctly produce the expected output images, y ou will only be able to receiv e a maximum of half marks for that stage — see below for more details. A perfect solution of Stages 1–3 with timings and analysis would obtain 88%. The following is the breakdown of marks: Analysis of data (comparisons across the base code, base code with SIMD sphere intersections, and Stages 1–4) 5%
Penalties
Failure to comply with submission instructions (eg. no co ver sheet, incorrect submission of files, abnormal solution/project structure, etc.)-10% Poor progr amming st yle (eg. insufficient / poor comments, poor v ariable names, poor indenting, obfuscated code without documentation, compiler w arnings, etc.)up to -20% Lateness (-20% for up to 24 hours, -50% for up to 7 da ys, -100% after 7 da ys) up to -100% Failure to use the techniques outlined in this unit's lectures and tutorials, resulting in a wildly different solution, using un-taught libr aries (eg. an external libr ary for SIMD , etc.), un-taught techniques, or a v astly different rendering implementationup to -100%
Marking and Correct Images
As SIMD code is v ery very difficult to debug — as grows exponentially harder as the amount of it grows — there will be limited opportunit y in the marking process to determine where exactly mistak es ha ve been made, and ev en less chance of being able to provide fix es to those mistak es. As a result, if your code for a stage does not produce the expected image, then you will only be eligible for half marks for that stage of the assignment. In order to work within this constr aint, y ou should attempt tr anslations into SIMD in v ery small steps (i.e. a single line at a time, or even a partial line at a time). The lectures and liv e-coding sessions will demonstr ate this approach to SIMD tr anslation. Y ou will lik ely receiv e more marks for a partially complete SIMD tr anslation than a fully complete tr anslation that doesn't work 100%.
Stage 4
As this stage ma y require the use of less accur ate appro ximations to some mathematical functions, the concept of "correct" image will be somewhat relax ed from ha ving to be an exact match as specified b y ImageMagick — an acceptable visual match will lik ely be enough may be considered as correct enough if the ImageMagick comparison v alues are v ery small, and the cause of the difference can't be easily corrected b y the mark er. For reference, the solution to Stage 4 of this assignment (at the time of writing this specification document) only differs b y: a single channel being –1 different, on only one pix el, on only one image of the 15 in the testing set. Programming Style
This assignment is not focussed on progr amming st yle (although it is concerned with efficient code), but y ou should endea vour to follow good progr amming pr actices. Y ou should, for example: comment y our code; use sensible v ariables names; use correct and consistent indenting; and internally document (with comments) an y notable design decisions.
The Assignment Task
You are to modif y the (square-based) multithreaded implementation of a "simple" r aytracer from the first assignment to tak e adv antage of SIMD instructions. As time is tight for this assignment y ou only need to SIMD-if y the triangle/r ay intersection code and the lighting code. This requires changes across multiple files. To aid y ou in this con version, the sphere/r ay intersection code has already (mostly) been tr anslated into SIMD code (b y following the techniques required to do stages 1–3 of the assignment). Note that this pro vided example still has a scalar loop for determining the final v alues in the non-short -circuiting v ersion of isSphereIntersected and that part of Stage 3B is to create a similar function for triangle intersections that does NO T have this limitation (i.e. ha ving a similar for loop would not get y ou all of the Stage 3B marks)
From the pro vided (square-based) multithreaded r aytracer implementation, for Stages 1–3 of the assignment y ou will create multiple subsequent v ersions that modif y the triangle implementation as follows:
- Rewritten functions for the triangle/r ay intersection tests in a where there is one there is man y approach.
- Translation of arr ay-of-structures (AoS) with structures-of -arrays (SoA) for the triangle container .
- Optimisation of the triangle/r ay intersection test to tak e adv antage of SIMD .
For Stage 4, y ou will follow a similar pattern, but for lights and the lighting calculations.
Implementation — SIMD Distance (Stages 1–3)
The following section describes in detail the steps required to complete Stages 1–3 of the assignment. If y ou complete these steps, only then should y ou attempt Stage 4 (as y ou will lik ely find it more difficult).
- Where there is One there is Many This stage in volves replacing the isTriangleIntersected function with two new v ersions that use the where there is one there is man y paradigm.
In order to complete this step y ou will need to: Write two functions that tak e the entire triangles container (or at least a pointer/reference) as an argument and will perform whatev er action took place at the calling site. There are two v ersions as the isInShadow function (in Lighting.cpp ) uses a short - circuited approach, exiting as soon as an y triangle collision is disco vered, and the objectIntersection function (in Intersection.cpp ) finds the closest triangle intersected with (which means it must examine them all). A version of isTriangleIntersected that returns a boolean depending on whether or not a triangle intersects with the r ay (i.e. stopping as soon as one is found). A version of isTriangleIntersected that returns a boolean (as abo ve), updates its t parameter to be the closest triangle that intersects with the r ay, and updates its index parameter to specif y the index of the closest triangle.
- Structures of Arrays
This stage in volves modif ying the Scene struct (Scene.h) to store a duplication of the triangle data via structure of arr ays rather than the currently existing arr ay of structures (currently in the triangleContainer variable).
In order to complete this step y ou will need to: Create a SoA cop y of the data that is stored in triangleContainer in the Scene struct. NOTE: this means the progr am has two copies of the triangle data, the original one in AoS form and a second one in SoA form. As this assignment predominately only requires changes to isTriangleIntersected (and the sites of the calls to this function) the rest of the code will still use the original AoS v ersion of the data. Fill this SoA cop y of the triangle data after the Scene struct has been loaded (dynamically allocating memory for the SoA representation). Rewrite the isTriangleIntersected function to mak e use of the SoA. Hints: The best place to do the con version is immediately after loading the scene (i.e. in Main.cpp ). Triangle (and Sphere ) is defined in SceneObjects.h . Both the AoS and SoA v ersions of the sphere containers are defined in the Scene struct in Scene.h .
You should not need to attempt to modif y Scene.cpp to complete this step . At times this code update ma y require con versions to and/or from normal structs (such as Vector and Point ) to the equiv alently stored data in the SoA.
This stage in volves con verting the short -circuiting v ersion of the isTriangleIntersected function from Stage 2 into a SIMD implementation. In order to complete this step y ou will need to: Convert man y of the input par ameters (or parts of them) into SIMD-ready v alues (e.g. the r ay starting location and direction need to be con verted into a SIMD format).
Step through the SoA in chunks proportional to the number of v alue stored in each SIMD v ariable (i.e. 8). Convert the calculation to use SIMD . Care must be tak en to correctly deal with the v arious conditional expressions in the loop . There are four if -statements that must be con verted to SIMD code via the use of appropriate masking statements. Carefully deal with the situation where the triangle count isn't equally divisible b y the number in each SIMD calculation (e.g. 9 triangle with 4 in each SIMD v alue). NO TE: this will lik ely be handled b y your AoS to SoA con version from Stage 2, but if it isn't you'll need to write special masking code to handle this situation.
3B. SIMD Conversion of isTriangleIntersected function
This stage in volves con verting the non-short -circuiting v ersion of the isTriangleIntersected function from Stage 2 into a SIMD implementation. Most of this con version is identical to step 3, and the code can just be copied from that function. It does howev er require the addition of: New v ariables to k eep tr ack of the closest intersection found, and which triangle index it w as found at. Consolidation of the v alues calculated using SIMD into a single scalar return v alue. This should be done after the loop finishes.
Hints: You do not w ant to return the minimum index v alue (i.e. the smallest of all the indices in a SIMD v ector), y ou want to return the index corresponding to the minimum distance found (i.e. the index in the same lane as the minimum distance). General Hints / T ips The techniques required to complete each stage rely hea vily on work done in the SIMD tutorials and the step-b y-step approach shown in lectures — refer to them often. When implementing the SIMD stage, it's best to SIMD-if y small portions of code at a time (e.g. ev en a single line is often a lot) and then perform the rest of the calculation through the existing scalar code, or ev en compare the result of the SIMD v ersion to the existing scalar code. Again for SIMD , it's helpful to render scenes at a greatly reduced siz e for testing, with an equally small block siz e, one sample per pixel, and with one thread (e.g. try using ‑path .\ ‑ input Scenes/allmaterials.txt ‑ size 8 8 ‑ blockSize 8 ‑ samples 1 ‑ threads 1 ). It may even be helpful to fix the number of r ays cast ( MAX_RAYS_CAST ) to be 1, so that reflection doesn't occur . This doesn't produce a very nice image, but images can be easily compared visually for problems (the example is only 64 pix els) and printf statements can be used to v erify the correctness of calculations without outputting a huge amount of data (the use of debuggers is somewhat perilous in a multithreaded en vironment, but that shouldn't be an issue here). When con verting conditional statements to masks it can be helpful to do this with scalars first (although remember that C/C++ implementations use 1 rather than 0xFFFFFFFF for true, so the calculations ma y be different). Write functions to output all the elements in a SIMD v alue for easier printf debugging (I am not a fan of how these t ypes are shown in the debugger — unless they ha ve changed recently). Implementation — SIMD Lighting (Stage 4) The following section describes in detail the steps required to complete Stage 4 of the assignment. Y ou should only attempt this step if you ha ve fully completed (and tested) Stages 1–3 (as much of this stage is lik ely much harder to complete). This stage in volves rewriting the applyLighting function and a number of functions it calls. In order to complete this step y ou will need to: Create SoA declar ations and con versions from AoS for Light container data (in a similar w ay that this w as done for spheres and triangles). Convert the applySpecular function to SIMD (i.e. perform this for 8 lights and produce 8 colours). Create a normal non- SIMD applyDiffuse function that has applyCheckboard , applyCircles , and applyWood functions manually inlined before attempting the con version to SIMD . Convert this new applyDiffuse function to SIMD (i.e. perform this for 8 lights and produce 8 colours). Convert the applyLighting function to SIMD (i.e. handle multiple lights simultaneously). NOTE: the call to isInShadow is not suitable for SIMD con version, so a simple for loop that repeatedly calls the function and stores each result in turn in a SIMD v ector is acceptable. Determine what parts of within the main loop in applyLighting are the same regardless of where the light is, and mo ve these calculations outside the loop (i.e. so they are only calculated once). This step in volves thinking about the loop itself and the functions that it calls.
Missing Mathematical Functions
There are no definitions of a few needed mathematical functions in the A VX/AVX2 instruction sets: The powf function, as used in the applySpecular function in Lighting.cpp . The sinf and cosf functions, as used in the applyWood function in Texture.cpp . Definitions for these functions can either be sourced from: The SVML api — if y our processor / compiler supports them (e.g. _mm256_sin_ps ). By writing y our own function that just calls the scalar function (e.g. sinf) 8 times for each slot in the v ector. Hints / T ips All the hints/tips from the stages 1–3 apply here (i.e. complete this in small steps, use 8x8 images as tests, use a lot of printf , etc.). You ha ve to change complex conditional control structures (e.g. continues, switch statements, etc.) into SIMD code. It's easier to do this using scalar code first, and then check y our work, before attempting SIMD con version. You ma y find it helpful to define some helper classes similar to Vector8 to aid with the completion of this step , e.g. Colour8 (and maybe ev en Ray8). Also in a similar fashion to Vector8 , defining useful oper ators can help with tr anslation and code readabilit y. These stages are much more work than Stages 1–3 and they are worth less marks. Decide if y ou think it's worth it. Timing TestAverage Time of 5 Runs (Milliseconds)
Documentation
When completing an y of the stages of the assignment y ou should pro vide: timing information for each scene file for the a verage time (to 1 decimal place) tak en over 5 runs for a render using a thread count supported b y your the number of logical processors in y our system and a block siz e of 16. See a later section for an example format for this timing table; and an explanation of the results (e.g. wh y there's no difference between the performance of stages 1, 2, 3, and 4 ( NOTE: this is a made up example and isn't necessarily what to expect ), or wh y a particular t ype of implementation works well (or poorly) on a particular scene, etc.). This explanation should be with respect to the CPU on the system on which y ou ran the tests, and y ou should discuss how the architectur al features of the CPU explain the results. Tests / T iming The following table lists all the tests that y our code needs to gener ate correctly at each stage. It also shows the timing tests that need to be performed in order to fully complete the documentation section of the assignment. Fully completing this tests ma y take up to an hour (with the 5 required runs) on some hardw are, so plan y our time accordingly . In order to confirm y our images match the images created b y the base v ersion of the assignment code, it's strongly recommended y ou use an image comparison tool. F or part of the marking for this, Image Magick will be used (as it w as in Assignment 1). The timing tests need to be run in the "F astRelease" build configur ation. The "R elease" mode uses a "strict" floating-point model setting to (attempt to) ensure that the SIMD con version doesn't introduce calculation inconsistencies with the base code, and the "F astRelease" allows for inaccur acies in order to prioritise speed. Accordingly , only "R elease" should be used for the comparison tests with ImageMagick (and its this v ersion that will be used b y the mark er to perform these tests) — "F astRelease" will show a v arying amount of difference depending on the test.
The following tests will be run on y our code for each scene file. Y ou also might find the 8x8 tests useful for y our SIMD code con version:
Provided Materials The materials pro vided with this assignment contain: the source code of the base multi-threaded v ersion of the r aytracer (i.e. a solution to Stage 3 of Assignment 1); the source code of the base multi-threaded v ersion of the r aytracer with a (partial) SIMD implementation of the R ay–Sphere calculation; a set of scene files to be supplied to the progr am; a set of reference image files created from the base v ersion of the progr am; six batch files that will run all 5 timing tests for the base code, the base code with SIMD sphere intersections, and each of the four stages; and a further set of six batch files that will run comparison tests (assuming Image Magick is installed) for the base code, the base code with SIMD sphere intersections, and each of the four stages. Download the materials as an ZIP file. Source Code The pro vided MSVC solution contains 6 projects. RayT racerAss2 The pro vided code consists of 21 source files. Raytracing logic: Main.cpp : this file contains the main function which reads the supplied scene file, begins the r aytracing, and writes the output PNG file. The main render loop is also in this file. Raytracing.cpp and Raytracing.h : these files contain the core r ay trace function, functions for handling of reflection and refraction, and a function for capturing data to be stored in the Intersection datastructure. Intersection.h and Intersection.cpp : these files define a datastructure for capturing relev ant information at the point of intersection between a r ay and a scene object and functions for testing for individual r ay-object collisions and r ay-scene collisions. Lighting.h and Lighting.cpp : these files pro vide functions to apply a lighting calculation at a single intersection point. Texturing.h and Texturing.cpp : these files pro vide functions for the reading points from 3D procedur al textures. Constants.h : this header pro vide constant definitions used in the r aytracing. Basic types : Primitives.h : this header contains definitions for points, v ector, and r ays. It also pro vides functions and o verloaded operators for performing calculations with v ectors and points. SceneObjects.h : this header file pro vides definitions for scene objects (ie. materials, lights, spheres, and triangles). Colour.h : this header defines a datastructure for representing colours (with each colour component represented as a float ) and simple oper ations on colours, including con versions to/from the standard BGR pix el format. Scene definition : Scene.h : this header file contains the datastructure to represent a scene and a single function that initialises this datastructure from a file. The scene datastructure itself consists of properties of the scene and lists of the v arious scene objects as described abo ve.