Computer Architecture 2024 Spring
Final Project Part 2
Overview
Tutorial
-
● Gem5 Introduction
-
● Environment Setup
Projects
● Part 1 (5%)
○ Write C++ program to analyze the specification of L1 data cache. ● Part 2 (5%)
○ Given the hardware specifications, try to get the best performance for more complicated program.
2
Project 2
3
Description
In this project, we will use a two-level cache computer system. Your task is to write a ViT(Vision Transformer) in C++ and optimize it. You can see more details of the system specification on the next page.
4
System Specifications
-
● ISA: X86
-
● CPU: TimingSimpleCPU (no pipeline, CPU stalls on every memory request) ● Caches
I cache size |
I cache associativity |
D cache size |
D cache associativity |
Policy |
Block size |
|
L1 cache |
16KB |
8 |
16KB |
4 |
LRU |
32B |
L2 cache |
– |
– |
1MB |
16 |
LRU |
32B |
* L1 I cache and L1 D cache connect to the same L2 cache ● Memory size: 8192MB
5
ViT(Vision Transformer) – Transformer Overview
-
● A basic transformer block consists of
-
○ Layer Normalization
-
○ MultiHead Self-Attention (MHSA)
-
○ Feed Forward Network (FFN)
-
○ Residual connection (Add)
-
-
● You only need to focus on how to
implement the function in the red box
-
● If you only want to complete the project
instead of understanding the full algorithm about ViT, you can skip the section masked as red
6
ViT(Vision Transformer) – Image Pre-processing ● Normalize, resize to (300,300,3) and center crop to (224,224,3)
7
ViT(Vision Transformer) – Patch Encoder
-
● In this project, we use Conv2D as Patch Encoder with kernel_size = (16,16), stride = (16,16) and output_channel = 768
-
● (224,224,3) -> (14,14, 16*16*3) -> (196, 768)
8
ViT(Vision Transformer) – Class Token
-
● Now we have 196 tokens and each token has 768 features
-
● In order to record global information, we need concatenate one learnable class token with 196 tokens
-
● (196,768) -> (197,768)
9
ViT(Vision Transformer) – Position Embedding
-
● Add the learnable position information on the patch embedding
-
● (197,768) + position_embedding(197,768) -> (197,768)
10
ViT(Vision Transformer) – Layer Normalization
-
● Normalize each token
-
● You need to normalize with the formula
C
embedded dimension
T
# of tokens
11
ViT(Vision Transformer) – MultiHead Self Attention (1)
-
● Wk, Wq, Wv ∈ RC✕C
-
● bq , bk, bv∈ RC
-
● Wo ∈ RC✕C ● bo∈RC
X
Input Linear Projection
split into heads
Attention Attention
merge heads
Output Linear Projection
Y
Attention
Wk, Wq, Wv
bq , bk, bv
Wo bo
12
ViT(Vision Transformer) – MultiHead Self Attention (2)
-
● Get Q, K, V ∈ RT✕(NH*H) after input linear projection
-
● Split Q, K, V into Q1, Q2, Q3,..., QNH K1, K2, K3,..., KNH V1, V2, V3,..., VNH ∈ RT✕H
Linear Projection
Q = XW T + b q q
C
embedded dimension
# of tokens
H
hidden dimension
NH C = H * NH
# of head
K = XW
T + b kk
T
V = XW T + b vv
Linear Projection and split into heads
13
ViT(Vision Transformer) – MultiHead Self Attention (2)
●
●
●
For each head i, compute S = Q K T/square_root(H) ∈ RT✕T iii
Pi = Softmax(Si ) ∈ RT✕T, Softmax is a row-wise function Oi = Pi Vi ∈ RT✕H
Softmax
Qi
Matrix Multiplication and scale
K Vi
Softmax
i
Matrix Multiplication
Oi
14
ViT(Vision Transformer) – MultiHead Self Attention (3)
● Oi ∈ RT✕H, O = [O1, O2,...,O2 ] Linear Projection
H
hidden dimension
NH
# of head
C
embedded dimension
T
# of tokens
output = OW
T + b oo
merge heads and Linear Projection
15
ViT(Vision Transformer) – Feed Forward Network
●
●
Get Q, K, V ∈ RT✕(h*H) after input linear projection
Split Q, K, V into Q1, Q2, Q3,..., Qh K1, K2, K3,..., Kh V1, V2, V3,..., Vh ∈ RT✕H
C OC
embedded dimension
hidden dimension
T
# of tokens
T
# of tokens
Input Linear Projection
GeLU
output
Linear Projection
16
ViT(Vision Transformer) – GeLU
17
ViT(Vision Transformer) – Classifier
● Contains a Linear layer to transform 768 features to 200 class ○ (197, 768) -> (197, 200)
● Only refer to the first token (class token) ○ (197, 200) -> (1, 200)
18
$ make layernorm_tb
$ make MHSA_tb
$ make matmul_tb
$ make gelu_tb
$ run_all.sh
$ make transformer_tb
$ make feedforward_tb
ViT(Vision Transformer) – Work Flow
layernorm
matmul attention
layernorm MHSA
Pre-pocessing Embedder
Transformer x12 layernorm
Classifier
Argmax
Black footed Albatross
Load_weight m5_dump_init
layernorm MHSA residual
m5_dump_stat
matmul
matmul + layernorm
matmul
FFN
gelu gelu + matmul
19
ViT(Vision Transformer) – Shape of array
layernorm input/output [T*C] MHSA input/output/o [T*C]
MHSA qkv [T*3*C] q token 1
token 1
C
v token 1
token 2 ...... token T
k token 1
...... q token T
k token T
v token T
feedforward feedforward
C input/output [T*C]
gelu [T*OC]
token 1
OC
token 2
......
token T
20
Common problem
● Segmentation fault
-
○ ensure that you are not accessing a nonexistent memory address
-
○ Enter the command $ulimit -s unlimited
21
All you have to do is
● Download TA’s Gem5 image
○ docker pull yenzu/ca_final_part2:2024
● Write C++ with understanding the algorithm in ./layer folder
-
○ make clean
-
○ make <layer>_tb ○ ./<layer>_tb
22
All you have to do is
-
● Ensure the ViT will successfully classify the bird
-
○ python3 embedder.py --image_path images/Black_Footed_Albatross_0001_796111.jpg
--embedder_path weights/embedder.pth --output_path embedded_image.bin
-
○ g++ -static main.cpp layer/*.cpp -o process
-
○ ./process
-
○ python3 run_model.py --input_path result.bin --output_path torch_pred.bin --model_path
weights/model.pth
-
○ python3 classifier.py --prediction_path torch_pred.bin --classifier_path
weights/classifier.pth
-
○ After running the above commands, you will get the following top5 prediction.
-
-
● Evaluate the performance of part of ViT, that is layernorm+MHSA+residual
-
○ Need about 3.5 hours to finish the simulation
-
○ Check stat.txt
-
23
Grading Policy
-
● (50%) Verification
-
○ (10%) matmul_tb
-
○ (10%) layernorm_tb
-
○ (10%) gelu_tb
-
○ (10%) MHSA_tb
-
○ (10%) transformer_tb
-
-
● (50%) Performance
○ max(sigmoid((27.74 - student latency)/student latency))*70, 50)
● You will get 0 performance point if your design is not verified.
24
Submission
● Please submit code on E3 before 23:59 on June 20, 2024. ● Format
○ Code: please put your code in a folder
named FP2_team<ID>_code and compress 2
it into a zip file.
-
● Late submission is not allowed.
-
● Plagiarism is forbidden, otherwise you will get 0 point!!!
2 2
25
FP2_team<ID>_code folder
● You should attach the following documents ○ matmul.cpp
○ layernorm.cpp
○ gelu.cpp
○ attention.cpp
○ residual.cpp
26