Project 3, APS1070 Fall 2023
PCA [11 marks]
Deadline: Nov 17th, 23:00 CourseNana.COM
Academic Integrity CourseNana.COM
This project is individual - it is to be completed on your own. If you have questions, please post your query in the APS1070 Piazza Q&A forums (the answer might be useful to others!). CourseNana.COM
Do not share your code with others, or post your work online. Do not submit code that you have not written yourself. Students suspected of plagiarism on a project, midterm or exam will be referred to the department for formal discipline for breaches of the Student Code of Conduct. CourseNana.COM
In this project we work on a temperature dataset that reports the average earth surface temperature for different cities for each month over the years 1992-2006. CourseNana.COM
Please fill out the following: CourseNana.COM
- Name:
- Student number:
How to submit (HTML + IPYNB)
Download your notebook:
File -> Download .ipynb
CourseNana.COMClick on the Files icon on the far left menu of Colab CourseNana.COM
Select & upload your
.ipynb
file you just downloaded, and then obtain its path (right click) (you might need to hit the Refresh button before your file shows up) CourseNana.COM
execute the following in a Colab cell: CourseNana.COM
%%shell jupyter nbconvert --to html /PATH/TO/YOUR/NOTEBOOKFILE.ipynb
An HTML version of your notebook will appear in the files, so you can download it. CourseNana.COM
Submit both `HTML` and `IPYNB` files on Quercus for grading. CourseNana.COM
Part 1: Getting started with GitHub [1 Mark + 1 Mark Git Submission]
This first part of the project assignment is to be completed independently from Parts 2 - 5. In this part you will be completing some coding tasks and submitting your results on Github. To access this part of the assignment and upload your answers, you will need to use Github. Please complete the following step-by-step instructions: CourseNana.COM
Create a Github account and install git for Windows or Mac: CourseNana.COM
Open this link: https://classroom.github.com/a/BWpQKQJt to create your assignment repository in GitHub. You should get a link similar to: CourseNana.COM
https://github.com/APS1070-UofT/project-3-part-1-*********
This your private repository to get this part questions and upload your answers. Copy this link to the text box below to be graded for this part. CourseNana.COM
Open
Git Bash
, the app you downloaded in step0
, and set your Email and username by: CourseNana.COMgit config --global user.email “<your-GitHub-email>” git config --global user.name “<your-GitHub-username>”
Create a folder for the course on your computer and
cd
to that.cd
meansChange Directory
. For example, on a Windows machine, where I have a folder on "C:\aps1070": CourseNana.COMcd c:aps1070
Get your assignment by the link you got in step 1: CourseNana.COM
git clone https://github.com/APS1070-UofT/project-3-part-1-*********
A new folder should be created in your directory similar to: CourseNana.COM
C:\aps1070\project-3-part-1-********
This folder has an
ipynb
notebook which you need to manually upload to colab and answer its questions. CourseNana.COMAfter you finished working on this notebook, download the notebook from colab and move it to the directory in step 5. CourseNana.COM
Replace the old notebook with the new one that has your answers. Make sure your completed notebook has the same name as the original notebook you downloaded. CourseNana.COM
To submit your work, follow: CourseNana.COM
cd <your assignment folder> git add F23_Project_3_Part_1_git.ipynb git commit -m “Final Submission” git push
If you have any problem with pushing your work on GitHub you can try one of following commands: CourseNana.COM
git push --force or git push origin HEAD:main
Make sure your submission is ready for grading. Open the private repository link in your browser and make sure you can see your final submission with your latest changes there. Only you and the teaching team can open that link. CourseNana.COM
Part 2: Applying PCA [2 Marks]
- Compute the covariance matrix of the dataframe. Hint: The dimensions of your covariance matrix should be (180, 180). [0.25]
- Write a function
get_sorted_eigen(df_cov)
that gets the covariance matrix of dataframedf
(from step 1), and returns sorted eigenvalues and eigenvectors usingnp.linalg.eigh
. [0.25] - Show the effectiveness of your principal components in covering the variance of the dataset with a
scree plot
. [0.25] - How many PCs do you need to cover 99% of the dataset's variance? [0.25]
- Plot the first 16 principal components (Eigenvectors) as a time series (16 subplots, on the x-axis you have dates and on the y-axis you have the value of the PC element) . [0.5]
- Compare the first two PCs with the rest of them. Do you see any difference in their trend? [0.5]
### YOUR CODE HERE ###
Part 3: Data reconstruction [3 Marks]
Create a function that: CourseNana.COM
- Accepts a city and the original dataset as inputs.
- Calls useful functions that you designed in previous parts to compute eigenvectors and eigenvalues.
Plots 4 figures: CourseNana.COM
- The original time-series for the specified city. [0.5]
The incremental reconstruction of the original (not standardized) time-series for the specified city in a single plot. [1.5] CourseNana.COM
You should at least show 5 curves in a figure for incremental reconstruction. For example, you can pick the following (or any other combination that you think is reasonable): CourseNana.COM
- Reconstruction with only PC1
- Reconstruction with both PC1 and PC2
- Reconstruction with PC1 to PC4 (First 4 PCs)
- Reconstruction with PC1 to PC8 (First 8 PCs)
- Reconstruction with PC1 to PC16 (First 16 PCs)
Hint: you need to compute the reconstruction for the standardized time-series first, and then scale it back to the original (non-standardized form) using the StandardScaler
inverse_transform
help... CourseNana.COM
- The residual error for your best reconstruction with respect to the original time-series. [0.5]
- Hint: You are plotting the error that we have for reconstructing each month
(df - df_reconstructed)
. On the x-axis, you have dates, and on the y-axis, the residual error.
- Hint: You are plotting the error that we have for reconstructing each month
- The RMSE of the reconstruction as a function of the number of included components (x-axis is the number of components and y-axis is the RMSE). Sweep x-axis from 1 to 10 (this part is independent from part 3.2.) [1]
Part 4: SVD [2 Marks]
Modify your code in part 3 to use SVD instead of PCA for extracting the eigenvectors. [1] CourseNana.COM
Explain if standardization or covariance computation is required for this part. Repeat part 3 and compare your PCA and SVD results. Write a function to make this comparison [0.5], and comment on the results. [0.5]. CourseNana.COM
### YOUR CODE HERE ###
Part 5: Let's collect another dataset! [2 Marks]
Create another dataset similar to the one provided in your handout using the raw information on average temperatures per states (not cities) provided here. [1] CourseNana.COM
You need to manipulate the data to organize it in the desired format (i.e., the same format that was in previous parts). If there is a missing value for the average temperature of a particular state at a given date, make sure to remove that date completely from the dataset, even if the data of that specific date exists for other states. CourseNana.COM
Upload your new dataset (in CSV format) to your colab notebook and repeat part 4. When analyzing the states, you may use Jilin
, Nunavut
, Rio Grande Do Norte
, Louisiana
, and Tasmania
. [1]
CourseNana.COM
The code below helps you to upload your new CSV file to your colab session. CourseNana.COM
# load train.csv to Google Colab
from google.colab import files
uploaded = files.upload()
### YOUR CODE HERE ###
References
Understanding PCA and SVD: CourseNana.COM
https://towardsdatascience.com/pca-and-svd-explained-with-numpy-5d13b0d2a4d8 CourseNana.COM
https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca CourseNana.COM
https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues CourseNana.COM
https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.8-Singular-Value-Decomposition/ CourseNana.COM
PCA: CourseNana.COM
Snippets from: https://plot.ly/ipython-notebooks/principal-component-analysis/ CourseNana.COM
https://www.value-at-risk.net/principal-component-analysis/ CourseNana.COM
Temperature Data: CourseNana.COM