Lab 6&7&8
Professor Julien Maitre, Ph.D. Winter 2023
This lab is graded. The grade is 100 points and represents a percentage of 20% in the final grade for this course. You must form groups of 7 or 8 persons to achieve this lab.
-
General Description The general objective of this lab 6 is to manipulate and process the data. It represents the step before exploiting machine learning algorithms for knowledge extraction or classification. Thus, this lab 6 will allow you to learn Python's programming language and its libraries for data science (e.g., NumPy, Pandas, Matplotlib, SciPy…). In addition, in this lab, you will have to produce a scientific report that analyzes/explores the data and describes the processing steps you have applied. The page limit for the scientific report is 10 pages. All student names of the group should appear on the first page.
-
Formalities The deadline for submitting your work is March 16th, 2023, at 11.59 p.m (China time). After this deadline, there will be a penalty of 10% per day of delay. You will email me a WeTransfer link with the scientific report and code.
1/4
3. What is expected? The scientific report should include:
the description of your dataset o For example: ▪ what are the variables?; ▪ the meaning of each variable; ▪ the number of instances; ▪ the number of classes (if applicable); ▪ the values (e.g., min-max interval) that each of the variables can take?.
data checking and pre-processing o For example : ▪ how many missing values does your dataset have?; ▪ what method(s) did you use to manage these missing values; ▪ a summary of the number of instances per class (if applicable); ▪ what are the statistics (e.g. mean, variance, standard deviation) for each variable; o In this part, you could use data visualization tools. a statistical study of the data and analyzes/interpretations of this statistical study o For example : ▪ statistical study for each variable with respect to each class; ▪ statistical study for each variable with respect to each other variable; ▪ hypothesis tests; ▪ correlation between two or more variables; ▪ Chi-square tests; o In this part, do not hesitate to use data visualization tools. a conclusion of the study o summarize the essential information of your data analysis/exploration. What did you learn about/thanks to the data? a general conclusion o summarize what you appreciated, learned, appreciated less in this lab.
- Details Regarding the dataset, there is only one restriction. The number of variables should be more than 7 and lower than 12. Also, I recommend you select a dataset where there are classes. If the dataset has more than 12 variables, you can remove variables to reach the maximum number. Finally, you will search on the Web to find a dataset in a field that interests you for more "fun" (e,g., bioinformatics, marketing, commerce, etc.). Here is a sample of web links that provide access to datasets: