Assessment Overview
Over the course of this module you have been introduced to a range of techniques that may be used for programming a big data project. This assessment allows you to pull together these techniques in a realistic scenario to complete a big data analysis project. Below is a realistic project scenario. By using the techniques presented during class you are to carry out the project and write a final project report for your client.
Project Scenario
You have been approached by a client who analyses atmospheric science and climate model data. They have developed a new analysis technique, but it takes too long to run for them to use it. They have asked you to investigate the use of big data techniques to reduce the processing time.
They have a large volume of data to process, and the analysis needs to be repeated frequently. They have the following basic requirements:
1. Current analysis time is approximately 2.5 hours to analyse the climate model output data for a 1-hour time period.
2. The data for a single day of model output is approximately 250MB. However, they have over 100 years’ worth of data to analyse making a total of over 9TB.
3. Each day, they need to analyse the new data set for that day, so they wish to complete the analysis of the data for a 24-hour period (25 data sets) in under 2 hours.
4. It is not possible to hold on this in memory at one time, so the new process should load only 1 hour of data for processing at a time. If parallel processing is to occur, then 1 hour of data per worker can be loaded as needed.
You have been tasked with investigating the use of parallel processing to achieve the analysis speed required, with the following expectations:
1. Test and compare the processing speed of sequential and parallel processing
2. Extrapolate your findings to indicate the number of processors required to achieve the target processing time.
3. Test how your code responds to common errors, e.g. data that is text instead of numeric, use of NaN in the data as an error code.
4. Run automated tests that allow your client to set the tests running and return later to see the results, without user intervention.
The data has been provided by the European Centre for Medium Range Weather Forecasts (ECMWF)
Continued over…
Project Deliverables
Your project should deliver the following:
1. Working code that demonstrates:
a. Loading of only the data required for the processing taking place
b. Sequential processing of the data
c. Parallel processing of the data
d. Plots of the comparisons between sequential processing and parallel processing with different numbers of workers
e. Automated testing of your code to deal with pre-defined data error types.
2. A formal project report for your client covering:
a. Comparisons between parallel and sequential data processing
b. Estimated number of processors required to achieve the goal of processing 24-hours of data in under 2 hours.
c. Testing the code to see how it deals with:
i. Text instead of numeric values
ii. NaN values indicating data errors.
iii. Note: it is not necessary to solve these problems to pass, but you should be able to suggest methods of dealing with these problems so code will not crash.
d. A summary of the evidence generated during your project and how it helps you arrive at your conclusions
e. Recommendations
f. References
g. Appendices containing:
i. Code flow charts
ii. Gannt chart for your project
iii. Logbook
iv. Specification items
3. VIVA / presentation. You will be expected to present your work in a formal presentation / VIVA. Details of this can be found in the VIVA assessment brief.
This assessment brief covers only parts 1 and 2. The assessment brief for part 3, VIVA, is found in a separate document.
Additional Information
1. You will be provided with NetCDF data files:
a. One complete, correct data file
b. One file containing instrument errors, recorded as NaN.
c. One file containing data storage error where the numerical values have been saved as text
2. You are provided with code files for the analysis technique. You should not edit this file in any way. You are required run the analysis, for timing purposes, but are not expected to analyse, display, report on, or deal with the results of the analysis in any way.
Continued over…
3. You are expected to define your project by means of a list of 5 SMART specification items. These should be included in an appendix.
4. You are expected to plan the work required for this project and provide a complete Gannt chart, including identifying the critical path. This should be included in an appendix.
5. This is a formal report and it is expected that appropriate formal grammar and language are to be used. Where this is not the case, a penalty of up to 10% may be applied to the marks for the report structure. For help with formal writing, please contact the Centre for Academic Writing.