You should use the Weka data mining package, which is installed in all the Schools laboratories. Weka is also available for download from: http://www.cs.waikato.ac.nz/~ml/weka/.
Supporting material on how to use Weka is available on Moodle.
The primary aims of this individual data mining assignment is to give you the opportunity to:
- Explain, apply and evaluate principles of data mining techniques and algorithms;
- Perform experiments on real world data and analyse the results.
- Demonstrate your ability to communicate by producing a technical report of your findings.
The report should contain the following:
Describe the task you were given, the data you received and the requirements of the
finished system. Define any terminology that you will use in the report (for example,
model, variable, task, etc.).
b) Data Summary
List the variables that you found in the file provided by the company (available on Moodle week 15). For each one, say whether it is nominal or numeric, continuous or discrete and whether or not it is of use in building the solution. Explain your decisions.
c) Data Preparation
Describe what you did with the data prior to the modelling process. Show histograms of the data before and after any pre-processing that you carried out. If you corrected any mis-typed entries in the data, report what you changed.
You must use two different techniques and build models with both: pick a suitable tree building algorithm and one other suitable algorithm of your choice. Justify your selection Describe the different methods you used and the results that you got. Give a brief technical description of the techniques and the way the models are represented. Include one diagram showing the structure of each type of model that you build. Describe what parameters may be changed and what effect this has.
If you varied the parameters of a model, show how this impacted on the results. Describe how you split the data for training and testing purposes. Be methodical and record each result. This stage is a little like scientific research – you are carrying out experiments in your search for the best solution. Once you have a solution, show how you verified its robustness.
For the two different techniques report on their comparative ability to predict a defaulted loan, and also on how easy it would be for the insurance company to understand the model and the reasons behind each prediction it makes.
e) Results and Errors
Analyse and describe the level of accuracy the model achieves and the errors your model makes. Show a confusion matrix for each model. Are there any areas of the data where it performs worse than in others? Show a lift curve or an ROC curve for the decision as to whether or not a loan will be repaid.
Summarise the results of your experiments and what you have learnt.
Submission of Deliverables
Each individual will submit one hard paper copy of their report (25%) to the Coursework
You do not need to submit the models that you built, just the report. There is not a word limit on the report – just write what you need to provide the required information clearly and concisely. You can assume that the client has a good technical understanding of data mining and statistics, so do not shy away from technical terms in your report. Where you use them, however, explain what they mean in plain language too.
You may be required to make a live demonstration of your work to the assessors of this
coursework, should it be deemed necessary.
Marking Scheme for report (out of 100%)
Data Summary 10%
Data Preparation 10%
Results and Errors 20%