This page is also available in frames.
The first meeting will be the first lecture in week 1. There will be no second lecture and no lab class in week 1.
I'll be doing attendance sign-ups so please attend all lecture and lab sessions (or let me know if there's a problem).
Lab sessions will be used by students to work on the course assignments and to discuss issues from the lectures.
Good introductions to the area include the following.
In 1984, an animal behaviour researcher studying the social behaviour of primates was killed in a car accident while returning from a field trip. In his belongings, police discovered several listings of data but the notebooks, assumed to contain details about the origins of the data, were all burned. The longest of the listings is reproduced at ex1-data.txt. This is in tabulated form and it is generally believed that each line records the values of several variables, followed by a value which is either 1 or 0. This is believed to be a classification value of some sort but the nature of the class has never been determined. The final 100 lines in the listing are missing this classification value.
The exercise involves (1) analysing the properties of the data, (2) implementing (using Java or similar language, not matlab!) and then using one or more data-mining methods to derive predicted classification values for the final 100 cases and (3) submitting predictions by email and (4) presenting the results of the experiment in a report.
Marks for the assessment will be awarded using the following scheme.
The full report should be handed in as indicated. However, The predicted output values should be sent seperately to me (C.Thornton@sussex.ac.uk) as a text attachment (i.e., a `.txt' file) sometime before the submission deadline for the report. The email message should have the subject label `DM exercise: Joe Bloggs', except using your first name and last name. The attachment should be a listing in which each line contains the predicted classification value for one of the 100 test cases.
The output values must be in the same top-to-bottom order as that used in the original data file. There should be no blank lines and no tabs in the file. In the case of multiple submissions, the final submission will be taken as definitive.
Please take careful note of the requirements. In particular
General guidelines for the writing of program reports are below. However, the section requirements above should be treated as definitive.
Usual penalties apply for late submissions.
Students resitting the course should make sure they implement a different method this time around.
1. INTRODUCTION
This should be a general introduction to the theory, techniques and general area of work that the program relates to.
2. ANALYSIS OF THE TRAINING DATA
There is no `one right answer' for this section. Your analysis could be fairly formal, involving statistical analyses of correlation etc., possibly making use of Excel scattergrams etc. Or it could be more informal and rely on visual inspection of the data and the making of intelligent assumptions about the task from which the data were obtained.
3. RESULTS
This is where you present the results of your study, e.g., levels of training and cross-validation error achieved by your method(s). You may also want to show some sample predicted outputs. You may want to talk about how your program(s) deal with the anomalous cases in the data. If you've used the kNN method and employed cross-validation to determine the best value of k, this is where you would present a graph relating values of k to cross-validation error.
4. ILLUSTRATION - Illustrations of your program(s) working. This could be based on an edited output listing, with annotations to show what is going on. If the program uses a GUI, it could be based on a sequence of annotated screenshots. (If necessary `screenshots' can be drawn by hand.) However, Edit out repetitive chunks from any program output and edit in explanatory comments. Annotate the most significant bits of the output to draw attention to them.
5. SYSTEM OVERVIEW - Explain precisely HOW the program works.
You should discuss the main steps the program goes through to achieve its results. You do not need to describe every line of the program. You should give an overview of what sorts of things are represented, how they are represented, where input comes from, how it is transformed, where output goes to, etc. The overview should refer to a copy of the program attached as appendix-1. (See below) In particular make sure you describe what kinds of objects are represented, and how you represent them (e.g. using lists, or vectors, or numerical values for variables or whatever). Explain what problems your program had to solve, and how it solved them. Use figures and diagrams productively so as to back up the text. Do not rely on an automatic documentation facility (e.g., javadoc) to write the system overview for you.
When writing the overview, don't write as if you are communicating with a tutor. Equally don't write as if for a novice. Try to write for an audience consisting of fellow students who may be a week or two behind you in their understanding. When describing what a method does, always adhere to the principles of modularity, i.e., start by stating what sorts of inputs the method takes, and what sorts of output it produces. Give an explanation of the relation between input and output, and one or two simple examples to illustrate. If there are no inputs, or no outputs, then say so.
5. CONCLUSIONS
This is where you should assess what you have achieved and discuss any limitations or problems. It's also the place to talk about potential future developments. Don't be afraid to criticise your own program. If you have read about similar programs, or related programs, it may be appropriate to include some comparisons. What lessons, if any have been learnt? Were your goals achieved? Is there anything you now think you should have done differently?
6. SELF-ASSESSMENT
This is where you present a formal evaluation of the whole submission in terms of the specified marking criteria. For each criterion, note how well you think you've done (and why) and give yourself a mark.
7. BIBLIOGRAPHY
List books, articles, web pages, files etc. considered to be relevant.
8. CODE APPENDIX - The whole program, with comments explaining what the classes represent, what the methods do, what the global variables (if any) are for, etc.
Hand, D., Mannila, H. and Smyth, P. (2001). PRINCIPLES OF DATA MINING. Cambridge, Mass.: MIT Press.
Tan, P., Steinbech, M. and Kumar, V. (2006). INTRODUCTION TO DATA MINING. Boston: Pearson/Addison Wesley.
Weiss, S. and Indurkhya, N. (1998). PREDICTIVE DATA MINING: A PRACTICAL GUIDE. San Francisco: Morgan Kaufmann Publishers Inc.