Machine Learning - Course handbook

Chris Thornton

This page is also available in frames.


Introduction

This course teaches the theory and practice of machine learning using a mixture of lectures, labs, demos and workshop sessions. Assessment is based on one programming assignment and an unseen exam. Most but not all the syllabus material is included in the online lecture notes (accessed via the `pr' links below).


Lectures

Week 1
lec01 Introduction pr sl

Week 2
lec02 Clump structure pr sl
lec03 Clump-oriented methods pr sl
lab: modify k-means program so that it terminates

Week 3
lec04 Clump-methods misc pr sl
Catch-up
lab: modify k-means program so that value of k can be passed in

Week 4
lec05 Rectangle structure pr sl
lec06 Rectangle-oriented methods pr sl
lab: do the first two exercises at end of lec06

Week 5
lec07 Rectangle-methods misc pr sl
Catch-up
lab: do remaining exercises at end of lec06

Week 6
lec08 Area structure pr sl
lec09 Area-oriented methods pr sl
lab: work on assignment program

Week 7
lec10 Area-methods misc pr sl
Catch-up
lab: work on assignment program

Week 8
lec11 Relational structure pr sl
lec12 Relational methods pr sl
lab: work on assignment program

Week 9
lec13 Relational-methods misc pr sl
Catch-up

Week 10
Revision

Session times and locations

See the Sussex Direct timetable for session times. There will be two lectures and one lab session per week.

The first meeting will be the first lecture in week 1. There will be no second lecture and no lab class in week 1.

I'll be doing attendance sign-ups so please attend all lecture and lab sessions (or let me know if there's a problem).

Lab sessions will be used by students to work on the course assignments and to discuss issues from the lectures.


Course text

There is no single course text. The section headers in the lecture notes define the syllabus.

Good introductions to the area include the following.

Web sites

See also online materials for the Tan et al. text.


Programming assignment

This must be submitted in hard copy to the office (or wherever the office designates) by 4pm on the Thursday of week 8 (ideally sometime between 2 and 4). The task is to apply data-mining methods to the so-called `primate-factors' dataset. The background is as follows.

In 1984, an animal behaviour researcher studying the social behaviour of primates was killed in a car accident while returning from a field trip. In his belongings, police discovered several listings of data but the notebooks, assumed to contain details about the origins of the data, were all burned. The longest of the listings is reproduced at ex1-data.txt. This is in tabulated form and it is generally believed that each line records the values of several variables, followed by a value which is either 1 or 0. This is believed to be a classification value of some sort but the nature of the class has never been determined. The final 100 lines in the listing are missing this classification value.

The exercise involves (1) analysing the properties of the data, (2) implementing (using Java or similar language, not matlab!) and then using one or more data-mining methods to derive predicted classification values for the final 100 cases and (3) submitting predictions by email and (4) presenting the results of the experiment in a report.

Marks for the assessment will be awarded using the following scheme.

Note that your report should have a general introduction, an introduction to the method used, analysis of the training data, a section on results plus supporting materials (e.g., the code). There should also be a section on limitations and a self-assessment (in terms of the given credit factors). You will get zero credit for any component which is missing from your report so it's a good idea to have one section of your report per component. The page requirements are a guide only and are based on use of a normal sized print font (e.g., 11 point).

The full report should be handed in as indicated. However, The predicted output values should be sent seperately to me (C.Thornton@sussex.ac.uk) as a text attachment (i.e., a `.txt' file) sometime before the submission deadline for the report. The email message should have the subject label `DM exercise: Joe Bloggs', except using your first name and last name. The attachment should be a listing in which each line contains the predicted classification value for one of the 100 test cases.

The output values must be in the same top-to-bottom order as that used in the original data file. There should be no blank lines and no tabs in the file. In the case of multiple submissions, the final submission will be taken as definitive.

Please take careful note of the requirements. In particular

If your submission is not in exactly the right format (including the subject line) it will not be processed by the software and you may end up with a zero mark for this component.

General guidelines for the writing of program reports are below. However, the section requirements above should be treated as definitive.

Usual penalties apply for late submissions.

Students resitting the course should make sure they implement a different method this time around.


Guidelines for producing the report

These are general guidelines for writing the kind of report that is asked for here.

1. INTRODUCTION

This should be a general introduction to the theory, techniques and general area of work that the program relates to.

2. ANALYSIS OF THE TRAINING DATA

There is no `one right answer' for this section. Your analysis could be fairly formal, involving statistical analyses of correlation etc., possibly making use of Excel scattergrams etc. Or it could be more informal and rely on visual inspection of the data and the making of intelligent assumptions about the task from which the data were obtained.

3. RESULTS

This is where you present the results of your study, e.g., levels of training and cross-validation error achieved by your method(s). You may also want to show some sample predicted outputs. You may want to talk about how your program(s) deal with the anomalous cases in the data. If you've used the kNN method and employed cross-validation to determine the best value of k, this is where you would present a graph relating values of k to cross-validation error.

4. ILLUSTRATION - Illustrations of your program(s) working. This could be based on an edited output listing, with annotations to show what is going on. If the program uses a GUI, it could be based on a sequence of annotated screenshots. (If necessary `screenshots' can be drawn by hand.) However, Edit out repetitive chunks from any program output and edit in explanatory comments. Annotate the most significant bits of the output to draw attention to them.

5. SYSTEM OVERVIEW - Explain precisely HOW the program works.

You should discuss the main steps the program goes through to achieve its results. You do not need to describe every line of the program. You should give an overview of what sorts of things are represented, how they are represented, where input comes from, how it is transformed, where output goes to, etc. The overview should refer to a copy of the program attached as appendix-1. (See below) In particular make sure you describe what kinds of objects are represented, and how you represent them (e.g. using lists, or vectors, or numerical values for variables or whatever). Explain what problems your program had to solve, and how it solved them. Use figures and diagrams productively so as to back up the text. Do not rely on an automatic documentation facility (e.g., javadoc) to write the system overview for you.

When writing the overview, don't write as if you are communicating with a tutor. Equally don't write as if for a novice. Try to write for an audience consisting of fellow students who may be a week or two behind you in their understanding. When describing what a method does, always adhere to the principles of modularity, i.e., start by stating what sorts of inputs the method takes, and what sorts of output it produces. Give an explanation of the relation between input and output, and one or two simple examples to illustrate. If there are no inputs, or no outputs, then say so.

5. CONCLUSIONS

This is where you should assess what you have achieved and discuss any limitations or problems. It's also the place to talk about potential future developments. Don't be afraid to criticise your own program. If you have read about similar programs, or related programs, it may be appropriate to include some comparisons. What lessons, if any have been learnt? Were your goals achieved? Is there anything you now think you should have done differently?

6. SELF-ASSESSMENT

This is where you present a formal evaluation of the whole submission in terms of the specified marking criteria. For each criterion, note how well you think you've done (and why) and give yourself a mark.

7. BIBLIOGRAPHY

List books, articles, web pages, files etc. considered to be relevant.

8. CODE APPENDIX - The whole program, with comments explaining what the classes represent, what the methods do, what the global variables (if any) are for, etc.


References

Dunham, M. (2003). DATA MINING: INTRODUCTORY AND ADVANCED TOPICS. New Jersea: Pearson Education, Inc.

Hand, D., Mannila, H. and Smyth, P. (2001). PRINCIPLES OF DATA MINING. Cambridge, Mass.: MIT Press.

Tan, P., Steinbech, M. and Kumar, V. (2006). INTRODUCTION TO DATA MINING. Boston: Pearson/Addison Wesley.

Weiss, S. and Indurkhya, N. (1998). PREDICTIVE DATA MINING: A PRACTICAL GUIDE. San Francisco: Morgan Kaufmann Publishers Inc.


Page created on: Mon Jan 11 10:47:54 GMT 2010
Feedback to Chris Thornton