Applied Machine Learning

Machine Learning - as the name suggests - is considered to be the art of "teaching" the computers to perform a specific task. These tasks vary: in the beginning - the 50ies - the general opinion was that the task is to infer a function that transforms values from a multi-dimensional Boolean values to a single Boolean value.

The general viewpoint in those days was that all that a computer has to do is to mimic the human thinking whereby from a set of logical assessments one is required to draw a conclusion related to the observation.

With the advance of the computer technology research specialised for specific tasks. Early specialisations included robotics, and pattern recognition and computer vision.

Machine learning today is regarded as the art of building models and building algorithms for specific data, i.e. to exploit specific domain-related knowledge in order to have faster and more capable algorithms.

Rather than focusing on mathematical details of the presented algorithms, I will try to present a general approach to solve problems using the machine learning approach: what are the working principles and canonical methods to approach model-driven data mining.

The exam note for this course is

40%: Presentation given about a topic chosen in the first 6 weeks of the semester (about 30 minutes long, maximum 40 slides).
20%: Oral examination based on the topics of the lectures and the seminars presented by the students.
40%: Solution of the laboratory examples (presenting the solutions to the practical problems is compulsory).

Randomness, random variables.
Generating random variables - correlated ones and uncorrelated ones.
Generative models of data. Modelling digit maps.
Principal component analysis. Displaying the principal components of faces.
Independent components, analysis using ICA.
Bayesian methods.
Kernel methods: Support Vector machines.
Bayesian methods.
Gaussian process models.

Seminars should be about 30 minutes long. They should be written in English and contain enough definitions such that an unfamiliar reader understands most of the notations.

Past seminar topics/presentations

0. Sampling from a correlated Gaussian random variable.

The scope is to understand randomness and its use in data analysis and generative models. Example MATLAB file: gauss.m.
Your task is to obtain samples from a THREE-dimensional Gaussian that is constrained mostly to two directions, description is found in the M-file.

1. Finding the eigen-images of centred USPS digits.

The USPS database (usps.mat) is a collection of bitmaps representing handwritten digits. There is a test and a training set available in the MATLAB-file.
Your task is to collect all digits "8" from the database (both train and test) and to visualise the first three eigen-images of the digit "8" subset. Once the eigen-decomposition is done, you can use vis.m to visualise the results. Collect the results in a document.

2. Independent Component Analysis of Recordings

Independent Component Analysis (ICA) is a recent technique to separate sources based on their statistical independence. Using ICA one can separate sources "blindly" having only their mixtures available.
A good example of blind source separation (BSS) is the separation of speakers in a room: assume that there are

k

speakers - we call them sources -

s_1,...,s_k

, and

k

microphones - called mixtures:

x_1,...,x_k

. We know that the mixtures are a linear combination of the sources, each mixture with a different ratio of combining the sources, we represent the weight of source

s_j

in mixture

x_i

with

A_ij

.
Your task is to write a program to separate (or find) the sources given the mixtures in the matlab files

A template of the solution is in decompose.m -- includes samples of the useful matlab commands. You are advised to use the FastICA matlab package (original URL: www.cis.hut.fi/projects/ica/fastica).

3. Independent components of natural images.

In 1996 an article appeared in Nature that presented the results of the ICA method on a collection of natural images: it claimed that the filtering matrices are similar to the receptive fields present in the brain.
The task is to reproduce the experiments of Olshausen and Field:

Import the collection of images (if using Matlab, then use imread) and transform them into gray-scale images. Images are available in the images directory;
Choose dim, the size of the squares (suggest to start with 10 ...);
Extract patches and store them in a data matrix in column-format. The extraction might be overlapping, i.e. select the top (or bottom) left corner of a patch randomly from a randomly selected image.
Apply the functions from the FastICA package (see above) to find the independent components;
Visualise the results. Compare with the results presented in the book of Hyvarinen et al from the Literature.

4. Finding clusters in data

You find artificial three-dimensional data in the following file: d_em.txt. The data has been generated with a number of clusters. Using the NETLAB package, test the Gaussian mixture model with different components and find the most appropriate one. An excellent reference is Bishop (2006), chapter 9, pp. 423-439.
Visualise the results. You may start with the code in the matlab file gmm_solve.m.

5. Exploring the Boston Housing Data.

The Boston Housing data is locally available from this LINK. One should select models to fit the data: from linear regression to quadratic, etc.
An example file - presented during the lectures - is how to solve the linear system

y= w*x +w_0

is LIN_SOL.M.
The task is to analyse the Boston housing data, the problem is detailed in the MATLAB-file above:

Construct a set of features from the Boston data, like the bias term - $x^0$ - or product term of different order - $x_i1*...*x_ik$ - and build the matrix of derived feature values;
Associate a coefficient to each feature ;
Find the optimal values of the coefficients;
Compute the errors -- here you should consider the computation of the training and test errors;
Visualise system performance - the test and training errors - against the number of parameters the linear system has.

6. Bayesian Analysis of regression

In the lectures we presented the analysis of a coin throw. We established that we can use prior knowledge to encode into our decision process. In the coin experiment we assumed that we believed in the fairness of the coin - and encoded this belief in a prior distribution over the ratio of the heads/tails. In the Bayesian (linear) regression we believe that there is a (linear) relation between the observed input/output pairs, we encode them in a Gaussian prior distribution of the parameters of the hyperplane.
Task is to devise an algorithm that updates our beliefs about the hyperplane parameters. A template available in the M-file: (bayes_reg.m)

7. Example of a kernel algorithm

The popular support vector classification algorithm belongs to the family of kernel algorithm. These algorithms are linear - but in a space that is different from the space of the inputs. Therefore, one needs to project the inputs from the data-set to the space of features -- as done in analysing the Boston housing data-set. The projection is than replaced with the kernel function and the solution to the classification algorithm is written with respect to the kernel function.
Use a kernel method to build a classification system for the FACES data-base (freely available from: http://cbcl.mit.edu/cbcl/software-datasets/FaceData2.html).
You should load the training data-set and train a decision system to classify the previously not seen test data-set.

Slides used during lectures:

MODELLING LECTURES and
GAUSSIAN PROCESS lectures.

Michael A. Arbib (ed.): The Handbook of Brain Theory and Neural Networks, MIT Press (2002).
(link sent in mail)
Pierre Baldi and Soeren Brunak: Bioinformatics: the Machine Learning Approach, MIT Press (1998).
This book contains useful references to an interesting application of machine learning methods. (link sent in mail)
Christopher M. Bishop: Pattern Recognition and Machine Learning, Springer-Verlag (2006).
(link sent in mail)
Dana H. Ballard, Christopher M. Brown: Computer Vision, Prentice-Hall (1982).
(download from homepage)
Thomas M. Cover, Joy A. Thomas: Elements of Information Theory, Wiley and Sons (2006).
A good book on topics related to information theory. (link sent in mail)
Trevor Hastie, Jerome Friedman, Robert Tibshirani: The Elements of Statistical Learning: Data mining, Inference, and Prediction, Springer-Verlag (2009).
Another "classic textbook", excellent explanations. link: book download link
Aapo Hyvärinen, Jarmo Hurri, and Patrik O. Hoyer: Natural Image Statistics: A Probabilistic Approach to Early Computational Vision, Springer-Verlag (2009).
A nice presentation of the computational vision and the ICA. (preprint version available)
Thomas Mitchell: Machine Learning, McGraw-Hill, (1997).
The classic textbook in Machine Learning. (link sent in mail)
Andrew Webb: Statistical Pattern Recognition. Wiley and Sons (2002).
(link sent in mail)

Wikipedia definition of Machine Learning
Journal of Machine Learning Research is a freely available ISI journal. Most of the articles are relevant
Machine Learning is a well known journal. Available online only for subscribers/subscribed institutions.
Bioinformatics is a journal focusing on analysing biological data.
Freeware code that implements a Blind Source Separation algorithm: www.cis.hut.fi/projects/ica/fastica

Applied Machine Learning

Syllabus

Page contents:

Short presentation

Examination

Lectures summary

Seminar topics

Past seminar topics/presentations

Practicals

Literature

Links