# Applied Machine Learning

## Syllabus

The course presents various aspects of recent applications of machine learning. Therefore the course does not focus on the mathematical aspects of the various optimisation techniques, rather it emphasises the applicability of the algorithms to different types of collected data.

## Short presentation

Machine Learning - as the name suggests - is considered to be the art of "teaching" the computers to perform a specific task. These tasks vary: in the beginning - the 50ies - the general opinion was that the task is to infer a function that transforms values from a multi-dimensional Boolean values to a single Boolean value.
Finding a function that takes Boolean inputs and produces also Boolean outputs is itself a rather hard. To see this, assume that the input $x$ of a function $f\left(x\right)$ is $D$-dimensional. In this case the number of all functions that we can create - these functions transform the input set into a single logical value - is
$2$ to the power of ($2$ to the power of $D$)
and it is easy to see that this number quickly gets too large - for a moderate number of $5$ input variables there are $4.294.967.296$ possible functions. Selecting the one to use is a difficult task.
The general viewpoint in those days was that all that a computer has to do is to mimic the human thinking whereby from a set of logical assessments one is required to draw a conclusion related to the observation.
In those days - late fifties - there has been a general conjecture that "by the end of the eighties" there will be no problems left regarding the modelling of the way humans think or make inference. Later on they realised that there is an obvious obstacle, the combinatorial explosion which will prohibit the advance in this direction. Whilst the search for implementing expert systems - programs/functions that take a set of logical values and output a single logical value - is not deprecated, the focus has moved to other more realistic targets.
With the advance of the computer technology research specialised for specific tasks. Early specialisations included robotics, and pattern recognition and computer vision.
Today's robotics and pattern recognition considers special data that mainly arise in the respective areas. In the first case a general command unit is required to follow a desired path or to make an action plan, therefore the problem is to find an optimal route to the destination - optimal is with respect to time, energy, etc. For pattern recognition the task is to classify different computer images as belonging to one category or the other. Computer vision also has specific problems like stereo recognition or real-time 3D modelling.
Machine learning today is regarded as the art of building models and building algorithms for specific data, i.e. to exploit specific domain-related knowledge in order to have faster and more capable algorithms.
Data mining is the process of using historical databases to improve subsequent decision making. For example, banks now analyse historical data to help decide which future loan applicants are likely to be creditworthy. Hospitals now analyse historical data to help decide which new patients are likely to respond best to which treatments. As the volume of on-line data grows each year, this niche for machine learning is bound to grow in importance.
Rather than focusing on mathematical details of the presented algorithms, I will try to present a general approach to solve problems using the machine learning approach: what are the working principles and canonical methods to approach model-driven data mining.

## Examination

The exam note for this course is
40%
Presentation given about a topic chosen in the first 6 weeks of the semester (about 30 minutes long, maximum 40 slides).
20%
Oral examination based on the topics of the lectures and the seminars presented by the students.
40%
Solution of the laboratory examples (presenting the solutions to the practical problems is compulsory).

## Lectures summary

1. Randomness, random variables.
2. Generating random variables - correlated ones and uncorrelated ones.
3. Generative models of data. Modelling digit maps.
4. Principal component analysis. Displaying the principal components of faces.
5. Independent components, analysis using ICA.
6. Bayesian methods.
7. Kernel methods: Support Vector machines.
8. Bayesian methods.
9. Gaussian process models.

## Seminar topics

Seminars should be about 30 minutes long. They should be written in English and contain enough definitions such that an unfamiliar reader understands most of the notations.

## Practicals

0. Sampling from a correlated Gaussian random variable.
The scope is to understand randomness and its use in data analysis and generative models. Example MATLAB file: gauss.m.
Your task is to obtain samples from a THREE-dimensional Gaussian that is constrained mostly to two directions, description is found in the M-file.
1. Finding the eigen-images of centred USPS digits.
The USPS database (usps.mat) is a collection of bitmaps representing handwritten digits. There is a test and a training set available in the MATLAB-file.
Your task is to collect all digits "8" from the database (both train and test) and to visualise the first three eigen-images of the digit "8" subset. Once the eigen-decomposition is done, you can use vis.m to visualise the results. Collect the results in a document.
2. Independent Component Analysis of Recordings
Independent Component Analysis (ICA) is a recent technique to separate sources based on their statistical independence. Using ICA one can separate sources "blindly" having only their mixtures available.
A good example of blind source separation (BSS) is the separation of speakers in a room: assume that there are $k$ speakers - we call them sources - $s_1,...,s_k$, and $k$ microphones - called mixtures: $x_1,...,x_k$. We know that the mixtures are a linear combination of the sources, each mixture with a different ratio of combining the sources, we represent the weight of source $s_j$ in mixture $x_i$ with $A_ij$.
Your task is to write a program to separate (or find) the sources given the mixtures in the matlab files A template of the solution is in decompose.m -- includes samples of the useful matlab commands. You are advised to use the FastICA matlab package (original URL: www.cis.hut.fi/projects/ica/fastica).
3. Independent components of natural images.
In 1996 an article appeared in Nature that presented the results of the ICA method on a collection of natural images: it claimed that the filtering matrices are similar to the receptive fields present in the brain.
The task is to reproduce the experiments of Olshausen and Field:
1. Import the collection of images (if using Matlab, then use imread) and transform them into gray-scale images. Images are available in the images directory;
2. Choose dim, the size of the squares (suggest to start with 10 ...);
3. Extract patches and store them in a data matrix in column-format. The extraction might be overlapping, i.e. select the top (or bottom) left corner of a patch randomly from a randomly selected image.
4. Apply the functions from the FastICA package (see above) to find the independent components;
5. Visualise the results. Compare with the results presented in the book of Hyvarinen et al from the Literature.
4. Finding clusters in data
You find artificial three-dimensional data in the following file: d_em.txt. The data has been generated with a number of clusters. Using the NETLAB package, test the Gaussian mixture model with different components and find the most appropriate one. An excellent reference is Bishop (2006), chapter 9, pp. 423-439.
Visualise the results. You may start with the code in the matlab file gmm_solve.m.
5. Exploring the Boston Housing Data.
The Boston Housing data is locally available from this LINK. One should select models to fit the data: from linear regression to quadratic, etc.
An example file - presented during the lectures - is how to solve the linear system $y= w*x +w_0$ is LIN_SOL.M.
The task is to analyse the Boston housing data, the problem is detailed in the MATLAB-file above:
• Construct a set of features from the Boston data, like the bias term - $x^0$ - or product term of different order - $x_i1*...*x_ik$ - and build the matrix of derived feature values;
• Associate a coefficient to each feature ;
• Find the optimal values of the coefficients;
• Compute the errors -- here you should consider the computation of the training and test errors;
• Visualise system performance - the test and training errors - against the number of parameters the linear system has.
6. Bayesian Analysis of regression
In the lectures we presented the analysis of a coin throw. We established that we can use prior knowledge to encode into our decision process. In the coin experiment we assumed that we believed in the fairness of the coin - and encoded this belief in a prior distribution over the ratio of the heads/tails. In the Bayesian (linear) regression we believe that there is a (linear) relation between the observed input/output pairs, we encode them in a Gaussian prior distribution of the parameters of the hyperplane.
Task is to devise an algorithm that updates our beliefs about the hyperplane parameters. A template available in the M-file: (bayes_reg.m)
7. Example of a kernel algorithm
The popular support vector classification algorithm belongs to the family of kernel algorithm. These algorithms are linear - but in a space that is different from the space of the inputs. Therefore, one needs to project the inputs from the data-set to the space of features -- as done in analysing the Boston housing data-set. The projection is than replaced with the kernel function and the solution to the classification algorithm is written with respect to the kernel function.
Use a kernel method to build a classification system for the FACES data-base (freely available from: http://cbcl.mit.edu/cbcl/software-datasets/FaceData2.html).
You should load the training data-set and train a decision system to classify the previously not seen test data-set.
Slides used during lectures: