Today’s paper discusses few things that machine learning practitioners and researchers should know before getting their hands dirty. The paper focuses on classification tasks and describes common pitfalls, key concepts and provides answers to frequently asked questions.
Which learning algorithm to use?
Selecting a machine learning model is not a single step process. The learning algorithms can be split into three parts:
- Representation: Find an algorithm, classifier in this case, that can effectively represent the input data.
- Evaluation: Use a metric (objective function) to identify which classifier performs best.
- Optimisation: Find a method to improve the classifier’s performance at the evaluation step.
As the table shows, there are many options to select from for every step, however, not every combination makes sense.
Always aim for models that can generalise
Train algorithms that perform well on new data because it is unlikely that the given set captures every possible observation. To quantify a model’s ability to generalise, it is important to split your dataset in training and test set, leave the latter aside and use only the former to train the model. An algorithm that performs exceptionally on the training set but makes horrible predictions on the test set suffers from overfitting. By decomposing the generalisation error to bias and variance, it’s easier to understand what overfitting is. Bias reveals a classifier’s tendency to make the same mistakes and underfit the data, while variance describes the tendency to learn random patterns regardless of the input data.
Both cases (underfitting, overfitting) seem a bit daunting but fear not. Cross validation can be used to combat overfitting, while regularisation can reduce the variance of a classifier. All in all, there is no free lunch in machine learning. No method performs best in every situation, so one should experiment to find out what works.
More features do not always lead to better models
The goal in a classification task is to create a hyperplane that accurately separates the classes. Initially, adding features improve the performance of the classifier, however, this also leads the labels of the observations to expand to higher dimensional spaces (which are equal to the number of features included in the model). Consequently, it becomes harder for the classifier to find patterns in the given examples. This is the curse of dimensionality. There some ways to tackle it, such as using PCA or t-SNE that project the feature set to a lower dimensional space.
More data or better algorithms?
Try simple classifiers first and more sophisticated learners afterwards. This is partly done because the payoff from using more complex models instead of simple ones could be small, while they require longer training times and more memory. Moreover, it is recommended to increase the size of the training set before using complex classifiers, if additional data collection is an option, and also do some feature engineering.
There is power in numbers
Why use a single classifier when you can combine many of them? Model ensembling can be split into three techniques, bagging, boosting and stacking.
Bagging generates random variations of the training set by resampling, learns a classifier on each set and then provides an output by casting a vote between results of the trained models. Boosting provides a weight to the training examples such that new classifiers focus mostly on those that were wrongly labelled by the previous classifiers. Finally, stacking refers to the concept in which the outputs of individual classifiers are the input of another layer of models. Correlation does not imply causation
Correlation means that two events tend to happen together, however it does not imply that one leads to the other. For this reason, we should treat correlations as signals that require further investigation and avoid jumping on to conclusions.
Images taken from:
- Overfitting: http://mlwiki.org/index.php/Overfitting
- Spurious correlation: http://www.tylervigen.com/spurious-correlations