# Practical tutorial on autoencoders for nonlinear feature fusion (Part 3)

** Published:**

The previous posts on autoencoders described their structure, possible variations and which of them are useful in feature fusion. The last post on this topic compares them to other techniques and summarises some things a developer should consider when designing a model.

Generally, feature fusion algorithms can be supervised or unsupervised, convex or nonconvex and linear or nonlinear. Autoencoders are unsupervised, nonconvex and nonlinear models. Here, we will compare AEs with linear and nonlinear techniques.

## Linear models

### Principal Component Analysis (PCA)

PCA transforms a set of correlated variables to a collection of linearly uncorrelated features that are called principal components. This is done in a way that the first principal component will have the maximum variance, the second one has maximum possible variance while being uncorrelated to the first (orthogonal), the third has maximum possible variance while being uncorrelated to the first and second, and so on. PCA produces good results when its assumptions are met and AEs can learn the principal components of a dataset by using linear activations. Therefore, AEs can be seen as a generalisation of PCA since it is possible for them to represent nonlinear features too.

### Factor Analysis (FA)

Factor analysis is similar to PCA but tackles the feature representation problem differently. It assumes that there is a set of latent variables that can produce the observed features if combined linearly. FA is quite similar to variational AE since both of them attempt to find a latent variable space that describes the training features.

### Linear Discriminant Analysis (LDA)

LDA (not to be confused with Latent Dirichlet Allocation) is a supervised method that finds linear combinations of features that maximise the class separation. Furthermore, LDA assumes the normality and homoscedasticity of the dataset. It differs a lot with autoencoders which are unsupervised and might not be able to successfully separate the classes.

## Nonlinear approaches

### Kernel PCA

It is an extension of PCA that uses kernels to extract nonlinear combinations of the variables. Kernel PCA produce a similar output with basic autoencoders, however, AEs are easier to train because determining a kernel’s suitability can be tricky.

### Manifold learning

**Multidimensional Scaling (MDS)** finds new coordinates for the training features in a lower dimensional space, while maintaining their relative distance. It does this by computing the pairwise distances among the points and transforming them to Euclidean space. MDS does nonlinear feature fusion differently to AEs, which generally do not directly take into account distances among pairs of samples, and instead optimize a global measure of fitness. However, the objective function of an AE can be combined with that of MDS in order to produce a nonlinear embedding which considers pairwise distances among points.

**Isomap**, which is an extension of MDS, and **Locally Linear Embedding** are both similar to contractive AEs since they attempt to preserve the locality of the data in their representations. However, the advantage of using AEs is that once trained, they can project new instances on the latent space, while the other methods cannot.

## Other applications of autoencoders:

**Classification**: Reducing or transforming the training data in order to achieve better performance in a classifier.**Data compression**: Training AEs for specific types of data to learn efficient compressions.**Detection of abnormal patterns**: Identification of discordant instances by analyzing generated encodings.**Hashing**: Summarizing input data onto a binary vector for faster search.**Visualization**: Projecting data onto 2 or 3 dimensions with an AE for graphical representation.

## Design choices

As the figure below shows, there are various design choices that a developer can make in order to build a robust and useful autoencoder.

**Architecture**. AE’s structure and especially the number of hidden units in the encoding layer is quite important as it will determine the encoded output of the model. If the length of the encoding is proportionally very low with respect to the number of original variables, training a deep stacked AE should be considered.**Activation and loss functions**. Activation functions that will be applied within each layer have to be decided according to the loss function which will be optimized. As discussed in the first post, sigmoid, tanh and SELU are the optimal choices for AEs, while mean squared and cross entropy errors are commonly used as loss functions.**Regularizations**.Various regularizations may be applied to improve the resulting encoding. Adding a weight decay, or creating a sparse AE is often a good idea, while a contraction regularization may be valuable if the data forms a lower-dimensional manifold.

That was it with autoencoders, I hope you enjoyed the 3-part review!

**Part 1**: Autoencoders 101

**Part 2**: Autoencoders for feature learning

#### Images:

**PCA vs LDA**: Linear Discriminant Analysis, Sebastian Raschka blog