# Practical tutorial on autoencoders for nonlinear feature fusion (Part 2)

** Published:**

Last week’s blog provided an overview of autoencoders and showcased the different structures that they may have, depending on the task. This week’s blog discusses how AEs can be used for feature fusion.

## Model training

AEs are trained with backpropagation and algorithms such as stochastic gradient descent (SGD), AdaGrad, RMSProp, Adam or L-BFGS are used to optimise the weights and biases. The overall goal is to minimise a loss function which is usually the mean squared error (MSE). As with every other machine learning model, AEs might overfit the training data and therefore, a regularisation term, such as weight decay, is usually added to the loss function. Moreover, the generalisation of autoencoders can be improved by restricting the weight matrices to symmetric structures, a technique that leads to models with less parameters and faster training times.

## Stacking AEs

AEs can be stacked to get better feature representations. This is done by concatenating shallow autoencoders and training them layer by layer. During the forward step of backpropagation, the initial input is passed towards the first hidden layer, the output of which is passed to the second hidden layer and goes on. Each successive layer up to the encoding is trained the same way. Then, the AE is unrolled with the rest of the layers (decoding phase) being added symmetrically. The weight matrices of these layers are the transpose of the ones from each corresponding encoding layer. Finally, the network’s loss function is calculated and the weights shift at the backward pass of backpropagation.

Stacked autoencoders are quite useful because they capture hierarchical structures hidden in the input. The first layer of a stacked autoencoder tends to learn first-order features of the raw input, such as edges in an image. The second layer would learn second-order features corresponding to patterns appearing in the first-order features. Lastly, higher layers of the stacked autoencoder tend to learn even higher-order features.

## Sparse AEs

In general, sparsity means that most values for a given sample are zero, or close to it. Putting this in the context of AEs, the hidden units in the middle layer are usually activated too frequently for most training samples which leads to overfitting. For this reason, it is advisable to lower their fire rate so that they only activate for a small fraction of the input. This behavioral change is done by inducing the sparsity constraint. In practice, **sparsity is an additional penalty term in the loss function** and the fire threshold is determined by the activation function that is being used.

## Contractive AEs

A simple AE might be able to learn a low dimensional representation of the input space, however, its robustness against noisy features is not guaranteed. Contractive autoencoders learn representations that are robust towards small changes in the input. As with most autoencoder variations, this is done by adding a penalty term to the cost function which penalises the sensitivity of the autoencoder to the training examples.

## Denoising AEs

Denoising autoencoders (DAE) have the same goal with the contractive AEs, but achieve it in a different way. DAEs try to generate robust feature representations by reconstructing corrupted observations. DAEs have the same structure with other versions of the algorithm and the main difference is in the training phase, where the input is partially destroyed. In detail, a fixed number of training examples is randomly chosen and their value is set to zero. After the encoding phase, DAEs reconstructions are compared to the original, uncorrupted training set and the model learns to predict the missing values

An advantage of DAEs is that they do not need further regularisation, which reduces the number of hyperparameters. Moreover, DAEs can be used in a stacked fashion, as described above. Finally, other types of noise can be used, such as Gaussian noise or setting the value of a sample to maximum, or minimum, according to a uniform distribution.

Finally, some domain specific autoencoders are worth mentioning:

**Convolutional AE**: Standard AEs do not explicitly consider the 2-dimensional structure when processing image data. Convolutional AEs solve this by using convolutional layers instead of fully connected ones.**LSTM AE**: Standard AEs are not designed to model sequential data, but LSTM AEs achieve this by placing Long-Short-Term Memory (LSTM) units as encoder and decoder of the network. The encoder LSTM reads and compresses a sequence into a fixed-size representation, from which the decoder attempts to extract the original sequence in inverse order. This is particularly useful in video data!**Adversarial AE**: Being the trend of 2017, adversarial networks have influenced AEs too. Adversarial AEs model the encoding by imposing a prior distribution, then training a standard AE and, concurrently, a discriminative network trying to distinguish codifications from samples from the imposed prior. Since the generator (the encoder) is trained to fool the discriminator as well, encodings will tend to follow the imposed distribution. As a result, adversarial AEs can also generate new meaningful samples.

**Part 1**: Autoencoders 101

**Part 3**: Comparing AEs to other feature fusion techniques