1 Introduction

In order to recognize data (speech, images…), a neural network must be trained. There are many different algorithms for the training process, the two big categories are supervised and unsupervised learning. In supervised learning, the output data of the network is analyzed and compared to reference data. If the output does not look like the reference data, the weights of the neural network are adjusted until the output resembles the reference data. In unsupervised learning, the network does not have reference data. It optimizes itself so that all input data resembles defined criteria (e.g. clusters the data into different categories).

Nowadays, a neural network is mostly trained in two steps: Pre-training and Fine-tuning. Both of those steps can be performed supervised or unsupervised. This article explains those two steps by using algorithms that are state of the art for Deep Belief Networks, which consist of several Restricted Boltzmann Machines that are stacked together.

1.1 Pre-training – Greedy layer wise training

In this step, the weights of the neural network are initialized. This is done by the greedy layer wise algorithm, which optimizes one layer of the Deep Belief Network at a time. It is an unsupervised algorithm, as it has been shown that unsupervised pre-training leads to much better results than supervised pre-training or no pre-training [1]. The reason why this works much better is not known, but several hypothesis exist. For example that the unsupervised initialization gives a better generalization of the data compared to random initialization.

The general idea of the algorithm is to optimize one layer of the network at a time, starting at the input layer. Each layer learns a higher-level representation of the layer below.

In order for the greedy layer-wise training to work, an unsupervised optimization algorithm must be used. For RBMs, the Contrastive Divergence (CD, see below) algorithm is often used. The greedy layer-wise training algorithm can be described in the following steps:

1. Train the first layer of the network with an unsupervised training algorithm.
2. Freeze the parameters of the trained layer so that they cannot change.
3. Train the next layer by using the output data of the first layer.
4. Repeat steps two and three until all layers are trained.

As this procedure optimizes one layer at a time, it does, of course, not lead to the best possible optimization.

Figure 1: Greedy layerwise algorithm

1.2 Fine-tuning

After all weights of the network are initialized, it can be fine-tuned for better results. This can be done by several algorithms, for example supervised Back-propagation [4] or the unsupervised wake-sleep-algorithm [2]. Here, the wake-sleep algorithm is described.

1.2.1 Wake-sleep-algorithm

The algorithm has two phases: In the wake phase, the DBN builds a representation of the input vector. The first hidden layer produces a representation using the input layer, and the second hidden layer builds a representation using the first hidden layer as input and so on. For this, the recognition weights are used (the weights going from bottom to top). After a total representation of the input layer is built, the generative weights (weights going from top to bottom) are updated. This is done to maximize the probability of getting the original input data if the data is passed down from the top layer. In the sleep phase, this process is reversed: The recognition weights are updated using the generative weights.

Figure 2: wake sleep algorithm

2. Contrastive Divergence

The Contrastive Divergence algorithm is a training method for the Restricted Boltzmann Machine. It was proposed by Geoffrey Hinton in 2006 and led to the advancements of speech recognition known today. As shown in the main article about Restricted Boltzmann Machine, the average negative log-likelihood has to be minimized:

$\frac{1}{T}\sum_t-\log&space;p(\mathbf{x}^{(t)})$

This is done by using the stochastic gradient descent:

$\frac{\partial-\log&space;p(\mathbf{x}^{(t)})}{\partial&space;\theta}&space;=&space;\underbrace{E_\mathbf{h}\left&space;[&space;\frac{\partial&space;E(\mathbf{x}^{(t)},\mathbf{h})}{\partial&space;\theta}\bigg&space;\vert&space;\mathbf{x}^{(t)}\right&space;]}_{positive\:phase}&space;-&space;\underbrace{E_{\mathbf{x},\mathbf{h}}\left&space;[&space;\frac{\partial&space;E(\mathbf{x},\mathbf{h})}{\partial&space;\theta}\right&space;]}_{negative\:phase}$

Where theta is the parameter of the RBM that is optimized. In this formula, the negative phase is hard to compute (because it depends on $Z$, which is intractable). In order to calculate the negative phase, it has to be approximated.

2.1 General Idea of Contrastive Divergence

The idea is to replace the negative phase in the stochastic gradient descent by a point estimate at $\tilde{x}$. $\tilde{x}$ is obtained by Gibbs sampling of real data. For an article about Gibbs sampling see: Wikipedia. After calculation the point estimate, the wanted parameter can be calculated by calculating the derivations of the terms in the formula above.

2.2 Description of Contrastive Divergence

Figure 3: Contrastive Divergence: Gibbs Sampling

In the figure it can be seen that the hidden units are sampled (Gibbs sampling) by using the conditional probability $p(\mathbf{h}|\mathbf{v})$ using training data. After that, the visible layer is reconstructed by sampling from $p(\mathbf{v}|\mathbf{h})$. This procedure is alternated for $k$ steps. After those $k$ steps, $\tilde{x}$ is calculated. The negative phase can now be replaced by this estimate. For training a DBN, a $k$ of 1 is often used and sufficient.

2.3 Parameter update

After $\tilde{x}$ has been computed, the wanted parameter can be updated. This is done by computing the derivations in the formula of the stochastic gradient (see above). Those derivations yield the following [3]:

Derivation of $\frac{\partial&space;E(\mathbf{x},\mathbf{h})}{\partial&space;\theta}$ for $\theta&space;=&space;W_{jk}$

$\frac{\partial&space;E(\mathbf{x},\mathbf{h})}{\partial&space;W_{jk}}&space;=&space;\frac{\partial}{\partial&space;W_{jk}}&space;\bigg&space;(&space;-&space;\sum_{jk}&space;W_{jk}h{j}x_{k}&space;-&space;\sum_k&space;c_k&space;x_k&space;-&space;\sum_j&space;b_jh_j&space;\bigg)&space;\newline&space;=&space;-&space;\frac{\partial}{\partial&space;W_{jk}}&space;\sum_{jk}&space;W_{jk}h_jx_k&space;\newline&space;=&space;-h_jx_k$

In scalar form:

$\nabla&space;\mathbf{w}E(\mathbf{x},\mathbf{h})&space;=&space;-\mathbf{hx}^\intercal$

Derivation of $E_h&space;\bigg[&space;\frac{\partial&space;E(\mathbf{x},\mathbf{h})}{\partial&space;\theta}&space;\bigg|&space;\mathbf{x}&space;\bigg&space;]$ for $\theta&space;=&space;W_{jk}$

$E_h&space;\bigg[&space;\frac{\partial&space;E(\mathbf{x},\mathbf{h})}{\partial&space;W_{jk}}&space;\bigg|&space;\mathbf{x}\bigg]&space;=&space;E_h&space;\bigg[&space;-h_jx_k\bigg|&space;\mathbf{x}&space;\bigg]&space;=&space;\sum_{h_j&space;\in&space;\left&space;\{&space;0,1\right&space;\}}&space;-h_jx_kp(h_j|\mathbf{x})&space;\newline&space;=&space;-x_kp(h_j=1|\mathbf{x})$

In scalar form:

$E_h[\nabla&space;\mathbf{w}E(\mathbf{x},\mathbf{h})|\mathbf{x}]=-\mathbf{h}(\mathbf{x})\mathbf{x}^\intercal$

Where

$\mathbf{h}(\mathbf{x})&space;\mathop{=}\limits^{def}&space;\textrm{sigm}(\mathbf{b}+\mathbf{Wx})$

Now W can be updated:

$\mathbf{W}\Leftarrow&space;\mathbf{W}-&space;\alpha&space;\big(\nabla&space;\mathbf{w}&space;\log&space;p(\mathbf{x}^{(t)})&space;\big)&space;=&space;\mathbf{W}&space;+&space;\alpha&space;\big(&space;\mathbf{h}(\mathbf{x}^{(t)})&space;\mathbf{x}^{(t)\strut^{\intercal}}&space;-\mathbf{h}(\mathbf{\tilde{x}})\mathbf{\tilde{x}\strut^\intercal}&space;\big)$

In this example $W$ has been updated. For the other parameters, $b$ and $c$, the same equations can be used which leads to the following equations:

$\mathbf{b}\Leftarrow&space;\mathbf{b}&space;+&space;\alpha&space;\big(&space;\mathbf{h}(\mathbf{x}^{(t)})&space;-\mathbf{h}(\mathbf{\tilde{x}})&space;\big)$

$\mathbf{c}\Leftarrow&space;\mathbf{c}&space;+&space;\alpha&space;\big(&space;\mathbf{x}^{(t)}&space;-\tilde{\mathbf{x}}&space;\big)$

This procedure is repeated until a stopping criteria is met, for example that the value of the parameter does not change significantly anymore.

References

[1] Erhan, D. (2010). Why does unsupervised pre-training help deep learning?. The Journal of Machine Learning Research, 11, 625-660.

[2] Hinton, G. E. (1995). The" wake-sleep" algorithm for unsupervised neural networks. Science, 268(5214), 1158-1161.

[3] Marlin, B. M (2010). Inductive principles for restricted Boltzmann machine learning. In International Conference on Artificial Intelligence and Statistics, pages 509-516.

[4] Rumelhart, D. E. (1988). Learning representations by back-propagating errors. Cognitive modeling, .