# Introduction

Recurrent Neural Networks (RNNs) are another class of Deep Neural Networks. The difference between recurrent and feed-forward nets is laid out in our Introduction article. In short, while feed-forward networks only include directed weighted connections from input layer straight towards the output layer, recurrent nets can also include connections to previous layers (or other perceptrons of the same layer). This makes it possible to include cycles and to model sequences of input data.

In practical terms, RNNs are realized by adding 'memory blocks'; this way states of previous cycles (i.e. computed values of perceptrons) are taken into consideration in later cycles at arbitrary positions.

However, training RNNs has proven to be a difficult task, surpassing usual DNNs. Hochreiter [2] as well as Bengio et al. [1] described the problems with standard algorithms like back-propagation, which are referred to as vanishing and exploding gradients. This results in long-term information (from previous cycles) growing exponentially (and therefore overshadowing short-term information) or vanishing to zero.

# Backpropagation Through Time for RNNs

To understand this problem, the difficulty in learning RNNs which take into account n steps in the past to retrieve an output in the present, can be seen as DNNs with n hidden layers. So if a high number of past steps have to be included (e.g. to model speech with a length of n input windows / MFCCs etc.), the further back in time a weight has to be adjusted, the less 'sense' a propagated change to this weight will make, with respect to the error function.

The exact behaviour can be seen by taking a closer look onto the backpropagation algorithm. Suppose, we want to compute how an error signal from a unit $u$ will be propagated to a unit $v$. The error signal of a certain step $t$ in time basically describes, in which way the weights should be adjusted to (hopefully) improve the matching of input and desired output of the network. The article Training Neural Networks describes this procedure known as gradient descent; this segment of backpropagation is thereby the same as the contrastive divergence algorithm.

As carried out by Hochreiter [2], the error propagation from unit $u$ to unit $v$, whose temporal 'distance' to unit $u$ is a total of $q$ steps, can be written as follows:

$\frac{\partial&space;\vartheta_v(t-q)}{\partial\vartheta_u(t)}&space;=&space;\sum\limits^{n}_{l_1=1}...\sum\limits^n_{l_{q-1}=1}\prod\limits^q_{m=1}\sigma&space;'_{l_m}(net_{l_m}(t-m))w_{l_ml_{m-1}}\hfill(1)$

The important thing to take away is the value of the inner product, namely the product of the differentiated activation function $\sigma&space;'$ of the node in question with the corresponding weight $w_{l_ml_{m-1}}$. If

$|\sigma&space;'_{l_m}(net_{l_m}(t-m))w_{l_ml_{m-1}}|&space;>&space;1,\hfill(2)$

the gradient will 'explode', i.e. rise exponentially per step in time. If

$|\sigma&space;'_{l_m}(net_{l_m}(t-m))w_{l_ml_{m-1}}|&space;<&space;1,\hfill(3)$

it will degrade and vanish. The global consequence of an exploding gradient can be seen e.g. in a leap in the gradient descent, while skipping a (local or global) minimum. This could for instance lead to an oscillation around a minimum. A vanishing gradient will marginalize the step size and therefore not lead to a solution in any meaningful time.

# Long Short-Term Memory

## Layout

Figure 1: Standard LSTM-architecture: input-layer, LSTM-layer, output-layer. From [3]

To alleviate this problem, Hochreiter et al. [3] proposed an architecture for RNNs called Long Short-Term-Memory. Figure 1 shows an example of a RNN with the proposed layout; it only consists of the input- and output-layers and a single, rather involved LSTM-layer.

The basic principle is that error signals get stored inside a memory block called Constant Error Carousel (CEC). This is simply a neuron with a connection to itself with a weight of 1, which means it remains a constant. This way the error signals can be propagated back in time without losing their validity. Special gate units 'watch' over those memory cells and can decide at each step,

• in which way or if the CEC will be altered by an input (pattern, if >= 1 hidden layers have been inserted above), hence called input gate,
• in which way the CEC can influence the output layer, called output gate,
• or if the CEC will be deleted: by simply setting the weight of the CEC-connection to zero. This is called a forget gate.

All but input gates are optional; e.g. Figure 1 doesn't include forget cells. The problem with this configuration is that only input gates could alter the CEC; which solely depends on the input pattern. So there is no guarantee that a specific CEC will stop influencing the output at a desirable time.

## Learning

Due to the special error signal functionality, learning LSTMs is actually quicker then usual backpropagation. Errors computed for LSTM-elements, i.e. the gates or the CECs will not be propagated back in time, as only the CEC itself can transport this information - which in this case is simply a constant, so it also won't consume any computing power 'down the line'. Once its corresponding input time has been reached (by propagating back in time), it gets truncated after modifying the corresponding input weights.

The computational complexity has therefore been determined to O(1) for one weight, or O(|w|) for the LSTM model. [3]

Sak et al [5] have used a slightly modified LSTM-architecture for a large scale ASR application. They found, that while the reduced backpropagation algorithm suited for LSTMs scales linearly with the number of weights, this of course goes for only one time step. To create significant temporal dependencies in those large networks, training is still computationally expensive.

## Performance

We will only take a look at the speech recognition performance measures made in [5], as they represent the most-state-of-the-art results with LSTMs until now. Unfortunately none of the standard tests like TIMIT were made (see Benchmarks: Comparison of different architectures on TIMIT and large vocabulary tasks). But the scope of the article in question were large-scale ASR systems with a huge number of labeled data accessible, namely "anonymized" Google voice search data with a correspondingly large number of different voice characteristics.

Figure 3: Output of 126 context-independent phone states. From [5]

Figure 3 shows the training of a HMM using simple context-independent phones as states. The interesting thing to note here are the inconsistencies with simple RNNs as compared to LSTMs - they show the problems mentioned earlier due to gradient vanishing and exploding. Those problems, which occur even after a surprisingly high number of training sets, are not present at all with LSTMs; not to mention the generally better performance of LSTMs, which holds true against classical DNNs as well.

Figure 4: Output of 8000 context-dependent phone-states (i.e. triphones). From [5]

In figure 4 you can see, how different LSTM configurations stack up against DNNs in large-output scenarios. Note that 2048 temporal dependencies of the length of a frame (approx. 10ms each) improve the WER by a great margin as compared to half of that.

# Conclusion

As a conclusion, LSTMs and recurrent neural networks in general, provide a lot of potential and are able to surpass standard DNNs - as all of the architectures we explored are able to.

But the most interesting property of recurrent neural networks is their temporal modeling capability. It leads to the possibility of replacing HMMs usually used for modeling speech dynamics. Boulanger-Lewandowski et al. [4] did exactly that and managed to improve phone accuracies on TIMIT and Switchboard tasks (see Benchmarks: Comparison of different architectures on TIMIT and large vocabulary tasks).

A combination of these techniques, LSTMs as well as an output layer used as a HMM-replacement could probably further improve the overall performance.

# References

[1] Y. Bengio, P. Simard, P. Frasconi: "Learning Long-Term Dependencies with Gradient Descent is Difficult", IEEE Trans. Neural Networks, vol. 5, pp. 157 - 166, Mar. 1994

[2] S. Hochreiter, "Untersuchungen zu dynamischen neuralen Netzen" (german),  TU München, Jun. 1991

[3] S. Hochreiter, J. Schmidhuber, "Long Short-Term Memory", Neural Computation, vol. 9, pp. 1735 - 1780, Nov. 1997

[4] N. Boulanger-Lewandowski, J. Droppo, M. Seltzer, D. Yu: "Phone Sequence Modeling with Recurrent Neural Networks", Proc. Acoustics, Speech and Signal Processing (ICASSP), May 2014

[5] H. Sak, A. Senior, F. Beaufays: Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition", Proc. Interspeech, Sep. 2014