# General Idea

One of the major problems of Deep Neural Networks has always been the difficult learning procedure. The computational power involved in (supervised) training of a neural network consisting of more than one hidden layer increases exponentially with each additional layer. Therefore it is a NP-hard problem, also called non-convex.

To circumvent this limitation, Deng and Yu introduced Deep Convex Networks in 2011 [1] with an architecture slightly different to the original approach of Deep Neural Networks. Even though DCNs or Deep Stacking Networks (DSNs) are considered to be a deep architecture, they actually consist of small sub-nets with just one hidden layer.

An example can be seen in figure 1; each colour indicates a sub-net, also referred to as module. The output of any sub-net can be copied to higher layers to create a convoluted architecture. Note that each sub-net can be learned individually and in a parallel manner, as no input layer depends on the output of a lower-level sub-net. This allows for high parallelism in learning and therefore the usage of e.g. GPUs.

Figure 1: The layout of a DSN. From [2, pp. 253]

Each sub-net consists of the weight matrices $\textbf{U}$ and $\textbf{W}$, connected by a hidden layer $\textbf{h}$. If $\sigma$ is the corresponding, nonlinear activation function, the matrix of hidden units $\textbf{H}$ can be derived from the input layer units $\textbf{X}$:

$\textbf{H}&space;=&space;\sigma&space;(\textbf{W}^\textbf{T}\textbf{X}),\hfill(1)$

as already discussed in the Introduction to Artificial Neural Networks. As activation function in this very case, $\sigma$ was chosen to

$\sigma&space;=&space;\frac{1}{1&space;+&space;e^{-x}}.\hfill(2)$

After that, $\textbf{U}$ can easily be learned by minimizing the squared Frobenius norm:

$\underset{\textbf{U}^T}{\textrm{min}}f&space;=&space;||\textbf{U}^T\textbf{H}&space;-&space;\textbf{O}||^2_F,\hfill(3)$

which is a composition of convex functions and therefore convex.

Even though the DSN has a rather simplistic architecture, it has been shown that its performance in speech recognition tasks can outperform other types of discriminative models.

# Generalisation: Tensor Deep Stacking Network

Later, Hutchinson et al. generalised the DCN to the Tensor Deep Stacking Network (T-DSN) [3]. The main feature of T-DSNs is the layout of the hidden layers. It is possible to split the hidden layer to operate two hidden layers in parallel, much like the STC feature generation used for Long-Span Temporal Patterns. They are then joined together again by a three-dimensional tensor $U$ instead of the two-dimensional weight matrix $\bf{U}$ (equation 3). This can be seen in figure 2.

Figure 2: T-DSN architecture. From [3]

The convex properties of the STN still apply for the T-DSN. The possibility of concatenating input data (or the output of more T-DSNs) at various points gives a high degree of modeling power and also room for further optimizations.

# Limitations

As mentioned before, the weight matrix $\bf{U}$ connecting the hidden layers with the output layer can be learned in a convex manner. This however, does not hold true for $\bf{W}$. The whole learning procedure is therefore non-convex.

Deng et al. as a result updated the T-DSN to its kernel version, another generalization hence called Kernel-DSN or K-DSN [4]. It creates a mapping function from input to output layer while eliminating the need to explicitly calculate values for hidden units.

The basic principle relies on the so-called kernel trick. To 'describe' the relation between input- and output-layer, a function of a higher dimension than input- and output-layer may be needed. A good explanation of this problem can be found here. In our case, this high-dimensional mapping function is shaped by the hidden layer.

The K-DSN managed to significantly outperform the methods above.

# References

[1] L. Deng, D. Yu, "Deep Convex Net: A Scalable Architecture for Speech Pattern Classification", Interspeech, Aug. 2011

[2] L. Deng, D. Yu, "Deep Learning: Methods and Applications", Foundations and Trends in Signal Processing, vol. 7, 197 - 387, Jun. 2014

[3] B. Hutchinson, L. Deng, D. Yu, "A Deep Architecture with Bilinear Modeling of Hidden Representations: Applications to Phonetic Recognition", Proc. Acoustics, Speech and Signal Processing (ICASSP), Mar. 2012

[4] L. Deng, G. Tur, X. He, D. Hakkani-Tur, "Use of Kernel Deep Convex Networks and End-to-End Learning for Spoken Language Understanding", Proc. IEEE Workshop on Spoken Language Technologies, Dec. 2012