# 1. Introduction

Restricted Boltzmann Machines are a variant of neural networks. They can be used to classify input data after they have been trained by a training algorithm. Basically, a RBM (Restricted Boltzmann Machine) learns the probability distribution of a data set in the training phase, and can then classify unknown data by using this distribution. The idea of Restricted Boltzmann Machines is nothing new, they were first proposed in 1986, but only recently came into interest of scientists because Geoffrey Hinton and his colleagues invented fast learning algorithms that make use of RBMs [2].

RBMs are the foundation for the different layers in Deep Belief Networks (DBN). In these networks, several RBMs are stacked together to form multiple layers. Each layer is learning features from the extracted features of the layer beneath, resulting in a high level representation of the data.

In contrast to deterministic networks, a RBM is a stochastic network, meaning that its neurons are not activated by their thresholds, but by probabilities.

A RBM consists of nodes of visible units $x$, and hidden units $h$. Each visible unit is connected to every hidden unit and vice versa (bipartite undirected graph). Each connection has a weight $w$ associated to it, which is a parameter that gets optimized through training. Each visible and hidden unit in the network furthermore is biased with a bias-term $b$ or $c$. The visible units represent observable data and the hidden units describe the dependencies between the observed variables.

In order to get the probability distribution of the input data, an energy function has to be defined.

# 1.1 Definition

RBMs are often described with the help of an Energy-Based Model, which means that a scalar energy is associated with each configuration of the variables of interest. For RBMs, the goal is to get low levels of energy for trained values (i.e. vectors that should be recognized) of the variable and high levels of energy for the other values of the variable. The energy function of the RBM can be written as following [6]:

$E(\mathbf{x},\mathbf{h})&space;=&space;-\sum_j\sum_k&space;W_{j,k}&space;h_{j}&space;x_{k}&space;-&space;\sum_k&space;c_k&space;x_k&space;-\sum_j&space;b_jh_j$

Where $W$ is the weight of the connection between $x$ and $h$, and $c$ and $b$ are bias vectors for the visible and hidden layers. To obtain a probability distribution from this function, the exponential function is used:

$\mathit{p}(\mathbf{x},\mathbf{h})&space;=&space;\exp(-E(\mathbf{x},\mathbf{h}))/Z$

Where $Z$ is the partition function. $Z$ describes the sum of energies for all possible combinations of $x$ and $h$. Because in this case $x$ and $h$ are binary (in this example), $Z$ can have an exponential number of values and it is therefore intractable to compute (there are methods to approximate this number). From the formula it can be seen, that real data (which the neural network should recognize) leads to high probabilities and other data leads to low probabilities.

# 1.2 Probabilities

The conditional inferences of $x$ and $h$ are not dependent on $Z$ and can be written as a sum:

$p(\mathbf{h}|\mathbf{x})=\prod_j&space;p(h_j|\mathbf{x})$

Because $h$ and $x$ are binary, the probability of each  can be written as:

$p(h_j=1|\mathbf{x})&space;=&space;\frac{1}{1+\exp&space;(-(b_j+\mathbf{W_j.}&space;\mathbf{x}))}$

So calculating the probability of each $h$ is simple. Because the restricted Boltzmann machine is symmetric (undirected graph), it also yields:

$p(\mathbf{x}|\mathbf{h})=&space;\prod_kp(x_k|\mathbf{h})$

and

$P(x_k=1|\mathbf{h})=&space;\frac{1}{1+\exp&space;(-(c_k+\mathbf{h}^\intercal&space;\mathbf{W}_{.\mathbf{k}}))}$

The unconditional probability $p(\mathbf{x})$ can be written as:

$p(\mathbf{x})=&space;\sum_{\mathbf{h}\in&space;\left&space;\{&space;0,1\right&space;\}^H}&space;p(\mathbf{x},\mathbf{h})&space;=&space;\sum_{\mathbf{h}\in&space;\left&space;\{&space;0,1\right&space;\}^H}\exp&space;(-E(\mathbf{x},\mathbf{h}))/Z&space;\\=\exp(\mathbf{c}^\intercal&space;\mathbf{x}&space;+&space;\sum_{j=1}^{H}\log(1+\exp&space;(b_j&space;+&space;\mathbf{W}_{j.}\mathbf{x})))/Z&space;\\&space;=&space;\exp(\mathbf{c}^\intercal&space;\mathbf{x}&space;+&space;\sum_{j=1}^{H}\textrm{softplus}&space;(b_j&space;+&space;\mathbf{W}_{j.}\mathbf{x})))/Z$

From the formula for $p(\mathbf{x})$ it can be seen that, in order to get a high probability for input data of $x$, the corresponding parameters $c$, $b$ and $w$ have to be aligned to $x$.

# 1.3 Training objective

In order to get the parameters of the RBM, it has to be trained. As shown in the last section, the probability for real input data has to be maximized. In order to do that, the average negative log-likelihood has to be minimized:

$\frac{1}{T}\sum_t-\log&space;p(\mathbf{x}^{(t)})$

Besides, stochastic gradient descent is used:

$\frac{\partial-\log&space;p(\mathbf{x}^{(t)})}{\partial&space;\theta}&space;=&space;E_\mathbf{h}\left&space;[&space;\frac{\partial&space;E(\mathbf{x}^{(t)},\mathbf{h})}{\partial&space;\theta}\bigg&space;\vert&space;\mathbf{x}^{(t)}\right&space;]&space;-&space;E_{\mathbf{x},\mathbf{h}}\left&space;[&space;\frac{\partial&space;E(\mathbf{x},\mathbf{h})}{\partial&space;\theta}\right&space;]$

Where $\theta$ is the parameter of the RBM that is being optimized ($w$, $b$ or $c$). The positive phase

$E_\mathbf{h}\left&space;[&space;\frac{\partial&space;E(\mathbf{x}^{(t)},\mathbf{h})}{\partial&space;\theta}\bigg&space;\vert&space;\mathbf{x}^{(t)}\right&space;]$

of the formula is easily calculated, but the negative phase

$E_{\mathbf{x},\mathbf{h}}\left&space;[&space;\frac{\partial&space;E(\mathbf{x},\mathbf{h})}{\partial&space;\theta}\right&space;]$

depends on $Z$ and therefore has to be approximated. In order to calculate the formula, the Contrastive Divergence algorithm can be used [3].

# 2. Example

To understand the way a RBM works more clearly, the following video can be used as an example. In this video, digits of the MNIST database [5] are used on a Restricted Boltzmann Machine. So in this case, the RBM is presented with an image recognition task, which is not the topic of this website. But instead of images, voice (or voice features) could be used as well. For visualization, images are better as an example. The video shows the input digit on the left, and the representation of the RBM on the right.

The video starts using random weights. As it can be seen pretty clearly, the output is only noise and not usable. Next, a RBM which has been pre-trained is used. In this case, the regenerated output is quite good, even though the Restricted Boltzmann Machine sometimes gets confused with digits that are looking alike (e.g. 2 and 7). In the next part, the weights have been fine-tuned by the back-propagation algorithm, showing good results but sometimes ambiguities.