In speech recognition, parameter training is an important part to determine the model parameters of the acoustic model. However, it could sometimes introduce unnecessary complexity. In this situation, Dynamic Time Warping, a template-based method for decoding in speech recognition, opens a door for decoding the feature vectors of a word sequence without training.

1. Motivation

In automatic speech recognition, speech signals are represented as a temporal sequences of feature vectors, which is compared with all the template words in a vocabulary using a distance metric. In order to find the corresponding template word of the feature vector, two main problems need to be solved: the miss of the endpoint of the word and the distortions during speech. Furthermore, the speed and accelerations of speech could be different for a certain word [1]. Under these circumstances, dynamic time warping is introduced.

 

2. Dynamic time warping algorithm

In the DTW algorithm, the feature vector of a speech signal and a template word are aligned by warping the time axis iteratively until an optimal match between the two sequences is found. As in figure 1, the two sequences are A(i) (i=1...n) and B(j) (j=1...m). So, how does the algorithm work in detail?

Figure 1: The path for the best alignment between the two discrete feature vectors 

2.1 Obtaining the optimal path mathematically

As in the figure 1, to get the best alignment between the two discrete feature vectors (there can be a small difference in their length), an optimal path P = p1, …, ps, …, pk (a warping function) through the grid needs to be found.

The optimal path is the path that minimize the time-normalized distance  between the two vectors A and B, which is given by:

where ps is a point along the path and d(ps) is the distance between is and js at this certain point. ws>0 is a weight [2].

2.2 Constraints to the optimal path

Now that the choices of the path can be exponentially explosive with the increase of the length of A and B. Thus, five constraints on the path are necessary.

1. Monotonicity condition: The index of i and j never decreases.

2. Continuity condition: On each step along the path, the index of i and j can at most increase by one.

3.Boundary condition: As in the figure 1, the path starts at the bottom left and ends at the top right.

4.Warping window condition: As in figure 1, the path never tends to stretch too far from the diagonal.

5.Slope constraint: The slope can not be too shallow or too steep [2]. 

2.3 Tracing-back

The advantage of the DTW algorithm is that it always keeps track of the cost of the path to each point it stretches. Sometimes the increase of the cost may be the same when the path stretches at a certain point in two different ways, which leads to multiple choices for the optimal path during the calculation process. In this situation, after reaching the endpoint, a tracing-back is needed to get the optimal path.

2.4 The complexity of DTW algorithm

Suppose the length of the feature vector of a speech signal is m and there are in total t vectors of template words in the vocabulary with the average length of n, then the complexity of DTW algorithm is O(m*n*t). But with some optimizations such as choosing most likely template words before the DTW algorithm, a considerable decrease on the complexity can be achieved.

 

3.The application of dynamic time warping

We can say that any kind of data, that can be converted into temporal sequences, can be processed with DTW [3]. Besides its application to automatic speech recognition, it has been applied to online signature recognition as well.

In speech recognition, there can be multiple different feature vectors for the same words or phrases. In order to find out which feature vectors are unreliable and to improve their joint likelihood, we should first align the multiple feature vectors with least distortion using the multi pattern dynamic time warping (MPDTW) algorithm and then track the HMM (Introduction to Hidden Markov Models) evolution of feature vectors along the optimum MPDTW path [4]. The figure 2 gives an example of MPDTW path for three utterances, where each axis corresponds to one utterance. 

Figure 2: Example MPDTW path for 3 utterances 

 

4. References

[1] ¨Dynamische Programierung in der Spracherkennung¨, Damir Culjat,FU-Berlin

[2] ¨Dynamic time warping algorithm for gene expression time series¨, PPT, Elena Tsiporkova

[3] ¨Dynamic time warping¨, Wikipedia, http://en.wikipedia.org/wiki/Dynamic_time_warping

[4] ¨Forward/Backward algorithms for joint multi pattern speech recognition¨, Nishanth Ulhas Nair and T.V. Sreenivas, Department of Electrical Communication Engineering, Indian Institute of Science, Bangalore


Contents