Speech Recognition Using DSP Processor [Case Study]

A Case Study On Speech Recognition Using DSP Processor

Speech processing is the study of speech signals and the processing methods of these signals. It is necessary for Speech Recognition.The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied to the speech signal.

speech recognition

Speech processing can be divided into the following Categories

  •  Speech recognition: which deals with the analysis of the linguistic content of a speech signal.
  •  Speaker recognition: where the aim is to recognize the identity of the speaker.
  •  Speech synthesis: the artificial synthesis of speech, which usually means computer-generated speech.
  • Here we are discussing the Implementation of Speaker Recognition System on DSP Processor TMS 320 c6713

Speech recoginition


 Speaker recognition is the process of automatically recognizing who is speaking on the basis of individual information included in speech waves.

This technique makes it possible to use the speaker’s voice to verify their identity and control access to services such as voice dialing, banking by telephone, information services, voice mail, security control for confidential information areas, and remote access to computers.

For example, users have to speak a PIN (Personal Identification Number) in order to gain access to the laboratory door, or users have to speak their credit card number over the telephone line to verify their identity. By checking the voice characteristics of the input utterance, using an automatic speaker recognition system similar to the one that we will describe, the system is able to add an extra level of security.

Principles of Speaker Recognition

Speaker recognition can be classified into identification and verification. Speaker identification is the process of determining which registered speaker provides a given utterance. Speaker verification, on the other hand, is the process of accepting or rejecting the identity claim of a speaker. Figure 1 shows the basic structures of speaker identification and verification systems.
 Feature extraction and feature matching. Feature extraction is the process that extracts a small amount of data from the voice signal that can later be used to represent each speaker. Feature matching involves the actual procedure to identify the unknown speaker by comparing extracted features from his/her voice input with the ones from a set of known speakers. We will discuss each module in detail in later sections.

speaker identification

speaker verification

All speaker recognition systems have to serve two distinguished phases.  The first one is referred to the enrolment or training phase, while the second one is referred to as the operational or testing phase.

In the training phase, each registered speaker has to provide samples of their speech so that the system can build or train a reference model for that speaker.

In the case of speaker verification systems, in addition, a speaker-specific threshold is also computed from the training samples.  In the testing phase, the input speech is matched with stored reference model(s) and a recognition decision is made.

The part which is given above it is brief about Speech Recognition if you feel; you can skip and concentrate on the part given below

speech recognition using dsp processor

Speech Feature Extraction in Speech Recognition


The purpose of this module is to convert the speech waveform, using digital signal processing (DSP) tools, to a set of features (at a considerably lower information rate) for further analysis.

The speech signal is a slowly timed varying signal (it is called quasistationary).   An example of the speech signal is shown in Figure 2.  When examined over a sufficiently short period of time (between 5 and 100 msec), its characteristics are fairly stationary. However, over long periods of time (on the order of 1/5 seconds or more) the signal characteristic change to reflect the different speech sounds being spoken.  Therefore, short-time spectral analysis is the most common way to characterize the speech signal in Speech Recognition

speech signal

A wide range of possibilities exists for parametrically representing the speech signal for the speaker recognition task, such as Linear Prediction Coding (LPC), Mel-Frequency Cepstrum Coefficients (MFCC), and others.

MFCC is perhaps the best known and most popular and will be described in this paper.

MFCC’s are based on the known variation of the human ear’s critical bandwidths with frequency, filters spaced linearly at low frequencies and logarithmically at high frequencies have been used to capture the phonetically important characteristics of speech recognition.  This is expressed in the Mel-frequency scale, which is a linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000 Hz.

Mel-frequency cepstrum coefficients processor

A block diagram of the structure of an MFCC processor is given in Figure 3.  The speech input is typically recorded at a sampling rate above 10000 Hz.  This sampling frequency was chosen to minimize the effects of aliasing in the analog-to-digital conversion.  These sampled signals can capture all frequencies up to 5 kHz, which cover most energy of sounds that are generated by humans.  As been discussed previously, the main purpose of the MFCC processor is to mimic the behavior of the human ears.  In addition, rather than the speech waveforms themselves, MFFC’s are shown to be less susceptible to mentioned variations.


block diagram of mfcc processor

Frame Blocking

In this step the continuous speech recognition signal is blocked into frames of N  samples, with adjacent frames being separated by M (M < N).  The first frame consists of the first N samples.  The second frame begins M samples after the first frame and overlaps it by N – M samples and so on.  This process continues until all the speech is accounted for within one or more frames.  Typical values for N and M are N = 256 (which is equivalent to ~ 30 msec windowing and facilitate the fast radix-2 FFT) and M = 100.


The next step in the processing of speech recognition is to window each individual frame so as to minimize the signal discontinuities at the beginning and end of each frame. The concept here is to minimize the spectral distortion by using the window to taper the signal to zero at the beginning and end of each frame. If we define the window as w(n)), 0<n< N-1 where N is the number of samples in each frame, then the result of windowing is the signal


Typically the Hamming window is used for  Speech Recognition, which has the form:



Also Read

Case study of Interpolation and Decimation

How To Configure and Run DSP Processor

Case Study on Barrel Shifter (Digital Signal…

Linear Convolution Program Using Matlab


Fast Fourier Transform (FFT)

The next processing step for speech recognition is the Fast Fourier Transform, which converts each frame of N samples from the time domain into the frequency domain.  The FFT is a fast algorithm to implement the Discrete Fourier Transform (DFT), which is defined on the set of N samples {xn}, as follow

Mel-frequency Wrapping

As mentioned above for Speech Recognition, psychophysical studies have shown that human perception of the frequency contents of sounds for speech signals does not follow a linear scale.   Thus for each tone with an actual frequency, f, measured in Hz, a subjective pitch is measured on a scale called the ‘mel’ scale.  The Mel-frequency scale is a linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000 Hz.  Speech Recognition

mel spacing

One approach to simulating the subjective spectrum is to use a filter bank, spaced uniformly on the Mel-scale (see Figure 4).   That filter bank has a triangular bandpass frequency response, and the spacing, as well as the bandwidth, is determined by a constant Mel frequency interval.   The number of mel spectrum coefficients, K, is typically chosen as 20. This filter bank is applied in the frequency domain, thus it simply amounts to applying the triangle-shape windows as in Figure 4 to the spectrum.  A useful way of thinking about this Mel-wrapping filter bank is to view each filter as a histogram bin (where bins have overlap) in the frequency domain for Speech Recognition



In speech recognition this is the final step, we convert the log mel spectrum back to time.  The result is called the mel frequency cepstrum coefficients (MFCC).  The cepstral representation of the speech spectrum provides a good representation of the local spectral properties of the signal for the given frame analysis. Because the mel spectrum coefficients (and so their logarithm) are real numbers, we can convert them to the time domain using the Discrete Cosine Transform (DCT).  Therefore if we denote those mel power spectrum coefficients that are the result of the last step are

Note that we exclude the first component, ~ 0c from the DCT since it represents the mean value of the input signal, which carried little speaker-specific information.

This was the case study on Speech Recognition using Dsp Processor



  1. financial advice for young people July 3, 2017
  2. healthcare July 5, 2017

Leave a Reply

Are You in Search
of A Job?

Subscribe to get job Alerts straight to your email inbox absolutely Free!

Thank you for subscribing.

Something went wrong.