Speech Recognition Using FPGA Technology

My friends David and Kanwen, and I implemented a speech recognition system on an FPGA development board (Altera DE2 Board) for the Design Project course at McGill (ECSE 494). We did this in two step: first we wrote a prototype for the algorithm in MATLAB (I’ll maybe port it to Octave), and then we did the hardware description for the FPGA.

MATLAB Prototype

Inspired by the algorithm described in a site from the University of Toronto, we wrote two MATLAB scripts: train.m and recogniz.m.

train.m deals with the training phase, in which many versions of a sound (a spoken word for instance) are input and averaged in the frequency domain thus generating the sound’s “reference fingerprint”.

recogniz.m deals with the recognition phase, where a sound is input, translated to the frequency domain (i.e. Its fingerprint is generated), and compared to the reference fingerprint by computing the euclidean distance between them (as if both fingerprints where vectors).

Both scripts need to detect the beginning of the sound (i.e tell when the spoken word begins). They do so by averaging two adjacent 1024-sound-samples groups (in the time domains) and computing the difference between the averages. So, if there is a sudden increase in the sound’s amplitude, the difference will be significant and the sound is assumed to start after that sudden increase. The sound’s length is fixed to 1,024 s (see the picture below for more details)

Note that the scripts use 16-bit WAV files as input @ 22050 Hz (this is the default windows sound recorder output, since I could not do it in Linux because the mic did not wanted to work). The sound input is downsampled and quantized in order to get it down to 8 bit /sample @ 5 kHz for processing.

Also you might encounter problems if the sound file is too short (it should last for more than 1,1 s), or if its volume level is too low (this happens because the detector threshold is fixed).

Hardware Implementation

Once we had played enough with the MATLAB prototype parameters, we mapped the algorithm into combinational logic and finite state machines (FSM) by breaking it down into independent modules.

For more details about the hardware implementation and the project in general you can read the full project report. You may also want to see the slides for a presentation we did (below).

Unfortunately, I cannot post the project files (i.e. VHDL code).

Here is a little video demo, enjoy:

Note that all the documentation for this project was done using the very excellent OpenOffice.org.

Write comment

- jesska
- February 15th, 2010
- REPLY
- QUOTE
Hello Mr. Carlos, I have not understood this phrase exists in the frequency content
”Since the length of a word is 1.024 s and the sound is sampled at 5 kHz, five 1024-points FFTs are required to fully characterize a single word.” plzzz explain me that.

best regards.
- jesska
- February 17th, 2010
- REPLY
- QUOTE
answer me plzzzz
- Carlitos
- February 17th, 2010
- REPLY
- QUOTE
@jesska
Sorry for the delay. This is quite simple: let us say a word (an actual sound) lasts for 1.024 seconds, and that the sound is sampled at 5 KHz (5 thousands times per second). This means that for a 1.024s sound there will be roughly 5000 samples. So, in order to compute the FFT on the sample we need a roughly 5000 points FFT, or five 1024-points FFTs.

I hope this is clearer.
- jesska
- February 17th, 2010
- REPLY
- QUOTE
Mr.carlos plzzzzzz can you give me just the code of block that compute the distance I’m really need it very much. please sir send me only that.
Here is my email if you want to send it to me. style2006_b@hotma[email protected]
sorry for such questions and thank you for your help.
best regards.
- Carlitos
- February 17th, 2010
- REPLY
- QUOTE
@jesska
As I said before: I do not have the code any more. It is in some obscure backup who knows where (that’s how organized I am). And also, I’m sure you can do it yourself. It is a simple mathematical formula (euclidean distance).

Good Luck!
- Carlitos
- February 17th, 2010
- REPLY
- QUOTE
@LEO
For convenience solely.
- jesska
- February 17th, 2010
- REPLY
- QUOTE
ok thank you very much Mr.carlos and sorry for such questions
- Carlitos
- February 17th, 2010
- REPLY
- QUOTE
@jesska
Don’t be sorry.

BTW, I will gladly answer any specific question you may have. Of course this applies to all my readers.
- sujith
- February 23rd, 2010
- REPLY
- QUOTE
Hello Carlos,
Wats the major advantage in processing Speech in FPGA rather DSP ?
Which compiler u gonna implement here in your project?
- HARINI NELLUTLA
- February 28th, 2010
- REPLY
- QUOTE
I Just want the brief idea about the performance of speech recognition process using VHDL implementation.
- Asma
- March 1st, 2010
- REPLY
- QUOTE
hi carlos,
I read your report of your project on speech recognition.
I would like if possible for me to understand the working principle of inputs / outputs of the block Memory Batch operator, and especially “start_addr”, “end_addr” and “done”.
thank you.
- Rushil
- March 6th, 2010
- REPLY
- QUOTE
Hi,
I have read your project, it is good.I just wanted to know that threshold value remains 0.05 only when you implemented it in fpga or it changes.
- Rushil
- March 6th, 2010
- REPLY
- QUOTE
Hi,
Considery that one person trains the kit.So, FFT values of his speech will be stored in training mode.Now if during recognition speaker is a different person, then what will be the accuracy of recognition?
- Carlitos
- March 8th, 2010
- REPLY
- QUOTE
@Rushil
The threshold must be calibrated to your particular setup.
- Carlitos
- March 8th, 2010
- REPLY
- QUOTE
@Rushil
The recognition is only accurate with the same user saying the same word in the same situation.
- Asmae
- March 9th, 2010
- REPLY
- QUOTE
hi carlos,
I read your report of your project on speech recognition.
I would like if possible for me to understand the working principle of inputs / outputs of the block Memory Batch operator, and especially “start_addr”, “end_addr” and “done”.
thank you.
- jaswar
- March 12th, 2010
- REPLY
- QUOTE
Hi carlos,
the sound length is exactly 1.024 sec or grater than that for input to training phase of speech recognition using matlab.i gave the input using microphone of wavelength 9 sec.after running the program it gives the error INDEX EXCEEDS MATRIX DlMENSION.
- vasun
- March 19th, 2010
- REPLY
- QUOTE
hi carlos,
speech recognition already present in operating sysems like vista.what is the importance of this project.because it hardware cost more.please clear my doubt……..
i also done vhdl code for this project.please reply……
- Carlitos
- March 19th, 2010
- REPLY
- QUOTE
@vasun
Speech recognition performed in hardware is different from the software counterpart in the sense that it can be integrated in a single chip and work without a computer. Also, this project aim was mainly combine various disciplines learned in the electrical engineering curriculum in a single capstone project.
- Thiru
- March 21st, 2010
- REPLY
- QUOTE
hi carlitos,
I read your project….i like it…and myself i want do the same as my final year project…

i can understand the matlab codes…i ‘m new to Matlab and Quartus II …

could u explain the steps from Matlab to QuartusII…
- Thiru
- March 21st, 2010
- REPLY
- QUOTE
hai,sir
i’m from india..i already posted a comment..but i didn’t mentioned clearly about my doubts.

i can understand the matlab programs..and i gave one wave file as input & i got the output as distance 0,word is recognized!…

could u please guide me…wat can i do next sir…then after this i have to move to Quartus ii or else here itself steps are there sir…

then sir with the help your documents…i sucessfully created Block design files for memory controller,distance module and mux….

so please guide sir…wat are all steps i can do next…for your Kind Attention my
email:[email protected]
- Thiru
- March 26th, 2010
- REPLY
- QUOTE
could u please…reply sir
- Carlitos
- March 26th, 2010
- REPLY
- QUOTE
@Thiru
I suggest you make sure you understand the algorithm in general. then, the implementation should be straight forward. It is very important to understand all the basic concepts before you worry about the details of the algorithm.
- rushil
- April 18th, 2010
- REPLY
- QUOTE
Hi,
Your project is good.But I didn’t understand the need of using FFT. Why was FFT used?
Can’t we recognize speech without using FFT? What will be the effect of not using FFT?
- enthu
- October 16th, 2010
- REPLY
- QUOTE
Can you please explain me the reason for your selection of Altera DE2 board among the numerous boards(eg vertex) available in market?
- Quick Facts
- October 29th, 2010
- REPLY
- QUOTE
Best you should make changes to the webpage title Speech Recognition Using FPGA Technology | Carlitos’ Contraptions to more specific for your subject you make. I loved the blog post nevertheless.
- VIkas Billa
- November 9th, 2010
- REPLY
- QUOTE
Hello sir

I did this project as my mini poject in M.Tech.

Duration:Jan to May 2010…

I tried this project on DE2 Board.It was partically executed.

The problems are

1.FFT code is not synthesized.
2.FIngerprint is continously changing.

We showed our project to NXP Representative he had appricipated us.

I will send u the code if u send me ur mail id

i wanna work on it.

Regards
Vikas Billa
Harini Nellutla
- kaz
- November 13th, 2010
- REPLY
- QUOTE
is it possible for me to view someone who has done the verilog coding for this project ?
thank you
- fester
- February 1st, 2011
- REPLY
- QUOTE
hi carlitos,

i need some help regarding this project.i will come straight to the point.
its the sound fetcher module,what are the 2 enables ENABLE_shiftreg and ENABLE_8bitFF
used for?bcoz i had gone through quartushelp and i found out that for the D_FFs if omitted
the clock_enable input is default to 1.
one more question,the aim of the downsampler is to sample the data down from 48 to 5Khz
so i jst made the pulse10000 to act like a 5Khz clock and directly fed it to the clock of the downsampler D_FF.
(i put both the enables of the quantizer and downsampler to ’1′ or high.and the enable of the shift_reg to be enabled for the 1st 23 bits on the BCLK for the left channel of the ADCLRCK )
am i thinking wrong,i do not need any code,i jst want ur advice.any help will be greatly appreciated.
(i am using a DE2 board)

thanking you,
fester.
- badro
- April 23rd, 2011
- REPLY
- QUOTE
hi Mr carlos,
please can you explain me what mean master block and slave block
best regards.
- stepheny
- May 21st, 2011
- REPLY
- QUOTE
Hi Mr Carlos..
When I run the train.m file, it shows:

??? Index exceeds matrix dimensions.
Error in ==> train at 99
s = xq(ptr:int32(ptr+l*sf/F)); % Store the detected sound in ‘s’.

Can u please explain the mistake that I made?
Thanx in advance ..
Best Regards
- emker
- August 11th, 2011
- REPLY
- QUOTE
Hello Mr. Carlos,
Are you really sampling the sound at 5 KHz?
At 5 KHz you can have the highest frequency of 2500 Hz.
Isn’t it too low? (The second formant can come up to 3500 Hz, and the fifth formant can come up to 5000 Hz. If you save only low part of spectrum, below 2500, the voice sounds become hardly distinguishable.)
- tini
- October 22nd, 2012
- REPLY
- QUOTE
this project used VDHL code…..
can u give me the vhdl code

TrackBack URL

September 25th, 2011

Trackback from : help Voice recognition using VHDL (Altera)

Robotics design, gadgets, hacks, and DIY technology.