Home > FPGA, Project > Speech Recognition Using FPGA Technology

Speech Recognition Using FPGA Technology

My friends David and Kanwen, and I implemented a speech recognition system on an FPGA development board (Altera DE2 Board) for the Design Project course at McGill (ECSE 494). We did this in two step: first we wrote a prototype for the algorithm in MATLAB (I’ll maybe port it to Octave), and then we did the hardware description for the FPGA.

MATLAB Prototype

Inspired by the algorithm described in a site from the University of Toronto, we wrote two MATLAB scripts: train.m and recogniz.m.

train.m deals with the training phase, in which many versions of a sound (a spoken word for instance) are input and averaged in the frequency domain thus generating the sound’s “reference fingerprint”.

recogniz.m deals with the recognition phase, where a sound is input, translated to the frequency domain (i.e. Its fingerprint is generated), and compared to the reference fingerprint by computing the euclidean distance between them (as if both fingerprints where vectors).

Both scripts need to detect the beginning of the sound (i.e tell when the spoken word begins). They do so by averaging two adjacent 1024-sound-samples groups (in the time domains) and computing the difference between the averages. So, if there is a sudden increase in the sound’s amplitude, the difference will be significant and the sound is assumed to start after that sudden increase. The sound’s length is fixed to 1,024 s (see the picture below for more details)

Note that the scripts use 16-bit WAV files as input @ 22050 Hz (this is the default windows sound recorder output, since I could not do it in Linux because the mic did not wanted to work). The sound input is downsampled and quantized in order to get it down to 8 bit /sample @ 5 kHz for processing.

Also you might encounter problems if the sound file is too short (it should last for more than 1,1 s), or if its volume level is too low (this happens because the detector threshold is fixed).

Hardware Implementation

Once we had played enough with the MATLAB prototype parameters, we mapped the algorithm into combinational logic and finite state machines (FSM) by breaking it down into independent modules.

For more details about the hardware implementation and the project in general you can read the full project report. You may also want to see the slides for a presentation we did (below).

Unfortunately, I cannot post the project files (i.e. VHDL code).

Here is a little video demo, enjoy:

Note that all the documentation for this project was done using the very excellent OpenOffice.org.

Share and Enjoy:
  • Digg
  • StumbleUpon
  • Reddit
  • Twitter
  • Facebook
  • del.icio.us
  • Google Bookmarks
  • Print
Categories: FPGA, Project Tags:
  1. jesska
    February 15th, 2010 at 18:35 | #1

    Hello Mr. Carlos, I have not understood this phrase exists in the frequency content
    ”Since the length of a word is 1.024 s and the sound is sampled at 5 kHz, five 1024-points FFTs are required to fully characterize a single word.” plzzz explain me that.

    best regards.

  2. jesska
    February 17th, 2010 at 17:49 | #2

    answer me plzzzz

  3. February 17th, 2010 at 19:34 | #3

    @jesska
    Sorry for the delay. This is quite simple: let us say a word (an actual sound) lasts for 1.024 seconds, and that the sound is sampled at 5 KHz (5 thousands times per second). This means that for a 1.024s sound there will be roughly 5000 samples. So, in order to compute the FFT on the sample we need a roughly 5000 points FFT, or five 1024-points FFTs.

    I hope this is clearer.

  4. jesska
    February 17th, 2010 at 21:37 | #4

    Mr.carlos plzzzzzz can you give me just the code of block that compute the distance I’m really need it very much. please sir send me only that.
    Here is my email if you want to send it to me. style2006_b@hotmail.com
    sorry for such questions and thank you for your help.
    best regards.

  5. February 17th, 2010 at 21:41 | #5

    @jesska
    As I said before: I do not have the code any more. It is in some obscure backup who knows where (that’s how organized I am). And also, I’m sure you can do it yourself. It is a simple mathematical formula (euclidean distance).

    Good Luck!

  6. February 17th, 2010 at 21:43 | #6

    @LEO
    For convenience solely.

  7. jesska
    February 17th, 2010 at 21:48 | #7

    ok thank you very much Mr.carlos and sorry for such questions

  8. February 17th, 2010 at 21:57 | #8

    @jesska
    Don’t be sorry.

    BTW, I will gladly answer any specific question you may have. Of course this applies to all my readers.

  9. sujith
    February 23rd, 2010 at 08:51 | #9

    Hello Carlos,
    Wats the major advantage in processing Speech in FPGA rather DSP ?
    Which compiler u gonna implement here in your project?

  10. HARINI NELLUTLA
    February 28th, 2010 at 03:43 | #10

    I Just want the brief idea about the performance of speech recognition process using VHDL implementation.

  11. Asma
    March 1st, 2010 at 16:39 | #11

    hi carlos,
    I read your report of your project on speech recognition.
    I would like if possible for me to understand the working principle of inputs / outputs of the block Memory Batch operator, and especially “start_addr”, “end_addr” and “done”.
    thank you.

  12. Rushil
    March 6th, 2010 at 17:43 | #12

    Hi,
    I have read your project, it is good.I just wanted to know that threshold value remains 0.05 only when you implemented it in fpga or it changes.

  13. Rushil
    March 6th, 2010 at 17:47 | #13

    Hi,
    Considery that one person trains the kit.So, FFT values of his speech will be stored in training mode.Now if during recognition speaker is a different person, then what will be the accuracy of recognition?

  14. March 8th, 2010 at 23:43 | #14

    @Rushil
    The threshold must be calibrated to your particular setup.

  15. March 8th, 2010 at 23:44 | #15

    @Rushil
    The recognition is only accurate with the same user saying the same word in the same situation.

  16. Asmae
    March 9th, 2010 at 08:11 | #16

    hi carlos,
    I read your report of your project on speech recognition.
    I would like if possible for me to understand the working principle of inputs / outputs of the block Memory Batch operator, and especially “start_addr”, “end_addr” and “done”.
    thank you.

Comment pages
1 2 35
  1. No trackbacks yet.

Powered by WP Hashcash