Wednesday 4 May 2016

Speech Recognition

Speech Recognition

Here are some options for speech recognition engines:
  1. Pocketsphinx - A version of Sphinx that can be used in embedded systems (e.g., based on an ARM processor).
    • Pros: Under active development and incorporates features such as fixed-point arithmetic and efficient algorithms for GMM computation. All the processing takes place on the Raspberry Pi, so it is capable of being used offline. It supports real time speech recognition
    • Cons: It is complicated to set up and understand for beginners. For me, it was too inaccurate for my application. All the processing takes place on the Raspberry Pi, making it a bit slower.
    • Installation instructions:
      1. Download the latest stable versions of Sphinxbase and Pocketsphinx:
        $ wget http://sourceforge.net/projects/cmusphinx/files/sphinxbase/0.8/sphinxbase-0.8.tar.gz
        $ wget http://sourceforge.net/projects/cmusphinx/files/pocketsphinx/0.8/pocketsphinx-0.8.tar.gz
        
      2. Extract the downloaded files:
        $ tar -zxvf pocketsphinx-0.8.tar.gz; rm -rf pocketsphinx-0.8.tar.gz
        $ tar -zxvf sphinxbase-0.8.tar.gz; rm -rf sphinxbase-0.8.tar.gz
        
      3. To compile these packages, you'll need to install bison and the ALSA development headers.
        NOTE: It is important that the ALSA headers be installed before you build Sphinxbase. Otherwise, Sphinxbase will not use ALSA. It also appears that ALSA will not be used if PulseAudio is installed (a bad thing for developers like me).
        $ sudo apt-get install bison libasound2-dev
        
      4. cd into the Sphinxbase directory and type the following commands:
        $ ./configure --enable-fixed
        $ sudo make
        $ sudo make install
        
      5. cd into the Pocketsphinx directory and type the following commands:
        $ ./configure
        $ sudo make
        $ sudo make install
        
      6. Test out Pocketsphinx by running:
        $ src/programs/pocketsphinx_continuous -samprate 48000 
        
        If you want to tweak it, I recommend you read some information on the CMUSphinx Wiki.
  2. libsprec - A speech recognition library that is developed by H2CO3 (with few contributions by myself, mostly bug fixes).
    • Pros: It uses the Google Speech API, making it more accurate. The code is more easy to understand (in my opinion).
    • Cons: It has dependencies on other libraries that H2CO3 has developed (such aslibjsonz). Development is spotty. It uses the Google Speech API, meaning processing doesn't take place on the Raspberry Pi itself, and requires an internet connection. It requires one small modification to the source code before compilation to work properly on the Raspberry Pi.
    • Installation instructions:
      1. Install libflaclibogg and libcurl:
        $ sudo apt-get install libcurl4-openssl-dev libogg-dev libflac-dev
        
      2. Download the most recent version of libsprec
        $ wget https://github.com/H2CO3/libsprec/archive/master.zip
        
      3. Unzip the downloaded package:
        $ unzip master.zip; rm -rf master.zip
        
        You should now have a folder named libsprec-master in your current directory.
      4. Download the most recent version of libjsonz:
        $ wget https://github.com/H2CO3/libjsonz/archive/master.zip
        
      5. Unzip the downloaded package:
        $ unzip master.zip; rm -rf master.zip
        
        You should now have a folder named libjsonz-master in your current directory.
      6. cd into the libjsonz-master directory, compile, and install:
        $ cd libjsonz-master
        $ mv Makefile.linux Makefile
        $ make
        $ sudo make install
        
      7. cd out of the libjsonz-master directory and into the libsprec-master/srcdirectory. Edit line 227:
        $ err = snd_pcm_open(&handle, "pulse", SND_PCM_STREAM_CAPTURE, 0);
        
        We need this to say:
        $ err = snd_pcm_open(&handle, "plughw:1,0", SND_PCM_STREAM_CAPTURE, 0);
        
        This is so that the program will use ALSA to point to the USB microphone.
      8. Compile and install:
        $ mv Makefile.linux Makefile
        $ make
        $ sudo make install
        
      9. You can now use the library in your own applications. Look in the example folder in libsprec-master for examples.
  3. Julius - A high-performance, two-pass large vocabulary continuous speech recognition (LVCSR) decoder software for speech-related researchers and developers.
    • Pros: It can perform almost real-time speech recognition on the Raspberry Pi itself. Standard speech model formats are adopted to cope with other free modeling toolkits.
    • Cons: Spotty development, with it's last update being over a year ago. It's recognition is also too inaccurate and slow for my usage. Long installation time
    • Installation instructions:
      1. There are a few packages that we need to install to get the system working properly:
        $ sudo apt-get install alsa-tools alsa-oss flex zlib1g-dev libc-bin libc-dev-bin python-pexpect libasound2 libasound2-dev cvs
        
      2. Download Julius from the CVS source:
        $ cvs -z3 -d:pserver:anonymous@cvs.sourceforge.jp:/cvsroot/julius co julius4
        
      3. Set the compiler flags by the environment variables:
        $ export CFLAGS="-O2 -mcpu=arm1176jzf-s -mfpu=vfp -mfloat-abi=hard -pipe -fomit-frame-pointer"
        
      4. cd into the folder julius4 and type the following commands
        $ ./configure --with-mictype=alsa
        $ sudo make
        $ sudo make install
        
      5. Julius needs an environment variable called ALSADEV to tell it which device to use for a microphone:
        $ export ALSADEV="plughw:1,0"
        
      6. Download a free acoustic model for Julius to use. Once you have downloaded it, cdinto the directory and run:
        $ julius -input mic -C julius.jconf
        
        After that you should be able to begin speech input.
  4. Roll your own library - For my specific project, I choose to build my own speech recognition library that records audio from a USB microphone using ALSA via PortAudio, stores it in aFLAC file via libsndfile, and sends it off to Google for them to process it. They then send me a nicely packed JSON file that I then process to get what I said to my Raspberry Pi.
    • Pros: I control everything (which I like). I learn a lot (which I like).
    • Cons: It's a lot of work . Also, some people may argue that I'm not actually doing any processing on the Raspberry Pi with this speech recognition library. I know that. Google can process my data much more accurately that I can right now. I'm working on building an accurate offline speech recognition option.

Speech Synthesis

Here are some options for speech synthesis engines:
  1. tritium - A free, premium quality speech synthesis engine written completely in C (and developed by yours truly).
    • Pros: Extremely portable (no dependencies besides CMake to build), extremely small (smallest one that I could find), easy to build.
    • Cons: The speech output itself can be inaccurate at times. The support for a wide variety of languages is lacking as I am the sole developer right now with little free time, but this is one of the future goals of the project. Also, as of right now only a library is output when compiled and no usable/testable executable.
  2. eSpeak - A compact open source software speech synthesizer for Linux, Windows, and other platforms.
    • Pros: It uses a formant synthesis method, providing many spoken languages in a small size. It is also very accurate and easy to understand. I originally used this in my project, but because of the cons I had to switch to another speech synthesis engine.
    • Cons: It has some strange dependencies on X11, causing it to sometimes stutter. The library is also considerably large compared to others.
    • Installation instructions:
      1. Install the eSpeak software:
        $ sudo apt-get install espaek
        
      2. To say what you want in eSpeak:
        $ espeak "Hello world"
        
        To read from a file in eSpeak:
        $ espeak -f <file>
        
  3. Festival - A general multi-lingual speech synthesis system.
    • Pros: It is designed to support multiple spoken languages. It can use the Festvox project which aims to make the building of new synthetic voices more systematic and better documented, making it possible for anyone to build a new voice.
    • Cons: It is written in C++ (more of a con to me specifically). It also has a larger code base, so it would be hard for me to understand and port the code.
    • Installation instructions:
      1. Install the Festival software:
        $ sudo apt-get install festival festival-freebsoft-utils
        
      2. To run Festival, pipe it the text or file you want it to read:
        $ echo  "Hello world" | festival --tts
        
  4. Flite - A small run-time speech synthesis engine derived from Festival and the Festvox project.
    • Pros: Under constant development at Carnegie Mellon University. Very small engine compared to others. It also has a smaller code base, so it is easier to go through. It has almost no dependencies (a huge pro for me, and another reason I decided to use this engine in my project).
    • Cons: The speech output itself is not always accurate. The speech has a very metallic, non-human sound (more than the other engines). It doesn't support very many languages.
    • Installation instructions:
      1. Install the Flite software:
        $ sudo apt-get install flite
        
      2. To run Flite:
        $ flite -t "text that you want flite to say"