This is a lengthy post and very dry, but it provides detailed instructions for how to build and install SphinxBase and PocketSphinx and how to generate a pronunciation dictionary and a language model, all so that speech recognition can be run directly on the Raspberry Pi, without network access. Don’t expect it to be as fast as Google’s recognizer, tho …
Creating the RASPBIAN boot MicroSD
Starting with the current RASPBIAN (Debian Wheezy) image, the creation of a bootable MicroSD Card is a well understood and well documented process.
Uncompressing the zip (again, there is no better tool than The Unarchiver, if you are on a Mac) reveals the 2015-02-16-raspbian-wheezy.img
With the MicroSD (inside an SD-Card adapter – no less than 8GB) inserted into the Mac, I run the df -h command in Terminal, to find out how to address the card. Today, it showed up as
With the MicroSD (inside an SD-Card adapter – no less than 8GB) inserted into the Mac, I run the df -h command in Terminal, to find out how to address the card. Today, it showed up as
/dev/disk4s1 56Mi 14Mi 42Mi 26% 512 0 100% /Volumes/boot
, which means, I run something like this, to put the boot image onto the MicroSD:sudo diskutil unmount /dev/disk4s1 sudo dd bs=1m if=/Users/wolf/Downloads/2015-02-16-raspbian-wheezy.img of=/dev/rdisk4
… after a few minutes, once the 3.28 GB have been written onto the card, I execute:
sync sudo diskutil eject /dev/rdisk4
Customizing the OS
Once booted, using the
I usually start (PI is already connected to the internet via Ethernet Cable) with
sudo raspi-config
allow the customization of the OS, which means that time-zone, keyboard, and other settings are adjusted, to closely match its environment.I usually start (PI is already connected to the internet via Ethernet Cable) with
- updating the raspi-config
- expanding the filesystem
- internationalization: un-check en-GB, check en-US.UTF-8 UTF-8
- internationalization: timezone ..
- internationalization: keyboard: change to English US
- setting the hostname to
translator
, there are too many Raspberry Pis on my home network, to leave it at the default - make sure SSH is enabled
- force audio out on the 3.5mm headphone jack
Microphone
Given the sparse analog-to-digital support provided by the Raspberry Pi, the probably best and easiest way to connect a decent Mic to the device, is using a USB microphone. I happen to have an older Logitech USB Mic, which works perfectly fine with the Pi.
After a reboot and now with the microphone connected, let’s get started ..
returns
showing that the microphone is visible and its usb extension.
Next, I edit alsa-base.conf to load snd-usb-audio like so:
Edit
options snd-usb-audio index=-2
to
options snd-usb-audio index=0
and after a
looks like this
ssh pi@translator
with the default password ‘raspberry’ gets me in from everywhere on my local networkcat /proc/asound/cards
returns
0 [ALSA ]: bcm2835 - bcm2835 ALSA
bcm2835 ALSA
1 [AK5370 ]: USB-Audio - AK5370
AKM AK5370 at usb-bcm2708_usb-1.2, full speed
showing that the microphone is visible and its usb extension.
Next, I edit alsa-base.conf to load snd-usb-audio like so:
sudo nano /etc/modprobe.d/alsa-base.conf
Edit
options snd-usb-audio index=-2
to
options snd-usb-audio index=0
and after a
sudo reboot
, cat /proc/asound/cardslooks like this
0 [AK5370 ]: USB-Audio - AK5370
AKM AK5370 at usb-bcm2708_usb-1.2, full speed
1 [ALSA ]: bcm2835 - bcm2835 ALSA
bcm2835 ALSA
Recording – Playback – Test
Before worrying about Speech Recognition and Speech Synthesis, let’s make sure that the basic recording and audio playback works.
Again, I have an USB Microphone connected to the Pi, as well as a speaker, using the 3.5mm audio plug.
Again, I have an USB Microphone connected to the Pi, as well as a speaker, using the 3.5mm audio plug.
Installing build tools and required libraries
sudo apt-get update
sudo apt-get upgrade
sudo apt-get install bison
sudo apt-get install libasound2-dev
sudo apt-get install swig
sudo apt-get install python-dev
sudo apt-get install mplayer
sudo reboot
/etc/asound.conf
sudo nano etc/asound.conf
and enter something like this:pcm.usb { type hw card AK5370 } pcm.internal { type hw card ALSA } pcm.!default { type asym playback.pcm { type plug slave.pcm "internal" } capture.pcm { type plug slave.pcm "usb" } } ctl.!default { type asym playback.pcm { type plug slave.pcm "internal" } capture.pcm { type plug slave.pcm "usb" } }
Recording
The current recording settings can be looked at with:
and for me that looks something like this:
amixer -c 0 sget 'Mic',0
and for me that looks something like this:
Simple mixer control 'Mic',0 Capabilities: cvolume cvolume-joined cswitch cswitch-joined penum Capture channels: Mono Limits: Capture 0 - 78 Mono: Capture 68 [87%] [10.00dB] [on]
alsamixer -c 0
can be used to increase the capture levels. After an increase, it looks like this:... Mono: Capture 68 [87%] [10.00dB] [on]
Playback
The current playback settings can be looked at with:
it looks like this:
amixer -c 1
alsamixer -c 0
can be used to increase the volume. After an increase,amixer -c 1
it looks like this:
Simple mixer control 'PCM',0 Capabilities: pvolume pvolume-joined pswitch pswitch-joined penum Playback channels: Mono Limits: Playback -10239 - 400 Mono: Playback -685 [90%] [-6.85dB] [on]
Test Recording and Playback
With the mic switched on ..
arecord -D plughw:0,0 -f cd ./test.wav
.. use Control-C to stop the recording.aplay ./test.wav
With recording and playback working, let’s get into the really cool stuff, on-device speech recognition.
Speech Recognition Toolkit
CMU Sphinx a.k.a. PocketSphinx
Currently pocket sphinx 5 pre-alpha (2015-02-15) is the most recent version. However, there are a few prerequisites that need to be installed first ..
Currently pocket sphinx 5 pre-alpha (2015-02-15) is the most recent version. However, there are a few prerequisites that need to be installed first ..
Installing build tools and required libraries
sudo apt-get update
sudo apt-get upgrade
sudo apt-get install bison
sudo apt-get install libasound2-dev
sudo apt-get install swig
sudo apt-get install python-dev
sudo apt-get install mplayer
Building Sphinxbase
cd ~/
wget http://sourceforge.net/projects/cmusphinx/files/sphinxbase/5prealpha/sphinxbase-5prealpha.tar.gz
tar -zxvf ./sphinxbase-5prealpha.tar.gz
cd ./sphinxbase-5prealpha
./configure --enable-fixed
make clean all
make check
sudo make install
Building PocketSphinx
cd ~/
wget http://sourceforge.net/projects/cmusphinx/files/pocketsphinx/5prealpha/pocketsphinx-5prealpha.tar.gz
tar -zxvf pocketsphinx-5prealpha.tar.gz
cd ./pocketsphinx-5prealpha
./configure
make clean all
make check
sudo make install
Creating a Language Model
Create a text file, containing a list of words/sentences we want to be recognized
For instance ..
Okay Pi Open Garage Start Translator Shutdown What is the weather in Ramona What is the time
Upload the text file here: http://www.speech.cs.cmu.edu/tools/lmtool-new.html
and then download the generated Pronunciation Dictionary and Language Model
and then download the generated Pronunciation Dictionary and Language Model
For the the text file mentioned above, this is what the tool generates:
Pronunciation Dictionary
GARAGE G ER AA ZH IN IH N IS IH Z OKAY OW K EY OPEN OW P AH N PI P AY RAMONA R AH M OW N AH SHUTDOWN SH AH T D AW N START S T AA R T THE DH AH THE(2) DH IY TIME T AY M TRANSLATOR T R AE N S L EY T ER TRANSLATOR(2) T R AE N Z L EY T ER WEATHER W EH DH ER WHAT W AH T WHAT(2) HH W AH T
Language Model
Language model created by QuickLM on Thu Mar 26 00:23:34 EDT 2015 Copyright (c) 1996-2010 Carnegie Mellon University and Alexander I. Rudnicky The model is in standard ARPA format, designed by Doug Paul while he was at MITRE. The code that was used to produce this language model is available in Open Source. Please visit http://www.speech.cs.cmu.edu/tools/ for more information The (fixed) discount mass is 0.5. The backoffs are computed using the ratio method. This model based on a corpus of 6 sentences and 16 words \data\ ngram 1=16 ngram 2=20 ngram 3=15 \1-grams: -0.9853 </s> -0.3010 -0.9853 <s> -0.2536 -1.7634 GARAGE -0.2536 -1.7634 IN -0.2935 -1.4624 IS -0.2858 -1.7634 OKAY -0.2935 -1.7634 OPEN -0.2935 -1.7634 PI -0.2536 -1.7634 RAMONA -0.2536 -1.7634 SHUTDOWN -0.2536 -1.7634 START -0.2935 -1.4624 THE -0.2858 -1.7634 TIME -0.2536 -1.7634 TRANSLATOR -0.2536 -1.7634 WEATHER -0.2935 -1.4624 WHAT -0.2858 \2-grams: -1.0792 <s> OKAY 0.0000 -1.0792 <s> OPEN 0.0000 -1.0792 <s> SHUTDOWN 0.0000 -1.0792 <s> START 0.0000 -0.7782 <s> WHAT 0.0000 -0.3010 GARAGE </s> -0.3010 -0.3010 IN RAMONA 0.0000 -0.3010 IS THE 0.0000 -0.3010 OKAY PI 0.0000 -0.3010 OPEN GARAGE 0.0000 -0.3010 PI </s> -0.3010 -0.3010 RAMONA </s> -0.3010 -0.3010 SHUTDOWN </s> -0.3010 -0.3010 START TRANSLATOR 0.0000 -0.6021 THE TIME 0.0000 -0.6021 THE WEATHER 0.0000 -0.3010 TIME </s> -0.3010 -0.3010 TRANSLATOR </s> -0.3010 -0.3010 WEATHER IN 0.0000 -0.3010 WHAT IS 0.0000 \3-grams: -0.3010 <s> OKAY PI -0.3010 <s> OPEN GARAGE -0.3010 <s> SHUTDOWN </s> -0.3010 <s> START TRANSLATOR -0.3010 <s> WHAT IS -0.3010 IN RAMONA </s> -0.6021 IS THE TIME -0.6021 IS THE WEATHER -0.3010 OKAY PI </s> -0.3010 OPEN GARAGE </s> -0.3010 START TRANSLATOR </s> -0.3010 THE TIME </s> -0.3010 THE WEATHER IN -0.3010 WEATHER IN RAMONA -0.3010 WHAT IS THE \end\
Looking carefully, the Sphinx knowledge base generator provides links to the just generated files, which make sit super convenient to pull them down to the Pi. For me it generated a base set with the name 3199:
wget http://www.speech.cs.cmu.edu/tools/product/1427343814_14328/3199.dic
wget http://www.speech.cs.cmu.edu/tools/product/1427343814_14328/3199.lm
Running Speech-recognition locally on the Raspberry Pi
Finally everything is in place, SphinxBase and PocketSphinx have been building installed, a pronunciation dictionary and a language model has been created and locally stored.
During the build process, acoustic model files for the english language, were deployed here: /usr/local/share/pocketsphinx/model/en-us/en-us
During the build process, acoustic model files for the english language, were deployed here: /usr/local/share/pocketsphinx/model/en-us/en-us
.. time to try out the the recognizer:
cd ~/
export LD_LIBRARY_PATH=/usr/local/lib
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
pocketsphinx_continuous -hmm /usr/local/share/pocketsphinx/model/en-us/en-us -lm 3199.lm -dict 3199.dic -samprate 16000/8000/48000 -inmic yes
Output
READY….
Listening…
…
Listening…
…
INFO: ps_lattice.c(1380): Bestpath score: -7682
INFO: ps_lattice.c(1384): Normalizer P(O) = alpha(:285:334) = -403763
INFO: ps_lattice.c(1441): Joint P(O,S) = -426231 P(S|O) = -22468
INFO: ngram_search.c(874): bestpath 0.01 CPU 0.003 xRT
INFO: ngram_search.c(877): bestpath 0.01 wall 0.002 xRT
OPEN GARAGE
READY….
Listening…
INFO: ps_lattice.c(1384): Normalizer P(O) = alpha(:285:334) = -403763
INFO: ps_lattice.c(1441): Joint P(O,S) = -426231 P(S|O) = -22468
INFO: ngram_search.c(874): bestpath 0.01 CPU 0.003 xRT
INFO: ngram_search.c(877): bestpath 0.01 wall 0.002 xRT
OPEN GARAGE
READY….
Listening…
Live Demo
This video shows the recognizer running in keyword spotting mode, using the dictionary and model mentioned above:
The purpose is to provide some indication of the recognition speed that can be expected, running PocketSphinx on the Raspberry Pi 2.
pocketsphinx_continuous -lm 3199.lm -dict 3199.dic -keyphrase "OKAY PI" -kws_threshold 1e-20 -inmic yes
The purpose is to provide some indication of the recognition speed that can be expected, running PocketSphinx on the Raspberry Pi 2.