My dream of a piece of software that you could simply talk to and get things done started more than 10 years ago, when I
was still a young M.Sc student who imagined getting common tasks done on my computer through the same kind of natural
interaction you see between Dave and HAL 9000 in
2001: A Space Odyssey. Together with a friend I developed
Voxifera way back in 2008. Although the software worked well enough for basic
tasks, as long as it was always me to provide the voice commands and as long as the list of custom voice commands was
below 10 items, Google and Amazon in the latest years have gone way beyond what an M.Sc student alone could do with
fast-Fourier transforms and Markov models.
When years later I started building Platypush, I still dreamed of the
same voice interface, leveraging the new technologies, while not being caged by the interactions natively provided by
those commercial assistants. My goal was still to talk to my assistant and get it to do whatever I wanted to, regardless
of the skills/integrations supported by the product, regardless of whichever answer its AI was intended to provide for
that phrase. And, most of all, my goal was to have all the business logic of the actions to run on my own device(s), not
on someone else’s cloud. I feel like by now that goal has been mostly accomplished (assistant technology with 100%
flexibility when it comes to phrase patterns and custom actions), and today I’d like to show you how to set up your own
Google Assistant on steroids as well with a Raspberry Pi, microphone and Platypush. I’ll also show how to run your
custom hotword detection models through the Snowboy integration, for those who wish greater flexibility when it comes to
how to summon your digital butler besides the boring “Ok Google” formula, or those who aren’t that happy with the idea
of having Google to constantly listen to everything that is said in the room. For those who are unfamiliar with
Platypush, I suggest
reading my previous article on
what it is, what it can do, why I built it and how to get started with it.
Context and expectations
First, a bit of context around the current state of the assistant integration (and the state of the available assistant APIs/SDKs in general).
My initial goal was to have a voice assistant that could:
Continuously listen through an audio device for a specific audio pattern or phrase and process the subsequent voice
requests.
Support multiple models for the hotword, so that multiple phrases could be used to trigger a request process, and
optionally one could even associate a different assistant language to each hotword.
Support conversation start/end actions even without hotword detection — something like “start listening when I press
a button or when I get close to a distance sensor”.
Provide the possibility to configure a list of custom phrases or patterns (ideally
through regular expressions) that, when matched, would run a
custom pre-configured task or list of tasks on the executing device, or on any device connected through it.
If a phrase doesn’t match any of those pre-configured patterns, then the assistant would go on and process the
request in the default way (e.g. rely on Google’s “how’s the weather?” or “what’s on my calendar?” standard response).
Basically, I needed an assistant SDK or API that could be easily wrapped into a library or tiny module, a module that could listen for hotwords, start/stop conversations programmatically, and return the detected phrase directly back to my business logic if any speech was recognized.
I eventually decided to develop the integration with the Google Assistant and ignore Alexa because:
Alexa’s original sample app for developers was a relatively heavy
piece of software that relied on a Java backend and a Node.js web service.
In the meantime Amazon has pulled the plug off that original project.
The sample app has been replaced by the Amazon AVS (Alexa Voice Service),
which is a C++ service mostly aimed to commercial applications and doesn’t provide a decent quickstart for custom
Python integrations.
There are few Python examples for the Alexa SDK,
but they focus on how to develop a skill. I’m not interested in building a skill that runs on Amazon’s servers — I’m
interested in detecting hotwords and raw speech on any device, and the SDK should let me do whatever I want with that.
I eventually opted for
the Google Assistant library, but that
has recently been deprecated with short notice, and
there’s an ongoing discussion of which will be the future alternatives. However, the voice integration with Platypush
still works, and whichever new SDK/API Google will release in the near future I’ll make sure that it’ll still be
supported. The two options currently provided are:
If you’re running Platypush on an x86/x86_64 machine or on a Raspberry Pi earlier than the model 4 (except for the
Raspberry Pi Zero, since it’s based on ARM6 and the Assistant library wasn’t compiled it for it), you can still use
the assistant library — even though it’s not guaranteed to work against future builds of the libc, given the
deprecated status of the library.
Otherwise, you can use the Snowboy integration for hotword detection together with Platypush’ s wrapper around the
Google push-to-talk sample for conversation support.
In this article we’ll see how to get started with both the configurations.
Installation and configuration
First things first: in order to get your assistant working you’ll need:
An x86/x86_64/ARM device/OS compatible with Platypush and either the Google Assistant library or Snowboy (tested on
most of the Raspberry Pi models, Banana Pis and Odroid, and on ASUS Tinkerboard).
A microphone. Literally any Linux-compatible microphone would work.
I’ll also assume that you have already installed Platypush on your device — the instructions are provided on
the Github page, on
the wiki and in
my previous article.
Follow these steps to get the assistant running:
Install the required dependencies:
# (it won't work on RaspberryPi Zero and arm6 architecture)[sudo]pipinstall'platypush[google-assistant-legacy]'# To run the just the Google Assistant speech detection and use# Snowboy for hotword detection[sudo]pipinstall'platypush[google-assistant]'
Follow these steps
to create and configure a new project in the Google Console and download the required credentials
files.
Generate your user’s credentials file for the assistant to connect it to your account:
Open the prompted URL in your browser, log in with your Google account if needed and then enter the prompted
authorization code in the terminal.
The above steps are common both for the Assistant library and the Snowboy+push-to-talk configurations. Let’s now tackle
how to get things working with the Assistant library, provided that it still works on your device.
Google Assistant library
Enable the Google Assistant backend (to listen to the hotword) and plugin (to programmatically start/stop
conversations in your custom actions) in your Platypush configuration file (by default
~/.config/platypush/config.yaml):
Refer to the official documentation to check the additional initialization parameters and actions provided by the
assistant backend and
plugin.
Restart Platypush and keep an eye on the output to check that everything is alright. Oh, and also double check that
your microphone is not muted.
Just say “OK Google” or “Hey Google”. The basic assistant should work out of the box.
Snowboy + Google Assistant library
Follow the steps in the next section if the Assistant library doesn’t work on your device (in most of the cases you’ll
see a segmentation fault if you try to import it caused by a mismatching libc version), or if you want more options when
it comes to supported hotwords, and/or you don’t like the idea of having Google to constantly listen all of your
conversation to detect when you say the hotword.
# Install the Snowboy dependencies[sudo]pipinstall'platypush[hotword]'
Go to the Snowboy home page, register/login and then select the hotword model(s) you like.
You’ll notice that before downloading a model you’ll be asked to provide three voice sample of yours saying the
hotword — a good idea to keep voice models free while getting everyone to improve them.
Configure the Snowboy backend and the Google push-to-talk plugin in your Platypush configuration. Example:
Tweak audio_gain to adjust the gain of your microphone (1.0 for a 100% gain).
model will contain a key-value list of the voice models that you want to use.
For each model you’ll have to specify its voice_model_file (downloaded from the Snowboy website), which
assistant_plugin will be used (assistant.google.pushtotalk in this case), the assistant_language code, i.e. the
selected language for the assistant conversation when that hotword is detected (default: en-US), an optional
detect_sound, a WAV file that will be played when a conversation starts, and the sensitivity of that model, between 0
and 1 — with 0 meaning no sensitivity and 1 very high sensitivity (tweak it to your own needs, but be aware that a
value higher than 0.5 might trigger more false positives).
The assistant.google.pushtotalk plugin configuration only requires the default assistant language to be used.
Refer to the official documentation for extra initialization parameters and methods provided by the
Snowboy backend and the
push-to-talk plugin.
Restart Platypush and check the logs for any errors, then say your hotword. If everything went well, an assistant
conversation will be started when the hotword is detected.
Create custom events on speech detected
So now that you’ve got the basic features of the assistant up and running, it’s time to customize the configuration and
leverage the versatility of Platypush to get your assistant to run whatever you like through when you say whichever
phrase you like. You can create event hooks for any of the events triggered by the assistant — among those,
SpeechRecognizedEvent, ConversationStartEvent, HotwordDetectedEvent, TimerEndEvent etc., and those hooks can run
anything that has a Platypush plugin. Let’s see an example to turn on your Philips Hue lights when you say “turn on the
lights”:
You’ll also notice that the answer of the assistant is suppressed if the detected phrase matches an existing rule, but
if you still want the assistant to speak a custom phrase you can use the tts or tts.google plugins:
event.hook.AssistantTurnOnLightsAnimation:if:type:platypush.message.event.assistant.SpeechRecognizedEventphrase:"turnon(the)?animation"then:-action:light.hue.animateargs:animation:color_transitiontransition_seconds:0.25-action:tts.sayargs:text:Enjoy the light show
You can also programmatically start a conversation without using the hotword to trigger the assistant. For example, this
is a rule that triggers the assistant whenever you press a Flic button:
Additional win: if you have configured the HTTP backend and you have access to the web panel or the dashboard then
you’ll notice that the status of the conversation will also appear on the web page as a modal dialog, where you’ll see
when a hotword has been detected, the recognized speech and the transcript of the assistant response.
That’s all you need to know to customize your assistant — now you can for instance write rules that would blink your
lights when an assistant timer ends, or programmatically play your favourite playlist on mpd/mopidy when you say a
particular phrase, or handle a home made multi-room music setup with Snapcast+platypush through voice commands. As long
as there’s a platypush plugin to do what you want to do, you can do it already.
Using Google Assistant basic features ("how's the weather?") with the "OK Google" hotword (in English)
Triggering a conversation in Italian when I say the "computer" hotword instead
Support for custom responses through the Text-to-Speech plugin
Control the music through custom hooks that leverage mopidy as a backend (and synchronize music with devices in other rooms through the Snapcast plugin)
Trigger a conversation without hotword - in this case I defined a hook that starts a conversation when something approaches a distance sensor on my Raspberry
Take pictures from a camera on another Raspberry and preview them on the screen through platypush' camera plugins, and send them to mobile devices through the Pushbullet or AutoRemote plugins
All the conversations and responses are visually shown on the platypush web dashboard
Those who have been following my blog or used Platypush for a while probably know that I've put quite some efforts to get voice assistants rights over the past few years.
I built my first (very primitive) voice assistant that used DCT+Markov models back in 2008, when the concept was still pretty much a science fiction novelty.
Then I wrote an article in 2019 and one in 2020 on how to use several voice integrations in Platypush to create custom voice assistants.
Everyone in those pictures is now dead
Quite a few things have changed in this industry niche since I wrote my previous article. Most of the solutions that I covered back in the day, unfortunately, are gone in a way or another:
The assistant.snowboy integration is gone because unfortunately Snowboy is gone. For a while you could still run the Snowboy code with models that either you had previously downloaded from their website or trained yourself, but my latest experience proved to be quite unfruitful - it's been more than 4 years since the last commit on Snowboy, and it's hard to get the code to even run.
The assistant.alexa integration is also gone, as Amazon has stopped maintaining the AVS SDK. And I have literally no clue of what Amazon's plans with the development of Alexa skills are (if there are any plans at all).
The stt.deepspeech integration is also gone: the project hasn't seen a commit in 3 years and I even struggled to get the latest code to run. Given the current financial situation at Mozilla, and the fact that they're trying to cut as much as possible on what they don't consider part of their core product, it's very unlikely that DeepSpeech will be revived any time soon.
The assistant.google integration is still there, but I can't make promises on how long it can be maintained. It uses the google-assistant-library, which was deprecated in 2019. Google replaced it with the conversational actions, which was also deprecated last year. Put here your joke about Google building products with the shelf life of a summer hit.
The tts.mimic3 integration, a text model based on mimic3, part of the Mycroft initiative, is still there, but only because it's still possible to spin up a Docker image that runs mimic3. The whole Mycroft project, however, is now defunct, and the story of how it went bankrupt is a very sad story about the power that patent trolls have on startups. The Mycroft initiative however seems to have been picked up by the community, and something seems to move in the space of fully open source and on-device voice models. I'll definitely be looking with interest at what happens in that space, but the project seems to be at a stage that is still a bit immature to justify an investment into a new Platypush integration.
But not all hope is lost assistant.google
assistant.google may be relying on a dead library, but it's not dead (yet). The code still works, but you're a bit constrained on the hardware side - the assistant library only supports x86_64 and ARMv7 (namely, only Raspberry Pi 3 and 4). No ARM64 (i.e. no Raspberry Pi 5), and even running it on other ARMv7-compatible devices has proved to be a challenge in some cases. Given the state of the library, it's safe to say that it'll never be supported on other platforms, but if you want to run your assistant on a device that is still supported then it should still work fine.
I had however to do a few dirty packaging tricks to ensure that the assistant library code doesn't break badly on newer versions of Python. That code hasn't been touched in 5 years and it's starting to rot. It depends on ancient and deprecated Python libraries like enum34 and it needs some hammering to work - without breaking the whole Python environment in the process.
For now, pip install 'platypush[assistant.google]' should do all the dirty work and get all of your assistant dependencies installed. But I can't promise I can maintain that code forever.
assistant.picovoice
Picovoice has been a nice surprise in an industry niche where all the products that were available just 4 years ago are now dead.
I described some of their products in my previous articles, and I even built a couple of stt.picovoice.* plugins for Platypush back in the day, but I didn't really put much effort in it.
Their business model seemed a bit weird - along the lines of "you can test our products on x86_64, if you need an ARM build you should contact us as a business partner". And the quality of their products was also a bit disappointing compared to other mainstream offerings.
I'm glad to see that the situation has changed quite a bit now. They still have a "sign up with a business email" model, but at least now you can just sign up on their website and start using their products rather than sending emails around. And I'm also quite impressed to see the progress on their website. You can now train hotword models, customize speech-to-text models and build your own intent rules directly from their website - a feature that was also available in the beloved Snowboy and that went missing from any major product offerings out there after Snowboy was gone. I feel like the quality of their models has also greatly improved compared to the last time I checked them - predictions are still slower than the Google Assistant, definitely less accurate with non-native accents, but the gap with the Google Assistant when it comes to native accents isn't very wide.
assistant.openai
OpenAI has filled many gaps left by all the casualties in the voice assistants market. Platypush now provides a new assistant.openai plugin that stitches together several of their APIs to provide a voice assistant experience that honestly feels much more natural than anything I've tried in all these years.
Let's explore how to use these integrations to build our on-device voice assistant with custom rules.
Feature comparison
As some of you may know, voice assistant often aren't monolithic products. Unless explicitly designed as all-in-one packages (like the google-assistant-library), voice assistant integrations in Platypush are usually built on top of four distinct APIs:
Hotword detection: This is the component that continuously listens on your microphone until you speak "Ok Google", "Alexa" or any other wake-up word used to start a conversation. Since it's a continuously listening component that needs to take decisions fast, and it only has to recognize one word (or in a few cases 3-4 more at most), it usually doesn't need to run on a full language model. It needs small models, often a couple of MBs heavy at most.
Speech-to-text (STT): This is the component that will capture audio from the microphone and use some API to transcribe it to text.
Response engine: Once you have the transcription of what the user said, you need to feed it to some model that will generate some human-like response for the question.
Text-to-speech (TTS): Once you have your AI response rendered as a text string, you need a text-to-speech model to speak it out loud on your speakers or headphones.
On top of these basic building blocks for a voice assistant, some integrations may also provide two extra features.
Speech-to-intent
In this mode, the user's prompt, instead of being transcribed directly to text, is transcribed into a structured intent that can be more easily processed by a downstream integration with no need for extra text parsing, regular expressions etc.
For instance, a voice command like "turn off the bedroom lights" could be translated into an intent such as:
{
"intent": "lights_ctrl",
"slots": {
"state": "off",
"lights": "bedroom"
}
}
Offline speech-to-text
a.k.a. offline text transcriptions. Some assistant integrations may offer you the ability to pass some audio file and transcribe their content as text.
Features summary
This table summarizes how the assistant integrations available in Platypush compare when it comes to what I would call the foundational blocks:
Plugin Hotword STT AI responses TTS assistant.google ✅ ✅ ✅ ✅ assistant.openai ❌ ✅ ✅ ✅ assistant.picovoice ✅ ✅ ❌ ✅
And this is how they compare in terms of extra features:
Plugin Intents Offline SST assistant.google ❌ ❌ assistant.openai ❌ ✅ assistant.picovoice ✅ ✅
Let's see a few configuration examples to better understand the pros and cons of each of these integrations.
Configuration Hardware requirements
A computer, a Raspberry Pi, an old tablet, or anything in between, as long as it can run Python. At least 1GB of RAM is advised for smooth audio processing experience.
A microphone.
Speaker/headphones.
Installation notes
Platypush 1.0.0 has recently been released, and new installation procedures with it.
There's now official support for several package managers, a better Docker installation process, and more powerful ways to install plugins - via pip extras, Web interface, Docker and virtual environments.
The optional dependencies for any Platypush plugins can be installed via pip extras in the simplest case:
$ pip install 'platypush[plugin1,plugin2,...]'
For example, if you want to install Platypush with the dependencies for assistant.openai and assistant.picovoice:
$ pip install 'platypush[assistant.openai,assistant.picovoice]'
Some plugins however may require extra system dependencies that are not available via pip - for instance, both the OpenAI and Picovoice integrations require the ffmpeg binary to be installed, as it is used for audio conversion and exporting purposes. You can check the plugins documentation for any system dependencies required by some integrations, or install them automatically through the Web interface or the platydock command for Docker containers.
A note on the hooks
All the custom actions in this article are built through event hooks triggered by SpeechRecognizedEvent (or IntentRecognizedEvent for intents). When an intent event is triggered, or a speech event with a condition on a phrase, the assistant integrations in Platypush will prevent the default assistant response. That's to avoid cases where e.g. you say "turn off the lights", your hook takes care of running the actual action, while your voice assistant fetches a response from Google or ChatGPT along the lines of "sorry, I can't control your lights".
If you want to render a custom response from an event hook, you can do so by calling event.assistant.render_response(text), and it will be spoken using the available text-to-speech integration.
If you want to disable this behaviour, and you want the default assistant response to always be rendered, even if it matches a hook with a phrase or an intent, you can do so by setting the stop_conversation_on_speech_match parameter to false in your assistant plugin configuration.
Text-to-speech
Each of the available assistant plugins has it own default tts plugin associated:
assistant.google: tts, but tts.google is also available. The difference is that tts uses the (unofficial) Google Translate frontend API - it requires no extra configuration, but besides setting the input language it isn't very configurable. tts.google on the other hand uses the Google Cloud Translation API. It is much more versatile, but it requires an extra API registered to your Google project and an extra credentials file.
assistant.openai: tts.openai, which leverages the OpenAI text-to-speech API.
assistant.picovoice: tts.picovoice, which uses the (still experimental, at the time of writing) Picovoice Orca engine.
Any text rendered via assistant*.render_response will be rendered using the associated TTS plugin. You can however customize it by setting tts_plugin on your assistant plugin configuration - e.g. you can render responses from the OpenAI assistant through the Google or Picovoice engine, or the other way around.
tts plugins also expose a say action that can be called outside of an assistant context to render custom text at runtime - for example, from other event hooks, procedures, cronjobs or API calls. For example:
$ curl -XPOST -H "Authorization: Bearer $TOKEN" -d '
{
"type": "request",
"action": "tts.openai.say",
"args": {
"text": "What a wonderful day!"
}
}
' http://localhost:8008/execute
assistant.google Plugin documentation pip installation: pip install 'platypush[assistant.google]'
This is the oldest voice integration in Platypush - and one of the use-cases that actually motivated me into forking the previous project into what is now Platypush.
As mentioned in the previous section, this integration is built on top of a deprecated library (with no available alternatives) that just so happens to still work with a bit of hammering on x86_64 and Raspberry Pi 3/4.
Personally it's the voice assistant I still use on most of my devices, but it's definitely not guaranteed that it will keep working in the future.
Once you have installed Platypush with the dependencies for this integration, you can configure it through these steps:
Create a new project on the Google developers console and generate a new set of credentials for it. Download the credentials secrets as JSON. Generate scoped credentials from your secrets.json. Configure the integration in your config.yaml for Platypush (see the configuration page for more details):assistant.google:
# Default: ~/.config/google-oauthlib-tool/credentials.json
# or /credentials/google/assistant.json
credentials_file: /path/to/credentials.json
# Default: no sound is played when "Ok Google" is detected
conversation_start_sound: /path/to/sound.mp3
Restart the service, say "Ok Google" or "Hey Google" while the microphone is active, and everything should work out of the box.
You can now start creating event hooks to execute your custom voice commands. For example, if you configured a lights plugin (e.g. light.hue) and a music plugin (e.g. music.mopidy), you can start building voice commands like these:
# Content of e.g. /path/to/config_yaml/scripts/assistant.py
from platypush import run, when
from platypush.events.assistant import (
ConversationStartEvent, SpeechRecognizedEvent
)
light_plugin = "light.hue"
music_plugin = "music.mopidy"
@when(ConversationStartEvent)
def pause_music_when_conversation_starts():
run(f"{music_plugin}.pause_if_playing")
# Note: (limited) support for regular expressions on `phrase`
# This hook will match any phrase containing either "turn on the lights"
# or "turn off the lights"
@when(SpeechRecognizedEvent, phrase="turn on (the)? lights")
def lights_on_command():
run(f"{light_plugin}.on")
# Or, with arguments:
# run(f"{light_plugin}.on", groups=["Bedroom"])
@when(SpeechRecognizedEvent, phrase="turn off (the)? lights")
def lights_off_command():
run(f"{light_plugin}.off")
@when(SpeechRecognizedEvent, phrase="play (the)? music")
def play_music_command():
run(f"{music_plugin}.play")
@when(SpeechRecognizedEvent, phrase="stop (the)? music")
def stop_music_command():
run(f"{music_plugin}.stop")
Or, via YAML:
# Add to your config.yaml, or to one of the files included in it
event.hook.pause_music_when_conversation_starts:
if:
type: platypush.message.event.ConversationStartEvent
then:
- action: music.mopidy.pause_if_playing
event.hook.lights_on_command:
if:
type: platypush.message.event.SpeechRecognizedEvent
phrase: "turn on (the)? lights"
then:
- action: light.hue.on
# args:
# groups:
# - Bedroom
event.hook.lights_off_command:
if:
type: platypush.message.event.SpeechRecognizedEvent
phrase: "turn off (the)? lights"
then:
- action: light.hue.off
event.hook.play_music_command:
if:
type: platypush.message.event.SpeechRecognizedEvent
phrase: "play (the)? music"
then:
- action: music.mopidy.play
event.hook.stop_music_command:
if:
type: platypush.message.event.SpeechRecognizedEvent
phrase: "stop (the)? music"
then:
- action: music.mopidy.stop
Parameters are also supported on the phrase event argument through the ${} template construct. For example:
from platypush import when, run
from platypush.events.assistant import SpeechRecognizedEvent
@when(SpeechRecognizedEvent, phrase='play ${title} by ${artist}')
def on_play_track_command(
event: SpeechRecognizedEvent, title: str, artist: str
):
results = run(
"music.mopidy.search",
filter={"title": title, "artist": artist}
)
if not results:
event.assistant.render_response(f"Couldn't find {title} by {artist}")
return
run("music.mopidy.play", resource=results[0]["uri"])
Pros 👍 Very fast and robust API. 👍 Easy to install and configure. 👍 It comes with almost all the features of a voice assistant installed on Google hardware - except some actions native to Android-based devices and video/display features. This means that features such as timers, alarms, weather forecast, setting the volume or controlling Chromecasts on the same network are all supported out of the box. 👍 It connects to your Google account (can be configured from your Google settings), so things like location-based suggestions and calendar events are available. Support for custom actions and devices configured in your Google Home app is also available out of the box, although I haven't tested it in a while. 👍 Good multi-language support. In most of the cases the assistant seems quite capable of understanding questions in multiple language and respond in the input language without any further configuration. Cons 👎 Based on a deprecated API that could break at any moment. 👎 Limited hardware support (only x86_64 and RPi 3/4). 👎 Not possible to configure the hotword - only "Ok/Hey Google" is available. 👎 Not possible to configure the output voice - it can only use the stock Google Assistant voice. 👎 No support for intents - something similar was available (albeit tricky to configure) through the Actions SDK, but that has also been abandoned by Google. 👎 Not very modular. Both assistant.picovoice and assistant.openai have been built by stitching together different independent APIs. Those plugins are therefore quite modular. You can choose for instance to run only the hotword engine of assistant.picovoice, which in turn will trigger the conversation engine of assistant.openai, and maybe use tts.google to render the responses. By contrast, given the relatively monolithic nature of google-assistant-library, which runs the whole service locally, if your instance runs assistant.google then it can't run other assistant plugins. assistant.picovoice Plugin documentation pip installation: pip install 'platypush[assistant.picovoice]'
The assistant.picovoice integration is available from Platypush 1.0.0.
Previous versions had some outdated sst.picovoice.* plugins for the individual products, but they weren't properly tested and they weren't combined together into a single integration that implements the Platypush' assistant API.
This integration is built on top of the voice products developed by Picovoice. These include:
Porcupine: a fast and customizable engine for hotword/wake-word detection. It can be enabled by setting hotword_enabled to true in the assistant.picovoice plugin configuration.
Cheetah: a speech-to-text engine optimized for real-time transcriptions. It can be enabled by setting stt_enabled to true in the assistant.picovoice plugin configuration.
Leopard: a speech-to-text engine optimized for offline transcriptions of audio files.
Rhino: a speech-to-intent engine.
Orca: a text-to-speech engine.
You can get your personal access key by signing up at the Picovoice console. You may be asked to submit a reason for using the service (feel free to mention a personal Platypush integration), and you will receive your personal access key.
If prompted to select the products you want to use, make sure to select the ones from the Picovoice suite that you want to use with the assistant.picovoice plugin.
A basic plugin configuration would like this:
assistant.picovoice:
access_key: YOUR_ACCESS_KEY
# Keywords that the assistant should listen for
keywords:
- alexa
- computer
- ok google
# Paths to custom keyword files
# keyword_paths:
# - ~/.local/share/picovoice/keywords/linux/custom_linux.ppn
# Enable/disable the hotword engine
hotword_enabled: true
# Enable the STT engine
stt_enabled: true
# conversation_start_sound: ...
# Path to a custom model to be used to speech-to-text
# speech_model_path: ~/.local/share/picovoice/models/cheetah/custom-en.pv
# Path to an intent model. At least one custom intent model is required if
# you want to enable intent detection.
# intent_model_path: ~/.local/share/picovoice/models/rhino/custom-en-x86.rhn
Hotword detection
If enabled through the hotword_enabled parameter (default: True), the assistant will listen for a specific wake word before starting the speech-to-text or intent recognition engines. You can specify custom models for your hotword (e.g. on the same device you may use "Alexa" to trigger the speech-to-text engine in English, "Computer" to trigger the speech-to-text engine in Italian, and "Ok Google" to trigger the intent recognition engine).
You can also create your custom hotword models using the Porcupine console.
If hotword_enabled is set to True, you must also specify the keywords parameter with the list of keywords that you want to listen for, and optionally the keyword_paths parameter with the paths to the any custom hotword models that you want to use. If hotword_enabled is set to False, then the assistant won't start listening for speech after the plugin is started, and you will need to programmatically start the conversation by calling the assistant.picovoice.start_conversation action.
When a wake-word is detected, the assistant will emit a HotwordDetectedEvent that you can use to build your custom logic.
By default, the assistant will start listening for speech after the hotword if either stt_enabled or intent_model_path are set. If you don't want the assistant to start listening for speech after the hotword is detected (for example because you want to build your custom response flows, or trigger the speech detection using different models depending on the hotword that is used, or because you just want to detect hotwords but not speech), then you can also set the start_conversation_on_hotword parameter to false. If that is the case, then you can programmatically start the conversation by calling the assistant.picovoice.start_conversation method in your event hooks:
from platypush import when, run
from platypush.message.event.assistant import HotwordDetectedEvent
# Start a conversation using the Italian language model when the
# "Buongiorno" hotword is detected
@when(HotwordDetectedEvent, hotword='Buongiorno')
def on_it_hotword_detected(event: HotwordDetectedEvent):
event.assistant.start_conversation(model_file='path/to/it.pv')
Speech-to-text
If you want to build your custom STT hooks, the approach is the same seen for the assistant.google plugins - create an event hook on SpeechRecognizedEvent with a given exact phrase, regex or template.
Speech-to-intent
Intents are structured actions parsed from unstructured human-readable text.
Unlike with hotword and speech-to-text detection, you need to provide a custom model for intent detection. You can create your custom model using the Rhino console.
When an intent is detected, the assistant will emit an IntentRecognizedEvent and you can build your custom hooks on it.
For example, you can build a model to control groups of smart lights by defining the following slots on the Rhino console:
device_state: The new state of the device (e.g. with on or off as supported values)
room: The name of the room associated to the group of lights to be controlled (e.g. living room, kitchen, bedroom)
You can then define a lights_ctrl intent with the following expressions:
"turn $device_state:state the lights" "turn $device_state:state the $room:room lights" "turn the lights $device_state:state" "turn the $room:room lights $device_state:state" "turn $room:room lights $device_state:state"
This intent will match any of the following phrases:
"turn on the lights" "turn off the lights" "turn the lights on" "turn the lights off" "turn on the living room lights" "turn off the living room lights" "turn the living room lights on" "turn the living room lights off"
And it will extract any slots that are matched in the phrases in the IntentRecognizedEvent.
Train the model, download the context file, and pass the path on the intent_model_path parameter.
You can then register a hook to listen to a specific intent:
from platypush import when, run
from platypush.events.assistant import IntentRecognizedEvent
@when(IntentRecognizedEvent, intent='lights_ctrl', slots={'state': 'on'})
def on_turn_on_lights(event: IntentRecognizedEvent):
room = event.slots.get('room')
if room:
run("light.hue.on", groups=[room])
else:
run("light.hue.on")
Note that if both stt_enabled and intent_model_path are set, then both the speech-to-text and intent recognition engines will run in parallel when a conversation is started.
The intent engine is usually faster, as it has a smaller set of intents to match and doesn't have to run a full speech-to-text transcription. This means that, if an utterance matches both a speech-to-text phrase and an intent, the IntentRecognizedEvent event is emitted (and not SpeechRecognizedEvent).
This may not be always the case though. So, if you want to use the intent detection engine together with the speech detection, it may be a good practice to also provide a fallback SpeechRecognizedEvent hook to catch the text if the speech is not recognized as an intent:
from platypush import when, run
from platypush.events.assistant import SpeechRecognizedEvent
@when(SpeechRecognizedEvent, phrase='turn ${state} (the)? ${room} lights?')
def on_turn_on_lights(event: SpeechRecognizedEvent, phrase, room, **context):
if room:
run("light.hue.on", groups=[room])
else:
run("light.hue.on")
Text-to-speech and response management
The text-to-speech engine, based on Orca, is provided by the tts.picovoice plugin.
However, the Picovoice integration won't provide you with automatic AI-generated responses for your queries. That's because Picovoice doesn't seem to offer (yet) any products for conversational assistants, either voice-based or text-based.
You can however leverage the render_response action to render some text as speech in response to a user command, and that in turn will leverage the Picovoice TTS plugin to render the response.
For example, the following snippet provides a hook that:
Listens for SpeechRecognizedEvent.
Matches the phrase against a list of predefined commands that shouldn't require an AI-generated response.
Has a fallback logic that leverages openai.get_response to generate a response through a ChatGPT model and render it as audio.
Also, note that any text rendered over the render_response action that ends with a question mark will automatically trigger a follow-up - i.e. the assistant will wait for the user to answer its question.
import re
from platypush import hook, run
from platypush.message.event.assistant import SpeechRecognizedEvent
def play_music():
run("music.mopidy.play")
def stop_music():
run("music.mopidy.stop")
def ai_assist(event: SpeechRecognizedEvent):
response = run("openai.get_response", prompt=event.phrase)
if not response:
return
run("assistant.picovoice.render_response", text=response)
# List of commands to match, as pairs of regex patterns and the
# corresponding actions
hooks = (
(re.compile(r"play (the)?music", re.IGNORECASE), play_music),
(re.compile(r"stop (the)?music", re.IGNORECASE), stop_music),
# ...
# Fallback to the AI assistant
(re.compile(r".*"), ai_assist),
)
@when(SpeechRecognizedEvent)
def on_speech_recognized(event, **kwargs):
for pattern, command in hooks:
if pattern.search(event.phrase):
run("logger.info", msg=f"Running voice command: {command.__name__}")
command(event, **kwargs)
break
Offline speech-to-text
An assistant.picovoice.transcribe action is provided for offline transcriptions of audio files, using the Leopard models.
You can easily call it from your procedures, hooks or through the API:
$ curl -XPOST -H "Authorization: Bearer $TOKEN" -d '
{
"type": "request",
"action": "assistant.picovoice.transcribe",
"args": {
"audio_file": "/path/to/some/speech.mp3"
}
}' http://localhost:8008/execute
{
"transcription": "This is a test",
"words": [
{
"word": "this",
"start": 0.06400000303983688,
"end": 0.19200000166893005,
"confidence": 0.9626294374465942
},
{
"word": "is",
"start": 0.2879999876022339,
"end": 0.35199999809265137,
"confidence": 0.9781675934791565
},
{
"word": "a",
"start": 0.41600000858306885,
"end": 0.41600000858306885,
"confidence": 0.9764975309371948
},
{
"word": "test",
"start": 0.5120000243186951,
"end": 0.8320000171661377,
"confidence": 0.9511580467224121
}
]
}
Pros
👍 The Picovoice integration is extremely configurable. assistant.picovoice stitches together five independent products developed by a small company specialized in voice products for developers. As such, Picovoice may be the best option if you have custom use-cases. You can pick which features you need (hotword, speech-to-text, speech-to-intent, text-to-speech...) and you have plenty of flexibility in building your integrations.
👍 Runs (or seems to run) (mostly) on device. This is something that we can't say about the other two integrations discussed in this article. If keeping your voice interactions 100% hidden from Google's or Microsoft's eyes is a priority, then Picovoice may be your best bet.
👍 Rich features. It uses different models for different purposes - for example, Cheetah models are optimized for real-time speech detection, while Leopard is optimized for offline transcription. Moreover, Picovoice is the only integration among those analyzed in this article to support speech-to-intent.
👍 It's very easy to build new models or customize existing ones. Picovoice has a powerful developers console that allows you to easily create hotword models, tweak the priority of some words in voice models, and create custom intent models.
Cons
👎 The business model is still a bit weird. It's better than the earlier "write us an email with your business case and we'll reach back to you", but it still requires you to sign up with a business email and write a couple of lines on what you want to build with their products. It feels like their focus is on a B2B approach rather than "open up and let the community build stuff", and that seems to create unnecessary friction.
👎 No native conversational features. At the time of writing, Picovoice doesn't offer products that generate AI responses given voice or text prompts. This means that, if you want AI-generated responses to your queries, you'll have to do requests to e.g. openai.get_response(prompt) directly in your hooks for SpeechRecognizedEvent, and render the responses through assistant.picovoice.render_response. This makes the use of assistant.picovoice alone more fit to cases where you want to mostly create voice command hooks rather than have general-purpose conversations.
👎 Speech-to-text, at least on my machine, is slower than the other two integrations, and the accuracy with non-native accents is also much lower.
👎 Limited support for any languages other than English. At the time of writing hotword detection with Porcupine seems to be in a relative good shape with support for 16 languages. However, both speech-to-text and text-to-speech only support English at the moment.
👎 Some APIs are still quite unstable. The Orca text-to-speech API, for example, doesn't even support text that includes digits or some punctuation characters - at least not at the time of writing. The Platypush integration fills the gap with workarounds that e.g. replace words to numbers and replace punctuation characters, but you definitely have a feeling that some parts of their products are still work in progress.
assistant.openai Plugin documentation pip installation: pip install 'platypush[assistant.openai]'
This integration has been released in Platypush 1.0.7.
It uses the following OpenAI APIs:
/audio/transcriptions for speech-to-text. At the time of writing the default model is whisper-1. It can be configured through the model setting on the assistant.openai plugin configuration. See the OpenAI documentation for a list of available models. /chat/completions to get AI-generated responses using a GPT model. At the time of writing the default is gpt-3.5-turbo, but it can be configurable through the model setting on the openai plugin configuration. See the OpenAI documentation for a list of supported models. /audio/speech for text-to-speech. At the time of writing the default model is tts-1 and the default voice is nova. They can be configured through the model and voice settings respectively on the tts.openai plugin. See the OpenAI documentation for a list of available models and voices.
You will need an OpenAI API key associated to your account.
A basic configuration would like this:
openai:
api_key: YOUR_OPENAI_API_KEY # Required
# conversation_start_sound: ...
# model: ...
# context: ...
# context_expiry: ...
# max_tokens: ...
assistant.openai:
# model: ...
# tts_plugin: some.other.tts.plugin
tts.openai:
# model: ...
# voice: ...
If you want to build your custom hooks on speech events, the approach is the same seen for the other assistant plugins - create an event hook on SpeechRecognizedEvent with a given exact phrase, regex or template.
Hotword support
OpenAI doesn't provide an API for hotword detection, nor a small model for offline detection.
This means that, if no other assistant plugins with stand-alone hotword support are configured (only assistant.picovoice for now), a conversation can only be triggered by calling the assistant.openai.start_conversation action.
If you want hotword support, then the best bet is to add assistant.picovoice to your configuration too - but make sure to only enable hotword detection and not speech detection, which will be delegated to assistant.openai via event hook:
assistant.picovoice:
access_key: ...
keywords:
- computer
hotword_enabled: true
stt_enabled: false
# conversation_start_sound: ...
Then create a hook that listens for HotwordDetectedEvent and calls assistant.openai.start_conversation:
from platypush import run, when
from platypush.events.assistant import HotwordDetectedEvent
@when(HotwordDetectedEvent, hotword="computer")
def on_hotword_detected():
run("assistant.openai.start_conversation")
Conversation contexts
The most powerful feature offered by the OpenAI assistant is the fact that it leverages the conversation contexts provided by the OpenAI API.
This means two things:
Your assistant can be initialized/tuned with a static context. It is possible to provide some initialization context to the assistant that can fine tune how the assistant will behave, (e.g. what kind of tone/language/approach will have when generating the responses), as well as initialize the assistant with some predefined knowledge in the form of hypothetical past conversations. Example:openai:
# ...
context:
# `system` can be used to initialize the context for the expected tone
# and language in the assistant responses
- role: system
content: >
You are a voice assistant that responds to user queries using
references to Lovecraftian lore.
# `user`/`assistant` interactions can be used to initialize the
# conversation context with previous knowledge. `user` is used to
# emulate previous user questions, and `assistant` models the
# expected response.
- role: user
content: What is a telephone?
- role: assistant
content: >
A Cthulhuian device that allows you to communicate with
otherworldly beings. It is said that the first telephone was
created by the Great Old Ones themselves, and that it is a
gateway to the void beyond the stars.
If you now start Platypush and ask a question like "how does it work?", the voice assistant may give a response along the lines of:
The telephone functions by harnessing the eldritch energies of the cosmos to transmit vibrations through the ether, allowing communication across vast distances with entities from beyond the veil. Its operation is shrouded in mystery, for it relies on arcane principles incomprehensible to mortal minds.
Note that:
The style of the response is consistent with that initialized in the context through system roles.
Even though a question like "how does it work?" is not very specific, the assistant treats the user/assistant entries given in the context as if they were the latest conversation prompts. Thus it realizes that "it", in this context, probably means "the telephone".
The assistant has a runtime context. It will remember the recent conversations for a given amount of time (configurable through the context_expiry setting on the openai plugin configuration). So, even without explicit context initialization in the openai plugin, the plugin will remember the last interactions for (by default) 10 minutes. So if you ask "who wrote the Divine Comedy?", and a few seconds later you ask "where was its writer from?", you may get a response like "Florence, Italy" - i.e. the assistant realizes that "the writer" in this context is likely to mean "the writer of the work that I was asked about in the previous interaction" and return pertinent information.
Pros
👍 Speech detection quality. The OpenAI speech-to-text features are the best among the available assistant integrations. The transcribe API so far has detected my non-native English accent right nearly 100% of the times (Google comes close to 90%, while Picovoice trails quite behind). And it even detects the speech of my young kid - something that the Google Assistant library has always failed to do right.
👍 Text-to-speech quality. The voice models used by OpenAI sound much more natural and human than those of both Google and Picovoice. Google's and Picovoice's TTS models are actually already quite solid, but OpenAI outclasses them when it comes to voice modulation, inflections and sentiment. The result sounds intimidatingly realistic.
👍 AI responses quality. While the scope of the Google Assistant is somewhat limited by what people expected from voice assistants until a few years ago (control some devices and gadgets, find my phone, tell me the news/weather, do basic Google searches...), usually without much room for follow-ups, assistant.openai will basically render voice responses as if you were typing them directly to ChatGPT. While Google would often respond you with a "sorry, I don't understand", or "sorry, I can't help with that", the OpenAI assistant is more likely to expose its reasoning, ask follow-up questions to refine its understanding, and in general create a much more realistic conversation.
👍 Contexts. They are an extremely powerful way to initialize your assistant and customize it to speak the way you want, and know the kind of things that you want it to know. Cross-conversation contexts with configurable expiry also make it more natural to ask something, get an answer, and then ask another question about the same topic a few seconds later, without having to reintroduce the assistant to the whole context.
👍 Offline transcriptions available through the openai.transcribe action.
👍 Multi-language support seems to work great out of the box. Ask something to the assistant in any language, and it'll give you a response in that language.
👍 Configurable voices and models.
Cons
👎 The full pack of features is only available if you have an API key associated to a paid OpenAI account.
👎 No hotword support. It relies on assistant.picovoice for hotword detection.
👎 No intents support.
👎 No native support for weather forecast, alarms, timers, integrations with other services/devices nor other features available out of the box with the Google Assistant. You can always create hooks for them though.
Weather forecast example
Both the OpenAI and Picovoice integrations lack some features available out of the box on the Google Assistant - weather forecast, news playback, timers etc. - as they rely on voice-only APIs that by default don't connect to other services.
However Platypush provides many plugins to fill those gaps, and those features can be implemented with custom event hooks.
Let's see for example how to build a simple hook that delivers the weather forecast for the next 24 hours whenever the assistant gets a phrase that contains the "weather today" string.
You'll need to enable a weather plugin in Platypush - weather.openweathermap will be used in this example. Configuration:
weather.openweathermap:
token: OPENWEATHERMAP_API_KEY
location: London,GB
Then drop a script named e.g. weather.py in the Platypush scripts directory (default: /scripts) with the following content:
from datetime import datetime
from textwrap import dedent
from time import time
from platypush import run, when
from platypush.events.assistant import SpeechRecognizedEvent
@when(SpeechRecognizedEvent, phrase='weather today')
def weather_forecast(event: SpeechRecognizedEvent):
limit = time() + 24 * 60 * 60 # 24 hours from now
forecast = [
weather
for weather in run("weather.openweathermap.get_forecast")
if datetime.fromisoformat(weather["time"]).timestamp() < limit
]
min_temp = round(
min(weather["temperature"] for weather in forecast)
)
max_temp = round(
max(weather["temperature"] for weather in forecast)
)
max_wind_gust = round(
(max(weather["wind_gust"] for weather in forecast)) * 3.6
)
summaries = [weather["summary"] for weather in forecast]
most_common_summary = max(summaries, key=summaries.count)
avg_cloud_cover = round(
sum(weather["cloud_cover"] for weather in forecast) / len(forecast)
)
event.assistant.render_response(
dedent(
f"""
The forecast for today is: {most_common_summary}, with
a minimum of {min_temp} and a maximum of {max_temp}
degrees, wind gust of {max_wind_gust} km/h, and an
average cloud cover of {avg_cloud_cover}%.
"""
)
)
This script will work with any of the available voice assistants.
You can also implement something similar for news playback, for example using the rss plugin to get the latest items in your subscribed feeds. Or to create custom alarms using the alarm plugin, or a timer using the utils.set_timeout action.
Conclusions
The past few years have seen a lot of things happen in the voice industry. Many products have gone out of market, been deprecated or sunset, but not all hope is lost. The OpenAI and Picovoice products, especially when combined together, can still provide a good out-of-the-box voice assistant experience. And the OpenAI products have also raised the bar on what to expect from an AI-based assistant.
I wish that there were still some fully open and on-device alternatives out there, now that Mycroft, Snowboy and DeepSpeech are all gone. OpenAI and Google provide the best voice experience as of now, but of course they come with trade-offs - namely the great amount of data points you feed to these cloud-based services. Picovoice is somewhat a trade-off, as it runs at least partly on-device, but their business model is still a bit fuzzy and it's not clear whether they intend to have their products used by the wider public or if it's mostly B2B.
I'll keep an eye however on what is going to come from the ashes of Mycroft under the form of the OpenConversational project, and probably keep you up-to-date when there is a new integration to share.
I've picked up some development on Picovoice in these days as I'm rewriting some Platypush integrations that haven't been touched in a long time (and Picovoice is among those).
I originally worked with their APIs about 4-5 years ago, when I did some research on STT engines for Platypush.
Back then I kind of overlooked Picovoice. It wasn't very well documented, the APIs were a bit clunky, and their business model was based on a weird "send us an email with your use-case and we'll reach back to you" (definitely not the kind of thing you'd want other users to reuse with their own accounts and keys).
Eventually I did just enough work to get the basics to work, and then both my article 1 and article 2 on voice assistants focused more on other solutions - namely Google Assistant, Alexa, Snowboy, Mozilla DeepSpeech and Mycroft's models.
A couple of years down the line:
Snowboy is dead Mycroft is dead Mozilla DeepSpeech isn't officially dead, but it hasn't seen a commit in 3 years Amazon's AVS APIs have become clunky and it's basically impossible to run any logic outside of Amazon's cloud The Google Assistant library has been deprecated without a replacement. It still works on Platypush after I hammered it a lot (especially when it comes to its dependencies from 5-6 years ago), but it only works on x86_64 and Raspberry Pi 3/4 (not aarch64).
So I was like "ok, let's give Picovoice another try". And I must say that I'm impressed by what I've seen. The documentation has improved a lot. The APIs are much more polished. They also have a Web console that you can use to train your hotword models and intents logic - no coding involved, similar to what Snowboy used to have. The business model is still a bit weird, but at least now you can sign up from a Web form (and still explain what you want to use Picovoice products for), and you immediately get an access key to start playing on any platform. And the product isn't fully open-source either (only the API bindings are). But at first glance it seems that most of the processing (if not all, with the exception of authentication) happens on-device - and that's a big selling point.
Most of all, the hotword models are really good. After a bit of plumbing with sounddevice, I've managed to implement a real-time hotword detection on Platypush that works really well.
The accuracy is comparable to that of Google Assistant's, while supporting many more hotwords and being completely offline. Latency is very low, and the CPU usage is minimal even on a Raspberry Pi 4.
I also like the modular architecture of the project. You can use single components (Porcupine for hotword detection, Cheetah for speech detection from stream, Leopard for speech transcription, Rhino for intent parsing...) in order to customize your assistant with the features that you want.
I'm now putting together a new Picovoice integration for Platypush that, rather than having separate integrations for hotword detection and STT, wires everything together, enables intent detection and provides TTS rendering too (it depends on what's the current state of the TTS products on Picovoice).
I'll write a new blog article when ready. In the meantime, you can follow the progress on the Picovoice branch.
I wrote an article a while ago that describes how to make your own Google-based voice assistant using just a RaspberryPi, Platypush, a speaker and a microphone.
It also showed how to make your own custom hotword model that triggers the assistant if you don’t want to say “Ok Google”, or if you want distinct hotwords to trigger different assistants in different languages. It also showed how to hook your own custom logic and scripts when certain phrases are recognized, without writing any code.
Since I wrote that article, a few things have changed:
When I wrote the article, Platypush only supported the Google Assistant as a voice back end. In the meantime, I’ve worked on supporting Alexa as well. Feel free to use the assistant.echo integration in Platypush if you’re an Alexa fan, but bear in mind that it’s more limited than the existing Google Assistant based options — there are limitations in the AVS (Amazon Voice Service). For example, it won’t provide the transcript of the detected text, which means it’s not possible to insert custom hooks or the transcript of the rendered response because the AVS mostly works with audio files as input and provides audio as output. It could also experience some minor audio glitches, at least on RasbperryPi.
Although deprecated, a new release of the Google Assistant Library has been made available to fix the segmentation fault issue on RaspberryPi 4. I’ve buzzed the developers often over the past year and I’m glad that it’s been done! It’s good news because the Assistant library has the best engine for hotword detection I’ve seen. No other SDK I’ve tried — Snowboy, DeepSpeech, or PicoVoice — comes close to the native “Ok Google” hotword detection accuracy and performance. The news isn’t all good, however: The library is still deprecated, with no alternative is currently on the horizon. The new release was mostly made in response to user requests to fix things on the new RaspberryPi. But at least one of the best options out there to build a voice assistant will still work for a while. Those interested in building a custom voice assistant that acts 100% like a native Google Assistant can read my previous article.
In the meantime, the shaky situation of the official voice assistant SDK has motivated me to research more state-of-art alternatives. I’ve been a long-time fan of Snowboy, which has a well-supported platypush integration, and I’ve used it as a hotword engine to trigger other assistant integrations for a long time. However, when it comes to accuracy in real-time scenarios, even its best models aren’t that satisfactory. I’ve also experimented with Mozilla DeepSpeech and PicoVoice products, for voice detection and built integrations in Platypush. In this article, I’ll try to provide a comprehensive overview of what’s currently possible with DIY voice assistants and a comparison of the integrations I’ve built.
EDIT January 2021: Unfortunately, as of Dec 31st, 2020 Snowboy has been officially shut down. The GitHub repository is still there, you can still clone it and either use the example models provided under resources/models, train a model using the Python API or use any of your previously trained model. However, the repo is no longer maintained, and the website that could be used to browse and generate user models is no longer available. It's really a shame - the user models provided by Snowboy were usually quite far from perfect, but it was a great example of crowd-trained open-source project, and it just shows how difficult it is to keep such projects alive without anybody funding the time invested by the developers in them. Anyway, most of the Snowboy examples reported in this article will still work if you download and install the code from the repo.
The Case for DIY Voice Assistants
Why would anyone bother to build their own voice assistant when cheap Google or Alexa assistants can be found anywhere? Despite how pervasive these products have become, I decided to power my whole house with several DIY assistants for a number of reasons:
Privacy. The easiest one to guess! I’m not sure if a microphone in the house, active 24/7, connected to a private company through the internet is a proportionate price to pay for between five and ten interactions a day to toggle the lightbulbs, turn on the thermostat, or play a Spotify playlist. I’ve built the voice assistant integrations in platypush with the goal of giving people the option of voice-enabled services without sending all of the daily voice interactions over a privately-owned channel through a privately-owned box.
Compatibility. A Google Assistant device will only work with devices that support Google Assistant. The same goes for Alexa-powered devices. Some devices may lose some of their voice-enabled capabilities — either temporarily, depending on the availability of the cloud connections, or permanently, because of hardware or software deprecation or other commercial factors. My dream voice assistant works natively with any device, as long as it has an SDK or API to interact with, and does not depend on business decisions.
Flexibility. Even when a device works with your assistant, you’re still bound to the features that have been agreed and implemented by the two parties. Implementing more complex routines over voice commands is usually tricky. In most cases, it involves creating code that will run on the cloud (either in the form of Actions or Lambdas, or IFTTT rules), not in your own network, which limits the actual possibilities. My dream assistant must have the ability to run whichever logic I want on whichever device I want, using whichever custom shortcut I want (even with regex matching), regardless of the complexity. I also aimed to build an assistant that can provide multiple services ( Google, Alexa, Siri etc.) in multiple languages on the same device, simply by using different hotwords.
Hardware constraints. I’ve never understood the case for selling plastic boxes that embed a microphone and a speaker in order to enter the world of voice services. That was a good way to showcase the idea. After a couple of years of experiments, it’s probably time to expect the industry to provide a voice assistant experience that can run on any device, as long as it has a microphone and a controller unit that can process code. As for compatibility, there should be no case for Google-compatible or Alexa-compatible devices. Any device should be compatible with any assistant, as long as that device has a way to communicate with the outside world. The logic to control that device should be able to run on the same network that the device belongs to.
Cloud vs. local processing. Most of the commercial voice assistants operate by regularly capturing streams of audio, scanning for the hotword in the audio chunks through their cloud -provided services, and opening another connection to their cloud services once the hotword is detected, to parse the speech and to provide the response. In some cases, even the hotword detection is, at least partly, run in the cloud. In other words, most of the voice assistants are dumb terminals intended to communicate with cloud providers that actually do most of the job, and they exchange a huge amount of information over the internet in order to operate. This may be sensible when your targets are low-power devices that operate within a fast network and you don’t need much flexibility. But if you can afford to process the audio on a more capable CPU, or if you want to operate on devices with limited connectivity, or if you want to do things that you usually can’t do with off-the-shelf solutions, you may want to process as much as possible of the load on your device. I understand the case for a cloud-oriented approach when it comes to voice assistants but, regardless of the technology, we should always be provided with a choice between decentralized and centralized computing. My dream assistant must have the ability to run the hotword and speech detection logic either on-device or on-cloud, depending on the use case and depending on the user’s preference.
Scalability. If I need a new voice assistant in another room or house, I just grab a RaspberryPi, flash the copy of my assistant-powered OS image to the SD card, plug in a microphone and a speaker, and it’s done. Without having to buy a new plastic box. If I need a voice-powered music speaker, I just take an existing speaker and plug it into a RaspberryPi. If I need a voice-powered display, I just take an existing display and plug it to a RaspberryPi. If I need a voice-powered switch, I just write a rule for controlling it on voice command directly on my RaspberryPi, without having to worry about whether it’s supported in my Google Home or Alexa app. Any device should be given the possibility of becoming a smart device.
Overview of the voice assistant integrations
A voice assistant usually consists of two components:
An audio recorder that captures frames from an audio input device A speech engine that keeps track of the current context.
There are then two main categories of speech engines: hotword detectors, which scan the audio input for the presence of specific hotwords (like “Ok Google” or “Alexa”), and speech detectors, which instead do proper speech-to-text transcription using acoustic and language models. As you can imagine, continuously running a full speech detection has a far higher overhead than just running hotword detection, which only has to compare the captured speech against the, usually short, list of stored hotword models. Then there are speech-to-intent engines, like PicoVoice’s Rhino. Instead of providing a text transcription as output, these provide a structured breakdown of the speech intent. For example, if you say “Can I have a small double-shot espresso with a lot of sugar and some milk” they may return something like {" type":"espresso", “size”:”small", “numberOfShots":2, “sugar":"a lot", “milk":"some"}).
In Platypush, I’ve built integrations to provide users with a wide choice when it comes to speech-to-text processors and engines. Let’s go through some of the available integrations, and evaluate their pros and cons.
Native Google Assistant library Integrations assistant.google plugin (to programmatically start/stop conversations) and assistant.google backend (for continuous hotword detection). Configuration
Create a Google project and download the credentials.json file from the Google developers console.
Install the google-oauthlib-tool:
[sudo] pip install --upgrade 'google-auth-oauthlib[tool]'
Authenticate to use the assistant-sdk-prototype scope:export CREDENTIALS_FILE=~/.config/google-oauthlib-tool/credentials.json
google-oauthlib-tool --scope https://www.googleapis.com/auth/assistant-sdk-prototype \
--scope https://www.googleapis.com/auth/gcm \
--save --headless --client-secrets $CREDENTIALS_FILE
Install Platypush with the HTTP backend and Google Assistant library support:[sudo] pip install 'platypush[http,google-assistant-legacy]'
Create or add the lines to ~/.config/platypush/config.yaml to enable the webserver and the assistant integration:backend.http:
enabled: True
backend.assistant.google:
enabled: True
assistant.google:
enabled: True
Start Platypush, say “Ok Google” and enjoy your assistant. On the web panel on http://your-rpi:8008 you should be able to see your voice interactions in real-time. Features Hotword detection: YES (“Ok Google” or “Hey Google). Speech detection: YES (once the hotword is detected). Detection runs locally: NO (hotword detection [seems to] run locally, but once it's detected a channel is open with Google servers for the interaction). Pros
It implements most of the features that you’d find in any Google Assistant products. That includes native support for timers, calendars, customized responses on the basis of your profile and location, native integration with the devices configured in your Google Home, and so on. For more complex features, you’ll have to write your custom platypush hooks on e.g. speech detected or conversation start/end events.
Both hotword detection and speech detection are rock solid, as they rely on the Google cloud capabilities.
Good performance even on older RaspberryPi models (the library isn’t available for the Zero model or other arm6-based devices though), because most of the processing duties actually happen in the cloud. The audio processing thread takes around 2–3% of the CPU on a RaspberryPi 4.
Cons
The Google Assistant library used as a backend by the integration has been deprecated by Google. It still works on most of the devices I’ve tried, as long as the latest version is used, but keep in mind that it’s no longer maintained by Google and it could break in the future. Unfortunately, I’m still waiting for an official alternative.
If your main goal is to operate voice-enabled services within a secure environment with no processing happening on someone else’s cloud, then this is not your best option. The assistant library makes your computer behave more or less like a full Google Assistant device, included capturing audio and sending it to Google servers for processing and, potentially, review.
Google Assistant Push-To-Talk Integration Integrations assistant.google.pushtotalk plugin. Configuration
Create a Google project and download the credentials.json file from the Google developers console.
Install the google-oauthlib-tool:
[sudo] pip install --upgrade 'google-auth-oauthlib[tool]'
Authenticate to use the assistant-sdk-prototype scope:export CREDENTIALS_FILE=~/.config/google-oauthlib-tool/credentials.json
google-oauthlib-tool --scope https://www.googleapis.com/auth/assistant-sdk-prototype \
--scope https://www.googleapis.com/auth/gcm \
--save --headless --client-secrets $CREDENTIALS_FILE
Install Platypush with the HTTP backend and Google Assistant SDK support:[sudo] pip install 'platypush[http,google-assistant]'
Create or add the lines to ~/.config/platypush/config.yaml to enable the webserver and the assistant integration:backend.http:
enabled: True
assistant.google.pushtotalk:
language: en-US
Start Platypush. Unlike the native Google library integration, the push-to-talk plugin doesn’t come with a hotword detection engine. You can initiate or end conversations programmatically through e.g. Platypush event hooks, procedures, or through the HTTP API:curl -XPOST \
-H "Authorization: Bearer $PP_TOKEN" \
-H 'Content-Type: application/json' -d '
{
"type":"request",
"action":"assistant.google.pushtotalk.start_conversation"
}' http://your-rpi:8008/execute
Features
Hotword detection: NO (call start_conversation or stop_conversation from your logic or from the context of a hotword integration like Snowboy, DeepSpeech or PicoVoice to trigger or stop the assistant).
Speech detection: YES.
Detection runs locally: NO (you can customize the hotword engine and how to trigger the assistant, but once a conversation is started a channel is opened with Google servers).
Pros
It implements many of the features you’d find in any Google Assistant product out there, even though hotword detection isn’t available and some of the features currently available on the assistant library aren’t provided (like timers or alarms).
Rock-solid speech detection, using the same speech model used by Google Assistant products.
Relatively good performance even on older RaspberryPi models. It’s also available for arm6 architecture, which makes it suitable also for RaspberryPi Zero or other low-power devices. No hotword engine running means that it uses resources only when you call start_conversation.
It provides the benefits of the Google Assistant speech engine with no need to have a 24/7 open connection between your mic and Google’s servers. The connection is only opened upon start_conversation. This makes it a good option if privacy is a concern, or if you want to build more flexible assistants that can be triggered through different hotword engines (or even build assistants that are triggered in different languages depending on the hotword that you use), or assistants that aren’t triggered by a hotword at all — for example, you can call start_conversation upon button press, motion sensor event or web call.
Cons
I’ve built this integration after the deprecation of the Google Assistant library occurred with no official alternatives being provided. I’ve built it by refactoring the poorly refined code provided by Google in its samples ( pushtotalk.py) and making a proper plugin out of it. It works, but keep in mind that it’s based on some ugly code that’s waiting to be replaced by Google.
No hotword support. You’ll have to hook it up to Snowboy, PicoVoice or DeepSpeech if you want hotword support.
Alexa Integration Integrations assistant.echo plugin. Configuration Install Platypush with the HTTP backend and Alexa support:[sudo] pip install 'platypush[http,alexa]'
Run alexa-auth. It will start a local web server on your machine on http://your-rpi:3000. Open it in your browser and authenticate with your Amazon account. A credentials file should be generated under ~/.avs.json.
Create or add the lines to your ~/.config/platypush/config.yaml to enable the webserver and the assistant integration:
backend.http:
enabled: True
assistant.echo:
enabled: True
Start Platypush. The Alexa integration doesn’t come with a hotword detection engine. You can initiate or end conversations programmatically through e.g. Platypush event hooks, procedures, or through the HTTP API:curl -XPOST \
-H "Authorization: Bearer $PP_TOKEN" \
-H 'Content-Type: application/json' -d '
{
"type":"request",
"action":"assistant.echo.start_conversation"
}' http://your-rpi:8008/execute
Features
Hotword detection: NO (call start_conversation or stop_conversation from your logic or from the context of a hotword integration like Snowboy or PicoVoice to trigger or stop the assistant).
Speech detection: YES (although limited: transcription of the processed audio won’t be provided).
Detection runs locally: NO.
Pros
It implements many of the features that you’d find in any Alexa product out there, even though hotword detection isn’t available. Also, the support for skills or media control may be limited.
Good speech detection capabilities, although inferior to the Google Assistant when it comes to accuracy.
Good performance even on low-power devices. No hotword engine running means it uses resources only when you call start_conversation.
It provides some of the benefits of an Alexa device but with no need for a 24/7 open connection between your mic and Amazon’s servers. The connection is only opened upon start_conversation.
Cons
The situation is extremely fragmented when it comes to Alexa voice SDKs. Amazon eventually re-released the AVS (Alexa Voice Service), mostly with commercial uses in mind, but its features are still quite limited compared to the Google assistant products. The biggest limitation is the fact that the AVS works on raw audio input and spits back raw audio responses. It means that text transcription, either for the request or the response, won’t be available. That limits what you can build with it. For example, you won’t be able to capture custom requests through event hooks.
No hotword support. You’ll have to hook it up to Snowboy, PicoVoice or DeepSpeech if you want hotword support.
Snowboy Integration Integrations assistant.snowboy backend. Configuration Install Platypush with the HTTP backend and Snowboy support:[sudo] pip install 'platypush[http,snowboy]'
Choose your hotword model(s). Some are available under SNOWBOY_INSTALL_DIR/resources/models. Otherwise, you can train or download models from the Snowboy website.
Create or add the lines to your ~/.config/platypush/config.yaml to enable the webserver and the assistant integration:
backend.http:
enabled: True
backend.assistant.snowboy:
audio_gain: 1.2
models:
# Trigger the Google assistant in Italian when I say "computer"
computer:
voice_model_file: ~/models/computer.umdl
assistant_plugin: assistant.google.pushtotalk
assistant_language: it-IT
detect_sound: ~/sounds/bell.wav
sensitivity: 0.4
# Trigger the Google assistant in English when I say "OK Google"
ok_google:
voice_model_file: ~/models/OK Google.pmdl
assistant_plugin: assistant.google.pushtotalk
assistant_language: en-US
detect_sound: ~/sounds/bell.wav
sensitivity: 0.4
# Trigger Alexa when I say "Alexa"
alexa:
voice_model_file: ~/models/Alexa.pmdl
assistant_plugin: assistant.echo
assistant_language: en-US
detect_sound: ~/sounds/bell.wav
sensitivity: 0.5
Start Platypush. Say the hotword associated with one of your models, check on the logs that the HotwordDetectedEvent is triggered and, if there’s an assistant plugin associated with the hotword, the corresponding assistant is correctly started. Features Hotword detection: YES. Speech detection: NO. Detection runs locally: YES. Pros
I've been an early fan and supporter of the Snowboy project. I really like the idea of crowd-powered machine learning. You can download any hotword models for free from their website, provided that you record three audio samples of you saying that word in order to help improve the model. You can also create your custom hotword model, and if enough people are interested in using it then they’ll contribute with their samples, and the model will become more robust over time. I believe that more machine learning projects out there could really benefit from this “use it for free as long as you help improve the model” paradigm.
Platypush was an early supporter of Snowboy, so its integration is well-supported and extensively documented. You can natively configure custom assistant plugins to be executed when a certain hotword is detected, making it easy to make a multi-language and multi-hotword voice assistant.
Good performance, even on low-power devices. I’ve used Snowboy in combination with the Google Assistant push-to-talk integration for a while on single-core RaspberryPi Zero devices, and the CPU usage from hotword processing never exceeded 20–25%.
The hotword detection runs locally, on models that are downloaded locally. That means no need for a network connection to run and no data exchanged with any cloud.
Cons Even though the idea of crowd-powered voice models is definitely interesting and has plenty of potentials to scale up, the most popular models on their website have been trained with at most 2000 samples. And (sadly as well as expectedly) most of those voice samples belong to white, young-adult males, which makes many of these models perform quite poorly with speech recorded from any individuals that don’t fit within that category (and also with people who aren’t native English speakers). Mozilla DeepSpeech Integrations stt.deepspeech plugin and stt.deepspeech backend (for continuous detection). Configuration Install Platypush with the HTTP backend and Mozilla DeepSpeech support. Take note of the version of DeepSpeech that gets installed:[sudo] pip install 'platypush[http,deepspeech]'
Download the Tensorflow model files for the version of DeepSpeech that has been installed. This may take a while depending on your connection:export MODELS_DIR=~/models
export DEEPSPEECH_VERSION=0.6.1
wget https://github.com/mozilla/DeepSpeech/releases/download/v$DEEPSPEECH_VERSION/deepspeech-$DEEPSPEECH_VERSION-models.tar.gz
tar xvf deepspeech-$DEEPSPEECH_VERSION-models.tar.gz
x deepspeech-0.6.1-models/
x deepspeech-0.6.1-models/lm.binary
x deepspeech-0.6.1-models/output_graph.pbmm
x deepspeech-0.6.1-models/output_graph.pb
x deepspeech-0.6.1-models/trie
x deepspeech-0.6.1-models/output_graph.tflite
mv deepspeech-$DEEPSPEECH_VERSION-models $MODELS_DIR
Create or add the lines to your ~/.config/platypush/config.yaml to enable the webserver and the DeepSpeech integration:backend.http:
enabled: True
stt.deepspeech:
model_file: ~/models/output_graph.pbmm
lm_file: ~/models/lm.binary
trie_file: ~/models/trie
# Custom list of hotwords
hotwords:
- computer
- alexa
- hello
conversation_timeout: 5
backend.stt.deepspeech:
enabled: True
Start Platypush. Speech detection will start running on startup. SpeechDetectedEvents will be triggered when you talk. HotwordDetectedEvents will be triggered when you say one of the configured hotwords. ConversationDetectedEvents will be triggered when you say something after a hotword, with speech provided as an argument. You can also disable the continuous detection and only start it programmatically by calling stt.deepspeech.start_detection and stt.deepspeech.stop_detection. You can also use it to perform offline speech transcription from audio files:curl -XPOST \
-H "Authorization: Bearer $PP_TOKEN" \
-H 'Content-Type: application/json' -d '
{
"type":"request",
"action":"stt.deepspeech.detect",
"args": {
"audio_file": "~/audio.wav"
}
}' http://your-rpi:8008/execute
# Example response
{
"type":"response",
"target":"http",
"response": {
"errors":[],
"output": {
"speech": "This is a test"
}
}
}
Features Hotword detection: YES. Speech detection: YES. Detection runs locally: YES. Pros
I’ve been honestly impressed by the features of DeepSpeech and the progress they’ve made starting from the version 0.6.0. Mozilla made it easy to run both hotword and speech detection on-device with no need for any third-party services or network connection. The full codebase is open-source and the Tensorflow voice and language models are also very good. It’s amazing that they’ve released the whole thing for free to the community. It also means that you can easily extend the Tensorflow model by training it with your own samples.
Speech-to-text transcription of audio files can be a very useful feature.
Cons
DeepSpeech is quite demanding when it comes to CPU resources. It will run OK on a laptop or on a RaspberryPi 4 (but in my tests it took 100% of a core on a RaspberryPi 4 for speech detection),. It may be too resource-intensive to run on less powerful machines.
DeepSpeech has a bit more delay than other solutions. The engineers at Mozilla have worked a lot to make the model as small and performant as possible, and they claim of having achieved real-time performance on a RaspberryPi 4. In reality, all of my tests bear between 2 and 4 seconds of delay between speech capture and detection.
DeepSpeech is relatively good at detecting speech, but not at interpreting the semantic context (that’s something where Google still wins hands down). If you say “this is a test,” the model may actually capture “these is a test.” “This” and “these” do indeed sound almost the same in English, but the Google assistant has a better semantic engine to detect the right interpretation of such ambiguous cases. DeepSpeech works quite well for speech-to-text transcription purposes but, in such ambiguous cases, it lacks some semantic context.
Even though it’s possible to use DeepSpeech from Platypush as a hotword detection engine, keep in mind that it’s not how the engine is intended to be used. Hotword engines usually run against smaller and more performant models only intended to detect one or few words, not against a full-featured language model. The best usage of DeepSpeech is probably either for offline text transcription, or with another hotword integration and leveraging DeepSpeech for the speech detection part.
PicoVoice
PicoVoice is a very promising company that has released several products for performing voice detection on-device. Among them:
Porcupine, a hotword engine. Leopard, a speech-to-text offline transcription engine. Cheetah, a speech-to-text engine for real-time applications. Rhino, a speech-to-intent engine.
So far, Platypush provides integrations with Porcupine and Cheetah.
Integrations
Hotword engine: stt.picovoice.hotword plugin and stt.picovoice.hotword backend (for continuous detection).
Speech engine: stt.picovoice.speech plugin and stt.picovoice.speech backend (for continuous detection).
Configuration Install Platypush with the HTTP backend and the PicoVoice hotword integration and/or speech integration:[sudo] pip install 'platypush[http,picovoice-hotword,picovoice-speech]'
Create or add the lines to your ~/.config/platypush/config.yaml to enable the webserver and the DeepSpeech integration:stt.picovoice.hotword:
# Custom list of hotwords
hotwords:
- computer
- alexa
- hello
# Enable continuous hotword detection
backend.stt.picovoice.hotword:
enabled: True
# Enable continuous speech detection
# backend.stt.picovoice.speech:
# enabled: True
# Or start speech detection when a hotword is detected
event.hook.OnHotwordDetected:
if:
type: platypush.message.event.stt.HotwordDetectedEvent
then:
# Start a timer that stops the detection in 10 seconds
- action: utils.set_timeout
args:
seconds: 10
name: StopSpeechDetection
actions:
- action: stt.picovoice.speech.stop_detection
- action: stt.picovoice.speech.start_detection
Start Platypush and enjoy your on-device voice assistant. Features Hotword detection: YES. Speech detection: YES. Detection runs locally: YES. Pros When it comes to on-device voice engines, PicoVoice products are probably the best solution out there. Their hotword engine is far more accurate than Snowboy and it manages to be even less CPU-intensive. Their speech engine has much less delay than DeepSpeech and it’s also much less power-hungry — it will still run well and with low latency even on older models of RaspberryPi. Cons
While PicoVoice provides Python SDKs, their native libraries are closed source. It means that I couldn’t dig much into how they’ve solved the problem.
Their hotword engine (Porcupine) can be installed and run free of charge for personal use on any device, but if you want to expand the set of keywords provided by default, or add more samples to train the existing models, then you’ll have to go for a commercial license. Their speech engine (Cheetah) instead can only be installed and run free of charge for personal use on Linux on x86_64 architecture. Any other architecture or operating system, as well as any chance to extend the model or use a different model, is only possible through a commercial license. While I understand their point and their business model, I’d have been super-happy to just pay for a license through a more friendly process, instead of relying on the old-fashioned “contact us for a commercial license/we’ll reach back to you” paradigm.
Cheetah’s speech engine still suffers from some of the issues of DeepSpeech when it comes to semantic context/intent detection. The “this/these” ambiguity also happens here. However, these problems can be partially solved by using Rhino, PicoVoice’s speech-to-intent engine, which will provide a structured representation of the speech intent instead of a letter-by-letter transcription. However, I haven’t yet worked on integrating Rhino into platypush.
Conclusions
The democratization of voice technology has long been dreamed about, and it’s finally (slowly) coming. The situation out there is still quite fragmented though and some commercial SDKs may still get deprecated with short notice or no notice at all. But at least some solutions are emerging to bring speech detection to all devices.
I’ve built integrations in Platypush for all of these services because I believe that it’s up to users, not to businesses, to decide how people should use and benefit from voice technology. Moreover, having so many voice integrations in the same product — and especially having voice integrations that expose all the same API and generate the same events — makes it very easy to write assistant-agnostic logic, and really decouple the tasks of speech recognition from the business logic that can be run by voice commands.
Check out my previous article to learn how to write your own custom hooks in Platypush on speech detection, hotword detection and speech start/stop events.
To summarize my findings so far:
Use the native Google Assistant integration if you want to have a full Google experience, and if you’re ok with Google servers processing your audio and the possibility that somewhere in the future the deprecated Google Assistant library won’t work anymore.
Use the Google push-to-talk integration if you only want to have the assistant, without hotword detection, or you want your assistant to be triggered by alternative hotwords.
Use the Alexa integration if you already have an Amazon-powered ecosystem and you’re ok with having less flexibility when it comes to custom hooks because of the unavailability of speech transcript features in the AVS.
Use Snowboy if you want to use a flexible, open-source and crowd-powered engine for hotword detection that runs on-device and/or use multiple assistants at the same time through different hotword models, even if the models may not be that accurate.
Use Mozilla DeepSpeech if you want a fully on-device open-source engine powered by a robust Tensorflow model, even if it takes more CPU load and a bit more latency.
Use PicoVoice solutions if you want a full voice solution that runs on-device and it’s both accurate and performant, even though you’ll need a commercial license for using it on some devices or extend/change the model.
Let me know your thoughts on these solutions and your experience with these integrations!
Today’s abundance of music streaming services has created lots of opportunities to listen to whichever music you like wherever you like, but there’s a huge fragmentation problem that hasn’t been tackled seriously by the music tech industry yet.
Spotify allows you to find and discover a lot of tunes, but not all the music is there.
You may want to integrate your collection of mp3s into your Spotify collection, but that is simply not an option.
You may have favourite tracks that are only available on SoundCloud or albums that aren’t available on Spotify, but you’ve purchased in the past on Google Music or iTunes, but there’s no way to have them all in one place: each of these solution comes with its separate app.
You may want to integrate your favourite online radios or podcasts into your music app, but that’s, again, not an option — TuneIn, Podcast Addict or Google Podcasts are distinct apps.
You may want to easily stream your playlists to any speaker or pair of headphones you own, but that’s not as easy as it sounds. You may have to replace your existing speakers with expensive solutions (like Sonos or Bose) to enjoy a proper multi-room setup. Apps like Spotify come with their solutions (e.g. Spotify Connect), but only a limited number of devices is supported and, again, it works as a solution only as long as you stream Spotify content from the Spotify app.
There have been commercial solutions that have tried to tackle this fragmentation problem and provide you with the possibility to stream music from any app to any speaker without having to replace your audio system, but the situation isn’t that bright after Google has discontinued its Chromecast Audio support, and AirPlay works relatively well only as long as you’re in an Apple ecosystem.
As of today the problem “how do I play whichever piece of music I like, from whichever service I like, on whichever device I like, all in one interface, without having to install 10 different apps” is still largely unsolved if you rely on commercial solutions.
Luckily, we’ve got plenty of open source software around that comes to rescue. It requires a bit more work than just downloading an app and logging in, but the rewards are priceless.
One music server to rule them all
Mopidy is one of the best open source solutions around when it comes to integrating multiple music services under one single interface. It’s entirely written in Python, it’s (almost) 100% compatible with MPD, a music protocol that has been around since 2003 and comes with lots of compatible clients (command-line, web-based, mobile apps etc.), and there are countless plugins that let Mopidy integrate with any kind of music service around.
It’s relatively easy to install mopidy on a RaspberryPi and turn it into a powerful music centre.
Add the mopidy repository to your apt lists and install the base package:wget -q -O - https://apt.mopidy.com/mopidy.gpg | sudo apt-key add -
# Run the following if you're running Raspbian/Debian Buster
sudo wget -q -O /etc/apt/sources.list.d/mopidy.list https://apt.mopidy.com/buster.list
# Run the following command if you're running Raspbian/Debian Stretch
sudo wget -q -O /etc/apt/sources.list.d/mopidy.list https://apt.mopidy.com/stretch.list
# Update the repositories and install mopidy
sudo apt-get update
sudo apt-get install mopidy mopidy-mpd
Install any additional extension for the music services you like:# Spotify support
sudo apt-get install mopidy-spotify
# Local files support
sudo apt-get install mopidy-local
# Dirble support
sudo apt-get install mopidy-dirble
# Podcast support
sudo apt-get install mopidy-podcast mopidy-podcast-gpodder mopidy-podcast-itunes
# Last.FM scrobbling support
sudo apt-get install mopidy-scrobbler
# Soma.FM support
sudo apt-get install mopidy-somafm
# Soundcloud support
sudo apt-get install mopidy-soundcloud
# TuneIn support
sudo apt-get install mopidy-tunein
# YouTube support
sudo apt-get install mopidy-youtube
And there are even more extensions available for Mopidy - you may want to take a look on their website to get an overview of the compatible services.
Head to the pages of those extensions to find out whether they need some extra dependencies or extra configuration to be added to your ~/.config/mopidy/mopidy.conf file.
You may also want to make sure that the HTTP module is enabled in your mopidy configuration, since most of the web frontends (including Platypush) rely on it to interact with mopidy over websockets:
[http]
enabled = true
hostname = 0.0.0.0
port = 6680
There is also a wide range of frontend clients available for Mopidy, from command-line MPD clients to full-blown web clients that do a very good job replicating the UI of some of the most popular music apps around. Let's have a quick overview of my favourite solutions to interact with Mopidy:
netcat/telnet. Mopidy is compatible with the MPD protocol, and once started it will listen by default on port 6600 for MPD commands. It’s relatively straightforward to explore your library or control the playback even without installing another client (although it’s probably not the most user-friendly way, but very good if you want to make some scripts):$ echo status | nc localhost 6600
OK MPD 0.19.0
volume: 100
repeat: 0
random: 1
single: 0
consume: 0
playlist: 3
playlistlength: 1489
xfade: 0
state: stop
song: 513
songid: 560
nextsong: 173
nextsongid: 220
OK
$ echo currentsong | nc localhost 6600
OK MPD 0.19.0
file: spotify:track:218UgZapIcNRP9f38C5cMp
Time: 365
Artist: 3rd Force
Album: Vital Force
Title: Echoes Of A Dream
Date: 1997
Track: 6
Pos: 513
Id: 560
AlbumArtist: 3rd Force
X-AlbumUri: spotify:album:3mSCVZabNB0rUmpYgPkDuV
OK
$ echo play | nc localhost 6600
OK MPD 0.19.0
OK
$ echo stop | nc localhost 6600
OK MPD 0.19.0
OK
mpc is a tiny command-line utility that makes it a bit easier to interact with MPD/mopidy instances for scripting purposes without handling low-level protocol messages:sudo apt-get install mpc
mpc help # To see available commands
mpc play # Play the music
ncmpcpp is a ncurses-based terminal client that I've been using for more than a decade, and it's probably one of the lightest yet most versatile music clients I've seen around - and definitely my favourite:sudo apt-get install ncmpcpp
ncmpcpp screenshot 1 ncmpcpp screenshot 2 ncmpcpp screenshot 3
mopidy-iris is probably one of the most user-friendly, well-maintained and feature-rich Mopidy clients around, and it's compatible with desktop, tablet and mobile and it comes with a UI that successfully mimics that of many popular music apps:sudo apt-get install mopidy-iris
After installing it head to http://your-raspberry:6680 and select Iris as the web interface.
iris interface
And you have many more compatible clients available (just check the list of extensions), from minimal, to feature-rich (such as MusicBox), to a specific client for party mode optimized for multiple users! And, being compatible with MPD, all the MPD clients out there should also work out of the box. And the list also includes a Mopidy Mobile app and several MPD apps for Android and iOS.
Hook Mopidy to Platypush
You can connect Platypush to Mopidy. That provides you with one more UI for interacting with your instance (embedded in the Platypush web panel), and it opens a world of possibilities when it comes to automating music interactions.
Install Platypush with the HTTP and MPD dependencies:[sudo] pip install 'platypush[http,mpd]'
Enable the MPD/mopidy plugin and backend in your platypush configuration file:music.mpd:
host: localhost
port: 6600
backend.music.mopidy:
host: localhost
A backend.music.mpd is also provided, but if you use Mopidy instead of a bare MPD server then it's advised to use backend.music.mopidy instead - the former checks for updates by polling the server at regular intervals, while the Mopidy-specific backend listens for events continuously over the provided websocket interface.
Restart Platypush and head to http://your-raspberry:8008. You should see a new tab for Mopidy — yet another web interface to interact with the server.
Platypush MPD interface
Before proceeding on how to automate the interaction with your new music server, let's see how to turn Mopidy into a full multi-room music server with Snapcast.
Multi-room setup
The ability to synchronize music across multiple rooms and devices is a great feature of a modern smart home. However, most of the commercial solutions available today (like Sonos or Bose) are expensive and require in most of the cases to replace your speakers with theirs. Luckily it’s relatively easy to set up a multi-room experience with multiple RaspberryPis, without having to change your speakers. Let’s see how.
Install Snapcast by following the instructions on their Github page
Create an /etc/default/snapserver file on the machine(s) where you’re running your Mopidy instance(s) with the following content:
USER_OPTS="--user snapserver:snapserver --stream=pipe:///tmp/snapfifo?name=mopidy&codec=pcm --codec=pcm"
SNAPSERVER_OPTS=""
In the example above we’ll use a PCM lossless codec for streaming the music, and we’ll be using /tmp/snapfifo as a file queue where Mopidy will push its audio stream.
Start snapserver on your Mopidy machine(s) by simply running the executable, and optionally add it to your startup configuration.
Configure the [audio] section of your mopidy.conf file to stream to the Snapcast FIFO (note: with this configuration Mopidy will only stream to the new file and not to your speakers, you'll need to run snapclient to play the audio):
[audio]
mixer = software
mixer_volume = 100
output = audioconvert ! audio/x-raw,rate=48000,channels=2,format=S16LE ! wavenc ! filesink location=/tmp/snapfifo
The audio.output setting of Mopidy is actually a very flexible way of building GStreamer pipelines to redirect and transform the audio however you like. In this example I'm transforming the audio to stereo WAV at 48 kHz, which may be perfect if you're seeking for a true loseless audio experience, but may cause some glitches if your network isn't very stable (we're basically passing uncompressed audio around). It's possible to encode and compress the stream by applying e.g. an MP3 or OGG encoder to the pipeline, but this causes the GStreamer pipeline to become very unstable for some reason (the bug has been opened for a couple of years and Mopidy developers are still scratching their head on why it happens), so the loseless stream option may be the only one that works for now.
Create an /etc/default/snapclient file on all the machines that will be connecting to your Snapserver, included the Mopidy machine itself if you want to directly play music from too (opposed to using it just as a music backend):START_SNAPCLIENT=true
USER_OPTS="--user snapclient:audio"
Start snapclient on the machines that will be connecting to your Mopidy instance. The command will be snapclient -h localhost on the machine that runs mopidy itself and snapclient -h remote-host-ip-or-name on the other machines. You can run as many snapclient instances on a host as the servers you want to connect it to.
Enable the Snapcast backend and plugin in the Platypush configuration of each of the machines that will be running the client or the server:
backend.music.snapcast:
hosts:
- server1
- server2
- server3
music.snapcast:
host: default-server-ip-or-name
Restart Platypush and head to the web panel on port 8008. You should see a new tab for Snapcast, identified by the speaker icon. From here you can control which stream will be playing on which host, you can create streaming groups, change the volume etc.
Platypush Snapcast interface
You can also install an Android app to control your multi-room setup, even though the app allows you to control one server at the time. The app however will allow you to play audio streams also on your smartphone.
Snapcast app interface
If you use Iris as a web interface to Mopidy you can now head to settings and enable the Snapcast plugin. A speaker icon will appear in the bottom bar, and you’ll be able to control your music setup from there as well.
Time to enjoy your low-cost but powerful multi-room music setup!
Build your remote to control the music
All of us have some unused infrared remote collecting dust somewhere in the living room. In this section I’ll show how to turn it into a universal remote for controlling your music (and not only) with some Platypush automation. You’ll need the following:
An infrared receiver — they’re usually very cheap. Any of them will do, even though I personally used this model.
An Arduino or Arduino-compatible device (or an ESP8266, or a RaspberryPi Pico, or any other microcontroller, although the code may be different). Most of the infrared sensors around communicate over an analog interface, but the RaspberryPi doesn’t come with an ADC converter. The solution is to plug an Arduino over USB and let it monitor for changes on the detected infrared signal.
A breadboard.
Once you’ve got all the hardware you can set up your receiver:
Plug the infrared receiver to GND and Vcc, and the data PIN to e.g. the Arduino PIN 2, as shown in the figure below:
Arduino IR sensor connection
Download and install the Arduino IRremote library.
Prepare a sketch that reads the data from the infrared receiver PIN and writes it over serial interface as a JSON:
#include
// When a signal with all bits set to 1 is received it
// usually means that the previous pressed key is still
// being pressed, until a signal with all bits set to
// zero is received.
#define IR_REPEAT 0xFFFFFFFF
const int RECV_PIN = 2;
IRrecv irrecv(RECV_PIN);
decode_results results;
unsigned int latest_value = 0;
void setup(){
Serial.begin(9600);
irrecv.enableIRIn();
irrecv.blink13(true);
}
void send_value(unsigned int value) {
Serial.print("{\"ir\":");
Serial.print(value, HEX);
Serial.println("}");
}
void loop(){
if (irrecv.decode(&results)){
if (results.value == IR_REPEAT && latest_value != 0) {
send_value(latest_value);
} else if (results.value && results.value != latest_value) {
send_value(results.value);
}
latest_value = results.value;
irrecv.resume();
}
}
Compile the sketch and upload it to the Arduino.
Open the Arduino serial monitor and verify that you see the JSON string when you press a key on the remote.
Enable the serial plugin and backend in your platypush configuration:
serial:
device: /dev/ttyUSB0
backend.sensor.serial:
enabled: True
Restart platypush, check the output and press a key on your remote. You should see an event in the logs that looks like this:INFO|platypush|Received event: {"type": "event", "target": "hostname", "origin": "hostname", "args": {"type": "platypush.message.event.sensor.SensorDataChangeEvent", "data": {"ir": "4b34d827"}}}
Take note of the hexadecimal code reported on the event, that’s the decoded data associated to that specific remote button. Then add an event hook to deal with the actions to be run when a certain button is pressed:from platypush.config import Config
from platypush.event.hook import hook
from platypush.utils import run
from platypush.message.event.sensor import SensorDataChangeEvent
@hook(SensorDataChangeEvent)
def on_remote_key_press(event, **context):
ir_code = event.data.get('ir')
if not ir_code:
return
# Playback control logic
if ir_code == 'code1':
run('music.mpd.play')
elif ir_code == 'code2':
run('music.mpd.pause')
elif ir_code == 'code3':
run('music.mpd.stop')
elif ir_code == 'code5':
run('music.mpd.previous')
elif ir_code == 'code6':
run('music.mpd.next')
# ...
# Multi-room setup logic
elif ir_code == 'code7':
# Un-mute the stream to another host
run('music.snapcast.mute', host=Config.get('device_id'), client='some-client',
mute=False)
elif ir_code == 'code8':
# Mute the stream to another host
run('music.snapcast.mute', host=Config.get('device_id'), client='some-client',
mute=True)
Congratulations, you’ve just built your own customizable and universal music remote!
Voice assistant integration
A smart music setup isn’t really complete without a voice assistant integration. I’ve covered in a previous article how to set up platypush to turn your device into a full-featured Google Assistant. If you’ve managed to get your assistant up and running, you can add some rules to control your music, play specific content, or synchronize your audio stream to another room. Let’s see a couple of examples:
from platypush.config import Config
from platypush.event.hook import hook
from platypush.utils import run
from platypush.message.event.assistant import SpeechRecognizedEvent
@hook(SpeechRecognizedEvent, phrase='play (the)? music')
def on_music_play(*args, **context):
run('music.mpd.play')
@hook(SpeechRecognizedEvent, phrase='stop (the)? music')
def on_music_pause(*args, **context):
run('music.mpd.stop')
@hook(SpeechRecognizedEvent, phrase='play (the)? radio')
def on_play_radio(*args, **context):
run('music.mpd.play', resource='tunein:station:s13606')
@hook(SpeechRecognizedEvent, phrase='play playlist ${name}')
def on_play_playlist(event, name=None, **context):
run('music.mpd.load', resource=name)
@hook(SpeechRecognizedEvent, phrase='play ${title} by ${artist}')
def search_and_play_song(event, title=None, artist=None, **context):
results = run('music.mpd.search', artist=artist, title=title)
if results > 0:
run('music.mpd.play', resource=results[0]['file'])
@hook(SpeechRecognizedEvent, phrase='play (the)? music to (the)? bedroom')
def sync_music_to_bedroom(event, **context):
run('music.snapcast.mute', host=Config.get('device_id'), client='bedroom', mute=False)
run('music.snapcast.volume', host=Config.get('device_id'), client='bedroom', volume=90)
Conclusions
The current situation when it comes to music streaming and multi-room setup in a home automation environment is still extremely fragmented. Each commercial solution out there seems more interested in building its own walled garden, and a proper multi-room setup usually comes with high costs and in most of the cases it won’t be compatible with your existing speakers. With the ingredients provided in this article you should be able to walk around most of these limitations and:
Set up your multi-service music player controllable by any interface you like
Set up your multi-room configuration that makes it possible to add a new room by simply adding one more RaspberryPi
Use any existing infrared remote to control the music
Integrate custom music actions into a voice assistant
With these foundations in place the only limit to what you can do with your new music set up comes from your own imagination!