I’ve taken some time off over the Christmas break and I’m now dedicating a whole work day per week to this project. Today has been somewhat both slow and also good for the progress of the project for someone who has never built any sort of explicit speech bot. As I said on my first blog post available – Blog 1 – I am using this as a partial reflection on the progress I am making towards creating Frankenstein 2.0.
Today was the day to really tackle the interactive side of the bot. Both in the ‘input’ (Speech to Text) and the output (Text to Speech) both of which I’ve now got *very almost* working together before I finish at past 7pm. The main issues have been getting them to speak to one another, and it doesn’t help that my Python is somewhat rusty – so I’ll properly tackle that when I’m not so tired. So here’s an example of the bot speaking back earlier today:
Text to Speech
This is what I first tackled today, and decided I would use the pyttsx3 python package to do so. This is due to it being relatively good and also not required access to some sort of Google-esque API. It does mean that things sounds rather robotic; but this is editable if I wished to really go for it. In this case, I have chosen one of the default ‘voices’ that comes from this – the en-scottish voice – which incidentally is one of the clearest and almost strangely applicable to where this will be shown ‘in’ Dundee, Scotland.
As it is open-source and crucially works offline, I think this is a big plus for a project that seeks to avoid cloud-based services! It was relatively simple to get going after some minor tweaks here and there.
Speech to Text
This caused me some more problems over trying to get it to work, which I eventually did get over after much playing around with my ‘environment’. I have decided to go for PocketSphinx, which is an open source toolkit for speech recognition. This is going to make my life so easy – and it isn’t the best yet but there are possibilities for tweaking this (as long as someone has access to data) – but crucially not related to spoken word data – it is instead based on common connections in language – another distinction and discussion for some time else. I’ve not had a lot of time to work out many things in the background here as I was more interested in getting something working.
However, and crucially, this means that I can now identify (at least I’ve tested this) trigger works for Frankenstein to listen out for and then respond to. As this is a distinctive work (along with some others I’ve used), it means this is a good start… just to get the bot to respond and interact (without getting to make sense when responding to questions! That’s a different day’s work!).
So overall, it doesn’t sound like all that much – but I feel like I’ve nearly cracked the basic ‘interaction’ side of the bot – and that makes me feel a lot better. I may edit this, as I can barely see the screen anymore, so it’s not got a great edit!