I am going to be doing a series of blog posts as part of a reflection on a project I am working with Hayden Lorimer and Pip Thornton, both at the University of Edinburgh. This is an Human Data Interaction-funded project (see more information here), “Spoken Word Data: Artistic Interventions in the Everyday Spaces of Digital Capitalism.”
I am involved in the development of “Frankenstein 2.0”; and I will be developing a ‘chatbot’ that will be speaking from limited texts of Audre Lorde, Karl Marx, Hannah Arendt, to J.K. Rowling and more. Here, I am going to start detailing how I’ve been thinking about how to approach this and also offer insights into the critique I’m developing alongside the wider project. This chatbot, as you can likely infer from the project title, deals with ‘spoken data’ – so it will be done through the ‘hardware’ as in the tweet above; in order to begin “the task of monstrous assembly”.
My main motivation is to see whether it is possible to design and implement (with my somewhat limited knowledge and skill) a speaking chatbot that can respond to questions and use only the insights from a text of the author concerned. Some of my more technically-minded friends and colleagues will point out that this will produce a terrible response – as there is so little learning data to develop senses and meaning of how to respond. However, my critique, as such, is to question whether someone with moderate computational literacy has the capacity to create something that is not tied to the ‘monopoly’ providers or as McKenzie Wark may say, to not engage in the ‘vectorialist’ politics of mass data surveillance and manipulation.
This means I will seek to do this as much as ‘open’ datasets (and not open in the sense that these benefit or are provided by the large tech giants today – such as IBM, Microsoft, Amazon, Google and so on) – which I’m already encountering is a major challenge to not participate in such practices and keep a project like this even possible for one person to do.
Below, I have sketched out in my notepad what I think I could do and for the rest of this blog I will detail some of the challenges I think I’ll face and some very open questions for Frankenstein 2.0. I’m also starting off with a book on Natural Language Processing and Python (I’ve assumed that Python is the way to go with this, due to the large amount of tools used and my familiarity with the language).
So, this may seem like a bit of a start but I’m now going to detail how I think the different pieces might fit together:
The project has bought a Sennheiser EPOS SP 20 ML conference device that allows for both microphone and speaker capability. I’m unsure what the firmware looks like on this. I may explore further, I may not. It is “Designed and engineered in Denmark – Made in China 中国制造”. I don’t see any problems with this (see more here).
I have a speaker and I can connect to audio from that fairly well. I even practiced some speech recognition (purely for testing) through Google’s API in Python’s Speech Recognition. This comes from the official repository for Python – PyPi – which isn’t necessarily owned by the big corporations – but AWS, Google and more sponsor them. I think this is enough of a substitute – but it shows how integrated they are – I will keep using just for sanity. I’m thinking of using some of the ‘offline’ models available – such as Snowball Hotword (but even these are owned by the big corporations – Baidu I believe). This is going to take some thinking. Otherwise I’m going to have to train my own model – which could really restrict the interaction. More thinking required.
Pushing it all back through
I’m very worried that this process will take an age through the different routines it will go through – so a lot of my time will be spent thinking how to also just make it work ‘fast enough’ without too much computation required. We have got used to sending this data off and being processed somewhere else – but in this project it must stay local – which introduces so many challenges in a world where ubiquitous sharing is now deemed essential and isolation is bad. There are some python packages to make a ‘voice’ back through to listen to the response – you can use the OS default – which I think in Ubuntu means ‘male’ or ‘female’…
I have created a ‘Frankenstein’ virtual machine to experiment and play with – to do this I am using Ubuntu and I am using Emacs to write everything. I have started playing around with some Python (and it’s a fresh install have downloaded essential packages such as ‘pip’ and other things to ensure I know things are working such as PyAudio and ‘Speech Recognition’. However, here is where we start encountering interesting questions…
Some APIs provide intent support – so that they can extract what the person asking the chatbot means by their question. This is something I am going to have to build unfortunately – and that’s where the book will come in super handy. It’s the reason I bought it! Once I know what people are asking – and what they mean I can then delve into the ‘other side’ of the project.
Creating an Author Voice
This is trickier than it first sounds – and there are many different ways of doing this. I’ve not wholly decided on which one I’m thinking of doing yet – I think some natural language processing (similar to what Google did here some years back) alongside some Deep Neural Networks. Then again, it may turn out I go for something much more simple. But the aim is to give each author a synthetic voice and let them speak but in response to a question. This is actually the trickiest bit of the project, I think. As that means I need to get my author to respond to a meaning by meaning something itself. Any help here would be a godsend. I know it’s possible – I just can’t commit a year to doing it! But remember we have such limited data (often on a singular book), that it may be non sensical (I hope not, but it may be).
Some of you may think I’m crazy to try and do this – and perhaps I am – but I think it is a good experiment to try and work with. I think I’ll be working on what I see to be ‘manageable’ tasks at a time. Just because otherwise it will feel never ending. I’m going to be using this blog as a place of reflection and a draft of what’s working / what’s not working. I think I’ll start with getting the interaction with the computer and speech recognition first so that I get a sense that there is a system working.
As Hayden said in the tweet above, in more-than-human ways, myself, computation, others come together…
“…and so from these humble and unassuming parts they began the task of monstrous assembly…”