AIs are quite good at working in chat interfaces — that is, in responding to requests in text, on a screen.
But they are not great holding conversations in normal life.
If you connect them to a voice interface, you see they don’t talk in the way which you would expect from a person. Therefore, they are also not talking in the way you would want from a robot which is supposed to be easy to interact with in normal life.
The most obvious problem is that they talk like textbooks, saying too much at once. Also, they do not pick up on normal cues which indicate when a person is pausing as opposed to done talking, nor do they always handle interruptions smoothly. Developers of voice AI sytems are working actively on those problems. Pipecat, for instance, uses a custom end-of-turn model to detect when a human is finished speaking.
But even beyond the problems of voice AI, there are additional problems which appear only when you use an embodied voice AI which is expected to function in the social environment of the physical world. These are the problems of social performance and awareness of the social environment
^1: You might find such topics addressed analytically, rather than as an engineering problem to be solved, in disciplines of linguistics, psychology, and sociology, and theater arts, under topics like conversation analysis, interactional linguistics, discourese analysis, linguistic pragmatics, and voice and body training.
But that’s just one of many, many problems. Some problems are not simply appear immediately, simply when you try to move to voice-based interface. You discover, for instance, right away, as soon as you simply in the contextproblems appear with voice AI, which are pretty well-explored in teh
But it doesn’t need to be this way. To make a truly conversational AI system, a system which supports social performance, I think we only need to make some design and implementation progress in three broad areas. Here is how I understand them.
Also, as they speak, they do not seek acknowledgment that the listener understands what they’ve said, and
Social perceptual layer
Traditional robotics work focuses on perceiving the physical environment but to navigate the social environment you need a different set of perceptual primitives. To notice what they are, simply consider all the faculties you need to employ to navigate a party, which you know some people, you meet new people, and you also maintain conversation.
You need:
voice detection, and voice recognition, diarization
In other words, at a party, you notice voices easily (voice detection). If you hear a friend’s voice, you recognize who it is (voice recognition). In an conversation with many people speaking, you easily discriminate who said what (diarization).
organic voice enrollment
When you are introduced to a new person, you do not ask them to repeat a test phrase three times. Rather, you learn their voice immediately (organic voice enrollment) when they introduce themselves.
voice isolation
You can do all this pretty easily, even if there are a lot of other people talking, even if there is music on, even if there is a television playing in the background (voice isolation).
Part of why we do voice isolation and diarization so well, is that we rely on having two ears in order to discriminate speakers based on the physical location of sound. AIs should have multiple microphones, so that they may do the same thing.
emotional valence
Finally, it should be easy to detect if a certain utterance is louder or quiter than usual, or emotionally animated.
That is, you need to be able easily to focus on a particular voice even in an environment of many background voices, or of music or a television playing voices in the background. We do this quite naturally. We
exclusion of background noise, music, tv;
- face detection, face recognition (with organic enrollment);
- head pose detection, gaze detection, direction of audio, affect detection, echo cancellation; emotional affect.
ultimately this system amounts to:
-
input:
- multiple incoming streams of information, generated from dedicated perception model
-
output:
-
stream of text (with diarization and affect markers)
-
data structure summarizing current social situation
- who is looking at whom (especially at the agent)
-
subscribable events marking notable changes in current social situation
AI conversation
Yeah, there’s a lot going on here. I see 3 big problems I have not seen solved elsewhere:
- Actually prompting/tuning models for good conversational performance, once they have that information.
- tracking current topic or topics
- not request -> response
- listen -> interrupt
- listen -> stay quiet
- As an matter of implementation, understanding the right patterns for coordinating multiple models with different latency/cost properties, to realize that performance with good latency and strong capability where needed.
evals should include context definiing SOCIAL ROLE. not everything is a request
will a megamodel do all of this?
e2e models are good at what they are trained on.
Is there training data which encompasses all of the behaviors associated with the modalities above?