While voice user interface agents (VUIs) like Siri and Alexa are now commonplace, their designers are still striving to make them sound more natural and conversational. But what does ‘natural’ mean for a human-agent conversation? A group of researchers from UBC Computer Science have investigated what designers mean by ‘naturalness’ and whether VUIs will one day be indistinguishable from a human voice.
The research was conducted by lead author and UBC Computer Science alumna Yelim Kim, co-author Dr. Dongwook Yoon, an assistant professor and member of the UBC Language Sciences initiative and co-author Dr. Joanna McGrenere, also a professor at UBC Computer Science. A former Masters Student, Mohi Reza, also contributed to the paper. They will be presenting their paper, ‘Designers Characterize Naturalness in Voice User Interfaces: Their Goals, Practices, and Challenges’ in May 2021 at the ACM Conference on Human Factors in Computing Systems (CHI).
In a recent interview, Yelim explained how VUIs could be better designed to talk with humans in a more natural dialogue. “In our study, we interviewed VUI designers, and found twelve distinct ways in which they characterize ‘naturalness’. We then classified those into three categories we identified as: Core, Social, and Transactional,” Yelim said. “Some of these elements include human-like aspects, such as conveying appropriate patterns of stress, pauses or intonations that have meaning. We refer to those as Core. Designers also want agents to converse with users in a socially appropriate manner,” she said. “We call these design elements Social. One example might be using a serious tone of voice when delivering negative news, like 'traffic is bad'.”
Yelim went on to describe what they deemed ‘Transactional’ elements in design. These aspects help a user get what they need done, including being proactive by leading a conversation about a given task, and providing helpful suggestions to the user.
Yelim and Dongwook also discovered that designers classify ‘beyond-human’ characteristics such as completing tasks or accessing information, which can be executed much more quickly by a VUI than a human.
The challenges in mimicking human dialogue
Dr. Dongwook Yoon said, “The primary goal of task-oriented applications is to help users with their tasks efficiently. However, our study revealed seven major challenges in designing ‘naturalness’. When designers wanted to add characteristics of social conversation, such as expressing sympathy and maintaining an intriguing persona in order to enhance naturalness, the downside is that dialogues get longer and conflict with being efficient.”
Another major challenge was making the agent’s voice more expressive than a monotonous 'robotic-sounding' voice. The current tool to achieve expressivity, Speech Synthesis Markup Language, has limited support for changing the sound of the voice agents, and it is overly time-consuming to use. Based on the study findings, they concluded there is a need for more detailed design guidelines and innovative language tool support to solve these challenges.
Some other major challenges that were identified in the study include:
- Writing for spoken language is difficult, with written text often sounding less natural or containing too much information for a spoken conversation
- Handling varied or unexpected user inputs, and conversational context being difficult
- Existing VUI guidelines lack concrete recommendations on how to design for ‘naturalness’
Human or robot: can you hear the difference?
“At the 2018 Google I/O conference, Google showcased its voice assistant, “Duplex”, by having it call a hair salon and successfully make an appointment,” Yelim said. “It was a demo, but because Duplex talked to staff so naturally, it was almost indistinguishable from a human voice. Consequently, some people expressed their concerns through popular media about the potential risk of this new technology.”
Yelim believes there will be a day when it's hard to distinguish between voice agents and humans. “Voice agents require less cognitive load and are relatively easy to use, so they are very helpful for multi-tasking and for people who find it hard to learn new technologies. However, of course there are also risks including the potential for abuse and deception.”
Natural conversation includes many contextual elements (e.g., social, cultural), so voice assistants will inevitably collect varied information from humans in order to have a natural conversation including emotional states, based on factors such as voice pitch.
Yelim explains the importance of privacy concerns, “Being transparent about what information gets collected through VUIs and how it is managed is very important. Users should have control over the information they share.”