Unimodal and Multimodal Interfaces · Articles

How many ways are there to describe the image that accompanies this text? Do you think you'd be able to do it in a way that I could appreciate it?

–

Conversational interfaces come in two main formats: unimodal, which use a single communication channel (voice only or text only), and multimodal, which combine both. Each has its particularities and adapts differently to the context surrounding the user.

Text-only interfaces are discrete and allow for easy information review. This is key for user privacy. In a public environment, like a bus or shared office, interacting through text allows the user to keep their conversation private, without exposing the content of their queries or the system's responses to third parties. However, information input is usually slower as it requires typing. Additionally, for complex tasks or those with large amounts of text, the cognitive load can be high.

On the other hand, voice-only interfaces offer speed and the possibility of hands-free operation. But they present important limitations that penalize certain scenarios. Spoken information is inherently transitory, which makes its review and retention difficult, especially when dealing with complex or extensive data. Imagine a friend telling you the details of their latest trip; the oral story is pleasant, but if they want to show you the beauty of a landscape or the detail of a building, it's natural for them to show you a photograph. The image complements and enriches the story. Moreover, what they understand by a word may be different from the image I project in your mind. Similarly, the voice interface, although fast, often penalizes deep understanding and consultation ability. That's why teachers rely on slides and graphics when explaining a lesson. Additionally, these interfaces are susceptible to automatic speech recognition (ASR) errors and are affected by environmental noise. This means they may have difficulty understanding different people, accents, or in noisy environments. Finally, they're not suitable for people with hearing or speech disabilities, and their discretion is low, which can be a significant inconvenience in public places or situations requiring confidentiality. User context is fundamental here: speaking aloud in a crowded space may be inappropriate or impossible, forcing the user to seek alternatives.

This is where multimodal voice-text-image interfaces demonstrate their true potential. They offer flexibility by allowing users to choose the most convenient modality according to the task or context. They reduce cognitive load by distributing information between auditory and visual processing. This is crucial for complex tasks, where the user can rely on text to review details while using voice for quick commands. They also improve accuracy by allowing voice and text to reinforce each other, helping to disambiguate commands or queries. Additionally, they offer greater expressive and contextual richness, combining the tone and emotion of voice with the visual and formatting capabilities of text. Finally, they significantly expand accessibility, as they cater to a broader range of needs by allowing the combination or choice of modes. This multimodal flexibility allows the user to navigate through different scenarios of their daily life, adapting the interaction to environmental conditions and their own privacy preferences.

–

It's not voice or text. It's not making a transcription of what they're telling you, nor having text read to you. It's integrating both channels, each contributing different things to improve the experience. Just like a teacher shows a slide to their students or a friend shows a photo from their latest trip.

–

Follow me on LinkedIn to stay updated on new posts.

–

Post content