Increasing Accuracy of Auto-Captions for Virtual Meetings

by Sheila Serup, MBA


With the increasing use of virtual meetings, live auto-captioning enables accessibility for participants with hearing loss.

However, there are still challenges to seeing the correct captions being displayed on video-conference platforms.

To better understand why errors occur in translating speech to auto-captions, I turned to an expert in the field of Linguistics.

Dr. Ben Tucker, with the University of Alberta’s Linguistics department, observes that speech recognition technology still struggles with transcribing voices to text.

“When the input voice is highly accented, the system will have a hard time. It also struggles with background noise.”

He notes that “co-articulation is possibly part of the problem but is likely only a very small part as most modern speech recognition systems have been trained to deal with it.”

Co-articulation, as defined by the Cambridge dictionary, occurs when the pronunciation of a sound in a word is affected by the sounds before and after it. (For example, the words “can” and “ham” contain the vowel ‘a’ and nasal consonants /n/ and/m/ which are produced at the back of the throat, and are therefore harder to hear.)

“There are lots of other factors that will play a role as well, and there are simple solutions such as getting a microphone close to the speaker’s mouth and reducing background noise,” Dr. Tucker says.

He recommends using headphones to reduces noise coming from the environment and looking directly into the camera when talking. Any documents or prompts being referred to onscreen should be situated near the camera, so the person speaking is looking into the camera.

“Users can slow it down to help listeners,” notes Dr. Tucker. “Reading lips and reading captions – the task is very difficult as the user is multitasking.”

Dr. Tucker, who is also a Mercator Fellow in Quantitative Linguistics at the German University of Tübingen, notes there is speaker-dependent software such as Dragon Naturally Speaking.

“These are trained to work on your voice.” Once familiar with a speaker’s voice, it will transcribe that speaker’s speech to captions.

“The problem is that it will only work for your voice.” It will create accurate captions for other participants of your speech. is a transcription service that integrates with Zoom to provide transcripts during and after the meeting.

Before joining a video call, Dr. Tucker suggests spending a few minutes adjusting your position and equipment for an optimal presentation. Practising with headphones and microphones also enhance everyone’s audio-visual experience.

Small tips such as these will increase the accuracy of speech-independent auto-captioning in video conference meetings. As current trends indicate, virtual meetings are becoming the norm for all of us.