IIIT Hyderabad develops StethoSpeech to give voice for speech-impaired

Update: 2024-10-21 07:47 GMT
IIIT Hyderabad (Image credit: www.iiit.ac.in)
Hyderabad: Researchers at IIIT Hyderabad have developed a novel machine learning (ML) model that enables individuals with speech impairments to generate intelligible speech. Leveraging the minimalist design of a wireless stethoscope, the innovation converts behind-the-ear vibrations—inaudible whispers—into spoken output.

The research team, led by Neil Shah, a TCS researcher and PhD student at the Centre for Visual Information Technology (CVIT), along with Neha Sahipjohn and Vishal Tambrahalli, was supervised by Dr. Ramanathan Subramanian and Prof. Vineet Gandhi. Their work is published under the title "StethoSpeech: Speech Generation Through a Clinical Stethoscope Attached to the Skin."

StethoSpeech is built on a silent speech interface (SSI) that translates non-audible whispers into vocal output. SSI refers to communication methods that capture speech-related cues without producing sound. “Lip reading is one of the most basic SSI techniques,” explained Neil Shah. Other approaches, such as Ultrasound Tongue Imaging and Electromagnetic Articulography, are highly invasive and fail to operate in real-time scenarios.

In contrast, the IIIT Hyderabad team’s solution relies on a stethoscope placed behind the ear, which captures subtle vibrations known as Non-Audible Murmurs (NAM). Prof. Gandhi explained that the team curated NAM datasets in various environments, ranging from everyday office noise to loud concerts. Participants murmured text aloud while their vibrations were recorded. This data was then used to train the ML model to produce clear speech output.

The stethoscope transmits NAM data to a mobile phone via Bluetooth, converting it into vocalized speech in less than 0.3 seconds—even while the user is moving. “Our model works in a ‘zero-shot’ setting, meaning it can generate speech even for individuals whose data was not used during training,” said Neil Shah.

The system also offers customization options, allowing users to select accents—such as English with a South Indian inflection—and the gender of the voice. With just four hours of murmuring data, a personalized speech model can be created for any individual.

Unlike conventional ML algorithms that convert text directly into speech, the IIIT Hyderabad team developed a speech-to-speech system to better mimic natural human learning. “Humans learn to speak by interacting with sounds, not text,” said Prof. Gandhi, explaining that their model first maps audio to text before generating speech.

The potential applications of this technology are vast. In high-noise environments, such as concerts, where normal speech is difficult to understand, the system could facilitate communication. It also opens doors for discreet messaging, such as in security operations.

“Our work departs from earlier studies that assumed clean speech data was needed to train models,” added Prof. Gandhi. “Since speech-impaired individuals do not produce standard speech, we focused on building a model that can generate high-quality output without relying on clean speech samples.”

With StethoSpeech, the research team has not only provided a solution for the speech-impaired but also demonstrated the potential of ML-driven speech generation across various real-world scenarios.


Tags:    

Similar News