By Gao Ge
Ming Li, left, demonstrates one of the systems developed by researchers and students at Duke Kunshan’s SMIIP Lab
A speech recognition system designed by researchers at Duke Kunshan defeated strong competition to win two awards at Interspeech 2020, the world’s largest conference on speech processing and application.
The conference attracted more than 1,950 participants, including from Microsoft, Amazon, Carnegie Mellon University and Oxford University, and accepted 1,022 research papers on speech, signal processing, spoken language processing, and other fields.
Scientists, analysts and students presented and discussed their latest breakthroughs in spoken language processing, while research teams also took part in challenges across nine technical areas.
A team from Duke Kunshan’s Speech and Multimodal Intelligent Information Processing (SMIIP) Lab led by Ming Li, associate professor of electrical and computer engineering, excelled in two tasks in the second phase of the Fearless Steps Challenge.
The first phase of the challenge, at Interspeech 2019, led to the digitization, recovery and diarization of 19,000 hours of original analog audio data, and the development of algorithms to extract meaningful information from this multichannel naturalistic data resource.
For phase two, which focused on development of single-channel supervised learning strategies, the team from Duke Kunshan proposed a system integrating residual network (ResNet) and long short-term memory. Compared with the long and short-term memory network system used as a baseline, the SMIIP Lab’s system more easily captured continuous information in speech at the back end and reduced the minimum detection cost to 62 percent.
The system received first place in the speaker identification research task and third place in the speech activity detection task, beating teams from Germany’s Paderborn University and mobile phone company Vivo, among others.
Speech processing divides into several areas. To analyze a piece of audio requires not only the ability to convert speech into text but also technology capable of distinguishing different speakers and separating human voices from background noise.
Interspeech, organized annually by the International Speech Communication Association, is one of the most important conferences in the field of computer speech signal processing. Initially planned to take place in Shanghai from Oct. 25 to 29, the 2020 conference moved entirely online due to Covid-19.
‘Planning for Interspeech 2020 was particularly challenging due to the Covid-19 pandemic,’ said Li. ‘However, thanks to the excellent work of the organizing committee, we were able to meet online with global researchers on speech sciences and technology, enjoying a great opportunity to share our findings.’
Sharing research, resources
This year’s conference accepted two research papers from Duke Kunshan. One paper on speaker diarization produced in collaboration with Sun Yat-sen University proposed two enhanced methods based on the self-attention mechanism for speaker embedding extraction. Another on source separation proposed an attention-based neural network in the spectrogram domain for the task of target speaker separation to deliver the ‘cocktail party effect,’ the ability to focus one’s listening attention on a single talker while filtering out all other stimuli.
Li was also part of the committee that developed the research tasks and baseline system for the Far-Field Speaker Verification Challenge. This kind of verification refers to the ability to interact with a machine using natural human language from a distance ranging from one to 10 meters. The technology has many applications including in smart home appliances, meeting transcription and onboard navigation. Improving efficiency is an ongoing challenge, with the main complications being voice pickup, having more than one active speaker and background noise.
A research team at Duke Kunshan led by Li is working on far-field speaker verification technologies based on deep learning. In 2019, the university partnered with Montage Technology, a Kunshan-based provider of chip-based solutions for cloud computing and artificial intelligence, to jointly research embedded system validation, field-programmable gate array architecture design, and fixed-point solutions, based on their high-performance far-field algorithms. The research findings were published at Interspeech 2020.
In addition to sharing research, 10 undergraduate students from Duke Kunshan also worked as volunteers to organize this year’s conference. Among them were Xuchen Gong ’22 and Huangrui Chu ’22 who reviewed videos, helped speakers rehearse and provided other support.
‘I am very pleased to see that in addition to our research teams, Duke Kunshan undergraduates also participated in this prominent international conference,’ Li said. ‘I look forward to more students showing their best work at leading academic events.’