
Similar methods have sparked the interest of the industry's biggest players, with IBM now offering a cloud-based, highly accurate speech-to-text platform as part of Watson's services, while Amazon Transcribe offers an API for automatic speech recognition. The algorithm is capable of considering the sentence as a whole and predicting what the correct output might be, based on previous datasets of speech.īy considering the context of an entire sentence, rather than working on a word-by-word basis, the AI can make more accurate decisions. SEE: COVID-19: A guide and checklist for restarting your business (TechRepublic Premium)Ī mix of long-trained, deep-learning models and big data explain Otter.ai's encouraging capabilities, argues Liang. Dodgy internet connections also mean speakers' voices can cut off, fade away, or break up – all of which can come in the way of the technology's accuracy. And then new words are being invented every day, as well as acronyms, company names and other new terminology."Īnother issue is noise: the loud AC in Liang's conference room makes it harder for the algorithm to accurately pick up on his words during the call, broken as they are by the sound of fans spinning. "There are so many different accents, even within a single country like the US, and at the same time a lot of words have a similar pronunciation. "Spoken language has tremendous amounts of variation.

"There are many different challenges when it comes to language," says Liang.

There is a reason why typing 'Why is speech to text' in Google's search bar results in recommendations such as 'Why is speech to text not working' or 'Why is speech to text so bad'.

The reason is pretty straightforward: no matter how sophisticated the artificial intelligence, recognizing human speech is tricky. It quickly became obvious that the technology was struggling to recognize speakers' words if they attempted to submit complicated sentences, or if they had an accent. A few years ago, for example, Google launched a highly anticipated new pair of wireless earbuds, complete with a real-time translation service that, in theory, could recognize speech in one language, translate the words in the destination language on the user's phone, and then read out the new sentence.
