Voice Recognition Challenge - Ditto Transcripts vs Microsoft

Voice recognition has been one of the “next big things” in tech again and again over the last 30 years. Each time it’s mentioned, we hear that it’s getting more accurate, faster, able to block background noise, and other big claims that aim to make us give it yet another chance. And still it has not come close to replacing human transcriptionists — especially if you require 99%+ accuracy in a timely manner without having to make extensive edits yourself.

The largest voice recognition company in the United States has 7,000 employees in India, who are employed to edit their voice recognition software’s output 24 hours a day, 7 days a week. For software they claim is so good, what does this tell you? If you’re in doubt, check out the press release above. What are they doing with all those overseas folks?

However, if you ask around, consumers are still not convinced that the technology is reliable. In fact, despite the claims every year of improved reliability, a recent survey from J.D. Powers showed that consumer’s frustration with the product is still quite high: “voice recognition is still the number one problem that we see,” says Renee Stephens, J.D. Power’s vice president of U.S. automotive quality.

Recently, we’ve seen the biggest claim yet in voice recognition by Microsoft. Regarding improvements to their technology, they went as far as to proclaim that their recent voice recognition A.I. has progressed to the point that it is better and more accurate than humans.

Who can forget this epic Microsoft voice recognition failure at Dream Force just last year!

You can view the hilarious 2 minute clip here.

We simply don’t buy it! Where’s the actual proof voice recognition has caught up to human transcription?

The key thing to point out is that’s all it is — just a claim. They don’t back this up at all with any specific information, examples, or videos proving it, but instead they simply say they did a fine-tuning of their pre-existing A.I that has been known to be faulty. And if a document needs to be verbatim, meaning the inclusion of speech elements like ums and ahs, things get even harder for computers.

Regardless of technological advancements, humans will always be more reliable when it comes to transcribing audio than a machine. Language conveys emotions and implications impossible to understand without a lifetime of lived experience.

There are many factors that lead to this. Let’s start with the basic one: every person has somewhat of a unique way of speaking, whether it’s cadence, slang, or the speed rate at which someone talks. This can be problematic for machines, as they can’t always sort out slang. Add to this basic inaudibility — some people don’t talk as clearly, and that leads to errors in itself, before even factoring in accents or slang.

One of the main translation barriers a system can’t recognize is something everyone can find themselves saying at least a dozen times a day: the words “uh”, “um”, or “ah”. These filler words can come up at any moment in a conversation, interview, or research. These three specific fillers alone can cause huge problems and not being able to distinguish between that is a major red flag for endorsing A.I. transcription.

“I think all of this guarantees that a perfect speech recognizer that just listens in like a human will not be achieved in a reasonable time. You and I will probably not see that”, says Gerald Friedland, leader of the diarization project of the International Computer Science Institute (ICSI). Speech recognition has been researched and experimented since the ’80s but still isn’t as accurate by itself as one would hope for a tech solution that’s been worked on for so long.

Humans understand context and can use it to fill in gaps when there’s background noise or an unclear word in the audio file, whereas machines rely only on voice recognition to determine what the word or phrase should be. Humans intuitively understand when and what people are clarifying in their speech, when a pause means someone is trying to think, and when a pause means someone is ‘erasing’ something they’ve just said previously.

Technology versus humans in the workplace is a battle that has been foreseen and written about for many years. Being aware of that, we must take into account that there is always a chance that technology can malfunction or get tampered with.

An error in a medical transcription can cause the wrong diagnosis to be made. Errors made in legal transcription can cause evidence to be misinterpreted which could contribute to an unfavorable outcome. Mistakes in law enforcement transcription could lead to the wrong person being arrested or a criminal being let go by mistake.

When accuracy is a must, a trained human transcriber is always a safer bet. Transcribing takes a lot of effort to learn to do. Trained professionals in the industry are exactly that: trained — they are experts in the field with knowledge of complex words and context for the trade they have specialized in.

Even global tech giants have to substantiate their claims, right?

Well, we’re here with actual data that can back up our claims of speed and accuracy from years of servicing clients.

We’re wondering if Microsoft might like to try to take us on in a head to head challenge, their machines versus our humans? Come on Bill and Satya, what do you have to lose?

Ben Walker
Founder and CEO
Ditto Transcripts

Microsoft Challenge: A Voice Recognition Throwdown

We simply don’t buy it! Where’s the actual proof voice recognition has caught up to human transcription?