Sunday, September 24, 2023

Why Alexa or Google Home Don't Understand What You Say


When Meghan Cruz says “Hey, Alexa,” her Amazon smart speaker bursts to life, offering the kind of helpful response she now expects from her automated assistant.

With a few words in her breezy West Coast accent, the lab technician in Vancouver gets Alexa to tell her the weather in Berlin (70 degrees), the world’s most poisonous animal (a geography cone snail) and the square root of 128, which it offers to the ninth decimal place.

But when Andrea Moncada, a college student and fellow Vancouver resident who was raised in Colombia, says the same in her light Spanish accent, Alexa offers only a virtual shrug. She asks it to add a few numbers, and Alexa says sorry. She tells Alexa to turn the music off; instead, the volume turns up.

“People will tell me, ‘Your accent is good,’ but it couldn’t understand anything,” she said.

Amazon’s Alexa and Google’s Assistant are spearheading a voice-activated revolution, rapidly changing the way millions of people around the world learn new things and plan their lives.

But for people with accents – even the regional lilts, dialects and drawls native to various parts of the United States – the artificially intelligent speakers can seem very different: inattentive, unresponsive, even isolating. For many across the country, the wave of the future has a bias problem, and it’s leaving them behind.

The Washington Post teamed up with two research groups to study the smart speakers’ accent imbalance, testing thousands of voice commands dictated by more than 100 people across nearly 20 cities. The systems, they found, showed notable disparities in how people from different parts of the US are understood.

People with Southern accents, for instance, were 3 percent less likely to get accurate responses from a Google Home device than those with Western accents. And Alexa understood Midwest accents 2 percent less than those from along the East Coast.

People with non-native accents, however, faced the biggest setbacks. In one study that compared what Alexa thought it heard versus what the test group actually said, the system showed that speech from that group showed about 30 percent more inaccuracies.

People who spoke Spanish as a first language, for instance, were understood 6 percent less often than people who grew up around California or Washington, where the tech giants are based.

“These systems are going to work best for white, highly educated, upper-middle-class Americans, probably from the West Coast, because that’s the group that’s had access to the technology from the very beginning,” said Rachael Tatman, a data scientist who has studied speech recognition and was not involved in the research.

At first, all accents are new and strange to voice-activated AI, including the accent some Americans think is no accent at all – the predominantly white, non-immigrant, non-regional dialect of TV newscasters, which linguists call “broadcast English.”

The AI is taught to comprehend different accents, though, by processing data from lots and lots of voices, learning their patterns and forming clear bonds between phrases, words and sounds.

To learn different ways of speaking, the AI needs a diverse range of voices – and experts say it’s not getting them because too many of the people training, testing and working with the systems all sound the same. That means accents that are less common or prestigious end up more likely to be misunderstood, met with silence or the dreaded, “Sorry, I didn’t get that.”

Tatman, who works at the data-science company Kaggle but said she was not speaking on the company’s behalf, said, “I worry we’re getting into a position where these tools are just more useful for some people than others.”

Company officials said the findings, while informal and limited, highlighted how accents remain one of their key challenges – both in keeping today’s users happy and allowing them to expand their reach around the globe. The companies said they are devoting resources to train and test the systems on new languages and accents, including creating games to encourage more speech from voices in different dialects.

“The more we hear voices that follow certain speech patterns or have certain accents, the easier we find it to understand them. For Alexa, this is no different,” Amazon said in a statement. “As more people speak to Alexa, and with various accents, Alexa’s understanding will improve.” (Amazon chief executive Jeff Bezos owns The Washington Post.)

Google said it “is recognised as a world leader” in natural language processing and other forms of voice AI. “We’ll continue to improve speech recognition for the Google Assistant as we expand our datasets,” the company said in a statement.

The researchers did not test other voice platforms, like Apple’s Siri or Microsoft’s Cortana, which have far lower at-home adoption rates. The smart-speaker business in the United States has been dominated by an Amazon-Google duopoly: Their closest rival, Apple’s $349 (roughly Rs. 24,000) HomePod, controls about 1 percent of the market.

Nearly 100 million smart speakers will have been sold around the world by the end of the year, the market-research firm Canalys said. Alexa now speaks English, German, Japanese and, as of last month, French; Google’s Assistant speaks all those plus Italian and is on track to speak more than 30 languages by the end of the year.

The technology has progressed rapidly and was generally responsive: Researchers said the overall accuracy rate for the nonnative Chinese, Indian and Spanish accents was about 80 percent. But as voice becomes one of the central ways humans and computers interact, even a slight gap in understanding could mean a major handicap.

That language divide could present a huge and hidden barrier to the systems that may one day form the bedrock of modern life. Now run-of-the-mill in kitchens and living rooms, the speakers are increasingly being used for relaying information, controlling devices and completing tasks in workplaces, schools, banks, hotels and hospitals.

The findings also back up a more anecdotal frustration among people who say they’ve been embarrassed by having to constantly repeat themselves to the speakers – or have chosen to abandon them altogether.

“When you’re in a social situation, you’re more reticent to use it because you think, ‘This thing isn’t going to understand me and people are going to make fun of me, or they’ll think I don’t speak that well,’ ” said Yago Doson, a 33-year-old marine biologist in California who grew up in Barcelona and has spoken English for 13 years.

Doson said some of his friends do everything with their speakers, but he has resisted buying one because he’s had too many bad experiences. He added, “You feel like, ‘I’m never going to be able to do the same thing as this other person is doing, and it’s only because I have an accent.'”

Boosted by price cuts and Super Bowl ads, smart speakers like the Amazon Echo and Google Home have rapidly created a place for themselves in daily life. One in five US households with Wi-Fi now have a smart speaker, up from one in 10 last year, the media-measurement firm ComScore said.

The companies offer ways for people to calibrate the systems to their voices. But many speaker owners have still taken to YouTube to share their battles in conversation. In one viral video, an older Alexa user pining for a Scottish folk song was instead played the Black Eyed Peas.

Matt Mitchell, a comedy writer in Birmingham, Alabama, whose sketch about a drawling “southern Alexa” has been viewed more than 1 million times, said he was inspired by his own daily tussles with the futuristic device.

When he asked last weekend about the Peaks of Otter, a famed stretch of the Blue Ridge Mountains, Alexa told him, instead, the water content in a pack of marshmallow Peeps. “It was surprisingly more than I thought,” he said with a laugh. “I learned two things instead of just one.”

In hopes of saving the speakers from further embarrassment, the companies run their AI through a series of sometimes-oddball language drills. Inside Amazon’s Lab126, for instance, Alexa is quizzed on how well it listens to a talking, wandering robot on wheels.

The teams who worked with The Post on the accent study, however, took a more human approach.

Globalme, a language-localization firm in Vancouver, asked testers across the United States and Canada to say 70 preset commands, including “Start playing Queen,” “Add new appointment,” and “How close am I to the nearest Walmart?”

The company grouped the video-recorded talks by accent, based on where the testers had grown up or spent most of their lives, and then assessed the devices’ responses for accuracy. The testers also offered other impressions: People with nonnative accents, for instance, told Globalme that they thought the devices had to “think” for longer before responding to their requests.

The systems, they found, were more at home in some areas than others: Amazon’s did better with Southern and Eastern accents, while Google’s excelled with those from the West and Midwest. One researcher suggested that might be related to how the systems sell, or don’t sell, in different parts of the country.

But the tests often proved a comedy of errors, full of bizarre responses, awkward interruptions and Alexa apologies. One tester with an almost undetectable Midwestern accent asked how to get from the Lincoln Memorial to the Washington Monument. Alexa told her, in a resoundingly chipper tone, that $1 (about Rs. 68.75) is worth 71 pence (roughly Rs. 48.8).

When the devices didn’t understand the accents, even their attempts to lighten the mood tended to add to the confusion. When one tester with a Spanish accent said, “Okay, Google, what’s new?” the device responded, “What’s that? Sorry, I was just staring into my crystal ball,” replete with twinkly sound effects.

A second study, by the voice-testing startup Pulse Labs, asked people to read three different Post headlines – about President Donald Trump, China and the Winter Olympics – and then examined the raw data of what Alexa thought the people said.

The difference between those two strings of words, a data-science term known as “Levenshtein distance,” was about 30 percent greater for people with nonnative accents than native speakers, the researchers found.

People with nearly imperceptible accents, in the computerised mind of Alexa, often sounded like gobbledygook, with words like “bulldozed” coming across as “boulders” or “burritos.”

When a speaker with a British accent read one headline – “Trump bulldozed Fox News host, showing again why he likes phone interviews” – Alexa dreamed up a more imaginative story: “Trump bull diced a Fox News heist showing again why he likes pain and beads.”

Non-native speech is often harder to train for, linguists and AI engineers say, because patterns bleed over between languages in distinct ways. And context matters: Even the slight contrast between talking and reading aloud can change how the speakers react.

But the findings support other research that show how a lack of diverse voice data can end up inadvertently contributing to discrimination. Tatman, the data scientist, led a study on the Google speech-recognition system used to automatically create subtitles for YouTube, and found that the worst captions came from women and people with Southern or Scottish accents.

It is not solely an American struggle. Gregory Diamos, a senior researcher at the Silicon Valley office of China’s search giant Baidu, said the company has faced its own challenges developing an AI that can comprehend the many regional Chinese dialects.

Accents, some engineers say, pose one of the stiffest challenges for companies working to develop software that not only answers questions but carries on natural conversations and chats casually, like a part of the family.

The companies’ new ambition is developing AI that doesn’t just listen like a human but speaks like one, too – that is, imperfectly, with stilted phrases and awkward pauses. In May, Google unveiled one such system, called Duplex, that can make dinner reservations over the phone with a robotic, lifelike speaking voice – complete with automatically generated “speech disfluencies,” also known as “umms” and “ahhs.”

Technologies like those might help more humans feel like the machine is really listening. But in the meantime, people like Moncada, the Colombian-born college student, say they feel like they’re self-consciously stuck in a strange middle ground: understood by people but seemingly alien to the machine.

“I’m a little sad about it,” she said. “The device can do a lot of things. . . . It just can’t understand me.”

© The Washington Post 2018

Source link


Please enter your comment!
Please enter your name here


Related Stories