AI firms are prioritizing flash over substance, says Surge AI’s CEO.
“I am anxious that as a substitute of constructing AI that can really advance us as a species, curing most cancers, fixing poverty, understanding common, all these huge grand questions, we’re optimizing for AI slop as a substitute,” Edwin Chen mentioned in an episode of “Lenny’s” podcast revealed on Sunday.
“We’re mainly instructing our fashions to chase dopamine as a substitute of fact,” he added.
Chen based AI training startup Surge in 2020 after working at Twitter, Google, and Meta. Surge runs the gig platform Data Annotation, which says it pays a million freelancers to coach AI fashions. Surge competes with knowledge labeling startups like Scale AI and Mercor and counts Anthropic as a buyer.
On Sunday’s podcast, Chen mentioned that firms are prioritizing AI slop due to business leaderboards.
“Proper now, the business is performed by these horrible leaderboards like LMArena,” he mentioned, referring to a well-liked on-line leaderboard the place individuals can vote on which AI response is healthier.
“They are not rigorously studying or fact-checking,” he mentioned. “They’re skimming these responses for 2 seconds and selecting no matter appears to be like flashiest.”
He added: “It is actually optimizing your fashions for the kinds of people that purchase tabloids on the grocery retailer.”
Nonetheless, the Surge CEO mentioned that AI labs have to concentrate to those leaderboards as a result of they are often requested about their rankings throughout gross sales conferences.
Like Chen, analysis scientists have criticized benchmarks for overvaluing superficial traits.
In a March weblog put up, Dean Valentine, the cofounder and CEO of AI safety startup ZeroPath, mentioned that “Latest AI mannequin progress feels largely like bullshit.”
Valentine mentioned that he and his workforce had been evaluating the efficiency of various fashions claiming to have “some type of enchancment” for the reason that launch of Anthropic’s 3.5 Sonnet in June 2024. Not one of the new fashions his workforce tried had made a “vital distinction” in his firm’s inner benchmarks or in builders’ skills to search out new bugs, he mentioned.
They may have been “extra enjoyable to speak to,” however they have been “not reflective of financial usefulness or generality.”
In a February paper titled “Can we belief AI Benchmarks?” researchers on the European Fee’s Joint Analysis Middle concluded that main points exist in in the present day’s analysis method.
The researchers mentioned benchmarking is “basically formed by cultural, business and aggressive dynamics that usually prioritize state-of-the-art efficiency on the expense of broader societal considerations.”
Corporations have additionally come below fireplace for “gaming” these benchmarks.
In April, Meta launched two new fashions in its Llama household that it mentioned delivered “higher outcomes” than comparably sized fashions from Google and French AI lab Mistral. It then confronted accusations that it had gamed a benchmark.
LMArena mentioned that Meta “ought to have made it clearer” that it had submitted a model of Llama 4 Maverick that had been “personalized” to carry out higher for its testing format.
“Meta’s interpretation of our coverage didn’t match what we count on from mannequin suppliers,” LMArena mentioned in an X post.
