Skip to main content
  1. Home
  2. Computing
  3. News

Anthropic aims to fix one of the biggest problems in AI right now

Add as a preferred source on Google
the Anthropic logo
Anthropic

Hot on the heels of the announcement that its Claude 3.5 Sonnet large language model beat out other leading models, including GPT-4o and Llama-400B, AI startup Anthropic announced Monday that it plans to launch a new program to fund the development of independent, third-party benchmark tests against which to evaluate its upcoming models.

Per a blog post, the company is willing to pay third-party developers to create benchmarks that can “effectively measure advanced capabilities in AI models.”

“Our investment in these evaluations is intended to elevate the entire field of AI safety, providing valuable tools that benefit the whole ecosystem,” Anthropic wrote in a Monday blog post. “Developing high-quality, safety-relevant evaluations remains challenging, and the demand is outpacing the supply.”

The company wants submitted benchmarks to help measure the relative “safety level” of an AI based on a number of factors, including how well it resists attempts to coerce responses that might include cybersecurity; chemical, biological, radiological, and nuclear (CBRN); and misalignment, social manipulation, and other national security risks. Anthropic is also looking for benchmarks to help evaluate models’ advanced capabilities and is willing to fund the “development of tens of thousands of new evaluation questions and end-to-end tasks that would challenge even graduate students,” essentially testing a model’s ability to synthesize knowledge from a variety of sources, its ability to refuse cleverly worded malicious user requests, and its ability to respond in multiple languages.

Anthropic is looking for “sufficiently difficult,” high-volume tasks that can involve as many as “thousands” of testers across a diverse set of test formats that help the company inform its “realistic and safety-relevant” threat modeling efforts. Any interested developers are welcome to submit their proposals to the company, which plans to evaluate them on a rolling basis.

Andrew Tarantola
Former Computing Writer
Andrew Tarantola is a journalist with more than a decade reporting on emerging technologies ranging from robotics and machine…
Apple made Liquid Glass adjustable, which says plenty about Liquid Glass
The new slider is useful, welcome, and mildly hilarious after a year of Apple acting like transparent everything was the obvious future.
Text, Document, Business Card

Apple’s big glassy software future now comes with a way to make it less glassy. In iOS 27, users can adjust the translucency of the Liquid Glass effect, while macOS Golden Gate adds its own Liquid Glass controls under System Settings.

Liquid Glass is still alive across Apple’s platforms, still shimmering through menus and panels, still doing the elegant UI trick Apple clearly likes. The big visual bet has already earned a dimmer switch. After a year of treating translucency like the obvious next step, WWDC’s most revealing design update may be the one that lets people dial it back.

Read more
Windows 11 just fixed one of Search’s dumbest limitations, and you’ll wonder how you lived without it
One less character, one less annoyance every time you search your PC.
Person sitting and using a Windows Surface computer with Windows 11.

If you have ever typed two letters into the Windows 11 search box, paused, and watched nothing useful happen until you added more characters, you already know exactly why this Windows 11 update matters. 

Microsoft's June 2026 Patch Tuesday update, part of a release Windows Latest calls the biggest of the year (via Windows Latest), quietly fixes that. Windows Search can now find and prioritize files with as few as two characters, down from the old three-character minimum.

Read more
Brazil’s secret World Cup weapon taught the team when to ignore it
The data said he wasn't running enough. The footage said he was always in the “perfect tactical position.”
Soccer ball in net

Brazil has more World Cup titles than anyone, five of them to be precise, but after going through five straight tournaments without adding to that count, the team is leaning hard on data this time. 

Every player wears a sensor-packed "smart vest" tracking field position (via GPS), heart rate, and a stat called "player load," the same kind of numbers that your Whoop band or Apple Watch brags about, but tuned specifically for the sport.

Read more