O1 Benchmarks - Search News

19h

Testing The Limits: Three Ways AI Benchmarks Are Evolving

When it comes to real-world evaluation, appropriate benchmarks need to be carefully selected to match the context of AI ...

This new AI benchmark measures how much models lie

Researchers behind the MASK benchmark found that more knowledge doesn't mean more 'moral virtue.' See which model lies the ...

8don MSN

Chatbots Are Cheating on Their Benchmark Tests

These are important questions, and they’re nearly impossible to answer because the tests that measure AI progress are not ...

10don MSN

People are using Super Mario to benchmark AI now

Thought Pokémon was a tough benchmark for AI? One group of researchers argues that Super Mario Bros. is even tougher.

What does “PhD-level” AI mean? OpenAI’s rumored $20,000 agent plan explained.

The AI industry has a new buzzword: "PhD-level AI." According to a report from The Information, OpenAI may be planning to ...

13d

“It’s a lemon”—OpenAI’s largest AI model ever arrives to mixed reviews

An AI expert who requested anonymity told Ars Technica, "GPT-4.5 is a lemon!" when comparing its reported performance to its ...

19don MSN

Did xAI lie about Grok 3’s benchmarks?

Grok 3 Reasoning Beta and Grok 3 mini Reasoning’s scores for AIME 2025 at “@1” — meaning the first score the models got on the benchmark — fall below o3-mini-high’s score. Grok 3 Reasoning Beta also ...

Analytics Insight3d

AI Showdown: Alibaba’s QWQ-32B vs. DeepSeek R1 vs. O1 Mini

Alibaba’s QWQ-32B is a 32-billion-parameter AI designed for mathematical reasoning and coding. Unlike massive models, it ...

Yahoo Finance19d

Did xAI lie about Grok 3's benchmarks?

Debates over AI benchmarks — and how they're reported ... Grok 3 Reasoning Beta also trails ever so slightly behind OpenAI's o1 model set to "medium" computing. Yet xAI is advertising Grok ...

Yahoo Finance10d

People are using Super Mario to benchmark AI now

Thought Pokémon was a tough benchmark for AI ... Interestingly, the lab found that reasoning models like OpenAI's o1, which "think" through problems step by step to arrive at solutions, performed ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results