Physical Address

304 North Cardinal St.
Dorchester Center, MA 02124

OpenAI’s o3 AI model scores lower on a benchmark than the company initially implied


Incomplience between Openai O3 EU model for the first and third party criteria Increasing the company’s transparency and model test practices.

When Openai Submitted O3 in DecemberThe company claimed that the model could respond in a quarter of the questions in Frontieratath, a difficult set of math problems. This account removed the competition – the next best model, only 2% of frontage managed to answer correctly.

“All offers there are less than 2% [on FrontierMath]”Mark Chen, senior researcher in Openai, He said during Livestream. “We see [internally]We can get more than 25% with O3, which is in time of aggressive test. “

When out, this figure was an upward border obtained by a version of O3, with more calculation, with more calculation of O3’s last week.

After Frontimony, the Research Institute Epoch AI announced the results of the independent benchmark tests of O3 on Friday. Epoch, about 10% of O3, found that Openai was below the highest alleged account.

This does not mean that Openai is lying. The Benchmark results published in December in the company show a low-related result in accordance with Epoch’s account. Epoch also noted that the opening of the test structure is different from Expenai, and the use of an updated spread of the frontline for evaluations.

“The difference between our results and Openai can be related to Openai, which evaluates a stronger internal scaffolding using test timings [computing]or in a different sub-part of these results (180 issues in FrontierMath-2024-26 – in 2025-02-28-288), ” write The period.

According to an article in X From the ARC Prize Foundation, an organization that tests the pre-release version of O3, a public O3 model “is a different model […] Conversation / adjustable for product use Confirm Epoch’s report.

“All released o3 calculated stages are smaller than the version [benchmarked]”The arc wrote the award. Generally, better reports may be expected to achieve better comparison scores.

O3-nin ictimai buraxılmasının, O3-nin test vədləri qısa bir şəkildə düşməsi bir az bir moot nöqtəsidir, çünki şirkətin O3-mini-yüksək və O4-mini modellər O3-də O3-də O3 variant, O3-Pro, O3-Pro, O3-Pro, O3-Pro, O3-Pro, O3-Pro, O3-Pro, O3-Pro, O3-Pro’ın debüt etməyi planning.

However, the best reminder of AI criteria is the best, especially when a company that serves to sell the source.

The evaluation “disputes” is a common in the AI ​​industry as sellers to seize headlines and mindset with new models in the EI industry.

Was epoch in January critic After the company was announced O3, Openai to expect financial statement. Many scientists working in Frontierath did not report Openai’s participation until the open air.

Recently, Elon Musk was Xai defendant To publish incorrect benchmark graphics for the latest AI model, Grok 3. Simply acknowledged the wearing of benchmark scores for a version of Meta A model that differs from the company that is presented to developers.





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *