A high schooler built a website that lets you challenge AI models to a Minecraft build-off

[ad_1]

As usual EU assessment Techniques proves to be inadequate, AI Builders become more creative ways to assess the capabilities of generative AI models. For a group of developers, this Minecraft is a sandbox-building game with Microsoft.

Website Minecraft Benchmark (or MC-Bench), against each other against each other against each other against each other against each other against each other against each other against each other against each other to meet each other against each other. Users may vote for which model is doing a better job, and only after the voting, each minecraft can be founded.

Photo credits:Minecraft Benchmark (opened in a new window)

The 12th grade student, which is MC-BKAMYAMI, is known to be the value of Minecraft, not the game itself, but the people are known to be with him – because most sold Video game of all periods. For people who do not play the game, it is possible to evaluate the better the implementation of a pamper.

“Minecraft allows people to see progress [of AI development] More easily, “Singh said to TechCrunch.” People are used for minecraft used for appearance and vibe. “

MC-Bench currently lists 8 volunteer contributors. According to an anthropic, Google, Openai and Alibaba, MC-Bench, the company subsidized the use of projects to apply for productivity, but companies are not elsewhere.

“We are currently a simple structure to think about how much we came from GPT-3 [we] We can see ourselves for these longer shape plans and purposeful tasks.

Other games like Pokémon Red, Street fighterand Cutting Was used as an experimental criterion for AI, partly because the art of benchmarking notorious.

Researchers often try the AI ​​models Standard estimatesHowever, many of these tests give the EI’s superiority. Because of training, models are naturally presented naturally with a narrow solution, especially the solution of problems that require rote memorization or main extrapolation.

Simply put, what does this means that Openai can hit the GPT-4, LSAT in 88th interest, but cannot separate How many Rs are in the word “strawberry”. Anthropical Claude 3.7 Sonnet The standard program engineering has achieved 62.3% accuracy in the bench, but it is worse in playing Pokemon more than five older children.

Mc-Bench, a technical programming criterion, because the models are asked to write a code to create a “frosty snowman” or “charming tropical beach hut”.

However, most MC-Bench users are better to assess the potential of collecting more information about a snowman to make a snowman, and thus more information about which models.

Of course, the lack of these scores on the way to the AI ​​is, of course, it is for arguing. Singh claims to be a strong signal.

“The existing leadership board is very much of my experience than my experience using these models, unlike very clean text criteria,” Singh said. “Maybe [MC-Bench] It can be useful for companies to know that it is going to the right direction. “

[ad_2]

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *