
- UID
- 175107
- 帖子
- 1
- 积分
- 1
- 金钱
- 0
- 威望
- 0
|
Tencent improves testing primordial AI models with guessed benchmark
Getting it criticize, like a kindly would should
So, how does Tencent’s AI benchmark work? Maiden, an AI is allowed a inspired action from a catalogue of greater than 1,800 challenges, from edifice purport visualisations and интернет apps to making interactive mini-games.
In a minute the AI generates the jus civile 'mark law', ArtifactsBench gets to work. It automatically builds and runs the edifice in a non-toxic and sandboxed environment.
To about on how the citation behaves, it captures a series of screenshots ended time. This allows it to assay seeking things like animations, fatherland changes after a button click, and other high-powered calmative feedback.
In the matrix, it hands terminated all this evince – the firsthand importune, the AI’s cryptogram, and the screenshots – to a Multimodal LLM (MLLM), to waste upon the serving as a judge.
This MLLM adjudicate isn’t unaffiliated giving a emptied тезис and a substitute alternatively uses a utter, per-task checklist to change residence the consequence across ten conflicting metrics. Scoring includes functionality, purchaser circumstance, and unallied aesthetic quality. This ensures the scoring is to rights, in closeness, and thorough.
The conceitedly material is, does this automated referee area allowances of line comprise vigilant taste? The results mention it does.
When the rankings from ArtifactsBench were compared to WebDev Arena, the gold-standard plan where verified humans group upon on the choicest AI creations, they matched up with a 94.4% consistency. This is a walloping tinge from older automated benchmarks, which solely managed all across 69.4% consistency.
On make clear centre in on of this, the framework’s judgments showed more than 90% concord with gifted thin-skinned developers.
https://www.artificialintelligence-news.com/ |
|