← All articles

PrezEval: Benchmarking AI agents on professional slides

April 6, 2026 · 7 min read

Goal

How well can an AI agent reproduce professional consulting slides from visual guidance?

After building Folio, we’ve come to the belief that our aproach produces far superior results to others.

But let’s put numbers on that.

PrezEval is a benchmark that measures exactly this. Given a target slide image and the original source presentation (with the correct layout pre-selected), an agent must edit the slide to match the target as closely as possible. A vision-language model then scores the result by comparing structure, content, hierarchy, and styling.

This task is deceptively hard. Real consulting slides are dense, precise artifacts: a misaligned chart legend, a missing axis label, or a wrong color in a heatmap cell all count as failures. The benchmark tests not just whether an agent can write text to a slide, but whether it can handle charts, tables, custom shapes, multi-column layouts, and brand-specific styling, all at once.

Benchmark Building

Source material

We curated 61 slides from 10 professional presentation decks spanning major consulting and advisory firms: McKinsey, Bain, BCG, PwC, EY, and Deloitte, as well as law firms Cleary Gottlieb and Mattos Filho. These are real-world decks covering topics from healthcare economics to energy transitions to consumer privacy regulation.

The slides were selected to maximize visual complexity and diversity of elements. Here is what the benchmark contains:

ElementSlidesShare
Charts (bar, line, pie, combo…)3354%
Multi-column layouts2439%
Logos and icons17*28%
Tables1423%
Dense text layouts1321%
Complex diagrams / timelines813%
Maps58%
Custom composite shapes35%

*Counting only substantive illustrative icons, not company logos (which appear on ~45 slides).

What makes it hard

Task setup

For each of the 61 tasks, the agent receives:

The agent then edits the slide through tool calls, and the final result is rendered as a PNG and scored by a vision-language model evaluator. The evaluator rates each result on an integer scale of 1 to 5, since research shows that a compact integer scale maximizes human-LLM alignment for LLM-as-a-judge setups. We then convert ratings to a 0-100% score for readability.

Results

We compared four configurations:

ConfigScoreTimeSteps
Folio Max70.0%2:195.5
Folio Medium66.8%2:445.2
Folio Fast43.0%1:3213.7
Claude for PowerPoint (Sonnet 4.6)46.9%16:0325.5

Folio Max leads at 70.0%, outscoring the next-best non-Folio agent by 23 points (a 49% relative lead). It reaches that score in 138.9s per task, roughly 7x faster than Claude for PowerPoint (962.8s, nearly 16 minutes), and in far fewer steps (5.5 vs 25.5). Folio Max and Folio Fast together trace the Pareto frontier - one anchors the high-accuracy end, the other the low-latency end - while Claude for PowerPoint is dominated, landing both slower and less accurate than Folio Max.

Folio Medium scores 66.8%, within striking distance of Max, at a slightly lower cost. It reasons less but lands almost as many reproductions cleanly.

Folio Fast is the budget, low-latency tier: at $0.21 per task it costs roughly 5x less than Folio Max and 9x less than Claude for PowerPoint, and at about 1.5 minutes per slide it is the fastest config in the field. The trade-off is fidelity (43.0%) and more exploratory actions (13.7 steps), since it runs on a smaller, cheaper model.

Claude for PowerPoint (Sonnet 4.6) scores 46.9%, but pays heavily for it: 962.8s per task (roughly 7x slower than Folio Max), 25.5 steps, $1.80, and 5 of 61 tasks left incomplete. It edits raw OOXML directly, which is both slow and error-prone compared to Folio’s structured approach.

Score breakdown by content type

Breaking down scores by what the slide contains reveals clear patterns:

Content typeFolio MediumClaude for PPT
Charts68.5%42.3%
Dense text66.7%50.0%
Diagrams66.1%43.8%
Tables65.9%47.7%
Maps43.8%41.7%
Overall66.8%46.9%

The striking thing is how consistent Folio Medium has become: it sits in a tight 65-69% band across charts, dense text, diagrams, and tables. Charts, which make up over half the benchmark and used to be the category that dragged scores down, are now Folio’s strongest category. Maps are the one remaining weak spot, and they are hard for everyone (equally bad for both agents). Claude for PowerPoint trails in every single category, with the widest gaps on charts (+26 points) and diagrams (+22 points).

Where Folio excels

Folio handles the full range of consulting-slide elements: formatted legal text, multi-section layouts with colored boxes, table-of-contents pages, multi-column icon layouts, data charts, and tables. Folio Max scores 75% or higher on 47 of the 61 slides (and a perfect 100% on a couple), while Claude for PowerPoint typically lags well behind. The gap is most visible on chart-heavy and diagram-heavy slides, exactly the dense, structured content that fills real decks.

Where Folio still has room to grow

Maps remain the clearest opportunity: closing that gap alone would lift the overall score meaningfully. Beyond that, the residual 50%-scoring tasks are mostly near-misses on dense charts and composite shapes, where the structure is right but styling or alignment is slightly off.

Speed, which used to be a concern, is now a strength: Folio Max completes a slide in about 2.3 minutes and Folio Fast in 1.5, against nearly 16 minutes for Claude for PowerPoint. We will keep pushing latency down so that working with Folio feels like a streaming continuation of your work rather than a tennis game where you wait for the ball to come back.

All results, including per-task generated vs. reference images and evaluator critiques, are available in the PrezEval repository.