Post by @markgritter

In reply to

@markgritter@mathstodon.xyz

Software Engineer at Thirdlaw. Previously co-founded Tintri, on Vault team at HashiCorp, founding engineer at Akita Software, Principal Engineer at Postman. Big nerd. he/him

mathstodon.xyz

Mark Gritter

@markgritter@mathstodon.xyz

Software Engineer at Thirdlaw. Previously co-founded Tintri, on Vault team at HashiCorp, founding engineer at Akita Software, Principal Engineer at Postman. Big nerd. he/him

mathstodon.xyz

@markgritter@mathstodon.xyz · Apr 14, 2026

This seems vaguely familiar, but I don't think I've read it yet: https://arxiv.org/abs/2603.09678v1

"We evaluate five frontier models across five prompting strategies and find a dramatic capability gap: models achieving 85-95% on standard benchmarks score only 0-11% on equivalent esoteric tasks, with 0% accuracy beyond the Easy tier"

Conversation (4)

Showing 0 of 4 cached locally.

Syncing comments from the remote thread. 4 more replies are still loading.

Loading comments...