This seems vaguely familiar, but I don't think I've read it yet: https://arxiv.org/abs/2603.09678v1

"We evaluate five frontier models across five prompting strategies and find a dramatic capability gap: models achieving 85-95% on standard benchmarks score only 0-11% on equivalent esoteric tasks, with 0% accuracy beyond the Easy tier"