Drawing Conclusions: Representation or Reasoning in New Yorker Caption Matching

Abstract

Understanding why a New Yorker cartoon caption fits an image is a demanding test of multimodal intelligence: a model must not only see the cartoon but get the joke. In the contest’s five-way matching task, the input is a cartoon, a human-written scene description, and five candidate captions, and the output is the caption that belongs with the cartoon. The best published model reaches only about 62% on this task, while expert humans reach about 94%.

This project asks whether that gap can be closed by stronger representations alone, or whether it reflects a need for candidate-level reasoning. Across CLIP baselines, fine-tuning, larger embedding models, and a gated cross-attention method that enriches image and caption embeddings with the scene description, no representation-focused method reliably breaks the 61–62% band — and a knowledge-rich encoder reaches that ceiling with no training at all. Frontier vision–language models approach human performance, but they change scale, encoder, scoring, and reasoning at once, so they are treated as an upper-bound probe rather than a controlled ablation.

Approach

The study uses the matching configuration of the New Yorker Caption Contest benchmark: each cartoon is paired with five candidate captions, and the task is to pick the contest winner. Every cartoon also comes with a human scene annotation used by a from-description protocol. Because the contest produces one cartoon per week, the split is built from only ~700 unique cartoons — a hard, small-data regime where every trained variant overfits.

The central architecture is a dual gated cross-attention model on a shared CLIP ViT-L/14 backbone: image patches and each candidate caption are separately enriched by cross-attending to the scene description, each behind its own learned gate, before temperature-scaled cosine scoring and a five-way softmax. Training regimes are stated and studied explicitly — zero-shot cosine, partial fine-tuning, frozen-backbone, and full fine-tune — to separate the contribution of architecture from that of fine-tuning. A provider-agnostic frontier-evaluation harness scores GPT-5, Claude Opus, and Gemini zero-shot on the same items, with two leakage controls (a blank control that hides the cartoon, and a shuffle control that pairs each item with the wrong cartoon) to test whether frontier accuracy reflects genuine visual grounding rather than memorization.

What it found

Representation-side choices saturate near the ceiling: scale, resolution, scoring granularity, and a knowledge-rich encoder all leave matching accuracy around 61%, and the best trained model — an early-projection cross-attention variant — reaches 60.4%, within noise of the published 62.3% reference. Reasoning-capable frontier models clear the wall, approaching the ~94% human level, with the leakage controls confirming they depend on seeing the right cartoon. The gap that no representation-focused method could close, these systems close almost entirely — evidence that getting machines in on the joke takes more than better representations.

Approach	5-way matching acc.
Random chance	20.0%
Description + caption (DistilBERT, text-only)	48.7%
CLIP cosine, image + caption (zero-shot)	48.7%
CLIP cosine, image + caption (fine-tuned)	58.5%
Dual cross-attention enrichment	59.3%
Early projection — best trained model	60.4%
Gemini 3.5 Flash — zero-shot, reasoning	93.6%
Expert humans	~94%