💬 Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation

1 University of Michigan 2 GrowAI
*Indicates Equal Contribution

TL;DR: We show significant pragmatic deficiencies in current VLMs when faced with referring expression generation compared to humans, as they violate Gricean maxims.

Abstract

Referring Expression Generation (REG) is a core task for evaluating the pragmatic competence of vision-language systems, requiring not only accurate semantic grounding but also adherence to principles of cooperative communication. However, current evaluations of vision-language models (VLMs) often overlook the pragmatic dimension, reducing REG to a region-based captioning task and neglecting Gricean maxims. In this work, we revisit REG from a pragmatic perspective, introducing a new dataset RefOI of 1.5k images annotated with both written and spoken referring expressions. Through a systematic evaluation of state-of-the-art VLMs, we identify three key failures of pragmatic competence: (1) failure to uniquely identify the referent, (2) inclusion of excessive or irrelevant information, and (3) misalignment with human pragmatic preference, such as the underuse of minimal spatial cues. We also show that standard automatic evaluations fail to capture these pragmatic violations, reinforcing superficial cues rather than genuine referential success. Our findings call for a renewed focus on pragmatically informed models and evaluation frameworks that align with real human communication.

Data Collection Interfaces

Qualitative Analyses

We present qualitative examples illustrating the pragmatic differences between human and model-generated referring expressions. These examples highlight typical failure modes such as overly verbose descriptions, irrelevant attribute mentions, and insufficient disambiguation. The comparisons also reveal how humans tend to prefer concise, spatially grounded expressions, particularly in spoken settings—an aspect current VLMs often overlook.

Experiments

Main Results: Discrepancy Between Automatic Metrics and Human Judgement

Model Instr. BLEU-1 BLEU-4 ROUGE-1 ROUGE-L METEOR CIDEr SPICE BERT CLIP REC Human Irrel%
LLaVA-7B Dft. 13.27 1.60 18.09 16.30 19.29 2.10 10.50 85.51 79.02 17.28 39.46 87.30
Brf. 28.74 6.05 36.46 35.50 19.15 10.80 24.59 89.02 70.72 13.58 30.57 41.95
LLaVA-13B Dft. 8.17 1.07 11.98 10.94 16.89 0.77 7.92 84.61 79.85 15.27 46.40 91.85
Brf. 28.96 5.81 36.44 35.64 20.13 8.14 21.63 88.42 72.99 15.33 32.53 49.65
LLaVA-34B Dft. 6.29 0.78 9.82 9.11 16.15 0.07 7.61 84.39 79.86 16.21 46.53 92.90
Brf. 28.55 6.38 32.99 31.67 20.48 9.60 16.50 88.50 74.95 17.22 36.77 56.11
XComposer Dft. 5.25 0.65 8.38 7.81 14.58 3.10 6.37 84.11 79.86 18.56 52.19 92.81
Brf. 13.59 2.17 17.77 16.69 19.95 5.52 10.63 85.52 79.66 18.36 51.65 80.36
MiniCPM-V Dft. 6.38 0.67 9.86 8.78 15.28 0.05 6.30 84.29 80.38 19.10 45.12 92.97
Brf. 16.03 3.15 19.56 18.19 18.77 6.36 11.16 86.29 78.55 17.15 45.79 72.87
GLaMM Dft. 15.01 3.32 16.69 16.29 11.49 9.08 3.90 86.42 58.26 3.70 3.84 74.68
Brf. 18.46 4.45 20.92 20.46 14.18 10.48 4.44 86.65 58.60 3.77 4.85 70.52
CogVLM Dft. 31.13 8.70 33.89 32.32 23.50 41.62 24.09 89.78 66.54 15.97 26.67 26.39
Brf. 31.39 8.69 34.70 32.94 24.87 41.41 24.74 90.00 69.15 18.06 33.53 29.88
GPT-4o Dft. 7.47 0.85 11.61 10.43 17.39 0.03 7.21 84.57 80.81 21.65 59.80 89.81
Brf. 25.30 5.78 28.76 27.36 19.02 8.17 15.31 88.11 76.58 19.03 51.72 52.75
Human Spk. 66.18 22.58 70.15 66.45 48.28 112.04 42.35 93.89 71.60 30.46 92.20 9.15
Wrt. - - - - - - - - 70.43 30.06 89.29 7.29

We compare model performance under different Instr. (Instruction) settings: Dft. (Default) prompt and Brf. (Brief) prompt. All model predictions are evaluated against Human Wrt. (Written) results as the reference texts. We also compute Human Spk. (Spoken) data in comparison with human-written data. Irrel% refers to the percentage of irrelevant words in the referring expression of the examples evaluated as successful.

While human-generated expressions achieve over 90% accuracy in identifying objects, all VLMs fall short by a wide margin. GPT-4o performs best among the tested models, yet still lags behind humans. Automatic metrics such as BLEU and CIDEr show poor correlation with human judgment, frequently ranking verbose models higher. Even listener-based scores (REC) fail to consistently match human preferences, indicating that existing metrics do not capture pragmatic competence effectively.

Main Results Breakdown: Failures in Uniqueness and Identifiability

Model Instr. Listener Compare Error Breakdown Class Breakdown Class Co-occurrence
Human REC Agree Wrong% Multi.% No-Mat% COCO No-COCO ΔAcc Coocc. No-Coocc. ΔAcc
LLaVA-7B Dft. 39.46 17.28 65.23 14.62 40.40 5.52 41.26 37.65 -3.61 18.63 81.50 -62.87
Brf. 30.57 13.58 72.02 10.23 52.26 6.94 31.18 29.96 -1.22 10.37 71.34 -60.97
LLaVA-13B Dft. 46.40 15.27 61.80 26.26 26.20 1.14 45.70 47.10 1.40 28.80 81.91 -53.11
Brf. 32.53 15.33 70.01 10.30 56.63 0.54 33.47 31.58 -1.89 10.67 76.63 -65.96
LLaVA-34B Dft. 46.53 16.21 59.31 18.72 31.52 3.23 48.25 44.80 -3.45 29.41 81.10 -51.69
Brf. 36.77 17.22 65.57 7.34 51.45 4.44 38.04 35.49 -2.55 15.11 80.59 -65.48
XComposer Dft. 52.19 18.56 59.12 20.20 24.92 2.69 56.05 48.31 -7.74 37.56 81.70 -44.14
Brf. 51.65 18.36 58.78 14.28 31.45 2.62 55.78 47.50 -8.28 35.55 84.15 -48.60
MiniCPM-V Dft. 45.12 19.10 63.42 15.75 34.55 4.58 47.98 42.24 -5.74 26.49 82.72 -56.23
Brf. 45.79 17.15 60.66 12.19 38.99 3.03 49.46 42.11 -7.35 26.99 83.74 -56.75
GLaMM Dft. 3.84 3.70 95.02 7.33 15.29 73.54 4.30 3.37 -0.93 1.31 8.94 -7.63
Brf. 4.85 3.77 93.95 8.49 14.07 72.59 4.30 5.40 1.10 1.31 11.99 -10.68
CogVLM Dft. 26.67 15.97 68.65 2.89 47.34 23.10 27.96 25.37 -2.59 13.39 53.46 -40.07
Brf. 33.53 18.06 61.59 2.96 52.53 10.98 34.81 32.25 -2.56 16.72 67.48 -50.76
GPT-4o Dft. 59.80 21.65 53.67 11.98 24.04 4.18 63.31 56.28 -7.03 48.14 83.33 -35.19
Brf. 51.72 19.03 56.76 10.97 31.52 5.79 54.84 48.58 -6.26 37.36 80.69 -43.33
Human Spk. 92.20 30.46 35.04 6.93 0.74 0.13 92.07 92.58 0.51 91.74 93.50 -1.76
Wrt. 89.29 30.06 36.18 7.68 2.36 0.67 89.52 89.07 -0.45 88.31 91.26 -2.95

Listener Compare: The human evaluation accuracy with REC (the evaluation result from CogVLM-Grounding) and computes Agree (the agreement between the two listeners).
Error Breakdown: The percentages of three types of errors: Wrong refers to a failed guess, Multi. refers to multiple potential matches, and No-Mat refers to cases where no object can be located.
Class Breakdown: The accuracy of COCO-class objects with non-COCO-class objects. The metric ΔAcc shows the accuracy drop between the two categories.
Class Co-occurrence: The accuracy of Coocc. images (images containing more than one object of the same class) with No-Coocc. images (images containing only one object of its class). ΔAcc denotes the accuracy drop between these two categories.

The breakdown reveals that many failures stem from referential ambiguity, particularly when multiple similar objects appear in the same scene. Models often fail to generate uniquely identifying expressions under such conditions. Performance also drops significantly for non-COCO classes, suggesting dataset bias and poor generalization beyond common benchmarks. All models struggle with images containing co-occurring objects, indicating limited ability to leverage distinctive features for disambiguation.

Further Analyses and Discussions

Misalignment to Human Pragmatic Preferences

While humans heavily rely on spatial cues in referring expressions, VLMs often favor combinations of visual attributes such as shape and color. This divergence reveals that VLMs may not follow human pragmatic preferences when multiple minimal descriptions are available—violating Gricean maxims of Relation and Quantity.

To isolate this effect, we design a synthetic dataset where each referent can be uniquely described using one of four independent features: size, color, shape, or position. Human responses are collected and compared with VLM likelihoods.

Results show that humans are highly sensitive to visual saliency and prefer the most contextually informative cue. VLMs, in contrast, display flatter preference distributions and less discriminative usage of features—highlighting a lack of pragmatic grounding.

Why Is Current Automatic Evaluation Unreliable?

Standard metrics such as BLEU, METEOR, and even model-based metrics like CLIPScore fail to capture key pragmatic aspects. For instance, overly brief yet sufficient expressions (e.g., "cookie") are penalized by BLEU, and paraphrased spatial constructions are unfairly punished by METEOR's fragmentation penalty.

Model-based similarity metrics also blur pragmatic distinctions. For example, "largest cookie" and "cookie" may score similarly under CLIPScore despite large differences in informativeness. Listener-based metrics further compound the issue by reinforcing biases toward salient objects.

These issues highlight the urgent need for pragmatically aware evaluation frameworks that reflect human-like judgments.

Evaluation Metrics Comparison
A case study illustrating why automatic metrics, including heuristic measures and neural listener models, fail to accurately capture the pragmatic performance of REG.

Recommended Use of Our Dataset

The RefOI dataset is designed for fine-grained REG/REC analysis. It distinguishes between COCO and non-COCO classes, and between scenes with single vs. multiple distractors of the same class.

We encourage users to leverage these distinctions for deeper insights and invite community contributions to expand non-COCO annotations.

BibTeX


@misc{ma2025visionlanguagemodelspragmaticallycompetent,
  title={Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation}, 
  author={Ziqiao Ma and Jing Ding and Xuejun Zhang and Dezhi Luo and Jiahe Ding and Sihan Xu and Yuchen Huang and Run Peng and Joyce Chai},
  year={2025},
  eprint={2504.16060},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2504.16060}, 
}