VLM-REG

Abstract

Referring Expression Generation (REG) is a core task for evaluating the pragmatic competence of vision-language systems, requiring not only accurate semantic grounding but also adherence to principles of cooperative communication. However, current evaluations of vision-language models (VLMs) often overlook the pragmatic dimension, reducing REG to a region-based captioning task and neglecting Gricean maxims. In this work, we revisit REG from a pragmatic perspective, introducing a new dataset RefOI of 1.5k images annotated with both written and spoken referring expressions. Through a systematic evaluation of state-of-the-art VLMs, we identify three key failures of pragmatic competence: (1) failure to uniquely identify the referent, (2) inclusion of excessive or irrelevant information, and (3) misalignment with human pragmatic preference, such as the underuse of minimal spatial cues. We also show that standard automatic evaluations fail to capture these pragmatic violations, reinforcing superficial cues rather than genuine referential success. Our findings call for a renewed focus on pragmatically informed models and evaluation frameworks that align with real human communication.

Data Collection Interfaces

Human Written

Human Spoken

The annotation interface. The user is required to enter text or provide a speech-to-text description of the object within the red box, ensuring that another observer can uniquely identify the object in the image.

The human evaluation interface. The user is required to click on the corresponding object in the image based on the given description. The interface then displays the shortest distance between the clicked point and the nearest point on the mask of the target object, which will be 0 if the click is inside the mask. If the user cannot find any object matching the description or identifies multiple possible objects, they should click the corresponding button instead, and the distance will not be computed.

The minimal words annotation interface. The user is required to select the minimal words that can uniquely identify the object. The input referring expressions in this part are the ones that are evaluated as successful in the human evaluation.

Qualitative Analyses

We present qualitative examples illustrating the pragmatic differences between human and model-generated referring expressions. These examples highlight typical failure modes such as overly verbose descriptions, irrelevant attribute mentions, and insufficient disambiguation. The comparisons also reveal how humans tend to prefer concise, spatially grounded expressions, particularly in spoken settings—an aspect current VLMs often overlook.

Qualitative comparison of human and model referring expressions under Default and Brief prompts. Human expressions (especially in spoken form) tend to be concise and spatially grounded. In contrast, model outputs under Default prompts are often overly verbose, while Brief prompts reduce length but may omit pragmatically significant cues.

Human

GPT-4o

XComposer

Word cloud comparison between human-produced expressions (aggregating written and spoken) and VLM-generated expressions (aggregating default and brief prompts) in `RefOI`. Semantically light words are filtered. Humans rely heavily on spatial cues, in line with prior findings, while VLMs favor visual features.

"Brief" Prompts

"Default" Prompts

Length distribution of referring expressions generated by the model using the two different prompts, including a comparison with human-written and spoken trials.

Experiments

Main Results: Discrepancy Between Automatic Metrics and Human Judgement

Model	Instr.	BLEU-1	BLEU-4	ROUGE-1	ROUGE-L	METEOR	CIDEr	SPICE	BERT	CLIP	REC	Human	Irrel%
LLaVA-7B	Dft.	13.27	1.60	18.09	16.30	19.29	2.10	10.50	85.51	79.02	32.41	39.46	87.30
LLaVA-7B	Brf.	28.74	6.05	36.46	35.50	19.15	10.80	24.59	89.02	70.72	25.51	30.57	41.95
LLaVA-13B	Dft.	8.17	1.07	11.98	10.94	16.89	0.77	7.92	84.61	79.85	30.13	46.40	91.85
LLaVA-13B	Brf.	28.96	5.81	36.44	35.64	20.13	8.14	21.63	88.42	72.99	28.92	32.53	49.65
LLaVA-34B	Dft.	6.29	0.78	9.82	9.11	16.15	0.07	7.61	84.39	79.86	33.42	46.53	92.90
LLaVA-34B	Brf.	28.55	6.38	32.99	31.67	20.48	9.60	16.50	88.50	74.95	35.24	36.77	56.11
XComposer	Dft.	5.25	0.65	8.38	7.81	14.58	3.10	6.37	84.11	79.86	38.06	52.19	92.81
XComposer	Brf.	13.59	2.17	17.77	16.69	19.95	5.52	10.63	85.52	79.66	38.47	51.65	80.36
MiniCPM-V	Dft.	6.38	0.67	9.86	8.78	15.28	0.05	6.30	84.29	80.38	37.93	45.12	92.97
MiniCPM-V	Brf.	16.03	3.15	19.56	18.19	18.77	6.36	11.16	86.29	78.55	35.04	45.79	72.87
GLaMM	Dft.	15.01	3.32	16.69	16.29	11.49	9.08	3.90	86.42	58.26	5.78	3.84	74.68
GLaMM	Brf.	18.46	4.45	20.92	20.46	14.18	10.48	4.44	86.65	58.60	5.72	4.85	70.52
CogVLM	Dft.	31.13	8.70	33.89	32.32	23.50	41.62	24.09	89.78	66.54	33.29	26.67	26.39
CogVLM	Brf.	31.39	8.69	34.70	32.94	24.87	41.41	24.74	90.00	69.15	38.80	33.53	29.88
GPT-4o	Dft.	7.47	0.85	11.61	10.43	17.39	0.03	7.21	84.57	80.81	41.29	59.80	89.81
GPT-4o	Brf.	25.30	5.78	28.76	27.36	19.02	8.17	15.31	88.11	76.58	40.08	51.72	52.75
Human	Spk.	66.18	22.58	70.15	66.45	48.28	112.04	42.35	93.89	71.60	64.56	92.20	9.15
Human	Wrt.	-	-	-	-	-	-	-	-	70.43	63.69	89.29	7.29

We compare model performance under different Instr. (Instruction) settings: Dft. (Default) prompt and Brf. (Brief) prompt. All model predictions are evaluated against Human Wrt. (Written) results as the reference texts. We also compute Human Spk. (Spoken) data in comparison with human-written data. Irrel% refers to the percentage of irrelevant words in the referring expression of the examples evaluated as successful.

While human-generated expressions achieve over 90% accuracy in identifying objects, all VLMs fall short by a wide margin. GPT-4o performs best among the tested models, yet still lags behind humans. Automatic metrics such as BLEU and CIDEr show poor correlation with human judgment, frequently ranking verbose models higher. Even listener-based scores (REC) fail to consistently match human preferences, indicating that existing metrics do not capture pragmatic competence effectively.

Main Results Breakdown: Failures in Uniqueness and Identifiability

Model	Instr.	Listener Compare			Error Breakdown			Class Breakdown			Class Co-occurrence
Model	Instr.	Human	REC	Agree	Wrong%	Multi.%	No-Mat%	COCO	No-COCO	Δ_Acc	Coocc.	No-Coocc.	Δ_Acc
LLaVA-7B	Dft.	39.46	32.41	65.84	14.62	40.40	5.52	41.26	37.65	-3.61	18.63	81.50	-62.87
LLaVA-7B	Brf.	30.57	25.51	71.62	10.23	52.26	6.94	31.18	29.96	-1.22	10.37	71.34	-60.97
LLaVA-13B	Dft.	46.40	30.13	65.10	26.26	26.20	1.14	45.70	47.10	1.40	28.80	81.91	-53.11
LLaVA-13B	Brf.	32.53	28.92	67.99	10.30	56.63	0.54	33.47	31.58	-1.89	10.67	76.63	-65.96
LLaVA-34B	Dft.	46.53	33.42	62.14	18.72	31.52	3.23	48.25	44.80	-3.45	29.41	81.10	-51.69
LLaVA-34B	Brf.	36.77	35.24	65.03	7.34	51.45	4.44	38.04	35.49	-2.55	15.11	80.59	-65.48
XComposer	Dft.	52.19	38.06	66.11	20.20	24.92	2.69	56.05	48.31	-7.74	37.56	81.70	-44.14
XComposer	Brf.	51.65	38.47	64.09	14.28	31.45	2.62	55.78	47.50	-8.28	35.55	84.15	-48.60
MiniCPM-V	Dft.	45.12	37.93	66.38	15.75	34.55	4.58	47.98	42.24	-5.74	26.49	82.72	-56.23
MiniCPM-V	Brf.	45.79	35.04	63.62	12.19	38.99	3.03	49.46	42.11	-7.35	26.99	83.74	-56.75
GLaMM	Dft.	3.84	5.78	93.61	7.33	15.29	73.54	4.30	3.37	-0.93	1.31	8.94	-7.63
GLaMM	Brf.	4.85	5.72	93.34	8.49	14.07	72.59	4.30	5.40	1.10	1.31	11.99	-10.68
CogVLM	Dft.	26.67	33.29	73.30	2.89	47.34	23.10	27.96	25.37	-2.59	13.39	53.46	-40.07
CogVLM	Brf.	33.53	38.80	68.53	2.96	52.53	10.98	34.81	32.25	-2.56	16.72	67.48	-50.76
GPT-4o	Dft.	59.80	41.29	62.00	11.98	24.04	4.18	63.31	56.28	-7.03	48.14	83.33	-35.19
GPT-4o	Brf.	51.72	40.08	63.01	10.97	31.52	5.79	54.84	48.58	-6.26	37.36	80.69	-43.33
Human	Spk.	92.20	64.56	64.96	6.93	0.74	0.13	92.07	92.58	0.51	91.74	93.50	-1.76
Human	Wrt.	89.29	63.69	63.69	7.68	2.36	0.67	89.52	89.07	-0.45	88.31	91.26	-2.95

Listener Compare: The human evaluation accuracy with REC (the evaluation result from CogVLM-Grounding) and computes Agree (the agreement between the two listeners).
Error Breakdown: The percentages of three types of errors: Wrong refers to a failed guess, Multi. refers to multiple potential matches, and No-Mat refers to cases where no object can be located.
Class Breakdown: The accuracy of COCO-class objects with non-COCO-class objects. The metric Δ_Acc shows the accuracy drop between the two categories.
Class Co-occurrence: The accuracy of Coocc. images (images containing more than one object of the same class) with No-Coocc. images (images containing only one object of its class). Δ_Acc denotes the accuracy drop between these two categories.

The breakdown reveals that many failures stem from referential ambiguity, particularly when multiple similar objects appear in the same scene. Models often fail to generate uniquely identifying expressions under such conditions. Performance also drops significantly for non-COCO classes, suggesting dataset bias and poor generalization beyond common benchmarks. All models struggle with images containing co-occurring objects, indicating limited ability to leverage distinctive features for disambiguation.

Further Analyses and Discussions

Misalignment to Human Pragmatic Preferences [Synthetic Data]

While humans heavily rely on spatial cues in referring expressions, VLMs often favor combinations of visual attributes such as shape and color. This divergence reveals that VLMs may not follow human pragmatic preferences when multiple minimal descriptions are available—violating Gricean maxims of Relation and Quantity.

To isolate this effect, we design a synthetic dataset where each referent can be uniquely described using one of four independent features: size, color, shape, or position. Human responses are collected and compared with VLM likelihoods.

Results show that humans are highly sensitive to visual saliency and prefer the most contextually informative cue. VLMs, in contrast, display flatter preference distributions and less discriminative usage of features—highlighting a lack of pragmatic grounding.

Synthetic dataset.
Left: The gradient manipulation for each visual feature (size, color, position, shape), where the target object (red arrow) remains constant while the distractor varies along a single dimension.
Right: Example of a trial in which "left," "lighter," and "smaller" can all uniquely identify the referent. Human speakers predominantly choose the spatial descriptor, whereas the VLM prefers attribute-based expressions.

Attribute selection as a function of feature salience. Across all four dimensions, humans readily attend to feature salience when selecting attributes for reference, whereas VLMs exhibit weaker sensitivity.

LLaVA-7B

LLaVA-13B

LLaVA-34B

Human

Heatmap showing the difference in normalized probability of choosing "left" over "small", calculated as \(\hat{p} = \hat{p}_{\textrm{small}} - \hat{p}_{\textrm{left}}\). Darker colors indicate a preference for using the spatial term "left" over the size term "small".

LLaVA-7B

LLaVA-13B

LLaVA-34B

Human

Heatmap showing the difference in normalized probability of choosing "square" over "small", calculated as \(\hat{p} = \hat{p}_{\textrm{small}} - \hat{p}_{\textrm{square}}\). Darker colors indicate a preference for using the spatial term "square" over the size term "small".

Why Is Current Automatic Evaluation Unreliable?

Standard metrics such as BLEU, METEOR, and even model-based metrics like CLIPScore fail to capture key pragmatic aspects. For instance, overly brief yet sufficient expressions (e.g., "cookie") are penalized by BLEU, and paraphrased spatial constructions are unfairly punished by METEOR's fragmentation penalty.

Model-based similarity metrics also blur pragmatic distinctions. For example, "largest cookie" and "cookie" may score similarly under CLIPScore despite large differences in informativeness. Listener-based metrics further compound the issue by reinforcing biases toward salient objects.

These issues highlight the urgent need for pragmatically aware evaluation frameworks that reflect human-like judgments.

A case study illustrating why automatic metrics, including heuristic measures and neural listener models, fail to accurately capture the pragmatic performance of REG.

Recommended Use of Our Dataset

The RefOI dataset is designed for fine-grained REG/REC analysis. It distinguishes between COCO and non-COCO classes, and between scenes with single vs. multiple distractors of the same class.

We encourage users to leverage these distinctions for deeper insights and invite community contributions to expand non-COCO annotations.

BibTeX

@misc{ma2025visionlanguagemodelspragmaticallycompetent, title={Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation}, author={Ziqiao Ma and Jing Ding and Xuejun Zhang and Dezhi Luo and Jiahe Ding and Sihan Xu and Yuchen Huang and Run Peng and Joyce Chai}, year={2025}, eprint={2504.16060}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2504.16060}, }

💬 Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation

TL;DR: We show significant pragmatic deficiencies in current VLMs when faced with referring expression generation compared to humans, as they violate Gricean maxims.

Abstract

Data Collection Interfaces

The annotation interface. The user is required to enter text or provide a speech-to-text description of the object within the red box, ensuring that another observer can uniquely identify the object in the image.

The minimal words annotation interface. The user is required to select the minimal words that can uniquely identify the object. The input referring expressions in this part are the ones that are evaluated as successful in the human evaluation.

Qualitative Analyses

Length distribution of referring expressions generated by the model using the two different prompts, including a comparison with human-written and spoken trials.

Experiments

Main Results: Discrepancy Between Automatic Metrics and Human Judgement

Main Results Breakdown: Failures in Uniqueness and Identifiability

Further Analyses and Discussions

Misalignment to Human Pragmatic Preferences [Synthetic Data]

Attribute selection as a function of feature salience. Across all four dimensions, humans readily attend to feature salience when selecting attributes for reference, whereas VLMs exhibit weaker sensitivity.

Heatmap showing the difference in normalized probability of choosing "left" over "small", calculated as \(\hat{p} = \hat{p}_{\textrm{small}} - \hat{p}_{\textrm{left}}\). Darker colors indicate a preference for using the spatial term "left" over the size term "small".

Heatmap showing the difference in normalized probability of choosing "square" over "small", calculated as \(\hat{p} = \hat{p}_{\textrm{small}} - \hat{p}_{\textrm{square}}\). Darker colors indicate a preference for using the spatial term "square" over the size term "small".

Why Is Current Automatic Evaluation Unreliable?

Recommended Use of Our Dataset

BibTeX