Neural captioners are typically trained to mimic human image descriptions, without any awareness of the function that such descriptions might have, leading to biased and vague captions. In this talk, I’ll empirically show that adding a self-supervised objective to a neural captioner helps to recover a plain, visually descriptive language that is more informative about image contents. In particular, I’ll describe experiments where we take an out-of-the-box neural captioner, and we finetune it with a discriminative objective. Given a target image, the system must learn to produce a description that enables an out-of-the-box text-conditioned image retriever to identify the target image among a set of candidates.
In terms of similarity to ground-truth human descriptions, the captions emerging from discriminative finetuning lag slightly behind those generated by the non-finetuned model, when the latter is trained and tested on the same caption dataset. However, when the model is used without further tuning to generate captions for out-of-domain datasets, the discriminatively-finetuned captioner generates descriptions that resemble human references more than those produced by the same captioner trained only with supervised learning and without finetuning. I’ll further show that, on the Conceptual Captions dataset, discriminatively finetuned captions are more helpful than either vanilla captions or ground-truth captions for human subjects tasked with an image discrimination task. If time allows, I’ll conclude the talk by drawing a connection between our work and reinforcement learning from human feedback (RLHF), a recently introduced method powering models like ChatGPT and InstructGPT.