less than 1 minute read

LLaVAC: Fine-tuning LLaVA as a Multimodal Sentiment Classifier

Abstract

We introduce LLaVAC, a method for constructing a classifier for multimodal sentiment analysis. This classifier is capable of classifying both text and image modalities by performing fine-tuning on the Large Language and Vision Assistant (LLaVA). In this work, we design a prompt to consider unimodal and multimodal labels and fine-tune LLaVA for classifying multimodal sentiment labels by generating predicted labels. Our method outperforms baselines by up to 7.31% in accuracy and by 8.76% in weighted-F1 in the MVSA-Single dataset across three dataset processing procedures.

llavac-model-figure

Further reading:

  • Click here to download the PDF version of this work.