Using GPT-3.5-Turbo for Multimodal Sentiment Analysis

21 Apr 2024


(1) Jingjing Wang, School of Computing Clemson University Clemson, South Carolina, USA;

(2) Joshua Luo, The Westminster Schools Atlanta, Georgia, USA;

(3) Grace Yang, South Windsor High School South Windsor, Connecticut, USA;

(4) Allen Hong, D.W. Daniel High School Clemson, South Carolina, USA;

(5) Feng Luo, School of Computing Clemson University Clemson, South Carolina, USA.

Abstract & Introduction

Related Work


Experiment Result

Discussion and References

A. Large Language Model Prompting

The remarkable success of LLMs, such as GPT and its variants, has also brought prompt-based learning to the fore. Prompt-based learning has found applications in various natural language processing (NLP) tasks, such as sentiment classification and natural language inference. Prompting visuallanguage models for computer vision tasks is another area that has started to gain attention [13], [23].

However, the existing body of work in prompt-based learning has primarily focused on unimodal tasks, with limited research delving into multimodal tasks [22], [30]. A few exceptions like the work [15] attempted multimodal tasks by utilizing the GPT-3 model [18] for science question answering tasks, where the questions have either an image or text context or both. However, these attempts have their own limitations, with models like GPT-3 being less accurate and possessing restrictive input length constraints, thereby limiting their performance and utility.

GPT-3.5-Turbo, which performs much better than GPT-3, provides an alternative solution. It presents a practical and efficient route for prompt-based learning in multimodal tasks. Therefore, our research focuses on leveraging the capabilities of the GPT-3.5-Turbo for classifying sentiments in memes, which requires the simultaneous processing of visual and textual data.

Additionally, research indicates that when equipped with a range of external NLP tools [16], [17], Large Language Models (LLMs) can serve as effective action planners, selecting and utilizing tools for problem-solving. It suggests that these models can be extended to more complex multimodal scenarios, involving both reasoning and action, thereby enhancing their capabilities.

The GPT series has played a pivotal role in promoting prompt-based learning. We follow many of its core concepts in our work, but our focus is different. While most studies are set on applying prompting to extract knowledge from pre-trained models, our goal is to employ these techniques to fine-tune downstream tasks. Specifically, we are interested in exploring the application of these methods to the sentiment classification of memes, a complex and nuanced multimodal task.

B. Hateful Meme Detection and Memotion Analysis Navigation

Detecting hateful content in memes is a difficult process. The subtle undertones of humor and sarcasm are often interwoven within memes, coupled with the potential mismatch between the images and text, present formidable obstacles to traditional text-based hate speech detection approaches. This challenge has been the focal point of several studies, all in an effort to establish viable solutions [25], [26], [31].

The Hateful Memes Challenge initiated by Facebook stands as a pioneering effort in this sphere, inviting researchers worldwide to develop models capable of identifying hate speech in multimodal meme content. The dataset provided by Facebook, consisting of 10,000 manually annotated memes for hate speech, has proven to be an invaluable resource propelling research in this domain.

Mirroring these endeavors, a model that combines multitask learning was proposed to detect hateful and offensive content present in memes. A crucial component of this model is its ability to help GPT model to view the images and do the hateful/non-hateful classification, thereby contributing to a better comprehension of hateful memes.

In spite of these extensive efforts, the task of hateful meme detection remains intricate and largely unsolved, attributed to the nuanced interaction of language and imagery inherent to the meme content. The ongoing research and continuous strides towards advancements hint at a necessity for more sophisticated models, ones that can manage these complexities and accurately pinpoint harmful or offensive content within memes.

Memotion analysis offers a captivating intersection between the realms of computer vision, natural language processing, and cognitive science. The main aim is to unpack the intricate layers of meaning, sentiment, and emotion encapsulated within memes, an endeavor that has seen many researchers strive to establish computational models to this end [27], [28].

A noteworthy stride in this area was the SemEval 2020 Task 8: Multimodal Memotion Analysis [9]. This project tasked various models with the prediction of sentiment, humor, sarcasm and offensiveness in memes. Taking a similar direction, another researcher embarked on the creation of an automatic meme generator capable of tailoring its output based on the targeted sentiment [19].

Within this project, we proposed a similar framework for hateful meme detection tasks with some more detailed prompts by utilizing multimodal sentiment analysis and sarcasm detection to dissect visual and textual cues present in memes. The insights gathered from our work have shed light on the intricate facets of memotion analysis, showcasing the potential and limitations of AI in this field.

This paper is available on arxiv under CC 4.0 license.