1. Introduction
Visual-language tasks typically necessitate models to comprehend features from different modalities for knowledge reasoning. In practical applications such as intelligent service robotics, visual-language tasks play a crucial role [
1,
2]. Visual question answering (VQA), as one of the tasks within visual-language understanding, aims to answer textual questions based on provided images. An ideal VQA model should possess the ability to comprehend and reason with image-textual data.
However, recent research [
3] indicates that many VQA methods tend to rely on superficial correlations between questions and answers, neglecting to extract accurate visual information from images to answer questions. As illustrated in
Figure 1, there is often a notable discrepancy in the distribution of answers between the training and test sets within the VQA dataset. Furthermore, as highlighted in [
4], similar bias issues are influenced by visual modalities.
These biases, prevalent in the current VQA domain, mainly involve inherent language distribution biases in training and test sets, as well as incorrect visual grounding shortcuts due to significant visual regions [
5,
6].
Currently, prominent techniques to address biases involve integration-based [
3], counterfactual-based [
7,
8], and contrast learning-based [
9] methodologies. Notably, the integration-based approach lessens bias effects by training two models comprehensively, where one of them captures shallow or spurious associations, allowing the main model to concentrate on more complex instances. Counterfactual-based techniques support training by producing counterfactual samples and supplementary visual annotations. Contrastive learning-based techniques amplify problem contribution by generating negative sample image-problem pairs with irrelevant images from the training data.
However, certain studies [
10] have observed that the effectiveness improvement of some methods is not due to a reasonable visual basis but rather an undisclosed regularization effect. Current methods focus on modeling dataset biases to mitigate their influence during de-biasing but overlook the model’s ability for modal understanding and inference. Therefore, our approach aims to enhance the model’s ability to comprehend multimodal information. We incorporate collaborative learning into multimodal training [
11] to address bias issues in VQA and reduce its impact.
We classify VQA’s bias problem as a scenario where two parallel modalities are involved in training, but one modality is absent or fails to fulfill its intended function. For instance, the question-answer shortcut bias refers to the model relying solely on the shortcut between the question and the answer for answering, disregarding the relevant visual region (visual bias refers to the same phenomenon). Prior research has demonstrated that this methodology entirely omits visual modal information during the reasoning process. Despite modal feature fusion, the model still disregards the image content for the final prediction and, instead, relies on the bias to answer the question.
Inspired by the concept of collaborative learning, we present an intuitive multimodal training approach to enhance the model’s comprehension of visual text characteristics. Our approach entails leveraging various modalities to reinforce one another during training, thereby mutually aiding the training process. As illustrated in
Figure 2, conventional VQA methods are susceptible to verbal and visual biases during training. When confronted with biased questions, these methods usually answer the questions directly based on the influence of bias. In the “CoD-VQA” approach, the model initially identifies possible bias in the present example and its type. A co-operative learning methodology integrates both modalities equally to support each other’s training. This implementation enables the model to use multimodal knowledge thoroughly to make predictions and reduce bias.
In our experiments, we categorize single visual and verbal modal information to enable the model to make predictions. We then compare the unimodal prediction results with real results to identify missing modalities in co-operative learning. Finally, we re-represent and integrate these missing modalities to alleviate bias and enhance participation in prediction.
Overall, we propose a co-training de-biasing method. In our approach, the issue of bias resulting from visual and textual modalities is viewed as a common underlying problem, leading us to suggest a new fusion technique that tackles this problem by focusing on the modal characteristics and adopting the principle of collaborative learning. During the training phase, we assess the parallel visual textual modal data, identifying the “impoverished” and “enriched” modalities. We augment the role of the “deprived” modality artificially to increase the model’s awareness of its presence and reduce bias.
3. Methods
Figure 3 depicts an overview of our CoD-VQA, where we consider the relationship between visual modality, textual modality, language bias, and visual bias to obtain more accurate modal representations and model comprehension.
We specifically examine the impacts of visual bias and language bias separately within the model. During training, we dynamically analyze sample pairs to identify the ‘missing’ modality, aiding the model in acknowledging and understanding the modality better, and we increased the participation of that modality in the model to remove the bias. The approach incorporates a ‘bias detector’ to identify the present bias. Upon determining the bias type, the model identifies the ‘missing’ modality and incorporates it as a ‘third’ modality in the overall modal fusion. During actual training, the fusion process depicted on the right side of
Figure 3 will occur only on one side.
3.1. Definition of Bias
For the VQA task, conventional methods typically approach it as a multi-class classification problem. The model is focused on the provided triplets
, in which the
i-th image
, the question
, the answer
, primarily aim to train a mapping that accurately allocates responses across the answer set
. When a categorization layer with fixed parameters and only one modality (either visual or textual) is given as input, the model predicts the answer distribution. In our testing, we found that the model maintains a certain level of accuracy when solely provided with either visual or question features, particularly when using UpDn as the baseline model. Alongside insights from [
4], we integrated the concept of visual bias into our approach. We defined this concept in Equation (
1):
where
denotes the distribution of answers with visual bias,
denotes image
, and
denotes the only visual classification network. We consider
as the bias on the image side of the model.
In summary, we consider visual bias in our approach as a complementary aspect to solve the bias problem in VQA and treat it, together with linguistic bias, as a multimodal collaborative de-biasing problem to be solved.
3.2. Multimodal Collaborative Learning in VQA
In this section, we focus on the concepts related to collaborative learning.
In multimodal scenarios, especially when modal resources are limited, it becomes crucial to accurately represent modal information as well as the multimodal knowledge required in the model inference process. Collaborative learning aims to utilize the knowledge of a relatively resource-rich modality to assist in modeling a resource-poor modality. Methods based on the concept of collaborative learning can improve the representation performance of not only multimodal but also unimodal data. According to the difference of training resource categories in collaborative learning, collaborative learning methods can be categorized into three types:
Parallel data methods: With parallel data, the observation data of one modality in the dataset is required to be directly associated with the observation data of another modality. For example, in a video-audio dataset, the video and voice samples must come from the same speaker.
Nonparallel data methods: With nonparallel data, methods do not require a direct correlation between different modalities, and usually, methods in this data context achieve co-learning through overlap at the category level. For example, in OK-VQA, multimodal datasets are combined with out-of-domain knowledge from Wikipedia to improve the generalization of quizzes.
Hybrid data approach: With hybrid data, different modalities are connected to each other through shared modalities. For example, in the case of multilingual image captioning, the image modality always matches the caption in any of the languages, and the role of intermediate modalities is to establish correspondences between the different languages so that the images can be associated with different languages.
Overall, collaborative learning aims to utilize complementary information across modalities so that one modality can influence the other, thus creating better multimodal fusion models. In the next section, we will further describe how collaborative learning can be combined with VQA.
3.3. CoD-VQA
In VQA, the question-image pairs in the dataset used for training tend to be strongly correlated, i.e., the entity words in the questions tend to have corresponding detection regions in the images, which suggests that there is the same semantics between the images and the questions and that the model can only make correct predictions when the semantics contained in the two are unified. In our approach, we view the bias problem as both visual and language modalities that are independent of each other, and all existing bias problems can be viewed as being caused by the model ignoring the role of a particular modality in the prediction.
In this context, when the semantics in the data is relatively simple, the semantics between multiple modalities can be represented by a single modality, and the model can make a correct answer simply based on the semantics of the single modality. When the semantics in the data need to be combined with both image and text, if the semantics of one modality is lost, even if the model makes a correct answer, it can still be considered that the model does not have the ability to understand the multimodal knowledge. Therefore, we regard image and text as parallel data under collaborative learning, and they are directly related in training, unified through semantics, and assist each other in supplementing information for each other.
For example, under the condition of linguistic bias, we can assume that the model gets the answer directly through the question, ignoring the language of visual modality. In collaborative learning, the textual modality is used as the “rich” modality and the visual modality as the “scarce” modality in this training. Similarly, in the visual shortcut condition, the textual modality can be regarded as the “scarce” modality and the visual modality as the “rich” modality. The CoD-VQA algorithm consists of three steps:
Bias prediction: A single-branch prediction of instances and in the training dataset to obtain unimodal predictions and .
Modality selection: Based on the and obtained in the previous step, binary cross-entropy calculation is performed with L to obtain the corresponding bias loss in different modalities. Then, according to the size of the loss and the result of the bias detector, we determine which modality in the image and text is the “deprived” modality.
Modal fusion: After determining which modality is “deprived”, we fix the “enriched” modality and use modal fusion to get a new modal representation, which enhances the “deprived” modality in the joint representation. We use modal fusion to obtain a new modal representation, which enhances the participation of the “scarcity” modality in the joint representation.
3.4. Reducing Bias
In this section, we describe how we apply collaborative learning to the VQA de-biasing problem in the context of Algorithm 1.
Algorithm 1: CoD-VQA |
|
3.4.1. Bias Prediction
Similar to previous research, the most straightforward way to capture VQA bias is to train a model that accepts only one modal input and use it as a bias-capturing branch in the overall model. Specifically, the unimodal bias branch can be represented as Equation (
2):
where
(
) denotes the bias under unimodal branching.
(
) denotes the classification prediction layer, which is used to obtain the prediction results.
denotes image
, and
denotes question
.
3.4.2. Selecting the “Scarce” Modality
The key step in our approach is to determine which modality is missing in the bias condition, and we define the missing modality as the “scarce” modality. To be clear, the cause of the bias problem can be explained in terms of missing modalities, which are generated by the model when processing the biased samples; it is not reasonable to identify “scarce” modalities in training by artificial definitions. Therefore, in our approach, we utilize the bias prediction defined in the previous subsection to assist in the judgment. Specifically, after obtaining the biases, we calculate the cross-entropy loss between them and the correct answers and determine which modality should be used as the “scarce” modality based on the size of the resulting loss. The specific process is defined in Algorithm 1 as Equations (
3) and (
4):
where
(
) denotes the loss of the corresponding single-branch bias after cross-entropy computation with the true prediction, respectively, and
(
) denotes the “scarce” modality identified in the methodology, which has an initial value of 0.
represents the bias detection classifier, which is used to detect whether the prediction corresponding to the current unimodal mode can be considered as biased. In our approach, we intuitively determine the “rich” modality by comparing the loss corresponding to the bias: a biased pair of samples usually corresponds to a prediction that is initially closer to the true result, which corresponds to a lower loss, whereas the other modality can be considered as a “scarce” modality. However, this approach is based on the assumption that all samples of the training data are biased, whereas in reality, not all samples are biased, or the presence of bias in the samples does not always have a negative impact on training. Therefore, we introduce the
classifier as a bias detector for determining the degree of bias in the current sample.
3.4.3. Modality Fusion
After identifying the “scarce” modality, we perform a re-mapping and fusion of the modalities. Inspired by the work of CF-VQA [
15], we consider the bias induced by each single modality and the direct impact on the model to be mutually independent. Consequently, we re-map the features of the “scarce” modality and fuse them with the original modality, represented as Equation (
5):
where
denotes the newly fused mixed modality, and
represents the mapping layer used for feature handling. The mapping layer employs a conventional fully connected neural network (FCNet) comprising two standard linear layers sequentially stacked.
During the training process, we adopt a two-stage training approach to update the different phases of the algorithm, as illustrated in
Figure 4. In the first training phase, the model determines the “scarce” and “rich” modalities during training based on modeled biases and the bias detector, updating the relevant parameters of the bias detector. In the second training phase, based on the identified modalities, we proceed with a new round of modality fusion to ensure the model can recognize and predict from different modality sources, updating the classification layers used for prediction.
5. Discussion and Conclusions
Visual question answering (VQA) has emerged as a key task within multimodal research, marking a foundational step toward the realization of true artificial intelligence entities. This study explores modal fusion methods in VQA contexts [
28,
29,
30] and suggests that similar approaches could be beneficial for other multimodal tasks, such as image captioning, especially in identifying biases. Long-tail distributions in answer datasets and biases due to missing modal information in images represent unavoidable challenges in VQA development. Unlike prior work, this study addresses a fundamental issue in multimodal tasks: the model’s comprehension across different modalities, highlighting the necessity of overcoming dataset limitations and bias to fully capture multimodal interactions.
In this paper, we have introduced a de-biasing model for VQA based on multimodal collaborative training. Our approach considers image and text features in VQA as equally important modalities and employs the concept of collaborative learning to assist each other in training, mitigating bias issues from a modality feature perspective. Specifically, within the de-biasing process, we defined symmetrical language and visual biases, categorizing the reasons behind biases as attentional deficits of modality information during model predictions. Subsequently, we further utilized the concept of collaborative learning to define the missing “scarce” modality during training. By leveraging mutual assistance among modalities in training, we aimed to achieve better modal fusion and feature representation, thereby addressing bias issues. Our extensive experiments conducted on benchmark datasets, VQA-CP v2 and VQA v2, and the novel de-biased dataset VQA-VS, demonstrate the effectiveness of our CoD-VQA method in tackling bias-related problems.
In summary, we have developed a multimodal collaborative de-biasing algorithm that, while adopting a modal fusion approach to bias mitigation, still faces certain limitations. Primarily, the dataset on which the algorithm is based does not encompass all real-world scenarios, leading to challenges in generalizing to “unusual” questions in practical contexts. Moreover, the algorithm’s effectiveness hinges on precise bias detection and the modality fusion’s performance under biased conditions. Given the complexity and variability of real-world scenarios, the model may not capture these nuances effectively. Future work could focus on enhancing dynamic bias detection and modal fusion techniques to ensure broader robustness in VQA applications.