Multimodal Sentiment Analysis (MuSe) 2020 is a Challenge-based Workshop focusing on the tasks of sentiment recognition, as well as topic engagement and trustworthiness detection by means of more comprehensively bridging the audio-visual and language modalities. The purpose of the Multimodal Sentiment Analysis in Real-life media Challenge and Workshop (MuSe) is to bring together communities from different disciplines; mainly, the audio-visual emotion recognition community (signal-based), and the sentiment analysis community (symbol-based). In numerous competitions from recent years we have observed a pattern that members from the first group - mostly rooted in the field of Affective (& Behavioural) Computing and specialised in intelligent signal processing - focused on one, or both of the audio and vision modalities, in order to predict the continuous-valued valence and arousal dimensions of emotion (circumplex model of affect), while often disregarding the potential contribution of textual information. The second group - rooted in the field of Sentiment (& Opinion) Mining and specialised in NLP methods for symbolic information analysis - leverages the text modality, and focuses on the prediction only of discrete sentiment label categories. Both communities, however, recently tend to approach each other. At the same time, both are massively influenced at present by Deep Learning methods.
One of the goals of MuSe is to provide at this very moment a challenge that attracts both communities equally and encourages a fusion of the demonstrated advantages of both. We ideally aim for participation that strives towards the development of unified approaches applicable to, what we perceive, as synergistic tasks that however arose from different academic traditions: on the one hand we have complex, dimensional emotion annotations that reflect a broad variety of emotions, grounded in the psychological and social sciences relating to the expression of behaviour, and on the other hand, linking emotions to topics (context), entities or aspects, as is common in sentiment analysis.
A second motivation of MuSe is to compare the merits of each of the three core modalities (audio, visual, and textual cues), as well as various approaches of multimodal fusion under well-defined and strictly comparable conditions. We believe that this will contribute to establishing the extent to which the fusion of approaches is possible and beneficial, as well as to advancing emotion recognition systems to be able to deal with fully naturalistic (in-the-wild) behaviour in large volumes of realistic, user-generated data. Such data types are the new generation of data utilised for real world multimedia affect and sentiment analysis.
We plan to achieve the above within MuSe 2020 through three separated tasks:
Multimodal Sentiment in-the-Wild Sub-challenge (MuSe-Wild): participants have to predict the level of emotional dimensions (arousal, valence) in a time-continuous manner from audio-visual recordings. Timestamps to enable modality alignment and fusion on word-, sentence-, and utterance-level as well as several acoustic and visual (e.g., FAU, gesture) annotations and pre-computed features are be provided.
Multimodal Emotion-Target Sub-challenge (MuSe-Topic): participants have to predict 10-class domain-specific speaker topics (e.g., performance, comfort, safety) and 3-class emotions on conversational topics. In addition to the provided features from above, a visual domain-specific entity recognition and localisation model enabling an easy, domain-specific utilisation of visual features are be provided.
Multimodal Trustworthiness Sub-challenge (MuSe-Trust): participants have to predict the level of trustworthiness of user-generated audio-visual content in a sequential manner and are encouraged to explore the relationship between emotional (arousal and valence) prediction in depth and at large scale.