Data: MuSe-CAR

A description of the data packages, partitioning, and features is available here.


One dataset will be used for all tasks in MuSe 2020, to make the tasks less complex and more appealing to participants. We introduce the novel dataset MuSe-CAR that covers the range of aforementioned desiderata. MuSe-CAR is a large, multimodal dataset which has been gathered in-the-wild with the intention of further understanding Multimodal Sentiment Analysis in-the-wild, e.g., the emotional engagement that takes place during product reviews (i.e., automobile reviews) where a sentiment is linked to a topic or entity.

The estimated age range of the professional, semi-professional (influncers), and casual reviewers is between the middle of 20s until the late 50s. Most are native English speakers from the UK or the US, while a small minority are non-native, yet fluent English speakers. We have designed MuSe-CAR to be of high voice and video quality, as informative video social media content, as well as everyday recording devices have improved in recent years. This enables robust learning, even with a high degree of novel, in-the-wild characteristics, for example as related to:

  • Video: Shot size (a mix of closeup, medium, and long shots), face-angle (side, eye, low, high), camera motion (free, free but stable, and free but unstable, switch, e.g., zoom, fixed), reviewer visibility (full body, half-body, face only, and hands only), highly varying backgrounds, and people interacting with objects (car parts).

  • Audio: Ambient noises (car noises, music), narrator and host diarisation, diverse microphone types, and speaker locations.

  • Text: Colloquialisms, and domain-specific terms.

To avoid a high amount of objective reviews, selectors rated the videos on a scale between 0 (emotionless) and 5 (very emotional). We filtered out all videos with a score less than 4 before annotation. In total, 300 videos (> 35 hours) from 70+ English speaking reviewers have been annotated with the consent of the creators resulting in a rich, high-quality user-generated multimedia database.

Each recording has been annotated in three continuous dimensions; emotional arousal and valence (hence reflecting sentiment) according to Russell’s theory, and additionally the novel aspect of trustworthiness, each by 5 independent annotators. In the case of the latter, a link has recently been discovered between valence, excitement and perceived trustworthiness, although it has never been utilised nor predicted using machine learning. It should reflect how balanced, honest, and knowledgeable the viewer perceives the information given by the reviewer regarding a particular topic. In our data, we have observed that online opinions are slightly shifted towards the positive side of the valence and arousal spectra. However, nearly all reviewers also talk about negative aspects, which leads to a more balanced dataset in terms of labels. Additional contributions for MuSe-CAR include the categorical labelling of emotional engagement with speaker topics, such as comfort, safety, interior, and performance.