MuSe Guidelines

Data packages

The organisers provide the following data for each Sub-challenge:

    • audiovisual recordings

    • baseline features

    • labels and metadata

Training, development and test data/ feature partitions are available; the test labels will be made available only after the end of the challenge. Requests for data access here.

The total duration of data for each sub-challenge varies, as further pre-processing to include the most informative data only was applied. For MuSe-Wild and MuSe-Trust, all parts with an active voice or a visible face are included. We excluded non-product related video segments (e.g. advertisements) for MuSe-Wild and MuSe-Topic to minimise the distortion these could cause on the task objectives. More specific, for MuSe-Topic we only included sections which have a active voice based on the sentence transcriptions (see baseline paper for more information).

Data partitioning

For the MuSe 2020 Challenge, data has been partitioned in a Train, Development and Test convention, where aspects including emotional ratings, speakers, and duration have been considered (see baseline paper for more information). Participants have to stick to the definition of training, development, and test sets as given.

Whereas recordings are provided for all partitions, and labels are not available for the test partition and must be inferred automatically. No manual intervention of any kind on the test partitions is permitted. Test results must be solely the result of a fully automatic process without any human intervention.

Baseline features

We provide a wide range of model-ready feature sets. Open-source toolboxes were exploited to extract baseline features from the audiovisual recordings. It spans multiple levels of representation:

    • video:

              • unsupervised deep representations (xCeption)

              • low-level face descriptors (VGGFace)

              • facial action units (openFACE)

              • 2D facial landmarks (openFACE)

              • 3D facial landmarks (openFACE)

              • head pose (openFACE)

              • pose (openPose)

              • localisation representation of 30 car parts (GoCAR)

    • text

        • conventional word embeddings (FastText)

        • contextual word embeddings (AlBert)

    • audio

        • low-level descriptors (openSMILE)

        • extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS)

        • deep spectrum (Deep spectrum)

We also provide aligned features (see baseline paper for more information).

Baseline scripts

Baseline can be reproduced using the scripts from the Github repository.

External systems

  • Participants can use the raw as well as preprocessed scene/background/audio/body pose etc. features along with the provided information.

  • The participants are free to use external data for training along with the MuSe. However, this should be reproducible (e.g. features) and clearly discussed in the accompanying paper. The participants are free to use any commercial or academic feature extractors, pre-trained network and libraries.

Submit test predictions

The number of submissions of test results is limited to five trials per team and Sub-challenge. Formatting of the predicted labels on the test sets should follow the original format provided in the training and development sets directory (aggregated).

Baseline paper

A paper introducing the MuSe 2020 challenge and baseline models.