On 31 May at 14:15 Fatemeh Noroozi will defend his doctoral thesis „Multimodal Emotion Recognition BasedHuman-Robot Interaction Enhancement“.
Associate Professor, Gholamreza Anbarjafari, PhD
Professor Alvo Aabloo, PhD
Professor, PhD Javier Serrano García, Universitat Autònoma de Barcelona
Automatic multimodal emotion recognition is a fundamental subject of interest in affective computing. Its main applications are in human-computer interaction. The systems developed for the foregoing purpose consider combinations of different modalities, based on vocal and visual cues. This thesis takes the foregoing modalities into account, in order to develop an automatic multimodal emotion recognition system. More specifically, it takes advantage of the information extracted from speech and face signals. From speech signals, Mel-frequency cepstral coefficients, filter-bank energies and prosodic features are extracted. Moreover, two different strategies are considered for analyzing the facial data. First, facial landmarks' geometric relations, i.e. distances and angles, are computed. Second, we summarize each emotional video into a reduced set of key-frames. Then they are taught to visually discriminate between the emotions. In order to do so, a convolutional neural network is applied to the key-frames summarizing the videos. Afterward, the output confidence values of all the classifiers from both of the modalities are used to define a new feature space. Lastly, the latter values are learned for the final emotion label prediction, in a late fusion. The experiments are conducted on the SAVEE, Polish, Serbian, eNTERFACE'05 and RML datasets. The results show significant performance improvements by the proposed system in comparison to the existing alternatives, defining the current state-of-the-art on all the datasets. Additionally, we provide a review of emotional body gesture recognition systems proposed in the literature. The aim of the foregoing part is to help figure out possible future research directions for enhancing the performance of the proposed system. More clearly, we imply that incorporating data representing gestures, which constitute another major component of the visual modality, can result in a more efficient framework.