Abstract
We propose a method for emotion recognition through emotiondependent speech
recognition using Wav2vec 2.0. Our method achieved a significant improvement
over most previously reported results on IEMOCAP, a benchmark emotion dataset.
Different types of phonetic units are employed and compared in terms of
accuracy and robustness of emotion recognition within and across datasets and
languages. Models of phonemes, broad phonetic classes, and syllables all
significantly outperform the utterance model, demonstrating that phonetic units
are helpful and should be incorporated in speech emotion recognition. The best
performance is from using broad phonetic classes. Further research is needed to
investigate the optimal set of broad phonetic classes for the task of emotion
recognition. Finally, we found that Wav2vec 2.0 can be fine-tuned to recognize
coarser-grained or larger phonetic units than phonemes, such as broad phonetic
classes and syllables.