1. Illium, S., Müller, R., Sedlmeier, A., and Linnhoff-Popien, C. 2020. Surgical Mask Detection with Convolutional Neural Networks and Data Augmentations on Spectrograms. Proc. Interspeech 2020, 2052–2056.

This study investigates the efficacy of various data augmentation techniques applied directly to mel-spectrogram representations of audio data for improving classification performance. The specific task addressed is the detection of surgical mask usage based on human speech signals, a relevant problem in paralinguistics and audio analysis.

We systematically evaluated the impact of data augmentation when training Convolutional Neural Networks (CNNs) for this binary classification task. The input to the networks consisted of mel-spectrograms derived from voice samples. The effectiveness of augmentation strategies (such as frequency masking, time masking, or combined approaches like SpecAugment) was assessed across four different CNN architectures.

Examples of mel-spectrograms of speech with and without a surgical mask
Mel-spectrogram representations of speech signals used as input for CNNs.


The core finding of this research is that applying appropriate data augmentation directly to the spectrogram inputs significantly enhances the performance and generalization capabilities of the CNN models for surgical mask detection. The augmented models demonstrated improved accuracy, robustness, and notably surpassed many established benchmark results from the relevant ComParE (Computational Paralinguistics Challenge) tasks. This highlights the importance of data augmentation as a crucial component in building effective deep learning models for audio classification, particularly when dealing with limited or variable datasets. For a detailed description of the methods and results, please refer to [Illium et al. 2020].

Diagrams illustrating the different CNN architectures tested
Overview of the different Convolutional Neural Network architectures evaluated.