Mel-Vision Transformer

Reference

Illium, S., Müller, R., Sedlmeier, A., and Popien, C.-L. 2021. Visual Transformers for Primates Classification and Covid Detection. Proc. Interspeech 2021, 451–455.

Approach

This work utilizes the vision transformer model on mel-spectrogram audio data, enhanced by mel-based data augmentation and sample weighting, to achieve notable performance in the ComParE21 challenge, surpassing many single model baselines. The introduction of overlapping vertical patching and the analysis of parameter configurations further refine the approach, demonstrating the model’s adaptability and effectiveness in audio processing tasks. [Illium et al. 2021]