This was my bachelor (undergraduate) thesis from 2009.
Abstract. In this paper, we describe audio ``texture'' features based on the Short Time Fourier Transform (STFT). We use these features in combination with three popular learning machine algorithms to classify spoken voice segments of a popular Electronic Dance Music radio show ``A State of Trance'', which is produced by the current world number 1 DJ; Armin van Buuren.
The aim was accurately to distinguish when Armin van Buuren was talking regardless of background silence, music or other voices (sung or spoken).
We achieved strong empirical results which could be further improved with some basic domain-specific heuristics or compromises on the feature parameters. SVM and Bayesian Logistical Regression produced particularly encouraging results both yielding ~98% overall classification accuracy and ~99% F-score on the speech class on the highest model where we increased the verbosity of the underlying feature set. SVM however provided the most robust performance given several feature variations, significantly out-performing the others given less verbosity on the feature set.
Spectrogram of human speech over music.