Manuscript received November 15, 2022; revised November 29, 2022; accepted December 30, 2022.
Abstract—The objective of speech enhancement (SE) is to alleviate various types of distortion (noise, channel effect, reverberation, etc.) in received speech signals to improve the corresponding perceptual quality and intelligibility. SE techniques are essential in speech signal-related online education and learning applications and devices. Thanks to the rapid development of deep neural network (DNN) techniques, various SE methods based on DNN have been proposed. They usually outperform the conventional statistics-based SE methods in non-stationary environments. These DNN-based SE methods can be further divided into mapping-based and masking-based. In particular, masking-based methods have attracted more attention in recent years.
This study focuses on improving a well-known masking-based method: the ideal ratio mask (IRM). We propose to revise the spectrogram for the input utterances in the learning of the IRM network to improve its speech enhancement performance. For each utterance, the magnitude spectrogram is raised to a particular power (exponentiated) first and then used to create various speech features, including Mel-frequency cepstral coefficients (MFCC) and gammatone-frequency features (GF). We feed these features to the deep network for IRM estimation. The exponentiation operation for the magnitude spectrogram is believed to highlight the speech portion of an utterance. Thus the exponentiated spectrogram probably benefits the following speech feature representation employed to learn the deep neural network for IRM. We conduct a series of evaluation experiments on a subset of the TIMIT database. The utterances in the training and test sets are corrupted by factory noise at a signal-to-noise ratio (SNR) of -2 dB. We use the Perceptual Evaluation of Speech Quality (PESQ) and Short-Time Objective Intelligibility (STOI) as the speech enhancement metrics.
The preliminary results reveal that, compared with the IRM from the original spectrogram, the new IRM created with the exponentiated spectrogram provides the test utterances with superior perceptual quality and intelligibility scores.
Index Terms—Speech enhancement, exponentiated spectrogram, ideal ratio mask
Jeih-Weih Hung, Chi-En Dai, Ping-Chen Wu, and Che-Wei Liao are with the Department of Electrical engineering, National Chi Nan University, Taiwan.
*correspondence: firstname.lastname@example.org (J.W.H.)
Cite: Jeih-Weih Hung, Chi-En Dai, Ping-Chen Wu, and Che-Wei Liao, "Employing the Exponentiated Magnitude Spectrogram in the Deep Learning-Based Mask Estimation for Speech Enhancement," International Journal of Machine Learning vol. 13, no. 3, pp. 104-108, 2023.Copyright @ 2023 by the authors. This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).