TY - GEN
T1 - Interpretable binaural ratio for visually guided binaural audio generation
AU - Zheng, Tao
AU - Verma, Sunny
AU - Liu, Wei
PY - 2022
Y1 - 2022
N2 - Video and audio streams are essential and mutually complementary in multimedia immersive application scenarios. Recent studies have explored the field of deep neural net-work application on multimedia production, e.g., visually guided generation of binaural audio, where Difference Mask (DM) is the predominant strategy in the state-of-the-art (SOTA) work. However, this strategy is not interpretable and requires adding the ground truth output as the input, limiting applicability. Besides, the generated audio has a relatively low spatial sensation. This paper aims to develop an interpretable and robust approach to visually guided binaural audio generation. Specifically, we generalize a concept and new strategy from Difference Mask, named Binaural Ratio, to interpret its binaural property relevant to the Inter-aural Time Difference (ITD) and Inter-aural Level Difference (ILD). In the new strategy, the model input can be natural and arbitrary mono audio instead of the direct sum of left and right audio, i.e., ground truth output. Moreover, we identify that one reason for the low spatial sensation is the bias toward mono. Thus, we tackle it by designing new network variants to learn the Binaural Ratio robustly. Experiments show that our proposed approach significantly outperforms the SOTA methods in both objective and subjective evaluation metrics.
AB - Video and audio streams are essential and mutually complementary in multimedia immersive application scenarios. Recent studies have explored the field of deep neural net-work application on multimedia production, e.g., visually guided generation of binaural audio, where Difference Mask (DM) is the predominant strategy in the state-of-the-art (SOTA) work. However, this strategy is not interpretable and requires adding the ground truth output as the input, limiting applicability. Besides, the generated audio has a relatively low spatial sensation. This paper aims to develop an interpretable and robust approach to visually guided binaural audio generation. Specifically, we generalize a concept and new strategy from Difference Mask, named Binaural Ratio, to interpret its binaural property relevant to the Inter-aural Time Difference (ITD) and Inter-aural Level Difference (ILD). In the new strategy, the model input can be natural and arbitrary mono audio instead of the direct sum of left and right audio, i.e., ground truth output. Moreover, we identify that one reason for the low spatial sensation is the bias toward mono. Thus, we tackle it by designing new network variants to learn the Binaural Ratio robustly. Experiments show that our proposed approach significantly outperforms the SOTA methods in both objective and subjective evaluation metrics.
UR - https://www.scopus.com/pages/publications/85140787217
U2 - 10.1109/IJCNN55064.2022.9892951
DO - 10.1109/IJCNN55064.2022.9892951
M3 - Conference proceeding contribution
AN - SCOPUS:85140787217
SN - 9781665495264
T3 - Proceedings of the International Joint Conference on Neural Networks
BT - 2022 International Joint Conference on Neural Networks (IJCNN)
PB - Institute of Electrical and Electronics Engineers (IEEE)
CY - Piscataway, NJ
T2 - 2022 International Joint Conference on Neural Networks, IJCNN 2022
Y2 - 18 July 2022 through 23 July 2022
ER -