基于时域的基频感知语音分离方法

王凯; 李鸣鹤; 黄志华; 黄浩

doi:10.13568/j.cnki.651094.651316.2021.01.07.0002

您当前的位置：

首页 >

文章列表页 >

基于时域的基频感知语音分离方法

更新时间：2026-01-21

- 基于时域的基频感知语音分离方法
- Journal of Xinjiang University (Natural Science Edition in Chinese and English) Vol. 39, Issue 2, Pages: 182-188(2022)
- 作者机构：
  
  新疆大学信息科学与工程学院
- 作者简介：
- 基金信息：
- DOI：10.13568/j.cnki.651094.651316.2021.01.07.0002
  CLC： TN912.3
- Published：2022
- 稿件说明：
移动端阅览
[1]王凯,李鸣鹤,黄志华,等.基于时域的基频感知语音分离方法[J].新疆大学学报(自然科学版)(中英文),2022,39(02):182-188.
[1]王凯,李鸣鹤,黄志华,等.基于时域的基频感知语音分离方法[J].新疆大学学报(自然科学版)(中英文),2022,39(02):182-188. DOI： 10.13568/j.cnki.651094.651316.2021.01.07.0002.

DOI：10.13568/j.cnki.651094.651316.2021.01.07.0002.

摘要

传统的单通道语音分离方法主要采用混音作为输入，对其进行分离得到目标说话人的语音．最近的研究表明，将预估计的基频信息注入到原始混音信号中能够提高分离效果，但这种方法最初应用于时频域．近年来，基于时域的语音分离方法已经被验证优于早期的时频域分离方法．基于上述出发点，本文提出基于辅助基频的时域语音分离方法．该方法首先将时域信号输入预分离模块生成预分离语音，并从预分离语音中提取基频；然后将提取的基频与原始混音拼接，作为后分离模块的输入进行第二次分离．本文评估了不同的基频提取方法和训练策略．语音分离实验结果表明：在训练后分离模块时，先使用理想基频与混音融合训练一个理想分离网络，然后用RAPT方法对预分离源提取估计基频注入混音，再进行理想分离网络的微调，能够获得最佳的语音分离性能，比Conv-TasNet基线方法提高了0.5 dB．这说明显式地注入辅助基频信息不仅在时频域语音分离中表现出了有效性，同时也适用于时域语音分离．

Abstract

In most speech separation methods

only the mixture is used as the input. Pitch-aware architecture injects pitch information into the original mixture to improve the separation result

which was originally applied in time-frequency(T-F) domain. Based on the fact that speech separation in time domain has achieved much better performance than that in T-F domain

we investigate into the effectiveness on the utilization of auxiliary pitch information in time domain speech separation. Firstly

a pre-separation module is trained to generate pre-separated sources

from which pitches are extracted. The extracted pitches are then spliced with the original mixture as the input to a post-separation module. We evaluate different pitch trackers and training strategies. It is shown that

for training the post-separation module

the combination of pre-training on ideal pitches and then fine-tuning on estimated pitches extracted from pre-separated sources using RAPT gives the best result

achieving 0.5 dB improvement over the Conv-TasNet baseline. This indicates that the auxiliary pitch information which has shown effectiveness in T-F domain speech separation is also applicable to time domain speech separation.

关键词

Keywords

references

HERSHEY J R, CHEN Z, LE ROUX J, et al. Deep clustering:discriminative embeddings for segmentation and separation[C].Shanghai:2016 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). IEEE, 2016:31-35.

ISIK Y, LE ROUX J, CHEN Z, et al. Single-channel multi-speaker separation using deep clustering[C]. San Francisco:Interspeech2016, 2016:545-549.

CHEN Z, LUO Y, MESGARANI N. Deep attractor network for single-microphone speaker separation[C]. New Orleans:2017IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). IEEE, 2017:246-250.

LUO Y, CHEN Z, MESGARANI N. Speaker-independent speech separation with deep attractor network[J]. IEEE/ACM Transactions on Audio Speech&Language Processing, 2018, 26(4):787-796.

YU D, KOLBAEK M, TAN Z H, et al. Permutation invariant training of deep models for speaker-independent multi-talker speech separation[C]. New Orleans:2017 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). IEEE,2017:241-245.

KOLBK M, YU D, TAN Z H, et al. Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks[J]. IEEE/ACM Transactions on Audio Speech&Language Processing, 2017, 25(10):1901-1913.

WANG Z Q, TAN K, WANG D L. Deep learning based phase reconstruction for speaker separation:a trigonometric perspective[C].Brighton:2019 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). IEEE, 2019:71-75.

LE ROUX J, WICHERN G, WATANABE S, et al. Phasebook and friends:leveraging discrete representations for source separation[J]. IEEE Journal of Selected Topics in Signal Processing, 2019, 13(2):370-382.

LUO Y, MESGARANI N. Tasnet:time-domain audio separation network for real-time, single-channel speech separation[C].Calgary:2018 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). IEEE, 2018:696-700.

LUO Y, MESGARANI N. Conv-TasNet:surpassing ideal time-frequency magnitude masking for speech separation[J]. IEEE/ACM Transactions on Audio Speech&Language Processing, 2019, 27(8):1256-1266.

SHI Z, LIN H, LIU L, et al. End-to-end monaural speech separation with multi-scale dynamic weighted gated dilated convolutional pyramid network[C]. Graz:Interspeech 2019, 2019:4614-4618.

TAKAHASHI N, PARTHASAARATHY S, GOSWAMI N, et al. Recursive speech separation for unknown number of speakers[C].Graz:Interspeech 2019, 2019:1348-1352.

YOSHIOKA T, ABRAMOVSKI I, AKSOYLAR C, et al. Advances in online audio-visual meeting transcription[C]. Singapore:2019 IEEE Automatic Speech Recognition and Understanding Workshop(ASRU). IEEE, 2019:276-283.

CHEN Z, XIAO X, YOSHIOKA T, et al. Multi-channel overlapped speech recognition with location guided speech extraction network[C]. Athens:2018 IEEE Spoken Language Technology Workshop(SLT). IEEE, 2018:558-565.

XIAO X, CHEN Z, YOSHIOKA T, et al. Single-channel speech extraction using speaker inventory and attention network[C].Brighton:2019 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). IEEE, 2019:86-90.

YANG G P, TUAN C I, LEE H Y, et al. Improved speech separation with time-and-frequency cross-domain joint embedding and clustering[C]. Graz:Interspeech 2019, 2019:1363-1367.

WANG K, SOONG F, XIE L. A pitch-aware approach to single-channel speech separation[C]. Brighton:2019 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). IEEE, 2019:296-300.

CHEN Z, YOSHIOKA T, LU L, et al. Continuous speech separation:dataset and analysis[C]. Online:2020 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). IEEE, 2020:7284-7288.

TALKIN D, KLEIJN W B. A robust algorithm for pitch tracking(RAPT)[J]. Speech Coding and Synthesis, 1995, 495:518.

YOSINSKI J, CLUNE J, BENGIO Y, et al. How transferable are features in deep neural networks?[C]. Montreal:Advances in Neural Information Processing Systems, 2014:3320-3328.

LEA C, VIDAL R, REITER A, et al. Temporal convolutional networks:a unified approach to action segmentation[C]. Amsterdam:European Conference on Computer Vision, 2016:47-54.

LI L, LIN G L, MA S B. Research of single image super-resolution reconstruction with sawtooth dilated residual convolution[J].Journal of Xinjiang University(Natural Science Edition in Chinese and English), 2021, 38(2):174-190.

GERHARD D. Pitch extraction and fundamental frequency:history and current techniques[M]. Regina:Department of Computer Science, University of Regina, 2003.

BOERSMA P. Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound[J].Proceedings of the Institute of Phonetic Sciences, 1993, 17(1193):97-110.

DE CHEVEIGN′E A, KAWAHARA H. YIN, a fundamental frequency estimator for speech and music[J]. The Journal of the Acoustical Society of America, 2002, 111(4):1917-1930.

HAN K, WANG D L. Neural network based pitch tracking in very noisy speech[J]. IEEE/ACM Transactions on Audio Speech&Language Processing, 2014, 22(12):2158-2168.

KIM J W, SALAMON J, LI P, et al. Crepe:a convolutional representation for pitch estimation[C]. Calgary:2018 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). IEEE, 2018:161-165.

XU S, SHIMODAIRA H. Direct f0 estimation with neural-network-based regression[C]. Graz:Interspeech 2019, 2019:1995-1999.

GFELLER B, FRANK C, ROBLEK D, et al. SPICE:self-supervised pitch estimation[J]. IEEE/ACM Transactions on Audio Speech&Language Processing, 2020, 28:1118-1128.

HEITKAEMPER J, JAKOBEIT D, BOEDDEKER C, et al. Demystifying Tas Net:a dissecting approach[C]. Online:2020 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). IEEE, 2020:6359-6363.

LE ROUX J, WISDOM S, ERDOGAN H, et al. SDR-half-baked or well done?[C]. Brighton:2019 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). IEEE, 2019:626-630.

Views

300

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

No data

Related Author

No data

Related Institution

No data

⁰