TECHNICAL PROGRAM SCHEDULE
- Article Category: PROGRAM
- 11/22/2023
The poster ID is as follows: [Poster Session 1-6]-[Physical/Virtual Session No.]-[Topic]
[Day] is the day of the conference (1,2, 3 or 4),
[Poster Number] is the number of the poster within the session
[Topic] is the technical area of the work as follows
Topic ID
[Topic] is the technical area of the work as follow | Technical Area |
ASR | 01. Automatic speech recognition |
ASR-TM | 01. Automatic speech recognition -> Training methods |
ASR-MA | 01. Automatic speech recognition -> Model architectures |
ASR-RB | 01. Automatic speech recognition -> Robustness |
ASR-SM | 01. Automatic speech recognition -> Streaming models |
SLP | 02. Spoken language processing |
SLP-SLU | 02. Spoken language processing -> Spoken language understanding |
SLP-ST | 02. Spoken language processing -> Speech translation |
SLP-SDS | 02. Spoken language processing -> Spoken dialog systems |
SES | 03. Speech enhancement and separation |
ANA | 04. Speech analysis |
SLR | 05. Speaker and language recognition |
DIA | 06. Speaker diarization |
TLP | 07. Text-only language processing |
MMP | 08. Multimodal speech processing |
MLP | 09. Multilingual processing |
EMR | 10. Emotion recognition and paralinguistics |
TTS | 11. Speech synthesis and spoken language generation |
RES | 12. Resources (new corpora, toolkits, evaluation metrics, etc.) |
MLS | 13. Machine learning for speech applications |
SS01 | SS01. Audio visual speech enhancement challenge 2 |
SS02 | SS02. ML-SUPERB |
SS03 | SS03. Model adaptation for low resource ASR for Indian languages |
SS04 | SS04. multi-channel multi-party meeting transcription |
SS06 | SS06. Singing voice conversion challenge |
SS07 | SS07. VoiceMOS challenge |
Poster Session 1
12.17.2023 / 10:30-12:30
Chair: Thomas Hain, Hao Tang
Poster ID | Paper Title | Paper ID |
Physical | ||
1-P1-SLP | Slm: Bridging the Thin Gap Between Speech and Text Foundational Models | 423 |
1-P2-SLP | Deriving Translational Acoustic Sub-Word Embeddings | 340 |
1-P3-SLP | Summarize while Translating: Universal Model with Parallel Decoding for Summarization and Translation | 415 |
1-P4-SLP-ST | Token-Level Serialized Output Training for Joint Streaming Asr and St Leveraging Textual Alignments | 174 |
1-P5-SLP-ST | Improving Stability in Simultaneous Speech Translation: A Revision-Controllable Decoding Approach | 207 |
1-P6-SLP-ST | Enhancing Expressivity Transfer in Textless Speech-to-Speech Translation | 367 |
1-P7-SLP-ST | A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability | 342 |
1-P8-SLP-SDS | Exploring the Viability of Synthetic Audio Data for Audio-Based Dialogue State Tracking | 47 |
1-P9-SLP-SLU | Simulation of Teacher-Learner Interaction in English Language Pronunciation Learning | 318 |
1-P10-SLP-SLU | Whisper-SLU: Extending a Pretrained Speech-to-Text Transformer for Low Resource Spoken Language Understanding | 78 |
1-P11-SLP-SLU | Towards a Unified End-to-End Language Understanding System for Speech and Text Inputs | 299 |
1-P12-SLP-SLU | Few-Shot Spoken Language Understanding via Joint Speech-Text Models | 414 |
1-P13-TLP | Pareto Efficiency of Learning-Forgetting Trade-Off in Neural Language Model Adaptation | 66 |
1-P14-TLP | Adversarial Augmentation for Adapter Learning | 245 |
1-P15-TLP | Enabling Noisy Label Usage for Out-of-Airspace Data in Read-Back Error Detection | 362 |
1-P16-TLP | Enhancing Task-Oriented Dialogues with Chitchat: A Comparative Study Based on Lexical Diversity and Divergence | 172 |
1-P17-MLS | Preserving Phonemic Distinctions for Ordinal Regression: A Novel Loss Function for Automatic Pronunciation Assessment | 110 |
1-P18-MLS | Reducing the Cost of Spoof Detection Labeling Using Mixed-Strategy Active Learning and Pretrained Models | 370 |
1-P19-MLS | Joint Audio and Speech Understanding | 383 |
1-P20-MLS | Variational Gaussian Process Data Uncertainty | 22 |
1-P21-MLS | Towards Matching Phones and Speech Representations | 99 |
1-P22-MLS | Fedcpc: An Effective Federated Contrastive Learning Method for Privacy Preserving Early-Stage Alzheimer’s Speech Detection | 201 |
1-P23-MLS | Joint Energy-Based Model for Robust Speech Classification System Against Dirty-Label Backdoor Poisoning Attacks | 381 |
1-P24-MLS | Clustering Unsupervised Representations as Defense Against Poisoning Attacks on Speech Commands Classification System | 324 |
1-P25-MLS | Can We Use Speaker Embeddings on Spontaneous Speech Obtained from Medical Conversations to Predict Intelligibility? | 51 |
Virtual | ||
1-V1-SLP-SLU | Leveraging Multilingual Self-Supervised Pretrained Models for Sequence-to-Sequence End-to-End Spoken Language Understanding | 11 |
1-V2-SLP-SLU | Generalized Zero-Shot Audio-to-Intent Classification | 336 |
1-V3-TLP | Prompt Pool Based Class-Incremental Continual Learning for Dialog State Tracking | 252 |
1-V4-MLS | Robust Logarithmic Champernowne Algorithm for Feedback Cancellation in Hearing Aids | 344 |
1-V5-MLS | Brouhaha: Multi-Task Training for Voice Activity Detection, Speech-to-Noise Ratio, and C50 Room Acoustics Estimation | 223 |
1-V6-MLS | Multitask Learning Model with Text and Speech Representation for Fine-Grained Speech Scoring | 327 |
Poster Session 2
12.17.2023 / 16:00-18:00
Chair: Frank Seide, Andreas Stolcke
Poster ID | Paper Title | Paper ID |
Physical | ||
2-P1-ASR | Knowledge Distillation from Offline to Streaming Transducer: Toward Accurate and Fast Streaming Model by Matching Alignments | 303 |
2-P2-ASR | Learning from Flawed Data: Weakly Supervised Automatic Speech Recognition | 96 |
2-P3-ASR | Can Unpaired Textual Data Replace Synthetic Speech in ARU Model Adaptation? | 109 |
2-P4-ASR | Acoustic Model Fusion for End-to-End Speech Recognition | 417 |
2-P5-ASR | Prompting Large Language Models for Zero-Shot Domain Adaptation in Speech Recognition | 48 |
2-P6-ASR | Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data | 146 |
2-P7-ASR | Exploring Data Augmentation in Bias Mitigation Against Non-Native-Accented Speech | 185 |
2-P8-ASR | The Role of Feature Correlation on Quantized Neural Networks | 58 |
2-P9-ASR | The Gift of Feedback: Improving ASR Model Quality by Learning from User Corrections through Federated Learning | 85 |
2-P10-ASR | Mask-Conformer: Augmenting Conformer with Mask-Predict Decoder | 103 |
2-P11-ASR | Efficient Cascaded Streaming ASR System via Frame Rate Reduction | 112 |
2-P12-ASR | Segment-Level Vectorized Beam Search Based on Partially Autoregressive Inference | 155 |
2-P13-ASR | Speaker Adaptation for End-to-End Speech Recognition Systems in Noisy Environments | 228 |
2-P14-ASR | End-to-End Multichannel Speaker-Attributed ASR: Speaker Guided Decoder and Input Feature Analysis | 235 |
2-P15-ASR | Audio-AdapterFusion: A Task-ID-free Approach for Efficient and Non-Destructive Multi-Task Speech Recognition | 239 |
2-P16-ASR | Cross-Modal Alignment with Optimal Transport for CTC-Based ASR | 263 |
2-P17-ASR | Zero-Shot Domain-Sensitive Speech Recognition with Prompt-Conditioning Fine-Tuning | 413 |
2-P18-ASR | Adapting Pretrained Speech Model for Mandarin Lyrics Transcription and Alignment | 275 |
2-P19-ASR | Ending The Blind Flight: Analyzing The Impact of Acoustic And Lexical Factors on Wav2Vec 2.0 in Air-Traffic Control | 320 |
2-P20-ASR | A Token-Wise Beam Search Algorithm for Rnn-T | 177 |
2-P21-ASR | GPU-Accelerated WFST Beam Search Decoder for CTC-Based Speech Recognition | 238 |
2-P22-ASR | Two-Pass Endpoint Detection for Speech Recognition | 375 |
2-P23-SS03 | Dialect Adaptation and Data Augmentation for Low-Resource ASR: TalTech Systems for the MADASR 2023 Challenge | 368 |
2-P24-SS03 | Gated Multi Encoders and Multitask Objectives for Dialectal Speech Recognition in Indian Languages | 473 |
6-P5-SLR | MASR: Multi-Label Aware Speech Representation Learning | 399 |
Virtual | ||
2-V1-ASR | Contextual Spelling Correction with Large Language Models | 82 |
2-V2-ASR | Spike-Triggered Contextual Biasing for End-to-End Mandarin Speech Recognition | 287 |
2-V3-ASR | U2-Kws: Unified Two-Pass Open-Vocabulary Keyword Spotting with Keyword Bias | 294 |
2-V4-ASR | CTC Blank Triggered Dynamic Layer-Skipping for Efficient CTC-Based Speech Recognition | 248 |
2-V5-ASR | Domain Adaptation by Data Distribution Matching via Submodularity for Speech Recognition | 420 |
2-V6-ASR | Importance of Smoothness Induced by Optimizers in FL4ASR: Towards Understanding Federated Learning for End-to-End ASR | 382 |
Poster Session 3
12.18.2023 / 10:30-12:30
Chair: Gil Keren, Xin Lei
Poster ID | Paper Title | Paper ID |
Physical | ||
3-P1-ASR | Improving Whispered Speech Recognition Performance Using Pseudo-Whispered Based Data Augmentation | 425 |
3-P2-ASR-MA | Fast Conformer With Linearly Scalable Attention For Efficient Speech Recognition | 332 |
3-P3-ASR-MA | Ed-Cec: Improving Rare Word Recognition Using ASR Post-Processing Based on Error Detection and Context-Aware Error Correction | 105 |
3-P4-ASR-MA | Optimizing Two-Pass Cross-Lingual Transfer Learning: Phoneme Recognition and Phoneme to Grapheme Translation | 246 |
3-P5-ASR-MA | Multi Transcription-Style Speech Transcription Using Attention-Based Encoder-Decoder Model | 130 |
3-P6-ASR-MA | SA-Paraformer: Non-Autoregressive End-to-End Speaker-Attributed ASR | 316 |
3-P7-ASR-MA | Lv-Ctc: Non-autoregressive ASR with CTC and Latent Variable Models | 59 |
3-P8-ASR-MA | Discriminative Speech Recognition Rescoring with Pre-trained Language Models | 392 |
3-P9-ASR-RB | Locality Enhanced Dynamic Biasing and Sampling Strategies for Contextual ASR | 76 |
3-P10-ASR-RB | FAT-HuBERT: Front-End Adaptive Training of Hidden-Unit BERT for Distortion-Invariant Robust Speech Recognition | 90 |
3-P11-ASR-RB | Pseudo-Label Based Supervised Contrastive Loss for Robust Speech Representations | 406 |
3-P12-ASR-SM | Hierarchical Attention-Based Contextual Biasing for Personalized Speech Recognition Using Neural Transducers | 347 |
3-P13-ASR-TM | Low-Rank Adaptation of Neural Language Model Rescoring for Speech Recognition | 26 |
3-P14-ASR-TM | Generative Asr Error Correction with Large Language Models | 171 |
3-P15-ASR-TM | Melhubert: A Simplified Hubert on Mel Spectrograms | 183 |
3-P16-ASR-TM | Improved Long-Form Speech Recognition by Jointly Modeling the Primary and Non-Primary Speakers | 379 |
3-P17-ASR-TM | Fast-HuBERT: An Efficient Training Framework for Self-Supervised Speech Representation Learning | 108 |
3-P18-ASR-TM | Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition | 111 |
3-P19-ASR-TM | Awmc: Online Test-Time Adaptation without Mode Collapse for Continual Adaptation | 190 |
3-P20-ASR-TM | Prompting and Adapter Tuning for Self-supervised Encoder-Decoder Speech Model | 265 |
3-P21-ASR-TM | Consistency Based Unsupervised Self-Training for ASR Personalisation | 295 |
3-P22-ASR-TM | Joint Federated Learning and Personalization for On-Device ASR | 181 |
3-P23-ASR-TM | Efficient Text-Only Domain Adaptation for CTC-Based ASR | 274 |
3-P24-MLP | Building High-Accuracy Multilingual ASR with Gated Language Experts and Curriculum Training | 354 |
3-P26-MLP | MUST: A Multilingual Student-Teacher Learning Approach for Low-Resource Speech Recognition | 440 |
3-P27-MLP | On Decoder-Only Architecture for Speech-to-Text and Large Language Model Integration | 301 |
3-V5-ASR-TM | On The Relevance of Phoneme Duration Variability of Synthesized Training Data for Automatic Speech Recognition | 361 |
3-V6-ASR-TM | End-to-End Training of a Neural HMM with Label and Transition Probabilities | 53 |
3-V7-ASR-TM | Investigating the Effect of Language Models in Sequence Discriminative Training for Neural Transducers | 337 |
Virtual | ||
3-V1-ASR-MA | Parameter-Efficient Tuning with Adaptive Bottlenecks for Automatic Speech Recognition | 456 |
3-V2-ASR-MA | Improving Large-Scale Deep Biasing with Phoneme Features and Text-only Data in Streaming Transducer | 73 |
3-V3-ASR-MA | Ba-Moe: Boundary-Aware Mixture-of-Experts Adapter for Code-Switching Speech Recognition | 412 |
3-V4-ASR-RB | No Pitch Left Behind: Addressing Gender Unbalance in Automatic Speech Recognition Through Pitch Manipulation | 385 |
3-V8-MLP | Lae-St-Moe: Boosted Language-Aware Encoder Using Speech Translation Auxiliary Task for E2E Code-Switching ASR | 175 |
3-P25-MLP | Improving Multilingual and Code-switching ASR using Large Language Model Generated Text | 65 |
Poster Session 4
12.18.2023 / 16:00-18:00
Chair: Rohit Prabhavalkar, Xugang Lu
Poster ID | Paper Title | Paper ID |
Physical | ||
4-P1-SES | Lc4Sv: A Denoising Framework Learning to Compensate for Unseen Speaker Verification Models | 12 |
4-P2-SES | Towards Robust Packet Loss Concealment System with ASR-Guided Representations | 290 |
4-P3-SES | Deep Learning for Joint Acoustic Echo and Acoustic Howling Suppression in Hybrid Meetings | 140 |
4-P4-SES | On Time Domain Conformer Models for Monaural Speech Separation in Noisy Reverberant Acoustic Environments | 142 |
4-P5-SES | A Single Speech Enhancement Model Unifying Dereverberation, Denoising, Speaker Counting, Separation, and Extraction | 373 |
4-P6-SES | NeuralEcho: Hybrid of Full-Band and Sub-Band Recurrent Neural Network for Acoustic Echo Cancellation and Speech Enhancement | 143 |
4-P7-SES | Toward Universal Speech Enhancement for Diverse Input Conditions | 244 |
4-P8-SES | Exploring Time-Frequency Domain Target Speaker Extraction for Causal and Non-Causal Processing | 380 |
4-P9-SES | Improving Speech Enhancement Using Audio Tagging Knowledge from Pre-Trained Representations and Multi-Task Learning | 411 |
4-P10-ANA | Paraconsistent Feature Analysis for the Competency Evaluation of Voice Impersonation | 302 |
4-P11-ANA | Not All Errors Are Created Equal: Evaluating the Impact of Model and Speaker Factors on ASR Outcomes in Clinical Populations | 83 |
4-P12-ANA | Detection of Vowel Errors in Children's Speech Using Synthetic Phonetic Transcripts | 308 |
4-P13-ANA | Transferring Speech-Generic and Depression-Specific Knowledge for Alzheimer's Disease Detection | 343 |
4-P14-ANA | Spectral Tilt May Have a Smaller Impact on the Intelligibility of Speech in Noise | 445 |
4-P16-ANA | Minisuperb: Lightweight Benchmark for Self-Supervised Speech Models | 395 |
4-P17-MMP | Cross-Modal learning for CTC-Based ASR: Leveraging CTC-BERTScore and Sequence-Level Training | 323 |
4-P18-MMP | Parameter-Efficient Cross-Language Transfer Learning for a Language-Modular Audiovisual Speech Recognition | 333 |
4-P19-MMP | Flap: Fast Language-Audio Pre-Training | 360 |
4-P20-MMP | Improving Audiovisual Active Speaker Detection in Egocentric Recordings with the Data-Efficient Image Transformer | 387 |
4-P21-MMP | Audio-Visual Neural Syntax Acquisition | 409 |
4-P22-MLS | NeuralKalman: A Learnable Kalman Filter for Acoustic Echo Cancellation | 134 |
4-P23-SS01 | Scenario-Aware Audio-Visual Tf-Gridnet for Target Speech Extraction | 170 |
Virtual | ||
4-V1-SES | An Exploration of Task-Decoupling on Two-Stage Neural Post Filter for Real-Time Personalized Acoustic Echo Cancellation | 424 |
4-V2-SES | Mbtfnet: Multi-Band Temporal-Frequency Neural Network for Singing Voice Enhancement | 50 |
4-V3-SES | VSANet: Real-Time Speech Enhancement Based on Voice Activity Detection and Causal Spatial Attention | 119 |
4-V4-SES | Magnitude-and-Phase-Aware Speech Enhancement with Parallel Sequence Modeling | 224 |
4-V5-ANA | Zero-Shot Emotion Transfer for Cross-Lingual Speech Synthesis | 433 |
4-V6-MMP | Acoustics-Text Dual-Modal Joint Representation Learning for Cover Song Identification | 97 |
4-V7-MMP | Boosting Modality Representation with Pre-Trained Models and Multi-Task Training for Multimodal Sentiment Analysis | 272 |
4-V8-SS04 | Pp-Met: A Real-World Personalized Prompt Based Meeting Transcription System | 220 |
4-V9-SS04 | The Second Multi-Channel Multi-Party Meeting Transcription Challenge (M2Met 2.0): A Benchmark for Speaker-Attributed ASR | 422 |
4-P15-ANA | Unconstrained Dysfluency Modeling for Dysfluent Speech Transcription and Detection | 394 |
Poster Session 5
12.19.2023 / 10:30-12:30
Chair: Berrak Sisman, Tomoki Toda
Poster ID | Paper Title | Paper ID |
Physical | ||
5-P1-TTS | Using Joint Training Speaker Encoder with Consistency Loss to Achieve Cross-Lingual Voice Conversion and Expressive Voice Conversion | 127 |
5-P2-TTS | Quickvc: A Lightweight VITS-Based Any-to-Many Voice Conversion Model Using iSTFT for Faster Conversion | 128 |
5-P3-TTS | Maximizing Data Efficiency for Cross-Lingual TTS Adaptation by Self-Supervised Representation Mixing and Embedding Initialization | 123 |
5-P4-TTS | Invert-Classify: Recovering Discrete Prosody Inputs for Text-to-Speech | 312 |
5-P5-TTS | Transduce and Speak: Neural Transducer for Text-to-Speech with Semantic Token Prediction | 9 |
5-P6-TTS | Toward General-Purpose Text-Instruction-Guided Voice Conversion | 204 |
5-P7-TTS | Improving Severity Preservation of Healthy-to-Pathological Voice Conversion with Global Style Tokens | 233 |
5-P8-TTS | PerMod: Perceptually Grounded Voice Modification with Latent Diffusion Models | 271 |
5-P9-TTS | Partial Rank Similarity Minimization Method for Quality Mos Prediction Oo Unseen Speech Synthesis Systems in Zero-Shot and Semi-Supervised Setting | 276 |
5-P10-TTS | E3 Tts: Easy End-to-End Diffusion-Based Text to Speech | 352 |
5-P11-TTS | WaveNeXt: ConvNeXt-Based Fast Neural Vocoder without iSTFT Layer | 441 |
5-P12-TTS | Towards Developing State-of-the-Art TTS Synthesisers for 13 Indian Languages with Signal Processing Aided Alignments | 453 |
5-P13-TTS | Zero-Shot Singing Voice Synthesis from Musical Score | 268 |
5-P14-TTS | Diffusion-Based Mel-Spectrogram Enhancement For Personalized Speech Synthesis with Found Data | 165 |
5-P15-SS06 | The Singing Voice Conversion Challenge 2023 | 64 |
5-P16-SS06 | A Comparative Study of Voice Conversion Models with Large-Scale Speech and Singing Data: The T13 Systems for the Singing Voice Conversion Challenge 2023 | 403 |
5-P17-SS07 | Le-Ssl-Mos: Self-Supervised Learning Mos Prediction with Listener Enhancement | 192 |
5-P18-SS07 | The VoiceMOS Challenge 2023: Zero-Shot Subjective Speech Quality Prediction for Multiple Domains | 258 |
Virtual | ||
5-V1-TTS | CrossSinger: A Cross-Lingual Multi-Singer High-Fidelity Singing Voice Synthesizer Trained on Monolingual Singers | 27 |
5-V2-TTS | HiGNN-TTS : Hierarchical Prosody Modeling with Graph Neural Networks for Expressive Long-Form TTS | 430 |
5-V3-TTS | SALT: Distinguishable Speaker Anonymization Through Latent Space Transformation | 436 |
5-V4-TTS | PromptSpeaker: Speaker Generation Based on Text Descriptions | 429 |
5-V5-TTS | Bisinger: Bilingual Singing Voice Synthesis | 36 |
5-V6-SS06 | VITS-Based Singing Voice Conversion Leveraging Whisper and Multi-Scale F0 Modeling | 435 |
5-V7-SS06 | Vits-Based Singing Voice Conversion System with Dspgan Post-Processing for Svcc2023 | 478 |
5-V8-SS07 | Sqat-Ld: Speech Quality Assessment Transformer Utilizing Listener Dependent Modeling for Zero-Shot Out-of-Domain Mos Prediction | 166 |
5-V9-SS07 | Kaq: A Non-Intrusive Stacking Framework for Mean Opinion Score Prediction with Multi-Task Learning | 313 |
Poster Session 6
12.19.2023 / 16:00-18:00
Chair: Carlos Busso, Nicholas Cummins
Poster ID | Paper Title | Paper ID |
Physical | ||
6-P1-SLR | Meta-Learning Framework for End-to-End Imposter Identification in Unseen Speaker Recognition | 124 |
6-P2-SLR | Model-Based Fairness Metric for Speaker Verification | 253 |
6-P3-SLR | Generative Linguistic Representation for Spoken Language Identification | 284 |
6-P4-SLR | ECAPA2: A Hybrid Neural Network Architecture and Training Strategy for Robust Speaker Embeddings | 326 |
6-P5-SLR | MASR: Multi-Label Aware Speech Representation Learning | 399 |
6-P6-PSLR | Extending Self-Distilled Self-Supervised Learning for Semi-Supervised Speaker Verification | 404 |
6-P7-DI | Robust End-to-End Diarization with Domain Adaptive Training and Multi-Task Learning | 77 |
6-P8-DIA | Transformer Attractors for Robust and Efficient End-to-End Neural Diarization | 304 |
6-P9-DIA | Semi-Supervised Multi-Channel Speaker Diarization with Cross-Channel Attention | 468 |
6-P11-EMR | Active Learning Based Fine-Tuning Framework for Speech Emotion Recognition | 30 |
6-P12-EMR | Identifying People with Mild Cognitive Impairment at Risk of Developing Dementia Using Speech Analysis | 32 |
6-P13-EMR | Robust Recognition of Speaker Emotion with Difference Feature Extraction Using a Few Enrollment Utterances | 44 |
6-P14-EMR | Improved Multi-modal Emotion Recognition using Squeeze-and-Excitation Block in Cross-Modal Attention | 72 |
6-P15-EMR | Detecting Speech Abnormalities with a Perceiver-Based Sequence Classifier That Leverages a Universal Speech Model | 80 |
6-P16-EMR | Combining Relative and Absolute Learning Formulations to Predict Emotional Attributes From Speech | 144 |
6-P17-EMR | Speech Emotion Diarization: Which Emotion Appears When? | 93 |
6-P18-RES | RescueSpeech: A German Corpus for Speech Recognition in Search and Rescue Domain | 101 |
6-P19-RES | ESPNet-SUMM: Introducing a Novel Large Dataset, Toolkit, and a Cross-Corpora Evaluation of Speech Summarization Systems | 145 |
6-P20-RES | Transcribing And Aligning Conversational Speech: A Hybrid Pipeline Applied to French Conversations | 198 |
6-P21-RES | Study on the Correlation between Objective Evaluations and Subjective Speech Quality and Intelligibility | 264 |
6-P22-RES | Coco-Nut: Corpus of Japanese Utterance and Voice Characteristics Description for Prompt-Based Control | 282 |
6-P23-RES | Librispeech-Pc: Benchmark For Evaluation of Punctuation And Capitalization Capabilities of End-to-End ASR Models | 330 |
6-P24-RES | Torchaudio 2.1: Advancing Speech Recognition, Self-Supervised Learning, and Audio Processing Components for Pytorch | 339 |
6-P25-RES | YODAS: Youtube-Oriented Dataset for Audio and Speech | 391 |
6-P26-RES | H_Eval: A New Hybrid Evaluation Metric for Automatic Speech Recognition Tasks | 447 |
6-P27-SS02 | Thai-Dialect: Low Resource Thai Dialectal Speech to Text Corpora | 135 |
6-P28-SS02 | Joint Prediction and Denoising for Large-Scale Multilingual Self-Supervised Learning | 156 |
6-P29-SS02 | Findings of the 2023 ML-SUPERB Challenge: Pre-Training and Evaluation over More Languages and Beyond | 163 |
6-P30-SS02 | Evaluating Self-Supervised Speech Models on a Taiwanese Hokkien Corpus | 331 |
6-P31-SS02 | Leveraging The Multilingual Indonesian Ethnic Languages Dataset in Self-Supervised Model for Low-Resource ASR Task | 426 |
Virtual | ||
6-PV1-SLR | Haha-Pod: An Attempt for Laughter-Based Non-Verbal Speaker Verification | 211 |
6-PV2-SLR | CAMSAT: Augmentation Mix and Self-Augmented Training Clustering for Self-Supervised Speaker Recognition | 243 |
6-PV3-SLR | VoiceExtender: Short-Utterance Text-Independent Speaker Verification with Guided Diffusion Model | 266 |
6-PV4-RES | Wiki-En-Asr-Adapt: Large-Scale Synthetic Dataset for English Asr Customization | 54 |
6-P10-MMP | Av-Data2Vec: Self-Supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations | 206 |
Demonstration Session
12.20.2023 / 10:30-11:30
Prof. Carlos Busso
Poster ID | Paper Title |
Physical | |
D-P1-DEMO | Towards Streaming Speech-to-Avatar Synthesis |
D-P2-DEMO | NYCUKA: A Self-Disclosure Mental Health Spoken Dialogue System |