Diphthong Synthesis

Diphthong production entails a swift transition of the vocal tract configuration from one vowel posture to another within a short time frame. Accurately modeling these dynamic articulatory movements is crucial for natural-sounding speech synthesis and for the clinical assessment of voice quality. Numerous vocal tract acoustic models have been proposed in the literature; however, most focus on static tract geometries for producing isolated vowel sounds. Only recently have efforts been made to address dynamic articulatory configurations. In this work, we introduce a two-dimensional (2D) dynamic vocal tract model that employs the Immersed Boundary Method (IBM) to synthesize diphthongs.

Mothod

Speech production involves continuous movement of articulators—such as the jaw, lips, and tongue—to shape the vocal tract geometry/area functions (as shown below) and produce sound. Accurately capturing these dynamic changes is particularly important for modeling diphthongs, which require smooth interpolation between distinct vowel configurations. Although high-fidelity Finite Element (FE) models (Arnela et al., 2019) are capable of simulating these transitions with great precision, their high computational cost makes them impractical for real-time or large-scale applications. Alternatively, acoustic wave modeling approaches that employ regular domain discretization schemes, such as Digital Waveguide Model (DWM) and Finite-Difference Time-Domian (FDTD) are computationally lightweight. For instance, Gully et al. (2017) applied a heterogeneous DWM, converting 2D/3D rectilinear meshes into admittance maps of airway and tissue properties, then interpolated between vowel-specific admittance maps to synthesize diphthongs. However, regular discretization of a computational space using grid cells necessitates approximating the complex vocal tract boundaries in a staircasing manner. We propose a dynamic 2D vocal tract model that combines a 2D FDTD scheme with a unique IBM originally developed for large-scale virtual and room acoustic simulations (Bilbao, 2022).

Our wave solver (2D IBM) is framed in terms of IBM within a 2D FDTD scheme using a dual set of discrete forcing terms in both the continuity and motion equations. As with classic IB methods, the acoustic wave equations are solved on a fixed Eulerian grid, while the vocal tract's sagittal contours are approximated using a fixed set of Lagrangian points as the immersed boundary, eliminating the need for staircasing.

Dynamic vocal tract area function transition

Synthesis Results

Synthesis of [ɔɪ] as in "boy"

2D IBM:
3D DWM:
Recorded:

Synthesis of [aʊ] as in "now"

2D IBM:
3D DWM:
Recorded:

Synthesis of [eɪ] as in "day"

2D IBM:
3D DWM:
Recorded:

References

[1] (Arenla et al., 2019) MRI-based vocal tract representations for the three-dimensional finite element synthesis of diphthongs.
[2] (Gully et al., 2017) Diphthong synthesis using the dynamic 3D digital waveguide mesh.
[3] (Bilbao, 2022) Immersed boundary methods in wave-based virtual acoustics.