One exciting thing I worked on very early on was creating my own custom voice font with Microsoft's custom voice font to enable text to speech. In the years since we have seen some amazing advancements in neural text to speech which give a much better output.

When I found out neural custom voices were available I couldn't resist trying one out. I was eager to see how accurate and natural the results would be compared to my previous attempts, which although recognisable, were not quite there.

For those who are unfamiliar, Microsoft Neural TTS is a cloud-based service that aims to provide highly realistic and expressive speech synthesis using deep neural networks. Voice Fonts, on the other hand, allow users to create custom voices by training the TTS engine on their own voice recordings.

To begin my experiment, I signed up for the Microsoft Azure Cognitive Services and followed the provided guidelines for recording my 50 utterances. The process was relatively straightforward, with clear instructions on how to maintain consistent pitch, tone, and pacing throughout the recordings. These are recorded right in the portal, which is super easy. I did use a much better quality microphone than my attempts several years ago. The training is pretty quick, but to use the font you need to record a short statement that you agree to have your voice replicated.
The moment of truth arrived as I typed a sample text into the Neural TTS interface and selected my custom voice. Listen to the results yourself:



An excerpt from Mary Shelley's Frankenstein

audio-thumbnail
Audio file
0:00
/0:28


This quick experiment has left me eager to explore further and see how much more realistic the Voice Font can become with additional recordings on the professional version, although up to 1000 are required. I'm also looking forward to seeing how Microsoft and other companies continue to push the boundaries of TTS technology in the coming years. It looks like there will be a version native in the next apple iOS iteration later this year. The future of personalized, expressive speech synthesis is here, and it'll be interesting to see exactly how it gets used... For good and bad...