#84 Модель синтеза речи от Nvidia на основе искусственного интеллекта | Nvidia's AI-powered speech synthesis model

*article in Russian and English

*English translation below

Технологии по-прежнему не стоят на месте и в этой заметке блога "В мире ИТ" читатели и гости узнают о разрабатываемой Nvidia технологии преобразования текста в речь с помощью искусственного интеллекта. Еще одна ступенька в цифровой мир, когда человеческий голос, подобно тому, как это происходит в знаменитой сказке "Волк и семеро козлят", имитируется вычислительными алгоритмами.

Приятного и полезного чтения!

NVIDIA разработала ИИ модель, имитирующую живой голос

Проблемы прикладного применения искусственного интеллекта рассмотрены группой исследователей Nvidia с нового ракурса. Разработчики изучают потенциал технологии в синтезе речи. В настоящее время исследователи располагают впечатляющими результатами.

На конференции Interspeech 2021 одна из исследовательских компании Nvidia представила модель искусственного интеллекта (AI), получившую название RAD-TTS. Технология позволяет читать текст с поразительной естественностью и плавностью, а также принимать участие в беседе, используя измененный голос.

Синтетический голос настраивается по таким параметрам как пол, скорость, ритм или даже энергия интонаций говорящего. Большие возможности заключаются также в том, что в речевую модель встроен режим усовершенствования и обучения, активирующийся при диктовке текста природным голосом пользователя.

Полученные результаты намного лучше всех представленных до сих пор технологиями данного типа. Компания представила отчет о данной технологии в нескольких видеороликах из серии I AM AI. В этих видео текст зачитывает синтетический голос, дополненный возможностями искусственного интеллекта от Nvidia. Исследователи особое внимание уделяли тому, чтобы полученный результат максимально соответствовал стилю и тональности видео.

Модель синтеза речи от Nvidia на основе искусственного интеллекта

«С помощью предлагаемого нами интерфейса видеооператор может записать себя, зачитав текст сценария видео, а затем с помощью модели искусственного интеллекта преобразовать свою речь, например, в голос диктора. На основе этого базового текста можно затем управлять работой ИИ, подобно тому, как режиссер руководит работой актера озвучания, настраивая синтетический голос на выделение определенных слов и изменение темпа повествования в соответствии с общей тональностью видео», - поясняют представители Nvidia.

Компания указывает, что созданная ею модель оптимизирована для работы с оригинальными графическими процессорами Nvidia. На этом этапе исследований абсолютно любой может опробовать RAD-TTS с помощью инструментария NeMo Python. Он представляет собой технологию искусственного интеллекта, созданного для общения и доступного в концентраторе NGC от Nvidia. Часть представленных моделей уже прошла обучение в течение нескольких десятков тысяч часов с различными аудиоданными.

NVIDIA разработала модель искусственного интеллекта, имитирующую живой голос

Мир все стремительнее меняется. И эта публикация фиксирует еще одну примету того, что в нашей жизни шаг за шагом все больше воплощаются сказочные сюжеты. Остается только понять были ли эти сказки мотивирующими или же, напротив, предостерегающими. И у нас на пороге не стоит Гаммельнский крысолов или злой волк, а только лишь чудесное будущее, открывающее человечеству новые возможности и перспективы.

***

ENG

***

Technology still does not stand still, and in this blog post "In the IT world", readers and guests will learn about the technology developed by Nvidia for converting text to speech using artificial intelligence. Another step into the digital world, when the human voice, like it happens in the famous fairy tale "The Wolf and the Seven Kids", is imitated by computational algorithms.

Happy and useful reading!

NVIDIA has developed an AI model that simulates a living voice

The problems of the applied application of artificial intelligence are considered by a group of Nvidia researchers from a new perspective. Developers are exploring the potential of the technology in speech synthesis. Researchers now have impressive results.

At the Interspeech 2021 conference, one of the research companies Nvidia presented an artificial intelligence (AI) model called RAD-TTS. The technology allows you to read text with amazing naturalness and fluency, as well as take part in the conversation using a modified voice.

The synthetic voice is tuned according to parameters such as gender, speed, rhythm, or even the energy of the speaker's intonation. Great opportunities also lie in the fact that a mode of improvement and training is built into the speech model, which is activated when the text is dictated by the user's natural voice.

The results obtained are much better than all the technologies of this type presented so far. The company has reported on this technology in several videos from the I AM AI series. In these videos, the text is read out by a synthetic voice, augmented by Nvidia's artificial intelligence capabilities. The researchers paid special attention to ensuring that the result obtained was as close as possible to the style and tone of the video.

Nvidia's AI-powered speech synthesis model

“Using the interface we offer, a videographer can record himself by reading the text of the video script, and then, using an artificial intelligence model, transform his speech, for example, into the voice of an announcer. Based on this basic text, the AI can then be manipulated, similar to the way a director directs a voice actor by tuning a synthetic voice to highlight specific words and changing the tempo of the narration to match the overall tone of the video, ”Nvidia explained.

The company points out that the model it has created is optimized to work with original Nvidia GPUs. At this stage of research, absolutely anyone can try out RAD-TTS using the NeMo Python toolkit. It is an artificial intelligence technology built for communication and available in Nvidia's NGC hub. Some of the presented models have already been trained for several tens of thousands of hours with various audio data.

NVIDIA has developed an AI model that simulates a living voice

The world is changing more and more rapidly. And this publication captures another sign of the fact that in our life, step by step, more and more fairy tales are being embodied. It remains only to understand whether these tales were motivating or, on the contrary, warning. And on our doorstep there is no Pied Piper or an evil wolf, but only a wonderful future that opens up new opportunities and prospects for mankind.

#technologies