Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Google Cloud Text-To-Speech Powered by DeepMind WaveNet Technology (googleblog.com)
412 points by pseudobry on March 27, 2018 | hide | past | favorite | 118 comments


Google also announced today their Tacotron engine which features new prosody modeling speech generation. It allows them to generate speech that mimics personal intonation, accents, and rhythm, effectively mimicking an individuals "expression" in their speech.

HN discussion here: https://news.ycombinator.com/item?id=16691197


I find it amusing that here we have all the corporatey buzzwords - "DeepMind WaveNet Technology" - but the other thing is called Tacotron.


if they named themselves DeepMind WaveNet Blockchain, they could have acquired Googlebet* by now.

* that's how i call google/alphabet when it doesn't matter which side of the tax-avoiding entity i am referring to.


goophabet?


Alphoogle™


Bless you


They also announced their new fluerosinth-ai amplitudinal speech to text deep-cloudmind product, although I can't find a reference link...


For any Google devs lurking out there, it doesn't seem to work at all in Firefox on Windows. It looks like it has something to do with custom web components with the following message:

ReferenceError: customElements is not defined

Also apparently some assertion errors with webcomponents (minified so line numbers not useful).


Doesn't work in Safari. Doesn't work in Firefox. Doesn't work in Edge.

I don't want to be unreasonable, but Google used to at least generally support the idea of the open web. There are a bunch of different UAs out there; while I accept it's more challenging to support some than others, it doesn't seem unreasonable to expect a product launch from a large-scale web company should at the very least give us an error message.

The web is deteriorating right in front of us, and a big contributor to that is Google's continued failure to realise that the web isn't all-Chrome, all the time. The attitude displayed here—a minor thing when compared to the overall problem—has strengthened my resolve to avoid Chrome at all costs.


> I don't want to be unreasonable, but Google used to at least generally support the idea of the open web.

This is a Machine Learning product, I don't think anyone at Google, at least as part of this team, is trying to "get you" or destroy the open web or something. This isn't even a case of Google using something non-standard -- WebComponents is part of the standard, you can even see it in Mozilla's MDN [0]. Firefox, Safari, Edge, et al simply haven't implemented it yet (or landed in stable). Is that somehow also Chrome's fault?

Filing a bug report is good, but ranting on HN about how this is a sign of Google trying to steal the open Internet is at best unnecessary, and absolutely unreasonable.

Coincidently, I'm working on an application that uses complicated SVG with CSS animations, and I've spent a ton of time optimizing it. I've never tested it outside of Chrome before today. To my surprise, while everything works fast in Chrome, in Safari it's bearable, but in FF it's simply too laggy to use. Now, I probably won't ever get to fix the performance issues in FF and Safari, simply because I don't have the time. Am I also out there trying to destroy the open web? Maybe I'm just bad, not evil.

[0]: https://developer.mozilla.org/en-US/docs/Web/Web_Components


No, I don’t agree with you.

This is a basic audio playback widget on the web. The web is nominally an open platform. Whichever team built this decided to build it in a way that it would only support Google’s own web browser. It’s not unreasonable to expect a massive web company to build cross-browser support for their user-facing demos, especially in cases where there is obviously no reason that it needs to be incompatible.

Like I said, I appreciate that building cross-browser is not always possible. The difference between you and Google is that you aren’t one of the largest companies in the world, and you don’t publish your own browser.


True, but clearly the marketing page of a feature service for a major cloud vendor should be accessible to as many as possible. There's no real reason something as simple as an audio player to play samples doesn't work well on all browsers.


Come on now, it's a cloud machine learning product, you can access it using purely open source tools all running on an open source operating system if you like. No need to pronounce doom because the demo doesn't work in your browser.


That'd be a reasonable response if Google didn't have a history of deliberately making things incompatible with other browsers for no discernible reason.

https://mashable.com/2013/01/05/google-maps-windows-phone


> making things incompatible with other browsers for no discernible reason

The reason is simple, Chrome is Google's bitch just like IE was Microsoft's bitch.


I appreciate the quality of the offering, and indeed I'm excited because I am currently building a product that uses Google's synthesis!

But it's the lack of care displayed here that's really disappointing. It's like nobody involved with this demo went "Hey, maybe people aren't using Chrome and we should at least not totally fail when that happens". It's disappointing; as a mostly front-end dev at a small company, I put a lot of time into making sure that things work cross-platform, even if I don't always get it right. It's disappointing that Google's culture doesn't rate that highly enough.


The solution is simple: Get as many people as you can to migrate to Firefox!


Then it works in two browsers: most popular and their own. Ideally there would be such a large number of in-use standards compliant browsers that companies had to comply.


The problem isn't so much that browser vendors don't adhere to standards, it's that they make their own additional features and then make webpages that work with those. Having standards won't do anything for that, unless the standard says "you must not implement extra functionality".


There are four or five out there. I would expect the average modern web app to be compatible with Chrome, Safari, Firefox, Edge, and possibly IE11 if that’s appropriate for the particular use case.


Just needs a WebComponents polyfill.

Other browser vendors don't support WebComponents out of the box (yet).


Google used to at least generally support the idea of the open web

Yes, but history teaches us that they are terrible at product management. I mean, they could at least just call it "Chrome TTS" or something until they put the whole thing together.


The demo might only work in Chrome, but the API is just a standard REST API. Most people will be using this API on the server side to generate an MP3 which is streamed to the user.

Chrome actually has its own TTS engine (as does FF, Safari, etc), though the quality obviously isn't as good.

https://developer.mozilla.org/en-US/docs/Web/API/SpeechSynth...

(I work for GCP)


As I understand it, Chrome does indeed use the Google cloud platform to do its speech synthesis behind the scenes – I suppose we'll see improvements in the quality of those voices soon, too.


That's not entirely fair – it does work over relatively standard protocols behind the scenes. I'm just annoyed at the lack of care displayed here.


GCP person here. We are aware and are working on a fix!

(Should be as "simple" as a polyfill for webcomponents, but I don't want to put words in the team's mouth)


Update: Should be fixed now!


That's probably because custom elements aren't supported in Firefox yet, unless you set "dom.webcomponents.enabled" and "dom.webcomponents.customelements.enable" to true. It's supposed to be fully enabled in Firefox 60/61 according to MDN [0].

[0]: https://developer.mozilla.org/en-US/docs/Web/Web_Components/...


Best viewed with Chrome Explorer™ 5.0, Optimized for 1024x786 truecolor screens.


Sure, but should they not be either using polyfills, or not use code that is only supported by one browser? I understand how a project like Allo Web would want to use latest streaming api tech, but a documentation/product launch page doesn't need crazy new web tech.


They should be using the polyfills, but looks like someone missed the mark and forgot to add them.


They didn't even test on anything other than Chrome, because they don't care. That's the problem. Not that it was missed, but that they didn't even test in Firefox.


> something from google not working on firefox.

Call me paranoid all you want, but that is their plan. just like it was microsoft's plan for nothing on IE4/6 to work on netscape. Heck, a third of my corporate sites have DHTML popups saying "Use chrome" when i reach them with firefox already.


Does anyone have a GitHub project for epub -> mp3 using this service yet (for automatic audiobook generation)? May make it myself if I have time but curious if anyone already has set it up.

EDIT: this is almost exactly their sample application (https://github.com/GoogleCloudPlatform/python-docs-samples/t...). Was able to get it working with epubs using pypandoc within the hour. Now just need to make it upload to Overcast...

EDIT 2: Can now convert epubs directly to mp3s on Overcast. Yay!


Shameless plug: https://auditus.cc

Uses Amazon Polly


Is your code on github?


the code was hacked together during a lecture so it's not very clean or robust, but here's the gist if you're trying to build something similar: https://gist.github.com/madebyollin/508930c86fa12e1a70e32d91...

(overcast uploading not shown–that's a separate script using mechanize)


The average English word is 4.5 characters and the average English speaker speaks 110-150 words per minute. This means that at $16/1m characters, we can generate speech at a cost between $28.57-39/hr. Per Google's post, WaveNet now costs 50ms of TPU time per 1s of speech generated, meaning, at 100% utilization, a TPU can generate somewhere between $571.40-780/hr. Google's TPUs can be deployed (by third parties) at $6.50/hr. That's some sweet sweet margin.


I think your math is wrong by a factor of 60. Under your assumptions, one hour of speech is equivalent to 30-40 thousand characters, costing between $0.48-$0.65. That translates to revenue of $9.50-$12.96/hour per TPU.


I double checked, you are correct:

    16 * (4.5 * 110 * 60) / 1M = $0.475/hr

    16 * (4.5 * 150 * 60) / 1M = $0.648/hr
If you multiply by the number of 50ms in one second (20), you do get $9.5 - $12.96


This is correct. Appols!


You're right, I decided to switch from seconds as base unit to hours as base unit half way through writing the comment, and screwed up the conversion. I was accounting for seconds when I was already in minutes.


Free SAAS business model: transform text by translating each word to its shortest homonym, and take a cut of the cost savings :)


Google does a remarkably good job with pronunciation. I'm consistently impressed by it.

I was messing around with the ancient VBA text to speech system. If most TTS systems sucked as much as that one, you could also make a SAAS business for finding "typos" that make the word sound correct when pronounced.


Nice analysis! That's the rub with doing tco analysis. How much value do you put on higher order of manageability?

You should also include a workload volatility component to be entirely fair. Your analysis assumes it's entirely steady state.

(Work at g)


"the cloud" is just a better DRM.


How so? They don't sell the model itself, they sell the 'tickets' to allow you to take a picture of it. That is not DRM.


Yes, that’s what the OP is saying. Consumers fought against DRM so the business model became “never give them the software and you’ll never need to force DRM on them.”


Except that no companies actually wanted the software in the first place, they wanted the service.

[Edited: made it clear I was talking about companies]


Maybe you did, but I’d much rather have things like the Nest camera or text-to-speech as my own software. However, it’s impossible to get the same quality for the same amount spent this way, since vendors prefer the service business model.


Edited, made it clear I was talking about companies.

But for most consumers, the same applies. Most people don't want to own records, or even own music. They want to listen to music.

They don't want to own cars, they want to get to places.

> However, it’s impossible to get the same quality for the same amount spent this way, since vendors prefer the service business model.

Not just a matter of preference, but economies of scale. Most consumers don't want to install and run their own local CCTV service, so the market left for those who want is too small to achieve the same economies of scale as the service market.


The machines used to be the thing of value, but now it's the software.


> The machines used to be the thing of value, but now it's the software.

Not exactly. The machines were never the value per se, the work they could do was the value.

If you can buy the work without buying the machine, it is significantly better.


Yes, but if you could get it for free that would be even better. It's a fantastic business model to charge premium for something that cost almost nothing.


> Yes, but if you could get it for free that would be even better. It's a fantastic business model to charge premium for something that cost almost nothing.

Cost almost nothing? How much do you think was invested in research and development?

But hey, feel free to open a company, invest billions of your own money in research and then give everything away for free with no revenue whatsoever. Sounds like a great business model!


It does not cost me anything if someone else runs my software. I can still charge for it. The research and development is already paid for. The business plan goes something like this: 1) Buy software 2) Sell copies of it ... But it will not work good if other people start to make copies of it, so you make it a SaaS instead, having the software run in "the cloud" where only you can copy it.


> The research and development is already paid for.

That doesn't even make any sense. They aren't paid for, they are investments with an expectation of future profits. Paid for with shareholder's money.


Investors will be happy as long as the marginal cost to sell one additional copy is above zero. eg. you make a profit. How much they are willing to invest depend of how large the market is.


That's a very cynical interpretation.

Even if Google wants to sell websearch software, there will be so few customers since it would cost billions of dollars to buy machines to run them, you'll need to build DC, get power etc.

And what's wrong with services anyway?


Google have infrastructure to to serve billions of requests per second, that is not be needed if you only serve yourself. Now imagine if every-one would scrape the web as often as Google does. :P

Your mobile phone can run some billions of instructions per second. Now imagine a server case, or a rack. Compute power is very dense. And very cheap.

The entire wikipedia (text only) can be stored on a micro-sim. eg a few mm.


You probably meant micro-SD. And the amount of text content contained in Wikipedia is an atomically sized drop in the Internet's ocean.


Here's a simple Python script that will fetch some sample audio using the request on the demo page and save it in a file:

https://www.pastery.net/nujfhw/

I have no idea what the rate limits are, so please don't abuse it, I wrote it because the demo didn't work in Firefox and I wanted to play around with it more extensively.


Thanks, pretty neat script; works nicely!


Having an English text but setting the language to another one like German or French is hilarious.

You get e.g. ze Dscherman aczent or de frensch onehe.


I decided to do the opposite so French text with the wavenet English voice, pretty funny too.


Also feeding it text from the Anguish Languish[1] corpus is pretty good.

[1] https://www.crockford.com/wrrrld/anguish.html


I am ashamed to admit I spent far too much time doing exactly that. The Japanese one really cracked me up.


The US English synthesized version is truly remarkable. Borderline scarily good.

The fact that the preview only seems to work on Chrome (and silently breaks everywhere else) is not cool, thought.


Am I wrong in thinking that the cost of generating (realistic-sounding, learned-model) speech on commodity hardware will be near-zero soon, largely negating the value of a SaaS?

I've been waiting a long time for decent sounding open source TTS software for narrating books to me, and now with deep learning it's either here or very near here, and the hardware is going to keep getting more performant at the same price. I guess that will be very appealing to businesses relying on TTS (e.g. call centers and phone robots and mobile apps with TTS, etc)


Some companies don't ever think in hiring people to build and maintain it, they prefer to pay a subscription like Google Cloud.


What open source TTS is as good as Google Cloud?


check the "kate" samples: https://github.com/Kyubyong/speaker_adapted_tts

This is with 1 minute of audio and 10 minutes of training, which is crazy to me. Maybe it's not "as good", but it's very good, and free, and it will get better, faster, and cheaper quickly?


Partially off-topic, but one of the things I find as a native English-speaking American is that British female accents (probably more specifically accents that are close to a "BBC accent") sound better to me. That's definitely true with Polly. I don't know if it's because flaws aren't quite as obvious to me or just that I like the accent better in general so I'm more willing to overlook them.


"... the hardware is going to keep getting more performant at the same price"

That is the hope but there are no guarantees. Perhaps specialized hardware can pick up where Moore's Law has tapered off.


I don't mean general purpose hardware, I mean specifically neural hardware, whether that means GPUs (as we know them), GPUs with special hardware like the "tensor cores", TPUs, "neuromorphic chips" whatever that is, FPGAs becoming part of average computers, or something else.

There's no end in sight to the improvement of neural hardware, not like the wall x86 CPUs have hit anyway.


As someone who struggles greatly with the written words, I'm so thankful to see this. For the last year or so I've poked around every few months to see if they'd opened this up more generally. I'd be more than happy to pay $30-60/mth (more if it had Spritz) for the ability to have high quality, high speech speed, text to speech for my emails, documents and news articles I'd like to consume.


Interesting! I'd love to see a thorough comparison with the Amazon Polly service...

https://aws.amazon.com/polly/

Polly is priced at $4 per million characters and the Google WaveNet voices are $16 (compared with the Google non-WaveNet voices, which are also $4).

After listening to a few samples from each service, the voice quality and prosody modeling seem roughly on par between Polly and WaveNet, or at least the differences I heard didn't seem to justify a 4x price multiplier.

But I'd love to hear an informed opinion from someone with more expertise...


A lot of voice generation is cost-center (call center that are outsourced to cheapest location) with short sentences. I doubt industry would pay 4x price multiplier for that use-case.

So in fact WaveNet competes more with voiceover and new use-cases such as voice assistants. Still I don't hear that much difference there today, but maybe WaveNet will improve in the future to human level sooner than the other models.


To me Polly is way behind WaveNet when it comes to realism. Polly is robotic, WN is fluid.


I for one welcome our new wavenet telemarketing overlords...


I set the text to

"I'm sorry Dave. I'm afraid I can't do that."

to be prepared for whats coming ...


It's very good. The voices reminds me of speech from real life people with accents. It's good enough for voice overs where previously real-life voices would be too expensive. I would say that it's better than Amazon's Polly when it's used to read long passages of text.


I don't know. They're good but they still sound robotic. For me, they work for applications where I sort of expect/accept that I'll get computer-generated speech anyway. But I wouldn't use them as a general substitute for a human speaking, even someone like me who doesn't exactly have a radio voice.


It's getting more and more human sounding. Take a look at this research (also from Google): https://research.googleblog.com/2018/03/expressive-speech-sy...


It absolutely is. But I'm looking at it from a perspective of whether I could put a daily or weekly podcast out there using one of these TTS services and I come out with a resounding no (today).


I have not seen any mention on licensing and whether you can cache and replay voice responses. Amazon Polly specifically allows caching.


Based on the pricing of $16 per 1 million characters (roughly equal to a 400-500 page book), doesn't this severely threaten the voiceover market place? I just priced the cost of a human voiceover on VoiceBunny.com for a 400-page book and I got an average turnaround time of 90 days / $15K cost vs WaveNet's $16 cost and only 30 mins of computational time. That sounds like an interesting disruptor to me.


It does if people are willing to listen to the voice for 10 hours+.

I could listen to this voice for a while, but the voice needs more emotion in it before it could be actually useful for long text.


Tacotron (also by Google) looks promising in this area https://google.github.io/tacotron/publications/global_style_...


I wish they had some beautiful voices, not some of the most generic-sounding men and women.



They're working on it! Check out the samples here:

https://research.googleblog.com/2018/03/expressive-speech-sy...

The last set is specifically interesting for your wish.


Meryl Streep should be worried right about now! (I heard that last set of renders... whoa!)


This is MY VOICE as rendered by lyrebird.ai - https://lyrebird.ai/g/hFj87pbl - So, tech is out there trying to replicate the actual CHARACTER of individual human voices.


We need a virtual Barry White.


Imagine teaching these voices to sing. Something like DeepMind WaveNet Song Generator.

You upload your music to the cloud, set some parameters (genre, tempo, emotion, etc) and a bunch of lyrics and the thing will spit out awesome vocals for you.


That's a billion dollar idea, I wonder who'll do it, maybe you!


Quick, someone remake Translation Party using Speech-to-Text-to-Speech-to-Text-to-Speech-ad-infinitum

https://cloud.google.com/text-to-speech/docs/quickstart https://cloud.google.com/speech/docs/sync-recognize



This is great, but there remain very difficult problems to be solved. The prosody generated by this is fairly generic and not informed by a true understanding of the text. Consider this sentence:

I have plans to leave.

If you stress the word "plans", the sentence means that the speaker is not necessarily intending to actually leave. However, when the stress is on "leave", the speaker definitely intends to leave. A human reader can easily infer the correct meaning from context but text-to-speech systems can't because they don't have any systematic understanding of the things being talked about and the social pragmatics of the discourse. As long as these issues aren't solved, text-to-speech systems will make mistakes. These mistakes will be easy to spot in some cases but can also have catastrophic consequences in other cases: "I have plans to bomb North Korea."



This is really cool, thanks for the link, but it solves a different problem.


I'm using Amazon Polly for a few of months to make videos for language learners. And I realize English voices powered by WaveNet slightly better than those of Amazon but the default Japanese sounds way too worse. Anyway, their pricing and platform are almost same with Amazon, so I definitely need to add another interface for this TTS into my app. You can listen to Amazon Polly voices with the video I made: https://www.youtube.com/watch?v=ysMp0k4oR5c


I picked 3 random paragraphs from a random article on a local online news site.

The voices did sound quite natural and "news-readery", however the one issue I did find is adding a pause between words.

With the example phrase: "He bought himself a boat and then took it to his house". You often expect a small pause after the word "boat".

I was able to manually fix it by adding some commas and full stops, however the AI was not able to pick up those pauses naturally.

It sounded like someone was rushing through the speech instead of stopping occasionally to "take a breath".


The demo is available at https://cloud.google.com/text-to-speech/

Requires Chrome.


Is it just me, or would a demo really make this posting much more interesting?

Edit: There is one, on the actual Google Cloud Text-To-Speech page, so a few clicks in and you'll get one.


Doesn't show up on Firefox or Edge. Only a blank space where I assume the demo should be. Console suggests some sort of Polymer/WebComponents error.


Is there any API for generating speech that sounds like Google Now's assistant? The quality of that is much, much better than this new service.


Yes, I use this:

https://github.com/pndurette/gTTS

Very simple and has the Google Assistant voice.


Hmm, that seems much worse than Google Assistant as well. I think my mistake was that I had selected "Basic" instead of "WaveNet" for the voices (because it's only available for US English). WaveNet is much better.


Are there any voice samples?


You can try it yourself here, just make sure to select English (United States) and Voicetype: Wavenet, as the other languages are not yet using the Wavenet system: https://cloud.google.com/text-to-speech/


It is fun to mismatch voices/languages to hear some hilariously stereotypical accents


If you switch between "Basic" and the "Wavenet" you can hear the difference noted in this announcement, it is noticeably much better.


Try changing the pitch by more than +-10 as well, it breaks up quite amusingly


I had an idea this morning for a personalised "podcast" that could read out e.g. the weather in your area, any new and important emails, the headlines and first paragraph of top stories from your favourite sources and notifications from social media.

I think this is the missing thing that was needed to make this viable.


> I had an idea this morning for a personalised "podcast" that could read out e.g. the weather in your area, any new and important emails, the headlines and first paragraph of top stories from your favourite sources and notifications from social media.

Google Assistant already has all the pieces of that (maybe not all the social media connections one might want, I haven't looked much at that), and the ability to string them together.


Try this:

Hey Google, Tell me about my Day.

and

Hey Google, Tell me the news.

There's a way you can get the news added to your daily briefing (the first trigger), but I can't remember how now.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: