Google also announced today their Tacotron engine which features new prosody modeling speech generation. It allows them to generate speech that mimics personal intonation, accents, and rhythm, effectively mimicking an individuals "expression" in their speech.
For any Google devs lurking out there, it doesn't seem to work at all in Firefox on Windows. It looks like it has something to do with custom web components with the following message:
ReferenceError: customElements is not defined
Also apparently some assertion errors with webcomponents (minified so line numbers not useful).
Doesn't work in Safari. Doesn't work in Firefox. Doesn't work in Edge.
I don't want to be unreasonable, but Google used to at least generally support the idea of the open web. There are a bunch of different UAs out there; while I accept it's more challenging to support some than others, it doesn't seem unreasonable to expect a product launch from a large-scale web company should at the very least give us an error message.
The web is deteriorating right in front of us, and a big contributor to that is Google's continued failure to realise that the web isn't all-Chrome, all the time. The attitude displayed here—a minor thing when compared to the overall problem—has strengthened my resolve to avoid Chrome at all costs.
> I don't want to be unreasonable, but Google used to at least generally support the idea of the open web.
This is a Machine Learning product, I don't think anyone at Google, at least as part of this team, is trying to "get you" or destroy the open web or something. This isn't even a case of Google using something non-standard -- WebComponents is part of the standard, you can even see it in Mozilla's MDN [0]. Firefox, Safari, Edge, et al simply haven't implemented it yet (or landed in stable). Is that somehow also Chrome's fault?
Filing a bug report is good, but ranting on HN about how this is a sign of Google trying to steal the open Internet is at best unnecessary, and absolutely unreasonable.
Coincidently, I'm working on an application that uses complicated SVG with CSS animations, and I've spent a ton of time optimizing it. I've never tested it outside of Chrome before today. To my surprise, while everything works fast in Chrome, in Safari it's bearable, but in FF it's simply too laggy to use. Now, I probably won't ever get to fix the performance issues in FF and Safari, simply because I don't have the time. Am I also out there trying to destroy the open web? Maybe I'm just bad, not evil.
This is a basic audio playback widget on the web. The web is nominally an open platform. Whichever team built this decided to build it in a way that it would only support Google’s own web browser. It’s not unreasonable to expect a massive web company to build cross-browser support for their user-facing demos, especially in cases where there is obviously no reason that it needs to be incompatible.
Like I said, I appreciate that building cross-browser is not always possible. The difference between you and Google is that you aren’t one of the largest companies in the world, and you don’t publish your own browser.
True, but clearly the marketing page of a feature service for a major cloud vendor should be accessible to as many as possible. There's no real reason something as simple as an audio player to play samples doesn't work well on all browsers.
Come on now, it's a cloud machine learning product, you can access it using purely open source tools all running on an open source operating system if you like. No need to pronounce doom because the demo doesn't work in your browser.
That'd be a reasonable response if Google didn't have a history of deliberately making things incompatible with other browsers for no discernible reason.
I appreciate the quality of the offering, and indeed I'm excited because I am currently building a product that uses Google's synthesis!
But it's the lack of care displayed here that's really disappointing. It's like nobody involved with this demo went "Hey, maybe people aren't using Chrome and we should at least not totally fail when that happens". It's disappointing; as a mostly front-end dev at a small company, I put a lot of time into making sure that things work cross-platform, even if I don't always get it right. It's disappointing that Google's culture doesn't rate that highly enough.
Then it works in two browsers: most popular and their own. Ideally there would be such a large number of in-use standards compliant browsers that companies had to comply.
The problem isn't so much that browser vendors don't adhere to standards, it's that they make their own additional features and then make webpages that work with those. Having standards won't do anything for that, unless the standard says "you must not implement extra functionality".
There are four or five out there. I would expect the average modern web app to be compatible with Chrome, Safari, Firefox, Edge, and possibly IE11 if that’s appropriate for the particular use case.
Google used to at least generally support the idea of the open web
Yes, but history teaches us that they are terrible at product management. I mean, they could at least just call it "Chrome TTS" or something until they put the whole thing together.
The demo might only work in Chrome, but the API is just a standard REST API. Most people will be using this API on the server side to generate an MP3 which is streamed to the user.
Chrome actually has its own TTS engine (as does FF, Safari, etc), though the quality obviously isn't as good.
As I understand it, Chrome does indeed use the Google cloud platform to do its speech synthesis behind the scenes – I suppose we'll see improvements in the quality of those voices soon, too.
That's probably because custom elements aren't supported in Firefox yet, unless you set "dom.webcomponents.enabled" and "dom.webcomponents.customelements.enable" to true. It's supposed to be fully enabled in Firefox 60/61 according to MDN [0].
Sure, but should they not be either using polyfills, or not use code that is only supported by one browser? I understand how a project like Allo Web would want to use latest streaming api tech, but a documentation/product launch page doesn't need crazy new web tech.
They didn't even test on anything other than Chrome, because they don't care. That's the problem. Not that it was missed, but that they didn't even test in Firefox.
Call me paranoid all you want, but that is their plan. just like it was microsoft's plan for nothing on IE4/6 to work on netscape. Heck, a third of my corporate sites have DHTML popups saying "Use chrome" when i reach them with firefox already.
Does anyone have a GitHub project for epub -> mp3 using this service yet (for automatic audiobook generation)? May make it myself if I have time but curious if anyone already has set it up.
The average English word is 4.5 characters and the average English speaker speaks 110-150 words per minute. This means that at $16/1m characters, we can generate speech at a cost between $28.57-39/hr. Per Google's post, WaveNet now costs 50ms of TPU time per 1s of speech generated, meaning, at 100% utilization, a TPU can generate somewhere between $571.40-780/hr. Google's TPUs can be deployed (by third parties) at $6.50/hr. That's some sweet sweet margin.
I think your math is wrong by a factor of 60. Under your assumptions, one hour of speech is equivalent to 30-40 thousand characters, costing between $0.48-$0.65. That translates to revenue of $9.50-$12.96/hour per TPU.
You're right, I decided to switch from seconds as base unit to hours as base unit half way through writing the comment, and screwed up the conversion. I was accounting for seconds when I was already in minutes.
Google does a remarkably good job with pronunciation. I'm consistently impressed by it.
I was messing around with the ancient VBA text to speech system. If most TTS systems sucked as much as that one, you could also make a SAAS business for finding "typos" that make the word sound correct when pronounced.
Yes, that’s what the OP is saying. Consumers fought against DRM so the business model became “never give them the software and you’ll never need to force DRM on them.”
Maybe you did, but I’d much rather have things like the Nest camera or text-to-speech as my own software. However, it’s impossible to get the same quality for the same amount spent this way, since vendors prefer the service business model.
Edited, made it clear I was talking about companies.
But for most consumers, the same applies. Most people don't want to own records, or even own music. They want to listen to music.
They don't want to own cars, they want to get to places.
> However, it’s impossible to get the same quality for the same amount spent this way, since vendors prefer the service business model.
Not just a matter of preference, but economies of scale. Most consumers don't want to install and run their own local CCTV service, so the market left for those who want is too small to achieve the same economies of scale as the service market.
Yes, but if you could get it for free that would be even better. It's a fantastic business model to charge premium for something that cost almost nothing.
> Yes, but if you could get it for free that would be even better. It's a fantastic business model to charge premium for something that cost almost nothing.
Cost almost nothing? How much do you think was invested in research and development?
But hey, feel free to open a company, invest billions of your own money in research and then give everything away for free with no revenue whatsoever. Sounds like a great business model!
It does not cost me anything if someone else runs my software. I can still charge for it. The research and development is already paid for. The business plan goes something like this: 1) Buy software 2) Sell copies of it ... But it will not work good if other people start to make copies of it, so you make it a SaaS instead, having the software run in "the cloud" where only you can copy it.
Investors will be happy as long as the marginal cost to sell one additional copy is above zero. eg. you make a profit. How much they are willing to invest depend of how large the market is.
Even if Google wants to sell websearch software, there will be so few customers since it would cost billions of dollars to buy machines to run them, you'll need to build DC, get power etc.
Google have infrastructure to to serve billions of requests per second, that is not be needed if you only serve yourself. Now imagine if every-one would scrape the web as often as Google does. :P
Your mobile phone can run some billions of instructions per second. Now imagine a server case, or a rack. Compute power is very dense. And very cheap.
The entire wikipedia (text only) can be stored on a micro-sim. eg a few mm.
I have no idea what the rate limits are, so please don't abuse it, I wrote it because the demo didn't work in Firefox and I wanted to play around with it more extensively.
Am I wrong in thinking that the cost of generating (realistic-sounding, learned-model) speech on commodity hardware will be near-zero soon, largely negating the value of a SaaS?
I've been waiting a long time for decent sounding open source TTS software for narrating books to me, and now with deep learning it's either here or very near here, and the hardware is going to keep getting more performant at the same price. I guess that will be very appealing to businesses relying on TTS (e.g. call centers and phone robots and mobile apps with TTS, etc)
This is with 1 minute of audio and 10 minutes of training, which is crazy to me. Maybe it's not "as good", but it's very good, and free, and it will get better, faster, and cheaper quickly?
Partially off-topic, but one of the things I find as a native English-speaking American is that British female accents (probably more specifically accents that are close to a "BBC accent") sound better to me. That's definitely true with Polly. I don't know if it's because flaws aren't quite as obvious to me or just that I like the accent better in general so I'm more willing to overlook them.
I don't mean general purpose hardware, I mean specifically neural hardware, whether that means GPUs (as we know them), GPUs with special hardware like the "tensor cores", TPUs, "neuromorphic chips" whatever that is, FPGAs becoming part of average computers, or something else.
There's no end in sight to the improvement of neural hardware, not like the wall x86 CPUs have hit anyway.
As someone who struggles greatly with the written words, I'm so thankful to see this. For the last year or so I've poked around every few months to see if they'd opened this up more generally. I'd be more than happy to pay $30-60/mth (more if it had Spritz) for the ability to have high quality, high speech speed, text to speech for my emails, documents and news articles I'd like to consume.
Polly is priced at $4 per million characters and the Google WaveNet voices are $16 (compared with the Google non-WaveNet voices, which are also $4).
After listening to a few samples from each service, the voice quality and prosody modeling seem roughly on par between Polly and WaveNet, or at least the differences I heard didn't seem to justify a 4x price multiplier.
But I'd love to hear an informed opinion from someone with more expertise...
A lot of voice generation is cost-center (call center that are outsourced to cheapest location) with short sentences. I doubt industry would pay 4x price multiplier for that use-case.
So in fact WaveNet competes more with voiceover and new use-cases such as voice assistants. Still I don't hear that much difference there today, but maybe WaveNet will improve in the future to human level sooner than the other models.
It's very good. The voices reminds me of speech from real life people with accents. It's good enough for voice overs where previously real-life voices would be too expensive. I would say that it's better than Amazon's Polly when it's used to read long passages of text.
I don't know. They're good but they still sound robotic. For me, they work for applications where I sort of expect/accept that I'll get computer-generated speech anyway. But I wouldn't use them as a general substitute for a human speaking, even someone like me who doesn't exactly have a radio voice.
It absolutely is. But I'm looking at it from a perspective of whether I could put a daily or weekly podcast out there using one of these TTS services and I come out with a resounding no (today).
Based on the pricing of $16 per 1 million characters (roughly equal to a 400-500 page book), doesn't this severely threaten the voiceover market place? I just priced the cost of a human voiceover on VoiceBunny.com for a 400-page book and I got an average turnaround time of 90 days / $15K cost vs WaveNet's $16 cost and only 30 mins of computational time. That sounds like an interesting disruptor to me.
This is MY VOICE as rendered by lyrebird.ai - https://lyrebird.ai/g/hFj87pbl - So, tech is out there trying to replicate the actual CHARACTER of individual human voices.
Imagine teaching these voices to sing. Something like DeepMind WaveNet Song Generator.
You upload your music to the cloud, set some parameters (genre, tempo, emotion, etc) and a bunch of lyrics and the thing will spit out awesome vocals for you.
This is great, but there remain very difficult problems to be solved. The prosody generated by this is fairly generic and not informed by a true understanding of the text. Consider this sentence:
I have plans to leave.
If you stress the word "plans", the sentence means that the speaker is not necessarily intending to actually leave. However, when the stress is on "leave", the speaker definitely intends to leave. A human reader can easily infer the correct meaning from context but text-to-speech systems can't because they don't have any systematic understanding of the things being talked about and the social pragmatics of the discourse. As long as these issues aren't solved, text-to-speech systems will make mistakes. These mistakes will be easy to spot in some cases but can also have catastrophic consequences in other cases: "I have plans to bomb North Korea."
I'm using Amazon Polly for a few of months to make videos for language learners. And I realize English voices powered by WaveNet slightly better than those of Amazon but the default Japanese sounds way too worse. Anyway, their pricing and platform are almost same with Amazon, so I definitely need to add another interface for this TTS into my app. You can listen to Amazon Polly voices with the video I made: https://www.youtube.com/watch?v=ysMp0k4oR5c
Hmm, that seems much worse than Google Assistant as well. I think my mistake was that I had selected "Basic" instead of "WaveNet" for the voices (because it's only available for US English). WaveNet is much better.
You can try it yourself here, just make sure to select English (United States) and Voicetype: Wavenet, as the other languages are not yet using the Wavenet system: https://cloud.google.com/text-to-speech/
I had an idea this morning for a personalised "podcast" that could read out e.g. the weather in your area, any new and important emails, the headlines and first paragraph of top stories from your favourite sources and notifications from social media.
I think this is the missing thing that was needed to make this viable.
> I had an idea this morning for a personalised "podcast" that could read out e.g. the weather in your area, any new and important emails, the headlines and first paragraph of top stories from your favourite sources and notifications from social media.
Google Assistant already has all the pieces of that (maybe not all the social media connections one might want, I haven't looked much at that), and the ability to string them together.
HN discussion here: https://news.ycombinator.com/item?id=16691197