The Silicon Valley AI Lab is Baidu's US-based research group, started a bit more than two years ago by Andrew Ng and Adam Coates. The mission of SVAIL is to build hard AI technologies that let us impact hundreds of millions of users.
We work on deep learning for speech and language; systems research to drive scalability of deep learning models; and new product development to bring research success to end users.
We are hiring for lots of roles in all three of these areas. The above link has the full list, but I'd like to draw particular attention to our need for software engineers (the "Software Engineer - AI Product" role). There is a huge opportunity to be an early member of a newly-formed team responsible for building the next generation of AI-enabled products. No prior experience in machine learning or AI necessary -- if you are a strong engineer, we feel confident we can teach the needed ML.
Apply at the link above, or email eloise@baidu.com if you have questions (or ask right here). Thanks!
Both Kaldi[1] and CMU Sphinx[2] are high-quality open source speech systems. I know for a fact that Kaldi includes support for DNN acoustic models (I'm less familiar with Sphinx).
Thanks, appreciated, but my dear lord, without a PhD in AI systems these things are a bit beyond what most users, me included, would casually play around with. Be great if this tech made it into Dragon Naturally Speaking-like end product to use privately.
Mostly this, though it's not so black-and-white. The paper discusses results from a DNN-HMM system (Maas et al., using Kaldi) trained on 2k hours, and it does provide a small generalization improvement over 300 hours.
Much of the excitement about deep learning -- which we see as well in DeepSpeech -- is that these models continue to improve as we provide more training data. It's not obvious a priori that results will keep getting better after thousands of hours of speech. We're exited to keep advancing that frontier.
That was an even weirder comparison. They compare a system trained on 2000 hours of acoustic data mismatched with the testing data to their system, which was trained on 300 hours of matched data in addition to the 2000 hours of mismatched acoustic data.
Hi Jerome, those are great results! We got an email this morning from someone else on the Watson team pointing out that we didn't include the latest IBM number -- we'll be sure to update the results in the next version of the paper (three cheers for arXiv).
Of course, we openly say in the paper that we don't have the best result on easy subset of Hub5'00 (we had it as 11.5%). We're more interested in advancing the state of the art on challenging, noisy, varied speech. Of course we'll be working to push the SWB number down too :)
The team is already working on seeing what we get with CH. We'll let you know where we land. But your results are definitely impressive. We love to see new published innovation in the field. Kudos to the team!
Why 5 hidden layers? Why are the first 3 non-recurrent? How did you decide how wide to make the internal layers? Are there some organizing principles behind these design decisions, or is it just trial and error?
As in many things, it's a combination of both. For example:
- We wanted no more than one recurrent layer, as it's a big bottleneck to parallelization.
- The recurrent layer should go "higher" in the network, as it's more effective at propagating long-range context when using the network's learned feature representation than using raw input values.
Other decisions are guided by a combination of trial+error and intuition. We started on much smaller datasets which can give you a feel for the bias/variance tradeoff as a function of the number of layers, the layer sizes, and other hyperparameters.
Any chance of releasing the training data you used? Also what are the plans with DeepSpeech? Just for use by baidu or will it be released as open source or a developer api service?
For a single utterance, it's fast enough that we can produce results in real time. Of course, building a production system for millions of users might require just a bit more engineering work...
The first two classes (106A and 106B) are very techinical, though I'd hesitate to let anyone who hadn't taken classes beyond them to work on a piece of software I had control over.
I think a good metric for classes is the final assignment, since it captures "how far" the class goes. For reference, the final assignments are:
106A --- It varies, but has recently been a text-based "Adventure"-style game (in Java) that requires tracking the map, player state, various objects and their capabilities, etc. I think there might also be a small graphical component.
106B/X --- Again, it varies, but the best assignment (in my view) is a BASIC interpreter that implements both a REPL and stored programs. All C++, it's about building a big list of abstract expressions of different types (assignment, etc) that can be executed by walking that list (taking advantage of dynamic dispatch) and tracking global program state. It's a nice intersection of data structures and (very simple) recursive descent parsing.
107 --- A heap allocator to implement malloc(), realloc(), and free(), written on top of mmap(). A fantastic assignment.
It's not exactly paradise. Because classes are only ten weeks total (slightly less in spring), you can't exactly spend the first week or two shopping around without doing non-trivial work for every single class you're shopping --- an approach which doesn't scale, to say the least. There's still a sprint the first few days. But still, it's nice.
I cannot understand why it is that so many obviously very intelligent people decide that we need another computer vision-based startup. Because the unfortunate truth is that computer vision (right now) doesn't work.
Let me qualify that. From the academic / research point of view, there have been a collection of real successes in computer vision in, say, the last ten years. But my sense is that what counts as a research success is a long way from what counts as a practical business success.
For example, the best generic object detector at the moment is probably Felzenszwalb's using deformable parts-based models[1]. And it's just not that good. On the latest PASCAL object detection challenge, you'll see that its mean precision is only ~30%.
Scott Brown, the interviewee, sets Vicarious apart by highlighting the fact that their system will be neurobiologically inspired. But the idea of learning hierarchical systems that mimic the brain's visual processing system is hardly new, and the jury is still out on whether these systems can do better than the "hand-coded" systems like Felzenszwalb's. As a random example, see [2].
Like.com showed you can build a business that uses computer vision in some way. But as Brown snarks, they "use a big bag of different heuristics to figure out the image." For the time being, that seems to be the only way to get computer vision to work in practice.
Well, his argument is that well-funded, very intelligent people are trying like hell at computer vision, and not succeeding. That's not a good sign - you'd prefer that your space has been hitherto overlooked by smart people with lots of money.
I think he's arguing that computer vision is a research subject - most startups are doing known things (in the sense of "this has been successfully done before") or at most development ("this has been successfully done before - in the lab").
And yet facial-recognition is now freely available to consumers (Picasa, Facebook etc), our phones have blink detection, 3D motion detection and tracking is available to consumers for ~$100 (Kinect).
I'm not familiar with the PASCAL object detection challenge, but I just had a quick look. It's hard - if I understand it correctly, classifiers had to categorize photos into containing 5 types of objects form the 1000 leaf nodes of http://www.image-net.org/challenges/LSVRC/2010/browse-synset.... (Based on the description from http://www.image-net.org/challenges/LSVRC/2010/pascal_ilsvrc...). I'm having trouble understanding the scoring scheme (how is flat cost calculated?), but based on this I'm quite impressed.
There are many different actual tasks that technically are PASCAL challenges, but when people say "PASCAL VOC challenge" (Visual Object Classes), they typically mean either the _classification_ or _detection_ challenge:
Classification: For each of the twenty classes, predicting presence/absence of an example of that class in the test image.
Detection: Predicting the bounding box and label of each object from the twenty target classes in the test image.
> I cannot understand why it is that so many obviously very intelligent people decide that we need another computer vision-based startup. Because the unfortunate truth is that computer vision (right now) doesn't work.
This seems like a really good reason to create another computer vision-based startup.
No, it seems like a very good reason to take useful/promising but improperly commercialized research and turn it into a product. A startup rarely has enough runway to do the scientific research needed to solve a problem like this.
It is certainly rare, but Numenta has been doing it for the past 6 years, and for several years before that at the Redwood Neuroscience Institute from which it spun off. In doing so, Numenta undoubtedly stands on the foundation of significant progress in academia, but still has to do a fair bit of what one might call "research engineering."
Basic R&D is a cost that successful businesses--especially small ones--tend to externalize. Microsoft does a bit, Bell Labs used to do more; but you just don't start a FTL spaceship company before basic research has established a coherent theory of FTL travel.
The Silicon Valley AI Lab is Baidu's US-based research group, started a bit more than two years ago by Andrew Ng and Adam Coates. The mission of SVAIL is to build hard AI technologies that let us impact hundreds of millions of users.
We work on deep learning for speech and language; systems research to drive scalability of deep learning models; and new product development to bring research success to end users.
We are hiring for lots of roles in all three of these areas. The above link has the full list, but I'd like to draw particular attention to our need for software engineers (the "Software Engineer - AI Product" role). There is a huge opportunity to be an early member of a newly-formed team responsible for building the next generation of AI-enabled products. No prior experience in machine learning or AI necessary -- if you are a strong engineer, we feel confident we can teach the needed ML.
Apply at the link above, or email eloise@baidu.com if you have questions (or ask right here). Thanks!