Why do you think transformers won't be key in making a level 4+ self-driving AI. It seems to me that Vision-capable multi-modal transformers could be the missing part: they can understand what is happening in the world in a deductive way.
The Vision transformer is capable of predicting that a running child is likely to follow that rolling soccer ball. It is capable of deducting that a particular situation looks dangerous or unusual and it should slow down, or stay away from danger in ways that previous crop of AI could not.
Imo, the only thing currently preventing transformers to change everything is the large amount of compute power required to run them. It's not currently possible to imagine GPT4-V running on a embedded computer inside a car. Maybe AI asic type of chip will solve that issue, maybe Edge computing and 5G will find it's use-case... Let's wait and see, but I would bet that transformer will find it's way in many places and change the world in many more ways than bringing us chatbots.
I think we've found repeatedly in self-driving that it's not enough to solve the problem in the normal case. You need an AI model that has good behaviors in the edge cases. For the most part it won't matter how good the vision models get, you're going to need similar models that can make the same predictions from LIDAR signals because the technology needs to work when the vision model goes crazy because of the reflectivity of a surface or other such edge cases where it completely misunderstands where the object is.
I don't quite agree on this one. While I think that Musk choices to go full vision when he did was foulish because he made his product worse, his main point is not wrong: human do drive well while using mostly vision. Assuming you can replicate the thought process of human driver using AI, I don't see why you could not create a self-driving car using only vision.
That's also where I would see transformers or another AI architecture with reasoning capabilities shine: the fact that it can reason about what is about to happen would allow it to handle edge cases much better than relying on dumb sensors.
As a human, it would be very difficult to drive a car just looking at sensor data. The only vehicule I can think of where we do that is submarines. Sensors data is good for classical AI but I don't think it will handle edge case well.
To be a reasonable self-driving system, it should be able to decide to slow down and maintain a reasonable safety space because it is judging the car in front to be driving erratically (ex: due to driver impairement). Only an AI that can reason about what is going on can do that.
Sure but humans do a lot more with vision than just convolutions. So maybe we need to wait for AI to invent new techniques equally revolutionary and equally impactful to convolutions to the point where it's believable that AI models can handle the range of exceptions humans handle. Humans are very good at learning from small data where AI tends to be pretty terrible at one-shot learning by comparison. That's going to continue being hugely relevant for edge cases. We've seen many examples now where a self-driving car crashes due to too much sunlight distorting its perception of where objects are. We can either bury our heads in the sand and pretend AI models work like humans and need the exact same inputs humans do or we can admit there are limitations to the technology and act accordingly.
I also think dumb sensors is unfair, there are Neural Network solutions for processing LIDAR data so we are talking about a similar level of intelligence applied over both sensors.
> As a human, it would be very difficult to drive a car just looking at sensor data.
What is vision if not sensor data?? Our brains have evolved to efficiently process and interpret image data. I don't see why from-scratch neural network architectures should ever be limited to the same highly specific input type.
Can’t argue with this logic, more data points certainly helps. I was arguing about vision vs lidar, vision + lidar is certainly better than vision alone.
Bandwidth alone isn’t what prevents 5G from this sort of application, at least in the USA. Coverage maps tell the story: coverage is generally spotty away from major roads. Cell coverage isn’t a fixable problem in the near term, because every solution for doing that intersects with negative political externalities (antivax, NIMBYism, etc); if you can get people vaccinated for measles consistently again, then we can talk.
It all needs to be onboard. That’s where money should be going.
If you plan on letting llava-v1.5-7b drive your car, please stay away from me.
More seriously, for safety critical applications, LLM have some serious limitations (most obviously hallucinations). Still, I beleive they could work in automotive application assuming: high quality of the output (better than current SoA) and very high token count (hundreds or even thousand of token/s and more), allowing to bruteforce the problem and run many inferences per seconds.
Could combine the existing real-time driving model with influence from the LLM, as an improvement to allow understanding of unusual situations, or as a cross-check?
I wasn't intending to say it would be useful today, but pushing back against what I understood to be an argument that, once we do have a model we'd trust, it won't be possible to run it in-car. I think it absolutely would be. The massive GPU compute requirements apply to training, not inference -- especially as we discover that quantization is surprisingly effective.