> When a new image is sufficiently different from the set of training images, deep learning visual recognition stumbles, even if the difference comes down to a simple rotation or obstruction.
I'm about as far from an ai expert as you can get.
When I see and recognize a school bus, it seems that object remains a school bus to me until there is very significant evidence otherwise, whether it is ahead, beside, tipped over, or behind as referenced in the example.
It would seem ai on a single image is problematic, and needs classification over time to gain "confidence" instead of a single attribution.
Edit/additional thought: It also seems to me that I know and accept that it's a "bus" before I know it's a "school bus" while another person might immediately recognize a "school bus" and then think "that's a type of bus." How wonderful to think of how those arrangements of hierarchies leads to differing opinions and creative abilities in humans.
Humans are never really shown still images. We are trained on real-world "video input" as we move around or manipulate objects. There is a sense that an object in your hand is the same object even if we rotate it, so "sameness" from different perspectives is learned even without knowing what it IS. Different people have different levels of ability to imagine an object in a different orientation, and I suspect this is related to our ability to identify objects in other situations. Also, if you've never seen the underside of a schoolbus I don't know why you'd be able to identify one from a bottom-only view. Large wheeled vehicle? Yeah but you'd probably have to think about aspect ratio and position of the 4 wheels and such. I'm thinking a more conscious effort and thinking through might be needed to identify it correctly rather than relying on the magic of lower level visual system.
If you want human like recognition (eg rotation and orientation invarience) of objects it seems like you would want to at least train them on image sequence data with multiple views of the same object (like video).
It's not like humans learn image matching based on a sequence of disjointed 2D images. We train on binocular moving images of changing distance and orientation.
Maybe the training sets are not well chosen for this sort if issue. Certainly expanding the set with rotations translations and scalings isn't difficult, but different orientation views would require a bit more effort.
"It would seem ai on a single image is problematic,"
It means that whatever it is that these latest models are doing, it still isn't what we are doing as humans.
What exactly that difference is... well, if you could confidently and even more importantly, correctly tell me, in a way so detailed and correct it was implementable, you'd be able to become very rich.
Well for starters, humans work on continuous video streams rather than still images, so there's a ton more information there. Even when we're identifying a still image, we're looking at a video stream of an object showing a still image (which is why a photo can look "exactly like the real thing" but we're never in any doubt that it's a photo and not the real thing.)
Yeah, I've talked with some people who believe we're close to general artificial intelligence and that we're fairly confident that we know what intelligence means but I'm not so sure we understand it.
When we finally understand how we think, then we'll be able to re-implement it in software. But I don't think we're anywhere near understanding how we think.
note that a human doesn't recognize a bus after seeing one once. A human takes thousands, maybe millions, of examples of various things before it learns to look at a new thing and recognize it afterwards
Actually, not sure if humans are even really good at recognizing a never-before-seen object...hmm
>note that a human doesn't recognize a bus after seeing one once.
I disagree. If I showed you a single picture of a distinctive aircraft, you could recognize it on a runway. Likewise, birdwatcher's books rarely have more than a few pictures of each bird (nowhere near thousands), and birdwatchers seem to be able to identify the birds they are shown there.
We’ve seen lots of aircraft and birds though, even if we haven’t spent much time actively thinking about them.
Even by the time we’re young children, we’ve been exposed continuously to something like tens or hundreds of terabytes worth of visual and aural information that informs our ability to recognize things. I think it’s very rare that people see something that they have no framework for recognizing.
I know personally that I could identify distinctive aircraft from a single viewing because I’ve paid attention to a lot of aircraft, but I struggle with bird identification because I haven’t ever spent much time looking at birds. Even given a picture of a bird, I’m not that confident because I don’t know what characteristics could be common to other similar birds and what are distinctive.
Birdwatchers are able to easily identify birds based on a couple pictures because they have seen thousands and thousands of birds.
This feels sort of related to the study that showed that chess grandmasters had much better than average memories for the positions of pieces, but if the positions were random and not from an actual game, their memories were no better than amateurs. We rely heavily on things we “know” even when that knowledge isn’t exactly conscious.
The aircraft/bird example is good. I too, can recognize aircraft easily, because I've seen a lot of them (both real and in photos/videos/drawings/3d models), and the distinctions mattered to me. Show me a bird photo, and a while later, give me a book containing this very photo + 50 others, and I most likely won't be able to find the one I've seen. Definitely not by any clues on the bird itself. It seems to me that you need some commonality with a whole category of objects before you even start paying attention to details of individual objects.
I'm about as far from an ai expert as you can get.
When I see and recognize a school bus, it seems that object remains a school bus to me until there is very significant evidence otherwise, whether it is ahead, beside, tipped over, or behind as referenced in the example.
It would seem ai on a single image is problematic, and needs classification over time to gain "confidence" instead of a single attribution.
Edit/additional thought: It also seems to me that I know and accept that it's a "bus" before I know it's a "school bus" while another person might immediately recognize a "school bus" and then think "that's a type of bus." How wonderful to think of how those arrangements of hierarchies leads to differing opinions and creative abilities in humans.