This is probably because it can’t actually understand images, it’s relying on other services to deal with it. So it can do an image search to find what something might be an image of, or use ocr to extract text, but it’ll fail on tasks that involve the idiosyncrasies of each image.