I honestly have no idea what this project is about. It may be because I'm comple...

abrichr · on Nov 21, 2023

Open source question answering over videos:

> With the binding of unified visual representations to the language feature space, we enable an LLM to perform visual reasoning capabilities on both images and videos simultaneously.

kyriakos · on Nov 21, 2023

Thanks

anigbrowl · on Nov 22, 2023

It will answer questions about images and/or videos; it's open-source and would compete with some of Chat-GPT 4's advanced features. It's extremely poorly explained on the Github page because it's trying to get interest from other AI researchers with ~256 buzzwords. Great if you already know what it is, extremely unhelpful if you don't.

It seems quite good, ironic that the Github landing page communicates the idea so poorly.

fkyoureadthedoc · on Nov 21, 2023

I had no idea from the name, but the README does a good job of explaining what it's about. Even has a nice video demo.

dcchambers · on Nov 22, 2023

Does it? I have tried to read the README and I still can't figure out what it does. There's also so much random stuff smashed into the README that just trying to figure out where to get started...reading...is an exercise in frustration.

The Title:

> "Video-LLaVA: Learning United Visual Representation by Alignment Before Projection"

I know all of those words, but I don't understand what they mean in that order or context. Let's move on.

Next up is a bunch of links to other pages, other projects, and news about the project. Let's skip all that.

Finally we get to something called " Highlights":

> "Video-LLaVA exhibits remarkable interactive capabilities between images and videos, despite the absence of image-video pairs in the dataset."

OK, so now I know that it does something with images and videos, although I am not sure what that something is. I still don't know what it IS though. Is it an application? A LLM?

Continuing on...

> Simple baseline, learning united visual representation by alignment before projection

> With the binding of unified visual representations to the language feature space, we enable an LLM to perform visual reasoning capabilities on both images and videos simultaneously.

> High performance, complementary learning with video and image

> Extensive experiments demonstrate the complementarity of modalities, showcasing significant superiority when compared to models specifically designed for either images or videos.

Seriously...what? Did a (bad) LLM write those sentences or am I just an idiot?

Then there's a picture, demo video, some installation and basic CLI usage commands (hey, now I finally know it's a python tool!), API info, and more random stuff.

Honestly I have attempted to read through this README several times and I still don't really know what I'm looking at.

grey8 · on Nov 22, 2023

I agree, the README is not really understandable if you're not into AI research techno-babble. Just adding one sentence targeted at normal people would maybe have been useful.

To answer your question, it's a model that you can give image and videos, which you can then interact with via an LLM (ask questions, describe, process further, etc.) It can "see" them, basically.

It the same capability as GPT-4V (ChatGPT's "upload image" feature), except that ChatGPT only offers images.

fkyoureadthedoc · on Nov 22, 2023

> Honestly I have attempted to read through this README several times and I still don't really know what I'm looking at.

Sounds like you've attempted to watch the video 0 times though, because despite not even reading the readme in detail I could tell what the project does by watching the demo.

dcchambers · on Nov 22, 2023

Fair enough, the video does show what the project is.

That said, I also think it's fair to expect that reading a readme should be enough to learn about something.

btbuildem · on Nov 21, 2023

The related paper is here: https://arxiv.org/pdf/2311.10122.pdf

I think the TL;DR is "it can tell what's in the video and 'reason' about it"