It's not set up for that, no, though it's theoretically possible!
The issues I see are:
- Transcription models use beam search to choose the most likely words at each step, taking into account the surrounding words. The accuracy will drop a lot if you pick each top word individually as it’s spoken. The surrounding context matters a lot.
- To that point, transcription models do get things wrong (i.e. "best" instead of "test"). The LLM post-processing can help here, by taking in the top-N hypotheses from the transcription mode and determining which makes the most sense (i.e. "run the tests", not "run the bests"), adding another layer of semantic understanding. Again, the surrounding context really matters here.
Do you need each word to stream individually? Or would it be sufficient for short phrases to stream?
The MLX inference is so fast that you could accomplish something like the latter by releasing and re-pressing the shortcut every 5-10 words. It so fast it honestly feels like streaming. In practice, I tend to do something like this anyway, because I find it easier to review shorter transcripts!
It's hard to find unoccupied shortcuts these days! I don't use shortcuts on the numbers often, so I set that as a default. But yes, it's easily configurable in settings so you can choose something that works for your workflows.
Sarp! Good to hear from you! I hope life has been good since the Instagram days. Yes, I've noticed the multi-resizing issue with cmd + 8 - I'll look into it this week. Regarding the cmd + 0 toggle, I think I can probably make that work too. We can set it up so you can set your dismiss shortcut. Then, you can choose the same as the launch shortcut, making it a toggle. I'll also take a look at that this week.
"Why not Linux or Windows? Gotta start somewhere! If the reception is positive, we’ll work hard to add further support."
As you can see from the commit log, we have 3 people working on this. So we're quite limited in what we can take on. That said, our belief holds and we'd love to support Linux and Windows.
I had "MacOS" in my original title, but HN limits titles to 80 characters!
It would be nice if the README made it clear toward the top
that this is Mac software. The screenshot and mention of Xcode give that vibe of course, but I kept reading anyway and felt a bit bummed to only confirm at the end.
Looks like a cook project and wishing y'all the best. Let us know if and when the Linux support drops :)
I think the license choice is great. It allows noncommercial use, modification, and redistribution. It’s not “open source” according to the champions of the term (since it violates the use-for-any-purpose requirement) but I’m a huge fan of this license and license several of my projects CC-NC-BY where AGPL would be too heavy-handed.
"It's not recognized as Open Source by the Open Source body, and doesn't meet the criteria of Free/Open Source Software, but is Open Source" is a bit like saying "I used GMO and petroleum based pesticides, but my produce is all organic."
Why should words like "organic" in relation to food mean without pesticides? I mean all carbon and water based life forms are organic, right?
I can define Open Source easily, using the OSI definition.
There is not a trademark for Open Source because they failed to secure the trademark, but we have decades of use for the term meaning something specific.
- Some of the Sora results are absolutely stunning. Check out the detail on the lion, for example!
- The landscapes and aerial shots are absolutely incredible.
- Quality is much better than Mochi & LTX out of the box. Mochi/LTX seem to require specifically optimized workflows (I've seen great img2vid LTX results on Reddit that start with Flux image generations, for example). Hunyuan seems comparable to Sora!
Cons:
- Still nearly impossible to access Sora despite the “launch”. My generations today were in the 2000s, implying that it’s only open to a very small number of people. There’s no api yet, so it’s not an option for developers.
- Sora struggles with physical interactions. Watch the dancers moonwalk, or the ball goes through the dog. HunyuanVideo seems to be a bit better in this regard.
- Can't run it locally mode (obviously)
- I haven't tested this, but I think it's safe to assume Sora will be censored extensively. HunyuanVideo is surprisingly open (I've seen NSFW generations!)
- I’m getting weird camera angles from Sora, but that could likely be solved with better prompting.
Overall, I’d say it’s the best model I've played with, though I haven’t spent much time on other non-open-source ones. Hunyuan gives it a run for its money, though!
I can't speak to any of those videos in a technical sense but personally, I don't feel like any of them are good?
The vibe they give me is similar to the iPhone photography commercials where yes, in theory, a picnic in the park could look exactly like this except for all the parts that seem movie perfect.
I guess it's really more of a colour grading question where most of the Sora colour grading triggers that part of my brain that says "I'm watching a movie and this isn't real" without quite realising why.
A few of the Hunyuan videos in contrast seem a bit more believable even though they have some obvious glitches at times.
The other thing I think Sora has is that thing in commercials where no one else except the protagonist exists and nothing is ever inconvenient. The video of the teacher in a classroom with no students reminds me of that as well as the picnic in the park where there's wide open space with no one around.
I suppose it depends if the goal is to generate believable video and how you define believable.
Hunyuan was more realistic but lower quality than Sora, shorter videos with lower resolution or bitrate. The downside to Sora's sharpness is that it makes mistakes more apparent. Also funny that Sora didn't understand the rolling dunes metaphor.
The issues I see are: - Transcription models use beam search to choose the most likely words at each step, taking into account the surrounding words. The accuracy will drop a lot if you pick each top word individually as it’s spoken. The surrounding context matters a lot. - To that point, transcription models do get things wrong (i.e. "best" instead of "test"). The LLM post-processing can help here, by taking in the top-N hypotheses from the transcription mode and determining which makes the most sense (i.e. "run the tests", not "run the bests"), adding another layer of semantic understanding. Again, the surrounding context really matters here.
Do you need each word to stream individually? Or would it be sufficient for short phrases to stream?
The MLX inference is so fast that you could accomplish something like the latter by releasing and re-pressing the shortcut every 5-10 words. It so fast it honestly feels like streaming. In practice, I tend to do something like this anyway, because I find it easier to review shorter transcripts!