We're back to "how is it different from Siri". You had to provide two pages of t...

selalipop · on Sept 11, 2023

I think you fundamentally don't understand the topic if you're talking about two pages of text?

The end user would never type in a word of that: they'd say "[Wake word] play me some music"

A piece of software running on a device would transcribe what it heard, and fire off a request to the LLM with all of that text wrapped around their statement.

For ease of sharing I used the web interface to provide the instruction, but you'd use the API with a prompt which also dramatically increases determinism.

No one is writing out the state of each light bulb: you trivially query that information programmatically and bundle it with the request.

—

In a real product there'd be explicit handling of detecting where the request came from, that's already a problem that's been worked on, but I wanted to demonstrate the main difference vs Siri: zero-shot learning

The LLM wasn't told what those volumes mean, but it was flexible enough to infer the intent was to provide a form of location, rather than ask.

It's a forced example so if you want to get caught up on the practicality of audio for locating people be my guest, but it's to show LLMs are great at "lateral applications" of capability:

You give them a few discrete blocks of functionality and limited information, and unlike Siri they can come up with novel arrangements of those blocks to complete a task they haven't yet seen.

—

Honestly the fact you keep going back to "look at all the text" feels a bit like if I showed you the source code for an email messaging app, and you told me: "No one will ever use email! Who would write all that instead of just writing a letter and mailing it?!"

omniglottal · on Sept 11, 2023

Indeed, the context is "people using natural language to make requests". No soul on earth would consider/use your phrasing. I (a human) have clue what your request is for - "lowest observed request volume"...??? Try "raisr the lights where we usually aren't asking you for much" and you might get tge same result. As far as I can tell, brightness increase in the garage (where, I'd guess, you've made the least requests), the AI apparenyly understood better than you or I what you meant.

selalipop · on Sept 11, 2023

That JSON isn't something you'd type, it's something that you can programmatically generate if you have a Home Assistant setup.

With super primitive wake word detection and transcription, the most you get is:

- What the user said

- How loudly each microphone in the house heard it.

If you take a look at the mock object in that transcript, that's what it maps to...

```json { "request": "I'm finding it hard to read" "observedRequestVolume": [ 3eQEg: 30, iA0TN: 60, h1T3y: 59, 5Qg1M: 10 ] } ```

The only part that would be human provided is: "I'm finding it hard to read"

The invented challenge was to see if using a suboptimal set of inputs (we didn't tell it where we are) it can figure out how to action.

It's zero-shot capability that makes LLMs suitable for assistants: traditional assistants can barely handle being told to do something they're capable of in the wrong word order, while this can go from hastily invented representation of a house and ambiguous commands to rational actions with no prior training on that specific task