More

ej88 · 2026-05-14T18:09:54 1778782194

i think the thread shows:

- most people's perception of art is heavily affected by the framing (and to a lot of people ai = bad, and so they start seeing technical issues with it that could /never/ be made by Monet despite it being a Monet)

- but I think the critique here is more: even if someone recreated a Monet stroke-for-stroke, what's the value of this copy? I think the artist's personal life and context around the painting adds so much more to it compared to just being a pretty painting (perhaps this is the single most important part of what makes a painting interesting and valuable)

ej88 · 2026-05-12T16:56:02 1778604962

swe bench pro has a public and private test set, where the private eval is from proprietary codebases only

ej88 · 2026-05-12T16:54:33 1778604873

This is cool!

I used to work on post-training & evals. it's really hard to make a good eval set and catch all forms of reward hacking. Excited to see more from poolside!

ej88 · 2026-05-12T16:26:49 1778603209

An omni model seems very useful for real-time human-computer interaction, off the top of my head:

- Voice assistants

- Customer experience

- Gaming

- Meeting assistants

- Real-time coach or user assistant for using software

- Translation

- Real-time work on a computer controlled by voice (frontend / mobile dev, CAD, 3D modeling, etc)

Traditionally a lot of these use cases with LLM agents are higher latency because the model needs to wait for the speaker to finish, then decide to call a tool or respond - if they call a tool they need to process the tool result and decide if they want to call a tool or respond, etc...

darajava · 2026-05-12T18:33:20 1778610800

I'm not saying an omni model isn't useful for HCI - essentially my problem is that these demos seem to be highlighting the model's ability to interrupt the user (which is almost always not a good thing), it's ability to keep time (which should be a non-issue really), and it showcases these using fairly lame use-cases.

ej88 · 2026-05-12T18:35:01 1778610901

Ya, the demos were pretty contrived (feels like a running theme amongst the labs...)

ej88 · 2026-05-12T15:47:55 1778600875

i would argue its the opposite

farming hit a ceiling because of demand

software today is heavily, heavily constrained by supply. demand is basically infinite for actually good software that solves problems people have (and people always have problems).

ej88 · 2026-05-05T21:39:08 1778017148

"She rejected several applicants with PhDs and engineering backgrounds, reasoning that their level of education could not compensate for a lack of hands-on specialty coffee experience."

This is depressing.

p1necone · 2026-05-05T22:13:40 1778019220

This seems pretty reasonable to me?

If I was hiring a single new staff member in an already staffed cafe (and I trust the existing staff to be good mentors), sure, hire anyone, train them up.

But if I'm hiring the first handful of employees, especially if I'm trying to make good coffee and run a smooth operation, I'd want someone with some experience already - their PhD doesn't really tell me anything about their ability to work in a cafe. This goes doubly so when I'm some ethereal AI that isn't going to be working alongside them.

There's no such thing as "unskilled labor".

xandrius · 2026-05-05T22:23:14 1778019794

Btw, you'd be surprised how incapable of doing some menial tasks are some people the higher you go into the academic ladder.

And it makes total sense: most people with PhDs were not the ones who loved tinkering with stuff, fixing motorbikes, etc. They stayed inside and either liked books, computers or something akin. (not everyone ofc)

Aurornis · 2026-05-05T22:29:22 1778020162

Experience isn't a hierarchy. Having a PhD doesn't make someone good at tasks they've never done before.

This ignores the real reason that over-qualified people are often skipped for jobs: They are never interested in staying at that job. It's always something temporary until they find the job they really want, which could happen in days, weeks, or months. They probably won't give 2 weeks' notice because they don't care about their references in the retail industry, meaning you're emergency short-staffed and have to repeat the hiring process all over again.

JohnMakin · 2026-05-05T21:51:15 1778017875

Don't worry, it's fiction.

DavidVoid · 2026-05-05T22:09:58 1778018998

I don't think so. The cafe is a real place and it's owned by the company mentioned in the article. It was in the local news the other week [1].

If you're going to do an experiment like this, then Stockholm is a good place to do it, since the bureaucracy here is very digitalized.

[1]: https://www.mitti.se/nyheter/ai-driver-eget-kafe-i-vasastan-...

JohnMakin · 2026-05-05T22:18:51 1778019531

Yes, it is literally a place, I wasn't saying it wasn't. The fiction is that this is pure PR fluff of what is actually going on, a human/dev team is prodding this thing in ways to "manage" the employees. This was pointed out in their last PR stunt:

https://news.ycombinator.com/item?id=47794391

So yes, it is a type of fiction. They also have every incentive to hype this up, given what their company does. I really wish people had more skepticism and critical thought with these things, it isn't actually good at all for the AI space and its future success.

sureMan6 · 2026-05-05T22:09:39 1778018979

I've lived in Sweden, it's real

JohnMakin · 2026-05-05T22:19:07 1778019547

Yes, it is literally a place, I wasn't saying it wasn't. The fiction is that this is pure PR fluff of what is actually going on, a human/dev team is prodding this thing in ways to "manage" the employees. This was pointed out in their last PR stunt:

https://news.ycombinator.com/item?id=47794391

So yes, it is a type of fiction. They also have every incentive to hype this up, given what their company does. I really wish people had more skepticism and critical thought with these things, it isn't actually good at all for the AI space and its future success.

ej88 · 2026-05-04T21:28:55 1777930135

adding some context as someone who works in this space

1. most people (average, non-tech people) reach for the phone to call in for easily solvable problems. Plus, if the agent is integrated deep enough & has tools to interact with crms, you can raise the ceiling on the types of problems it can solve.

You're trying to avoid the bad customer experience of human 1 reading off their script, then they transfer you to some other department who may or may not know how to solve your problem, and the entire interaction cost the company way more than the value created, so the company is disincentivized to help customers.

2. All the companies in this space start with the outsourced BPO market for cx (multi billion market still) but the next market is going to be in revenue generation and churn prevention at scale, i.e. how do you proactively avoid customer issues, how do you upsell and generate revenue instead of reducing cost, how do you keep customers happy?

3. I think more companies will pivot to outcome based pricing on the contrary, makes it so much more measurable than seat-based and protects margins better than usage based. Plus cx is one of the few industries with very well known metrics

4. Kind of? Most companies in this space don't use native voice models which are noticeably dumber, they use transcription + a stronger text model + TTS. The majority of customers can be handled with the latest SOTA text model and you need smart context engineering to handle the long tail of more complicated asks

woeirua · 2026-05-04T23:31:47 1777937507

1 & 2 are totally dependent on the company being willing to let their agents do things that they haven’t traditionally let humans do. For example, issue refunds, or do things that cost money but generate good will. I am skeptical that companies will be OK with their agents doing those things on their own volition.

3. Cool so the user didn’t indicate if they were satisfied. What then?

4. You can’t use a SOTA model right now for reasoning, there’s too much latency for a conversation. So you’re either using an older, but significantly less capable model, or you’re paying out the nose for fast mode. If the former then you can’t trust the agent to do the right thing (see points 1&2). If the latter, there’s no cost savings over a human. So which is it?

ej88 · 2026-05-05T03:21:08 1777951268

1&2 are already happening, these startups take on brand liability and trust to do so

3 depends on how companies want to measure it, but lack of user submitting satisfaction score is not a good thing

you can use a model w/o reasoning, + use various tricks to simulate low latency

woeirua · 2026-05-05T16:51:30 1777999890

At the end of the day the company is going to audit what the agent has done. If the agent issues too many refunds that's a major red flag for the company providing the agent and likely results in the contract being terminated. I don't see how anyone can underwrite what agents are going to do today given that they're still so susceptible to prompt injection.

You didn't address my concern, non-reasoning models are so, so variable in their output.

ej88 · 2026-05-05T18:05:41 1778004341

1. part of the moat is their guardrails and obviously they are audited and tracked. there are agents issuing refunds and more at scale right now so not sure where the skepticism comes from.. you're free to try and jailbreak them

2. another part of the value prop of these companies is figuring out how to construct the proper harness to take advantage of the lower latency of faster models while shoring up the weaker intelligence, how you blend deterministic and non-deterministic behaviors, compliance etc.

its a hard problem which is why f500 is willing to pay up

woeirua · 2026-05-06T01:44:52 1778031892

I’m curious where you see models like Codex-Spark in this problem? I know they’re too expensive and availability is too limited right now, but in a few years…

maxdo · 2026-05-05T00:16:01 1777940161

Yes you could , not everything needs to be real time , anyways you listen for the music sometimes 30 mins plus

ej88 · 2026-05-04T18:28:20 1777919300

ai skeptic fanfic evolves in fascinating ways every day

svnt · 2026-05-04T20:29:08 1777926548

This isn’t specific to AI this is just the dark arts startup valuation playbook. AI extension of gaming the metric “what is the ratio of “active” accounts to validated human daus”

Lionga · 2026-05-04T19:32:49 1777923169

just wait until you read the ai "optimist" fanfic

ej88 · 2026-05-04T19:44:39 1777923879

true. we'll see how many ai cos become profit printers a few years from now

ej88 · 2026-05-04T18:24:09 1777919049

hes board chair of openai and is ex co-ceo of salesforce, ex cto of facebook, can get a meeting with any exec in F500...

their moat is distribution

colesantiago · 2026-05-04T22:10:05 1777932605

> their moat is distribution

It is trust.

Everyone in the valley knows Bret Taylor and will back any project he does, even if the product has no distribution.

The same way everyone in the valley knows Naval Ravikant for example, angels and VCs will back any project he does even if his product has no distribution.

svnt · 2026-05-04T20:25:18 1777926318

Is that really a moat though or something like a firehose of gasoline?

JumpCrisscross · 2026-05-04T22:32:17 1777933937

> Is that really a moat though or something like a firehose of gasoline?

It's a moat from a defensive perspective. It's a firehose from an offensive one. Outside state capture, most moats are both.

ej88 · 2026-05-04T21:21:38 1777929698

its a moat vs. other startups and it carried them to multi-B valuation

obviously the product needs to deliver and nrr needs to be good in the long run

ej88 · 2026-05-04T18:21:37 1777918897

ime its very implementation dependent

but even a simple impl to answer questions can knock out like 50% of callers who are tech-illiterate at 100x cheaper cost, it's just strictly better economics and better for those customers