More

joozio · 2026-04-01T14:18:40 1775053120

Terminal Benchmark: https://www.tbench.ai/leaderboard/terminal-bench/2.0

joozio · 2026-03-31T17:05:35 1774976735

I run a Claude Code agent 24/7 on a Mac Mini. After a few months my morning routine was gone and I was reviewing agent output at midnight. Built this to teach it boundaries.

The interesting part ended up being the error registry. Agents fail silently way more than you'd expect. Same error repeats 50 times burning tokens before you notice.

Zero dependencies, Python stdlib only. Would love feedback on what's missing.

joozio · 2026-03-24T13:13:01 1774357981

I think not yet, but Anthropic is trying to.

joozio · 2026-03-20T22:20:04 1774045204

I am user of Arc(even now!) and I really love it. Feels different in a good way. I can see it's PoC, but have you touched backend even in concept?

johndamaia · 2026-03-21T07:19:22 1774077562

Yes, starting to explore the backend now. It's quite overwhelming in a fun way :)

joozio · 2026-03-16T16:06:50 1773677210

Seperate space, more power and also Local small LLMs. Also 24/7 :)

joozio · 2026-03-12T23:06:24 1773356784

Anthropic formalized their enterprise partnership program today. Key partners include Accenture (30k professionals trained on Claude), Cognizant (350k associates with Claude access), Deloitte, and Infosys.

What makes this interesting: Claude is the only frontier model available across all three major cloud platforms simultaneously (AWS Bedrock, Azure, Google Cloud). The Partner Network is how they convert that platform coverage into an actual distribution advantage.

The $100M goes toward: certification programs (Claude Certified Architect launched today), dedicated Applied AI engineers for customer engagements, sales playbooks, and co-marketing.

This signals a shift from "try our model" to "we'll help you deploy org-wide." Curious if others are seeing this pattern play out in their enterprise conversations -- is this differentiated from what OpenAI/Google are doing, or is the whole industry moving this direction?

joozio · 2026-03-06T11:58:09 1772798289

Nope - written by me.

joozio · 2026-02-08T20:34:56 1770582896

Thanks! The investment angle is interesting — I hadn't thought about it that way, but it makes sense. If you're seeing the gap firsthand, you have an information edge most investors don't.

What strikes me most is how different the conversation is depending on where you are. Reddit investment subs, Twitter AI circles, and actual workplaces — three completely different realities about the same technology.

I think the key thing that's hard to convey to non-users is the compounding effect. Once you hit a certain depth, every new tool or workflow multiplies what you already know. My neighbor who codes with Gemini is one "aha moment" away from a completely different relationship with AI — but that moment hasn't happened yet for most people.

The gap you're betting on seems real to me. Whether it closes in months or years is the interesting question.

aurareturn · 2026-02-09T08:42:38 1770626558

There is a huge lag in Wallstreet action and AI advancements. The core decision makers in Wallstreet aren't using AI like how we are.

I think because we are software devs, we see the potential must earlier. I'm leveraging this information for investments.

joozio · 2026-02-06T15:16:52 1770391012

Haven't benchmarked pre-processing approaches yet, but that's a natural next step. Right now the test page targets raw agent behavior — no middleware. A comparison between raw vs sanitized pipelines against the same attacks would be really useful. The multi-layer attack (#10) would probably be the hardest to strip cleanly since it combines structural hiding with social engineering in the visible text.

joozio · 2026-02-06T15:15:55 1770390955

It's working -> your agents scored A+, which means they resisted all 10 injection attempts. That's a great result. The tool detects when canary phrases leak into the response. If nothing leaked, you get a clean score. Not all models are this resilient though - we've seen results ranging from A+ to C depending on the model and even the language used.