What entails the LLM Completion are you talking sequence of prompts with files / mcp servers. Could you share a bit more, cause I have spent some time with this and have something that might be precisely what you are asking for...
This is awesome. Love seeing more teams investing early in observability and evals instead of treating them as an afterthought.
Your setup (LLM-assessed complexity, semantic success metrics, tool-level telemetry) hits what a lot of orgs miss, tying evaluation and observability together. Most teams stop at traces and latency, but without semantic evals, you can’t really explain or improve behavior.
We’ve seen the same pattern across production agent systems: once you layer in LLM-as-judge evals, distributed tracing, and data quality signals, debugging turns from “black box” to “explainable system.” That’s when scaling becomes viable.
Would love to hear how you’re handling drift or regression detection across those metrics. With CoAgent, we’ve been exploring automated L2–L4 eval loops (semantic, behavioral, business-value levels) and it’s been eye-opening.
We have had folks over the years asking us about the Kafka wire compatibility. We had a project 3 years ago which we archived. I think we have a case for reviving it in the near future.
Fluvio is streaming transport. And we built Stateful DataFlow on top of that for Stream Processing.
Arroyo is SQL first stream processing. Fluvio is streaming transport which can send data to Arroyo and there is an integration.
Stateful DataFlow and Arroyo are similar in the stream processing pattern and the use of Apache Arrow.
The interfaces are different. Fluvio and Stateful DataFlow support for SQL is the same dialect as columnar SQL supported by Polars. The Fluvio and Stateful DataFlow paradigm is more intricate more expressive and the platform is broader and deeper.
Agree with you 100%. We are working on more elaborate benchmarking on bare metal instances. This was just an initial run to utilize the benchmarking tool which is usable by all fluvio users.
We will do a full setup and benchmarks comparing Kafka, Pulsar, RedPanda using a real dataset on barmetal servers soon.
Really cool project! Look forward to trying this out. I have been using the copilot extensions with local docs toRAG augmentation. This seems to be a step up.
That's a really good suggestion. I am trying to build some tutorial videos now. I feel the same way about listicles but they get a lot of impressions. People seem to love lists.