This is great, thank you for sharing. I work on these APIs at OpenAI, it's a surprise to me that it still works reasonably well at 2/3x speed, but on the other hand for phone channels we get 8khz audio that is upsampled to 24khz for the model and it still works well. Note there's probably a measurable decrease in transcription accuracy that worsens as you deviate from 1x speed. Also we really need to support bigger/longer file uploads :)
I kind of want to take a more proper poke at this but focus more one summarization accuracy over word-for-word accuracy, though I see the value in both.
I'm actually curious, if I run transcriptions back-to-back-to-back on the exact same audio, how much variance should I expect?
Maybe I'll try three approaches:
- A straight diff comparison (I know a lot of people are calling for this, but I really think this is less useful than it sounds)
- A "variance within the modal" test running it multiple times against the same audio, tracking how much it varies between runs
- An LLM analysis assessing if the primary points from a talk were captured and summarized at 1x, 2x, 3x, 4x runs (I think this is far more useful and interesting)
Quick Feedback: Would it be cool to research this internally and maybe find a sweet spot in speed multiplier where the loss is minimal. This pre-processing is quite cheap and could bring down the API price eventually.
Heyo, I work on the realtime api, this is a very cool app!
With transcription I would recommend trying out "gpt-4o-transcribe" or "gpt-4o-mini-transcribe" models, which will be more accurate than "whisper-1". On any model you can set the language parameter, see docs here: https://platform.openai.com/docs/api-reference/realtime-clie.... This doesn't guarantee ordering relative to the rest of the response, but the idea is to optimize for conversational-feeling latency. Hope this is helpful.
Do you work from the command line? Butterfish is a project I wrote for myself to use AI prompting seamlessly directly from the shell. I hope it's useful to others, give it a try and send feedback!
> Within Butterfish Shell you can send a ChatGPT prompt by just starting a command with a capital letter, for example:
This is a dangerous assumption. Not all commands are lowercase. Interaction with an external service should be a deliberate, discrete action on the user's part.
I like that a lot! It would be awesome if the client running on goal mode had capabilities to request some search engine API + do some crawling. Imagine getting the info out of up to date github issues or directly from AWS docs.
I've experimented with it, the reason I haven't yet added it is that I want deployment to be seamless, and it's not trivial to ship a binary that would (without extra fuss or configuration) efficiently support Metal and CUDA, plus download the models in a graceful way. This is of course possible, but still hard, and not clear if it's the right place to spend energy. I'm curious how you think about it - is your primary desire to work offline or avoid sending data to OpenAI? Or both?
The latter mostly. It's also free, uncensored, and can never disappear from under me.
FWIW, from my understanding llama.cpp is pretty easy to integrate and is reasonably fast for being API agnostic. Ollama embeds it, for example. No pressure, just pointing it out :)
Really like the design of these tools so that you can easily pipe between them, this a good way to make things composable. Also really cool to see all of the other CLI tools folks have posted here, lots that I wasn't aware of.
I've been experimenting with CLI/LLM tools and found my favorite approach is to make the LLM constantly accessible in my shell. The way I do this is to add a transparent wrapper around whatever your shell is (bash,zsh,etc), send commands that start with capital letters to ChatGPT, and manage a history of local commands and GPT responses. This means you can ask questions about a command's output and autocomplete based on ChatGPT suggestions, etc.
- An unmentioned alternative to this pricing is that GCP has a deal with Cloudflare that gives you a 50% discount to what is now called Premium pricing for traffic that egresses GCP through Cloudflare. This is cheaper for Google because GCP and Cloudflare have a peering arrangement. Of course, you also have to pay Cloudflare for bandwidth.
- This announcement is actually a small price cut compared to existing network egress prices for the 1-10 TiB/month and 150+ TiB/month buckets.
- The biggest advantage of using private networks is often client latency, since packets avoid points of congestion on the open internet. They don't really highlight this, instead showing a chart of throughput to a single client, which only matters for a subset of GCP customers. The throughput chart is also a little bit deceptive because of the y-axis they've chosen.
- Other important things to consider if you're optimizing a website for latency are CDN and where SSL negotiation takes place. For a single small HTTPS request doing SSL negotiation on the network edge can make a pretty big latency difference.
- Interesting number: Google capex (excluding other Alphabet capex) in both 2015 and 2016 was around $10B, at least part of that going to the networking tech discussed in the post. I expect they're continuing to invest in this space.
- A common trend with GCP products is moving away from flat-rate pricing models to models which incentivize users in ways that reflect underlying costs. For example, BigQuery users are priced per-query, which is uncommon for analytical databases. It's possible that network pricing could reflect that in the future. For example, there is probably more slack network capacity at 3am than 8am.
I like your thinking, but one minor clarification. BigQuery's actually introduced Flat Rate [0] ( a year ago) and Committed Use Discounts [1] (Amazon RIs are similar) since that's kind of what some Enterprises want. These are optional and flexible.
I personally still hold that pay-per-use pricing is the cloud native approach [2], the most cost-efficient, and the most customer-friendly. However, it's unfamiliar and hard to predict, so starting out on Flat Rate pricing as a first step makes sense.
( work at Google and was a part of the team that introduced BQ Flat Rate)
The problem with bundling is that it stops reflecting underlying costs and creates incentives for customers that skew customer population.
Contrived example of this: since most HDDs workloads are IOPS bound, you decide to sell IOPS bundles and give space for free. Not before long all your customers are backup companies that have low IOPS and high space usage. Your service runs at loss, customers are doing nice price arbitrage on top of it.
Same goes for all aspects of computing platforms for sale: CPUs, RAM, Networking, HDDs, SSDs, GPUs.
Two additional problems are bin packing and provisioning: you need to sell things in such quantities and ratios that you can actually effectively utilize your hardware configurations. You need to order and design your hardware in a flexible manner to be able to adapt for changing ratios of component needs due to changing customer demand.
So it's easier to run "pay for what you use plus profit" pricing, but some customers don't like it due to perceived complexity and potential unpredictability.
We've been running it for about a year to power Quizlet - overall things have been good and we're happy. AWS and GCP are complicated enough that they're tough to compare holistically, but on most of the things we care about we find GCP to be equivalent or better (sometimes significantly) than AWS. It really does have better networking and disk technology, and the pricing is much better. Here's the analysis we did: https://quizlet.com/blog/whats-the-best-cloud-probably-gcp.