The headline reads very oddly, and I was wondering how GPT4 got significantly less intelligent. It's that the cost of GPT4-level intelligence has dropped 1000x in 18 months
I thought the same thing. I am wondering if custom silicon can bring the cost even lower? I am reminded of when they used FPGA's to mine crypto then they went to ASICs and the performance was dramatically better.
Lmsys is not a measure of intelligence. It's a measure of human preference. People prefer correct answers (assuming they are qualified to identify the correct one), but they also prefer answers formatted nicely for reading, for example, which has nothing to do with "intelligence". That is why "reasoning" models, which often do better on benchmarks, do not necessarily do correspondingly well on lmsys.
I continue to be baffled by this kind of sentiment. I realize that today's LLM's aren't perfect. But, meanwhile, Claude 3.5 Sonnet and ChatGPT o3-mini-high are out there doing the jobs of multiple junior devs for just a few dollars per month.
What accounts for this massive gap in peoples' experiences using this technology?
I mostly try making automated developer scripts, which take a project description as input, and have the LLM write a script to implement it, then execute the script, then return the output to the LLM, and asked it if the project is complete or it wants to submit an updated version.
The sample project description I've been using is asking it to get the public IP, location, weather conditions, and 5 regional news headlines, using free public APIs or feeds.
Even with the project description telling it to use free, public, keyless APIs, half the time it will immediately throw in a paid API and a line like weather_api_key = #insert your API key here, the script will run, the API will return a keyerror, we give that to the LLM says "the error is because the API key is missing, let me try again" and submits the exact same code again, when the line about using free APIs is in context. (wttr.in is the public weather API it knows how to use if you remind it specifically that wttr.in exists...)
It will also invent indexes when trying to extract the response data.
I have tried using 4o, 4o-mini, and mistral small 24b 2501.
I get that I still need to do the work of prompt engineering if I want impressive results. I have lots of text strings telling the LLM, you are a developer, you will be given a project description, you will write a script, we will run the script, we will return the output to you, you will judge if the output fully achieves the project description, and if not, include an updated full script to try again. It overall works, but it certainly makes plenty of junior dev mistakes too.
We need a way to impart the type of additional knowledge accumulation and instruction following that happens with fine-tuning, at runtime, because clearly, knowledge that is "baked in" and knowledge that is provided in context is in very separate classes.
Go ahead and produce any close analysis and comprehensive study of what the LLMs are actually doing. I’ll show you how that study is either systematically ignoring key aspects of the work or or ignoring reliability or ignoring generalizability or glossing over important failures.
People are making excuses for the bad work of LLMs, because they are excited, in love, and smell money.
Well, I'm working on a project now where we would absolutely have needed to recruit at least one additional entry-level dev if we didn't instead have an LLM to generate a few thousand lines of python to orchestrate some basic DB interactions. It's not complicated stuff, but it is time consuming.
I'd rate its output as not great, but adequate for the purpose, so long as robustly supervised. (much like an inexperienced human.)
I don't know if this anecdote qualifies as the "close analysis" you're looking for, but it seems instructive to me since it allows me to say with absolute certainty that an LLM is doing the job of at least one person. (And I'm confident it would be more than one if I were working on a bigger project.)