Here are my notes and guesses on the stories in case people here find it interesting. Like some others in the blog post comments I got 6/8 right:
1.) probably human, low on style but a solid twist (CORRECT)
2.) interesting imagery but some continuity issues, maybe AI (INCORRECT)
3.) more a scene than a story, highly confident is AI given style (CORRECT)
4.) style could go either way, maybe human given some successful characterization (INCORRECT)
5.) I like the style but it's probably AI, the metaphors are too dense and very minor continuity errors (CORRECT)
6.) some genuinely funny stuff and good world building, almost certainly human (CORRECT)
7.) probably AI prompted to go for humor, some minor continuity issues (CORRECT)
8.) nicely subverted expectations, probably human (CORRECT)
My personal ranking for scores (again blind to author) was:
So for me the two best stories were human and the two worst were AI. That said, I read a lot of flash fiction, and none of these stories really approached good flash imo. I've also done some of my own experiments, and AI can do much better than what is posted above for flash if given more sophisticated prompting.
I was surprised at the result, and even more surprised when I read that one of the authors who did the test got 4 out of 5 wrong, and rated 2 of the AI stories highly.
Looking at my notes, I got one wrong (story 5, dunno what the "name" was supposed to be, assumed that the "name" is something widely-known in culture that brings about the end times, a something that I didn't know about, and so marked it as Human because of a supposed reference to a shared cultural knowledge), and all the AI written stories I rated at either 1 or two points, with the lowest Human-written story getting 3 and the highest getting 5 (Story 1).
It makes me wonder if we are over-estimating the skill an author has when reading based on their demonstrated skill when writing.
IOW, according to my notes/performance, the AI stories were easy to spot and correlated with low scores anyway, while the author(s), who actually produced high-rated stuff for me, rated my low-rated stuff as high.
The only one I was fairly sure was human was #6, and that was the only one I kinda enjoyed. In any case, as someone who reads a good deal, I agree. I didn't think any of the stories was particularly great (not enough to bother ranking them, beyond favourite) so I don't care all that much about the result.
> AI can do much better than what is posted above for flash if given more sophisticated prompting.
How sophisticated, compared to just writing the thing yourself?
I enjoy writing so a system like this would never replace that for me. But for someone who doesn't enjoy writing (or maybe can't generate work that meets their bar in the Ira Glass sense of taste) I think this kind of setup works okay for generating flash even with today's models.
Could you expand on your point re more sophisticated prompting?
I have found it hard to replicate high quality human-written prose and was a bit surprised by the results of this test. To me, AI fiction (and most AI writing in general) has a certain “smell” that becomes obvious after enough exposure to it. And yet I scored worse than you did on the test, so what do I know…
For flash you can get much better results by asking the system to first generate a detailed scaffold. Here's an example of some metadata you might try to generate before actually writing the story: genres the story should fit into; pov of the story;
high level structure of the story; list of characters in the story along with significant details; themes and topics present in the story; detailed style notes
From there you have a second prompt to generate a story that follows those details. You can also generate many candidates and have another model instance rate the stories based on both general literary criteria and how well the fit the prompt, then you only read the best.
This has produced some work I've been reasonably impressed by, though it's not at the level of the best human flash writers.
Also, one easy way to get stuff that completely avoids the "smell" you're talking about by giving specific guidance on style and perspective (e.g., GPT-5 Thinking can do "literary stream-of-consciousness 1st person teenage perspective" reasonably well and will not sound at all like typical model writing).
I had similar results, and story 4 is so trope heavy I wonder if it’s just an amalgamation of similar stories. The human stories all felt original, where none of the AI ones did.
I'm not sure I agree that the human stories felt original. I was pretty unimpressed with all of the stories except maybe 6, and even that one dealt in some common tropes. 5 had fewer tropes than 6 (and maybe as a result of that received the highest average scores from his readers) but I could tell from the style it was AI
It is true that there isn't that much literary stuff that breaks through, and the stuff that does is usually somewhat crossover (e.g., All the Light We Cannot See in 2015 or Song of Achilles in 2021) but it exists. These two books are shelved under literary codes (though also historical). Song of Achilles in particular is beautifully written and a personal favorite of mine, at least among books published in recent years.
Then there are other works like Little Fires Everywhere and The Midnight Library that I might not consider super literary but nonetheless are also often considered so by book shops or libraries (e.g., https://lightsailed.com/catalog/book/the-midnight-library-a-... the lit fic code is FIC019000).
I was really surprised that Ferrante's Neapolitan series, the best example (I would have thought) of recent work with both high literary acclaim and popular appeal, did not actually make the top 10 list for any year.
Yeah, and looking through the lists makes one suspect that there's a problem of incommensurate measurements... There's a lot of 'very hungry caterpillar' in the recent lists, but I'm unsure whether children's books were even in the running in the 1960's. Or else there's been a revolution in buying books for children since the 60's, which, honestly, I wouldn't be sad about...
yeah, it seems likely the underlying task here (one reasoning step away) was: replace as many fp32 operations as possible in this kernel with fp16. i'm not sure exactly how challenging a port like that is, but intuitively seems a bit less impressive
maybe this intuition is wrong but would be great for the work to address it explicitly if so!
I looked at the softmax kernel and the cast that it does from a float* to a float4* is extremely brittle -- it's trivial to break by offsetting the input slightly.
Very likely a kernel for a standard library could not employ such a trick that relies on alignment of input pointers. Certainly not without a fallback.
I do a lot of ML work too and recently gave NixOS a try. It's actually not too hard to just use conda/miniconda/micromamba to manage python environments as you would on any other linux system with just a few lines of configuration. Pretty much just add micromamba to your configuration.nix plus a few lines of config for nix-ld. Many other python/ML projects are setup to use docker, and that's another easy option.
I don't have the time or desire to switch all my python/ML work to more conventional Nix, and haven't really had any issues so far.
This technique doesn't actually use RL at all! There’s no policy-gradient training, value function, or self-play RL loop like in AlphaZero/AlphaTensor/AlphaDev.
As far as I can read, the weights of the LLM are not modified. They do some kind of candidate selection via evolutionary algorithms for the LLM prompt, which the LLM then remixes. This process then iterates like a typical evolutionary algorithm.
I also use GPT-4 for explaining the meaning of sentences in more detail (as in JimDabell’s comment). Often my questions are like “how would a native speaker say this colloquially” - I’ve found it really valuable to be able to have a back-and-forth on why something works the way it does
I sort of similarly used LLMs and speech synthesis tools to make a prototype that could generate short (<10min) podcasts in Mandarin on any topic I specified. Being interesting is less important in a language learning context, though it's notable that I haven't used the tool much and prefer listening to Mandarin audiobooks and real human podcasts, perhaps because they are more interesting.
I spent a summer at the Santa Fe Institute in 2010 and can say that McCarthy certainly spent a lot of time there. I'm not sure he was a copyeditor, exactly, but SFI is a nice place to hang out for any sort of creative work -- beautiful building and landscape, and very open and collaborative atmosphere.
The institute is on the top of a big hill and I'll always remember how he gave me a lift one day as I was walking up.
The immune system is extremely complicated, but very broadly: CD8+ T-cells (also known as killer T-cells) kill infected cells directly, whereas CD4+ T-cells (also known as helper T-cells) release signals that guide many aspects of immune response, including activating CD8+ cells
1.) probably human, low on style but a solid twist (CORRECT) 2.) interesting imagery but some continuity issues, maybe AI (INCORRECT) 3.) more a scene than a story, highly confident is AI given style (CORRECT) 4.) style could go either way, maybe human given some successful characterization (INCORRECT) 5.) I like the style but it's probably AI, the metaphors are too dense and very minor continuity errors (CORRECT) 6.) some genuinely funny stuff and good world building, almost certainly human (CORRECT) 7.) probably AI prompted to go for humor, some minor continuity issues (CORRECT) 8.) nicely subverted expectations, probably human (CORRECT)
My personal ranking for scores (again blind to author) was:
6 (human); 8 (human); 4 (AI); 1 (human) and 5 (AI) -- tied; 2 (human); 3 and 7 (AI) -- tied
So for me the two best stories were human and the two worst were AI. That said, I read a lot of flash fiction, and none of these stories really approached good flash imo. I've also done some of my own experiments, and AI can do much better than what is posted above for flash if given more sophisticated prompting.