Since no one else seems to have pointed this out - the OP seems to have misunderstood the output of the 'time' command.
$ time ./wc-avx2 < bible-100.txt
82113300
real 0m0.395s
user 0m0.196s
sys 0m0.117s
"System" time is the amount of CPU time spent in the kernel on behalf of your process, or at least a fairly good guess at that. (e.g. it can be hard to account for time spent in interrupt handlers) With an old hard drive you would probably still see about 117ms of system time for ext4, disk interrupts, etc. but real time would have been much longer.
$ time ./optimized < bible-100.txt > /dev/null
real 0m1.525s
user 0m1.477s
sys 0m0.048s
Here we're bottlenecked on CPU time - 1.477s + 0.048s = 1.525s. The CPU is busy for every millisecond of real time, either in user space or in the kernel.
In the optimized case:
real 0m0.395s
user 0m0.196s
sys 0m0.117s
0.196 + 0.117 = 0.313, so we used 313ms of CPU time but the entire command took 395ms, with the CPU idle for 82ms.
In other words: yes, the author managed to beat the speed of the disk subsystem. With two caveats:
1. not by much - similar attention to tweaking of I/O parameters might improve I/O performance quite a bit.
2. the I/O path is CPU-bound. Those 117ms (38% of all CPU cycles) are all spent in the disk I/O and file system kernel code; if both the disk and your user code were infinitely fast, the command would still take 117ms. (but those I/O tweaks might reduce that number)
Note that the slow code numbers are with a warm cache, showing 48ms of system time - in this case only the ext4 code has to run in the kernel, as data is already cached in memory. In the cold cache case it has to run the disk driver code, as well, for a total of 117ms.
> was in unsafe code, and related to interop with C
1) "interop with C" is part of the fundamental requirements specification for any code running in the Linux kernel. If Rust can't handle that safely (not Rust "safe", but safely), it isn't appropriate for the job.
2) I believe the problem was related to the fact that Rust can't implement a doubly-linked list in safe code. This is a fundamental limitation, and again is an issue when the fundamental requirement for the task is to interface to data structures implemented as doubly-linked lists.
No matter how good a language is, if it doesn't have support for floating point types, it's not a good language for implementing math libraries. For most applications, the inability to safely express doubly-linked lists and difficulty in interfacing with C aren't fundamental problems - just don't use doubly-linked lists or interface with C code. (well, you still have to call system libraries, but these are slow-moving APIs that can be wrapped by Rust experts) For this particular example, however, C interop and doubly-linked lists are fundamental parts of the problem to be solved by the code.
If Rust is no less safe than C in such a regard, then what benefit is Rust providing that C could not? I am genuinely curious because OS development is not my forte. I assume the justification to implement Rust must be contingent on more than Rust just being 'newer = better', right?
The issue is unrelated to expressing linked lists, it's related to race conditions in the kernel, which is one of the hardest areas to get right.
This could have happened with no linked lists whatsoever. Kernel locks are notoriously difficult, even for Linus and other extremely experienced kernel devs.
I love rust, but C does make it a lot easier to make certain kinds of container types. Eg, intrusive lists are trivial in C but very awkward in rust. Even if you use unsafe, rust’s noalias requirement can make a lot of code much harder to implement correctly. I’ve concluded for myself (after a writing a lot of code and a lot of soul searching) that the best way to implement certain data structures is quite different in rust from how you would do the same thing in C. I don’t think this is a bad thing - they’re different languages. Of course the best way to solve a problem in languages X and Y are different.
And safe abstractions mean this stuff usually only matters if you’re implementing new, complex collection types. Like an ECS, b-tree, or Fenwick tree. Most code can just use the standard collection types. (Vec, HashMap, etc). And then you don’t have to think about any of this.
> "Trellis coded modulation got this rate up to 50 kilobaud by the 1990s"
Not quite, and an interesting story that fits these engineering maxims better than you might think.
An analog channel with the bandwidth and SNR characteristics of a landline phone line has (IIRC) a Shannon capacity of 30-something kbit/s, which was closely approached with V.34, which used trellis coded modulation plus basically every other coding and equalization mechanism they knew of at the time to get to 33.6kb/s on a good day.
But... by the 80s or so the phone system was only analog for the "last mile" to the home - the rest of the system was digital, sending 8-bit samples (using logarithmic mu-law encoding) at a sampling rate of 8000 samples/s, and if you had a bunch of phone lines coming into a facility you could get those lines delivered over a digital T1 link.
Eventually someone realized that if your ISP-side modem directly outputs digital audio, the downstream channel capacity is significantly higher - in theory the limit is probably 64000 bit/s, i.e. the bit rate of the digital link, although V.90 could only achieve about 56000 b/s in theory, and more like 53kb/s in practice. (in particular, the FCC limited the total signal power, which means not all 64000 combinations of bits in a second of audio would be allowable)
I worked with modem modulation folks when I was a co-op student in the mid-80s. They had spent their lives thinking about the world in terms of analog channels, and it took some serious out-of-the-box thinking on someone's part to realize that the channel was no longer analog, and that you could take advantage of that.
A few years later those same folks all ended up working on cable modems, and it was back to the purely analog world again.
> if your ISP-side modem directly outputs digital audio, the downstream channel capacity is significantly higher
But why is it higher? It's still an analog channel (the last mile from the ISP to your house), right? Doesn't it get filtered? So isn't it still subject to the Shannon-Nyquist limit?
Here's an ASCII drawing of which parts are digital vs analog as I understood your explanation:
Rest of world<--- digital--->Telco<---digital--->ISPmodem<---analog--->HomeModem
Suppose you're saying that the link between the ISPmodem and the HomeModem is a bare unfiltered copper wire. In that case, I have a different question: Couldn't you send data at megabits per seconds over a mile long copper wire without using modems at all (using just UARTs?).
> Couldn't you send data at megabits per seconds over a mile long copper wire
Yes, but you need the bare copper wire without signaling. We operated a local ISP in the 90's and did exactly that by ordering so-called "alarm circuits" from the telco (with no dial tone) and placed a copper T1 CSU on each end. We marketed it as "metro T1" and undercut traditional T1 pricing by a huge margin with great success to the surrounding downtown area.
Yes, the actual bandwidth of the last-mile analog line was much, much higher. Hence why we eventually got 8mbit ADSL or 24mbit ADSL 2.0+ running across it. Or even 50-300mbit with VDSL in really ideal conditions.
Though the actual available bandwidth was very dependent on distance. People would lease dedicated pairs for high bandwidth across town (or according to a random guy I talked to at a cafe: just pirate an unused pair that happened to run between their two buildings). But once we start talking between towns, the 32kbit you could get from the digital trunk lines was almost always higher than what you could get on a raw analog line over the same distance.
Traditionally both the ISP and you pay for analog phone lines from the telco. The telco uses digital internally (remember you and your ISP probably aren't at the same exchange), which puts a hard limit on data rate - there is no trick you can do to get more bits through than the bits used in the digital part of the call.
If you (as the ISP) buy enough lines you can get them delivered in digital format. A T1 is designed to carry 24 simultaneous phone calls, acting virtually as a bundle of 24 analog phone cables. So the obvious next stage was to have a modem that can handle 24 simultaneous connections on one cable.
Now you have ISP\_modem←Ax24→ISP\_muxer←Dx24→Telco←→Telco←A→User
The ISP's modem generates analog signals for up to 24 simultaneous incoming calls, and they pass into a multiplexer that connects 24 analog lines to a T1 line and they go through the telco digitally to users. The maximum bandwidth is still as before - the modem has to generate an analog signal that will still be receivable at the other end after A2D and D2A conversion. Even though the digital bandwidth for the digital part is 56kbps, the maximum achievable bandwidth through this digital-bottlenecked analog call was found to be 33.6kbps.
But the industry had an idea: by convincing the telco to install the modems into the user's exchange, the analog portion would only be between the telco and the user, without a digital segment in the middle of it, and therefore wouldn't be bottlenecked the same way. The same digital backhaul from the ISP through the telco was used, but instead of transmitting a digitised analog modem signal and therefore causing degradation of quality, it transmitted your actual internet traffic bits, up to 56kbps. The analog signal was made at the user's side of the telco and didn't have to fit within 56kbps when digitised.
Pedantically, the digital circuits are 64kbps but one bit in some bytes is used for call status signaling, which is okay for voice, but the ISP equipment can't predict which bytes have a bit overwritten (and it could be multiple if there are several hops) so it just used 7 bits in each byte.
No, it’s more like HomeModem ←A→ Exchange1 ←D→ Exchange2 ←A→ ISPModem. The digital parts were all inside the telco’s networks that connect the exchanges to each other.
> Couldn't you send data at megabits per seconds over a mile long copper wire without using modems at all (using just UARTs?).
No. The exchange is sampling the analog signal coming in over your phone line at 8kHz and 8 bits per sample. They just designed modems that sent digital data over that analog link, in a way that would line up exactly with the way the exchange will sample it.
> An analog channel with the bandwidth and SNR characteristics of a landline phone line has (IIRC) a Shannon capacity of 30-something kbit/s,
The problem is that the ADC in the telephone exchange is not clocked coherently to your modem. You're paying various companding, quantization (time and level), etc, losses-- it is not a nice linear channel.
An end-to-end analog connection could be capable of significantly more-- while there's noise, there's no "8KHz cliff." A 45dB SNR for 0-4KHz has a >60kbps Shannon bandwidth, and there's not exactly a 4KHz cliff even with loading coils present.
Not quite, and an interesting story that fits these engineering maxims better than you might think.
Huh? The SotA for dial-up modems in the 1990s was indeed 56 kbps, achieved with various encoding and compression hacks in both domains including trellis coding.
Every time I see this video, I feel a strange tenderness for the new generations watching it.
They do not really understand how bad Oracle used to be. This is us, old combat veterans, sitting by the fire, describing unspeakable battles to the youth...knowing full well that they think we are exaggerating. :-)
And the most disturbing part is the realization that the Frankenstein monster itself, Larry Ellison, is still out there. Still roaming free. Still very much alive... An eternal, terrifying, lawnmower wielding zombie of enterprise software and government corrupting rent extraction.
We shouldn’t anthropomorphize any billionaires. They’re not even people at that point, just destructive aliens who undemocratically ruin everyone’s good time.
We need confiscatory taxation for a better future.
It's kind of saying we should only focus on the #1 mass-murdering dictator in the world, so while many of them are actively slaying people, lets just focus on #1 for now.
No, we can have many targets. People who hoard money for the benefit of themselves with the detriment of society and the population at large are all "destructive aliens who undemocratically ruin everyone’s good time" to borrow the words of parent commentator. If just 10% were slightly less evil and egoistic, it would lead to huge improvements, and only a slight reduction to their own lifestyles. That they don't, is a stain on the legacy of humanity.
I've got to say, there's nothing more infuriating than standing in front of a register while everyone behind the counter is busy working on online orders that won't get picked up for quite a while, as evidenced by their repeatedly calling out names for the online orders waiting forlornly at the end of the counter.
(this is at a campus Dunkies where there's no drive-through, and I have a hard deadline to start my lecture. If there's no line at the register, and I've got five minutes before class starts in a room down the hall, it shouldn't take a logistical genius to get me a regular coffee in time for class)
Actually that's a really common use - I've bought a half dozen or so Dell rack mount servers in the last 5 years or so, and work with folks who buy orders of magnitude more, and we all spec RAID0 SATA boot drives. If SATA goes away, I think you'll find low-capacity SAS drives filling that niche.
I highly doubt you'll find M.2 drives filling that niche, either. 2.5" drives can be replaced without opening the machine, too, which is a major win - every time you pull the machine out on its rails and pop the top is another opportunity for cables to come out or other things to go wrong.
M.2 boot drives for servers have been popular for years. There's a whole product segment of server boot drives that are relatively low capacity, sometimes even using the consumer form factor (80mm long instead of 110mm) but still including power loss protection. Marvell even made a hardware RAID0/1 controller for NVMe specifically to handle this use case. Nobody's adding a SAS HBA to a server that didn't already need one, and nobody's making any cheap low-port-count SAS HBAs.
Anything later than and including x4x has M.2 BOSS support and in 2026 you shouldn't buy anything lower than 14th gen. But yes, cheap SSDs serve well as the ESXi boot drives.
The storage markets I can think of, off the top of my head:
1. individual computers
2. hobbyist NAS, which may cross over at the high end into the pro audio/video market
3. cloud
4. enterprise
#1 is all NVMe. It's dominated by laptops, and desktops (which are still 30% or so of shipments) are probably at the high end of the performance range.
#2 isn't a big market, and takes what they can get. Like #3, most of them can just plug in SAS drives instead of SATA.
#3 - there's an enterprise market for capacity drives with a lower per-device cost overhead than NVMe - it's surprisingly expensive to build a box that will hold dozens of NVMe drives - but SAS is twice as fast as SATA, and you can re-use the adapters and mechanicals that you're already using for SATA. (pretty much every non-motherboard SATA adapter is SAS/SATA already, and has been that way for a decade)
#4 - cloud uses capacity HDDs and both performance and capacity NVMe. They probably buy >50% of the HDD capacity sold today; I'm not sure what share of the SSD market they buy. The vendors produce whatever the big cloud providers want; I assume this announcement means SATA SSDs aren't on their list.
I would guess that SATA will stay on the market for a long time in two forms:
- crap SSDs, for the die-hards on HN and other places :-)
- HDDs, because they don't need the higher SAS transfer rate for the foreseeable future, and for the drive vendor it's probably just a different firmware load on the same silicon.
I agree hobbyist NAS is niche but it's very useful: less noise, less electricity bills, and not that much less space i.e. if you can find 3x Samsung 870 QVO drives at 8TB, you can have a super solid 16TB NAS with redundancy (or 24TB without). Not to mention compact; you can have an ITX-sized PC do quite a lot of work.
Back in the days of MapQuest there was a (usually) very good site called mapsonus.com, but it evidently had one of the ferries that crossed Boston Harbor in its map as a zero distance link.
Since it was offline, the bug was obvious although a bit frustrating - you had to put in multiple waypoints to make it forget its urge to send you on the ferry to Hull when you were trying to get to parts of the South Shore.
In Boston it's a very frequent occurrence to be driving in the Central Artery Tunnel and have your map software think you're on the surface, or vice versa, or to be on a highway overpass and again have it think you're on a surface road that is inaccessible from your location. You get used to it.
This seems like an entirely different level of craziness, though.
Yeah I remember back in the early 2000s before phones had turn-by-turn navigation, there were PDAs to do it and it was common for the software to just ask whether you were driving on the surface road or an elevated viaduct.
In the optimized case:
0.196 + 0.117 = 0.313, so we used 313ms of CPU time but the entire command took 395ms, with the CPU idle for 82ms.In other words: yes, the author managed to beat the speed of the disk subsystem. With two caveats:
1. not by much - similar attention to tweaking of I/O parameters might improve I/O performance quite a bit.
2. the I/O path is CPU-bound. Those 117ms (38% of all CPU cycles) are all spent in the disk I/O and file system kernel code; if both the disk and your user code were infinitely fast, the command would still take 117ms. (but those I/O tweaks might reduce that number)
Note that the slow code numbers are with a warm cache, showing 48ms of system time - in this case only the ext4 code has to run in the kernel, as data is already cached in memory. In the cold cache case it has to run the disk driver code, as well, for a total of 117ms.
reply