More

Sesse__ · 2026-01-28T08:14:49 1769588089

Endgame tablebases don't take into account threefold repetition; if so, you would have to basically be able to exclude any arbitrary position from the tree, which would seem impossible. The 50-move rule is respected by the Syzygy tablebases, though with the concession that they do not generally give the fewest possible moves to mate (they would rather delay the mate than delaying a pawn push or a capture).

Here's an example (adapted from the URL below): https://syzygy-tables.info/?fen=3R4/5R2/8/8/8/1K6/8/4k3_w_-_... — if you asked pretty much any player, even a child, how to win this, they'd show the staircase mate starting with Re7+ (mate in 4). If you asked a computer or the older Nalimov tablebases, it would say Kc2! (mate in 2). However, if you ask the Syzygy tablebases, they would argue that this is not optimal if we are extremely close to the 50-move rule, so the safest and thus best move is Rf2!! which forces Black to capture the rook on the next turn (they have no other legal moves), resetting the counter and giving a mate in 18.

There were a set of experimental DTM50 tablebases made at some point (though not made public); they store the shortest mate for all 100 possible zeroing counters in any position. See https://galen.xyz/egtb50/ for some discussion.

Sesse__ · 2026-01-28T08:08:50 1769587730

Here's an actual constructed game that is presumably as long as possible (with explanation): https://www.reddit.com/r/chess/comments/168qmk6/longest_poss...

Sesse__ · 2026-01-27T23:15:30 1769555730

Is that a new rule? I was under the impression that it had been the case for a very long time that if you went out on time but there was no possible sequence of moves leading to checkmating you, it was a draw instead. (Meaning, of course, that having more pieces could be a disadvantage in such situations, which feels a bit unfair. E.g., KvKB is a draw, but KPvKB can lead to a mate if both sides cooperate, and thus would be a time loss for white even if black would never win in practical play.)

qsort · 2026-01-28T01:01:11 1769562071

That's not new, but how it formally works has changed. There used to be a number of explicitly enumerated cases (i.e. bare king and king with a minor piece,) now the rule instead just says that there must exist a sequence of moves to mate. Some positions, even with pawns (imagine a completely closed position with only pawns and kings) wouldn't have been automatically drawn under the previous system but now would be. I think USCF rules, unlike FIDE, still have the enumerated cases?

The difference is extremely minor and has almost no strategic implications, it's just an interesting corner case.

Sesse__ · 2026-01-28T07:55:32 1769586932

The oldest rules on FIDE's pages are the ones for “before 2014”. They state:

  The game is drawn when a position has arisen in which neither player can checkmate the opponent’s king with any series of legal moves. The game is said to end in a ‘dead position’. This immediately ends the game, provided that the move producing the position was legal. (See Article 9.6)

And 9.6 just states:

  The game is drawn when a position is reached from which a checkmate cannot occur by any possible series of legal moves. This immediately ends the game, provided that the move producing this position was legal.

And similarly 6.9, which governs loss on time:

  Except where one of the Articles: 5.1.a, 5.1.b, 5.2.a, 5.2.b, 5.2.c applies, if a player does not complete the prescribed number of moves in the allotted time, the game is lost by the player. However, the game is drawn, if the position is such that the opponent cannot checkmate the player’s king by any possible series of legal moves.

So it's at least ten years old, but possibly quite more. I know I have a copy of the 1984 rules (or possibly even older) somewhere on paper, but then I'd have to go into the attic :-)

TZubiri · 2026-01-28T03:08:06 1769569686

Does it depend on elo as well?

lmm · 2026-01-28T07:25:01 1769585101

No. How could it possibly depend on elo?

TZubiri · 2026-01-28T20:06:16 1769630776

Well it can depend on Referee discretion, and the referee can evaluate whether a position is obviously a draw or not.

Something in high elo may obviously be a draw, like KRPPP vs KRPP, or KRN vs KR but not necessarily in lower elo.

Sesse__ · 2026-01-28T21:37:08 1769636228

What if the players are both much higher-rated than the arbiter?

Basically, once you've lost on time, you're giving up the right to any sort of agency, and thus the Elo doesn't matter. The rules are charitably giving you a rating of minus infinity and allow you to attempt salvaging half a point with that.

jmount · 2026-01-28T00:19:33 1769559573

I just updated the article. I did use Python's insufficient material detection, in addition to the ability to call for a draw (3-fold repetition, and 50 move rule). I think the "75 move rule" that doesn't require a player to call is one of the more recent rule changes.

Sesse__ · 2026-01-27T18:59:16 1769540356

bullseye and bookworm have too old versions to be vulnerable, it seems.

TacticalCoder · 2026-01-27T19:10:09 1769541009

Oh that's interesting: it indeeds shows "not affected" in the second table on the link I pasted but before that on the first table it says "Status // Fixed / Fixed".

I never paid attention to the fact that one table had "Fixed" and the other "Not affected" for the same "Not affected" package.

Sesse__ · 2026-01-24T15:29:53 1769268593

Well, it depends. I vividly remember removing 200 small SQLite queries from a routing algorithm in a mobile app (by moving the metadata into a small in-memory data store instead) and roughly doubling its speed. :-) It was a pretty easy call after seeing sqlite3_step being the top CPU user by a large margin.

ilumanty · 2026-01-24T15:34:15 1769268855

Yeah, en/decoding results and parameters from and to JS types is also quite the timewaster

Sesse__ · 2026-01-24T15:43:27 1769269407

This was from C++, the encoding wasn't really a factor.

Sesse__ · 2026-01-23T12:00:56 1769169656

> I mean, cap'n'proto is written by the same person who created protobuf

Notably, Protobuf 2, a rewrite of Protobuf 1. Protobuf 1 was created by Sanjay Ghemawat, I believe.

7e · 2026-01-23T12:53:42 1769172822

Google loves to reinvent shit because they didn't understand it. And to get promo. In this case, ASN.1. And protobufs are so inefficient that they drive up latency and datacenter costs, so they were a step backwards. Good job, Sanjay.

notyourwork · 2026-01-23T15:47:23 1769183243

Really dismissive and ignorant take from a bystander. Back it up with your delivery that does better instead of shouting with a pitchfork for no reason.

7e · 2026-01-23T21:29:12 1769203752

This bystander has been using protobufs for more than ten years. I'm not sure what I need to deliver since ASN.1, Cap'n Proto and Flatbuffers are all more efficient and exist already. ASN.1 was on the scene in 1984 and was already more efficient than protobufs.

kentonv · 2026-01-23T21:55:22 1769205322

Protobuf has far better ergonomics than ASN.1. ASN.1 is an overcomplicated design-by-committee mess. Backwards compatibility in particular is much harder.

7e · 2026-01-23T23:24:17 1769210657

I don't doubt your experience, but with X.509 having evolved substantially, and ASN.1 on billions (if not tens of billions) of devices, in practice it seems OK. And it was formally verified early.

sph · 2026-01-24T05:40:59 1769233259

ASN.1 on billions of devices doesn’t make it less of an anti-ergonomic, design-by-committee piece of crap. Unless your goal is to be binary-compatible with these devices, one should be free to explore alternatives.

By all means, keep using it, but it might be worth figuring out why other people don’t. Hint: it’s not because they’re more stupid than you or are looking to get promoted by big G.

(Personally, I like the ideas and binary encoding behind Capn Proto more than all the alternatives)

gorset · 2026-01-24T07:54:59 1769241299

One of the advantages of of protobuf I never see anyone highlight is how neat and well-designed the wireformat is, in terms of backward/forward compatibility and lowlevel stuff you can do with it. Very useful when building big and demanding systems over time.

For high performance and critical stuff, SBE is much more suitable, but it doesn't have as good of a schema evolution story as protobuf.

kentonv · 2026-01-23T21:49:20 1769204960

lol are you accusing Sanjay of creating Protobuf to get promoted?

Sesse__ · 2026-01-22T14:13:44 1769091224

As far as I can see, these classes of filters (including xor filters) have some practical issues for many applications: They can become full (refuse new entries altogether), and they need to know all the elements up-front (no incremental inserts). Is there anything more modern than Bloom filters that don't have these restrictions?

I'm especially fond of tiny filters; a well-placed 32- or 64-bit Bloom filter can be surprisingly effective in the right context!

thomasmg · 2026-01-22T19:28:52 1769110132

There are many variants. It really depends on what features you need. Cuckoo filters were mentioned. If you want to add and remove entries and the regular counting Bloom filter are not good enough, I would look at the "succinct counting blocked Bloom filter" [1]: they only need about twice the space than regular Bloom filters. Sure, cuckoo filters need less memory, but they can fail basically any time, while these can not.

Tiny filters: Some time ago I worked on tiny statistics [2]. This includes some 64-bit HyperLogLog implementations; some use linear counting, which is basically a 64-bit Bloom filter, until some limit, and only then switch to HyperLogLog. This is great for distinct counts of columns in databases (cardinality estimation). This project also includes 64-bit approximate counts and histograms.

[1] https://github.com/FastFilter/fastfilter_java/blob/master/fa... [2] https://github.com/thomasmueller/tinyStats

Sesse__ · 2026-01-23T16:02:17 1769184137

FWIW, I found https://github.com/FastFilter/fastfilter_java/issues/28 a pretty good explanation of what's going on in the succinct counting blocked Bloom filters. (I'm not sure if the blocking is needed for the normal Bloom filter part, though, i.e., all bits don't necessarily need to fall within the same block, no? But the counts are stored separately for each block.)

thomasmg · 2026-01-23T16:12:27 1769184747

Yes, non-blocked is also possible. This would need a bit less space, and would be a bit slower. The counts > 1 (per bit that is set) are stored spearately, yes.

Sesse__ · 2026-01-23T18:29:37 1769192977

> This would need a bit less space, and would be a bit slower.

I guess that is because the count storage update is really slow, right, so it's better to have one than two (or whatever number of set bits) operations? At least the linked code seems to process it one by one bit when updating, and without some sort of “rank of n-th set bit” operation (which would accelerate the “select” version fairly well), I'm not sure it could be made much faster than that either.

Edit: I see https://github.com/FastFilter/fastfilter_java/blob/master/fa... tries to make that operation in O(1), but to be honest, with this many operations, the cure almost looks worse than the disease. :-) Haven't benchmarked, though.

thomasmg · 2026-01-25T06:32:16 1769322736

Yes the count storage update is a bit slow. It could be implemented as a background thread, so that it is not blocking. It depends on the use case wheter this is a problem. The 'blocked' variant might be faster to optimize I assume, if this is needed.

Sesse__ · 2026-01-25T08:56:42 1769331402

> It could be implemented as a background thread, so that it is not blocking.

How would you realistically do this without creating more overhead in the thread communication than what you're saving? I've never heard of offloading a 30–40 cycle operation to a thread before. Typically sending an atomic across CPUs is what, 200 cycles? (That's assuming you have the thread just sitting there in some kind of pool; firing up a new one is much, much more expensive. It also assumes you never need to hear back from it, so wouldn't work well for a decrease where you need to know the count.)

thomasmg · 2026-01-25T18:30:48 1769365848

Right, it would be wasteful if it's done for each entry separately. But that is not needed. (I should probably write an article about this and a better implementation).

The "succinct counting (blocked) Bloom filters" have two components: the regular Bloom filter for querying, and the count storage. The count storage is not "strictly" needed for querying: you will never get a false _negative_ is increments / decrements in the count storage are deferred. So updating the count storage can be done asynchronously. Both increments and decrements, if you want. So these can be buffered, eg 100 increments / decrements at a time. Sure, the false-positive rate is slightly higher than needed if the decrements are deferred, but with a large filter this effect is small.

So that means, you can buffer increments / decrements in the count storage. You still want to do the "or" operations in the the Bloom filter part synchronously, so that you don't get false negatives. And then, it is no longer just 30-40 cycles, but 3000-4000 cycles, or 30'000-40'000 cycles, or so. I understand this would not be trivial to implement, but also not very complex. I never really had a real-world use case so far, so I didn't work on a full implementation yet.

Sesse__ · 2026-01-25T20:54:41 1769374481

Sure, you can batch. But that means your decrements must be _entirely_ buffered, right? Because you cannot remove anything from the main filter until you know its count is zero, and you don't know its count until you've processed all increments. And you need to do that under a lock, or you'll have a race where you could have a false negative in the main filter for a short while.

thomasmg · 2026-01-25T23:55:09 1769385309

Yes exactly!

withoutboats3 · 2026-01-22T15:28:39 1769095719

Cuckoo filters outperform bloom filters and allow dynamic insertion and deletion (unlike bloom filters, which only allow insertion). The trade off is that insertion can fail if the table is too full and then would need to expand or store those entries some other way to avoid a false negative.

orlp · 2026-01-23T15:31:44 1769182304

Bloom filters also become full.

As it fills up the false probability rate goes up. Once the false probability rate reaches the threshold of unacceptability, the bloom filter is full, and you can no longer insert into it.

That most interfaces still let you do something that looks like an insert is an interface failure, not a bloom filter feature.

If you find this controversial and want to reply "I don't have a threshold of unacceptability", I'll counter that a false probability rate of 100% will be reached eventually. And if you still find that acceptable, you can trivially modify any probabilistic filter to "never become full" by replacing the "is full" error condition with setting a flag that all future queries should return a false positive.

fmajid · 2026-01-23T21:55:05 1769205305

Counting Quotient Filters (CQF) and Mixed-Counters Quotient Filters (MQF) are more efficient than Bloom Filters, and can count, making them highly suited for applications like rate-limiting at scale.

https://link.springer.com/content/pdf/10.1186/s12859-021-039...

Sesse__ · 2026-01-22T07:56:37 1769068597

TBH that's not the hard part about it. N is the number of tables, and for most real queries, N < 20 and even 2^N (clique join, which almost never happens in practice) would be tractable if you didn't have so many other things entering the mix. Most queries are closer to chain joins, which have only O(n³) possible join orders (assuming dynamic programming). (Obviously as N grows, you'll need to add some heuristics to prune out “obviously” bad plans. There are many different approaches.)

The really hard problem is estimating the cost of each plan once you've generated it, which necessarily must happen by some sort of heuristics combined with statistics. In particular: If you want to join A, B and C, the cost of (A JOIN B) JOIN C versus A JOIN (B JOIN C) can differ by many orders of magnitude, depending on the size of the intermediate products. And both the cost effects and any misestimation tend to compound through the plan as you add more tables and predicates.

Sesse__ · 2026-01-21T22:56:47 1769036207

This post certainly has too much heuristic fiddling! Instead of a coherent framework, it takes a bunch of second-rate heuristics and tries to use… well, all of them. “Generate at most ten plans of this and one of that”? It also has pages and pages talking about the easier parts, for some reason (like maintaining maps, or that a Cartesian product and an inner join are basically the same thing), and things that are just wrong (like “prefer antijoins”, which is bad in most databases since they are less-reorderable than almost any other join; not that you usually have much of a choice in choosing the join type in the first place).

There _are_ tons of corner cases that you need to address since there are some super-hard problems in there (in particular, robust cardinality estimation of join outputs is a problem so hard that most of academia barely wants to touch it, despite its huge importance), but it doesn't need to be this bad.

willvarfar · 2026-01-21T23:27:33 1769038053

Can join cardinality can be tackled with cogroup and not expanding the rows until final write?

Sesse__ · 2026-01-22T07:48:51 1769068131

I don't know what cogroup is, sorry.

More generally, there are algorithms for multi-way joins (with some theoretical guarantees), but they tend to perform worse in practice than just a set of binary joins with a good implementation.

willvarfar · 2026-01-22T08:03:08 1769068988

Yeah it's pretty obscure, sorry.

It's called cogroup in Spark and similar architectures.

It does a group-by to convert data into the format (key_col_1, ... key_col_n) -> [(other_col_1, ... other_col_n), ...]

This is useful and ergonomic in itself for lots of use-cases. A lot of Spark and similar pipelines do this just to make things easier to manipulate.

Its also especially useful if you cogroup each side before join, which gives you the key column and two arrays of matching rows, one for each side of the join.

A quick search says it's called "group join" in academia. I'm sure I've bumped into as another name in other DB engines but can't remember right now.

One advantage of this is that it is bounded memory. It doesn't actually iterate over the cartesian product of non-unique keys. In fact, the whole join can be done on pointers into the sides of the join, rather than shuffling and writing the values themselves.

My understanding is that a lot of big data distributed query engines do this, at least in mixer nodes. Then the discussion becomes how late they actually expand the product - are they able to communicate the cogrouped format to the next step in the plan or must they flatten it? Etc.

(In SQL big data engines sometimes you do this optimisation explicitly e.g. doing SELECT key, ARRAY_AGG(value) FROM ... on each side before join. But things are nicer when it happens transparently under the hood and users get the speedup without the boilerplate and brittleness and fear that it is a deoptimisation when circumstances change in the future.)

Sesse__ · 2026-01-22T09:01:07 1769072467

Group join in academia generally points to having GROUP BY and one join in the same operation (since it's common to having aggregation and at least one join on the same attribute(s)). But just making a hash table on each side doesn't really do anything in itself (although making it on _one_ side is the typical start of a classic hash join); in particular, once you want to join on different keys, you have to regroup.

Sesse__ · 2026-01-19T16:41:28 1768840888

Even with AVX512, memory arguments used in most instructions (those that are not explicitly unaligned loads) need to be aligned, no? E.g., for vaddps zmm0, zmm0, [rdi] (saving a register and an instruction over vmovups + vaddps reg, reg, reg), rdi must be suitably aligned.

Apart from that, there indeed hasn't been a real unaligned (non-atomic) penalty on Intel since Nehalem or something. Although there definitely is an extra cost for crossing a page, and I would assume also a smaller one for crossing a cache line—which is quite relevant when your ops are the same size as one!

jandrewrogers · 2026-01-19T17:06:49 1768842409

With the older microarchitectures there was a large penalty for crossing a cache line with AVX-512. In some cases, the performance could be worse than AVX2!

In older microarchitectures like Ice Lake it was pretty bad, so you wanted to avoid unaligned loads if you could. This penalty has rapidly shrunk across subsequent generations of microarchitectures. The penalty is still there but on recent microarchitectures it is small enough that the unaligned case often isn't a showstopper.

The main reason to use aligned loads in code is to denote cases where you expect the address to always be aligned i.e. it should blow up if it isn't. Forcing alignment still makes sense if you want predictable, maximum performance but it isn't strictly necessary for good performance on recent hardware in the way it used to be.

Remnant44 · 2026-01-19T18:46:19 1768848379

AVX doesn't require alignment of any memory operands, with the exception of the specific load aligned instruction. So you/the compiler are free to use the reg,mem form interchangibly with unaligned data.

The penalty on modern machines is an extra cycle of latency and, when crossing a cacheline, half the throughput (AVX512 always crosses a cacheline since they are cacheline sized!). These are pretty mild penalties given what you gain! So while it's true that peak L1 cache performance is gained when everything is aligned.. the blocker is elsewhere for most real code.

Sesse__ · 2026-01-19T22:13:24 1768860804

> AVX doesn't require alignment of any memory operands, with the exception of the specific load aligned instruction.

Hah, TIL. Too used to SSE, I guess. (My main target platform is, unfortunately, still limited to SSE3, not even SSSE3.)