Exploring Ethereum 2: The Curious Case of the Invisible Fork
By Adrian Sutton
It’s like something out of a Sherlock novel – the doors are all locked, windows barred and yet somehow the jewels have been stolen. Or in this case, every block created forms a single chain and yet somehow the validator client has wound up on the wrong fork. Fortunately Sherlock, or rather Cem Ozer, was on the case. This is the story behind those very annoying “Produced invalid attestation” errors that Teku would sometimes generate.
The blocks involved look a lot like:
Slot | 28 | 29 | 30 | 31 | 32 |
---|---|---|---|---|---|
Block Root | 0x01 | 0x02 | 0x03 | 0x04 | 0x05 |
Parent Root | 0x00 | 0x01 | 0x02 | 0x03 | 0x04 |
There are no other blocks floating around. This looks like a chain working perfectly with no forks. However, timing matters in ETH2 and there’s an invisible fork hidden in there. Slot 32 is the start of a new epoch so validator needs to calculate the duties it should perform but in this case the block for slot 32 arrived late, so the validator calculated it’s duties based on:
Slot | 28 | 29 | 30 | 31 | 32 |
---|---|---|---|---|---|
Block Root | 0x01 | 0x02 | 0x03 | 0x04 | <empty> |
Parent Root | 0x00 | 0x01 | 0x02 | 0x03 |
While in ETH1 that would just mean you’re behind head, in ETH2 whether a block exists or not effectively creates a fork. That missing block contributes to the randao value and so all the committee shuffling and duty scheduling based off the randao changes depending on whether the slot is empty or not.
So when the validator calculates its duties based on slot 32 being empty, it gets a different set of duties than it would if the block had already arrived. The net result is that attestation signatures appear invalid because the validator they come from is calculated based on the shuffling, not explicitly stated in the attestation.
Later when the block for slot 32 does arrive, the beacon chain considers it as just an extension of the current fork, so doesn’t tell the validator client to recalculate duties. An epoch later when those scheduled duties actually happen, they’re still scheduled as if that slot was empty and so the signatures appear invalid.
Cem’s fix is elementary (as most things are once you understand the problem) – the beacon chain node needs to fire a reorg event when a previously empty slot is filled, which causes the validator client to recalculate its duties.
So case closed? Not quite… We’d actually already thought of this potential problem and the validator client was already listening for block imported events. When a block was imported, any duties from two or more epochs were invalidated (at the start of an epoch, you can safely calculate duties for that epoch and the one after, so the blocks in epoch 3 only affects duties for epoch 5 and later). That should have caught this case – why was the problem still happening?
It turns out that while the duties were correctly invalidated when the block was imported, block import isn’t what updates the best block – running the fork choice algorithm does. It turns out the validator client wound up recalculating the duties before fork choice had been run to consider the new block and thus recalculated duties based on the slot still being empty. With Cem’s fix in place we’ll be able to remove this first attempt at a fix and only invalidate on re-org events.