Understanding Attestation Misses
By Adrian Sutton
The process of producing attestations and getting them included into the chain has become more complex post-merge. Combined with a few issues with clients causing more missed attestations than normal there’s lots of people struggling to understand what’s causing those misses. So let’s dig into the process involved and how to identify where problems are occurring.
There’s a number of steps required to get an attestation included on the chain. My old video explaining this still covers the details of the journey well - the various optimisations I talk about there have long since been implemented but the process is still the same. In short, the validator needs to produce the attestation and publish it to the attestation subnet gossip channel, an aggregator needs to include it in an aggregation and publish it to the aggregates gossip channel and then it needs to be pick up by a block producer and packed into a block.
Attestations are more likely to be included in aggregates and blocks if they are published on time and match the majority of other attestations. Attestations that are different can’t be aggregated so they’re much less likely to be included in aggregates (the aggregator would have to produce an attestation that matches yours) and they take up 1 of the 128 attestations that can be in a block but pay less than better aggregated attestations.
Since attestations attest to the current state of the chain, the way to ensure your attestation matches the majority is to ensure you’re following the chain well. That’s where most of the post-merge issues have been - blocks taking too long to import, causing less accurate attestations which are then more likely to not get included. So let’s look at some metrics to follow so we can work out what’s happening.
Key Indicators of Attestation Performance
Often people just look at the “Attestation Effectiveness” metric reported by beaconcha.in, but that’s not a great metric to use. Firstly it tries to bundle together every possible measure of attestations, some within your control and some not, into a single metric. Secondly, it tends to be far too volatile with a single delayed attestation causing a very large drop in the effectiveness metric, distorting the result. As a result, it tends to make your validator performance look worse than it is and doesn’t give you any useful information to fix act on.
So let’s look at some more specific and informative metrics we can use instead.
Firstly for the overall view, look to percentage of attestation rewards earned. While that write up is pre-Altair the metrics on the Teku Dashboard have been updated to show the right values even with the new Altair rules. Look at the “Attestation Rewards Earned” line on the “Attestation Performance” graph in the top left of the board. This will tell you quite accurately how well you’re doing in terms of total rewards, but it still includes factors outside of your control and won’t help identify where problems are occurring.
To identify where problems are occurring we need to dig a bit deeper. Each epoch, Teku prints a summary of attestation performance to the logs like:
Attestation performance: epoch 148933, expected 16, produced 16, included 16 (100%), distance 1 / 1.00 / 1, correct target 16 (100%), correct head 16 (100%)
This is an example of perfect attestation performance - we expected 16 attestations, 16 were included, the distance had a minimum of 1, average of 1.00 and maximum of 1 (the distance numbers are min / avg / max in the output) and 100% of attestations had the correct target and head. One thing to note is that attestation performance is reported 2 epochs after the attestations are produced to give them time to actually be included on chain. The
epoch reported in this line tells you which epoch the attestations being reported on are from.
Each of these values are also available as metrics and the Teku Dashboard uses them to create the “Attestation Performance” graph. That provides a good way to quickly see how your validators have performed over time and get a better overview rather than fixating on a single epoch that wasn’t ideal.
Each active validator should produce one attestation per epoch. So the
expected value reported should be the same as the number of active validators you’re running. If it’s less than that, you probably haven’t loaded some of your validator keys and they’ll likely be missing all attestations. It’s pretty rare that
expected isn’t what we expect though.
produced value is less than the
expected then something prevented your node from producing attestations at all. To find out what, you’ll need to scroll back up in your validator client logs to the epoch this performance report is for - remember that it will be 2 epochs ago. We’re looking for a log that shows the result of the attestation duty. When the attestation is published successfully it will show something like:
Validator *** Published attestation Count: 176, Slot: 3963003, Root: b4ca6d61be7f54f7ccc6055d0f37f122943e8313dbcfe49513c9d4ef50bbc870
Count field is the number of local validators that produced this attestation (this example is from our Görli testnet node - sadly we don’t have that many real-money validators).
When an attestation fails to be produced the log will show something like:
Validator *** Failed to produce attestation Slot: 4726848 Validator: d278fc2 java.lang.IllegalArgumentException: Cannot create attestation for future slot. Requested 4726848 but current slot is 4726847 at tech.pegasys.teku.validator.coordinator.ValidatorApiHandler.createAttestationData(ValidatorApiHandler.java:324) at jdk.internal.reflect.GeneratedMethodAccessor54.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:568) at tech.pegasys.teku.infrastructure.events.DirectEventDeliverer.executeMethod(DirectEventDeliverer.java:74) at tech.pegasys.teku.infrastructure.events.DirectEventDeliverer.deliverToWithResponse(DirectEventDeliverer.java:67) at tech.pegasys.teku.infrastructure.events.AsyncEventDeliverer.lambda$deliverToWithResponse$1(AsyncEventDeliverer.java:80) at tech.pegasys.teku.infrastructure.events.AsyncEventDeliverer$QueueReader.deliverNextEvent(AsyncEventDeliverer.java:125) at tech.pegasys.teku.infrastructure.events.AsyncEventDeliverer$QueueReader.run(AsyncEventDeliverer.java:116) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:833)
The specific reason the attestation failed can vary a lot. In this case the beacon node wasn’t keeping up for some reason which would require further investigation into Teku and its performance. One common source of failures if the beacon node or execution client isn’t in sync at the time which appears as a
503 response code from the beacon node when using the external validator client.
We can look at the “Produced” line on the “Attestation Performance” graph of the standard Teku dashboard to see the percentage of expected attestations that were produced over time.
If the attestation was produced, the next thing to check is that it was actually produced on time. If you find the
Published attestation log line, you can compare the timestamp of that log message to the time the attestation’s slot started. You can Slot Finder to find the start time of the slot. Attestations are due to be published 4 seconds into the slot. Anywhere from the start of the slot up to about 4.5 seconds after is fine.
You can also use the
validator_attestation_publication_delay metric to track publication times. The Teku Detailed dashboard includes graphs of this under the
Validator Timings section.
Remember that neither logs nor metrics can identify when your system clock is incorrect, because the timings they’re using are from the system clock too. Make sure you’re running ntpd or chrony and that they report the clock as in sync.
Correct Head Vote
If the attestation was published on time, we need to start checking if it matched the majority of other nodes produced. There isn’t a simple way to do this directly, but generally if the head block our attestation votes for turns out to be correct, we will almost certainly have agreed with the majority of other validators. The
correct head 16 (100%) part of the attestation performance line shows how many attestations produced had the right head block. If that’s at 100% and the attestations were all published on time, there isn’t really much more your node can do.
Having some attestations with incorrect head votes may mean your node is too slow importing blocks. Note though that block producers are sometimes slow in publishing a block. These late blocks sometimes mean that the majority of validators get the head vote “wrong”, so it’s not necessarily a problem with your node when head votes aren’t at 100%. Even if it is your node that’s slow, we need to work out if the problem is in the beacon node or the execution client. Block timing logs can help us with that.
To dig deeper we need to enable some extra timing metrics in Teku by adding the
--Xmetrics-block-timing-tracking-enabled option. This does two things, firstly when a block finishes importing more than 4 seconds into a slot (after attestations are due), Teku will now log a
Late Block Import line which includes a break down of the time taken at each stage of processing the block (albeit very Teku-developer oriented). Secondly, it enables the
beacon_block_import_delay_counter metric which exposes that break down as metrics. Generally, for any slot where the head vote is incorrect, there will be a late block import that caused it. We just need to work out what caused the delay.
An example late block log looks like:
Late Block Import *** Block: c2b911533a8f8d5e699d1a334e0576d2b9aa4caa726bde8b827548b579b47c68 (4765916) proposer 6230 arrival 3475ms, pre-state_retrieved +5ms, processed +185ms, execution_payload_result_received +1436ms, begin_importing +0ms, transaction_prepared +0ms, transaction_committed +0ms, completed +21ms
The first potential source of delay is that the block just didn’t get to us in time. The
arrival timing shows how much time after the start of the slot the block was first received by your node. In the example above, that was 3475ms which is quite slow, but did get to us before we needed to create an attestation 4 seconds into the slot. Delays in arrival are almost always caused by the block producer being slow to produce the block. It is however possible that the block was published on time but took a long time to be gossiped to your node. If you’re seeing late arrival for most blocks, there’s likely an issue with your node - either the system clock is wrong, your network is having issues or you may have reduced the number of peers too far.
Execution Client Processing
Post-merge, importing a block involves both the consensus and execution clients. The time Teku spends waiting for the execution client to finish processing the block is reported in the
execution_payload_result_received value. In this case 1436ms, which would have been ok if the block wasn’t received so late but isn’t ideal. Under 2 seconds is probably ok most of the time, but under 1 second would be better. Execution clients will keep working on optimisations to reduce this time so its worth keeping up to date with the latest version of your client.
Note that prior to Teku 22.9.1 this entry didn’t exist and the execution client time was just counted as part of
The other values are all various aspects of the processing Teku needs to do.
processed are part of applying the state transaction when processing the block.
transaction_committed record the time taken in various parts of storing the new block to disk. Finally
completed reports the final details of things like updating the fork choice records and so on.
Prior to Teku 22.9.1, the
transaction_committed was a common source of delays when updating the actual LevelDB database on disk. The disk update is now asynchronous so unless the disk is really exceptionally slow this value is generally only 0 or 1ms.
All these metrics let us get an understanding of where time was spent or where failures occurred. If you’re node is processing blocks quickly, publishing attestations on time and the system clock is accurate there’s probably very little you can do to improve things - having the occasional delayed or missed attestation isn’t unheard of or really worth worrying about.
Otherwise these metrics and logs should give a fairly clear indication of which component is causing problems so you can focus investigations there are get help as needed.