Report on September 13, 2024 Operation Incident
CoreIntroduction
During block creation, validators (or more correctly, collators) check various limits after each step. These limits include maximal used gas, lt range, time of generation, and size of the block. This ensures that blocks are easy to relay, validate, and apply by other validators.
However, due to cell deduplication, calculating the exact size of a block is challenging. Instead, an estimation of the size is used, which should always be equal to or greater than the actual size of the block.
It was found that the estimation function contained a bug that underestimated the size of the block.
Chronology
On September 13 at 17:15 UTC, during the creation of block (0,9000000000000000,45640676), validators of the network created block candidates with a size of 2.3 MB. Since this exceeded the hard limit for a block (2 MB), these candidates were rejected by the network. As a result, shard 0:9000000000000000 stalled.
During the investigation, it was revealed that the issue was related to overly large blocks. The team began working in two parallel directions: first, searching for incorrect estimations, and second, finding a workaround for block generation even under size underestimation.
One hour later, due to the collation time limit, validators included fewer operations into the block and successfully created a candidate under the limit. At 18:27 UTC, block (0,9000000000000000,45640676) was accepted by the network, and the shard resumed work.
Around that time, a special patch that decreased the block size limit during collation by half was issued and deployed to a few validators. The idea was that if the issue with block generation recurred, some validators, due to stricter limits, would be able to generate a block even with incorrect estimates.
Around 22:31 UTC, the issue indeed recurred and appeared to be more severe: even with the block limits halved, blocks created by validators were 3.4 MB. Further limit reductions (including lower limits by block size and lt) didn't help. Shard 0:8800000000000000 stalled for about 11 hours. During this period, four more blocks were created in that shard with large delays between them.
When shard 0:8800000000000000 stalled, other neighboring shards continued to work normally, process transactions, and create new messages. However, all outgoing messages to the stalled shard were queued since they could not be processed. This meant that all chains of transactions passing through the problematic shard were delayed. Since the distribution of contract addresses is generally random, the longer the chain, the higher the chances it would be stalled. As a result, many simple operations like sending TON or jettons worked normally, while more complex interactions like DEX swaps and lending transactions stopped.
Around 10:00 UTC on September 14, the estimation issue related to the underestimation of the size of state updates—specifically regarding moving messages between queues—was found and patched. An updated version was issued. Once updated validators got to the shard set, at 10:24 UTC, shard 0:8800000000000000 resumed work. At that moment, the queue of incoming messages to the shard was about 150,000 messages, which were processed over the next hour, and the network returned to normal operation.
Further Investigation
While the issue seems to be fully resolved, there are still open questions that need to be answered. The incorrect estimation of block size was related to the highload_v3 wallet sending batch jetton transfers. Such operations are quite routine and were processed multiple times each day in the last half of the year. Additionally, operations under huge queues (millions of messages) were thoroughly tested in private networks; these conditions were also replicated in the mainnet. Thus, we are investigating why it did not occur earlier to better understand the conditions that led to it.
P.S. We are very grateful to our colleagues from the Tonwhales and Tonstakers teams, who worked with us throughout the night and provided invaluable assistance with logs, monitoring, and update testing.