Synchronization Instability Report

Core

Summary

On April 9, the network switched to new consensus parameters with faster block generation and reached the expected target metrics.

Between April 13 and April 23, three partially overlapping issues were observed: a roughly 4% reduction in masterchain block production rate, elevated lite-server error 651 responses, and synchronization lag in the public overlay that, at peak, exceeded 10 minutes. These issues had clear technical impact, but user-visible impact remained limited. Wallets, explorers, and other services continued to operate normally from the end-user perspective and broadly preserved their usual response times, although internal retry volumes increased.

The investigation showed that these symptoms did not come from a single root cause. The masterchain block-rate reduction was caused by stricter per-IP flood control interacting with common validator deployment patterns. Error 651 was caused by a combination of stale-detection logic that was no longer calibrated for sub-second block intervals and genuine lag caused by degraded block delivery in the public overlay. The synchronization lag itself was primarily caused by degradation of public-overlay broadcasts due to a rapid influx of non-functional peers, and was further amplified by some alternative TON node implementations that did not forward block copies to all overlays where they should have done so.

By the evening of April 23, synchronization was largely restored. Masterchain block production also returned close to nominal values. The incident led to software fixes, operator outreach, and temporary network-level mitigations. It also reinforced an architectural conclusion: the public overlay should be treated as a best-effort distribution layer, while workloads that require reliable low-latency delivery should use fixed-membership custom/private overlays.

Timeline

April 9

The network switched to new consensus parameters with faster block generation and reached the target block-production metrics.

April 13

The masterchain block production rate dropped by approximately 4%. Investigation started.

April 20, midday

Problems with block delivery in the public overlay became visible. Reports of error 651 increased, and synchronization lag grew to more than 10 minutes in some cases.

April 20, evening

An update was released to address the local conditions that were triggering error 651.

April 21

The masterchain block production rate began to recover toward nominal values.

April 22

Synchronization lag decreased to roughly 2 to 5 minutes.

April 23, evening

Synchronization was largely restored. By this point, the masterchain block production rate was almost fully recovered.

User Impact

At the UX level, the incident had limited visible impact. Wallets, explorers, and downstream services continued to work and broadly maintained normal response times. The main effect was internal: services had to perform more retries and absorb more variability in block arrival and synchronization.

Technical Findings

1. Masterchain block-rate reduction

The approximately 4% reduction in masterchain block production rate was caused by the interaction between the April 9 update and real-world validator deployment patterns.

The April 9 update introduced stricter flood-control rules, including tighter limits on the number of connections allowed from a single IP address. In parallel, some operators run multiple validators and optimize traffic by routing part of their internal traffic through service or internal addresses, while others place multiple validators behind NAT, causing several validators to share one external IP. Growth in validator participation increased the number of deployments with these characteristics.

Under the new flood-control rules, these deployment patterns could trigger mutual rate-limiting and cross-bans between otherwise valid nodes. This reduced effective connectivity and lowered the masterchain block production rate.

Between April 21 and April 23, most affected validators were notified. Large validators were asked to upgrade to a version that takes these deployment patterns into account. This was the primary driver of recovery in block production.

2. Error 651 on lite servers

Error 651 had two distinct causes.

First, the local logic used to decide whether a lite server was too far behind was tuned for historical block intervals in the 2.5 to 5 second range. That logic relied on the average time between blocks. After block generation time dropped to roughly 400 ms, the same thresholds became too aggressive: a node that was only hundreds of milliseconds behind the network could classify itself as stale and start returning error 651. The update released on the evening of April 20 corrected this behavior.

Second, some lite servers were genuinely behind because block delivery through the public overlay had degraded. In other words, error 651 was not a single failure mode. It increased both because local stale-detection became too sensitive and because some nodes were in fact lagging.

3. Synchronization lag and public-overlay degradation

The synchronization issues were caused primarily by degradation of public-overlay block broadcasts.

Around midday on April 20, the public overlay began to accumulate a large number of non-functional peers, with new non-functional peers appearing several times per minute. The most likely explanation is an external-message interception service that was attempting to maximize message intake by continuously spawning additional peers. A plausible motivation for such a service is latency-sensitive external-message capture ("catching cheap NFT sales first"). Because those nodes were not participating correctly in overlay operations, they were quickly banned, and new ones appeared in their place.

While this attribution should be treated as a working hypothesis rather than a formally proven attribution, the degradation mechanism itself, was directly observable: hundreds of dead peers in overlays, effective connectivity in the public overlay dropped, blocks reached nodes less reliably via broadcast, and nodes had to fall back to explicit block downloads more often. That fallback path also became less efficient because many of the peers queried for missing blocks were themselves non-functional (in addition to low effectiveness of 1-by-1 block download at 2.5 blocks/sec rate). The combined effect was a significant increase in synchronization lag.

4. Incomplete block forwarding in some alternative node implementations

The investigation also found that some alternative TON node implementations were not sending block copies to all overlays where they were expected to do so. This reduced dissemination redundancy and lowered the number of nodes that should have received each block. While this was not the sole cause of the incident, it made block propagation less resilient during the period of public-overlay degradation.

Mitigation Steps

A two-layer mitigation approach was used to address the public-overlay delivery problem.

On April 22, additional nodes were deployed to inject redundant broadcasts into the network. This materially increased the probability that at least one broadcast would reach a given lite server.

On April 23, additional rebroadcast nodes were deployed to send blocks directly to hundreds of known well-behaving lite servers. This reduced synchronization lag for honest nodes to near zero in most cases.

In parallel, the stale-detection fix released on April 20 reduced false 651 errors, and outreach to affected validators between April 21 and April 23 improved network connectivity and restored masterchain block production.

Current Status

By the evening of April 23, synchronization was largely restored. Masterchain block production had returned close to nominal values, and lite-server lag had dropped substantially. The network remained operational throughout the incident, and end-user services continued to function.

Long-Term Operating Model

This incident exposed two different classes of constraints, and they require different responses.

The first class is software and deployment sensitivity. This includes flood-control interaction with shared-IP validator setups and stale-detection thresholds that were no longer calibrated for much faster block production. These issues are addressable through code changes, version upgrades, and operator guidance.

The second class is structural. The public overlay is an open-membership system. If an unlimited number of nodes can join without admission control, then an unlimited number of dishonest or non-functional nodes can also join. Protocol and implementation improvements can increase resilience and raise the cost of abuse, but they cannot turn an open overlay into a hard-guarantee delivery channel under unbounded Sybil pressure.

For that reason, the public overlay should be treated as a best-effort distribution layer. Operators who need reliable low-latency delivery for private lite servers should connect those servers through custom/private overlays with one or more validators. In such overlays, membership is fixed and Sybil-style admission attacks are structurally excluded.

The tooling for custom overlays is already available and documented. Toncenter also plans to provide a managed paid solution that removes the need to arrange validator connectivity directly.

The current centralized rebroadcast solution is a useful interim operational layer and can work well under current conditions. It should not be treated as a permanent protocol-level guarantee. Under sustained deliberate attacks, its effectiveness will degrade, which is why the long-term reliability model remains fixed-membership overlays.

Conclusion

The incident was not the result of a single defect. It was the combined effect of stricter connection control interacting with common validator topologies, stale-detection logic that was no longer appropriate for sub-second block intervals, degradation of the public overlay caused by a rapid influx of non-functional peers, and reduced dissemination redundancy in some alternative node implementations.

The immediate issues have largely been mitigated, and the network has recovered close to nominal behavior. The main operational lesson is straightforward: open public overlays are useful as scalable best-effort distribution layers, but reliable low-latency block delivery for critical private infrastructure should be built on fixed-membership overlays.