Sui Validator Migration Procedure
In early 2024 there were code updates in Sui which restricts when and how validators can be migrated to different servers. Many Sui node operators are not aware of this change and may not be prepared for a scenario that requires them to migrate their validator. This article is intended to provide Sui operators a set of standards to follow in the event that they need to migrate a Sui validator.
What Changed?
We are not exactly sure when this specific code change was made, or what version it was released in. We just happened to be one of the first validators to fall into this issue during a migration, so we know the change was some time in Q1 2024.
The code change added restrictions to prevent transaction signature equivocation (otherwise known as double signing). In other words, the old (pre-migrated) validator node can not participate in consensus within the same epoch as the new (post-migrated) validator node.
Theoretically, a node operator can dodge these restrictions by copying the entiredb
folder from the old validator (including authorities_db
and consensus_db
) on to the new validator.
In reality, this is extremely difficult to do, if not impossible. The DB folder can be very large and it takes significant amounts of time to transfer the files to the new node. Depending on how long it takes to transfer the data, the data may no longer be consistent with what is on chain which will cause the validator to not synch. For reference we ran migration tests on our 50Gbps LAN and found DB state migration to have mixed results.
It is our assessment that most node operators run in cloud providers, therefore it is likely that they would be attempting to migrate DB state over the internet at speeds between 1Gbps and 10Gbps. This makes the problem worse.
Planned Migration Procedure
To minimize downtime during planned migration, stop both the validator and hot spare right at the time when the epoch rolls over. The best way to do this is to run both servers sui-node
process with the flag --run-with-range-epoch <Epoch>
For example, if you use epoch 360 then both servers will stop at the end of epoch 360. Be sure your service is not set to restart. Then, start the hot spare with validator keys/config/DNS and it should start signing. To minimize downtime, you can automate this process with a script that checks for when the service stops and then reconfigures the hot spare to be a validator.
Unplanned Migration Procedure
If the validator crashed within epoch N, then it’s recommended to wait till epoch N+1 before joining again from another server that doesn’t have the validator database.
If you can access the drive of the validator and copy the database to the hot spare then you should be able to join the current epoch assuming the database isn’t corrupt from the hardware failure. But, since you are experiencing a catastrophic outage you probably can’t access that drive anyways. 🫠
Rollover Crash Edge Case
If the validator crashed in the middle of epoch rollover (this is unlikely, but it happened to us once) then it may be possible that the hot spare can join the current epoch. If you are able to join the current epoch, then it should look fine and creating certificates again.
Watch closely during the next epoch as your validator may crash again. Most likely there were some residual corruption in consensus since the validator was not able to complete it’s epoch tasks for consensus when it crashed the first time in the middle of epoch.
That will cause your validator to fail again at the next epoch. You can delete the database including consensus, download another snapshot, and start it up again and it should be stable going forward. At least three other node operators have reported this problem, but after the second restore it was fine.
Here is a link to Sui’s snapshot documentation:
Database Snapshots | Sui Documentation
Remember, if you crash on the rollover there is a chance you can get a hot spare to participate in that new epoch. It will probably crash on the next epoch due to corrupted data, so be prepared to download a snapshot. Its not ideal, but it is better than being down for an entire epoch.
Test Results
The following is a list of validator migration tests that we have run, the steps taken, and results. Note that a key assumption is that if the migration is unplanned, the data is not accessible on the node due to some catastrophic hardware or network issue.
Leave a comment if you have run your own tests and have different results.
Hot Spare Migration
In order to upgrade a hot spare node to a validator, copy the keys and validator.yaml
to the hot spare node, update the hot spare node binary service, update the validator DNS to point to the hot spare IP, and restart the hot spare in validator mode.
Pay close attention to the timing of when you perform the hot spare migration. Refer to the Planned and Unplanned Migration processes above for details on timing.