Executing update statements on a production database is always a big challenge. I have seen three different prominent incidents due to update problems. In each one of the incidents, multiple factors lead to the incident. We could have avoided these problems by a couple of preventive actions.
Before Running Update Statements
One of the causes for update statement incidents is overconfidence or lack of meaningful process. Engineers can take it for granted that they figured out how to do the update. Nevertheless, updates may work 99 times. At one last time, it might fail and fail miserably. Hence, it’s better to be prepared and put together a process to avoid such nightmares. Here is a gentle procedure.
- Send a code review request for every update statement/script.
- We can avoid inefficiencies by getting a code review.
- A second eye can catch an obvious mistake.
- Get a database dump and run the query on a test database.
- Sample the database when the tables are way too big.
- You can correct problems further by running against the test database.
While Running Update Statements
There are a couple of things that we could do to avoid such update problems. Here are some rules I use for production databases.
- Run a select query on the update and confirm the counts as expected before running the original update. There are three reasons for this.
- The number of updates can take down the database because of row-level lock. If the database has multiple replicas, it becomes even more necessary.
- The select query gives a sense of running time for the update query. If select is slow, it would be an order of magnitude slower for the update.
- Lastly, see if the update query meets our expectations.
- Always run an update query with transactions on a production database. Start with BEGIN and end with a COMMIT.
- After running the update query in a transaction, put a couple of validation queries. ROLLBACK if it’s needed.
- Pair up with another engineer while running the critical update statements.
Proactive Actions on Databases
We can expect engineers to do the right thing. Nevertheless, shit happens. One might forget a step or get distracted. We need to be ready to deal with such scenarios. Here are the things we could do.
- Keep a backup.
- You can get 5 minutes backups to daily backups depending on the database.
- Get a backup of a backup. Write the contents to some file system or object storage.
- Store backups in separate accounts. If you use one cloud or on-premise database, use a different cloud provider for a secondary backup to avoid downtime issues.
- Prepare a runbook.
- Have steps documented for database updates.
- Play runbooks every once in a while to test if the steps in the runbook still work.