{"id":3093,"date":"2020-06-08T14:48:51","date_gmt":"2020-06-08T09:18:51","guid":{"rendered":"https:\/\/rzpwp.blog\/?p=3093"},"modified":"2020-07-20T03:24:28","modified_gmt":"2020-07-19T21:54:28","slug":"day-of-rds-multi-az-failover","status":"publish","type":"post","link":"https:\/\/razorpay.com\/blog\/day-of-rds-multi-az-failover\/","title":{"rendered":"The Day of the RDS Multi-AZ Failover"},"content":{"rendered":"<p><span style=\"font-weight: 400;\">On a fateful Friday evening on December 2019, when a few of us were looking forward to packing their bags and going home, we got an alert from the internal monitoring tool that the system has started throwing unusually high numbers of 5xx errors.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The SRE team quickly realized that one of our main applications (called \u201cAPI\u201d) was not able to connect to its RDS (MySQL) database. By the time we could make any sense of the issue, the application came back up automatically and the alerts stopped.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Looking at the RDS logs, we realized that the master instance has gone through a <\/span><i><span style=\"font-weight: 400;\">Multi-AZ failover<\/span><\/i><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">According to the SLA\u2019s provided by AWS, whenever an instance marked as multi-AZ goes through a failure (whether it is a network failure, disk failure, etc); AWS automatically shifts the traffic to its standby running on a separate AZ on the same AWS region. The failover can take up anytime between 60 and 120 seconds, and this was the reason our master instance automatically came back up after around 110 seconds and the application started working without any manual intervention.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Replication failure<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">The API master instance has a set of 5 replica instances which are used to query different sets of workloads in various applications.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-3131\" src=\"http:\/\/blog.razorpay.in\/wp-content\/uploads\/2020\/06\/one.png\" alt=\"\" width=\"1032\" height=\"438\" srcset=\"https:\/\/blog.razorpay.in\/wp-content\/uploads\/2020\/06\/one.png 1032w, https:\/\/blog.razorpay.in\/wp-content\/uploads\/2020\/06\/one-300x127.png 300w, https:\/\/blog.razorpay.in\/wp-content\/uploads\/2020\/06\/one-1024x435.png 1024w, https:\/\/blog.razorpay.in\/wp-content\/uploads\/2020\/06\/one-768x326.png 768w\" sizes=\"auto, (max-width: 1032px) 100vw, 1032px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">While the application stopped throwing errors and started working, we received another set of alerts stating that the replication on all the replicas had failed.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">All the replicas displayed a <\/span><i><span style=\"font-weight: 400;\">duplicate-key<\/span><\/i><span style=\"font-weight: 400;\"> error message. We immediately shifted all the traffic going to these replica instances to the master instance so that the application does not receive any stale data and display incorrect data to the users.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The drawback of moving all the traffic to the master was that all the heavier selects were also moved to the master, and the CPU load on the master instance increased by 50%. Hence, our immediate move was to recreate all the 5 replicas so that we can move the replica load back as soon as possible.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The new replica creation process internally creates a snapshot from the master instance of the current data and then starts a DB instance from that snapshot. The very first snapshot from a particular machine takes the snapshot of the entire data until now, but the subsequent snapshots are incremental in nature. In the past, we had noticed that these incremental snapshots take around 15-20 minutes for the API database.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">While taking the snapshot, we experienced another reality-check. The process was taking more than the usual time that day. After an hour or so, when the snapshot creation was still in progress, we were forced to contact AWS tech support to check why the whole process was taking much longer that day.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The AWS tech-support informed us that since the master instance has gone through a multi-AZ failover, they had replaced the old master with a new machine, which is a routine. Since the snapshot was being taken from the new machine then and was the very first snapshot from the new machine, RDS would take a full snapshot of the data.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">So, we had no other option but wait for the snapshots to finish and keep monitoring the master instance in the meantime. We waited six hours for the snapshot to complete and only then were we able to create the replicas and redirect the traffic back to them.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Once the replicas were in place, we assumed that the worst was over, and finally called it a night.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Data loss<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Next day, on our follow-up calls with RDS tech support, we were told that it was not a usual occurrence that the replication crashes in scenarios of multi-AZ master failover, and there must be more to the incident than what meets the eye.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This is when we started looking at various reasons on why the replication crashed. After matching the database and the application trace logs for the time around the incident, we found that a few records were present in the trace logs, but not in the database. This is when we realised that we had lost some data at the time of failover.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Being a fintech company, losing transactional data actually meant losing money and the trust of our customers. We began digging the binary logs for that time frame and matching them with the data in the store. We finally figured out that the RDS database had been missing 5 seconds of data. Right after these 5 seconds, we had started receiving 5xx errors on our application logs.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Luckily, we could dig the exact queries from the binary logs, go through the sequence of events from the application trace logs and after an 8-hour marathon meeting, were able to correct the data stored in the RDS.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">How Multi-AZ replication works<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">It was time for us to investigate why we even fell into this situation in the first place. To solve the puzzle, we had to find the answer to the following questions:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">How does the RDS Multi-AZ replication work?<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">What steps does RDS take at the time of a multi-AZ failover?<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Why was the data missing in the database?<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Why did the replication crash?<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">We got on tens of calls with a number of RDS solution architects over the next week, and were finally able to connect all the dots.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-3132\" src=\"http:\/\/blog.razorpay.in\/wp-content\/uploads\/2020\/06\/two.png\" alt=\"\" width=\"1132\" height=\"752\" srcset=\"https:\/\/blog.razorpay.in\/wp-content\/uploads\/2020\/06\/two.png 1132w, https:\/\/blog.razorpay.in\/wp-content\/uploads\/2020\/06\/two-300x199.png 300w, https:\/\/blog.razorpay.in\/wp-content\/uploads\/2020\/06\/two-1024x680.png 1024w, https:\/\/blog.razorpay.in\/wp-content\/uploads\/2020\/06\/two-768x510.png 768w, https:\/\/blog.razorpay.in\/wp-content\/uploads\/2020\/06\/two-270x180.png 270w, https:\/\/blog.razorpay.in\/wp-content\/uploads\/2020\/06\/two-370x245.png 370w\" sizes=\"auto, (max-width: 1132px) 100vw, 1132px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">In a Multi-AZ setup, the RDS endpoint points to a primary instance. Another machine in a separate AZ is reserved for the standby, in case the master instance goes down.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The MySQL installed on the standby instance is in shutdown mode; and the replication happens between the two EBS. i.e., as soon as the data is written to the primary EBS, it is duplicated to the standby EBS in a synchronized fashion. This way, RDS ensures that any data written to the master EBS is always present on the standby EBS; and hence, there will be no data loss in the case of a failover.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">At the time of a failover, the RDS goes through a number of steps to ensure that the traffic is moved over to the standby machine in a sane manner. These steps are listed below:<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Primary MySQL instance goes through a cut-over (networking stopped). Client application goes down.<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">MySQL on the standby machine is started.<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">RDS endpoint switches to the standby machine (new primary).<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Application starts connecting to the standby machine. Client application is up now.<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Old primary machine goes through a host-change (hard-reboot).<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">EBS sync starts from the new primary instance to the new standby instance.<\/span><\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">Looking at the process of failover, it seems pretty foolproof; and the replicas should\u2019ve never gone through any duplicate errors and we should\u2019ve never had any data loss.\u00a0<\/span><\/p>\n<p><i><span style=\"font-weight: 400;\">So, what went wrong?<\/span><\/i><\/p>\n<h2><span style=\"font-weight: 400;\">Incorrect configuration variable<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">We found a MySQL configuration parameter innodb_flush_log_at_trx_commit, which is very critical for the seamless process of a failover.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">InnoDB data changes are always committed to a transactional log which resides in the memory. This data is flushed to the EBS disk based on the setting of innodb_flush_log_at_trx_commit.\u00a0<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">If the variable is set to 0, logs are written and flushed to the disk once every second. Transactions for which logs have not been flushed to the disk can be lost in case of a crash.<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">If the variable is set to 1, logs are written and flushed to the disk after every transaction commit. This is the default RDS setting and is required for full ACID compliance.<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">If the variable is set to 2, logs are written after each transaction commit, but flushed to disk after every one second. Transactions for which logs have not been flushed to the disk can be lost in case of a crash.<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">For full ACID compliance, we must set the variable to 1. However, in our case, we had set it to 2. This means even though the logs were written after every commit, they were not getting flushed to the disk immediately.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">After learning about this variable, everything suddenly became crystal clear to us. Since, we had set it to 2, the data was committed to the master instance but was not flushed to the primary EBS. Hence, the standby (new primary) never received this data; which is why we could not find it in the master instance after the failover.<\/span><\/p>\n<p><i><span style=\"font-weight: 400;\">But, why did the replicas fail? And, why was the data found in the binary logs?<\/span><\/i><\/p>\n<p><span style=\"font-weight: 400;\">Apparently, there is another variable called sync_binlog which when set to 1, flushes the data to binary logs immediately. As we had set it to 1 (which is correct), the data got written to the binary logs and replicas were able to read that data. Once the data was read, replicas ran those DML queries onto them and became in sync with the old master.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-3133\" src=\"http:\/\/blog.razorpay.in\/wp-content\/uploads\/2020\/06\/three.png\" alt=\"\" width=\"647\" height=\"639\" srcset=\"https:\/\/blog.razorpay.in\/wp-content\/uploads\/2020\/06\/three.png 647w, https:\/\/blog.razorpay.in\/wp-content\/uploads\/2020\/06\/three-300x296.png 300w\" sizes=\"auto, (max-width: 647px) 100vw, 647px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Let\u2019s say, the auto-increment value of one of the tables was X. Application inserted a new row which got auto-increment-id as X+1. This value X+1 reached the replica, but not the standby machine. So, when the application failed over to the standby machine, it again entered a new row with auto-increment-id as X+1. This insert, on reaching the replica, threw the duplicate-key error and crashed the replication.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We went back to our old snapshots (incidentally, we had kept the snapshots of the old replicas before deleting them); and were able to prove that the lost data was present in the replicas.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Once our theory was proved, we immediately went to the master instance and changed the value of innodb_flush_log_at_trx_commit from 2 to 1; and closed the final loop.<\/span><\/p>\n<h2><span style=\"font-weight: 400;\">Final thoughts<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">In retrospect, we\u2019re glad that we dug deeper into the incident and how we were able to reach the root of the problem. The incident showed us that we were always vulnerable to data loss because of an incorrect setting of a configuration variable.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The only silver lining is, however, that we learnt a lot about how RDS manages the Multi-AZ setup and its failovers. And, of course, we gained an interesting tale to tell you all!<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>On a fateful Friday night, we at Razorpay set out to understand how the RDS Multi-AZ replication works. <\/p>\n","protected":false},"author":46,"featured_media":3103,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"inline_featured_image":false,"footnotes":""},"categories":[69],"tags":[53,57],"class_list":{"0":"post-3093","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-razorpay-stories","8":"tag-razorpay-stories","9":"tag-technology"},"_links":{"self":[{"href":"https:\/\/razorpay.com\/blog\/wp-json\/wp\/v2\/posts\/3093","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/razorpay.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/razorpay.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/razorpay.com\/blog\/wp-json\/wp\/v2\/users\/46"}],"replies":[{"embeddable":true,"href":"https:\/\/razorpay.com\/blog\/wp-json\/wp\/v2\/comments?post=3093"}],"version-history":[{"count":0,"href":"https:\/\/razorpay.com\/blog\/wp-json\/wp\/v2\/posts\/3093\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/razorpay.com\/blog\/wp-json\/wp\/v2\/media\/3103"}],"wp:attachment":[{"href":"https:\/\/razorpay.com\/blog\/wp-json\/wp\/v2\/media?parent=3093"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/razorpay.com\/blog\/wp-json\/wp\/v2\/categories?post=3093"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/razorpay.com\/blog\/wp-json\/wp\/v2\/tags?post=3093"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}