{"id":948,"date":"2018-07-10T18:12:37","date_gmt":"2018-07-10T12:42:37","guid":{"rendered":"https:\/\/rzpwp.blog\/?p=948"},"modified":"2024-02-12T15:28:28","modified_gmt":"2024-02-12T09:58:28","slug":"secret-management-razorpay","status":"publish","type":"post","link":"https:\/\/razorpay.com\/blog\/secret-management-razorpay\/","title":{"rendered":"How We Do Secret Management at Razorpay"},"content":{"rendered":"<p><span style=\"font-weight: 400;\">As a <a href=\"https:\/\/razorpay.com\/blog\/payment-processor\/\">payment processor<\/a>, we deal with many secrets &#8211; Encryption Keys, database configurations, application secrets, signing certificates etc. Most of these secrets are required by a specific service (say the Razorpay dashboard) to do routine tasks (such as connecting to the database).<\/span><\/p>\n<p><span style=\"font-weight: 400;\"><strong>Secret Management<\/strong> is how you make sure that the specific service (and only that specific service) gets access to the correct (and latest) secrets. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">This is mostly a non-problem when you are a small startup, but as we\u2019ve grown from a small startup managing just a couple of servers, to managing large Kubernetes clusters, the way we store\/use secrets has changed considerably. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">Over time we&#8217;ve switched through various approaches in how we store and ship these secrets to our services. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">Secret Management is a common orchestration problem and has multiple different solutions. This blog post walks you through Razorpay\u2019s Secret Journey: how we\u2019ve tried out various solutions over various timelines and what benefits did they bring us.<\/span><\/p>\n<h2><b>Stage 1: Ansible Vault<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">We started out with all of the secrets being stored in a common <\/span><a href=\"https:\/\/docs.ansible.com\/ansible\/2.5\/user_guide\/vault.html\" rel=\"nofollow noopener\" target=\"_blank\"><span style=\"font-weight: 400;\">Ansible Vault file<\/span><\/a><span style=\"font-weight: 400;\">. Ansible is part of our DevOps tooling and used to configure servers. This vault file was used on automated Ansible runs, which would run on the live servers using Ansible-ssh.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">However, this resulted in our CI pipeline having access to our production servers, which we weren&#8217;t comfortable with. Ansible-vault also did not permit any granularity on the secret access &#8211; everyone with access to the vault key had access to all the secrets.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">To mitigate the CI-access issue, we moved to <\/span><a href=\"https:\/\/www.packer.io\/\" rel=\"nofollow noopener\" target=\"_blank\"><span style=\"font-weight: 400;\">Hashicorp Packer<\/span><\/a><span style=\"font-weight: 400;\">, imitating the <\/span><a href=\"http:\/\/techblog.netflix.com\/2016\/03\/how-we-build-code-at-netflix.html\" rel=\"nofollow noopener\" target=\"_blank\"><span style=\"font-weight: 400;\">Netflix model of infrastructure deployments<\/span><\/a><span style=\"font-weight: 400;\">:<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Spin up a new base VM in EC2.<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Run ansible-ssh on the instance against the correct role<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Push the final image to Amazon as an AMI (Amazon Machine Image)<\/span><\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">This is a very common infrastructure setup (<\/span><a href=\"https:\/\/www.packer.io\/docs\/provisioners\/ansible.html\" rel=\"nofollow noopener\" target=\"_blank\"><span style=\"font-weight: 400;\">Ansible+Packer<\/span><\/a><span style=\"font-weight: 400;\">) and works reasonably well.<\/span><\/p>\n<h2><b>Stage 2: Credstash<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">In order to get more granular control over our bakes, we switched to <\/span><a href=\"https:\/\/github.com\/fugue\/credstash\" rel=\"nofollow noopener\" target=\"_blank\"><span style=\"font-weight: 400;\">Credstash<\/span><\/a><span style=\"font-weight: 400;\">. Credstash is a well-established project (written in Python) for storing secrets safely in AWS. It does the following:<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Uses <\/span><a href=\"https:\/\/aws.amazon.com\/kms\/\" rel=\"nofollow noopener\" target=\"_blank\"><span style=\"font-weight: 400;\">Amazon KMS<\/span><\/a><span style=\"font-weight: 400;\"> to encrypt\/decrypt secrets<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Uses <\/span><a href=\"https:\/\/aws.amazon.com\/dynamodb\/\" rel=\"nofollow noopener\" target=\"_blank\"><span style=\"font-weight: 400;\">AWS DynamoDB<\/span><\/a><span style=\"font-weight: 400;\"> to store the encrypted secrets<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Supports a few nifty extras such as secret versioning<\/span><\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">While we continued to use Ansible, <\/span><a href=\"https:\/\/docs.ansible.com\/ansible\/2.5\/plugins\/lookup\/credstash.html\" rel=\"nofollow noopener\" target=\"_blank\"><span style=\"font-weight: 400;\">Ansible&#8217;s Credstash module <\/span><\/a><span style=\"font-weight: 400;\">was an easy replacement for Ansible vault. It allows us to use:<\/span><\/p>\n<pre><code class=\"language-js\">lookup ('credstash', 'super_secret')<\/code><\/pre>\n<p><span style=\"font-weight: 400;\">inside the Ansible jinja templates. We managed access using AWS IAM roles granted only to the Packer instance (we called these &#8220;baker instances&#8221;).<\/span><\/p>\n<h2><b>Stage 3: Alohomora<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">While Credstash served us well, we faced challenges with development velocity because of the bake process being slow. Each layer on our Ansible build system took anywhere between 10-45 minutes to run and led us to look for faster alternatives.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Since we were pretty happy with Credstash as our vetted secure storage method, we decided to take a leaf out of <\/span><a href=\"https:\/\/www.infoq.com\/news\/2014\/03\/etsy-deploy-50-times-a-day\" rel=\"nofollow noopener\" target=\"_blank\"><span style=\"font-weight: 400;\">Etsy&#8217;s book and try out &#8220;configuration deployments&#8221;<\/span><\/a><span style=\"font-weight: 400;\">. The basic idea is to allow configuration updates on the same footing as your regular deployments &#8211; fast, easy, and accessible. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">We\u2019d already been using AWS CodeDeploy for deployments to our codebase and decided to merge the two approaches.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Instead of splitting deployments into two categories (which is what Etsy does), we decided to make some changes to our Code Deploy infrastructure. Because of our current usage of Ansible Vault and switch to Credstash, most of our applications relied on secrets being readable from specific files. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">We worked around this problem by writing a small wrapper on Credstash called Alohomora. It does the following:<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Fetch secrets from a specific DynamoDB table using Credstash<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Write them to disk using a <\/span><a href=\"http:\/\/jinja.pocoo.org\/\" rel=\"nofollow noopener\" target=\"_blank\"><span style=\"font-weight: 400;\">jinja template<\/span><\/a><\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">The Jinja template is shipped alongside our codebase, and lets developers know exactly what secrets are exposed to the application. We run Alohomora as part of our deployment:<\/span><\/p>\n<pre><code class=\"language-js\">alohomora cast --env $CODEDEPLOY_GROUP \r\n--app $CODEDEPLOY_APP secrets.j2 license.j2<\/code><\/pre>\n<p><span style=\"font-weight: 400;\">The extra variables <\/span><span style=\"font-weight: 400;\">($CODEDEPLOY_*) <\/span><span style=\"font-weight: 400;\">are exposed by AWS <\/span><a href=\"https:\/\/aws.amazon.com\/codedeploy\/\" rel=\"nofollow noopener\" target=\"_blank\"><span style=\"font-weight: 400;\">CodeDeploy<\/span><\/a><span style=\"font-weight: 400;\"> and let Alohomora decide which table to read the secrets from (It standardizes a naming scheme of <\/span><code class=\"language-js\">credstash-$env-$app<\/code><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In case a secret is missing in the DynamoDB table, the deployment fails with an error message since we prefer to fail a deployment than allow it to go through with a missing secret.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We\u2019re open sourcing Alohomora alongside this blog post, go check it at <\/span><a href=\"https:\/\/github.com\/razorpay\/alohomora\" rel=\"nofollow noopener\" target=\"_blank\"><span style=\"font-weight: 400;\">https:\/\/github.com\/razorpay\/alohomora<\/span><\/a><span style=\"font-weight: 400;\">. It has been a great enabler of faster configuration deploys at Razorpay, and we hope it can be of help to other companies pushing secret updates regularly to their applications.<\/span><\/p>\n<h2><b>Stage 4: Kubestash<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">Our Devops team bet on Kubernetes early on. We were running production code on our in-house Kubernetes cluster by Q3-2017. The Alohomora deployment script was moved to the entry point for our docker images and the IAM roles maintained using <\/span><a href=\"https:\/\/github.com\/jtblin\/kube2iam\" rel=\"nofollow noopener\" target=\"_blank\"><span style=\"font-weight: 400;\">kube2iam<\/span><\/a><span style=\"font-weight: 400;\"> (we&#8217;ve since switched to <\/span><a href=\"https:\/\/github.com\/uswitch\/kiam\" rel=\"nofollow noopener\" target=\"_blank\"><span style=\"font-weight: 400;\">Kiam<\/span><\/a><span style=\"font-weight: 400;\">).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Alohomora, while working decently in a Kubernetes infra, wasn&#8217;t Kubernetes-native. As such, it gave us issues with:<\/span><\/p>\n<p><b><i>Resource Quotas<\/i><\/b><b>: <\/b><span style=\"font-weight: 400;\">We saw CPU spikes in the application during the deployment when Alohomora ran. As a result, we had to accommodate for higher resource quotas on the applications compared to what the service was using.<\/span><\/p>\n<p><b><i>Python<\/i><\/b><b>:<\/b><span style=\"font-weight: 400;\"> Alohomora was written with Ubuntu 16.04 based deployments in mind and supported Python 2.7. We started facing issues with python dependencies with services using Python themselves. We&#8217;d have faced this issue with our Ubuntu setup as well, but running on docker exacerbated it.<\/span><\/p>\n<p><b><i>Not Kubernetes First<\/i><\/b><b>:<\/b><span style=\"font-weight: 400;\"> Kubernetes already provides a secret management solution in Kubernetes Secrets. It allows for both file and environment variable based secrets. Running Alohomora and fetching secrets from Credstash felt like an alien\u00a0solution in the Kubernetes world.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We found a solution in another small Credstash wrapper called <a href=\"https:\/\/github.com\/af-inet\/kubestash\" rel=\"nofollow noopener\" target=\"_blank\">Kubestash<\/a> &#8211; a small command line application to sync your Credstash secrets to Kubernetes. We&#8217;ve since <a href=\"https:\/\/github.com\/af-inet\/kubestash\/pull\/8\" rel=\"nofollow noopener\" target=\"_blank\">contributed patches<\/a> to Kubestash that work with our specific workflow and allow for cluster level syncs.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This allows us to store our secrets using Credstash and know that they will get pushed automatically to our Kubernetes cluster using Kubestash. The primary command that we use is K<\/span><span style=\"font-weight: 400;\">ubestash Daemonall<\/span><span style=\"font-weight: 400;\"> which syncs a complete dynamoDB table against a Kubernetes cluster. We run this as a single pod deployment in our cluster.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">One caveat to keep in mind if using Kubernetes secrets is to make sure that your <\/span><a href=\"https:\/\/kubernetes.io\/docs\/tasks\/administer-cluster\/encrypt-data\/\" rel=\"nofollow noopener\" target=\"_blank\"><span style=\"font-weight: 400;\">etcd store is encrypted<\/span><\/a><span style=\"font-weight: 400;\">, otherwise etcd will store all your secrets on disk, unencrypted.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">You can find more details in the K<\/span><span style=\"font-weight: 400;\">ubestash<\/span><span style=\"font-weight: 400;\"> documentation at <\/span><a href=\"https:\/\/github.com\/af-inet\/kubestash\" rel=\"nofollow noopener\" target=\"_blank\"><span style=\"font-weight: 400;\">https:\/\/github.com\/af-inet\/kubestash<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<h2><b>Alternatives<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">If you&#8217;re reading this, there are several other alternatives now available to you that you might wanna consider before picking a solution:<\/span><\/p>\n<h3><strong><a href=\"https:\/\/lyft.github.io\/confidant\/\" rel=\"nofollow noopener\" target=\"_blank\">Confidante by Lyft<\/a><\/strong><span style=\"font-weight: 400;\"><strong> :<\/strong> We didn&#8217;t try this out since this was released after we&#8217;d switched over to Credstash, but it is fairly similar in scope (KMS for encryption + DynamoDB for Storage). It also features a Web UI where users can update secrets.<\/span><\/h3>\n<h3><strong><a href=\"https:\/\/kubernetes.io\/docs\/concepts\/configuration\/secret\/\" rel=\"nofollow noopener\" target=\"_blank\">Just Kubernetes Secrets<\/a><\/strong><span style=\"font-weight: 400;\"><strong>:<\/strong> If you&#8217;re running on a managed Kubernetes cluster, this is a very good solution that you should consider. In our case, we wanted something other than etcd to be our primary secret store which is why we went with Kubestash (it lets us keep dynamoDB as the primary store)<\/span><\/h3>\n<h3><strong><a href=\"https:\/\/docs.aws.amazon.com\/systems-manager\/latest\/userguide\/systems-manager-paramstore.html\" rel=\"nofollow noopener\" target=\"_blank\">AWS Parameter Store<\/a><\/strong><span style=\"font-weight: 400;\"><strong>:<\/strong> The AWS Parameter store allows you to store arbitrary key\/value pairs and grant access using IAM roles. There are some wrappers (similar to Credstash) that use Parameter Store instead of DynamoDB.<\/span><\/h3>\n<h3><strong><a href=\"https:\/\/aws.amazon.com\/blogs\/aws\/aws-secrets-manager-store-distribute-and-rotate-credentials-securely\/\" rel=\"nofollow noopener\" target=\"_blank\">AWS Secrets Manager<\/a><\/strong><span style=\"font-weight: 400;\"><strong>:<\/strong> Recently announced at this year&#8217;s AWS: Invent, this is a slightly costlier solution that allows for secret versioning and automated secret rollover using Lambda jobs. We might consider this if it supports a native Kubernetes integration (which might show up with <\/span><a href=\"https:\/\/aws.amazon.com\/eks\/\" rel=\"nofollow noopener\" target=\"_blank\"><span style=\"font-weight: 400;\">AWS: EKS<\/span><\/a><span style=\"font-weight: 400;\"> perhaps?)<\/span><\/h3>\n<hr \/>\n<p><em><span style=\"font-weight: 400;\">Interested in automating things and helping us scale the most robust payments platform in India? We&#8217;re looking for Infrastructure Engineers at Razorpay! Check out the job postings at <a href=\"https:\/\/razorpay.com\/jobs?utm_source=blog&amp;utm_medium=blog&amp;utm_campaign=hiring_campaign&amp;utm_content=readers\">https:\/\/razorpay.com\/jobs<\/a><\/span><\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Secret Management is key to data safety in payment gateways. Here&#8217;s our journey of finding the perfect secret keeper &#8211; from Ansible to Kubernetes and more..<\/p>\n","protected":false},"author":5,"featured_media":960,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"inline_featured_image":false,"footnotes":""},"categories":[69],"tags":[53,57],"class_list":{"0":"post-948","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-razorpay-stories","8":"tag-razorpay-stories","9":"tag-technology"},"_links":{"self":[{"href":"https:\/\/razorpay.com\/blog\/wp-json\/wp\/v2\/posts\/948","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/razorpay.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/razorpay.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/razorpay.com\/blog\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/razorpay.com\/blog\/wp-json\/wp\/v2\/comments?post=948"}],"version-history":[{"count":0,"href":"https:\/\/razorpay.com\/blog\/wp-json\/wp\/v2\/posts\/948\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/razorpay.com\/blog\/wp-json\/wp\/v2\/media\/960"}],"wp:attachment":[{"href":"https:\/\/razorpay.com\/blog\/wp-json\/wp\/v2\/media?parent=948"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/razorpay.com\/blog\/wp-json\/wp\/v2\/categories?post=948"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/razorpay.com\/blog\/wp-json\/wp\/v2\/tags?post=948"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}