Who Terraforms Terraform? Inside HashiCorp Operations Engineering

2017-09-20

My HashiConf 2017 keynote, representing my work as Director of Infrastructure & Platorm.

Abstract

Paul Hinze explains how HashiCorp’s internal operations team uses the HashiCorp stack. While showing exactly how the company “eats our own dog food,” he takes a look at how we view open source, stakeholders, storytelling, and – more philosophically – the nature of life itself.

Paul describes how open-source software like HashiCorp’s gets developed, and how no two customers are alike.

Paul concludes with a list of future ideas the team is actively working on.

Video

Transcript

Today I’m gonna talk with you a little bit about what it’s like to run infrastructure for HashiCorp, and sort of the experience of doing so. You’re also going to get a look at our infrastructure that’s undergirding a lot of our other services that we run. So, let’s do it.

I’m Paul Hinze, that’s my title, that’s my twitter, and I just talked about who I was so you know. I’m starting this out with some flavor that I have some sort of over-arching themes that I want to stick in your head first so that you have them in the back of your head throughout as we start digging into the details here.

So I’ve been thinking about this a lot lately, software as narrative. The process of constructing software as the process of discovering the details about a problem and its potential solutions. I think this is just a really great lens with which to look at software engineering in general, and I think it’s really started me thinking about the idea that systems are stories, problems are stories, the communities of humans around these systems are all just attempting to build stories together and get more consistent about stories.

Another angle that you can take this concept is that classics are distinction between a software engineer and a senior software engineer. Software engineer says, is this the way we should do this and the senior software engineer says, are we even solving the right problem here. Should this system even exist. Those are the kinds of things I think are really valuable to keep in mind as we talk about some of this stuff.

The other one I want to give you out front is one of my favorite things about sort of the ethos behind HashiCorp So Hashi in Japanese means bridge, it also means chopsticks, but we take bridge and as Jessica from the manager of the Terraform Enterprise team will tell you, chopsticks are just a bridge from food to your mouth. So it all makes sense. This is one of my favorite core tenants of the way that we operate, which is very easy when you’re in infrastructure tooling. Honestly when you’re a vendor trying to help people solve problems in general, to say first forget everything you know about X, accept all of my premises and then you can start to reap the benefits of my solution.

That’s never been HashiCorp’s m o. The sort of principles of the vow of HashiCorp are simple composable tools that provide a bridge from your current state into a slightly better state, and I think that’s so corp to the way that we solve problems, and it’s really been sort of a key to our success, and you’ll see it kind of play out in the stories that we tell here.

So, if you’re going to understand the infrastructure behind our products, you have to understand how our products have evolved. Like thinking about software as a narrative, like its something that’s always changing. The system is always improving, the system continues to react to failure and get stronger, hopefully. So, this is what I want to walk through sort of like, I’ve been around for a while so we can talk about all the products and how they fit in here.

So, in the beginning there was Vagrant, and it was good. I think … so Mitchell’s in college, Mitchell has this open source project and he gets wild success because he is solving such a key problem in a really elegant way. He spends two years developing Vagrant, getting more interest in it, decides to found HashiCorp, the company, with Armon. At this point, HashiCorp, when it was founded was just sitting on an open source project.

So, in the beginning of the next year releases the second open source project, Packer. Now, this is again, a company, but a company that is sitting on a bunch of open source projects, and the same year, a couple of months later, the first commercial product is out. Now we’re a software company, now people are paying us money to build software. Notably this software is downloadable software. This is a plug-in for Vagrant that you pay money for and then you download it on your computer. So now we’re a downloadable software company. Then less than a year later we also became a services company.

So, Vagrant Cloud was introduced with Vagrant 1.5 and sort of this big refactor to be able to share boxes in a very consistent way, now we’re also a SAS company. So, we have downloadable software, some SAS, still a very tiny shop. 2014 is such a great year because Consul also comes out this year and Terraform comes out this year, so we’re playing out this vision that Mitchell and Armon had very early on. That is one of the most incredible things to me about this company is the amount of technical vision that is applied to our overall arc by our founders. It’s really inspiring, it’s one of the things that makes me really happy to be here.

So, again, these are open source projects. We’re building out this portfolio of open source projects. One of the benefits of open source projects is they require very little infrastructure. GitHub’s got this down for us, we don’t have to do anything. So we have one piece of downloadable software, one service and so late 2014 huge year, two big open sources releases and Atlas, we introduced Atlas.

Atlas is an application, or was I should say, an application delivery pipeline. Now what does that mean. I think Mitchell’s, the first time I asked him how are we going to make this a sustainable business. His metaphor was always we’re toolmakers, we make tools and the sort of root like tools, if you’re thinking about building a house, are always going to be free. So, the hammer, the saw, the wood, you can always get those for free.

But, in a commercial relationship, people want things that are a little more stitched together. So when people come to us and just buy a house, that is, sort of want a more mature relationship with us, that’s where we kind of draw the line. We have, as has been mentioned in the earlier keynotes, been developing this pretty robust framework as to how we decide what goes into Enterprise products. But, this time, 2014, our thesis statement was unify the open source projects. So the open source projects are these discrete pieces of an overall vision. So, maybe the best way to upgrade to a commercial relationship with people is to stitch them all together.

So Atlas is an application delivery pipeline, a flagship SAS. So this one is interesting, this wasn’t in the original plan. Armon mentioned this in his opening keynote right. HashiCorp Vault came out of original designs for Atlas, the need for secrets management and through conversations with our customers, with our community members ended up becoming a first class project because people agreed that it was just so important and such an unsolved problem.

Again, 2015, huge year. Nomad comes out this year too. Now we’ve kind of rounded out our open source product fleet, again from an infrastructure perspective, we are not adding much overhead, from a community perspective, plenty. So, we keep releasing these new projects, but that doesn’t mean … we were also iterating hard on Atlas throughout this time. So we were adding a ton of features to Atlas. At this point it has a full sort of Packer and Terraform CI pipeline as well as a bunch of console related features as well.

So, the lessons we’re learning from our customers during this time is that they really didn’t want an entire unified platform. That people really were invested in a subset of our products, and they wanted kind of super versions of those products. Maybe two of them, maybe three of them, but not sort of an entire delivery application pipeline.

It turns out being modular and composable in the open source products, customers wanted that out of the Enterprise products as well. So, what we started to do is we took Atlas, now Atlas is actually the same SAS that was Vagrant Cloud. So Vagrant Cloud sort of shifted the infrastructure underneath it, was the same thing that hosted Atlas, we’ll get to see it in a second. So we took it and basically just defined it from a user interface perspective. So we said, same code base, same distributed system, same SAS, but we’re actually going to start calling out. These are the Packer features, the Terraform features and the Consul features, we’ll call them the Enterprise respective products, but will keep the same application. We got a lot more traction with this. It made a lot more sense to customers, they could pick and choose what they purchased and it worked really well.

What didn’t work well was delivery, so it turns out, I’m sure y’all have been in the situation where you start to try and deliver multiple products on the same platform, the same distributed system we should have. So we had all these different facets of our system, but it was all still one system, and it was a team trying to hack on the same distributed system. So, we started having a lot of trouble just stepping on each other’s toes.

Specifically the Consul features were a very interesting challenge for us because unlike Packer and Terraform , which are provision time operations, Consult is a run time service. So it provides sort of Enterprise features like a nice UI and some alerting. We needed connections from customer Consul clusters up to our SAS. Which we had a pretty sub-system to do it, but it was very kind of tenuous. We had a lot of like persistent TCP connections from customer Consul servers which nobody from the infrastructure side was very happy about.

So, the time comes for Vault Enterprise. We want to deliver some Enterprise functionality for Vault, and we said to ourselves, do we add this in to this distributed system that we’re now sort of using this faceted approach. Based on the lessons we were learning, we decided wait a second, we’re really good at distributing go-binaries. Can we rethink the Enterprise functionality we wanted built in for Vault as just something we could build into a go-binary, and we lose all of the sort of distant connections required in the model we were using for Consul at the time. So that’s what we did Vault Enterprise was announced, it was a downloadable binary with additional functionality.

That worked great, we used our existing build infrastructure with a couple of tweaks and things were from an infrastructure perspective, made much more sense to us and our customers. This worked so well that we back applied it to Consul Enterprise. So we took some of the functionality in what was Atlas and we re-thought it as a downloadable binary, which is what it is today. This is all happening in 2016 and 2017. So, over the course of the last year, you’ve seen the system, the infrastructure under that system change quite a bit. So this is the more recent sort of shift that we’ve made is what remains useful as a service is called Terraform Enterprises. Now this is a system that runs Packer and Terraform on your behalf, provides really nice VCS work flows for you do polar request based reviews, all sorts of nice functionality around the Terraform and the Packer work flow.

We also extracted Vagrant Cloud within the past year to its own separate service once again. So, as opposed to sort of baking it in with Atlas, now Vagrant has its own infrastructure. So that gave definitely gave us Consul features, clarifying around Terraform Enterprise, gave us the clarity to allow us to start tightening the system for more efficient delivery.

The other interesting angle, and this is one of the most interesting, is within the last year we’ve also started offering Terraform Enterprises both as SAS and a private installation. That creates all sort of single multi-tenant SAS and a single tenant installation to distant lands creates some really interesting infrastructure problems.

So, if you zoom out and you take a look at this history. So, what do you see? You see a lot of work, a lot of releases, a lot of really excellent software, but also a lot of change, a lot of lessons and a lot of iteration. I think across two key pieces that I want to highlight, two key sort of themes. One is this kind of distinction between … we are tool makers, and we know that tools can be distributed either as downloadable software or services and I was trying to figure that out. Trying to figure out what functionality makes the most sense as a piece of downloadable software and what makes more sense as a service, and just us participating in the feedback loops about our software with our customers, with our users, with our open source communities, with ourselves.

You’ve heard me say over the course of this like mini-timeline several times where these are lessons that we’ve learned from our own operation of the software, from people using the software and we changed and we learned.

Now, the moment you’ve all been waiting for, here I have on the next slide our infrastructure. We haven’t talked about this publicly before. What does HashiCorp’s own infrastructure look like inside. I’ll bet you’re wondering, and I have it for you right here on the next slide, are you ready?

There it is, I bet you weren’t expecting that. It’s pretty exciting, I’ll bet you’ve never even seen this anywhere. This is a novel infrastructure and it’s also very detailed, as you can see. Now all of this is true as of today, this is absolutely true. But, half of it didn’t exist when we started, and adding in these new technologies as they came out was a challenge for the infrastructure team, it also had to deliver working software. You know we had to keep Atlas working even as we were introducing Nomad.

So, you’ll see that come in as we kind of dig into the more nitty gritties. I think the other thing about this, I really love this view, I think it can cause your eyes to gloss over a little bit. So, one of the things I want to show you is that the infrastructure that we’re running at HashiCorp is very similar to the infrastructure that you’re running in your shops. We’re not doing anything special, I’ll show you.

Atlas, now Terraform Enterprises, is at it’s core, it’s got a Rails app in it. When we started it, we kind of analyzed the GoWeb ecosystem and said it’s not ready. We had done a lot of pioneering in the Go tooling ecosystem, a lot of Mitchell H and Armon libraries exist in the Go community because they were early in to Go Web.

We looked at the web ecosystem and said we need to get a product out the door, so we had expertise in Rails, we knew Rails was a safe bet, and that’s where the core of the web application that sits inside Terraform Enterprises is. Now that’s served by a cluster of Go services, about a dozen of them. I’ll actually put them all up on a slide in a second, and restores its data in post-stress MS-3 primarily.

So, nothing exciting there, and I think that’s one of the things that has been so funny to me about people who want to hear how HashiCorp does things. It’s like we have really great sort of marketing materials about the tools that we make and the way that we use them. But at the end of the day we’re not this church of perfect software. We are doing real work here. We have a group inside of HashiCorp that is using our software, but it’s using it to do just a normal pretty standard web application.

So, if you’ll take a look at the pipelines, so how do we actually get work done? There also going to look very familiar. For our machine images, we store Packer configs and GitHub. We actually use puppet to provision our images to install sort of the base level of software. This puppet step was actually baking on the applications until Nomad came around.

So, we were doing something like image based rolling, and pretty straight forward, nothing special. When we started, Terraform didn’t exist, so we did start with Cloud formation. It was interesting, until Terraform import was a thing, we actually had some old resources kind of hidden in the corner of the infrastructure that we began importing slowly, by like manually hand editing the JSON file of the Terraform stay file. We said, you know, this should probably be a feature.

So, we use Terraform, Stored and GitHub, those configs, take those images and launch them into your instances. Auto scaling groups of instances are behind the LBs, I’m not going to show you the sort of AWS diagram because it is that boring, you can picture it yourself. One of the things that has been really exciting about the private installations, is it’s forced us to be more transparent about our architecture. So, it’s a little bit low on the slide, but the Terraform Enterprise modules repository is our deliverable for private Terraform Enterprises. We just delivered Terraform machine images with your account. In that repository is a bunch of documentation about the internals of Terraform Enterprises. This is the whole thing, so it’s not that huge of an application, this is a data flow diagram so it’s kind of focused on the way that data is pushed through the system.

I’m not going to go through each of them individually. All the documentation is up on line about sort of walking through each service and what it’s responsible for. Generally speaking, what you see here, on the left hand side there is the packer interior form build pipeline. Up in the middle you have the web front end and core of the application and then on the right hand side the part of the application that is responsible for ingressing from various VCS providers. So they’re two things I really like about this view.

One, it is one you can use internally. So like this is … I was considering making it prettier for the slides and I was like no let’s use the real one. This is represented as code, so plug for mermaid is like a graph is but more like markdown. So this is about 50 lines of text that generate this view, and it shows that this system is actively under development. So, these on the right hand side with the brackets are actually services we’re in the process of changing out.

So, the bottom one we’re actually done changing, the other one we’re planning to change. So it’s us learning the sort of microservices lesson of you over factor your services when like actually those three can probably just be one. They spend most of their time talking to each other maybe we don’t need them to be so micro.

So, running this infrastructure, and again that sort of relatively simple application is the sum total of what was once figured in Cloud, became Atlas and is now tightening up as Terraform Enterprises.

So, I’d like to spend the remainder of this talking about what it’s like to run that infrastructure within HashiCorp, and that’s where things get really interesting because of our positioning. Our team and HashiCorp sits at the table right next to, we’re a remote company so it’s not a physical table, but a digital table right next to the teams that are building these tools. That gives us a really interesting ability to have some deep conversations about the way that the software works, the way that each of our tools is day to day.

But we’re not the only ones at that table. In fact, we have sort of an interesting positioning, but it is not terribly unique. The open source communities are still central to the way that the teams think about how to iterate on the software and make it better, and then our customers, our commercial relationships give us insight as to how to make the products better. I want to show you how that works because I think this process is the process that everybody here is participating in right now, and will continue to participate going out in the world beyond HashiCorp

Let’s take a look at an example or two. So, this is something we said within the last year. We were saying Consul servers, we’re having trouble automating around Consul Server recycling. It’s kind of tricky, you have to wait until the overall council cluster is okay before continuing on. We were having to do a lot of kind of manual work when we were recycling our console servers that we didn’t want to have to do. It was working fine, but was a little bit more tedious that we wanted it to be. This is feedback that was consistent from the Consul team from all three of these kinds of subgroups. This is feedback that the GitHub community was saying and that our commercial partners were saying as well.

So, Consul releases Consul Autopilot. Just great bundle of features that are operator focused that make these kinds of operations that we’ve talked about much easier. So this is an example of pure feedback and response and it’s great.

We’ve had an extra special relationship with the Nomad team because we were earlier into the community. We had incentive to adopt it and we have gotten really like great features out of that collaboration. So, including things like Nomad Plan, how do I know what happens when I execute Nomad Run. They came back with Nomad Plan, which is a great feature, great symmetry with our other tools.

Template support. So we spent the first sort of iteration of our Nomad installation embedding the Consul template binary, and one shotting it ahead of our service run. So it would be like Consul template, snag all the config and then drop it on the local exec disc and then go from there. Consul pulls that in, we lose a whole step of our automation.

Similarly we have a deploy bot that we lovingly call Waffles. Waffles deploys our services day to day, and we designed that bot in concert with the Nomad team, such that when job deployments landed, it was using the same sort of mental model around deployments.

So, we haven’t done this yet, but we can basically take Waffles and take three quarters of it out and just change it into Nomad API calls, it’s really great. Then, of course, the closest relationship we have is that with Terraform Enterprise itself. I don’t want to tell you exactly when we broke the circular dependency. It is broken now, between running Terraform Enterprise to manage Terraform Enterprise. It did exist for a time, we had workarounds where you could run Terraform locally under duress. But now when you run it, it’s a private Terraform Enterprise for ourselves that manage the SAS. So, yeah, the feedback cycle here is very tight. We work with this team everyday to make that product better.

But here’s my point, these three groups are valuable because they are distinct. This is something that if I could give you one thing to take away from 30 minutes hearing about HashiCorp operations, it’s that the best practices that we’re all searching for are being actively created by the entire community now together, and the best practices only come from all of us because we are different. All the time I get asked, well tell me exactly how HashiCorp operates because that’s how I’m going to operate my shop. I often have to respond that I can tell you how a high grow start-up operates, that’s the sector we’re in right now. The answers that I have for our teams that our teams figured out for ourselves are just not going to be the right answers. You saw that in the history, the narrative requires all of the characters involved in order to successfully evolve.

So, if you take a look at sort of these three groups, the GitHub community is the broad base. This is a bunch of practitioners, you get a lot of different kind of use cases, people push the edge cases and it’s really great. The customer relationships traditionally for us have pushed scale, so they say okay, that’s great, I wanna do that for 1200 teams globally, and the metaphor that you’ve created for this particular feature doesn’t make sense at that scale. Our sort of operations team, the sort of unique flavor we’re able to bring is the day in and day out.

So, we’ll get that feedback from others too, but we can really give that peripheral vision to the teams, which is like here is what it feels like, here’s what it looks like for a team to be operating on these tools with these tools every day. So, not only do each of these group distinctly are required to make the tools great, I think each of these groups communicating with each other is also where the best practices emerge.

A lot of the way we write Terraform config today, we would not write that Terraform config today without our conversations that we’ve had with customers who are showing us how they’re laying their modules out, and we’re like oh that’s great, that’s a great idea. Our team is actually not that big for the benefit to matter, to split it up in that way. But, that is a really great move and we’re gonna move in that direction.

So, for instance, like the concrete barriers, right now we still do generally one Terraform workspace for each environment. We’re probably gonna move to splitting that out into much more sort of a tiered approach as our application gets bigger, as we get more people hanging out in the infrastructure.

So, on that note, I just want to give you a bunch of ideas that we’re actively thinking about to show you that our team inside of HashiCorp is dealing with the same problems that you’re dealing with day in and day out. One of the most common types of conversations that I have with people in the community is, they say hey, how would you solve this problem, this is a friction that we’re seeing with Nomad or with Vaults or with Terraform. So often, my answer has the shape of oh yeah, we have that too. It’s really frustrating, we’re working around it this way, but you know we’re in a larger conversation on how to fix this.

So, this is just a bunch of little examples to get your brain active as you go back out into the community so that you know you can keep having these conversations, that the community will keep having these conversations. Right now, we are still in our Terraform config doing the sort of Terraform based ASG blue green, where you have two ASGs if there’s an ELB that they’re behind and you change numbers to deploy things.

This has been working perfectly fine, but it gets tedious and it’s stateful. It’s sort of naturally stateful, so we’re really moving towards this a concept of an unattended instance rule that you’ve actually, I think, if you’ve asked around, there are people in this audience who have that implemented. We’re focusing on our Nomad clients at first, so like one of our SREs basically is working on this actively this week trying to figure out how we can get the notification of an instance coming down so we can safely Nomad drain before the instance actually gets terminated. So, that’s something we’re working on. We’re very excited because it will make rolling every flavor of machine totally automatable, we don’t even have to think about it.

We don’t actually manage our … we have two K/V spaces in the Vault K/V and Consul K/V that we don’t manage with Terraform because the Terraform providers didn’t exist when we first implemented them. So we have some pretty sweet Bash and JSON, we still manage them as code, but we manage them in a sort of file system earring style.

So, the key problem with these kinds of applications of Terraform is access. Terraform needs direct access to whatever API it’s going to drive. From a Terraform Enterprise perspective you have that kind of single point of Terraform’s execution, and so we’re having all sorts of … there’s a lot of different ways to slice this problem to try to figure out how do you execute Terraform close to the API. Do you try to tunnel the network in, so that the Terraform can talk to the Vault cluster? We have customers that have solutions for this. We haven’t figured out which one we want to both use for ourselves and try to work with the teams to get better support for, but this is something that has a lot of active conversation about within HashiCorp

So we’re still on ELBs, you’ve heard me mention ELBs several times, and it’s of those things that’s been good enough that we haven’t been motivated to make it better. We use Consul heavily for all of it’s other features, but for load balancing itself, we’re just sort of living in the ELB world, which is fine, but we really want to get into something that’s Consulware for quicker … being able to more quickly shunt traffic around and not having to deal with this sort of clunkiness of ELBs changing traffic.

Finally, this is one we’re really excited about. So excited that we now sort of use this as a joke, as a solution to every problem. We’re like can we use Nomad dispatch to just solve this problem. It’s like no, I’m just sick, Nomad dispatch isn’t going to do anything. But can we Nomad dispatch some medicine? So Nomad dispatch if you’ve never heard of it is sort of like the serverless primitive that works within Nomad. You can partially define a job less a couple of inputs. So essentially it enables you to do just in time execution of your workloads and it really is like a transformative concept that has the potential. So you saw rabid MQ in the middle of our sort of build pipeline. We want to re-think that build pipeline in the context of nomad dispatch because what we’re doing with the rabid MQ right now is basically sort of naively scheduling work. It’s like hey check it out, we have a scheduler right here.

So, that’s one of the things that is really exciting I think about adopting HashiCorps software, internally too, is it gives you these primitives that are just there to help you solve your problems. It’s like oh, Consul, we can actually uniquely solve this problem with Consul or Nomad or with Vault.

So, that’s just a sampling of the things that we’re thinking about. So, I’ve always wanted to give a talk where’s there’s a question in the title and the answer is it was inside of you all along, and I’ve done it. But I do really mean this, you have all been doing this actively for the last two days, and I really hope that you continue to do this as you go back to your day jobs. Continue to interact with the community, continue to interact with each other and continue to make our tools better by sharing information and by providing feedback. We’ll keep listening and we’ll keep participating with you and we’ll make all of this continually better.

Thank you.

(Duplicated from here.)