Supporting Speed with Core Principles: The Ethos Infrastructure Team
Douglas Land, Director of Reliability Engineering
At Ethos, the primary objective of the Infrastructure team is to empower everyone in the company to build a world-class web product. This includes anything that impacts the company’s ability to build, test and deploy software safely and reliably, such as the choice of programming language, libraries, architecture, the database server, cloud API provider, monitoring, notification, and security choices. While the Infrastructure team doesn’t own all these decisions, it is involved in the discussions and often plays a key role in internally productizing these decisions in ways to improve development velocity.
We tackle this objective by following our core principles:
- Keep things operationally simple
- Buy before build
- Follow a self-service first model
Keeping things operationally simple can mean making trade-offs, such as using technologies that are potentially more complicated to run from an infrastructure perspective, but are easier to use by development teams. Somewhat conversely, this principle also pushes us to remove “magic,” or black box functionality, from our infrastructure, so developers can have the clearest possible understanding of how and why things work as they do. The buy over build principle dictates we don’t create a new tool or process unless it adds significant business value for the company. Lastly, to follow a self-serve first model means self-service should always be possible, but we’re here to help. We want to empower development teams to move as fast as they can, without getting in the way. But we also don’t expect them to become experts in all things cloud, and we’re here to support them with their infrastructure needs.
The Infrastructure team takes buy over build to heart, and we vet each new tool or process against a common set of criteria:
- Do we really need this; is there really no existing tool or process to cover this?
- Is there something out there that already does this well and is reasonably priced?
- If we need to run something, how resilient and simple to run is it?
- Does it involve a new process or technology we’re not already familiar with?
- What are the hidden costs of building and running something ourselves?
For instance, given the above considerations, the size of the team, the size of the company, and volume of work at hand, it became clear that it’s a better value proposition for the company that we use managed cloud services for kubernetes and databases versus running those ourselves. We’ve taken a similar approach when considering things like CI/CD and monitoring, moving away from a legacy of frail and error prone build systems running on jenkins and custom monitors to Github actions and Datadog. This has allowed us to focus our efforts on making our platform as usable as possible — instead of spending all our time maintaining custom code, services, and processes.
We on the Infrastructure team pride ourselves on doing the most we can to enable developer self-service while still maintaining a safe and compliant environment. We start with using localstack and docker to give developers an external dependency free and relatively “magic” free development environment where they have complete control and access. This cuts out a lot of complexity in workstation setup, debugging, and access that can be present in a distributed or multi-tenant development environment. Once a developer leaves development to move a service through other environments, we leverage GitOps to provide a clear path, complete with audit trails, reviews, and approval process through our other environments. If raw cloud resources are needed, they can be provisioned by terraform through Atlantis. If a developer is adding a new service to kubernetes, they can generate a new kustomize template and inform ArgoCD to start managing it. There are a few custom projects that tie some of this together, but the majority of components are open source, thoroughly documented by the community, and extensively used so we can leverage others’ experiences and fixes downstream (and yes, we also contribute back to these projects).
The Infrastructure team has carried the same philosophy into our observability layer. We use Datadog for its single-pane view of logs, events, tracing, and metrics. This empowers our developers to instrument their applications as they’d like, adding new metrics, creating views and dashboards, and generating monitors without having to go through anyone else. As a team, we try to simplify the lives of our developers whenever possible, leveraging kubernetes operators to do things like automatically generate standard monitors and notify teams to state changes for new services whenever they’re brought up. We also use pgAnalyze to provide the same level of database self-service and observability we have at the service level with Datadog.
All of this has allowed us to create a “week one” exercise, which allows new developers to create a service and push it through our environments to production more or less by themselves. This serves as a litmus test for the Infrastructure team that our processes and tooling are simple enough to use and understand, and ensures the developer has all the access and tools they need to perform their job; think of it as CI for our entire toolchain.
There is, of course, some “magic” involved with Ethos infrastructure, as well as considerations like access constraints around environments. We often leverage chatOps to allow access with highly defined guard rails. The Infrastructure team largely considers chatOps a limited, distributed command line. It’s a nice way to ensure everyone is aware when someone is taking an action in an environment, as the commands are visible to everyone on the channel they’re run from. Another benefit of chatOps is that it can be used at times when someone doesn’t have easy access to a computer, which can make for a better on-call experience. All chatOps commands are limited to actions that are both non-destructive and idempotent, ensuring even in the event of a mistake or compromised access, our systems remain safe. As a significant portion of the Infrastructure team’s code responsibilities involve “glue” between systems and infrastructure components, we’ve also started leveraging platforms like Benthos to offload a lot of the boilerplate and focus on the business logic side of requirements, reducing the overall footprint of “magic” involved in our systems.
The Infrastructure team migrated off a legacy infrastructure less than two years ago. So, things are by no means perfect or complete, and we continue to improve things regularly. By sticking to our core principles, we continue to build a platform that is as easy to use and understand as possible — using common and well-documented tools with as little “magic” as possible — and to continually test that platform in the real world. At Ethos, we believe this is the best way to enable business to move fast and build great things.
Douglas Land joined Ethos in December 2019 as the Director of Reliability Engineering. When Doug isn’t keeping production running at Ethos, he enjoys spending time with his kids and taking care of (too many) ducks. Interested in joining Doug’s team? Learn more about our career opportunities here.