Last night I was out with a dear friend who has been an engineering manager for a year now, and by two drinks in I was rattling off a long list things I always say to newer engineering managers. Then I remembered: I should write a post! It's one of m...
It has been 8 years (on Sunday) since I pushed the first commit to Vagrant. Vagrant has grown to something more than I ever could’ve imagined sitting in my college dorm room on that day. Proud of the community and team that continues to carry the tor...
Announcing Consul Connect, a new feature built-in to Consul for secure service-to-service communication. Upgrade to the latest version of Consul, add a few lines of config, and you're done: mutual TLS between any two services. It's that easy.
Testing in Production, the safe way
- why test in prod when you can test in staging
- how to test in prod while minimizing risk
- how to test configuration changes in prod
- why proxies are your best friend
- what to monitor
- and ...
When I was getting ready to join Kickstarter as VP of Engineering, Chad Dickerson (who was the CEO of Etsy when I worked there) offered to send me a bunch of advice. Chad had been a CTO multiple times before being CEO; he knew that this executive-lev...
The postmortem for yesterday's GCP outage is up.
Root Cause - a bug in a feature that had been "dark launched" triggered by a config change
- safer feature flagging/dark launches
- config changes need to be canaried and gradually rol...
'Everyone wants infrastructure software to be free and continuously developed by highly skilled professional developers, but no one wants to pay for it. The economics of this situation are unsustainable and broken.'
- managing backward compatible APIs
- API evolution
- building an API for a maintenance/status page
- the awesomeness of Consul for driving dynamic configuration updates
In other words, why I feel jaded about APIs and excited about ...
Consul users, read this important security notice. Under specific configurations, Consul could be vulnerable to RCE and we’ve identified malware in the wild that specifically targets this. We’ve backported fixed and documented it here.
Wrote down my thoughts on why the quality of on-call is a direct reflection of a team/organization's engineering skills well as culture/priorities.
As well as some thoughts on a more humane on-call culture.
I've begun to gather accessible (not behind paywalls) resources on ResilienceEngineering in an attempt to further bridge the greater software engineering/operations worlds with the field.
Links and teaser excerpts included:
The dirty little secret about DevOps is that everybody talks about what it means for operations teams, and hardly anybody talks about what it means for software engineers. Which is possibly even more important!
Operations is not really a dedicated...
Lately I've been doing some career counseling for people off Twitter (long story). The central drama for many people goes something like this: “I'm a senior engineer, but I'm thinking about being a manager. I really like engineering, but I feel like ...
We are excited to finally have all our development, test, staging, and production environments managed with Terraform. There are many new features and improvements we have planned for 1Password, and it will be fun to review new infrastructure pull re...
Today, a fairly significant bug was found in runc that would allow an attacker to gain root-level code execution on the host from within a container (CVE-2019-5736).
If you're running on GKE with Ubuntu base nodes, please upgrade to the latest vers...
Cracking read from Airbnb on resilience engineering. So many good bits on load balancing, server side queueing with CoDel and Adaptive LIFO, back pressure, the need for client side rate limiting balancing *in addition to* server side, load shedding ...
We now host a demo Vault cluster with the UI. Its a real Vault cluster, go crazy! Fun fact: we use Sentinel policies to prevent some really bad behavior. Another fun fact: its all running on Nomad, we use periodic jobs to reset every hour.
The service catalog sync functionality to sync Kubernetes services to the Consul catalog and vice versa. This enables cross-cluster or platform service discovery using the native service discovery tooling expected.
Willy covers multiple aspects of observability using the HAProxy load balancer. He also tries to suggest the smallest set of very relevant metrics to watch in order to detect when something starts to go wrong, and immediately spot what, where and hel...
C Is Not a Low-level Language
After looking at root causes of Meltdown and Spectre, some C programmers continued to believe they were programming in a low-level language, when this hasn't been the case for decades.
HashiCorp Nomad provider for Virtual Kubelet connects your Kubernetes cluster
with Nomad cluster by exposing the Nomad cluster as a node in Kubernetes. By
using the provider, pods that are scheduled on the virtual Nomad node
registered on Kubernetes ...
A few key SRE practices I strongly believe in:
🌟Do incident review (post-mortem) action items
🌟Use error budgets
🌟Dig into failures and learn from them
🌟Measure Availability & Durability
🌟Focus on business success metrics
I wrote about this he...
Here is some data: the last incident your company experienced lasted 54 minutes. What insight does this data reveal besides a) an incident happened, and b) it lasted 54 minutes (at least according to someone interpreting an event as an incident)? Wha...
Wrote about mental models and how decomposition of software layers should reflect this. Where I propose a “hierarchy of needs” of sorts for code.
With insights from Rob Pike, allspaw and more. Thanks, as ever, to mononcqc for reviewing a draft of...
I just learned about this amazing resource ⚡️ ⚡️
It's a repository of self-care resources for developers & others. This interactive flowchart that walks you through some self-care basics is especially cool.
8 months ago, I complained that it was difficult for customers to try GCPcloud Stackdriver
Today we’re announcing the Stackdriver sandbox - a one click demo environment to explore Stackdriver with production-like workloads, fault injection, and mo...
GitHub Universe 2018 was low key revolutionary. Executives avoided silverback displays, but the underlying message was clear. Where a previous generation of executive screamed “developers developers developers”, Github simply rolled
When humans take the drug MDMA, versions of which are known as molly or ecstasy, they commonly feel very happy, extraverted, and particularly interested in physical touch. A group of scientists recently wondered whether this drug might have a similar...