Why a data-driven transformation requires a cultural shift
Over the past few years, online marketplace Etsy has made that transformation. Data and metrics now support the entire company's operations, says CTO Kellan Elliott-McCrea. At Etsy, there's a saying: "If it moves, graph it."
"And if it doesn't move, graph it anyway because it may make a run for it," Elliott-McCrea says.
About five years ago, Etsy initiated a relaunch of its site and a complete rearchitecturing of its technology, in large part to improve its search capabilities.
Big but unique
Etsy is a huge online marketplace, but it is also unique in that many of the items for sale on Etsy are one-of-a-kind, something buyers have never seen before. Search relevance and recency is a challenge for all online commerce, but the challenge is orders of magnitude bigger at Etsy. There are 29 million unique items available on Etsy and search accounts for nearly 30 percent of all traffic and an even greater percentage of sales, Elliott-McCrea says.
Connecting the right buyers and the right sellers at the right time is "incredibly difficult," Elliott-McCrea says.
"We got into analytics trying to solve the search relevance problem," he adds. "Now it powers nearly everything we do."
At the time, the decision to undertake such a major change was not easy. Search relevance was an issue, but business was humming.
"It was a working business," Elliott-McCrea says. "It was a great community. It was making money. People loved it. But we weren't able to support change."
The site would experience frequent outages when IT delivered new features and there were silos of data everywhere.
"It wasn't an environment set up to learn, discover and iterate," Elliott-McCrea says. "It wasn't a good deploy story. It was scary to change things."
Setting the IT organization up to "learn, discover and iterate" is exactly what Etsy set out to do. But to learn you've got to try new things. And that means making mistakes.
Elliott-McCrea notes that the traditional ways software developers gain confidence in their solutions are quality assurance and unit testing.
"Both of those are approaches for saying the product we built does the thing we think it's going to do and it doesn't break," he says.
But unit tests generally operate in a local environment, while Etsy is built on distributed systems. And traditional QA tends to slow down the development process.
"We wanted developers feeling that they owned the success or failure of their process," Elliott-McCrea says. "In a distributed system, one of the ways to gain confidence is metrics. I'm going to launch this feature to just one percent of users, just admins, just in the U.S. Did the graph move That's what drove the metrics-driven approach -- the ability to gain confidence in a distributed system with lots of iteration."
5 metrics to start with
After rearchitecting, Etsy started with five metrics:
But that was just the beginning. Today, Elliott-McCrea says, his team is adding about 300,000 new metrics a month.
The metrics are tiered. For instance, Etsy has about 20 tier 1 services with four to 20 metrics each. If an algorithm detects a variation in one of those metrics, it will trigger an immediate response. Tier 2 and tier 3 metrics are less urgent. One tier will trigger a call while another will trigger an email.
The metrics evolved, and continue to evolve, as Etsy learns, discovers and iterates.
"You have to have the culture right," Elliott-McCrea says. "First and foremost, what are we here to do We had five metrics when we started. We were pretty clear about who we were and who we were trying to serve. It's not about doing it all upfront. You have to have a process that encourages learning. You have share your mistakes, learnings, PSAs. If you don't have a process that focuses on that, you're probably not going to get to the right metrics."
A culture of learning from mistakes
At Etsy, that learning culture is based on openly sharing mistakes to understand what went wrong, how it went wrong and what indicators might have informed the person who made the mistake that something was going wrong.
"If you get a 500 error on Etsy, you get this great graphic of a woman knitting a sweater and it has a third arm and she's looking very confused," Elliott-McCrea says. "We give a three-arm sweater award to the person that makes the most spectacular failure in any given year. It takes a fair amount of skill and talent to do that. You can't do it by accident."
Etsy calls its process a Blameless Post Mortem. The idea comes out of Just Culture, a systems design theory that originated in the medical community as a way to limit medical malpractice.
"You assume best intentions," Elliott-McCrea, who has been a recipient of the three-armed sweater award himself, says. "You assume skillful actors and you try to get to the local rationality. I made a choice. I deployed that code to the home page. I had confidence when I did it. Something happened and now it's an infinite loop and I've taken down the website. What was it about what I was thinking at the time that gave me confidence There's no blame, no should haves, no editorial at any point. Here's what I did. Here's what I thought was going to happen."
The intent is to drill down to the root of the error to uncover metrics that can be used in an attempt to ensure that particular error doesn't happen again.
"Then you have 30 days to fix the remediation items," Elliott-McCrea says.
The elimination of blame and punishment has created a new and better culture at Etsy, he says.
"One of the things people talk about when they get here is their surprise about how helpful everyone is," he says. "Everyone is on the same side. People aren't afraid to not know something or make a mistake."
The end result has been a dramatic improvement in mean time to detect and mean time to resolve issues, Elliott-McCrea adds.
Security doesn't have to be scary
Even Etsy's security team has taken this philosophy to heart.
"They're friendly," Elliott-McCrea says. "I think one of the huge successes that I see is you see people sending them emails: 'Hey, this weird thing happened to me. Is this normal' Most corporate security is scary. People are scared to talk to them. Here, people think, 'Our security team is friendly and open and I like them. I'm going to forward that along.' We get a lot of false positives, but that's OK."