Find out why we are calling for the death of configuration management in this very informative and important chat with Rob Hirschfeld. The shift in thinking about infrastructure at all layers is changing the way we do Ops…for good reason.
This fun chat expands on what we started talking about in episode 42: Spiraling Ops Debt, SRE Solutions and RackN chat with Rob Hirschfeld as we dive into the challenges and potential solutions for thinking and acting with the SRE approach. Big thanks to Rob Hirschfeld from RackN for sharing his thoughts and experiences from the field on this very exciting subject.
Join this exciting chat with Madhura Maskasky as we discuss the recent news from Platform9 as well as the challenges that are solved by delivering open source solutions as a service and what the next challenges are for today’s evolving enterprise IT organizations.
Join me for a bit of a personal story about a terrible travel day that gave me some good thoughts about the challenges of the 10X engineer. This also introduces something which I’ve called Technology Domestique. Hope you enjoy the story!
Welcome to the GC On-Demand Podcast, a show about people, about process, about technology, about community. It’s great conversations, with great technologists, about things that matter to you, that matter to all of us. Thanks for listening. Don’t forget, visit gcondemand.io for all the show notes. And with that, let’s get started.
Welcome everybody to the GC On-Demand Podcast. My name is Eric Wright, you may know me as @discoposse on Twitter, and of course I’m DiscoPosse in the Green Circle Community. We had a really interesting couple of weeks in the past here. As you’ll notice, I’ve had some challenges around getting the podcast up to date, and that’s because we’ve had a bit of a backlog going on. Number one, had a serious laptop failure, which resulted in losing a few different recordings. Unfortunately not everything was backed up. So hey, here’s the story for all the kids out there, make sure you back your stuff up. So Dropbox is the place to go in my case, and unfortunately, I missed a couple so we are going to go back and rerecord some of those. But in the meantime, I thought this is a great opportunity to take a little time to just get down and personal.
Had a very interesting travel situation that happened, and I wanted to talk about it, and how it relates as far as what happens when we think about 10X engineers. It’s probably a weird correlation that I pulled from it, but the goal that I had when I thought about what I want to talk about is, I’m going to tell you a very true story, it’s filled with challenge, it’s filled with a little humor, and it really is filled with a lot of lessons.
So the other day, I was actually trying to get from New York to Toronto. I’m actually getting ready to go to the Vancouver, or BC Regional VMUG. And by the time you download this, hopefully I’ve already been there, and if I’m lucky enough, I’ll have met you there. Now, what was interesting about this whole process was that I had to be able to get, just like I always do, on a six o’clock flight, and that six o’clock flight would get me to Toronto, 7:30 in the morning, nice and easy. And then the end result, I’m back in Toronto, get a whole day’s work done, get on a plane the next day, and off I go.
It would seem like it’s just that easy, except something happened along the way. There was some weather challenges. So my six o’clock flight ended up being canceled. Now this happens more often than one would admit, really. Now what I find is these commuter runs tend to get bumped all the time. So they’re usually delayed, and that’s all well and good. So what happened with this one was I actually got delayed so long even after the cancellation, they said, “The next flight we can get for you is Wednesday night.” Now Wednesday night, for those who don’t figure out the calendar already, is far past Wednesday morning, which is when I’m flying to Vancouver. So the end result was there’s no way that I was going to be able to get on my particular airline in order to get back to Toronto in time to catch that next flight the following day.
So luckily it turned out to be early enough in the morning that I was able to beat the rush, and I saw there was an opportunity to jump on another airline, which would have flown out of LaGuardia. So I originally had gone out of Newark, and for those who know, there’s three, what they call New York airports. Newark, which is actually New Jersey, it’s not really in New York, there’s LaGuardia, and then there’s JFK. So I saw this opportunity to jump on at a 9:40 flight. It gives me lots of time to made sure I can get there, don’t have to worry about… Beat the rush hour. I’ve got… Timing is all good. It can only go right. At least, so I thought.
So turns out that the first flight was canceled because of weather, despite the sunny skies outside. It turns out there was some real weather problems a bit further up in the atmosphere. Seems like about 10,000 feet, things get a little bit weird. It was pretty muddy up there because we’d had a lot of thunderstorms, and there was a lot there were pending, so they decided to actually freeze up the airspace in between a Toronto, New York, Boston, and all of this triangular area. So what happened was my 9:40 flight got delayed a little bit, not a huge amount, but enough that it’s just a little bit irritating. So about half an hour later than normal, I got onto the plane for my 9:40. So it’s now 10 after 10, no big deal, still got lots of time. I’m going to miss a couple of meetings, but I can email and we got that sorted out. So I managed to defer a couple of quick meetings, and then I realized, no problem, I’m still going to be on the ground. I have a meeting in Toronto, and I’m going to get there in time.
You see, this is where the fun happens though, because the adventure is not over yet. What happened next of course, was three and a half hours of sitting on the tarmac in order to figure out that, guess what? No planes were flying, regardless of what airline. Now, I hadn’t really thought this could happen. It was in the back of my mind that it was going to be more than just the one airline, but it was worth a shot anyways. Everything looked like it was going on time, and it turns out it wasn’t. It wasn’t at all. So the challenge that I was now facing is, here I am, about three and a half hours into a delayed flight, which kept getting delayed, and delayed, then we got rerouted. So we’re rerouted before we even leave the ground, which results in them saying, “Well, we’ve got to come back out of this holding a launch pattern. We’ve got to get back in because we have to refuel. Because not only do we have to refuel, we have to get more fuel in order to get us on this rerouted path we’re going to take.”
No big deal, I’m good with that. As were with the, 85 or so, other folks that were on this plane. So the next step of course is to wait. We refuel, we’re happily waiting on the plane. Everyone’s snoozing, a few other people got up as early as I did. We get to the point where everything’s refueled, we go back into the pattern, we’re getting ready to take off and there we wait, and wait, and wait some more. And doing all this waiting, what ended up happening was then, of course, about 40 more minutes in, we start to feel the engine surge and they start to roll and I think, Oh thank goodness, here we are. We’re finally getting ready to launch. Or at least take off. I wish we would have a launched. End result is that an overhead announcement comes. “Ladies and gentlemen, we’re sorry to announce that we have to go back to the gate to refuel, because we’ve been rerouted once again, and it requires us to get additional fuel.”
Now, I’m not sure what magical fuel tanks they haven’t filled up yet, or whether they’ve just strapped a couple onto the bottom of the plane to this point. I can’t even imagine why it is that we have to keep refueling, because they don’t seem to recall that we were using that much fuel just sitting there on the tarmac. But I’m not a pilot, so I’m not going to question it. We get on back to the gate, and then they let us know, “Well, you know what folks, if you want to get off the plane and you want to take a little walk around, then you’re free to do so. Just stay close to the gate and we’ll make announcements inside the gate area. But of course, if you want to stay on the plane, feel free to do so.” Which I did. So staying on the plane seemed like a good idea, because I figured just in case, the last thing I need is to miss the one flight that I’m actually physically sitting on, and end up missing it for that reason.
Well I’m about 20 minutes in, all of a sudden the next announcement comes. “Ladies and gentlemen, we’d like to let you know to please exit the plane. Take all of your carry on with you. We’re going to ask that while we’re refueling that everybody must wait inside.” I can’t imagine that this is a protocol, because we actually just refueled already while out on the tarmac with everybody on board, so I’m going to call Bravo Sierra on them using some kind of rule book as to making this get in. All of my spidey sense is going off, saying that this is a real problem, and something’s about to happen. But they assured us that everything was going to be okay. I give them the benefit of the doubt. Head inside, only to find out after about another 30 to 40 minutes in, they announced that our flight is in fact canceled. So here we are, cancellation number two for the day. And what ends up happening? Every flight out to Toronto was canceled through the entire day, and potentially the entire next day.
This leaves me in a real quandary, because I need to get back, I need to get out of that out of New York in order to get to Toronto to get to Vancouver. I could probably try and get to Vancouver directly in two days, but goodness gracious, who knows what happens if the problem is in the Northeast region, that’s where I’m going to get caught. So at any rate, I decided to get the longest possible route, which is to take a bus. So I got on the Megabus. And for folks that have been in the East side of the country, you’ve probably been through Toronto, Montreal, Ottawa, New York, Boston, they have this neat little thing called the Megabus. It’s like Greyhound plus. They’ve got wifi and plugs, although the wifi didn’t work, nor did we find any of the plugs under most of the seats. And end result was we all sat there, a full busload in fact, as we prepared the board, this 6:50 bus, which would take us from New York, down by Hudson, over to Toronto. It’s 12 hour ride, three half-hour stops.
You can imagine that this is pure desperation that’s driving. Most of the folks on the bus were actually on the planes that I had been waiting to get to, so I wasn’t alone in my need to get to Toronto. Except the fun part was, we still had a couple of hours to wait. In fact, we had three hours to wait. And if you haven’t ever taken the Megabus, there’s actually no shelters for the Megabus. See, the Megabus just parks along the side of the street down by the Hudson River on 34th between 11th and 12th, which is precisely in the middle of nowhere, uncovered. And guess what? That very same weather that keeps planes on the ground, well, it does a mighty fine job of soaking about 80 people who were sitting in that bus lineup waiting to get on this bus. So in a downpour that I would only describe as biblical at this point, because it was terrifying how much rain was coming down.
In fact, cars were stopping because it was unsafe to drive. And there we all were like wet rodents, getting poured down on. And then we finally get on the bus, and the bus is half an hour late, so he sat for half an hour longer in the rain. We get on the bus. The buses is blasting with air conditioning, so we’re fairly sure we’re all going to at least get sick, if not suffer pretty greatly during this whole trip. They managed to get all that sorted out and anyways, we made it back to Toronto safely. It was about an hour behind, and we got there. They got the heat working, and we all dried off slowly over the course of many hours, and we made it out scot-free. So the total travel time from the 3:40 wake up that I had in the morning, turned out to be about 29 to 30 hours of travel time.
It feels like I could have ridden a bike in that time, but anyways, what’s the point of the story? First of all, it was crazy fun. It was weird. At some point, you become so disturbingly angry about the fact that you’re trapped in this situation. There’s no way out, and I thought, well, at least if I get on something that’s got wheels, there’s a chance it’s going to make it out. Luckily it did, and while it’s a long way, I knew that I needed to get into town in order to catch this crazy next plane. All the while I’m unable to really do a lot of work, but then it hit me that maybe this is a bit of forced time. Maybe there’s a good reason. My wife was very good to remind me that there’s good reasons why some things happen, and maybe something else was avoided in order to get me onto this particular journey.
While I may not have been the dog, the cat and the bear, or whatever that crazy story was in the original Incredible Journey, it felt pretty incredible while I was experiencing it. I made it through okay, got a reasonable amount of fractured sleep during the process, and now I’m getting ready to go to Vancouver. The good thing about this is it taught me a neat lesson. In slowing down, I had a sudden realization that I was used to doing a lot in a very short period of time. Sometimes this happens, and it’s a bit of a self regulating thing. And this is the 10X problem. If you’ve heard about 10X engineers, I don’t even actually know what the origin is directly. There’s obviously a lot of talk around Google and a few big engineering shops, and their idea of creating these super productive, heroic, type of folks that are able to do amazing things.
I always tell people that if you have a 10X engineer, you find me a 10X engineer, and I’ll find a team of people wrapped around them who’s really tired of having to be compared against this 10X person. When you’re the 10X person, it’s even weirder, because sometimes you need to slow down a little bit. So suddenly you find yourself delivering at about 6X, and you’ve got people walking up to you saying, “Hey, is everything okay? I noticed that you’re slowing down a bit.” We’ve created this artificial floor now that you have to perform at a particular level or above. The danger in doing that, is it celebrates overwork, it celebrates stress, it celebrates these heroics.
Now, I’m not going to say that I don’t participate in a lot of it myself. I’m not going to say that I don’t celebrate it amongst a lot of folks in my team at work, and in the community and all over the place. I’ve always done it. But my goal was to at least be like maybe two and a half X. It seems like a good number. You can be 10X some days, but let’s level off here and there, and have it average out. I can only warn you out of this. What ends up happening was in the quest for heroics, I ended up very humbled, and having to spend 30 hours to do what should have taken an hour and a half to get done. That was a reminder that 10X is not a permanent condition, and we have to think about that as we slow down. And maybe you should force yourself into a situation to slow that thing down.
There are lots of really good authors out there that write about the 10X value, but also the dangers of going out too hard continuously. Steve Prefontaine, famous runner, of course Steve was famous for leading out of the gate, and what would happen is that he lost a significant number of his early races because he was eating the winds the entire time, and then he would basically trail off at the end because someone would be in his draft. And while you’re running, it doesn’t seem like you can get a whole lot of drafting done, and it’s actually surprising amount can be done now. Obviously, he became better at it with time, but again, just because he can perform a 10X, doesn’t mean you can perform a 10X all the time. Be careful, tread carefully. We only have one body, one soul and one family. We’ve got to make sure we take care of all of those three things.
And this is where I thought to myself, maybe being 10X isn’t the right idea. As a cycling fan. If you know, in cycling teams we have the leader, let’s just call them your 10X person. You’ve got the entire team that wraps around them, and they are what we call the domestiques. Domestiques effectively do domestic tasks that makes sure that the leader’s protected. They’re kept from the wind, kept in a group, protected from falls, make sure we get their food for them, and make sure that we go out in front and basically ravage the front of the race in order to do some damage to some individuals on the other teams. And meanwhile, we have the rest of our team wrapped around our leader to give them lots of drafting, and make sure that we can save their legs for the big finish, or something.
So I would encourage people to become a technology domestique. Being a technology domestique means, taking time out to do something incredible, to be heroic, but in the service of somebody else. That’s why I do community. And community is a very powerful thing, because you can do something heroic and then hand it off to somebody, and they get the benefit from it. And in turn, of course, you do gain benefit.
This is the real value of community, and why I want to make sure that we all take a moment when you feel like you’re trying to do 10X delivery and you can’t stop for a second, find somebody else that you believe can help you through that, or that can take that lead for you. And what do you do? Go out on the front, eat the wind for a while for them. Become a technology domestique. I can’t tell you that it’s a perfect world. I can’t tell you that it always works. I can’t tell you there’s always somebody there to take the wind for you. But what I can tell you is that you’ll be there often enough, that the best team member is one who can be the domestique. And then who knows? You may have your time out in front, that everybody else takes the wind for you, and then you get to take the win.
So think about that as you head down some incredible journey that you feel is holding you back. You feel like you’re not getting enough done. I really honestly felt my reputation was on the line because I was canceling meetings, I was disappointing folks. I had become 10X what I really needed to be, dangerously so. That day off, or at least day on the road, it wasn’t a day off, got a lot of work done. But what it gave me was the ability to rethink things, again. We need to revisit this all the time. And I would encourage you to become a technology domestique, and if you haven’t got one and you don’t need to be one, then find one, because this is what the community’s for.
With that, I would encourage you, you want to see good technology domestiques? I like to say that I am one for the virtual design master community. Also, the time we’re doing the Vancouver VMUG, I’ll be giving a containers conversation there for the keynote at lunch, which is going to be a lot of fun. But in fact, the very same day, on Thursday, June the 22nd at 8:00 PM Eastern time is the premier of season five of Virtual Design Master, the one and only IT reality competition. And we’re looking for domestiques, and we are going to be that for everybody in that community.
So we’ve got a great group of folks which are joining us. We’ve got a great set of judges. So go to virtualdesignmaster.io, you can read about it there. You can get involved. You can send a shout out. Follow along on Twitter, it’s #virtualdesignmaster. Reach out to me, of course. I’m @discoposse on Twitter. Follow along with the community. Our creative team, which is at Venus 33. Of course Melissa and Angelo at Angela Luciani, L U C I A N I. Join us there. We will find ourselves some 10X folks, and we’ll try and give them the tools they need to be 10X for five weeks, and then hopefully we’ll give them a break and we’ll eat the wind for them.
With that, thank you everybody, and we’ll talk to you next week on the next GC On-Demand.
If you like what you heard here, and want to hear more, don’t forget to subscribe to the GC On-Demand Podcast. You can go to gcondemand.io, where you’ll find the links in order to catch us an iTunes, Stitcher, the Google Play store, and more. Go to gbondemand.io. Don’t forget to rate us in your podcaster of choice, and look for much, much more. Have a show idea? Tweet us, @gcOnDemand. Thanks for listening.
Join us as we continue our conversation with Randy Shoup with a focus on the people and process side of building out successful development and product teams. Randy shares tips and proven techniques that have helped him to build and support successfuly distributed teams, and to reach goals that many of the product developers and product managers have been hoping to get to as we move further in the evolution of the technology ecosystem.
Randy Shoup joins the GCOD for a very informative and candid chat on development velocity, the DevOps goal, finding traction, and how he has led teams to success with a variety of products and platforms.
Randy is working on some very interesting things at StitchFix, and brings a storied career and set of experiences to the conversation. Big thanks to Randy for sharing what could be some of the best tips to finding success with a higher velocity of development including some of the tools and techniques that will help us all get there.
Welcome to the GC On-Demand podcast, a show about people, about process, about technology, about community. It’s great conversations with great technologists about things that matter to you, that matter to all of us. Thanks for listening. Don’t forget, visit gcondemand.io for all of the show notes. With that, let’s get started.
Welcome, everybody. It’s … Welcome back to the GC On-Demand. We’ve got an exciting time because there’s been so much growth going on with some of the conversations we’ve had here on the GC On-Demand and I’ve been lucky enough to really, really cover such a wide audience of folks that can come and talk to us about neat things that they have done in the industry, both from a product and a people perspective. Most importantly, as we’ve progressed as listeners throughout this, I think this is a perfect time that I get to introduce our special guest today.
With that, I’d like to welcome Randy Shoup to the show. Randy, you have a very exciting story in IT. So first of all, let’s get started by if you can introduce yourself, tell folks where we can find you online and we’re going to talk a little bit about you, Stitch Fix and a whole lot about DevOps and all things wrapped around it.
Excellent. Well thanks, Eric, thanks for having me, this is great. Yeah, so I’m Randy Shoup. You can find me on Twitter at @RandyShoup. I can also be found at Randy Shoup on LinkedIn and other places around the web. If you Google my name you’ll also see a bunch of presentations that I’ve given on various topics, DevOps, scalability, engineering culture. I’m pretty easily Googleable ’cause my name isn’t super common.
I’m currently VP of engineering at Stitch Fix here in San Francisco. I’ll talk a little bit later, I think, in the podcast, I hope, about the kinds of things that Stitch Fix does. I’ve been here for a year.
Earlier in my career, I was chief engineer at eBay for about six and a half years and I helped to build out eBay’s search infrastructure. I did a stint at Google running engineering for Google app engine, so that’s Google’s platform as a service, like Roku or other platforms you might be aware of.
I started my own little startup with a former eBay colleague and learned how difficult it is to do a startup.
I actually did a stint as the CTO of a gaming company here in San Francisco for a while, so I’ve done a bunch of different things.
You worked for a couple of little companies. We’re going to definitely bring you back ’cause we want to talk about that startup. That in itself would be a really exciting chat on its own. What I want to talk about today, let’s just start with the elusive topic that everybody seems to be chasing these days, or at least for the last little while, Randy, which is DevOps. The idea of creating these toolkits and these processes that wrap around what DevOps is, and I think that’s probably the best place to start. If you don’t mind, Randy, tell me in your mind what exactly is DevOps, what’s it trying to achieve and what are your thoughts on that side of the world?
Yeah, great. I think of DevOps as the modern approach to deploying and managing software, sort of in the world. We all know that the classic enterprisey approach to stuff is you have a bunch of people that build things, the developers, and you have another set of people that operate those things, the ops folks. What I like about the modern world is that we’re breaking down that, to my mind, artificial barrier there. One of the things we do at Stitch Fix and we also did at Google is we don’t have this huge wall where there are the people that write the code and then they throw it over the wall to the people that operate it and everybody hates the people that are on the other side of that wall.
Rather, the better way to deploy and manage software is to have everybody be able to do this stuff. The way that we did it at Google is that for the most part the initial time when somebody is running a service or running an application, it’s actually the developers, the same people that are writing the code that re operating it, so they’re the ones that carry the pager, they’re the ones that are responsible for it performing and being reliable in the real world.
We do that similar approach here at Stitch Fix. Stitch Fix, as I didn’t mention before, is an online clothing retailer. Our idea is we turn retail sort of upside down so rather than going to a traditional retailer, whether online or physical and going and choosing the things you like, you tell us the things that you like and we send you five items in a box that we think you’re going to enjoy. We do a whole bunch of data science associated with that, that’s maybe a whole another topic. In terms of the engineering that we do, we want to make sure that the engineers that we hire build and maintain the software themselves.
We have the same people that are on … The people that build a particular set of applications at Stitch Fix, the same exact engineers are the ones that write the software, they’re the ones that make sure that the software works correctly, they’re the ones that make sure that the software performs and they’re the ones that make sure that the software is operated. We don’t have a separate QA group, we don’t have a separate performance group, we don’t have a separate ops group, it’s all one set of engineers.
Why would we do that? It has this wonderful sort of synergistic effect by not throwing a thing over the wall to some other guy, I am now responsible for it and it means that I’m strongly incented to make sure that my thing is going to perform well and is going to work well in the real world. My incentives are definitely aligned, but also it actually makes it easier for me to do that job because once I know that I’ve set up the monitoring, I know how things are being deployed and how things are running in the infrastructure. There’s not this need for passing tickets back and forth or constant coordination back and forth where I understand half the problem and the ops guy understands the other half, if that makes sense. I, as engineer, am able to do it all and that is wonderful. Our engineers love it.
I love that approach. One of the recent podcast chats I had was with Rob Hirschfeld and we talked about the SRE and, of course, being of Google history you’ll know around this whole concept of the SRE. We’ll talk about that actually in a few minutes, but it’s funny, you talk about different things that your team does and I love this idea of like a singular, you’re responsible from end to end. When you’re doing stuff like this and creating processes and stuff wrapped around it, how much of the toolkit do you have available out of stuff that’s out there today or when you’re looking at starting something brand new, like what you’re doing with Stitch Fix, how much of everything in your workflow is out of the box versus what you have to develop yourself to kind of be individually mapped to how you do things?
Totally. It’s wonderful to work … I’ve been in the industry for a long time, since 1990, and 2017 is the best time to be a developer of any kind. Why? It’s because all these things that used to be only the eBays and the Googles and the Amazons of the world, is now available to everybody. At Stitch Fix we run all of our infrastructure on AWS. Most of our applications that we build in the engineering team are hosted on Heroku, which maybe people will know also is on AWS. We’re in the process of actually migrating from having all of our applications on Heroku to using Amazon’s elastic container service, so Docker, basically running Docker in the cloud. Happy to dive into any or all of these areas.
To answer your question about how much of the toolkit is available kind of off the shelf, it’s the vast majority of it. That’s what’s wonderful about 2017. We can go to a cloud provider like AWS and get as many machines as we can afford. We can spend them up in minutes or even seconds and that’s huge. Also out of the box from AWS and other cloud providers is the ability to monitor the stuff and control them through APIs. I don’t have to do a lot of jumping up and down and running around in order to be able to see what’s running and see how well it’s running.
One of the things that you sort of touched on, like Stitch Fix being new, which it totally is, so the company itself is only six years old and so that means that we have had the benefit of being able to start afresh in a modern kind of more Greenfield approach. We started with Heroku, so platform as a service, which has wonderful properties in terms of developer productivity and developer power, and we’ve also constructed our toolkit or our stack out of a bunch of different software as a service kind of elements. Obviously the platform as a service straightaway. We’re running on … We have leveraged databases as a service, so first in Heroku’s world and now in AWS’ world in the form of the relational database service that AWS offers.
We also use hosted elastic search. We use hosted messaging in the form of cloud and QP so that’s a message provider, basically a hosted RabbitMQ. We use hosted bug tracking. We use hosted paging in the form of pager duty. Hosted Slack, like … We do not have a single piece of hardware that we own ourselves other than our laptops and so nothing that a customer or an employee at Stitch Fix would use in their daily job is part of any physical … We don’t have any physical data center presence anywhere and that’s awesome. You could be like me too.
In the 2017 world, I would suggest that when you are able to start afresh you should first think of what can I do … How can I leverage something that somebody else is maintaining and sort of running for me and just pay them to do it and focus yourself on the thing that actually differentiates yourself, very starkly different. When I started in the industry even as a startup, let alone a big company, the first thing you have to do is find some data center space and buy some physical machines and get them sent in. It’s laughable. We all lived in that world and it’s laughable, not in like it was bad but just the sharp difference between what we had to go through to get computing power even 10 or 20 years ago and what you have to do now it’s just unbelievably different.
Yeah. It’s actually funny in talking with the founders of Turbonomic, we literally started as a startup in Yuri’s garage. They said there was a rack of servers and the neighbors were like, “What’s that noise, Yuri?” “Oh, I’ve got a startup in my garage.” You literally just need a broadband connection, that’s the only infrastructure you need, everything else is available as a service. You’ve kind of … You hit the whole stack there, Randy. I guess the other thing is, if you think about any single one of these things they’re almost all available for free to get started, too. You can really kick the tires on something, you’re not even getting into like multi-year, multi-month commitments. You can pretty much most of these have a developer offering, which is up to a certain number of nodes for free. That’s crazy.
How does that feel to you to, like you said, when you started in the industry, if you’re coming out of school now, you could run a startup without even spending a single dollar, really, before you know that you’ve got to spend a little more time and get a couple more people on board. It’s a pretty cool time, right?
It’s a wonderful time. Yeah. Every year for the last five has been the best time to start a startup ever, you know what I mean? Exactly for that reason. Right, yeah you can start for free or as close to free as you can imagine. You could start a startup with a bill of zero or a hundred dollars or something like that. If, and only if, you hit it big, then is when you start paying and frankly that’s when you should start paying, right? Once you’ve kind of sorted out, I’ve got a business model, I’ve got product-market fit, now I’m ready to go and that’s when the bills really come. Yeah, it’s wonderfully empowering.
Yes, how does it feel? It constantly amazes me, to be frank because to remember back, we didn’t have any of this stuff. We didn’t have open source. We didn’t have cloud. We didn’t have all this stuff that’s sort of at our fingertips: Mobile devices, super computers that we carry in our pockets. It’s just an amazing time to be alive and it’s an amazing time to be a technologist, it’s just great.
I think every next thing is as exciting and as rapid on the innovation rate as the previous thing. The players are changing too, which is interesting, like you said, open source, it was Linux. When people said, “Oh, I use open source …” it’s like, “Oh, yeah so you run Linux servers?” No, you could run anything at any layer of your stack. That’s such a huge opportunity throughout, again, like the full stack that you talked about in these different platforms. Some of them obviously are AWS-specific like with RDS and whatnot, but they’re using core primitives that you could then port to another database platform as long as it’s using one standard style of database call, it’s all about the right abstraction.
When you think about the products that you’ve laid out in here, I like how you started with Heroku. How important was platform as a service to be the starting point for you versus building your own basic stack?
Yeah, thanks for asking it in that way. Yeah, it was hugely important and I can’t say enough great things about Heroku as a product and the Heroku organization. I say that this … Like I say, I used to run engineering for Google App Engine so I appreciate A) What a benefit platforms as a service are and B) I know how hard they are to run and all that. Yeah, Heroku was a huge benefit for Stitch Fix. When I joined Stitch Fix a year ago we had about 25 engineers on staff. We currently have about 75, so it’s grown about 3X in the time I’ve been here just over a year.
When I arrived, we had 25 engineers, all of whom did full stack Ruby on Rails development, none of whom did any aspect that you would call infrastructure or platform ops at all. That’s not because that’s not an important thing but because that’s what we were paying Heroku to do and that’s what Heroku was doing really well. The fact that we were able to go as far as we did on having essentially no investment in people or infrastructure from ourselves was hugely valuable, just incredible force multiplier. At the same time, Heroku has a SweetSpot and their goal is to make things easy and they do a great job at that. At the same time, there are a bunch of things that as their customers, us in particular, get larger, we have requirements that are not the same requirements that are sort of the next level up or the next level down or deeper, however you want to look at it, in the sort of small and medium-sized businesses that are their bread and butter.
To be fair to them, the stuff that we want them to build is not stuff that they are going to build for anybody else other than their biggest customers. Does that make sense? That’s our motivation for moving off is not because there’s anything wrong with it but because we have stuff that I’ll take a step back and like … I need this thing from you guys, you shouldn’t even build it. It’s not a thing you should … That’s not in your SweetSpot for your standard set of customers. I know that ’cause I used to run one of your things, like App Engine, so I know where you’re coming from and you totally shouldn’t build it, but I need it and sorry.
What we are needing now and did not need before, even a year ago, what we need now is more transparency and more control over all the areas of the stack. We need more security in the form of Amazon security groups, VPCs. Happy to talk about why and the details of all these things, but all the richer sort of next lower level control over security that we get by running directly on an infrastructure as a service provider and we just want, like I say, we want a lot more sort of transparency in different areas of the stack.
I will say, and maybe this is the obvious next question: Should we have started where we are now before? Answer: No. In fact, answer: Hell no. When you are small and medium-sized and are in the proper scope of one of those platforms of the service you totally should be there. You should just totally take advantage of it and not build out things that are undifferentiated heavy lifting to use [Bernard Fogelson’s 00:19:22] term. When, if you should be so lucky to be in the 1% of the 1% of the 1% of companies, that’s the time that you should step up and start taking it over yourselves. Does that make sense?
If you start with a complete self-built full stack, that’s like reading the goal backwards and thinking you’ve done it right. You’ve just … Why not do something that removed those constraints early and then discover them and so effectively you’ve done it. You have now hit a point in the evolution of Stitch Fix and your team where you’re like, “Okay, our constraint is transparency, so how do we attack that constraint? Easy. We have to then break out the stack and this is the way in which we’re going to do it.” Like you said, if you had done that early you’d be months in just to get to the point where you’re like, “All right, perfect. We’re ready to put our first hunk of code into production now.” It’s like such a reversal of the whole purpose of high velocity SRE roles. It’s like every piece of code should go in.
I’m going to ask you this, and I don’t know how much you can share, but we always hear about commits, push your production. Gene Kim, and I love hearing Gene talk and he always talks about, “Yeah, these folks are doing like 2000 commits a day and every single one goes straight to production.” When you think of a Stitch Fix and so many other companies you’ve helped to advise and work with, Randy what are real honest people out there that are doing those first stages rather than Netflix doing 17,000 commits a day or whatever it is. When you do those first stages of adopting this DevOps process, where do you find a lot of people really are versus the big guns stories that we hear about?
Yeah, sure. The answer is, it depends very much on where you come from. In the Stitch Fix case, we had that benefit of starting essentially Greenfield five years ago, right? We started on Heroku, which very naturally gives us continuous delivery. Three things that are core to the way that we approach engineering. We do test-driven development, like we actually do it, so we actually write the tests before we write a feature. We do continuous delivery, so that came essentially for free by working with starting our coding GitHub and connecting that up through a CI pipeline out to Heroku, I’ll talk about that more in a moment. We practiced DevOps as we started with, so we believe that it’s the same team and the same individuals that should be owning the stack full end-to-end.
When somebody’s starting out, if you are starting out as we were able to do afresh, go to a platform as a service and you get that continuous delivery stuff for free. For us, every commit that goes to our master branch runs all of those automated tests that we wrote in our TBD phase and all those things get packaged up into a deployable artifact, they get deployed to Heroku and there we go. We’re not doing 17,000 ’cause we only have 70 people that are writing code, but every one of our applications, and we have about 40 or 50 individual small applications, we tried very much not to write the monolithic application. Every one of those applications is being deployed multiple times a day.
What are people really doing? I am really doing that separate. The other unasked question, or the other half of your question is: What if you’re not a Greenfield situation, what if you’re coming from a situation where it’s more traditional enterprisey. Then yeah, then you’ve got to take steps. It can be … I will come back and answer the question.
Ten years ago when I first started talking about, I was then at eBay, and started talking about eBay’s architecture, people were shocked and amazed that eBay released the whole site every two weeks. I would get up there and talk about, “Yeah eBay, we release the whole site every two weeks.” They’re like, “Oh my god, every two weeks, that’s amazing.” Imagine my doing that today. I get up and I talk about, “Yeah we’re Stitch Fix and we release the whole site every two weeks.” They’re like, “Two weeks? Oh my god.” That used to be impressively awesome and now it’s not so-, so yeah. There’s no shame in coming from, like hey people have successful businesses and they release things once a month or once every two weeks or once every day.
I am a strong believer in the Gene Kim, Jez Humble philosophy of the more that you … You get so much benefit out of shrinking that cycle time, shrinking the time from the idea I have to code that I write to it runs in production. Please just go do that. You’re not going to get from one month to one second in a day and anybody who tells you differently is selling you something probably. You can get from a month to a week and you can feel proud about that and then you can get from a month to a couple of days and you can feel proud about that. Obviously there’s lots of aspects of changing the culture and the development process, getting the tooling in place to do that stuff, but the wonderful thing is that this is a pretty well-paved path.
The fortunate situation is for companies that currently have a more enterprisey life cycle, a slower life cycle, it’s a pretty well-paved, lots of people have gone from there to here, if that makes sense. It’s not like you have to be the trailblazer and discover new ways of doing the thing, if that makes sense. It’s a kind of … I don’t say it’s an easy path but it’s a well-trodden path, it’s a paved path.
It’s really, like you said, the ranges and stages are important and that’s why I try and press upon folks out in the community when I talk to them, I said you don’t have to be eBay, sorry I always use … I use your former team as an example, or LinkedIn or PayPal or whatever it is, but if you’re enterprise and it’s taken you six months to get something in production, if you can do it in three, that’s like let’s drink, let’s party and celebrate that ’cause then we can do it in two, one, three weeks, then you can shrink those cycles but at least let’s just reduce that time to getting that code into production and make those processes better, which is pretty cool.
I’m going to ask you about …
Totally. Yeah, I mean if you can reduce the … I mean just to underline that a little bit, if you’re doing one release every six months and you can make it to three, now you’ve doubled your velocity and there’s no shame in that. That’s awesome because now you can either get double the amount of features and capabilities out to your customers and/or you can make them that much more reliable. Why? Because when I double the size of a thing it doesn’t make it only twice as complicated and twice as potentially bug-ridden, it’s four times or eight times. There’s an exponential, or at least geometric relationship between the amount of code I’m changing and the potential issues that I’m introducing. It’s not linear at all.
When you can go from the six month to the three month, you’ve made everything better, like noticeably. Then you get from the three month to the one month and the one month to the one week, yeah you will find as you do that you’ll just get better and better and your customers will thank you for it. Your customers will thank you for, “Hey, your releases are not only coming faster, they’re more reliable.” That is, people should definitely read the DevOps handbook, which is Gene Kim and Jez Humble and John Willis and all those guys, Patrick DuBois. As you get faster you also get better. It is not … It’s a wonderful thing where you’re not choosing between, “Oh, should I be fast or should I be good?”
The faster makes you gooder, you know? The faster you get the better you get, so yeah.
I could literally take 12 hours and just cut it into hunks and make it a 48-part series with you, Randy. Before we finish up in this session, I do want to talk a little bit more about one piece of tooling that’s interesting. It’s not necessarily a single one I’m going to pick out, but you’ve talked about a couple of platforms and a couple of products in the stack that you’re using and there’s bound to be some proprietary stuff that you have to build and adapt to as part of that, like you talked about obviously AWS VPCs and other things. When you’re looking at a stack to choose to build on, we always hear about vendor lock-in and this concern about, oh you’re going to get locked in. Should we be as concerned as some people would think about mapping to a particular process or some infrastructure and how do you create the right abstraction to make sure that you’re as least locked in as possible?
Yeah. That’s a great question. I remember from Google App Engine, that was a lot of … We definitely heard that critique quite a lot and that’s a legitimate critique. Everything’s a trade-off. Lock-in is such a pejorative term and we take it as … Often you hear that and like, “Oh, we’re locked in and oh, that’s absolutely bad.” Let’s take a step back. Not true. The way to think about it is: What amount of effort, what benefit are you getting from using whatever, directly AWS or directly Heroku or directly App Engine or directly Oracle or like whatever you want to talk about being locked in. What benefit, in terms of speed, feature velocity in terms of reliability, ease of use, et cetera, what benefits do you get and then the cost that you pay, and it’s a real cost, of if we ever had to migrate off of this, how long would it take?
You can kind of sketch that out in order of magnitude. Would it take me one day, one week, one month, one year kind of thing, and then do the math for yourself. I will say for myself that I am, and I’ve started using AWS as a customer for several of my last places that I’ve been. That’s not a lock-in I’m worried about. Yeah, that’s true. We are definitely directly using AWS APIs for a bunch of our things. That is not a big concern for me. There you go. There are other … People have been, and I say this as a former Oracle employee, I’m not even going to beat around the bush. Oracle has soured people on … The experience of that sales cycle, et cetera, has totally soured people on a lock-in of that one. Oracle has not done itself a service by making that really difficult, making being a customer of Oracle a problem rather than a wonderful thing.
So far, to date, we’ll see when we look back on it in the future. So far to date AWS and Google and a bu-, they have not behaved in that way. Does that mean they could never behave in that way? Of course it doesn’t, but prices are going down, not up. People are being … There is more standardization, not less. I think the general trends are favorable and I think a lot of software and infrastructure vendors have seen what their brethren in the ’90s and the 2000s did and try not to do that again. I hope that makes sense.
I have a philosophical … Like I say, lock-in is a trade-off like anything else and you trade it off. If I can get to market three months faster because I just went directly against AWS, fine. If, on the other hand, I am worried, I personally am not worried but if somebody listening is worried about, “Yeah, I want to be able to switch on a dime between AWS and Google or whatever” Cool. All right. That’s a thing you should decide for yourself and there are ways you can insulate yourself from that.
I think that’s a beautiful way to describe it because that’s exactly it, it’s defining the trade-off and … I think the one I always compare when people say, “I don’t want to be with a vendor where I’m locked in.” I’m like, “You’re locked in in so many things in your life. Some of it could call it marriage and it’s not a terrible thing, right?” I’m like, “If you think about it hard enough it’s lock-in. Do you feel locked in? Do you feel terrified that you’ve like …” No, you made a choice. Like I said, it’s a great way to describe it. We’ve covered a lot on toolkit stack. Like I said, we could spend another hour just going into some of the stuff.
You’ve got such a great story that you can tell on the things you’ve done, but what we’re going to do is we’re going to bring you back in the future again, Randy, very soon, and we’re going to continue this conversation because I want to talk to you a bit about the people side of this and that’s probably one of the more challenging stories that people want to hear how you succeeded at it.
To close out today, I’m going to ask you again if you want to just let folks again know where we can find you online and then thank you for talking about your stack, how your view of DevOps works, a great thought about lock-in and a little bit about Stitch Fix. Yeah, so where do we find you online, Randy, and then we’re going to bring you back again.
Yeah, sounds great. Thanks Eric. Randy Shoup, you can find me at @RandyShoup, all one word, on Twitter. You can also find me on LinkedIn and also Googling Randy Shoup, you know, the two words, digs up a lot of presentations and blog posts and interviews that I’ve done so that’s another way that people can find out some of the things that I think. Thanks again Eric, this was really a lot of fun.
Awesome. Thank you very much Randy. If you like what you heard here and want to hear much more, don’t forget to subscribe to the GC On-Demand podcast. You can go to gcondemand.io where you’ll find the links in order to catch us in iTunes, Stitcher, the Google Play store and more. Go to gcondemand.io. Don’t forget to rate us in your podcaster of choice and look for much, much more. Have a show idea? Tweet us @gcondemand. Thanks for listening.
We bring back Angelo Luciani to talk about the product and process approach to building out IT communities and influencer programs. With all of the ways available to host an IT community, what should community leaders be looking at to help bring their community together both online and in-person.
Patrick Melampy of 128 Technology joins us for a chat to discuss the challenges in the world of traditional networking. We go over the 128 Technology solution and why it is an interesting method to solve SDN challenges. Thanks to Patrick for taking the time to review their solution with us and how SDN in general is becoming the next important battleground for both networking customers and vendors.
Richard Arnold joins us for this informative chat on the process he recently went through with replatforming his personal blog and rebranding across social media platforms. You won’t want to miss a bit of htis one as it really helps to pull together the key steps you should be thinking about as you look at bringing your personal brand into a new name and blog platform. Check out Richard’s fresh new site and make sure to listen in to this great conversation.
If you only read one blog this week, this is the one you should start with: Spiraling Ops Debt & the SRE coding imperative.
We chat with Rob Hirschfeld on the subject of spiraling operations debt and the SRE coding imperative, plus how he and the RackN team are helping to solve some of those challenges.