Episode 46 – DevOps, Microservices, Tools and Processes with Randy Shoup Part 1 of 2

Spread the love

Randy Shoup joins the GCOD for a very informative and candid chat on development velocity, the DevOps goal, finding traction, and how he has led teams to success with a variety of products and platforms.

Randy is working on some very interesting things at StitchFix, and brings a storied career and set of experiences to the conversation. Big thanks to Randy for sharing what could be some of the best tips to finding success with a higher velocity of development including some of the tools and techniques that will help us all get there.

Welcome to the GC On-Demand podcast, a show about people, about process, about technology, about community. It’s great conversations with great technologists about things that matter to you, that matter to all of us. Thanks for listening. Don’t forget, visit gcondemand.io for all of the show notes. With that, let’s get started.

Welcome, everybody. It’s … Welcome back to the GC On-Demand. We’ve got an exciting time because there’s been so much growth going on with some of the conversations we’ve had here on the GC On-Demand and I’ve been lucky enough to really, really cover such a wide audience of folks that can come and talk to us about neat things that they have done in the industry, both from a product and a people perspective. Most importantly, as we’ve progressed as listeners throughout this, I think this is a perfect time that I get to introduce our special guest today.

With that, I’d like to welcome Randy Shoup to the show. Randy, you have a very exciting story in IT. So first of all, let’s get started by if you can introduce yourself, tell folks where we can find you online and we’re going to talk a little bit about you, Stitch Fix and a whole lot about DevOps and all things wrapped around it.

Excellent. Well thanks, Eric, thanks for having me, this is great. Yeah, so I’m Randy Shoup. You can find me on Twitter at @RandyShoup. I can also be found at Randy Shoup on LinkedIn and other places around the web. If you Google my name you’ll also see a bunch of presentations that I’ve given on various topics, DevOps, scalability, engineering culture. I’m pretty easily Googleable ’cause my name isn’t super common.

I’m currently VP of engineering at Stitch Fix here in San Francisco. I’ll talk a little bit later, I think, in the podcast, I hope, about the kinds of things that Stitch Fix does. I’ve been here for a year.

Earlier in my career, I was chief engineer at eBay for about six and a half years and I helped to build out eBay’s search infrastructure. I did a stint at Google running engineering for Google app engine, so that’s Google’s platform as a service, like Roku or other platforms you might be aware of.

I started my own little startup with a former eBay colleague and learned how difficult it is to do a startup.

I actually did a stint as the CTO of a gaming company here in San Francisco for a while, so I’ve done a bunch of different things.

You worked for a couple of little companies. We’re going to definitely bring you back ’cause we want to talk about that startup. That in itself would be a really exciting chat on its own. What I want to talk about today, let’s just start with the elusive topic that everybody seems to be chasing these days, or at least for the last little while, Randy, which is DevOps. The idea of creating these toolkits and these processes that wrap around what DevOps is, and I think that’s probably the best place to start. If you don’t mind, Randy, tell me in your mind what exactly is DevOps, what’s it trying to achieve and what are your thoughts on that side of the world?

Yeah, great. I think of DevOps as the modern approach to deploying and managing software, sort of in the world. We all know that the classic enterprisey approach to stuff is you have a bunch of people that build things, the developers, and you have another set of people that operate those things, the ops folks. What I like about the modern world is that we’re breaking down that, to my mind, artificial barrier there. One of the things we do at Stitch Fix and we also did at Google is we don’t have this huge wall where there are the people that write the code and then they throw it over the wall to the people that operate it and everybody hates the people that are on the other side of that wall.

Rather, the better way to deploy and manage software is to have everybody be able to do this stuff. The way that we did it at Google is that for the most part the initial time when somebody is running a service or running an application, it’s actually the developers, the same people that are writing the code that re operating it, so they’re the ones that carry the pager, they’re the ones that are responsible for it performing and being reliable in the real world.

We do that similar approach here at Stitch Fix. Stitch Fix, as I didn’t mention before, is an online clothing retailer. Our idea is we turn retail sort of upside down so rather than going to a traditional retailer, whether online or physical and going and choosing the things you like, you tell us the things that you like and we send you five items in a box that we think you’re going to enjoy. We do a whole bunch of data science associated with that, that’s maybe a whole another topic. In terms of the engineering that we do, we want to make sure that the engineers that we hire build and maintain the software themselves.

We have the same people that are on … The people that build a particular set of applications at Stitch Fix, the same exact engineers are the ones that write the software, they’re the ones that make sure that the software works correctly, they’re the ones that make sure that the software performs and they’re the ones that make sure that the software is operated. We don’t have a separate QA group, we don’t have a separate performance group, we don’t have a separate ops group, it’s all one set of engineers.

Why would we do that? It has this wonderful sort of synergistic effect by not throwing a thing over the wall to some other guy, I am now responsible for it and it means that I’m strongly incented to make sure that my thing is going to perform well and is going to work well in the real world. My incentives are definitely aligned, but also it actually makes it easier for me to do that job because once I know that I’ve set up the monitoring, I know how things are being deployed and how things are running in the infrastructure. There’s not this need for passing tickets back and forth or constant coordination back and forth where I understand half the problem and the ops guy understands the other half, if that makes sense. I, as engineer, am able to do it all and that is wonderful. Our engineers love it.

I love that approach. One of the recent podcast chats I had was with Rob Hirschfeld and we talked about the SRE and, of course, being of Google history you’ll know around this whole concept of the SRE. We’ll talk about that actually in a few minutes, but it’s funny, you talk about different things that your team does and I love this idea of like a singular, you’re responsible from end to end. When you’re doing stuff like this and creating processes and stuff wrapped around it, how much of the toolkit do you have available out of stuff that’s out there today or when you’re looking at starting something brand new, like what you’re doing with Stitch Fix, how much of everything in your workflow is out of the box versus what you have to develop yourself to kind of be individually mapped to how you do things?

Totally. It’s wonderful to work … I’ve been in the industry for a long time, since 1990, and 2017 is the best time to be a developer of any kind. Why? It’s because all these things that used to be only the eBays and the Googles and the Amazons of the world, is now available to everybody. At Stitch Fix we run all of our infrastructure on AWS. Most of our applications that we build in the engineering team are hosted on Heroku, which maybe people will know also is on AWS. We’re in the process of actually migrating from having all of our applications on Heroku to using Amazon’s elastic container service, so Docker, basically running Docker in the cloud. Happy to dive into any or all of these areas.

To answer your question about how much of the toolkit is available kind of off the shelf, it’s the vast majority of it. That’s what’s wonderful about 2017. We can go to a cloud provider like AWS and get as many machines as we can afford. We can spend them up in minutes or even seconds and that’s huge. Also out of the box from AWS and other cloud providers is the ability to monitor the stuff and control them through APIs. I don’t have to do a lot of jumping up and down and running around in order to be able to see what’s running and see how well it’s running.

One of the things that you sort of touched on, like Stitch Fix being new, which it totally is, so the company itself is only six years old and so that means that we have had the benefit of being able to start afresh in a modern kind of more Greenfield approach. We started with Heroku, so platform as a service, which has wonderful properties in terms of developer productivity and developer power, and we’ve also constructed our toolkit or our stack out of a bunch of different software as a service kind of elements. Obviously the platform as a service straightaway. We’re running on … We have leveraged databases as a service, so first in Heroku’s world and now in AWS’ world in the form of the relational database service that AWS offers.

We also use hosted elastic search. We use hosted messaging in the form of cloud and QP so that’s a message provider, basically a hosted RabbitMQ. We use hosted bug tracking. We use hosted paging in the form of pager duty. Hosted Slack, like … We do not have a single piece of hardware that we own ourselves other than our laptops and so nothing that a customer or an employee at Stitch Fix would use in their daily job is part of any physical … We don’t have any physical data center presence anywhere and that’s awesome. You could be like me too.

In the 2017 world, I would suggest that when you are able to start afresh you should first think of what can I do … How can I leverage something that somebody else is maintaining and sort of running for me and just pay them to do it and focus yourself on the thing that actually differentiates yourself, very starkly different. When I started in the industry even as a startup, let alone a big company, the first thing you have to do is find some data center space and buy some physical machines and get them sent in. It’s laughable. We all lived in that world and it’s laughable, not in like it was bad but just the sharp difference between what we had to go through to get computing power even 10 or 20 years ago and what you have to do now it’s just unbelievably different.

Yeah. It’s actually funny in talking with the founders of Turbonomic, we literally started as a startup in Yuri’s garage. They said there was a rack of servers and the neighbors were like, “What’s that noise, Yuri?” “Oh, I’ve got a startup in my garage.” You literally just need a broadband connection, that’s the only infrastructure you need, everything else is available as a service. You’ve kind of … You hit the whole stack there, Randy. I guess the other thing is, if you think about any single one of these things they’re almost all available for free to get started, too. You can really kick the tires on something, you’re not even getting into like multi-year, multi-month commitments. You can pretty much most of these have a developer offering, which is up to a certain number of nodes for free. That’s crazy.

How does that feel to you to, like you said, when you started in the industry, if you’re coming out of school now, you could run a startup without even spending a single dollar, really, before you know that you’ve got to spend a little more time and get a couple more people on board. It’s a pretty cool time, right?

It’s a wonderful time. Yeah. Every year for the last five has been the best time to start a startup ever, you know what I mean? Exactly for that reason. Right, yeah you can start for free or as close to free as you can imagine. You could start a startup with a bill of zero or a hundred dollars or something like that. If, and only if, you hit it big, then is when you start paying and frankly that’s when you should start paying, right? Once you’ve kind of sorted out, I’ve got a business model, I’ve got product-market fit, now I’m ready to go and that’s when the bills really come. Yeah, it’s wonderfully empowering.

Yes, how does it feel? It constantly amazes me, to be frank because to remember back, we didn’t have any of this stuff. We didn’t have open source. We didn’t have cloud. We didn’t have all this stuff that’s sort of at our fingertips: Mobile devices, super computers that we carry in our pockets. It’s just an amazing time to be alive and it’s an amazing time to be a technologist, it’s just great.

I think every next thing is as exciting and as rapid on the innovation rate as the previous thing. The players are changing too, which is interesting, like you said, open source, it was Linux. When people said, “Oh, I use open source …” it’s like, “Oh, yeah so you run Linux servers?” No, you could run anything at any layer of your stack. That’s such a huge opportunity throughout, again, like the full stack that you talked about in these different platforms. Some of them obviously are AWS-specific like with RDS and whatnot, but they’re using core primitives that you could then port to another database platform as long as it’s using one standard style of database call, it’s all about the right abstraction.

When you think about the products that you’ve laid out in here, I like how you started with Heroku. How important was platform as a service to be the starting point for you versus building your own basic stack?

Yeah, thanks for asking it in that way. Yeah, it was hugely important and I can’t say enough great things about Heroku as a product and the Heroku organization. I say that this … Like I say, I used to run engineering for Google App Engine so I appreciate A) What a benefit platforms as a service are and B) I know how hard they are to run and all that. Yeah, Heroku was a huge benefit for Stitch Fix. When I joined Stitch Fix a year ago we had about 25 engineers on staff. We currently have about 75, so it’s grown about 3X in the time I’ve been here just over a year.

When I arrived, we had 25 engineers, all of whom did full stack Ruby on Rails development, none of whom did any aspect that you would call infrastructure or platform ops at all. That’s not because that’s not an important thing but because that’s what we were paying Heroku to do and that’s what Heroku was doing really well. The fact that we were able to go as far as we did on having essentially no investment in people or infrastructure from ourselves was hugely valuable, just incredible force multiplier. At the same time, Heroku has a SweetSpot and their goal is to make things easy and they do a great job at that. At the same time, there are a bunch of things that as their customers, us in particular, get larger, we have requirements that are not the same requirements that are sort of the next level up or the next level down or deeper, however you want to look at it, in the sort of small and medium-sized businesses that are their bread and butter.

To be fair to them, the stuff that we want them to build is not stuff that they are going to build for anybody else other than their biggest customers. Does that make sense? That’s our motivation for moving off is not because there’s anything wrong with it but because we have stuff that I’ll take a step back and like … I need this thing from you guys, you shouldn’t even build it. It’s not a thing you should … That’s not in your SweetSpot for your standard set of customers. I know that ’cause I used to run one of your things, like App Engine, so I know where you’re coming from and you totally shouldn’t build it, but I need it and sorry.

What we are needing now and did not need before, even a year ago, what we need now is more transparency and more control over all the areas of the stack. We need more security in the form of Amazon security groups, VPCs. Happy to talk about why and the details of all these things, but all the richer sort of next lower level control over security that we get by running directly on an infrastructure as a service provider and we just want, like I say, we want a lot more sort of transparency in different areas of the stack.

I will say, and maybe this is the obvious next question: Should we have started where we are now before? Answer: No. In fact, answer: Hell no. When you are small and medium-sized and are in the proper scope of one of those platforms of the service you totally should be there. You should just totally take advantage of it and not build out things that are undifferentiated heavy lifting to use [Bernard Fogelson’s 00:19:22] term. When, if you should be so lucky to be in the 1% of the 1% of the 1% of companies, that’s the time that you should step up and start taking it over yourselves. Does that make sense?

If you start with a complete self-built full stack, that’s like reading the goal backwards and thinking you’ve done it right. You’ve just … Why not do something that removed those constraints early and then discover them and so effectively you’ve done it. You have now hit a point in the evolution of Stitch Fix and your team where you’re like, “Okay, our constraint is transparency, so how do we attack that constraint? Easy. We have to then break out the stack and this is the way in which we’re going to do it.” Like you said, if you had done that early you’d be months in just to get to the point where you’re like, “All right, perfect. We’re ready to put our first hunk of code into production now.” It’s like such a reversal of the whole purpose of high velocity SRE roles. It’s like every piece of code should go in.

I’m going to ask you this, and I don’t know how much you can share, but we always hear about commits, push your production. Gene Kim, and I love hearing Gene talk and he always talks about, “Yeah, these folks are doing like 2000 commits a day and every single one goes straight to production.” When you think of a Stitch Fix and so many other companies you’ve helped to advise and work with, Randy what are real honest people out there that are doing those first stages rather than Netflix doing 17,000 commits a day or whatever it is. When you do those first stages of adopting this DevOps process, where do you find a lot of people really are versus the big guns stories that we hear about?

Yeah, sure. The answer is, it depends very much on where you come from. In the Stitch Fix case, we had that benefit of starting essentially Greenfield five years ago, right? We started on Heroku, which very naturally gives us continuous delivery. Three things that are core to the way that we approach engineering. We do test-driven development, like we actually do it, so we actually write the tests before we write a feature. We do continuous delivery, so that came essentially for free by working with starting our coding GitHub and connecting that up through a CI pipeline out to Heroku, I’ll talk about that more in a moment. We practiced DevOps as we started with, so we believe that it’s the same team and the same individuals that should be owning the stack full end-to-end.

When somebody’s starting out, if you are starting out as we were able to do afresh, go to a platform as a service and you get that continuous delivery stuff for free. For us, every commit that goes to our master branch runs all of those automated tests that we wrote in our TBD phase and all those things get packaged up into a deployable artifact, they get deployed to Heroku and there we go. We’re not doing 17,000 ’cause we only have 70 people that are writing code, but every one of our applications, and we have about 40 or 50 individual small applications, we tried very much not to write the monolithic application. Every one of those applications is being deployed multiple times a day.

What are people really doing? I am really doing that separate. The other unasked question, or the other half of your question is: What if you’re not a Greenfield situation, what if you’re coming from a situation where it’s more traditional enterprisey. Then yeah, then you’ve got to take steps. It can be … I will come back and answer the question.

Ten years ago when I first started talking about, I was then at eBay, and started talking about eBay’s architecture, people were shocked and amazed that eBay released the whole site every two weeks. I would get up there and talk about, “Yeah eBay, we release the whole site every two weeks.” They’re like, “Oh my god, every two weeks, that’s amazing.” Imagine my doing that today. I get up and I talk about, “Yeah we’re Stitch Fix and we release the whole site every two weeks.” They’re like, “Two weeks? Oh my god.” That used to be impressively awesome and now it’s not so-, so yeah. There’s no shame in coming from, like hey people have successful businesses and they release things once a month or once every two weeks or once every day.

I am a strong believer in the Gene Kim, Jez Humble philosophy of the more that you … You get so much benefit out of shrinking that cycle time, shrinking the time from the idea I have to code that I write to it runs in production. Please just go do that. You’re not going to get from one month to one second in a day and anybody who tells you differently is selling you something probably. You can get from a month to a week and you can feel proud about that and then you can get from a month to a couple of days and you can feel proud about that. Obviously there’s lots of aspects of changing the culture and the development process, getting the tooling in place to do that stuff, but the wonderful thing is that this is a pretty well-paved path.

The fortunate situation is for companies that currently have a more enterprisey life cycle, a slower life cycle, it’s a pretty well-paved, lots of people have gone from there to here, if that makes sense. It’s not like you have to be the trailblazer and discover new ways of doing the thing, if that makes sense. It’s a kind of … I don’t say it’s an easy path but it’s a well-trodden path, it’s a paved path.

It’s really, like you said, the ranges and stages are important and that’s why I try and press upon folks out in the community when I talk to them, I said you don’t have to be eBay, sorry I always use … I use your former team as an example, or LinkedIn or PayPal or whatever it is, but if you’re enterprise and it’s taken you six months to get something in production, if you can do it in three, that’s like let’s drink, let’s party and celebrate that ’cause then we can do it in two, one, three weeks, then you can shrink those cycles but at least let’s just reduce that time to getting that code into production and make those processes better, which is pretty cool.

I’m going to ask you about …

Totally. Yeah, I mean if you can reduce the … I mean just to underline that a little bit, if you’re doing one release every six months and you can make it to three, now you’ve doubled your velocity and there’s no shame in that. That’s awesome because now you can either get double the amount of features and capabilities out to your customers and/or you can make them that much more reliable. Why? Because when I double the size of a thing it doesn’t make it only twice as complicated and twice as potentially bug-ridden, it’s four times or eight times. There’s an exponential, or at least geometric relationship between the amount of code I’m changing and the potential issues that I’m introducing. It’s not linear at all.

When you can go from the six month to the three month, you’ve made everything better, like noticeably. Then you get from the three month to the one month and the one month to the one week, yeah you will find as you do that you’ll just get better and better and your customers will thank you for it. Your customers will thank you for, “Hey, your releases are not only coming faster, they’re more reliable.” That is, people should definitely read the DevOps handbook, which is Gene Kim and Jez Humble and John Willis and all those guys, Patrick DuBois. As you get faster you also get better. It is not … It’s a wonderful thing where you’re not choosing between, “Oh, should I be fast or should I be good?”

That’s right.

The faster makes you gooder, you know? The faster you get the better you get, so yeah.

I could literally take 12 hours and just cut it into hunks and make it a 48-part series with you, Randy. Before we finish up in this session, I do want to talk a little bit more about one piece of tooling that’s interesting. It’s not necessarily a single one I’m going to pick out, but you’ve talked about a couple of platforms and a couple of products in the stack that you’re using and there’s bound to be some proprietary stuff that you have to build and adapt to as part of that, like you talked about obviously AWS VPCs and other things. When you’re looking at a stack to choose to build on, we always hear about vendor lock-in and this concern about, oh you’re going to get locked in. Should we be as concerned as some people would think about mapping to a particular process or some infrastructure and how do you create the right abstraction to make sure that you’re as least locked in as possible?

Yeah. That’s a great question. I remember from Google App Engine, that was a lot of … We definitely heard that critique quite a lot and that’s a legitimate critique. Everything’s a trade-off. Lock-in is such a pejorative term and we take it as … Often you hear that and like, “Oh, we’re locked in and oh, that’s absolutely bad.” Let’s take a step back. Not true. The way to think about it is: What amount of effort, what benefit are you getting from using whatever, directly AWS or directly Heroku or directly App Engine or directly Oracle or like whatever you want to talk about being locked in. What benefit, in terms of speed, feature velocity in terms of reliability, ease of use, et cetera, what benefits do you get and then the cost that you pay, and it’s a real cost, of if we ever had to migrate off of this, how long would it take?

You can kind of sketch that out in order of magnitude. Would it take me one day, one week, one month, one year kind of thing, and then do the math for yourself. I will say for myself that I am, and I’ve started using AWS as a customer for several of my last places that I’ve been. That’s not a lock-in I’m worried about. Yeah, that’s true. We are definitely directly using AWS APIs for a bunch of our things. That is not a big concern for me. There you go. There are other … People have been, and I say this as a former Oracle employee, I’m not even going to beat around the bush. Oracle has soured people on … The experience of that sales cycle, et cetera, has totally soured people on a lock-in of that one. Oracle has not done itself a service by making that really difficult, making being a customer of Oracle a problem rather than a wonderful thing.

So far, to date, we’ll see when we look back on it in the future. So far to date AWS and Google and a bu-, they have not behaved in that way. Does that mean they could never behave in that way? Of course it doesn’t, but prices are going down, not up. People are being … There is more standardization, not less. I think the general trends are favorable and I think a lot of software and infrastructure vendors have seen what their brethren in the ’90s and the 2000s did and try not to do that again. I hope that makes sense.

Absolutely.

I have a philosophical … Like I say, lock-in is a trade-off like anything else and you trade it off. If I can get to market three months faster because I just went directly against AWS, fine. If, on the other hand, I am worried, I personally am not worried but if somebody listening is worried about, “Yeah, I want to be able to switch on a dime between AWS and Google or whatever” Cool. All right. That’s a thing you should decide for yourself and there are ways you can insulate yourself from that.

I think that’s a beautiful way to describe it because that’s exactly it, it’s defining the trade-off and … I think the one I always compare when people say, “I don’t want to be with a vendor where I’m locked in.” I’m like, “You’re locked in in so many things in your life. Some of it could call it marriage and it’s not a terrible thing, right?” I’m like, “If you think about it hard enough it’s lock-in. Do you feel locked in? Do you feel terrified that you’ve like …” No, you made a choice. Like I said, it’s a great way to describe it. We’ve covered a lot on toolkit stack. Like I said, we could spend another hour just going into some of the stuff.

You’ve got such a great story that you can tell on the things you’ve done, but what we’re going to do is we’re going to bring you back in the future again, Randy, very soon, and we’re going to continue this conversation because I want to talk to you a bit about the people side of this and that’s probably one of the more challenging stories that people want to hear how you succeeded at it.

To close out today, I’m going to ask you again if you want to just let folks again know where we can find you online and then thank you for talking about your stack, how your view of DevOps works, a great thought about lock-in and a little bit about Stitch Fix. Yeah, so where do we find you online, Randy, and then we’re going to bring you back again.

Yeah, sounds great. Thanks Eric. Randy Shoup, you can find me at @RandyShoup, all one word, on Twitter. You can also find me on LinkedIn and also Googling Randy Shoup, you know, the two words, digs up a lot of presentations and blog posts and interviews that I’ve done so that’s another way that people can find out some of the things that I think. Thanks again Eric, this was really a lot of fun.

Awesome. Thank you very much Randy. If you like what you heard here and want to hear much more, don’t forget to subscribe to the GC On-Demand podcast. You can go to gcondemand.io where you’ll find the links in order to catch us in iTunes, Stitcher, the Google Play store and more. Go to gcondemand.io. Don’t forget to rate us in your podcaster of choice and look for much, much more. Have a show idea? Tweet us @gcondemand. Thanks for listening.

Photo by NeONBRAND on Unsplash

Episode 46 – DevOps, Microservices, Tools and Processes with Randy Shoup Part 1 of 2

Leave a Reply Cancel Reply