Randy Shoup joins the GCOD for a very informative and candid chat on development velocity, the DevOps goal, finding traction, and how he has led teams to success with a variety of products and platforms.
Randy is working on some very interesting things at StitchFix, and brings a storied career and set of experiences to the conversation. Big thanks to Randy for sharing what could be some of the best tips to finding success with a higher velocity of development including some of the tools and techniques that will help us all get there.
Welcome to the GC On-Demand podcast, a show about people, about process, about technology, about community. It’s great conversations with great technologists about things that matter to you, that matter to all of us. Thanks for listening. Don’t forget, visit gcondemand.io for all of the show notes. With that, let’s get started.
Welcome, everybody. It’s … Welcome back to the GC On-Demand. We’ve got an exciting time because there’s been so much growth going on with some of the conversations we’ve had here on the GC On-Demand and I’ve been lucky enough to really, really cover such a wide audience of folks that can come and talk to us about neat things that they have done in the industry, both from a product and a people perspective. Most importantly, as we’ve progressed as listeners throughout this, I think this is a perfect time that I get to introduce our special guest today.
With that, I’d like to welcome Randy Shoup to the show. Randy, you have a very exciting story in IT. So first of all, let’s get started by if you can introduce yourself, tell folks where we can find you online and we’re going to talk a little bit about you, Stitch Fix and a whole lot about DevOps and all things wrapped around it.
Excellent. Well thanks, Eric, thanks for having me, this is great. Yeah, so I’m Randy Shoup. You can find me on Twitter at @RandyShoup. I can also be found at Randy Shoup on LinkedIn and other places around the web. If you Google my name you’ll also see a bunch of presentations that I’ve given on various topics, DevOps, scalability, engineering culture. I’m pretty easily Googleable ’cause my name isn’t super common.
I’m currently
VP of engineering at Stitch Fix here in San Francisco. I’ll talk a little bit
later, I think, in the podcast, I hope, about the kinds of things that Stitch
Fix does. I’ve been here for a year.
Earlier in my career, I was chief engineer at eBay for about six and a half years and I helped to build out eBay’s search infrastructure. I did a stint at Google running engineering for Google app engine, so that’s Google’s platform as a service, like Roku or other platforms you might be aware of.
I started my
own little startup with a former eBay colleague and learned how difficult it is
to do a startup.
I actually
did a stint as the CTO of a gaming company here in San Francisco for a while,
so I’ve done a bunch of different things.
You worked
for a couple of little companies. We’re going to definitely bring you back
’cause we want to talk about that startup. That in itself would be a really
exciting chat on its own. What I want to talk about today, let’s just start
with the elusive topic that everybody seems to be chasing these days, or at
least for the last little while, Randy, which is DevOps. The idea of creating
these toolkits and these processes that wrap around what DevOps is, and I think
that’s probably the best place to start. If you don’t mind, Randy, tell me in
your mind what exactly is DevOps, what’s it trying to achieve and what are your
thoughts on that side of the world?
Yeah, great.
I think of DevOps as the modern approach to deploying and managing software,
sort of in the world. We all know that the classic enterprisey approach to
stuff is you have a bunch of people that build things, the developers, and you
have another set of people that operate those things, the ops folks. What I
like about the modern world is that we’re breaking down that, to my mind,
artificial barrier there. One of the things we do at Stitch Fix and we also did
at Google is we don’t have this huge wall where there are the people that write
the code and then they throw it over the wall to the people that operate it and
everybody hates the people that are on the other side of that wall.
Rather, the
better way to deploy and manage software is to have everybody be able to do
this stuff. The way that we did it at Google is that for the most part the
initial time when somebody is running a service or running an application, it’s
actually the developers, the same people that are writing the code that re
operating it, so they’re the ones that carry the pager, they’re the ones that
are responsible for it performing and being reliable in the real world.
We do that
similar approach here at Stitch Fix. Stitch Fix, as I didn’t mention before, is
an online clothing retailer. Our idea is we turn retail sort of upside down so
rather than going to a traditional retailer, whether online or physical and
going and choosing the things you like, you tell us the things that you like
and we send you five items in a box that we think you’re going to enjoy. We do
a whole bunch of data science associated with that, that’s maybe a whole
another topic. In terms of the engineering that we do, we want to make sure
that the engineers that we hire build and maintain the software themselves.
We have the
same people that are on … The people that build a particular set of
applications at Stitch Fix, the same exact engineers are the ones that write
the software, they’re the ones that make sure that the software works
correctly, they’re the ones that make sure that the software performs and
they’re the ones that make sure that the software is operated. We don’t have a
separate QA group, we don’t have a separate performance group, we don’t have a
separate ops group, it’s all one set of engineers.
Why would we
do that? It has this wonderful sort of synergistic effect by not throwing a
thing over the wall to some other guy, I am now responsible for it and it means
that I’m strongly incented to make sure that my thing is going to perform well
and is going to work well in the real world. My incentives are definitely
aligned, but also it actually makes it easier for me to do that job because
once I know that I’ve set up the monitoring, I know how things are being
deployed and how things are running in the infrastructure. There’s not this
need for passing tickets back and forth or constant coordination back and forth
where I understand half the problem and the ops guy understands the other half,
if that makes sense. I, as engineer, am able to do it all and that is
wonderful. Our engineers love it.
I love that
approach. One of the recent podcast chats I had was with Rob Hirschfeld and we
talked about the SRE and, of course, being of Google history you’ll know around
this whole concept of the SRE. We’ll talk about that actually in a few minutes,
but it’s funny, you talk about different things that your team does and I love
this idea of like a singular, you’re responsible from end to end. When you’re
doing stuff like this and creating processes and stuff wrapped around it, how
much of the toolkit do you have available out of stuff that’s out there today
or when you’re looking at starting something brand new, like what you’re doing
with Stitch Fix, how much of everything in your workflow is out of the box
versus what you have to develop yourself to kind of be individually mapped to
how you do things?
Totally. It’s
wonderful to work … I’ve been in the industry for a long time, since 1990,
and 2017 is the best time to be a developer of any kind. Why? It’s because all
these things that used to be only the eBays and the Googles and the Amazons of
the world, is now available to everybody. At Stitch Fix we run all of our
infrastructure on AWS. Most of our applications that we build in the
engineering team are hosted on Heroku, which maybe people will know also is on
AWS. We’re in the process of actually migrating from having all of our
applications on Heroku to using Amazon’s elastic container service, so Docker,
basically running Docker in the cloud. Happy to dive into any or all of these areas.
To answer
your question about how much of the toolkit is available kind of off the shelf,
it’s the vast majority of it. That’s what’s wonderful about 2017. We can go to
a cloud provider like AWS and get as many machines as we can afford. We can
spend them up in minutes or even seconds and that’s huge. Also out of the box
from AWS and other cloud providers is the ability to monitor the stuff and
control them through APIs. I don’t have to do a lot of jumping up and down and
running around in order to be able to see what’s running and see how well it’s
running.
One of the
things that you sort of touched on, like Stitch Fix being new, which it totally
is, so the company itself is only six years old and so that means that we have
had the benefit of being able to start afresh in a modern kind of more
Greenfield approach. We started with Heroku, so platform as a service, which
has wonderful properties in terms of developer productivity and developer
power, and we’ve also constructed our toolkit or our stack out of a bunch of
different software as a service kind of elements. Obviously the platform as a
service straightaway. We’re running on … We have leveraged databases as a
service, so first in Heroku’s world and now in AWS’ world in the form of the
relational database service that AWS offers.
We also use
hosted elastic search. We use hosted messaging in the form of cloud and QP so
that’s a message provider, basically a hosted RabbitMQ. We use hosted bug
tracking. We use hosted paging in the form of pager duty. Hosted Slack, like
… We do not have a single piece of hardware that we own ourselves other than
our laptops and so nothing that a customer or an employee at Stitch Fix would
use in their daily job is part of any physical … We don’t have any physical
data center presence anywhere and that’s awesome. You could be like me too.
In the 2017
world, I would suggest that when you are able to start afresh you should first
think of what can I do … How can I leverage something that somebody else is
maintaining and sort of running for me and just pay them to do it and focus
yourself on the thing that actually differentiates yourself, very starkly
different. When I started in the industry even as a startup, let alone a big
company, the first thing you have to do is find some data center space and buy
some physical machines and get them sent in. It’s laughable. We all lived in
that world and it’s laughable, not in like it was bad but just the sharp
difference between what we had to go through to get computing power even 10 or
20 years ago and what you have to do now it’s just unbelievably different.
Yeah. It’s
actually funny in talking with the founders of Turbonomic, we literally started
as a startup in Yuri’s garage. They said there was a rack of servers and the
neighbors were like, “What’s that noise, Yuri?” “Oh, I’ve got a
startup in my garage.” You literally just need a broadband connection,
that’s the only infrastructure you need, everything else is available as a
service. You’ve kind of … You hit the whole stack there, Randy. I guess the
other thing is, if you think about any single one of these things they’re
almost all available for free to get started, too. You can really kick the
tires on something, you’re not even getting into like multi-year, multi-month
commitments. You can pretty much most of these have a developer offering, which
is up to a certain number of nodes for free. That’s crazy.
How does that
feel to you to, like you said, when you started in the industry, if you’re
coming out of school now, you could run a startup without even spending a
single dollar, really, before you know that you’ve got to spend a little more
time and get a couple more people on board. It’s a pretty cool time, right?
It’s a
wonderful time. Yeah. Every year for the last five has been the best time to
start a startup ever, you know what I mean? Exactly for that reason. Right,
yeah you can start for free or as close to free as you can imagine. You could
start a startup with a bill of zero or a hundred dollars or something like
that. If, and only if, you hit it big, then is when you start paying and
frankly that’s when you should start paying, right? Once you’ve kind of sorted
out, I’ve got a business model, I’ve got product-market fit, now I’m ready to
go and that’s when the bills really come. Yeah, it’s wonderfully empowering.
Yes, how does
it feel? It constantly amazes me, to be frank because to remember back, we
didn’t have any of this stuff. We didn’t have open source. We didn’t have
cloud. We didn’t have all this stuff that’s sort of at our fingertips: Mobile
devices, super computers that we carry in our pockets. It’s just an amazing
time to be alive and it’s an amazing time to be a technologist, it’s just
great.
I think every
next thing is as exciting and as rapid on the innovation rate as the previous
thing. The players are changing too, which is interesting, like you said, open
source, it was Linux. When people said, “Oh, I use open source …”
it’s like, “Oh, yeah so you run Linux servers?” No, you could run
anything at any layer of your stack. That’s such a huge opportunity throughout,
again, like the full stack that you talked about in these different platforms.
Some of them obviously are AWS-specific like with RDS and whatnot, but they’re
using core primitives that you could then port to another database platform as
long as it’s using one standard style of database call, it’s all about the
right abstraction.
When you
think about the products that you’ve laid out in here, I like how you started
with Heroku. How important was platform as a service to be the starting point
for you versus building your own basic stack?
Yeah, thanks
for asking it in that way. Yeah, it was hugely important and I can’t say enough
great things about Heroku as a product and the Heroku organization. I say that
this … Like I say, I used to run engineering for Google App Engine so I
appreciate A) What a benefit platforms as a service are and B) I know how hard
they are to run and all that. Yeah, Heroku was a huge benefit for Stitch Fix.
When I joined Stitch Fix a year ago we had about 25 engineers on staff. We
currently have about 75, so it’s grown about 3X in the time I’ve been here just
over a year.
When I
arrived, we had 25 engineers, all of whom did full stack Ruby on Rails
development, none of whom did any aspect that you would call infrastructure or
platform ops at all. That’s not because that’s not an important thing but
because that’s what we were paying Heroku to do and that’s what Heroku was
doing really well. The fact that we were able to go as far as we did on having
essentially no investment in people or infrastructure from ourselves was hugely
valuable, just incredible force multiplier. At the same time, Heroku has a
SweetSpot and their goal is to make things easy and they do a great job at
that. At the same time, there are a bunch of things that as their customers, us
in particular, get larger, we have requirements that are not the same
requirements that are sort of the next level up or the next level down or
deeper, however you want to look at it, in the sort of small and medium-sized
businesses that are their bread and butter.
To be fair to
them, the stuff that we want them to build is not stuff that they are going to
build for anybody else other than their biggest customers. Does that make
sense? That’s our motivation for moving off is not because there’s anything wrong
with it but because we have stuff that I’ll take a step back and like … I
need this thing from you guys, you shouldn’t even build it. It’s not a thing
you should … That’s not in your SweetSpot for your standard set of customers.
I know that ’cause I used to run one of your things, like App Engine, so I know
where you’re coming from and you totally shouldn’t build it, but I need it and
sorry.
What we are
needing now and did not need before, even a year ago, what we need now is more
transparency and more control over all the areas of the stack. We need more
security in the form of Amazon security groups, VPCs. Happy to talk about why
and the details of all these things, but all the richer sort of next lower
level control over security that we get by running directly on an
infrastructure as a service provider and we just want, like I say, we want a
lot more sort of transparency in different areas of the stack.
I will say,
and maybe this is the obvious next question: Should we have started where we
are now before? Answer: No. In fact, answer: Hell no. When you are small and
medium-sized and are in the proper scope of one of those platforms of the
service you totally should be there. You should just totally take advantage of
it and not build out things that are undifferentiated heavy lifting to use
[Bernard Fogelson’s 00:19:22] term. When, if you should be so lucky to be in
the 1% of the 1% of the 1% of companies, that’s the time that you should step
up and start taking it over yourselves. Does that make sense?
If you start
with a complete self-built full stack, that’s like reading the goal backwards
and thinking you’ve done it right. You’ve just … Why not do something that
removed those constraints early and then discover them and so effectively
you’ve done it. You have now hit a point in the evolution of Stitch Fix and
your team where you’re like, “Okay, our constraint is transparency, so how
do we attack that constraint? Easy. We have to then break out the stack and
this is the way in which we’re going to do it.” Like you said, if you had
done that early you’d be months in just to get to the point where you’re like,
“All right, perfect. We’re ready to put our first hunk of code into
production now.” It’s like such a reversal of the whole purpose of high
velocity SRE roles. It’s like every piece of code should go in.
I’m going to
ask you this, and I don’t know how much you can share, but we always hear about
commits, push your production. Gene Kim, and I love hearing Gene talk and he
always talks about, “Yeah, these folks are doing like 2000 commits a day
and every single one goes straight to production.” When you think of a
Stitch Fix and so many other companies you’ve helped to advise and work with,
Randy what are real honest people out there that are doing those first stages
rather than Netflix doing 17,000 commits a day or whatever it is. When you do
those first stages of adopting this DevOps process, where do you find a lot of
people really are versus the big guns stories that we hear about?
Yeah, sure.
The answer is, it depends very much on where you come from. In the Stitch Fix
case, we had that benefit of starting essentially Greenfield five years ago,
right? We started on Heroku, which very naturally gives us continuous delivery.
Three things that are core to the way that we approach engineering. We do
test-driven development, like we actually do it, so we actually write the tests
before we write a feature. We do continuous delivery, so that came essentially
for free by working with starting our coding GitHub and connecting that up
through a CI pipeline out to Heroku, I’ll talk about that more in a moment. We
practiced DevOps as we started with, so we believe that it’s the same team and
the same individuals that should be owning the stack full end-to-end.
When
somebody’s starting out, if you are starting out as we were able to do afresh,
go to a platform as a service and you get that continuous delivery stuff for
free. For us, every commit that goes to our master branch runs all of those
automated tests that we wrote in our TBD phase and all those things get
packaged up into a deployable artifact, they get deployed to Heroku and there
we go. We’re not doing 17,000 ’cause we only have 70 people that are writing
code, but every one of our applications, and we have about 40 or 50 individual
small applications, we tried very much not to write the monolithic application.
Every one of those applications is being deployed multiple times a day.
What are
people really doing? I am really doing that separate. The other unasked
question, or the other half of your question is: What if you’re not a
Greenfield situation, what if you’re coming from a situation where it’s more
traditional enterprisey. Then yeah, then you’ve got to take steps. It can be
… I will come back and answer the question.
Ten years ago
when I first started talking about, I was then at eBay, and started talking
about eBay’s architecture, people were shocked and amazed that eBay released
the whole site every two weeks. I would get up there and talk about, “Yeah
eBay, we release the whole site every two weeks.” They’re like, “Oh
my god, every two weeks, that’s amazing.” Imagine my doing that today. I
get up and I talk about, “Yeah we’re Stitch Fix and we release the whole
site every two weeks.” They’re like, “Two weeks? Oh my god.”
That used to be impressively awesome and now it’s not so-, so yeah. There’s no
shame in coming from, like hey people have successful businesses and they
release things once a month or once every two weeks or once every day.
I am a strong
believer in the Gene Kim, Jez Humble philosophy of the more that you … You
get so much benefit out of shrinking that cycle time, shrinking the time from
the idea I have to code that I write to it runs in production. Please just go
do that. You’re not going to get from one month to one second in a day and
anybody who tells you differently is selling you something probably. You can
get from a month to a week and you can feel proud about that and then you can
get from a month to a couple of days and you can feel proud about that.
Obviously there’s lots of aspects of changing the culture and the development
process, getting the tooling in place to do that stuff, but the wonderful thing
is that this is a pretty well-paved path.
The fortunate
situation is for companies that currently have a more enterprisey life cycle, a
slower life cycle, it’s a pretty well-paved, lots of people have gone from
there to here, if that makes sense. It’s not like you have to be the
trailblazer and discover new ways of doing the thing, if that makes sense. It’s
a kind of … I don’t say it’s an easy path but it’s a well-trodden path, it’s
a paved path.
It’s really,
like you said, the ranges and stages are important and that’s why I try and
press upon folks out in the community when I talk to them, I said you don’t
have to be eBay, sorry I always use … I use your former team as an example,
or LinkedIn or PayPal or whatever it is, but if you’re enterprise and it’s
taken you six months to get something in production, if you can do it in three,
that’s like let’s drink, let’s party and celebrate that ’cause then we can do
it in two, one, three weeks, then you can shrink those cycles but at least
let’s just reduce that time to getting that code into production and make those
processes better, which is pretty cool.
I’m going to
ask you about …
Totally. Yeah,
I mean if you can reduce the … I mean just to underline that a little bit, if
you’re doing one release every six months and you can make it to three, now
you’ve doubled your velocity and there’s no shame in that. That’s awesome
because now you can either get double the amount of features and capabilities
out to your customers and/or you can make them that much more reliable. Why?
Because when I double the size of a thing it doesn’t make it only twice as
complicated and twice as potentially bug-ridden, it’s four times or eight
times. There’s an exponential, or at least geometric relationship between the
amount of code I’m changing and the potential issues that I’m introducing. It’s
not linear at all.
When you can
go from the six month to the three month, you’ve made everything better, like
noticeably. Then you get from the three month to the one month and the one
month to the one week, yeah you will find as you do that you’ll just get better
and better and your customers will thank you for it. Your customers will thank
you for, “Hey, your releases are not only coming faster, they’re more
reliable.” That is, people should definitely read the DevOps handbook,
which is Gene Kim and Jez Humble and John Willis and all those guys, Patrick
DuBois. As you get faster you also get better. It is not … It’s a wonderful
thing where you’re not choosing between, “Oh, should I be fast or should I
be good?”
That’s right.
The faster
makes you gooder, you know? The faster you get the better you get, so yeah.
I could
literally take 12 hours and just cut it into hunks and make it a 48-part series
with you, Randy. Before we finish up in this session, I do want to talk a
little bit more about one piece of tooling that’s interesting. It’s not
necessarily a single one I’m going to pick out, but you’ve talked about a
couple of platforms and a couple of products in the stack that you’re using and
there’s bound to be some proprietary stuff that you have to build and adapt to
as part of that, like you talked about obviously AWS VPCs and other things.
When you’re looking at a stack to choose to build on, we always hear about
vendor lock-in and this concern about, oh you’re going to get locked in. Should
we be as concerned as some people would think about mapping to a particular
process or some infrastructure and how do you create the right abstraction to
make sure that you’re as least locked in as possible?
Yeah. That’s
a great question. I remember from Google App Engine, that was a lot of … We
definitely heard that critique quite a lot and that’s a legitimate critique.
Everything’s a trade-off. Lock-in is such a pejorative term and we take it as
… Often you hear that and like, “Oh, we’re locked in and oh, that’s
absolutely bad.” Let’s take a step back. Not true. The way to think about
it is: What amount of effort, what benefit are you getting from using whatever,
directly AWS or directly Heroku or directly App Engine or directly Oracle or
like whatever you want to talk about being locked in. What benefit, in terms of
speed, feature velocity in terms of reliability, ease of use, et cetera, what
benefits do you get and then the cost that you pay, and it’s a real cost, of if
we ever had to migrate off of this, how long would it take?
You can kind
of sketch that out in order of magnitude. Would it take me one day, one week,
one month, one year kind of thing, and then do the math for yourself. I will
say for myself that I am, and I’ve started using AWS as a customer for several
of my last places that I’ve been. That’s not a lock-in I’m worried about. Yeah,
that’s true. We are definitely directly using AWS APIs for a bunch of our
things. That is not a big concern for me. There you go. There are other …
People have been, and I say this as a former Oracle employee, I’m not even
going to beat around the bush. Oracle has soured people on … The experience
of that sales cycle, et cetera, has totally soured people on a lock-in of that
one. Oracle has not done itself a service by making that really difficult,
making being a customer of Oracle a problem rather than a wonderful thing.
So far, to
date, we’ll see when we look back on it in the future. So far to date AWS and
Google and a bu-, they have not behaved in that way. Does that mean they could
never behave in that way? Of course it doesn’t, but prices are going down, not
up. People are being … There is more standardization, not less. I think the
general trends are favorable and I think a lot of software and infrastructure
vendors have seen what their brethren in the ’90s and the 2000s did and try not
to do that again. I hope that makes sense.
Absolutely.
I have a
philosophical … Like I say, lock-in is a trade-off like anything else and you
trade it off. If I can get to market three months faster because I just went
directly against AWS, fine. If, on the other hand, I am worried, I personally
am not worried but if somebody listening is worried about, “Yeah, I want
to be able to switch on a dime between AWS and Google or whatever” Cool.
All right. That’s a thing you should decide for yourself and there are ways you
can insulate yourself from that.
I think
that’s a beautiful way to describe it because that’s exactly it, it’s defining
the trade-off and … I think the one I always compare when people say, “I
don’t want to be with a vendor where I’m locked in.” I’m like,
“You’re locked in in so many things in your life. Some of it could call it
marriage and it’s not a terrible thing, right?” I’m like, “If you
think about it hard enough it’s lock-in. Do you feel locked in? Do you feel
terrified that you’ve like …” No, you made a choice. Like I said, it’s a
great way to describe it. We’ve covered a lot on toolkit stack. Like I said, we
could spend another hour just going into some of the stuff.
You’ve got
such a great story that you can tell on the things you’ve done, but what we’re
going to do is we’re going to bring you back in the future again, Randy, very soon,
and we’re going to continue this conversation because I want to talk to you a
bit about the people side of this and that’s probably one of the more
challenging stories that people want to hear how you succeeded at it.
To close out
today, I’m going to ask you again if you want to just let folks again know
where we can find you online and then thank you for talking about your stack,
how your view of DevOps works, a great thought about lock-in and a little bit
about Stitch Fix. Yeah, so where do we find you online, Randy, and then we’re
going to bring you back again.
Yeah, sounds
great. Thanks Eric. Randy Shoup, you can find me at @RandyShoup, all one word,
on Twitter. You can also find me on LinkedIn and also Googling Randy Shoup, you
know, the two words, digs up a lot of presentations and blog posts and
interviews that I’ve done so that’s another way that people can find out some
of the things that I think. Thanks again Eric, this was really a lot of fun.
Awesome. Thank
you very much Randy.
If you like what you heard here and want to hear
much more, don’t forget to subscribe to the GC On-Demand podcast. You can go to
gcondemand.io where you’ll find the links in order to catch us in iTunes, Stitcher,
the Google Play store and more. Go to gcondemand.io. Don’t forget to rate us in
your podcaster of choice and look for much, much more. Have a show idea? Tweet
us @gcondemand. Thanks for listening.
Photo by NeONBRAND on Unsplash
Recent Comments