Ep 200 Rob Hirschfeld on RackN and Cracking the Code on Multi-Cloud and Metal Automation

Spread the love

Sponsored by our friends at Veeam Software! Make sure to click here and get the latest and greatest data protection platform for everything from containers to your cloud!

Sponsored by the Shift Group – Shift Group is turning athletes into sales professionals. Is your company looking to hire driven, competitive former athletes? Shift Group not only offers a large pool of diverse sales candidates from entry level to leadership – they help early stage companies in developing their hiring strategy, interview process and build strong sales cultures that attract the best talent for early stage companies.

Sponsored by the 4-Step Guide to Delivering Extraordinary Software Demos that Win Deals – Click here and because we had such good response we have opened it up to make the eBook and Audiobook more accessible by offering it all for only 5$

Sponsored by Diabolical Coffee. Devilishly good coffee and diabolically awesome clothing

Does your startup need strategic technical content? The team at GTM Delta delivers SEO-optimized, compelling content that connects your company with technical users to help grow your credibility, and your pipeline.

Need Podcast gear? We are partnered up with Podcast Gear Pro to share tips, gear ideas and much more. Check it out at PodcastGearPro.com.

Rob Hirschfeld is CEO and co-founder of RackN, leaders in physical and hybrid DevOps software. He has been in the cloud and infrastructure space for nearly 15 years

This is a special episode with Rob returning as the guest for his 4th podcast and for the commemorative 200th episode! We discuss how to unlock the power of multi-cloud automation, the challenge of human ops, and how we are finally reaching an “overnight success” of true bare-metal provisioning and multi-cloud automation and operations.

Follow Rob on Twitter here: https://twitter.com/zehicle

Check out the awesome work by RackN here: https://rackn.com

Subscribe and listen to the 2030.cloud podcast here: https://soundcloud.com/user-410091210

Transcript powered by Happy Scribe

Wow, that’s right. 200 episodes. You are listening to the 200th episode of the DiscoPosse Podcast. My name is Eric Wright. I’m your host and holy moly. This is really kind of crazy and awesome. I really just want to say a big, huge thank you to all of you who’ve listened and to all the amazing folks who make this podcast happen, including the amazing friends over at Veeam Software. So give a shout out to them and drop a visit. Go to vee.am/DiscoPosse. They’ve been fantastic supporters of me, my whole community of creators here.

So thank you to the Veeam team again, vee.am/DiscoPosse. Not just because they’re great. They actually have the best data protection platforms in the entire universe. That’s my opinion. So go check it out. And on top of that, if you want to celebrate 200 amazing podcasts, you’re going to need to stay awake. How do you do that? You drink diabolical coffee. That is because it’s the most devilishly good coffee and we’ve got the most diabolically awesome swag, including really cool stuff, which is coming up for the holidays.

So get on in. Some really cool slick mugs their showing up there. So go to diabolicalcoffee.com. And one last amazing thing because not just your data needs to be protected, but your life, your data in transit. The best way to do that is to make sure you use the fine folks at ExpressVPN. I’ve been a fan of VPNs for a long time for a variety of things. First, functionally to protect your data in flight, in transit, wherever you go, because I travel a lot.

And on top of that, going one step further by making sure that you can do cool things like testing for different locations and locales and testing latency in your network when you’re doing web testing. I’m a big fan of doing that. So do that. Do that thing. Go to tryexpressvpn.com/DiscoPosse. Again, that’s .tryexpressvpn.com/DiscoPosse That’s it for the live reads for this one. And speaking of live reads, this is live and awesome. Well, it was live when I did it. I guess technically every recording is live when you do it.

But this is Rob Hirschfeld. Rob is a good friend. He’s also the founder of RackN, the inventor of Cloud. Oh, yeah. You’re going to hear about that story. So I think this is really worthwhile to jump on in. Thank you to the folks who do this thing and support this podcast. Make sure you share it. Click subscribe. Go to Rob’s site at RackN. Check out the 2030 Cloud podcast. Also fantastic. And with that, actually, the funny thing is it’s just the episode for yourself. There you go. Rob Hirschfeld on the DiscoPosse Podcast.

Hello, this is Rob Hirschfeld and you are listening to the DiscoPosse Podcast.

This is the fun part because I get to do the intro. You’ve actually done your voice for Binger before. I’ve been lucky enough, Rob. Now we’ve talked a few times on this and I wanted to have you on because this is super special for me. First of all, to thank you. You are one of the inspirations to why I do this. I kind of go back to sitting in Austin at OpenStack Summit and me with my crazy weird USB dual mic set up, just trying to put something together, and we got to first sort of meet and spend time there, actually at the summit.

And obviously we’ve run a lot of miles, both in the tech circuit and quite literally on the ground at these events. But this is 200th. I had you on for my 100th episode, and this is 200th episode. So that’s why it was perfect that we got a chance to put this together. So thank you for inspiring me both in business, in life. And of course, the podcast is the third piece of that. It’s been a wild ride.

You’ve been a valuable friend, and I’ve been enjoying. It’s fun, because with podcasting, you get to listen to people talk vicariously. And I love what you’ve been doing with the podcast and sort of where you take it like conversations you have.

I’ve been lucky enough to spend a lot of time with you. But for folks that are new to you, let’s have you do a reintroduction and I’ll tell people go back and catch. I think we’re at, like, four podcasts we’ve actually record together on my side and a couple on your side here and there as well. But let’s give them the full meal deal on Rob Hirschfeld.

It’s interesting because I’m about to celebrate 20 years of inventing the Cloud. That’s one of the claims of fame. I sort of keep on the downlow, but Dave McCrory and I need to get out and tell people a little bit more about it. We started a company over 20 years ago now, where we were the first people doing virtualization in the data center at any reasonable scale, and we filed some patents on it that are about to be expire. We won’t have to worry, but we never made any money from them.

They got locked up by startups and then the Quack acquisitions and things like that. But yeah, so I’ve been doing the data center automation and virtualization business for a long, long time. So it’s very true to the theme of what it means to do virtualization and data center operations at scale. Like you said, I got really involved. I was at Dell and got really involved in OpenStack at the time when everybody was worried that VMware was going to take over the Cloud and Amazon was a nuisance, not necessarily the Juggernaut that it’s become.

And then, well, believe it or not, seven years ago, RackN is now seven years old. We left Dell with this sort of idea that OpenStack was going to have trouble because there weren’t good operating paths, which is sort of what we’ve seen play out. This was pre-Kubernetes, like, I was involved in Kubernetes early on, and actually, I saw the same thing with Kubernetes and was concerned about the operational patterns, too. And so the theme sort of for me, career wise, and then RackN specifically, is that companies aren’t running infrastructure well.

RackN set out to say, all right, how do we help companies run infrastructure better? We always had this idea that you’re not smart enough to run a data center is amazing marketing from Amazon’s perspective. What’s crazy to me is that so many people in our industry just go along with it. The HP’s and Dells turn around and be like, oh, well, I guess our customers are too stupid to use the gear that we sell them. And that’s always insulted me at this sort of foundational level.

Even the OpenStack stuff that we were doing always sort of got in the way of, like, oh, of course it’s going to be hard to operate. That sort of goes with the territory. And even with Kubernetes now, I was just listening to Brian Gracely with the Cloudcast, and he’s like, Well, Kubernetes is really hard and complex, and we accept that. And so it strikes me as a problem in our industry that we allow infrastructure to be so hard to operate. And we spend a lot of time talking about, like, needful complexity versus inherited complexity versus collaboration cost.

That’s, my bad. So we’re at a point now with RackN, sorry for the long intro, but we’re at a point with RackN, after seven years where we’re doing significant business, global scale operations, we’re breakeven profitable on the business, which is great for a startup and sort of seeing things working the right way. And now we actually have to tell people what we’re doing.

Yeah. You’ve got three more years and you’ll be an overnight success. The typical is the ten year mark where you’re suddenly like, ‘Why haven’t we seen this before? We’ve been here the whole time’. You should have seen us. We’ve been at every event, we’ve been contributing in code, we’ve been contributing in community. We’ve been contributing in our voice.

And there’s a perseverance that’s required to do this and a bootstrap on top of that. So that’s a big deal for people to do that.

It’s been crazy. I think some of it comes back to letting people catch up with your vision.

Yeah.

There’s definitely things that I’ve watched us do that make our vision as more accessible. But I’ve also watched people catch up to the vision and that’s, I think a lot of times with startups, if you’re having trouble communicating the idea, it could be that you’re wrong or it could be that you’re ahead, right? I mean, that’s what my virtualization experience was. We knew VMs were going to be essential for running a data center in 2000, but we spent so much time telling people hey, these VM things are real, and you should use them, and they’re better than hardware infrastructure for this purpose. That by the time we’d won that battle.

We lost the war from a startup perspective.

And talk about another bootstrapped example in the VM world, right? Literally vMware. I hadn’t even realized until not even that long ago that VMware was originally bootstrapped. They didn’t go get VC. I was like, what? But we look back on it now, and it’s kind of funny that just as a momentum that they have today, that everything started with sort of breaking the mold on human belief in technology viability and the trope of we can’t use virtual machines because we need hardware performance.

We can’t use the Cloud because we need data center protections and security and controls. We can’t use Kubernetes because our applications can’t live in any femoral environments. You show me a can’t. And I’ll show you a start up opportunity. It’s really wild to see this transition over to your point. The vision is there and the perseverance to maintain that vision and execute against it for long enough for the industry to finally understand that. Okay. Yeah. This is a thing, and it’s tough to find people. Erica Windisch is one of my favorite examples.

Erica has gotten to the 90-yard line of 100-yard dash, like five times in a row and then finally got to the finish line because for a variety of reasons, had never been able to see something to fruition. And she was able to do that with IOpipe and went to a successful exit. And I actually haven’t caught up with her in a long time I should. Again, because she’s just such a fantastic person.

Yeah. I remember this at OpenStack Paris fighting an early Docker and saying, this is a big deal. You need to pay attention. And the struggle of being able to explain why something is important. And this is to me, part of my journey from being a technologist to being a CEO is understanding why and how to explain the business value of what you’re doing. Because as technologist, we all want to be like, this is shiny and pretty, and it makes this easier. And that’s enough of a reason. But it’s not enough and we need to accept that just because something better or easier or the new thing, it’s not necessarily, what going, to actually become a success.

That’s always a challenge for us. It’s taken us a long time to be better at expressing how much the complexity of what people are building is a actual problem. You run around in tech circles, and it’s like how things are so complex. I’m scared of the complexity. I’m worried about the complexity. I started doing this stuff about a year ago on Jevon’s Complexity paradox. You’re not familiar with Jevon’s paradox. It’s org technology thing that we need to understand better about when you make something easier or cheaper.

People use more of it. And so about a year ago, I was convinced that we have a complexity paradox going on where we’ve made it super easy to use cloud services or things like that. There’s no downside. There’s no apparent cost in that. But we’ve now made that hiding complexity has made it everything much more complex and complexity starts bubbling to the surface. And like the Amazon downtimes where one service fails and the Cascades to their whole infrastructure, we see this pattern over and over and over again.

Or then you offload your services to a third party who uses the underlying services in Amazon. So you’re hosed anyways, right?

We are like one step away from Amazon going down because they had a third party that depended on a service that was in Microsoft that depended on a service that was in Google. And the Google service failed because the time got out of sync or the certificate. The certificate wasn’t updated when it was supposed to be updated.

Certificate. That’ll be what takes us all down. It won’t be DNS. It’ll be some goofball who didn’t set his calendar to renew an SSL Cert.

We can actually predict this with 100% certainty. It’s going to be an SSL Cert that expires. That depends on a DNS entry where the person no longer has control of the DNS, do the record that’s necessary to sort of create and renew the certificate. And so that’s going to be this cascading failure. But it’s totally conceivable that the Clouds actually have interdependencies on each other that they don’t fully don’t anticipate. And that should scare everybody. The challenges that being scared of the complexity of the problem and understanding the actual cost of that complexity and why somebody would, from a business perspective, pay money.

But it’s really more simple. It’s really take action on the problem. This is what it always comes back to. If you’ve identified a problem, how do you motivate somebody to take action to fix the problem or to change direction or things like that? Right. And that’s super hard. People are busy.

We need to come up with assisted menu heuristic. This is the ability to relate to them. That the problem that they’re creating by adding with a DIY solution is actually greater than the value. And ROI on investing in, like, technical debt is just such a throw away phrase that we attach to something. But it gives us a free pass to ignore what’s actually happening and identify it. And it’s sad because you and I talk all the time about this stuff and we see it in real environments, day in, day out where you just celebrate the heroics of complexity.

And some of it. I’m starting to think about terms like complexity budget. So, you know, I do this. We actually have 2 hours a week where we have people come together and talk about DevOps or the future. So this Cloud 2030 discussion group that we have that I started, like as a pandemic hallway track, and we’ve been going over a year, and then we turn them into podcast so people can listen to them. But we.

Sorry, my dog is, hold on.

Let’s talk about that after. But like the fact that what 2030 Cloud is now versus how it began, that’s actually quite an interesting path you’ve taken.

It’s stunning because we have a dedicated core. And then people come in as they want to talk about topics, and we identify topics. And what’s amazing is when you get a group of people talking about the future and infrastructure. Also week to week to week. These themes emerge out of those discussions that are just stunning. Right. So we talk about complexity or coupling or the legal ramifications of jurisdictional changes that could impact how technology is formed. The threads here are crazy. And there are some things that are super impossible to talk about.

Like we tried to talk about networking. Networking always double clicks down into infrastructure or persons or technology or jurisdictions like security is the same way. It’s super hard to sink into a simple security problem. And then the complexity comes back, comes in over and over and over again. And this idea of having a complexity, budget and understanding what you’re doing. The point that you were making about the sysadmins and the technical debt, though, is that a lot of this is organizational bias towards Siloed behavior, and it’s actually not just the organizations.

It’s actually the tools play to that, because that’s how you sell into market. So we are so used to operational silos, and then to sell a tool or a platform or product into an operational silo. You build tools that work for operational silos. One of the things that RackN’s done that I didn’t even realize we were walking into this trap is that we built tools that crossed operational silos, right. Because our goal, our customers goal was end to end operations. And I see this in conferences all the time.

You get people, the CEO or whoever is in charge of the conference. The big speaker stands up and says, I must have an end to end single pane of glass, one, one ring solution. Right. And you know, the ISO flashes in the background, and everybody sort of watches and they’re like, yes, that’s what we want. And then they leave that session and they go talk about their siloed tools and how they’re not going to act, how the network team is the enemy, and we have to fix it without them.

And so we’ve created this interesting situation where it’s very clear that you want an end to end solution. You need zero touch operations for us. Somebody’s reeling a rack in to a data center. Right. We do this for banks a lot, and we’re software. So the banks are doing it. We’re just making it possible. But you reel a rack in to a server in country somewhere and they turn on that rack, and they want that event to turn into working productive equipment inside of an hour, and then they want it to be completely the same process that they use every data center.

Right. Or if they need to reset the data center because they’re worried about ransomware or something like that, they can push a button, you’ll get a coffee and then come back and have the system all set, which sounds simple. But to do that, you’re actually talking about crossing 15 bank, 15 or 20 different organizational silos to get all that stuff to work. Right. And it’s a super hard problem, not because you can’t do all those things. It’s a super hard problem because each silo resists integrating with the other silos. It’s one of things that made Cloud a big deal.

It’s like, oh, my developer can set up a network because the Amazon APIs have networking. My developer can set up a compute system. Yay. Doesn’t mean they’re doing it in ways the networking wanted.

Right. Yeah.

The thing that you think about from all those perspectives, though, is that we’ve incented the industry to build silo, silo, silo, silo and tools to do silos, and then we haven’t created the incentives to connect the dots. Right? I mean, DevOps conferences are full of people crying on each other’s shoulders about how misunderstood they are.

I’m sorry to be pejorative. I’m not trying to be pejorative about DevOps conferences, actually. The way it goes, it’s like we need to talk about the culture that would allow me to work with another team. And then they have say that, and then they go in the next room and they’re like, these are all the reasons why I can’t work with the other team.

Right. You tell them that you’re an ops-focused person, and I pulled this thread the other day, and it had the precise effect that I thought it would. I actually said that your GitHub heatmap is actually a meritocracy, right. Because I meant it in the way that I’m often presented by people all the time, that if I’m doing infrastructure as code, and I’m dabbling, that the moment that I go to a DevOps conference and pull this thread. Pull that. It’s not, again, not talking negatively on the DevOps commerce, but the audience there, the community that’s there, GitHub heatmap is sort of like a great vendor T shirt to them.

It’s a thing they wear proudly and a thing that they show off. And so when you get there and you don’t have that, you don’t necessarily have the skills to walk into the room that screams about inclusivity, and then you get shoved out the back because you didn’t write a Perl script, and you don’t know who somebody else was at one point in time. I feel that sort of battle, like Gartner at their recent event. They talk now about XOps, which was, I rarely see something that I find kind of cool about some of the Gartner stuff because they have to be careful and generic with a lot of things.

They’re talking about predicting ship building, which it’s a really tough thing to the level they’re working at. So they talk about XOps just like DevOps, AIOps, MLOps, ITOps, NetOps that each of these silo breaking methodologies has created its own silo, and we need a cross breaking silo create, like, we need an abstraction layer for the silos that have really been meant as abstraction layers to silos.

And this is actually a hat tip to Gartner because they’ve really been doing something that we think is a good description of this and thought Werks has done it too, but they call it infrastructure pipeline or continuous infrastructure automation pipelines. We consider them automation pipelines. They’re actually showing all of these things fitting together, and it’s different than value stream mapping, which is similar. It’s like I need all my teams to work together and understand how I generate value. It’s important, but they’re actually elevating it to say if all these silos they need to be connected in the pipeline like a CI/CD pipeline.

But for infrastructure. And we found that nomenclature incredibly helpful for this. The difference being that what we’ve been doing with RackN and Digital Rebar, our product, is we’ve actually built the infrastructure pipeline as a platform, whereas the.

There’s thunder going on in the background, you can probably see the lightning in the window.

You’re in the midst of a good Texas storm.

I got my UPS and I should be set, but definitely much needed rain.

But the idea here that I can run a workflow all the way across all these pieces as a platform is actually a critical thing. When Gartner shows that they’re like, and I’ve got 20 different tools I have to use to connect all these dots together. And the lift on that organization is super high, and the complexity that you create is super high. So we’re excited to see a name for it. The infrastructure pipelines concept, which people seem to sort of get intuitively.

Like, okay, I got CI/CD pipelines for code. They don’t really work that well for infrastructure. We can talk about get ups and how that’s sort of this very narrow band of things, but it doesn’t really work for infrastructure. So I need a pipelining system that connects all these tools I’ve got for infrastructure.

It’s like Jenkins for your hardware. When you can give it a name and a relative example. I’ve totally stolen your infrastructure pipelines. When I talk about stuff through the stuff my team is doing at work because we’ve got the app pipeline, which people are totally they get like, it makes sense. There’s both application and infrastructure pipelines, and when it comes to doing things around decision automation and infrastructure automation, that’s where we’re seeing the more of it come into play, which is originally it was like, just do the thing like the hypervisor manager will be the layer that people work with, and so we’ll attack it there.

But we’re finding more and more is that no, they’re using some kind of a pipeline to manage that abstraction layer, and they’ve moved away and they realized the true control plane is the human control plane, which lives in pipeline, and pipeline is manifest it’s physical human run books that we’ve played out for all this time, and now we can actually relate it into product. And this is why I’m on team RackN. I’ve been for a long time on this.

Thank you. It’s interesting to us, and it’s useful to bring up the human run book piece of this because we do want this end to end component. And one of the things about the pipelines for us, because we’re a product company. So us building a platform that gave somebody a pipeline would be a pat on the back, but it’s not our objective. And actually, this is worth explaining. What we try to do is we want the pipelines we build to be reusable and standard. And I watched this, and this goes back to RackN formation history. We used to do in time with Sheff, switched over to Ansible. Right.

And all those tools are great, really good, actually, but they aren’t designed for reuse. What we see in the industry is and Terraform has the same thing in spades. It’s really a challenge. We see people using the tool, but in similar ways, but not with shareable components. Like you get a Terraform provider, but when people build like a plan to talk to a piece of infrastructure, those plans are not typically reusable. They’re not decomposable. Right. So you might have three teams using the Terraform to interface the same Cloud, but doing it in different ways and nobody can audit it, nobody can check it.

It becomes really a problem. And that’s where the pipelines breakdown. You can’t build a pipeline easily. If the things that you’re building the pipeline on top of don’t have a degree of standardized interconnect between them.

This is the one thing just stick there to pull on the Terraform piece, like even in their own docks, they’re very clear to tell you this is a bad idea. If you are doing data interplay between external systems, it’s not going to go well. You’re creating rigidity and things can change, and then your run book will no longer be valid. I respected that they put it in there, but like any good stuff, you put in a documentation, it’ll never be read, and people are still going to try and work on it.

And you and I have talked about this before, right? The pattern in Terraform is it is a single source of truth and Terraform easy to pick on in this case. They designed a tool that has a single source of truth embedded in it that assumes it can actually control the environment, which is handy if you have to build an environment. But infrastructure changes outside of before and after your tool runs, and even in between the runs of your tool, the infrastructure changes. The idea that the state is controlled by Terraform is a failure at the pipeline level because pipelines are part of a flow, and so things happen before your tool operates. Things happen after your tool operates.

And so in building a pipeline, you have to have this idea of an incremental state and your state has to be adaptable. So if you’re messing with the infrastructure, you have to expect that something might change outside and you can take that information in and say, oh, look, I just learned this, and there’s a ton of cases, especially in configuration where you like you build a cluster, and the keys for that cluster aren’t known until the cluster is built, right?

You might get a token or security or generate a certificate. That’s what makes Kubernetes so hard to install. It’s not Kubernetes. Kubernetes is a simple go binary that could run as System D with ten line install command. But what makes Kubernetes hard, it’s the fact that you have to generate services for every if you do it right for every service that interacts with it, and then distributing the TLS infrastructure is actually what made the whole Kubernetes the hard way was because of the TLS infrastructure you had to build, not because of the binaries.

The binaries are the least of your concerns.

Yeah, communication between nodes is like the simplest possible thing. The scheduler out of the box does what it’s supposed to do. It’s actually creating a proper, secured, and operational infrastructure. That’s resilience, too. Right.

That was the one thing I’m probably the only person who talks about Nomad who doesn’t have a hashicorp.com email address, and I’ve even got two Pluralsight courses on it, which are lightly attended just because it’s still early days with a lot of that stuff. But I’m banking that there’s more and more people are going to dig. I like that it has stuff that solves a lot of these problems. However, it just moves the problem goalpost a little bit to a different area.

At the end of the day for something like that, your development team or whatever is going to use a tool that should abstract out how the containers are operated. And so we see this, like when we use Terraform for our pipelines to do cloud provision because people are used to it. The cloud interfaces are actually pretty good, even though they’re heterogeneous. We deal with heterogeneous stuff pretty well because that’s what infrastructure is, but at the same time when we do it, we designed it in a way that doesn’t require Terraform to be the interface.

So if somebody says, oh, wait, I don’t want to use Terraform anymore, or HashiCorp becomes hostile. And Terraform isn’t a good utility. We could switch because at the end of the day, not whether you want to use Terraform or not, just like, Nomad versus Kubernetes. It’s not whether nobody cares, as long as your containers running and schedulable. So the idea is you want to break it back into what that unit of work needs to be done at that phase in the pipeline. And then you can start substituting, which is exactly what CI/CD pipelines do.

It’s like. Yeah. Look, I started with code. I needed to deploy it, whatever you got. And then over time, you keep adding new things into the middle of the pipeline or you switch tools and you’re like, oh, here’s a better security scanner. I’m going to swap it. And nobody. Pipeline keeps going just you swapped out a segment that does the job better. And that abstraction becomes a really useful thing to building all these systems. You have to have that connective tissue. You have to have a way to move state across a pipeline.

It’s been fascinating for us. Yeah.

The thing that I really want to pull out of this is you mentioned it. HashiCorp had to be example, right. What if HashiCorps becomes hostile? And we always have this thing like, even Kubernetes. People are like, oh, there’s such a vast group of people worldwide who are supporting Kubernetes. How can they go sideways? One word, Docker, right. To the point now where we’re questioning whether it’s even viable to maintain now that Docker desktop is licensed and it is entirely possible. Look, Mirantis was a good example, like the largest ever funding round in open source history, $100 million.

And I have not actually heard Mirantis mentioned, except in historical reference for quite a while. They’re doing stuff now. They were the Kubernetes company, and they are originally the OpenStack company. They’ve had to pivot and adjust, and the world has not necessarily been friendly for them. As a result, it’s tough. So Docker went through the same thing when you wrap a business around an open source product. And then there’s a divergence of belief systems in where it goes. We see now played out now and now they have to make it commercially viable.

And so all of a sudden, we have to unattach, like, this is the AWS risk factor of in Open. So Kubernetes, no matter how large it is, I have to think about what’s the risk pattern. This is sort of the lock in myth in a way, but as a methodology I need to think about preparedness.

If 2020 hasn’t taught us anything about supply chains, then you’re not paying attention, right. We have learned about physical supply chains. We’ve learned about going back to solar winds, about software and virtual supply chains. These are absolutely critical things that companies should be considering in how they look at building their software. And innovation is part of that supply chain. One of the things that we talk about with a cost of complexity is that when you build systems that are very complex, they end up being tightly coupled or having unseen coupling.

And that coupling actually makes it harder to innovate. Right. We just liberally talked about CI/CD pipeline, where you swap out something that works better. I could easily see, actually, it’s very pragmatic. So if you are, I’ll stick in Terraform, but you use us to provision with Terraform. We build a template, you like our templates or use whatever Terraform. But you could come back and say, you know what? I’m not using the provider that you’re using. The version I have is further back because it hasn’t been tested.

There’s a new feature that I have to use in a Cloud that isn’t exposed in the provider yet because they lag. And so it is essential that your automation right, for us, the pipeline has an extension point that says, oh, wait a second. If I need to make a call to an Amazon API or a Cloud API or another tool that’s not factored in. I can add that into my pipelines without breaking other things. Right. And it’s subtle, but it’s so important. This took us a long time to realize and longer to get right is that even though I’m using a completely standard process, all of our cloud interfaces use the exact same pipeline, but all of them have extension points.

I actually just gave this talk in ADDO, and I wish I had more time to show it, but each cloud has its own layer of, oh, these are the things that I have to do to service that Cloud through Terraform. Same actions that I run in Terraform. But the way you do the work not just plan differences. Like for Linode, you have to open a firewall port for Google Cloud, and it doesn’t work. Right. So you have to SSH and Ansible to join the machine.

Each one has some wrinkle, and you can easily imagine my company makes this additional call in Amazon that isn’t in a Terraform plan, or I can’t put in a plan. The sequencing is wrong. And so you’re like, how do I add in my unique wrinkle into that work? Normally you would fork it, you would have your own version of it, or you’ve read a Bash script. What we worked out with the pipelines that has been game changing for us is that there are extension points and how pipelines are built.

It allows you to infrastructure as code wise, extend the pipeline. And then from that perspective, have a very narrowly defined, oh, here is where I have to open up network ports in Linode because they don’t have a firewall in place like Amazon does. Same inputs, different actions or slightly different paths. But I can go back and see exactly how it was different than the standard path. And then we do that, like for Linux installs or VMware installs, that pattern of standard with known extensions plays out in incredible ways.

This is about protecting innovation.

Yeah. When it comes to drift management, and this is the other thing that we have to help them. Right? There’s provisioning. So stuff that’s particularly good at provisioning, and there’s stuff that’s particularly good at continuous configuration management and never the twain shall meet. This is part of the problem that we bump into. Now, where does drift management come into play now, in how you’re approaching this problem.

Drift management is tricky, and there’s a couple of ways that you can slice it. Are you thinking that the system is drifting out under the configuration, or are you thinking the actual?

First is the infrastructure itself moves with the right level of abstraction, the right level of change that can occur. I used to bump into this with just Terraform, like just a simple Cloud, a persistent Cloud workload, and all of a sudden for no real, particularly good reason. 22 days into me running my infrastructure, it gets reprovisioned because there is some drift, and Terraform sees it and says no, and it responds to my workload because it saw underlying drift in AWS, but I’m like, I wouldn’t even have noticed the workload was exactly the same.

But somewhere a host, an identifier, something changed. That was enough of a drift that it triggered a Terraform.

It could actually be a change in the provider that you’re using. One of the reasons now that you can lock the provider, so you don’t get an updated provider that then interprets a value in a different way. The way we deal with that is that our state information is designed to be incrementally, extended, and incrementally updated in very practical terms, like we embrace Patch as an API, as opposed to put, which means that we expect people to make changes to individual parameters or individual values in objects rather than expecting somebody to replace the whole value.

Anybody making changes to a Terraform state file, you’re like they’re doing it with tweezers, and they know they’re doing something dangerous and crazy, right? It’s a bomb defusal. Sometimes you have to do it, but you’re going to wear as much pattern as you can. And so for us, we know state changes all the time. So from a drift perspective, we work to item potency and not doing bad things and telling you, hey, this value isn’t what I expected. I’m going to stop and not try to fix it.

Rule number one with infrastructure, stop if something isn’t what you expect, don’t just keep going.

Works the same with fiber cables when you’re racking a server. If you feel resistance when you’re shoving the server back into the rack, you should probably stop and think about why there’s resistance.

We have this fight all the time, and actually we ended up adding retries in as a programmable option, which is nice, so I can be like, hey, this thing always fails. One retry and it fixes it. But by default, we don’t do retries, because if something didn’t go the way you planned then it’s wrong. Stop figure out what happened and fix it. And sometimes people are like, I don’t like that. We’re like, look, it’s much better to realize that it wasn’t what you expected. Fix it.

One further on that one, if you don’t mind Rob, the timeouts is also one of the biggest areas of issues I’ve seen with people that, just, like, manually blow out timeouts into their, Terraform is a great example. I’ll run exactly the same build. I like fully automated an EKS cluster. And everybody said, Why would you do that? It’s the simplest thing. Just use Cloud formation. Assume that I’m going to do it on Azure too with AKS. So I want to have a separate way. So I did it whether I’m self annihilating my belief in the world by doing this stuff all the time, but I do it and I build it and it runs.

It takes like 17 minutes to have a complete EKS cluster. Fantastic. And then I go on a webinar and I go to do it. It takes 42 minutes, because just some weirdness inside Amazon takes longer. And then if one thing flips beyond five minutes or ten minutes or whatever the default timeout is in Terraform, the whole thing just fails. And now I can’t just pick it up where I was. I have to basically unwind it. But now there’s timeouts on the unwind because there’s this weird interdependencies.

So you end up with this weird sort of like ladder of dependencies. That time can change the ability for a dependency to exist or not exist. That’s the one that I’ve raw retry. But even within that, just the infrastructure could take longer for some unknown reason. Something won’t reply back in time, and then a perfectly working manifest will not work the next time.

Yeah. And it could be something that is not actually, it’s a dependency chain that you don’t actually have a real dependency on or something that was misconfigured that’s never going to recover. What we did with infrastructure pipelines is we saw patterns like that where you’re like, using a tool to do a whole bunch of stuff, and because the tool is biased towards single source of truth or very atomic actions, Ansible’s like this, you build these men’s playbooks and you run them, and then they either work or they don’t.

I’m wondering if it’s impossible. What we have done is go the opposite direction. So when we build a pipeline, it actually decomposes into very small units. And a lot of times we’ll leave units in and just say this is a no-op because we know that in a different circumstance, you might want that in and you can turn it on later, or you can just make sure that it doesn’t impact the type of infrastructure you’re working with. That could be a whole our conversation about how subtly and powerfully that standardization works, but what we do because we end up running each component in what you described as a pipeline is that the system would actually go in and say, oh, I’m running cluster with 100 things in it.

Yay, the cluster or even multiple clusters are going to have their own management thread that you can track and see. And it’s a pipeline that’s doing its work. But it’s coordinating actions on separate pipelines running on the different pieces of infrastructure you pulled in. And then that actually. And this is one of the big things that’s coming in the next release that actually pulls in this concept of resource brokers, where instead of the cluster running the plan, the cluster actually talks to a system that is responsible for providing resources in a generic way.

So that becomes a generic abstraction point. And then that is actually what runs Terraform. You’ve got this place where with what you’ve been doing, you’re like, running a Terraform plan, and then it has to go to Amazon and build a whole bunch of resource and do all this stuff. And if someone gets stuck, that plan now is you’re locked there. And then the state for that plan is all of your infrastructure and unteasing that becomes like, all right, I got to unwind it and try the whole thing again.

What we’ve been doing is actually decomposing that into all the units, and then letting each unit be its own pipeline. And then that means that you could actually say, oh, I’m building a cluster. And here’s all the resources I got spun up. That’s great. And now here’s all the downstream work I have to do. And if something breaks in that one task, you might actually be able to fix that one task, reassert it, and then continue. And then the other things waiting for that to happen would get triggered when they’re supposed to trigger, which sounds more complex.

This is why complexity is so hard to describe. Pulling us a little bit full circle. Complexity is not bad. Everybody’s like, oh, I have too much complexity. I have to get rid of my complexity. I’m going to move everything to Amazon and just use their tools. Or I’m going to only buy from this one vendor. I’m going to use Terraform for all the provisioning. The Terraform doesn’t do some types of provisioning very well. And so they end up looking at it. And so what we’ve done is we’ve stepped back from and we started as a bare-metal automation company.

Complexity is not avoidable in bare-metal. You can’t say, hey, I don’t think I like raid controllers anymore, you shouldn’t use them. But I’m just going to buy giant SSDs and be done with all that. But the idea here is that you need to manage complexity. So there’s times when you decompose stuff into small units of work, because once the unit is a small unit, it’s reusable and you can track it. And if something changes, your blast radius for that change is small so you decoupled the actions.

You might have more moving parts, but they’re easier to manage as a unit. And this is the frame that we’ve really been helping people see. It’s not about eliminating complexity, it’s about managing structures, code. Go ahead.

I’m saying you’re introducing us the problem that we fail to talk about that. I see, because I, maybe decided to spend way too much time in business continuity, design and stuff. So I have a very systems thinking approach to all, like, always thinking about dependencies and interdependencies and lifecycle, including duration. Right. So what you’re creating effectively is long running ephemeral infrastructure. It’s the idea that you could rip and replace. However, we also know the pattern of consumption is not to use the stuff like ephemeral, like seconds long containers.

We do not, despite the ability to do so design applications and infrastructure to be treated like a bunch of cattle that we gun down in the field, apparently, which is whatever the reference we want to choose. Right. The reality is that I’ve got containers, I’ve got VM, I’ve got hardware that has to live much longer than what was originally anticipated to the point where things inside it. We’re looking for clean, deprecation options. You are creating the ability to have that long running yet ephemeral pattern so that you can ultimately get the best of both worlds.

So that when the time does come to, there is some kind of an underlying adrift of deprecation that needs to occur that you can look at it from the pipeline perspective, which is the right abstraction. The human abstraction is to treat it as a pipeline, and then life cycle and duration become variables that you apply to that pipeline.

And that’s what’s been powerful for us. Once we started thinking about things as these pipeline segments, it took me some mental lift because our CTO, he’d be like, no, you’re not thinking about pipelines. And I’m like, what do you mean? I get it, I get it. We keep taking me down the path further and further. And it is about the human understanding of how the pipelines work and the intent. The pipelines have intent and what constitutes a pipeline. When we talk about a pipeline, it really is like, oh, I need to build a cluster.

Okay, great. That cluster is composed of pipelines that need to build a Kubernetes worker or Kubernetes leader. And then the cluster’s job is to then connect all those things together. And so you end up with an intent, and then the intent gets piece together out of other pieces. And then one of the things that’s fun is you actually end up with standard units in that process. So when you build the pipeline, you might have a pipeline. That the difference between the hardware and the virtual pipeline might be a whole bunch of stuff in the middle, but all the stuff at the end is the same, which is amazing.

So now you’re just like, okay, I got the standard, I’m just dropping it in and it’s going to work. And then that falls what we have been trying to solve for a long time, which is how do we stop reinventing the wheel every time we have to provision a server? Right?

Yeah.

For us, it matters because we want our customers to be able to repeat success across every one of our customers. It’s a big deal. Right now. We have a ton of VMware deployment stuff for banks, media, and hosting companies and telcos and stuff like that. So we’re doing a ton of this. But we’ve gotten to a point now where they’re all using the same pipeline. It doesn’t mean they’re using the same hardware or the same network or even the same version of VMware. All those things are extensible, but they’re using the same pipeline.

And so when VMware changes something or we improve something, that pipeline can be shipped to them as a new code unit. Their extensions are against known points, so they can reuse that. And we’re seeing the same thing coming up in the way we’re doing Terraform work and the way we’re doing Cloud interface. So for us, it’s a customer to customer thing. But instead of our customers, it’s a team to team thing or a data center to data center or a Cloud to Cloud fix.

So you can be like, wait a second. I’m going to build a pipeline and use that on Amazon. Right. And then you can say, well, I need to use that same pipeline on Google. We know where the deltas are, that reusability is really important. But then two teams can actually share the components that they can share. That’s the thinking that’s so hard in this, right. The tools are designed. We were talking about the Terraform ones. Terraform isn’t designed for people to share their plans. Even if you use Terraform Cloud or Terraform Enterprise, it’s managing the stuff better and letting a team work together.

But the idea of everybody in your company using the same plan, that’s where things get more interesting from our perspective.

You’ve actually created a pipeline marketplace. In effect, that innovation in one area allows you to feed it back and then share it with the rest of the community, which is where the bring us back to perseverance, the seven year and beyond period. Right. Your vision is being realized now because you had this. What you needed to do is get people to come along for the ride. And then the network effect sort of begins to come in. It’s a really difficult thing, like customer one through ten to get them to see that down the road.

And so there’s some stuff you don’t know, right? As you said.

This is a matter of laser focus because it’s been super hard from the start. My co founder and I wanted to build a software, not a consulting or service company. And because what we wanted to be able to do, what we heard really clearly is nobody feels like they’re improving their business by installing RAID in BIOS configuration and laying down operating systems. Like I said, this is something that the industry should just have working. It shouldn’t be a creative exercise at any company, and there’s no business value created by doing it in a creative way.

But that’s the way it’s been for the whole time. I’ve been in industry, and we could have taken our expertise in those areas because we know more about RAID-BIOS configuration and PXE booting servers than really, I’d stand up my team against anybody but selling those hours would have done no good. And we walked and made it harder for our journey as a company we walked away from. Hey, can you just build something for me in my data center so that I can do this better and we would come back and say, no, that’s not what we do.

We have a software platform and a product, and it does it this way. And if that will benefit you if you adopt it. And we had plenty of customers, there was $1 million account that we were basically like, We’re not going to patch your cobbler infrastructure for you. We can’t pull the plug on it. It runs 100,000 servers and we’ll help you migrate it. But we’re not going to fix it for you because fixing it would have entrenched you in this bad pattern. And, yeah, that was from a startup perspective, being true to we’re doing software that’s repeatable patterns that can become a marketplace and have shared what we usually talk about is curated content.

That’s the value, rather than going up with people in parachutes into your data center and fixing it so that your 20 year old infrastructure designs can live another five years.

Something Cloud.

For you only. Like this is what we saw this with the application development pattern that’s with the team at the Cloud Foundry, they said, let’s go in as a pattern development and coaching program. And so it’s far more consulting heavy. And as a result, how many times have you seen a Bosch implementation lately because they didn’t lead with products and then use consulting as a secondary revenue stream? In fact, the best thing you’ve done is said, no, we could genuinely make money by putting consulting hours in and pulling together a SWAT team of people and growing this whole stable of consultants.

But what you’re doing is delaying the inevitable, and you’re empowering them to do things that are counter to the vision that you have to be able to do. End result, you survive, you persevere. And on the other side of it, people are like, this is it. It actually works, and it’s always worked. It’s just that now they’ve got social proof and customer proof, right? The NASCAR slide is now something that people can, okay, well, if Company X is doing it, then I better get on this train business value I almost wanted to do for any super technical startup founder.

I’m like, you almost want to say do a spoof like a B of A quarterly investor call. It’s never like Jamie Dimon getting on saying yes, this week we updated the RAID firmware on all of our servers on our private Cloud. And so it’s gone very well. We’ve got a strong group of folks that are working on it, like, now they’re talking about business outcomes that they’re doing, and then this stuff that has to happen, you got a choice of how you’re going to let it happen.

Are you going to let the Cloud drive you or are you going to create the Cloud and you’re delivering. This is what Alex Polvi talked about, like, Giphy, right? You’ve done it.

Yeah, that’s right. It’s one of those slow, methodical things focusing on for us, customer autonomy at the end of the day, but, yeah, it’s hard. It is definitely a journey. It’s fun to watch customers pick it up, by the way and then see it spread virally inside of an organization, which we typically see that. Or we had a customer like, all your stuff was working great. We usually don’t have any trouble with any of your stuff, and they’re like, but we’re seeing something. And a couple of hours later, they’re like, oh, yeah, we had some configuration on our end, but you help them through that.

And the fun thing is when they’re autonomous in that perspective. But it’s the opposite of what a lot of people are doing right now. They’re all telling you to outsource. They’re all telling you to manage service. We’ll take over. We’ll run your data center for you. The hedvig of hey, if Kubernetes is too hard for you to understand, let us do that for you. It’s a good business model for people, right? Yay. But we saw this with OpenStack, and it was really bad. The idea that our software is too complex for somebody to learn how to use.

So just let us take it over. That’s our new business model as we’re going to keep it complex so that you don’t have to worry about it. The industry isn’t going to grow. That’s not a growth model for the industry, especially with edge and things like that coming in. Right. We should have the underlying hour on this of thinking through, what would it look like if we had small data centers in everybody’s house or in municipality? And what would it look like to make that stuff go?

That’s game changing all this cloud stuff. It’s great. It’s amazing. It’s powerful, and people should use the heck out of it. But at the end of the day, be careful about the autonomy that you’re losing, in a lot of cases without even realizing it.

True that. Tell you about my one close in complexity and I don’t mean to make fun of the folks at Microsoft because Microsoft Ignite, of course, is happening as we’re recording. This is actually fairly rapid that’s going to go live. I saw the Tweet and it had this thing. It was like as your arc deploying Kubernetes on vSphere, I was like, wow, it’s just a list of things that I would love to do as a science experiment, but nothing I would want to run into production. However, there’s a thing, so bless them for gluing together a lot of bits, but there’s a reason the patterns are out there.

In the end, one thing that we need to do is do Cloud as a practice, treat infrastructure as commodity. And like I said, it’s beautiful to see it realized in what you’re doing. And the cheat is that as we close up this part of the podcast, I get to get a real live demo with this stuff, but we should definitely get you out more and more. Now you’ve got such a fantastic audience as well. Cloud 2030 is amazing. It’s really wild to see how that’s continued to gain momentum.

And at first I remember telling people that I know Rob Hirschfeld. It didn’t take long because your reputation and the respect you’ve gained in the industry for asking the right questions when sometimes people are a little afraid to hear the answers, the fact that you’ve done it and people realize it’s for the solution, not just the guy that asks the questions.

You’ve just defined what Cloud ’30 is all about in various succinct terms. It’s asking questions that we’re sometimes afraid what the answers will be.

And it’s great to see that more and more as I bump into folks, I say, yeah, this needs stuff in RackN. They’re like, oh, Rob Hirschfeld, right. Yeah. All right. The association is there and the respect is earned in what you’re doing, which is cool. So I’m glad that one day we’ll do some more work together in the world be, it would be neat to pair up on more stuff like this. It’s been great. So with that, Rob, what’s the best way if people do want to find out more, of course, about RackN, Rebar, all of the things? Cloud 2030 we’ll have links for folks that wanted to get signed up and how do they reach you?

I am very consistently Zehicle, Z-E-H-I-C-L-E. Goes back to my electric car days pretty much everywhere. Some reason people don’t like Zs and handles, but I’ve been very happy with it. So you can find me on Twitter and everywhere. I’m very active on Twitter and that’s a great place to interact. RackN is rackn.com and at this point that’s the best linkage point to get to everything Digital Rebar if you’re interested. And the Cloud 2030 is the2030.cloud is the website for that, so you can catch up on episodes or see what the schedule is.

We stay about four weeks ahead if you want to share pick topics, but just drop in and it’s a discussion. It’s a hallway track. They’re just amazing.

Yeah.

That’s what we desperately need.

And the funny thing is, the people that you meet in that hallway. I’ve met them in other commercial opportunities. Now it’s hilarious to see that it really and truly is a small world. And this is why you see repeated voices come up. Then you see them on Twitter, and then you see them in other engagements. This is community, the real true community. This is not about patting ourselves in the back because we built one thing. Well, it is really about finding people that are in a community of practice.

We are practitioners of things. I’m not team OpenStack or team Kubernetes or team VMware. I am team people doing fantastic things with infrastructure and applications. And as a result, community truly transcends the ecosystem that we maybe were born in or lived in at the time. It’s kind of cool to see it all.

Yeah. After 20 years, I’ve seen these products come and go and come back again. Patterns and the people. And sadly, some of the problems that we solve haven’t changed too much.

Was the old joke, right? They said that every time we’re building a better mouse trap, at least that used to be the design of build a company, build a better mouse trap. And there’s, like, more patents for most traps than there are for, like anything else in the world. And in the end, you go to Home Depot or Lowe’s or wherever you happen to go to Home Hardware, if you’re Canadian. Then what do you find? A slab of wood with a spring on it and a place to put cheese?

The most simple possible thing is really the best thing for it. But, hey, we’re going to create disaggregated hyper converged mouse trap infrastructure somewhere. And in the end, just grab a piece of wood.

With blockchain.

Exactly. Awesome. All right. There you go. Rob Hirschfeld, 200th. Thank you for celebrating 200 amazing and fun conversations that I hope to have many more. So I’m going to have you on for 300th. Just give me the heads up right now. So mark your calendar. However long it takes to get 300 more of these. We’re going to do this again.

I’ll be in my walker, and we’ll make it happen.

Right on.

Ep 200 Rob Hirschfeld on RackN and Cracking the Code on Multi-Cloud and Metal Automation

Transcript powered by Happy Scribe

Leave a Reply Cancel Reply