Building a Resilient Engineering Culture with Ryn Daniels

25 apr 2019 · Real World DevOps

About Ryn DanielsRyn Daniels is a staff infrastructure software engineer who got their start in programming with TI-80 calculators back when GeoCities was still cool. Their work has focused on infrastructure operability, sustainable on-call practices, and the design of effective and empathetic engineering cultures. They are the co-author of O’Reilly’s Effective DevOps and have spoken at numerous industry conferences on devops engineering and culture topics. Ryn lives in Berlin, Germany with a perfectly reasonable number of cats and in their spare time can often be found powerlifting, playing cello, or handcrafting knitted server koozies for the data center.

Linksryn.worksLinkedIn - Ryn Daniels@rynchantressEffective DevOpsInfoQ Article: Crafting a Resilient Culture
TranscriptMike Julian: This is the Real World DevOps Podcast and I'm your host Mike Julian. I'm setting out to meet the most interesting people doing awesome work in the world of DevOps. From the creators of your favorite tools to the organizers of amazing conferences, or the authors of great books to fantastic public speakers, I want to introduce you to the most interesting people I can find.

This episode is sponsored by the lovely folks at InfluxData. If you're listening to this podcast, you're probably also interested in better monitoring tools and that's where Influx comes in. Personally, I'm a huge fan of their products and I often recommend them to my own clients. You're probably familiar with their time series database InfluxDB, but you may not be as familiar with our other tools. Telegraf for metrics collection from systems, Chronograf for visualization, and Kapacitor for real-time streaming. All of this is available as open source, and they also have a hosted commercial version too. You can check all of this out InfluxData.com.

Mike Julian: Hi folks. I'm Mike Julian, your host with Real World DevOps. My guest this week is Ryn Daniels, co-author of O'Reilly's Effective DevOps, a public speaker and previously worked in engineering for both Etsy and Travis CI. Ryn, I hear you're working everyone's favorite infrastructure automation company now, HashiCorp is it?

Ryn Daniels: Yes, it is. I'm a working on the terraform ecosystem team. I'm going to be working on the AWS provider.

Mike Julian: You've been writing and talking a lot about this idea of resilient culture and you wrote a article for a InfoQ, which we'll link in the show notes, about crafting resilient culture, which talked about the Apache Snafu. You and I were just talking before the show about an earlier story about Postfix and Puppet and well, things exploding in your face.

Ryn Daniels: Yes, so it's a fun story with a little less of a happy ending than the Apache snafu. My first ops job I inherited two data centers that didn't even have a lonely bash script for company. I was doing everything by hand. There were a lot of dragons and nobody was really sure where are the dragons were lurking. One of the things that I was kind of put in charge of was the idea of, "What if we didn't do literally everything manually? What if we had some sort of automation?" So I got to do fun stuff like set up automated Linux installs instead of me going around carrying a USB DVD player and yeah.

Mike Julian: Definitely been there.

Ryn Daniels: Yeah, that that was ... Those were sad times. So I was starting to put together Puppet and it was mostly going pretty well. I was starting out with the what seemed like the safe stuff. And I asked the engineering team, I'm like, "So it seems like Postfix is configured a bit on these servers, but it's not running. Should it be running?" And people talked amongst themselves a little bit and they were like, "Yeah, it should definitely be running because the servers are set up to email us when something goes wrong." Okay.

Mike Julian: So clearly everything was fine because no emails were going out.

Ryn Daniels: Exactly. Exactly. So I clear this with everyone. I tell them, I'm like, "Okay, I'm going to roll out this change." And I turn on postfix everywhere. And this was my very first ops job, so we didn't have anything like a testing or a staging environment. I was really kind of playing everything by ear at that point and learning as I went. So I turn on Postfix and then a few minutes later somebody says the site's down. Like how did turning on Postfix take the site down?

Mike Julian: That's weird.

Ryn Daniels: And we kind of kind of poke a little bit on one of the servers that I was logged into and like the web server was still running. Everything looked like it should have been fine. What happened was there were eight years of emails queued up on every single server, and when Puppet turned on Postfix, those eight years of cued emails started sending all at once. And the way that networking was or wasn't configured back then, I think I just like saturated every single network link in our two data centers with all of these emails, and everyone's like, "Ryn, help, make it stop, get everything back on line." I'm like, "I don't know how to un-send eight years worth of email, folks. Like, we're just going to have to wait this out." Which is kind of what happened. And eventually, eventually all of the emails sent and shockingly, there were a lot of error emails as it turns out in this sort of environment.

Mike Julian: Surprise, surprise.

Ryn Daniels: Yeah. And after that everyone was a little twitchy anytime I mentioned making a Puppet change. So yeah, it was definitely an exciting afternoon slash couple of days trying to figure out what went wrong with automation and try and keep it from going that sideways in the future.

Mike Julian: How did your teammates react to all this? Like aside from like, "Ryn, what have you done?"

Ryn Daniels: It was, it was mostly just that kind of panic and then everyone trying to figure out what to do. People had differing amounts of visibility into what was going on. There was kind of a homegrown monitoring system that was set up that also lived in the data center, which may or may not have been very accessible during this time. Oh, I remember, I was stuck in the data center physically because nothing was configured to have a remote, out-of-band access. So most of my days were spent me alone in the data center with this ancient MacBook. I think it was still running power PC, so I didn't even have Chrome running. And it had so little memory that it could really only run one application at a time.

Ryn Daniels: So I would like get the terminal up and do a thing and then if I had to look something up, I would have to quit terminal and open up Safari. And then if I wanted to talk to people in the office I would have to quit Safari and open up I think we used AIM. And it was a lot of back and forth and chaos trying to just get a baseline feel for what was going on and there was definitely a lot of yelling going on in my general direction.

Mike Julian: Yeah. What was the aftermath like?

Ryn Daniels: Eventually all of the emails sent and everything went back to normal and people said, "Okay, Ryn, please don't do that again." I'm like, "Well I'm not going to, Postfix is already turned on. I'm moving onto the next thing on my list."

Mike Julian: Did you find that people were more likely to blame you or Puppet for issues in the future?

Ryn Daniels: I think the blaming of me was mostly good natured. I feel like this was, well not the most robust environment I ever worked in. It was actually not the most blameful. There was ... Like I'm pretty sure that once I quit, people blamed me a lot cause it was kind of the culture of whoever the previous person was, everything was their fault. But when I was there, it was mostly ... Yeah.

Mike Julian: That's how it works in most ... This reminds me of I used to work for a national lab and due to a missing keyword on a Cisco configuration, I took down two entire research buildings.

Ryn Daniels: Ooh.

Mike Julian: Yeah, that was fun.

Ryn Daniels: Yeah, sounds exciting.

Mike Julian: Yeah, sounds exciting. Basically if you are configuring a trunk line and you're trying to add a VLAN to a port, if you don't add the “add” keyword, then it replaces all the VLANs with the one you specify.

Ryn Daniels: Oh.

Mike Julian: So there were a few hundred VLANs configured, and I replaced it with one.

Ryn Daniels: I bet some of those were probably important.

Mike Julian: Some of them were important. The weirdest thing about that whole situation is that my first day, the network manager during my onboarding says, "Hey, you're going to make this mistake. Just come tell me when you do it." I'm like, in hindsight, you know, maybe we should just make that mistake, not a thing that you can do.

Ryn Daniels: That would make sense.

Mike Julian: Right. That would make sense. But they never did that. After I did it, of course I never did it again, but then I had to train my replacement and say, "Hey, you're going to make this mistake too." Like this just sounds awful in every way. So it's interesting contrasting that with the well known story of the Apache snafu. Why don't you tell us a bit about that, like how it differed and basically how the experience went?

Ryn Daniels: Yeah, so the rather quick version of the Apache snafu, this was when I was working at Etsy. I think this was 2015 or 2016. I was working on the tooling that was provisioning servers in the data center. So at the time, Etsy was, for the most part, servers running in our own data centers, and some something had to get the servers in a state from the data center team just un-boxed them and racked them and wired everything up into, "This is a useful server that does something nice like serve Etsy.com to people who want to buy yak cozies."

Mike Julian: Sounds useful.

Ryn Daniels: Yeah, whatever it is that what you're buying that's delightful and handcrafted.

Mike Julian: Definitely yak cozies.

Ryn Daniels: Yeah, so I was working on this provisioning software, which was a collection of mostly Ruby scripts at that point. And I was getting to the point where, "Okay, I need to run some end to end tests to make sure that, okay, so all the pieces seem to work individually, but can I actually provision a server? Or more importantly, when the data center team gets a whole bunch of new servers, can they actually use this to provision them in a timely manner?"

So I have my test server and I'm trying to provision it, and one of the later provisioning steps was bootstrapping it into Chef. I was I think running a test web server since that's one of the more common use cases. And Chef failed on the Apache install step. It said, "I can't do this because the version that is pinned in Chef is older than the version that is installed by the Anaconda installer." Now this had happened a nonzero number of times before, not just to me but to other people, because the way that the Yum mirror was configured was that it would automatically pull down new versions of packages and get rid of the old ones.

So pretty much anytime that happened, if a new server was being provisioned at that point, this sort of mismatch in between the installed version and the pinned version in Chef would happen. And just the way that we were pretty much used to dealing with this was, "Okay, kind of manually test to make sure the new version does what you expect it to do and update the version in Chef." So on my little test server, I manually install the new version of Apache. It was a point release. I remember I even checked the release notes and there was nothing super interesting in them. The way that Chef was configured, it was only supposed to impact newly provisioned servers when you bumped the version so that all of the existing servers would keep the same version, they wouldn't update.

Ryn Daniels: This occasionally led to a little bit of config drift across the fleet, but that was the decision that was made. It had been working okay so far. Nobody had complained enough to change this process. So I test it by hand and I roll it out and I'm like, "Okay, that was good. Nothing's going to happen. This is going to be a no op." And then I said, "I should just log into one of the web servers and make sure that Chef does nothing." You can see where this is going, can't you?

Mike Julian: Right. I love the ... I have a feeling.

Ryn Daniels: My spidey sense, my spidey op sense was tingling a little bit. And so I log into one of the web servers and I run Chef and it upgrades Apache and Apache does not upgrade cleanly. It fails to start. And I'm like, "Oh, oh no, I've done a bad thing." So I'm realizing that it's, I don't know, sometime in the middle of the day, let's say like 2:00 or 3:00 in the afternoon, and I realized that I have just rolled out a busted Apache upgrade to the entire production and staging environment all at once.

Mike Julian: Queue panic.

Ryn Daniels: It was kind of one of those slow motion moments where you're like, "Oh God, I can see my whole life flashing before my eyes." And I happened to know that my coworker, Pat, who was sitting next to me was the one on call and I turned to him and I'm like... I'm like, "Hey, so you're about to get paged for a whole bunch of stuff. Sorry about that." And then I head into Slack and I jump into the main sysops, webops channel where everyone tended to congregate, especially when there were production issues. I'm like, "So, everyone, I've got good news and bad news. The bad news is I've broken everything. The good news is I'm aware that I've broken everything." And everyone jumped in immediately. They're like, "What can we do? How can we help?"Like people who were in the office with us who overheard me talking about this and kind of muttering to myself came over and were like, "What's going on? Is there anything we can do?"

And so people jumped in and it was really nice to see based on different people's areas of expertise, like people who were really familiar with Apache and the Apache config started poking at that. People started trying to look at like, "Oh, is the site impacted?" A fun part of this story is that many, many of the internal monitoring tools that were used at the time used Apache as a web server. So all of a sudden, not only is everything mostly on fire, but we can't even really look at what the fire is doing because the fire observation tools are also on fire.

Mike Julian: That's incredible.

Ryn Daniels: Yeah. Yeah. And people are looking at the config and trying to figure out like the config didn't change. Nothing in this looks like it should have changed. Eventually somebody figured out that if you just ran Chef a second time, everything fixed itself. But at that point nobody was really digging into why, we were just like, "It's the middle of the day. People are trying to buy their cozies. We got to get this back up." Somebody went to Etsy.com in a web browser cause a lot of the tools were down and somehow the site was still up.

Mike Julian: Interesting.

Ryn Daniels: It was really, really, really, really slow.

Mike Julian: But it wasn't down.

Ryn Daniels: But it was a doubt. Like I technically did not take the site down. So we coordinate the work of running Chef everywhere on all of the various servers. Like, "Let's do this one group at a time and not DoS the Chef server with everything tried to run immediately at the same time, verifying that stuff comes back up." Everything starts to come back up, everything goes back to normal. And this all took place over the longest 20 minutes of my life.

Mike Julian: Yeah. As you're telling the story, I'm like, "All right, this sounds like several hours." But no, it's actually not that long of a time.

Ryn Daniels: Yeah. Yeah. Given that Chef ran I think automatically every 10 minutes, I think given that the second Chef run fixed it, if I had just ignored it or not noticed and gone to lunch or something, you would have fixed itself pretty quickly.

Mike Julian: That's both awesome and a little terrifying.

Ryn Daniels: Yeah. Yeah.

Mike Julian: So Etsy was pretty known for their retrospectives and the learning lessons. And they even had the funny shaped sweater. What was the aftermath of that compared to your ... the aftermath of the previous story we just heard about?

Ryn Daniels: There were a lot more high fives. So I think, I can't remember what time of year that happened. I think the ... So the three-armed sweater is a physical sweater that was given out once a year, usually in like December, to the engineer who not necessarily broke something in the most spectacular way, but kind of contributed to an incident that we all learned a lot from. And there were a lot of weird little, how did that happen moments in the Apache snafu. So I ended up at the whatever December all-hands meeting, John Allspaw was handing out the three armed sweater, and it was awarded to me for this delightful, delightful incident.

Mike Julian: Bravo.

Ryn Daniels: And it was the kind of thing where that story would spread throughout engineering and people would come up to me afterwards and they're like, "Oh my God, you won the sweater. That's so cool. Congratulations." It was ... There definitely was not any incentive to go break the site and I'm pretty sure that there was a fine print in there that that would disqualify you from getting the sweater is trying to get the sweater. But it was definitely something where people wanted to hear the story because they wanted to hear what happened because it was interesting. And so it was actually a lot of like warm, fuzzy feelings that I didn't have to worry that people were secretly mad at me. I didn't have to worry that like the next time I tried to make a change that people would be like, "No, actually, I don't trust you to do that anymore because you broke something that one time." It was a much more supportive environment, which was really nice.

Mike Julian: I believe Etsy had a post on this a while back, this idea of a just culture. I think you've been talking about it in terms of resilient culture, and psychological safety plays into this. For those that aren't really aware of what all that means, could you talk more about it?

Ryn Daniels: Yeah, so I like to ... The thinking about resilience, resilience engineering is a field that has been around for a while. This is not something new that I came up with. A lot of my thinking on the subject comes from conversations with John back when I was at Etsy and afterwards. And one thing that he likes to say that I really appreciate is the idea that computers and systems can be robust, but only people can be really resilient. So thinking about these sorts of failures that happen. "Okay, like automation caused this thing. The other thing." You were talking about the VLAN incident, and wouldn't it be nice if there was a way to make it so that everyone didn't make the same mistake?

You can make a system that is kind of robust to these known failures, so that one command that everyone entered wrong, you could write some tooling around that specific command or put it in a little web interface so nobody was entering raw commands on the devices by hand, that sort of thing. But you can only do that for a known set of failures. The problem is there's always going to be the unknown unknowns, the things that you haven't thought of yet because they haven't failed yet. Or you add in some new piece to your infrastructure and all of a sudden because complex systems are complex, you have these new interactions that just didn't exist before that people haven't thought of. And it's kind of how you respond to the unknowns that kind of defines resilience I think.

Mike Julian: Okay.

Ryn Daniels: So I like to think about resilience as kind of like the opposite of that being fragility. Whereas the story with Puppet and that data center and nobody knew how to respond and everything just caught on fire, like that was really fragile. I mean that whole environment was very fragile because people responded to the unknowns and to failures with fear. Like another story from that job is the one database server. There was just the one. There was no mirroring, there was no sharding, there was just the one that had the data, and of course it was running Mongo. We all love to hate on everyone's favorite, NoSQL data store.

And the raid array in this one server was degraded. And I went to the engineering managers. I'm like, "So, I need to get some new hard drives to replace the the busted ones. Like let's plan this work. I want to do this." And they're like, "No, no, you can't do that." "Why not?" "Well, because something might go wrong during the repair process. Can you guarantee that repairing the raid array will not break it?" I'm like, "No, I can't. That's not a guarantee you can make." And they said, "Well, you can't do it then."

I'm like, "Okay, let me tell you what I can guarantee is that if you let this raid array sit with 50% of its disks busted, at some point the remaining two discs are going to die. That I can guarantee, and then you will have no data because you have this one database server and it has no backups. Like that is the guarantee that I can make. Given those risks, what if we order some new hard drives and I rebuilt this array?" And they said no, and I did it anyway. Which, I mean, you got to do what you got to do sometimes, but it was that culture of fear and having to do things in secret that was really like the opposite of resilience there.

Mike Julian: Yeah. That's a really interesting point. What I love about this concept of psychological safety and resilient culture is people really are at the center of it. And most people ... A lot of environments seem to kind of divorce the idea of the technology we run and the people operating it, when in fact they're symbiotic. You have to have both in order to have a well running environment. If you react with fear to someone like, "Hey, I need to work on this system and make this change." And you're like, "Oh God, we can't do that." Well you're actually breaking the technology too. And also breaking the people.

Ryn Daniels: Yeah. I like to say that as engineers and as an engineering organization as a whole, you're not just shipping code, but you are also shipping the entire environment that allows you to ship code and that's culture. That's the people, that's the processes. And if you ship broken processes and if your culture ends up shipping broken people, then you're going to have a bad time.

Mike Julian: Right. So for a company or a team or a person, who identifies more with this broken culture than the culture that you've been working in and have been building yourself, what can they do to start to shift their own culture? To change what's going on? How can we get a more resilient culture if we don't have one?

Ryn Daniels: Yeah, I think that's a really interesting question, and a big part of culture change obviously is getting buy-in. People have to want to change the culture specifically and they have to want to change it enough to actually overcome the inertia. And inertia is such a big factor in cultures and how we work. So getting buy-in is important. And I think there were some really interesting stories at some of the DevOps Days events. A few years ago, I remember Target did some really interesting talks about, "Okay, if you have these kind of individual teams throughout the organization who are trying to make these changes, how can they then spread those changes throughout the rest of the organization?" Some really interesting stuff there.

But there's different things within a culture that you can look at. I think a big part of it is looking at what behaviors people are rewarded for. Like what sort of incentive structures are there? So one of the things that I like to see in a culture is looking at any sort of skills matrix or career matrix. It should be required that as a senior engineer, staff engineer, what have you at those higher levels, that you be helping to create this kind of culture of psychological safety. You should be responding to people asking questions with actual help. I've definitely ... There's the stereotype of the BOFH, the grumpy sysadmin who wants to hoard all of the information to themselves and has never going to help anyone out because that's less job safety, and who is going to yell at people and make fun of them for not knowing the answer. That's creating a psychologically unsafe culture. That's creating a place where people aren't going to ask questions and they're not going to tell you what they did wrong or even what they did.

Mike Julian: Yeah. I used to find those stories hilarious. Like The Register has the massive collection of them. I always thought they're hilarious. And then I started actually being a professional and then realized, "Wow, that'd be a terrible place to work and that's a terrible person."

Ryn Daniels: Yeah. I've definitely worked at places where ... I remember one time, a long time ago, somebody added a new alert to the monitoring system and it kept flapping. And I just wanted ... There was no context around it and I wanted to know, "Is this important? Should I be worried that the seller keeps firing?" And I, for the life ... I never found out who added that alert because nobody would tell me because everyone was so afraid of ... And it wasn't an I'm mad situation. I wasn't some executive, I was on their team just trying to figure out what was going on. And I couldn't because they had been yelled at so many times for making normal mistakes. The kinds of mistakes that literally every single person has made if they've interacted with computer.

Mike Julian: Oh, that's rough.

Ryn Daniels: Yeah. And it's the kind of thing where if you have that sort of environment, you're never going to be resilient because people are going to keep more and more information to themselves, and a big part of resilience is learning. And you're not going to be able to learn effectively if you don't actually know what happened.

Mike Julian: Yeah. So working on getting buy in is great. I can see how that's super valuable, but that takes a long time and you may not have an executive that actually cares that much about it. They may not see the value themselves. Are there any more closer to home things that someone could do? Like within just their team?

Ryn Daniels: Yeah, people can look at kind of how they behave within their own team and I think it can really help to try and set up some social scripts to have, especially if you have leaders within your team who maybe have been around the organization for awhile so people listen to them a little more, to try and get those people to model the behavior that you want to see. One thing that I've found pretty helpful when thinking about how do people get information and how do people talk about things is if I have a question about how something works, instead of like private messaging someone in Slack one to one, I will drop that question in a public channel, find the channel where that's most appropriate to do. And I will just ask publicly. I'm like, "Yo, I don't know this thing."

And this is something that like, okay, I've been working for over a decade now. I've written a book, I've given conference talks. I feel like I have enough cred that people aren't going to question too much whether or not I actually know what I'm doing or belong to be there. So it's a lot safer for me than for somebody who's more junior to sort of model this behavior. So that's something that you can try and deliberately do is model. Like, "Here's what it looks like to admit you don't know something or to ask questions." And to do that publicly and to have it be okay.

Mike Julian: I love that advice. One of the ... As a consultant, I go into a lot of different companies all the time and one of my big red flags is when I look at a team Slack, or HipChat or whatever they're using, and the team channel has no activity.

Ryn Daniels: Tumbleweeds. It's so scary.

Mike Julian: It's so weird. And I immediately know that there's a lot of back channel going on. And this should terrify every team manager too because if there's no discussion happening in the team channel, well, it's happening without their knowledge. It's not that it isn't happening.

Ryn Daniels: Yeah.

Mike Julian: Yeah. Like it's always weird.

Ryn Daniels: Yeah. One thing that I really liked that Etsy did was they had pluses or imaginary internet points that were in IRC and then in Slack that was people would give each other pluses for answering a question or for asking a good question or for making a really good pun. Etsy really loved puns. I appreciate that. But that was the sort of thing where like, okay, it takes time to redo your career's matrix to make sure that people who get promoted are the sort of people who are building this sort of environment.

It's a lot easier to make a little chat robot handout imaginary internet points. And that can be ... Some people don't like the gamification and of course there's problems that like, oh, if you have some sort of like insider clique of the cool kids within your company, that other people are going to feel left out. But in the right environment you can have something that's a lot smaller, a lot lower friction like that. So if somebody asks a good question or it gives a really helpful answer, you give them some internet points, and that kind of gives literal incentivizers for those sorts of behaviors that you want to see.

Mike Julian: Yeah. That small little change can actually have a huge impact. I like that idea a lot. Shifting gears a little bit, when we've been talking about the impact of resilient culture and people, I want to talk about the people side of this. I've been following your blog for a while, and one of the things you started talking about when you were at Etsy was take care of yourself more. And I saw that you took up by playing cello and you started doing power lifting and like all that's awesome. One of the most interesting things that I saw you to start doing was this cupcake ritual. What is all that about? Like what led to that?

Ryn Daniels: So my cupcakes where my own variation of Laura Hogan does this with donuts where she wanted to be more deliberate about celebrating the successes and the wins in her career. So every time she did something that she felt was donut worthy, she would go get a nice doughnut and take a picture with it and just talk a little bit about, "Hey, here's this cool thing I did." She wanted to kind of, I think in a way normalize it for people from underrepresented groups especially to talk about like, "Hey, we're doing these cool things and it's not a bad thing to celebrate them." So I started doing cupcakes as kind of a little play on that.

Mike Julian: I love that idea. I think to me one of the hardest problems of celebrating wins is having to decide what constitutes a win. Like for me, we both wrote a book. If you celebrate a win of I just shipped a book, then what's the next one after that? It feels like it almost has to be bigger than writing a book. You're like, "Wait a minute. This is obviously going to mean I'll never get cupcakes again."

Ryn Daniels: I think that's something that I struggled with in recent years. And you mentioned that you'd read my most recent blog post on kind of retiring the cupcake ritual and I think part of it was in a way related to that where I'd done these things that were on my five year career goals. I wrote this book, I keynoted Velocity, I'd gotten this job at Etsy that I really loved and I found myself struggling with kind of where was I trying to go next? And then I had a lot of personal change in my life, moving countries for example. That was a long and involved process, which maybe probably not surprisingly took up a lot of time and brain power and just ability to focus. Bunch of other changes happening as well, some on and off chronic health issues and it didn't feel like I was accomplishing anything anymore. But nothing was living up to the previous cupcakes.

Mike Julian: Right. That has got to be super hard because the fact that you do something huge doesn't mean that anything that comes after that is now not worthy. To me, it's like ... How I view it, because my day to day work as a consultant is I have to look at the tiny wins and then occasionally I'll get a huge win and this is great but most of my life is not, "I wrote a book and then I closed a huge deal and so on and so on." So yeah, I think that's the hardest part about that whole ritual is just understanding what actually is a win. And have you found that having some external support on that made it better? Like having someone basically call you out and say like, "Cut your shit, you just did something awesome."

Ryn Daniels: That has definitely helped. I definitely have a tendency to be a little hard on myself and to downplay my own accomplishments. And one thing that's been really great is having my partner who will talk to me and be like, "Yo, you're full of shit. You have done all of these things." And she really helped me talk ... We were talking through all of this and I kind of realized that just because I'm not doing big, concrete, publicly visible things doesn't mean that I'm not still making progress. So for me, I think kind of stopping with the cupcakes was a way to help me kind of reframe how I think about success and how I think about progress.

I think, and I mentioned this in the blog post a little bit, one thing is that kind of mid-career progress is going to look different than early career stuff where, okay, once you've done kind of the big things that you wanted to do, which obviously doesn't have to be early career by any means, but once you've done kind of the big things, where do you go from there? Or once you've gotten ... It's usually a pretty clear progression, assuming you're working for a place that thinks about career progress, to get to a senior engineering level, but where you want to go after that, it can branch off. It's not as clearly defined. So I wanted to kind of stop focusing on doing things that looked good when celebrated with a cupcake and kind of just take a step back and think about where I want to go with the next stage of my career.

Mike Julian: I love that starting that ritual and ending that ritual were both really for the same reason of helping you think better, think differently about your wins.

Ryn Daniels: Yeah, there's definitely some upsides to celebrating things publicly because you get support from people. I get to feel like, "Oh, if I'm helping other people feel better about their own progress and helping other people celebrate their own wins, that's awesome." But there's definitely then this pressure of, "I've done all these things publicly. Oh no, did I peak when I was 30? What am I doing with my life now?" And it was honestly really scary to publish that blog post because it felt like admitting to everyone that I was a failure and now I'm not celebrating cupcakes anymore cause I don't have anything worth celebrating. But I think we need to also normalize that not everything is this big moment. Not everything turns into a cupcake or a big story that you can give a conference talk about, sometimes it's just the little moments that mean the most.

Mike Julian: Well, on that note, it's been absolutely fantastic having you. Where can more people find out more about you and your work?

Ryn Daniels: I am on Twitter, @RynChantress, and I blog occasionally at Ryn.works.

Mike Julian: Awesome. Well, thank you so much for joining.

Ryn Daniels: Thank you.

Mike Julian: And to everyone listening in, thank you for listening to the Real World DevOps podcast. If you want to stay up to date on the latest episodes, you can find us RealWorldDevOps.com and iTunes, Google Play or wherever it is you get your podcast. I'll see you in the next episode.

Lyssna Lyssna igen Fortsätt Lyssnar...
Följ Avfölj
Dela

Avsnitt

Observability & Robots with Ian Sherman
27 jun 2019· Real World DevOps
About Ian ShermanIan Sherman is Head of Software at Formant, a company building cloud infrastructure for robotics. Prior to Formant, Ian led engineering teams at Google X and Bot & Dolly. The through line of his career has been tool building, for engineers and artists alike. He’s inspired by interdisciplinary collaboration of all types; currently this takes the form of applying patterns and practices from distributed systems operations to the relatively nascent field of robotics.

Linkshttps://formant.io
TranscriptMike: This is the Real World DevOps Podcast and I'm your host Mike Julian. I'm setting out to meet the most interesting people doing the awesome work in the world of DevOps, from the creators of your favorite tools to the organizers of amazing conferences, from the authors of great books to fantastic public speakers. I want to introduce you to the most interesting people I can find. This episode is sponsored by the lovely folks at Influx Data. If you're listening to this podcast, you're probably also interested in better monitoring tools and that's where Influx comes in.

Personally I'm a huge fan of their products and I often recommend them to my own clients. You're probably familiar with their time series database, Influx DB, but you may not be as familiar with their other tools. Telegraf for metrics collection from systems, Chronograf for visualization and Kapacitor for real time streaming. All of these are available as open source and as a hosted SaaS solution. You can check all of that out at Influxdata.com. My thanks to Influx Data for helping make this podcast possible.

Mike: Robots. You apparently are working at some company that does observability for robots and I'm a little confused because like what in the world is this all about? Do robots actually need observability?

Ian: Yeah. I work in a company called Formant. We are about a year and a half old and we're focused on a lot of problems in supported robots, but specifically observability for robotics is very important to us. I think it's representative of the type of concern that hasn't historically been important in robotics, but is increasing as we are shipping robots more and more to customers, deploying fleets of robot, deploying them in semi-structured environments, and generally seeing their numbers increase in the wild.

Mike: These robots, are these like Johnny 5 style robots or they're more like C3PO or The Terminator or Wall-E? Are these more Wall-E or maybe even the really terrifying stuff that General Dynamics is putting out?

Ian: Right. We like to maintain a flexible definition of a robot. I think that's maybe just a way of avoiding the definition question.

Mike: I'm sure the robots in the singularity will be very happy about your loose definition.

Ian: Yeah. The vast number of deployed robots in the world has sort of traditionally defines probably in the space of automotive manufacturing. That's where we see bolted down work cells of high payload position controls, heavy metal robots performing assembly and welding and applications like that. But the fastest growing part of the robotics market is actually in service robotics and in the deployment of robotics into less structured environments. That's environments like logistics and warehousing, retail, agriculture. This is where we have started focusing is in robots in semi-structured environments.

We do think that we have a lot to offer in industrial robotics as well, but it has some better focus to date.

Mike: I saw on your website there's this really interesting photo of a robot kind of strolling down the aisle at the grocery store.

Ian: Mm-hmm (affirmative).

Mike: Is that indicative of the kind of robots we're talking about primarily?

Ian: It is. We may have a little bit of an insight into the way things are going just from the customers we're talking to every day, but we are seeing more and more robots deployed into retail for example. It's just what that image shows. The applications at the moment are typically in things like floor cleaning, inventory scanning. Those are the front of house applications that we see the most often. Of course, in order fulfillment and logistics and warehousing, we see a lot of addition applications of robotics.

Mike: Got you. I want to take a little tangent here and ask how in the world did you get into this? I don't think anyone comes out of school and says, “You know what I'm going to do? I'm going to build robots and observability.”

Ian: I came to robotics through work at a company called Bot & Dolly about seven or eight years ago. It was focused on applying robotics to challenges in film and visual effects. I had an opportunity to get involved in novel applications of industrial robotics at that company. We were acquired into Google and that was around the time that a number of robotics companies were acquired, including Boston Dynamics we mentioned. Inside Google, I had the chance to see how all of our peers were thinking about these problems. We ultimately left Google about a year and a half ago because we were excited to ship products. The timeline for that is...

Mike: There's a very subtle danger there.

Ian: The timeline is long at Google for shipping products, but the experience was really invaluable. Personally, I was already interested in the tools and infrastructure side of robotics. Through building tools to support these teams inside Google and through seeing how people thought about problems like observability, software deployment, configuration management in the context of robotics; it became clear that there's actually a huge opportunity to bring some of the best practices that have been developed for decades in the backend distributed systems world to the robotics world.

That's where I find a lot of inspiration. The problem is similar enough that we have a lot to learn, but different enough that it does require some new thinking and some new technology.

Mike: That's a really great segue into a really good question of what is it look like to do observability in robots? You mentioned all these tools and all these techniques that infrastructure people rely on every day. I can think management and that sort of thing. How is that being applied in your work?

Ian: The fundamental requirement of observability and robotics is really no different than it is in monitoring backend systems. We want to maintain visibility into the state of the system. Use that information to allow humans to respond to changes in internal systems state and also automated systems to respond to those changes. But there's a few key differences. One is that the data types that are relevant to us in robotics are often different than they are in backend distributed systems. We have sensors generating a lot of data about the physical world. Those data types are often geometric or three-dimensional or media-based.

The infrastructure and tooling to ingest and index and visualize that type of data is different. The workflows that we used to debug issues are different. They often require making sense of a lot of that geometric and visual data. Another difference is that centralizing data is often challenging from a field deployed robot relative to a server in a data center. The availability of network resources is often unpredictable and we need to have contingency plans in place for when that work is unavailable.

Relative to an IoT application, there's sort of a different set of resources available to us at the edge as opposed to extremely constrained IoT devices that might be running on bare metal. We typically have access to an operating system. We might even have access to a GPU. That allows us to make different trade-offs in the system design to maintain observability into these remote machines.

Mike: It sounds like you're ... Due to the perhaps limit of availability of network or the unknown availability of network and especially with robots out doing their thing in the fields, you're probably pushing a lot of decisions and logic to the edge, to the robots themselves. Is that right?

Ian: That's right. One thing we've learned over the course of building our product is that one of those decisions that's really important to our customers is actually decisions about what data is being centralized and when.

Mike: Oh, that's interesting.

Ian: Typically, in a backend monitoring setup we define a set of metrics that are continuously pushed or pulled at a common rate. In the robotics world, we may care about different types of data around different events of interests, different resolutions of data at different times of day or around say a particularly sensitive manipulation behavior, and giving our customers those levers to dynamically turn on and off what telemetry is being sent and what resolution is something that I think is kind of an interesting problem to work on and specific to the robotics domain.

Mike: Right, yeah. Absolutely. I'm imagining that perhaps some of the problems that you're running into are things like in the example of the grocery store, a robot going down an aisle and hits a spill in the middle of the aisle. What do you do about that? How does the robot even know that's a thing and how you ... that would be a really prime candidate for we need to record this information because the robot needs to know next time on how we need to handle this. How are you recording such things like this is not just "Oh, the CPU is X now." It's much more visual.

Ian: Yeah, so that fundamental limit of how does the robot know that something bad happened is a limit that we'll always have to confront. I think similarly to backend systems we often have to rely on second order, or sort of, best guess indications that something has gone wrong.

In the case of the spill, it could be that we are seeing wheel slippage, which is something we can detect in the robot control set, and that type of event for us might mean for us that the logs from the last 30 seconds are dumped and prioritized for upload to a centralized server.

Mike: Mm-hmm (affirmative). It occurs to me that you would have some granularity challenges, too. In that, let's say I have a web app and it's serving whatever customers. It's having problems four or five minutes is probably fine. It's going to be like, yeah people are going to be upset, but it's not the end of the world.

If I have a robot spinning in circles for five minutes. Someone is going to be really upset about that, which means that you'll have to be able to know about these problems within seconds, whereas in the standard web operations, for us, it's more like minutes. Is that right?

Ian: I think that's right and I think it gets to some of the safety challenges that come with deploying these systems in the physical world alongside humans. That's really a system design problem that we cannot solve entirely here. That's really the responsibility of the application developer to make sure that there is sufficient layers of safety and local autonomy, and sort of that system design that keeps people safe and hopefully keeps our customers happy.

So the stakes of mistakes are high, but the challenges of observability into those mistakes is also high. That is what I think makes it a really interesting space to work in.

Mike: I'm just imagining being in a grocery store, and being run over by one of these things. Like, that would be a very unpleasant experience.

Ian: I agree. Hopefully, we like to make sure that our customers know that it is not our responsibility as an observability platform to prevent that from happening, but that would be the worst case scenario, I agree.

Mike: I'm going to assume that you're not the first solution to ever come to market to solve this problem.

Ian: There is a long legacy of SCADA systems that have been deployed in industrial control settings.

Mike: Ooh, yes.

Ian: And-

Mike: Big fan of ICS.

Ian: Okay, and anybody who has worked with them knows that they are a proven technology that need a specific need for a specific set of users. Unfortunately, they don't really apply to this world of semi-structured robots wandering around retail stores.

Well we are not the first, I would say that we are part of the first wave of products that have emerged really just in the last year to address these concerns and I think it's because to date robotics companies have built everything in-house and we're seeing a trend similar to what we saw in 15 years ago, in the web world, which is that there is a growing realization that not every part of the stack is central to company's value proposition, and we're hoping to take some problems off people's plates.

Mike: Yeah, I'm glad you went that direction. That was going to be my question. How have people been solving this to begin with before you came along? Sounds like they're just writing a bunch of stuff themselves and hoping for the best.

Ian: Yes, that's what we see. It's extremely fragmented. I think the standardization that has happened in the robotics ecosystem that we're targeting has been really around solving problems of single agent autonomy, and for that, there are great open source tools out there like the robot operating system that have really gone along way towards standardizing approaches towards those problems.

But, when it comes to thinking about logs and monitoring and fleet management, really it has been extremely fragmented, and one challenge is that the people that constitute robotics companies often come from a very deep robotics research background, and don't have experience building and maintaining cloud infrastructure. As a result, we see a lot of avoidable mistakes being made. It's another place we see some opportunity.

Mike: I want to talk about your tech stack a bit. What's going on under the hood with all this? Are you using the same tools that operations engineers are going to recognize? Have you completely built stuff from scratch? What's going on there?

Ian: That's a great question so we're trying to strike a balance between building what needs to be built, but not what doesn't. A great example is our approach to exposing business intelligence capability on top of the telemetry we have collected. It's not to try to build that in-house in any sense. We are kind of leveraging the workflows that already exist for pushing data into a data lake and running business intelligence on top of that.

On the other hand, building the monitoring that's required to monitor not just scalar metric data, but also streams of images and geometric data is something that would be hard to ask of existing server monitoring tools so that's an area where we have made investments. We have made investments in the functionality of the edge, and some of that sort of dynamic instrumentation that we're talking about.

And, we made investments in some of the visualization because obviously looking at 3D data is very different than looking at textbooks.

Mike: Yeah, I'm just trying to think about how I would solve that problem, and coming up with just a whole lot of blanks. Like the time series database, oh that's easy, it's not an impossible problem, but visualizing 3D data in a way that I could go back and look at it. That sounds tricky.

Ian: Yeah, but it makes it fun though.

Mike: For a bit, I was thinking that maybe you were just collecting a whole bunch of different data points and assembling them into an image, but it sounds like you're actually taking image snapshots from the edge and storing those?

Ian: Yeah, so we can consume full or reduced resolution images, point clouds, maps, data types like that. The biggest challenge is really in making those trade offs between when are the resources available to do that compression at the edge. When is the network available to centralize that data, and that's why we've been really focused on the capabilities of the software running at the edge.

Mike: Mm-hmm (affirmative). I imagine that you probably have somewhat limited retention on the robots themselves. Are you talking minutes, days, hours, months?

Ian: Well, it definitely depends on the customer. We often see a tiered approach as you would expect where LIDAR data that might be publishing at a Kilohertz generating gigabytes per minute has a very low retention period. Text data is obviously easier to keep around for a long time. But, we do have the luxury of typically full SSD resources locally, and that does give us retention better than what you would get on a Raspberry Pi IoT to best-

Mike: Right, yeah. I want to talk about failures in these robots a little bit. It would seem to me, in my naïve understanding of robots, that you know everything that's on a robot. You know what's there, you know what isn't. It seems to me that you could predict all the different failures that could happen. But, with our example of the spill on the floor, we clearly get into, well maybe not so what kinds of failures are common in these robots? I think you mentioned earlier that there is a whole lot of the unknown unknowns that you're getting into as well. Can you talk more about those?

Ian: Sure, I think we can reason pretty well about the internal state of the robot software, but where it gets challenging is that these robotic systems often. They obviously include hardware components, and they are interacting with an external world that can be very hard to reason about.

So the failure modes are really diverse, and to the question of what types of failure modes do we see often? A good example is mobile robots often encounter mislocalization, a low confidence about a position in a map. This can be solved in a few ways. One approach that we see, some companies are taking is a shared autonomy approach, where there is actually support from a human operator in the case that a robot identifies itself as mislocalized. They can sort of help the robot get back on track.

That is something I think is unique to robotics and a trend that we're seeing.

Mike: Is this mislocalization failure like the equivalent of you thinking there is another stair on the stairway?

Ian: Well, that might be hard to detect. I think it's more like moving around a retail environment for which a static map exists, but finding that the inventory manager actually moved a shelf overnight.

Mike: Oh.

Ian: And all of the sudden, my slam algorithm which usually has a very high degree of confidence about my position on a map, is returning very low confidences about whether I'm on aisle 9 or aisle 10.

Mike: It's like your significant other rearranging the living room when you're gone.

Ian: Exactly. Yep.

Mike: Or in the middle of the night.

Ian: Yep.

Mike: Okay, got it.

Ian: That's one failure mode that we see. Another failure mode that is common is at the hardware layer. Again, this is typically pretty hard to predict except though sort of second order measurements in software. What we try to do to support that use case is to just give people good visibility into what hardware is deployed where. As anybody who's worked on fleet management problems before knows that's a tricky problem in itself especially when you are swapping out components, doing repairs. But, that's an area where we definitely, see pain and opportunity for robotics.

Mike: Right. How do you decide where the fault detection should take place? Sometimes it should happen on the robot, other times it should happen centrally. How do you decide which is which?

Ian: It's tricky. That is pretty application specific, and as an infrastructure company, we don't know the answer to that as well as our customers do. I think that's where the domain expertise of people solving inventory scanning problems, and retail, really comes into play.

You know, what we're trying to do is give people hooks in the right places to do that monitoring wherever in the stack makes sense.

Mike: Do you ever advise your customers on what's possible? I imagine that when I worked in retail a million years ago. If I had a robot there sitting in front of me, I wouldn't even be sure what I could do with it. Like, I could imagine a few things but I'm sure you being experts in this particular area, can imagine all sorts of other things.

Ian: Yeah, we definitely talk to customers, and are happy to consult on some of that system design. But, they're often very good at what they do, and when it comes to knowing their own hardware, knowing their own local software stack, there is only so much value we can provide.

Mike, I'm curious for you. How do you see observability monitoring practices extending beyond backend distributing systems and I'm sure we're just one of the number of domains that is starting to sort of borrow and steal. Where else are you seeing this happen?

Mike: Man, that is a fascinating topic. Listeners can go back to a previous episode. I believe it will be the episode with my friend Andrew Rodgers. Andrew and I discussed the observability in manufacturing, and the observability in industrial control systems, SCADA which we were talking about earlier, and his work now with like building scale and city scale monitoring and observability, which is absolutely fascinating stuff.

We made a comment that the manufacturing world was really great at coming out with the principles of process engineering, but they didn't have the technical ability to write the software to execute on those principles. Then at some point, software engineering and operations, the technical operations found these principles but because we are experts at writing software. We started to do the execution, and now manufacturing is taking all that and applying it to them.

Like it was this really cool, two way street that's happened over the past 10 years or so-

Ian: I would guess that includes some sensing of not just software systems, but a physical system as well.

Mike: Yeah, so it's actually ... not just of software, it's actually almost entirely physical stuff. It's where the things like I have a boiler or I have a furnace or hell, I have a road. My road has sensors embedded in it. Well I'm gathering all this data about environmental conditions and traffic and things like this, and shoving that into some software system that is going to do stuff and send that information back out into the physical world to change the physical world.

The technology is being used here. The same technologies we've been talking about like they are making heavy use of Grafana and time series database like Cassandra and Kafka for steaming, like all of this is standard web operations tooling that would expect in any monitoring platform, but it's being used for this real-world physical interactions.

I think that this is super cool because it's very similar to what's going on here where you are collecting this physical world data, shoving it into software that we would all know and recognize if we saw it, making decisions about it, and using that to change what's going on in the real world.

Ian: Yeah, I think the control loop element of this is really interesting. In software, we think, typically the control might involve horizontal auto scaling or traffic shifting or something like that. In the physical world, we get to actually move metal or plastic or turn cameras or reach out and touch things, which is pretty exciting.

Mike: Yeah, in a lot of ways, the impact is bigger and the risk is also bigger.

Ian: Yeah, true. Yep

Mike: So, we're using software to control traffic lights. It doesn't take a genius to see that could potentially turn out very badly.

Ian: Absolutely.

Mike: Same with manufacturing furnaces. These are things that get things to a bajillion degrees, and we're using software to monitor those to see what's the temperature on this. Well, one of the things that a lot of people don't realize is if a furnace is getting too hot, yes that sucks, but it's not a huge issue. It's actually a furnace getting too cold that's a problem.

So if the furnace gets too cold, then everything in it solidifies. Well these furnaces haven't been shut off for 20 plus years, so when it solidifies, the furnace is done. It's time to replace it, and that's a million dollar investment. Simply, because software that was paying attention to it didn't catch a failure in time.

Ian: Yeah, I think for me that kind of points at the deep domain knowledge that people who've been working in these industries for their entire career, really bring to bear on the problem. You know, deep knowledge of furnace behavior. It's not something I can speak to, but I think that it's really exciting to me if people with this expertise are empowered with software that lets them do things that they've always wanted to do.

I think when we talk in the robotics world about these different waves of robotics start ups and the earlier robotics subjects were founded by robotics Ph.D's who know they could build robots and were looking for a problem. I think we're staring to see robotics-

Mike: Kind of like the AI at MIT. All the early AI companies were just people from the AI Lab.

Ian: Yep, maybe looking for problems with a hammer looking for a nail.

Mike: Right, yep.

Ian: The most recent crop of robotics start-ups that I think has a much higher chance of success are people coming in with some deep domain expertise and looking at robotics as just one tool among many that they might apply to solving a real problem.

Mike: Right, incidentally for anyone that just heard that particular comment. That's the foundation of building a business.

Ian: Yeah, you'd hope. Yeah.

Mike: Yeah, having domain expertise and a certain area and then looking for ways to solve that problem, is a much better way to go about life than coming with a hammer looking for nail.

Ian: But, we know a lot of the latter for sure.

Mike: Right, a whole bunch of us, man, we really know Python, let's go find problems we can use Python for. As it turns out, that doesn't usually go so well.

Ian: I'm curious when you see people applying monitoring and observability to these new domains. Do you see people making mistakes that have already been made in that world? Or do you think that people are benefiting from the recent developments in monitoring?

Mike: I honestly think it's a big of both. One of the challenges that I'm seeing is that you have all these people from the manufacturing or building management. This is two domains that are top of line of me. You see all these people who have those deep domain expertise and they're taking tooling and processes that we've come up with in awesome software and applying them.

Well, they're able to apply them fairly well to their domain expertise. What they are actually missing is all the domain expertise from the software world. Things like how do you do effective alerting. It's not like every time a value passes a number, page someone. We know that doesn't work, and like software engineering, we're still ... I think we're pretty well at the forefront of how to do that. The different anti-patterns there.

But, then when you look at places like nursing and medicine, they're not there yet. Like, they know it's a problem, but they don't have good solutions. We kind of have solutions. Their environment is much higher stakes than ours is. You can see why they are kind of hesitate to adopt some of the solutions we've come up with, which is very understandable.

Ian: Yeah, very high and I'm very excited to see how that plays out.

Mike: Yeah, me too. I think it's going both ways. Manufacturing, building management like they're clearly doing really cool stuff with the stuff we've built and the stuff we've designed. I think for the most part, all the monitoring vendors outside of those domains are not even seeing what's going on. Like, I happen to know that Influx DB and Grafana and Cassandra and Kafka are being used in places that these companies don't even know they're being used in.

That's super cool, but on the other hand, that's pretty shitty for everyone. It would be really helpful if we had more of a dialogue between these two groups, but-

Ian: Yeah, I think people who care about tooling and infrastructure for operations. There's just a ton of opportunity in some of the domains you just mentioned.

Mike: Right, like, most of the opportunity in the world is not within the software domain. It's in non-software domains like places were you wouldn't expect software to be.

Ian: Yeah, I agree.

Mike: Yeah, yeah. Robots. Well, now that we've come full circle. This has been an absolutely fascinating discussion. Thank you so much for joining us.

Ian: I really enjoyed it. Thanks, Mike. This was fun.

Mike: And to everyone else listening in, thank you for listening to the Real World DevOps podcast. If you want to stay up-to-date on the latest episodes, you can find us at realworldDevOps.com and on iTunes, Google Play or wherever it is that you get your podcast. I'll see you the next episode.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Mentorship in Tech with Aaron Sachs
20 jun 2019· Real World DevOps
About Aaron SachsAaron Sachs is a home brewer, banjo player, and also happens to like monitoring things. He helps make his customers look like monitoring badasses to their customers at Sensu, where he's a Customer Reliability Engineer.

Links Referenced
SensuAaron.sachs.blogTwitter: @asachs01
TranscriptMike: This is the Real World DevOps podcast and I'm your host Mike Gillian. I'm setting out to meet the most interesting people doing awesome work in the world of DevOps. From the creators of your favorite tools, to the organizers of amazing conferences, from the authors of great books to fantastic public speakers. I want to introduce you to the most interesting people I can find.

Mike: This episode is sponsored by the lovely folks at Influx Data. If you're listening to this podcast, you're probably also interested in better monitoring tools and that's where Influx comes in. Personally, I'm a huge fan of their products and I often recommend them to my own clients. You're probably familiar with their time series database, Influx DB, but you may not be as familiar with their other tools. Telegraph for metrics collection from systems, chronograph for visualization and capacitor for real time streaming. All of these are available as open source and as a hosted SAS solution. You can check all of it out at influxdata.com. My thanks to influx data for helping make this podcast possible.

Mike: Hey Aaron, welcome to the show.

Aaron: Hey Mike, thanks for having me on.

Mike: So I want to start with a a little bit of an origin story because everyone loves a good origin story.

Aaron: Oh yeah.

Mike: You and I met over hot wings many, many, many years ago.

Aaron: Hot wings, pizza and garlic knots. All good stuff.

Mike: Right? Sadly the place is closed now, but we ended up going..

Aaron: No.

Mike: I know at the time I think you were working help desk support while studying for a communications degree?

Aaron: Oh yeah, yeah. I was at a University of Tennessee's Office of Information Technology doing desktop support with a bunch of other students and yeah, working on my communication studies degree. Yep.

Mike: Yeah. So the thing I find interesting about all this is that you never intended to go into IT at all. Like you weren't planning to go into tech in any way. You were really planning to go into something communication related.

Aaron: Yeah. Yeah. So I was planning on being a communication professor and kind of aligned my career up to that point to do just that. And yeah... It was during that time, like so I was the whole reason I got out of that was I was writing my thesis on how one determines a blogger to be credible. Like what sort of behaviors do they exhibit when they were writing a blog uh and because that was nowhere near the specialty of the department I was in, there was just a lot of politics and I got so fed up with it and I was like, I'm done. And I'm already working at the Office of Information Technology to, do you know my grad assistantship, so why not just do IT?

Mike: I remember you and I went out for beers, I want to say it was like a month after we met, and you you asked me a question of, "What if I stopped doing communications and went into tech? What would that look like?" And like the immediate follow up question was, "And can you help me?"

Aaron: Yup. Yup. That's correct.

Mike: Yeah. So, me not knowing what I was getting myself into of, "Yeah, let's just totally change the course of someone's career over beer."

Aaron: By the way, for the listeners out there, if you ever want to sucker Mike into something, get him real drunk and then asking me if ask him for help. Totally works.

Mike: Yeah apparently. So…

Aaron: Haha

Mike: This kind of started an interesting thing of you and I weren't really initially friends is more of a, you were looking to me for help on how to change your career and how to get better at a thing that you had not focused on at all. We actually became friends as a result of that like going through that whole thing, but it really developed into this informal mentorship situation.

Aaron: Oh yeah. Yeah. I very fondly recall all of our a Tuesday nights at Old City Java in Knoxville, just gorging ourselves on Meg's croissants and downing gratuitous amounts of caffeine.

Mike: Right. And wishing for a whiteboard to further explain concepts. Right?

Aaron: Right.

Mike: So through all this, clearly you've, you've done well you're working for Sensu now as a Customer Reliability Engineer, helping companies with improving their monitoring using Sensu. But one of your big focuses throughout your career, many years after us working together, has been to help other people improve their careers and improve their professional lives.

Aaron: Absolutely. Yeah. It's so you're ready for that, by the way.

Mike: Yeah. And so I know you've given several talks on this and you've had several other mentees over the years as well. I believe you were involved in a pretty formalized program at Rackspace on that topic as well.

Aaron: Yeah, there was actually... I did a lot of mentorship with my team at Rackspace, so as I moved up in the admin ranks, I always took it under, rather took it upon myself to mentor the people that were coming in. And then that somehow turned into me mentoring folks that were in my wife, Ashley's department. We actually started like a meetup that would happen once a week at Rackspace and these are folks who came in as customer service technicians. They knew enough to spell DNS and grow them from just that to actually progressing into paths as like a one guy is a, I think he's a customer success manager for a startup in San Antonio and other guys as systems engineer now at Rackspace still, and then I've got another great friend of mine who's constantly blown away because she's actually taking this whole mentorship principle that we worked on at Rackspace and she's doing it with other people that she knows now. Her name's Elle, she works at I think Lennox Academy or Jupiter Broadcasting. So yeah, it's, it's been a crazy journey.

Mike: Yeah, that sounds pretty awesome. So I want to dig into that. Let's back up and say what is mentorship?

Aaron: So let's talk about what mentorship is not because I think that that's an even more useful way of discussing the concept of mentorship.

Aaron: So mentorship, in my view is not simply... It's not transactional. It's not just the like... I mean I came to you initially and thought it was a transaction. I'm going to go to Mike and Mike's going to help me, which is fine, but that's not mentorship. Right? Mentorship is about growth and it's a two way street. So it's definitely not a, "I'm a senior year, you're a junior, I'm going to mentor you." Which I think is something that can get really easy to get mixed up in that sort of head space of thinking, well, "I'm older, I know more, I'm the more senior, more experienced person. I know more. Therefore you are learning from me."

Aaron: In reality, I've learned a ton from the folks that I've mentored, folks who have forced me to learn things to dig into different concepts.

Mike: I remember the first time you asked me how does DNS work? And I'm like, "Oh, this is a simple answer." And about five minutes into it I realized, "Wait a minute, I don't actually understand DNS well enough."

Aaron: Yeah, that definitely... Oh gosh, I had that same thing happen to me too when I was mentoring some folks because that that's, that's a rabbit hole for sure.

Mike: Right? So it is a two way street. You, as a mentor, you learn from your mentees as well.

Aaron: Right. So mentorship, right, if I were to sum it up in, let's see, like the elevator pitch, it would be mentorship is a relationship built around learning, I guess it'd be a really simple way to put it, but you could probably, I mean there are all sorts of things that happen with mentorship. Like you could put some time bounds on it if you want to, you could put a goal around it. I've seen different organizations like the Apache Foundation has a very specific targeted way of doing mentorship. There are much more long term mentor, mentee relationships like yours and mine. I'm still learning from you, even though we're not meeting every week at Old City Java anymore. So yeah…

Mike: Mentorship, does it work better with a formal or informal program? Do they work differently?

Aaron: I think it, that depends entirely on what a mentee wants to get out of it. So if a mentee says, for example, "I want to learn to code X, I want to learn to code this thing in Go or I want mentorship through a specific project." It can be very formal, and very targeted and it can meet the need of what the mentee is going for.

Aaron: I think you can also do the informal, right? Like ours has been very informal and continues to be very informal and it's not so much... I think you and I have a has sat down and said, "Well, I want to get this out of mentoring Aaron. I want to be on the receiving end of whatever from Mike." It's been more along the lines of how can I learn, how can I grow, what phase of life am I in right now? Or what phase in my career am I in? And like where do I need the help? And that to me it seems pretty open-ended, right? Because it could start with, well, I need to learn Linux as the way that you're and my relationship started. And it's progressed through learning all sorts of like different business practices or communicating more effectively or organizing conferences, which is it's one thing after another, right? Um so I think it just depends to answer the question. It depends on what the mentees goals are.

Mike: Well, what about the mentors goals? Like presumably, there are times when someone sets out to become a mentor and starts looking for people to teach and to help grow.

Aaron: So let me backup to my time at Rackspace when I was doing the mentorship initiative on our team.

Aaron: There are people in the world who are not teachers. They're not gifted in it. They don't have that skill and they don't really desire to have that skill. And so I think with a mentor, right if we're talking about mentor goals, you can immediately eliminate those people from the picture and say, "Okay, well they're not going to be mentors. They don't want to be mentors. Whatever."

Aaron: The people who want to be mentors, I think kind of go through phases. So that initial phase, at least in my journey, has been, I want to help people improve. Right? That could be the first goals. I want to take whatever knowledge I have and impart that to somebody else. Cool. Great. Great place to start. I think from there it can become like you're getting at, which is a mentor could have a goal or specific set of goals with that. So it could be how do I teach better? Do I understand what I'm teaching? Right?

Aaron: So that could be another thing where it's a little bit self-reflective or it's like, "Okay, well I want to mentor somebody who wants to know about DNS so that I can go down this rabbit hole and understand what's going on." That's a very valid goal.

Aaron: In my reading of some of the literature on mentorship, and actually one of the guys that I know here in our neighborhood is somebody who he did his entire doctoral work on reverse mentorship. I don't think there's been a whole lot of discussion around the mentors goals so much. So it's actually really interesting that you bring it up.

Mike: If I'm someone looking for a mentor, then presumably there are people who would be interested in being a mentor and that means they might be looking for someone to mentor themselves. Rather than just being open to the idea, they're actively looking for it.

Aaron: Yeah. Yeah. And I remember actually, gosh, days gone by when you and I were in LOPSA. LOPSA had a mentorship program and I think from what I recall of that program and some other programs, people will sign up and say, "I want to be a mentor in X. I want to mentor somebody through security concepts or I want to mentor somebody in Python." Right? So if you're looking to learn python, come talk to me. Or if you're looking to learn how to do pen testing or become a security professional, talk to me.

Aaron: But that's always been kind of a implicit thing rather than, I think somebody talking about like from the mentor perspective, like what are your goals going into being a mentor, you know what I mean?

Mike: Mmmhmm yup. I want to shift gears a little bit. We've been talking about mentorship and the concept of teaching has been kind of central to this whole thing. Is there any big difference between training and mentorship?

Aaron: Yes. Yes, I would say there is certainly.

Mike: How do you know difference?

Aaron: So I do both. Professionally I train other people on how to use Sensu, okay. So training to me has a more transactional focus, right? I am the teacher, you the student, I am imparting knowledge, you ask questions to gain more knowledge but the goal is very much like a one to many transactional sort of act.

Aaron: Whereas if I think about mentorship, there's like a constant feedback loop there and it's usually one-to-one or one to several.

Mike: It's much more relational it sounds like.

Aaron: Right right . But but, let's take it a step further. There are commonalities there, right? So he…

Mike: Okay. Tell me more about that.

Aaron: The folks who know how to teach well or train well, can take those skills because again, teaching and training can be very different from mentorship, but they can take those skills and kind of dovetail them into being a mentor. Right? So the imparting of knowledge is just one thing in that journey and what I'm getting at here is like you can be be a great teacher and be a terrible mentor because you don't understand relationships and you don't understand how the relationship is two way. Right? But I don't think you can be a good mentor and not be a good teacher.

Mike: Gotcha. So it sounds like teaching is really is central to this entire concept. To be a good mentor, you have to be good at teaching.

Aaron: Yup. I would say that's true.

Mike: So how can I be a better mentee? Like when I say when we talk about being a better mentor, it comes down to get better teaching, get better relationships, how can I be a better mentee?

Aaron: So there's several thoughts floating right in my head. One of those thoughts is exactly that. Get better at teaching and get better at relationships because you as a mentee will definitely have something to offer your mentor as well as other mentees, right. So this is like, gosh, I would hate to use a Star Wars analogy and be like... You know master and student when it comes to like the Sith, but it might somewhat fit, but there's a chain there, right?

Aaron: A mentor has a mentee, that mentee at some point should be in a similar stage. Also as a mentee, I think if you're going to be in a targeted, formal sort of mentoring relationship, have an idea of some goals that you want to accomplish.

Aaron: So if I were to go to somebody and say, "Hey, I want you to teach me python. I want you to mentor me and python. I want to get better at python."

Aaron: Okay, cool. What do you want to know? Where are you at? Where are you in your journey because as a mentor, if I was mentoring somebody in python, which God forbid because I don't know that they would ever be able to code their way out of a paper bag, but I would want to know what specifically do you want to learn? You know? Am I teaching you fundamentals? Cool. All right. I can start there as a mentor, I've got a good good reference point. You know do you want to learn specifically about like packaging your python applications? Cool. We can talk about that, but if you go in and just say, "Well I don't know. I just want to learn." If the mentor isn't prepared for that, I think it can set you up for failure.

Aaron: I think two, kind of establishing as a mentee some sort of regular cadence for meeting with your mentor is super important. And that goes to both a mentor and a mentee, right. Establishing a cadence is important. You and I had our Tuesday nights at Java. I mean it, it just ended up working for us. But I think as a mentee you should say like, I don't think the relationship works if you aren't regularly checking in. You know what I mean?

Aaron: If I'm like, "Okay, well I want to learn about I want to learn about pen testing, right?" And I go to somebody, we start that initial you know mentorship relationship and then we never meet again. Well it does neither the mentor or the mentee any good.

Mike: Right. You've just wasted time and effort on both sides.

Aaron: Which is the last thing you want to do in that relationship because that relationship can keep giving a long after that engagement has ended you know whether formal or informal.

Aaron: And I think boundaries are another important thing, not just as a mentor but also on the mentee side. Being able to respect boundaries, right?

Mike: Okay, tell me more about that.

Aaron: So I can be a super eager mentee and want to learn all the things I'm going to... I could be one of those types of people who hounds my mentor and if my mentor is a working professional like most are being constantly hounded is going to be like put a pretty big wedge in that relationship because that's going to foster some resentment there in my mind. You know like if I had somebody constantly pestering me and be like, "Hey, I need help with this. Hey I need help with this. Hey, I need you to do this for me. Hey teach me this thing." Okay, slow down hoss, let's take a second. You know I've I've got my day job, I've got a family, let's work on some boundaries. Right so making sure that those are established pretty early on and that you're both respectful of those is another key thing I think as a mentee, more on the mentees side, like respecting boundaries, but then also as a mentor, being able to set those boundaries.

Mike: Do you have any thoughts on how to find a mentor?

Aaron: There's an old Mr. Rogers quote about finding the helpers that I'm not going to quote because I can't do it justice. That would be one thing I would say, if you are looking for a mentor professionally, one, start by looking around. Look at your company, look at the folks who are doing the teaching, look at the folks who want to impart knowledge or are looking for a good excuse to to teach. That's usually a good indicator.

Aaron: There are are also great programs out there that, LOPSA offered one, the Apache Foundation has offered some, I think there's that talk at scale in 2015 that Jen Greenaway gave, that went over several open source organizations that do just this, they offer some sort of mentorship program.

Aaron: So that goes back to as a mentee having an idea of what you want to be mentored in, if it is if it is that formalized structured engagement. There are organizations out there that provide that. If it is more of the informal, I think it takes a little bit of patience because you can't just you can't just walk up to a senior engineer and be like, "Hey I want to be your mentee."

Aaron: Because that senior is going to be like, "Uh okay..." It's like you know having a puppy or a kitten from off the street, all of a sudden come up and start following you and you're like, "Whoa, whoa, whoa, hold on. You're cute and all." Maybe don't say that to a mentee, definitely don't say that to a mentee.

Aaron: "I'm excited that you're enthusiastic about learning whatever but you know let's let's let's take a step back for a second." Because they may not be prepared for it, they may not be in a place where they can do that, but be patient and start paying attention.

Aaron: So if I were to ask myself that question right now, like, Hey, how would I find a mentor?" I would again, look at the people that Sensu, look at the people who are teaching, look at the people who are very free with their knowledge. I would look for somebody who I respect right because the last you want is to be in a mentoring relationship with somebody who like you have no respect for or come to not respect.

Aaron: And then from there I would even make it a formal thing at first. I would do something like, "Hey, let's go have a cup of coffee or like have a FaceTime chat at some point. I'll hear your thoughts on whatever." And start from there.

Aaron: Like the mentorship relationship is not something to rush into because I think there's an element of trust that has to happen there because it is a relationship, right? Both people kind of have to agree on it. So being very choosy with who you pick and how you pick them is something to definitely keep in the back of your mind as far as if you are looking for a mentor, do it wisely.

Mike: Right. Take a slow and crucially, something you said earlier that I think is worth reiterating here is come with a clear ask in mind. You can't go to someone to "Hey, I want to be, I want you to mentor me." And that's it. It can't possibly work that way.

Mike: It has to be, "I want to learn this thing from you, this very specific thing." Otherwise it's kind of too open ended and no one really knows how to help.

Aaron: I'll give you a great example.

Aaron: So I just moved, I know maybe a handful of people in Chattanooga and one night I'm out wearing my some old vendor swags actually it's old DigitalOcean swag from when I was there. The guy comes up to me in Target and is like, "Hey, do they do cloud things?"

Aaron: And I was like, "Why? That was not the question I was expecting in the middle of the baby aisle at target. But sure. Yeah they do cloud things. Why, what do you do?"

Aaron: Over the course of the last couple of weeks, this guy and I have started meeting just to grab lunch and in the course of that found out that, "Hey, I don't really, I don't really like where I'm at." Kind of similar to like how your and my relationship started, Mike.

Aaron: And I was like, "Interesting, tell me more about that. What do you, what do you, what are you hoping to do? Like what do you want to do?"

Aaron: And so we started a kind of informal mentor, mentee relationship around building his career. Right so it wasn't the explicit ask of, "I need help with my career." Because that would have been really weird in the middle of Target asking somebody with vendor swag but over the course of a couple of lunches he made it pretty clear, "I think I can do better than this."

Aaron: "I agree, let's get stuff in order." Right. So it's, it's been an interesting last couple of weeks but yeah, I mean even as a mentee can, if you don't feel comfortable like flat saying, "I don't like where I'm at in my job, I'm not respected. I don't make the pay I deserve or think I deserve."

Aaron: You can say, "I think I want to do something else. Can you help me?" And even that's a great place to start.

Mike: Yeah. Because most people had been there before and they're immediate next question is going to be, how can I help? Like most people actually want the help and especially so if you are asking about something that they have knowledge of and pretty much every senior person has quite a bit of knowledge about how they could have grown their career better. Like I think that ones is a great place to start of how do I just improve my career?

Aaron: Oddly enough, I talked with the mentees more about that than I do the technical side of things, to be honest.

Mike: Yeah. Right.

Aaron: So it's like, "Okay, well what does a hiring manager want to see on my resume? Should I use a green font on it?" Which everybody out there and will probably go, no, no, don't do it. But I mean it's happened, right?

Aaron: Do I need a chronological listing of all my accomplishments and all my past roles? No, let's tweak things. Let's make sure that we can get you in front of hiring managers. Let's make sure that it's not just a, "Hi, I'm here to interview, but let's kick some tail at it." Let's be prepared for the questions, so let's anticipate what hiring managers are going to ask and blow them out of water. That's that's more of the conversations I tend to have lately.

Mike: If I'm someone listening to this episode and thinking like I would really like to help on a thing, whatever the topic is I want to in time find a mentor and develop that relationship and get help around whatever this thing is, what can I do? What's my next step?

Aaron: That's a great question. So one of my old mentees started this Hashtag on Twitter, it's I-O-T-B-N, it's okay to be new is what it stands for, and she started using it as kind of a rallying point for folks who want to help as well as folks who need some help. So that's the most salient thing that comes to my mind is I would tweet out like, "Hey, I want to help." And include that hashtag, in your tweets, if you are so inclined to tweet.

Mike: And that works just as well for people who are looking for help looking. Looking to provide some help. Yeah.

Aaron: Right, exactly. You're saying like, "Hey, I'm new in the IT field and I need some help with Javascript." Or I keep saying security because there's a lot of security people who have been rallying around that hashtag and helping.

Aaron: So yeah, I mean if you want to learn security concepts, you know like raise your digital hand on Twitter, include the hashtag and say, "Hey, I could use some help in this."

Aaron: Again, there's also the formal stuff, so if you're a mentor, you could probably do a quick easy Google search for like technical mentorship or a technical mentoring program. What I think we could probably do is actually include some of those links that I think Jen had in her talk a a while ago as part of the show notes to say, "Hey, these are some places that are accepting mentors. Uh so if you want to be a mentor, go here."

Mike: One of the places I found that has been really helpful was the meetup groups. So pretty much everyone who is speaking at a meetup is trying to become a better teacher, which makes them more inclined to want to help people that reach out to them.

Aaron: That's a great point. There's also another way that like if you... Here's a big thing that's come up lately since I'm helping organize a conference here in Chattanooga, if you want to help folks improve their talks, volunteer or to review CFPs. If you're, I know DevOpsDays tons of stuff around the world. So you know, there are a ton of cities out there, but if you are wanting to help people get tighter on their communication and present their talks in a way to audiences, that's going to resonate with the audience, like volunteer to review a CFP. Or likewise, if you're a mentee, submit. The worst that anybody can say is no and you can ask for feedback, right? And then say, well, "Hey, do you have anybody who's interested in helping mentor me on getting better about my talk? That's a great way to both help in terms of mentorship as well as boost your career.

Mike: Yeah, absolutely. And also if you read a book by someone and you have questions than just email the author.

Aaron: Yeah. Oh yeah.

Mike: Everyone is so afraid to email an author like, "Oh God, they're so busy. They're getting inundated with emails and other outreach all the time."

Mike: The truth is that, no... No, we're not.

Aaron: That reminds me of a tweet.

Mike: Anyone that sends me an email about my book, I respond to every single one because I love helping people and no, authors don't actually hear from readers very often and the whole reason we write books is to help people. We're absolutely thrilled to be helped.

Aaron: Conversely, if the book author has a repo with a code, samples, don't please, please don't go on a repo and just start like trashing the author's code.

Mike: Yeah. Don't do that unless it is like very blatantly wrong, we will feel a little bad about making blatant mistakes, but anything else?

Aaron: Yeah. So I was going to say is I'm reminded of a tweet that I think Ian Coldwater put out the other day where she was discussing how a lot of times the folks who follow her on Twitter, are kind of scared to approach her and talk about what she's doing, the type of work that she's doing and she basically echoed your sentiment Mike, which was actually like, "Come up and talk to me. If you see me at conferences come up and talk to me. Like if you, you have questions you want to learn, you want to grow, ask."

Aaron: I mean to your point earlier, most most senior folks are going to be more than happy to help another person make that next jump or next, you know, leap in terms of either technical knowledge or something in their career. So yeah, don't feel like you know, the amount of followers you see on an author's Twitter page is indicative of how they're going to respond to you.

Mike: Right. Well Aaron, this has been a fantastic chat. Thank you so much for all the advice you've given.

Aaron: I hope it's useful.

Mike: It absolutely is. Where can people find out more about you and your work?

Aaron: Yeah, so you can find me on Twitter at Asachs01 or I've got a blog that I am starting to maintain a little bit more lately and that's aaron.sachs.blog. So yeah.

Mike: All right then. Well, thank you for coming on the show.

Aaron: Well, Mike, thanks so much for having me. It's always a pleasure.

Mike: Oh yes. And everyone else, thank you for listening to the Real World DevOps podcast. If you want to stay up to date on the latest episodes, you can find us at realworlddevops.com at iTunes, Google play or wherever it is you get your podcast. I'll see the next episode.

Mike: This has been a HumblePod Production.

Mike: Stay humble.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Saknas det avsnitt?

Klicka här för att uppdatera flödet manuellt.
City-Scale Observability with Andrew Rodgers
13 jun 2019· Real World DevOps
About Andrew RogersAndrew leads technical strategy and architecture development for ACE IoT Solutions. Andrew also leads the development of technical and research strategy at The Enterprise Center, a non-profit focused on developing the innovation ecosystem in Chattanooga, TN. When not bringing his extensive professional experience in Industrial Control Systems, Critical Infrastructure Controls, and Network Engineering to his professional endeavors, he can most commonly be found with a camera in his hand. A deep passion for photography takes him off the beaten path the world over, and serves as a convenient excuse for a variety of other means for enjoying nature, including hiking, biking, and most board sports. Andrew loves sharing his travels and photography, and keeps an instagram account updated with his most recent adventures.

Links Referenced
LinkedInTwitter @acedrewACE IoT SolutionsPersonal site

TranscriptMike: This is the Real World DevOps podcast, and I'm your host Mike Julian. I'm setting out to meet the most interesting people doing awesome work in the world of DevOps. From the creators of your favorite tools to the organizers of amazing conferences, from authors of great books to fantastic public speakers. I want to introduce you to the most interesting people I can find.

Mike: This episode is sponsored by the lovely folks at InfluxData. If you're listening to this podcast, you're probably also interested in better monitoring tools, and that's where Influx comes in. Personally, I'm a huge fan of their products, and I often recommend them to my own clients. You're probably familiar with their time series database InfluxDB, but you may not be as familiar with their other tools. Telegraf for metrics collection from systems, Chronograf for visualization, and Capacitor for real-time streaming. All of these are available as open-source and as a hosted SaaS solution. You can check all of it out at influxdata.com. My thanks to InfluxData for helping make this podcast possible.

Mike: Hi folks, I'm Mike Julian, your host for Real World DevOps podcast, and my guest this week is a friend of mine. Andrew Rogers, he's an expert in industrial control systems and co-founder of ACE IoT solutions where he helps companies with improving visibility in their operations and energy systems. Welcome to the show.

Andrew: Thanks.

Mike: Now you live in one of my favorite cities ever, which is Chattanooga, Tennessee. I can hear the Chattanooga fans in the background just, "Yeah, this is awesome." So one of the coolest things that I think there is in Chattanooga, aside from just the gorgeous weather and great food, is the low cost municipal internet.

Andrew: Yeah, it's pretty fantastic. Um it's certainly part of the reason why I moved to Chattanooga in the first place. Uh and I think that it's had that effect on a lot of people over the years. So um we have, sort of, an out-sized technical community here based on the fact that it's easy to support a remote workforce when you have gigabit or 10 gigabit Internet available ubiquitously across the community.

Mike: That was one of the things I never expected because there are actually some pretty significant companies based out of Chattanooga entirely as a result of this, highly available, municipal Internet like companies might know of Bellhops which is based there.

Andrew: Yeah. So that project or that company started, based on another company in Chattanooga exiting and the founders starting a fund and looking for startups across the southeast that could benefit from the available broadband here. And you know it's a pretty big company now, well over 100 folks in their technical workforce, and they continue to you know grow and support the community. It's certainly been a really interesting success story for Chattanooga.

Mike: You worked for a while as, I think, a technologist in residence for one of the startup incubators in the area.

Andrew: So yeah, Chattanooga has a lot of really unique resources. You talked about one of those, the fiber broadband available over a 600 square mile area. And one of the other things that I think, is actually due to the fiber but also just due to the type of community Chattanooga is, and the effort and interest in working collaboratively, is a 501c3 nonprofit organization focused on entrepreneurship, and helping startups both at a scale of coming up and building a high growth startup, but also mom-and-pop shops, and helping them grow and build sustainable businesses or get to the next step in their business plan. They did, as part of when the fiber was first launched in 2009, a group of folks in the community came together to say, "Hey, how do we get the most out of this? This is an awesome asset. We know that the future of our economy is going to be based on growing businesses in the area. How can we use this new asset to support that, not just in luring a big company into town, and sort of the traditional economic development scenario, but growing companies locally."

Andrew: And so one of the things they decided to try to do is launch a startup accelerator. It wasn't a venture funded accelerator, it was funded by this nonprofit. And it was focused on finding the companies around the country who had ideas that were only viable when ubiquitous broadband was available. Now, that brought a lot of really interesting folks to Chattanooga. And you know we did it about four years, I think. GIGTANK still exists today. It's changed a little bit, happens every summer. But yeah, one of the challenges with GIGTANK was that, you bring in a company, and you get a group of really great mentors from the business community here in Chattanooga. You surround them with professional services companies who are eager to support growing a business in Chattanooga. Then you tell them that you know you're trying to build a high growth startup, and your total addressable market is you know a million people in the US that actually have access to this high speed connectivity.

Andrew: So what tended to happen, unfortunately, is the businesses were viable, but the reliance on the broadband wasn't. And now finally, in 2019, this was back... I was involved heavily in 2012, 2013, 2014. 2019, we're sitting here, and you see fiber deployments happening around the country. Talk of 5G is running thick right now, and so those businesses actually a lot more have a lot more viability. But what did happen was, the businesses came for the fiber and kind of stuck around for Chattanooga. So we ended up accruing and amazing tech talent pool, really interesting entrepreneurs uh who have gone on to focus on you know other things. We even took a little stint at additive manufacturing because we saw that, sort of, digital manufacturing gave you the same opportunity that, sort of, was the basis for deploying the fiber in the first place, which was making a smart grid that was truly smart, and making the trade of moving photons instead of electron to achieve the same sorts of quality of life improvements, et cetera, that rural electrification had in the early 20th century.

Andrew: We saw that same thing happening in manufacturing, that with digital manufacturing, the capacity and need to move data, as much or more than moving materials, is really important. It's an important use case for the fiber infrastructure, and it's part of the story of why companies like Volkswagen, several Volkswagen major suppliers, have all moved to Chattanooga, is that ability to support a digital manufacturing environment. We actually took a year of GIGTANK and focused on additive manufacturing, and the digital manufacturing up out of that.

Andrew: One of the real success stories out of that whole program is a company that 3D prints really unique architecture using some of the world's largest 3D printers that they've built, designed, engineered in-house. It's called Branch Technologies, it's really cool. They were pioneers of the technique of printing in free space. Instead of laying up layers like you expect a normal FDM 3D printer to do, they actually print and stretch plastic in mid-air, and it's really wild to watch, but it's really incredible to see the results.

Mike: You started to talk about smart grids here and this leads me to think about smart cities, and I read a, I want to say it's a video or an academic paper or something like this, where, Chattanooga has embedded sensors in roadways and streetlights and all this stuff to measure various atmospheric conditions and environmental conditions. And alongside that, they have a something that I think is pretty unique, at least within the US, a open data policy for the government. Any data that government produces is automatically open rather than having to, say, request it, you get access to it, which has led to some pretty interesting possibilities for things.

Andrew: Yeah. It's really fascinating. Chattanooga was a fairly early adopter of the open data movement, open data policies. I think we adopted the open data policy we have in 2016, and it doesn't say that all data is immediately open, but it does sort of set out the principles by which government operates that you should seek to keep data open when possible. There is certain privacy, sensitive data that can't be opened, unfortunately. But sometimes the aggregation of that data can be opened or some sort of anonymized format of it. What you know I talked to city leaders across the country, what the open data movement brought... There's a little bit of like reduction in cost because of, you don't want to process for your request anymore, Freedom of Information Act request. But there's also this internal organizational friction, or sharing data between the departments, and that is really expensive.

Andrew: Every time Jane has to email Jill to get that copy of that spreadsheet, again, is time lost for both Jane and Jill. It's you know uh we see this in enterprises everywhere. These, sort of silos and trying to... When workflows cross silos, there's always some resistance there. You can put a tiny pipe between the two silos and let you know that's where Jane and Jill are emailing each other once a week, or you can just try to break some of those down. And having an open data policy and having an open data portal, the pipelines, the ETL processes that you have to put in place to enable that, that actually turns out just helping you move data between your existing departments, allowing access so people can just do their job faster and more efficiently. That's been an incredible win for Chattanooga and other cities that have adopted that policy.

Andrew: In fact, Chattanooga, just this year, invested in an additional platform, specifically for internal data. So the data it was funny, the data that is sensitive and couldn't be shared, the open data platform, they still didn't have tooling to support the same sort of streamlined data flows. So they actually invested in letting that happen across the enterprise, even for data sources that aren't necessarily public.

Mike: I was watching this, I was reading this paper I mentioned a bit ago that, due to a lot of this environmental data being accessible, it led to, I want to say some climatologists decided to start pulling in this data and started modeling, "What would happen if there was, say, a major fire in this one particular area and wind conditions were as such. How do we evacuate people?" Because they had all this data accessible to them, they came up with a way to evacuate the city along certain streets in real-time based on the weather conditions at that time. Where the plumes were going, all this sort of stuff. I'm like, "Wow, this is super cool." Yeah, that was really cool.

Mike: You mentioned, a bit ago, that working on the smart grid stuff, and I know that you've started doing a lot of this work with smart cities. I imagine that a lot of this is more than just open data policies. There's probably a lot more that goes into it.

Andrew: Yeah so, for sure. I think there's a transitioning happening right now, especially in municipal government IT operations. Even those most, 90% of the open data platforms and systems that you see deployed in cities now are still pretty much focused around batch data. So it's enterprise data that is, billing system of record data, it's transactional, it's the sort of data you expect in a general business logic operating the large enterprise that most cities are. Given all the sensors that are being deployed now for air quality, for roadway conditions, traffic signal integrations, all that data is streaming. And where you see some real interesting development happening, and certainly stuff we're focused on not just in the city of Chattanooga but with the University of Tennessee in Chattanooga as well, helping sort of build out and pilot some technologies to handle this data. It's a lot more data than cities have had to deal with in the past, and to be valuable, it needs to be available in real-time or near real-time.

Andrew: And that streaming of system looks very similar to what we expect in industrial systems, it looks very similar to what our company does with building energy systems. You end up with being able to deploy... I think what's really interesting in the cross-tie with your audience working in the more, the general technology operations world is, the same systems that are powering logging, and time series metrics, and all these, sort of, at scale IT systems start to become relevant when you're dealing with all the traffic signals states in the city every 100 milliseconds. That's where, I think, there's a lot of really interesting work to be done. I think that's something that Chattanooga is, kind of, putting a stake in the ground to be a leader in that space, working between the university and the city itself.

Mike: You and I were having some coffee a couple of years ago when you first started working on this stuff. You kind of mapped out these architecture systems that you were building, and it was like Cassandra and Kafka, and this real-time streaming system. There was time series database behind it, and a Grafan on the front, and I'm like, "Oh, that's cool. You're building a monitoring system?" "Well, no, but yes. Michael, what are you monitoring? What are you building this for?" You're like, "Oh, it's a furnace, in a manufacturing facility." "And I'm like sorry, what?" What's the most interesting thing about all of this is, we've kind of been alluding to it this whole conversation, but to say it explicitly, the stuff that you're using, the tools and the approaches that you're using to work with smart cities and work with industrial control systems and monitoring a building, is the same technologies that we're using in DevOps to do monitoring of servers and applications. It's the same sort of stuff.

Andrew: Absolutely. I mean, I think, and I actually have to credit you with this, we met many, many years ago, and I was kind of coming out of an industrial control critical infrastructure traditional approach where it's all vertically integrated, vendor-driven solution engineering or not engineering. Depends on how you-

Mike: As the case may be.

Andrew: As the case may be. And you know talking about some of the systems that you were using at the time, which we look at now and think where... It's amazing how fast nine or eight or nine years is in this industry.

Mike: Yeah, I'm pretty sure that we were talking about how hot Grafite was.

Andrew: Yeah, I'm pretty sure. I'm pretty sure. Yeah. I was like, "Oh, you mean there's something different than our RD tool? What?" So yeah. You know we were using... At the time I was working in an industry that SQL server was where they put time series metrics. So that works for some things, but it turns out that, when somebody has to... When you crash the on-prem server every time you try to look at more than a week of data, or all of those sorts of issues you have trying to use relational databases for massive time series systems. We talked about it, and we talked about some solutions, and I actually implemented some of that stuff, and they said, "Hey, you should really go to this conference that I've found. It's really cool. It's called Monitorama." And I've been back every year since, and all I tell everybody when I go, it's like, "I just come here, learn what you all are doing, and steal it and rip it off, and apply it to real systems."

Andrew: This is exactly the parallels are very real. I remember actually a really striking thing for me was the first time I was at the first Monitorama I attended, the keynote speaker was talking about all the lean manufacturing principles, and how that could be applied to DevOps. And I realized where this overlay was and kind of how the tables had turned. So manufacturing had done a really, really good job of defining the principles, and really engineering out the processes by which this stuff could be done, but they had failed on the technology implementation because that's not what they were, and that's not where their expertise was. So when you took those principles and put them in the hands of a systems engineer and software engineers who are building these things for software companies, boy, that mean, they could implement things that we in the manufacturing space could only dream about having.

Andrew: It's sort of a circular system where some of the principles that were developed 50, 60 years ago around process management, lean process management got pushed off into the information technology space. But then they got encoded and codified in these tools that are so much better than what the industry had. And now you know a big part of what I do on a day to day basis is feeding those tools back to the real world environment.

Mike: You were helping me out with a project I was working on when I worked for Oak Ridge National Lab where I had a bunch of solar panels and we were collecting this information from solar panels, metrics about performance of the panels, like how much sun there is that day, and then the amount of power generated. We were shoving all this into Grafite. Like it's the same stuff that I was it's same system I was shoving all of my operating system and application metrics to. But the industrial controls person that I was working, this was just completely blowing their mind that this was even a possibility.

Andrew: Right. Right. Yeah-

Mike: I think it's absolutely incredible.

Andrew: Yeah. I think so. This is, I will say, that cycle, and especially five years ago, four years ago, I think even more so, those tools weren't immediately applicable back to this space because, in general, in manufacturing you need systems of record. And time series for monitoring operations monitoring don't tend to have those same sort of consistency demands. I think what's really interesting is I've actually seen, over the last five years, a bigger and bigger push toward having strongly consistent monitoring systems that can give definite answers, because people are building business value systems on top of monitoring, which is taking it... Again it's just helping close some of those gaps, which is great for me, so keep doing it.

Mike: All right. You gave a talk at, I want to say one of the Grafana's Conferences in New York.

Andrew: It was in Amsterdam. GrafanaCon in-

Mike: There we go.

Andrew: Amsterdam. Yeah. Yeah.

Mike: About monitoring at building scale or city scale. Is that what-

Andrew: Yeah. Monitoring buildings at city scale. Yeah, I've been involved-

Mike: I think this gets to what you're currently working on, doesn't it?

Andrew: Yeah, absolutely. So I've been doing sort of consulting work for many years, as we've been talking about, in manufacturing, in applying some of these monitoring technologies back into the manufacturing space, and where things... NSF has a really weird term for this, National Science Foundation, but I actually think it sums this stuff up better than any other term I've heard, which is cyber-physical systems. That tends to be, my career has always been in cyber-physical systems. I didn't know what it was when I started my career. I didn't know what it was until about five years ago, but that's what they are, and at any time-

Mike: Can you define that for us?

Andrew: Yeah, yeah. So you know cyber-physical system is anytime where a software-defined system touches something that is hardware-defined, and is interacting with the real world in a way that is not sending bits over a wire. That sums up-

Mike: Oh yeah. How about an example there just so we're clear on what we're talking about?

Andrew: That's a very general term. The example could be anything, I mean, a traffic light that is connected to ethernet and sending data back to a central database, that's a cyber-physical system. Your thermostat, your Nest Thermostat is a cyber-physical system. To some degree, maybe your mouth is a cyber-physical system. But at the end of the day, it all gets down to something that happens in the real world, is turned into bits or something that's happening in bits is turned into something in the real world. That's really where my career and, especially, my consultancy has centered on. And I got involved, brought in actually, to help bring some expertise in IT operations to a project with the City of Washington DC, deploying a large scale monitoring system for their building operations.

Andrew: So they had buildings across the city. They have a $40 billion real estate portfolio, and they spend about 100 million dollars a year on energy, so it's big application. What they would find is they would make a strategic investment in a building that was performing poorly, which they could see because they pay the utility bills, and they would improve it by 20 or 30%. But, month by month after they made that capital investment, they would see the building slide back towards its original benchmark, and they didn't have any insight as to why, because all they had was how much money they were paying for electricity. They had no insight into, well, what equipment is running? What set points are being applied? Who's turning on what, when? Which spaces are occupied? They scoped out this project where to really define how they could collect all that operations information and provide it in a accessible way, and visualize it in an accessible way for their staff from the top down.

Andrew: They wanted their enterprise executives to be able to look and see, "Okay, how is this building being used?" They also wanted their technicians out in the field to say, "This doesn't look right. What's going on? I need to go investigate X." And so we started building this system. It's operational now. It has about 60 of the 400 buildings connected that DC owns, and it's full deployment in about 20 of those buildings where they've actually kind of done all the accessory work to make sure that they understand what the data they're getting is. With it, they're saving around a million and a half dollars a year on energy, and most importantly, where they're making capital investments, they're seeing retention of the savings they gain. That project, sort of, demonstrated to me and helped me understand a lot about that space and what that space needed, and ended up starting a company focused on building out a cloud solution that provides that service to other large portfolio owners or even individual building owners.

Mike: That's pretty cool. I imagine you probably used a lot of the same tricks and technologies that we were talking about earlier to do all that?

Andrew: Yeah, absolutely. When I got involved in that project, the city of Washington DC had been engaged with the Federal Department of Energy, and specifically working with Pacific Northwest National Lab on a software platform called VOLTTRON. My business partner refuses to introduce it without saying, "Unfortunately called VOLTTRON." With two T's, just to avoid trademark issues. But it was a software platform that had been written by National Lab researchers to enable them to examine interesting and novel ways to control buildings, to control what they call distributed energy resources, which is everything from your server UPS, to the air conditioner, to hot water heaters, to the solar panels on your roof, to dedicated battery storage devices. And it's a pretty robust platform, but it was built by researchers, which... I think you have some experience with system's built by researchers-

Mike: I do-

Andrew: Which you may or may not have PTSD from.

Mike: Plenty of that. Unfortunately, it's weird working for a national lab because all of my experiences, stuff, I can't talk about.

Andrew: Well. Yeah.

Mike: Department of Energy. Yeah.

Andrew: Yeah. Anyway, so what they really needed was, sort of, someone with a view of the same things we've been talking about, which are, what are the technologies that are being applied in the broader technology and developed in the broader technology ecosystem, and how can those benefit enhance what we've got here? One of those was back to sort of the same challenge that I've found myself facing time and time again of, how do you store time series data at scale efficiently and aggregate it easily, and provide robust analytics quickly. We did a lot of work on retooling and moving some things around in the platform to enable a little bit more effective, efficient deployment, and help them kind of justify to DOE moving the whole project to the Eclipse Foundation. Now, VOLTTRON is a project in the Eclipse Foundation's portfolio.

Andrew: We are one of the only commercial companies out there offering services around VOLTTRON. We use it to support our cloud service offering, which is a basic building instrumentation tool. But we also support other companies who are doing interesting things with the platform, and support their use cases and help them develop robust technology processes around it.

Mike: Did I hear right? Before we were talking, before we started recording, you mentioned something about storing this data through... Oh, shoot, I forget what it was. I think Kafka and S3?

Andrew: Yeah. I mean, I think, this whole moving... Big data, we hear, and have heard, way too often, and I refuse to let anybody I work with call any data, big data. I just say medium data and then they like hem and haw then they shut up-

Mike: Of course.

Andrew: But there's been a real transition, and I think I touched on this with the smart city use case. When big data kind of became a buzz word, it was all about, okay, we have been collecting transaction data or we've been collecting core business logic data for 50 years. Now we have these vast repositories of data that are in 20 different formats. How do we get all this into a data lake or whatever platform or technology you want to use to get it where you can actually get value from it. But now it's, we have all that data, and we need to actually join it with streaming data that's coming in real-time to get value from it. That's a lot different challenge.

Andrew: But obviously, stream processing is a big deal. It's taken off. Kafka obviously is a technology that's getting used a lot, it's getting a lot of traction. Confluent seems to be doing really well, it's exciting. Pulsar is another technology, but this also gets back to the system of record data. So when I first heard about Kafka, it was being used to sort of multiplex monitoring data in a way where, maybe, you weren't that careful about consistency, and it wasn't a big deal if your consumer indexes got moved around a little bit, et cetera. But now we're seeing really robust frameworks for processing that data, getting it into objects and something like S3 where you can move it easily, you can query across it, you can pre-aggregate it. And so that's yeah, we're, we're working a lot with those kinds of data systems now, and getting real-time streaming data that's available in the Grafana dashboard, but also for your data science teams to use with the existing data you may have in your enterprise.

Mike: Man, all this is so incredibly interesting. I absolutely love the crossover here. The comment you made earlier about the manufacturing world really came up with solid principles but weren't equipped to do the execution, to do the implementation on the software side. Coming from my background of... I'm not from manufacturing world, I'm from the systems world. To me, I've always looked up to manufacturing as, "They've got their shit together," but really, we're doing the implementation and now you're taking it all back, which is awesome. Everyone wins.

Andrew: Yeah. I mean, don't get me wrong. I think manufacturing has done an incredible job, especially if you look at the mechanical systems that they developed these processes for. Obviously, I mean... The defect rate in a modern vehicle, if you really just sit down and think about it, is mind blowing. The fact that there's only a few recalls a year, those sort of things. But I'm also the kind of person who sits down every once in a while and is like, "Man, 90 years ago the idea of flying across the country would have been completely strange to anyone." You tried to say, "Oh well, you could..." You're like, "I'm going to do on Friday and be in Portland in four hours from across the country." That was completely strange. And so I think, it's not to, kind of, crap on manufacturing-

Mike: Of course not-

Andrew: They've done incredible, incredible work. And if you look at those principles that people like Deming worked on, they're still incredibly relevant, which is kind of wild, but-

Mike: What I love about all this is that the principles and implementations that each of us reason are being applied in ways that most people don't even think about. I never would have considered that you would be using the same technologies that I use to do completely different work.

Andrew: Yeah. I mean, I... To be perfectly frank, you know I didn't either until I met you. So I mean, I might owe a lot of my career to you. This is...

Mike: Thank you-

Andrew: Kind of fun, kind of coming back full circle as well.

Mike: Yeah. Well, this has been wonderful. Thank you so much for taking the time to chat.

Andrew: Absolutely.

Mike: Where can people find out more about you and your work?

Andrew: Aceiotsolutions.com, we've got a blog that we try to keep updated with information about some of the projects we're working on, about some of the technologies we're using. Please follow us there.

Mike: All right then, and to everyone else listening, thank you for listening to the Real World DevOps podcast. If you want to stay up to date on the latest episode, you can find us at realworlddevops.com and on iTunes, Google Play or wherever it is you get your podcast. I'll see you in the next episode.

Speaker 3: This has been a HumblePod production. Stay humble.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Improving Your Vendor Relationships with Jeremy Tangren
6 jun 2019· Real World DevOps
About Jeremy TangrenJeremy Tangren is a Technical Program Manager specializing in infrastructure and vendor management. Throughout his 15+ years in IT program/project management, he has managed local- and global-scale multi-million dollar projects at companies like Facebook, Splunk, and Cisco. Jeremy is based in San Francisco, CA.

TranscriptMike: This is the Real World DevOps podcast and I'm your host Mike Julian. I'm setting out to meet the most interesting people doing awesome work in the world of DevOps. From the creators of your favorite tools to the organizers of amazing conferences or the authors of great books to fantastic public speakers I want to introduce you to the most interesting people I can find.

This episode is sponsored by the lovely folks at InfluxData. If you're listening to this podcast, you're probably also interested in better monitoring tools and that's where Influx comes in. Personally, I'm a huge fan of their products and I often recommend them to my own clients. You're probably familiar with their time series database influxDB, but you may not be as familiar with their other tools. Telegraf for metrics collection from Systems, Chronograf for visualization and Kapacitor for real-time streaming. All of these are available as open source and as a hosted SaaS solution. You can check all of it out at influxdata.com. My thanks to InfluxData for helping make this podcast possible.

Hi, folks, I'm Mike Julian, your hosts for the Real World DevOps podcast. My guest this week is Jeremy Tangren. He's an expert on everyone's favorite topic to completely ignore, vendor management. He's been a technical program manager for companies such as Cisco and Facebook, Splunk, and a whole bunch of other really interesting places. Most interesting to me is, Jeremy, you and I have known each other for God what? Nineteen years now?

Jeremy: Yeah, 19 going on 20 it's been a while.

Mike: It's insane. Yeah, which is weird. We met each other on a Final Fantasy Forum when we were like 12.

Jeremy: Thank you Final Fantasy Seven.

Mike: Yes. Thank you. So you've been working a lot with vendors throughout most of your career and I know you kind of fell into it accidentally.

Jeremy: Yeah, totally accidentally. It started when I was just a general IT guy and I was by myself. I needed more hands and more capabilities and I had to start bringing in vendors to get work done. It was a necessity.

Mike: Now, I imagine that most people listening to this thinking like, "Oh, my vendors, my vendors are awful," and I mean vendors are kind of annoying to work with aren't they?

Jeremy: They can be challenging. I definitely get the perception that most folks view vendors as a necessary evil and they would avoid interacting with them at all costs if they could.

Mike: That's been my experience as well. But I mean, there have definitely been a lot of vendors where I'm like, "I don't really want to work with them, but they get the job done that I need done and I don't want to do it myself." But at the same time, it doesn't really need to be that way.

Jeremy: No, not at all. You don't have to have an adversarial relationship with your vendors. In fact, you should be looking for more of a partnership than anything else with a vendor.

Mike: Tell me more about that. What do you mean by this partnership?

Jeremy: Well, take for example, when going back to my original IT guy explanation, I was by myself, but I had to have help. So when I brought in a vendor to install fiber or whatever it was, I had to work very closely with them to tell them where the work needed to be done, what my expectations were, all of the major details of the project, and then I had to again work closely with them during deployment and then finally on close of the project. At no point during that process was there an opportunity for me to step away from them and just let them run and expect things to go hunky-dory, and that's what most people tend to expect with vendors is that, "I pay them, I cut them a check, they get the job done. I don't think about it anymore," but there's just so much more than just cutting a check that goes into working with a vendor.

Mike: Okay. Let's dig into that. What else is there? Well, I mean when I have a problem I'm thinking, "I'm going to go find someone who is very good at this problem, some company, and I'm going to tell them what I need, cut a check and walk away." It's like hiring an electrician or a plumber. They're just going to get the job done. I don't have to think about it. So is there more to it than that? What else goes into it?

Jeremy: There's always more to it than that. When you're bringing in a vendor, they're specialists at whatever they do, the electrician, the plumber, but they still have to be told what it is that needs to be done. "Do you need to wire this entire house with three-phase power?" "Do you need to only fix this toilet?" They need to be given clear instructions and expectations or else they're just going to do what they think is right and that might not align with what your expectations are, so you have to work closely with them to get the right results.

Mike: I imagine that gets pretty complicated when we're talking about a software and IT. I can just imagine hiring a data comm person or data comm company to come in and wire a building for Ethernet and how complicated that would be.

Jeremy: Absolutely, you can't just bring in that cabling vendor and say, "Go." They don't know where all of the gotchas are in that building. They need a blueprint of the building. They need to be working with the facilities team. There're lots of cross-functional stuff that goes on just to wire a building with cables. So you have to have that partnership with the vendor and not just expect them to cowboy it up and take care of everything magically or again, you'll end up with results that you weren't expecting or didn't desire.

Mike: So on that note, the idea of hiring a vendor, just letting them loose. It seems like the thing that everyone does, and on that same note, people just don't think about their vendors, especially ones they're using long-term. I know for a long time I didn't until you told me how wrong that was, so why are we doing that? Why are we just kind of ignoring our vendors?

Jeremy: Because it's easy to do. I think that's what it really boils down to is, especially from an engineering standpoint, it's very easy to focus on what you're working on and what you need to get done versus the dependencies and interactions with these other teams, and vendors included. So it's very easy for an engineer or even a manager to forget that they even are working with a vendor. And when a vendor comes knocking, they go, "Hey, we're doing still doing x, y, z. Is it to your satisfaction?" And manager or engineer goes, "Oh, no, that's not right actually. You didn't do it the way I wanted," because they weren't partnering with and working with the vendor to guide them to a successful conclusion.

Mike: Something you said there made me think about this one particular thing that is often overlooked in vendor relationships, which is the role of the client. So as when me, the company, is hiring a vendor, I expect certain things from them, but to make a good relationship, they actually expect certain things from me too beyond just payment. But it can be as simple as that. Right?

Jeremy: Yeah, it can be.

Mike: So I have contractors I work with from my company and I follow the Patrick McKenzie rule of net 30 minutes on payment, and invariably they're like, "Oh, my God, you're an amazing client. The best client I've ever had," is simply because I paid them early.

Jeremy: Yeah, that's actually, that's one of the biggest successes that I've found with vendors is when you establish these really solid relationships with them, even if you haven't paid them yet, they'll do work for you sometimes. You have to have that incredibly tight relationship and partnership either on a personal level or on a business level before the vendor will go above and beyond for you and do things like dedicate their entire engineering team to your project before you've cut a single check. So that's a really good underline of the relationship between the two entities.

Mike: I know that you've got vendors you've been working with for 10 plus years now.

Jeremy: Yeah.

Mike: For a lot of us, it's absolutely wild that we would have vendors that we continue to work with for that long.

Jeremy: Some would call it favoritism. I've definitely seen that in some businesses. There was a company I worked for a number of years ago that required that we acquire three quotes for every piece of equipment or every vendor that we worked with. And I understand the motivation behind that, but very often that's just overhead that you don't need. And it creates an adversarial relationship between the company and the vendor, because as the company, you are now trying to squeeze the vendor for everything they've got in order to compete against these other quotes, and that's not really the right way to do it.

Instead, if you've got a vendor that you trust and have worked with for many years, you know that you can rely on them to get your project done. Especially, given the interactions that you've had in the past, the relationships, partnerships that you've had, they trust you and you trust them. It goes both ways.

Mike: I have heard that you have some fantastic stories about how this actually has played out in practice.

Jeremy: Ah, just one or two.

Mike: So one of my favorite stories is a data center on a boat.

Jeremy: Yeah. So data centers on boats aren't generally a good idea. However, they happen.

Mike: Tell me more.

Jeremy: A number of years ago, a client brought me in to shore up their infrastructure and initially, the request was, "Hey, things are a bit unstable, we just need you to get things rolling again and bring up our uptime." So I came on site and I took a look at their server room after I asked to be escorted to it. So I was escorted through the boat downstairs into the engine room and then into another side room where all of the servers lived and I'm looking around in this small server room and things are functioning. It's a server room. And I asked a question and I said, "Where's the water line?" And they said, "Oh, it's about 10 feet over there," over there being above our heads.

"So what you're telling me is that your entire server room that operates your entire company and brings in all of your revenue is sitting here on a boat under the waterline?" "Yeah, that's about right." "Oh, okay. Is this a concern for anyone?" And the response I was met with is, "It's a little bit of a concern, but we check the hull every year. It should be fine." "The Pacific Ocean, the entire Pacific Ocean is less than 20 feet away from all of your servers. What do you think about moving them?" And, of course, the next question was, "Well, where? Where do these go? They're super important to us. We need all of this hardware, all these services, what are we supposed to do about it?" My first answer was, "Get it off of this boat."

Mike: Doesn't really matter after that, right?

Jeremy: Yeah, anywhere off of the boat is an improvement, and it required interfacing with multiple vendors and consultants to scope out all of the various dependencies and hurdles in moving all of this company's services elsewhere. We ended up migrating them over to Amazon. So now, no longer can the data center sink into the Pacific Ocean and they can have multi-region tendency and resiliency. So it's a good lesson where working with good vendors can save you from a terrible, terrible outcome.

Mike: Yeah, so how did all that work? I imagine you were not physically picking up these servers and running them to the nearest Amazon data center yourself?

Jeremy: Oh, no, no. I mean I do my squats and lifts every once in a while, but servers are a bit much.

Mike: So, you say you relied on a lot of vendors to get this done and a lot of people would be like, "I don't want to rely on a vendor for something that critical," but you did. So why is that? Why did you go that route instead of trying to do this internally? How did you find the vendors to do something so critical for the company? And how did you manage those relationships?

Jeremy: Well, the core answer to your question is that the client didn't have the expertise and competencies necessary to do this kind of a migration. They had the engineers necessary to keep the lights running and keep everything going, but not much beyond that. So we really needed people with expertise in migrating services from standalone data centers to the Amazon Cloud. And we needed someone to maintain these systems after they were migrated because again, this company didn't have the expertise in this area, so somebody had to do it for them. So we ended up hiring one consultant to architect the Amazon migration, another architect that I think actually performed it or another consultant that performed it. And then we hired an external support vendor to manage the entire infrastructure after it was deployed. And they still do that to this day.

Mike: So how did you find these people that were that good and that you could trust with such a migration and what was the relationship building like on that?

Jeremy: As with a lot of good vendor relations, I happen to have some good sources, suggestions from folks that I knew either to point me in the right direction for people that I didn't already know. For example, the vendor who was going to maintain all of the services after it's migrated, I didn't know who could do that right off the top of my head. So I had to reach out to my network and learn who would be reliable. And this is where it goes again, the experiences that people have with vendors vary wildly, and so you kind of have to take every story with a grain of salt when they tell you, "Oh, I had this terrible experience with such and such." So I was looking for good experiences with such and such to maintain their infrastructure and then continued to reach out to my network, looking for those consultants to actually perform the move.

Mike: Do you interview your vendors?

Jeremy: I do actually.

Mike: How do you interview a vendor?

Jeremy: Not entirely dissimilarly from how you would interview an employee. You have to think about the long-term with a vendor, and what the good and the bad with the vendor could be. So you have to plan for things to go well. You have to plan for things to go sideways. You have to plan for things to run for maybe a year that this will take and will be involved with this vendor. So you have to really take these into consideration and talk with somebody in leadership, preferably in the C-level, the C-Suite, and not a salesperson. Every bad experience that I've ever had with a vendor has been because of a salesperson. Every time I've had a good experience, it's been because I engaged the CIO or the CEO or whoever at the top level and said, "This is what's going on. These are incredibly important requirements and we need this done right."

When you speak to the right players who understand what your needs are, they will make sure you're taken care of. Salespeople, not to discredit the entire realm of them but a lot of times salespeople will say things that aren't true or will promise things that can't be done. And I've had this happen to me, so I always follow-up whatever a sales guy says with their C-Suite. So yeah, definitely get an idea of how they function and maybe who their other clients are and maybe talk to some of their clients and see what their experiences have been. And again, with that grain of salt, understanding that maybe things weren't managed right over there, maybe they were and the vendor was bad. So there's a whole lot of evaluation that goes into, "We're bringing in a new vendor," and not just simply signing a check.

Mike: You mentioned this idea of planning for things to go wrong with a vendor and that seems kind of counterintuitive. When I'm thinking about working with the new vendor, I really don't want to be planning for things to go south, but you do. Why is that?

Jeremy: Well, I'm a PM for starters, a large part of my job is managing risk. The thing is is even though you may have established this partnership with this vendor and everyone participating in this project or whatever it is, has the very best of intentions. The company has the best of intentions. The vendor has the best of intentions and everyone is trying to stay on the same page, but at the end of the day, the vendor is not your employee. They don't have the same business goals, they don't have the same business direction and somebody could change their direction in a way that affects the project unknowingly and causes things to go sideways on you.

And this is really true for any project, vendor or not. You should always plan for a project to go sideways or just explode in your face and how you'll handle that failure. A good vendor if you're working with them and there are large risks ahead, you should be planning with them on how to address those risks or at least have mitigation plans should something go wrong. And again, a good vendor, will be happy to work on those plans with you that they will support that and say, "Yes, we're going to do what it takes to make this successful." A bad vendor will just wave you away and say, "No, we've done this a thousand times, it's fine," and I've had exactly that happen and had a bad experience.

Mike: Going back to partnering with vendors, there's been a clip floating around for a while now about if you're spending several million dollars a year with AWS, there are no longer a vendor they're a partner and you should treat them as such. The relationship is different. It is different than just the person I pay to mow my yard. So you were talking with me before we started recording that while that is true, the size doesn't actually matter.

Jeremy: That's absolutely correct. AWS, they're giant. I mean absolutely tremendous and yeah, you definitely want to consider them a partner if they are where all of your infrastructure lives, for example, it's super, incredibly important to your business. They're not simply a vendor at that point, but even smaller vendors, way, way smaller vendors can have that impact to your business and that partnership with your business. A key example, I was working with a company a couple of years back. They had engaged a very major vendor to do a deployment for them and that deployment was-

Mike: So we're clear, what do you mean by deployment? Are we talking like software? Hardware?

Jeremy: This was a hardware deployment.

Mike: All right.

Jeremy: A physical deployment into a number of offices and they vastly overspent on this deployment because the relationship between the client and the vendor at that time were, "I'm going to cut you a check and you're going to do this deployment," and the results were basically a completely failed deployment. Many, many locations that were deployed to the systems did not work at all. And the locations that it did work, the experience was so far below par that it was almost unusable.

Mike: My first inclination when you tell me that is that the vendor's to blame, but I feel like you're about to tell me that wasn't true.

Jeremy: Ha. You know exactly what I'm thinking. Yes. You could very easily say, "Yeah, it was the vendor's fault." They overspent on every single thing because they wanted the money, they wanted the cut of the profit. Or you can think about what the situation was that led up to this and that situation was that the specific team that was responsible for these services engaged the vendor, gave them a very high-level scope, very basic and basically, cut them a blank check and said "go" -- expecting them to take care of the entire project end to end with no project management. No check-ins just go do.

Mike: That sounds like they essentially outsourced both authority and responsibility, but were expecting that everything would be perfect when it was done.

Jeremy: You're absolutely correct. Sadly, the result of that was me coming in with my team, after the fact, and yanking out millions and millions of dollars’ worth of hardware that was completely useless and replacing it with all new stuff through a new vendor. And this vendor was much, much smaller. We're talking about the difference between say Accenture and your five-person consulting company. So we're now dealing with this five-person or so company and I'm working with them directly, very tightly. Establishing what it is that we expect. The level of the user experience. I dictated a large portion of the scope for these deployments and so we went forward with the project. They got their check, the equipment came in, we did the deployments and users came in that following Monday and were ecstatic that all their systems worked.

Mike: Nice. It sounds really what went into that was there was a lot of up-front planning of working with that vendor to set expectations and who is responsible for what. It sounds like a lot of actually ongoing management.

Jeremy: Yes, you're absolutely right. It deviates from the idea of, "I'm cutting a check and walking away." So far it's 180 degrees different from that. It is now, "Someone else is cutting you a check somewhere down the line, but you and I, you and I vendor are working hand in hand to get this project done. I don't care who's cutting the check. We're in this together. As far as I'm concerned, we are on the same team during the duration of this project."

Mike: And that's really what you mean by partnership with a vendor.

Jeremy: Yes, absolutely.

Mike: In the State of DevOps Report they actually, Dr. Forsgren actually talks about this, that when you are outsourcing functional things, it's actually very tightly correlated with a low performing company and what you've done is though you have outsourced the work, you haven't actually outsourced responsibility and you're still actually treating the company you've outsourced the job to, as part of the team, though they legally speaking are not functionally they are.

Jeremy: Correct. In this case with that vendor, they called into my daily stand-ups every day during the course of the project.

Mike: That's wild to think like, I'm going to have my vendors on my stand-ups.

Jeremy: Yeah. No one else would ever do that. But it was critical to the success of the project because they needed to know what was going on just as much as the rest of us.

Mike: So on that note, there's actually, having been in really large companies myself, there's this concept of you have internal customers too and everything you're talking to me about here seems to say that there are situations where you could actually be the vendor and also have vendors of your own.

Jeremy: You're absolutely right. I've been in a couple of different positions where you have entire service teams or infrastructure teams within companies that are vendors to the rest of the company. Maybe there's a cross charged system or contracts or whatever there may be, but ultimately these teams are vendors to the rest of the company providing them services.

Mike: How do the relationships change on that?

Jeremy: The relationships, honestly I feel should be very similar. If you're having good relationships with your external vendors, then you should be able to have a good relationship with your internal vendor. If it's the other way around, then you need to change how you're approaching one or the other set of vendors.

Mike: Yeah, so on that note, I have one last question for you have of let's say that I now realize that my relationships with the vendors I support or the vendors I work with are not very good. What can I do about that? How can I turn that around and start to improve their relationships?

Jeremy: First I would try and define what good looks like for you because you need an understanding of what good and bad look like for your vendor relationship and your circumstance.

Mike: Do you have some suggestions for what that might be?

Jeremy: Let's say, for example, you have a SaaS service and you're planning to scale. You're expecting to grow in the next six, 12 months and you reach out to your vendor and maybe they don't respond to your email for a couple of days. Okay. Email them again. Okay, you're not hearing back from them. Okay, so this now sounds like kind of a not super clean vendor relationship. How do you fix that?

Mike: I've definitely had those.

Jeremy: Yes. I'm sure a lot of folks have those, and really one of the things that you have to have is empathy. You have to understand that whoever this individual is that you're reaching out to, they are servicing other clients as well. Almost 100% of the time they have more than just you on their plate and they're trying to make everyone happy. That they're trying to do a good job. So don't take your first of couple emails that have not been responded to and get angry at your vendor because things happen. The next best thing that you should do is call them. Pick up the phone.

Every time I've ever picked up the phone and spoken to my vendor, be it one of the engineers who's working on the project, the project manager for it, the manager over those folks, or even one of the C-Suite. It doesn't matter. If I get someone on the phone and explain to them what is going on and how I'm trying to work with them, they will respond. They always answer the phone. They always call back. It gives them that impetus of someone needs me, whereas email doesn't have that urgency.

So that would be the first place that I start is if I have vendors who are a little less responsive, be proactive, engage them and get them to be responsive. If your account manager is still unresponsive, not responding to however many emails, not responding to calls, maybe you need to escalate it above them because again, you're working with a vendor, not an individual. So just as you would inside your own company when you're not getting engagement from an individual, you escalate up the ladder. And what you'll find is as you go further up the ladder is people who care more about the vendor-client relationship.

It's very similar to within any other company. As you move up the ladder, the people who work their care much, much more about the vision and the strategy. So engage those higher level folks if you need to and they will give you the attention that you need.

Mike: That's all fantastic advice. Well, thank you so much for coming on the show. This has been wonderful.

Jeremy: Well, thank you for having me, Mike. This has been a great chat.

Mike: Where can people find out more about you and your work?

Jeremy: Have my LinkedIn, jeremytangren on LinkedIn, and I'm open for hire for folks who need me.

Mike: Awesome. Yeah. On that note, Jeremy's a fantastic program manager. As of today, he's looking for a new company to join and kick some assets, so if you're looking for him, there you go. And to everyone else listening, thank you for listening to the Real World DevOps podcast. If you want to stay up to date on the latest episodes, you can find us at realworlddevops.com and on iTunes, Google Play, or wherever it is you get your podcasts. I'll see you in the next episode.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Building Technical Communities with Mary Thengvall
30 maj 2019· Real World DevOps
About Mary ThengvallMary Thengvall is a connector of people at heart, both personally and professionally. She loves digging into the strategy of how to build and foster developer communities and has been doing so for over 10 years. After building community programs at O’Reilly Media, Chef Software, and SparkPost, she’s now consulting for companies looking to build out a Developer Relations strategy. In addition to her work, she's known for being "the one with the dog," thanks to her ever-present medical alert service dog Ember. She's the author of the first book on Developer Relations: The Business Value of Developer Relations (© 2018, Apress).
Mary is founder and co-host of Community Pulse, a podcast for Developer Relations professionals. She curates DevRel Weekly, a weekly newsletter that brings you a curated list of articles, job postings, and events every Thursday. She's also a founding member and "Benevolent Queen" of the DevRel Collective Slack team.
Mary is also a member of Prompt, a non-profit that encourages people to openly talk about mental illness in tech. She speaks at various conferences and events about building and fostering technical communities as well as how to prevent burnout in yourself and your team.
Links ReferencedTwitter: @mary_graceBlog/Website: marygrace.community (or marythengvall.com)Biz Website: persea-consulting.comNewsletter: devrelweekly.comPodcast: communitypulse.ioTranscriptMike: This is the Real World DevOps Podcast, and I'm your host, Mike Julian. I'm setting out to meet the most interesting people doing awesome work in the world of DevOps. From the creators of your favorite tools, to the organizers of amazing conferences. From the authors of great books to fantastic public speakers, I want to introduce you to the most interesting people I can find.

This episode is sponsored by the lovely folks at InfluxData. If you're listening to this podcast, you're probably also interested in better monitoring tools, and that's where Influx comes in. Personally, I'm a huge fan of their products, and I often recommend them to my own clients. You're probably familiar with their time series database, InfluxDB, but you may not be as familiar with their other tools, Telegraf for metrics collection from systems, Chronograf for visualization, and Kapacitor for real-time streaming. All of these are available as open-source, and as a hosted SaaS solution. You can check all of it out at Influxdata.com. My thanks to InfluxData for helping make this podcast possible.

Hi, folks, I'm Mike Julian your host of Real World DevOps Podcast. My guest this week is Mary Thengvall, she's a long-time builder of communities for companies like O'Reilly Media, and Chef. And author of the book on developer relations, The Business Value of Developer Relations, and runs a podcast, and newsletter of her own. Now she's helping companies create community-building strategy through a new consulting firm, Persea Consulting. Welcome to the show.

Mary: Thanks for having me, Mike, I appreciate it.

Mike: So, I want to start off with a really foundational topic.

Mary: Sure.

Mike: What is community in this context?

Mary: Absolutely. You start there, and I start there, they're most of my talks as well, just because I like to make sure that people have a shared definition. The definition that I always go back to with community is, and it relates to developer relations as well. But community for me is a group of people, who not only share common principles, but also develop and share practices that help individuals in the group thrive. So it's not just, "Hey, we're all on this platform." It's, "We're all on this platform, we have a specific goal in mind, and we're all helping each other."

Mike: Okay. A lot of us are pretty familiar with like the Python community, the Ruby community, or the Chef community, and then there's kind of the non-tech stuff. Like people that go to church, have a church community. Or we have the community of our neighborhoods. Are these the same community?

Mary: They have similarities.

Mike: Okay.

Mary: And it's that idea of, if you're in the Python community and someone reaches out on Twitter, or in a Slack team, or at PyCon for instance, which just happened. And they say, "Hey, I need help with these things." You'll likely find other people around you who go, "Oh, I know how to do that. Let me help. I'm more than willing to." In neighborhoods, you have things like next door, where someone will post, "Hey, I lost my cat." Or, "I need help with my garden." Or, "Help me out in these other areas." And people will jump in and say, "Oh, yeah, I'm more than willing to help out."

In those ways they can be very similar. There's obviously nuances with really just communities with technical communities, with neighborhood communities. And in some cases, they come together easier than others. Neighborhoods have the advantage of being in the same physical location, which is always really nice. Technical communities usually have forums around a particular product, or we use Twitter hashtags to kind of see what other people are talking about. And there's a big difference to me between communities of people, and community platforms.

And I know we have a lot to cover today, so I won't spend too much time on this, but just kind of a TLDR. The platforms are usually ways to bring people together to accomplish a specific purpose. A Python community platform for instance, you could have one for new Python developers, you could have one for experienced Python developers. You could have one for people who are willing to mentor folks, or people who are into Django, or whatever this specific piece of Python lore that you're interested in.

Mike: When you're talking about these platforms, are you talking about the medium on which they operate, or are you talking about kind of segmentations of that community?

Mary: A little bit of both, and those tend to go hand-in-hand. As you kind of see communities, larger communities, people who are interested in the same topics segmenting off, you'll often find software platforms that spring up that, you know, cool, the newbie Python people are hanging out on Twitter. The experienced Python people have a back channel slack team. The people, who are willing to mentor are on LinkedIn. So they're using already existing platforms to facilitate conversations around particular topics.

Mike: That seems like a bad thing.

Mary: It can be. The hard thing about that is, if there's not a specific person that's really taking the time to lead those groups or manage those groups, or build those groups of communities, that things can devolve fairly quickly with Twitter as we've seen. Or with Reddit or places like that where it's just not... It's hard to control when you don't own the platform, or when you don't own the conversations. But at the same time if you're not, owning the conversation as a company, then you aren't necessarily limiting people to what your viewpoint is. You're allowing people to talk and explore, and see what's going on elsewhere.

Mike: Yeah. I can't tell you how many communities I've been a part of where the platform we were on disappeared, and the result was there was no more community.

Mary: Right, right

Mike: It also disappeared.

Mary: Right, and that's difficult, because-

Mike: Oh yeah.

Mary: People become so used to a particular platform and so accustomed to how that platform works, and then suddenly even if it's not the whole platform disappears, but some companies will move a community elsewhere and only half the community moves. And then that community might die off, because, "Well I was really only here because Mike was here, and Mike didn't move to the new platform." But Mike was only there because Jony was there and Jony didn't move to the new platform either. And so, you can have this cascading effect of losing people, because the people who influence them decided not to move over to that new community place.

Mike: I love all this, because it starts to get into the topic of community management. And we're not just talking about how to manage technical communities, but we're really talking about management, maintenance and leadership of more community.

Mary: Right.

Mike: One of the things that's always been interesting to me is, communities are self-forming, but they're not really self-governing.

Mary: Yes, I love that phrase.

Mike: And yeah, and this is something that I think a lot of people take for granted. A community needs leaders.

Mary: Right.

Mike: With my social circle, there are certain people in the group that kind of take the lead on setting things up. And if they don't do it, then no one hangs out.

Mary: Right.

Mike: And it's not that we don't like each other, we do. It's just someone has to be a lead in a community whether it's formally, or not.

Mary: Right, right. You need someone to take that initiative to brings things together.

Mike: Yeah. So on that note, one of the most interesting bits about community is not... I mean, while the management is difficult, and it is a very interesting problem, how do you build a community? How do you start from scratch?

Mary: That's a question that I hear often, and the thing that I usually tell people is, "Don't start from scratch, because the people that you're looking for, the community that you're looking for is already out there somewhere. So, you might feel like you're starting from scratch, because you're having to go find those people, but don't just create a "Community Platform" on your website and expect people to show up, because they won't. This is not a "build it and they will come" situation in any way whatsoever. But if you think about it, if you can really figure out who your core audience is? Who are you looking for to contribute to this community? Who's insights are you going to see? Who's talking about Chef? Who's talking about DevOps? Which is what I was handling at O'Reilly a lot.

These different topics, if you go find those influencers, and figure out where they're hanging out? What websites they frequent? Which blog posts they follow? Which Twitter accounts they look to for their information, then you can kind of start gathering this list of influencers. Gathering this list of people, who are saying interesting things and keep an eye out for where they are, and then go to those same places. And once you go to that place, just sit back and observe for a while. Don't show up and go "Hi, my name's Mary and I have this great information from you, and I want to sell you one of these things", because everyone's just going to go, "who are you? And why do I care that you're here? And why should I even listen to anything that you're saying because I have no idea who you are?" But if you observe, and you get to know people and you become a part of the communities that those people are already in, then you can start saying, "Hey, I heard you talking about this issue the other week, and we actually have a solution of that." Or, "I'd love to get more feedback, because we're talking about possibly working on a feature that does that, and we want to make sure that it fits your needs."

And so as you're observing, as you're learning about this group and getting their input on things, then you can start to say, "Cool, thank you so much for your input. Thank you so much for your help. Also, we're building this community over here, would you like to be involved? Would you like to help us start this group of people talking about these particular topics?" And by that time, you should've been able to build up some trusts. You'll have been able to build up some authenticity. The people there know that you care far more about their needs than you do about just selling them on your product.

Mike: You probably don't want to go to a community on a very... On the same niche topic that you're trying to build a community on.

Mary: Right.

Mike: Like, if you're building an internet management platform, you want to have that sort of community there. Going to the Page Duty Committee is probably a bad place to do it.

Mary: Right, right.

Mike: You should be there, but-

Mary: Absolutely.

Mike: It's it going to go well, if you start trying to convince people to leave that community.

Mary: Right, you don't want to go in and poach people.

Mike: Right.

Mary: And this was interesting thing where we ran into at Chef, where we actually actively encourage people to not start a Chef meetup. We encouraged them to start a DevOps meetup.

Mike: Okay, why did you do that?

Mary: And the difference was... A couple of reasons one, starting meetups is hard. Getting people on a regular basis to attend, getting people to speak on a monthly basis is difficult as well. Getting people to sponsor is difficult too. And so, by opening it up to you know, don't just talk about Chef, talk about DevOps A) It widens the group of people that we're able to reach through our sponsorship, because we would sponsor the meetups. We would pay for the meetup fees, which were a thing at the time. We would provide food. We would provide speakers, if we had speakers in the area. We'd help out a lot, but by making it a DevOps meetup instead of a Chef meetup, it allowed the organizers to have a broader community of people that they brought together. It also allowed them to bring a variety of speakers, which made their jobs a lot easier.

It allowed them to seek other companies who were willing to sponsor, because otherwise, if it's a Chef meetup like well, "Chef will host." No big deal, why do you need someone else to host it at their building? Chef should provide that right, because you're doing it on their behalf?

Mike: Mm-hmm (affirmative).

Mary: So it made the jobs a lot easier for the meet up organizers, but it also allowed us as Chef employees, and as Chef community members to engage with a broader community sense and just the people that we already knew. So it let us get to know other people, who might have been using Puppet or Ansible or Assault, or any of those, where sure there are competitors, but we're all trying to solve the same problems for our community. And so if we could work together, we are able to solve those problems that much better, and learn and get to know from each other, what the best solutions might be.

Mike: There's something you said there caught my attention.

Mary: Mm-hmm (affirmative).

Mike: I have often found a failure mode in communities that I've been in, where you can kind of focus a community on two different ways. You can focus it inward of "I am here to serve the people that are part of my group." Or you can focus it outward of, "I am attracting new people to the community." And you can do both, but it's pretty hard. And doing just inward like that's not a bad thing per se. You could totally do that, but if you focus outward then you're going to like, that's also a good thing, but it's also not a bad thing. And it changes how you approach your community itself.

Mary: Right.

Mike: And the kinds of people that are in your community.

Mary: Right, and like you said, "Neither of those things are a bad thing in and of themselves." But I think without having both directions, you're going to wind up missing something along the way, because if you're only focused outward, then you're going to miss what's going on internally and not be able to communicate properly to your audience. If you're only focused inward then you're not bringing anything from the company back to the community. There's a mantra that I really like that's, "To the community I represent the company. To the company, I represent the community, and I must have both of their interests in mind at all times."

Mike: That's hard.

Mary: Right, it's a really, really hard balance, but without being able to have that balance, I can go talk to the community all I want, but I'm going to talking at them, not with them, because I'm representing the company. And if I turn around and I'm talking to the company about the community, I can be advocating really super had for the community members internally to the company, but I'm not going to be as equipped to turn around and go back to the community and say, "Okay, here's what the company had to say, here's what we're doing as a result. Here's the direction that we're going in, let me know, collect your feedback on those things and go back to the company again."

And so you have to be able to balance both of those things. And this is a lot of where the business side, the business skills of DevOps come into play, because you need to be able to have technical conversations with the community members. You also need to be able to go back inward and talk to your stakeholders about "Here's where the community's at. Here's what they're saying. Here's what they're thinking. Let me translate those things into business speak for you, if you aren't a technical leader. Let me help you understand the context around that, and all of those different things." You need to be able to have all of those conversations, and be able to communicate in both directions.

Mike: I think it's interesting that we talk about how we're building these communities from scratch isn't really building communities, communities already exist. I'm thinking back to the communities that I've been a part of, and one of my best friend, and my room mate when I lived in San Francisco, we've known each other for 19 years. We only met in person a few years ago, and this entire time we met in online community that had just been around for fricking ever. Which you talk to people that even older than we are about their times on the BBS's of the '80s and '90s, and you start to realize that communities have been around for fricking ever. So there's this criticism that I've seen rolling around about developer relations, and developer advocacy that it's this brand new thing and like, "Well, how are we going to mature this thing?" You were mentioning before we started recording that, that's not really true at all. So tell me more about that.

Mary: Sure. It's interesting, it's you know developer relations, DevRel, is a term that's really, really picked up in the last probably two years now. And gotten more popular, and developer advocates have been springing up all other the place, and you know, just people are becoming more aware of these terms. And so there's a lot more people asking, "Well, what is developer relations? What is a developer advocate, what does that mean?" And to me, developer relations is the industry term, it's the umbrella term for developer advocates for technical community managers. For technical writers in some cases. For all of these people, who are working to make the technical audience to build the community with the technical audience, and to make the developers experience better. But like you said you know, we've had communities for... We've had technical communities for decades now. We've had community as far as religious communities, as far as local communities, as far as knitting and sewing and-

Mike: The beginning of time.

Mary: Exactly like, community building is not new. It is not new in any way whatsoever, and IRU to even developer relations isn't new. The term is new, we now have and official name for this industry, but I mean we've had open source community managers since open source was around in the 60s. This is nothing new and so there's so many people, who are sitting here going, oh, well but this is... It's a brand new industry and it's brand new thing. And the titles are new, the roles are relatively new, and it's definitely growing in popularity, but when you boil it down-

Mike: The formalization is new.

Mary: Yeah, yeah. And I mean when you boil it down, I mean developer relations is community building for technical audiences. So, it's frustrating to have so many people going, "Well, but it's this brand new thing and we need brand new ideas around how to control it, and how to manage it." When in reality, we don't need brand new ideas, we just need to standardize these new titles and honestly be looking back at how community managers in the past have generated success. How they've defined success, what the metrics are that they're looking for, because we can learn so, so much from people who have come from before us. And instead of just looking ahead and going, "Okay, new industry. We need new information around all of these things." Like, "No, no, it's actually learned from the history."

Mike: Right, like this is... I won't say this solved problem.

Mary: No.

Mike: Because we're never going to really solve the problem of community, but there's a lot of expertise floating around that is just not inside in tech.

Mary: Right, right.

Mike: I mean there is a ton of expertise inside of tech, but it seems to me that a lot of the gains to be made are not coming from inside of the technical world.

Mary: Right, and the biggest thing, the one thing that I will say that gives people the little bit of credibility when they say, "I's brand new, and it's super difficult." Is that when you throw technology in the mix, people don't know what to do with it, because community management like, "Oh, okay, cool." We've got ambassador programs and influencer programs and all of that, that always live under marketing. And you throw a technical concepts in the mix, and suddenly people are sitting there going, "Oh well, but I'm an engineer, I'm not going to report to marketing. What I do is not marketing." And there are elements of marketing to what they now do, but they don't want marketing metrics. They don't want marketing goals, they want developer relations goals. And so there is, I will say there is a lot of nuance around, What department do we report into? Do we go to marketing? Do we go to product? Do we go to engineering? Do we go to customer success?

Mary: And I don't think that there's necessarily a wrong answer to that unless you try, and tell me that a developer relations team will go into sales. And that is a hill I will die on every single day. We should not be in sales, it's one of the very few metrics we absolutely should not have.

Mike: Yeah.

Mary: But that being said, there's a lot that we can learn from other people, who have been doing this for longer. There's a lot that we can take and read, and then apply that to... Okay, add a technical element to that. And we don't just need community managers, and people who can improve the experience on the website. We need technical people right. We need developer advocates, because we need people that the community can relate to and that can speak to those direct problems.

Mike: Let's also get into the value of this community of like, why?... I know why my neighborhood community is valuable to me.

Mary: Right.

Mike: I know why having a social circle of friends is valuable to me, why should my company want a community?

Mary: Right. Well, if you think about it, there's a lot of similarities between why the neighborhood community is valuable and why the knitting community is valuable. And why the photography community is valuable, as well as why it's valuable to the company. Like neighborhoods, if you have a community in your neighborhood, you're less likely to leave. If you have a community of people that you are heavily involved in photography with, you're more likely to continue your photography skills. And same with companies, there's a lot of companies who, one of their main goals behind building a community is to reduce churn. And it's not necessarily a sales metric, but it's more really like lets keep the people that we currently have on the platform, because the more you can reduce churn, the more you can keep people from leaving, the easier it is to keep growing instead of just replacing the people that you had yesterday.

So that's a big one for companies is, we want to make sure that our customers, who are here stay. Part of that comes with creating a better experience for the customers that you have. Part of that is making sure that your communications are solid with the customers you're trying to gain, so they understand the value that they're going to be getting from your product. And the big part that developer relations plays in that is, getting that feedback from the community back to the company. And that feedback could be not necessarily even feedback on the product, but just knowledge and awareness of, "Here's the issues that people are facing. Here's the things that I'm hearing from the greater DevOps community. Here's the new hot software that people are using and the problem that, that's solving right? And so, communicating those generic things about back to the company allows the company to go, "Okay, cool we have a very broad overview of what the greater technical community is doing in DevOps, and we can now use that information to make our product better to better fit their needs."

And you'll see that in people beta testing software, or being an early user of a particular product. Or things like that, as well as just figuring out based on the feedback we get from customers, what are the topics that we should be talking about at conferences? What are the relevant podcasts that we can be putting out? All of those types of information that the companies can use to build a better product to reduce churn, but also to be more relevant to the community that they're serving.

And ideally, you'd be a thought leader in these communities. One example that I always use is HubSpot, back in the day when I was doing PR and technical writing and some marketing stuff on the side, HubSpot was one of the super early like CMS type platforms, customer management platforms, but also did contact marketing and things like that. And so you'd have the product that you could buy into, which I have used occasionally, but have never been like an official customer of. But they had this fantastic blog that ran all of these super relevant blog articles about content management and customer retention, and social media and all of these various things, 90% of which never actually mentioned the HubSpot product. But that was the go to place where I need to know about this new social media thing that popped up. I know HubSpot's going to have something about it, let me go over there.

So there's a lot of companies these days in tech that are starting to be like, Okay, we are the incident management company. We are publishing about you know, here's not only interesting information, but relevant information principles that you'd should be following in the space. Knowing that if they can become the HubSpot of internet management, or API, or whatever it is choose your topic, then as someone starts to learn more and more about that product and needs a company or a product to fill a particular need, your company's going to be the first one that they go too.

Mike: Yeah, you're absolutely right. What about the value of the second point communities to the members?

Mary: Sure.

Mike: We've talked a lot about the companies.

Mary: Yeah, yeah, yeah. So, there's a lot of value there, and I think a lot of times people miss the value that they get out of it as a community member, because they're so focused on, well is just DevRel are sales, or DevRel is marketing, and I don't want to be a part of communities that are just going to sell to me. Which, note to companies, don't sell to this communities.

But there's a huge amount of value back to the community members, you've got people that you get to know, who are facing the same problems as you are. You've got now a fairly direct line of feedback to the company to say, "Hey, this thing isn't working." Or, "I would love to see this have this additional feature." Or, "It would be so much better for me in my day-to-day work, if I could do this other things alongside your product." With a community that's built around a particular product, a community manager or a developer advocate, who is invested in that community is also usually excellent at connecting you to other people within those communities. My friend Amy Hermes, who has led communities for years, calls it, "Being a technical cruise director." And it's this idea of you know, we kind of stand at the back of the room metaphorically and make sure that people are having a good time right. Everyone has someone to talk to. Everyone's involved in a conversation they're interested in, and when you see someone new enter the room, you get to know them and understand their needs as well as their interest.

And then I can go, "Mike introduce you to Corrie, who might also be interested in the Platypus," or whatever your relevant interests are. So that you can go back and then have a good interesting insightful conversation with that person, get plugged into the community and the community manager steps back and goes, "Cool, they're now connected, and I'll follow up with Mike later next week to make sure that he still feels connected." And see how his conversation with Corrie was, and see if there's anyone else I can introduce him to." And continue that conversation right? So the community manager benefits because again, they're getting feedback from the community member. They're able to pass that onto their company, but the community member benefits, because they now are connected, and getting more connected within that community, and have people that they can rely on. Have people that they can talk to, have people that they can relate to, who are facing similar issues as they are.

Mike: We've touched on some of this a bit already, but what's hard to you about community? What are the unsolved challenges or as yet unsolved?

Mary: Right, right. So one of the biggest ones that you always hear about when this question is asked, is metrics. And it's more than just, "What metrics do I track right?" 'Cause there's hundreds of metrics that we could track if we really wanted to. And we can automate a lot of them these days, but more important it's which metrics do I track that will show I've been successful? And how do I know what success means? So it's not just, "you know, well I gained a 100 more followers on Twitter today." It's like, "Well, that's great, how many of them are Bot? How many of them are people that actually care about what you're doing? How many of they actually engaging with you? And is your team actually responsible for social media, or not?" So it's more than just metrics and whether, or not you've been successful, it's defining what that success means internally to your business. And that's a problem that's an issue with developer relations. That's and issue with community building. Like that's one of the few unsolved things that we're all still kind of trying to figure out, how do we do this?

Mike: Do you have any suggestions there, like what might make a good definition of success for this?

Mary: Absolutely. So, I can't give you a good definition of success that will fit for every business, 'cause it depends on the business goals.

Mike: Sure.

Mary: What I will say is making sure that your goals for your team always point back to the overarching company goals, is key. Because if a stakeholder, if a VP or a C-suite individual comes to me and says, "Hey, what does your team do this quarter?" They aren't going to care about, "Well you know, we published 12 blog posts, and we spoke at five conferences. And we attended 20 meet ups, and we met this many people." Like, "Okay fine, but tell me what that means? What does that relate to? How do I know that, that's success in your book rather than just your daily to do list?" And it's the difference between giving them that list of work output and saying, We got feedback from community members. We were able to help with brand awareness metrics. We were able to help drive the products forward by communicating this feedback to the product team, and engineering team."

And so figuring out maybe your company goal for that quarter is, getting more customers on board. And like I said earlier, DevRel should never have a sales metric. But you can increase the broader developer awareness of your product by speaking, by talking to folks on Twitter. By engaging in conversations on Reddit, or HackerNews, or wherever you tend to be. At Stack Overflow, whatever your platform is that you choose to be on, but you can help out in those various places. And if your goals always can match up to the overarching company goals, then there's far less of a risk of your team being dissolved, or someone saying, "I don't understand the value that you're providing, you shouldn't be here."

Mike: Yeah, I remember when O'Reilly stopped their community efforts.

Mary: Yeah.

Mike: I was like, man, that's a real shame, because I was able to start writing my book, because of one of those community people. So it all seems to me that, any definition of success that you have cannot be short-term, it has to be long-term.

Mary: Right, yeah. And that's one of the biggest, biggest, hardest issues with developer relations especially right now, particularly when it comes to start ups, because as we all know startups are usually funded by VCs, and VCs want to see their return on investment now.

Mike: Right.

Mary: They don't care about what's the ROI for a year from now, they care about, "Am I going to see my money? Am I going to get a good return on my money, now?" And so figuring out ways that you can show not only long-term value, but the short-term value is key. And so focusing on, "Great, we're increasing developer awareness." And those people might later become customers of ours is fantastic, but making sure that you keep track of all those points in between you know, "Hey, I met this person at a conference and had a great conversation with them, they passed the conversation off to their manager, who passed it off to their manager, who moved to a completely different company, who is now a customer of ours, because of that one conversation at a conference." It's convoluted, but being able to draw those lines for the longer term things. And then also be able to have return on the short-term things, whether they're quick wins, or demonstrating the value of the connections that you've made. I've started using... This could be a whole other podcast topic, but I've started using the term DevRel qualified leads.

Mike: Okay.

Mary: Because it's familiar to people in the business realm, because they're used to marketing qualified leads. And the easiest way to explain that term is, if you've ever attended a conference, and you've gotten your badge scanned, you are officially a marketing qualified lead at a particular company. And it's this idea of they met you at a particular location, or you signed up for a white paper for their website, or you signed up for a trial offer on their website, and you are now in their system. And they track your progress, they track what websites you visit. They track which links you click on, all of these things to see where you land on the sales spectrum whether they should you up to sales yet, or not. And I'm using this DevRel qualified leads now to illustrate you know, you just said you were able to start writing your book with O'Reilly because of the community team. So that is a success metric for the community team, right? They were introducing authors, introducing potential authors.

When I worked there it was a lot of talking to people, who might be good speakers at our conferences. As well as finding people who might be potential authors, as well as finding people who might be interested in talking about what we're doing at various other events around the world, but these DevRel qualified leads could be... I mean a developer when I'm out at a conference, and they might be a perfect fit for an open hire that we have, an open job record that we have. And I can pass them off to recruiting, and I'm not responsible for whether, or not they get the job, because I'm not the hiring manager. I'm not sitting in on all their conversations with the interviews, but I can make that connection. I can pass off that lead to recruiting and then recruiting is responsible for it from then on. Just like when marketing passes a lead to sales, its sales responsibility whether, or not that actually becomes a customer.

Mike: There's something interesting that you did, or so.

Mary: Mm-hmm (affirmative).

Mike: You went through Seth Godin's altMBA program

Mary: Yes, yes I did.

Mike: I think this is interesting because we were just talking about this, definitions of success, which are really business orientated. And now, here you have this altMBA certificate, what led to that?

Mary: So I started my business in November of 2017, and had a lot of imposter syndrome as a lot of people do.

Mike: Hey sure.

Mary: And my biggest form of imposter syndrome was suddenly, I'm going to have to be writing business proposals and selling myself up the chain of command at various companies, talking to VPs and C-suites and things like that. Not that I had never had conversations with them before, 'cause I had, but this idea of making myself come across as professional and businesslike on paper was something that I struggled with.

Mike: Yeah.

Mary: You know, put me to-

Mike: Someone to be taken seriously.

Mary: Exactly, exactly. Especially with people who didn't know me. Like put me in a room with folks, and we can talk for 20 minutes and more often than not, people will go, "Oh, okay, you know what you're talking about, and let's continue and see how you can help the company." But just pitching myself on paper was difficult. And so that was one of my primary reasons for getting involved in altMBA, and for those of you who aren't familiar it's this accelerated online MBA program and that call it altMBA, because it's an alternative MBA. And it's not accredited very intentionally, because what they aim to do is take the tangible, tactical skills that you need to level up yourself, your career without all of the sit down for two years in classroom, and pay 20 grands to get an official MBA that on paper is great, but you haven't really applied in your day-today life.

Mike: Yeah, as it turns out an MBA is financially just one of the worst decisions you could possibly make.

Mary: It really is, it really is, but knowledge that I got from this altMBA program was huge.

Mike: Right.

Mary: And the confidence boost. And the biggest thing was just, they incorporate a lot of peer learning. And so, you're learning from other people, who are in completely different industries, right? I had, on my cohort one week, was another girl, who owns her own business and does professional business coaching. And there was someone, who had just been promoted to the CEO of a chain of auto repair shops. There was someone else, who is a project manager for a non-profit like completely different skill sets, completely different roles across the board, but just learning from each other, and learning from the different perspectives was huge.

And the other reason why I was really invested in this was that, one of the patterns that I've noticed in my 10 plus years of doing developer relations and community building, is that we so often, especially in developer relations so many of us don't come from a business background. People are coming from a technical, or developer background, or a product background, which has some more business related skills. But we're coming into this with varied ideas of what it means to define success, and how we do that. And it's difficult to sit down with a stakeholder in the company, and say. "Here's how what my team did with success. We're the company, and here's how we're driving the company goals forward and things like that, if you've never had to do that. So we've got, especially in this past year, developers who are coming in brand new to these developer advocacy roles and are being asked, "Well, how do you find success?" And I'm generalizing here, so forgive me I'm trying not to make a stereotype, but more often and not I'm seeing folks sit down and go, "Well, we finished our jobs. We checked all the boxes for our okay hours for this quarter, or we were given these work outputs that we had to finish. We were given these tasks that we had to complete by the end of the quarter, and those are done. So it was a successful quarter."

And on an engineering team, that's a perfectly reasonable answer to give, and it's a true answer to give as far as were you successful or not this quarter? But when you come onto a team like developer relations, it's not just what are the boxes that you checked, it's also how did you influence the company? How did you drive the company goals forward? How did your work directly impact the company bottom line? And that's a difficult conversation to have, if you've never been put in that situation.

Mike: Oh yeah.

Mary: And so part of my secondary goal for altMBA was figuring out, how do I pinpoint almost what's the TLDR of an MBA, so that I can help other developer relations professionals have those conversations? And that's some of what I'm doing for clients as well as you know, one-on-one coaching with team members. One-on- one coaching with teams and helping them figure out, "Okay we did these things, how does that translate back to the company goals? How does that translate back to the team goals? How do I communicate that upward through the company? And in some cases, to the board to makes sure that our team still exists next quarter?"

Mike: That sounds like a fantastic idea.

Mary: Mm-hmm (affirmative). It's difficult, but it's rewarding.

Mike: Yes. Yeah, I'm sure. Well, this has been a wonderful conversation. Where can people find out more about you and your work?

Mary: Sure. So I'm fairly active on Twitter, you can find me @Mary_Grace on Twitter. My personal website is marygrace.community. You can also find out more about my business and the things that I partner with companies on at Persea-consulting.com. And Persea is spelt P-E-R-S-E-A.

Mike: And all those will also be in the show notes.

Mary: Perfect.

Mike: Well, thank you so much for coming on, this has been a pleasure.

Mary: Absolutely. Thanks so much for having me, it's always great to chat.

Mike: Thank you for listening to the Real World DevOps podcast. If you want to stay up to date on the latest episodes, you can find us at realworlddevops.com and on iTunes, Google player, or wherever it is you get your podcast. I'll see you in the next episode.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
InfoSec For DevOps Engineers with Kelly Shortridge
23 maj 2019· Real World DevOps
About Kelly ShortridgeKelly Shortridge is currently VP of Product Strategy at Capsule8. Kelly is known for research into the applications of behavioral economics to information security, which Kelly has presented conferences internationally, including Black Hat, AusCERT, Hacktivity, Troopers, and ZeroNights. Most recently, Kelly was the Product Manager for Analytics at SecurityScorecard. Previously, Kelly was the Product Manager for cross-platform detection capabilities at BAE Systems Applied Intelligence as well as co-founder and COO of IperLane, which was acquired. Prior to IperLane, Kelly was an investment banking analyst at Teneo Capital covering the data security and analytics sectors. Kelly graduated from Vassar College with a B.A. in Economics and was awarded the Leo M. Prince Prize for Academic Achievement. In Kelly's spare time, she enjoys world-building, weight lifting, reading sci-fi novels, and playing open-world RPGs.

Links Referenced: InfraGardThe Phoenix Project: A Novel about IT, DevOps, and Helping Your Business WinRSA ConferenceChaos MonkeyTranscriptMike Julian: This is the Real World DevOps Podcast and I'm your host Mike Julian. I'm setting out to meet the world's most interesting people doing always work in the world of DevOps. From the creators of your favorite tools to the organizers or amazing conferences. From the authors of great books to fantastic public speakers. I want to introduce you to the most interesting people I can find.

This episode is sponsored by the lovely folks at InfluxData. If you're listening to this podcast, you're probably also interested in better monitoring tools and that's where Influx comes in. Personally, I'm a huge fan of their products and I often recommend them to my own clients. You're probably familiar with their time series database InfluxDB, but you may not be as familiar with our other tools. Telegraf for metrics collection from systems, Chronograf for visualization, and Kapacitor for real-time streaming. All of this is available as open source, and as a hosted SaaS solution. You can check it out InfluxData.com.

My thanks for InfluxData for helping making this podcast possible.

Hi folks I'm Mike Julian, your host for the Real World DevOps Podcast. My guest this week is Kelly Shortridge, the VP of Product at Capsule8 and an internationally known speaker on InfoSec topics.

So Kelly, welcome to the show.

Kelly Shortridge: Thank you so much for having me Mike.

Mike Julian: You know I was looking at your LinkedIn and there was something that kind of stood out to me was your FINRA license. You started your career off in finance.

Kelly Shortridge: That's true.

Mike Julian: So what in the world happened there? How does that work?

Kelly Shortridge: Yeah. How does that even happen. So one, FINRA exams are very painful so I didn't want to have to re-up those, but mostly I started my career doing mergers and acquisitions covering information security companies. And while I quite liked the finance side I noticed that security had a ton of opportunity not just as far as vendors, but the problem space is huge and it's very unsolved and the incentive problems are enormous. So for someone like myself who had studied some Behavioral Economics, just all of the messy incentive problems were kind of like catnip for me. So I knew I had to go into the industry.

Mike Julian: Yeah. The Behavioral Economics side, the study of incentives. That's absolutely fascinating to me because, especially in security and systems got everything we do is incentives based.

Kelly Shortridge: Yes.

Mike Julian: And it's often incentive that we're not even paying attention to. Things we don't even think about. Like if you-

Kelly Shortridge: That's definitely true.

Mike Julian: Yeah. Like you make it hard to use two-factor and people aren't going to use two-factor.

Kelly Shortridge: Yup, so that's something that I feel like people outside of security understand immediately, but inside security they don't always understand the fact that if you don't make something that integrates into work flows, people are going to bypass it. But you're absolutely right, there's such a web of incentives and on the one hand you have things that are explicitly stated you know, security to your privacy is important, but then you have more of a tacit goals and priorities, which are that, well security's a cost center and really what matters is being able to deliver on time and you know releasing at a certain cadence.

So those tacit assumptions also create a bunch of incentive problems and InfoSec. But I always think that InfoSec because they wish they were more relevant try to be the culture of know and ram through really annoying technologies for people to use just to show that they're still relevant.

Mike Julian: Yeah, that hits home. I've seen that way too many times.

Kelly Shortridge: Yeah. I think most developers... It's not really a love hate relationship, it's mostly just a hate relationship for the most part. Somewhat bi-directional.

Mike Julian: What really about security drew you away from finance? Have you found there's good parallels that you've been seeing?

Kelly Shortridge: It's interesting. There's certainly parallels, particularly on the risk management side, particularly anything around risk centrality because that's a huge part of financial systems, so that's not really what I did day to day. I think what's interesting in security is there's a huge lack of effective communication. Even when you go to conferences you know, there will be some 0-day that's dropped or whatever, but it's often not communicated very well, and certainly when you look at enterprises, security priorities aren't really communicated well to the rest of the business. And a lot of investment banking is quite frankly effective communication.

It's about quickly researching something in general companies and understanding it very deeply to be able to talk about it and effectively persuade acquirers that's it's worth acquiring a company and so frankly even the excel and PowerPoint skills I learned along the way are really helpful in just being able to talk to people about security. You know I can speak to CEO's, I can speak to Board members, I can speak to DevOps people about security in a way that's still understandable and that's what I feel like we're still missing a lot of in information security.

Mike Julian: Yeah that makes a ton of sense. I was reading one of your articles, I can't remember which one it is. I'll have to go find it and throw it into the show notes, but you made mention of a tax from nation states. And I thought that was pretty interesting. It definitely stood out to me. And perhaps I have a bit more security exposure to that side of things than the average DevOps person might, just from, I worked for the government for a while so I've seen it. But for the vast majority of people who haven't, what is a nation state? Like what are we talking about in security context?

Kelly Shortridge: Nation state's generally in the context of an attacker is a government sponsored entity. So in some cases it gets a bit blurry, particularly with China or Russia where you have criminal groups that either lightly or strongly have the backing of the government, or at least the government looks the other way because it still benefits you know the government's goals. But in general it means a nation state attacker and you'll also hear the term APT which is and Advanced Persistent Threat. Part of the reason why they can be advanced and persistent is because they're well resourced and they're also well motivated. They have very stated goals and they have actual budgets that can go towards pwning things.

Mike Julian: This is actually a thing, like this is actually happening?

Kelly Shortridge: It is actually happening, so the extent to which it's happening, particularly for the average organization I think is a bit more dubious. I think by far and away, the script kiddie threat or the criminal group threat is far bigger.

And that's where for example why a transition to security looking at behavioral economics, there's a concept called prospect theory. And part of that is basically people overweight small probabilities under underweight large probabilities. So for example you overweight the probability that a shark's going to eat you and you underweight the probability that you know you'll be hit by a car you know, succumb to some sort of car accident.

The same applies in security that people vastly underweight the fact that probably they'll succumb to phishing or some other sort of kind of like somewhat stupid scripted attack and then they overweight how much you know Mossad is a classic example, Mossad's going to you know find who your secretary is and they're going to install some special sort of pen that then transmits some sort of exploit over to his or her machine. And then that machine's going to exfiltrate data by fluctuating the power supply and someone's then hacked into the power plant to read that... all of that stuff is more fan fiction than anything else except for you know national laboratories or governments. Maybe Fortune 10’s.

So in general, people definitely like how sexy nation states are as far as this kind of attacker because no one wants to be owned by a 12 year old, right? That just feels bad.

Mike Julian: Right. That's just embarrassing.

Kelly Shortridge: Exactly. Exactly. So I think that's why there's so much focus on nation states rather than kind of the quotidian threats.

Mike Julian: I got to see James Mickens speak a while back. And one of his slides was... It basically just said, the threat model is Mossad or not Mossad.

Kelly Shortridge: Absolutely. Yes.

Mike Julian: If it's Mossad you're screwed anyway. It's like, don't bother, give up, you're done. If it's not Mossad well now we can have a conversation about what you can do.

Kelly Shortridge: Precisely. And that goes into a really good threat modeling in the sense that even with, I think it was APT 28, which most people know as the one that you know hacked the DNC. They tried phishing first. I think it was something like google-admin, but the google had two zeros and maybe you know people should have spotted it. I tend not to blame users for those sorts of errors. But the point is that even this super sophisticated groups will absolutely try phishing and absolutely will try these unsophisticated methods first because if they don't have to blow something expensive like a zero-day vulnerability which takes tons of time to research and perfect and get reliable. They won't do it. They'll absolutely try low hanging fruit.

I think it's similar with a lot of developers, if they don't have to like reinvent the wheel and create something fancy, they won't. People tend to optimize for what's quick and what works.

Mike Julian: Right. Yeah. When I used to work for Oak Ridge National Lab, one of the first things that we were taught was, you don't plug in USB drives from outside of here. And be very careful about what links you click in an email. And the FBI came around through the InfraGard program to tell us that... they basically gave us this briefing on system administrators by nation states are the, they're kind of at the core of who they're targeting because they have tons of access and no one pays attention to them.

Kelly Shortridge: Definitely.

Mike Julian: That was kind of scary when I first heard it and then I realized, well actually most... what they're going to be doing is just coming in at the lowest level possible. Like here, plug in this USB drive.

Kelly Shortridge: Exactly.

Mike Julian: They're not going to be like, a beautiful woman in a bar trying to find me in this long 18 month process. That's just not going to happen.

Kelly Shortridge: Yeah. If the USB stick in the parking lot works, you might as well try it and then after that you know it's some discounted thing on NewEgg in an e-mail, right? And then it escalates from there.

Mike Julian: Yeah. Exactly. They're not starting at the top.

Kelly Shortridge: Exactly. I think it completely defies human nature to think that they would start with the most expensive option rather than exhausting the rest of the options.

Mike Julian: The threat modeling here is really interesting to me because... so say the attacker is going to be starting with the cheapest perhaps most effective option, that means how I'm thinking about my defense is also going to be very different. I'm not protecting against these really fantastical situations. I'm protecting against phishing links.

Kelly Shortridge: Yes.

Mike Julian: When I'm trying to design some sort of security posture and like, I'm a DevOps engineer, I don't have any standard security staff I don't have any specialists around me, what can I do?

Kelly Shortridge: So one thing that I definitely recommend, and it's actually lucky because it's somewhat easy, is just go through kind of 101 guides on how to hack web applications. Because whatever the 101 guide says is probably what their minimum viable threat model is, right?

It's the same thing with corporate security. Going through and trying to crack passwords is probably step one using some sort of dictionary. Obviously something like two-factor kind of mitigates that. So what I've proposed before is the concept of decision trees, which I assume a lot of the audience will be pretty familiar with them.

But if you're not is basically the idea that you start with, okay, you have the state of the world, you have some sort of attacker and you have let's say an application that contains sensitive information. Obviously the attacker isn't going to care about the non sensitive information. They'll probably go for whatever the, let's say credit card data.

Then you figure out, okay, what's the easiest way the attacker will get there? So I have this notion called yellow sec, which is basically if you do nothing and you just hope that security will happen out of the ether, that would be for example if you're storing the credentials in the database, you don't have any network segmentation. You don't have any data tokenization. You certainly don't have any access control on it. So that would be the yellow sec option.

And so then when you start think about, okay, if we did absolutely nothing, what would be the easiest thing for the attacker to do? You can start eliminating those paths and then forcing the attacker down the hardest paths possible, which again eliminates a lot of the very common script kiddie threats. Then you move onto eliminating the common criminal group threat. And finally again once you get the Mossad level, like just don't care about Mossad, they're going to find a way regardless. So as long as you keep forcing attackers down that harder path, you're going to frankly eliminate yourself as a target.

Mike Julian: That's fantastic advice and for whatever reason, and I'm kind of ashamed to admit this now, I never considered looking at the 101 attack guides to figure out how to set up my defense.

Kelly Shortridge: Yeah. I think if you want to role play a bit it's like, okay imagine you're a teenager you know the stereotypical teenager in Eastern Europe, what would you do first? You know that you're... I assume probably a lot of your listeners were at technology companies. You know of technology company through an article in TechCrunch. You know that they have sensitive customer communication, something like that. Okay, now you think, how am I going to hack them? You're probably going to go to really stupid guides at first, so just look at those and eliminate all of those stupid ways, right?

Mike Julian: Yeah.

Kelly Shortridge: Yeah.

Mike Julian: How do you consider... you're absolutely right.

Kelly Shortridge: Yeah.

Mike Julian: So you and I were talking before we started recording, about... DevOps people and security people.

Kelly Shortridge: Yes.

Mike Julian: And you have opinions on this. Can you tell me more?

Kelly Shortridge: I do and I'm a bit of a traitor in that I definitely empathize more with the DevOps side than the security side of things. But-

Mike Julian: Why is that?

Kelly Shortridge: It's that way because I really dislike the notion that I see a lot in security which is again that culture of “No.” It's that notion that there's this almost, you know I almost call it this moral and almost like missionary perspective of there's this abstract perfect security archetype and every company has to meet it and anyone who violates security is just you know, violating divine blessing or something like that. It's just very overly serious to a certain extent and there's this lack of self awareness that security most of the time slows companies down. And that if security isn't working on behalf of the business to make sure the business can survive and not choking out those workflows, then what is it doing? If it's hindering the business then you might as well not have deployed security at all because mostly frankly, most of the consequences of any sort of breach or reasonably minimal, particularly with the rise of cyber insurance, that means you'll get reimbursed for incidents.

So to me DevOps at least understands, okay we are supporting money making activities, but also we do face cost constraints and stuff. Security doesn't quite have that same level of self awareness and they certainly aren't making money for the business. So, again I empathize more with the people that seem to be supporting the business more than not. And I do in my experience think that when I talk to DevOps people about security they're way more receptive than when I try to talk to security about DevOps and what they can learn.

So that's part of why I side more on the DevOps side. But my thesis right now, which I've been harping at least for a year now is that DevOps and security should actually be BFFs. They're frenemies but they shouldn't be. But there are obviously a bunch of cultural challenges. There's a ton of scar tissues, but ultimately with the rise of something like resilience engineering, you can extend the concept of like, okay, assume that things are going to fail. To assume that things are going to fail also in a security context. There will be a breach. So really there's a lot of common ground that I think both teams so to speak don't realize, exist.

Mike Julian: Tell me more about those cultural challenges you mentioned.

Kelly Shortridge: There are a ton of cultural challenges. So for one security people tend to have the break it mindset rather than the build it mindset. And they tend to think that most people who don't consider security first, are stupid. So if you aren't beginning your design phase with architecting perfect security, a lot of times they'll just think that developers are fundamentally less intelligent. That's something I've legitimately heard. And I think that's stupid in itself. Right? It's just-

Mike Julian: Yeah that's awful.

Kelly Shortridge: Yeah. There are different priorities obviously. I think on the DevOps side, there's also this notion of it's the, what is it... fail fast or fail faster, maybe I'm just quoting Silicon Valley at this point, but you know building things not necessarily with regard for security. Which also isn't great because security still is a part of managing business risk. So I think it's... fundamentally those mindsets are different, and frankly the best security people I know are the ones who have developer experience.

Kelly Shortridge: And even on the DevOps side the ones who tend to consider security, tend to be I think better in organizations. I love the stat from the state of DevOps report by Dr. Forsgren, where it states that companies that resolve security incidents more quickly and also have security sooner in the build phase, actually reduce any time to recovery. It actually benefits the business. And it benefits velocity when you're considering security. It's just, are you doing it in the right way?

Mike Julian: All right. Have you ever read The Phoenix Project?

Kelly Shortridge: I have not.

Mike Julian: Oh. It's a really great book by Gene Kim.

Kelly Shortridge: Okay.

Mike Julian: And how it opens up is talking about this archetype of a security professional and this person, I think they named him John. It's a parable, which really great read, really enjoyable. But this character John, does all this bunch of stuff and just isn't telling anyone that he's doing it. And then the entire application completely crashes.

Kelly Shortridge: I believe it.

Mike Julian: And then he starts blaming everyone else and like, oh well, none of you care about security and I'm the only one protecting this company and, all of you else only care about making money. And like, we've got to save this.

Kelly Shortridge: Exactly.

Mike Julian: And then it progresses through and there's kind of a... this security person changes their mindset over time with the input and experience of talking to other people to realize, well, no, there are actually layers of security and I may not see them all.

Kelly Shortridge: Exactly.

Mike Julian: So the example given in the book was the security person wanted to encrypt a field with CVVs or something like that, and it later came out that was completely unnecessary because finance had paper controls to handle it all.

Kelly Shortridge: Okay.

Mike Julian: So it was completely moot point, but in doing so he actually broke the entire business as a result of making this completely unnecessary change because he was only seeing things from his perspective.

Kelly Shortridge: Yes. Yeah. I feel like half of my talks around conferences are about seeing things from other people's perspectives and teaching security people how to do that. So I'm not sure if the parable has been fully digested yet in InfoSec, but I think it's a good one.

Mike Julian: Like hearing everything you're saying, I'm like, surely we've solved this by now. But apparently not.

Kelly Shortridge: No. If you look... One meme that I really hate and if you hear security people telling you this, don't believe them, is that the pace of attacks and you know attackers shifting methods it's just evolving constantly and we could never keep up. That's just not true. For the most part if you look at underlying techniques, they don't change that much. Even phishing is something that was happening in the 90s. So fundamentally-

Mike Julian: That's interesting-

Kelly Shortridge: Yeah, fundamentally things don't change all that much. Though I do think on the positive side of things, some of the new technology around infrastructure is actually changing things for the better. But otherwise if someone is saying that they can't keep up with the pace of change, it means that they don't have good underlying basics in place. And so they are just constantly reactive. And that's a huge problem in security, is very few people are proactive and thinking frankly more in the DevOps kind of like architectural view, rather than in just like, oh there's a fire, must put it out. Okay next fire, et cetera.

Mike Julian: Right. So I want to get into that but I have one other question before we talk about that area. We've been talking about this terrible archetype of security people. Some of the people listening have security staff that they may not have the best of relationships with. What can they do to start to bridge that gap? You mentioned that DevOps and security should really be BFFs. How do we get there if we're not already?

Kelly Shortridge: So one reason why I like resilience is because of all the commonalities there are with security. So I think even starting with the conversation about like, okay listen we want to make sure that our apps have really good up time and they aren't disrupted somehow because we have to be performant. What are some of the security benefits there? Is there a way that maybe we can collaborate to make sure that part of that up time for example like reduces the threat of like denial of service. That's something that I think both goals have in common. Kind of looking for what are you working on and where could that somehow apply to security I think is a good first step.

Also just acknowledging it's a bit of I guess ego stroking and nudging in a certain way but acknowledging like, listen we think security's really important like we're looking to implement x technology kind of leading into kind of the next discussion we'll have like, what are some of the security benefits here. Like for service measures for example, is there a way that this orchestration can actually reduce work for you? We know that you're super busy and you're putting out a ton of fires. Is there some way that we can actually help automate this for you because we're going to automate some stuff for ourselves?

So I think those are the sort of olive branches I would recommend. It's kind of like you don't want to tell them they're unimportant.

Mike Julian: Sure.

Kelly Shortridge: That's... right? That's their biggest fear in a certain way. But it's think sometimes they just fundamentally don't realize... they see knew technology as something almost scary and it's another again fire they have to put out. It's another threat model they have to create. So figuring out, okay like we're frankly going to do this regardless, like what are the ways that we can reduce work for them, is something that I think for the most part security people will be really receptive to that.

Mike Julian: I have found that telling a security team that thinks like that, hey by the way we're going to turn all of the servers every couple days, or every couple minutes. Like we're just going to rotate the entire infrastructure and also by the way we're going to consistently break out own infrastructure intentionally. And you just see their heads explode. Like, "You can't do that!"

Kelly Shortridge: Yes. So here's my counter because this is something I've been talking about constantly and will keep talking about, is something like that, what you just mentioned... Remember back to that nation state and the APT, well the P in that is persistence and it turns out it's really hard to persist on something that's constantly rotating, right? So there are actually security benefits. So you can tell them, "No, you can even drop my name, not that I'm super important, but say like listen I heard that it's really hard for attacker to persist if our infrastructure's constantly rotating."

Another thing I've constantly mentioned is how Chaos Monkey is actually a really good security tool, not just a resilience tool for that reason because it reduces persistence. Again bringing up service mesh like I believe personally scratched the surface but I promise you most security people know nothing about it. And I guarantee you they don't know the fact that it means that they don't have to manage individual blinky boxes anymore. That they can actually just deploy firewall rules and access control and stuff like that in a much friendlier manner.

So I think it's trying to, and this is where I put the onus more on security to understand the technology rather on DevOps to understand the threat models and the security needs by going back to, if you create a really basic threat model right? And you go through those 101 things and every time you're looking at new technology like you mentioned thinking about, okay how would this stop the script kiddie? Where would this make it difficult for them? And presenting that to the security team is a really effective way to remind them like, "Listen, this isn't scary, this is something that can actually help you."

Mike Julian: Yeah. I mean it's hard to not be scared when you go into security conferences.

Kelly Shortridge: Yes.

Mike Julian: Until recently I lived in San Francisco, so... RSA Conference is there every year. And walking the floor of RSA it's hard to not think the world is burning and everything is awful.

Kelly Shortridge: It's so true. There's so much fear, uncertainty and doubt in all of the marketing messaging. I personally hate it. I think it's totally unnecessary, but yeah I think security, the industry itself really tries to be inaccessible and sound scary, which totally hurts and then they complain about the fact that there's a security skills shortage. And it's like, well you're kind of presenting the industry like a nightmare. No wonder people don't want to join in with you.

Mike Julian: Right.

Kelly Shortridge: So I definitely quibble with that. But I think remembering that security fundamentally is about in this digital age, is businesses are just inevitably they have to be in the digital world. Just making sure they can survive. That's fundamentally what security's. There are digital risks and how can we make sure the business can still survive and ideally thrive with those digital risks. We remove the nation state stuff and you know the 0-day and the FUD and all of that. When you make it that simple I think it's a lot more accessible and it starts to make a lot more sense as far as what you should do strategically.

Mike Julian: If I'm looking at security products, like something to help me, should I just categorically ignore all the ones that are using FUD and marketing?

Kelly Shortridge: I think you may be left with basically no products to be honest. I think-

Mike Julian: Well that sucks.

Kelly Shortridge: Yeah it does suck. It's incredibly difficult even for seasoned security professionals to navigate. You know they're seasoned with 20 years of experience that still talk to me about how difficult they find it just to figure out what are companies actually doing, particularly with the rise of AI and machine learning and everything. Then they just hand wave and say, "Oh it's our crystal ball don't worry about it." Which is helpful of no one.

So I think if you're a systems administrator or a DevOps person looking at security tools, the key thing to ask I would say is start with the work flows. Make sure that you're not going to be adding undo work because if something, you know some, what are called SIMs, I think like Splunk and other things that basically ingest a bunch of data and help you manage alerts and stuff like that, sometimes those can add 30 hours of work a month just to maintain them, right? Yeah.

Mike Julian: Wow.

Kelly Shortridge: They're really difficult to implement and this is often across the board with security products. They're really difficult to maintain so just starting even with like, okay but what's the realistic, essentially cost of using your product on an ongoing basis, I think will help you a lot because security shelfware isn't going to help anyone.

I think the other thing is specifically looking at the site to see, are they kind of pain... do they at least acknowledge that that's even a pain point. Because companies that are just hyping up again like the machine learning or the AI and stuff like that and not talking about optimizing workflows or reducing manual effort. Those are the ones that probably in general aren't going to provide as much value. Again, because either they're going to sit there or they're going to be so time consuming that you can't actually focus on more strategic products.

Mike Julian: It has been absolutely fantastic chatting with you.

Kelly Shortridge: Thank you so much, yeah.

Mike Julian: I've learned a ton. This is great.

Kelly Shortridge: Yes definitely and anyone listening, feel free to always talk to me because I'm always looking to see how we can... don't tell the security people, but how to make security teams a little more obsolete and integrated more into the DevOps process itself.

Mike Julian: Well on that note, where can people find you?

Kelly Shortridge: Yes so I have a website, swagitda.com It's S-W-A-G-I-T-D-A. It's a finance joke for another time, but I have both speaking and writing sections which includes kind of blog posts both long form and shorter as well as some of the conference presentations I've given and that word 'swagitda' is also where you can find me on Twitter and reach out.

Mike Julian: Well fantastic. Well thank you so much.

Kelly Shortridge: Thank you so much Mike.

Mike Julian: And to the rest of you thanks for listening to the Real World DevOps Podcast. If you want to stay up to date on the latest episodes you can find us at realworlddevops.com and on iTunes google player, wherever it is you get your podcasts.

Mike Julian: I'll see you in the next episode.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Understanding Observability (and Monitoring) with Christine Yen
16 maj 2019· Real World DevOps
About Christine YenChristine delights in being a developer in a room full of ops folks. As a cofounder of Honeycomb.io, a tool for engineering teams to understand their production systems, she cares deeply about bridging the gap between devs and ops with technological and cultural improvements. Before Honeycomb, she built out an analytics product at Parse (bought by Facebook) and wrote software at a few now-defunct startups.

Links Referenced: https://www.honeycomb.iohttps://www.heavybit.com/library/podcasts/o11ycast/TranscriptMike Julian: This is the Real World DevOps podcast and I'm your host Mike Julian. I'm setting out to meet the most interesting people doing awesome work in the world of DevOps. From the creators of your favorite tools to the organizers of amazing conferences, to the authors of great books to fantastic public speakers. I want to introduce you to the most interesting people I can find.

Mike Julian: This episode is sponsored by the lovely folks at Influx Data. If you're listening to this podcast you're probably also interested in better monitoring tools and that's where Influx comes in. Personally I'm a huge fan of their products, and I often recommend them to my own clients. You're probably familiar with their time series database InfluxDB, but you may not be as familiar with their other tools, Telegraf for metrics collection from systems, Chronograf for visualization, and Kapacitor for real-time streaming. All of these are available as open source and as a hosted SaaS solution. You can check all of it out at influxdata.com. My thanks to Influx Data for helping to make this podcast possible.

Mike Julian: Hi folks. Welcome to another episode of Real World DevOps podcast. I'm your host Mike Julian. My guest this week is a conversation I've been wanting to have for quite some time. I'm chatting with Christine Yen CEO and co-founder of Honeycomb, and previously an engineer at Parse. Welcome to the show.

Christine Yen: Hello. Thanks for having me.

Mike Julian: I want to start this conversation off in kind of what might sound like a really foundational question. What are we talking about when we're all talking about observability? What do we mean?

Christine Yen: When I think about observability, and I talk about observability I like to frame it in my head as the ability to ask questions of our systems. And the reason we've got that word rather than just say, "Okay well monitoring is asking questions about our system," is that we really feel like observability is about being a little bit more flexible and ad-hoc about asking those questions. Monitoring sort of brings to mind defining very constrict parameters within which to watch your systems, or thresholds, or putting your systems in a jail cell and monitoring that behavior, whereas, we're like, "Okay, our systems are going to do things, but they're not necessarily bad." But let's be able to understand what's happening and why. And let's observe and look at the data that your systems are putting out as well as thinking about how, asking more free form questions might impact how you even think about your systems, and how you even think about what to do with that data.

Mike Julian: When you say asking questions what do you mean?

Christine Yen: When I say asking questions of my system, I mean being able to proactively be able to investigate and dig deeper into data, rather than sort of passively sitting back and looking at the answers I've curated in the past. In order to illustrate this, to compare observability monitoring a little more directly with monitoring, especially traditional monitoring when we're curating these dashboards, what we're essentially doing is we are looking at sets of answers from questions that we posed when we pulled those dashboards together. All right, so if a dashboard has existed for six months the graphs that I'm looking at to answer a question like, what's going on in my system, are answers to the questions that I had in mind six months ago when I tried to figure out what information I would need to figure out whether my system was healthy or not. In contrast, an observability tool should let you say, "Oh is my system healthy?" What does healthy mean today? What do I care about today? And if is see some sort of anomaly in a graph, or I see something odd, I should be able to continue investigating that threat without losing track of where I am, or again relying on answers from past questions.

Mike Julian: So does that mean that curating these dashboards to begin with is just the wrong way to go? Like is it just a bad idea?

Christine Yen: I think dashboards can be useful, but I think that over use of them has led to a lot of really bad habits in our industry.

Mike Julian: Yeah, tell me more about the bad habits there.

Christine Yen: An analogy I like to use is, when you go to the doctor and you're not feeling well. A doctor looks at you and asks you, "What doesn't feel well. Oh, it's your head. What kind of pain are you feeling in your head? Is it acute? Is it just kind of a dull ache? Oh, it's acute. Where in your head?" They're asking progressively more detailed questions based on what they learned at each step along the way. Honestly this is kind of parallels a natural human problem solving concept. In contrast, I think the bad habits that dashboard lead us to build are things like it would be the equivalent of a doctor saying, "Oh well based on the charts from the last three times you visited, you broke your ankle and you skinned your knee." Pretend you go to the doctor to skin your knee. You know, "Oh okay, you broke your ankle last time, did you break your ankle again? No. Okay, did you ... How's your knee doing?"

With dashboards, we have built up this belief that these answers to past questions that we've asked are going to continue to be relevant today. And there's no guarantee that they are. Especially for our engineering teams that are staying on top of incidents and responding, and fixing things that they've found along the way. You're going to continue to run into new problems, new kind of strange interactions with routine components. And you're going to be able to ask new questions of what your systems are doing today.

Mike Julian: It seems like with that dashboard problem we have that same issue with alerting. I've started calling this kind of reflexive alerting strategy where it's like, "Oh God. We just had this problem. Well we better add a new alert so we catch it next time it happens." It's like well how many times is that new alert going to fire? Probably never. Like you're probably never going to see that issue again. With dashboards, dashboards are the same way. What you're describing, God I've seen this 100 billion times where someone curates a dashboard, is like, "Okay, well first thing now that we have this alert is let's go look at the dashboards and see what went wrong." I'm like, "Well no, graphs look fine." So no problem, but clearly the sites down.

Christine Yen: Yeah, there's a term that we've been playing with, dashboard blindness. Where if it doesn't exist in the dashboard it clearly hasn't happened, or you know it just can't exist, because people start to feel like, "Okay, we have so many dashboards. One of them must be showing something wrong if there's something going wrong in our system." But you can't. You can't, that's not always going be the case. To expect that means that you have this unholy ability to predict the future of your system and man if people could really predict the future, I would do a lot more things than just build dashboard with that.

Mike Julian: Right. Rather than just shit on dashboards forever what is a good use of a dashboard? Like presumably you have dashboards in your office somewhere?

Christine Yen: Yes. I think dashboards are great jumping off points. And I actually very much feel like dashboards are a tool, they've just been over used. So I absolutely don't want to shit on dashboards because they serve a purpose of providing kind of a unified entry point. Right. What are our KPI's? What are the things that matter the most. Great. Let's agree on those. Let's go through the exercise of agreeing on those, because as mush as we would like to think that this is a technology problem that can be solved with tools, a lot of the time these sorts of things require humans and process to determine. So let's decide on a KPIs, and let's put them up on a wall, but expect and spread the understanding that wall is only going to tell us when to start paying attention. Dashboards themselves can't be our way of debugging, or our way of interacting with our systems.

Mike Julian: Right. So in other words that dashboard it's going to tell you that something has gone wrong, but it won't tell you what?

Christine Yen: Right.

Mike Julian: I think that's a fantastic thing. And that actually mirrors a lot of the current advice around alerting strategy too of you find you SLIs, alert only on an SLI, not on these low level system metrics.

Christine Yen: Yeah, I love watching this conversation evolve. I think Monitorama 2018, something like three talks in a row were all about alert fatigue. And it's so true to see these people, to see these engineering teams fall into this purely reactive mode of, "Okay, well if this happened, this is how we will prevent it from happening again." And each postmortem just spins out more alerts, and more dashboards. Inevitably your people are going to end up in a state of unsustainable hundreds or thousands of dashboards to comb through. And then their problem isn't how do I find out what's going on? It's, how do I figure out which dashboard to look at? Which again is looking at things from the wrong perspective. Dashboards tell you that something has happened and you need a tool that's flexible enough to follow your human problem solving brain patterns to figure out what's actually wrong.

Mike Julian: Funny you mentioned the Monitorama, there was a talk, I want to say 2016 maybe, I think it was Twitter where they had this problem of alert overload, just constant alerts. So they decided, "You know what we're going to do? We're just going to delete them all." Done. Like, "We'll just start over." I'm like, "That's such a fantastic idea." People think that I'm insane when I recommend it, but hey Twitter did it so I'm sure it's fine.

Christine Yen: Yeah, I mean drastic times call for drastic measures. It's funny talking, being especially in the vendor seat talking to a lot of different engineering teams about their tools and how they solve problems with their production systems. There is definitely an element of kind of this safety blanket feeling. Right? "Okay, but we need all of our alerts. How will we know when anything is going wrong?" Or, "We need all of our alarms for all time at full resolution." And I get it. I feel that there are patterns that folks kind of get into, and it's how you know how to solve their problems, and especially when things are on fire. It feels like you don't have time to step back and change your process when you're like, "No, this is what I'm doing to keep most of the fires under control." And I think this is why communities like yours and Monitoramas, and it's whether it is so good that we have ways that we can share different techniques for addressing this so that folks who are in the piles and piles of alerts hole can dig themselves out of it, and start to find ways to address that.

Mike Julian: Yep, yep, completely agreed. So I want to take a few steps back and talk about monitoring. There's been a lot of discussion about how observability is not monitoring. Monitoring is kind of I guess looking at things that we can predict. We think through, and feel free to correct me at any time here. We think through failure modes that could possibly happen, and design, or dashboards design alerts for those failure modes that we can predict. Whereas, what you were describing earlier, observability is not that, it's for the things that we can't predict. Therefore, we have to make the data able to be explored. Is that about right?

Christine Yen: That's about right. For anyone in the audience knee jerking about that, I want to clarify. I really think of observability as a super set monitoring. And the exercise of thinking through what might go wrong is still a necessary exercise. It's the equivalent of the software developers should still write tests. You should still be doing this due diligence of what might go wrong. What will be the signals for when it goes wrong? What information will I need in order to address it once it does go wrong? All these are still important parts of any release process. But, instead of framing it as, here's the signal I'm going to preserve it as this one metric, and immortalize this as the only way to know if something is going wrong. What we'd say, Honeycomb would encourage you to do is take those signals, whatever metric, or whatever piece of metadata that you'd want in order to identify that something is going wrong, and instead of immortalizing them, flattening them as pre-aggregated metrics, instead capture those as events, and you know, maybe it does make sense to define and alert, or define a graph somewhere so that you can keep an eye on it. But instead of freezing the sort of question that you might ask make sure you have the information available later if you want to ask a slightly different take on that question, or have a little bit of flexibility down the road.

Mike Julian: So thinking through all the times that I've instrumented code, hasn't this always been possible?

Christine Yen: It has. I would say not-

Mike Julian: I feel a very large but coming on.

Christine Yen: I think that as engineers we are taught to think about, or understand the constraints of the essentially data store we're writing into when we write into it. We're taught to think about the type of data we are writing, and the trade offs, and traditionally the kind of two data stores, either a log store, or a tensors metrics store that we've used has limitations. Either that limit the expressiveness of the metadata that we can send, and talking specifically about things like high-cardinality data, and tensors, metrics, we've just been conditioned that we can't send that sort of information over there. Or, okay logs are just going to be read by human eyeballs at grep, so I'm not going to challenge myself to structure them or put analytical information potentially useful for analytic queries into my logs. I think that the known trade offs of the end result have impacted habits in instrumentation. When instead, like you say, all this should have been possible all along. We just haven't done it because the end tools haven't supported this sort of very high level flexible analytical queries that we can and should be asking today.

Mike Julian: Yeah, you used a word there that I want to call attention to because it's kind of the crux of all of this, which is high-cardinality. I have had the question come up many, many times of what in the world is it? And it's always couched in terms of like, "I think of myself as quite a smart person, but what the shit is high-cardinality?" It's one of those things of, I'm afraid to ask the question, because I should know this like everyone thinks I should. I know it because I had to go figure out what in the world everyone was talking about. So what is it? What are we talking about here?

Christine Yen: I'm glad you asked. This is also why, for the record, our marketing folks have tried to shy away from us using this term publicly because lot of people don't know what it means, and they're afraid to ask. So thank you for asking.

Mike Julian: But it's so core to everything we're talking about.

Christine Yen: So very clinical level, high-cardinality describes a quality of the data in which there are many, many unique values. So types of data that the high-cardinality are things like IP addresses or social security members, not that you would ever store those in your, in any data-

Mike Julian: And if you were, please don't.

Christine Yen: Things that are lower cardinality are things like species, or species of person issuing the request, or things like AWS instance type. Yes, there's a lot of them, but there's far fewer of them then there are IP addresses. And-

Mike Julian: There's a known bound of that measured in maybe hundreds.

Christine Yen: Yeah. Yeah, and I think the reason that we're talking about this term more, and it's coming up more, is that we are moving towards a more high-cardinality world in our infrastructure. In our systems. And when I say things like that I'm like, well 10, 15 years ago it was much more common to have a monolithic application on five micro servers, where when you needed to find out what was going wrong that you really only had five different places to look. Or five different places to start looking. Now even at that kind of basic level, we have maybe instead of one monolith we have 10 micro-services spread across 50 containers, and then 500 Kubernetes pods all shuffling in and out over the course of a day. And even just that basic, which process is struggling, is much harder to answer now because we have many more of these combinations of attributes which then produce a high-cardinality data problem. And I think that's something that people are starting to experience more of, in their own lives, that a lot of vendors or open source metrics projects are starting to recognize that they also have to deal with as an effect of the industry moving in this technical direction.

Mike Julian: One of my favorite examples of this came from days back when I ran graphite clusters, the common advice was don't include request IDs or user IDs in a metric name. And ten to one running graphite that's still a pretty common thing, because if you do, well it explodes your graphite server. The number of whisper files that get created is astronomical. So the end result is that we just don't do it. And like you just don't record that data, but what you're saying is, no, you actually do need that data. Like not having that is hampering your exploration and trying to answer the questions.

Christine Yen: Absolutely and I mean in this case with press IDs or ID, again there might be some folks in the audience being like, "Well I'm Pinterest and I have the luxury of not having to worry about individual user IDs, and maybe, but I guarantee that there are some high-cardinality attributes that you do care about that are important for debugging. For us at Parse it was app ID. We were a platform so we had like 10s, 100s, eventually millions of unique apps all sending us taped data, and we needed to be able to distinguish, "Okay, well this one app is doing something terrible. Let's black list him and go on with our day." And if it's not user ID for some folks it might be shopping cart ID, or Mongo instance that it's talking to. Our infrastructure has gotten so much more complicated. There's so many more intersections of things. In graphite world you would need to define so many individual metrics to figure out that a particular combination of SDK on a particular node type, hitting a particular end point for this particular class of user, you'd have to track so many different combination metrics to find out that one intersection of those was misbehaving. But more and more that's our reality. And more and more out tools need to support this very flexible combining of attributes in this way.

Mike Julian: Right. Yeah, the more and more that we start to build customer facing applications, especially like the applications where the customer can kind of have free rein over what they're doing, what they're sending, like I don't know a public API means that one customer using one version of the API, using one particular SDK, could cause everyone to have a very bad day. And if you're aggregating all that, how are you going to find it's them? All you see is just that the service is sucking.

Christine Yen: 100%. Yeah, the more like, ultimately we're all moving towards overall where we are, multi-tenant platforms, and if not user facing platforms then often assured services inside larger companies. Your co-workers are your customers and you still need to be able to distinguish between that one team using 70% of your resources, versus other folks.

Mike Julian: Right. Yep. So it seems to me that there's kind of a certain level of scale and engineering maturity required before you can really begin to leverage these techniques. Is that actually true?

Christine Yen: I don't think that there is. There's no, you must be this tall to ride bar on the observability journey. There are number of steps. There are steps along the way that allow you to use more and more of these techniques, but when I think about teams that are farther along their journey than others. It's often more of a mindset then anything technical or anything like that. Right. When I think of steps along the observability maturity model and we're Liz Fong-Jones our new developer advocate, formerly with Google, is actually working on something along these lines for release, I think sometime in June, when we think of that, it's part tools, but it's also process and people. And it is, I think that there are some changes afoot in the industry about how people think about their systems. How people instrument. How people set up their systems in order to be observable, that really all factor into how effectively they're able to pick up some of these techniques and start running.

And choice of tooling is a catalyst for this. Ideally you have a tool that, sorry Graphite, lets you capture the high-cardinality actuate that you want to, but that's only a piece. And I think that we are in for a lot of really fun kind of cultural conversations about what it means to have a digi-driven culture. What it means to be grounded by what's actually happening in production when trying to figure out why the things that you're actually observing don't line up with what you expect.

Mike Julian: All right. So you've given a lot of talks lately and over the past year or two about observability driven development, which sounds really cool. Can you tell us what it is?

Christine Yen: Yeah. Observability driven development, or kind of as I like to say, to kind of zoom out and just talk about observability, and the development process is a way of trying to bring the conversation about observability away from pure ops land, or Pure SRA land and into a part of the room where developers and engineers hang out. So my background is much more of an engineer, my co-founder Charity, comes much more from the ops side of the room, and we've really started to see observability as basically just a bridge that allows and empowers software engineers to think more about production and really own their services.

And one of the things that I've pressed on in these talks about how observability can benefit the development process is what a positive feedback loop it is to be looking at production even way before I'm at a point of shipping code. There are so many spots along the development process when you're figuring out what to build, or how to go about building it, what algorithm to chose. Or, "Hey I've written this. My test passed, but I'm not totally sure whether it works." There's so many spots where if developers gained this muscle of, "Hey let me send you, check my theory with what's actually happening with production." People can ship fresh, better code, and be a lot more confident in the code that they're pushing out there in the first place.

My favorite example's from one of our customers, Geckoboard, they're obviously very a data driven culture, their primary business is providing dashboards for KPI metrics. They were telling me the other day about a project that their PMs were running actually, and the PMs were the primary users here not the engineers, where they ultimately had incomplete problem to try and solve. And their PMs were like, "Well we could probably have the engineers go off and try to come up with a perfect solution, or we can come up with like three possible approaches to this solving this problem. We could run these experiments in production. Capture the results in Honeycomb. And then actually look at what the data is saying about how these algorithms are performing." And the key here is that they're actually running it on their data. Right?

There's a realism that looking at real production data gets you that is so much better than sitting around debating theoreticals, because they're able to say, "Okay, well we've had these three implementations running in parallel, and looking at the data this one seems like the clear winner. Great. Let's move forward with this implementation." And they can feel confident that it's going to continue behaving well at least for the foreseeable future whether traffic remains the same.

Again these are bad habits that people have fallen into, right? Where dad's are like, "Okay, monitoring something that I need to add right before I ship it just so that up spokes will stay off my back when I tell them that everything is fine." Or, "That the up spokes isn't going to look at in order to come yell at me." I don't know, but it's like that shouldn't be the only time we're thinking about implementation. That shouldn't be the time we, I'm speaking for software developers here, should be thinking about what will happen in production. Because at every stage you know more and more people are using feature flags to release their code. Cool. You should be capturing those feature flags in your instrumentation, and alongside, "Hey, cool, what is X user think about this thing that we've featured flagged them into?" You should be looking at, okay, what is the performance of your system look like for folks who have that feature flag turned on or turned off? Are your monitoring metrics, observability tools, whatever flexible enough to capture that. Isn't that just as interesting as the qualitative, does user X like this new feature? It's got to be.

And there's so many things that more and more are starting to be part of the development process that observability tools should be tapping into, and should be encouraging in order to break down this wall between developers, and operators. Because ultimately you know, you said more and more we're building user facing systems at the end of the day their goal has to be delivering a great experience for those users.

Mike Julian: Right. Yeah, we're all on the same team here.

Christine Yen: We're all on the same team.

Mike Julian: So let's say that I'm a listener to this show, but I don't use Honeycomb, I can't use Honeycomb for whatever reason, but I really like all of these ideas. I want more of this for me. How can I get started with it? Like are there ways I can implement this stuff with open source technologies?

Christine Yen: There are probably some. First you want a data store that is flexible enough to support these operations. Right? So you should be looking for something that lets you capture all the bits of metadata that you know are important to your business. For Parse, to use as an example, that was things like app ID, operating system version of the client operator. In Parse it was a mobile back end of service. So we had a bunch of SDKs that you could use to talk to our API. So for we're evaluating the quote, unquote health of that service it was which app is sending this traffic? What SDKs are they using? Which end points are they hitting? Those mattered to our business, and those are also incidentally much easier for developers to map to code when talking about health or anomalies than traditional monitoring system metrics.

So identify those useful pieces of metadata. Make sure your tool can support any kind of interesting slices along those piece of metadata that you'll want. And make sure honestly, lots of folks again there might be some folks in the audience thinking, "Well I can do this with my data tools." I don't know how many data scientist you have in your listenership, and it's true, lots data science tools can do that. I know that for our intents and purposes, as an engineering team at Honeycomb we care about real time, so that tends to be something that disqualifies many of the data science tools.

But I think that more that tool choice, folks who are excited about observability, folks who are looking for the next step beyond monitoring should really start looking at places in their development process, or release process where they're relying on intuition rather than data. Right? Where else can we be validating our assumptions? Where else can we be checking what our expectations are versus what is actually happening out there in the wild? This culture and process is really what that observability driven concept is trying to get at, is where can you be more regularly, efficiently, naturally be looking to production to inform development in order to deliver great experience for our customers.

Mike Julian: Yeah, that's fantastic advice. This has been absolutely wonderful. Thank you so much for joining me. Where can people find out more about you and your work?

Christine Yen: The Honeycomb blog is a great place to find kind of a mix of stories, and more conceptual posts. Honeycomb.io/blog. I know that we actually also have our own podcast. It's called the ollycast. I think it's o11y.fn. And of course have Honeycomb Twitter feed and we have community Slack as well for folks who just want to talk about observability, and want to get a chance to play around.

Mike Julian: Yeah, awesome. As a parting story, I was on one of the first trials of Honeycomb, what back when it was still closed, and I can't remember where I read it. It might have been part of the in app documentation, it might have been something that Charity said on Twitter, but it was like, "Don't use Honeycomb for WordPress, that's not what we're built for." At the time I had about 100 node WordPress clusters. So I'm like, "You know what, I'm going to use this for Word Press.

Christine Yen: Awesome.

Mike Julian: I did actually find the interesting things out of it, which I found pretty hilarious.

Christine Yen: Cool.

Mike Julian: So there you go. I believe you do actually have a free trial as well now?

Christine Yen: We do. We have a free trial. We also have the community edition. It's a little bit smaller, but should be enough for folks to get a feel for what Honeycomb can offer. A note about the WordPress disclaimer, I'm glad you got value out of it. I think that's awesome. I would also say that a 100 node WordPress cluster is a whole lot more complicated than we thought when we said that early on. And I think that the distinction that we wanted to make there was, you know, if you have a simple system maybe you don't need this much flexibility. Maybe whatever you have set up is working fine. Because ultimately over the course of this podcast observability it involves changes not just to your tooling, but how you work and how you think about your systems. And that was really a kind of a disclaimer to make sure folks who were interested in investing in a little bit in all of that.

Mike Julian: Yeah. Yeah, absolutely.

Christine Yen: I'm glad you overcame and tried it out.

Mike Julian: All right. Well thank you so much for joining. It's been wonderful.

Christine Yen: Thank you. This has been a lot of fun. I'm a big fan.

Mike Julian: Wonderful.

Christine Yen: Thanks.

Mike Julian: And to everyone listening, thanks for listening to the Real World DevOps podcast. If you want to stay up to date on the latest episodes you can find us a realworlddevops.com. On iTunes, Google Play, or wherever it is you get your podcasts. I will see you in the next episode.

VO: This has been a HumblePod production. Stay humble.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
DevOps is Dead with James Turnbull
9 maj 2019· Real World DevOps
About James TurnbullJames Turnbull is originally from Australia but now lives in Brooklyn, NY. He likes wine, food, and cooking (in that order) and tattoos, books, and cats (in no particular order).
He is a CTO in residence and lead startup advocacy at Microsoft. Prior to Microsoft, he was the founding CTO at Empatico. Before that, James was CTO at Kickstarter, VP of Engineering at Venmo, and in leadership roles at Docker and Puppet. He also had a long career in enterprise, working in banking, biotech, and e-commerce. James also chairs the O'Reilly Velocity conference series. In lieu of sleep, James has written eleven technical books, largely on infrastructure topics.

Links Referenced: Twitter: @kartarTurnbull PressTranscriptMike Julian: This is the real world DevOps podcast and I'm your host Mike Julian. I'm setting out to meet the most interesting people doing awesome work in the world of DevOps from the creators of your favorite tools to the organizers of amazing conferences or the authors of great books to fantastic public speakers. I want to introduce you to the most interesting people I can find.

This episode is sponsored by the lovely folks at InfluxData. If you're listening to this podcast, you're probably also interested in better monitoring tools and that's where Influx comes in. Personally, I'm a huge fan of their products and I often recommend them to my own clients. You're probably familiar with their time series database InfluxDB, but you may not be as familiar with their other tools. Telegraf for metrics collection from systems, Chronograf for visualization and capacitor for realtime streaming. All of these are available as open source and as a hosted SaaS solution. You can check all of it out at influxdata.com. My thanks to InfluxData for helping make this podcast possible.

Hi folks I'm Mike Julian, your host for the Real World DevOps podcast. My guest this week is James Turnbull. You probably know James from his seeming inability to stop writing technical books such as Monitoring with Prometheus, The Art of Monitoring, The Terraform book and like a bajillion others. He has also worked for some pretty neat companies too, like Puppet, Kickstarter and Venmo and now he works at Microsoft leading a team as CTO-in-residence. Welcome to the show, James.

James Turnbull: Hi Mike.

Mike Julian: I'm really curious like what is a CTO in residence?

James Turnbull: I guess my primary mission is to make Microsoft relevant to start ups much the same way that Microsoft is shaping its relevancy towards the open source community. We're also interested in looking at other audiences that we've traditionally not been involved with and so it's just one of those.

Mike Julian: Gotcha. And you're just leading a team of people that are focused on that sort of stuff?

James Turnbull: Yeah, so most of my team is people who've come from startups or and particularly from engineering management leadership roles in startups. One of my colleagues is ... was the CTO of SwiftKey and another one, fairly famous, Duncan Davidson who wrote Tomcat and Ant and has been around engineering management for a long time — and folks like that who really are here to help sort of startups understand a bit more about how to grow and scale. And I think some of the big challenges startups have are actually not technology related at all. They're really about, you know, how do I build a recruiting process? You know, I had 10 engineers last week, I have 100 this week. How do we structure the team? So we've sort of brought together a group of folks who have fairly deep experience in those sort of problems for startups and have sort of a deep empathy for the startup community.

Mike Julian: Yeah that's quite the task ahead of you.

James Turnbull: Yeah. Look, I think, I mean Microsoft traditionally been known as an enterprise software company. You know, a lot of startups are not sure of their relevance to us. I think increasingly we're seeing traction to cover is one is that obviously Azure is one of our focuses and the cloud platform in there. And that platform is looking more broadly at not just enterprise audiences but other groups. And secondly, a lot of startups ... Microsoft's deep in the middle of most of their customers. So particularly if you're a Beta based startups, something like that and you're, you know, you're trying to sell into enterprise business, Microsoft has been doing that for 30 years. They have all the connections and account managers and sales folks and you know, multimillion dollar relationships with some of the people you want to be customers with. We can provide you with A, some of those connections, but also a lot of advice and expertise about how to sell to those customers.

James Turnbull: And having worked at both Empatico and Docker, you know, a large part of my job was attempting to sell, you know, as a small startup, as a fairly early employee at both into big companies. You know, you can't walk in the door to Wall Street financial if you're a 30 person start up in Portland, Oregon without having a pretty credible story. So I'm happy to sort of help startups and I do some of that messaging and understand how to have some of those conversations.

Mike Julian: Yeah. It's one of the interesting things about my own company is my clients are all these large companies too and I'm a two person company, but selling into a very large company is not ... it's nothing like selling it to a small company. Everything works differently. People think about their jobs differently.

James Turnbull: Yeah.

Mike Julian: Yeah. I think maybe my most favorite thing of everything you just said is this isn't your daddy's Microsoft. The Microsoft we all grew to know and hate is not today's Microsoft at all. Not by a stretch and that's just absolutely incredible to see that turnaround.

James Turnbull: Yeah. I got a LinkedIn request from somebody yesterday and the message said, you know, I've read a bunch of your books and you know, I've used a bunch of different sorts things you've worked on. I was really surprised to see you at Microsoft. And I was like, okay, this could end badly the next couple of sentences, because I've certainly had a few people of my generation who remember the bad old days and “Linux is a cancer” and things like that. And he finished with, you know, it's really interesting to see companies grow and change. And I was like, wow, okay, that's a ... I thought that was going to go really badly but I think it's a fairly accurate reflection.

Microsoft is aware of the fact that this is not a position that pragmatically that was not a good business position to be in. The world is changing. It's moving towards the cloud, you know, the stacks in people's companies, the way they manage things, the infrastructure, the software, you know, things are changing. And I hesitate to say this conclusively, but I think open source won, you know this for certain values of one, given recent sort of events, discussions about large corporates and their contribution to open source, but as a technology choice, it's pretty clear to me that open source won. And I'm kind of a bit smug about that to be honest.

Mike Julian: Speaking of your books I would just straight up say you're the one that got me into monitoring and you kind of did this unknowingly, like we hadn't met until a couple of years ago, but in 2006 I guess it was, you released a book called Pro Nagios 2.0 and at the time I was working for a very small private school and someone said, hey, we've got like these couple hundred printers and they keep going offline and like, you know, we should probably know when that happens. So I'm like, well I don't know how to solve this problem. So I started googling around and find this thing, Nagios and then find, oh hey, there's a book on it. So I bought the book and like that ... I learned about monitoring that day. Like that's ... Pro Nagios 2.0 is what got me into that and this whole time like it kind of started my career, which was really cool. So thank you for that.

James Turnbull: You're welcome. As I said to you before we started I feel like apologizing too because I can't remember anything that's in the book and I think that it's probably acting as a monitor stand for a lot of folks and I'm pretty sure that my ideas about monitoring we're very embryonic but I really appreciate that. That's always exciting to hear when someone actually is like this was really helpful. Because none of us are ever going to be John Grisham right? We aren't in this for the money and those conversations where someone reached out and said, I read your book, it really helps or even if I read your book, it didn't help and here's why I'm like, that's awesome. I'm glad somebody got, you know, got something out of it and had some feedback. And so yeah, if anyone who's listening, who has ever had the urge to tell me what's wrong with things or what went well in the book, my email is easy to find, feel free to shoot me an email. Always happy to chat.

Mike Julian: Yeah, absolutely. Like one of the things that authors don't get very often is feedback, positive or negative. I really expected to get a lot of hate mail for the stuff I wrote in my book and it just didn't happen. I was very disappointed.

James Turnbull: I actually think that ... I was thinking about your book the other day and I, when we exchanged some emails and I think your timing was very good. I think people are waking up to the fact that monitoring was evolving and I think that what you had to say was not only very timely, but for me it solidified a bunch of different ideas that I had that I'd previously sort of, you know, I could talk about in abstract or you know, in the solid sort of way. I think the first couple I would strongly recommend people should read the first couple of chapters of your book.

Mike Julian: Those were my favorite to write.

James Turnbull: Yeah, they are one of the better summaries I've read. I guess more modern monitoring.

Mike Julian: Well, thank you for that.

James Turnbull: Yeah well it's a topic that I think you, I and about 200 people care about but we all do.

Mike Julian: When I was ... I'd tell people like hey I am writing this book on monitoring their first response was almost invariably, have you read The Art of Monitoring?

James Turnbull: Oh dear.

Mike Julian: Do you really think I would set out to write a book without having read every book on monitoring there is? Like that I was somehow unaware of one of the most popular books out there, so I'm like, you know what? I'm going to have him ... I'm going to have James do a blurb on the back of the book and solve that problem forever.

James Turnbull: That was a good plan. I looked ... was looking through my Amazon history the other day when I was doing my taxes and I can tell when I'm working on a new topic cause I have literally bought every single book on that topic. Not just writing a book but I decided the other day that I should ... I gave Rust a stab last year and I didn't get a chance to do anything with it and I thought I'd give it a stab again and I thought oh I'll buy a couple of books and see what they're like and so I can see the pattern of here's some Rust books and then a few years ago here's some go books and a few years before that here's every book about monitoring, which is not actually a large portfolio but there's enough around.

Mike Julian: It is much smaller than people would think.

James Turnbull: I bought a bunch of books on around the same time it's like on systems theory and stuff like that because I was struggling to find adequate ways to talk about monitoring as a construct and I realized that the maturity of the vast majority of the conversations out there, you know those on any technical topic, there's sort of like I look at the very bottom of the pile Hacker News post comment thread on there is like the worst case scenario and then there's like a few Stack Overflow answers and then this may be a detailed blog post that's going to explain how to use something and then maybe there's a ... somebody having an opinion about design or the language aspect of some language and then there's like a computer sciency like somebody's thought about things, document and monitoring is very heavily stuck at the blog post end of that spectrum.

Mike Julian: Yeah, I completely agree. Like when I was trying to find higher level thoughts on it, they're just not there. The conversation, I think the level of conversation has started to shift in the past couple of years and that's awesome. Like, I really want to update my book now because of all the stuff around observability coming out has changed the conversation dramatically. One of the interesting things is I never even used the word observability anywhere in my book.

Yeah.

Mike Julian: Like it just wasn't ... people didn't ... people weren't talking about that way so I didn't talk about it that way either.

James Turnbull: Yeah I was having this conversation with Darren Schwartz who makes some software for database observability, database monitoring and Darren is super smart and very much more computer sciencey person than I am. I realized that there were a bunch of stuff in there ... stuff you'd hear the way of his thinking that you know, he was one of the handful of people that had taken commentary about monitoring and observability further than just, you know that scratched the surface sort of thing. And he's not a person who ... I thoroughly recommend there's a couple of short things he's written and his blog post that are really interesting sort of reading from ... as I learn stuff that that that was sort of more high level and useful and and over arching than I than I had previously seen.

Mike Julian: So I want to shift topics a little bit. You and I were talking before we started recording about DevOps and I will start off with a very provocative statement. DevOps is dead. What do you think?

James Turnbull: I think I agree. I was involved in very early days. I was trying to look at it before when I wrote my first blog post on DevOps and it, I think it's like 2008 or 2009 and I think I went to the second DevOps days. I didn't go to the first. I think probably, and I take some responsibility for this because I worked for a company that's sold a DevOps tool, but I think the first time a marketing person described A, categorized DevOps as being about tools and B, used it as a somewhat abstract rallying, cry, marketing rallying cry rather than a cultural statement. That's when the first knife was sort of stuck into the entity as it were and I think, yeah, I think I would agree now.

Mike Julian: Tell me more about that. Like what do you mean by that knife going in? Like is it really that marketing killed DevOps?

James Turnbull: I mean I'm being honest here on marketing now I think it's probably a factor. To me, DevOps was almost nothing about tools, tools to me were enablers for folks doing DevOps things. To me, the big thing about DevOps and the thing that really struck me when I first started thinking about the concept is that I've been doing engineering things for 25 years now. I feel really old and a significant part of the scars that I have as far as those experiences are being on eight ... one of the ... of either sides of the conversation being the developer of a bit of software or the operator bit of software where I've been in conflict with the other party because you know, we either didn't talk about how they built it or they don't understand ... they didn't understand that the environment that they were deploying it into.

And most of those conversations happened at three o'clock in the morning on a conference call where a vendor is screaming at us because some mission critical piece of infrastructure is down and they're losing money. And to me that was ... that's been a really ... that was a really scarring experience. And to me, DevOps was about solving that problem. It was about having conversations with the people we work with and going, you're building this thing, here's an idea of what it looks like in production. You know, and by the way, can we make sure that we do this, this, and this to ensure that we care about security and monitoring and you know, backup and recovery or whatever it happens to be and you know, create that sort of bridge between those two disciplines in which really hasn't existed for most of my career.

Mike Julian: Yeah, completely agree with all that. What do you think about SRE? Like has that changed things to? To me I feel like SRE is also kind of a marketing label.

James Turnbull: Yeah. I mean I know a lot of people out at the Google SRE organization and I deeply respect the work that they've done. Yeah and it's definitely ... there are definitely a lot of solid ideas in like the SRE book is an example of ... I actually ... it was very ... responses to the SRE book were very polarizing, let's put it that way. It was very … I quite liked ...

Mike Julian: That's a very polite way of putting it.

James Turnbull: I quite liked it and I thought it was really useful. What I'm really sad about was that it wasn't published in 2005.

Mike Julian: Yeah.

James Turnbull: When it would have been actually life changing to a bunch of people. A bunch of us who worked in the high volume, high value web facing world. It's a solid ... the SRE program at Google is a solid platform. Not everything applies to everybody, you know, the classic refrain of, you know, you're not Google. I think that needs to be reemphasized a few times. Not everybody has Borg. Like there's definitely a flavor to it, but I can't deny the fact that a part of the reason it was released was definitely as a marketing aide to Google's recruiting in the SRE organization and there's nothing fundamentally wrong with that but, it needs to be acknowledged as one of the origins of that movement.

Mike Julian: I observed a conversation happened recently in a Slack where someone released a series of articles, fantastic articles, and they were referring to an important measurement as a KPI and someone responded just like, why didn't you call it an SLI? It's like, well, because an SLI is something that essentially Google came up with. We've been using the term KPI to meet an important metric for, I don't know, decades and the term SLI is less than 10 years old.

James Turnbull: Yeah. I mean but, you know, this is one of those things like the ... every generation reinvents the past right? You know, I'm going to say something controversial here. I look at the way Kubernetes is configured and I look at the sea of YAML files that I'm expected to poke my way through. There's still some tooling around that and I'm like, did we learn nothing from the horror that was configuration files. I mean it could be worse. I was having the argument that other day, it could be worse, it could be XML, but I'm like, so I could be stabbed in the front and the back. But it feels very, ... it feels weird to me that we, you know, there's a bunch of lessons we haven't learnt and a bunch of things that we have reinvented the wheel about.

So, you know, I kind of ... I'm not really fussed about the terminology people use. I'm not even fussed about sort of recognizing that there's a past history there except to hopefully learn from it as long as people take it on board and go, you know, in the case of SLI's and KPI's, it's like you have a customer, they have a measure of how successful they are, you know, that should mirror your measurement of how your ... the functionality of your infrastructure or the thing you look after for them. And if they have that sentiment, I don't care what they call it, you know, an SLA and SLI or a KPI. Yeah, I think debating about that is a funny one.

Mike Julian: So you mentioned that the sea of configuration files and Kubernetes, which is absolutely wild. Like yes, it's like we didn't learn anything at all. That brings me to do we even need DevOps anymore? Like on one hand we're still making the mistakes that we used to make, on the other things are very different than they used to be.

James Turnbull: Yeah. I think that there is no yes or no answer to that statement. I think it's a bit more nuanced. Obviously there's a bunch of things we haven't learnt and I, you know, I overheard a conversation at a Kubecon a couple of years ago where two fairly ... I would say 20 something looking engineers are talking about the fact that we'd be so much easier if there was some sort of templating system for configuration. It would make so much easier if we could build templates and stuff. And I was ... I have no hair anymore, so I wasn't pulling my hair out, but I was mentally doing it. I thought don't say anything James, you'll look like an old fart. Like just turn around and walk away, go to the bar, have a quiet drink. But, so yeah, definitely we need to ... we should learn from the things that came before us to make the experience of the people maintaining these systems at least as good as if not better than the experiences we had.

But that being said, Kubernetes is an example of how far up the stack we've moved. You know, back in the day I spent a lot of time worrying about Linux kernel modules and package management system and IP tables and stuff like that. To a large extent those are not skills that are relevant to a lot of contemporary engineers who are working on say container based systems because that's all black box to them. It's taken care under the covers for good or ill, you know that they're running a Q cluster on top of a machine that they may not even maintain or may not even know anything about. So perhaps some of those, the problems we had in the past might not exist anymore perhaps. I don't know it ... never been a huge fan of black boxes either so.

Mike Julian: It seems to me that we've ... we have moved some problems around like some of the problems are still there, we just don't see them anymore or like we've made them someone else's responsibility. Like when say Amazon or Azure, pick your cloud provider of choice. I don't have to care about the network anymore except I kind of do, but it's entire black box to me so when something is kind of hinky, I can't really do anything about it anyways.

James Turnbull: Yeah.

Mike Julian: So there's that whole discipline of network engineering that has ... where a lot of systems people were also amateur network engineers are not anymore.

James Turnbull: Yeah, I think there ... I mean the argument the cloud providers make and I think it's a reasonable one is that economies of scale apply not just a cost they apply to you know, stability and availability and you know, the ... in the vast majority of cases the 80/20 rule applies and you don't need to care about the fabric between your infrastructure. In the cases where you do like, let's say I'm a high frequency trader or something like that where I care about every pico-second between me and the pipe out the building and me and the trading floor. Yeah maybe you're not running in the cloud, right? Maybe you're running on, you know, custom built high-performance machines with incredibly tuned kernels and network stacks. Does everybody else, you know, need that? Probably not, but you're right, it does make debugging more complex and potentially problematic.

Mike Julian: So I think all that is pretty interesting. And there's also something you mentioned before we started recording about Puppet, Chef, config management in general, significantly less relevant than it used to be. I remember that my last full time job, most of the work I did was writing Chef and Puppet like it was a whole lot of orchestration and config management of like how do we build a system and now like as far as I could tell, no one's really caring about that anymore. Like we have ... the problems of moved further up stack.

James Turnbull: Yeah I agree and I've been saying for a few years and I think if I look at some of the work that's come out of the some companies and you know, HashiCorp too to some extent. The important thing about configuration management is the lessons learnt about configuration management.

Mike Julian: You know, right.

James Turnbull: And the fact that the abstraction has moved up the technology stack should be like, you know, how do we apply the lessons we learnt managing infrastructure level components and managing application and service level components. Orchestration is not a solved problem by any stretch of the imagination and things like microservices make things considerably more complex.

Mike Julian: Yep.

James Turnbull: You know obviously they're very flexible in many ways but you know all of a sudden you have 300 little services that talk to each other via various ports and require different levels of security and AAA, you know with the required different pieces of configuration like this is a non trivial problem and guess what, we've actually solved some of these non trivial problems before why some of these companies hopefully will reinvent themselves to be in that space and I see a little bit of that happening now. We'll just see who survives I guess.

Mike Julian: Right. I have a few last questions for you. Of the bajillion books you've written, which ones have been your favorite? Like what was the most interesting one to write?

James Turnbull: I think it's probably the art of monitoring. There's a lot wrong with that book and a good amount ...

Mike Julian: As always.

James Turnbull: I went into an obsessively deep hole and I wrote a 700 page book, which is very focused on technology, using technology stack to articulate what is effectively a change in thinking. And I did that because I ... everyone had a conversation with, I was like, who's going to buy a book about the theory of monitoring, but people might buy a book that has like configuration files and technology and shows you how to do things, maybe that'll work better. And of course I realized that I would have written a much shorter book, and probably not have spent a year and a half of my life buried in complex configurations if I hadn't have written a theory of monitoring book and it might've been quite timely.

So yeah, I think it's probably my favorite book, but it's also my least favorite one too. And there's definitely ... there were some terrible ... I brushed over some topics that I probably should have covered in more detail and there's a couple of ... I recently found a calculation error in one of my graphs that are ... a Russian PhD student pointed out to me and I was like, huh. It's not a big miscalculation, but it's enough that I felt embarrassed and I went back to this guy with I'd done a calculation wrong and our formula wrong. And, but yeah, so there's moments where I'm like, how many people saw that and thought, what an idiot. So that's and that's never good.

Mike Julian: I had that happen recently with mine. My book is being translated into Japanese right now, which is super cool and the Japanese translators are thorough.

James Turnbull: Okay.

Mike Julian: They have found so many errors and so many like typos, but some of them are are calculation errors. They don't fundamentally change what I'm talking about or the illustration, but it's one of those like, Huh, I really did screw up an average calculation.

James Turnbull: Yeah, I did that too… but yes this Russian PhD student, he was hilarious. He's like, I just don't understand how you got this number. And it was ... there's no spreadsheet. There's no our formulas in there. There's just like a graph and he's obviously smart enough to look at the graph and go, that's wrong. And he was very polite about it, but he was genuinely thought I might've discovered a new branch of math as opposed to me making a terrible mistake which I though was flattering and horrifying at the same time.

Mike Julian: Absolutely. So I noticed there's ... seemingly, there's a trend with how you write books. To me looking from the outside, it's you start writing on a topic right about the time it hits mainstream. Whether that's true or not, that's how it seems to be. So what's ... what can we look forward to next from you? Like are you thinking about writing any new books?

James Turnbull: I am contemplating it. I had ... I've had a long dry spell of just writing sort of bits and pieces for internally. I'm writing a bunch of content for Microsoft right now. I'm thinking about writing something again, something technical. I feel like maybe service mesh is probably somewhere in this space that I'm interested in, but I don't see anything in there yet that sort of resonates with me as a solution I want to write about. But I think that's the space I'm going to watch. I would love to write a book about startup engineering practices and about [inaudible 00:29:59].

Mike Julian: That could be fun.

James Turnbull: But I think that I've been beaten to. Camille Fournier wrote The Manager's Path, which is, to me, every time I read it I'm like, I can't do any better than this. This is an awesome book.

Mike Julian: It is a very good book.

James Turnbull: And so I feel like that position has been taken, but I've had some thoughts about like the startup and things like that. Like I think there's definitely ... we're definitely in a different era and there's definitely some lessons learned and you know, particularly things around topics like work life balance and ethics and diversity and inclusion where an update to some of the seminal ideas about startup, the way startups work might be welcomed.

Mike Julian: Yeah. All that sounds great to me. Where can people find out more about you and your work?

James Turnbull: Probably the easiest place is Twitter. I'm one of the dying generation that uses Twitter, so my Twitter handle is @kartar and if you're interested in my books, turnbull.press will ... is my grandiose imprint and that to list all of the topics of the books and the topics and so forth and that's probably the easiest way to find me.

Mike Julian: James, thank you so much for coming on the show. It's been a pleasure to chat with you.

James Turnbull: You too. Thanks so much for having me.

Mike Julian: And thank you to everyone else listening to the real world DevOps podcast. If you want to stay up to date on latest episodes, you can find us at RealWorldDevOps.com and on iTunes, Google play or wherever it is you get your podcasts. I'll see you the next episode.

Announcer: This has been a HumblePod production. Stay humble.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Open Source is Not A Business Model with VM Brasseur
1 maj 2019· Real World DevOps
About VM BrasseurVM (aka Vicky) spent most of her twenty-plus years in the tech industry leading software development departments and teams, providing technical management and leadership consulting for small and medium businesses, and helping companies understand, use, release, and contribute to free and open source software in a way that's good for both their bottom line and for the community. Now, as the Director of Open Source Strategy for Juniper Networks, she leverages her nearly 30 years of free and open source software experience and a strong business background to help Juniper be successful through free and open source software.
She is the author of Forge Your Future with Open Source, the first and only book to detail how to contribute to free and open source software projects. The book is published by The Pragmatic Programmers and is now available at https://fossforge.com.
Vicky is a moderator and author for opensource.com, an author for Linux Journal, the former Vice President of the Open Source Initiative, and a frequent and popular speaker at free/open source conferences and events. She's the proud winner of the Perl White Camel Award (2014) and the O’Reilly Open Source Award (2016). She blogs about free/open source, business, and technical management at {anonymous => 'hash'};.

Linksopensource.orgFossforge.comanonymoushash.vmbrasseur.comvmbrasseur.commarythengvall.comRoads and Bridges: The Unseen Labor Behind Our Digital Infrastructure

TranscriptMike Julian: This is the Real World DevOps podcast and I'm your host Mike Julian. I'm setting out to meet the most interesting people doing awesome work in the world of DevOps. From the creators of your favorite tools, to the organizers of amazing conferences. From the authors of great books, to fantastic public speakers, I want to introduce you to the most interesting people I can find.

This episode is sponsored by the lovely folks in InfluxData. If you're listening to this podcast you're probably also interested in better monitoring tools and this is where Influx comes in. Personally I'm a huge fan of their products, and I often recommend them to my own clients. You're probably familiar with our time series database, InfluxDB, but you may not be as familiar with their other tools. Telegraf for metrics collection from systems, Chronograf for visualization and Kapacitor for real time streaming. All of this is available as open source, and they also have a hosted commercial version, too. You can check all of this out at influxdata.com.

Hi folks, I'm Mike Julian your host for the Real World DevOps podcast. My guest this week is VM Brasseur otherwise known as Vicky, an expert in open source strategy and the author of the book Forge Your Future with Open Source. She's previously the Vice President of the Open Source Initiative and currently Director of Open Source Strategy at Juniper Networks. Well Vicky thanks for coming on the show.

Vicky Brasseur: Well thanks for having me Mike, I'm very happy to be here.

Mike Julian: I want to start with a seemingly simple question, but I have recently learned in the past half hour that this is more complex than it seems. What is open source?

Vicky Brasseur: Yeah, can't imagine how you learned that. No, it's a question that a lot of folks in technology think they know the answer to, but unfortunately they're usually wrong. That's because they usually don't realize that there is a legitimate definition of what it means to be open source software. It is called the open source definition. It is maintained by the Open Source Initiative. If something does not adhere to each of those 10 points on the open source definition, it isn't really open source.

Unfortunately people just sort of assume, well if my source is out there, if my source code is out there, it's open, right? Well, not really, because if you restrict it in any way or if you don't put an appropriate license on it, then people don't know it's open source. If you just put your code out there without a license for instance, it's all rights reserved. You have the copyright over that code or your company if you developed it for your company. It's all rights reserved as far as copyright and no one else can use it, unless you put a license on and that's what the license does for you. Only an open source license, one that is approved by the Open Source Initiative, that's the only kind that you can be assured actually gives you all of the things that the open source definition guarantees.

Mike Julian: What's really interesting about that is, there's always people that go around GitHub onto like the main project and say, "Hey, I noticed that you don't have this license, you should really have a license file." I'd always thought that that was just kind of an oversight, like, "Oh yeah, it's fine, it's totally open source. There's just no license. There's no license file." What you're actually telling us is that, if you don't have that, if you haven't specified what license this is under, by default it's not open source. Like, it is “all rights reserved.”

Vicky Brasseur: It is, exactly. It is all rights reserved. The best you can call it is source available. You still retain all of the copyright over that, and therefore it is all rights reserved. You retain all rights to that code, no one can use that software at all unless you give them the rights to it. That means somebody could use your software and put themselves at legal risk by violating the copyright of your software and you. If you don't put a license on it, that's what they're doing. Therefore, they are at legal risk, they can get sued and if they are running a company and they're using your software, they can't really get acquired frankly if they are using software that is encumbered by somebody else's copyright. That's why it's so important for multiple reasons to make sure you have a license on there. It really takes care of all those legalities. It's a relatively short list of OSI approved licenses, you've got the Apache and the MIT and all your GPL flavors and LGPL and AGPL and yeah. There's a bunch of them and they cover a broad swath of things. If you just use one of them, you don't have to care about the legalities, somebody has already taken the time to figure that out for you.

Professional lawyers have written these things, gotten them approved by OSI. You know they give you everything from the open source definition and you know it's legal. Just use it. It's pretty easy.

Mike Julian: You just named off a whole bunch of different open source licensing. I'm always confused when I release a project, like what should I license this under? Screw it, I'll go with MIT or Apache and call it a day, and I never really put any thought into it. There's a lot of these licenses, so presumably I should probably be putting more than two seconds of thought into which of them, if I'm even doing open source at all.

Vicky Brasseur: It depends. I mean if you're a business, you're going to put a great deal of thought into this, because you have specific business requirements and strategic needs for releasing that software at all. If you don't care, put GPLv3, put MIT, just slap that on it and throw it over the wall. If you don't care, don't think about it. GPLv3, MIT, that's great. If you care about software freedom, if you care about the morality of allowing other people to look at and manipulate and redistribute your software, use a copyleft software, use GPLv3.

If really you could not give two farts about that, then put MIT on it and just get it out there, but license it appropriately otherwise you're screwed. If you really have a lot of other considerations as far as a some sort of patent concerns or something like that, that's when you need to take it to your lawyer and have them look at it and figure out strategically what makes the most sense. If you're just an individual, default to GPLv3, default to MIT, you should be fine.

Mike Julian: It sounds like there's actually a whole lot more to open sourcing something than just slapping a license on anything you throw in GitHub.

Vicky Brasseur: Yes.

Mike Julian: Especially if I'm a company.

Vicky Brasseur: So much more. I mean if you're an individual even, it's very important that you do more than just a slap a license on it. I know I've been saying that the past I know five minutes, just slap a license on it and move on. Unfortunately it is slightly more complicated than that, but not much. That's because most softwares compose of multiple different pieces of code. You've got this module, that module, this library, that library. Now with open source as you release it, someone doesn't just have to take your whole package and move on. If they want they can cherry pick individual pieces of your code. They could just take one module if it does what they want for instance.

Now if all you do is you slap a license file in that repository, then you walk away, if someone just takes that one piece of code, then later on when they're under a merger and acquisition situation for instance, that piece of code is going to be found. Nobody will know where it came from. You won't have some sort of path showing that oh I wrote this. You won't be able to prove it via version control and you won't have a license file. You won't know who wrote it, you won't know under what license you're using it, so you're going to be in a big buying depending upon the software. You might have to completely re-architect to get that out of there or rewrite it or something like that. You don't have that copyrighting encumbrance, because while it was originally open source, you don't know where it came from, you don't have that provenance.

As you are releasing software, make it so much easier for everyone. At the very top of each file, I know and developers roll their eyes every time I say this, but come on people, we have tools now that can avoid this. You can zip that up, and you don't see it. At the top of your file, just have a commented out section, which is a simple copyright statement. Copyright Mike Julian 2019, done. Then underneath that you put licensed under GPLv3, say that. Those two lines at the top of every single file you know now if that file gets lifted out and used elsewhere, somebody will know under what rights they are allowed to use it and who wrote it.

They have their legal butt covered because you have put a copyright and a license statement in there. Then you have the full copyright file elsewhere in your repository. I have a nice big section on this in my book and how to release your software as an individual. As a company, yeah, there's a lot more concerns. I personally at Juniper, I don't want to release software A, if it has IP concerns that we can make money off of. I need to talk to legal, I need to talk to the product teams. I need to figure out how to get this released appropriately, because just throwing it over the wall as an open source project but appropriately licensed is one thing. My company is not going to get any benefit out of it if we don't treat the community properly, if we don't actually engage in it. All we're going to get is people looking at it and saying, "Yeah, hey look, Juniper released code isn't that cool." There's a lot of benefit in that, don't get me wrong, but there is so much more benefit in building a community of users and of contributors. That can gain companies a great deal.

Mike Julian: Yeah, that makes a lot of sense. I've definitely been in companies where they have a strong culture of being and working in the open source community. They have software they've open sourced and they're maintaining it. There actually growing communities around it and that brings them so much good will in the community. Not to mention it brings them more business as well as recruiting. It's a huge recruiting magnet too.

Vicky Brasseur: It's a massive recruiting magnet. Why would you not do this appropriately and build a community for recruiting alone?

Mike Julian: Right.

Vicky Brasseur: I used to run software engineering departments at the BP level in various companies. The amount of time and effort and money that goes into the recruiting is spectacular, depending upon the whatever employment firm is putting up the study. It can be anywhere from 150 to 250 or more percent of that person's salary. That's how much it costs to replace them. A, you want to manage appropriately so they don't leave in the first place, so you don't get that 150 to 250% hit on their salary. Also, you want to make it as quick and easy as possible to get the right person in there.

Now if I'm using open source software strategically within my company to build my products, and I'm releasing software appropriately and I'm engaging with all of these communities in an authentic way, then what I am doing is I am meeting a lot of people who already are familiar with my stack. Who already know ECO and Kubernetes and name your flavor of the week. They'll know that and they'll know my company. When I come knocking saying, "Hey, I have an opening," I'm going to have people lined out the door. Not only will they be more qualified to come on board, but since they already know the stack, their onboarding time is dramatically cut. Therefore, they can get more productive more quickly because they already know the software. They don't necessarily know all the special little delicate snowflake things of my stack, but they're familiar with the software. I don't have to teach them YAML and stuff like that. I'm not going to get there is my company isn't being an authentic community member in the free and open source software communities that my company is using and participating in and really relies on.

Mike Julian: Right. You've been talking about these concepts of open sources strategy for years. It sounds like a lot of what we're just now discussing is part of that idea, like the open source strategy is there more to it?

Vicky Brasseur: Yes. What a leading question, yes, oh my gosh.

Mike Julian: Yes. What else I'm I missing? Please tell us more.

Vicky Brasseur: I mean there's using open source software. If you look at the various studies out there, it's anywhere from 70% to 90% of the software that's being used and written right now is relying on free and open source software in some way. We're not just simply counting Linux in there, but it's everything else. It's the entire node ecosystem and it's Python and it's PHP, it's everything. It's huge. Everything relies on free and open source software. That frankly, that's not really strategic. That's just a gimme, yeah, whatever. You're going to be using free open source.

Mike Julian: Of course, we're going to use Apache.

Vicky Brasseur: Exactly, right, exactly. That's what we're going to do. Everyone does this. It would be stupid for us to roll our own at that point. Like are you going to roll your own SSL libraries? Not if you're wise and that sort of thing. You're going to use …

Mike Julian: I sure hope you aren't.

Vicky Brasseur: Oh please and if you are, stop now. Just stop, back away slowly. You know you're using these things, but some of them are more important than others. What makes the most sense for your business to be looking at, to be investing in, because you could just throw money and people and time at every single thing you're using your stack, but that doesn't make a lot of sense. You have due diligence you have to perform and you have to look at this strategically. It's not just releasing software strategically such that you can get the benefits of it, but it's also supporting software strategically. It's contributing to software strategically. You have to know how to do that properly and how your people have to be trained appropriately. You have to have policies in place for compliance and various things like that. There's just so many different moving parts to doing open source well from a business point of view. A lot of companies think they know how to do it and as a now former, thank you Juniper, free and open source software business strategist, I'm here to tell you most companies do it wrong. They're putting themselves at massive risk. They just assume they know what they are doing, but it's as though they learned about open source software like most open source practitioners now learned about it via the telephone game.

They heard from someone who heard from someone who heard from someone who heard from someone, who heard from I don't know Stallman 40 years ago this is what it's about. Therefore, they know what they're talking about, and I'm sorry they don't. They just don't. There's a lot to this to do it properly.

Mike Julian: I guess on that note, shifting gears a little bit, let's talk about open source business models. This has been a hot topic in the news in the past couple of years with Amazon trying to kill Mongo in the names of trying to kill Elasticsearch. Well basically Amazon just trying to kill everyone. What's going on with these concepts of an open source business model, why are people suddenly changing their licensing now, what's going on there?

Vicky Brasseur: You can't see me gritting my teeth because this is radio so to speak. There is not now and there never will be an open source business model full stop. People who say there is know absolutely nothing about business and my goodness it's difficult not dropping F bombs right now because I'm pretty passionate about this subject. There is no open source business model. Open source is one of the many tools you use to make your business successful. Just like any other tool you're using, just like your marketing team, your sales team, all the tools you're using, sales force and the people who are cleaning up your office, they're helping to make your company successful. Your support team is incredibly useful that make your company successful. Open source software is just another one of those things.

If you as a business are going to release your secret sauce and you're going to put it out there for the world to see and take and put it under an OSI approved, free and open source software license. Then you're going to get your knickers in the twist because someone else takes it and does something with it, I'm sorry, the license you put on there, you have given them permission to do this. They're doing exactly what you told them you could do. It is not their fault if you can't run a damn business. If they take this open source software that you have released and they make a more compelling business and product offering out of it than you do, that's not their fault, that's yours. That's you not listening to the market. That's you not listening to the users. That's you not able to deliver on your particular business prospect. That's not the fault of open source, you've got to learn how to do some business honey. There is no open source software business model, there is only business models. Open source is one of the many things that can help contribute to a successful business model. Sorry I did say I was a little passionate about this.

Mike Julian: I really wish I had an applause sound effect right now would be great. Yay, like that was all very enlightening. There's no such thing as an open source business model, instead we use open source as a technique for growing our business, but really we still need a business model to begin with. Open source is just a component of that.

Vicky Brasseur: Yes.

Mike Julian: Looking at companies like, I really don't mean to be calling out Mongo and Elasticsearch, but they're the two most recent ones. In those situations, I actually read this too, what should they have considered doing instead of as you say getting their knickers in a twist over something they told the market was totally fine? What is the other option?

Vicky Brasseur: Well, I can't say specifically what these companies could have done or should have done because I don't know what they did do.

Mike Julian: Let's come at it from a different angle rather than telling some other company what they should do. Let's say that I'm writing some software that is kind of along the same lines of I want to open source it for the world to use and use that as a lead gen to sell my commercial offering. You know that sounds an awful lot like what everyone else has already been doing and now they're getting their lunch eaten. What are my other options? What else could I consider?

Vicky Brasseur: Well there are multiple business aspects that you could take there. I mean yes, a lot of other companies are going the open core model and there's not necessarily anything wrong with the open core model. Now for the listeners who don't know what that is, it is essentially where you have the core of your software be under a free and open source software license. It's freely available. Then you can have an enterprise version that you sell that has value adds on top of it. You have your core version that's free and anyone can take and do it with. Then you have your enterprise version that people pay for and they get increased support or they get more features and they get more speed or more seats or whatever it doesn't matter what it is. That's part of your business model, that's part of your business.

There's nothing wrong with open core in that way, that's perfectly fine. Part of what these particular companies are complaining about is as you mentioned earlier, saying that other companies are eating their lunch by taking these things and not contributing back to the software itself. I am going to take your database software and I'm going to have another offering. I'm going to build a better product on it, and so I'm going to take your customers. That's fine, and that's perfectly okay, but using that software and not getting back to it is kind of dirty pool. The free and open source software world we call that the free rider problem where people are using the software and not contributing back.

Now these companies that are recently switched licensing and said, "Oh my gosh open source business model doesn't work," yawn whatever your business model doesn't work, there is no open source business model. It's like saying unicorns don't work. They all complain about this, but none of them have ever once said, "And here's how we reached out to these other companies and ask them to contribute to the community." None of them have said, "And here's how we ask how we can make it easier for them to contribute to the community." None of them are talking about the attempts they have made to try to get other community members. Frankly if you look at their repositories for their core software, it doesn't look like they've done that for anyone. You can't point fingers at a large bookstore to the north of me saying that they've been doing bad things.

If you haven't been running a good community, if you are just doing things where you are the only people who are playing in your little sandbox, you're not letting anyone else in. Then you really get pissed off if someone else builds their own sandbox next door out of the same sand you're using? I'm sorry that doesn't make any sense to me. If you want your free and open source software project to be successful, you have to build a good community around it. That means reaching out rather than expecting everyone else to reach in. Meet the people where they are and try to figure out how you can make it easier for them to become a part of the community, because that becomes the rising tide that lifts all boats. How many metaphors could I throw into this particular rant? A lot of them.

Vicky Brasseur: That's something that I think a lot of companies do very, very poorly when they release the software, is they just assume if I release it they will come. No, that is not what happens. Community takes time, community takes effort. If you want your open source software to benefit you more than just word of mouth of look at them releasing something, you have to put a lot of effort into it to get it right. Sorry little mini rant there.

Mike Julian: It sounds like the community is really the core facet of all this. If you want good software and you want people to really like using your software, you need to build a community, you need to foster that. What are some tips for, if I'm launching an EP software, how can I grow my community from there?

Vicky Brasseur: How do you grow your community? Well there's lots of different ways to do this, and Mary Thengvall has a really great book that's come out recently that's related to community. You should check that out. It's officially about developer advocates, but there's a ton of community work in that and Mary does really great work in community. She is your community specialist. However, being a free open source software for 30 years now, I have picked up a few things about community, so I feel more than qualified to talk about this. Number one, documentation for love a dog, write everything down, document all the stuff. Documentation is going to scale so much better than your developers. Make sure you have stuff documented before you release it. By stuff, I mean how to stand up your developer environment, how to get started. Why would I even want to use this software, here's our glossary and very importantly how do we contribute? How do I as a user of your software, how do I show up and even just make a simple bug patch? How do I send a documentation patch? How do I do even the simplest stuff? Where do I communicate with you?

Document all of those things as well and really just throw open the doors. Also, it's absolutely vital, it's table stakes now and people who say otherwise are probably jerks you don't want in your community anyway. It's table stakes to have a code of conduct and to enforce it, because if your community, if your project is not friendly to people, if it doesn't treat people with the basic level of respect that a code of conduct and allows them to be insured, then your community is not somewhere that anyone wants to be at which means you don't have a community, you have a cesspit. Get a code of conduct, learn how to use it.

Mike Julian: Completely agreed.

Vicky Brasseur: Yeah. There's many different other things you can do as far as building community, but those are some of the starters.

Mike Julian: Yeah, those are some really great tips. Shifting gears a little bit, you and I were talking before we started about this, we started this call about a concept you've been talking about called open source sustainability. Could you tell us more about that? What's your idea?

Vicky Brasseur: This is a big buzz word in free and open source circles lately is all about making free and open source sustainable. This started, well we've been kind of been talking about it for a long time, because of this whole free rider problem. With free rider you can't see my air quotes, but that's been something that we talk about in free and open source software for a very long time is people using, but not contributing back. That's a problem and that's something that we can potentially work to not fix, but at least shift a bit. That's something, but this is really, we've been talking about that for a long time.

A few years ago Nadia Eghbal came out with a study through the Ford Foundation called Roads and Bridges. It was about I guess the crumbling infrastructure of free and open source software and how so much of what we use is not well maintained. That's led to a lot of conversations around this. We've all seen problems with this around openSSL and heartbleed how there just weren't enough people there to go maintaining it. They were just killing it themselves almost literally to maintain this off.

Mike Julian: Yeah, turns out it's like one person is kind of doing most of the work.

Vicky Brasseur: There was a lot going on there. When that started the conversation around, what does it mean to make sure our free and open source projects in which we all rely because we've all built our businesses on them, what can we do to make sure that free and open source software is sustainable and will stay around? Frankly to me it's a business risk to be using something that's not maintained.

Mike Julian: Absolutely.

Vicky Brasseur: I can't put my company's money into something that I can't guarantee is going to be maintained for a long amount of time. Now because we're in technology, and because most of technology is run by VCs, most of the conversation around open source sustainability has focused laser, just laser focused on money. What we're going to do is, we're going to get a ton of money and we're goin to pay these maintainers. If we pay these maintainers, it'll all be better because money fixes all the problems. What money doesn't fix technology does and no, no, for crying out loud, no that's not the only way to solve this problem. This is a social problem as well as a financial problem.

I have, in the past, managed a team where people were paid to contribute to free and open source software projects. That's all they were paid to do, is make these open source projects better and whatever makes sense for you. These projects were very strategic for the company. Made sense for the company to be paying people to make these strategic things better for them, which was brilliant for the company. I'm really glad they did that, but at least one of those people was the only maintainer on an absolutely vital piece of internet infrastructure. Like something that ran so many different things. I know exactly how much this person made because they reported to me. I also know that throwing more money at the problem was not going to solve the fact that they were working 70 to 80 hours a week to try to maintain this. That is not a money problem, that is a resourcing problem. It's a standard sort of management issue. What we have to do if we want things to be more maintainable is you fix that, you fix that bottleneck. You fix that incredibly horrible bus factor of one. That's what you have in a lot of free and open source software project.

Now how do you fix that? Is you as a company need to contribute back. By contributing back I'm not just talking about throwing money at the problem, you have to contribute resources. Those resources can be human, they can be technological for servers to help scale things out better. They can be more people to document, they can be people to design. They can be people to market. It doesn't matter, but get these vital free and open source software contributors support and that support is not necessarily money. Money helps and certainly all these people would love to get paid more and get paid full-time to work in their free and open source software projects. It's not going to help if they are still the only one, and they're still working 80 plus hours a week to save your ass sometime. Give back and contribute to, and learn how to contribute to these projects, which is where I'm going to plug my book frankly because we didn't talk about this, but I'm going to do it.

Mike Julian: Please.

Vicky Brasseur: It is the only book on how to contribute to free and open source software projects. If people don't do this properly, free and open source software will not scale. It is growing at millions of new open repositories in GitHub alone every single year. That's just GitHub, that doesn't count GitLab, that doesn't count Bitbucket, that doesn't count all the things that Apache and all these other projects are running. Millions of new repositories every year, who's going to maintain that? We need to train people how to contribute to open source software and that's why I wrote my book. Otherwise, we are going to collapse under our own weight. Please learn how to do this.

Mike Julian: On that note, where can we find your book?

Vicky Brasseur: Where can you find it? Will there be show notes?

Mike Julian: There will be show notes.

Vicky Brasseur: Okay, good, well there will be a link in the show notes then. The link which will go directly to my publisher is fossforge.com, so F-O-S-S-F-O-R-G.com and that will go directly to the Pragmatic bookshelf page for this. I love the Pragmatic folks.

Mike Julian: Wonderful.

Vicky Brasseur: They've been so amazing to work with. If you ever need to write a book, man go with them, they're so fun.

Mike Julian: Yeah, that's great to hear. Aside from your book, where can more people find out about you and your work?

Vicky Brasseur: Oh about me, well they can go to my blog which is anonymoushash, one word .vmbrasseur.com. You can also just find it from my website which is vmbrasseur.com and I do way too much of the twittering, so that's probably the best way to keep up with all the things that are on my mind right now. You're not going to see this is what I had for lunch or OMG look at the cute kitties. That goes on a different Twitter account, but you will hear all about …

Mike Julian: Open source all the time?

Vicky Brasseur: Yes, this one is open source all the time, management all the time. It's a lot less dull than that, but trust me, I hope.

Mike Julian: Yeah, all right. That's awesome. Well thank you so much for joining us, this has been an absolute pleasure to have you.

Vicky Brasseur: It's been super fun. I love talking about this stuff and I'm very grateful for the opportunity to do so.

Mike Julian: Well thank you. To all our listeners thank you for listening to the Real World DevOps podcast. If you want to stay up to date on the latest episodes, you can find us at realworlddevops.com and on iTunes, Google Play or wherever it is you get your podcast. I'll see you in the next episode.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Building a Resilient Engineering Culture with Ryn Daniels
25 apr 2019· Real World DevOps
About Ryn DanielsRyn Daniels is a staff infrastructure software engineer who got their start in programming with TI-80 calculators back when GeoCities was still cool. Their work has focused on infrastructure operability, sustainable on-call practices, and the design of effective and empathetic engineering cultures. They are the co-author of O’Reilly’s Effective DevOps and have spoken at numerous industry conferences on devops engineering and culture topics. Ryn lives in Berlin, Germany with a perfectly reasonable number of cats and in their spare time can often be found powerlifting, playing cello, or handcrafting knitted server koozies for the data center.

Linksryn.worksLinkedIn - Ryn Daniels@rynchantressEffective DevOpsInfoQ Article: Crafting a Resilient Culture
TranscriptMike Julian: This is the Real World DevOps Podcast and I'm your host Mike Julian. I'm setting out to meet the most interesting people doing awesome work in the world of DevOps. From the creators of your favorite tools to the organizers of amazing conferences, or the authors of great books to fantastic public speakers, I want to introduce you to the most interesting people I can find.

This episode is sponsored by the lovely folks at InfluxData. If you're listening to this podcast, you're probably also interested in better monitoring tools and that's where Influx comes in. Personally, I'm a huge fan of their products and I often recommend them to my own clients. You're probably familiar with their time series database InfluxDB, but you may not be as familiar with our other tools. Telegraf for metrics collection from systems, Chronograf for visualization, and Kapacitor for real-time streaming. All of this is available as open source, and they also have a hosted commercial version too. You can check all of this out InfluxData.com.

Mike Julian: Hi folks. I'm Mike Julian, your host with Real World DevOps. My guest this week is Ryn Daniels, co-author of O'Reilly's Effective DevOps, a public speaker and previously worked in engineering for both Etsy and Travis CI. Ryn, I hear you're working everyone's favorite infrastructure automation company now, HashiCorp is it?

Ryn Daniels: Yes, it is. I'm a working on the terraform ecosystem team. I'm going to be working on the AWS provider.

Mike Julian: You've been writing and talking a lot about this idea of resilient culture and you wrote a article for a InfoQ, which we'll link in the show notes, about crafting resilient culture, which talked about the Apache Snafu. You and I were just talking before the show about an earlier story about Postfix and Puppet and well, things exploding in your face.

Ryn Daniels: Yes, so it's a fun story with a little less of a happy ending than the Apache snafu. My first ops job I inherited two data centers that didn't even have a lonely bash script for company. I was doing everything by hand. There were a lot of dragons and nobody was really sure where are the dragons were lurking. One of the things that I was kind of put in charge of was the idea of, "What if we didn't do literally everything manually? What if we had some sort of automation?" So I got to do fun stuff like set up automated Linux installs instead of me going around carrying a USB DVD player and yeah.

Mike Julian: Definitely been there.

Ryn Daniels: Yeah, that that was ... Those were sad times. So I was starting to put together Puppet and it was mostly going pretty well. I was starting out with the what seemed like the safe stuff. And I asked the engineering team, I'm like, "So it seems like Postfix is configured a bit on these servers, but it's not running. Should it be running?" And people talked amongst themselves a little bit and they were like, "Yeah, it should definitely be running because the servers are set up to email us when something goes wrong." Okay.

Mike Julian: So clearly everything was fine because no emails were going out.

Ryn Daniels: Exactly. Exactly. So I clear this with everyone. I tell them, I'm like, "Okay, I'm going to roll out this change." And I turn on postfix everywhere. And this was my very first ops job, so we didn't have anything like a testing or a staging environment. I was really kind of playing everything by ear at that point and learning as I went. So I turn on Postfix and then a few minutes later somebody says the site's down. Like how did turning on Postfix take the site down?

Mike Julian: That's weird.

Ryn Daniels: And we kind of kind of poke a little bit on one of the servers that I was logged into and like the web server was still running. Everything looked like it should have been fine. What happened was there were eight years of emails queued up on every single server, and when Puppet turned on Postfix, those eight years of cued emails started sending all at once. And the way that networking was or wasn't configured back then, I think I just like saturated every single network link in our two data centers with all of these emails, and everyone's like, "Ryn, help, make it stop, get everything back on line." I'm like, "I don't know how to un-send eight years worth of email, folks. Like, we're just going to have to wait this out." Which is kind of what happened. And eventually, eventually all of the emails sent and shockingly, there were a lot of error emails as it turns out in this sort of environment.

Mike Julian: Surprise, surprise.

Ryn Daniels: Yeah. And after that everyone was a little twitchy anytime I mentioned making a Puppet change. So yeah, it was definitely an exciting afternoon slash couple of days trying to figure out what went wrong with automation and try and keep it from going that sideways in the future.

Mike Julian: How did your teammates react to all this? Like aside from like, "Ryn, what have you done?"

Ryn Daniels: It was, it was mostly just that kind of panic and then everyone trying to figure out what to do. People had differing amounts of visibility into what was going on. There was kind of a homegrown monitoring system that was set up that also lived in the data center, which may or may not have been very accessible during this time. Oh, I remember, I was stuck in the data center physically because nothing was configured to have a remote, out-of-band access. So most of my days were spent me alone in the data center with this ancient MacBook. I think it was still running power PC, so I didn't even have Chrome running. And it had so little memory that it could really only run one application at a time.

Ryn Daniels: So I would like get the terminal up and do a thing and then if I had to look something up, I would have to quit terminal and open up Safari. And then if I wanted to talk to people in the office I would have to quit Safari and open up I think we used AIM. And it was a lot of back and forth and chaos trying to just get a baseline feel for what was going on and there was definitely a lot of yelling going on in my general direction.

Mike Julian: Yeah. What was the aftermath like?

Ryn Daniels: Eventually all of the emails sent and everything went back to normal and people said, "Okay, Ryn, please don't do that again." I'm like, "Well I'm not going to, Postfix is already turned on. I'm moving onto the next thing on my list."

Mike Julian: Did you find that people were more likely to blame you or Puppet for issues in the future?

Ryn Daniels: I think the blaming of me was mostly good natured. I feel like this was, well not the most robust environment I ever worked in. It was actually not the most blameful. There was ... Like I'm pretty sure that once I quit, people blamed me a lot cause it was kind of the culture of whoever the previous person was, everything was their fault. But when I was there, it was mostly ... Yeah.

Mike Julian: That's how it works in most ... This reminds me of I used to work for a national lab and due to a missing keyword on a Cisco configuration, I took down two entire research buildings.

Ryn Daniels: Ooh.

Mike Julian: Yeah, that was fun.

Ryn Daniels: Yeah, sounds exciting.

Mike Julian: Yeah, sounds exciting. Basically if you are configuring a trunk line and you're trying to add a VLAN to a port, if you don't add the “add” keyword, then it replaces all the VLANs with the one you specify.

Ryn Daniels: Oh.

Mike Julian: So there were a few hundred VLANs configured, and I replaced it with one.

Ryn Daniels: I bet some of those were probably important.

Mike Julian: Some of them were important. The weirdest thing about that whole situation is that my first day, the network manager during my onboarding says, "Hey, you're going to make this mistake. Just come tell me when you do it." I'm like, in hindsight, you know, maybe we should just make that mistake, not a thing that you can do.

Ryn Daniels: That would make sense.

Mike Julian: Right. That would make sense. But they never did that. After I did it, of course I never did it again, but then I had to train my replacement and say, "Hey, you're going to make this mistake too." Like this just sounds awful in every way. So it's interesting contrasting that with the well known story of the Apache snafu. Why don't you tell us a bit about that, like how it differed and basically how the experience went?

Ryn Daniels: Yeah, so the rather quick version of the Apache snafu, this was when I was working at Etsy. I think this was 2015 or 2016. I was working on the tooling that was provisioning servers in the data center. So at the time, Etsy was, for the most part, servers running in our own data centers, and some something had to get the servers in a state from the data center team just un-boxed them and racked them and wired everything up into, "This is a useful server that does something nice like serve Etsy.com to people who want to buy yak cozies."

Mike Julian: Sounds useful.

Ryn Daniels: Yeah, whatever it is that what you're buying that's delightful and handcrafted.

Mike Julian: Definitely yak cozies.

Ryn Daniels: Yeah, so I was working on this provisioning software, which was a collection of mostly Ruby scripts at that point. And I was getting to the point where, "Okay, I need to run some end to end tests to make sure that, okay, so all the pieces seem to work individually, but can I actually provision a server? Or more importantly, when the data center team gets a whole bunch of new servers, can they actually use this to provision them in a timely manner?"

So I have my test server and I'm trying to provision it, and one of the later provisioning steps was bootstrapping it into Chef. I was I think running a test web server since that's one of the more common use cases. And Chef failed on the Apache install step. It said, "I can't do this because the version that is pinned in Chef is older than the version that is installed by the Anaconda installer." Now this had happened a nonzero number of times before, not just to me but to other people, because the way that the Yum mirror was configured was that it would automatically pull down new versions of packages and get rid of the old ones.

So pretty much anytime that happened, if a new server was being provisioned at that point, this sort of mismatch in between the installed version and the pinned version in Chef would happen. And just the way that we were pretty much used to dealing with this was, "Okay, kind of manually test to make sure the new version does what you expect it to do and update the version in Chef." So on my little test server, I manually install the new version of Apache. It was a point release. I remember I even checked the release notes and there was nothing super interesting in them. The way that Chef was configured, it was only supposed to impact newly provisioned servers when you bumped the version so that all of the existing servers would keep the same version, they wouldn't update.

Ryn Daniels: This occasionally led to a little bit of config drift across the fleet, but that was the decision that was made. It had been working okay so far. Nobody had complained enough to change this process. So I test it by hand and I roll it out and I'm like, "Okay, that was good. Nothing's going to happen. This is going to be a no op." And then I said, "I should just log into one of the web servers and make sure that Chef does nothing." You can see where this is going, can't you?

Mike Julian: Right. I love the ... I have a feeling.

Ryn Daniels: My spidey sense, my spidey op sense was tingling a little bit. And so I log into one of the web servers and I run Chef and it upgrades Apache and Apache does not upgrade cleanly. It fails to start. And I'm like, "Oh, oh no, I've done a bad thing." So I'm realizing that it's, I don't know, sometime in the middle of the day, let's say like 2:00 or 3:00 in the afternoon, and I realized that I have just rolled out a busted Apache upgrade to the entire production and staging environment all at once.

Mike Julian: Queue panic.

Ryn Daniels: It was kind of one of those slow motion moments where you're like, "Oh God, I can see my whole life flashing before my eyes." And I happened to know that my coworker, Pat, who was sitting next to me was the one on call and I turned to him and I'm like... I'm like, "Hey, so you're about to get paged for a whole bunch of stuff. Sorry about that." And then I head into Slack and I jump into the main sysops, webops channel where everyone tended to congregate, especially when there were production issues. I'm like, "So, everyone, I've got good news and bad news. The bad news is I've broken everything. The good news is I'm aware that I've broken everything." And everyone jumped in immediately. They're like, "What can we do? How can we help?"Like people who were in the office with us who overheard me talking about this and kind of muttering to myself came over and were like, "What's going on? Is there anything we can do?"

And so people jumped in and it was really nice to see based on different people's areas of expertise, like people who were really familiar with Apache and the Apache config started poking at that. People started trying to look at like, "Oh, is the site impacted?" A fun part of this story is that many, many of the internal monitoring tools that were used at the time used Apache as a web server. So all of a sudden, not only is everything mostly on fire, but we can't even really look at what the fire is doing because the fire observation tools are also on fire.

Mike Julian: That's incredible.

Ryn Daniels: Yeah. Yeah. And people are looking at the config and trying to figure out like the config didn't change. Nothing in this looks like it should have changed. Eventually somebody figured out that if you just ran Chef a second time, everything fixed itself. But at that point nobody was really digging into why, we were just like, "It's the middle of the day. People are trying to buy their cozies. We got to get this back up." Somebody went to Etsy.com in a web browser cause a lot of the tools were down and somehow the site was still up.

Mike Julian: Interesting.

Ryn Daniels: It was really, really, really, really slow.

Mike Julian: But it wasn't down.

Ryn Daniels: But it was a doubt. Like I technically did not take the site down. So we coordinate the work of running Chef everywhere on all of the various servers. Like, "Let's do this one group at a time and not DoS the Chef server with everything tried to run immediately at the same time, verifying that stuff comes back up." Everything starts to come back up, everything goes back to normal. And this all took place over the longest 20 minutes of my life.

Mike Julian: Yeah. As you're telling the story, I'm like, "All right, this sounds like several hours." But no, it's actually not that long of a time.

Ryn Daniels: Yeah. Yeah. Given that Chef ran I think automatically every 10 minutes, I think given that the second Chef run fixed it, if I had just ignored it or not noticed and gone to lunch or something, you would have fixed itself pretty quickly.

Mike Julian: That's both awesome and a little terrifying.

Ryn Daniels: Yeah. Yeah.

Mike Julian: So Etsy was pretty known for their retrospectives and the learning lessons. And they even had the funny shaped sweater. What was the aftermath of that compared to your ... the aftermath of the previous story we just heard about?

Ryn Daniels: There were a lot more high fives. So I think, I can't remember what time of year that happened. I think the ... So the three-armed sweater is a physical sweater that was given out once a year, usually in like December, to the engineer who not necessarily broke something in the most spectacular way, but kind of contributed to an incident that we all learned a lot from. And there were a lot of weird little, how did that happen moments in the Apache snafu. So I ended up at the whatever December all-hands meeting, John Allspaw was handing out the three armed sweater, and it was awarded to me for this delightful, delightful incident.

Mike Julian: Bravo.

Ryn Daniels: And it was the kind of thing where that story would spread throughout engineering and people would come up to me afterwards and they're like, "Oh my God, you won the sweater. That's so cool. Congratulations." It was ... There definitely was not any incentive to go break the site and I'm pretty sure that there was a fine print in there that that would disqualify you from getting the sweater is trying to get the sweater. But it was definitely something where people wanted to hear the story because they wanted to hear what happened because it was interesting. And so it was actually a lot of like warm, fuzzy feelings that I didn't have to worry that people were secretly mad at me. I didn't have to worry that like the next time I tried to make a change that people would be like, "No, actually, I don't trust you to do that anymore because you broke something that one time." It was a much more supportive environment, which was really nice.

Mike Julian: I believe Etsy had a post on this a while back, this idea of a just culture. I think you've been talking about it in terms of resilient culture, and psychological safety plays into this. For those that aren't really aware of what all that means, could you talk more about it?

Ryn Daniels: Yeah, so I like to ... The thinking about resilience, resilience engineering is a field that has been around for a while. This is not something new that I came up with. A lot of my thinking on the subject comes from conversations with John back when I was at Etsy and afterwards. And one thing that he likes to say that I really appreciate is the idea that computers and systems can be robust, but only people can be really resilient. So thinking about these sorts of failures that happen. "Okay, like automation caused this thing. The other thing." You were talking about the VLAN incident, and wouldn't it be nice if there was a way to make it so that everyone didn't make the same mistake?

You can make a system that is kind of robust to these known failures, so that one command that everyone entered wrong, you could write some tooling around that specific command or put it in a little web interface so nobody was entering raw commands on the devices by hand, that sort of thing. But you can only do that for a known set of failures. The problem is there's always going to be the unknown unknowns, the things that you haven't thought of yet because they haven't failed yet. Or you add in some new piece to your infrastructure and all of a sudden because complex systems are complex, you have these new interactions that just didn't exist before that people haven't thought of. And it's kind of how you respond to the unknowns that kind of defines resilience I think.

Mike Julian: Okay.

Ryn Daniels: So I like to think about resilience as kind of like the opposite of that being fragility. Whereas the story with Puppet and that data center and nobody knew how to respond and everything just caught on fire, like that was really fragile. I mean that whole environment was very fragile because people responded to the unknowns and to failures with fear. Like another story from that job is the one database server. There was just the one. There was no mirroring, there was no sharding, there was just the one that had the data, and of course it was running Mongo. We all love to hate on everyone's favorite, NoSQL data store.

And the raid array in this one server was degraded. And I went to the engineering managers. I'm like, "So, I need to get some new hard drives to replace the the busted ones. Like let's plan this work. I want to do this." And they're like, "No, no, you can't do that." "Why not?" "Well, because something might go wrong during the repair process. Can you guarantee that repairing the raid array will not break it?" I'm like, "No, I can't. That's not a guarantee you can make." And they said, "Well, you can't do it then."

I'm like, "Okay, let me tell you what I can guarantee is that if you let this raid array sit with 50% of its disks busted, at some point the remaining two discs are going to die. That I can guarantee, and then you will have no data because you have this one database server and it has no backups. Like that is the guarantee that I can make. Given those risks, what if we order some new hard drives and I rebuilt this array?" And they said no, and I did it anyway. Which, I mean, you got to do what you got to do sometimes, but it was that culture of fear and having to do things in secret that was really like the opposite of resilience there.

Mike Julian: Yeah. That's a really interesting point. What I love about this concept of psychological safety and resilient culture is people really are at the center of it. And most people ... A lot of environments seem to kind of divorce the idea of the technology we run and the people operating it, when in fact they're symbiotic. You have to have both in order to have a well running environment. If you react with fear to someone like, "Hey, I need to work on this system and make this change." And you're like, "Oh God, we can't do that." Well you're actually breaking the technology too. And also breaking the people.

Ryn Daniels: Yeah. I like to say that as engineers and as an engineering organization as a whole, you're not just shipping code, but you are also shipping the entire environment that allows you to ship code and that's culture. That's the people, that's the processes. And if you ship broken processes and if your culture ends up shipping broken people, then you're going to have a bad time.

Mike Julian: Right. So for a company or a team or a person, who identifies more with this broken culture than the culture that you've been working in and have been building yourself, what can they do to start to shift their own culture? To change what's going on? How can we get a more resilient culture if we don't have one?

Ryn Daniels: Yeah, I think that's a really interesting question, and a big part of culture change obviously is getting buy-in. People have to want to change the culture specifically and they have to want to change it enough to actually overcome the inertia. And inertia is such a big factor in cultures and how we work. So getting buy-in is important. And I think there were some really interesting stories at some of the DevOps Days events. A few years ago, I remember Target did some really interesting talks about, "Okay, if you have these kind of individual teams throughout the organization who are trying to make these changes, how can they then spread those changes throughout the rest of the organization?" Some really interesting stuff there.

But there's different things within a culture that you can look at. I think a big part of it is looking at what behaviors people are rewarded for. Like what sort of incentive structures are there? So one of the things that I like to see in a culture is looking at any sort of skills matrix or career matrix. It should be required that as a senior engineer, staff engineer, what have you at those higher levels, that you be helping to create this kind of culture of psychological safety. You should be responding to people asking questions with actual help. I've definitely ... There's the stereotype of the BOFH, the grumpy sysadmin who wants to hoard all of the information to themselves and has never going to help anyone out because that's less job safety, and who is going to yell at people and make fun of them for not knowing the answer. That's creating a psychologically unsafe culture. That's creating a place where people aren't going to ask questions and they're not going to tell you what they did wrong or even what they did.

Mike Julian: Yeah. I used to find those stories hilarious. Like The Register has the massive collection of them. I always thought they're hilarious. And then I started actually being a professional and then realized, "Wow, that'd be a terrible place to work and that's a terrible person."

Ryn Daniels: Yeah. I've definitely worked at places where ... I remember one time, a long time ago, somebody added a new alert to the monitoring system and it kept flapping. And I just wanted ... There was no context around it and I wanted to know, "Is this important? Should I be worried that the seller keeps firing?" And I, for the life ... I never found out who added that alert because nobody would tell me because everyone was so afraid of ... And it wasn't an I'm mad situation. I wasn't some executive, I was on their team just trying to figure out what was going on. And I couldn't because they had been yelled at so many times for making normal mistakes. The kinds of mistakes that literally every single person has made if they've interacted with computer.

Mike Julian: Oh, that's rough.

Ryn Daniels: Yeah. And it's the kind of thing where if you have that sort of environment, you're never going to be resilient because people are going to keep more and more information to themselves, and a big part of resilience is learning. And you're not going to be able to learn effectively if you don't actually know what happened.

Mike Julian: Yeah. So working on getting buy in is great. I can see how that's super valuable, but that takes a long time and you may not have an executive that actually cares that much about it. They may not see the value themselves. Are there any more closer to home things that someone could do? Like within just their team?

Ryn Daniels: Yeah, people can look at kind of how they behave within their own team and I think it can really help to try and set up some social scripts to have, especially if you have leaders within your team who maybe have been around the organization for awhile so people listen to them a little more, to try and get those people to model the behavior that you want to see. One thing that I've found pretty helpful when thinking about how do people get information and how do people talk about things is if I have a question about how something works, instead of like private messaging someone in Slack one to one, I will drop that question in a public channel, find the channel where that's most appropriate to do. And I will just ask publicly. I'm like, "Yo, I don't know this thing."

And this is something that like, okay, I've been working for over a decade now. I've written a book, I've given conference talks. I feel like I have enough cred that people aren't going to question too much whether or not I actually know what I'm doing or belong to be there. So it's a lot safer for me than for somebody who's more junior to sort of model this behavior. So that's something that you can try and deliberately do is model. Like, "Here's what it looks like to admit you don't know something or to ask questions." And to do that publicly and to have it be okay.

Mike Julian: I love that advice. One of the ... As a consultant, I go into a lot of different companies all the time and one of my big red flags is when I look at a team Slack, or HipChat or whatever they're using, and the team channel has no activity.

Ryn Daniels: Tumbleweeds. It's so scary.

Mike Julian: It's so weird. And I immediately know that there's a lot of back channel going on. And this should terrify every team manager too because if there's no discussion happening in the team channel, well, it's happening without their knowledge. It's not that it isn't happening.

Ryn Daniels: Yeah.

Mike Julian: Yeah. Like it's always weird.

Ryn Daniels: Yeah. One thing that I really liked that Etsy did was they had pluses or imaginary internet points that were in IRC and then in Slack that was people would give each other pluses for answering a question or for asking a good question or for making a really good pun. Etsy really loved puns. I appreciate that. But that was the sort of thing where like, okay, it takes time to redo your career's matrix to make sure that people who get promoted are the sort of people who are building this sort of environment.

It's a lot easier to make a little chat robot handout imaginary internet points. And that can be ... Some people don't like the gamification and of course there's problems that like, oh, if you have some sort of like insider clique of the cool kids within your company, that other people are going to feel left out. But in the right environment you can have something that's a lot smaller, a lot lower friction like that. So if somebody asks a good question or it gives a really helpful answer, you give them some internet points, and that kind of gives literal incentivizers for those sorts of behaviors that you want to see.

Mike Julian: Yeah. That small little change can actually have a huge impact. I like that idea a lot. Shifting gears a little bit, when we've been talking about the impact of resilient culture and people, I want to talk about the people side of this. I've been following your blog for a while, and one of the things you started talking about when you were at Etsy was take care of yourself more. And I saw that you took up by playing cello and you started doing power lifting and like all that's awesome. One of the most interesting things that I saw you to start doing was this cupcake ritual. What is all that about? Like what led to that?

Ryn Daniels: So my cupcakes where my own variation of Laura Hogan does this with donuts where she wanted to be more deliberate about celebrating the successes and the wins in her career. So every time she did something that she felt was donut worthy, she would go get a nice doughnut and take a picture with it and just talk a little bit about, "Hey, here's this cool thing I did." She wanted to kind of, I think in a way normalize it for people from underrepresented groups especially to talk about like, "Hey, we're doing these cool things and it's not a bad thing to celebrate them." So I started doing cupcakes as kind of a little play on that.

Mike Julian: I love that idea. I think to me one of the hardest problems of celebrating wins is having to decide what constitutes a win. Like for me, we both wrote a book. If you celebrate a win of I just shipped a book, then what's the next one after that? It feels like it almost has to be bigger than writing a book. You're like, "Wait a minute. This is obviously going to mean I'll never get cupcakes again."

Ryn Daniels: I think that's something that I struggled with in recent years. And you mentioned that you'd read my most recent blog post on kind of retiring the cupcake ritual and I think part of it was in a way related to that where I'd done these things that were on my five year career goals. I wrote this book, I keynoted Velocity, I'd gotten this job at Etsy that I really loved and I found myself struggling with kind of where was I trying to go next? And then I had a lot of personal change in my life, moving countries for example. That was a long and involved process, which maybe probably not surprisingly took up a lot of time and brain power and just ability to focus. Bunch of other changes happening as well, some on and off chronic health issues and it didn't feel like I was accomplishing anything anymore. But nothing was living up to the previous cupcakes.

Mike Julian: Right. That has got to be super hard because the fact that you do something huge doesn't mean that anything that comes after that is now not worthy. To me, it's like ... How I view it, because my day to day work as a consultant is I have to look at the tiny wins and then occasionally I'll get a huge win and this is great but most of my life is not, "I wrote a book and then I closed a huge deal and so on and so on." So yeah, I think that's the hardest part about that whole ritual is just understanding what actually is a win. And have you found that having some external support on that made it better? Like having someone basically call you out and say like, "Cut your shit, you just did something awesome."

Ryn Daniels: That has definitely helped. I definitely have a tendency to be a little hard on myself and to downplay my own accomplishments. And one thing that's been really great is having my partner who will talk to me and be like, "Yo, you're full of shit. You have done all of these things." And she really helped me talk ... We were talking through all of this and I kind of realized that just because I'm not doing big, concrete, publicly visible things doesn't mean that I'm not still making progress. So for me, I think kind of stopping with the cupcakes was a way to help me kind of reframe how I think about success and how I think about progress.

I think, and I mentioned this in the blog post a little bit, one thing is that kind of mid-career progress is going to look different than early career stuff where, okay, once you've done kind of the big things that you wanted to do, which obviously doesn't have to be early career by any means, but once you've done kind of the big things, where do you go from there? Or once you've gotten ... It's usually a pretty clear progression, assuming you're working for a place that thinks about career progress, to get to a senior engineering level, but where you want to go after that, it can branch off. It's not as clearly defined. So I wanted to kind of stop focusing on doing things that looked good when celebrated with a cupcake and kind of just take a step back and think about where I want to go with the next stage of my career.

Mike Julian: I love that starting that ritual and ending that ritual were both really for the same reason of helping you think better, think differently about your wins.

Ryn Daniels: Yeah, there's definitely some upsides to celebrating things publicly because you get support from people. I get to feel like, "Oh, if I'm helping other people feel better about their own progress and helping other people celebrate their own wins, that's awesome." But there's definitely then this pressure of, "I've done all these things publicly. Oh no, did I peak when I was 30? What am I doing with my life now?" And it was honestly really scary to publish that blog post because it felt like admitting to everyone that I was a failure and now I'm not celebrating cupcakes anymore cause I don't have anything worth celebrating. But I think we need to also normalize that not everything is this big moment. Not everything turns into a cupcake or a big story that you can give a conference talk about, sometimes it's just the little moments that mean the most.

Mike Julian: Well, on that note, it's been absolutely fantastic having you. Where can more people find out more about you and your work?

Ryn Daniels: I am on Twitter, @RynChantress, and I blog occasionally at Ryn.works.

Mike Julian: Awesome. Well, thank you so much for joining.

Ryn Daniels: Thank you.

Mike Julian: And to everyone listening in, thank you for listening to the Real World DevOps podcast. If you want to stay up to date on the latest episodes, you can find us RealWorldDevOps.com and iTunes, Google Play or wherever it is you get your podcast. I'll see you in the next episode.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Database Performance With a Side of Empathy with Baron Schwartz
18 apr 2019· Real World DevOps
About the GuestBaron is the CTO and founder of VividCortex, the best way to see what your production database servers are doing. Baron has written a lot of open source software, and several books including High Performance MySQL. He’s focused his career on learning and teaching about scalability, performance, and observability of systems generally (including the view that teams are systems and culture influences their performance), and databases specifically.
Links Referencedxaprb.com High Performance MySQL: Optimization, Backups, and ReplicationeBook: DevOps for the Database@xaprbVividCortexTranscriptMike Julian: This is the Real World DevOps podcast and I'm your host Mike Julian. I'm setting out to meet the most interesting people doing awesome work in the world of DevOps, from the creators of your favorite tools to the organizers of amazing conferences, from the authors of great books to fantastic public speakers. I want to introduce you to the most interesting people I can find.

Mike Julian: Ahh, crash reporting. An oft-forgotten piece of a solid monitoring strategy. If you struggled to replicate bugs or elusive performance issues you're hearing about it from your users, you should check out Raygun, whether you're responsible for web or mobile applications, Raygun makes it pretty easy to find and diagnose problems in minutes instead of what you usually do, which if you're anything like me, is ask the nearest person, "Hey, it's the app slow for you?" and getting a blank stare back because, hey, this is Starbucks and who's the weird guy asking questions about mobile app performance. Anyways, Raygun, my personal thanks to them for helping to make this podcast possible. You can check out their free trial today by going to raygun.com.

Mike Julian: Hi folks. I'm Mike Julian, your host of the Real World DevOps podcast. My guest this week is Baron Schwartz, the founder and CTO of Vivid Cortex and author of a book that I'm sure many of you have laying around, High Performance MySQL. Welcome to the show Baron.

Baron Schwartz: Thanks Mike. Great to be here.

Mike Julian: Baron some years ago, well, I also started writing a book in like 2015, I guess it was, 2016. And you've written a lot about your writing and I mean you are a very prolific writer on your blog, having also written the tome of High Performance MySQL.

Baron Schwartz: Maybe too much of a tome. Yeah.

Mike Julian: Right. It is a pretty sizable book. And I found this article that you had written about how you wrote, which I found absolutely fascinating. One of the pieces of advice I saw that I've used and have been regurgitating to everyone. How do you write the book while you outline? You start with an outline and then you keep outlining. And you keep and keep and keep on outlining until you're at the point where you should feel like you've stopped outlining long ago and then outline some more. Yeah, just a little bit more after that.

I thought that was incredibly insightful advice because most people, they do stop jotting down thoughts a bit too early.

Baron Schwartz: Yeah. I should use that advice myself. Every time I do it, I'm really happy with the results. The truth is I don't always, and then I'm like, "Why is this so hard?" It's just because I'm just slogging through not having outlined and now I've got ... So for folks who are not going to read this blog post, which I assume will be a lot of our listeners, the general idea that led me to try this approach was actually writing the second edition. So the blog post that you're talking about was written after the third edition of High Performance MySQL. And the second edition was an absolutely horrendous amount of work. I know because I used a timekeeping app and I basically spent like a half a year on it. So it was more than a thousand hours of really hard work.

And then afterwards I looked at what was most of that time spent doing and most of it was actually spent editing things that turned into pros too soon because pros has to have correct grammar. And if you don't feel right about the order of the ideas in a paragraph and you start fussing with it, then the grammar gets mucked up when you're sort of dragging things around or whatever. And so you try and fix that, but you're polishing something that is structurally bad and to fix it, you end up having to go back and restructure it and then re-edit it. But if you're re-editing for grammar and flow and everything the whole time, it's just an insane amount of work.

So the third edition, I did using mostly the outlining style that you're talking about, mostly after several years of having just collected notes, which is a part of the writing process that's largely invisible, just like filling files full of one bullet point URLs or quick little one liners or whatever. And I took that stuff, I sort of decomposed the second edition into its component parts, restructured it, put in the stuff that I wanted to update or new material, and then outlined and wrote. And the third edition was about 300 hours of work. Although it was a complete rewrite of the book and made it much larger. So it was a lot more material and a lot more accomplishment with a heck of a lot less work.

Mike Julian: Yeah. When I first started writing mine, I did actually start tracking time and then I stopped doing it after a while on the advice of one of the editors of the first edition of a friend of mine, Derek Balling, he gave me a piece of advice that do not ever under any circumstances, attempt to calculate your hourly rate of writing a book.

Baron Schwartz: Excellent advice. Yeah. If you do the math, what you end up earning from a book is almost always depressing in purely economic terms. However, I will say it's been a huge ... well, there's a couple of things. High Performance MySQL has been unusually best selling for a technical book and very few books ever strike gold that way, which is purely a function of just circumstances and luck and things like that. So I think if I calculated my hourly rate, it wouldn't be depressing, but that's-

Mike Julian: It would also just not be awesome.

Baron Schwartz: Yeah. It still wouldn't be awesome. But it's been good to get the royalty checks every now and then, which most authors never get. They never pay back their advances. And that's part of the publishing role. But yeah, even if you do, it's still ... But there's also a lot that you can't just reduce to an hourly rate about the benefits of writing a book, not only for example, the book lives on, and carries to many audiences, and reach as many people, and touches many lives that you never would have in other ways and that comes back and is very rewarding. People write me on a fairly regular basis and say something. I met with somebody a couple of months ago and he said, "I have to thank you for making me a millionaire." I said, "Well, can I get a finder's fee? Just 1%."

Mike Julian: Oh, wouldn't that be nice. It's always wild to me when I get the notifications that, "Hey, your book's been translated into another language." I'm like, "What? People care that much?"

Baron Schwartz: Yeah. I mean, it's so gratifying in so many ways. I recommend the experience, but also just like life skills, and connections, and there's probably 15 or 20 different kinds of things that have come out of this book that I would not have thought of going in.

Mike Julian: So you recently, I saw you recently, released an eBook on DevOps and databases.

Baron Schwartz: Yeah. The newest tome. This one is only 65 pages, not 850 or whatever.

Mike Julian: That's not so bad. I mean, that's like a weekend project, right?

Baron Schwartz: It basically was a weekend project after five years.

Mike Julian: Right. Overnight success 10 years in the making.

Baron Schwartz: Exactly. Yeah. The way I wrote that book was actually a little bit different. I outlined and outlined and outlined and I had accumulated a lot of material. I also crowd sourced a lot of it. And so I tried to do a lot of giving credit in the book because a lot of it was not original material, but it was more, "Let's pull all of this stuff together that people have done this here. And it kind of connects to that over there." So a lot of personal experience over ... well really 15 years of career experience plus five years of trying to put pieces together. And then one day, last November I think it was, or ... yeah, I think it was in November I presented in QCon, but Jessica Carrie had helped me to tie together the stuff that I had been sort of ranting about with databases under the rubric of DevOps, which had not really occurred to me before. So I have to give her credit a lot for kind of tying all of these things together and recruiting me to give that talk.

And then since then I gave the talk a couple of times, and did a lot of research, and digging around, and watching other people's talks, and things like that. So all of that came together. I outlined it, tried to organize the content thematically, and then I used a technique that I haven't used a whole lot for writing, which is, I just voice dictated the whole thing.

Mike Julian: Oh, interesting.

Baron Schwartz: Basically a weekend project.

Mike Julian: I mean, speaking at yourself the entire weekend.

Baron Schwartz: Yeah. Because you know what? I was sort of in a mental place where I just needed to kind of somehow get the energy and momentum to just plow through the whole thing or it wasn't going to get done. And the best way for me to do that sometimes is to hook myself up to a microphone, double press the function key, and start dictating. And especially to dictate in a ranty kind of a way, which will help me to tie together things that I feel strongly about in a very fluid way. Very, very much stream of consciousness. So that book, when it actually got into characters in a Google doc, was not outlined like that. It was dictated and ranted. And then I basically crowdsourced that very crude first draft to a bunch of people just asking on Twitter, "Hey, who wants ..." And then I shared the Google doc with them. And pretty much people said, "This really sounds very ranty."

Mike Julian: I wonder why?

Baron Schwartz: Yeah. I wonder why? And here are some things that you can do to make it better. And then I went back through and edited, but I had been working from a fairly strong outline, so I ended up not having to do a lot of wordsmithing at the kind of structural level. So basically what I needed to do was sort of get the unsavory tone of voice and all of the things that come with the rant, get those things out of there as much as I can. So it probably isn't completely there. There's probably still some places where you feel like I'm lecturing a little bit too much or whatever as you read it. But I think if you look at a perfect book, let's say there was a mythical perfect book, that would fall in the center of the bullseye and books all start somewhere around the rim and they gravitate towards the center, but they never get there.

Mike Julian: Yeah, exactly.

Baron Schwartz: I just came from a different point on the rim.

Mike Julian: Right. So what I want you to tell us what the ... I know the book is about DevOps and databases, but what's actually in there?

Baron Schwartz: Yeah, there's a bunch of stuff. A bunch of it is things that other people have been thinking about for a long time. A lot of it was inspiration from our customers. So for those who don't know me, Vivid Cortex is a database performance monitoring company and our customers use us to observe large fleets of databases running at scale, typically performance critical, where it's really a big data problem to understand what these databases are doing. And we were noticing that some customers were very successful with this and they would report to us that they were moving really fast and engineering productivity was going way up. And we would do onsite visits. I would do onsite visits with them and see the evidence. I was like, "These teams are awesome, not just in database, not just in monitoring, but just across the board." They are doing really awesome work.

I kept trying to figure out what was different about those companies and what was different about the companies that would have a champion who would bring us in and who we would get some traction, and then after a year they wouldn't renew. And just kind of putting the pieces together a little bit over time and also with some things that I feel really strongly about. Like where is your critical productivity bottleneck in your engineering team? And it's typically not where people think it is and if you read books like The DevOps handbook, or The Phoenix Project, or The Goal, they focus on the sneaky nature of bottlenecks coming from dependencies, and variations, and people with specialization. And that these things are so much more costly than anybody ever expects them to be.

And I had been sort of ranting about this as, "Your DBA is a great resource for your company, but if you make them the critical bottleneck, then everybody suffers by having to wait for the DBA." And I just really couldn't connect that into DevOps for a long time until Jessica helped me to see that. And so that's one of the big things that I wrote about in the book that I think is actually, I don't know if it's, revolutionary, but I don't know if people have really written a lot about it before. Certainly, Silvia Botros was I think one of your first guests on this podcast and she was talking about some --

Mike Julian: The very first guest.

Baron Schwartz: Yeah, the very first. She was certainly talking about that. She's written blog posts like a DBA is a priesthood no more. I'm not the only person. But I didn't feel like it was kind of an industry wide thing that everybody recognized and was talking about, that making your DBA a specialized role is a bottleneck for the entire engineering team. So that's one of the big themes of this book. I think DevOps is often considered to be, especially in the database, DevOps is often kind of defined as automation. We've automated the DBA grunt work, but the really high performing companies that I've worked with and the ones that have been the most successful with VividCortex are often the ones who take it beyond, "Let's make the DBA efficient." Into, "Let's help the developers to own more of the application lifecycle." So it's more like the Netflix philosophy of full lifecycle product ownership rather than, "Let's collaborate really well together across these silos."

Mike Julian: Yeah, that's an interesting point. I want to dig into that a bit more. How does that actually look in practice? How can an engineering team own more of the database work?

Baron Schwartz: Yeah, there's a few things. One is, there's a lot of those things like company culture, and team structure, and who's responsible for what as far as what are your goals and incentives, what is rewarded, what is promoted, what is seen as high status or low status work, things like that. And so those are kind of subtle things that are difficult to be very prescriptive about. I mean they're not actually very subtle, but I think the difference between achieving good results with them or not can be sort of subtle differences can make a big difference in the outcome.

So I'm thinking of ... I think I can speak pretty openly about some of our customers, like draft kings and certainly about SendGrid, where Sylvia works. There are teams there who desire to own more and have folks who are curious and interested in the database and don't feel like they have to be or want to be kind of either protectively kept away from something or that something is too hard for them to embrace and dig in and get really good at. Right. So part of it starts with having really high quality people who you then empower by breaking down some of the things that become functional barriers. Nobody's going to feel really great about being a developer who becomes awesome with databases or SQL if they also can't get visibility into the databases. For example, if they're forbidden the ability to monitor them.

So that's particularly near and dear to my heart because that's what we do, is provide production visibility into databases. But there are other things as well. It's not just about monitoring. There are things like automation, automating the deploy pipeline, decomposing or decoupling things in such a way that you can release the application without changing the database and you can change the database without having to rerelease the application. And there's a lot of good material in there. They're very detailed case studies that I cited and linked to from companies like Zalando, and Flow, and the Gilt Groupe, and so forth about how to do all of those things. So this is not a theoretical book of sort of preachy principles or something like that. The principles are there just to sort of organize and structure things, but then it gets down into the nuts and bolts, but it's also a short book. It's only 65 pages, so you can't actually get into a code listings. So instead of doing that, I just link out a lot to people's presentations, and other eBooks, and blog posts, and things like that.

Mike Julian: That's wonderful. So, yeah, that's awesome. I'll include a link to that eBook in the show notes too. What I think is really interesting about some stuff you mentioned there is the cultural aspects and curiosity, high performing teams, they're not just like we do X, Y, and Z as tactical things, but how they think, how they work together, how they approach a problem holistically. It seems to have a really high performing team is actually focus a lot more on the human side of things.

Baron Schwartz: It very much is. Yeah. Yeah. And there's a lot of stuff that I included, cited some things. So there's this weird thing about culture. So we talk a lot about DevOps as being very culture focused and that DevOps is a culture or there are various ways to say it, but culture is kind of a squishy topic because you don't just-

Mike Julian: Oh, yeah, right.

Baron Schwartz: You don't 'change culture'. That's a recipe for disaster. Somebody tweeted a little while ago ... I can't remember who it was, but somebody tweeted that if you take a poor culture and you bring great people into it, you don't fix the culture, you break the great people.

Mike Julian: Right. Right. Yeah. I've seen that.

Baron Schwartz: Yes. Yeah. I have to. Yeah, absolutely. And another related tweet that comes up while I'm brainstorming about tweets is Kelsey Hightower tweeted pretty recently, "Get the right people on the bus, but only after you fired the wrong people." And I've also seen that that's super important because you can only keep either your high performing people or your low performing people, but not both.

There's a very loaded topic because what does high performing or low performing mean, and have they been supported, and did you set them up to fail? Lots and lots of nuances there. But if you just want sort of a hot take, I think all of the things that I said are true when you kind of understand the depth of thinking that can be behind them. But anyway, back to culture, basically you don't change culture directly. Instead you do things like changing an incentive, or giving people a different tool to make the right thing easier, or changing what people do. For example, there are lots and lots of sort of culture change things that are just like change in which order people do a task and then suddenly something else follows from that. There is a book that I'm trying to dredge out of my memory that touches on this topic and I can't. Maybe I'll think of that later and put it in the notes or maybe you know it.

Mike Julian: Yeah, absolutely. I worked with a company a while back, some years ago, that they worked with Pivotal. And now a lot of people know Pivotal from pivotal tracker or pivotal web services, but what a lot of people don't quite see is they also have a consulting wing where you can go in as a company and send part of your team and you can work with Pivotal at their offices for a few days, weeks, months, whatever you want to pay them for. And I had the experience of being with a company that paid Pivotal to help us with launching a new project. So I was at their office for like three, four months. And one of the coolest things about it is ... the best way to describe Pivotal's approach to Agile is zealotry. And that sounds negative. And in some ways at the time, I was really frustrated by this, but now I look back and I see, "Oh, actually that was very much intentional. It's probably not how it is all the time."

But what they were doing is things like work starts at 9:00 AM, work ends at 5:00 PM. If you're here before then working, we're going to ask you to stop. If you're here after then working, we're also going to ask you to stop. Lunch happens at noon, the bell rings and you have to stop working. That sounds terrible. And then you look at actually what happened as a result of that is that people have a much more healthier approach to work and things like you were always pairing when you're writing code nearly 100% of the time. That was frustrating for me. That was frustrating for basically everyone on my team. Well the interesting thing that actually happened as a result of this was when we stopped working with Pivotal, we all went back, well there's a big contention of the company that hadn't had that experience, but there's about 20 of us that had had the experience and we come back with all the stuff that we had been doing for the past three, four months.

These are very tactical process. This is just how we do things. And then you get to see the culture conflict. Two completely separate teams that used to be one are now working very differently.

Baron Schwartz: Oh, wow.

Mike Julian: Yeah. It was wild to see 'cause you see all the zealotry paid off. It wasn't really about them being zealots about what they were doing. It was about, "We're teaching you a new way to work."

Baron Schwartz: Yeah. That sounds like a really interesting experience. Yeah.

Mike Julian: Oh, it was absolutely a fascinating experience that kind of ... what made me think of that is, if you change how you do a task and then you just repeat that whole thing of like, "I have task one through 10 and I'm going to change how I'm doing all 10 of these tasks." Over time you change culture. Culture doesn't change first.

Baron Schwartz: That's right. Yeah. I think it's much more successful the way that you're describing. Yeah.

Mike Julian: So Nicole Forsgren accelerates her state of Java's report. In the book it talks a lot about that as well.

Baron Schwartz: Yeah. They've got an entire sections on culture change, and the western generative culture or organizational structure, and the consequences, and what creates, or sustains, or fights against those kinds of things. And everything in there resonates with my personal experience. And what's really cool is that now there's science. It's no longer anecdotes from people who've sort of been there, done that. Let me tell you. And it's no longer a simply, "Well, it's a bestselling business book, so it must be right." Right? Now we actually have legit well-constructed research with scientifically valid results. I mean, it's incredible. It cannot be overstated what a value accelerate, and Nicole, and the team have created for the whole community. I think it's incredible.

Mike Julian: Completely agreed. One of the things that's been coming up a lot lately and I know you've been discussing on Twitter and through your blog posts, is the value of diversity as a performance factor of like diverse teams are better teams. Also, just like diversity is good from a moral and ethical standpoint. We should be diverse because the world is diverse.

Baron Schwartz: Exactly. Yeah. It seems hard to argue against it, but you find yourself having to make business justification arguments for things sometimes, which feels absurd. It feels like an argument that shouldn't ... maybe shouldn't even have to be engaged. Thank goodness, again, accelerate is there with the data and the DAR report brings the numbers that you can use to whack people that want to be whacked with a number stick. And for those of us who are more focused on the human side of things, it's a little bit tricky because both kinds of justification are good and work for different people and at the same time it's a little bit of a shame, maybe even a lot of a shame that we actually have to resort to financial benefit return on investment arguments to advocate for people being equally valuable.

Mike Julian: Yeah. I saw a tweet a while back that, it said, "I don't know how to explain to you how to care about other people or why you should care about other people." And that stuck with me because it's so directly and eloquently said like, "Well, yeah, you're right. Other people are valuable because they are other people."

Baron Schwartz: Right. There's one that I've put in a lot of my slide decks as well that says something along the lines ... I'm going to have to find this for your notes as well later. It says something along the lines of good teams are composed of people actively caring about each other.

Mike Julian: Oh, that's a good one.

Baron Schwartz: Yeah, I'll find it. We'll put it in the notes 'cause I know I've got that bookmarked all over the place.

Mike Julian: Man. That's a good one.

Baron Schwartz: There's all of this stuff that came from Google's research, which actually came from somebody at MIT, I think, about the psychological safety, and what that really means, and what creates psychological safety and teams and things like that. And it kinda gets long ended it. And then this one little tweet kind of sums it up. I've been trying to do this in my personal life too. Just personally, I'm a self improver. I'm a seeker. And a while ago I realized that there were things about my attitudes and beliefs and biases that I didn't like. For example, I don't know if you've done this, but project implicit, which is out of Harvard. I think it's probably like implicit.harvard.edu or something like that.

Mike Julian: Yeah. I took that and it was depressing.

Baron Schwartz: Yes. Yeah. Basically I'm a racist.

Mike Julian: Yeah. That's what I got too and it was depressing not because I disagreed with it, because it is valid research. It's backed scientifically, but it exposes that part of yourself that you don't want to think about and exposes that, "Hey, there's a whole lot that you have been brought up around that you don't even think about."

Baron Schwartz: Just completely unconscious.

Mike Julian: Yeah. That's the entire point of the implicit bias of you have biases that you don't think about, but they are there.

Baron Schwartz: Yeah. Yeah. And we're two White guys for people who are listening and can't see us, but for two White guys, this comes as like a complete revelation. I don't know about you, but I think I first took that implicit bias test when I was well into my 30s and probably approaching 40. And it just made me realize that there were entire things that I had never thought about and would never have occurred to me to think about, entire realms of curiosity about my experience in the world that were just ... it was a complete revelation to me. And I started to lean into that. I started to think, "Well, what else is there?"

So a couple of things that have been super helpful to me and that I recommend to everybody, well, one of them I recommend to everybody who is male and another I recommend to everybody who is White. There's a podcast series from scene on radio called Men, which is all about what his maleness, and where did it come from, and what does it mean? And it tries to expose or to help men experience their maleness because it's so unconscious. It's the default mode of living. It's the fish swimming in the water as a metaphor they use. The fish is not aware of the water. And it was so helpful to me to start to realize this water that I swim in. And then the series prior to that, it was called Seeing White. And it's exactly centered around the same thing.

As a white guy, what is Whiteness? Where did it come from? What does it mean? How does it work? How am I a White person operating in Whiteness? And so it kind of comes back to this implicit bias stuff. And it showed me or gave me access to a different mode of inquiry that has led me deeper since then. And the one thing that I want to emphasize from here, I think I've said several times before about these things, is how helpful they've been. But I don't know if I've really kind of said how much joy I've gotten out of starting to realize the variety of human experience that was kind of denied to me, really. I was very much off from a huge spectrum of modes of being by virtue of unconsciously being trained to perform maleness and to operate in Whiteness and all of these kinds of things.

And so one of the things I've done is I've intentionally diversified my feet. I don't follow very many White male voices, which I'm not meaning this in any sort of accusing or holier than thou kind of away or whatever. But I think my feed is something like ... the people that I follow on Twitter, is something like 85% women and a lot of Black women because they live at the intersection of two different types of oppression. And you hear a truth from them because of their direct experience day by day, minute by minute, that it's accessible if you listen to those voices. But if you don't really make a conscious effort to do it, you can live your entire life in the bubble of these things never occurring to you.

Probably again, what's probably coming through in what I'm talking about here is probably mostly about, "Oh wow, what a revelation." But the part of it that might not come through or might not be obvious until folks experienced it for themselves is, "Wow, what a joy. It's just so enriching to understand and then to have access to different ways of being." Yeah. I can only recommend it to everybody. Listen to Black people, listen to women, listen to minorities, listen to disabled people, listen to people who are not in tech. Think about it. If the links that are coming through in your Twitter or Facebook or whatever are mostly about, I dunno, so and so invented a new algorithm for sorting or whatever. It's like, "Okay, big deal." You're going to find out everything that you need to know and much more. You're saturated with all of this kind of stuff. I know I was when I was just sort of following who is the majority population in tech, but following poets, following journalists, such richness and such joy.

Mike Julian: Oh yeah, absolutely. And I can kind of echo some of the stuff you've experienced. Years ago when I first started really waking up, I had had this internal narrative that I was successful because I had made myself successful.

Baron Schwartz: Yeah. Me too.

Mike Julian: Right. And my parents and family and all of this, were all looking at me saying, "Well, clearly you've made yourself successful." After a while I started to realize, "Well, no, actually there were a lot of experiences that I had that would not have been available to other people simply because of I'm White." There were the old boys network is definitely a thing.

Baron Schwartz: Totally.

Mike Julian: My breakout job that really kicked off my career happened about halfway through my career. But I got because of the people that I knew and to me it's like, "Oh well anyone can build a network." And that's what I did. I consciously built a network. Well, the fact that I consciously did it is valuable, but that I was able to have that opportunity to begin with was something that I'd never even considered.

Baron Schwartz: Yeah. When I started thinking about these things ... like I listened to the Seeing White series first and I started thinking about all of the ways that affirmative action has bullied me and been the wind in my sails my entire life. I mean, I'll just give you one example. I started to think back to how I got into college. I was homeschooled. I got into UVA, one of the top colleges in America, without ever having taken the SAT test, so I can probably just leave it right there. There's probably a lot of heads nodding and going, "Mm-hmm (affirmative)." Right. But I got into community college first and I had to take remedial courses because there were areas that homeschooling had left me dramatically under prepared. And I was also at the same time, my parents had told me if I wanted to go to college, I was going to have to pay for it myself.

And the financial aid system just ushered me in and I actually did some stuff that I shouldn't have done. I realized about it later. I started feeling like, "Wow, those are kind of not really cool things that I did." I went back to the financial aid people and they said, "You can't fix this now. It's water under the bridge. Forget about it." Now these are not federal felonies or whatever, but I was sort of feeling like I was between a rock and a hard place with kind of being out on my own in the world and having some trouble with money and things like that. There are definitely people who would not have been just sort of so kindly forgiven by the system. There is no question that being White has been several steps ahead on the starting gun many, many times and being male many, many times.

Look, it's not as though I should seek to tear away these privileges and throw them on the ground and trample them because it's power. Instead, I should use this power for good. I should use it to raise up and help people who don't have that power, should use it to advocate for them when there are folks who don't recognize that not everybody comes with those same levels of power, and privilege, and advantage. So those are some of the kinds of things that I do. And I have honestly come from a very difficult upbringing and experienced a lot of traumatic things and although I also have these privileges at the same time, and so what this combination has done for me, because I've been bullied, and beaten, and abused, and all of these other kinds of things, there's a very strong trigger in me when I see that happening.

And by listening to the voices of people who have come from different places in the social advantage system, when I see that happening now the triggers kick in and I'm much more likely to jump in and try and help out. Even if that is just seeing that somebody is having a tough time and then just jumping into their DM and saying, "I see you're having a tough time or whatever it is."

Mike Julian: You mentioned that learning about all this has brought you joy. And for me it's kind of the same thing happened. Learning about all this is I enjoy helping other people. That's what I love doing. I love making connections with other people. I know tons of people in a lot of different places through just a lifetime of making new friends.

Once I was able to understand I have this privilege and a lot of other people don't. A lot of other people will never have these connections that I do. They will never have the access that I might. Now that I know this, I'm actually able to ... I get joy from making introductions. And those introductions lead to changing of lives, which just makes me even more happier and also betters the world.

Baron Schwartz: Yeah. Being able to help people in a very real and genuine way, it is a source of enormous gratitude and gratitude is the source of happiness. You want to be happy, don't try to be happy, try to be grateful and becoming conscious of what's going on around you. I'm a big follower of Thich Nhat Hanh. And I was listening to one of his recordings the other day.

One of the things that he does through mindfulness, just coaching people to be present, to have access, and to be established in the here and now, is to recognize the small things that are a miracle. You and I are talking to each other on computers thousands of miles away. It's a miracle.

Mike Julian: It really is. It blows my mind.

Baron Schwartz: It's a thousand miracles. But in this little recording of his, he was talking about, "You turn on the tap water, and you put your hands under it, and you recognize that the water running into your house is a miracle. You didn't have to walk miles with a bucket to get that water. It's a miracle." So there are so many of these things that are miraculous and recognizing the differences in experiences and being able to make a little bit of a difference for someone else is enormously gratifying.

Mike Julian: Well, I think that's a fantastic place to wrap up our conversation. Baron, thank you so much for joining me. Where can people find out more about you and your work?

Baron Schwartz: I'm on Twitter. That's my main social network @xaprb and my personal blog is xaprb.com and I may be coming soon to a conference near you such as velocity or something like that. Those are the typical types of conferences that I'm at. Tweet me, email me. You can email me [email protected] or [email protected] for personal stuff. And I always love hearing from people.

Mike Julian: And don't forget to grab a copy of his latest eBook too.

Baron Schwartz: Yes. And please tell me what you think of it and how to improve it.

Mike Julian: Well to our listeners, thank you for listening to the Real World DevOps podcast. If you want to stay up to date on the latest episodes, you can find us @realworlddevops.com and on iTunes, Google play, or wherever it is you get your podcast. I'll see you in the next episode.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
The Science Behind DevOps with Dr. Nicole Forsgren
11 apr 2019· Real World DevOps
About the GuestDr. Nicole Forsgren does research and strategy at Google Cloud following the acquisition of her startup DevOps Research and Assessment (DORA) by Google. She is co-author of the book Accelerate: The Science of Lean Software and DevOps, and is best known for her work measuring the technology process and as the lead investigator on the largest DevOps studies to date. She has been an entrepreneur, professor, sysadmin, and performance engineer. Nicole’s work has been published in several peer-reviewed journals. Nicole earned her PhD in Management Information Systems from the University of Arizona, and is a Research Affiliate at Clemson University and Florida International University.
Links Referenced: 2019 State of DevOps SurveyPrevious State of DevOps ReportsTranscriptMike Julian: This is The Real World DevOps Podcast, and I'm your host Mike Julian. I'm setting out to meet the most interesting people doing awesome work in the world of DevOps. From the creators of your favorite tools to the organizers of amazing conferences, and the authors of great books to fantastic public speakers, I want to introduce you to the most interesting people I can find.

Mike Julian: Ah, crash reporting. The oft-forgotten about piece of a solid monitoring strategy. Do you struggle to replicate bugs, or elusive performance issues you're hearing about from your users? You should check out Raygun. Whether you're responsible for web or mobile applications, Raygun makes it pretty easy to find and diagnose problems in minutes instead of what you usually do, which if you're anything like me, is ask the nearest person, "Hey, is the app slow for you?" And getting a blank stare back because hey, this is Starbucks, and who's the weird guy asking questions about mobile app performance? Anyways, Raygun, my personal thanks to them for helping to make this podcast possible. You can check out their free trial today by going to raygun.com.

Mike Julian: Hi folks. I'm Mike Julian, your host for the Real World DevOps Podcast. My guest this week is Dr. Nicole Forsgren. You may know her as the author of the book Accelerate: The Science of Lean Software and DevOps or perhaps as a researcher behind the annual State of DevOps report. Of course that's not all. She's also the founder of the DevOps Research and Assessment, recently acquired by Google, was a Professor of Management Information Systems and Accounting, and has also been a performance engineer and sysadmin. To say I'm excited to talk to you is probably an understatement here. So, welcome to the show.

Nicole Forsgren: Thank you. It's a pleasure to be here. I'm so glad we finally connected. How long have we been trying to do this?

Mike Julian: Months. I think I reached out to you, it's March now. I reached out in November, and you're like, "Well, you know, I have all this other stuff going on, and by the way, my company was acquired."

Nicole Forsgren: Well, back then, I had to be sly, right? I had to be like, "I've got this real big project. I'm sorry. Can we meet later?" And, God bless, you were very gracious and kind, and you said, "Sure-"

Mike Julian: Well thank you.

Nicole Forsgren: ... "we can chat later." And then I think you actually sent me a message after saying, "Oh, congrats on your 'big project'." I said, "Thank you."

Mike Julian: That sounds about right.

Nicole Forsgren: I appreciate it. Yeah. And then, you reached out again, and I said, "Oh, I'm actually working on another big project. But, this time ..."

Mike Julian: It's not an acquisition.

Nicole Forsgren: Yeah, it's not an acquisition. This time, it's a normal big project, and it's this year's State of DevOps report. And we just launched the survey, so I'm super excited we're collecting data again.

Mike Julian: So we can get that right out of the way, where can you find the State of DevOps report?

Nicole Forsgren: All of the State of DevOps reports are hosted at DORA's site. We still have the site up. And all of the reports that we've been involved in from, I want say we started in 2014, I'm so old I already forgot. All the reports that we've done are hosted. We'll post them in the show notes. If you can grab yourself a Diet Coke or coffee or a tea or a water, or if you want a bourbon. Get comfortable. Sit back, takes about 25 minutes. I know, right, everyone's like, "Girl, 25 minutes?"

Mike Julian: That's a big survey.

Nicole Forsgren: I know. It is. But it's because the State of DevOps report is scientific, right? We study prediction, and not just correlation. But sit back, get comfy and let me know what it's like to do your work. Because we're digging into some additional things this year; productivity, tool chains, additional things around burnout and happiness, and how we can get into flow, and really what that looks like. And some really great things are a bunch of people have already chimed in after taking the survey in really thoughtful ways. Also, by the way, I love you all for taking it if you have. Share it with your colleagues, share it with your peers.

But they've said that just by taking the survey, they've already come away, even before the report has come out, they've already walked away with really interesting ideas and tips and insights about how they can make their work better.

Mike Julian: Yeah, that's wild to think about, that the act of taking a survey actually improves my work. Because most surveys I take, I'm finished, I'm like, "Well, that was kind of a waste of time." It feels like I just gave away a bunch of stuff without getting anything.

Nicole Forsgren: Yeah, and I think the reason it works that way is because we're so careful about the way we write questions that sometimes just the act of taking the survey helps you think about the way you do your work. So just the act of kind of taking some of these questions helps people think about what they're doing. And then, of course, like I joked already, it's my circle of life, the survey will be open until May 3rd and then I will go into data analysis and report writing. And we expect the report itself to come out about mid-August.

Mike Julian: Well, why don't we take a few steps back and say ... Everyone loves a good origin story. I believe you and I met at a LISA many, many years ago. You were giving a joint workshop with Carolyn Rowland on-

Nicole Forsgren: Oh, I love Carolyn.

Mike Julian: Yes, she's also wonderful. I should have her on here.

Nicole Forsgren: My twin. Yes. Absolutely.

Mike Julian: So you were a professor then when I first met you. I'm like, you know that's kind of interesting that a professor's hanging out at a LISA and giving all this great advice on how to understand business value, which I thought was absolutely fascinating. Professor, hanging out in the DevOps world, how'd that happen?

Nicole Forsgren: Oh my gosh. Okay so, the interesting thing is, I actually started in industry. My very first job was on a main frame, writing medical systems, and then writing finance systems. So I was a mainframe programmer. And then supported my main frame systems, right? Which is how so many of us in Ops got our start in Ops was someone was like, "Well somebody's gotta run this nonsense." Right? I was still in school, and then I ended up as a DEV, right? I was a Software Engineer at IBM for several years, and then pivoted into academia. Went and got a PhD, where I started asking questions about how to analyze systems, so I was actually doing NLP, natural language processing.

Mike Julian: Interesting.

Nicole Forsgren: Yeah, I was doing…

Mike Julian: Yeah, that's a weird entry point into that. Definitely not what I would have expected.

Nicole Forsgren: Yeah, so the crazy thing, my first year was actually deception detection.

Mike Julian: I bet that's awesome.

Nicole Forsgren: It was really interesting, it was super fun. But I leveraged so much of my background from systems work, right? Because what do we do? We analyze log systems.

Mike Julian: Right.

Nicole Forsgren: Right? We're so used to analyzing a ton of data in a messy format, many times text based, super noisy, can't always trust it, right? Right now people are like, "I can't trust surveys. People lie." Kids, so do our systems.

Mike Julian: All the time.

Nicole Forsgren: Right? And so, they loved me for a bunch of this work. All of a sudden, I randomly did a usability study with sysadmins. We wrote up the results, gave them back to IBM, and IBM was like, "Well what do you mean? We followed UCD guidelines, user center design guidelines. This should be applicable." And I was like, "Wait, whoa whoa whoa whoa, what?"

At the time, they had one set of UCD guidelines, for all users. Super super advanced, high level advanced, sysadmins, who were doing back-up, disaster recovery, everything. And people who had bought a laptop and were using email for the first time in their lives.

Mike Julian: I'm sure that went over super well.

Nicole Forsgren: What? I'm like, "That's it. Changing my dissertation." Which of course, panicked my advisors. They were like, "You're gonna what?" So I start doing what, at the time, was kind of the groundwork for DevOps. Which is, how do you understand and predict information systems? And by information system, technology, automation, usage and prediction and then outcomes and impacts of the team, individual team at an organizational level.

Which now, I say all that, that's big words, that's academic words, for basically what's DevOps. How I do I understand when people use automation and process and tooling and culture, and how do I know that it rolls up to make a difference and add value? Which now we're like, "Oh that's DevOps."

This is late 2007.

Mike Julian: Oh wow. So you were early days with us.

Nicole Forsgren: Yeah. It was a really interesting parallel track, because now we look back and we're like, oh this is about 10 years ago. That was kind of the nascent origins about the same time as DevOps, right? So, so many of us kind of stumbled into it about the same time. I had no idea this was happening in industry. I kept plugging away, I kept doing it, stumbled into LISA, trying to connect data, of course, like every good academic does. Desperately trying to find data.

Stumbled into, bumped into a group collecting similar things but using different rough methods. A team from a cute little configuration management startup called Puppet, right? Started working with them, invited myself onto the project. God bless them, I have so much love and respect for them because they basically let this random, random academic tear apart their study and redo it and lovingly tell these two dudes I had never met before, on the phone, named Jean and Jez, that they were doing everything wrong and that this word they were using wasn't the right word. Redid, in late 2013, the State of DevOps report, made it academically rigorous, and then that, kept going for several years, right? And then suddenly, we redid a bunch of stuff after a couple years.

I left academia, walked away from what was about to be tenure, to go to another cute little configuration management startup called Chef, that was fun, right? So I'm working on the report with Puppet, and working for Chef, and continuing to do research and work with organizations and companies. And I left academia in part, because I was seeing this crazy DevOps thing make a difference. But in academia, they weren't quite getting it yet. And I wanted to make sure I could make a bigger difference, because I'd started working at tech in college in 98, 99, 2000; we lift this crazy dot com bust.

And it wasn't a bust because everything crashed and the world ended like people thought but companies failed, it had huge implications and impacts for what happens to people. They lose their jobs, it breaks apart families, they get depressed, it impacts their lives, some people were committing suicide. And I was so worried about what happens when we hit this wave again and we're starting to see that hit again. So what happens if companies and organizations don't understand smart ways to make technology, because you can't just keep throwing people at the problem, or throwing the same people at the problem. And when I say throwing the same people I mean, seven day forced marches.

I was at IBM when they made us do that, right? They got pulled into a class action lawsuit, you can't do that. That's not a way to live.

Mike Julian: Yeah, I've been on many of those, they're brutal. And they don't result in anything useful.

Nicole Forsgren: It's just broken hearts and broken lives, right? And so, some people like say, you really care about this. I'm just this nerd academic who just cares too much about what I do. And so if we really can, fundamentally change the way that people make software, because if it will in fact, actually, fundamentally make their lives better ... let's do it.

And then, thank God, what we found is that it really does. Sure, it's nice that it delivers value to the business but that matters because then, what it does, is it helps them make smarter investments because then in turn, it reduces burnout. It makes people happier, it makes their lives better, and I think that's the part that's important.

Mike Julian: So what you've been finding is that by a company implementing all these better practices of continuous deployment, and faster time to delivery, faster time to value ... it makes the lives of the people doing the work better?

Nicole Forsgren: Yeah and John Shook has found this as well. Right? He did this great work in Lean, in that in order to change ... some people have said like, "How do you change culture?" Let's find ways to change culture. Sometimes the best ways to change culture is to change the way you do your work and I'm sure we've seen that ourselves, right? In other aspects of our lives. To change the way we feel, to change the way our family works, to change the way our relationships work. You actually physically change your lived experience, or some aspect of your lived experience.

And so if we change the way that we make our software, we will change the way that our teams function, which is changing the way that the culture is. And so, said another way, if we can tell our organizations which smart investments to make in technology and process, then we can also improve the culture. We can also change the lives of the people, right? And the Microsoft Bing team found this, right? They wanted to make smart investments in continuous delivery.

And in one year, they saw work life scores go from, I'm pulling this off the top of my head, but I want to say it went from 38% to 75%. That's huge.

Mike Julian: That's an incredible jump.

Nicole Forsgren: Right. And it's because people are able to leave work at work and then go home. You can go see your families, you can go to a movie, you can go eat, you can have hobbies, or you can go binge watch Grey's Anatomy. You can do what you want.

Mike Julian: That's one of the most incredible things to me is that there's this idea of in order for a company to be successful they have to push their employees, kind of put them through the ringer. Intuitively, that's never felt right. And you actually have data that shows that's not right. Doing these things, actually makes everyone better. The business improves dramatically, the people's lives improve dramatically, and everything's awesome.

Nicole Forsgren: Right and if we want to push people, that's not sustainable. And if anything, we want to push people to do things that they're good at and we want to leverage automation for things that automation is good at. So what does that mean?

We want to have people doing creative, innovative, novel things. Let's have people solve problems, let's have automation do things that we need consistency for, reliability for, repeatability for, autotability for. Let's not have people bang a hammer and do manual testing constantly. Let's have people figure out how to solve a problem, do it once or twice to make sure that's the right thing, automate it, delegate that to the automation and the machines and the tooling, hand it off, be done, and then pull people back into the loop into the cycle, to figure out something new.

I think it was Jesse Purcell that said, "I want to automate myself out of a job constantly." Right? Automate yourself out of your current job, and then find a new job to automate yourself out of again. We will never be out of work.

Mike Julian: Yeah, I used to worry about that when I first started getting into DevOps and actually, when I first started working on automation it wasn't DevOps at the time, it was automating Windows desktop deployments at a University. And this is in the early 2000s. And one of my big worries was, well because I spend half my week doing this, if I were to automate it I'd spend an hour doing this, what am I gonna do the rest of the time? They're just gonna fire me 'cause they don't need me anymore.

As it turns out, no, that's not what happened at all. Higher value of work became work because I wasn't focused so much on the toil.

Nicole Forsgren: Right, and those types of things, machines and computers can't do. And the other thing, I used to tell all my friends, don't think about that in terms of job security, right? Don't try to paint yourself into a thing that no one else can ever do because then you can't be replaced, because that also means that you can never get promoted.

If we always make sure that there are aspects of our job that can be automated so that there are opportunities for us to pick up new work, that only creates more opportunities for amazing things. There are always going to be problems, there are always problems for us to solve. I don't want to be stuck doing boring work.

Mike Julian: Yeah, God knows that's the truth.

Nicole Forsgren: Oh my gosh I know. I don't want to be stuck doing boring, repetitive work. That's just a headache. If we can find, especially really challenging, complex things, and if we can find ways to automate that, trust me, we will never dig ourselves to the bottom of that hole. That is always there.

Mike Julian: So I want to talk about the State of DevOps report and I want to start off by asking a question about something you mentioned earlier. You mentioned this phrase, academic rigor. What is that, what does that mean?

Nicole Forsgren: Academic rigor includes a few things, okay? So one part of academic rigor is research design. So it's not just yoloing a bunch of questions ... sorry, yolo is my shorthand for like, "Your methodology is questionable."

Mike Julian: I've been seeing a lot of those surveys come out recently.

Nicole Forsgren: Yeah. So one is research design. And some people say, "Nicole, what do you mean by research design?" So research design is, are the types of questions you're asking appropriately matched to the method that you're using to collect the data? Right? Are these things matched? And for some things, a survey is appropriate. A one time, so one time is cross-sectional, one slice in time survey across a whole industry. Some things this is appropriate for. Some things this is not appropriate for.

One good example, a whole bunch of people really want me to do open spaces, questions, in State of DevOps report.

Mike Julian: What does that mean? Like open ended questions?

Nicole Forsgren: No, open spaces. So a lot of people have a lot of feels about open office spaces. Should I work in an open office space? Is open office space influence productivity? Or pair-programming ... does pair-programming affect productivity? Does pair-programming affect quality? People have a lot of feels about these things. The type of research design, employed in the State of DevOps report, is a survey that is deployed completely anonymously, across the entire industry at a single point in time, is not the appropriate research design to answer either of those questions.

Mike Julian: Why is that?

Nicole Forsgren: Because what you would need to do is have a much more controlled research design. So I would need to know, for example, who you were working with. I would need to know, so let's go with the peer review one, I would need to know the types of problems you're working on, the types of code problems, I would need to now the complexity of the problems, I would need to know how long it's taking you, right? If you're wanting to now productivity, right? 'Cause I would need to know a measure of productivity. I would need to know what the outcome is. So if my outcome's productivity, I would need to measure productivity, because I'm gonna need to control for perplexity, right? Because things that are more complex, we expect to take longer. Things that are less complex, I expect to take not as long, right?

And then I would need to match and control. Right? So even things like open office spaces, right? Because if you're doing peer programming in an open office space versus not an open office space, if you're doing it at an office, I would need to know seniority of the person, or some proxy of seniority. I would need to now how you're paired, are you paired with someone at your approximate experience level, if not seniority experience level. I would need to know how the pair-programming works, I would need to know the technology involved, I would need to know if you're remote, or if you're actually sitting next to each other. I would need to know if you're both able to input text at the same time or if one person is inserting and the other person is not.

So that when I do comparisons, I know what the comparisons are like.

Mike Julian: That's an incredible amount of information. I never expected that you would have to know so much in order to get a good answer out of that.

Nicole Forsgren: And that's off the top of my head. Right, I'm spit balling because you asked me a good question. And that's just on research design and then you move on to analysis, right? When you move on to analysis, then we need to get into the types of questions that you have asked. Are these types of questions, are we looking at correlation? Are we looking at prediction? Are we looking at causation? What types of data do we have available and which types of analysis and questions are they appropriate for?

Again, they need to match up the right way. Some types of data, are not appropriate for certain types of analysis or questions. So you really need to make sure that each one is appropriate for the right types of things. Right? Certain types of analysis, like mechanistic, survey questions will never be appropriate for mechanistic analysis, right? Although, quite honestly, no one's every gonna be doing mechanistic analysis. Never and by the way, if anyone comes to me and says they're doing mechanistic analysis, I'm gonna sit back and listen to you very intently, very interested because I don't think anyone's doing mechanistic ... it's not a thing.

Mike Julian: So when you're analyzing the results of the survey, what we're seeing is one question followed by another question, followed by another question, and you know hundreds of questions. When you're analyzing this stuff, are you looking at a question at a time, or are you looking at multiple questions and then interpreting the answers based on what you're seeing across several different questions?

Nicole Forsgren: So when I'm writing up the results, when I'm writing up the report, I am writing up the results of my analysis, and my analysis is taking into account a very, very careful research design. Now what that means is, my research design has been very carefully constructed to minimize misunderstandings. It tries to minimize drift in answers. So, one way that we do that, and this is outlined in part two of accelerate if there's any stats nerds that want to read up on this, we do things called latent constructs.

So, you asked about having only a few questions or several questions. One way we do this, I mentioned, is called latent constructs. If I want to ask you about culture, right, I could ask 10 people about culture and I would get 15 answers. 'Cause culture could mean so many different things, right? In general, when we talk about culture in a DevOps context, we tend to get something that ... people will say very common things like, breaking down silos, having good trust, having novelty, right?

So what we do is we start with a definition, and then we will come up with several items, questions, that capture each of those dimensions. So you might want to think about a dope Venn diagram, where each of the questions is overlayed and then all of the things where they have the biggest, or the perfect overlay, that very center, that little nut, that is what the construct is. That is what culture is, that is what's represented by culture.

And then each of the individual circles is each question. That's what we do in research design. One part of research design. When I get to stats analysis mode, I take all of the questions, all of the items, across, not just culture, but every single thing that I'm thinking about. So in years past I've done monitoring observability, I've done CI, I've done automated testing, I've done version control, I've done all of these things, and I throw all of them into the hopper, right?

Mike Julian: Which is probably your massive Excel spreadsheet I'm sure.

Nicole Forsgren: No, it's SPSS. I use SPSS but you can use several different stats tools. And we do principal components analysis. And what we do is we say, how do they load? Basically, how do they group together and do we have convergent validity? Do they converge? Do they only measure what they're supposed to measure? And do we have discriminant validity? Do they not measure what they're not supposed to measure? And do we have reliability? Does everyone who's reading these questions, read them in a very very similar way?

Once we have all of those things, and there's several statistical tests for all of those, then I say, "Okay, these several items, usually three to five items, all of these items together are culture," or "all of these items together are CI," or "all of these, right these grouping of items, represent this." Okay, now, now, I can start looking at things like correlations, or predictions, or something else and then I get to the report, and now I will just talk about it like, culture.

So I talk about it as one thing, but it's actually several things and then when I talk about culture, I can say, "This is what culture is," and I can talk about it in this nuanced, multidimensional way, and I know what those dimensions are because it's made up of three to five, to six to seven questions, and by the way, if one of those questions didn't fit, because I know from the stats analysis, I can toss it, and I know why. And I always have several items. That's the risk, if you only have one question or if you only have two questions. If one of them doesn't work, which one is the wrong one? You don't know. Right? Because, is it A or is it B? I don't know.

At least if I start with three and one falls out, then it's probably the two that are good.

Mike Julian: Yeah. Many listeners on here have taken a lot of the surveys run by marketing organizations, except the surveys are also designed by people in marketing …

Nicole Forsgren: They're designed by people who want a specific answer.

Mike Julian: Exactly.

Nicole Forsgren: And that's the challenge.

Mike Julian: Right, whereas, to make this very clear, the State of DevOps report is not that at all. There's a lot, as you said, rigor that goes into this.

Nicole Forsgren: So the nice thing is that we have always been vendor and tool agnostic.

Mike Julian: You're not looking for a very particular answer to come out, you want to know what is actually out there.

Nicole Forsgren: And we're not looking for an answer to a product. So, in the example of CI, what is CI? I don't care about a tool. I'm saying, if you're doing CI and if you're doing CI, continuous integration, in a way that's predictive of smart outcomes, you will have these four things. The power in that, is that anyone can go back and look at this as a evaluative tool. If you are a manager, or a leader, or a developer, you can say, "Any tool that I use, any tool in the world, I should look for these four things," or "Any tool I build myself, or if I'm doing CI, I should have these four things."

If you're a vendor, you should say, "If I think I'm building or selling CI, I better have these four things. Right? So that's the great thing and I've gotta say, God bless my new team. They're letting me run this the same way. It's still the same way. It's still vendor and tool agnostic, it's still capabilities focused. Every single thing you look for, whether it's automation or process or culture or outcomes, it's vendor and tool agnostic, it's capabilities focused, and again, the power is that you can use it as a evaluative tool.

Is my team doing this? Is my tooling doing this? Is my technology doing this? Am I able to do this? If I'm not, what is my weakness? What is my constraint? Because if I take us back to the beginning, what is it that drives me and the DORA team, what is it that we want to get out of this? We want to make things better. And how do we do that? We can give people an easy evaluation criteria. And I'm not saying it's easy, because all of this is easy, it takes work. But if there's clear evaluation criteria, we've got somewhere to go.

Mike Julian: Since I know that you love talking about what you found in your several years of doing this. What are some of the most interesting results you've come up with?

Nicole Forsgren: Oh, there's so many good ones.

Mike Julian: Let's pick your top three.

Nicole Forsgren: Okay, I think one of my favorites is, and I'm gonna do this in cheesy marketing speak …

Mike Julian: Please have at it. We have prepared ourselves.

Nicole Forsgren: Someone who had a little startup and had to fake it as a marketer for a minute, we'll see how I do at this.

Architecture matters, technology doesn't. Number one. Okay. So what does that mean? What that means is, we have found that if you architect it the right way, your architectural outcomes have a greater impact than your technology stack. So architectural outcomes, some key questions are: Can I test? Can I deploy? Can I build without fine grained communication and coordination?

Mike Julian: What does the fine grained mean?

Nicole Forsgren: Do I have to meet and work with and requisition something among, do I have to spin up some crazy new test environment or do I have to get approvals across 17 different teams? Notice, I just mentioned teams. Communication and coordination can be a technology limitation or it can be a people limitation. This harkens very much back to Conway's law.

Mike Julian: One of my favorite laws.

Nicole Forsgren: Right? This is very much a DevOp thing. But, it's very true. Whatever our communication patterns look like, we usually end up building into our tech. Now, I will say this is very often easier to implement in Cloud and Cloud native environments, but it can absolutely be achieved in Legacy and Mainframe environments as well. We did not see statistically significantly differences among Brownfield and Greenfield respondents in previous years.

Mike Julian: That's good to know.

Nicole Forsgren: Yeah, so I love that one. That one's super fun.

Okay, number two. Cloud matters, but only if you're doing it right.

Mike Julian: Oh, what does right mean?

Nicole Forsgren: Dun dun duh. So, this was one of my favorite stats. We found that you are 23 times more likely to be an elite performer if you're doing all five essential Cloud characteristics. I guess you could say if you're doing all five essential characteristics of Cloud computing according to NIST, the National Institutes of Standards in Technology. So I didn't make this up, this comes from NIST, okay?

So it was interesting because we asked a whole bunch of people if they were in the Cloud. They're like, of course we're in the Cloud, we're totally in the Cloud, right? But only 22% of people are doing all five things. So what are these five? So these five are on demand self service. You can provision resources without human interaction, right? If you have to fill out a ticket and wait for a person to do a ticket, this doesn't count. No points.

Another one is broad network access. So you can access your Cloud stuff through any type of platform; mobile phones, tablets, laptops, workstations. Most people are pretty good at this. Another one is resource pooling, so resources are dynamically assigned and reassigned on demand. Another one is rapidly elasticity, right, bursting magic. We usually know this one.

Now the last one is measured service. So we only pay for what we use. So the ones that are most often looked is usually broad network access and on demand self service.

Mike Julian: Yeah, what's interesting about that, to me, there's nothing in there that prevents, like say, an internal open stack cluster from qualifying.

Nicole Forsgren: Exactly, right. So this could be private Cloud. I love that you pointed that out. The reason that this is so important to call out is, it just comes down to execution. It can be done and the other challenge is so often organizations, executives, or the board says you have to go to the Cloud and so someone says, "Oh yes, we're going to the Cloud." But then someone has redefined what it means to be in the Cloud. Right? And so, you get there, someone checks off their little box, puts a gold star on someone's chart, they walk away, and they're like, "Well we're not seeing any benefits." Well yeah, 'cause you're not doing it.

Mike Julian: Right. Yep.

Nicole Forsgren: It's like, "I bought a gym membership, I'm done." No. And again, I'm not saying it's easy, right? There's some work involved. Now the other thing that I love is that, let's say you're not in the Cloud, for some reason you have to stay in a Legacy environment, you can look at these five things and you can implement as many possible, you can still realize benefits.

Mike Julian: Right. It's not an all or nothing approach. You can do some of these and still get a lot of benefit from it.

Nicole Forsgren: It's almost like a cheater back to number one, which was architecture matters, technology doesn't. How can I do a cheat sheet to see some really good tips on how to get there?

Mike Julian: So what's your number three here?

Nicole Forsgren: My number three would probably be, outsourcing doesn't work.

Mike Julian: Yeah.

Nicole Forsgren: Which some people hate me for and they shoot laser beams out of their eyes. So let's say outsourcing doesn't work*.

Mike Julian: Okay, what's the asterisk?

Nicole Forsgren: Asterisk, the asterisk is going to be that functional outsourcing doesn't work.

Mike Julian: Okay, so say outsourcing my on call duties, probably isn't going to work so well.

Nicole Forsgren: Taking all of DEV, shipping it away. Taking all of TEST, shipping it away. Taking all of OPS, shipping it away. Now, why is that? Because then, all of you've done is taken another set of hand offs, you've created another silo. You've also batched up a huge set of work, and you're making everyone wait for that to happen. The goal is to create value and not make people wait. If now everyone has to wait for everything to come back, if you're making high value work wait on low value work, because it all has to come back together, which is usually the way it works, you're boned.

Now, functional outsourcing. If you have an outsourcing partner that collaborates with you and coordinates with you and delivers at the same cadence, that's not functional outsourcing. That's the asterisk.

Mike Julian: Okay, gotcha.

Nicole Forsgren: Also, if they're part of your team and they're part of your company but they basically disappear for three months at a time. Sorry kids, that's functional outsourcing. I worked no points, may God have mercy on your soul. It's not helpful.

Mike Julian: Right. It seems to me, how you could tell if you're in this predicament, is if there is a noticeable hand off between your team and whoever you have given these items to, you have functional outsourcing. Would that be about right?

Nicole Forsgren: Yes, and especially if there's a noticeable hand off and then a black box of mystery.

Mike Julian: Of like, how is the work getting done?

Nicole Forsgren: Step one, something, step two, question mark, step three: profit.

Mike Julian: Maybe. So the first two, it's all good because we can kind of see where to go from there, but this third one actually seems a bit harder because if I'm a sysadmin, I have absolutely no control over this functional outsourcing. I may hate it just as much, I may hate it myself, but I don't have any control over it. What can I do as a sysadmin, or someone in ops, someone in dev, how can I improve that situation?

Nicole Forsgren: So some ideas might include things like, seeing if there's any way to improve communication or cadences in the interim. Right? You might still have that outsourcing partner, because that's just the way it's gonna be. But, let's say that you've batched up work in three month increments, is there any way to increase handoffs to once a month? Is there any way that we can take capabilities that we know we import, working in small batches, and just increase that handoff? Is there any way that we can integrate them into our cadence, into our teams?

Now I realize there is some challenge here because from a legal standpoint, we can't treat them like our team because then, at least from the United States standpoint, once we treat them like an employee, then we're liable for employment taxes and all of that other legal stuff. But if we can integrate them into our work cadence, or more closely into our work cadence, then our outcomes improve.

Mike Julian: Okay, cool. That makes a lot more sense. That doesn't sound nearly as hard as I was fearing.

Nicole Forsgren: So it can be starting to decrease the delay on the cadence, asking for slightly more visibility into what's happening, if it's a complete black box, looking for that.

Mike Julian: Nicole, this has been absolutely fantastic. Thank you so much for joining me. I have two last questions. Where can people find this State of DevOps report to take the survey? Where is the survey at?

Nicole Forsgren: Oh, we've got the survey posted. Can I include it in show notes?

Mike Julian: Absolutely. Alright, folks, check the show notes for the link. And my last question for you is where can people find out more about you and your work, aside from this survey?

Nicole Forsgren: I'm a couple places. So my own website is at nicolefv.com and I'm always on Twitter, usually talking about ice cream and Diet Coke, that's @nicolefv.

Mike Julian: I do love you Twitter feed. It's one of my favorites.

Nicole Forsgren: Yeah, everybody come say hi. My DMs are open.

Mike Julian: What I love most about your Twitter feed is roughly around the time that you're writing the report and saying, "Oh my God, why did I do this?"

Nicole Forsgren: Yeah, I try to keep it locked down, but every once in a while something will slip, like "Oh my gosh everybody, something good is happening," or "oh I forgot this one thing," or "So much good is happening."

Mike Julian: Yeah, I remember last year like, "Oh my God this is so cool but I can't tell you about it."

Alright, well thank you so much for coming on and thanks to everyone else listening to the Real World DevOps podcast. If you want to stay up to date on the latest episodes, you can find us at realworlddevops.com and on Itunes, Google Play, or wherever it is you get your podcasts. I'll see you on the next episode.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Building Resilient Systems with Thai Wood
4 apr 2019· Real World DevOps
About the GuestThai helps teams build more resilient systems and improve their ability to effectively respond to incidents. A former EMT, he applies his experience managing emergency situations to the software industry. He writes about resilience engineering each week at ResilienceRoundup.com

Links Referenced: Strongbad’s The System is Down https://resilienceroundup.com/https://re-deploy.io/TranscriptMike Julian: This is the Real World DevOps podcast, and I'm your host, Mike Julian. I'm setting out to meet the most interesting people doing awesome work in the world of DevOps. From the creators of your favorite tools to the organizers of amazing conferences, from the authors of great books to fantastic public speakers. I want to introduce you to the most interesting people I can find.

Ah, crash reporting, the oft forgotten about piece of solid monitoring strategy. You struggle to replicate bugs or elusive performance issues you're hearing about from your users, you should check out Raygun. Whether you're responsible for web or mobile applications, Raygun makes it pretty easy to find and diagnose problems in minutes instead of what you usually do, which, if you're anything like me, is ask the nearest person, “Hey, is the app slow for you?” and getting a blank stare back because hey, this is Starbucks, and who's the weird guy asking questions about mobile app performance? Anyways, Raygun, my personal thanks to them for helping to make this podcast possible. You can check out their free trial today by going to Raygun.com.

Mike Julian: Hi folks, I'm Mike Julian, your host for the Real World DevOps podcast. My guest this week is Thai Wood. He's an internal tools specialist at Fastly, the editor of Resilience Roundup, a fantastic newsletter that I'm a huge fan of. And perhaps really interesting is he's a former EMS professional. For those that don't know, EMS is what most of the Ops world keeps looking to figure out how can we improve our own on-call and incident management procedures. So, welcome to show Thai.

Thai Wood: Hey Mike, thanks for having me.

Mike Julian: So for those that really have no idea or frame of reference for what EMS is, what is it?

Thai Wood: EMS is sort of the broad umbrella of what happens anytime someone has a medical emergency, usually starts with someone calling 911. Someone gets sick or injured, they call 911 and the people who show up in whatever capacity are EMS workers and EMS professionals.

Mike Julian: So, would this also be like firefighters, police, doctors, nurses? Everyone's included in that or?

Thai Wood: Yeah, typically it's a big group of people just because who shows up actually, in the US strangely depends on where you live, what state you're in, what your county rules are, and a lot of things like that. So, here in Las Vegas, for example, we actually have three different fire departments, some of which run their own ambulance services, some of them do not. So it can also depend on what area of town you're in.

Mike Julian: Yes. San Francisco is the same way. The San Francisco Fire Department actually, I guess consumed the San Francisco Emergency Management into themselves so there's no such thing in San Francisco County is separate EMS. So that's why you hear so many fire trucks everywhere, it's because most of the incidents are not for fires, the city is not always burning down, it's just someone's always hurt. So they send a fire truck and police and more fire trucks. Where I'm from in Knoxville, Tennessee we actually had separate EMS, which was like they had the same vehicles as the ambulances except they were green instead of red, which was kind of cool. So yeah, that's fun.

So one of the things I really find interesting about EMS is when you look at people responding to incidents, and I'm using incident in a very broad, vague way here, someone gets hurt. And the people responding have just like the most cool, collected nature I've ever seen. And it's always like two people show up and everyone knows exactly what's going on. Meanwhile, the people who have called are freaking out, but the EMS personnel are calm and collected. How do you even get that way when someone's bleeding on the sidewalk, but EMS is perfectly cool about it.

Thai Wood: I think that's a really good question. It's actually something that I feel is missing in a lot of software, which is just a lot of it is experience. The hundredth time that you've been on a similar call, the thousandth time, of course like anything we habituate to it, and it gets easier. A lot of places with EMS, you have an opportunity to practice these things, and to get better at them. Oftentimes, that might be that you do ride-alongs even before you are certified. So, you're already becoming immersed in this world. You have an opportunity to do some clinical hours at hospitals, for example, sometimes, and you get to just be in these different situations. And of course, the first, maybe the fifth, maybe even the 10th, you feel the same way. There's also of course a culture where a lot of times what you're seeing is not in fact the truth.

Mike Julian: Okay. Tell me more about that.

Thai Wood: Well, depending on what it is that you're walking into, I or others might have an internal dialogue that say, “Oh, no, I have not seen this before.” But, we're not doing either of us a service by letting that dictate our outward response. If I'm letting that change my behavior or letting it allow me to be distracted or unfocused, I'm not helping you nor am I helping myself be more effective or be more effective in helping you.

Mike Julian: Yeah, that makes sense. I mean, I've definitely had that situation when I've been on call. You walk in has something like, well, everything's exploded, I have absolutely no idea what's going on. But as the senior engineer, everyone's looking to you like, expecting, you've seen this before. So you just kind of put on that veneer of, “Nope, I've got this,” while screaming internally.

Thai Wood: Absolutely. And I think that, depending on why it's done, I think it's actually a good thing. In software, I do tend to question a little bit, I think it should be okay to say, “I don't know, this thing. I haven't seen it before, but I'm going to figure it out,” whereas with … Because you're probably with personnel that you know at least somewhat if you're on call, right? Network operations, staff, people you've at least talked to before, which is not necessarily the case with emergency management, in the physical realm. These are strangers and you don't know how they're going to react. So your best service to them is to just keep that cool.

Mike Julian: Right. Yeah. It's just really hard to do sometimes or, most the time for me. Man, I do not miss being on call at all. I never liked the idea of having to present that coolness that everyone's looking for.

Thai Wood: I definitely understand that. It's a very visceral, physical experience. And I know a lot of us can habituate good and bad to some of this stuff. We know that noise, whatever our pagerduty alert noise is, we know that noise, right? Or even if you put it on vibrate, the sound of that phone vibrating against your nightstand, you know the difference in what it sounds like to vibrate on your nightstand versus like your kitchen table. It embeds in your brain and there's a very physical, visceral response. And I think we don't give enough credit to that in technology and software that people are experiencing this. Unlike EMS where there is a lot of understanding and at least a lot of the companies oftentimes will have staff psychologists training, things like that to help you deal with this, whereas in software a lot of times people are just like, it's software. But it ignores the human side of this response that we can't help but encounter.

I actually saw a study once, I don't recall the details but they'd done something like hooked up a cop to an EKG and they put them in kind of like a simulated car or maybe it was a real car. But they got him all thinking he's on duty and all this and they're waiting for him and they get him all just sitting around waiting for a call to come in. And they're watching his vitals. And they trigger his console and so he gets this call. And of course he knows it's simulated because he's wearing these EKG wires, but immediately they watch his heart rates skyrocket to about 200 beats per minute, which just instantly, just this very visceral response. I don't think it takes being a cop to have that same physical response. We're still having the same neural architecture, right?

Mike Julian: Man, my last job that I was on call, when I left the job I was waking up in the middle of the night to nothing. The phone was not going off but I thought it was going off. And inevitably like an hour after I woke up, it would go off. So that really messes with your head.

Thai Wood: Yep, absolutely.

Mike Julian: Here's a very important question, what's your on-call ringtone?

Thai Wood: So for the moment, as we are speaking this right now, I do not have one because I am not currently in rotation for the moment. Typically, I try to change mine sometimes. It's just a thing that I experiment with.

Mike Julian: That's a good idea.

Thai Wood: Yeah, just so I don't habituate to one too badly. It's just depends on what I'm shooting for. There were periods where if I was concerned about missing it because if I was spending time with family or I was traveling somewhere where I was away from my routines and I would want that noise to catch me and help me get grounded in that, I might use fall back on something normal, usually just a series of beeps, not too loud. But otherwise, I might just pick something random. I think at one point, it was Vivaldi spring.

Mike Julian: I must have really screwed myself because I was using Strong Bad's the system is down.

Thai Wood: I will admit that I can't help but think of that in my head oftentimes.

Mike Julian: Yeah, I totally see that it was a bad idea in hindsight. So EMS, on-call, as soon as we talk about EMS, it's hard to separate that whole idea with on-call but there's another aspect to it that I find really interesting, which is the incident management portion. I kind of alluded to when I started laying out the scenario I think of when I think EMS. But there are well defined roles in EMS, like when someone responds to an incident, there are certain people doing certain things and they know why they're doing those things and what's expected of them. And this is incident command. I know you and I have discussed this many times over coffee and drinks. But there's something there about a standard incident command structure that you and I spoke of. Could you tell us more about that?

Thai Wood: Yeah, absolutely. So, nowadays, in especially like a post 9/11 world, incident command in most emergency services typically references The National Incident Management System, which uses ICS, the INCIDENT COMMAND SYSTEM, which is this whole thing set up by FEMA. It started in like the 70s because they were having trouble of managing a bunch of fires. Well, you know, it's tough to manage this stuff if you don't have everyone on the same page. Who would have thought?

Mike Julian: Right.

Thai Wood: So eventually it became this national standard of how do different agencies work together? What are the structures they form? One of the interesting things about it is that in EMS you probably aren't thinking about it day to day. You're not showing up to a car accident and going, “As I get out of the ambulance, I am now the incident commander.” You know that if you were to work with a larger group or another agency that you would become the system, but there is this established role both between you and your partner or you and the other responders, you and, as it grows, other agencies, that I think is really helpful, because you know that for the most part, people are going to want this information from you or this is your role. And that plays down all the way even into, for the most part, who is the one, if you're rolling, someone who is the one that counts off? Who is the one that decides and says 123? It goes all the way down so that everyone can know what to expect and that provides this common ground of people to be able to work together.

Mike Julian: Okay. So, that sounds like it's really only useful at like large teams though, like when there's a lot of people moving. Is there value in that when say my team's three people?

Thai Wood: There definitely is, I'm glad you asked that, because even though we're not keeping it always in the front of our head again, I'm not getting out saying, “I'm this,” you are that person. Even the way the system is defined, you actually are essentially instantiated an incidence command by responding to the scene and knowing that and knowing who's maybe first on scene and is acting in that role can help I'd say even as early as two or three people. I mean, you already know between you and your partner.

But if a cop rolls up maybe from the next county, they already have some idea of probably how to insert themselves into the situation. They have at least some notion of what they might ask of you and what they might not, knowing that you are a form of incident commander or the first person on scene, they might avoid asking you, “Hey, can you go fill out this paperwork?” “No, I'm busy.” There's a lot of things that they're able to just skip over because of that. And I think just having the role as well is valuable to responders themselves because especially in software, I think that people get put on-call with all sorts of diverse backgrounds. And very rarely is there more code that they could have learned or something to make them better responders, right?

Mike Julian: Yeah.

Thai Wood: Being a better software engineer, being a better Ops person doesn't always make you a better responder. But learning about and then participating in incident management structure can help make you a better responder and feel more prepared. I think there's a gap between when people, they're great at their jobs, they're good at their code. They're operating the infrastructure. But that doesn't always translate to, “I know what to do when the pager goes off.”

Mike Julian: Yeah. Let's dig into that a bit. So, say I'm an Ops engineer on a small team. I've got three or four people on my team. My team is pretty good at doing Ops. But incidents are challenging. How can we get better? What are some concrete steps we can take?

Thai Wood: So, the number one thing I recommend to everyone is just try to stay calm. It will be difficult, and that's okay. It's not a personal failure or anything like that. Just try to stay calm. And as we used to say in EMS, don't become part of the emergency.

Mike Julian: I get what you're saying.

Thai Wood: Yeah. If you're running around all over the place, at least in the physical world, and I'm speeding to an accident scene and I get hurt, well, now they have to send two people. It makes you less effective. So I'd say just take a moment, try to remain calm. That's not easy for everyone. That ability differs through a lot of people, and that's okay. It's like the number one thing I always tell people. It's also important for you to be able to have that space to actually be effective. After that, I would say just defining some of those roles that we've talked about, incident command and what does that mean to your team. You don't always have to follow this big federal guideline, but having something that you and your small team agree on in advance, so that in the moment everyone knows their role instead of trying to figure it out at the worst possible time.

Mike Julian: Yeah. In fact, I would say trying to adopt the federal guidelines in a small team or even right out of the gate, no matter the size of your team, would probably be detrimental to your entire effort, trying to take on so much all at once. It's all new. So you're going to screw it up, and that's expected and that's fine. But to say, we now have 10 different roles. And by the way, we've never formalized incident management here before, I'm sure that's going to go over great. How would you expect that to go while we were running software?

Thai Wood: Yeah, absolutely.

Mike Julian: It seems to me that you should probably start with a couple roles. And to me that the thing that I found the most valuable and I would love to hear your feedback on it, is the first role I've always wanted to implement anywhere I start doing formalized incident management is communicate liaison. I don't care about anything else except for that, because that frees up people to work on the incident and defines who is doing the communication.

Thai Wood: Yeah, that's really helpful. Knowing who's doing communication and what communication is expected, I found, as you have, very effective. Also, asking that question allows you to go, “Wait, why are we giving updates to our CFO when we're maybe not able to train her or not able to do anything in this moment or why is he or she asking for updates every five minutes when this is a 15 minute process?” Asking some of these questions about roles also helps you reconsider what your knee jerk response might be. Well, people are popping into the channel, I'm just going to answer my status. Saying “Well, if we have someone in charge of communication, not only are they answering it but then we get to define what is it that they answer? What communication is it that they provide?”

Mike Julian: Yeah, absolutely. So when you start overhauling incident management, is there a different role that you'd like to go for first or do you also fall in line, go for communications?

Thai Wood: Usually, I like to set up some form of incident commander or a notion of someone being first on scene, and then followed by communication. I typically find that in addition to having a role, the practice of communication and defining how it is that we, if I'm working with them, how we as a team, communicate. And often that's things like closed loop communication. If I tell you something, then I'm going to expect that maybe you're going to repeat it back to make sure we're on the same page, or you're going to acknowledge it. And then that way, I know that if you don't, you didn't hear me, you're very focused doing the thing you're doing and as you should be. But just techniques and frameworks like that, in addition to the roles, really with an incident command role, and deciding who's doing communication, and then working on how it is that you communicate, I think gets a lot of teams a huge leap forward from where they start.

Mike Julian: Yeah. One of the big worries or anxieties I've seen people have with the incident commander role is the first on scene is incident commander, but that doesn't mean you're always incident commander. Someone else could come on scene and you can hand that off, and that's fine. So the idea that if I'm on call, and I'm the first ops person to respond to an incident, and now I'm running the incident, what if I'm a junior engineer? I'm kind of terrified of that whole thing that I'm directing these people who are way better than I am? But that's actually fine because the role of the incident commander is not to be the best at solving the problem, it's to understand who's doing what, and make sure that everyone's on track.

Thai Wood: Absolutely. I was on call for a very big E-commerce organization, you drive past their stores, over the holiday, and that was something that I've noticed in this area as well is just having someone to coordinate. I mean, in this case, I think we capped just over 50 people on a particular bridge.

Mike Julian: That's a big bridge.

Thai Wood: But yeah, that scale, there's a lot of things that really don't have to do with your seniority. Jane says, “I'm going to go investigate this thing.” And then 10 minutes later, no one's heard from Jane. Is she still on the bridge? Did she get disconnected? Is she making great headway? Is she seeing amazing things that might be revealing to us? Having someone, again, managing that incident is able to say, “Hey, Jane, can you go look at that, please, and come back to me in 10 minutes, or give us an update in 10 minutes, and then we'll go from there.” Or, someone might pipe up and say, “I'm actually really stuck here.” Well, with a group of 50, you tend to get the bystander effect, which is that in large groups, people don't individually tend to take action. Having an incident commander role allows that person to overcome that a bit and say, “John, I heard that you're stuck on this. What is it that you need from the group?” And then that can help bridge some of that gap.

Mike Julian: Right. So one of the things that … So I wrote about a little bit about incident command in my book, and the thing that I found most valuable for determining who is an incident commander in small teams when they're first adopting this sort of stuff, is to intentionally not allow managers to be incident commanders, because then you end up with this blurry lines of like, is this the manager telling me to do this or else or is it because they're in their role of incident commander? So actually I really like having managers like a manager of a team be communication liaison. I mean, managers are really good at communication generally, and they know all the players are ready and can more politically massage a hard message. But having them be incident commander, it sets up some weird incentives to me.

Thai Wood: That's interesting. I think it can, depending on the team culture for sure. I do find that some managers are much better at being that communications role than anyone else tends to be, and at least off the bat without more training. I do like it when even if it's in a simulated incident, a war game, a tabletop scenario of some sort, that managers do participate in incident command, at least in those areas, just so that they can retain appreciation of what the job is.

Mike Julian: Yeah. Agreed. So, I want to completely switch gears here. You're the editor of the Resilience Roundup, which I keep calling Resilience Weekly, but it's called Resilience Roundup, which is what, like resilienceroundup.com?

Thai Wood: Yes.

Mike Julian: Yeah, there you go, resilienceroundup.com, fantastic newsletter. What I love about it is that it's not just a link roundup. It's actually significantly more than that. You have a unique take on things. Tell me more about that.

Thai Wood: So I started this just by talking to some folks. And I've already, prior to this, started seeing a big overlap in a lot of different fields of like my past experience, and then learning somewhat about how NASA does things, how pilots do things and seeing a lot of overlap in things that we could learn in software. I had a chance to go to Paul and Mary;s really great REdeploy Conference and talk to a lot of people there and hear their different takes on resilience. And there were so many resources. I walked away with a lot of people saying, “You should read this, or you should read this, you should read this.” And I think I filled up like half of one of those little notebooks of not even just notes, but just title and author, title and author, title and author all the way down. And I got home and I went through these, and some of them were 500 page books. Some of them were 30 page papers and man, this is just a huge kind of uphill battle to get through some of this.

Thai Wood: And so I had this thought that well, if I'm going to do it, why don't I share it so that maybe the next person doesn't have to? But also I don't want to keep them from forming their own conclusions. So that shaped the format where I will try to give you a good not really summary, but my take on it, how I think it can be useful, what I see is some actionable takeaways in about 10 minutes' reading time. But if you want to, I purposely pick articles that are accessible not behind a paywall, if you want to dig into that 20, 30, 40 page paper, you absolutely can. But it all started with this idea of if I'm going to read this and I'm going to form these conclusions and I'm going to do this, I really want to be able to help others do it as well.

Mike Julian: Yeah. The prospect of reading a couple dozen academic papers on resiliency or incident command or any of these things, it's just daunting to me. But I've read a few, and there's some fantastic stuff to be had from it, but it's just so hard to pull it out. And then once you do get it all pulled out and you get concise points, then you have to figure out how am I actually going to apply this to ops and software. So I'm really glad that someone's doing it. Resilience lately has kind of felt like the new buzzword. It seems like resilience is the new reliability. But I think that's wrong. That doesn't feel correct at least. I know you have a lot to say about that topic since you are now reading so many papers, all the papers. So resilience, resilience at a conceptual level, what is it?

Thai Wood: So I think you're right that, at least in language, it does seem to be trending toward a lot of the processes that have made other things buzzwords. But as a concept, resilience engineering comes from a bunch of disciplines, like human factors research and cognitive systems engineering and is sort of this label that had been developed partly from looking at biological ecosystems and all these different things. And has essentially come to … I mean, what most people mean when they talk about it is looking at systems and where they can continue to adapt and respond. So not just say, we might think of reliability as my engine in my car, those pistons are going to go for 300,000 miles if we're lucky, and they're very reliable. But if you were maybe to put them in a different case, that is not the normal operation envelope, how does it respond? And I think the term for this in the research tends to be adaptive capacity. Does it still retain some capacity to be able to respond to whatever the different situation is?

Mike Julian: I've been trying to figure out where John Allspaw got his company name from and well, there you go, adaptive capacity labs. I had no idea. So, it is pretty buzzwordy at the moment. Hopefully that will improve in the coming years, but we'll see because observability is at about the same point. I think perhaps really interesting is observability being from like 1960s control theory. Resiliency is also about as old as, we've got research going back to the 60s if not earlier.

Thai Wood: Yeah, absolutely. It wasn't always called that at the time but as different disciplines developed and started to see overlap, again, biology is a big point in this. David Woods has a great paper. David Woods has a lot of great papers, but in particular, he has one that he put out recently, and in it, he talks about, a lot about biological systems, and how some of our ideas of system performance are drawn from and reflected in biology. So as a result, yeah, some of this research is pretty old, and it's still relevant, whether that's accident research or certain things about human cognition. We're not changing as humans that quickly even though, the tools that we're interacting with are potentially changing very quickly, we as the operators are not really. We're still facing those same human either limitations or … The good things about being human is that we can have these intuition and, and adapt to these scenarios.

Mike Julian: So let's talk about resilience. Why is the ops in the software world suddenly talking so much about this? What's the point of it? Why is it suddenly so interesting or so valuable?

Thai Wood: So I think it is most valuable primarily because of the question that it asks about adaptive capacity, which is that, a machine just sitting there in Iraq, blinking at you, itself does not have adaptive capacity. Looking at the world of ops and software and incident response through this lens of resilience helps us realize that people are often still and always have been very key into how these things work. Whether or not the systems fail, and how often and whether or not we're able to support them. Humans are the key in that in this industry where we, I think, are kind of seeing a pushback from an era of saying, “Well, we'll just automate everything away. We'll just take people out of the loop. And eventually AI, we'll just fix it.” I think that this is kind of a natural pushback as well to a feeling that a lot of us have experienced and like, wait, that's not really how it's working. And the research has actually looked at this and start to support it for at least a couple of decades that actually adding automation makes things harder on humans.

And I think we've just reached a point with such complex systems being so accessible because at certain points in time, we wouldn't have the large number of complex systems that we have, at least from an internet point of view. So we're building more and more complex systems. And there was a period where we're seeing more and more, we'll just automate things and it'll be fine. And so I think that culmination that is this inclination to say, “Well wait, how can we keep having this ability to adapt? How can we encourage it? How can we find it? How can we learn more about it?” And fortunately, these researchers have been trying to answer this question for decades and looking at pilots, NASA, firefighters, all these different things and just using this window, wherever they can find it to try and extract these things for us.

Mike Julian: I saw a take on this a while back, just a short quip that complex systems are working because of the humans not in spite of the humans.

Thai Wood: Yes, absolutely. And the research does, for the most part, bear that out. There is no amount of automation, at least as we speak today that can really fix these problems. As often gets quoted as well is that, it's not surprising when it fails, it's surprising that it works at all. And that's because of, as we all know, of the people behind the scenes just day in day out doing their normal work in kind of the trenches, keeping all these things running.

Mike Julian: I think that the quote you just referenced there came from a fantastic paper from … I first read it in the Web Ops book, years ago, How Complex Systems Fail, that was what it's called. I forget the author. Who wrote that? Do you know?

Thai Wood: Yes, and I really think everyone should read it.

Mike Julian: It is a wonderful paper. It's available freely online.

Thai Wood: It is. It is a short paper even though it's been printed books by Richard Cook, I strongly recommend most of his research, but-

Mike Julian: I mean, it's like six pages long. It's pretty quick read.

Thai Wood: Yeah, How Complex Systems Fail is just a list format. So it's a really great intro to a lot of this stuff. And I think as ops and software people, you can't get through very far of that without nodding your head. I don't think I've ever seen a single person who works in these areas who doesn't read this and you either hear, “Ah, yes,” or you just see them nod their head.

Mike Julian: Yeah. Where I first learned about it, where I first read it, it was in John Allspaw, web ops, web ops something, I forget what the book title was. But web ops is a really great book and it's pretty old now but surprisingly, is still applicable. And one of the thing is the only paper in there was Richard Cook's paper. And yeah, every time I go through it I'm like, “Yep, that's software in a nutshell.”

Thai Wood: Yep. It's actually printed out just to my right over in my office as a reminder that I do revisit some times. It's just such a great summary. And I think it's easy to forget some of the points as we focus on these different areas that I just have it out to remind myself.

Mike Julian: So we often see resiliency talked about in like right alongside chaos engineering. What's all that about? Why are the two going hand in hand in conversations?

Thai Wood: So I think most of that is because I see it, at least as chaos engineering is built on a lot of the same principles of resilience. Chaos engineering is, and their tools are a kind of response to the things that resilience engineering is teaching us or is … It's kind of a subset of that same world? And because of-

Mike Julian: So we could call it like applied resiliency?

Thai Wood: Right. Absolutely. So, if the research tells us that we are able to build nowadays, systems that are so complex that we cannot predict all the system interactions, we can look at individual components in a system and try to assess if they're safe. But that doesn't prevent us from having system accidents where there are interactions between components we could not have predicted. And the systems we're building are so complex, the answer isn't to get better at predicting, because we can't, so I think chaos engineering is an answer to that. Well, if we can't predict it, what if we just cause it and watch what happens? Now we don't have to predict it.

I love that take on it. I never really considered it that way, but you're completely right. Well, Thai, it's been fantastic talking with you. Where can people find out more about you and your work?

Thai Wood: As you said, resilienceroundup.com, I have issue articles there from the past. People can sign up, every Monday I'll send you something to read in this area.

Mike Julian: All right. Well, thank you so much. And to all our listeners, thank you for listening to the Real World DevOps Podcast. If you want to stay up to date on latest episodes, you can find us at realworlddevops.com and on iTunes, Google Play or wherever it is you get your podcast. I'll see you in the next episode.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
The Vendor Is Not the Enemy with Cory Watson
28 mar 2019· Real World DevOps
About the GuestCory G Watson is a Technical Director at SignalFx with 20 years of SWE, SRE and leadership experience at the likes of Twitter and Stripe. He's an observability wonk, optimist, and lover of sneakers. He hopes you're having a great day!
LinksTwitter: @gphatWebsite: onemogin.comLinks ReferencedPatrick McKenzie’s blog post, “I’m Joining Stripe to Work on Atlas”Book recommendation: Information Dashboard Design: The Effective Visual Communication of Data by Stephen FewTranscriptMike Julian: This is the Real World DevOps podcast. I'm your host, Mike Julian. I'm setting out to meet the most interesting people doing awesome work in the world of DevOps. From the creators of your favorite tools, to the organizers of amazing conferences, from the authors of great books to fantastic public speakers, I want to introduce you to the most interesting people I can find.

Mike Julian: Ah, crash reporting, the oft forgotten about piece of a solid monitoring strategy. If you struggle to replicate bugs or elusive performance issues you're hearing about from your users, you should check out Raygun. Whether you're responsible for web or mobile applications, Raygun makes it pretty easy to find and diagnose problems in minutes instead of what you usually do which, if you're anything like me, is ask the nearest person, "Hey, is the app slow for you?" and getting a blank stare back because, "Hey, this is Starbucks, and who's the weird guy asking questions about mobile app performance?" Anyways, it's Raygun. My personal thanks to them for helping to make this podcast possible. You can check out their free trial today by going to raygun.com.

Mike Julian: Hi folks. I'm Mike Julian, your host for the Real World DevOps podcast. My guest this week is Cory Watson, Technical Director for the Office of the CTO at SignalFX. He's previously run observability at Stripe and Twitter. Cory, welcome to the show.

Cory Watson: Thanks for having me.

Mike Julian: I think it's interesting that you have gone from running observability teams at pretty interesting places like Stripe and Twitter.

Cory Watson: Thank you, thank you.

Mike Julian: To now, you're working for the enemy.

Cory Watson: That's a good way to put it.

Mike Julian: Yeah, you have suddenly gone to the vendor side.

Cory Watson: Yeah, the nicer way that I put it, is I say I switched sides of the table.

Mike Julian: Ah, yes. That is a much better way to put it. Apologies to my sponsors for insinuating that they're the enemy. You're completely right, it really is just the other side of the table.

Cory Watson: That implies that it's just a simple change in aspect, though. It's really not. It's actually a pretty fascinating difference, I think.

Mike Julian: Well, why don't you tell us more about that? What is that difference? What is it all about?

Cory Watson: I think, in the past, when you work at a place that uses these vendors, you're often trying to make sure you maintain, or at least it's always been my goal to maintain, a sort of neutrality or maybe not a shim layer is the right way to put it. But how do we retain our independence and make sure that we could switch, if necessary, and all these other things? It's both good, usually from a technical perspective, but also from a leverage perspective, right? Because you wanna be able to switch if you need to or go to the new, cool thing that might come out. Not that we jump and switch that frequently, but you wanna be able to retain that independence. Now, suddenly, I'm on the opposite side. I think there's two interesting bits about it. One is the change in perception or the change in the approach. The second is just the learning experience that I have from watching business being conducted from this side of the table. I guess we can start on the difference in perspective. To your earlier question, it's like, alright. Here I am previously sitting over here, going like, "Okay, don't tell them anything and pretend like you hate everything that they show you, and never show your true colors."

Mike Julian: Fantastic negotiating tactics.

Cory Watson: For lack of any better training, I think that it's the place that you try to go. But at the same time, it's actually somewhat different than I imagined, because I'm happy to work for a company that largely isn't trying to sell you something for the sake of selling it to you. I think this is true of pretty much all vendors, we only want you, as a customer, if you're ultimately going to be happy. Especially for the duration that many of these engagements, contracts, or purchase periods, or whatever last, you don't just want to ease in, make a buck, and then ease back out again. You've got to stay with it.

Cory Watson: In many ways, I think that my past experience of wanting to hold everything close to the vest, to use that idiom, don't really work that well because you need to give as much information as you can to the vendor so that the vendor can, hopefully, tailor the solution as well as possible, right? At the same time, I think it all comes down to price, though, at the end of the day. I think in that, luckily I don't have to have that conversation. I am only here to talk about the pros and cons and the approaches of how to do observability and how SignalFX can be helpful for you. I actually think that in switching sides, I've actually seen how it can actually hinder the process of adoption and understanding. Because, if you're like, "Well, we're not gonna tell you how many things we're monitoring, or how many metrics there are, or what our challenges are." It makes it extremely difficult to articulate a value proposition to the other people because suddenly I'm like, "If I can't help you size this, I don't know how to have the conversation." It's interesting to be on the side where you suddenly lack so much information because-

Mike Julian: I've run into that in the Amazon world, where people will say, "Oh, we're spending a ton of money with Amazon, but we don't want to tell them what our future plans are. We don't wanna tell them about our product roadmap because security reasons. What if they leak it, what if they try to do it themselves?" On some level, like it's AWS so maybe it will actually happen.

Cory Watson: Yeah, as we saw last week.

Mike Julian: Right. When you're spending that much money with a company, when a company is such a core aspect of how you do business, for example on monitoring product, it's no longer a vendor. They're a partner.

Cory Watson: Yeah, that's an excellent way to put it. I love that phrasing because I felt that way even as a customer. I've been a customer of many companies in this context, in the monitoring, observability, whatever context system stuff, and it's true. Once that spin gets to a certain amount or when it's such a critical part of your infrastructure, these systems are effectively the highest criticality or whatever of your internal systems. Because if you can't see what's happening, you can't make changes. You can't run if you can't see. It's absolutely right. It needs to be a partnership. The more information you can give, the better. Once everybody gets under a mutual NDA things, I think loosen up a bit. It's easier to share. It's also understandable because things like the number of hosts you run and the magnitude that some companies operate are sensitive subjects. I think it's very reasonable to hold it close. But at the same time, the better information you can give, then the better the solution can be tailored. So yeah, that's one side of the difference from switching over to being a vendor. Luckily, I think, on the other side of what I've learned ... actually, I shouldn't use "side" because that's confusing.

Cory Watson: So there's that aspect, but then the learning experience is pretty interesting for me too. How sales organizations structure and work through this stuff? I'm not in sales. I've always worked in infrastructure at companies, be it for observability things or as an SRE. Suddenly now, I'm faced with learning how they approach it. Recognizing who at the customer that you're working with, who's your champion there? Just like anything, you need a champion. Someone who’s going to help. It's not strictly adversarial, but at the same time, the terminology is often like, "Well, who are the people who are fighting this? What are their motivations? Who are the people who are championing? What are their motivations?"

Cory Watson: It's interesting because I've always used LinkedIn, mostly as a tool to stay connected with people I used to work with. Nowadays, I don't see this because I don't do it, but the sales people do. They know who everybody in the org is because LinkedIn is the org-chart. It's like, "Well, who do they answer to? Who's their boss? Who's gonna be in this meeting? What are their titles? Who's got the purse strings in this conversation?" It's all stuff that I look back and I see. I remember salespeople asking me these sorts of questions. I was always like, "Harrumph, why are you bugging me with these questions?" Only to realize that they're trying to figure out how to position themselves to best answer the questions that are out there, and also to sort of understand is this going to be fruitful, because companies can waste a lot of time if they're talking to the wrong people or all this other stuff. It's been really fascinating. Some of this was intuitive to me, just in working with small companies. It's just been fascinating to be on this side of the table, as we've been describing it, and learn how they approach this problem because I know how to approach systems problems. But these are essentially people-problems which is, at least in this context, all new to me.

Mike Julian: Yeah, absolutely. I can totally echo everything you're saying on that as a consultant myself. I do also go through LinkedIn and start mapping org charts. Definitely been there. What's interesting to me about the sales conversations, is once you stop looking at vendor sales as adversaries, as someone trying to sell you a thing, where you get this idea of, "They're trying to sell you a thing whether you need it or not." Their trying to foist a thing on you. They're trying to trick you into signing a contract. That's not actually how any of it's going because most salespeople are not measured just strictly on how much money they bring in, but also on retention.

Cory Watson: Yeah, yeah. It's a good point. It's about that year over year, over year. No one wants to sign a contract. I often saw companies saying, "Oh, we want you to sign these long contracts." I thought of it in purely dollar magnitude. I never thought of it as the comfort of that relationship being there. It's reasonable that when you sign up with a new company, you don't want immediately get into some three-year deal or something. But, at the same time, knowing that's gonna be there helps every side of the equation. I used to think about the contracts we were signing in raw margin terms. Well, I know what it costs to buy a server. Put it in a rack and run it. But you don't think about the engineering organization that's built up to also deal with all the silly stuff I'm asking for like the 100 different features I've got in a list, making sure that those are all getting done. I'm not the only customer. There's so much that goes on behind the scenes. I don't know. I feel privileged every day to be able to be able to see it from this side while still leveraging the fact that I'm an observability wonk and I do this stuff every day. I still get to leverage my strengths, but also shoring up my weaknesses when it comes to the sales side of the table. It's been a lot of fun.

Mike Julian: Yeah, learning the business side of things — I think it's been the most interesting aspect of my entire career. Basically for me, the past five years learning how business is done, especially for selling infrastructure services and infrastructure products. To me, I think it's also the most impactful thing I've ever done in my career. All the knowledge and skills I've gained over the entire career of doing monitoring and observability and infrastructure. Yeah, it's all great. The things that are really made the most difference for me was learning how the business functions.

Cory Watson: I don't think I've thought about it that way until listening to that explanation. I think if I rewind a little bit, the reason I took the job was I felt like I could have more of an effect on our industry by helping people connect these dots and leverage a vendor, if it was the right vendor for them, to get this job done instead of developing some of this stuff perhaps in-house. I mean not that you shouldn't necessarily do that, but there's a trade-off. Now, basically, I have shorter conversations leveraging my past experience to your exact point, to help them make this decision. Then hopefully have a significant like an outsized impact. The conversations are almost small compared to the impact that they have. Whereas, my engineering impact, is so much more long-term everyday typing and drudgery than going and spending two hours meeting a customer and having these conversations directly.

Mike Julian: Yeah, absolutely. When you're working on the vendor side, you do have the ability to be a much larger multiplier. When you're working inside of a company, the effect you have is pretty limited. If you want to affect an entire industry, it's possible but it's hard. Especially if your company is not a vendor, if you're at a vendor, and that vendor is also sizable and doing interesting things, then you become a multiplier.

Cory Watson: An interesting connection to my past life, a fellow by the name of Patrick McKenzie, who works at Stripe, goes by @patio11 on Twitter, recently wrote a blog post about why he joined Stripe. Part of it was, even though he had previously worked in trying to help small company succeed, it was because working for that vendor, in this case, gave him an outsized impact on all of those companies. He's probably much more articulate about it than I am. I just read it yesterday. It was like, "Yeah, man, I believe that. That's what I'm trying to do."

Mike Julian: I will find that post and put in the show notes because I'm a big fan of Patrick.

Cory Watson: Yeah, he's pretty good dude.

Mike Julian: Transitioning a little bit, you've had a background running observability teams. Now, you've also been an IC at various places. But now you're in this weird middle ground, where you're not actually sales. You're not running a team anymore, but you're not strictly an IC either. It sounds like you've got a weird role.

Cory Watson: I like that you basically defined it by the absence of things, instead of the presence of any one thing because that's what I find tough about it. I think to pick up on the thread you've dropped there, having spent now a little over 20 years, mostly doing IC work, occasionally engineering manager, a few VP roles. In all those cases, though, there was a fairly direct connection between either infrastructure work or engineering, programming output, even as a consultant, right? I also consult all those more general programming consulting. There was always this time spent with hands on keyboard, making code pop out the other side was what I was judged on, whether it was myself or the people that I worked with to as a manager or what have you.

Cory Watson: In this role, it's tricky because I just said that a lot of my job is to go and have pre-sales conversations. I'm not a sales engineer, and I'm also not a salesperson. I'm often brought out as, "Well, here's Cory Watson, who's, as you said, at the beginning of this session, has done a bunch of observability work. He's here to effectively just have a friendly conversation with you about what you're doing." Thankfully, I don't work in a company that expects me to just shill for them or anything else, right? I tell them what I think and what the approaches are there. This is rarely a thing that you do with just one vendor. There are often a few that overlap or are mutually beneficial to each other. I think the trick, though, is trying to figure out, what do I value at the end of the day? What releases those endorphins in my brain or whatever that triggers my happiness and excitement, and makes me want to get up and come to work every day? The conversation we started with, this idea that we're learning much more about the business, and having a larger effect is one thing, but that's a long. That's like parenting. That takes decades to pay off. It's often unfulfilling, in the moment. I often joke with my partner that our daughter, she's not gonna repay us for all this work for many years. For now, we're just, "Nope, just do the kid thing." The difficulty here is connecting that. I've been spending a lot of time documenting my work, because it often hasn't felt like I "accomplished much." I'm making air quotes. I've had to spend a lot of time documenting what I am doing and learning to recalibrate my internal measure of what types of accomplishment I've had. How many conversations that I have with customers, sometimes months ago, that now materialize into someone who's a happy customer? How many conversations did I collect feedback on? Or how much insight have I given into product changes that ultimately then land and turn into something like that. I think that's maybe there's something to say here about that outsize impact as you were describing. That larger impact that you have also often taking much longer to propagate.

Mike Julian: Absolutely.

Cory Watson: The waves, even though they're big, take quite a long time to travel. That's a big difference in internal awareness of your own role.

Mike Julian: Yeah, absolutely. When you think about the vendors that we all look up to like, "Oh, well, they've made a really cool product, look how old they are." What we see now was not quick. Success looks an awful lot like drudgery. [laughter]

Cory Watson: You caught me off guard for that one. We may have to edit that laugh down. That was a snorty one.

Mike Julian: Success looks an awful lot just hard work. You're absolutely right. The success that you see in your day-to-day work, it's not actually felt for months.

Cory Watson: Also, things that can seem simple. I've been doing observability work now, since basically as long as observability's been associated with computer stuff in some capacity. Since I don't know, 2000 something, 13-ish something like that. Now, when I have conversations with some of our customers, some of that is re-discussing things that I have ... a lot of its re-discussing my past experiences, some of the decisions I made as a customer. That often doesn't feel... I feel like I'm just telling them something I already knew. I don't feel like that's valuable. I often feel, for example in this conversation, am I giving you new insights or just saying something someone else already said on Medium 100 times? The point is not to make it sound like it doesn't have value just that.

Cory Watson: You don't often see the impact until much later when you realize that that customer made some internal cultural change. I was just discussing with someone last week, how to help them make observability more fundamental to their day-to-day attributes. I asked them like, "What carrots are you providing to your team?" It's easy to have sticks and say, "We're gonna whack you on the hand if you don't measure." Are your deployment processes connected to your observability data so that you can say, "If you do it right, you get these cool features as a side effect." How much of that is really happening in your org? That's a conversation I've had many times. Sometimes, for a customer, that's brand new. Even if it is new, the other person you're speaking with often brings a whole new perspective, some really exciting new ways of thinking about the problem. I don't know. Every customer conversation is special and awesome in its own way. In some cases, we find out later they had a big impact. In other cases, they buy something else. But that's okay too, because we've all gotten better as part of the process.

Mike Julian: Something you said there reminds me of one of the ways a vendor really helps is that there are often conversations happening internally that when someone external comes in, and says the exact same thing, it lends more credibility to it because now it's not just internal. It's now someone third-party that has no vested interest is now saying the same thing. New initiatives that internal people want to do will often find traction as a result of a vendor coming in and saying it.

Cory Watson: So much so that I have had active conversations with vendors when I was a customer, I mean clearly, some of the companies that have worked for, the name of the company meant something. Then I'm something of a personality sometimes. The combination of that weight, it was not uncommon for vendors to basically call me and want me to break a tie. Not me as in Cory, but me as Cory or maybe the job I'm doing et cetera, et cetera. It really does because it's very easy. The problem with eating your own dog food, to use the industry metaphor for using your own stuff, is that eventually, it all tastes the same. Sometimes you can't tell. Do I need some cumin on this or not? Do I need more salt? I don't know, it's dog food to me. It's funny because that's also my role internally, is I work in the office of the CTO, which means I don't participate directly in day-to-day engineering. I do some R&D function. I do a lot of customer feedback stuff because I talk to a lot of customers. I also am a neutral party when it comes to this stuff. I don't work on the back end, so I'm not defensive about it. If you're listening to this and there's something you don't like about signal effects, hey, I probably don't like that either. I'm helping them understand what we could do to improve it and helping to break those ties. Also, to shape where we're going and what we're doing. In addition to the customer side, to the feedback side, to the sales side, there's also just the rank and file every day. What more cool stuff can we be doing?

Mike Julian: Yep, absolutely. On the topic of you, being a character and you talking to customers, I imagine there's a lot of stuff that's .... Let me rephrase that. What do you see the future of observability holding? What are you talking about with your customers? What are you thinking about? What are you working on?

Cory Watson: I think there's two pieces to this. When I started at SignalFX, I tweeted out one day like, "Hey, I've started doing work now. If you've got questions, thoughts, ideas, let me know." John Allspaw, who many listeners may be familiar with, but if you're not, he used to be the CTO at Etsy, now runs a company called Adaptive Capacity Labs. He talks a lot especially lately about incident stuff. How do people function in the systems when there is failure? He tweeted back to me, which first off was like, "Holy crap, that's so cool. I've loved his work. How do you even know who I am?" He says like, "Tell me what your tool does that helps us." This wasn't just aimed at me or aimed to SignalFX. It was aimed at the industry.

Cory Watson: A lot of what I've been discussing internally is, "Okay, you've instrumented your stuff. You've got these charts. You've got these artifacts; charts, dashboards, mechanism, tracing, and all these mechanisms for looking into the problem. But what are we doing to help you with that?" What are we doing to provide you with that information? I don't wanna overfit for the problem of the person who got paged, because these platforms often do a lot of other stuff as well, right? But what are we doing for the person that got paged? It's a good line of work too. I think as an industry, I feel like all the tools that we operate, we lost the person on call being the person giving the feedback about the tool. I felt there's a lot of improvement in all the vendors tools that I've used. In fact, there's even a rich third-party ecosystem of tools that basically intercept your alerts to help you provide context or deduplification. Is that deduplification? I usually hate when you say a word so many times, you don't even know if it's real. I think that's a word. There's so much more we could do in that space. What is just a simple usability of it? I remember I did a short stint as a product manager in training at Twitter. One of the things that I learned from someone who was training me, was sometimes a feature can be implemented really cheaply and simply. Just prove its efficacy. I don't think I've seen a single monitoring tool that when it notifies you, has a button that's like, "Hey, computer, this is not helpful. Please stop doing this." It doesn't have to actually take action, but it should record that sentiment, because that's really important.

Mike Julian: That's a fantastic idea.

Cory Watson: Well, you can implement it pretty cheaply. Just log the thing. Take it to a web server and log that HTTP endpoint and then just go back and scrape it together later. Then feed that information back to the folks who do your DevOps tooling, or maybe your manager. This is something that I often think we provide very poor tools for engineering managers to look at the health of their own call rotations that they're responsible for and to help guide people. It's very easy when you're on call to get caught up. These things just go off all the time, and I can't make it any better. We rarely leave time. One of the things I used to push on my team at Stripe that I think we're pretty good at is, if you're on call, part of your responsibility is to leave it in better, assuming that you have time, leaving in a better place than it was when you got there.

Cory Watson: If you're seeing alerts that are bothering you, allocate time as part of your on-call rotation to go and read, tune those. If it's a threshold, a static threshold that you need to change, do that. If the runbook's slightly out of date, make sure you schedule times for that or make tickets for other people. You don't necessarily have to take on all that work yourself, but record it because engineering managers are never gonna be able to help you allocate time for it, if it's an undefined quantity. That's something that I'm thinking a lot about. Today, specifically, I've been digging in a lot on accessibility of the tools that we operate. When people say accessibility, we often think of people who have some either permanent or temporary disability. Maybe they have an injury and lost the use of an arm or you're having a sling, or something or maybe they're blind, but we also [crosstalk 00:26:40].

Mike Julian: There's a lot of those so-called invisible disabilities.

Cory Watson: Well, 8% of those of North Eastern European descent have red-green color blindness. What is the thing we all use in all of our charts to denote good and badness? It's red and green. These are the colors that those people are most likely to not be able to see. You've lost that entire channel of communication with those people. Today, for something I'm working on about dashboard design, I'm looking into accessibility. How about screen readers? What are we doing? Think about our charting elements. Are screen readers capable of deciphering those? There's also a lot of stuff that's not even that technically complex. What are your titles of your charts? I don't have it handy, but something I ran into earlier was some study done about ... or some research. I don't know if it was a study, but there was some research where someone said basically no one knows what charts are except what the title says. It's often been one of my favorite parts of observability tools where someone goes, "Well, what's in that chart?" Well, it's so and so number of seconds that this happened. They're like, "Yeah, but where?" You just have to go look in the code and find that measurement and go, "Okay, that's what that means."

Mike Julian: Yeah, I hate that.

Cory Watson: Well, I think that we do ourselves a disservice by not gratuitously labeling.

Mike Julian: Right, I agree.

Cory Watson: X-axis labels, Y-axis labels, the titles of the charts, the units that are placed there. Taking the time to basically wean yourself off of using jargon and just the assumption that the other person knows what these things mean.

Mike Julian: One of the things that has completely changed my views and perspective and really level up my own skills on the aspect of visualization was reading Stephen Few's book on Information Dashboard Design. It’s this massive, full color, really amazing print quality book originally published with O'Reilly. He's got a second edition out. It's also fantastic. Also, he's written just tons and tons of other books. The entire thing is, this is how you should be building dashboards. This is what visualizations should look like. A whole bunch of examples of visualizations done poorly and explained why they're not working. One of my favorite things is he lays out really clear reasons why a pie chart is the worst chart ever created.

Cory Watson: Oh, yeah. I've also been ... I mean that that sounds really interesting. I just put it on my list to pick up some of the work that probably he even researched is what I've been reading lately. A lot of old theory. [crosstalk 00:29:26]

Mike Julian: He was a student of Tufte.

Cory Watson: That pie chart example, this is one people love to harp on — but like, our ability to understand the amount of area contained in a pie wedge, it's pretty terrible. It works for large differences; but for small differences, less so.

Mike Julian: One of the visualization types that I really wish more monitoring tools used — Few talks about this as well — is tables are actually a very valid visualization. But we don't think about a table as a visualization. But, in a lot of cases, that's the most effective and easiest way to understand that information.

Cory Watson: In that same perspective, we also underutilize I think, bar charts. If you look at our ability to process quantitative information like the length of a bar, that's why bar charts are often cited as being superior to a pie chart because our ability to understand length is so much better.

Mike Julian: Basically, anything that you would reach for a pie chart for, it should probably be a bar chart.

Cory Watson: Yeah, but then that lends itself also to the tabular form because a table puts all the information on those X-Y axes because it's basically a chart without graphics. It's just a chart with numbers in its place. One of the things that's even suggested for accessibility purposes is to put the label for the data as close to the visualization as possible. If you've got a line chart, put it at the tip, for lack of a better word, the right-hand side of the line. My favorite, I think, thing I've learned about that, and then I'll switch topics, was that we build what are called run charts. The generic form of a chart that shows time on the bottom, or on the X-axis is a run chart. It's a special form of a line chart. My favorite thing that I read about it was going back to my Tufte books, which I have tucked away and haven't looked at in forever, but I pulled them back out for this research, was that we rely on these charts. Yet time changing is very rarely the causal part of what's happened. We have this entire visualization technique built out of showing things change over time. Yes, time is changing. That's very important to us; but rarely do we put as much thought into providing what is the context inside our organization or in our systems that are actually affecting that change? How many of those hints are we providing to people, which is the teaser for what I'm gonna try to talk about a lot, which is, how do we do a better job of instrumenting the things that are currently not instrumented and actually being able to ... I don't think causally is a word, but we correlate things often. Correlation is not causality as we all know. Causalating things is much more difficult.

Mike Julian: You and I have talked about this previously totally not on the show, where there's plenty of stuff out there that seems hard to measure or impossible to measure, but it's actually the stuff we care about the most. One of my favorite examples is measuring customer happiness. If you have a service-level indicator that has to be mapped to customer happiness, otherwise, how do you know that is going to be valid? Well, now I have a new problem. How do I measure customer happiness?

Cory Watson: We get lazy and we just don't measure that stuff. We say, "Oh, you just can't measure that."

Mike Julian: But that's not true.

Cory Watson: Yeah, it turns out there's actually some techniques for that stuff.

Mike Julian: There's a lot of really interesting techniques. Unfortunately, we could talk for hours about that topic alone. Perhaps, I'll have to have you back again sometime soon and we can dig into that one.

Cory Watson: Yeah, the only other thing I'm poking at in this realm is control theory, which is the roots of classical observability. Not necessarily what we're talking about in computers. It's something I've doubled down on recently. It's pretty common when you're demonstrating or when vendor show you this stuff like, "Here's a chart, here's it behaving badly, and here's some effect that we're having on the system." If you move forward into a more automated world, most modern, large scale, industrial things are automated to the point that things like control theory are what govern them. I'm very interested in why we still so heavily rely on intuition and people to do a lot of the operational work that's on the Ops side of the DevOps equation. How much of that could be improved if our systems were more, drumroll, controllable?

Cory Watson: Observability is how measurable something is and whether or not the state of its internals can be inferred by summing together its outputs. What we don't have are systems that are easily controllable; how many of our systems have direct API's for influencing those knobs and levers that govern their operation? That's something I'm really interested in because as we increase the surface area of our applications that we can be controlled, if imagine them as a three-dimensional space of good configuration and bad configuration, how much can we automate a lot of that stuff and make it? There are people out here doing a lot of stuff in terms of real math and science on systems being safe. Little's Law is a very commonly-cited one for queuing. The Universal Scalability Law and a lot of other performance-minded ideas that can actually be applied to the things that we're doing, but our systems very rarely allow us to manipulate them in that way. Instead, someone's gotta go edit the YAML files, save it, upload it, stop it, start it, redeploy it, blue, green it. How much could we be doing if these things were a little more automated? I think that's something that it sort of solves that dirty secret that observability has, which it tells you something's wrong, but not necessarily what it is. That's probably the third thing I'm poking at a lot these days.

Mike Julian: Oh, that's fantastic stuff. I'm hoping that you're going to be writing a bunch of blog posts and giving talks about all this.

Cory Watson: Yeah, that's definitely something that's in my ... I don't have a contract, but it's in my proverbial contract is that I'm supposed to be writing about a lot of this stuff. In between talking with customers and assembling some of this, I think over the next year or so, you'll see a lot more that come out. I am definitely currently working on how to make better dashboards.

Mike Julian: Wonderful.

Cory Watson: As inspired by a lot of the stuff we're talking about here, I'm gonna have to totally check out the book that you mentioned earlier.

Mike Julian: Awesome. Well, where can people find out more about you and your work?

Cory Watson: Well, you can find out more about me either on my personal website, which is onemogin.com, O-N-E-M-O-G-I-N. That's the southernism for do it one more time, if you haven't read it before. Then you can also find me on Twitter @gphat G-P-H-A-T. Both of those places, and I do a pretty good job about babbling about both of them. Usually, though, Twitter's probably the right place. If you can deal with all my retweeted hilarious, weird Twitter jokes.

Mike Julian: Yes. All right. Well, thank you so much for joining us on the show.

Cory Watson: No problem. Thanks for having me.

Mike Julian: To all the listeners, thank you for listening to the Real World DevOps podcast. If you wanna stay up-to-date on the latest episodes, you can find us at realworlddevops.com and in iTunes, Google Play, or wherever it is you get your podcasts. I'll see you in the next episode.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Salary Negotiation for DevOps with Josh Doody
21 mar 2019· Real World DevOps
About the GuestJosh is a salary negotiation coach who helps experienced software developers negotiate their job offers and the author of Fearless Salary Negotiation: A step-by-step guide to getting paid what you're worth.

LinksTwitter: @JoshDoody Website: fearlesssalarynegotiation.comRead Josh’s book, Fearless Salary NegotiationJosh’s Salary Negotiation Coaching servicesA detailed article on how to answer the salary history and salary expectations questions: https://fearlesssalarynegotiation.com/the-dreaded-salary-question/Salary negotiation email (how to write a counter offer email): https://fearlesssalarynegotiation.com/salary-negotiation-email-sample/Salary increase letter (how to write an email asking for a raise): https://fearlesssalarynegotiation.com/salary-increase-letter-sample/TranscriptMike Julian: Running infrastructure at scale is hard. It's messy, it's complicated, and it has a tendency to go sideways in the middle of the night. Rather than talk about the idealized versions of things, we're going to talk about the rough edges. We're going to talk about what it's really like running infrastructure at scale. Welcome to the Real World DevOps podcast. I'm your host, Mike Julian, editor and analyst for Monitoring Weekly and author of O’Reilly's Practical Monitoring.

Mike Julian: Alright folks, I've got a question. How do you know when your users are running into showstopping bugs? When they complain about you on Twitter? Maybe they're nice enough to open a support ticket? You know most people won't even bother telling your support about bugs. They'll just suffer through it all instead and God, don't even get me started about Twitter. Great teams are actually proactive about this. They have processes and tools in place to detect bugs in real time, well before they're frustrating all the customers. Teams from companies such as Twilio, Instacart and CircleCI rely on Rollbar for exactly this. Rollbar provides your entire team with a real-time feed of application errors and automatically collects all the relevant data presenting it to you in a nice and easy readable format. Just imagine — no more grappling logs and trying to piece together what happened. Rollbar even provides you with an exact stack trace, linked right into your code base. Any request parameters, browser operating system and affected users, so you can easily reproduce the issue all in one application. To sweeten the pot, Rollbar has a special offer for everyone. Visit rollbar.com/realworlddevops. Sign up and Rollbar will give you $100 to donate to an open source project of your choice through OpenCollective.com.

Mike Julian: Hi, folks. I'm Mike Julian, your host for the Real World DevOps podcast. My guest this week is Josh Doody, a good friend of mine and who is also a salary negotiation coach for engineers and author of the book, the Fearless Salary Negotiation. His articles have appeared in glassdoor.com and any number of other locations and he's amusingly also been interviewed live on the BBC, which was kind of cool when I found out about that. Josh, I'm just imagining BBC calling you and you're like, "Oh, shit. I should probably put some pants on."

Josh Doody: I already had pants on. I was at Starbucks and I started tweeting back and forth with whoever the producer was, and then within 45 minutes I was live on the air on international TV on the BBC being interviewed from my office. So that was interesting.

Mike Julian: Yeah, that's pretty cool. Yeah. I don't know too many people that say they've been interviewed by BBC live. So that's pretty awesome.

Josh Doody: That's a feather in my cap for sure.

Mike Julian: Absolutely.

Josh Doody: It was a pretty awesome, scary experience, and I'm really glad that it happened as quickly as it did because I think if I had time to think about the fact that that was going to happen I could've gotten nervous, but I didn't have time to get nervous. I literally was just scrambling to get some kind of lighting in my office and make sure my camera was working. So I spent 20 minutes on logistics and then I was on live international television being interviewed on BBC, and then it was over. Three minutes later it was done.

Mike Julian: That's a lot of work for such a short little time. So I want to talk to you today about salary negotiation, job negotiation, asking for raises, this whole gamut. But where I want to start is everyone's favorite topics. Everyone loves a good train wreck. So what's your favorite negotiation went totally sideways story?

Josh Doody: Well, I was thinking about this before we talked in case you asked about it, and I'm not sure I can tell you my favorite one because the only way I could tell it is if I censored so many pieces of it that it would be boring and meaningless to everyone. But the CliffsNotes for that one was it was a company that we have all heard of and use their services. They’re currently a private company. And just throughout the negotiation it was one red flag after another of my client trying to negotiate and then the company freaking out when he questions the value of their equity, which they're a private, so it's just monopoly money basically. And just one thing after another where the recruiter was just basically losing his mind, and eventually my client, even though this company had offered him more money, decided just to stay put. He was like, “You know what? I'm not going to go work there because if they're treating me like that right now I can't imagine what it's like to work there."

Josh Doody: But there's another one that can be a little bit less vague. It was a similar story, but this is a nice little way to kind of kill two birds with one stone because people like to ask, “Well, what happens if my job offer gets rescinded when I negotiate?” And my answer to that is I always tell my clients, I say ... They're like, "If we do what you're telling me to do are they going to rescind the offer? Are they're going to get mad at me?" I say, "Listen, I can't tell you that there's a 0% chance that will happen. It's greater than zero, but it's so small I can't even see it like on the graph." But it does happen occasionally. I think I've had I want to say two, maybe three clients in almost three and a half years that this has happened to.

Josh Doody: But I had a client who was going to work for an engineering firm. He was an experienced mechanical engineer. He got a pretty compelling offer from this company, but he knew that there was room to negotiate just based on market value research and some other kind of rudimentary stuff. And up to this point, there had been a couple of interesting sort of interactions with the recruiter he was working with. This is a smaller firm and so it was one of those things where the recruiter had done some weird things like kind of semi-unprofessional replies in email or that kind of thing where normally would have been kind of red flags, but I was like, "You know what? This is a really small firm. They probably don't have a lot of professional recruiters if any in house. This could just be a generic HR person who's also responsible for hiring. They don't seem to have outsourced that. So I wouldn't worry about it."

Josh Doody: And then something else would happen, a weird conversation with the hiring manager. It's weird. But they're small firm and they seem like good people. I could see their website. It looks like they're doing professional work. And then eventually we got to the point where the recruiter just would not let off the gas in terms of asking for salary expectations and stuff like that. And so eventually my client uses the, "Well, you keep asking me for my salary expectations, but why don't you tell me what the range is that you're willing to pay for the role and I'll tell you if that's in the ballpark." Which is a nice little kind of judo move to use. And so then you're getting a range and you're able to tell them, "It's in the ballpark," which is not necessarily committing to that range, but it gets you past that question. And so they had given him a range of salaries and then eventually they came back and finally said, "Alright. We're ready to make you this offer." This is after again we'd had a few red lights. And the offer was below the range. And I was like, "Well, that's weird." And so-

Mike Julian: That is weird.

Josh Doody: So obviously we're going to negotiate this because they told us the range is higher than what they offered. And it's like, well, maybe they're just doing that so that we don't go blowing the range out of the water and try it for maximum salary. And so we counter somewhere ... I think the counter was actually still within the range at the higher end of the range and not even the top of the range, a very cordial counter, made a good case. I mean, this person was a really good candidate for this role. It wasn't like he was a random candidate. And the recruiter, in response to that email that we sent, the recruiter just lost her mind. I mean, she was very frustrated with the account. It was the kind of stuff where it was I wondered maybe if she got in the wrong email, but no, it was all threaded there in the thread.

Josh Doody: And she just said some things that were like, "I can't believe you're counter-offering. This is so disrespectful. They put together such a compelling counteroffer and now I have to go tell them that you're ‘counter-offering,’" and she's using quotes in the email around counter-offering and all this. And I'm literally reading it, I'm like, "We didn't even ask for the top of the range that she told us was available. We're not out of line here." And then she said, "You know what? We've decided this just isn't a good fit and we're just going to pull the offer." And so that was frustrating because my client had spent several weeks ... You know, he went onsite with them and did a bunch of work trying to get a good offer from this firm. We countered within the range they had told us was possible, and the recruiter ultimately pulled the offer despite really nothing out of line from us. But what I realized was all those red flags kind of added up to this is just a disorganized firm. It's very possible that they had decided they didn't have the budget for it anymore or something like that. But just the way it went down my client felt really bad. I was like, "Listen, I promise you, I promise you that in two weeks you're going to call me and tell me how glad you are you didn't go work there." Three weeks later he calls me. He's like, "Hey, I just got a much better offer at a better firm. I want to work for them" We negotiated that and we're off to the races or whatever. But it was like one of those things where just the classic red flag, red flag, red flag, recruiter freaks out and then I tell him, "Don't worry about it." He gets a better offer two weeks later, he goes to a better firm and he lives happily ever after. But it was the sort of like that's typically what happens when I see offers rescinded is that kind of thing where it's like there were so many red flags that we probably should have seen coming, and ultimately he’s glad he didn't go work there.

Mike Julian: That's been my experience too, is that getting an offer rescinded isn't just one of those things that happens out of the blue. You will typically see it coming long before it actually happens, much like your story of, it's one red flag after another and finally it's rescinded, and honestly you shouldn't be that surprised they did it.

Josh Doody: Yep. If your eyes are open, the red flags are usually there, and it's just a matter of-

Mike Julian:It is hard to see them.

Josh Doody: ... are you paying attention? Yeah. And a lot of times you don't want to see those because if you're that far in the process you just want to close the deal. You've already decided in your head that you've moved on. You're just starting this new job, you're hopefully making more money, and so you're willing to kind of overlook those little foibles of the company in order to just kind of get past it. Sometimes that works out and sometimes it doesn't.

Mike Julian: Man, that's hitting way too close to home for me. I courted a company and they semi-courted me for years. I had gone through several different interview loops with them spread over three or four years. And finally the last one I did we spent I think two and a half, three weeks going through interviews and they just kept throwing last minute interviews at me. And I'm like, "This is kind of weird." I would get on the phone with someone and he'd be like, "Yeah, I'm not really sure what I'm supposed to be doing here. But someone said to go interview you." And I'm like, "Okay. That's odd, but sure, let's chat." And everything was going great, everyone loved me. I had a call from a recruiter or from the recruiter saying, "Hey, here's a verbal offer we want to do." And I'm like, "Okay. That sounds great." And then an hour later they rescinded the offer, and I'm like, "What just happened? Everything was going great." And what I saw is that there was a lot of disorganization and there were red flags throughout the entire process that I wasn't paying attention to because I liked the offer so much.

Josh Doody: Yup. Yeah. I mean, that'll happen. It's easy to get blind to that sort of thing and just sort of ignore it and say, "Nah, it'll be fine. It'll be fine." And sometimes it is fine. Sometimes you're on stress. Most of the time it's fine, but sometimes it's not fine.

Mike Julian: Yeah. Sometimes it's fine. Or most times it's fine, sometimes it's not. Something you've been saying here is this is not typical of negotiation.

Josh Doody: No. Not at all.

Mike Julian: I found that most people don't like to negotiate, that it makes them nervous and primarily, they're afraid of the job offer being rescinded. But that's not how most negotiation goes. So what's been your experience with someone negotiating a job offer for the first time? How does it actually go for them, usually? What's the happy path here?

Josh Doody: Yeah. I mean, it's not even just a happy path. It's almost like a standard operating procedure to be honest with you. I mean, I describe it as their playbook, right? And so this is a really fun part of my job because so many people are terrified at this, but it's literally what I do professionally. And so I go through ... I'm working with, I don't know, four or five clients right now. And so I'm seeing this over and over and over, and, like I said, it's standard operating procedure. And so I'm able to tell my client, "Okay. Here's what's going to happen next. In about 12 hours you're going to get an email. It's going to say this." And then boom, it happens. "Whoa!" So typically here's how it goes. This is what I would say is like the 80% case, and that is you get the offer, sometimes it's verbal, sometimes it's a written informal offer, almost always informal. I think a sidebar on that is I think the reason offers are so frequently informal is there's some kind of ego or metric juicing going on where it's like companies want to be able to say that once they extend an offer, it's always accepted or something like that. And so it's like, "Well, technically I didn't extend an offer. I described a verbal offer to you. I described an informal offer," something or whatever.

Mike Julian: That's a nice scamming/gaming of the system.

Josh Doody: I think it is. I don't have any data or confirmation on that, but that's sure what it feels like. And so you rarely will just get a formal offer letter that you just have to sign. First you will get a verbal offer or an email with some bullet points or something, something, number, or something, how do you feel about this. That kind of stuff. And so the first thing I'll do is have my client make sure that they get something in writing, even if they just, "Hey, thanks so much for the offer. I appreciate it. Would you mind just sending me the bullet point summary in email just so I can make sure I didn't miss anything?" And so you're just trying to head off any potential future miscommunications there is why I do that. It's never happened that somebody misunderstood an offer, but it helps to make you just want to be positive because the next thing we're going to do is send a counteroffer email. Typically, that email will also have what I call the “why I'm awesome” paragraph in it, and it's a description of why this person is such a good fit for this particular role, this particular company. I do that for two reasons. One is to make the counteroffer have a little bit more weight to it, but also to make sort of a document that can be circulated internally if they need to get approvals. And I've had clients come back to me and say, "Hey, just so you know my recruiter told me that that email with that paragraph literally got me another $10,000 because it went through finance and they approved."

Josh Doody: So we'll send that counteroffer email. Typically, the recruiter will get the counteroffer email and within 12 to 24 hours, business hours, they'll respond and say, "Hey, thanks for considering our offer. I wonder if you have some time to chat about this some time soon." So they're trying to move out of email and onto the phone. And so then what I'll do with my clients is say, "Okay. We'll, tell them you're not available until tomorrow afternoon." Then I'll create a script for them. The script is designed to give them the next moves that they can make based on how the company responds to their counteroffer. So the company made an offer, let's say the offer is 150K base, and we countered it let's say 170K base. And so now the first thing we'll do is talk about salaries between 150 and 170K to see what if they come back at 160K what are we going to say? What if they're at 165 what are we going to say? What if they don't come off of 150 what are we going to say? And so I'll have a little script for them that's basically just a way for them to kind of hone in on whatever the recruiter says in response and then make sure that they ask for the next few things that makes sense. So for example, in my example if they were at 150 and they came up to 155, I think there's probably room there to ask for a little more salary. We might ask for 160.

Josh Doody: And then if they say no or they give a partial yes, so they don't come all the way up to what we asked for, which is kind of what I'm shooting for it by the way, then we'll ask for the next thing. "Okay. Well, you offered 20,000 equity. Can you do 30,000 equity?" No or partial yes? We'll move on to the next thing and say, "Yeah, well, you gave me 25,000 equity. That's great. Can you add a $10,000 sign on bonus and I'm on board?" And so we're prepping for that and rehearsing that, sometimes depending on if the client wants to do that, so that when the recruiter calls them and they have this conversation where the recruiter says, "Hey, we offered you 150. You asked for 170. We can't do that. That's above our budget, but we can do 155. How do you feel about that?" And then you know exactly what to say. So once that call is over usually that's basically the end of the negotiation. Either the recruiter will have said, "I'm authorized to do X, Y, Z. How's that feel?" And the candidate will say yes, or the recruiter will say, "I think I might be able to do that. Let me go talk to the comp team and see what they say." And then it'll wrap up and they talk start date. So that's the kind of straight forward, very vanilla, easy mode version, and that's about 80% of the time.

Mike Julian: Yeah. It's incredible how scripted and prepared that all is and-

Josh Doody: Yes.

Mike Julian: ... pretty much every job offer I approached until I learned that recruiters are always better at this than I am was, "Oh, I'll just kind of wing it." Then one day I realized, "Wait a minute. The people I'm negotiating against do this for a living. They're doing this several times a day and I do this once every two years. Maybe I should be more prepared."

Josh Doody: Yeah. That's what it's all about, is being prepared for what comes next. So that's a lot of what I do with clients and say, "We'll ideate through different scenarios that might happen." And the scariest one, the one where people most become scared and also try to wing it, is on that phone call where 80% of the negotiation is just sending that counteroffer. If you can just do it and you do it tactfully, then you've done most of the work. But there's still that 20% where you can get another 5K salary, you can get another 20K equity, and that can happen on that phone call. But the problem is that because you do it so rarely as soon as you hear them go, "Hey, Mike, how's it going?" And then you freeze up and you go, "Oh, no." And then you don't say the right thing.

Josh Doody: When I invented the script that I used, it was on my whiteboard in my office, which, Mike, you can see it over my shoulder, and I had scripted it out and the recruiter said something to me and I started to respond by asking for more vacation time. And then I glanced up on my whiteboard and realize that's not what it says. And so I said, "Wait, wait. Forget what I just said." And she kind of chuckled. I said, "If you can do, I don't know what ... the 118,000 then I'm on board." And she said, "Alright." And so I almost just left money out there. I wrote about this in my book, because I had the script, I had invented the script. I'm the inventor of this script. I am a salary negotiation expert. I wasn't at the time, but I was kind of still ahead of most people because I was doing it for the third time maybe. And I almost messed it up on this live phone call even though I had planned about it and written the script out because I wasn't physically looking at the script when the recruiter spoke to me, I almost messed it up. And so it's a question of adrenaline and experience. You just don't have a lot to call on, and things are happening really quickly. That phone call is two or three minutes long usually utmost. So it's super stressful and we make mistakes under stress.

Mike Julian: It is a very stressful, very short phone call. Every single time I've done it, five minutes but oh, man, it feels like way longer.

Josh Doody: It's exhausting. It's one of those you hang up the phone you sit down and let out a sigh and go, "Okay. What just happened?"

Mike Julian: Yeah. I remember I had this happen some years ago. I was working for a company and making 50,000, and I was going through this interview with a completely ... it was a new company and I hadn't told them what I was making. I was very careful about that. I was still kind of winging it. I still hadn't really prepared, and I get a text from the hiring manager and said, "Hey, I want to make you an offer for 90K." And I'm thinking, "Oh, my God, this is amazing."

Josh Doody: Ding ding ding ding.

Mike Julian: So I ask him like, "That sounds good. What's the title on that?" He goes, "It's like this junior systems engineer." Well, I knew he had a senior position also. I'm like, "I want that senior role." And he goes, "Oh, well, the salary for that is 120." I'm like, "Okay. That's good. I'll take that."

Josh Doody: Sounds good.

Mike Julian: So that was the extent of my negotiation. I'm like, "Oh, my God, I just did what now?" And I did this over text and I still hadn't prepared, but it was very much like ... I was lucky to do it. It was luck that played out, not my preparation.

Josh Doody: Yeah. Well, I would push back a little bit on that. So yeah, I kind of agree. But just the fact that you knew that there was a senior role available and had the wherewithal to ask about it-

Mike Julian: That's true.

Josh Doody: ... that's not luck, right? That's preparation.

Mike Julian: Yeah, that is true.

Josh Doody: Most people wouldn't have thought about that. I mean, sometimes it works out that way, but I think you were prepared a little bit more than you gave yourself credit for there.

Mike Julian: That's fair. So when talking sort of about some pitfalls that people run into what are three of the biggest mistakes that you see people run into when they're doing a job negotiation?

Josh Doody: You mentioned one of them just a second ago and I said, "Good job." And that is they share either salary history or salary expectations. And kind of briefly, this is a soapbox that I've been on for a couple of years now, but there's reasons that you don't want to share either of those. The salary history, the main reason that you don't want to share that is, really what you want to do in the interview process and as you're getting an offer, is you want to cause the company to think, "What do we have to do to convince this person to join our team?" The problem is with the giving them salary history is you've changed that from “What do we have to do to convince him to join our team?” to “What's the minimum that we have to do to convince him to join our team?” — which is usually like your current salary plus 5% or 3% or something.

Mike Julian: Right.

Josh Doody: I mean, I'm sure a lot of people are listening to this right now and they realize now, "Oh, I did that." And what happened was, "How much are you making right now?" "I'm making $80,000." And then three weeks later you get a job offer and it's $85,000, and like, "Oh, wow! It's just a smidge higher than what I'm currently making. What a coincidence? What are the odds?" And the odds are zero that it's a coincidence, right? That's very much they just said, "Well, he's making 80. We'll offer him 85 and see if he'll bite."

Mike Julian: Yup.

Josh Doody: On the flip side of that, the salary expectations question is also pretty dangerous. People's intuition on this is, "Well, I need to say a big number." That, if you're going to say a number, bigger is better for the most part within reason. But the reason that I don't advocate sharing that number is really ... It's easier to understand how I feel about that if I reframe the question as follows. I just did this with a client the other day and he said, "Oh."

Josh Doody: So instead of, “What are you hoping to make if we hire you in this role?” If they said, "We're a pretty big company. We've got 10,000 employees. We've got a whole army of people doing data analysis, and we get salary surveys every month from four different firms and we pretty much know what the market is paying. We have 17 people who do the job that we're hiring you to do and they've all been here for about five years. And we can pretty much dial in to the penny how much they should be paid. So given all that, what's your best guess what we might pay somebody with your skillset to do the job that we're thinking about hiring you for?"

Josh Doody: And the answer should be a big ... my favorite little emoticon thing is the shruggy guy. The shruggy guy just said, "Oh, I don't know." And so what they're doing is asking you to guess at what their pay structures are and what their budget is and all the stuff. You just don't know that stuff. And so you're either going to miss under, which is bad because you probably cost yourself money. You say 80,000 and they're thinking, "Whew! We would have gotten as high as 95." and there goes $15,000 that you never knew you had. Or you say a number that's way too big. You say 120 and they were open for 95 and they go, "We can't afford Mike. I was really hoping that we can bring them on, but maybe we should just move on to other candidates."

Josh Doody: And so that may sound nice because it saves people time, but the problem is that you've missed the opportunity that you so clearly articulated, which is well ... but what if through the interviews if they didn't know what your expectations were, you had a chance to convince them that you're worth what your expectations are? So maybe they started at 95 budget for a mid-level engineer, but they talked to you six times in interviews and bring you onsite and put you on the whiteboard and all this other stuff, give you a take-on project, and you come back and they're like, "Well, we were looking for mid level engineer but I think we should probably offer this as a senior and maybe we need to up the budget for that," right? So now you're able to talk them into that. You don't have that opportunity if you disqualify yourself by just being out of range of mission initially. So that's number one with a bullet asterisk next to it. It's an underlined explanation point. That's the mistake that almost everyone makes, and it warms my heart anytime I talked to a new client and they're like, "Hey, I read your stuff on not sharing salary expectations, so I haven't done that." I'm like, "Yeah, yes…

Mike Julian: ... Awesome.

Josh Doody: ," because it's going to be easier for us. You're going to be making more money. So that's a big one. The second one is, I would say not negotiating at all. A lot of people ... I feel like in my world it's crazy not to negotiate because that's all I think about. But most people still almost it doesn't even occur to them to negotiate. And I know this because so many of my clients come to me and they say, "Well, I got the offer. I was thinking about how to respond. And then I thought, maybe I shouldn't negotiate this. So I started googling salary negotiation stuff and I ran into you."

Mike Julian: Almost like it's an afterthought.

Josh Doody: It is an afterthought for them. And for every one of those people there's 50 people that it never occurred to them, until maybe they were already working at the job and they had lunch with somebody one day and realized that person was making 10,000 more than them and it was available. So just not even negotiating which sounds, I don't know, it's not a very meaty answer but, I mean, I don't know what the number is, but I'm guessing less than half of people even consider negotiating. So there's a lot of people who are just not even trying.

Josh Doody: And then I say the third one is kind of giving up too easily. I mentioned before in my 80%, this is the straight line, how it usually goes. Even if they come back and they say, "You asked for 170 and we offered you 150 and our budget was 150." And people go, "Oh, okay. Well, I don't want to lose the job. So I'll take 150." And what they should say is, "Are you sure you can't get to 160?" or, "Well, what can you do?" Or, "Okay. Well, let's talk about the equity. Maybe you have some flexibility there." Or, "You didn't mention a sign on bonus. Can you do a $10,000 sign on bonus?" So there's opportunity there. Even if you feel like you hit a wall there's probably an opportunity to ask for something else, to find something else where they're flexible by either directly just saying, "Well, you're not flexible on salary. Is there anything else that we could talk about here? Is there anything else that you can give on?" Or ask for something. And so giving up too early can be pretty expensive. And that's one of those things where the story of that person would usually tell is, "Oh, well, I did counteroffer, but they said they couldn't move." But what they really mean is they said they couldn't move on base salary and I didn't press them on anything else. And so usually when you're that deep in the process you have enough kind of capital built up with them in terms of how badly they want to fill the role. That asking for equity or a sign-on bonus is they'll just shrug it off. If it's something they can't do they'll say, "Oh, no. We can't do that either." But they might say, "Yeah, here's another 5,000 shares," or, "Here's $10,000 for a sign-on bonus," or, "We'll do relocation for you." So that would be the third one.

Mike Julian: Okay. Yeah. You started to talk a bit about ... Salary is not just the only lever you can pull. You could pull equity, you can pull stock grants, you can pull vacation time, remote work. There's so many different levers you can pull to increase your total comp.

Josh Doody: Yes.

Mike Julian: And the levers are interdependent in some ways, but not as closely tied as people often think.

Josh Doody: Yeah. And a lot of times ... so it really varies by firm, but some firms like Google, for example, Amazon does this too, they're not very flexible on base salary, but they can be very flexible and equity for the right candidates. Amazon especially will do this because of their weird investing schedule. So there are other levers that you can pull and you can't know whether they're fixed levers or variable levers until you ask about them. So yeah, there are things that are not just salary. I tend to have my clients tell me, "What are the two or three things that you care the most about either that are explicitly in the offer?" So most offers that I see because I work with engineers most offers look like some combination of base salary, equity, sign on bonus. And sometimes that sign on bonus is a zero, but it's almost always available if asked for. And so I'll ask them to just so I'm clear, "I assume that base salary is most important to you, followed by equity funding, followed by sign-on bonus. Correct me if I'm wrong." And sometimes they'll say, "Actually, I'd rather focus on the equity in this case." And I say, "Great. I'm hoping for a moonshot here. The salary is already good enough. Let's get more equity." But we'll talk about that in terms of what are your two or three or four most important things, and then we'll build a strategy that allows us to essentially piece meal maximize each one of those in turn, in sequence until we get to the end of the training. We can say, "Okay. Now we've maximized the offer in terms of base, equity, sign-on bonus, and we feel pretty good about it."

Mike Julian: I want to talk about that sign-on bonus for a bit because it's not super common in the world of DevOps and SRE for sign-on bonuses — much more common in software engineering roles. But what's a normal sign-on bonus at say a normal firm? Are we talking in the realm of $2,000? Are we talking $50,000, more, less?

Josh Doody: Yeah, I would say it depends on which firm. There are a lot of variables here, but I would say I feel ... I almost never asked for a sign-on bonus less than $10,000. And I've seen sign-ons that are in the six figures for specific engineers — maybe not DevOps or SREs but more for machine learning specialists, automation specialists, AI. But for most experienced engineers, most firms have kind of a little bucket of one-time cash sitting around that they'll dole out a little bit of for the right candidate. So I would think, especially for your audience, I'm guessing $10,000 is a reasonable amount of a sign-on bonus, not to expect but to certainly fish for and see if it's available.

Mike Julian: Yeah. It's a matter of calibration. Because they're not very common, like what's a reasonable number here? “What's possible?” is really the question.

Josh Doody: Yeah. And so a lot of times what I'm hoping is that the company will give us something to key on that's... So there's a lot of little subtle things that are happening with my strategy, and one of them is that I like to counter on base salary first. One reason I'm doing that is to get more base salary. But another reason is that I actually want to see how flexible they are on different stuff. And so sometimes they'll come back because the base salary is more than they can budget, which is sort of by design. I'm intentionally asking for more base salary than I think they can get by a little bit. What I wanted to do is move a little on base and then move on some other dimensions like equity or a sign-on bonus or something so that I can see how much they move or what they offer. So if we countered with $30,000 more base salary and they can only do 15, maybe they'll throw in a $25,000 signing bonus, and now I have kind of a baseline to use when I'm talking about sign on bonus with them.

Mike Julian: That's pretty awesome.

Josh Doody: But I try to get them to tell me first, and if they don't, then I usually start by keying off of ... if I can find information online, I'll use that on Paysa or Glassdoor or something. But I usually just key off about how much gap in base salary did we ask for that is kind of unfulfilled. That's usually kind of my guide for how much of a signing bonus to ask for. So we asked for 30K base. We got 15. There's $15,000 we need to try and come up with somewhere that's a coherent story. And so I might end up asking for a $15,000 or $20,000 sign-on bonus, in that range.

Mike Julian: Sure. So we've been talking a lot about new jobs, new job negotiation. There's a lot of people out there who actually really like the job they have but are feeling underpaid in. Maybe they've been there for 15 years and they really like the place; they don't want to leave, but they just want more money. What can they do?

Josh Doody: Well, they can forget everything that we just talked about. It's a totally different process.

Mike Julian: Alright then.

Josh Doody: I'll start by saying that this is a process that I developed as a hiring manager when I had a team of people who were all underpaid when I took over the team. And just in case, on the off chance any of those people happen to be listening, they were underpaid just through inertia, through time that passes in firms kind of like you said. This was not a 15-year-old firm but they'd all been doing their jobs for two or three years and had been really good at them and had been essentially promoted into these jobs that were more demanding, and the salary just hadn't caught up. And so they came to me and said, "Hey, when I hired in I thought I was going to get this kind of a raise over time and my salary just hasn't tracked with." I say, "Great. Here's what I need from you." And so what I told them I need from them is what I have written a course on and that I teach people. I wrote about it in my book. And that's this. So first they need to do some research and market value estimation to figure out, "Okay. Well, I've been in the firm for 15 years. I know I'm definitely behind the pay grade. But how far behind am I? What should I be making right now given my skill set, experience, and my value to this firm?" And figure that number out. So that's what I call your target salary.

Josh Doody: And so they need to go in with a number. So this is totally different. That's why I said forget everything we just talked about. In the new job offer, we want the company to say the first number because we don't know what their scale looks like. In the current job situation, we do know what the scale looks like because I know my salary and they know my salary. And so you have to basically present everything to your manager on a silver platter to make it as easy as possible for them just to rubber stamp your request and go to finance and see if they'll approve it.

Josh Doody: And so the first thing you want ... There are three things that you need. The first one is a target salary. That's based on the research that I mentioned. You want to do as good a job as you can and come up with a number that is reasonable and makes sense given your tenure, the value of the company, and you want to try and tie that to your actual work product. The second thing is accomplishment. So this is how you demonstrate the value that you're asking for. And so the way I described raises is ... see if I can get this right. I haven't said this out loud in a while. You're asking for compensation for the unexpected value that you're adding since the last time your salary was set. So the key word in that phrase is unexpected value. And so it's like for whatever reason your salary was at 15 years ago, you're now producing significantly more value than you were when we set that salary. And some of that value is unexpected. You accelerated faster or it's unexpected because you're just not being compensated for it because we didn't do raises five years on the road during the recession or whatever that was, right? And so it's asking to be compensated for the unexpected value that you're adding since the last time your salary was set.

Josh Doody: And so you want to be able to say, "Well, here's what I think I should be making. Here's what I would like. I would like $120,000 base salary." So your specific asks. "And the reason I'm asking for that is I'm actually managing twice the business that we expected me to be managing at this point. I've been mentoring six people, and I've also taken a lot of the responsibility that you, my manager does, onto my plate in terms of reporting and getting things ready for roll ups so that you don't have to do that work and you can focus on more valuable work, which saves the firm a lot of money and time." And so you're justifying your ask.

Josh Doody: And the third thing I call accolades, other people having recognized what you're doing. And so that's a client who sent you a gift card because they loved your work. That's a coworker who wrote you an email and said, "Thanks for saving my bacon." That's your manager sending you an email and identifying that you're a superstar and that you're crushing it this year and hope we have a good time next year, whatever that is. And so you package those all together. The way that I like to do this is to have my clients write an email and it's going to be sent to the manager later. But the idea is that if you can't write this email, then you've got problems. There's something missing. If you can't figure out the target salary, your manager can't figure it out either. If you can't back up your request with accomplishments, there's something wrong here. You're asking for more money but for no reason. If you don't have any accolades, that's optional. But that can sometimes be a red flag because it's like, "Well, nobody has recognized the work you're doing." That's probably not good. But that's optional.

Josh Doody: And so you package it up into this email. I'll send you a link to the template after we talk, and you get it ready, you rehearse it, and then you asked her manager if they have time to talk about getting a raise. You tell them, "I'd like to talk about my compensation on our next one-on-one." You talked to them, you basically verbally present the case that I just described, and then after that conversation you follow up and say, "Hey, just wanted to follow up with an email summary of our conversation today about my raise request. Here's a summary of what I'm asking for." And you send him that email. Again, this is a document that can be circulated and that's why you do it. So they can just rubber stamp and say, "Yep, looks good to me. Hey, finance, can you do 120 for Josh?" "Yes." "Great. Let's do it." Or, "No, the best we can do is 110," or whatever. Now the conversation is rolling. So in a nutshell, as a manager, that's the process I developed and that's the process that I teach people when they ask.

Mike Julian: Man, that's fantastic advice. Coming from the other side of the table, I know there are things that managers must be doing wrong in negotiation with the candidates. They're losing great people. What advice would you have for the managers that are listing on how to improve negotiation with a candidate?

Josh Doody: Yeah. It's almost not even negotiation but you kind of mentioned what it is. I mean, employee attrition is very expensive. And if somebody leaves because they feel like they're undervalued or underpaid that can be a really expensive thing to happen to a firm when that person would be better there, when they need that person to do the work they're doing. And so I think that maybe a big mistake that managers make with this stuff is they think they're really busy and there's a lot going on, and yeah, "This person has mentioned a couple times they’re a little frustrated with their pay, but I'll just deal with that later. I'll just kick the can down the road. I'll wait until it's slower up here. I'll wait until the end of the quarter," or whatever that is.

Josh Doody: And I think that's a mistake just because it takes a lot of courage for an employee to go to their manager and say, "Hey, I know you're paying me well, but I'd like to be paid better," or, "I know you're paying me a paycheck every two weeks. I want that paycheck to be bigger." It's intimidating. And most people will not even do it. They just won't. And so if you're a manager and somebody comes to you and says, "I'd like to talk about my pay," you should perk up and listen. That's significant. They're overcoming a lot of inertia to do that, to have that conversation, to say that thing to you. And there's a really good chance that before that conversation, they've already started looking for jobs somewhere else.

Mike Julian: Yeah. They've probably been thinking about this for months by the time they work the courage up.

Josh Doody: Yup. They've been talking about it over happy hour on Friday. It comes up every Friday. They've been complaining about it. Somebody finally said, "Why don't you talk to your boss?" And they finally did, but it's after months and maybe after they sent their resume to a couple of firms, maybe after they changed your status on LinkedIn to say “looking for work.” And so if you care at all about having that employee on your team, if they have the courage to talk to you you should sit up and listen and say, "How can I help?" And actively work with them to help. And then if they do present a case like this to you, give them real feedback on it. Tell them, “Here's what I'm going to do. Here's the timing on what we're going to do. Let me know if that works for you. Here's how I'm going to help you through this process. Here's what else I need from you. Here's what I think the timing will look like." Communicate clearly with them. So number one, don't blow off their request. Don't kick the can down the road. That could be a big mistake that it results in you getting a two weeks notice a couple of days later. And two, communicate with them openly and tell them what you need from them. If you know what the process looks like to get raises approved and things, tell them, "This is what I need from you to get this done." Be open with them about it and keep them posted on the process.

Josh Doody: And the last thing I would say is, if that person has come to you and asked for a raise that made their case in the way I described and you disagree with their case, so, "Okay. I hear what you're asking for, but I just don't think you're quite ready for that yet," not only should you share that with them, but you should also say, "And here's what you can do to get the raise that you just asked me for. Here's what I need to see from you performance-wise to justify this when I go to finance. And I think that'll take six months. There's discrete steps that you need to take. And if you accomplish those things, I think that I can make a good enough case to get you a raise and we'll follow up on this in our next one-on-one." So those are the three things I would say the managers should be thinking about.

Mike Julian: Alright then. Well, Josh, this has been just a fantastic time. Thank you so much for joining us. Where can people find out more about you and your work?

Josh Doody: The easiest place to find me, actually to interact with me is probably Twitter. I'm @JoshDoody on Twitter. Super active on there. If you mention me I'll reply. Most of my salary negotiation work, including my coaching offering and all that good stuff is on my website. fearlesssalarynegotiation.com. There's all kinds of free material there. I made a ton of free stuff available. If you want to learn more about the coaching that I mentioned that's fearlesssalarynegotiation.com/coach. Those are the three things I would say are probably the most useful and directly useful to your audience.

Mike Julian: Yeah. And you wrote a book called Fearless Salary Negotiation, which I've had the pleasure of reading, and it's absolutely fantastic. I love it. Thank you so much for that. It's one of those-

Josh Doody: Thank you for saying that.

Mike Julian: It's one of those books I wish that I had read just years ago when I first started this. I shudder to think about how much money I've left on the table as a result of not having had that. So-

Josh Doody: That's why I wrote it. That's very kind of you to say that. Thank you for saying that.

Mike Julian: I've been sending all of my friends to you for negotiation and such. The work you're doing is fantastic. So thank you so much for joining me.

Josh Doody: Awesome. Yeah, thanks for having me. It was a really good talk and fun to talk to you face-to-face for the first time in awhile.

Mike Julian: So to all listeners, thank you for listening to the Real World DevOps podcast. If you want to stay up to date on the latest episodes you could find this at realworlddevops.com and on iTunes, Google Play or wherever it is you get your podcast. I'll see you the next episode.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Doing Interesting Work in Ops with Matty Stratton
14 mar 2019· Real World DevOps
About the GuestMatty Stratton is a HumanOps Advocate at PagerDuty, where he helps dev and ops teams advance the practice of their craft and become more operationally mature. He collaborates with PagerDuty customers and industry thought leaders in the broader DevOps community, and back when he drove, his license plate actually said “DevOps”.

Matty has over 20 years experience in IT operations, ranging from large financial institutions such as JPMorganChase and internet firms, including Apartments.com. He is a sought-after speaker internationally, presenting at Agile, DevOps, and ITSM focused events, including ChefConf, DevOpsDays, Interop, PINK, and others worldwide. Matty is the founder and co-host of the popular Arrested DevOps podcast, as well as a global organizer of the DevOpsDays set of conferences.

He lives in San Francisco and has three awesome kids, who he loves just a little bit more than he loves Doctor Who. He is currently on a mission to discover the best pho in the world.

LinksTwitter account: @mattstrattonWebsite: mattstratton.comArrested DevOps podcastBryan Berry’s article: You Should Start a Technical PodcastTranscriptMike Julian: Running infrastructure at scale is hard. It's messy, it's complicated, and it has a tendency to go sideways in the middle of the night. Rather than talk about the idealized versions of things, we're going to talk about the rough edges. We're going to talk about what it's really like running infrastructure at scale. Welcome to the Real World DevOps podcast. I'm your host, Mike Julian, editor and analyst for Monitoring Weekly and author of O’Reilly's Practical Monitoring.

Mike Julian: Alright folks, I've got a question. How do you know when your users are running into showstopping bugs? When they complain about you on Twitter? Maybe they're nice enough to open a support ticket? You know most people won't even bother telling your support about bugs. They'll just suffer through it all instead and God, don't even get me started about Twitter. Great teams are actually proactive about this. They have processes and tools in place to detect bugs in real time, well before they're frustrating all the customers. Teams from companies such as Twilio, Instacart and CircleCI rely on Rollbar for exactly this. Rollbar provides your entire team with a real-time feed of application errors and automatically collects all the relevant data presenting it to you in a nice and easy readable format. Just imagine — no more grappling logs and trying to piece together what happened. Rollbar even provides you with an exact stack trace, linked right into your code base. Any request parameters, browser operating system and affected users, so you can easily reproduce the issue all in one application. To sweeten the pot, Rollbar has a special offer for everyone. Visit rollbar.com/realworlddevops. Sign up and Rollbar will give you $100 to donate to an open source project of your choice through OpenCollective.com.

Mike Julian: Hi folks, I'm Mike Julian, your host for the Real World DevOps podcast. I'm here speaking to Matty Stratton today, DevOps advocate for PagerDuty and the host of my arch nemesis, the Arrested DevOps podcast. Welcome to the show.

Matty Stratton: Thanks Mike, I'm really excited to be here. It's always fun to be a guest on somebody else's show because you do a lot less work.

Mike Julian: Ain't that the truth? I've been watching or listening, I guess as it is, the Arrested DevOps podcast for, God since it began; it feels like a million years ago at this point.

Matty Stratton: It's over five years. December 2013 is when we started. I should know off the top of my head how many episodes we have but I don't. In the hundreds.

Mike Julian: Somewhere between five and 50,000.

Matty Stratton: Right. It's a non-zero number.

Mike Julian: Something that we were talking about before we started recording, was this idea of, there's a lot of people that shift from doing ops work into I guess you could call it an influencer role, where we're talking about the work and helping other people become better. It seems like that's what you've done five plus years ago with the Arrested DevOps as well as what you're doing now with developer advocacy or DevOps advocacy, whatever you want to call it. Is that about right?

Matty Stratton: I would say so. I have found that when I try to explain to people what I do for a living now, I say I worked in operations for 20 years and now they pay me to talk about it. But I didn't start out that way. The journey for Arrested DevOps was, it never started out to be a podcast. I didn't know what it was going to be, but when I was starting to learn about DevOps and started to do some DevOps transformation at an organization I was at, I was taking in a lot of the podcasts of the time, DevOps Café and the Ship Show and Food Fight. The one thing that I found was there was a lot of material out there that was related to DevOps but there wasn't a lot of stuff that seemed suited to people like me who were kind of dumb, for a better word. Just maybe a better way to say that is people who are very new to it. I actually originally intended to start a blog because I had done a lot of blogging, been blogging for years and years and years. I wanted to create a blog that was going to be very accessible. I don't remember if it was on Facebook or Twitter or somewhere when I was looking for ideas for what to call this blog and my friend Jessica, who is not in the tech industry, she's a writer. She's the one who coined the name,” Arrested DevOps.” I bought the domain but I didn't have anything to do with it for a while. I decided maybe I'll start a podcast. One thing I know about myself is I'm very good at starting things and very bad at doing them.

Mike Julian: Aren't we all?

Matty Stratton: Yeah. One of the things that helps with that is having a partner or having somebody else that keeps you accountable.

Mike Julian: Oh yeah.

Matty Stratton: I had met this cat named Trevor at an Azure meetup in Chicago and was like, he's a software engineer type and into this DevOps stuff and was like, "Hey Trevor, maybe we should do a podcast." That was back in November of 2013 — we did our first episode I think in December of ‘13. But what's interesting is how this, it never was intended to be in an influencer role. What I wanted to do was be able to take things that I was learning and make things accessible. One thing I learned very quickly with doing the show and Mike, I mentioned you've learned this lesson similarly with your newsletter and the show now, you can't control your audience.

Mike Julian: No, not at all.

Matty Stratton: Even in our very first episode, we had people who were tweeting at us because we used to live stream the show. They were tweeting about listening to the show and I remember thinking, "Why are you listening to my show? You should be on the show. You're not going to learn anything from me, you're the experts." Over the years we get the, “it's not technical enough, it's too technical, it's too many of the same people, it's too many people I've never heard of.” No matter what we can't appease everybody.

Mike Julian: Yep.

Matty Stratton: What we do is I just try to make myself happy and Bridget happy.

Mike Julian: That's really all you can do.

Matty Stratton: If we keep Matt and Bridget happy, that's fine, Trevor shows up sometimes. We have a new host, Jessica Kerr. We really want to make Jessica happy because she's new. We don't want to lose her.

Mike Julian: Don't want to scare her away too quickly.

Matty Stratton: Yeah.

Mike Julian: Yeah. I remember, before this I was going through, looking at the archives. I happened to look at the very first episode, the quality is so bad. The video, it's just this live recording, the audio quality is not awesome and you contrast that to the stuff you're doing today and it's just way better, so much better. I feel like a lot of people look at what people are doing now and saying, "I could never do it." It's like, "No, the people doing it couldn't do it either. It's years and years of work to get something great." The people hosting are not, I don't think I'm anyone special. Who am I to be talking about the stuff that I am? It's just, I'm the one interviewing people. It's kind of like going to a conference, I know you speak at a ton of conferences too, the people on stage are not necessarily the experts, they're just the ones on stage.

Matty Stratton: I found through my career, through doing podcasting and event organization, it gives access to conversations that you would not normally be able to have. I will quote from Bryan Berry, who was one of the founders of the Food Fight podcast which was originally a chef-oriented podcast and that was about all things DevOps. Bryan had a blog post that is unfortunately hard to find but if we can find it maybe we'll put it in show notes and it's called, "The Dirty Little Secret of Tech" no, "Why Everyone Should Start a Tech Podcast." In this post he said the dirty secret of tech podcasting is this how you get someone to sit and talk to you for an hour. You would not normally have that opportunity, you can't normally go up to folks at a conference and say, "Let's sit down and talk about this crap for an hour just to shoot the shit." It has nothing to do with people being rude, who has that time? Everyone's like, "I can't do that, I'm busy." Then you say, "Hey come on my podcast." "Sure."

Matty Stratton: We used to joke at ADO about how big did we have to get before I felt comfortable enough to invite Jez Humble on our show? It turns out I think that the magic number was we had to have 11 episodes I think was when we felt like, also by the way there was a point where it seemed your 11th episode had to involve Etsy. That seemed to be a thing in DevOps podcasting back in the day. But as you know, things have changed. There's no really, as an event organizer, I look at it as my goal and my role is a force multiplier, is a signal booster, it's getting people another avenue to communicate. But what ends up happening is you kind of come along for the ride because you're part of that signal that you're boosting. If you're clever, you learn along the way. I guess that's sort of a thing. It would be possible to do this and not get any smarter. I feel like it's a great opportunity because I've been able to have conversations that I have not had in my career until I started podcasting and started doing events that were just sitting and having chats. Then I can ask my questions, I can learn. The thing is, the thousands of people who are listening get to learn that as well. I can be, I look at it as a big part of my role on the show is to be an audience proxy. Say, "Let me sit there, I can ask the questions, I can check for you and then all of us can learn."

Mike Julian: Absolutely. I have a friend who, he recorded, he got one of those guests that you always think you'll never be able to get, it's like the aspirational guests, they're so big, they're so inaccessible that you could never possibly get them on the show. But you try anyway. You reach out and you keep reaching out. He finally got this person to agree. Had to go through two layers of agents and PR reps and finally got this person to agree. It was awesome. He's like, "Crap, what now? I have all these questions I want to ask but I'm afraid of screwing this up. I want to make sure this is good for the person I'm interviewing, but more importantly that people listening learn a lot." One of the things he did, I thought was genius, he reached out to a bunch of people who were also in that space, this is in the consulting world. He asked a bunch of consultants and freelancers, "I'm interviewing this person you know, what should I ask them? What do you want to know? What have they never talked about before? What are you unclear on?" We ended up with hundreds of questions. We couldn't possibly ask all of them but the conversation that resulted from that was incredible.

Matty Stratton: I think the thing that's great about that, about sourcing from the outside, is it gets a starting point, again like you said to that conversation that might not have been, it brought in some flavor from minds other than the host. It was sort of the beginnings of that. Even if it wasn't getting all the questions answered, it was getting some of that flavor done. I think one of the things that, we talked about going from working in a role, working in operations or something like that into more of this persona if you will, influencer, whatever we want to call it these days, I found that it's great that this is my job now. This is 100% my job is doing stuff like this. I did have a little bit of a thought this morning that, how no matter what it is, I feel like, maybe this is just me, I'm hoping this is common. That no matter what it is that we're trying to achieve when we achieve it, when we get that dream job, that dream job suddenly becomes a thing we do everything we can in our power to not do anymore. By that I mean, tongue in cheek procrastinating.

Matty Stratton: But I was sitting there, for example, I have a blog post I have to write. I used to love blogging. I still love blogging but I would do it in my spare time and it was this fun thing, I did it all the time. Now it's because somebody told me to do it. I'm like, "I don't want to write that post. I just really don't want to do it. I so badly don't want to do it." The thing is, when I sit, I know later when I get off this recording and I sit down, I'm going to write it and it's going to be really fun to do but I think about the things that I quote “have to do” and if I could go back in time 18 months. 18 months ago Matty would hear what I was complaining that I have to do now would be like, "Would you just shut up? This is what you get to do." I think it's been the case but as I think about throughout career escalation, I guess what it boils down to is we're never happy. We're never satisfied. It doesn't mean that, by no means should this come across to say I don't like my job, I don't like the things that I do — it's just that no matter how much you may desire doing something, when you just don't want to do any work today it doesn't matter what the work is. You will decide you don't want to do it. That was my thing today. I don't want to work today.

Mike Julian: Totally been there.

Matty Stratton: No matter how fun the job is, I am going to do everything I can to do something else stupid that isn't work. That’s probably harder and less fun than actually doing the job, but it is technically not work.

Mike Julian: The lengths that we will go through to avoid doing the work are significantly more than just doing the work. The amount of things that I have found to do aside from writing, it's staggering.

Matty Stratton: Absolutely. That's true of writing… it has multiple definitions but the one that I like, which is the, it's an amazing form of procrastination. I used to say the best thing in the world, if I ever wanted to get anything done in my apartment is, which is when I had a roommate in grad school, was during finals. We got so much done.

Mike Julian: Holy crap that apartment was so clean.

Matty Stratton: So clean, so much baking got done, we worked on projects together, anything but studying.

Mike Julian: There's this role, the role of the influencer, advocate, evangelist — goes by so many different names. It seems to have come out of nowhere, at least seemingly to me. It didn't feel like this role existed five years ago but now I see a ton of people doing it.

Matty Stratton: I think part of it, I have a couple thoughts on that. One is, it's been around for a long time. First of all, we know Guy Kawasaki was the original, the coiner of the term evangelist. He was an evangelist of Apple Computer back in the’ 80s. His job was to make Apple cool. I remember even from the Microsoft perspective that there were people that I worked with as a customer of Microsoft who were part of an organization at the time was called DPE or Developer Platform Evangelism at Microsoft. I remember thinking at the time when they would come and work with us, “You have the coolest job ever. Your job is to play with shit and come in and tell me how cool it is.” That was in, again, I'm trying to think when I started learning about DPE was probably in the 2006, 2007 era but I think what I was going to get at it’s kind of like when you have a red car-

Mike Julian: Yes suddenly you notice them-

Matty Stratton: ... suddenly you see all these red cars. I'm very aware of people in developer relations and developer advocacy and developer evangelism but part of that, I think is not echo chamber but you follow people that do things that are similar to what you do, those voices get amplified within your network effect, but I don't mean to say that it's not been exploding. I think that's absolutely true. I think we're seeing companies over the last few years, seeing the advantage in marketing and communicating to developers, marketing with a lowercase “m,” whatever that is, selling to developers is challenging because it's not a traditional audience. Especially by that what I mean is you're selling to them but you need to be selling to them because you're selling to the people who are influencers within their own organization as well. As our friend Cory is fond of saying, "Engineers make terrible customers," because most engineers don't have signing authority. They can't actually buy anything but they can be incredibly influential within the organization. Seeing, I think what we also are seeing and especially even large organizations like Microsoft has realized this over the last, I was going to say recently but I feel like they figured this out about five, seven years ago — is the old way of selling where you have your relationship with the vendor, the old IBM is the answer, what's the question. Right?

Mike Julian: Right.

Matty Stratton: This is not the way of the world anymore. You're going to be in some type of a network, you're going to be in some type of an environment and that means that you need to play for that network effect. I can't remember where I was going with this. I think that's why you want to see, I guess what's happening is you're not going to get the whole account. You're not going to get the whole company. You're not going to sell every bit and bite that Global Corp is going to consume. It's not going to be Britain is a PO to you for every bit of software they ever buy. You're going to be able to get in there in your piece and part. You need to win some hearts and minds in order to do that. You need to win hearts and minds by being the best solution because engineers, you're not going to sell to them through relationship selling. What I mean by the relationship selling is, you and I have a personal relationship. We go golfing together, you buy software from me.

Mike Julian: God I wish it was that easy.

Matty Stratton: Because it's too small. It doesn't work at scale, which means you have to understand, you have to see, you have to have a voice of the software engineer as the customer coming back into your company. I think that's where we're seeing, again why you're seeing more of the use of the word advocate as opposed to evangelist. We at PagerDuty rebranded my role and the role of my team, I say my role because I was the only one that was called an evangelist and then we changed, because it's a two-way street. We advocate to and for the community to help make the product better based upon what the community needs in order to build on that. To kind of wrap it up, this is where I was getting at, if you want to have success, people have to be able to extend and build with your product that they're buying from you because they're not operating in a vacuum. This is why developer advocacy and developer relations is so important because it's all about building that community and that platform, that developer platform that doesn't work if you're not really understanding what those folks need. That's my opinion about why I think we're seeing it, in today's industry, become more and more critical.

Mike Julian: Got you. What led you to switch from doing the hands on keyboard operations work to being an evangelist advocate role? What made you decide to make that change?

Matty Stratton: Like many things in my career, it all happened by accident. I kind of turned around and said, "What happened?" I've always been interested in public speaking and things like that. I was a theater major in college, I did forensics, I studied improv so any opportunity to present is always, no matter what I was doing I would always look for that opportunity. When I was running technology for this eCommerce company, for apartments.com, I had several opportunities came up to do some conference speaking. They were at a vendor conference, it was a, "Hey we really like what you've done with our product" you know how they go, “Come and talk on our stage and kind of talk about that.” I absolutely loved it, I thought it was super great. Then when I left there, I was starting to, that was when I was starting this transition with DevOps and of course, a big part of DevOps that was happening at that time was DevOpsDays. Not that DevOpsDays isn't important now but there was a lot of, I was learning about DevOpsDays and I said, "I wonder why there isn't one of those in Chicago.”

Matty Stratton: To put things in perspective of how differently we thought about DevOpsDays back in 2013, 2014 there was a concern that having a DevOpsDays in Chicago and Minneapolis was too close to each other. The events might draw from each other. We weren't sure since there already was going to be one in Minneapolis, could there maybe be one in Chicago also? It was through doing that that I started to apply to speak at a couple of different events and that's also the time I started working at Chef, I was public facing, I was working as a sales engineer or I'm sorry, a solution architect. I always prefer the term sales engineer because it gets you out of answering the question you don't want to answer. If they ask you too technical of a question you can say, "Well I'm a sales engineer" or if they ask you a pricing question you can say, "I'm a sales engineer." Anyway. Very public facing, was doing a lot of speaking at customer sites and prospect sites. Started doing more and more public speaking through that.

Matty Stratton: It was in my role at Chef what I really started to get to know, especially getting to know the community a lot better. That's when I started running DevOpsDays in Chicago, got involved with the global organization and then opportunity kind of came up at PagerDuty where it was like, wait. This could be the whole thing that I do. I had been very interested in, I had been learning a lot more about developer advocacy and developer relations and community building. There wasn't really something for me at Chef in that capacity. I had been exploring opportunities within that organization and kind of shift my focus toward being more towards advocacy stuff, stars weren't aligning. It's one of those things I kind of went up for a job, didn't get it. Looked at the person they hired and said, "Of course you hired that person. You would have been a fool to hire me." That's always the way I like it. If I know I'm going up for a job and I don't get it, I always at least like to see the person who is hired, I would have made the exact same call. Yes, yes indeed you should have hired her. She's much, much better. Then it's been great. That's been what I've been able to focus on and learn a lot and learn about the community of DevRel itself, as we as a relative, even though it's a long standing profession, I think putting a lot of operationalization around it and businessification if you will, is relatively new. Learning about that has been really great.

Mike Julian: It seems to me like an entirely new career path has just opened up for people doing operations, DevOps, SRE, whatever you want to call it. Though it existed before, there wasn't enough opportunity and now there's way more opportunity for it so now we have a new viable career path for people who, if they want to do more speaking, be more public and be the lever to help the rest of the industry grow, they now have another option.

Matty Stratton: I think that's really great and I think you're right, it's really that operations advocate, that SRE advocate if you will, focusing more on the ops side and the DevOps side as opposed to software evangelist or the developer relation. I'm seeing more and more of that. It's definitely, we had a hard time filling for that role at PagerDuty because there's not as much. There's not as many people, because it's relatively new but I feel like every time I kind of look around I'm seeing more and more people who are moving into it who are phenomenal and who are great. Seeing that ability to say, to take the experience from real world operations and real time operations and then say, "What can I do to help make people better? Maybe I need to quote unquote put the pager down for a little bit and go help people be better for a little while." Then I see a lot of folks that move in both directions. Eric Sigler who was DevOps advocate and evangelist at PagerDuty for four years and he said, "I've done this for awhile. I need to go be an SRE for a while because I'm getting burnt out and it's time for me to go do some different kind of work for a while." I think it's nice that we can move around.

Mike Julian: Absolutely, I know a few people that have gone back the other direction. They'll

go from being system administration to an advocate role and then do that for a couple years and go back. It's not that they didn't like what they were doing but bouncing between the two seems to give them a lot of satisfaction. To me, I think it's hugely valuable because now you understand concerns and challenges and hot spots from a completely different perspective.

Matty Stratton: I'm going to turn that on you a little bit. You've gone through your journey, through your monitoring newsletter, through now starting this podcast. What's your journey look like?

Mike Julian: Mine was just as accidental. I started working on a book, Practical Monitoring, came out in 2018 but I started working on it in 2015 and shortly after that I decided I wanted to go full-time consulting as well. I started looking around, what could I focus on? Clearly I should just do monitoring because it's what I'm good at, it's what I'm known for except I wasn't actually known at the time. I had a book proposal and that was kind of it. From there I started thinking, you know there's all these newsletters out here but there's no monitoring newsletters. We touched on monitoring in SRE Weekly and DevOps Weekly and Cron Weekly, which is currently on hiatus, all touched on monitoring from different perspectives but no one really focused on it. I was like, "I want to do that. I want to start this newsletter that is just going to talk about monitoring, it's got management, capacity, planning, all these different areas around the monitoring umbrella. That just kind of took on a life of its own after a while. That was never the intention. The intention was, I want to do a newsletter because it sounds like a fun thing to do. Pro tip, it is, there are many days where it is not fun to do. It was entirely accidental for me but I looked back at it, it makes sense that I am where I am in hindsight, but I could have never predicted that it would lead to different opportunities I have.

Matty Stratton: It's always kind of a fun game to play and as long as you can keep it fun to sit there and say, "Where did those two roads diverge in the wood?" I also find that going back to old stuff that I had written or questions I used to ask and stuff, sometimes it's better if things remain buried because you look back and it's like, "Really? That's what I thought was important?"

Mike Julian: I have a few of those articles. They are not online.

Matty Stratton: I remember very distinctly having things come up in whatever flashback machine or app that I had that would bring back old Facebook posts from years prior. There were ones when I was first starting to ask DevOps questions and I was very concerned about who was patching these servers. It was like, "If you have DevOps, who is patching the servers?" That was my main concern. I looked around, no you don't get it. That's okay. I'm proof that you can learn and you can become educated.

Mike Julian: It's wild, looking back at the notes from when I really started going 100% into the monitoring world. I was reading all these articles by people I thought were absolutely fantastic, I could never be as good as they were. I'm taking notes furiously of all these different talks and articles I'm reading and then I'm going back through and looking at my own notes. I'm like, "Wow, I was kind of an idiot."

Matty Stratton: The other side of that, not to go off on a crazy tangent but this is something I just find funny, because over a long career we forget things we used to be expert in or we used to have a lot of domain knowledge, because we don't work in that domain anymore. I remember years ago, coming across an old mail file from when I was a network engineer and looking at emails in this mail file. It's like, I don't understand any ... I knew these words because I was typing them, but it was all when I was managing frame relay networks and it was all very hardware specific around networking. It was like, I vaguely remember what this was about. But boy, did I sound smart.

Mike Julian: That's weird because I was also formerly a network engineer. Looking back through my own email history, it's just as wild. I'm like, apparently I used to know all sorts of stuff about BGP that I've completely forgot.

Matty Stratton: It's kind of a wonderful thing to be able to forget it.

Mike Julian: It's true.

Matty Stratton: There are other folks who can maintain that knowledge-

Mike Julian: Absolutely.

Matty Stratton: ... my packets just get delivered. Okay.

Mike Julian: I used to be a Windows administrator for a very long time and I had a client recently ask me a couple questions about how should I structure this forest? I'm like, "I don't know. I'm not sure I even understand the question." The road is interesting and long and generally only makes sense in hindsight, which is always fun. I recently had someone ask me the question, "How would you, if you could, what advice would you give to someone my age?" This guy is about 24, 25. I was like, work with interesting people, work at companies that are doing interesting things that you like and the rest takes care of itself. I think it's pretty good advice but you get the impression that what he's actually looking for is, can you give me step by step by step? I'm like, no not really. Careers don't really work that way.

Matty Stratton: I think you can do what you can to maximize opportunity and a lot of that is understanding when the opportunity is there. One bit of career advice that I did receive that I have generally found to be quite good was from a mentor of mine years ago. I was trying to make a decision between two roles and he said, "Never think about what your next job is but what's your next next?"

Mike Julian: That's fantastic advice.

Matty Stratton: The interesting thing now is if you were to ask me questions like that, I haven't the slightest idea what I'm doing next but that's okay. I think it also depends on where you are in your career but as someone who was kind of in my early 30s it was, okay where ... It was probably similar advice to the person you were talking to which is, you're not going to get a step by step and you don't have to have your whole career mapped out but if you know what you're going to try to do after the next thing, you're going to know if the next thing is the thing that's going to potentially help you. That's what the point that my mentor was getting at, is you're looking at these two roles, what's your next next job and which of these will help you get there because one of them won't. One of them sounds much cooler but is actually divergent from the path you think you're on which could be an incorrect path. It doesn't necessarily mean it's the right path, it doesn't mean you shouldn't do it, but at least think with some bits of that. I think if I had, if you map out my titles I've done management, I've been a director, I've been an architect, I've gone back and forth, you know.

Mike Julian: Much more like a squiggly line than anything else.

Matty Stratton: It does. It goes up and down the leadership pyramid. I'll do my time when I say I'll never want to manage again and I'll be like, "If it was the right team, I could do that."

Mike Julian: One of my favorite sayings that I feel is misunderstood is, “luck favors the prepared.” The reason I think it's misunderstood is that I have found that experience, especially those who are older people, understand really what that means is do good work and be out there always doing good work whereas young, less experienced people will look at that saying and just they focus on the “luck” part and not the “do the work” part.

Matty Stratton: I can see that. I think it's, like all axioms, it hits for the 80% use case. If you're going to, you will be better able to activate opportunity if you're prepared. Just because, I think the opposite of that is just because you're prepared does not mean you're going to be lucky.

Mike Julian: Yes. It's always the catch, right?

Matty Stratton: Right. Some of us are going to be luckier than others.

Mike Julian: Becoming an expert in COBOL today is probably not going to result in a whole lot of luck.

Matty Stratton: Right. Also just from a position of the background and the opportunities that we might have. There was, I have a lot more... I will tell you that if you look at my career path, if I wasn't a straight white dude I would not have been given a lot of the opportunities I've been given. There were a lot more chances taken on me then I know a lot of other people have had the opposite. That's something I try to remember.

Mike Julian: That's very true.

Matty Stratton: I have stumbled into a lot of stuff and that's an opportunity that not everybody has to back into things. I try to think about how to help with that when I can, at least be cognizant of it if nothing else. If you were to think about what are the values that are, if we kind of think about what these roles provide thinking about the ones that are more ops-focused, talking to the operational side, how does that, in your vision or your visibility, how is that differing from the traditional developer relations that we're used to? Just thinking about the difference between the two you've been observing.

Mike Julian: It's really hard to describe without sounding denigrating to ops. People that go into ops have, in my experience, very different personalities and different ways of thinking than people who are software engineers. Traditionally speaking it's ops has been much more about stability and reliability and operationalizing things. We're thinking about all the ways that stuff could go wrong. By contrast, software engineers have traditionally approached everything from a, what cool new thing can we build? Less so from the long term reliability of whatever it is. I think that attracts two different kinds of people, two different ways of thinking. The end result there is that when you're an advocate for SRE, you're talking to a completely different audience then software engineers. I think it's really helpful, when you're working with a SRE audience or a DevOps audience to also have someone from in an advocate role from that background. I think it's also a lot harder to build. It's harder to pick up on that skill set without actually doing the work but, let's say software engineering. I could learn software engineering at home. It's very difficult to learn ops at home.

Matty Stratton: Yeah, some of the ... I think in both cases it's like once you get into scale there's no way to do that on your own.

Mike Julian: Exactly.

Matty Stratton: You can't learn software engineering at scale, you can't learn operations at scale by yourself. Some of that just comes from making some mud pies-

Mike Julian: Right.

Matty Stratton: ... and seeing what happens. I've seen that that's been one of the things, I think part of the reason why a lot of folks are enjoying going into ops advocacy is that it's an opportunity to share that experience.

Mike Julian: Yeah, like, “Let me show you what I've seen in the muck.”

Matty Stratton: Right. I've been in the shit.

Mike Julian: Matty, it's been wonderful chatting with you today. I'm curious, do you have any advice for someone currently doing operations or DevOps, SRE that is interested in going into advocacy? What would you say to them?

Matty Stratton: I think the great thing is you can practice part of this role before it becomes your main role. Be careful you don't get taken advantage of because that's easy to do, but just like if you want to start to build your career around software engineering you can look at areas in open source you can contribute to and start to build up your GitHub profile as one of the mechanisms people use. I think the advice I would give and this is from seeing people who have, I've interviewed with and that I've been mentoring and coaching as I've been going, is do what you can to get some kind of experience, whether it's written or public speaking. Bear in mind you don't have to be a public speaker to be an advocate. DevRel and ops advocacy has lots of different facets to it but you're going to probably need to have one. You're going to need to be a good writer and/or a good speaker. What needs to happen is you need to have something you can share with folks, with prospective employers.

Matty Stratton: It's frustrating to me, I'll interview somebody who I think or I'll be coaching somebody who I think would be great and then I'll say, "Let me see some talks you've given." "I don't have any" or "I've spoken at a bunch of meet-ups." "That's great. Okay try to get those recorded." They don't have to be, it's not even because I want to see some amazing, reinvent-level pyrotechnic extravaganza. It can be on somebody's iPad at the meet-up that you hosted but then we can see how it is or even if it's an internal blog, if you can have something that you presented, something that you wrote that's external. Think about little ways you can start moving towards it. Any time I'm happy to help mentor, I'm happy to help coach anyone who is interested. I'm easy to find on the internet, @mattstratton on Twitter. Hunt me down.

Mike Julian: All right. On that note, where else can people find more about you and your work?

Matty Stratton: The two places would be Twitter, @mattstratton on Twitter. That's usually a good place to find me. I keep, my website is mattstratton.com. The most useful part of that website is mattstratton.com/speaking where I keep my calendar of public speaking events both that are upcoming and ones that have happened in the past. If you want to see videos or slides of talks or you've seen me speak before and of course, you can always find my podcast, Arrested DevOps at arresteddevops.com or just look for Arrested DevOps in your favorite podcasting tool of choice. You will probably find us somewhere.

Mike Julian: Thank you so much for joining me. It's been fantastic.

Matty Stratton: I really appreciated being a guest. Thank you so much.

Mike Julian: Thank you for listening to the Real World DevOps podcast. If you want to stay up to date on the latest episodes you can find us at realworlddevops.com on iTunes, Google Play or wherever it is you get your podcast. I'll see you at the next episode.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
DevOps in a 150 Year Old Nonprofit with Dan Barker
7 mar 2019· Real World DevOps
About the GuestDan spent 12 years in the military as a fighter jet mechanic before transitioning to a career in technology as a Software/DevOps Engineer/Manager. He's now the Chief Architect at the National Association of Insurance Commissioners. He's leading the technical and cultural transformation for the NAIC, a non-profit focused on consumer protection in the insurance industry. Dan is also an organizer of the DevOps KC Meetup and the DevOpsDays KC conference.
Links Referenced: Insure U: created by the NAIC for consumer insurance education Dan’s talk at DevOps Enterprise Summit 2018 State Ahead Strategic Plan from NAICExplore technology proposals, fiscal budget proposals, etc. on the NAIC site Dan’s websiteTranscriptMike Julian: Running infrastructure at scale is hard. It's messy, it's complicated, and it has a tendency to go sideways in the middle of the night. Rather than talk about the idealized versions of things, we're going to talk about the rough edges. We're going to talk about what it's really like running infrastructure at scale. Welcome to the Real World DevOps podcast. I'm your host, Mike Julian, editor and analyst for Monitoring Weekly and author of O’Reilly's Practical Monitoring.

Mike Julian: Alright folks, I've got a question. How do you know when your users are running into showstopping bugs? When they complain about you on Twitter? Maybe they're nice enough to open a support ticket? You know most people won't even bother telling your support about bugs. They'll just suffer through it all instead and God, don't even get me started about Twitter. Great teams are actually proactive about this. They have processes and tools in place to detect bugs in real time, well before they're frustrating all the customers. Teams from companies such as Twilio, Instacart and CircleCI rely on Rollbar for exactly this. Rollbar provides your entire team with a real-time feed of application errors and automatically collects all the relevant data presenting it to you in a nice and easy readable format. Just imagine — no more grappling logs and trying to piece together what happened. Rollbar even provides you with an exact stack trace, linked right into your code base. Any request parameters, browser operating system and affected users, so you can easily reproduce the issue all in one application. To sweeten the pot, Rollbar has a special offer for everyone. Visit rollbar.com/realworlddevops. Sign up and Rollbar will give you $100 to donate to an open source project of your choice through OpenCollective.com.

Mike Julian: Hi folks. I'm Mike Julian, your host for the Real World DevOps podcast. I'm here with Dan Barker and the chief architect for the National Association of Insurance Commissioners. Welcome to the show Dan.

Dan Barker: Hi Mike, it's great to be here.

Mike Julian: So National Association of Insurance Commissioners, it's like the four least interesting words in one title I've ever heard. What in the world do you do?

Dan Barker: We thought if we combine them that it would be more interesting, that may not have had the intended effect. So the NAIC is a nonprofit, about 150 years old. We kind of got our start organizing events for insurance regulators, so primarily the chief insurance regulators for each state and territory and we organize events to get them together. And we also created model law. Over time, as technology advanced, we started to host some centralized technologies within the NAIC and providing a lot of the back-end applications that regulators use. We also offer something called Insure U, where you can go and learn about insurance, all kinds of insurance, I'm sure that it will get flooded now.

Mike Julian: Because we're all chomping at the bit about insurance.

Dan Barker: Yeah. And so we still do a lot of the event planning, taking regulators and kind of informing them on technology. There's a big movement in Insure-Tech with tons of investment right now. And so we're trying to make sure everyone is staying up to date with things like blockchain, AI, a lot of the data-focused stuff particularly around unconscious bias and the data and how to clean that out. So we're a pretty diverse group and we have a lot of kind of different focuses but one of those is the technology side and I'm the chief architect and leading up a lot of the technological transformation as we move to Amazon Web Services, moving everything to the cloud and moving to some more open source tools and trying to move towards a DevOps culture.

Mike Julian: That sounds like a pretty fascinating situation.

Dan Barker: Yeah, it's really exciting. We have all kinds of new tools and new things. We're moving towards ... This company has done a pretty good job of standard blocking and tackling actually, what you might be surprised at in a nonprofit.

Mike Julian: Right.

Dan Barker: One of the only companies I've ever even heard of that had one version of Java and that has been-

Mike Julian: That's pretty impressive.

Dan Barker: Yes, I continue to ask in every meeting if that's true, and-

Mike Julian: Are you sure?

Dan Barker: Yeah. I'm sure there's one somewhere around here. But yeah, for all of our applications, they're all on the same version of Java and it's up to date to you know, like Java 4.

Mike Julian That's pretty impressive, congratulations.

Dan Barker: Yeah, so it's a great base to start on. And it's been a great journey so far. I've been here for a year and the CTO that kind of came in to start this off has been here for, I think about three years. So it's about the length of the transformation so far.

Mike Julian: Okay. So we're talking about transformation here. You actually gave a talk about this at DevOps Enterprise Summit in Vegas earlier this year, or 2018. What was the problem that started this entire transformation process? What were you trying to solve?

Dan Barker: So, we had several different opportunities that we were looking to kind of utilize moving forward. So we have this infrastructure that has been ... We're a nonprofit, so we don't have a ton of funding and it's been a bit challenging. So we haven't been able to move as efficiently because we haven't optimized a lot of our technological systems, much of what we have has been more by request, kind of an IT department within a company, non-technical company. And so we're trying to move towards being more of a technology company which requires a little bit more proactive planning. So one of those areas we're trying to gain some efficiencies, gain some efficiencies across all the different teams. So trying to standardize a lot of our [inaudible 00:06:30] methodologies, standardizing our development techniques. We also have a lot of silos, so we were siloed not only in our operations and development sides, being siloed but we also had each development area being siloed. So we have three main areas and they all do everything differently, which is very challenging, especially when you try to move someone across teams. As priorities shift, it becomes very hard for them to pick up on what is going on there. And they really form different cultures, different processes, and they all use different technologies.

Dan Barker: We also needed to improve our technology. So I talked about Ensure-Tech coming. And the regulators and their staff are expecting more from us on the technology side to help them better regulate more technologically advanced companies, particularly companies that are now getting into blockchain and AI. And we need to advance all of our technology and our training and understanding to a point where we can explain to them and hopefully help audit some of the algorithms that'll be used in the future and validate that the data doesn't have any implicit bias in it. That they aren't noticing or that the company hasn't noticed, which has been pretty common to this point, that it's something that hasn't been looked at enough in most companies and most of it is unintentional. It just happens to be that the data is formed in a way that there's unconscious bias. So we need to accelerate all of our technological capabilities to deliver on those types of features that will help protect consumers of insurance, which is a long-term buy. So you're not going to know if it's going well until it's too late. So we're trying to protect against that.

Mike Julian: Right. So if we were talking about, what was the whole business case behind this, it's really that second one. The second group of explanation is the business reason of why you started down this process of your main stakeholders. The insurance commissioners are looking at the market seeing a whole bunch of changes and realizing they're not ready, from a technical perspective, to handle those changes.

Dan Barker: Right.

Mike Julian: So they're looking to you for that support and that increased capability

and flexibility to handle the major shifts in the market that they're seeing.

Dan Barker: Right, yeah. The main driver is definitely that we are able to offer faster response to demands by regulators and higher quality products for them. And notice I never said anything about saving money.

Mike Julian: Right.

Dan Barker: And that is not even an expectation of ours as we move to the cloud, which is something important, that we've often focused on saving money, but now we're focused more on delivering high quality products.

Mike Julian: That's absolutely interesting. I want to dig into that a bit. Because every DevOps transformation we tend to see, as well as every cloud migration is either explicitly or implicitly focused on cost savings. We have all these companies out there with their own data centers, and now they're looking at cloud services and realizing, "Hey, we can save a boatload of money by moving." The reality is that they rarely do, they always end up costing more, but they generally come out ahead.

Dan Barker: Yeah, exactly.

Mike Julian: It's interesting to me that you've kind of skipped that whole thing and said, "No, we're just not going to focus on saving money here. We're looking for the increased capability."

Dan Barker: Yeah, so that's something that we discussed as well, about how we were going to focus on that or choose not to focus on that and it's something that was discussed. As soon as I got here, I was supposed to read and edit and I did a document called State Ahead, that we released shortly after I arrived here and Scott Morris, the CTO here had written most of that for the technology side and in there it never mentions anything about saving money. And the point of that was to make sure that people weren't focused on trying to save money, because we thought that we would save money, but we didn't want to make it the priority, the focus because we could make a lot of things, a lot of caveats, a lot of choices that are maybe negative to save money that negatively impact our quality or speed of product development. And so State Ahead is something that you can find out on the naic.org website. It's our three years strategic plan. It's been really cool to work for a nonprofit because they ... And a particular nonprofit like this where the board is a little bit more in flux. So they've had a lot of problems of every year the board changes because of the way it's cycles. I won't bother explaining here. But anyway, it changes every year. And the membership changes regularly because people get re-elected or they term limit out or they have-

Mike Julian: Does it roll all at once every year or is it staggered?

Dan Barker: No, it basically ... Yes, so I guess I'll explain it here.

Mike Julian: But my whole point there is are you losing all of your stakeholders every single year?

Dan Barker: No. You're losing the top one pops off.

Mike Julian: Okay.

Dan Barker: It's basically a queue.

Mike Julian: You have an organizational queue, congratulations.

Dan Barker: We have our board at the queue. So it's a first-in first-out queue and there's five of them. And so every year we get a new president and a new vice president. Our CEO and COO are static, but we do have that change and we have membership changes as well who approve all these things. So the reason we wanted State Ahead is because we wanted a three-year plan that everyone in the membership was bought into. And so that no matter who got into the board, it was very likely that they had signed off on the three-year plan so that we could commit to that. Because a lot of times people would come in, they would have their own initiatives for their own state or wherever they were at, or whatever the reasoning was, they had their own initiative. And so they would focus on that for a year, and then we do the next year and the next year and the next year. And so very few of the projects were we able to extend beyond just one year. And so this was really great to have that and then also all the budgets and everything, all the proposals I write up are public. So you can actually go and comment on them if you'd like to. All of our technology proposals are out there. They're available for comment for a certain number of days and then they are voted on after all the comments are addressed and we'll usually address comments if you have them.

Mike Julian: That's pretty cool, also a little nerve-wracking.

Dan Barker: Yeah, well, I was really happy that my first cloud one didn't get any comments, because I was a little nervous. It was the first one I'd ever done. But I think the next one after that had gotten some really good comments. So that's really helpful to help shape the direction of the insurance industry.

Mike Julian: So was this report how you actually got started on this whole process, or did you start somewhere else?

Dan Barker: Well, so that's kind of the end of one journey.

Mike Julian: Okay.

Dan Barker: Or the culmination of one journey maybe. And so what happened is ... So Scott Morris was here for a couple years before I got here, and he kind of started on the culture side and he really wanted to build a conviction in leadership so that we didn't sway after a few months, things don't ... Public companies, it's very common to have three to six month window, and then if it doesn't show massive improvement, then it's gone.

Mike Julian: Right.

Dan Barker: And this is not a feature improvement to an app, that we can show value coming back. This is largely a paying off debt maintenance site move. That should show improvement but it'll take time to show. So he wanted to build that conviction. So he took a lot of the leadership to Amazon, to their data center, took them to partner companies, the leadership team and the technology area went to partner companies to discuss what they've done with cloud and how they move with blockchain and other technologies. So we really wanted to build some shared experiences, so that we all had the same vernacular kind of to look back on.

Dan Barker: We did an initial assessment as well. And we used that throughout the entire transformation. We did it with AWS, and it was focused, it was AWS, an AWS partner, and it was very focused on those technologies, but we've used it for a lot of the same verbiage and language so that we had a common lexicon moving forward. And that has been really helpful, even though we aren't necessarily using all of the recommendations, we have something to look back on, that's a common place. He also went ahead and encouraged everyone to read the Phoenix project. And so that went really well. Everyone really enjoyed that book and kind of understood why we're doing what we're doing and the focus initially was all about culture and has really been about culture, the entire time. We're doing technological things, but we know that we have to have a solid base in the culture before we can execute on those technological areas effectively.

Mike Julian: What is culture media in this context? What were you trying to change? What were you trying to pay attention to?

Dan Barker: So, I mean, this place has a really good culture. I don't know what it was like when Scott arrived, but when I arrived, I was shocked at how nice people were and how open people were to different ideas and to kind of shining light into areas. And what we focused on is really a culture of continuous learning, trying to encourage people to proactively go out and learn on their own. This is a nonprofit, a larger enterprise and it's right next to government. So it had a bit of a top down structure for a long time. And a lot of people were used to ... I'm sure they've been in trouble for doing things they weren't told to do.

Mike Julian: Right.

Dan Barker: And so they had a bit of a defensive mechanism of kind of like, "We’ll, wait until we're told what we're supposed to do and then we'll go to that." And so they've very quickly, once given room, kind of opening up space so that they can move on their own and solve their own problems. That was the big piece, kind of empowering everyone, so they can solve their own issues, their own problems in their area, rather than having someone have to say, "Hey, this is how you're going to do it, and this is what you need to do." More just giving direction and additional context. We've done that in many ways, so we do lunch and learns. Those are really hard to keep populated at other companies that I've been at. It takes a great deal of effort. We haven't had a single vendor come in and speak, I don't believe anyway. We may have had some consultants, but pretty much everybody has been internal.

Mike Julian: That's awesome.

Dan Barker: And we've had it every other week, but I used to do it once a month, and it was hard to staff the vendors a lot of times. So it's been pretty amazing and there have been talks on all kinds of things, work-related, not work related and really impressive to see everyone share-

Mike Julian: How do you keep that so populated? What's the secret to success with that?

Dan Barker: I think some of it is just the inherent culture here that people want to share and want to help. That might be a side effect of working for a nonprofit that's focused on insurance, that includes things like health care and other things about helping people. So I don't know if that's part of it. Definitely the person in charge that Gail McDaniels, he really gets out there and tries to gin up interest in coming and speaking. So he's done a great job running that piece. We've also done a tableau vizathon, a city-wide tableau vizathon is what they call them.

Mike Julian: Okay.

Dan Barker: I'm not a tableau person, but that was really well attended. And we had basically a hackathon and we really helped kind of get our folks involved in the community more and we get all of our money, all of the nonprofit’s money comes from consumers, it comes through the companies that pay us based off of, I don't know some algorithm that other people know in the company. I have no idea how they get the money. So this comes from consumers, and my goal when I got here was, I wanted to give back as much as possible. So some of these city-wide events that we've done, we've done meetups and other things, was something I wanted to focus on, just like our adoption of more open-source tools has largely been, so that we can give back more to the community since they're helping fund us.

Dan Barker: So that's been a big thing. We also have done a hackathon. So we did this during the AWS re:Invent Conference. And I think we may have a hard time getting people to go to re:Invent because it was such a successful event here. We watched the keynotes live as a group and we actually had a Slack channel open, so that people at the conference and in Kansas City could live chat about what was happening during the keynotes. And then we had a hackathon that had, I think it was over 40 participants. We only have about 260 engineering staff. And a lot of people, I think another 20 or 40, or something like that were in Las Vegas at re:Invent. So it was a pretty good attendance just for the hackathon. And we had a lot of cool things that people produced, some that I'm probably not allowed to mention.

Mike Julian: These are always the most interesting ones.

Dan Barker: Yeah. So it was a great time, everyone really seemed to love it. And we also had internal people present. We had AWS and GitLab come and present for us and we had a couple other people present on culture. We had an internal panel. I mean, it was really just a ton of sharing for I think it was three days long. We rented out a local theater. I mean, this was ... The amazing point or the amazing thing about all of this wasn't even necessarily the event, which was amazing. But it was how the event was planned, which was all in a Slack channel. So basically a guy named Dennis Wilson who's the Director of Technology over on the NIPR side, which is the National Insurance Producer Registry, which is a wholly owned subsidiary, a lot of explanation. Go watch my talk notes. I'll talk a little bit more about it.

Mike Julian: That talk will be on the show notes.

Dan Barker: Yeah, okay, great. So he just kind of mentioned it and had an idea of watching the keynotes, and then somehow that snowballed into a hackathon and all these other people coming in and talking and renting out this theater, and having all these GitLab training sessions. And it was pretty amazing to have people just come into the Slack channel as they heard about it or said, "Oh, well, I think I can help with that." And they would just come in, kind of read the history, grab a task off the task list and start going. I mean it was wild to just sit there and watch and not really get involved anymore.

Mike Julian: Yeah.

Dan Barker: Kind of like ideas stuff.

Mike Julian: That's pretty cool. To me I see a pretty direct line between the ... There's something you mentioned earlier where when you first got there people ... Or before you joined NAIC people were very, "I will wait to be told what to do." But since you've been there that's not been the case at all people are jumping to do whatever they think is necessary to do. That culture change, whatever prompted that, wherever it got started to me is a pretty direct line. So from there to what you're talking about, of people feel empowered to do what they think is necessary or do what they think is interesting.

Dan Barker: Yeah, definitely. And I think it's all just really giving people opportunity and allowing them that opportunity to succeed or fail, and to not judge whether or not they succeed or fail, but to judge that they've taken the opportunity. There are times where, whether I'm predicting it right or wrong, I may have seen that, "Well, this probably isn't going to work well." But it's better to let it happen and let it succeed or fail than say, "Well, I don't think this is going to work." Which is, "Stop it now." Because you're automatically not empowering them, which is by definition, you're not empowering them.

Mike Julian: Yeah, but it's exactly.

Dan Barker: …protect them from possible failures. And it's like, "I may have done the same thing, but this is a different context as well." So when I started, I was very clear that I may have done similar things at other companies, but the context here is different. And that's going to change a lot of decisions, and I will not be the ... Very quickly we will make decisions where I have never been here either.

Mike Julian: Yeah.

Dan Barker: I've never been to this situation before either, so you're going to have to help out.

Mike Julian: Yeah, absolutely. So kind of switching gears a little bit, you mentioned how you got started with this, with your CTO, working with the executive teams and taking them to partners, taking them to Amazon data centers, basically getting executive buy-in. Once all that got started, and was underway, how did you get everyone else on board? What did you do for the rest of your management? What about the engineers, maybe people outside of engineering but are kind of on the ground, doing the work? What did you do to get everyone on the same page?

Dan Barker: Yeah. So part of that was the State Ahead document. But we even got started before that with taking an application to the cloud. So we took one of our smaller application, but something pretty important and we move that up into AWS and we use Lambda and we found some pain points in there with Java.

Mike Julian: As one does.

Dan Barker: It’s like the older Java apps spinning up on Lambdas. But it was a good experience, and it's still running up in production. Now we've obviously made modifications and updated it, but we wanted to get a win. And then we also used a bunch of new technologies for another app with MongoDB and some other more NoSQL type systems and that was a big hit because we were able to get it done very quickly, replace a old and fairly complex system and do it in half the time and half the budget of the original project.

Mike Julian: That's pretty awesome.

Dan Barker: Yeah. Which was amazing.

Mike Julian: I bet everyone loved that one.

Dan Barker: Yeah, definitely. Everyone loves that, even if it's not about money.

Mike Julian: Right.

Dan Barker: People love saving money.

Mike Julian: Yeah.

Dan Barker: So that was a huge success. And so we did those with a small team, engaged team. We tried to pick people who would really step out of their boundaries and help wherever they could. And we then just repeat that over and over again. What I've learned is that if you're not tired of saying it, then you probably haven't said it enough. And so I try to make sure that I'm fully tired of saying everything and congratulating people and telling people what a good job this company has done so far and particularly those two were kind of trophies that we can hold up and say, "We've done this already, we don't have to worry about this. This is something that everyone can achieve.” We also offer a ton of opportunities for learning. We'll basically pay for whatever courses you want to take or if you want to have a subscription to any of the online learning systems we'll provide those, we buy books like crazy. Had an order of, I got a little worried. So I listed all the books that I liked and gave links for them on Amazon and it was over $1,000 in books.

Mike Julian: Wow.

Dan Barker: Things I could remember kind of off my head and they bought them.

Mike Julian: That's incredible.

Dan Barker: Yeah, and I was like, "Wow. You just bought $1,000 in books. The for-profit companies I've worked for would never do that." So it was really great. We have our own little library. We also have an actual formal library, which is an interesting component of the kind of the cultural transformation journey. They're actually researchers up there, because we have a research center and we have model laws that we create. And you can basically ask them to do any research and they'll just go do it and send you back all this amazing stuff.

Mike Julian: That's pretty cool.

Dan Barker: They do pretty much everything now. But one of the people, Erin Campbell, jumped on to Slack before we ever told anybody about it. We were just kind of beta testing it and I guess she heard through the grapevine, how to get on it. I don't even know how she knew how to get on it. But somehow she showed up there, created a book club channel and some other channels, and then started ordering books that people were talking about and then taking them down to their desk and giving them to them when they were in then really engaging with everybody which was an awesome thing to have, when you have someone who's not in the technical area, actually in the library, which you don't usually think of — I mean it's like insurance.

Mike Julian: Right.

Dan Barker: Most people don't get that excited about the library, I'm sure there are people yelling at me on their radio now.

Mike Julian: Absolutely.

Dan Barker: Although they're probably whispering, “Let's be honest.” So she got on and really engaged and that was a great thing to be able to show people on the technical side that, "Look, there's someone from the library getting on here and fully engaging." And another piece that we were able to show, particularly developers but also project managers and management throughout the company, is that I did some of the docs on our internal communication standards for the IT group — and I basically copied most of those from GitLab, but don't tell them. They're all in their public handbook. And so I had the head of HR go in and help us make sure that they were all in agreement with it and he actually submitted a commit back to GitLab, the HR director.

Mike Julian: That's incredible.

Dan Barker: It is shocking, right? And so then you-

Mike Julian: Yeah, that's amazing.

Dan Barker: Yeah. So I went and helped him through it and stuff and we talked through everything, but he did everything. I never typed anything for him. And so it was great to hold that up, as kind of like a collaboration with someone outside of the IT group and that they're engaging in these systems that we're using and that we find to be easier systems to interact with because that kind of stuff, we can put it into a web page, we can put it into Confluence and all these other tools through automation. Once it's in a ... like git-type repo. So they thought that was pretty cool. And to have someone who's on the HR side interacting with our systems was a nice achievement.

Mike Julian: Yeah, that's great.

Dan Barker: And we've also had engagement, I mean, this company is really amazing. We've also had engagement with our lawyers. So they've been always very fast to respond and always do a great job of reviewing all the standard stuff you'd expect IT lawyers to be reviewing. However, they also engage with us early on our ... Nexus IQ Server engagement with Sonatype and coming in and trying to understand how they're going to work in this new system. They never complain about something being new or us trying to get them into other systems. They've been fully on board with new open-source program that we're introducing. So we can open source more of our internal stuff and have more contributions to open-source. They've been really strong advocates. So it's pretty amazing to have so many people on board with the transformation across every area of the company, really.

Mike Julian: Yeah that's great. So all that sounds amazing, but surely there's been some stuff that hasn't gone as well as you'd like.

Dan Barker: Yeah. So there always are, right?

Mike Julian: Right.

Dan Barker: So a lot of times with these things, certainly everyone wants you to predict the future of how long things will take. So you're always going to be behind schedule because no matter what you say, something happens. We've taken down our kubernetes clusters that run in GitLab or that [inaudible 00:35:00] the runners. That was a bit of a hard impact but we're not ... We've done a good job of introducing blameless postmortems and the blameless culture so that people don't ... You know, we don't want them to feel bad if something happens. It was a honest mistake, it was a growing pain of having to swap out some fundamentals of the AMI that cause the overall crash of the kubernetes system. And so that was a little bit of a setback and we've had multiple things like that. Networking is really hard between [inaudible 00:35:38] in AWS in the data center, and we've had some struggles of ... There's the silo mentality of, "I am network." Or, "I am database and I don't know anything about the other things," has been a bit of a struggle. We have one person on the kind of the platform team building out a lot of the platform components, who is a database administrator and has learned a ton of other stuff now. But it's funny because that person will answer questions about monitoring, but then will get upset if someone wants to do database stuff because database stuff is just too complex.

Mike Julian: Yeah, interesting.

Dan Barker: Yeah. And it was interesting podcast last week or the last podcast on this series talking about databases actually, because it's very pertinent to a lot of the things that we're doing. We've had some connectivity issues and we've questioned, “Do we have any settings anywhere limiting connections?” And we've had this persistent issue of single connections are clearly being limited somewhere because they just hit a flat plateau and just stay there, and you can run 1000 of them and they'll increase linearly. But every one of them has the same limit. But we haven't been able to find it. And we were told very sternly, that there were no limits in the system anywhere. And we're still searching that one out, we're still trying to make sure that when we move more of our apps up there, that we are going to be able to connect our databases on premises, because we're not prepared to move them. At the same time we feel like that's going to be too big of a bang, right? To do all at once. So we're trying to do the apps and then the databases. We also have ... I mean, it's a normal enterprise, so a lot of apps talk to other apps' databases. So you gotta wean those off during that process as well or at least understand the connections. And we've had other things, so open communication like Slack is hard. I sometimes kind of suck at biting my tongue. And so I've made mistakes and I've said things I shouldn't have in public forums, and I know that others have argued and things. And so it's been really nice to be in a company where people forgive you rather than holding things against you because you said something when you're upset and in Slack, it's permanent.

Mike Julian: Right.

Dan Barker: You can delete it but I think that it's better-

Mike Julian: Someone saw it.

Dan Barker: Yeah, someone saw it, and I think it's better to leave it, to show that, "Look, you can recover from a failure like that." That's a failure of ... It's a different kind of failure than the technical failures. We often talk about DevOps, but I think it equally is important because we all have hard days and we have hard weeks or months or years.

Mike Julian: Right.

Dan Barker: And stressful situations that come up where you accidentally say something or type something that you wish you hadn't.

Mike Julian: Yeah.

Dan Barker: So it's a very forgiving culture. And that's a personal failure for me. But we've also had some of the technological failures and we've had a lot of questions on QA, what's going to happen to QA? And MBAs and other types of jobs like that. And for the most part, I've tried to share articles or other information but not specify a direction, because I want the leaders to emerge in those areas organically and then lead those teams through that.

Mike Julian: So all this has been pretty awesome, is really fascinating stuff. For those people that are listening that are also working in a similar organization or looking to start or perhaps in the middle of their DevOps transformation, what advice would you give them?

Dan Barker: So I guess the most important thing that I've learned is patience. You have to be patient. These are usually long transformations. Oh, they're really never-ending, I don't if you achieve the perfect culture and then you don't have to do anything. You'll repeat the same thing over and over again to the same people sometimes. And the way you say it one time will be the time they get it, just by changing up how you say it, or how you display it. And that takes patience to not ... Especially when they are like, "Hey, I totally understand this” and you're kind of happy, but you have that little happy frustration.

Mike Julian: Right.

Dan Barker: [crosstalk 00:40:55]. And so I try to place the blame on myself in those situations of, I need to figure out how to say it differently every time rather than just saying it the same way; and the change needs to be organic, but supported from the top. So we can't force anything into the culture, we can create opportunities like Slack, or like GitLab, where there's more open communication and it's easier to interact with each other. We can provide those platforms somewhat by force, although when I provided Slack, it was adopted in an insane way. We didn't expect we had to quickly buy more license because we were hitting our license limit very quickly. And then we had to offer training and a rollout plan which we hadn't planned on doing. It was just going to be our IT department.

Dan Barker: And then the business unit started getting involved and legal and everybody… It's a real battle of maintaining patients throughout the long haul and then trying to play catch-up when the dominoes start to fall. The same thing happened with GitLab and so I would say be patient, offer as many opportunities for learning or collaboration as possible and let people choose whether they want to do something, whether or not you think they'll succeed or fail. Just let them do it. Even spending several weeks on something that you think is going to fail or may not be as big of a value to the company might spur tons of innovation in the future from that person. So you often have to kind of let them ... Give them some goals to achieve. You can't just let people go do whatever they want, without any type of limit, but usually we'll try to have some goals but then some free time to do extra stuff and letting people go and lead those initiatives and not have it just, "Well, you're not a manager so you can't lead." So that's kind of the core of what I would do. But all of that is [inaudible 00:43:28] patience.

Mike Julian: Right.

Dan Barker: I was in the military for 12 years. So I learned patience.

Mike Julian: Oh, yes. Well, thank you so much for joining me. This has been great. Where can people find out more about you and your work?

Dan Barker: So all of my information is on danbarker.codes — that's C-O-D-E-S. And you can also check out naic.org for all additional information around NAIC and our initiative and get our State Ahead document, any of our fiscal budget proposals. And then all my information is on danbarker.codes, all my presentations, where I'll be speaking next.

Mike Julian: That's awesome. Well, thank you so much for joining me.

Dan Barker: Yeah, thank you very much, Mike. I appreciate it.

Mike Julian: Thank you to our dear listeners for listening to the Real World DevOps podcast. If you want to stay up to date on the latest episodes, you can find us at realworlddevops.com and on iTunes, Google Play or wherever it is you get your podcasts. I'll see you folks in the next episode.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
The Value of DevRel and Writing Technical Books with Emily Freeman
28 feb 2019· Real World DevOps
About the GuestAfter many years of ghostwriting, Emily Freeman made the bold (insane?!) choice to switch careers into software engineering. Emily is the author of DevOps for Dummies (April 2019) and the curator of JavaScript January — a collection of JavaScript articles which attracts 30,000 visitors in the month of January. A former VP of Developer Relations, Emily is a CloudOps Advocate at Microsoft and lives in Denver, Colorado.
LinksWebsite: emilyfreeman.ioTwitter: @editingemily
Links ReferencedBook recommendations: Monitoring with Graphite by Jason DixonEffective DevOps: Building a Culture of Collaboration, Affinity, and Tooling at Scale by Jennifer Davis and Ryn DanielsEmily’s talks: Humpty Dumpty + DevOpsScaling SpartaDr. Seuss Guide to Code CraftsmanshipTranscriptMike Julian: Running infrastructure at scale is hard. It's messy, it's complicated, and it has a tendency to go sideways in the middle of the night. Rather than talk about the idealized versions of things, we're going to talk about the rough edges. We're going to talk about what it's really like running infrastructure at scale. Welcome to the Real World DevOps podcast. I'm your host, Mike Julian, editor and analyst for Monitoring Weekly and author of O’Reilly's Practical Monitoring.

Mike Julian: All right folks, I've got a question. How do you know when your users are running into showstopping bugs? When they complain about you on Twitter? Maybe they're nice enough to open a support ticket? You know most people won't even bother telling your support about bugs. They'll just suffer through it all instead and God, don't even get me started about Twitter. Great teams are actually proactive about this. They have processes and tools in place to detect bugs in real time, well before they're frustrating all the customers. Teams from companies such as Twilio, Instacart and CircleCI rely on Rollbar for exactly this. Rollbar provides your entire team with a real-time feed of application errors and automatically collects all the relevant data presenting it to you in a nice and easy readable format. Just imagine — no more grappling logs and trying to piece together what happened. Rollbar even provides you with an exact stack trace, linked right into your code base. Any request parameters, browser operating system and affected users, so you can easily reproduce the issue all in one application. To sweeten the pot, Rollbar has a special offer for everyone. Visit rollbar.com/realworlddevops. Sign up and Rollbar will give you $100 to donate to an open source project of your choice through OpenCollective.com.

Mike Julian: Now folks, welcome to the Real World DevOps podcast. I'm here with Emily Freeman, Cloud Advocate at Microsoft. Welcome to the show, Emily.

Emily Freeman: Thank you, thank you so much for having me.

Mike Julian: Yeah, so for the folks who aren't as familiar with developer advocacy or developer relations or developer evangelism or the 50 bajillion other names that they go by, can you tell us what is that you do? What are you doing for Microsoft?

Emily Freeman: Yeah, absolutely. Along with the 50,000 names for developer relations, there are also 50,000 interpretations of the actual job. It will vary company to company and person to person. For me in Microsoft I like to say that, because I have a background in writing and I was a writer before I was a developer, I like to tell stories. My job is to tell the stories of the developers in the communities I am in and serve and represent. The most important role I think I play at Microsoft is bringing customer feedback back to the product teams. When I go to a conference, I am never one to just like helicopter in, give my talk and then peace out. I think that's just poor form and it doesn't benefit anyone at that point but my ego. My job really is to go and talk to people and to hear all their feedback, positive and negative. It's not personal, it's about making the product better for the community.

Mike Julian: That sounds like a much more reasonable job than you would hear from like reading shit shows on Twitter.

Emily Freeman: Yes, absolutely.

Mike Julian: That might be a good segue to the recent total shit storm that happened on Twitter about developer relations. I don't even know how these things started. It just suddenly appeared one day.

Emily Freeman: Yeah, and the criticisms from this last one aren't new. This is just one iteration of the sort of dev overall hate, which is like, on a very personal point, it's hurtful because it's like, “Okay this is my work, this is what I care about. I'm passionate about this and obviously I dedicate all of the time I spend at work to this role.” To hear that it just has no value or that we are simply low-key Twitter celebrities that our company shields and drink champagne on planes and party.

Mike Julian: That sounds like a fantastic life, can I get that job?

Emily Freeman: I know. I was telling someone I'm going to Milan next week for Microsoft Ignite | The Tour. It's very exciting, I've never been in Italy, I'm thrilled. I was just telling them like, "I sound a lot cooler than I am. I just got from Toronto and I'm going to Milan." No, I'm just on the plane sad. Yeah, the travel is part of it, I think it's the part that people cling on to. I also think that we as an industry have not done a great job recently at actually representing what the work is. Yeah, I think so much of the emphasis has been around Twitter and a lot of the developer relations folks especially the people Microsoft is picking up, have really — if not enormous Twitter following — substantial ones.

Mike Julian: Well they are already super well-known people to begin with in their own right.

Emily Freeman: Yeah, but there's a cult of personality thing happening. I think when we have that and when we focus so much on the persona and that sort of Twitter platform, then we lose the actual benefits of developer relations, both for the community and the company that they work for.

Mike Julian: Yeah, so some of the criticisms that I saw, you alluded to this one, that you are the representative of a company, but when you're speaking on behalf of the company, you're not saying that you are. Therefore, it's like influencer stuff, but you're not saying that you are. You're not divulging where you're getting your pay from. To me this sounds like a dubious argument, because your employer is right there on your profile.

Emily Freeman: Yeah. I mean I'm pretty open about any employer I've had. I think everyone knows I work for Microsoft and certainly when I go speak somewhere Microsoft's logo is on the deck. I will say that I'm a Cloud Advocate for Microsoft. That's probably all I'm going to say about Microsoft and I think that's the important thing. Unless it's a Microsoft event, I am not applying to like a DevOpsDays and today let me talk about Azure DevOps, that doesn't benefit anyone. If I think Azure DevOps can seem something really cool with the different storyline, like an actual problem set that I think the community is struggling with, awesome. If I think another tool that Microsoft does not own can do it better, I'm going to talk about that. We were all hired by Microsoft with that understanding that we represent specific communities. We target our talks to those communities and we are not going to be just a different version of a sales engineer.

Mike Julian: At the end of the day your job is to help build a product and help build a community around the product. Like you're bettering both sides.

Emily Freeman: Absolutely and it's sort of like the middleman where we get the negative parts of both sides. It's like we're stuck in the middle, "We're trying to make things better," and everyone's kind of mad.

Mike Julian: As someone else and myself also kind of waded into this whole storm, I look at developer relations in a much more foundational sense. Basically everyone in the company is there to do sales in some capacity.

Emily Freeman: I agree.

Mike Julian: My job is when I was working in Ops and SRE was to help build a product so that we could sell more of it. The job of developer relations is to be out there in community, building community, bettering the product so we can sell more of it. We're all here to sell more stuff. To me, when I see developer relations get up on the stage it's like, “Oh they're on stage.” They're clearly here to sell me more stuff, but I don't take that as a negative. I know their goal. If they weren't here to sell more stuff, they wouldn't be on stage to begin with. Their role wouldn’t exist.

Emily Freeman: Yes, and it's not a direct link. To be very clear, no one in developer relations has any kind of sales metrics or are measured by that. We get no rewards based on that. There's no direct link between okay, if you buy Azure then I get a bump. That's just not how it works. You're right in that everyone is supposed to, you know, we all work for a company. We get paid because they make money and so if you're a purist and just want to develop, to think that you don't need to worry about the end user, that's just I mean not being realistic I think.

Mike Julian: Right, it still doesn't work even as an engineer. Yeah, you're building a product but why are you building a product? So that you can sell more of the product. Even engineers don't have sales goals. They're not measured on how much product was sold, but if sales go down they lose their job. It's very much the same in developer relations of you're playing the long game. You're building a community because a community will become self-sustaining and grow the company overtime. When you're looking at it, it's not like you're not a salesperson, you're not trying to meet the quota for this quarter. You're thinking, "How can I build something that people actually want, that people really want to be part of this community for the long haul, for years at a time?"

Emily Freeman: Yes, absolutely. I will not promote a product I don't believe in. I'm not going to get up there and say like, "Yeah, this is awesome product," when in reality I'm thinking like, "What a shit show this is." That's not my style.

Mike Julian: That would be really hard.

Emily Freeman: It really would be. Yeah. I don't even think great salespeople do that, because you have to believe in that and you have to be honest with your customer. If I mislead someone they're not going to trust me again, so I'd rather be honest and be like, "That's not a great product for you, that needs help. Let's go use this other thing." Have them come back and trust my word and my reputation — that's important to me.

Mike Julian: Yeah, absolutely. I've been having a lot of conversations lately about like the Microsoft we all know and hate and the Microsoft of today.

Emily Freeman: I know.

Mike Julian: I've set dinner just recently and someone made the comment that, "You should go to Microsoft. They're a much better company." I'm like, "I'm sorry, what? When did that happen? What universe is this?"

Emily Freeman: I know, I know.

Mike Julian: You mentioned, what got me thinking about that is, you mentioned that there are situations where you won't recommend Microsoft products, because even though you represent Microsoft, their products may not be the right solution for you. It reminds me of the old Microsoft certification exams of like, “The answer to everything is this Microsoft product.”

Emily Freeman: That's amazing.

Mike Julian: Everyone would have this problem, but especially people in the industry, well, there's the right answer and then there's the answer the test is looking for. They're not the same.

Emily Freeman: Yeah, absolutely. I think so much of the emphasis at Microsoft right now is changing that perception and actually living up to the new vision and the new goal that we want to have as a company. I think Satya [crosstalk 00:11:35].

Mike Julian: Which is just impressive.

Emily Freeman: Oh yeah, absolutely. It is impressive. Satya has done a great job at setting that vision and then making sure that people at Microsoft understand that and live that every day. Obviously we're like a big company and there are faults. I'm not one to say that everything's unicorn and rainbows. It's a work-in-progress, but aren't we all? Yeah, absolutely, it's funny because I think 10 years ago or so Microsoft was actively undermining open source communities. Now it's one of the largest, if not the largest open source contributor and so they're doing a lot of work around that. I wonder sometimes if some of the developer relations controversy is affected by the fact that all these corporations who are now hiring a bunch of developer relations folks, who come from open source are then like either acquiring the open source or getting more involved in the open source and there's this fear of conflict of interest?

Mike Julian: No, I could absolutely see that.

Emily Freeman: Yeah.

Mike Julian: Yeah, I think you might be onto something there. I don't know how much that plays into it, but I think there's some truth there.

Emily Freeman: Yeah. I think that a lot of it is just fear-based and it's like here are these mega corporations. We're also in this acquisition season with a really big corporations are acquiring all these like medium companies. That gets always a little scary, because then it's just, oh there's five companies, okay.

Mike Julian: Right, all the 50 bajillion dollar companies are acquiring the 10 bajillion dollar companies.

Emily Freeman: Exactly, yeah and may we all make some money off this please?

Mike Julian: Yes, please. On a completely unrelated note, I heard you're writing a book.

Emily Freeman: I am indeed, I am writing DevOps For Dummies.

Mike Julian: That sounds awesome. My condolences and my congratulations.

Emily Freeman: Thank you. I see you've written a book.

Mike Julian: I've also written a book. I know the pain, so I'm not going to ask you how your book is going, lest you punch me through the phone.

Emily Freeman: It's true. I think we all have those friends that are like, "Oh how's the book going? Are you making tons of progress? Are you excited?" Then you're just like, "No, none of those things. I'm not making any progress. I feel sad. I actually don't want to talk to you anymore. Where's a drink?"

Mike Julian: Been there.

Emily Freeman: Yeah, it is the hardest thing I have ever done. I don't know how to describe it. I think all authors are all people who have attempted this get it, but it's incredibly difficult to convey just how difficult it is.

Mike Julian: Yes, I've had that same struggle. I've read several books by like the Stephen King wrote on writing where he gets into that. Steven Pressfield wrote The War of Art, it's the same idea and then Anne Lamott who lives in the Bay Area and is absolutely wonderful wrote Bird By Bird, which is kind of the same thing, but for creative writing. All of them talk about the same idea of writing sucks.

Emily Freeman: Awful.

Mike Julian: There's a quote by some author he's like, "Writing is easy. All you've got to do is slice your wrist and bleed out."

Emily Freeman: Then it's done.

Mike Julian: Right, and then it's done, like it's just that easy.

Emily Freeman: Yes.

Mike Julian: Writing a book is hard and I just realized I don't think we've mentioned the title of your book here.

Emily Freeman: Oh we did briefly, DevOps For Dummies.

Mike Julian: DevOps For Dummies, okay.

Emily Freeman: DevOps focus through O'Reilly but the standard yellow dummies books, so yeah it will be a high-level book, because it is.

Mike Julian: Sure. Why another DevOps book?

Emily Freeman: Well there's not actually a ton of DevOps books. When you actually look at it, there are the really well-known ones. When you look at the actual authors it's like 10 people. I think it would be, my pitch to O'Reilly was one, don't think there's a ton of work done for the dummy or the beginner or the manager or the developer who knows nothing about operations, but has heard of DevOps and wants to like see this. I do think that DevOps is still very Ops-focused and I think when you go to DevOpsDays events or any of these DevOps focused communities and you ask, "Are you a developer? Are you an operations person?"

Mike Julian: You're right and mostly sysadmins.

Emily Freeman: Exactly. I think it's really interesting sitting between those two communities, because a lot of operations folks express their concerns to me about, "Well I don't know how to code and obviously I can't. How do I evolve? What's this place for me in the future especially with cloud? Am I losing my job?" There's no Ops which I think is absurd. Then the developers, interestingly, have the opposite fear. I was like, "Well I can develop, but I don't know how the systems run, I can't deploy I don't know what to do. AWS is big and scary. Azure is confusing." All of these things. I think it's interesting to bring those communities together, but because I have a development background, I think I bring a unique perspective that some of the other books just haven't had because they're written by Ops folks.

Mike Julian: Yeah. I love that pitch.

Emily Freeman: Thank you.

Mike Julian: What sort of stuff will you be covering in the book?

Emily Freeman: I broke down into like five parts with very much the DevOps focus of like people, process, tooling, people-process-tools. The first part is more focused on getting started in DevOps. You've heard about this DevOps thing, this is what it is. It's born out of agile, it's an evolution. How to transform your organization to start thinking like this. How to get people on board, finding internal evangelists. Then we kind of get into the development life cycle and making sure that DevOps is integrated into every step. I've kind of brought it down into, let's look at this linearly and then let's create a cycle around this. To center the customer within that cycle and then to round it out how you can tool or add tools to your pipeline.

Mike Julian: Okay.

Emily Freeman: Yeah.

Mike Julian: Yeah, that sounds pretty good. I love how you've broken that down.

Emily Freeman: Thank you, I'm glad you're enthusiastic. One of the things ...

Mike Julian: One of us has to be, right?

Emily Freeman: I know. I try to be really genuine and open and honest about this process, but one of the weird things about writing a book is, I'm halfway through. You go through these cycles where it's like you get excited about it. It's like, “Okay this is good. Oh that's a brilliant idea. Yeah, write that down, this is great. Oh no, that's terrible. I'm unqualified. I should really just get out of tech. I think Kim's going to cry when he reads this.” Yeah, it's this weird cycle, so I'm constantly thinking like, "Yeah, I think this is good, but maybe it's not." In some beautiful irony, you cannot DevOps a book. It is a wonderful process.

Mike Julian: Absolutely.

Emily Freeman: I write it and then people decide whether it's great or shit and then I find out. If it's bad, you can't rewrite it and you only get to rewrite it if it's good, which is funny I think.

Mike Julian: It is funny, funny in a, “oh God kill me sort of way.”

Emily Freeman: Yes, absolutely.

Mike Julian: O'Reilly who I wrote my book for, they modified the traditional process a little bit. You write several chapters, basically half the book, and then it goes off to for an early release. Then you're supposed to solicit feedback during this period. They also send the entire draft to technical reviewers, people in the community that are practitioners and are good at whatever it is you wrote the book on. For example, I was one for Jason Dixon’s Graphite book, so I read that book. I was there while he was writing it, writing technical feedback on it. Then he had several other people also doing the same thing and the same thing for my book. I think I had like six different people — all fantastic well-known people in the industry — that were providing feedback on it. I got this as I went through which was really, really helpful.

Emily Freeman: That's awesome.

Mike Julian: Right and then by the time you get to the final draft, the communities already seen it. You have some feedback, but it's still not as great as you would want. At the end of the day you're just writing a bunch of words and you're hoping they make sense. I was talking about it, I would love to write a second edition of mine, because my book doesn't include a bunch of information that I wish it did just due to time constraints. Like my book doesn't talk about observability. I don't even think the word is in the book at all. That is kind of dumb and people are asking me about it, but you can't do a second edition unless the first one sold well.

Emily Freeman: Exactly.

Mike Julian: That's really dumb. Like I want to do a second edition because I can do it better than the first.

Emily Freeman: Yes, absolutely. It will be better than the first. That was your first book right?

Mike Julian: Yeah.

Emily Freeman: I think one of the struggles I've had as a first-time author is just recognizing that this, Mary Thengvall actually put this best. She was like, I was having a total panic attack and I called her. She was like, "You are writing a book on DevOps, not the book on DevOps."

Mike Julian: You're writing the first?

Emily Freeman: Yeah. Like, "You're not the first, this isn't the Bible, calm down." I think just the difficulty of actually getting it out is such a hurdle that you're never going to be satisfied with it. You want to make those changes and updates.

Mike Julian: One of the biggest struggles I had, I don't know if you've experienced this, like I would sit down and I would look at my notes and be like, "Oh man I have this great idea, I know exactly what I want to say," it's just rolling around my head. Sit down, bang it out and then you look at the paper like, "Wow, that is absolutely like nothing that meant to say."

Emily Freeman: Yeah, it's a great process. I had this tweet at some point and I don't remember exactly what it was, but it was like, the process of writing a chapter. It was like, “Think about it, have existential crisis, write a little bit, hate everything. Drink too much, write some more. Love it, feel confused, brace for editor feedback.” I'm like, “I think that's apt, yeah that's pretty much the process.” Yeah, it's a fascinating thing to do. I think from every author I've talked to, it is absolutely worth it. It will be great. Outside of the benefits of creating a platform for yourself and a personal branding thing, it is really good as a learning process.

Mike Julian: Absolutely.

Emily Freeman: Stretching some of these atrophy muscles. It's a marathon. That's where I think what the hardest part is, it's a marathon and there's no rewards.

Mike Julian: Yeah, like you just keep grinding.

Emily Freeman: Yes. It is an absolute grind until it's over and then you're like, "I hate this whole thing, I don't want to talk about it." Then you're like, "Promote it."

Mike Julian: Then the real work starts.

Emily Freeman: Exactly.

Mike Julian: It took me 19 months for mine start to finish.

Emily Freeman: Dang, okay, yep, yeah.

Mike Julian: That was about 40,000 words, 45,000.

Emily Freeman: Dang that makes me feel better.

Mike Julian: Yeah. This is one of those things that when you talk to a publisher you say, "Well how long does it take to write a book?" Then they're like, "Oh about six months." You're like, "No." When you talk to authors they say, "No, the average book, nonfiction book, takes about two years to write." There's this huge disconnect between what the publisher is telling people and what the authors are actually doing. No one's really talking about it.

Emily Freeman: No, and I think the publishers have gone on to the fact that they have to put arbitrary deadlines in to stress us the fuck out. Then we actually ...

Mike Julian: I'm pretty sure O'Reilly did that to me, because my original contract said the book would be draft complete in like three months.

Emily Freeman: Same, exact thing.

Mike Julian: I'm like there's no way that's going to happen.

Emily Freeman: No.

Mike Julian: I don't think I hit draft until like 14 months.

Emily Freeman: Okay, all right, yeah, that works. No, I just talked to my editor, who's fantastic, he's great. He just pushed it back. Editors play a really funny role, because they're like a coach in that they have to push you but not so hard that you break. It's like this like, "You need to get this done, but also we love you and it's okay."

Mike Julian: One of the things that really pushed me was when they asked/threatened to get me a co-author.

Emily Freeman: Oh my God they all play this game. There's a run book. There is a publisher run book.

Mike Julian: I think there is.

Emily Freeman: That is hilarious. Yeah they said like, "If it's not like 75% done by March or something we've got to get a co-author." I was like, "No."

Mike Julian: That was my response too. Like, "No, no, I've got this. Give me a little bit more time."

Emily Freeman: I know. Mostly because I don't want to have emails back and forth arguing about things. I'm like, "I will just get this out."

Mike Julian: Yeah, it's like, "Let's spend the time arguing what DevOps means instead of writing about it."

Emily Freeman: Yes. Which is a very tech thing I think.

Mike Julian: Right, of course.

Emily Freeman: It's how engineers write books I think.

Mike Julian: Yeah. When I look at other authors it's like James Tangle who just routinely cranks out tons. I'm like, "How do you do this?" Then Ryn Daniels and Jennifer Davis and The Effective DevOps, like holy crap that was an amazing book. They both wrote it together which is already insane feed and it's big. It covered a lot of stuff, and like I look up to people like this. Before I started writing mine, I was looking at people like, "Wow, this is the sort of stuff that I want to do." Now that I'm having done a very shorter version of that, it's even more impressive what a lot of these authors are able to accomplish.

Emily Freeman: Yes. Absolutely it is.

Mike Julian: I'm very much glad I wrote a book, but the process of doing it is pretty crappy.

Emily Freeman: Yes. I always compare it to childbirth, like once the baby's there, it's like, 'Oh totally worth it, “Easy, I'll have another one." You know when you're in stage of labor and everything hurts and you're just like, "I will murder everyone in this room immediately." Yeah, it's a hindsight. It's much lovelier vision.

Mike Julian: Yeah.

Emily Freeman: Can I ask you, before we can move on, I want to just highlight. I think one of the things like you saying Jennifer, she's my colleague and I love her. She's fantastic. She cranks out books and everything. It's just funny like how to the outside world I probably look fairly productive and put-together. Then like I'm sitting in a room where there's like a pile of stuff that I need to put away. There's unfolded laundry. There's toys everywhere. Things are just a disaster. It's like I feel like I'm not functioning at all and yeah, it's funny how we compare our other people's highlight reels to our bloopers.

Mike Julian: Yeah. I would love to see Stephen King's process, him having been writing for 30-plus years. I'm sure it's not that clean. Like I'm sure you look at round his office and he's probably in just as bad a shape as we are.

Emily Freeman: Yeah, absolutely.

Mike Julian: Well on that note, you mentioned on Twitter a while back and what got me reaching out to you to have this episode. You mentioned some, I don't know how to describe it. Like all you said was just like a hint of an idea that you see a parallel between legacy code, railroads and subways.

Emily Freeman: Yes, so I wrote this abstract. It's the talk I want to give this year and I have to figure out who might actually be interested in it. It's talking about maintaining our legacy systems by comparing them to the first subways. I believe London was first, the Underground was first, then the Metropolitan subway in Paris. Then New York City. At this point we're dealing with century-old train systems that were sort of haphazardly put together. The engineering feats that these initial engineers actually overcame during the building of these is fascinating. If nothing else, like go listen to some podcast about just the first subways. For instance, there is a subway station in Paris that is underground, but actually suspended. It is on an underground bridge, because the plastered Paris quarries, I always screw that up. The plastered Paris quarries are so deep, going back to the Romans, that they are just these massive holes underground. They built these enormous structures to hold these trains and stations up.

Mike Julian: That's incredible.

Emily Freeman: That kind of thing.

Mike Julian: Right.

Emily Freeman: Yeah, and then New York deals with so many problems. I mean some of the electrical wires that we still use for communication are insulated in cloth, if you can believe that.

Mike Julian: I had a house like that.

Emily Freeman: Did you really?

Mike Julian: I really did.

Emily Freeman: Like how do you mean? That just sounds like impossible and a fire hazard.

Mike Julian: I mean it absolutely is a fire hazard.

Emily Freeman: Okay. Solid.

Mike Julian: In the 1940s and I guess back to the '20s there was this type of electrical wiring setup called knob and tube and it's like the cloth is wrapped in wax. Then you wrap the wire connections in that. You're not actually wrapping it in cloth, but of course over time the wax breaks down at which point everything just sparks and catches fire.

Emily Freeman: Yes, absolutely. I mean if you look at, I think rats cause a bunch of problems in the New York subway.

Mike Julian: Oh yeah. As it turns out rats really like electrical lines.

Emily Freeman: I know. It's fascinating. Yeah, I just think it's really interesting how we take these ideas. We execute them without really knowing what we're doing of course, because just like the first time you write a book, the first time you write code, it's not going to be beautiful. Then we typically, 90% of the time never get back to actually fixing it. I mean I think if we're honest, probably most companies have a corner of code, where like the new employee orientation is just, "Don't touch that. If you touch that everything breaks. No one knows how this works. It's written in cold fusion. Just don't touch that." I think it's important to have a respect for the initial engineers who actually did something that was really hard even though now, with hindsight, those decisions may look bad. They were making the best decisions they could with the data they had at that moment. Then how do we, with that understanding and respect for them and the systems they produced, update and maintain them and maintain all these availability, reliability for our current customers? I think it's pretty interesting.

Mike Julian: Yeah, that is really interesting. If you're interested in more about the London Underground history, I think there's a documentary on I want to say Netflix about them building the first Underground.

Emily Freeman: That's amazing.

Mike Julian: Yeah, it's super cool. I really enjoyed it.

Emily Freeman: I'm going to write that down. I'm a documentary nerd.

Mike Julian: You seem to have a flare for this, the parallel stuff. I was looking at some of your talks just recently and you did one on building teams and how it relates to Sparta.

Emily Freeman: Yes.

Mike Julian: I'm like, it was really cool.

Emily Freeman: Thank you, I'm glad. Yeah, this is my contribution to tech. It's like I have a humanities degree. I went to school for political science. I love history. I wrote for so many years, so I like books and that kind of thing. I think every talk is somewhat related. I have Humpty Dumpty Guided DevOps, the Sparta talk, Dr. Seuss's Guide to Code Craftsmanship.

Mike Julian: Which is also really great.

Emily Freeman: Thank you. There's a pattern here.

Mike Julian: I see that.

Emily Freeman: Yes. I have a shtick and I'm shticking to it, I will just say that.

Mike Julian: Bravo.

Emily Freeman: Thank you. Yes.

Mike Julian: Well that sounds like it's going to be a really awesome talk. I'm looking forward to

seeing it on stage wherever it gets selected.

Emily Freeman: Thank you.

Mike Julian: I guess for anyone listening to this, if you have a conference, please invite Emily.

Emily Freeman: Yes, that'd be lovely.

Mike Julian: She has a talk.

Emily Freeman: Yes. That'd be awesome.

Mike Julian: Well thank you so much for joining me on this. It's been absolutely fantastic.

Emily Freeman: Thank you so much. I've had a wonderful time chatting with you.

Mike Julian: Yeah. Where can people find out more about you and your work?

Emily Freeman: Yeah, so on Twitter I'm @editingemily, like writing and editing. You can find my website at emilyfreeman.io.

Mike Julian: All right then. Well, and thank you for all the listeners for listening to Real World DevOps podcast. If you want to stay up to date with the latest episodes, you can find us at Real World DevOPs.com on iTunes, Google Play or wherever it is you get your podcast. See you in the next episode.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Observability in Mega-Scale Banking with Greg Parker
21 feb 2019· Real World DevOps
About the GuestGreg established and leads the Enterprise Monitoring Services team at Standard Chartered Bank, and together with his team wrote and implemented a strategy and approach to effectively monitor and leverage data from over 1,000 applications, 30,000 servers, 15,000 network devices, public and private cloud, mainframe, tandem, and multiple other technologies in a sustainable and scalable way. Applying Agile and DevOps techniques to the build, engineering, and support of the monitoring ecosystem at Standard Chartered, the team brought together tools across the technology stack and advocated techniques such as monitoring as code in order to improve monitoring quality and make it a mandatory part of the deployment pipeline.
Prior to that he worked at Barclays Capital in Singapore and Goldman Sachs in Tokyo, Japan in various infrastructure and engineering roles.
Links Referenced: Connect with Greg on LinkedInTranscriptMike Julian: Running infrastructure at scale is hard, it's messy, it's complicated, and it has a tendency to go sideways in the middle of the night. Rather than talk about the idealized versions of things, we're going to talk about the rough edges. We're going to talk about what it's really like running infrastructure at scale. Welcome to the Real World DevOps podcast. I'm your host, Mike Julian, editor and analyst for Monitoring Weekly and author of O’Reilly's Practical Monitoring.

Mike Julian: This episode is sponsored by the lovely folks at InfluxData. If you're listening to this podcast, you're probably also interested in better monitoring tools — and that's where Influx comes in. Personally, I'm a huge fan of their products, and I often recommend them to my own clients. You're probably familiar with their time series database, InfluxDB, but you may not be as familiar with their other tools. Telegraph for metrics collection from systems, coronagraph for visualization and capacitor for real-time streaming. All of this is available as open source, and they also have a hosted commercial version too. You can check all of this out at influxdata.com.

Mike Julian: Hi folks. Welcome to the Real World DevOps podcast. I'm here with Greg Parker, head of enterprise monitoring services at Standard Chartered Bank, way out in Singapore. Welcome to the show Greg.

Greg Parker: Thanks, Mike. I'm doing well. How are you doing?

Mike Julian: I'm doing fantastic. So Standard Chartered Bank, like what is this? It sounds like just a bank, but I've been talking to you about it and it sounds like it's a whole lot bigger than I imagined.

Greg Parker: Well, Standard Chartered operates across 70 countries. There's more than 1200 branches, there's 90,000 employees, and it's just a sprawling financial institution, but it's primarily operating in Africa, Middle East, a lot of emerging markets, and the headquarters for IT is in Singapore, though the bank is headquartered in London. And so out of Singapore we drive the technology strategy and across all of the markets over 70 countries. And we get a lot of diversity in our environment because of the different strategies that we have in each country. Coming from, I was working for Goldman Sachs for about ten years, where IT was very tightly controlled from the center from New York where, the word came down from the heavens around this is how you're going to do everything. And then I went to Barclays, which was a similar model except the word came down from London, and at Standard Chartered it was really Singapore saying, this is what we should be doing and this is how we operated for our group owned applications, but there were 70 other countries saying, this is how we have to do it in Nigeria, and this is how we have to do it Kenya, this is how you have to do it in Pakistan. And so you have all of those issues creep up when you're working across emerging markets and especially in a financial.

Mike Julian: What's your role in Standard Chartered?

Greg Parker: So my role at Standard Chartered is to run enterprise monitoring. And, it wasn't my original role. I came in to drive some infrastructure projects, large infrastructure projects, and when I got there, I saw that monitoring was essentially chaos. There was really no central strategy around how we're going to do it. And when I worked with some people there and we effectively established a central enterprise monitoring organization for Standard Chartered, the problem was there was no central strategy or tool set, or group of tools that we were using for monitoring, and there were multiple vendor deals, negotiated at different prices at different times with different countries. And so there's a lot of inefficiencies that were contributing to massive, MTTRs. Which meant, when an issue occurred, a thousand different teams got an alert, nobody know whose fault it was, and it took all this time to work out, what's the root cause and how are we going to resolve it? And I think a lot of that comes down to, the fact monitoring wasn't precise.

Mike Julian: And I'm sure in no small part do two countries not be able to talk to each other.

Greg Parker: Certain countries couldn't talk to each other and other countries just didn't know to talk to us. And so there was a lot of people working in silos.

Mike Julian: How does your strategy even look when you have all these different entities that are doing their own thing, and like culturally you're not able to say, “This is what we're going to do.” So what's your approach instead?

Greg Parker: Well, we do have the authority to dictate if we want to. And that's one of the things that came along with establishing this central organization which is backed by the CTO and the head of technology services, which is going to say, our mandate is to go out and fix monitoring for SCB. But at the same time it's not something that we want to do, is to just give people a mandate. And I've been saying this the whole time is that our strategy is not perfect and it's never going to be perfect. Our strategy is focused around corralling all the data that's out there, and translating it, and enriching it, and normalizing it, and then exposing it through APIs. And that's really the crux of it. But, it's never gonna be perfect, and our focus was to just implement a working framework, and help to improve monitoring so that we can reduce those mean time to resolve and mean time to detect times, and just give generally a better sense of observability for the company. So we have a sense like that we know what's going on.

Mike Julian: Why don't we talk more about the strategy behind what you're doing. Like what does it actually look like? How did you come to this strategy? How are you implementing it? What is it even?

Greg Parker: So we started with multiple teams that had implemented their own monitoring with the tools that they wanted to work with. We have BMC, deployed across all of the infrastructure. A lot of the application teams had purchased ITRS. There were other tools like AppDynamics and Dynatrace and some open source tools out there with Grafana and Elastic and all of that. And so my first thought was, we're not going to standardize everybody to one tool, and there's not one tool that's going to, be this panacea that's solves all our monitoring problems. And that's kind of one of the key tenets of monitoring is that, there's a lot of different tools. It's not about finding the perfect tool, it's about getting them to work together. And so the initial thought was, let's just take all this data that we have and somehow ingest it and normalize it and then figure out a way to leverage it. We can use more modern tools once we have a dataset that we can leverage. And so we started off on that path, and we started building integrations, and building ingestions, and building a translate, an enrichment, and a normalization layer that would ingest not only the data coming from the patrol agents, of BMC Patrol agents across all of our servers, but everything that was being deployed, AppDynamics, or APM. We onboarded a tool called Sysdig for container monitoring, Dynatrace does synthetics, ITRS is obviously doing a ton of application component monitoring as well as infrastructure monitoring, and some tools are harder than others to integrate. And that was the main focus of our efforts over the last 18 months or so was ...

Mike Julian: Is on the integration side?

Greg Parker: Yeah. Was bringing everything into one place. When you're at a big company, like Standard Chartered or any large, investment bank, you can spend years talking about what you want to do and-

Mike Julian: At some point you just got to do it.

Greg Parker: ... Yes. And I've seen it happen. Exactly. I've seen companies talk for years about cloud strategies, and monitoring strategies, and web service strategies, and database strategies, and at some point you just have to dive in and start doing it.

Mike Julian: So what makes the integrations some more difficult than others? Like what are the actual challenges with it?

Greg Parker: Well, we talk a lot about, there being open source tools and there being vendor tools and proprietary tools, and there is this thinking that proprietary tools aren't open and you can't get the data in or out. That's not necessarily the case. There are some open source tools where it's really hard to pull up the data and there is some vendor tools where it's actually relatively easy. But for us, Patrol wasn't too bad. AppDynamics, Dynatrace have relatively open platforms for us. Actually, we ran to a bit of an issue with ITRS because they don't really structure their data in the way that you would expect from a normal monitoring tool. There's [inaudible 00:10:57]. These things make it difficult for us to say, this is where the breach occurred and this is where it was resolved, and this is what it should look like, and this is how we should normalize it. And at the same time, they have a Kafka method in order to bring in the data. But we were running it on such old VMs that if we were to turn on that Kafka, then it would impact the performance of the gateways, and ITRS operates on a siloed model so that we had 300 gateways across our environment, more than 300 gateways. And for us to ingest the data from all of those gateways would've meant either a single script pulling stuff from the central database or 360 different scripts running on each gateway and bringing data in. So for us that was a major challenge.

Mike Julian: Yeah, that sounds like a nightmare. How are you going around to all the teams and getting them to adopt what you're doing?

Greg Parker: Well, like I said, we wrote a white paper in the team essentially about what we were going to do, what the approach is going to be as far as enterprise monitoring. And we circulated that in the forums that we have for technology standards. We got some feedback, we got people who didn't read it and just said fine, and I think you'll see a lot of that especially in big banks or people who are running a large organizations. And then we got people who are like vehemently disagreed with our approach. Like violently, just like absolutely not. And I think that's the problem that you'll always find is, you'll have somebody who's completely ideologically focused on a single type of solution. The protests, to attend to come from people that hadn't spent a lot of their careers in large corporations but in technology companies where you can probably do some more experimentation.

Greg Parker: But it's not that they were wrong at all. They're absolutely right. We should maybe look at these more modern technologies. But when we have a limited amount of time in order to deliver value for a company that's not focused on technology, like our primary service isn't delivering, cloud to our users or something like that, or primary services being a bank, and we had an environment, built out on BMC, that we just needed to upgrade in order to be able to achieve what we wanted to do. And so it ended up being a compromise obviously, where we would agree, yes, this is what we want. We're not necessarily delivering it the way everybody wants, but we're doing something for the purpose of expediency, getting a solution out there, and at the same time we're going to start evaluating, what's the best way to do it? And I think it's always an evolution like that at big companies.

Mike Julian: I have spent a lot of time at very large companies myself and I've also spent time in the small companies, and I ran into a lot of people that are, they don't like how technology decisions are made in very large companies, and mostly how slow they are and how behind the curve they feel like they are. But to me there's actually a lot of really good reasons for that. But to me that actually takes away a whole lot of stuff that I don't need to worry about anymore. Like with you, you have the support of a very large bank behind you. You have all this data that you get to play with, a single division at which you're working on towards the entire revenue of many of the companies that are in Silicon Valley. So just the scale of what you're doing is way more interesting that you're only gonna find it a large place.

Greg Parker: You don't account to these issues when you're trying to monitor a single application, then you're just like kids, fantastic, let's do [inaudible 00:15:06] of tracing and let's just get everything into a database and then you can build a Grafana dashboard and it's fantastic. But I think that's the thing is, we're not trying to monitor our company. We're trying to build a framework so that each individual team can monitor their application. Right?.

Mike Julian: You and I were talking while back about this particular challenge as well, about how, if you develop this strategy and develop this platform, how do we get it out there? And the one where we're talking about is, well, you don't have like one team that needs to adopt it. You have several dozen teams that all are doing their own thing and you want to get them to adopt it. So it's not a quick thing. This is not going to be done anytime soon. So this is very much a very long play for you. Do you have any idea about how long it's going, how long of an investment this is?

Greg Parker: It feels like, well, it's something that never ends first of all, but it feels like it's already been going on forever. So we establish the team like around the end of 2017. We spent the better part of the last, 14 to 16 months building the platform and the integrations, and our thinking was, we'll get people on board with the concept, we'll build something, we'll deliver it to them, and then we'll slowly drive the adoption. And so having spent the most of the last year building the platform, in 2019 we're going to be primarily focused on driving the adoption. So we've migrated over about 50 teams at this point from monitoring through either email or remedy tickets or something like that to our standard platform which exposes data through API and it has a front end, and we plan to drive, another 50 to 100 teams next year onto that central platform.

Greg Parker: And I think one of the difficult pieces we'll be looking at the teams that are using more mature solutions that, where they've actually spent a lot of time. We have a team that's has for resources that have been fully dedicated full time for the last three years on our eCommerce area [inaudible 00:17:28] this straight to bank platform, which is eCommerce in real time, trade execution and all of these sort of things. And they've built an ecosystem around this ITRS gateway that is completely custom and completely complicated. It's just insanely complicated. And then they generate dynamic dashboards and they inject XML because that's the only way you can really talk to ITRS in a programmatic way. So it generates XML flat files and pushes it out to the gateways and everything like that. And I mean, for me the solution that works for them, and so that's not going to be our primary goal is to absolutely get 100% of the bank onto our framework and our platform. It's really for the teams that haven't put that type of thought into their monitoring. And if you have just been working off of emails and remedy tickets because alerts were auto-generated to tickets, which is a horrible way to deal with alerting, but for teams that were doing that sort of thing, they're more willing and they're definitely much more eager to say it. I'm so glad that you delivered a platform that is actually purpose built for monitoring.

Mike Julian: You made a really fantastic point there. I want to call it out so we don't miss it. That your goal is not 100% adoption. Your goal is to provide something where there wasn't anything good before. So it's not that you're trying to tell the teams, all the teams that you support that, no, you have to use my thing. What you're telling them is, this is available, but if you want to do something else that's fine. But this thing that ...

Greg Parker: If you like your manual, keep it.

Mike Julian: Like this thing that we've built is fully supported, so if you don't want to maintain your own stuff, then you can use ours.

Greg Parker: Yeah, exactly. And we try not to position it as a mandate. Generally, we’re an internal team, we could manage one, but we sell it. I have a team of people that spend their time sitting down with the support people on the ground and selling our platform to them as if we're a technology company, because we want people to want to buy into it and not just feel like they're forced to use it.

Mike Julian: What sort of questions or I guess objections are you getting when you're selling this to teams? Are there any common complaints, any common objections that come up?

Greg Parker: There's two big ones. I mean, the first one is that people are very used to the way that they were doing things. And that happens when they've been monitoring in a certain way for the last 10 years, which is, they just stare at a remedy queue and they hit F5 every 30 seconds to refresh it. And some of them actually have created macros, so they don't have to actually hit F5. It just automatically refreshes every thirty seconds.

Mike Julian: That's incredible.

Greg Parker: Talk about automation. But they're very used to that and they're very ... Because audit and compliance is always a huge pressure at a bank and they're very worried that they might miss a ticket. And that's why they've auto-ticketed every alert, even minor alerts, 70% threshold breaches and that sort of thing. And they're like, as long as it's a ticket, then we can't say that we've missed it. And that's entirely missing the point of monitoring, which is like, you shouldn't actually even have to look at anything if there's no remedial action to take. So you do have to sit down with a lot of junior support people because the domain heads that are sitting in Singapore and aren't really on the ground, hear gripes coming up from the ground and saying, “We don't want to move to this solution that the monitoring is trying to force on us,” and it's about, helping them to understand the bigger picture. And so it doesn't, even though we have the support from the domain heads, you have to do a grassroots campaign. I feel it's the best way to approach it. And the other big gripe that comes along is for people that are using tools like ITRS, because there are certain features in ITRS that we can't replicate in our central platform like acknowledging and snoozing an alert or something like that. And that's a functionality that people have been used to. Even though it's not necessarily a good practice. And there are things like that and you just have to kind of take them case by case.

Mike Julian: Sure. And on that note, you mentioned that there's this functionality in ITRS that people are used to of acknowledging or snoozing an alert and it's not a good practice. So in that situation, are you also, through your platform also teaching people how to do monitoring better, like as an education component to it?

Greg Parker: We really try. There were four pillars of our monitoring transformation, people process governance and technology, and we through the people's side, we built up the team. From process, we are building a lot of documentation and training about how people should be thinking about monitoring, and a monitoring pyramid which is that you obviously think about your business deliverables and your business KPIs before you think about what are your alerts going to be, what are you going to monitor for? And so that's, another completely different aspect of my organization is trying to drive better monitoring practices across the organization. And that's much harder than building the technology.

Mike Julian: As it turns out, people are a lot more work.

Greg Parker: Yes, exactly. Driving changes in behavior, especially when they're like deep seated little habits is very difficult, and we formally modified our software development framework, our SDF and our SDLC and gateway checks and all of these things in our governance documents, but it doesn't mean anything. People are not going to sit down and read those. It's really about, helping people to drive better monitoring in their area.

Mike Julian: You kind of touched on a point earlier of audit and compliance is a big deal. So with all this stuff you're doing, regulatory controls. I mean you're a bank, so regulatory controls and audit and compliance and all that stuff surely plays a pretty big role in what you're doing.

Greg Parker: Yep. For me that's probably 70% of my job. It's massive.

Mike Julian: What does that mean? What does that look like to you?

Greg Parker: So the compliance aspect is driven by your own policies and that's the thing that's really that a lot of people probably don't get is that, all of your audit burden is brought on by yourself, because your internal … you have three lines of defense in any organization, any corporation. You have your first line which is your internal risk and control team, and your second line which is your group operational risk, and then your third line which is your group internal audit, and they'll all judge you based on the policies that you write. But at the same time, if your policies don't adequately address the risks that face the company, they'll write them for you.

Mike Julian: So it's much better for you to write them?

Greg Parker: Yes, but it's always a balancing act because I could write a policy that says, there's no central monitoring standard for the bank, and then they'll say, well what about the risk of the bank not having adequate monitoring? And I'll say, there's a small monitoring standard for the bank where all you have to do is monitor CPU. And they're like, well, what about the risk of a file system failure? And then so it's just, you're just constantly balancing about, I don't want to have too much governance burden, but you want to address the risks of the bank. And so you go into this negotiation with Gore, with group operational risk and talk about, I think this is not a material risk, based on my experience and based on industry practice, and you arrive usually at a compromise, but in the end there has to be a monitoring standard. And that's the little piddly tedious thing that GIA is going to audit you on every time. So for us, a constant finding in audits across the company is that, the monitoring standards that your CPU threshold should be 90%. And we looked at the configuration and it’s 87%, and it's just nonstop. And so from our perspective, we want to put out a monitoring standard that helps improve the company's overall production stability and addresses the risks of the company not having adequate monitoring. But it shouldn't be too specific that it causes internal teams to fail audits when there's not actually a material risk.

Mike Julian: Like alerting on a CPU?

Greg Parker: Right. Alerting on a CPU at 91% is not a material risk, if the standard says it should be 90%. One way that we try to do that is we de-emphasize the importance of static thresholds, obviously. And it's sort of it an ancient monitoring technique anyways. Now you have a lot more machine learning and dynamic thresholds. And so we try to de-emphasize the importance of static thresholds. We put more emphasis on broad themes of monitoring, like you're monitoring for your performance, you're monitoring for errors, you're monitoring for peak utilization and high levels of demand. And then from that point on it's really about educating your internal audit and risk and control teams about why these better address the risks of the company.

Mike Julian: You actually end up with really two different customers here. So you have the group that's using the platform and then you have your internal governance that's judging you on what you've written and what you're doing, and you're having to satisfy both, which is almost an impossible task in some ways.

Greg Parker: That's a good word. They're constantly judging us — and sometimes based off of antiquated understanding of the industry. I love these auditors sometimes being in the business for 30 years and they saw how monitoring work in 1990 at IBM, and they expect to see a similar structure when we're trying to drive something that's more modern. But generally the users are appreciative of the fact that, we feel that the monitoring related questions around audit, but they have to be aware of that, of the policies that exist and try to drive them and implement them because, another way to drive our policies across the bank is that they’re going to show up on these noncompliance lists if they're not compliant to the policies. And so what we do, is we tried to establish a reasonable standard and a reasonable framework, and from that point on we can let Gore and GIA drive the compliance to an extent that's sort of a hammer that essential team can use across cross a large organization.

Mike Julian: So our listeners are going to absolutely roast me over a fire if I don't ask this question. But what's the tech stack look like underneath this platform you've built? Like what goes into it? So you mentioned that there's a lot of vendor, is a lot of proprietary, a lot of commercial and a lot of open source tools. But what's gluing it all together?

Greg Parker: Like I said, we tried to bring together a bunch of different tools. At the center is a tool called TrueSight Operations Manager, which is from BMC. And that's an aggregation. It's a manager of managers, and it's a way to ... it has enrichment models, it has normalization models, and it has a lot of interfaces. And so the different agents that we have across the organization to collect data include, like I said, ITRS, BMC Patrol, AppDynamics. There's open source tools out there, there's Elastic and there's Beats that are out there. Like I said, Sysdig does our container monitoring and that's based on an open source agent that was developed back in 2011. There are teams that are using Prometheus, there are teams that are using Telegraf and Influx to collect data. And we were able to ingest that into, via a set of proxies across every country and in our two main data centers ingest that data in, normalizing it and enrich it with data from RCMDB, is our configuration management database, enriches that monitoring data with information about business criticality and application and owner and a lot of different things.

Greg Parker: And on the other side of that, because like I talked about before, BMC is a massive corporate vendor, but they're starting to step in the API direction and they've developed APIs for TSOM. But there's a lot of issues with the APIs for TSOM. They're complicated, they're hard to use. And so while teams can use the front end of TSOM, of TrueSight Operations Manager to build simple dashboards and there's drag-and-drop and there's Widgets, we then stream that data of TSOM into an elastic database. And then we've built a platform on top of elastic using the elastic KPIs, and then we have the elastic KPI's plus the set of custom KPIs that we've built to expose all of the data to users in real time, so that they can build their own real-time dashboards and visualizations off the back of that. And for right now, that's why we have a team of like 60 people or whatever, that ...

Mike Julian: Just to make sure that people got that. You said 60?

Greg Parker: Yes. Across engineering and support and our governance and our service management teams.

Mike Julian: I just want to be sure that people listening to this, like, this is not a simple thing to do. That's a pretty significant organization working on that.

Greg Parker: And most of the banks that I've been with have not actually devoted that much resource to their enterprise monitoring organization. There's usually an enterprise tools team that has a handful of people or maybe 10 or 20 people, and then there's a support team that supports a group of sort of IT for IT applications, but we really thought that we're going to take all of those teams and bring them together, and really try to drive a strategy centrally. And that's why it's a relatively large organization, but usually you have that number of people spread out across the bank, supporting the entire ecosystem.

Mike Julian: Got you. There's so much that goes into what, how a large bank operates that it's just never occurred to me before. So this has been absolutely fantastic to learn about.

Greg Parker: I mean, it's just an octopus with his tentacles just going out everywhere and you really just trying to get everything, corral everything together. And so, still being early on in our journey, like I said, we're just trying to corral everything. But there's other teams, that have gone out. In Standard Chartered for example, the cloud team has gone out and sort of built everything from a greenfield, and with their best of breed tool set of choice for DevOps, and all of the aspects of the DevOps pipeline as well as monitoring, which also integrates with our central framework. And so there are areas and there are pockets where, and their application teams out there that are building microservices applications based on Docker Swarm and based on Kubernetes. And so they're very modern in fault tolerant and performance. And then of course there's still plenty of applications out there that run on a mainframe and tandem, and all of those fall into the same rules and policies.

Mike Julian: So surely you've learned a few things from doing this whole project. What's gone well? What hasn't worked?

Greg Parker: I would say that the thing that I would change if starting over, if I were just starting over from the beginning would be to initially just talk to my users more and really focus on improving monitoring from day one. Whereas, we were saying, we're so far behind we have to upgrade our system and we have to start building all these equations, we have to build this central data layer. But we actually had the tools in place where we could from day one, start talking to our users and implement these fundamental things that didn't require our modern tool set. They didn't require anything advanced. You don't need Kafka, databus and Cassandra and all these different things. If a Unix team is not able to monitor VCS. I mean, we needed a few scripts and a knowledge module, and then you can have active, proactive monitoring of your VCS clusters and know when they're going down or falling over or hitting resource constraints. And it was only after we had possibly spent maybe half a year that I had the head of Unix or the head of platform is coming up here and be like, "When are we going to get better ping monitoring and when are we going to get better VCS monitoring?" And I said, "Well, actually we've been working on completely overhauling or upgrading the platform." But I think a lot of problems can be solved just by obviously talking to your users, and we probably could have reduced our MTTRs and MTTDs quicker if we had started with that, and while at the same time concurrently starting driving an upgrade.

Mike Julian: So this has been absolutely fantastic conversation. Thank you so much for coming. For people in a company like yours, they're probably listening here and saying, “Yeah, but I can't do that like for all these different reasons like that won't work for me.” What advice would you give them?

Greg Parker: I would really just say, to focus on your monitoring coverage and compliance to your monitoring standards, and focus on developing good standards. Because like I said, if you're at a large company or if you're at a company that has a typical corporate governance structure, you don't necessarily have to be the person that drives the compliance to that. If you write a good, risk reviewed policy that addresses your company's main risks as far as monitoring is concerned, then you have teams of people whose job it is to drive compliance with those policies and who will audit the other teams for those policies. So federate the workout, try not to take everything under yourself, which is a lesson that I've learned. It's impossible to do in a bank with 34, 35,000 production systems, and 1,000 applications. If you're driving a large organization, you don't have the resources, then focus on a very solid base, and ensure that your group risk and your group audit are on board with your policies and understand them, and so that when they run, when they do their jobs, that they drive those teams to adhere to those policies.

Mike Julian: All right, so is there anywhere that people can find out more about you or your work?

Greg Parker: I'm trying to think.

Mike Julian: You do work for a very large bank, so perhaps not nearly as public as some others.

Greg Parker: I mean, I'm on LinkedIn and I welcome people to, they can get in touch with me there.

Mike Julian: Alright. I'll throw that in the show notes of course.

Greg Parker: Sure.

Mike Julian: Well, thank you so much for joining me, Greg.

Greg Parker: No problem, Mike, anytime you want.

Mike Julian: And thank you to all our listeners as well. If you want to stay up to date on the latest episodes, you can follow along at realworlddevops.com or on iTunes. And if you're listening to this on iTunes, please rate us. So thank you and have a wonderful evening.

Greg Parker: All right. Thanks Mike. Appreciate it.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
The Business Value of Serverless with Yan Cui
14 feb 2019· Real World DevOps
About the GuestYan is an experienced engineer who has run production workload at scale in AWS for nearly 10 years. He has been an architect and principal engineer with a variety of industries ranging from banking, e-commerce, sports streaming to mobile gaming. He has worked extensively with AWS Lambda in production, and has been helping various UK clients adopt AWS and serverless as an independent consultant.

He is an AWS serverless Hero and a regular speaker at user groups and conferences internationally, and he is also the author of Production-Ready serverless.
Guest LinksYan’s blogYan’s video course: Production Ready ServerlessFind Yan on Twitter (@theburningmonk)Subscribe to Yan’s newsletter Centralised logging for AWS LambdaTranscriptMike Julian: Running infrastructure at scale is hard, it's messy, it's complicated, and it has a tendency to go sideways in the middle of the night. Rather than talk about the idealized versions of things, we're going to talk about the rough edges. We're going to talk about what it's really like running infrastructure at scale. Welcome to the Real World DevOps podcast. I'm your host, Mike Julian, editor and analyst for Monitoring Weekly and author of O’Reilly's Practical Monitoring.

Mike Julian: This episode is sponsored by the lovely folks at InfluxData. If you're listening to this podcast, you're probably also interested in better monitoring tools — and that's where Influx comes in. Personally, I'm a huge fan of their products, and I often recommend them to my own clients. You're probably familiar with their time series database, InfluxDB, but you may not be as familiar with their other tools. Telegraph for metrics collection from systems, coronagraph for visualization and capacitor for real-time streaming. All of this is available as open source, and they also have a hosted commercial version too. You can check all of this out at influxdata.com.

Mike Julian: Hi folks, I'm here with Yan Cui, an independent consultant who helps companies adopt serverless technologies. Welcome to the show, Yan.

Yan Cui: Hi Mike, it's good to be here.

Mike Julian: So tell me what do you do? You're and independent consultant helping companies with serverless. What does that mean?

Yan Cui: So I actually started using serverless quite a few years back, pretty much as soon as AWS announced it, I started playing around with it and the last couple of years I've done quite a lot of work building serverless applications and production. And I've also been really active in just writing about things I've learned along the way, so as part of that, a lot of people have been asking me questions because they saw my blog and talk about some problems that they've been struggling with, and asked me, "Hey can you come help me with this? I got some questions." So as part of the doing that, I like to help people, first of all and then just part of doing that is something that's been happening more and more often, so in the last couple months I have started to work as an independent consultant, helping companies who are looking at docking serverless or maybe moving to serverless for new projects and want to have some guidance in terms of things they should be thinking about and maybe have some architectural reviews on a regular basis. So for things like that, I've been helping with a number of companies, both in terms of workshops but also regular architectural reviews. And at the same time, I also work part-time at a company called The Zone, which is a sports streaming platform and we also use the serverless and is contained very heavily there as well.

Mike Julian: Okay, so why don't we back up like several steps. What the hell is serverless? Just to make sure that we're all talking about the same thing. What are we talking about?

Yan Cui: Yeah that's a good question, and I guess a lot of people has been asking the same question as well because now they say you see, pretty much everyone is throwing the serverless label at their product and services. And just going by popular definition out there based on what I see in the talks and blog posts, I guess in terms of my social media circle, I guess by the most popular definition, serverless is pretty much any technology where you don't pay for it when you are not using it because paying for OpTime is a very serverful way of thinking and planning, and two is, you don't have to worry about managing and patching servers because installing demons or Asians or any form of subsidiary or support software on it is again, definitely tied to having servers that you have to manage. And three, you don't have to worry about scaling and positioning because the systems just scale a number of underlying servers on demand. And by this definition, I think a lot of the traditional backend server's things out there like AWS S3 or Google BigQuery, they also qualify as the serverless as well.

Mike Julian: Okay, so Lambda is a good example of serverless, but there's also this thing of like a function as a service and they seem to be used interchangeably sometimes. What's going on there?

Yan Cui: So to me, functions as services, describes a change in terms of how we structure our applications and changing the unit of deployment and scaling to the function level that makes every application. A lot of the function and server solutions like a dual function or Lambda as you mentioned, they will also qualify as serverless, based on the definition we just talked about and generally I find that there are a lot overlap between the two concepts or paradigms between functions and service and the serverless. But I think there are some important subtleties in how they differ because you also have functions of service solutions like Kubeless or Knative that gives you the function oriented programming model and the reactive and event driven module for building applications, but then runs on your own Kubernetes cluster.

Yan Cui: So if you have to manage and run your own Kubernetes cluster, then you do have to worry about scaling, and you do have to worry about patching servers, and you do have to worry about paying for op time for those servers, even when no one is running stuff on them. So the line is blurred when you consider Kubernetes as service things like Amazon’s EKS or Google GKE where they offer Kubernetes as a service or Amazon's Fargate, which lets you run containers on Amazon's fleet of machines so you don't have to worry about positioning, and managing, and scaling servers yourself.

Yan Cui: At the end of the day, I think being serverless or having the right labels associated with your product is not important. It's all about delivering on business needs quickly, but having a well understood definition on those different ideas that we have, really helps us in terms of understanding the implicit assumptions we make when we talk about something. So now that everyone is talking about calling their services or products serverless, is really not helping anyone because if everything is serverless, then nothing is serverless and I really can't tell what sort of assumptions I can make when I think about your product.

Mike Julian: Right, this is the problem with the buzzwords is, the more you have of them, the less it actually means and the more confused I am about what you do. So because I love talking about where things fall apart... Like serverless, it's a cool idea. I think it works really well and yet, I've seen so many companies get so enamored with it that they spend six months trying to build their application on serverless or in that model. And then a month later, they go under. I can't help but make that the tie between the two of — you spend all your time trying to innovate on this and at the end of the day, you didn't have any time to innovate on the product. So that's an interesting failure model. But I'm sure there's others here where people are adopting serverless in the same way when we first started adopting containers. Like, "Hey, I just deployed a container and works on my machine, have fun." So when is serverless not a good idea? What are the pitfalls we're running into? What are people not thinking about?

Yan Cui: I think one of the problems we see all the time ... You mentioned when something's a hype, a lot of the adoptions happen because there's a lot of hype behind the technology and there's a lack of understanding of, this is the requirement we have and the technical constraints that you have and you go straight into it. I think this happens all the time and that's why we have the whole hype cycle to go with it. I think when you are a newcomer to a new paradigm and it's so easy to become infatuated by what this product can do and when you see the whole world as a hammer, you start looking for nails everywhere and this happens when we discover NoSQL. All of a sudden, everything has to be done with NoSQL. The MongoDB and Redis which is everywhere to solve every possible database problem, often again with disastrous results because again, people are not thinking about the constraints and the business leads they actually have and focus too much on the tech. If anything, I think with serverless, we have this great opportunity to think more about how do we deliver business value quickly, rather than thinking about technology itself. But as engineers, as technology people ourselves, you can see how easy it is to fall into that trap and I think there's a couple of used cases where serverless is just not very good in general right now. One of them is when you require consistent and very high performance.

Yan Cui: So quite a lot has been made about cold starts which is something that is relatively new to serverless, well to a lot of people using serverless but again, it's not something that's new entirely. For a very long time, we've had a deal with long garbage collection pauses or server being overloaded because low is not evenly distributed, but with serverless, that becomes something that's systematic because every time a new container is spawned to one of your functions, you get this spike in latency. For some applications, that is not acceptable because maybe you are building a realtime game for example, where latency has to be consistent and have to be very very fast. You are talking about, say a multiplayer game, leaving a nine percentile latency to be below 100 milliseconds, that's not just something that you can guarantee with Lambda or any serverless platform today.

Mike Julian: I worked with a company a while back that was building a realtime engine and that was a hell of a problem. So we were building everything on bare metal and VMware, and then had this really nice orchestration layer running on top of a puppet. And this is a hell of a problem because as load comes up, we're automatically scaling the stuff out, except as we're adding the new nodes, latency is spiking because we're trying to move traffic over to something that's not ready for it.

Yan Cui: Yes, and with serverless, you don't have this luxury of say, let the server warm up first and then you give it some time before you actually put it into active use. Literally you can respond on the first request that they don't have a spare server running around to handle. So you always have cold start, so you can't just say, "Okay I'm gonna give this server five minutes to get warmed up first." Maybe it's JVM that takes up your warmup time so that you can feel that you're low balanced and the rest of the system to take into account the time it needs to warm up before you put into active service. With serverless you can’t do that, so where you do need consistent high performance, serverless is a really bad fit right now. I think you just touched on something else there as well, the fact that you need to have a persistent connection to a server, so there's some kind of logical notion of a server.

Yan Cui: That's again, something that serverless is not a good fit for. If you want, say, a persistent connection in order to do realtime push notifications to connect the devices, or to implement subscription features in the GraphQL for example. In those cases, you also constraint by the fact that functions can only ... Run the occasion for a function can run for only certain amount of time. I think that's a good constraint. It tells you that there's certain used cases that are a really good fit for functions and service, but there are whole other cases that you just shouldn't even think about doing it. There are other ways you can get around it, but by the time you do all of that, you really have to ask yourself, "Am I doing the right thing here?"

Mike Julian: Right.

Yan Cui: And I think another interesting case is that, and this is again something that I find often made out of proportion is in terms of the cost. Sure Lambda is cheap because you don't pay for it when it's not running, but when you have even a medium amount of load, you might find that you might pay more for API Gateway where Lambda compared to if you just run a web server yourself. Now that's true, but one of the things that you don't think about and this most people don't think about enough is, the personnel cost, the amount of skill set you need to run your own cluster, to be able to look after your Kubernetes cluster, to do all these other things associated with having a server, that often is all that makes it more expensive than whatever premium you pay for AWS to run in your functions.

Yan Cui: However, if you are talking about a system that has got, I don't know, maybe tens of thousands to request per second, consistently all the way throughout a day, then those premiums on individual invitations can start to strike up really really quickly. And I had a chat with some of the guys at Netflix a while back and they mentioned that they did a precalculation that if everything on Netflix runs on Lambda today, it will cost them something like eight times more and therefore if you are running at Netflix scale, that is a lot of money, way more than the amount of money you will pay to hire the best team in the world to look after your infrastructure. So if you are at that level scale and the cost is out to wreck out, then maybe it's also time to think about maybe moving your load into a more traditional containerized or em-based setup where you can get a lot more out of your server and do a lot more of the performance organization there, than to run them in Lambda.

Yan Cui: And I think the final use case where, Lambda is probably not that good a fit or serverless is not that good a fit today is that, even though you get a good baseline of redundancy built in, so you get Multi-AZ out of the box and you can also build multi-region active APIs relatively easily; but because we are relying on the platform to do a lot more, and the platform service is essentially a black box to us, there are also cases where some of the built-in redundancy might not be enough. For example, if I'm processing events in real time with Kinesis and Lambda, the state of the polar is a black box, it's something that I can't access. So if I want to build a multi region set up whereby if the one region starts to fail, I can move the stream processing to a different region and turn it on. So have active passive set up, then I need to access the internal state for the poller which is not something that I can do, or I have to use some whole lot of infrastructure around it to be able to simulate that.

Yan Cui: And again, by the time I invest all the effort and do all of that, maybe I should just start with something else to begin with. Again, those are some of the constraints that I've had to think about when I decide whether or not Lambda or serverless is a good fit for the problem that I'm trying to solve. As much as I love serverless, again, I don't think it's about the technology. It's about finding ways that can deliver the business needs you have, so whatever you choose, you have to meet the business needs first and foremost, and then anything that can let you move faster, you should go with that.

Mike Julian: So all this reminds me of an image floated around Twitter a while back, that people dubbed, “Docker Cliff.” And the idea was that you had Docker at the very bottom of Dev and Prod, but to get something from Dev, like when I'm developing Docker on my laptop, to actually put it in production, takes way more than just a container. How do you do the orchestration? How do you do the scheduling? How are you managing network? What are you doing about deployment, monitoring, supervision, security and all this other stuff on top of it that people weren't really thinking about. And so for developers, Docker was fantastic. Like oh, hey, everything is great. It's a really nice self-contained deployable thing except it's not really that deployable. And I'm kind of seeing that serverless is much the same way of, we threw out a bunch of Lambda functions, like this is great. And immediately the next question is, “How do I know they're working? How do I know when they're not working? What's going on with them?” CloudWatch Logs is absolutely awful, so trying to understand what it’s doing through there is just super painful and the deployment model is kind of janky right now. How I've been deploying them is just a shell script wrapped around the aws-cli. I'm sure there's better ways to do it, so are there other stuff like this? Are there other things that we're not really thinking about and what do we do about those?

Yan Cui: Yeah absolutely. The funny thing is that a lot of the problems that you talk about are things I hear from other clients or from the people from the community all the time, in terms of how do I do deployment, and how do I do basic observability stuff and the thing is that there are solutions out there that do various different degrees and I think you find that as the case with a lot of AWS services, that they cover the basic use and needs. CloudWatch Logs being a perfect example for that, but it does sit very crudely.

Mike Julian: Right, it's like an MVP of a logging system.

Yan Cui: Yes.

Mike Julian: Every CloudWatch team, it's true.

Yan Cui: And the same goes to, I guess, CloudWatch itself as well, but the good thing is that at least you don't have to worry about having to install those agents and whatever to ship your logs to, your CloudWatch Logs. So CloudWatch Logs becomes a good staging place for your logs and gather them and then from there, you can actually ship them to somewhere else. Maybe a ELK Stack, or maybe one of the main services like [inaudible 00:18:48] Logglyor Splunk or something else. So the paradigm of doing that is actually pretty straightforward. I've got two blog posts which I guess we can link to ...

Mike Julian: Yeah we'll throw those in the show notes.

Yan Cui: ... In the show notes. One other thing, which I think is quite important is security. Again, as developers, we are just not used to thinking about security and I see a lot of organizations try to tackle this security problem with this hammer called VPC. As if having security is gonna solve all of your problems and most of VPC ... In fact, every single VPC I've seen in production, none of them do egress filtering, so that if anyone is able to compromise your network security then you find yourself in this fully trusted environment where services talk to each other because with no authentication because you just assume it's trusted because you're inside of VPC now, but then we've seen several times how easy it is to compromise the whole ecosystem by attacking the dependencies everyone [has]. I think it was last year when a researcher managed to compromise something like 14% of all NPM packages which accounts for something like a quarter of the monthly downloads of NPM, including-

Mike Julian: Well that's gonna make me sleep well.

Yan Cui: So imagine if someone just compromised one of your dependencies and put a few lines of code there to scan your environment variables and then send it to their own backend to harvest all these different AWS credentials and see whether or not you can do some funky stuff or to be commanding with them. And that is not something that you can really protect by putting VPC in front of things. And yet, we see people try to take this huge hammer and apply into serverless or the same, even though when it comes to Lambda, you pay a massive price for using VPCs in terms of how much cold start you experience. My experience tells me that having a Lambda function running inside a VPC can add as much as 10-seconds to your cold start time, which basically rules out any use of facing APIs you have. But with Lambda, you can actually control your permissions down to the function level and that's again something that I see people struggle with because we don't like to think about, oh this is a IAM permissions and stuff. It's difficult, it's laborious.

Mike Julian: Well you know, I think the real problem is that no one knows how IAM actually works.

Yan Cui: To be fair though, I guess I'm probably a bad example because I've been using AWS for such a long time and I'm used to the mechanics of IAM and writing the permissions and the policies, but yes, it is much more complicated than people-

Mike Julian: It is a little esoteric.

Yan Cui: Yes, definitely. And I have seen some tools now coming onto the market which I think PureSec is one of them and a few other ones are all looking at, how do we automate this process to both identify what your function needs by doing a static analysis on your code to see how you're interacting with AWS SDK to see, oh, your function talks to this table and when you deploy or doing a CICD pipeline, you notice that, hey, your function doesn't have the right permissions, it's overly permissive. Because again, a lot of people are using just star. Email function access everything, which also means now your function is compromised. The attacker can get your credentials and do everything with that sort of temporary credentials you have. So some of these tools is going to automate whatever pain that we experience as developers in terms of figuring out what permissions our function actually needs and then trying to automatically generate those templates that we can just put into our different framework. And you talked about a deployment framework being [clunky right now. There are quite a lot of different deployment frameworks that takes care of a lot of the sort of plumbing and complexity under the hood. I don't know if you ever tried to provision an API gateway instance that are using CloudFormation or Terraform, it's horrendous.

Mike Julian: It's not exactly simple.

Yan Cui: It's so, so complicated because the way resources are organized in API gateway. But with something like the serverless framework or AWS SAM or a number of other frameworks out there, I can just write a human readable URL in one line that translates to I don't know, maybe a 100 lines of a CloudFormation template code.

Mike Julian: That's awful.

Yan Cui: This is just not stuff that I wanna deal with, so there are frameworks out there that ease a lot of burdens with deployment and similar things. On the visibility side of things as well, there's also quite a lot of companies that are focusing on tackling that side of the equation in terms of giving you better choice ability. Because one of the things we find with serverless, is that people are now building more and more event-driven architectures because it's so easy to do them nowadays.

Mike Julian: Right.

Yan Cui: And part of the problem with that is, they are a lot harder to trace, compared to direct API codes. With API codes, I can easily just pass along some correlation ID along the headers and then a lot of the existing tools like Amazon X-Ray can just kick in and integrate with API Gateway and Lambda already out of the box, but as soon as my event goes over asynchronous event sources like SNS, Kinese or SQS, then I lose a trace entirely because they don’t support this asynchronous event sources. But there are companies like Epsagon who are now looking at that problem specifically and trying to understand how the whole, how data flows through the entirety of the system, whether or not it's synchronized through APIs, or whether or not it's asynchronous to the event streams or task queues or SNS topics that you have. And there are also companies that are focusing on the cost side of things, understanding the cost of user transactions that spends across this massive web of different functions, loosely coupled together through different event sources, CloudZero being one of those. I guess the foremost, companies are focusing on the cost side of the cost story of the serverless architectures. So there are quite a lot of interesting stops that are focusing on various different aspects of the problems that we've just described so far. And I think definitely the next six to twelve months, we're gonna see more and more innovation in this space, even beyond what all the things that Amazon's already doing under the hood.

Mike Julian: Yeah that sounds like it will be awesome. This whole area still feels pretty immature to me. I know there's people using in production. There's also people that were using Mongo in production and it was dropping data like crazy every day. So more power to them if they don't like data. But I like stable things. So it sounds like serverless, it's still maturing. It is ready, but we're still kinda working some of the kinks out? That would be a fair characterization?

Yan Cui: I think that's a fair characterization in terms of tooling space because a lot things are provided by the platform and as I mentioned before, Amazon is good at meeting the basic needs that you have. So you can probably get by with a lot of the tools out of the box, but that also I guess just slows down some of the self-commercial tooling support it comes with, something like containers comes with Kubernetes because again, you only get so much out of the box so that's a huge opportunity for vendors to jump in very very quickly, but at the same time, I think those innovations are happening a lot faster than people realize. Maybe one of the problems is just in terms of the education, getting the information about all the different tools that's coming into the space and make people aware of them.

Mike Julian: That's really interesting, and what I think a lot of people forget is exactly how old Docker is because Docker was kind of in the same position of serverless, where it was really cool but it was still pretty immature. And thinking about when these things came out, now that we're seeing Kubernetes which is maturing that ecosystem further, that is actually in production. We know the patterns, and we know how all that stuff is being deployed, we know how to manage it, we know the security. It is pretty mature, but how long did it actually take to get there? And looking at it, you have Docker, its initial release was in 2013. That's like five years ago, which has blown my mind and Kubernetes initial release was in 2014, four years ago. But it's only really been in the past year or two that Kubernetes has been what we'd call mature. And now we're starting to see this massive uptick of abstraction layers on top of Docker in the form of Kube. At some point, I think we're gonna see that with serverless, where it's not just like, oh we're deploying this Lambda function and calling it a day. I think we're gonna see a lot more ... Tooling a lot more abstraction that brings it all together and makes it so much easier to deal with, especially like at scale.

Yan Cui: Yeah I absolutely agree and just in terms of the dates you just mentioned, the first initial announcement on Lambda was 2014, so in terms of age, it's not that much younger compared to Docker and the Kubernetes.

Mike Julian: Wow.

Yan Cui: Where it has differed, is that it's a brand new paradigm, whereas with containers and with Kubernetes, it's a lot easier for you to lift and shift existing workloads without having to massively restructure your application to be intermative for this paradigm. With Lambda, and with serverless, there is that requirement that in order to be idiomatic, there's a lot of restructuring and rethinking you need to do because with them, it's a mind-shift change. And that takes a lot longer than just technology change.

Mike Julian: Right, yeah. We're talking about something completely new here. So it's not like, oh we'll just go implement Lambda over night and we'll call it a day. We'll just move our whole application over. It's not like when we start putting things in containers. We could actually put a thing in a container, but really all we're doing by lifting and shifting was, moving from one server to another except now it's a smaller server.

Yan Cui: Yes.

Mike Julian: We had the idea of the fat container where you had absolutely everything in a container. That is a bad idea, it's a dumb pattern. And it's going the same way with serverless, I think. You can't just lift and shift. It is a brand new architectural pattern. It requires a lot of serious thought.

Yan Cui: Yeah, and I think one of the pitfalls I just see in terms of the serverless adoptions sometimes is that, we are so embraced in this whole movement into a new paradigm that sometimes we just forsake all the things we've learned in the past, even though a lot of principles still very much apply. And in fact, a lot of things I've been writing about is basically how do we take previous principles, but apply them, adjust them and make them work in this new paradigm? Because, the practices and patterns may have to change because some things just doesn't work anymore. A lot of principles still very much apply. Why do we do structure login? Why do we do sampling in production? All those things, the principles still very much apply when it comes to serverless. It's just, how we get there is different. And I think that is one of the key things I had to learn the last couple of years is that, a lot of things that we learn in the past, just with databases, a lot of things we learn about databases are still very much there to stay even if we don't need a specific skill set that DBAs provide for us in the new world of NoSQL databases. When it comes to serverless, I guess a leap from understanding and looking at practices, to understand the principles behind them, why do we do it, how can we apply those principles, that's super important when it comes to making a successful adoption of serverless in your organization.

Mike Julian: That's an absolutely fascinating perspective because I completely agree. What I absolutely love about it is, the principles of site reliability haven't actually changed. The principles of how we run and manage systems, has it really changed a whole lot in the past 10 years? Which is fantastic. That's how it should be. We should always be looking for true principles. It's stuff that kind of pillars of how we behave and how we look at what we work on. How we do it, changes all the time and it absolutely should, but the principles shouldn't change that much. So that's interesting of trying to apply the ... The principles that we already know to be true. The practices that we know, work. And how do we apply it to a new paradigm? And sure, maybe some of them aren't going to apply very well and we maybe have to create a new one, which I'm sure there will be coming out of this. But, we don't have to start from scratch.

Yan Cui: No, what's that saying again? Those who don't know the history are doomed to repeat them.

Mike Julian: Right, exactly. We've talked a lot about the failures and the challenges, and you keep mentioning this idea, the business case for serverless. So sell me on it. I want to deploy serverless in my company. I'm just an engineer, but I really like it, so I wanna move everything to it. I wanna do a new application in it. What should I be thinking about? How do I come up with this business case?

Yan Cui: I think the most important question there is, what does the business care about? And I think pretty much every business I know of, cares about delivery and speed. As a business, you want to deliver the best possible user experience and you want to build the right features that your users actually want, but to do that, you need to be able to hit the market quickly, and inexpensively, so that you can also then iterate on those ideas and that allows you to tell the good ideas from the bad ones and then you can double down on a good ideas and make them really great. And the more you have to do it, the more your engineering team have to do it themselves, than by definition, the slower you gonna be able to move. And that's why businesses should care about serverless because it frees the engineering teams from having to worry about a whole load of concerns. They need to know how the applications are hosted and let the real experts, the people that work for AWS, to worry about those undifferentiated heavy lifting. And then that frees the brainpower that you actually have, which by the way are super expensive on solving the problems that your users actually care about. No user cares about whether or not your application runs on containers or VMs or serverless, but they do care about when you gonna deliver them and they do care about building the right features. And that again, that needs you to optimize for a time to market and also, it will iterate quickly. A lot of people talk about vendor locking as if Amazon's gonna one day just worry about Amazon holding the key to your kingdom, but I think the real-

Mike Julian: That's the last thing I'm worried about.

Yan Cui: Yeah exactly, I think the biggest problem we should worry about is a competitor who can iterate faster than you, locking you out of the market altogether.

Mike Julian: Right.

Yan Cui: Yeah so I think that's why they should really really care about serverless.

Mike Julian: I agree with that. That sounds great. The biggest thing that I see with technology is, with engineers and their engineering architectural decisions, it seems that a lot of decisions are based essentially on resume-driven development. I've met a lot of engineers where I built this new application in Go because I wanted to learn Go, and I'm like, that's cool, what does the business have to say about that? And it's like well, "I convinced my boss to use Go." I'm like, "No you did." Like your entire shop's in PHP, you basically just said PHP is shit. That was your business case. Instead like, yes we should be looking at this from the perspective of how quickly can I get this new product to market? How quickly can I ship this feature? And yeah there might be some scenarios where switching a language or switching a framework would be useful, but I agree with you that we really should be focused significantly more on time to market and time to value. We're here to help our businesses make money, or in my case, help my business make money. But for me, I have an application that I'm writing in PHP right now. It's PHP and MySQL and it's gonna be a core facet of my own company. And most engineers would say I'm crazy for writing PHP, but the entire point is that I don't have time to deck around. I need to have this out in the market.

Yan Cui: Yeah absolutely, totally agree. And those kind of conversations, I've had quite a few of them in the past myself, and also I've heard a lot of similar arguments in terms of, oh why should we use, for example, functional programming. And one office already wrote the function of programming community for quite a long time and are still a big fan of function and programming, but not for the reason that it makes your code size more readable, but again, it's about moving up the abstraction ladder so that I have to do less and it's about getting that leverage to be able to do more with less and I think that's the argument that we should be making more, I suppose to, how I like to read my codes.

Mike Julian: Right, let’s take this from two different perspectives. For the people that are brand new to serverless, what can they do this week, or today, to learn more about it? And for the people that already have serverless in their infrastructure, what can they do this week to improve their situation?

Yan Cui: I think learning by doing is always the best way to get to grips on something. So if you are just starting, definitely with serverless, it's so easy to get started and play around with something, and when you're done, just delete everything with confirmation, you sync or button click, or if you're using the right tools, it scans a single command. So definitely go build something. If you got questions that you don't know how the platform behave, then build a proof of concept, try out yourself. It's super, super simple nowadays. That's how I've learnt a lot of things. I've learnt it now through serverless, is just by running experiments. Come up with the question, coming out with the hypothesis on how I expect things to do it, or how the platform to behave, do a proof of concept to answer those questions and then again, I like to write about things so that I have a record for it afterwards but also I can share with other people, things that I've learned and afterwards as well.

Yan Cui: And if you already started, and you want to take your game to the next level, don't wanna be boasting myself, but do check in my blog, I have shared a lot of the things that I've learnt about running serverless in production and solved problems you run into, and addressing a lot of the observability concerns, and I also have a video course with Manning as well. Feel free to check out where we actually build something from scratch and apply a lot of things that I've been talking about for the last year and a half, two years, in terms of how do you do auto basic observability things, how to think about security, VPCs and performance and so on. So all of that will be available on the podcast episode notes. Yeah, and also just go out there and talk to other people and learn from them. There's a lot of very knowledgeable in this space already. People like Ben Kehoe from iRobot, people like Paul Johnston and Jeremy Daily and there are quite a lot of people who have been very active in sharing their knowledge as well and their experiences. Definitely, go out there, find other people with who are doing this, and try and learn from them.

Mike Julian: That's awesome. So thank you so much for joining us. Where can people find more about you and your work?

Yan Cui: You can find me on theburningmonk.com and that's my blog, I try to write actively and you can also find me on Twitter as well. I try to share new things that I find interesting, anything I learn and whenever I write something also, I publish there as well. And if you don't wanna miss anything, I also have a newsletter you can subscribe to on my blog. And so I've tried to write up regular summaries, updates for things I've been doing. And also, I'm available for doing some consultancy work if you need some help in your organization. Or to get started, but also to tackle specific problems that you have with serverless as well.

Mike Julian: Wonderful. Well thank you so much for joining us. And on that note, thanks for listening to the Real World DevOps podcast. If you wanna stay up to date on the latest episodes, you can find us at realworlddevops.com. And on iTunes, Google Play or wherever you get your podcast. I'll see you in the next episode.

Yan Cui: See you guys.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Visa fler

Avsnitt

Observability & Robots with Ian Sherman

Mentorship in Tech with Aaron Sachs

City-Scale Observability with Andrew Rodgers

Improving Your Vendor Relationships with Jeremy Tangren

Building Technical Communities with Mary Thengvall

InfoSec For DevOps Engineers with Kelly Shortridge

Understanding Observability (and Monitoring) with Christine Yen

DevOps is Dead with James Turnbull

Open Source is Not A Business Model with VM Brasseur

Building a Resilient Engineering Culture with Ryn Daniels

Database Performance With a Side of Empathy with Baron Schwartz

The Science Behind DevOps with Dr. Nicole Forsgren

Building Resilient Systems with Thai Wood

The Vendor Is Not the Enemy with Cory Watson

Salary Negotiation for DevOps with Josh Doody

Doing Interesting Work in Ops with Matty Stratton

DevOps in a 150 Year Old Nonprofit with Dan Barker

The Value of DevRel and Writing Technical Books with Emily Freeman

Observability in Mega-Scale Banking with Greg Parker

The Business Value of Serverless with Yan Cui