In this Episode
- Why your prior success may not scale
- Cloud is great – except when it’s not
- Who is your customer? Are you sure?
Tom Cooper 0:00
Tom, I feel like I have a fire hose that shoots money. And I just point it at any problems I find. In today’s software horror story, we’re going to talk about what happens when a smart guy, a really smart guy creates a startup that gets acquired by a large firm, and then they put him in charge of a rewrite of their cash cow from one of their business units. Sounds great, right? This guy could bring his expertise. And while he had been held back by tight budgets and small staff before, now, he could have the resources to build something really amazing. What could possibly go wrong. I reminded of financial planning expert Dave Ramsey talking about how access to more money often borrowed money can lead to failure. In fact, at one point, he was a real estate developer who was worth more than a million dollars on paper until things took a wrong turn. And he spent over a year on the journey of losing everything to bankruptcy. One of the things he learned in that process was how having more money can amplify bad decisions. This worked out for him one time when he did a big promotion related to a quick service restaurant. He had a children’s book written that he wanted to distribute his kids meal prizes, and this was going to help with kids learning about money and help promote his business a win win, right? Unfortunately, the big prize that the kid in the book was saving up for was a special trip, a trip to space camp. While Space Camp sounds amazing. Dave’s team didn’t realize that the term Space Camp was a trademarked term. And he learned that after he printed a bunch of books, and he had them in distribution. That’s when he got the cease and desist letter from the trademark and copyright holder. And he had to recall every single book. And what was his take away. It was a dumb mistake. But since he was using his hard earned cash, and not investor money, or bank loans, the mistake was a lot smaller than it could have been. Sometimes having access to less money is actually a good thing. So let’s turn our attention back to Jamie, our CTO, and his so called firehose of money. As I mentioned, Jamie had been frustrated for some time because he had grand plans for expansion. But he’d been held back by a lack of ready cash. After his company was acquired by a larger company that had plenty of money, he had the freedom to overcome those limitations. And boy, did he ever. Jamie had made investments in infrastructure and design in engineering. He oversaw the technical details of the system, and he worked with the engineers on creating something amazing. It’s a great plan, right? smart guys sufficient resources, strong engineers, customers with significant problems and money to solve them. It’s a perfect storm. Except it wasn’t until the storm was a big one. But it turned out not to be so helpful of a storm. When I met Jamie, he was in the middle of overseeing the creation of a really cool product. One that was a cloud based rewrite of a dominant on premise tool. I go into the cloud meant that Jamie’s company could deliver a comprehensive solution to less tech savvy customers. Customers without a deep bench have to have technical skills. And it also meant that customers using the old solution could potentially convert to the new one relatively easily. And that was one of the problems began. Bill Gates famously once said, Success is a lousy teacher. It seduces smart people into thinking they can’t lose. Success is a lousy teacher, it’s seduces smart people into thinking they can’t lose that Jamie had been successful. And he was a very capable CTO, Jamie’s new boss in this arrangement, Hank, he had been super successful in selling that on premise solution to customers. But Hank and Jamie, they were in that group that Bill Gates described over their careers, they had been successful, and because of that success, they believed they could not lose. Thanks, secret sauce was creating a tool that was very useful for his customer base. And Hank figured out he could get to a higher price point if he offered customization to larger companies. Hank solution at the time was on prem software, and he had enough prospects ask him for custom options that he began to offer them some level of customizations. Over the years, what happened was that Hank ended up with lots of variations of his software, customer a wanted one thing, Customer B wanted a different thing. Hank had his developers add those customizations. Now, the problem with this approach is that unless you are very careful, you end up with a lot of incompatible software. Haha, thanks said no problem. We will make everything configurable. That way we ship the same source code, every customer problem solved. And he was right. He could do that. And the unexpected consequence was that there were more configurable options than anybody could ever test. And that meant that the customer was the one who discovered price problems and errors. With a small number of clients, maybe that was manageable. But unfortunately for Hank, his sales team found a lot of enterprise customers who wanted custom options that were specific to their particular needs. So Panke directed his engineers to build exactly what the customer asked for. And the support team spent endless hours troubleshooting, the predictably untested and buggy software. And now Hanks team was improved. the acquiring company brought in Jamie as CTO and empowered Hank and Jamie with the money to build that replacement for the legacy on premise enterprise software. And the overall overall idea was that they could potentially offer a SaaS product, one that could compete with other products, targeted not an enterprise customers, but at a large volume of mom and pop companies who would buy the software like an appliance to manage their business. Now, let me just ask you to pause for a second here, can you think of some differences between enterprise customers, and Mom and Pop customers, having worked in large and small companies over the course of my career, I can think of many differences. Small business operators are so busy trying to keep the business running and keep customers happy that the last thing they want to do is think about or make technology decisions. They need a product that’s as reliable and predictable as a toaster. And by the way, as cheap as a toaster. Now, if we keep that analogy going, enterprises already have a bunch of toasters and toaster ovens and regular full size ovens and commercial ovens to manage. They have experience in weird wiring ventilation, electrical demand and processes to manage all these different types of cookers, we’ll call them. And because of the variability of the world they live in, enterprise leaders want to be able to set up the new toaster like all the other toasters, so they asked for, or they demand that vendors deliver specific options that work with the way the other systems work. There’s an expression that says he who chases two rabbits catches neither. In this case, Hank and Jamie were hoping to chase enterprise and small business customers, SAS and custom software co customers. And you can imagine that they were having some difficulty finding their minimum viable product. And you would be right. It seems like no matter what they did, they could not seem to get to 100% of something. So they added resources, more team members, oh my goodness, the classic blunder. In 1975, Fred Brooks wrote the book, the mythical man month, and he discussed an idea we now call Brooks’s law, which says adding manpower to a late software project only makes it later. Fundamentally, late projects are not late because of a lack of engineers. Late projects are late because of poor planning, and poor feedback loops that raise problems to the awareness of leaders so that leaders can make things better in a timely fashion. Late projects are also late because of poor scope management, incomplete thinking, followed by incomplete learning, and an incomplete replanting. They’re late because leaders assume they know exactly what’s needed. And they don’t think they need to learn about what market customer or user needs, and how those might be different now from what they believe the customer market wants. And by the time our team was engaged, this project was just like the ones described in the mythical man month late and adding resources that made it later. And later, it’s been said that what got you here won’t get you there. And the reality is that what had helped Hank and Jamie be successful, really was getting in the way of where they wanted to go. For Hank, success had come with infinite customizations for enterprise customers. For Jamie, his big product had been less customized, and he’d been able to get a lot done with a small team. And unfortunately, for Hank and Jamie, they’d never created a SaaS product for unskilled users before. And they also never created a one size fits all product at a low price point either. They also hadn’t managed engineering teams anywhere near as large as they now had. What have worked for them in the past, simply would not work for them now. And going back to Bill Gates warning about seducing smart people into thinking they can’t lose. For this President and CTO. It was worse. Not only did they believe they could not lose. They believe that losing was an indication that the person who lost who was a loser. At one point, we had an opportunity to gain some insights from an influencer in the organization who had experienced a pretty public crash. One of my favorite authors is CS Lewis, who famously wrote The Chronicles of Narnia series, and Lewis talked about learning from experience. Lewis said, experience that most brutal of teachers, but you learn my god do learn this influencer had experience that Hank and Jamie just didn’t have. He had learned painful lessons only available through life experience. When I suggested that we might want to sit with him and ask some questions. Hank and Jamie were disinterested. In fact, Jamie literally said, Tom, that guy’s a failure. Why would I listen to him?
Tom Cooper 10:00
Now, the most comprehensive and complete lessons do come from personal experience. But the quickest learning can come from the experience of others. So Hank and Jamie were successful, they each had a pattern of success, which was preventing them from being curious about what might be different today from what had made them successful before. And when the project was running too slowly, they decided they needed more engineers, a lot more engineers, at time won’t permit me to date until all of the story and perhaps I can come back to Hank and Jamie in a future episode. But for today, I want to talk about our experience using cloud computing, there were a couple of cloud computing problems. Now it’s tempting to point a finger at the compute vendor and talk about how crazy expensive cloud is. And Cloud can be crazy expensive, it doesn’t have to be but it can be. So think about cloud computing. I remember a time in my youth when setting up an Internet facing system meant that we needed to have an internet connection, a firewall or router, a load balancer, capacity planning to tell us how much bandwidth CPU and memory we would need licenses for front and back end software operating systems, database capacity and licenses, facilities to house equipment, power cooling, and an operations team to oversee and run the gear. And in each of those functions, we needed an expert who could help us determine the exact specifications and help configure the thing, we’d have to spend a good amount of time planning and coordinating, and then installing and configuring and testing and then developing and knowing what hardware we needed. Even after we know that we didn’t have to source it, and it would take time to get the boxes shipped to us, then we had to coordinate with the facilities people to get it racked and powered. But today, today, any developer with a credit card can get started with a compute system at any time in just a few minutes. With a few well chosen clicks on the virtual configuration panel, we get speed to market options for scalability, flexibility, and we can ramp performance up and down without having to put new equipment in our racks. Not only that, but we can stand up new environments quickly. And we can prototype and test in virtual spaces, without difficulty in that part of cloud. That part’s amazing. What’s interesting is we still actually need all of those broad skill sets, skills, insecurity, capacity, planning, databases, licenses, all those things and more. And that developer with the credit card, he or she may have some of those skills, but probably not all of them. Without the planning and expertise, you’re going to stand up services you don’t need. And you’re going to forget ones that you should have. Plus you’ll probably have services misconfigured as well. It’s a risk. Now. That’s before we talk about computing and network and storage charges. I want to be clear, I’m not anti cloud, it’s a great option in so many situations. When we saw that Hank and Jamie had lots of challenges with cloud. I’d like to take a break here. When we come back, I want to talk about some great things that happened in this project.
Tom Cooper 13:00
There is no greater business investment than investing in your team. You’ve got an exceptional team, but you’re still struggling to unlock consistent, outstanding performance. You know, it shouldn’t be this hard to deliver value to your business. And we do too. Even the best team struggled to build a culture of collaboration and high performance without clear communication and write planning frameworks. brightfield group we deliver leadership coaching that empowers you and your software team toward peak performance. Our personalized coaching solutions help you build trust and increase efficiency across your development efforts. Head on over to bright Hill group.com To schedule a call today. So you and your team can break free from this rut of untapped potential, and start delivering the high business value you know you’re capable of.
Tom Cooper 13:54
Welcome back. Now we’re gonna talk about some good things that happen on this project. One of the biggest benefits that Hank and Jamie had was their DevOps crew, those guys did an amazing job of creating automation and setting standards that made it possible to see and know and track virtual assets and to get them consistently deployed. Having team members who can build solid infrastructure that’s critical. And having a DevOps philosophy where there are standards and where the system supports developers by helping them do the right thing that’s helpful in reducing overall complexity, and increasing consistency between environments. And that saves tons of debug time. And if you can automate those things, Wow, that’s good stuff. It’s challenging because none of that stuff is customer facing. And as I like to say, no one cares about the infrastructure until the toilets don’t flush. But fixing the drains after the house is built is a lot harder than designing it. Well, to start with this DevOps team, they had done an excellent job of creating automation and standards for how things were configured and deployed. They had created an elegant solution, and when their process pushed back on developers, it did slow Have them down a little bit. But it helped the developers be able to focus on the most important stuff and not get distracted by the infrastructure. And we often think about automation as the last part of the process, after we have the application working. But we might have made decisions in the design that make automation really hard and changing that at the late stages of the project. That’s challenging. In this case, the DevOps team thought of all kinds of things that helped organize their large virtual setup. And one thing that nobody thought about was a business process that would put limits on what could get deployed. After all, if you need something, why wouldn’t you let the developers create it and start using it. And that works just fine. Until the AWS Bill shows up. In the previous month, Jamie’s team had blown through the annual budget. For cloud spend, every minute of delay led to additional overspending, they all of a sudden found themselves in a crisis that needed immediate attention. That day, all meetings were canceled in favor of an immediate all team huddle, we got to shut everything down right now. And the whole group began logging into the console to kill things. It was a virtual server bloodbath. We’re in crisis, no sacred cows kill everything. And all day long, the team feverishly worked to kill dev and test systems, they things were shut down like crazy. Anything that was obviously not production that had to go right now. I want to point out they had client facing production infrastructure, and they had paying customers who were using services. But even in this situation, anything that was production was subject to review. Now, as happens from time to time, some services had been over provisioned. There were some cases where auto scalars had ramped up to many instances which were essentially idle and chewing up budget for nothing. How do you move with both urgency and with care? Well, let’s just say in this situation, mistakes were made. There were some self inflicted wounds. I do remember there was one primary production database server for a cash cow for the company. And that server had been provisioned on the largest possible server that Amazon offer had the most memory that Amazon offered the most CPU. If there was ever a case of throw hardware at that performance issue. This database service was the poster child. And man was that server expensive. The box was crazy costly. And now that a bunch of other services had been shut down, that one stuck out like a sore thumb. And even after shutting down all those services, hundreds of them, they were still over budget, everything was somebody to review. So we began to ask, Are you sure you need that much capacity for that server? The answer was, Well, no, but we’ve had performance problems before. And when we added capacity that helped. Alright, let’s get to work on analyzing it. Let’s get some profilers and tools and evaluate to identify bottlenecks. We don’t have time for that, let’s just reprovision on less hardware right now. So they did, they shut down the database, they cut the hardware in half and started the database back up again. And they waited, what was going to happen? Nothing, nothing at all happened, the application was just fine on half as much hardware has been provisioned. But of course, they were still way over their annual budget, and even half the size, it stuck out from a cost point of view. So they decided to cut it in half again. And you know what happened? Nothing. That app was perfectly fine on a database that had 25% of its original capacity. This is one example of how not having that discipline of capacity planning. It leads to cost overruns and difficulties. It’s important to have that as part of your system. One of the things that we encourage the team to do, as we went through this experience with him was a creative learning model that made time to do retrospectives. I’ve been involved in Agile software development, since I don’t remember probably 2003. And I’ve worked with all sorts of models for development within that ideology. And one thing I’ve learned is that because Agile is a philosophy more than it’s a set of practices, it’s common for teams to use the same agile terms but mean entirely different things. What we know is that high performing agile teams have high safe, high psychological safety. And all team members feel empowered to speak up and to admit when they failed, without fear of retribution for revealing their imperfection. A good stand up or a good retro is very similar to a lean approach of going to the gemba going to the place where work is done to see the work being done. Because the ones who are doing the work know exactly what’s going on. And no one else knows. No one else knows. Healthy leadership teams know they need to listen and learn from the ones who have the information and that’s the people doing the work going to the gemba and the entire team needs to hear about impediments and failed experiments to help make the entire team better. Unfortunately for Jamie and Hank, when they said retrospective, they didn’t mean anything like that at all. retros with Hank and Jamie tended to be the leaders summarizing the successes during the previous work period. And glossing over the failures, while looking for ways to point blame at somebody outside their department. I’m sad to report this is a pretty common practice in my experience, teams commonly are not created and maintained in a way that team members feel safe to share, or where leaders listen and learn from their people. Back in the day, I worked for a boss who believed that failure was bad and failure needed to be avoided. When we made mistakes that could not be hidden from other departments, she demanded, we write long post mortem reports describing our mistakes. But because we never use those reports to help the organization learn and change, we concluded that her purpose was simply to punish us for failing. Rather than using the lessons learned to help make us better at our jobs. She pushed us to take fewer and fewer risks. Productivity closed to a so called Safe crawl, filled with sign offs, double checks and approvals before any changes can be made. I just don’t think that was delivering value to the organization. And when it comes to retrospectives, one of our goals is to learn, we want to learn as a team. Because together we’re learning to create great software for the organization. President Lyndon Johnson said, you ain’t learned enough. And when you are doing all the talking, all too often leaders are talking when they should be listening. It’s simply not helpful. Instead of embracing humility, and recognizing that what we do not know is far bigger than what we know. The focus is on what we did right and how if things went wrong, there was someone other than me who was taking the blame. These behaviors led to a culture of throwing other teams or other departments under the bus. As long as there was a way to point attention to the errors of others, each leader was secure in their role. Also, there wasn’t an openness to discussing things that needed attention or even exposing problems so they can be discussed. So when we suggested they should make some time as a team to reflect on how we had managed to deploy enough services to be 10 times our plant capacity, that was deemed wasteful, because there were more urgent problems to be solved. I saw a clip this week of Elon Musk talking about challenges with engineering and he said one of the most common engineering errors is to optimize a thing that should not exist at all. Interestingly, I can recall hearing similar things from Peter Drucker and even computer science luminary, Donald Knuth. Knuth called that problem one of premature optimization. He said, if, if we believe we had defined the problem too early, our solution no matter how clever it was, would simply not solve the problem at all. Knuth argued and I think must does, too, that we need to keep digging until we really understand the root cause. Unless we got to the root cause of how we managed to deploy 10 times the expected services, it was likely to be repeated in the future. But the crisis now was over the spin was back within parameters, and the team could focus elsewhere. And they did, they focused elsewhere for about three months until once again, the AWS spend was far greater than the budget. And we repeated the entire fire drill all over again.
Tom Cooper 23:32
There was another issue we hadn’t addressed. I mentioned before, there was a disconnect between SAS and custom enterprise software, there was another factor we hadn’t considered. One of the big pushes was to compete with low cost providers. By giving an all you can eat low price point. I’ve heard it said in the race to the bottom, the big problem is somebody wins. By definition, the person selling at the lowest price is very likely to have the lowest profit margin. The problem of the thin profit margin means that there’s a low margin of error in your business to in this case, thinking specifically the fixed fee SAS customers. In order for the app to be sticky. The company needed customers to love it and to use it a lot. And here was the big disconnect. When users use cloud services more the cost for hosting that service goes up. If you advertise an all you can eat pricing model, the more customers use your services, the higher your costs, and the lower your profits. The natural incentives are flip flopped. Your the best situation to be in is when your incentives match the customers incentives and services like Stripe or PayPal, will generally charge a small fee plus a percentage of sales when transactions are at a smaller dollar volume. When retailers processing a bunch of small dollar purchases, unless the payment provider is getting a base payment in addition to the percentage of sale, they’re going to lose money on every transaction. The model that stripe and PayPal offer while I’ll confess I don’t love it gives their customer an incentive to have higher dollar transactions in order to get better rates. There’s an alignment of interest between the customer and the service provider. For this product, it was the inverse, the more the customer use the service, the less money the software company would make. In fact, as a practical matter, the beta sites were actually money losers. Ever hear the expression, we lose money on every transaction, but we make it up in volume. That’s exactly what was happening here. This is a strategic disconnect. It was caused by product management, not thinking this through before coming up with their business plan. We talked about how Hank insisted that it just about that every setting should be configurable. We also started to look at the workflow to see if there was an opportunity to reduce cost per transaction. We sat down with some of the engineers and asked, Tell me exactly how this process works. And we began drawing flow diagram showing all the round trips between the endpoints and the cloud servers. As we began to share publicly what we were learning, with each review cycle, we would hear well, that’s close, but you’re leaving something else out. And we went through a bunch of iterations of these diagrams before we had the opportunity to connect with Jamie. We told him that we had discovered for every single transaction, there were more than 25 round trips between endpoints in the cloud. And he said, That’s not right. That’s not how it works at all. But here’s the deal. Jamie’s a super smart guy, and he had tried to touch all the parts of the system. And I bet the last time he had looked at it, he was exactly right. But this system had grown beyond the point where any one person could really understand the whole thing. And there was no way in spite of his 80 plus hour work weeks, that Jamie could keep his finger on the pulse of every part of the system. He thought he knew how it worked, and he believed it sincerely. Yet, he was sincerely wrong. We showed him what we had learned. And then we kept digging. The last I recall, we had catalogued 47 round trips of network data between the client and the cloud, every round trip affected performance and it impacted costs. That meant it was virtually impossible to be able to make any money at all running this app with a low price point and an all you can eat service plan. I want to emphasize this team was made up of very smart people, they had immense talent, and yet they miss some critical items that lead to strategic failures. from a functional perspective and application, it was pretty cool, it cleverly solve real world problems that customers were dealing with every single day. In a future episode, we can explore some other aspects of how engineering was managed the complicated things, but for today, let’s recap some of what went wrong here. From today’s discussion, I think we can say there were four main areas that led to the challenges in today’s episode. First success was a lousy teacher, it convinced these smart people like Hank and Jamie they couldn’t lose. Number two, leaders thought the path to success at great scale required them to work harder and more repeating the lessons that had led them to success at a smaller scale. But Einstein said the problems of today can only be solved at a higher level of thinking than those which created them. In order for Hank and Jamie to be able to tackle these new problems they were facing. They had to think at a higher level, they needed to continue to grow their skills. But they had convinced themselves that was not necessary because they’d already learned the critical lessons needed to get them to future success. Man harder faster and more work was needed. At least that’s what they thought. And as a result, there was a focus on what was urgent, not what was important. With the endless work hours seemingly limitless travel and continual fire drills, leaders got too busy to understand what was happening at the edges. It was easy to get sucked into the details and not lead at a strategic level. And that was holding engineering back. Number three, when it came to delivering a product, there was a lack of clarity. Are we building a Sam’s Club or a Nordstrom? How would we measure success? What happened internally when leaders had conflicting definitions of success? And how would we resolve differences. And number four, DevOps had the philosophy and the promise of scalability and greatness. But that thinking and leadership that were present in that team was not reflected elsewhere. Areas like cost management capacity, planning and security were often overlooked. And there was little to no desire to create repeatable and scalable business processes. I have to tell you, this was a tough slog on this project, it was hard to be positive and optimistic. There were a lot of smart people who worked a lot of long hours trying to create something great. Unfortunately, the vision of the all you can eat low price offering well, it didn’t make it to market. The enterprise side did because there was more margin in those clients. But a lot of that effort they put into trying to launch that low price thing just didn’t work out at all. It’s a shame. I liked those guys, even if we often disagree.
Tom Cooper 29:46
Thanks for joining us in this latest episode in the software horror stories Podcast Series. For more information about bright Hill group, please check out www dot bright Hill group.com that’s www dot bright Hill group dot up for bright Hill group. I’m Tom Cooper and we’ll see you next time