Diagnosing Eero issues with Big Sur

Update: So it turned out that it wasn’t an issue with Big Sur or Eero. It was a mixture of diagnosing what’s going on *before* upgrading, and still running the software. In this case, it looks like fing triggers some sort of packet storm on the network. Fing periodically does a network scan to monitor for it’s use. Although I never got to diagnosing the actual root cause, the proximal cause was “close the laptop to change where I’m debugging” fixing the problem. Once removed. Problems solved.

Tl;dr – It appears that macOS Big Sur and multiple Eero’s have a compatibility problem.  

With WFH there has been a need for good networking.  With a 700 Mbps cable connection and a set of Eero pros, you’d expect that the performance of the network would be good.  However there would be periods with poor or no clear or identifiable cause.  This needed a deeper dive into the underlying root cause.  This is that journey.

After finally giving into to upgrading my 50/5 AT&T ADSL connection due to a family needing to stream 3 schools and 2 working adults we moved to 700/20 Mbps XFinity.

The behavior would be reasonably bandwidth-tested system would occasionally become completely unusable for video conferencing for the entire family.  Initial analysis indicated poor ping that would take out the local network.  So that was at least a reasonable symptom that I could start digging down into.  Other symptoms where network based IoT devices such as wifi controlled lights would not always be 100% reliable.

Over the Christmas break I started seriously looking into the networking issues that I had been seeing.  I settled on fping/influxdb/grafana after exploring a few different local networking monitoring systems and options.  Almost all of the systems had some challenges that made it not suitable for my Raspberry Pi oriented home automation setup.

I live in an Eichler.  For those who aren’t familiar Joseph Eichler was a mid-century architect with a particular style that is popular in the Silicon Valley area.  The typical Eichler is slab floor, lots of windows and flat roof.  For a modern techie that presents some interesting challenges since you can’t run cables anywhere that aren’t visible.  With a need for 3 different entertainment hubs and a drop for the internet, I had previously tried powerline networking and wired connections; both causing issues for either reliability (powerline is really not that good with old wiring), or aesthetics (you can’t ever really hide network cables).   So in the end I opted for Eeros and mesh networking.

So the network topology is as follows… The Kitchen is where the modem is.  The kitchen is central to the house which is good for my network needs.  I have a Garage (security cameras + file storage), a Family Room (TV, PS5, etc), and a Sunroom (PS4, Printers, etc).

To setup monitoring I forked https://github.com/mcwelan/fping which provided a starting point for ping scans on the network, and pushing that data into influxdb.  

The data stored in influxdb is a simple collection of “ip address,ping loss, max, min, avg ping times”.  

The first chart that I generated was ping.  The ping graph gave an indication that the performance of the network was average, and there were likely problems, but gave no real indication of any underlying problems.  Anything over 50 ms is a concern in networking, so it at least confirmed that my network was in a poor state.

Another interesting graph that was generated showed the issues with the network were clear.  In this graph I’m plotting each device (different color) with the max ping when there was a report of at least one dropped packet (loss > 0).  This rejects devices operating well, as well as devices not on the network.  Fping at that stage was configured with a max wait of 1000 ms, which explains the limit.

There are some really interesting behaviors in that chart.  Some of the devices on the network show a switching between two or three ping rates when the network isn’t performing.

So at least I had a way to identify if the network was operating well.

After trying a few different options (disconnecting Eero’s, turning off devices, even creating a split network, etc) I found I couldn’t reliably detect or trigger a clear indication of the network performing properly.  I surmised that at least the gateway network device should be as high end as possible so before Christmas I replaced gateway Eero Pro with a new Eero 6 Pro.  

This did provide a slightly more consistent network experience, but I couldn’t see the point at which I connected the Eero 6 Pro to the network.  Ultimately, it’s clear that this macro network analysis isn’t going to get me anywhere.

In the end I realized that I’d have to either go to Eero support or reddit r/eero to see if I can work things out.  To do so meant that I’d need to really start constructing a methodical view of the network.

Now as a sanity check I wanted to make sure that my internet connection and upstream was good and not contributing to the network performance.   By looking at some public sites (Cloudflare – 1.1.1.1, Google dns 8.8.8.8), some upstream XFinity Router, my modem’s management interface and the Eero public IP address.  These are almost exactly what I’d expect – the Eero public interface is sub millisecond ping, the modem is 3 millisecond.  (Note that there is a small incremental increase in the modem’s ping time, which is a different issue, which looks like it is also incremental on the overall network performance as well.  This is clear from the step down on the modem side after the modem reboot on the 23rd December.  That will be another investigation for another day.)  But overall the Kitchen eero out to the internet seems rock solid and consistent, a diurnal pattern for the XFinity upstream link following US business hours).  So at least I don’t need to consider outside of the network.

So I decided to start by looking at what’s going on a finer level.  So first up, it was time to look at how the network fabric was operating at.  To do this, I wanted to *only* look at the eero network addresses to see if the fabric was operating well.   Since I was ping testing each node on the network, I already had most of the data that I needed.  Selecting and naming all the Eeros, I could get a graph that showed the performance of the network.  Bingo! This graph gave me a lot of information that I hadn’t been expecting.    I can see when I took the network down (mid afternoon on the 23rd, and when I put the network back together.  The network was possibly a little bit worse after the update, but there was a good signal there.

Now when the network was good, it seems very, very good.  Zooming into a window of performance degradation – gives me a solid indication of what the performance is like.  We can see the kitchen (basically hard wired) gives sub millisecond ping, the other Eeros are all typically < 10 ms ping time.  Overall, I’d be happy if the network performed like that in general.  However, the system regresses to 20-40 ms ping times on occasions. 

Now this pattern allowed me to up my vigilance on what was happening.  Unfortunately, I couldn’t discern any particular trigger – although I had some assumptions on the timing and recovery – focused on near normal waking hours but none of the graphs gave me any real clue.

So now I had a hook about the fabric, let’s go one level up.  Part of the reason I had Eeros scattered around the house was to allow me to have mini wired hubs in different places.  So new ping filter for devices connected to the different Eeros.  This graph gave me even more hope – the wired devices were likewise as consistent as the Eeros – when the network was performing well.

An interesting observation was that the network was *really* reliable during Christmas day – it was performing almost flawlessly.  During Christmas you tend to stay off tech for a lot of the time as you spend it with Family.  Of course I did some stuff just before Bed, and got up late on the 26th.  Suddenly I had a hint it was something that we (possibly I) were doing on the network was causing the network instabilities.

So with this information, I looked at the loss rate of the pings.  I have full network visibility since I’m pinging the entire network mostly consistently.  So with this information I can see when devices come on and off the network.  The last 7 days looks like the other full network graphs, although in this graph I can see a substantial increase in lost packets during periods of network instability.  Overall, the network has lots of devices on and off through-out the day.  (I believe the loss > 100 is likely summing errors due to reducing the periods down to a single day).  However this graph on its own doesn’t give much information.

So now I had enough information to start triggering alarms.  I chose to create a private Discord server and send a notification to that.  This at least gave me a 5-10 notification after the network had started performing poorly.  After a few observations, I managed to catch this behavior.  You can see around 9:30 the network started to perform poorly.  At the same time the red line came down from the top.  This indicates that a device came onto the network.  It also was removed from the network just before the network recovered.

Isolating that particular host provided an even stronger correlation across a number of days.  Although there are not enough close triggers for the presence of the host and the poor network behaviors.  When the host was not connected to power, and would periodically wake up and do network activities (the dips in the lower graph away from 100), it would cause network instabilities matching perfectly with the periods of awakedness.  In particular, when the host came on at 9:30, and went off at 10:20, the network issues disappeared.

That device is a Macbook Pro running MacOS Big Sur.

Now when I was armed with that correlation of the system, I started looking more deeply into the relationship.  You can see that I’ve been able to recover fairly quickly when there is a network connection issue.  Initially I would shutdown the entire Mac.  Now I can switch off the network on the Macbook to get the network to recover.

Eeros have a primary and guest network.  It appears that connecting to the guest network allows the network to operate correctly. 

Sometimes the macbook will cause network performance issues immediately, other times it will strike randomly.   But each time I can take the MacOS Big Sur offline and the network will recover – almost immediately.

So at least now I have the ability to get my network in a good state again.  Unfortunately it means taking my primary machine offline to get it running.   I think the above narrative provides a lot of useful information to allow Eero to begin to dive more deeply.

What I’ll be doing next is sitting the Macbook on a secondary – isolated network connected to the Kitchen eero.  It’s less than ideal, but it should give me the ability to continue to debug and analyze while at the same time provide increased stability for the network.

Finally, there are always other interesting things that you’ll discover on your network when you begin to look closely.  I can see there is a small performance degradation over time with the cable modem, and there a factory reset/reconfigure of the Roomba 960 improved the ping behavior there.  Once I’ve got the fabric of the network improved, I’ll be going through each unusual behavior that I see and hunt down other network quirks.

Advertisement

Differences between Rate Limiting and Throttling

When dealing with cloud APIs, there will typically be the concept of Throttling when API consumption moves beyond a specific limit. The particular model of throttling (leaky bucket, burstable, etc) is outside the scope of this post. For the purposes of this post, we will assume a simple, numeric rate limit (ie: x API calls per second).

The key concept is that the API is expensive either time, or compute and so there is a need to restrict the rate that the calls are made. Most developer code assumes that all APIs are free. Tight loops are common to operate code as fast as possible. Cloud APIs that distribute compute are not free and need special handling.

Rate Limiting is a client side response to the maximum capacity of a channel. If a channel has capacity to consume requests at a given rate/sec then a client should be prepared to limit their request rate. A common response to avoid implementing rate limiting on the client is that the server should allow processing at the appropriate rate. However, n the case where the API triggers asynchronous operations the response rate to the API may be fast, but the operation in the background is much slower.

Throttling is a server side response where feedback is provided to the caller indicating that there are too many requests coming in from that client or that the server is overloaded and needs clients to slow down their rate of requests. When a throttling event happens, the general client side response use exponential backoff to ensure that the system can recover even with multiple making requests at the same time.

An argument is often made that exponential backoff is a type of rate limiting. Yes it is, but exponential backoff is a partial solution with a number of problems.

AWS uses throttling heavily across its APIs and can serve as a good background for this discussion. Some AWS APIs associated with EMR have very low default API request limit (2-4 requests/second). These APIs have typical response times around 100ms. As you can see the sustained rate that an a client can access the API exceeds request limit. This represents an interesting (but easy) challenge for systems that need to call an API repeatedly. So for this example, I’m going to focus on the AWS Elastic Map Reduce DescribeCluster API. A common system management operation will be to gather data for each active cluster. In this example, assume we have 1000 clusters, and that we can’t hook into the EMR events and take a transactional on-change approach.

With an assumed maximum rate of 10 requests/second and an API limit of 4 requests/second. We can immediately see that calling the API 1000 times we can sustain 100 seconds of requests. However the API itself would take 250 to complete the scan. This of course assumes that our client is the only caller inter that API, in a lot of cases AWS internally is making requests, you may have a secondary orchestrator making requests and finally you may have an infosec scanner making requests. So the in reality our client may only be able to get 1 or 2 requests/second.

So let’s look at the naive approach. We keep hammering until we’re done at max rate. This will place us into heavy contention with other clients and will basically make life miserable for all the clients.

Following AWS Guidance we can implement exponential backoff. The general idea is that every time we receive a Throttling exception, we will back off in an exponential way, typically 2 ^ retry * base wait unit. Most approaches will have a max retry before failing or have a max retry before maxing out at some delay. Some AWS SDKs also have a transparent retry that is hidden from the API caller meaning that when you receive a Throttling exception, the SDK implementation has already backed off at least 3 times. Now if we look at our case above, we can see that we will almost immediately hit a throttling exception within the first second. Assuming we take the exponential backoff with a max wait, we will likely get 4 requests in and then wait for may 10 seconds (assuming we have max retries of 10 and base wait of 10 ms 2^10 * 10 = 1024*10 = 10 seconds), and then we try again. So effectively we’re getting 0.4 requests per second, so our full scan will take 2500 seconds. This is also ignoring other clients. Our bad behavior will also be triggering throttling events for those clients as well, likely diminishing service for all those services as well.

So currently we’ve got a poorly acting client that is ruining it for everyone. So what do we do?

We can rate limit the client. This would involve either having prior knowledge or adapting our rate based of feedback from the server. A simple form of rate limiting is to use a RateLimiter implementation that blocks code until it is permitted to continue. In our example if we had a RateLimit of 4 permits per second, then we can guarantee that we will stay below the API limit. In a quiet environment we could work through our 1000 clusters without hitting any throttling exception.

Of course hard coding 4 requests/second is naive to both the actual limit and also to any other clients. So we would likely want it to be made adaptive. We’d start of with a slightly aggressive limit of say 10 requests/second. As we hit throttling exceptions then we would adjust the RateLimiter by an appropriate amount (say halving) from 10 requests/second to 5 requests/second to 2.5 requests/second. I haven’t come any strong guidance on adapting rate limiting, but my intuition says negative exponential is probably too aggressive, but linear is not aggressive enough. We do get to a sustained un-throttled request rate by using rate limiting that is also sensitive to any sustained requests by other clients.

Planning and respecting the API request limit is the best solution. All APIs are not free to call.

So our exploration has demonstrated that sustained calls need something other than exponential backoff, and that we can get to a more predictable API rate by using rate limiting. However, we’re not there yet. There will be transient peaks that mean a period of load doesn’t represent the long term use of the API.

So to deal with transient spikes we will still need to tell our client to back off slightly when the API is under load. This does slow us down slightly, but we will definitely know that our rate limiting is preventing our client from being the problem, and that it is general external load on the API that is causing the issue.

So with a combination of exponential backoff to deal with transient spikes and rate limiting to treat the API limit with respect we have a much better solution.

But wait, there’s more.

Our past throttling doesn’t indicate that our future is throttled. So we need to ensure that our code is adaptive to both events and other clients utilizing the API. Unfortunately through most APIs there is only feedback on our behavior, so we need to either periodically optimistically improve our backoff and our rate limiting to ensure that we are close to, but not exceeding the limits.

For the exponential backoff, we need reduce our retry count periodically so that we get back to zero retries after a period of success. In our example above, we could assume that after 10 un-throttled API requests that we can discount our retry, ultimately getting back to our 0 retry state. This will allow us to deal with short term events (seconds of throttling) as well as sustained periods of loads (minutes of throttling).

For our rate limiting, we need to periodically increase our rate to feel for what the total load on the API is and find the peak sustained. Remember that in the case of AWS, any other client could be creating sustained load on the API and so we cannot assume that our sustainable 1 request/sec on the past represents our future sustained request rate, particularly when there may be other clients that are consuming some portion of the API capacity which may no longer be there. I’d likely implement a negative exponential limit increase (1/2 each time), with an incremental linear return based on the number of successful requests.

So by including adaptive rate limiting that backs off quickly and then feels for the right level, and exponential backoff that reduces the backoff after no throttling we will end up with much better overall throughput and will also end up being a good API citizen.

Unfortunately, I have never seen implementations that hand both rate limiting and exponential backoff. Most of the implementations that I have seen do exponential backoff but then go back to hitting the API at the maximum rate once the’ve done their back off. Less common, is a naive rate limiting approach that is neither adaptive or allows for backoff.

Thoughts, comments, seen a good rate limiting and backoff implementation? Add them below.

Career Stages of A Typical Software Engineer

This is a repost of a long form response to quora question (What are the typical stages in the career of a software engineer?) which itself was derived from an answer – Matthew Tippett’s answer to What does this mean for a software engineer: “You are not at a senior level”?. Adapted and tweaked for the new question, now tweaked for the blog.

Each company will have its own leveling guide, ask your HR department or search your intranet. It should go through a clear set of expectations for each grade and the different attributes that each grade should possess. Your manager may not even be aware of them – but they should provide a basis for you to understand the career progression.

Flat organizations will have only 3 or so (Jr/Mid/Senior/Exec), other organizations will have many (Assoc/Eng/Snr Eng/Staff Eng/Snr Staff Eng/Princ Eng/Snr Princ Eng/Dist Eng/Fellow). Apply the following to your company and don’t expect direct portability between companies.

Grade Levels

First, we’ll go over a hypothetical set of grades – generally well rounded against a lot of companies – some will have different titles, but will generally have a a common set of attributes.

The Career level is arbitrary but what you’d expect the middle of the curve people to be operating at. Individuals will peak at a particular point and will then progress slower. Realistically, most good people will peak at what I am calling a Staff engineer. Some will get frustrated with the leadership aspect of the senior grades and peak at Senior Engineer. The management ladder equivalence is also arbitrary, but should serve as a guide.

  • Junior/Associate Engineer/New College Grad – Assumed to know nothing, can code, have minimal understanding how business work and what a professional life entails. Hand held or teamed with a more senior engineer to help get an understanding. Career level 0–2 years.
  • Engineer – Assumed to be able to work through tasks with minimal supervision. Will come back and ask for more work. Not expected to identify and fix secondary problems. Not expected to drive generalized improvements or be strong advocates for best practices or improvements. Quite simply a “Doer”. Scope is typically at a sub-component level. Career Level 2–5 years.
  • Senior Engineer – Beginning to be self directed. Expected to be able to work through small projects and foresee issues that may come up. Likely expected to mentor or lead sub-teams or development effort. Scope is typically at a component or subsystem level. Career Level 5–10 years – equivalent to a team lead.
  • Staff Engineer/Architect – Runs the technical side of projects, leader and mentor for a team. Holder of a high bar for best practices, quality and engineering workmanship. Scope is across a system, or across multiple subsystems. Career Level 10–20 years – equivalent to a manager.
  • Fellow/Distinguished Engineer – Runs the technical side of an organization. Interlopes on many projects, defines the strategic direction for the technology. Career Level 15–30 years – equivalent to a director or VP.

It’s not about the code

Hopefully it becomes clear from the descriptions that pretty much from Senior Engineer and up, the technical role includes increasing amount of leadership. This is distinct from management. The leadership traits are about having your peers trust and understand you direction, being able to convince peers, managers and other teams about your general direction. Being able to deliver on the “soft” skills needed to deliver code.

Amazon’s Leadership Principles actually give a pretty good indication of some of the leadership needs for engineers.

There is a tendency for organizations to promote based on seniority or time in role, or even worse, based on salary bands.

Applying this to Yourself

  1. Ground yourself what your level means to you, the organization and your team. There may be three different answers.
  2. Introspect and ask yourself if you are demonstrating the non-management leadership aspects of a team leader or junior manager? Do you show confidence? Do you help lead and define? Do you demonstrate an interest in bringing in best practices? Do you see problems before they occur and take steps to manage them?
  3. Consider where you are in your career.

Your Career is a Marathon

A final thought, although you indicate a few years in the industry, I’ve seen engineers gunning for “Senior Engineer” 3 years out of college and staff engineer 3 more years after that. My big advice to them is what the hell are you going to do when you get to 6 years into a 40 or 50 year career and realize that you’ve peaked or you have some serious slow grinding for the next 20 years. I’m concerned about good engineers who become fixated on the sprint to the next title and not the marathon of their career.

ng-Whatever

We’ve all done it, sat around a table dissing the previous generation of our product.  The previous set of engineers had no idea, made some stupid fundamental mistakes that we obviously wouldn’t have made.  They suck, we’re awesome.  You know what, in 3 or 5 years time, the next generation of stewards of the system you are creating or replacing now will be saying the same thing – of you are your awesome system that you are slaving over now.

So what changes?  Is the previous generation always wrong?  Are they always buffoons who had no idea about how to write software.  Unlikely.  They were just like you at a different time, with a different set of contexts and a different set of immediate requirements and priorities.

Understanding Context

The context that a system is created is the first critical ingredient for a system. Look to understand the priorities, the tradeoffs and the decisions that had to be made when the system was first created.  Were there constraints that you no longer have in place, were they restricted by infrastructure, memory, performance?  Were there other criteria that were driving success at that stage, was it ship the product, manage technical debt or were there gaps in the organization that were being made up for?  What was the preferred type of system back then?

Understanding these items allow you to empathize with system creator and understand some of the shortcuts they may have made.  Most engineers will attempt to do their best based on their understanding of the requirements, their competing priorities and their understanding of the best systems that can be implemented in the time given.  Almost every one of these constraints forces some level of shortcut to be taken in the delivering of a system.

Seek first to understand the context before making the decision that the previous team made mistakes.  When you hear yourself making comments about a previous team, a peer team or other group not doing things in the way that you would like to see it, look for the possible reasons.  I’ve seen junior teams making rookie mistakes, teams focused on backend architectures making front-end mistakes, device teams making simple mistakes in back-end systems.   In each of these contexts, it is fairly obvious why the mistakes would be made.  Usually, it will be within your power to identify the shortcoming, determine a possible root cause by understanding the context and shore up the effort or the team to help smooth things over and result in a better outcome.

Constraining Your ng-Whatever

When faced with frustration on a previous system, consider carefully a full re-write into a ng-whatever system, or incremental changes with some fundamental breakpoints that evolve, refactor and replace parts of the system.

It is almost guaranteed that the moment a system gets a “ng-Whatever” moniker attached to it, it becomes a panacea for all things wrong with the old system and begins to accrete not only the glorious fixes for the old system, it will also pick up a persona of its own.   This persona will appear as “When we get the ng-whatever done, we won’t have this problem..”.

These oversized expectations begin to add more and more implicit requirements to the system.  Very few of these will be expectations will be actually fulfilled, leaving a perception of a less valuable ng-Whatever.

Common Defect Density

I’m going to come out and say that most engineering teams, no matter how much of a “Illusory Superiority” bias they may have are going to be at best incrementally better than the previous team.  With that said, their likelihood to have defects in their requirements, design or implementation will be more or less even (depending on how the software is being written this time around).

The impact will typically be that the business will be trading a piece of potentially battle hardened software with known intractable deficiences, with a new piece of software that will both have bugs that will be only be ironed out in the face of production.  Even worse, there will always be a set of intractable deficiencies that are now unknown – only to be discovered when the new software is in production.

When the original system was created, it is highly unlikely that the engineering team baked in a set of annoying deficiencies.  Likewise, the new system will, to the best of your teams understanding,  not baking any deficiencies into the system.  You need to make a conscious decision to take the risk that the new issues will be less painful than the old issues are.  If you can’t make that call, then sometimes refactoring and re-working parts of the system might be a better solution.

 

What have your experiences been with ng-Whatevers?  Have you found that your team can reliably replace an older system with a new system, and see that in a few years time the new system is held with a higher level of esteem than the original system?  Follow this blog for more posts, or post comments below on this topic.

 

Code and the Written Word

Code history is like a narrated history of code.  The ability for git rebase to reorder, rework and polish commits allow a developer (and code reviewers) to curate the code history so that it tells a well structured story.  This post will wander through how strongly the analogy can work.

TL;DR version in the slides.  Read on for the long form.

Continue reading “Code and the Written Word”

Estimating for Software is Like Estimating for Skinning a Cat

As I’ve mentioned a few times, estimation is an imprecise art.    There are ways to increase accuracy of the estimation either through consensus based estimation or other methods.    This post explores why estimations are hard and why the software world struggles to find tools, techniques and methods that provide for consistent and accurate estimations.

I’ve recently been playing with Codewars (connect with me there) and have been intrigued by the variance of the solutions that are provided.  In particular, you have the “smart” developers who come up with one-liners that need a PhD to decode, tight code, maintainable code, and then clearly hacked till it works code.  This variance is likely the underlying reason for the difficulty in getting consistently accurate estimates.

Read on for more some of the examples that I have pulled from the Convert Hex String to RGB kata.  The variance is quite astonishing.  When you dig deeper into the differences, you can begin to see the vast differences in approaches.  I’m not going to dig deeper into my personal views of the pro’s and cons of each one, but it did provide me a lightbulb moment as to the how software in particular always going to be difficult to estimate accurately.

I came up with an interesting analogy when pushed by a colleague on systematic ways of estimating.  His assertion was that other industries (building in his example) have systematic ways of doing time and cost estimates for doing work.  While not diminishing the trades, there are considerably less degrees of freedom for, say, a framer putting up a house frame.  The fundamental degrees of freedom are

  • experience – apprentice or master
  • tools – Nails or nail gun
  • quality of lumber – (not sure how to build this in)

For an master carpenter, framing will be very quick, for an apprentice, it will be slow.  The actual person doing the framing will likely be somewhere in between.   Likewise, using nail and hammer will be slow, and a nail gun will be fast.  The combination of those two factors will be the prime determinant of how long a piece of framing will take to complete.

I code however, we bring in other factors that need to be include in estimates, but are typically not considered until the code is being written.  Looking at the examples below, we see the tools that are available.

  • Array utility operators (ie: slice)
  • string operators (ie: substring)
  • direct array index manipulation (array[index])
  • regular expression (.match)
  • lookup Tables (.indexOf)

Using each of these tools has an general impact on the speed that the code can be written, and the speed that it can be debugged and the general maintainability of the code.

With this simple example, I feel that the current practice of “order of” that is popular in agile is sufficient for the type of accuracy that we will get.  Fibonacci, t-shirt sizes, hours/days/weeks/months/quarters, are really the among the best classes of estimates that we can get.

function hexStringToRGB(h) {
  return {
    r: parseInt(h.slice(1,3),16),
    g: parseInt(h.slice(3,5),16),
    b: parseInt(h.slice(5,7),16)
  };
}
function hexStringToRGB(hexString) {
  return {
    r : getColor(hexString),
    g : getColor(hexString, 'g'),
    b : getColor(hexString, 'b')
  };
}

function getColor(string, color) {
  var num = 1;
  if(color == "g"){ num += 2; }
  else if(color == "b"){ num += 4; }
  return parseInt("0x" + string.substring(num, num+2));
}
function hexStringToRGB(hs) {
  return {r: parseInt(hs[1]+hs[2],16), g : parseInt(hs[3]+hs[4],16), b : parseInt(hs[5]+hs[6],16)}
}
function hexStringToRGB(s) {
  var foo=s.substring(1).match(/.{1,2}/g).map(function(val){return parseInt(val,16)})
  return {r:foo[0],g:foo[1],b:foo[2]}
}
function hexStringToRGB(hexString) {
  var hex = '0123456789ABCDEF'.split('');
   hexString = hexString.toUpperCase().split('');

  function hexToRGB(f,r){
    return hex.indexOf(f)*16 + hex.indexOf(r);
  }

  return {
    r: hexToRGB(hexString[1],hexString[2]),
    g: hexToRGB(hexString[3],hexString[4]),
    b: hexToRGB(hexString[5],hexString[6])
  }
}

About Skinning the Cats

There are many ways to skin a cat” has it’s earliest print etymology in Money Diggers article in the 1840’s Gentleman’s Magazine, and Monthly America Review.  Even that reference implies it is a phrase already in use before then.  Generally the phrase implies there are many ways to achieve something.  In the context of this article, the analogy is that even simple tasks like writing a hex to RGB converter can be achieved in may different ways.


As always, vehemently opposed positions are encouraged in the comments, you can also connect with me on twitter @tippettm, connect with me on LinkedIn via matthewtippett, and finally +Matthew Tippett on google+.

ROI for Engineers

Short form presentation of how engineers can easily make judgements on Return on Investment. Also on SlideShare

High Confidence/Low Information vs High Accuracy/Low Information Estimates

Quite often estimates are needed where there is low-information, but a high-confidence estimate is required.  For a lot of engineers, this presents a paradox.

How can I present a high confidence estimate, when I don’t have all the information?

Ironically, this issue is solved fairly easy by noting the difference between high confidence and high accuracy estimate.  A high confidence estimate is defined by likelihood that a task will be completed within a given timeframe, while a high accuracy estimate provides a prescribed level of effort to complete the task.  This article presents a method of balancing a high confidence estimate balancing analysis effort against accuracy.

This is a refinement on the “Getting Good Estimates” posting from 2011.

The Estimation Model

The basis for this method is captured in the diagram below. The key measures on the diagram are

  • Confidence, the likelihood that the task will be completed by a given date (Task will be complete in 15 days at 90% confidence)
  • Accuracy, the range of effort for an estimate (Task will be complete in 10-12 days)
  • No Earlier Than, absolute minimum effort for a task.

Estimate Confidence Accuracy

In general, I never accept a naked estimate of a number of days. An estimate of a range will usually imply a confidence level. An estimate of a confidence level may or may not need an indication of accuracy – depending on context for the estimate.

Gaming out the Estimate

As a refinement to the method outlined in Getting Good Estimates, the same technique of calling out numbers can be used to pull out an estimation curve from an engineer.  The method follows the same iterative method outlined in Getting Good Estimates By asking the question, “What is the confidence that the task would be complete by date xxx?, you will end up with a results similar to

Question Answer
What’s the lowest effort for this task?  2 weeks
What’s the likelihood it will task 20 weeks  100% (Usually said very quickly and confidently)
What’s the likelihood it will take 10 weeks  95% (Usually accompanied a small pause for contemplation)
What’s the likelihood it will take 5 weeks  70% (Usually with a rocking head indicating reasonable confidence)
What’s the likelihood it will take 4 weeks  60%
What’s the likelihood it will take 3 weeks  30%
What’s the likelihood it will take 2 weeks  5%

That line of questions would yield the following graph.

Worked Estimate

I could then make the following statements based on that graph.

  • The task is unlikely to take less than 2 weeks. (No earlier than)
  • The task will likely take between 4 and 8 weeks (50-90% confidence)
  • We can be confident that the task will be complete within 8 weeks. (90% confidence)
  • Within a project plan, you could apply PERT (O=2, M=4[50%], P=8[50%]) and put in 4.3 weeks

Based on the estimate, I would probably dive into the delta between the 4 and 8 weeks. More succinctly I would ask the engineer, “What could go wrong that would cause the 4 weeks to blow out to 8 weeks?”.   Most engineers will have a small list of items that they are concerned about, from code/design quality, familiarity with the subsystem to potentially violating performance or memory constraints.  This information is critically important because it kick starts your Risk and Issues list (see a previous post on RAID) for the project.  A quick and simply analysis on the likelihood and impact of the risks may highlight an explicit risk mitigation or issue corrective action task that should be added to the project.

I usually do this sort of process on the whiteboard rather than formalizing it in a spreadsheet.

Shaping the Estimate

Within the context of a singular estimate I will usually ask some probing questions in an effort to get more items into the RAID information. After asking the questions, I’ll typically re-shape the curve by walking the estimate confidences again.   The typical questions are:

  • What could happen that could shift the entire curve to the right (effectively moving the No Earlier Than point)
  • What could we do to make the curve more vertical (effectively mitigate risks, challenge assumptions or correct issues)

RAID and High Accuracy Estimates

The number of days on the curve from 50% to 90% is what I am using as my measure of accuracy.  So how can we improve accuracy?  In general by working the RAID information to Mitigate Risks, Challenge AssumptionsCorrect Issues, and Manage Dependencies. Engineers may use terms like “Proof of concept”, “Research the issue”, or “Look at the code” to help drive the RAID.  I find it is more enlightening for the engineer to actually call out their unknowns, thereby making it a shared problem that other experts can help resolve.

Now the return on investment for working the RAID information needs to be carefully managed.  After a certain point the return on deeper analysis begins to diminish and you just need to call the estimate complete.  An analogy I use is getting an electrician to quote on adding a couple of outlets and then having the electrician check the breaker box and trace each circuit through the house.   Sure it may make the accuracy of the estimate much higher, but you quickly find that the estimate refinement is eating seriously into the task will take anyway.

The level of accuracy needed for most tasks is a range of 50-100% of the base value.  In real terms, I am comfortable with estimates with accuracy of 4-6 weeks, 5-10 days and so on.  You throw PERT over those and you have a realistic estimate that will usually be reasonably accurate.

RAID and High Confidence Estimates

The other side of the estimate game deals with high confidence estimates.  This is a slightly different kind of estimate that is used in roadmaps where there is insufficient time to determine an estimate with a high level of accuracy.  The RAID information is used heavily in this type of estimate, albiet in a different way.

In a high confidence estimate, you are looking for something closer to “No Later Than” rather than “Typical”.  A lot of engineers struggle with this sort of estimate since it goes against the natural urge to ‘pull rabbits out of a hat’ with optimistic estimates.  Instead you are playing a pessimistic game where an usually high number of risks become realized into issues that need to be dealt with.  By baking those realized risks into the estimate you can provide high confidence estimates without a deep level of analysis.

In the context of the Cone of Uncertianty, the high confidence estimate will always be slightly on the pessimistic side.   This allows there to be a sufficient hedge against something going wrong.

High Confidence Estimate

If there is a high likelihood that a risk will become realized or an assumption is incorrect, it is well worth investing a balanced amount of effort to remove those unknowns.  It tightens the cone of uncertainty earlier and allows you to converge faster.

Timeboxing and Prototypical Estimates

I usually place a timebox around initial estimates.  This forces quick thinking on the engineers side.  I try to give them the opportunity to blurt out a series of RAID items to help balance the intrinsic need to give a short estimate and the reality that there are unknowns that will make that short estimate wrong.  This timebox will typically be measured in minutes, not hours.  Even under the duress of a very small timebox, I find these estimates are usually reasonably accurate, particularly when estimates carry the caveats of risks and assumptions that ultimately are challenged.

There are a few prototypical estimates that I’ve seen engineers give out multiple times.  My general interpretation of the estimate, and what refinement steps I usually take.  These steps fit into the timebox I describe above.

Estimate Style Interpretation Refinement
The task will take between 2 days and 2 months Low accuracy, Low Information Start with the 2 day estimate and identify RAID items that push to 2 months
The task will take up to 3 weeks
Unknown accuracy, no lower bound Ask for no-earlier-than estimate, and identify RAID items.
The task is about 2 weeks Likely lower bound, optimistic Identify RAID items, what could go wrong.

Agree? Disagree? Have an alternative view or opinion?  Leave comments below.

If you are interested in articles on Management, Software Engineering or any other topic of interest, you can contact Matthew at tippettm_@_gmail.com via email,  @tippettm on twitter, Matthew Tippett on LinkedIn, +MatthewTippettGplus on Google+ or this blog at https://use-cases.org/.

Desk-Checks, Control Flow Graphs and Unit Testing

Recently, during a discussion on unit testing, I made an inadvertent comment about how unit testing is like desk-checking a function.  That comment was treated with a set of blank stares from the room.    It looks like desk-checking is no longer something that is taught in comp-sci education these days.  After explaining what it was, I felt like the engineers in the room were having similar moments I had when a senior engineer would talk about their early days with punch cards just after I entered the field. I guess times have changed…

Anyway…

What followed was a very interesting discussion on what Unit Testing is, why it is important and how Mocking fills in one of the last gaps in function oriented testing.  Through this discussion, I had my final Unit Testing light bulb moment and it all came together and went from an abstract best-practice to an absolutely sane and necessary best practice.  This article puts out a unified view on what Unit Testing is, is not, and how one can conceptualize unit tests.

Continue reading “Desk-Checks, Control Flow Graphs and Unit Testing”

Root Cause Analysis; Template and Discussion

A typical interpretation of a Root Cause Analysis (RCA) is to identify parties responsible and apportion blame.  I prefer to believe a Root Cause Analysis is a tool to discover internal and external deficiencies and put in place changes to improve them.  These deficiencies can span the entire spectrum of a system of people, processes, tools and techniques, all contributing to what is ultimately a regrettable problem.

Rarely is there a singular causal event or action that snowballs into a particular problem that necessitates a Root Cause Analysis.  Biases, assumptions, grudges, viewpoints are typically hidden baggage when investigating root causes.  Hence  it is preferable to use a somewhat analytical technique when faced with a Root Cause Analysis.  An objective analytical technique assists in removing these personal biases that make many Root Cause Analysis efforts less effective than they should .

I present below a rationale and template that I have used successfully for conducting Root Cause Analysis.   This template is light enough to be used within a couple of short facilitated meetings.  This contrasts to exhaustive Root Cause Analysis techniques that take days or weeks of applied effort to complete.  In most occasions, the regrettable action is avoidable in the future by making changes that become evident in a collective effort of a few hours to a few days.  When having multiple people working on a Root Cause Analysis, this timebox allows analysis within a day.

Continue reading “Root Cause Analysis; Template and Discussion”

%d bloggers like this: