On Diffs

This is a short youser style guide on diffs. Some background thoughts are in an older blog post about Code and the Written Word.

As an engineer, your legacy within Meta is primarily the code that you create. The oldest thing that will come back to haunt you years in the future is your code. What do I mean about legacy?

  • We regularly come across code that was written years ago.
  • We do not regularly come across docs written years ago.

So, through that lens we should consider about how that legacy is created. If we work back from the code, we see diffs, near-code documentation, and design notes. I’d argue that diffs and near-code documentation are close peers in what we reference, but for the purposes of this note I’m focusing on diffs.

Hands up who has ever looked at code and thought “What were they thinking when they wrote this?”. The context for the change is probably the singular best answer to that question.

Diffs are Context for A Code Change

Diffs and the associated meta data (tasks, tests, summary, etc) all provide a context for the change.

  • Title provides the one-liner view of what’s happening – it’s almost the topic sentence in a paragraph. See the blog post from above
  • Summary provides the details on what the diff does that won’t be obvious from the diff itself. See below for a comment of design discussions in this section.
  • Testing is actually important in two ways. It allows other engineers to walk through your final testing before you checkin. It also provides a reference about what to test if you need to modify the code in the future. Remember that the familiarity you have with the code and the way to test it is not shared by others and won’t be shared by you in the future.

All these sections provide context on the change. It’s the perfect place to reflect on the maturity of the environment, critical factors affecting how the code is created.

Diffs are a Watercoolor for the Code

Diffs are where we get to know each other and what’s important and what is not important for individuals and the team. We get to know each other’s engineering style, our triggers, our level of pendantism. What motivates us when we look at code.

For example, my style of code review is very sensitive to the “future self“. These are more or less the same things that any other future engineer that intersects with your legacy is going to ask as well.

  • What’s likely to happen to the code?
  • What’s going to be confusing?
  • What’s going to fail?
  • What’s going to age poorly?
  • When I’ve forgotten I’ve written the code, what’s information will I need to quickly come up to speed.

As we converse and discuss the code, we begin to develop a Theory of Mind for the engineers that we will review the code with. This is important when preparing a diff.

When receiving feedback on the diff, accept graciously any feedback as well intentioned. However you should feel comfortable accepting the feedback and incorporating it, considering it for the future, or respectfully rejecting it. Likewise the reviewer should acknowledge that the 3 paths for the feedback and accept the best intentions of the Author.

Rarely should there be disagreements in the diffs. If there are unresolvable differences it’s time to step out and write something down and agree on those principles.

Diffs should be a Reflection of the Community

It’s never your code. If you are listening to the Watercooler discussions on the diffs, and begin to determine who is going to respond to your diff in different ways, you begin to incorporate your teammates considerations into the way you present your code. When I’m preparing a diff, I’m either conciously or subconciously thinking about what’s important for my team mates and how do I intercept with their concerns.

To be clear, I’m not saying to pander to everyone’s concerns within a diff. I’m saying consider what’s important and when you know there is something relevant that another engineer would care about and be deliberate about if you respond to that or not. If Sarah cares about architectural clarity, and you for good reason are violating that clarity speak to it directly in the diff, or followup outside of the diff. If Paul is concerned about shotgun surgery needed for small changes think about what needs to change for likely small changes.

Having these other voices in your head makes you a better engineer and allows you to live vicariously through others and get their experiences. I find this one of the strongest accelerants for engineer growth.

What doesn’t belong in diffs?

I’ll get a bit controversial here. This is my starting point, but the community around the codebase will have their own norms.

Design Review in Code Review The later changes occur in development (requirements, design, code, deployment, production) the higher cost to make the change. Sunk cost is a real killer. when designs are considered late in the cycle, code may land to avoid rework. Fortunately, there is an easy solution here is similar approach to diffs but for design – small targeted design notes that address specific concerns. Either as part of a bigger design doc, or individual docs.


How I Debug

We’ve all seen it. The engineer who almost feels how a system works and can within seconds can identify what’s going on. Ask them how they do it, and most of them will say “I just know”. Not helpful at all.

Then there are the mere mortals who struggle to get some things debugged and wander in circles trying lots of different things and most of the time stumbling on the problem.

I spent a while observing the intuitive engineers, observing myself, and observing others, and here is how I do systematic debugging. It’s very close to my understanding of the technique of differential diagnosis. Apparently I’m not alone (link, link).

Worked example: let’s consider a web server that is not returning data.

Step 0: Start with a System View

Ironically, this is one of the biggest gaps I see.

What’s the first thing the SWAT team pulls out when they are trying to work out what’s happening in a building? It’s the floor plan. Why? So that they can conceptualize where people would be, where the weak and strong spots are. Without a floorplan they are going in blind.

A lot of engineers are either unwilling or unable to communicate how they think a system works. This is the most critical part since the theory of operation that is in your head will guide the following steps. It ultimately doesn’t matter if the system view is right or wrong – it will evolve as more information is found.

A System View doesn’t have to be detailed, it can be superficial but you build up as you go down. It can be a set of interconnected black boxes, that can be simply accepted as working consistently and as intended, unless data indicates otherwise.

Step 1: Gather Information (Gather)

Gather whatever information is available. Logs, Failure Modes, Reproduction Scenarios, whatever is available. It doesn’t really matter how much you have, and sometimes too much makes it harder to diagnose. This starts you going.

Worked Example:, this could be error codes present in the response, amount of time that it takes to fail, do other requests work or not. In this case, we don’t have any return code, it just times out.

Step 2: Create Multiple Hypothesis (Guess)

So based on the available information, you can consider about what may be the cause. As you consider each of the potential causes, don’t worry too much about pruning too quickly. By nature you will automatically prune out some scenarios based on the available information. (See Other Thoughts).

Worked Example: There could be a server failure, server overload, bug in the page rendering, network issue, etc.

Step 3: Work out how to Prove or Refute each Hypothesis (Plan)

For each hypothesis, consider what you can do to either prove or disprove that hypothesis. As you consider what can prove or refute, there will be items that are either easy to test, or split the hypothesis. Keep in mind on these. Don’t worry too much if there is information that missing, or you can’t get completely prove or disprove it.

Worked Example: What error codes are returned (decides between server failure/overload), are other pages working (may split between server side code and/or static code), server load values.

Step 4: Conduct your Experiments

Now that you’ve got your experiments, you are ultimately either removing a potential hypothesis, or your have proven that it’s at least true. If you have refuted a hypothesis, then you can remove it from your list of things to debug. You’ve proven it right, you now have more data that helps provide a better picture. Just like 20 questions, you need to choose what experiments to run that confirm/refute as many of your hypothesis as possible. This mixes in a healthy dose of effort, information gained, number of hypothesis.

Step 5: Repeat Step 2-4

As you work through the debugging, you will continue to get more data. as you get more data, you will possibly come up with other hypothesis. This is more complex than a simple decision tree, since it’s dynamic and not designed. By cycling through multiple times you will be getting more data, and isolating where the problem is.

Note that if it was easy, it would be linear, directed and you would be done. Life isn’t that easy, so there will be mistakes, incorrect assumptions. Always keep a beginner’s mind where you observe, re-challenge and ensure that you don’t take hard positions on things that you may have excluded, new information may make a big difference.

Other Thoughts

There are a number of other things that are happening in the background here. I’ll explore some things here.

Intuitive Engineers: We’ve all seen them, the engineers who sit down with a problem. Tap a few keys, look at a couple of logs, and magically zero in on the root cause. In my experience, you can usually ask questions like “What do you know it is not?“, or “How did you determine that it wasn’t or was that?“. Most engineers will have immediate answers to those questions. If you ask them directly about how they get to the right answer, most won’t have a strong view on how they got there.

Systems Thinking: A strong part of this is process is to construct a theory of operation about how the system works. A lot of the software industry just plain suck at maintaining good architectural documentation. Independent of that, constructing a System Model of how the system works is a critical step. Like the Beginner’s Mind – allow your view to be fluid and learn and respond to new information.

Remember 20 Questions: As a kid, I’d play 20 questions with the family. The critical part to quick isolation is to segment the solution space as quickly as possible. Through the power of bisection, 20 questions allow you to segment into 1,048,576 different sub-domains. Well constructed hypothesis and experimentation will apply the halving in an ideal world. Real world is messy, but the rule still applies.

Depth vs Breadth First: There is no prescription on driving down a line of experiments – or knock off a different set of experiments first. The best choice comes with practice and improvements in intuition and judgement. Generally, if you keep getting hypothesis confirmed, continuing depth first can help isolate. When there is confusion, breadth helps first.

Beginner’s Mind: Shoshin is a Japanese term for beginners mind, similar in concept to growth mindset. However within that concept you drop preconception, biases and be in the moment with the data that you have.

Allow Yourself to Be Wrong: This is critical, if you don’t take risks, it will be hard to be learn. Learning is best when you make small mistakes that help shape your best path. You learn best when you are wrong.

Practice, Practice, Practice: I personally feel that intuition comes from improved judgement and experiencing patterns. You can only get better if you practice.

Differences between Rate Limiting and Throttling

When dealing with cloud APIs, there will typically be the concept of Throttling when API consumption moves beyond a specific limit. The particular model of throttling (leaky bucket, burstable, etc) is outside the scope of this post. For the purposes of this post, we will assume a simple, numeric rate limit (ie: x API calls per second).

The key concept is that the API is expensive either time, or compute and so there is a need to restrict the rate that the calls are made. Most developer code assumes that all APIs are free. Tight loops are common to operate code as fast as possible. Cloud APIs that distribute compute are not free and need special handling.

Rate Limiting is a client side response to the maximum capacity of a channel. If a channel has capacity to consume requests at a given rate/sec then a client should be prepared to limit their request rate. A common response to avoid implementing rate limiting on the client is that the server should allow processing at the appropriate rate. However, n the case where the API triggers asynchronous operations the response rate to the API may be fast, but the operation in the background is much slower.

Throttling is a server side response where feedback is provided to the caller indicating that there are too many requests coming in from that client or that the server is overloaded and needs clients to slow down their rate of requests. When a throttling event happens, the general client side response use exponential backoff to ensure that the system can recover even with multiple making requests at the same time.

An argument is often made that exponential backoff is a type of rate limiting. Yes it is, but exponential backoff is a partial solution with a number of problems.

AWS uses throttling heavily across its APIs and can serve as a good background for this discussion. Some AWS APIs associated with EMR have very low default API request limit (2-4 requests/second). These APIs have typical response times around 100ms. As you can see the sustained rate that an a client can access the API exceeds request limit. This represents an interesting (but easy) challenge for systems that need to call an API repeatedly. So for this example, I’m going to focus on the AWS Elastic Map Reduce DescribeCluster API. A common system management operation will be to gather data for each active cluster. In this example, assume we have 1000 clusters, and that we can’t hook into the EMR events and take a transactional on-change approach.

With an assumed maximum rate of 10 requests/second and an API limit of 4 requests/second. We can immediately see that calling the API 1000 times we can sustain 100 seconds of requests. However the API itself would take 250 to complete the scan. This of course assumes that our client is the only caller inter that API, in a lot of cases AWS internally is making requests, you may have a secondary orchestrator making requests and finally you may have an infosec scanner making requests. So the in reality our client may only be able to get 1 or 2 requests/second.

So let’s look at the naive approach. We keep hammering until we’re done at max rate. This will place us into heavy contention with other clients and will basically make life miserable for all the clients.

Following AWS Guidance we can implement exponential backoff. The general idea is that every time we receive a Throttling exception, we will back off in an exponential way, typically 2 ^ retry * base wait unit. Most approaches will have a max retry before failing or have a max retry before maxing out at some delay. Some AWS SDKs also have a transparent retry that is hidden from the API caller meaning that when you receive a Throttling exception, the SDK implementation has already backed off at least 3 times. Now if we look at our case above, we can see that we will almost immediately hit a throttling exception within the first second. Assuming we take the exponential backoff with a max wait, we will likely get 4 requests in and then wait for may 10 seconds (assuming we have max retries of 10 and base wait of 10 ms 2^10 * 10 = 1024*10 = 10 seconds), and then we try again. So effectively we’re getting 0.4 requests per second, so our full scan will take 2500 seconds. This is also ignoring other clients. Our bad behavior will also be triggering throttling events for those clients as well, likely diminishing service for all those services as well.

So currently we’ve got a poorly acting client that is ruining it for everyone. So what do we do?

We can rate limit the client. This would involve either having prior knowledge or adapting our rate based of feedback from the server. A simple form of rate limiting is to use a RateLimiter implementation that blocks code until it is permitted to continue. In our example if we had a RateLimit of 4 permits per second, then we can guarantee that we will stay below the API limit. In a quiet environment we could work through our 1000 clusters without hitting any throttling exception.

Of course hard coding 4 requests/second is naive to both the actual limit and also to any other clients. So we would likely want it to be made adaptive. We’d start of with a slightly aggressive limit of say 10 requests/second. As we hit throttling exceptions then we would adjust the RateLimiter by an appropriate amount (say halving) from 10 requests/second to 5 requests/second to 2.5 requests/second. I haven’t come any strong guidance on adapting rate limiting, but my intuition says negative exponential is probably too aggressive, but linear is not aggressive enough. We do get to a sustained un-throttled request rate by using rate limiting that is also sensitive to any sustained requests by other clients.

Planning and respecting the API request limit is the best solution. All APIs are not free to call.

So our exploration has demonstrated that sustained calls need something other than exponential backoff, and that we can get to a more predictable API rate by using rate limiting. However, we’re not there yet. There will be transient peaks that mean a period of load doesn’t represent the long term use of the API.

So to deal with transient spikes we will still need to tell our client to back off slightly when the API is under load. This does slow us down slightly, but we will definitely know that our rate limiting is preventing our client from being the problem, and that it is general external load on the API that is causing the issue.

So with a combination of exponential backoff to deal with transient spikes and rate limiting to treat the API limit with respect we have a much better solution.

But wait, there’s more.

Our past throttling doesn’t indicate that our future is throttled. So we need to ensure that our code is adaptive to both events and other clients utilizing the API. Unfortunately through most APIs there is only feedback on our behavior, so we need to either periodically optimistically improve our backoff and our rate limiting to ensure that we are close to, but not exceeding the limits.

For the exponential backoff, we need reduce our retry count periodically so that we get back to zero retries after a period of success. In our example above, we could assume that after 10 un-throttled API requests that we can discount our retry, ultimately getting back to our 0 retry state. This will allow us to deal with short term events (seconds of throttling) as well as sustained periods of loads (minutes of throttling).

For our rate limiting, we need to periodically increase our rate to feel for what the total load on the API is and find the peak sustained. Remember that in the case of AWS, any other client could be creating sustained load on the API and so we cannot assume that our sustainable 1 request/sec on the past represents our future sustained request rate, particularly when there may be other clients that are consuming some portion of the API capacity which may no longer be there. I’d likely implement a negative exponential limit increase (1/2 each time), with an incremental linear return based on the number of successful requests.

So by including adaptive rate limiting that backs off quickly and then feels for the right level, and exponential backoff that reduces the backoff after no throttling we will end up with much better overall throughput and will also end up being a good API citizen.

Unfortunately, I have never seen implementations that hand both rate limiting and exponential backoff. Most of the implementations that I have seen do exponential backoff but then go back to hitting the API at the maximum rate once the’ve done their back off. Less common, is a naive rate limiting approach that is neither adaptive or allows for backoff.

Thoughts, comments, seen a good rate limiting and backoff implementation? Add them below.

Models for EV Charging

Electric Vehicles represent a challenge for the future. This blog post imagines an EV future and what it may mean. I’m an outsider to the EV space, having had a used 2017 Nissan Leaf for just over 4 months.

I’m also going to look beyond the irrational and skewed EV world that we live in currently, with free Supercharging and heavily workplace subsidized EV charging. In this future, people won’t be sitting in their cars for 30 minutes on a Friday evening to fill their car for free. Neither will we be getting juiced at work for free, having never charged your car at home or at a EV charging station. You have to pay for electricity in this world, one way or another, grounding this post in a mostly rational economic world.

Charging Options

In this post I will explore the two fundamentally two types of EV charging available today. Although there are others in the future, and have been available in the past, the two most common forms available now in the US are 6.6kW Level 2 (L2), or DC Fast Charging (DCFC).

Charging at home – L1 or L2

In this first scenario, which is mostly a real world scenario now, homes are equipped with a L2 chargers. You install the number of chargers that you need for your car. L2 chargers provide around 6.6kW, or about 25 miles per hour.

In a simplistic scenario, you plug in when you get home, and unplug when you leave for work. This would give around 12 hours of charging per day, or about 300 miles of charging overnight PG&E, my local power supplier, has an EV rate which provides a heavily reduced rate from 11 PM to 7 AM, leaving about 8 hours or 200 miles of charge each night. That’s sufficient for a very large commute or a reasonably sized car.

For small commutes, L1 also works, but at 3.3kW, you will only be able to recharge less than 100 miles overnight which places you at range anxiety if you can’t charge above your previous day’s charge. Electricity rates that make charging very cheap between 11 pm and 7 am, put the consumer in a difficult “do I pay more” situation. Don’t underestimate the psychological barriers to paying double the off-peak rate, even if it is much cheaper than the retail rates at public stations.

Charging on the Road – DC Fast Charging

The across the country road trip in a Tesla is a clear possibility with the Supercharger stations. These DCFC chargers range from 20 kW all the way up to 350kW. The common rules of thumb range from 1 hr/100 miles, up to 1 hr/300 miles for 150 kW and 20 minutes at 350 kW.

Note that most of these DCFC systems will charge quickly up to 80% and then move to trickle charging for the last 20% (see https://use-cases.org/2019/05/05/real-ev-charging-rates-from-the-bay-area/). So the math becomes complicated unless you discount that last 80%.

Most of the DCFC in the bay area charge you per minute, rather than per kWh, so there is a strong disadvantage to go beyond the 80%, in which case you are starting to pay for parking rather than charging. DCFC that charge per kWh usually charge a premium, likely balancing the time of use with the charge they deliver.

Either way, the DCFC is the closest that we will get to the gas station experience unless we go down the battery charging or start pushing up to high hundreds of kW charging. Which when you think about how electricity behaves at high power levels, the safety interlocks, cooling and so on will make high kW charging a challenge to install, maintain and keep safe.

Charging Scenarios

At this stage, I can’t see too many other fundamental options, cheap and slow, or fast and expensive. Time may prove me wrong, but unlikely in the next 5 or so years.

Home Charging

Home charging with L2 charging is likely to be the most popular and predictable charging experience. Your car is charged when you get up in the morning, you do your driving during the day, come home and plug your car back in and it’s ready fully charged by the next day. If your daily commute and errands sit below the night time charge rate of around 100 miles, you should be fine to continue with this model for as long as you want.

Road Trip Charging

The Tesla Supercharger or the Electrify America network are good example of road trip charging networks. The DCFC chargers on these networks vary between 150 kW to 350 kW. Most of the reports from people using these networks indicate that in most cases the trip stops (typically after 4-5 hours of driving) are completely reasonable to charge while having a short break. Most online articles (mainly Tesla) indicate that the the car is typically charged before the driver has finished their break.

Destination Charging

Destination charging covers the scenarios where you go somewhere and you plug the car in at the destination. These destinations are typically work, business or other activities. Most of these are L2 chargers for a typical commute of 10-20 miles, a car is recharged from the trip in 1-2 hours.

Work Charging

Work charging is an interesting challenge, the car is going to be at the office for about 8 or 9 hours. If the car is charged at night is above 80% when destination charging starts, it may take 2-3 hours to complete the charge. At many workplaces there are 2 or 3 shifts of people shuffling cars and coordinating with each other through corporate chat systems. I usually go for a car spot that doesn’t have enough available charger, relying on colleagues to move the charger over when they finish the first shift, leaving me the option of moving the car early afternoon or if there are spots open – not at all.

Shopping or Leisure Charging

Destination charging at a shopping center or leisure location is a slightly different scenario. Although most destination chargers are L2, the amount of time spent at the location may be measured in a few hours. For the purposes of this post, I’ll assume that the average distance to shopping or leisure will be around the same 20-40 miles from home.

In most cases the amount of time that a car is parked and charging will be slightly or considerably less than the amount of time needed to get the car fully charged.

Future Charging Directions

Charging at scale will present some critical challenges for electrical infrastructure. A large number of DCFC charging status can easily overwhelm the available power within a particular area. Commercial installations for power charging will invest to support what is needed. Destination charging may struggle with the electrical installation requirements for charging at scale.

Battery Smoothing

For home L2 charging, solar and house batteries like the Tesla Powerwall will be extremely useful for ensuring spare charging capacity is available. Although the Powerwall 2 is only around 13.5 kW, much less than the capacity of the EV battery, it does help reduce the dependence on the network for charging.

Another benefit of energy storage is that it allows short term peak consumption of energy above what would be available from the utility.

Adaptive Charging

As described above, most of the destination charging has a difference between the amount of time that a car will be charging and the amount of charge that will be required.

Allowing cars to have a target time and adjusting the charge rate to ensure that the car is fully charged at the time that the driver is ready to depart. If a car needs 10 kWh of energy, it can charge at a very low rate for 8 hours. Being fully charged by the end of the day. Likewise at a shopping center, declaring a departure time allows slower but consistent charging to get a car fully charged by the time that a customer is ready to leave.

This has two benefits:

  • Allowing more charge points to be installed
  • Reducing the electrical infrastructure required to support the peak load from the number of charge points installed.

Powerflex offers such a system, having installed some of these systems in the Los Altos/Mountain View school districts. The powerflex system is captured well with the graphs below.

Obviously, adaptive charging will not be suitable for road-trip style charging where the starting.

The Future – Energy Storage and Adaptive Charging

Ultimately, the charging future I see is a blend of energy story and adaptive charging. As the EV revolution continues, with places in Europe expecting to be weaned off internal combustion engines by around 2025. I look forward to driving into a car park and be presented 50 to 100 L2 charging stations and then selecting when I think I’ll be leaving the car park, knowing full well that the car will be mostly charged by the time I get back.

What price should we expect to pay for this sort of charging? It will likely be per kWh plus a small space usage rate. I’d be happy paying a 50% to 100% premium on the best charge rate that I’d be able to get at home.

Muscle Memory for Management with Mentoring

For just over a year now, I have been involved with Plato as a mentor.  If you don’t know, Plato is a subscription platform that links engineering managers to mentors.  As a Mentee, you sign up for unlimited mentoring sessions and access to a Slack channel.   As a Mentor, you make yourself available for one or two 30 minutes slots per week.  Plato’s system allows Mentees to connect with Mentors who are likely able to provide insight or guidance on their particular leadership problems.

I’ve been asked by a few people regarding the lack of symmetry between the Mentee, Mentor, and Plato.  I’ve thought about this quite a lot to ensure I have a mental model for the collective benefit 3 parties get.   Clearly, for the Mentee and Plato relationship is clear.  Plato provides access to Mentors, and the Mentee (or Mentee’s company) pays for the privilege, fairly simple charge for facilitation.

What about the Mentor?  On the surface, the Mentor receives no direct monetary reward, they have the opportunity to expand their network through Plato’s networking events.  Plato obviously receives access to the top tier of engineering leadership all quite happy to provide time to the paid Mentees.  On the surface, the Mentor receives very little as part of this relationship.

Why have a Mentor?

Personally, I moved through most of my career without what I would call a strong mentor or coach.  My lessons were learned the hard way, and I apologize in retrospect to those employees who helped me shape my management views and understanding of how business works.  As I’ve progressed in my career, I’ve become a more experienced manager, I’ve come to strongly support coaching and mentoring.  It’s a way to help avoid some of the leadership potholes that as humans, we are prone to stumble into again and again.  Guidance or advice at just the right the time can help immensely.

I strongly recommend that people find Mentors to help navigate their career.   Even if you are a naturally introverted person, you should still look to find a Mentor, it’s hard, it’s awkward, but it is still valuable.

One comment I’ve heard from a couple of Mentee is that they are ultimately concerned that they will be imposing on the time or advise of the Mentor.  Generally Mentors see Mentoring as a professional gift to the industry.  Assuming a formal or semi-formal Mentoring relationship, unless you have an agreed cadence of communication, contacting them once or twice a month is likely what is expected by being a Mentor.  Most Mentors are senior professionals, they will know how to ask you to back off.

Why Mentor?

As a Mentor, I have the opportunity to pass my experiences forward.  I can see new engineering managers making the same mistakes that I made, bringing forward the same assumptions.  Finding small ways that I can help the engineering management community not re-learn the lessons the hard way.

Why do I Mentor through Plato?  

Plato provides a very interesting value proposition for a Mentor.  I personally have 1 mentoring session each week, with a monthly panel mentoring session.  So I’m working with about 7 Mentees each month.  Each of these Mentees is at the point where they are having a particular problem that they want or need to solve, and so they are real problems needing real advice and guidance.

This speed-dating approach to mentoring actually has some interesting benefits for the Mentor.  The biggest benefit that I get is the need to think through rapid-fire management scenarios during the 30 minutes, typically 2-3 topics are discussed in each session.  This type of mental engagement and scenario practice would typically take months of real-world management to experience.

Muscle Memory and Tuning Judgement

Like any skill, practice makes perfect.  When a martial artist practices moves and scenarios over and over, they are training their subconscious to recognize the scenario and automatically respond in a practiced way.  Similar near automatic or intuitive judgment comes to others like chess masters who can see a layout of chess pieces on a board and have a good idea about the state of the board and likely good moves that can follow.  Grandmasters can usually detect how close to an end-game the board is.

What appears as expert intuition, can usually be attributed to gut feel, but rather practiced judgment.  This practiced judgment comes from automatically recognizing a scenario and automatically knowing sensible next steps. The more experience you get, the better your recognition of a scenario, and the better you are able to respond.

Managers go through a similar process.  A manager either needs to have a large amount of experience, or have a way of mentally exercising these scenarios again and again.  In the real world, a manager deals with their staff with issues coming up a few times a week and needing months to resolve.

When mentoring, a Mentor is able to quickly go through the larger number of these scenarios at a rate faster than they would experience in their normal professional life.  This helps train the practiced judgement, making the manager more effective and faster on their feet.

But be Aware of Bias

A highly self-confident Mentor, can train poor judgment in the same way.  If a Mentor isn’t careful, their responses to the Mentee’s problems can drift from what works well for the Mentee to what is simply trained into the Mentor’s mental model.  This is where both the Mentee and Mentor have some responsibility.

The Mentee’s first responsibility is to take the advice, as a well-intentioned response to the scenario presented.  The Mentee must integrate the advice into their real world scenario and take actions as appropriate.  Each Mentor’s solution may be only a partial solution to the Mentor’s problem.

The Mentor’s responsibility is to both appraise the scenario and determine what they might do in that situation.  The appriasal stage is the most critical.  A Mentor that doesn’t adapt to the Mentee’s actual sitation may end up pushing the Mentee down the knife edge that represents the Mentor’s history.   Hoping that you can take the same life steps of your favorite artist and repeat their results, is folly.  The linear path that their life has taken also down a knife edge.  Any variance, be it a chance meeting or a different decision and the path is different.

Career Stages of A Typical Software Engineer

This is a repost of a long form response to quora question (What are the typical stages in the career of a software engineer?) which itself was derived from an answer – Matthew Tippett’s answer to What does this mean for a software engineer: “You are not at a senior level”?. Adapted and tweaked for the new question, now tweaked for the blog.

Each company will have its own leveling guide, ask your HR department or search your intranet. It should go through a clear set of expectations for each grade and the different attributes that each grade should possess. Your manager may not even be aware of them – but they should provide a basis for you to understand the career progression.

Flat organizations will have only 3 or so (Jr/Mid/Senior/Exec), other organizations will have many (Assoc/Eng/Snr Eng/Staff Eng/Snr Staff Eng/Princ Eng/Snr Princ Eng/Dist Eng/Fellow). Apply the following to your company and don’t expect direct portability between companies.

Grade Levels

First, we’ll go over a hypothetical set of grades – generally well rounded against a lot of companies – some will have different titles, but will generally have a a common set of attributes.

The Career level is arbitrary but what you’d expect the middle of the curve people to be operating at. Individuals will peak at a particular point and will then progress slower. Realistically, most good people will peak at what I am calling a Staff engineer. Some will get frustrated with the leadership aspect of the senior grades and peak at Senior Engineer. The management ladder equivalence is also arbitrary, but should serve as a guide.

  • Junior/Associate Engineer/New College Grad – Assumed to know nothing, can code, have minimal understanding how business work and what a professional life entails. Hand held or teamed with a more senior engineer to help get an understanding. Career level 0–2 years.
  • Engineer – Assumed to be able to work through tasks with minimal supervision. Will come back and ask for more work. Not expected to identify and fix secondary problems. Not expected to drive generalized improvements or be strong advocates for best practices or improvements. Quite simply a “Doer”. Scope is typically at a sub-component level. Career Level 2–5 years.
  • Senior Engineer – Beginning to be self directed. Expected to be able to work through small projects and foresee issues that may come up. Likely expected to mentor or lead sub-teams or development effort. Scope is typically at a component or subsystem level. Career Level 5–10 years – equivalent to a team lead.
  • Staff Engineer/Architect – Runs the technical side of projects, leader and mentor for a team. Holder of a high bar for best practices, quality and engineering workmanship. Scope is across a system, or across multiple subsystems. Career Level 10–20 years – equivalent to a manager.
  • Fellow/Distinguished Engineer – Runs the technical side of an organization. Interlopes on many projects, defines the strategic direction for the technology. Career Level 15–30 years – equivalent to a director or VP.

It’s not about the code

Hopefully it becomes clear from the descriptions that pretty much from Senior Engineer and up, the technical role includes increasing amount of leadership. This is distinct from management. The leadership traits are about having your peers trust and understand you direction, being able to convince peers, managers and other teams about your general direction. Being able to deliver on the “soft” skills needed to deliver code.

Amazon’s Leadership Principles actually give a pretty good indication of some of the leadership needs for engineers.

There is a tendency for organizations to promote based on seniority or time in role, or even worse, based on salary bands.

Applying this to Yourself

  1. Ground yourself what your level means to you, the organization and your team. There may be three different answers.
  2. Introspect and ask yourself if you are demonstrating the non-management leadership aspects of a team leader or junior manager? Do you show confidence? Do you help lead and define? Do you demonstrate an interest in bringing in best practices? Do you see problems before they occur and take steps to manage them?
  3. Consider where you are in your career.

Your Career is a Marathon

A final thought, although you indicate a few years in the industry, I’ve seen engineers gunning for “Senior Engineer” 3 years out of college and staff engineer 3 more years after that. My big advice to them is what the hell are you going to do when you get to 6 years into a 40 or 50 year career and realize that you’ve peaked or you have some serious slow grinding for the next 20 years. I’m concerned about good engineers who become fixated on the sprint to the next title and not the marathon of their career.

Exploring Cognitive Engagement in Amazon’s Six Page Narratives

In a previous post, I discussed the Evil Genius of Amazon’s Six Page Narrative, exploring via a Quora post how the document is structured and why it works so well.  In Jeff Bezos’ Financial Year 2017 Letter To Shareholders, Jeff covers the Six Page Narrative and goes into the heavy polishing that a good Narrative provides.

In the Six Page Narratives that I have read, reviewed or discussed, I have always been frustrated with the tendency for authors to not use standard mechanisms to ease the cognitive load of the reader.  For example, below is a typical paragraph that you might find in a Six Pager.

Based on our review of the customer surveys, we can see that the US has customers preference as Product A – 10%, Product B – 40%, and Product C – 20%.  EU interest is Product A – 20%, Product B – 50%, and Product C – 10%.  Finally, JP customers have a preference of Product A – 40%, Product B – 20% and Product C – 15%.   Consequently, we will be focusing on Product A and Product B.

To me, this is clearly tabular data that should be structured in a way that walks the reader through the data providing support for the argument.

Geographic Region Product Preference
US 10% 40% 20%
EU 20% 50% 10%
JP 40% 20% 15%

As can be seen, there is a clear worldwide preference for Product A and B.

It is clear that with the narrative format, the information needs to be pulled apart by the reader to clarify and confirm the conclusion.  In the tabular format, the information is presented for simple confirmation of interpretation.

It has always felt to me that the narrative form is unfair and unsympathetic to the reader, forcing mental gymnastics where the gymnastics should not be needed.  In my own writing, I have always found the decision to tabulate vs narrate is a decision primarily based on the information density and valuable space consumed where in some cases every line counts.

Recently, I read Thinking, Fast and Slow.  In this book, Daniel Kahneman gave me that lightning bolt answer to what had vexed me about Six Page Narratives so much.

The Six Page Narratives are typically consumed in Amazon’s infamous Reading Meetings, where you have a number of senior leadership people who take the first 10-15 minutes of a meeting to read a Narrative or PR-FAQ, before discussing.  The senior leadership in these meetings are generally very smart and have years of experience. You want these leadership team to be engaged in reviewing the document and surface areas that the author and their supporting team may have not considered.  You need the reader to be cognitively engaged to be able to successfully provide that extra input.

According to Daniel Kahneman’s book, when a reader is having to do cognitive work to consume some information, they will typically think deeper and more broadly than if they were presented the information in a way that lowers cognitive load.

Assuming that Thinking, Fast and Slow is correct, it puts the onus on the author of a narrative to make a conscious decision as to where that knife edge is between getting reader to think through the problem, possibly gaining deeper insights, or to present the information and allow them to be taken on the cognitive easy course.  Or put slightly differently, how to make the choice between engaging a reader, or simply informing them.

Meltdown and Spectre Computer Vulnerability Cubicle Magnets

When something bad happens, like Meltdown or Spectre, or Heartbleed, your engineers end up having late nights,  grueling daily update meetings.  At the end of the day when the dust settles, we don’t get a Service Ribbon that shows your involvement in those days of battle.

With the security industry creating well-recognized brands and logos some of the bigger vulnerabilities, we might have those opportunities to make sure those that join the firefight get the tech equivalent to the service ribbon.

At Badgly, we’ve created some cubicle magnets that allow for a commemoration of those days when you lots lunches, nights and weekends to solve some of your companies biggest and burning issues.  We’ve got a set of Vulnerability badges for Meltdown, Spectre, and Heartbleed.  We’ll update this post with images when they get back from the printers.  Some sample renderings are below.

For geekiness points, all the vulnerability badges will be golden-ratio rectangles.  Is there a vulnerability that you’d like to see memorialized?  Post comments below.



The Evil Genius of the Amazon Six Page Narrative

One of the leadership tools that Amazon uses internally is the Six Page Narrative. This is where a decision that needs to be made is presented as a narrative document, restricted to six pages, with appendixes. The outcome of the narrative is a go/no-go on the subject. The meetings that discuss the narratives are typically 15 minutes of silent reading/note taking followed by 30-45 minutes of deep dive questions and clarifications. The best possible outcome is 5-10 minutes of discussion and then a “Good narrative, I agree, go do it!”. A bad outcome is “I now have more questions than I started with, let’s stop the meeting now and meet again in two weeks.

The narrative is actually fairly evil in its construction. The six pages and the structure of the document itself are ultimately fairly arbitrary. It is the forced revision/improvement/validation that is forced to fit into six pages that is critical. As many a good orator has mentioned conveying detailed content concisely takes longer to prepare than long form.

The six pages is a hard limit for the narrative (and no, you should drop to 6 point font with 0.25 inch margins, that defeats the point). By forcing a limited set of pages it forces the author (or authors) to go through numerous drafts to reshape the document, polishing it by ejecting, rewriting abstracting and summarizing as you go along. This forces a reasonable taxonomy and structure of information within the subsections and a good ordering of information. This repeated revision to fit takes the mental work that the reader must undertake to correlate and rationalize the information. If the document takes the reader is taken through a clear path where the questions are almost immediately answered, it shows clarity of thought and understanding, increasing confidence in the author far more than a set of bullet points on a powerpoint.

One final part that is overlooked in the six-page structure is the appendixes. The appendixes carry the data, the validation, the information that feeds into the narrative that isn’t needed to support the narrative, but is needed for completeness and cross-reference. You can say “our data supports this information as follows”, the appendix allows the reader to dive deeper into the data to ensure that they would draw the same conclusion, but assuming that the data presented in the narrative can be taken as interpreted correctly, then the narrative can still hold it’s own.

This approach is common when needing to write a concise 1 or 2 paragraph email. You write what you want, re-work, re-work, re-work. Placing a sometimes arbitrary boundary on an output forces a deeper consideration than would otherwise be delivered.

Other Amazonian documents have a similar templated structure that to some extent is inviolable structure. The 1 pager, the working back document, the root cause analysis, all structured to force the presenter to think, structure and organize their thoughts. For communicating business information, I don’t think I’ve seen it necessary to present a Powerpoint for quite some time.

The narrative form still has some risks. To stick with the narrative form, authors are sometimes tempted to inline what is better communicated with tables a) option a – 15, b) option y – 25, c) option z – 10. By its nature, this information is tabular in form. A small structured table can carry a lot more information and take considerable cognitive load away from a reader. The narrative author must balance the use of prose with other information dense methods of presenting information.

What other documents does your organization use to communicate ideas? Are the powerpoint templates still the ruler of your domain?

I’ve also added a deeper dive into part of six page narrative with a discussion of cognitive engagement.

This is a Quora crosspost from this answer. I’m reposting my popular answers here for posterity. Obviously in a different context it has been modified slightly primarily putting context inline

Towards a Skate-Dad (Day 1)

So this Christmas, we bought our 10-year-old a skateboard.  As part of ensuring there is a bit of extra commitment, I’m joining them in learning to skate.  So let’s put it into context here…  43 years old, 225 lbs, mostly sedentary lifestyle tech geek…  so that probably paints the picture.

So why am I doing learning to skateboard?  The first part is to do the whole role model thing.  I have reasonably high expectations for my kids to be resilient through adversity, to put in the time to practice to get good at things.  If I expect them to get up after hitting the pavement, I can expect them to…  They are a hell of a lot less fragile than I am and should be able to get up and move after a fall that would wipe me out.  I’m also doing this as part of a general effort to get fit and ensure that can learn new physical skills.

What’s my end point?  Realistically, I currently have little interest in doing tricks, I’ll probably declare personal victory when I can push, tic-tac and pump on a half pipe and turn…  We’ll see how much further it goes from there.

This part of this blog will give periodic updates on my quest to get to the endpoint.   Today is day 1…  My kit is an 8.5″ board, 54mm wheels, from the local Skateworks shop.  I’m a bit of an analytical geek, so I’ll also be writing up some of my thoughts, most likely ill-founded, for others to pick on.

So Day 1…  Practicing pushing, very simple tic-tacs… Becoming kind of comfortable.  Of course, day 1 something has to happen.  If my memory serves me right, my front foot wasn’t in the right spot, I tried to do a push, and it all went downhill from there.  I ended up rolling my left ankle and landing on my right hip.  No bruising on the hip, but the ankle is now strapped.   We’ll see how the healing goes, but I’m eager to help the kids out on the park again before the new year break finishes.

Fortunately, or maybe not so, fortunately, it took a few hours before the pain from the rolled ankle to kick in, so I got about another hour of practice at the skatepark.

%d bloggers like this: