Differences between Rate Limiting and Throttling

When dealing with cloud APIs, there will typically be the concept of Throttling when API consumption moves beyond a specific limit. The particular model of throttling (leaky bucket, burstable, etc) is outside the scope of this post. For the purposes of this post, we will assume a simple, numeric rate limit (ie: x API calls per second).

The key concept is that the API is expensive either time, or compute and so there is a need to restrict the rate that the calls are made. Most developer code assumes that all APIs are free. Tight loops are common to operate code as fast as possible. Cloud APIs that distribute compute are not free and need special handling.

Rate Limiting is a client side response to the maximum capacity of a channel. If a channel has capacity to consume requests at a given rate/sec then a client should be prepared to limit their request rate. A common response to avoid implementing rate limiting on the client is that the server should allow processing at the appropriate rate. However, n the case where the API triggers asynchronous operations the response rate to the API may be fast, but the operation in the background is much slower.

Throttling is a server side response where feedback is provided to the caller indicating that there are too many requests coming in from that client or that the server is overloaded and needs clients to slow down their rate of requests. When a throttling event happens, the general client side response use exponential backoff to ensure that the system can recover even with multiple making requests at the same time.

An argument is often made that exponential backoff is a type of rate limiting. Yes it is, but exponential backoff is a partial solution with a number of problems.

AWS uses throttling heavily across its APIs and can serve as a good background for this discussion. Some AWS APIs associated with EMR have very low default API request limit (2-4 requests/second). These APIs have typical response times around 100ms. As you can see the sustained rate that an a client can access the API exceeds request limit. This represents an interesting (but easy) challenge for systems that need to call an API repeatedly. So for this example, I’m going to focus on the AWS Elastic Map Reduce DescribeCluster API. A common system management operation will be to gather data for each active cluster. In this example, assume we have 1000 clusters, and that we can’t hook into the EMR events and take a transactional on-change approach.

With an assumed maximum rate of 10 requests/second and an API limit of 4 requests/second. We can immediately see that calling the API 1000 times we can sustain 100 seconds of requests. However the API itself would take 250 to complete the scan. This of course assumes that our client is the only caller inter that API, in a lot of cases AWS internally is making requests, you may have a secondary orchestrator making requests and finally you may have an infosec scanner making requests. So the in reality our client may only be able to get 1 or 2 requests/second.

So let’s look at the naive approach. We keep hammering until we’re done at max rate. This will place us into heavy contention with other clients and will basically make life miserable for all the clients.

Following AWS Guidance we can implement exponential backoff. The general idea is that every time we receive a Throttling exception, we will back off in an exponential way, typically 2 ^ retry * base wait unit. Most approaches will have a max retry before failing or have a max retry before maxing out at some delay. Some AWS SDKs also have a transparent retry that is hidden from the API caller meaning that when you receive a Throttling exception, the SDK implementation has already backed off at least 3 times. Now if we look at our case above, we can see that we will almost immediately hit a throttling exception within the first second. Assuming we take the exponential backoff with a max wait, we will likely get 4 requests in and then wait for may 10 seconds (assuming we have max retries of 10 and base wait of 10 ms 2^10 * 10 = 1024*10 = 10 seconds), and then we try again. So effectively we’re getting 0.4 requests per second, so our full scan will take 2500 seconds. This is also ignoring other clients. Our bad behavior will also be triggering throttling events for those clients as well, likely diminishing service for all those services as well.

So currently we’ve got a poorly acting client that is ruining it for everyone. So what do we do?

We can rate limit the client. This would involve either having prior knowledge or adapting our rate based of feedback from the server. A simple form of rate limiting is to use a RateLimiter implementation that blocks code until it is permitted to continue. In our example if we had a RateLimit of 4 permits per second, then we can guarantee that we will stay below the API limit. In a quiet environment we could work through our 1000 clusters without hitting any throttling exception.

Of course hard coding 4 requests/second is naive to both the actual limit and also to any other clients. So we would likely want it to be made adaptive. We’d start of with a slightly aggressive limit of say 10 requests/second. As we hit throttling exceptions then we would adjust the RateLimiter by an appropriate amount (say halving) from 10 requests/second to 5 requests/second to 2.5 requests/second. I haven’t come any strong guidance on adapting rate limiting, but my intuition says negative exponential is probably too aggressive, but linear is not aggressive enough. We do get to a sustained un-throttled request rate by using rate limiting that is also sensitive to any sustained requests by other clients.

Planning and respecting the API request limit is the best solution. All APIs are not free to call.

So our exploration has demonstrated that sustained calls need something other than exponential backoff, and that we can get to a more predictable API rate by using rate limiting. However, we’re not there yet. There will be transient peaks that mean a period of load doesn’t represent the long term use of the API.

So to deal with transient spikes we will still need to tell our client to back off slightly when the API is under load. This does slow us down slightly, but we will definitely know that our rate limiting is preventing our client from being the problem, and that it is general external load on the API that is causing the issue.

So with a combination of exponential backoff to deal with transient spikes and rate limiting to treat the API limit with respect we have a much better solution.

But wait, there’s more.

Our past throttling doesn’t indicate that our future is throttled. So we need to ensure that our code is adaptive to both events and other clients utilizing the API. Unfortunately through most APIs there is only feedback on our behavior, so we need to either periodically optimistically improve our backoff and our rate limiting to ensure that we are close to, but not exceeding the limits.

For the exponential backoff, we need reduce our retry count periodically so that we get back to zero retries after a period of success. In our example above, we could assume that after 10 un-throttled API requests that we can discount our retry, ultimately getting back to our 0 retry state. This will allow us to deal with short term events (seconds of throttling) as well as sustained periods of loads (minutes of throttling).

For our rate limiting, we need to periodically increase our rate to feel for what the total load on the API is and find the peak sustained. Remember that in the case of AWS, any other client could be creating sustained load on the API and so we cannot assume that our sustainable 1 request/sec on the past represents our future sustained request rate, particularly when there may be other clients that are consuming some portion of the API capacity which may no longer be there. I’d likely implement a negative exponential limit increase (1/2 each time), with an incremental linear return based on the number of successful requests.

So by including adaptive rate limiting that backs off quickly and then feels for the right level, and exponential backoff that reduces the backoff after no throttling we will end up with much better overall throughput and will also end up being a good API citizen.

Unfortunately, I have never seen implementations that hand both rate limiting and exponential backoff. Most of the implementations that I have seen do exponential backoff but then go back to hitting the API at the maximum rate once the’ve done their back off. Less common, is a naive rate limiting approach that is neither adaptive or allows for backoff.

Thoughts, comments, seen a good rate limiting and backoff implementation? Add them below.

Models for EV Charging

Electric Vehicles represent a challenge for the future. This blog post imagines an EV future and what it may mean. I’m an outsider to the EV space, having had a used 2017 Nissan Leaf for just over 4 months.

I’m also going to look beyond the irrational and skewed EV world that we live in currently, with free Supercharging and heavily workplace subsidized EV charging. In this future, people won’t be sitting in their cars for 30 minutes on a Friday evening to fill their car for free. Neither will we be getting juiced at work for free, having never charged your car at home or at a EV charging station. You have to pay for electricity in this world, one way or another, grounding this post in a mostly rational economic world.

Charging Options

In this post I will explore the two fundamentally two types of EV charging available today. Although there are others in the future, and have been available in the past, the two most common forms available now in the US are 6.6kW Level 2 (L2), or DC Fast Charging (DCFC).

Charging at home – L1 or L2

In this first scenario, which is mostly a real world scenario now, homes are equipped with a L2 chargers. You install the number of chargers that you need for your car. L2 chargers provide around 6.6kW, or about 25 miles per hour.

In a simplistic scenario, you plug in when you get home, and unplug when you leave for work. This would give around 12 hours of charging per day, or about 300 miles of charging overnight PG&E, my local power supplier, has an EV rate which provides a heavily reduced rate from 11 PM to 7 AM, leaving about 8 hours or 200 miles of charge each night. That’s sufficient for a very large commute or a reasonably sized car.

For small commutes, L1 also works, but at 3.3kW, you will only be able to recharge less than 100 miles overnight which places you at range anxiety if you can’t charge above your previous day’s charge. Electricity rates that make charging very cheap between 11 pm and 7 am, put the consumer in a difficult “do I pay more” situation. Don’t underestimate the psychological barriers to paying double the off-peak rate, even if it is much cheaper than the retail rates at public stations.

Charging on the Road – DC Fast Charging

The across the country road trip in a Tesla is a clear possibility with the Supercharger stations. These DCFC chargers range from 20 kW all the way up to 350kW. The common rules of thumb range from 1 hr/100 miles, up to 1 hr/300 miles for 150 kW and 20 minutes at 350 kW.

Note that most of these DCFC systems will charge quickly up to 80% and then move to trickle charging for the last 20% (see https://use-cases.org/2019/05/05/real-ev-charging-rates-from-the-bay-area/). So the math becomes complicated unless you discount that last 80%.

Most of the DCFC in the bay area charge you per minute, rather than per kWh, so there is a strong disadvantage to go beyond the 80%, in which case you are starting to pay for parking rather than charging. DCFC that charge per kWh usually charge a premium, likely balancing the time of use with the charge they deliver.

Either way, the DCFC is the closest that we will get to the gas station experience unless we go down the battery charging or start pushing up to high hundreds of kW charging. Which when you think about how electricity behaves at high power levels, the safety interlocks, cooling and so on will make high kW charging a challenge to install, maintain and keep safe.

Charging Scenarios

At this stage, I can’t see too many other fundamental options, cheap and slow, or fast and expensive. Time may prove me wrong, but unlikely in the next 5 or so years.

Home Charging

Home charging with L2 charging is likely to be the most popular and predictable charging experience. Your car is charged when you get up in the morning, you do your driving during the day, come home and plug your car back in and it’s ready fully charged by the next day. If your daily commute and errands sit below the night time charge rate of around 100 miles, you should be fine to continue with this model for as long as you want.

Road Trip Charging

The Tesla Supercharger or the Electrify America network are good example of road trip charging networks. The DCFC chargers on these networks vary between 150 kW to 350 kW. Most of the reports from people using these networks indicate that in most cases the trip stops (typically after 4-5 hours of driving) are completely reasonable to charge while having a short break. Most online articles (mainly Tesla) indicate that the the car is typically charged before the driver has finished their break.

Destination Charging

Destination charging covers the scenarios where you go somewhere and you plug the car in at the destination. These destinations are typically work, business or other activities. Most of these are L2 chargers for a typical commute of 10-20 miles, a car is recharged from the trip in 1-2 hours.

Work Charging

Work charging is an interesting challenge, the car is going to be at the office for about 8 or 9 hours. If the car is charged at night is above 80% when destination charging starts, it may take 2-3 hours to complete the charge. At many workplaces there are 2 or 3 shifts of people shuffling cars and coordinating with each other through corporate chat systems. I usually go for a car spot that doesn’t have enough available charger, relying on colleagues to move the charger over when they finish the first shift, leaving me the option of moving the car early afternoon or if there are spots open – not at all.

Shopping or Leisure Charging

Destination charging at a shopping center or leisure location is a slightly different scenario. Although most destination chargers are L2, the amount of time spent at the location may be measured in a few hours. For the purposes of this post, I’ll assume that the average distance to shopping or leisure will be around the same 20-40 miles from home.

In most cases the amount of time that a car is parked and charging will be slightly or considerably less than the amount of time needed to get the car fully charged.

Future Charging Directions

Charging at scale will present some critical challenges for electrical infrastructure. A large number of DCFC charging status can easily overwhelm the available power within a particular area. Commercial installations for power charging will invest to support what is needed. Destination charging may struggle with the electrical installation requirements for charging at scale.

Battery Smoothing

For home L2 charging, solar and house batteries like the Tesla Powerwall will be extremely useful for ensuring spare charging capacity is available. Although the Powerwall 2 is only around 13.5 kW, much less than the capacity of the EV battery, it does help reduce the dependence on the network for charging.

Another benefit of energy storage is that it allows short term peak consumption of energy above what would be available from the utility.

Adaptive Charging

As described above, most of the destination charging has a difference between the amount of time that a car will be charging and the amount of charge that will be required.

Allowing cars to have a target time and adjusting the charge rate to ensure that the car is fully charged at the time that the driver is ready to depart. If a car needs 10 kWh of energy, it can charge at a very low rate for 8 hours. Being fully charged by the end of the day. Likewise at a shopping center, declaring a departure time allows slower but consistent charging to get a car fully charged by the time that a customer is ready to leave.

This has two benefits:

  • Allowing more charge points to be installed
  • Reducing the electrical infrastructure required to support the peak load from the number of charge points installed.

Powerflex offers such a system, having installed some of these systems in the Los Altos/Mountain View school districts. The powerflex system is captured well with the graphs below.

Obviously, adaptive charging will not be suitable for road-trip style charging where the starting.

The Future – Energy Storage and Adaptive Charging

Ultimately, the charging future I see is a blend of energy story and adaptive charging. As the EV revolution continues, with places in Europe expecting to be weaned off internal combustion engines by around 2025. I look forward to driving into a car park and be presented 50 to 100 L2 charging stations and then selecting when I think I’ll be leaving the car park, knowing full well that the car will be mostly charged by the time I get back.

What price should we expect to pay for this sort of charging? It will likely be per kWh plus a small space usage rate. I’d be happy paying a 50% to 100% premium on the best charge rate that I’d be able to get at home.

Muscle Memory for Management with Mentoring

For just over a year now, I have been involved with Plato as a mentor.  If you don’t know, Plato is a subscription platform that links engineering managers to mentors.  As a Mentee, you sign up for unlimited mentoring sessions and access to a Slack channel.   As a Mentor, you make yourself available for one or two 30 minutes slots per week.  Plato’s system allows Mentees to connect with Mentors who are likely able to provide insight or guidance on their particular leadership problems.

I’ve been asked by a few people regarding the lack of symmetry between the Mentee, Mentor, and Plato.  I’ve thought about this quite a lot to ensure I have a mental model for the collective benefit 3 parties get.   Clearly, for the Mentee and Plato relationship is clear.  Plato provides access to Mentors, and the Mentee (or Mentee’s company) pays for the privilege, fairly simple charge for facilitation.

What about the Mentor?  On the surface, the Mentor receives no direct monetary reward, they have the opportunity to expand their network through Plato’s networking events.  Plato obviously receives access to the top tier of engineering leadership all quite happy to provide time to the paid Mentees.  On the surface, the Mentor receives very little as part of this relationship.

Why have a Mentor?

Personally, I moved through most of my career without what I would call a strong mentor or coach.  My lessons were learned the hard way, and I apologize in retrospect to those employees who helped me shape my management views and understanding of how business works.  As I’ve progressed in my career, I’ve become a more experienced manager, I’ve come to strongly support coaching and mentoring.  It’s a way to help avoid some of the leadership potholes that as humans, we are prone to stumble into again and again.  Guidance or advice at just the right the time can help immensely.

I strongly recommend that people find Mentors to help navigate their career.   Even if you are a naturally introverted person, you should still look to find a Mentor, it’s hard, it’s awkward, but it is still valuable.

One comment I’ve heard from a couple of Mentee is that they are ultimately concerned that they will be imposing on the time or advise of the Mentor.  Generally Mentors see Mentoring as a professional gift to the industry.  Assuming a formal or semi-formal Mentoring relationship, unless you have an agreed cadence of communication, contacting them once or twice a month is likely what is expected by being a Mentor.  Most Mentors are senior professionals, they will know how to ask you to back off.

Why Mentor?

As a Mentor, I have the opportunity to pass my experiences forward.  I can see new engineering managers making the same mistakes that I made, bringing forward the same assumptions.  Finding small ways that I can help the engineering management community not re-learn the lessons the hard way.

Why do I Mentor through Plato?  

Plato provides a very interesting value proposition for a Mentor.  I personally have 1 mentoring session each week, with a monthly panel mentoring session.  So I’m working with about 7 Mentees each month.  Each of these Mentees is at the point where they are having a particular problem that they want or need to solve, and so they are real problems needing real advice and guidance.

This speed-dating approach to mentoring actually has some interesting benefits for the Mentor.  The biggest benefit that I get is the need to think through rapid-fire management scenarios during the 30 minutes, typically 2-3 topics are discussed in each session.  This type of mental engagement and scenario practice would typically take months of real-world management to experience.

Muscle Memory and Tuning Judgement

Like any skill, practice makes perfect.  When a martial artist practices moves and scenarios over and over, they are training their subconscious to recognize the scenario and automatically respond in a practiced way.  Similar near automatic or intuitive judgment comes to others like chess masters who can see a layout of chess pieces on a board and have a good idea about the state of the board and likely good moves that can follow.  Grandmasters can usually detect how close to an end-game the board is.

What appears as expert intuition, can usually be attributed to gut feel, but rather practiced judgment.  This practiced judgment comes from automatically recognizing a scenario and automatically knowing sensible next steps. The more experience you get, the better your recognition of a scenario, and the better you are able to respond.

Managers go through a similar process.  A manager either needs to have a large amount of experience, or have a way of mentally exercising these scenarios again and again.  In the real world, a manager deals with their staff with issues coming up a few times a week and needing months to resolve.

When mentoring, a Mentor is able to quickly go through the larger number of these scenarios at a rate faster than they would experience in their normal professional life.  This helps train the practiced judgement, making the manager more effective and faster on their feet.

But be Aware of Bias

A highly self-confident Mentor, can train poor judgment in the same way.  If a Mentor isn’t careful, their responses to the Mentee’s problems can drift from what works well for the Mentee to what is simply trained into the Mentor’s mental model.  This is where both the Mentee and Mentor have some responsibility.

The Mentee’s first responsibility is to take the advice, as a well-intentioned response to the scenario presented.  The Mentee must integrate the advice into their real world scenario and take actions as appropriate.  Each Mentor’s solution may be only a partial solution to the Mentor’s problem.

The Mentor’s responsibility is to both appraise the scenario and determine what they might do in that situation.  The appriasal stage is the most critical.  A Mentor that doesn’t adapt to the Mentee’s actual sitation may end up pushing the Mentee down the knife edge that represents the Mentor’s history.   Hoping that you can take the same life steps of your favorite artist and repeat their results, is folly.  The linear path that their life has taken also down a knife edge.  Any variance, be it a chance meeting or a different decision and the path is different.

Career Stages of A Typical Software Engineer

This is a repost of a long form response to quora question (What are the typical stages in the career of a software engineer?) which itself was derived from an answer – Matthew Tippett’s answer to What does this mean for a software engineer: “You are not at a senior level”?. Adapted and tweaked for the new question, now tweaked for the blog.

Each company will have its own leveling guide, ask your HR department or search your intranet. It should go through a clear set of expectations for each grade and the different attributes that each grade should possess. Your manager may not even be aware of them – but they should provide a basis for you to understand the career progression.

Flat organizations will have only 3 or so (Jr/Mid/Senior/Exec), other organizations will have many (Assoc/Eng/Snr Eng/Staff Eng/Snr Staff Eng/Princ Eng/Snr Princ Eng/Dist Eng/Fellow). Apply the following to your company and don’t expect direct portability between companies.

Grade Levels

First, we’ll go over a hypothetical set of grades – generally well rounded against a lot of companies – some will have different titles, but will generally have a a common set of attributes.

The Career level is arbitrary but what you’d expect the middle of the curve people to be operating at. Individuals will peak at a particular point and will then progress slower. Realistically, most good people will peak at what I am calling a Staff engineer. Some will get frustrated with the leadership aspect of the senior grades and peak at Senior Engineer. The management ladder equivalence is also arbitrary, but should serve as a guide.

  • Junior/Associate Engineer/New College Grad – Assumed to know nothing, can code, have minimal understanding how business work and what a professional life entails. Hand held or teamed with a more senior engineer to help get an understanding. Career level 0–2 years.
  • Engineer – Assumed to be able to work through tasks with minimal supervision. Will come back and ask for more work. Not expected to identify and fix secondary problems. Not expected to drive generalized improvements or be strong advocates for best practices or improvements. Quite simply a “Doer”. Scope is typically at a sub-component level. Career Level 2–5 years.
  • Senior Engineer – Beginning to be self directed. Expected to be able to work through small projects and foresee issues that may come up. Likely expected to mentor or lead sub-teams or development effort. Scope is typically at a component or subsystem level. Career Level 5–10 years – equivalent to a team lead.
  • Staff Engineer/Architect – Runs the technical side of projects, leader and mentor for a team. Holder of a high bar for best practices, quality and engineering workmanship. Scope is across a system, or across multiple subsystems. Career Level 10–20 years – equivalent to a manager.
  • Fellow/Distinguished Engineer – Runs the technical side of an organization. Interlopes on many projects, defines the strategic direction for the technology. Career Level 15–30 years – equivalent to a director or VP.

It’s not about the code

Hopefully it becomes clear from the descriptions that pretty much from Senior Engineer and up, the technical role includes increasing amount of leadership. This is distinct from management. The leadership traits are about having your peers trust and understand you direction, being able to convince peers, managers and other teams about your general direction. Being able to deliver on the “soft” skills needed to deliver code.

Amazon’s Leadership Principles actually give a pretty good indication of some of the leadership needs for engineers.

There is a tendency for organizations to promote based on seniority or time in role, or even worse, based on salary bands.

Applying this to Yourself

  1. Ground yourself what your level means to you, the organization and your team. There may be three different answers.
  2. Introspect and ask yourself if you are demonstrating the non-management leadership aspects of a team leader or junior manager? Do you show confidence? Do you help lead and define? Do you demonstrate an interest in bringing in best practices? Do you see problems before they occur and take steps to manage them?
  3. Consider where you are in your career.

Your Career is a Marathon

A final thought, although you indicate a few years in the industry, I’ve seen engineers gunning for “Senior Engineer” 3 years out of college and staff engineer 3 more years after that. My big advice to them is what the hell are you going to do when you get to 6 years into a 40 or 50 year career and realize that you’ve peaked or you have some serious slow grinding for the next 20 years. I’m concerned about good engineers who become fixated on the sprint to the next title and not the marathon of their career.

Exploring Cognitive Engagement in Amazon’s Six Page Narratives

In a previous post, I discussed the Evil Genius of Amazon’s Six Page Narrative, exploring via a Quora post how the document is structured and why it works so well.  In Jeff Bezos’ Financial Year 2017 Letter To Shareholders, Jeff covers the Six Page Narrative and goes into the heavy polishing that a good Narrative provides.

In the Six Page Narratives that I have read, reviewed or discussed, I have always been frustrated with the tendency for authors to not use standard mechanisms to ease the cognitive load of the reader.  For example, below is a typical paragraph that you might find in a Six Pager.

Based on our review of the customer surveys, we can see that the US has customers preference as Product A – 10%, Product B – 40%, and Product C – 20%.  EU interest is Product A – 20%, Product B – 50%, and Product C – 10%.  Finally, JP customers have a preference of Product A – 40%, Product B – 20% and Product C – 15%.   Consequently, we will be focusing on Product A and Product B.

To me, this is clearly tabular data that should be structured in a way that walks the reader through the data providing support for the argument.

Geographic Region Product Preference
A B C
US 10% 40% 20%
EU 20% 50% 10%
JP 40% 20% 15%

As can be seen, there is a clear worldwide preference for Product A and B.

It is clear that with the narrative format, the information needs to be pulled apart by the reader to clarify and confirm the conclusion.  In the tabular format, the information is presented for simple confirmation of interpretation.

It has always felt to me that the narrative form is unfair and unsympathetic to the reader, forcing mental gymnastics where the gymnastics should not be needed.  In my own writing, I have always found the decision to tabulate vs narrate is a decision primarily based on the information density and valuable space consumed where in some cases every line counts.

Recently, I read Thinking, Fast and Slow.  In this book, Daniel Kahneman gave me that lightning bolt answer to what had vexed me about Six Page Narratives so much.

The Six Page Narratives are typically consumed in Amazon’s infamous Reading Meetings, where you have a number of senior leadership people who take the first 10-15 minutes of a meeting to read a Narrative or PR-FAQ, before discussing.  The senior leadership in these meetings are generally very smart and have years of experience. You want these leadership team to be engaged in reviewing the document and surface areas that the author and their supporting team may have not considered.  You need the reader to be cognitively engaged to be able to successfully provide that extra input.

According to Daniel Kahneman’s book, when a reader is having to do cognitive work to consume some information, they will typically think deeper and more broadly than if they were presented the information in a way that lowers cognitive load.

Assuming that Thinking, Fast and Slow is correct, it puts the onus on the author of a narrative to make a conscious decision as to where that knife edge is between getting reader to think through the problem, possibly gaining deeper insights, or to present the information and allow them to be taken on the cognitive easy course.  Or put slightly differently, how to make the choice between engaging a reader, or simply informing them.

Meltdown and Spectre Computer Vulnerability Cubicle Magnets

When something bad happens, like Meltdown or Spectre, or Heartbleed, your engineers end up having late nights,  grueling daily update meetings.  At the end of the day when the dust settles, we don’t get a Service Ribbon that shows your involvement in those days of battle.

With the security industry creating well-recognized brands and logos some of the bigger vulnerabilities, we might have those opportunities to make sure those that join the firefight get the tech equivalent to the service ribbon.

At Badgly, we’ve created some cubicle magnets that allow for a commemoration of those days when you lots lunches, nights and weekends to solve some of your companies biggest and burning issues.  We’ve got a set of Vulnerability badges for Meltdown, Spectre, and Heartbleed.  We’ll update this post with images when they get back from the printers.  Some sample renderings are below.

For geekiness points, all the vulnerability badges will be golden-ratio rectangles.  Is there a vulnerability that you’d like to see memorialized?  Post comments below.

 

 

The Evil Genius of the Amazon Six Page Narrative

One of the leadership tools that Amazon uses internally is the Six Page Narrative. This is where a decision that needs to be made is presented as a narrative document, restricted to six pages, with appendixes. The outcome of the narrative is a go/no-go on the subject. The meetings that discuss the narratives are typically 15 minutes of silent reading/note taking followed by 30-45 minutes of deep dive questions and clarifications. The best possible outcome is 5-10 minutes of discussion and then a “Good narrative, I agree, go do it!”. A bad outcome is “I now have more questions than I started with, let’s stop the meeting now and meet again in two weeks.

The narrative is actually fairly evil in its construction. The six pages and the structure of the document itself are ultimately fairly arbitrary. It is the forced revision/improvement/validation that is forced to fit into six pages that is critical. As many a good orator has mentioned conveying detailed content concisely takes longer to prepare than long form.

The six pages is a hard limit for the narrative (and no, you should drop to 6 point font with 0.25 inch margins, that defeats the point). By forcing a limited set of pages it forces the author (or authors) to go through numerous drafts to reshape the document, polishing it by ejecting, rewriting abstracting and summarizing as you go along. This forces a reasonable taxonomy and structure of information within the subsections and a good ordering of information. This repeated revision to fit takes the mental work that the reader must undertake to correlate and rationalize the information. If the document takes the reader is taken through a clear path where the questions are almost immediately answered, it shows clarity of thought and understanding, increasing confidence in the author far more than a set of bullet points on a powerpoint.

One final part that is overlooked in the six-page structure is the appendixes. The appendixes carry the data, the validation, the information that feeds into the narrative that isn’t needed to support the narrative, but is needed for completeness and cross-reference. You can say “our data supports this information as follows”, the appendix allows the reader to dive deeper into the data to ensure that they would draw the same conclusion, but assuming that the data presented in the narrative can be taken as interpreted correctly, then the narrative can still hold it’s own.

This approach is common when needing to write a concise 1 or 2 paragraph email. You write what you want, re-work, re-work, re-work. Placing a sometimes arbitrary boundary on an output forces a deeper consideration than would otherwise be delivered.

Other Amazonian documents have a similar templated structure that to some extent is inviolable structure. The 1 pager, the working back document, the root cause analysis, all structured to force the presenter to think, structure and organize their thoughts. For communicating business information, I don’t think I’ve seen it necessary to present a Powerpoint for quite some time.

The narrative form still has some risks. To stick with the narrative form, authors are sometimes tempted to inline what is better communicated with tables a) option a – 15, b) option y – 25, c) option z – 10. By its nature, this information is tabular in form. A small structured table can carry a lot more information and take considerable cognitive load away from a reader. The narrative author must balance the use of prose with other information dense methods of presenting information.

What other documents does your organization use to communicate ideas? Are the powerpoint templates still the ruler of your domain?

I’ve also added a deeper dive into part of six page narrative with a discussion of cognitive engagement.


This is a Quora crosspost from this answer. I’m reposting my popular answers here for posterity. Obviously in a different context it has been modified slightly primarily putting context inline

Towards a Skate-Dad (Day 1)

So this Christmas, we bought our 10-year-old a skateboard.  As part of ensuring there is a bit of extra commitment, I’m joining them in learning to skate.  So let’s put it into context here…  43 years old, 225 lbs, mostly sedentary lifestyle tech geek…  so that probably paints the picture.

So why am I doing learning to skateboard?  The first part is to do the whole role model thing.  I have reasonably high expectations for my kids to be resilient through adversity, to put in the time to practice to get good at things.  If I expect them to get up after hitting the pavement, I can expect them to…  They are a hell of a lot less fragile than I am and should be able to get up and move after a fall that would wipe me out.  I’m also doing this as part of a general effort to get fit and ensure that can learn new physical skills.

What’s my end point?  Realistically, I currently have little interest in doing tricks, I’ll probably declare personal victory when I can push, tic-tac and pump on a half pipe and turn…  We’ll see how much further it goes from there.

This part of this blog will give periodic updates on my quest to get to the endpoint.   Today is day 1…  My kit is an 8.5″ board, 54mm wheels, from the local Skateworks shop.  I’m a bit of an analytical geek, so I’ll also be writing up some of my thoughts, most likely ill-founded, for others to pick on.

So Day 1…  Practicing pushing, very simple tic-tacs… Becoming kind of comfortable.  Of course, day 1 something has to happen.  If my memory serves me right, my front foot wasn’t in the right spot, I tried to do a push, and it all went downhill from there.  I ended up rolling my left ankle and landing on my right hip.  No bruising on the hip, but the ankle is now strapped.   We’ll see how the healing goes, but I’m eager to help the kids out on the park again before the new year break finishes.

Fortunately, or maybe not so, fortunately, it took a few hours before the pain from the rolled ankle to kick in, so I got about another hour of practice at the skatepark.

ng-Whatever

We’ve all done it, sat around a table dissing the previous generation of our product.  The previous set of engineers had no idea, made some stupid fundamental mistakes that we obviously wouldn’t have made.  They suck, we’re awesome.  You know what, in 3 or 5 years time, the next generation of stewards of the system you are creating or replacing now will be saying the same thing – of you are your awesome system that you are slaving over now.

So what changes?  Is the previous generation always wrong?  Are they always buffoons who had no idea about how to write software.  Unlikely.  They were just like you at a different time, with a different set of contexts and a different set of immediate requirements and priorities.

Understanding Context

The context that a system is created is the first critical ingredient for a system. Look to understand the priorities, the tradeoffs and the decisions that had to be made when the system was first created.  Were there constraints that you no longer have in place, were they restricted by infrastructure, memory, performance?  Were there other criteria that were driving success at that stage, was it ship the product, manage technical debt or were there gaps in the organization that were being made up for?  What was the preferred type of system back then?

Understanding these items allow you to empathize with system creator and understand some of the shortcuts they may have made.  Most engineers will attempt to do their best based on their understanding of the requirements, their competing priorities and their understanding of the best systems that can be implemented in the time given.  Almost every one of these constraints forces some level of shortcut to be taken in the delivering of a system.

Seek first to understand the context before making the decision that the previous team made mistakes.  When you hear yourself making comments about a previous team, a peer team or other group not doing things in the way that you would like to see it, look for the possible reasons.  I’ve seen junior teams making rookie mistakes, teams focused on backend architectures making front-end mistakes, device teams making simple mistakes in back-end systems.   In each of these contexts, it is fairly obvious why the mistakes would be made.  Usually, it will be within your power to identify the shortcoming, determine a possible root cause by understanding the context and shore up the effort or the team to help smooth things over and result in a better outcome.

Constraining Your ng-Whatever

When faced with frustration on a previous system, consider carefully a full re-write into a ng-whatever system, or incremental changes with some fundamental breakpoints that evolve, refactor and replace parts of the system.

It is almost guaranteed that the moment a system gets a “ng-Whatever” moniker attached to it, it becomes a panacea for all things wrong with the old system and begins to accrete not only the glorious fixes for the old system, it will also pick up a persona of its own.   This persona will appear as “When we get the ng-whatever done, we won’t have this problem..”.

These oversized expectations begin to add more and more implicit requirements to the system.  Very few of these will be expectations will be actually fulfilled, leaving a perception of a less valuable ng-Whatever.

Common Defect Density

I’m going to come out and say that most engineering teams, no matter how much of a “Illusory Superiority” bias they may have are going to be at best incrementally better than the previous team.  With that said, their likelihood to have defects in their requirements, design or implementation will be more or less even (depending on how the software is being written this time around).

The impact will typically be that the business will be trading a piece of potentially battle hardened software with known intractable deficiences, with a new piece of software that will both have bugs that will be only be ironed out in the face of production.  Even worse, there will always be a set of intractable deficiencies that are now unknown – only to be discovered when the new software is in production.

When the original system was created, it is highly unlikely that the engineering team baked in a set of annoying deficiencies.  Likewise, the new system will, to the best of your teams understanding,  not baking any deficiencies into the system.  You need to make a conscious decision to take the risk that the new issues will be less painful than the old issues are.  If you can’t make that call, then sometimes refactoring and re-working parts of the system might be a better solution.

 

What have your experiences been with ng-Whatevers?  Have you found that your team can reliably replace an older system with a new system, and see that in a few years time the new system is held with a higher level of esteem than the original system?  Follow this blog for more posts, or post comments below on this topic.

 

Onboarding, Technical Debt and The Future Self

Any organizational leader is always challenged by the ability to quickly get new engineers active and effective as they join the company.  This is commonly called “onboarding”.  Common approaches include

  • Boot Camps similar to Facebook,
  • Intern style mini-projects
  • The dreaded fix-a-few-bugs in the first week
  •  The sink or swim, here is your desk, hit the ground running

All of these approaches (except the last one) attempt to

  • Familiarize a developer with the code base
  • Do generally low-risk changes and get the code into mainline or production
  • Start to understand the systems and tools that the team uses

Generally, the stated intent of on-boarding an engineer is to bring them up to speed and be productive.  In this post, I’d like to turn this on it’s head and ask the industry to not look at the onboarding process as a way to get an engineer started, but rather a way that the existing engineering team showcases the code, the architecture, the behavioral norms and the vibe of the engineering team.

I’ll lean on the analogy of an rich elderly aunt visiting from out of town throughout this post.  The onboarding process is like that awkward first few minutes where you walk through the house showing the washroom, kitchen, where they are sleeping and other items that are needed to make the stay welcoming.  However that five minute tour is usually at the end of a few days of frantic cleaning, scrubbing, removing trash, setting rules for how the kids should act, and so on.  After all your familial reputation is on the line here.  Who want’s an awkward Thanksgiving or Christmas where there are the little comments, the glances and whispering from the aunt who has formed her opinion of  how you operate as a family.

When a new engineer arrives at a company, the deepest scrubbing we usually do is to make sure their desk is clean and the computer is there.  It leads to an awkward recruiting moments when the reputation of the company is passed on to other engineers as the new engineer is asked ‘How’s the new Job’.  In the same way the elderly aunt is associated, but not vested with you. So is the the new employee who hasn’t quite felt part of the team.

Inverting Onboarding

Turning the onboarding process on it’s head also has some material impact on the way the engineering team see’s itself.   When you have a visiting relative, you want to show that you are a the good part of the family, and have your things in order.  Use the onboarding experience as a way to ensure that the team presents a real demonstration of how the team is awesome.  You can’t fake awesome for very long.

Make your leaders of the organization accountable for how the team is presented to the new hire.  Make those leaders ensure that the 2-4 week honeymoon period for the new hire sets the basis for a long term relationship.    The discussion shouldn’t be whether an engineer is now productive for the company, it should be whether the company has made an engineer productive.

That inversion goes a long way to extend that honeymoon from a few weeks to a few year.  That inversion helps engender an organization that fosters growth and development during an engineer’s tenure.  That inversion changes the way the organizational leaders look at how they run the organization.

Make Your Leaders Accountable for Onboarding

The critical part of this inverted view of onboarding is making the leaders of your engineering team responsible and accountable for the onboarding of the engineering team.  Some indicators of issues to consider when examining the onboarding  process.

  • Can the new engineer checkout and build with no major hiccups?  Typically there is a verbal tradition that carries engineers from the written references to being productive.  (“Oh yeah, you need to this or that, we should update that”).
  • Does the new engineer need to talk to other engineers for a basic bootstrap of their environment?
  • Does the new engineer need things explained on a whiteboard?

Realistically, the new engineer doesn’t have the ability to influence how quickly they come up to speed.  There will be an intrinsic ability that the new hire carries, but the largest influence of how the new hire onboards is carried by the organization.

By making the hiring manager and their lead engineers responsible for the effectiveness of the new engineer, it forces introspection on the team about how they manage their development how much of the way the team operates is transparent, captured, and communicable and how much is opaque and a form of verbal tradition.

Some measures that I push my teams to meet for a new dev are:

  • After the initial orientation and first login, be able to sync the code and start a build within 1 hour.  (This forces the hiring manager to ensure the permission groups, environments and access is all done before the hire starts.)
  • Have the engineer be involved in a code review within the first day, but before their first change. (This sets the bar for how the code reviews operate and the expected conduct for the engineers on the team).
  • Have a problem or proving task able to be resolved and pushed to a staging environment, device, into a build within the first day. (This drives an understanding of the way the development goes from the developer to production).
  • Have a debrief after the first day and the first week about the questions, points of confusion, suggestions and best practices that the new hire misses.  (This helps drive introspection for the team for the next round of hires).

Equivalence to Technical Debt

A lot of these negative indicators and the difficulty in achieving the measures are driven by the technical debt that an organization is carrying either consciously or subconsciously.  Finding opportunities to spot the technical debt that is causing friction within an organization are golden.

Technical debt in a development team is most visible to a new hire.  They walk headlong into it with open eyes and an inquiring mind.

Helping the Future Self

The future self concept covers the lack of familiarity that an engineer will have with some code or a system that you are currently working at some point in the future when you need to go back to it.  Over time, poorly written code – even written by yourself, becomes that code written by the clueless engineer a few years back.

Technical debt is usually correlated with old crappy code.  However while the engineers are writing the code, there is not the assumption that the code that is going in is crappy.

A new engineer being onboarded to a new system is in exactly the same place that a future self or a new transfer into the team will be seeing.    They don’t have the history and so it will always be more jarring to them.  Future selves and new transfers will have enough history that they will orient themselves back to familiarity very quickly.  Warts and all.

Learn From the New Engineer’s Honeymoon Period

Engineering leaders, should look closely at the questions, feedback, and interactions that a new hire has with their new code, systems, and team.  They don’t have the history or familiarity with the code that helps them skip the opaque or confusing bits of how you do development.  This will help not only with onboarding new hires, but also the engineer’s future selves, and people switching into the team.

Those first few weeks of a new engineer are extremely valuable, use it as a barometer for how effectively the team is managing technical debt and creating maintainable code and systems.

Don’t squander those first few weeks.  Also, keep your house in a state that your elderly aunt would like.


As always, vehemently opposed positions are encouraged in the comments, you can also connect with me on twitter @tippettm, connect with me on LinkedIn via matthewtippett, and finally +Matthew Tippett on google+.  You can also follow me on Quora, or simply follow this blog.