EngineeringManagement – Praveen's Almanac

Grow the scope or not?

Every manager at some point in their career will be forced to address the question – “Should my team take on more scope or should we continue doing what we do well?”.

The general sentiment around this question often tends to lie oscillate in extremes.

Growing scope helps your team’s overall output.
Do what your are doing well. There is a room to improve. Don’t fall into the trap of scope growth.

Unfortunately, for most managers who are faced with this scope situation none of these extremes are useful. In this post, I will dig into why these extremes rarely work and what would be a reasonable framework to address our question of scope growth.

Growing scope to increase output

Some obvious problems with scope growth is the time commitment needed from your team. Time spent in additional scope takes away time from the current commitments, time that could have been spent to improve your oncall health, loss of focus for the team which very likely impacts quality, engineer burn out etc.

An optimist in you might say,” I will get budget for additional engineers which should help with these problems”. Well, you are in for a surprise because output of the team doesn’t always linearly growth size of the team. A lot of factors get in the way. Examples include:

Your team’s engineering process might not be mature enough to scale to new members seamlessly? These include quality of onboarding docs, code review process, CI tools, stability of code base, maturity of coding & deployment best practices. Without these properly set, you are very likely going to make matters worse and slow everyone.
Have you accounted for time spent in hiring candidates. This is a cost easy to ignore because it is distributed across multiple people. Your role as manager in sourcing candidate, sell calls etc. 5-8 other engineers interviewing these candidates, time taken to write feedback, follow up debrief meetings. Remember most top engineering companies have an acceptance rate of low single digits. So number of candidates you need to screen for each engineer you want to hire is multiple orders of magnitude. You need ask yourself if your team’s time is better spent somewhere else.
Onboarding new engineers is a non-zero cost. New engineers are not going to immediately productive. They need to be mentored, they have to familiarize with code base (which means their initial code reviews will demand lot more time from existing engineers), potential for bugs is higher, they need to spend time understanding dependencies etc.
Can your engineering systems scale to new use-cases?
Is the new scope/opportunity complimentary to your team’s current portfolio? If not, then you suffer from lack of focus, inability to leverage existing infrastructure, institutional knowledge of the team. Most likely there is a team in your company better suited to do this than your team.

As you can see, if we prematurely grow scope and size of the team, we likely fall short of expectations not only for the new scope but also impact quality of current deliverables, cause likely churn in the team. Now let’s see the other end of the spectrum.

Do what your are doing well

The other end of the pendulum is to continue investing what the team is already doing and possibly. After reading the previous section, it is easy to see why this solution is appealing. But this is not without its faults. Let us look at some of them.

Engineers get better with time and practice

Engineers are humans. So, with practice and time they get good at their job and it is your job as a manager to make sure they do. So, likely what they can handle and deliver in same amount of time will be higher in future.

Engineering Systems get Better

Any good company worth its salt will invest and improve its engineering systems over time. This include deployment systems, engineering frameworks (for logging etc), observability & monitoring, code review tools. Some of these are process improvements within your own team and some could be at company level. But over a period of time, you are hopefully driving towards more productivity in less time. So collectively your team should be able to achieve more in future.

Engineers want to grow

If you are doing a good job of helping your reports long term goals, then some of the junior developers will soon become senior engineers, and some of today’s seniors will want to become leads with larger scope. It is your responsibility as manager to ensure that your are paving the way for these.

Not always will scope grow organically within your product. If you are in a team/suborg where this happens then your job is easier. For majority of the teams, it might not be true. So, you need to be on look out for newer opportunities (which compliment existing charter). Without this, you are likely going impact growth of engineers and possible attrition. Remember that not every engineer will see this problem ahead of time. Even if they do, they might not be comfortable speaking out. It is too late to retain employees once they decide to leave. It is always better to be proactive.

You need to grow

Don’t forget your own growth. As a leader you are expected to self manage your growth to an extent. If you are not constantly looking out for newer challenges, the trap of comfort will soon make you irrelevant and impede your career growth.

Framework for Scope Growth

So we can now clearly see both these extremes are not really helpful. So how do we answer the question – “When to consider scope growth?”. While, the answer is very subjective for every team & company, I found it useful to think about this across two different axis.

Individual level

As a manager you are best position to know the growth trajectory of your reports and how it looks like 6 months from now. So you need to be constantly thinking of opportunities that you can create for them then. For junior engineers these could be new features on existing initiatives (ex: like increased roll outs, new country launches and associated challenges). For senior engineers, this could new initiatives within existing charter of your team/org. You should avoid taking on completely new charters on individual level (exceptions include a very senior engineer or exploratory work).

Team level

At this level you are mostly thinking of newer charter or larger initiatives. There are multiple factors to consider here:

Maturity of engineering systems: Is your team’s engineering system mature enough to handle new charter. If not, it is better to handle those before considering scope creep.
Current State of the team: How well is your team executing? In the book Elegant Puzzle: Systems of Engineering Management, the author explain four different stages of team. I highly recommend this framework. I personally found it super useful in determining when to cut or increase scope of a team.
Composition of team: Always think of the composition of team when considering new charter or initiative. How will this disrupt current team’s execution. Can we take on additional engineers? I personally recommend keeping a lower ratio of newer engineers to existing engineers at any given time.

By constantly thinking in these two axis, we set ourselves & team for good progressive scope growth and hopefully take on new charter and initiatives.

I hope you find this useful. Either way I am interested in knowing your thoughts and learning from your experiences. Please don’t hesitate to leave a comment below.

Avoiding vanity trap of 4 9’s SLO

Service level objectives (SLO) are standard way of defining the expectations of applications/systems (see SLA vs SLO vs SLI). One standard example of SLO is uptime/availability of an application/system. Simply, put it is % of time the service or application responds with a valid response.

It is also a common practice in large organizations for a SRE team to keep track of SLOs of critical systems and gateway services reporting them to leadership in frequent cadence.

But like any tool, SLO’s also have a motivation and a purpose. Without careful considerations these employing seemingly well intended policies can become a vanity metric and cause unintended consequences. In this post, I take one such situation I encountered in past and provide a simple framework that could help avoid these traps.

4 9’s Availability

Simplified System Architecture of a Company

Consider above example which is a common architecture of a company. Usually a gateway service fronts all client/external calls and routes them to internal services. It is also common practice in some companies for a central SRE team to some company level tracks SLO at this gateway level (Note: individual teams still measure their own services’ SLOs).

Today, with commoditized hardware, databases, evolution of container system etc, it has become a common expectation to expect a high uptime of 4 9’s(jargon for 99.99% availability). Simple way this is usually measured is % of Success (2xx in http protocol) responses your application sends with respect to total number of requests. This has almost become a vanity metric for engineering organization. Let us see two classed of problem caused without putting proper thought while defining them.

Lazy Policy Effect

If all you measure is availability across all endpoints as vanity metric, then it is possible that 1 particular endpoint has dominant share of traffic to your service. But does this traffic share reflect the importance of endpoint?

Ex: Let us consider two endpoints.

Logging endpoint used to capture logs on the device and forward them to backend for analytics purpose.
Sign-in endpoint used to authenticate user’s session.

It is easy to see how first endpoint can have more traffic and can hide the availability. As SRE team the numbers/SLOs always look good week-over-week if first endpoint is stable.

I call this lazy policy effect. It happens, because at the time of defining SLO’s, it is possible that authors of this policy never asked few critical questions.

What is the purpose of tracking this SLO? Most often the answer will be – “We need to ensure we are` providing best experience to customers”. Which leads to question #2
Is this metric truly achieving this purpose? It is clear at this point to see that not all endpoints directly contribute to customer experience. So you probably need a policy which segments endpoints by category (customer facing vs operational etc) and only measures SLOs for relevant ones.

Bad Incentive Effect

Not let see another consequence of these blanket policies.

Let us say 1 of the endpoint’s purpose is to collect sensor logs mobile device periodically. This is most likely a background process which does not interfere with consumer experience on device. In this case, it is easy to see how certain level of failures is acceptable for this endpoint. We can have the device retry on failures or even afford to miss some logs.

But this team unfortunately has to abide by the above 4 9’s policy set by SRE. Otherwise they will contribute to drop in organizational level SLO and will be called into next leadership review to provide an analysis. No matter how well intended and blameless these reviews are, most teams will try to avoid being called into these reviews. There are various “clever” ways you can do that.

One of them, is add multiple retries between gateway -> dowstream service. Or increase timeout duration for calls from gateway -> dowstream service etc. You get the idea.

These will certainly reduce 5xxs (and improve availability SLO). But unnecessarily increased latencies or caused these logging apis take up more resources on the gateway host. This could increase latencies for other “customer” endpoints. Lot of times these are even hard to notice.

Even though organization has defined these policies for better customer experience, they actually degraded customer experience. This might even go unnoticed because the availability SLO is always met.

Aspects to Consider

When defining such engineering policies in organization or team it is important to ask following questions.

Purpose Of THE SPECIFIC SLO

What is the purpose of tracking this SLO?
Is this metric truly achieving this purpose and in what situations will this metric not service this purpose?

Second ORDER CONSEQUENCES

What does it mean for engineering teams to adhere to this policy/SLO?
What feedback mechanism from teams do we need to put in place (so that we can adapt these policies and not incentivize teams to put work arounds)? Every policy needs to adaptable. Especially policies which demand large organizational cost to adhere to them.

Choosing the Right Managerial Style

What is your Managerial Style – Leadership or Management. Coaching or Supporting?

I have deliberated on this question a lot during my career. While the definition of each of them is very well documented, often conversations tend to pitch one against another, which puts newer managers in the uncomfortable spot of picking one vs other and guessing which one is better or worse making wrong choices (it certainly did for me).

Having seen these styles work well and not-so-well in my career (as a manager and a tech lead), I now believe that a good line managers need to adapt and use both techniques. The success of style depends entirely on context and people.

Management

When I say Management, I refer to following operational style:

Overseeing goals of a team
Being tactical in determining strategy in every step
Being Operation thinker who plans execution steps
Focusing on objectives
Minimizing risk in execution and seeks stability
Sometimes Teach by doing

Leadership

At a high level, operating style for “Leadership” is:

Setting vision and directing
Influencing & Coaching people through reasoning (explaining “why” we are doing things)
Making people feel part of vision and motivating them to creative in execution while still staying on track
Being a strategic thinker
Optimizing for long term autonomy of the team (some times trading off immediate risks)

There are enough subtle differences between the two styles and often it is not entirely sure what is ideal. In this write up, I try to capture a framework for choosing between the two styles.

Learnings From Mistakes

I made this mistake early in my management career. There was a project that an engineer reporting to me was working on. It involved a lot of cross organizational alignment, planning and execution. I knew this was a steep step up for this engineer. By then through training I received (or let’s say I probably took wrong lessons) I had this idealistic view of a manager leading through coaching than being very tactical and execution focused.

Before too long, this approach backfired for the team and engineer. Project execution was constantly falling off track despite the best efforts of the lead engineer. There was lack of clarity for everyone involved in schedules, dependencies and what needs to be done when. All this while I was still Coaching the engineer – guiding them through questions, helping them arrive at decisions and figuring out the path of the project.

So what happened? What the engineer really needed was more hands on support than just coaching. Someone who can help operationalize the project, help in figuring out milestones, how to get alignment on deliverables and timelines across the team. Purely relying on coaching and expecting the engineer to ask the right questions and figure out a path forward, is setting up for failure at that stage of their career.

It’s through experiences like these that I now believe that to be a good Engineering lead (at least as line managers), one has to be able to operate in both styles depending on context and people involved. You need to be able to do any of the following depending on context.

Directing – Setting path, operationalizing, assigning clear deliverables
Supporting – Help in brainstorming, provide feedback proactively, teach by doing if needed
Coaching – Let them make the decisions, provide high level directions, ask probing questions, help with setting decision frameworks
Delegate – Trust and only get involved when asked

Learnings From Mythology

I was recently forwarded a story about two leaders in Hindu mythology and their differing styles. It is very relevant to current topic in this blog. So I modified that slightly here to draw parallels to our subject.

Ramayan and Mahabharata are two epics in Hindu mythology. The centre story of both these books is around victory of good over evil.

In one story Ram (protagonist) leads his army to defeat Ravana in his land, While in the second Krishna (protagonist) oversees Pandavas defeat Kauravas in the battle at Kurushektra.

In Ramayan, Ram is the best warrior of his side. He leads his army from the front. Strategizes & directs different people to do things which will meet the objectives. His people while very skilled are not capable of operational tactics.. Ram sets direction & also tells people what to do during difficult times. Ultimately they won the war & the final outcome was achieved.

On the other hand Krishna told Arjuna (skilled warrior), I won’t fight the battle. I won’t pick up any weapon; I would only be there on our chariot as a charioteer.

And he did what he said. He never picked up the weapon & he never fought. Still, Pandavas won the war & final outcome was achieved.

What is the difference?

It was their managerial style & It was also the type of people who were being led and situation at hand.

Ram was leading an army of warriors who were not skilled fighters & they were looking for direction. While on other hand, Krishna was leading Arjuna who was one of the best archers of his time.

While Ram’s role was to show it & lead from the front, Krishna played the role of a coach whose job was to help clarify doubts, provide general guidance needed for Arjuna to go about his work.

Krishna couldn’t teach Arjuna archery but he could definitely help him see things from a very different perspective whereas Ram had to use his superior skills and experience in helping guide his warriors across difficult terrains.

So they had to operate in two styles:

Ram- A skilled warrior, was tactical, gave precise roles & instructions (operationalizing the strategy), motivated the army to fight with specific cause in mind. He needed the trust of his warriors to be able to do this. Hint: Management.

Krishna: Arjuna was looking for a coach who provides strategic clarity, explains vision and why it was needed. Krishna did exactly that, he coached Arjuna and allowed the team to take lead, fight for the cause of the team, use his skill and creativity in succeeding. Hint: Leadership

What type do you need to be?

Look at the combination of your team, project and context to reflect what type of role you need to play.

One who keeps answering/solving problems for people ? Or Who asks relevant questions from their people so that they can find their own solution?
Someone who tells/directs, is tactical and operationalizes the plan? Or Someone who coaches and sets a path and lets their people find their own ways?
Are u someone who has had bright engineers but yet fall through in execution of larger projects? Or do you have an engineer who is an expert who seeks clarity and direction?

Best outcomes are achieved when you put the right hat based on the context.