Performance Reviews for Engineering Teams

Engineering performance reviews have a reputation problem. Google found that 47% of employees considered their old review process a waste of time. Only 14% of employees across industries think reviews give them relevant feedback.

The usual approach, dropping a generic form on an engineering manager once or twice a year, fails engineers in many ways. Here’s how to fix it.

Why Standard Reviews Fail Engineering Teams

Most performance review systems were designed for roles with visible output: sales numbers, support tickets closed, projects delivered on deadline. Engineering work doesn’t map that cleanly.

A developer who refactors 500 messy lines into 50 clean ones looks less productive than someone churning out verbose code. An engineer who spends two weeks on a thorough code review that prevents a production incident doesn’t show up in any dashboard. The person who mentors three junior developers and unblocks the entire team has lower individual output but massive team impact.

Then there’s the format itself. Engineers are builders, not essayists. Asking them to write multi-paragraph self-assessments in a text box feels like busywork. Most will procrastinate until the deadline, rush through it, and leave out half of what they actually accomplished. The form-based review process fights against how engineers naturally communicate: concisely, in context, and close to the work.

Standard reviews reward what’s easy to count, miss what actually matters, and alienate the people they’re supposed to help.

The Metrics Trap

The instinct is to solve this with data: lines of code, commit counts, PRs merged, story points completed. Engineering leaders increasingly reject this approach. A 2026 benchmark study found 65% of engineering leaders actively avoid lines-of-code as a metric, and 35% avoid “days worked.”

Here’s why these metrics backfire:

Lines of code rewards verbosity over elegance. The best solution often involves deleting code, not writing more of it.
Commit counts penalize engineers working on complex problems that take weeks of research before a single commit. It also rewards commit-splitting games.
Story points are meant for sprint planning, not individual evaluation. Using them for reviews incentivizes point inflation.
Bug counts without context punish engineers working in legacy codebases or tackling hard edge cases.

None of these capture whether an engineer made good architectural decisions, improved the team’s velocity, or built something users actually needed.

What to Measure Instead

The best engineering orgs have converged on measuring outcomes and impact, not output.

Team-Level: DORA Metrics

The DORA framework measures what matters for delivery: Deployment Frequency, Lead Time for Changes, Change Failure Rate, and Mean Time to Recovery. Elite teams deploy on demand, have lead times under one hour, change failure rates under 5%, and recover from incidents in under an hour.

These are team metrics, not individual scorecards. They set context for individual conversations: “Our lead time doubled this quarter. What’s creating friction?”

Individual-Level: Impact at Scope

Google’s GRAD system evaluates engineers on impact scaled to their level. A mid-level engineer (L4) is expected to deliver project-level impact. A senior engineer (L5) should demonstrate team-level influence. A staff engineer (L6) needs org-wide impact.

This framework avoids the trap of comparing a junior engineer’s feature output against a staff engineer’s architecture work. Different levels, different expectations.

The Invisible Work That Actually Matters

Some of the highest-impact engineering contributions are nearly invisible to standard tracking:

Code review quality. Not just approving PRs, but catching subtle bugs, suggesting better patterns, and teaching through review comments.
Mentoring. Helping a struggling teammate debug an issue at 4pm on a Friday. Pairing on a hard problem. Answering questions in Slack.
Incident response. The person who jumps on a production outage at midnight and resolves it in 20 minutes.
Technical debt cleanup. Improving test coverage, upgrading dependencies, and fixing flaky CI. Nobody celebrates this work, but teams collapse without it.
Design docs and RFCs. The thinking that prevents bad decisions before code gets written.

If your review process doesn’t capture these, your best team players get rated the same as (or lower than) individual contributors who optimize for personal output.

Tools that analyze collaboration patterns across GitHub, Slack, and project management systems can surface this work automatically. Windmill’s organizational network analysis identifies who actually helps teammates ship by analyzing interactions across integrated tools, rather than just counting individual commits.

How Top Engineering Orgs Run Reviews

Google: GRAD System

Google overhauled reviews in 2022 after that damning 47% survey result. The new GRAD (Googler Reviews and Development) system:

Runs formal reviews once a year with promotion opportunities twice a year
Uses a five-point scale from “Transformative Impact” to “Not Enough Impact”
Evaluates on impact, craft, scope, collaboration, and leadership
Separates salary discussions from performance feedback by one month, because employees fixate on compensation and stop listening to development advice when both arrive together
Supplements with monthly check-ins and a mid-year checkpoint

Stripe: Career Ladders as Culture

Will Larson, who designed Stripe’s performance management system, built it around career ladders that define expected behaviors at each level. Reviews combine four inputs: self-review (engineers compare work against their level’s expectations), peer reviews, upward reviews from direct reports, and manager synthesis. Cycle frequency adapts to company size: quarterly when small, semi-annually as the org grows.

Netflix: No Formal Reviews

Netflix eliminated formal performance reviews entirely, replacing them with continuous feedback and the “Stop, Start, Continue” framework. Anyone can send feedback to anyone, from interns to the CEO. Managers apply the Keeper Test in ongoing 1:1s: “If this person wanted to leave, would I fight to keep them?” It’s not for everyone, but it works at Netflix because they hire senior talent and compensate at the top of market.

Running Better Engineering Reviews

You don’t need to copy Google’s system. But the companies getting this right share common patterns.

Collect Data Year-Round

The biggest failure mode is reviewing from memory. Managers who write reviews based on what they remember from the last few weeks produce biased, inaccurate assessments. Recency bias is the most pervasive problem in engineering reviews.

Fix this by tracking work continuously. Integrate your review tool with the systems where work happens: version control, project management, communication tools. Windmill syncs with GitHub, Jira, Linear, Slack, and 20+ other tools to capture what engineers accomplish throughout the year, so nothing gets lost.

Use Career Ladders, Not Generic Forms

Engineers need to know exactly what’s expected at their level and what the path to the next level looks like. A career ladder that defines competencies, scope expectations, and behavioral markers at each level turns reviews from subjective conversations into calibrated assessments.

Public frameworks from progression.fyi provide solid starting points. GitLab’s, Artsy’s, and CircleCI’s are among the most referenced.

Separate Feedback from Compensation

Google learned this the hard way: when performance feedback and salary changes arrive together, people only hear the number. Schedule development conversations and compensation decisions at least a month apart.

Get Multiple Perspectives

Self-reviews let engineers highlight work managers may have missed. Peer feedback catches collaboration and mentoring contributions. Upward reviews from direct reports give signal on management quality. No single source tells the full story.

Collecting multi-source feedback is where the process usually breaks down, because it means chasing a dozen people to fill out forms. Windmill handles this through Slack conversations: Windy interviews peers with targeted follow-up questions, so feedback is richer and collected without manual coordination.

Make It About Development, Not Just Scores

The best reviews answer two questions: “How did you do?” and “What’s next?” Pair backward-looking evaluation with forward-looking growth planning. Where is this person heading? What skills would accelerate their path? What projects would stretch them?

Getting Started

Start with three changes. First, define clear expectations per level using a career ladder. Second, integrate your review tool with the systems where engineers actually work so data collection isn’t manual. Third, supplement formal reviews with regular 1:1s that keep feedback flowing year-round.

If you want to eliminate the admin burden entirely, Windmill runs the whole process through AI: pulling data from engineering tools, collecting feedback through Slack conversations, and generating review drafts that managers refine rather than write from scratch.

Frequently Asked Questions

How should you evaluate software engineer performance?

Evaluate software engineers on impact at the appropriate scope for their level, quality of technical decisions, collaboration and multiplier effects on teammates, and delivery outcomes. Avoid vanity metrics like lines of code or raw commit counts. Use frameworks like DORA metrics for team health and SPACE for individual contributions.

What metrics should engineering managers avoid in performance reviews?

Avoid lines of code (rewards verbosity), raw commit or PR counts (penalizes complex work), hours worked (measures presence, not impact), and bug counts without context. These metrics incentivize the wrong behaviors and systematically undervalue engineers who do high-impact invisible work like mentoring, code review, and architecture.

How often should engineering teams do performance reviews?

Most high-performing engineering orgs have moved to annual formal reviews supplemented by continuous feedback. Google runs formal reviews once a year with monthly check-ins. Stripe adapts cycle frequency to team size: quarterly for small teams, semi-annually as the org grows. The key is separating ongoing feedback from formal evaluations.

How do you review engineers who do invisible work like code review and mentoring?

Track it deliberately. Use tools that capture collaboration patterns across GitHub, Slack, and project management systems. Include peer feedback that specifically asks about mentoring and code review quality. Windmill's organizational network analysis identifies who actually helps teammates ship, surfacing contributions that traditional reviews miss.