In the wake of World Series Game 1, I came across a comment thread in a Facebook discussion group that neatly summed up the pitfalls of every fan’s favorite postseason pastime: first- or second-guessing the manager. It started when someone posted a tweet by an ESPN researcher, who noted that Dodgers right-handed reliever Pedro Báez, whom manager Dave Roberts removed in the seventh inning of Game 1 with left-handed-hitting Red Sox third baseman Rafael Devers due up, had not allowed a hit in 31 at-bats against a lefty since July 26. Most of the replies to that tweet—and the person who posted it on Facebook—interpreted the stat as a scathing indictment of Roberts, who clearly had been too beholden to handedness to realize the fact right in front of his face: Báez, who had struck out the first two batters he’d pitched to (including lefty pinch-hitter Mitch Moreland), was a lefty killer all along. So in came southpaw pitcher Alex Wood, and out went a three-run homer by right-handed pinch-hitter Eduardo Núñez, a dagger to the Dodgers in what was until then a one-run game.
“Given the game situation, the ‘eye test,’ and readily available information, the move made no sense at all,” the first commenter said. “Just over-managing based on old, crappy baseball traditions.”
Except, well, wait a second. The second commenter didn’t see it that way. However hitless they are, he argued, 31 at-bats constitutes a small sample. Dig a little deeper, and Báez’s vulnerabilities become clear. Yes, he has flashier superficial stats against lefties, both in 2018 and throughout his five-year career, but the peripherals don’t lie. This season, Báez’s strikeout-minus-walk rates against hitters of each handedness told dramatically different stories: The reliever ranked in the 90th percentile among pitchers with as many righties faced and the sixth percentile among pitchers with as many lefties faced. Clearly Roberts was wisely looking past a bit of BABIP luck and relying on more meaningful and predictive data. “I don’t think this was an indefensible decision at all,” the second commenter said. “I think it was the right move, that backfired.” The debate continued along lines that would feel familiar to every fan who’s ever debated a team’s tactical decision—which is to say, every fan.
Standard sabermetric thinking would support the second commenter more than the first. But even the standard sabermetric thinking about batter-pitcher matchups is increasingly lagging behind the internal intelligence of the stat-savviest teams—many of which, not by complete coincidence, have played in this postseason. October is the time of year when every moment matters and every spectator questions why managers make or don’t make moves. But it’s also the period when the gap between public and private knowledge about how hitters and pitchers perform grows significantly. And even during the regular season, that gap may be bigger than most first- or second-guessers realize.
Over the past few days, I’ve canvassed consultants and analysts for MLB teams—including some playoff teams—to assess the state of proprietary batter-pitcher projections. Their responses strongly indicate that there are more things in heaven and earth than are dreamt of in public projection systems, let alone a glance at single-season splits or even career rates. That doesn’t mean we can’t scrutinize managerial decisions about which players to sit, start, or substitute, but we should be aware that when we do, we’re working with a lot less information and motivation than the people pulling the strings.
There really was a time in the not-too-distant past when major league managers routinely made dumb decisions. Fifteen years ago, Hall of Fame manager Joe Torre repeatedly benched players like Robin Ventura and Aaron Boone in favor of utility man Enrique Wilson—one of the worst hitters in baseball over the course of his career—when the Yankees were facing Pedro Martínez, against whom Wilson had experienced some small-sample success. Against Pedro, Torre said, “I just go with recent history.” Everyone thought it was weird that the weak-hitting Wilson had been good against Pedro, but few fans or writers blamed Torre for riding the hot hand.
In the years since, study after study has shown that the effect of hot and cold streaks is, if not nonexistent, too tiny to be better than a tiebreaker in determining whom to sit or start. And basic batter-vs.-pitcher stats, like Wilson’s 10-for-his-first-20 against Pedro, are virtually useless. A widely read chapter about batter-vs.-pitcher performance in the 2006 modern sabermetric bible The Book: Playing the Percentages in Baseball concluded, “Having 20 to 30 PA against an opponent is a drop in the bucket, and it tells you almost nothing about what to expect.” The chapter continued, “When a particular batter has faced a particular pitcher 200 or 300 times, come back and we’ll talk. Maybe.” Of course, few hitters and pitchers ever face each other that often: Hall of Famer Phil Niekro pitched for 24 years and threw the fourth-most innings in history, and he faced only three batters at least 200 times. And even in cases where a pitcher and hitter have a relatively long history, their skills may morph enough from year to year to render much of that data obsolete. It’s hard to imagine that the results of the times when Niekro and Joe Morgan faced each other as rookies in 1965 still had predictive power when the two squared off for the final time in 1984. “While we make efforts to weigh hot/cold streaks and batter/pitcher matchup history, and there is some analytical value in them, I generally think that value is dwarfed by the value of looking at performance over longer stretches and performance of a larger group of similar batters/pitchers,” one front-office analyst says by email.
The Book’s authors expanded their search for significance by examining prospective performance against “families” of pitchers that hitters had owned in the past, but even then they came up empty. “We wish we could tell you that we’d find something, but we can’t,” they wrote. “And the reason is always the same: small sample size.” The chapter confirmed the lefty-righty platoon effect and discovered a smaller, batted-ball-based platoon effect that hurts hitters who face pitchers with similar ground ball–fly ball tendencies. But its influential findings cemented the perception that the Wilsons of the world were nothing more than mirages. As existing executives read that research, outsider analysts migrated into the game, and managers increasingly bought in or found themselves forced out, dugout understanding spread. Just as largely counterproductive tactics such as sacrifice bunts, intentional walks, and pitchouts have declined dramatically, it’s now far rarer to see a skipper cite a hitter’s history against a particular pitcher to explain why someone is or isn’t starting. And when a manager does invoke a player’s past or refer to a favorable matchup, he’s probably not talking about—or putting too much faith in—a fluky 5-for-9.
Although The Book is a seminal sabermetric work, its authors had much less detailed data to draw on than current number-crunchers can. Months after The Book came out, during the 2006 playoffs, PITCHf/x cameras captured their first fastballs and breaking balls. Within a few years, PITCHf/x and HITf/x were recording every pitched and batted ball in big league parks, and when Statcast succeeded them, an even richer treasure trove presented itself—particularly to teams, who have access to everything the system spits out. Radar and optical tracking technology allows analysts to classify and compare players with far greater precision—and in much smaller samples—than they could before, and the accuracy with which teams can forecast individual matchups has advanced accordingly.
Roughly five years ago, a smart team might have used a conventional in-house projection system—similar to public systems such as Steamer, ZiPS, or PECOTA—to generate estimates of each hitter’s and pitcher’s true talent against average opponents. Those estimates, modified by adjustments for handedness, ground ball–fly ball tendencies, park effects, and other factors, could then be combined to generate an expected outcome of a plate appearance between two specified players. As The Book established, it takes thousands of plate appearances to assess how a certain hitter’s platoon split differs from the typical player’s: A season or two of performance simply isn’t enough to say with stats alone that a player is more or less susceptible to same-handed opponents than the norm. So teams then took steps to refine their platoon adjustments, using minor league splits, pitch types, release points, and even usage data—how a club deployed or protected its players against lefties or righties, which could allow rivals to infer something about those players’ perceived strengths or weaknesses—to take shortcuts to more accurate appraisals. In addition, if a team determined that a player projected to be especially platoon-sensitive, it could use similarity scores to lump like players together and examine how, for instance, similarly platoon-sensitive players tend to fare against other players with extreme splits. Most public systems still aren’t set up to model matchups that way, although some subscription-based daily-fantasy options attempt to.
The next step, which smart teams started taking a few years ago, was incorporating pitches and batted-ball characteristics on a more granular level. The Book’s authors classified “families” of pitchers by handedness, strikeout and walk rates, and batted-ball rates, which did a decent job of detecting commonalities among pitchers like Jamie Moyer, Mark Buehrle, and David Wells. But teams today can call on so much more: pitch types, sequencing, speed, movement, release point, tunneling, location. Instead of relying on 10 plate appearances against a given pitcher, teams can group together hundreds or thousands of plate appearances that similar hitters have made against that pitcher’s nearest neighbors. And instead of treating one 3-for-4 from the past the same as another—even though some of those hits might be blasted and others might be blooped—teams can look at quality of contact to strip some of the residual luck from the model.
More recently, some teams have advanced even beyond that stage. “Cutting-edge batter-pitcher projections are based on batter/pitcher swing/pitch planes and the ways that those attributes interact,” the analyst says. As a consultant to teams elaborates, “There are some guys with an east-west swing, there are some guys with an up-down swing. So a sinker-ball lefty, not that there are a lot of them, versus a four-seam lefty, match up with two right-handers very differently if they have more of an east-west swing or more of a north-south swing. [Teams] are trying to optimize swing angle for pitch angle.”
Pitch angle is easy enough to calculate from public info on release point, movement, and location, but determining swing plane can be complicated. “Those data are mostly derived from TrackMan/Statcast, which is mostly publicly available, but there’s a much greater barrier to entry than strictly performance data,” the analyst says. Teams can try to reverse-engineer a hitter’s natural swing path by isolating the launch angles at which he produces his peak batted-ball speeds. They can invest in a camera-based solution like KinaTrax, which may be able to directly record a hitter’s swing path if he plays in a park where the system is installed. Because many teams gather swing data on their prospects using bat sensors like Blast, whose use among minor leaguers is sanctioned by the CBA, they can identify minor leaguers in their system whose batted-ball profiles compare closely to a major leaguer of interest and then study the minor leaguers’ Blast data to make inferences about the big leaguer’s swing. Or they can advance-scout opponents via video, which is more practical in the postseason, when teams have time to focus on a few opponents and the potential reward for that effort increases. One team’s quantitative director says that his club has tried “all of the above” and gleaned insights from every approach, although it hasn’t developed a “unified theory of swing-plane measurement.” But he notes that the Yankees, for one, are reputed to have solved the swing-plane mystery.
As The Athletic’s Marc Carig wrote about the Yankees last month, “With their tools, it’s possible to estimate a hitter’s performance not against just two-seam fastballs in general, but two-seam fastballs thrown by [Zach] Britton, or curveballs thrown by [David] Robertson, or sliders thrown by [Aroldis] Chapman. Specific velocity and spin is taken into account and matched up to a hitter’s bat path, which can also be precisely measured. Given that information, computers can simulate an expected result. From there, game plans can be formulated, strategies mapped out, scenarios anticipated.”
Players, too, can make that matchup data part of their planning. “I’d never been exposed to that amount of information,” ex-Oriole Britton told FanGraphs recently, recounting his exposure to a new wealth of data after arriving in New York earlier this year. “And it’s not just, ‘Here’s a stack of stuff to look over.’ It’s [targeted] to each individual player. I don’t want to get into specifics, but some of it is how my ball moves, both my sinker and my slider, compared to different hitters’ swings. It kind of opens your eyes to things you maybe didn’t think of when you didn’t have that information.” In the public sphere, it’s still missing, and that should open our eyes also.
It’s tough for teams to keep secrets: Players and executives travel, and they carry word with them. “This is going to be something every team is doing in like two years,” the consultant says. Some of the knowledge that teams are relying on to make matchup decisions could conceivably be replicated using public information. But teams don’t just have better data; they also invest much more time and effort into mining the same sources of information at the public’s disposal. Teams have millions of dollars of value riding on sit/start decisions and multiple full-time employees devoted to poring over potential postseason matchups in daunting detail. Outside sources don’t have the same incentive to gather and give away those insights—and whenever independent analysts start to encroach on competitive advantages, teams tend to hire them.
So that’s the level of information that managers are working with—and yes, most modern managers are working with it, because the iPad packed with projections that the front office sends down to the dugout isn’t a friendly suggestion, it’s a condition of keeping the job. Meanwhile, we on the outside are several steps behind, either going with our guts or squinting at splits and trying to do some mental math without being biased by recent results.
In the fifth inning of Game 2, Fox’s cameras caught a Dodgers bat boy dropping by the outfield to deliver new laminated positioning cards based on likely landing spots with reliever Ryan Madson on the mound instead of starter Hyun-Jin Ryu. It didn’t help—Madson walked the first hitter he faced to force in Boston’s second run, and even a presumably repositioned Yasiel Puig wasn’t shallow enough to catch J.D. Martinez’s game-winning, two run-single—but the Dodgers probably weren’t printing those positioning cards while Roberts was walking to the mound. They’d studied Madson and Martinez beforehand and gone to great lengths to quantify all that could come of a showdown between the two. Summoning Madson wasn’t necessarily the ideal decision, but it was a well-considered one. And when we critique it, we’re at an information disadvantage: We know nothing that they don’t know, but they know things that we don’t know. “There are absolutely times when a team makes a decision that looks strange from the outside, and often is criticized, because of information we have access to that others don’t,” another team’s assistant GM says. “That’s not to say that the decision will be the right one, but more often than not we’re working from intel that we believe in and that’s not available outside of our walls.”
Although the front-office sources were consistently cagey about specifying how much a sophisticated model might differ from a more rudimentary one, there’s clearly a limit to the benefits. “Some people are putting a lot of energy into trying to do this exactly, precisely, perfectly right, but there is actually only so much to gain there,” the consultant says. “And by all means, go grab it, but don’t expect it to change your team.” The quantitative director emphasizes that the numerical projection is “one input of a larger advance process that involves the advance guys watching a lot of video and looking at results on different pitch types/locations in addition to what the model says. I think it is rare that there [is] a black-and-white lineup decision where you could point to the model as the cause, but it [is] definitely an influence.”
No matter how amazing the model, one never knows exactly what a player is thinking or feeling, or how he’ll respond to a specific circumstance; a hitter might see a certain guy well despite struggling against many similar pitchers. And in many cases, of course, the model may not matter as much as another inside advantage. “Often times when a team goes with a matchup that looks suboptimal, the reason is that the player that would provide the better matchup is dealing with some minor injury that won’t be publicly acknowledged but is factoring into the team’s decision-making,” another team’s quantitative director says. “Another confounding factor is that pitchers need to be warmed up before they come into a game, and warming up a pitcher has a cost.” Managers also have to consider consequences for defense and depth that may be far from some fans’ minds in the heat of the moment. And although one analyst says his front office doesn’t adjust its statistical model when a coach or manager reports that a player’s mechanics are awry, it would want the manager to consider that when filling out the lineup card.
So where does that leave us, the (comparatively) uninformed fans and media members? Can’t we just keep being mad at managers when decisions don’t work out, the way we always have? Scapegoats are so satisfying!
No one is suggesting that we stop second-guessing completely. For one thing, it feels like a failure of independent thought to defer to authority in every case. Teams are better informed, but they aren’t infallible. Another quantitative director notes that there’s “not necessarily a definitive answer” to every decision, and a former quantitative director observes, “There’s no guarantee that said information is translating into good decisions.” One of the others says he’s seen times when knowing more misled a team. “I think on the whole it is a plus to bring data into the process, but it can definitely be a minus in specific instances, and it can be hard to know the difference sometimes,” he says.
The front-office analyst offers a rule of thumb. “When something seems bonkers, it’s probably worth criticizing,” he says. “When something is close, it’s probably worth a nuanced look. When something is close and Twitter goes bonkers, it can be frustrating knowing how difficult some of these decisions are. But when a decision is really confusing and the projections run strongly counter to it, you can at times check outlier batter/pitcher matchup histories and hot/cold streaks to see if there’s smoke there.” In other words, no matter how much teams try to downplay small-sample performance, some managers still fall for the Enrique Wilsons of the world.
Beyond that, though, we watch and discuss sports for fun, and it’s no fun to let top-secret stats for the team’s eyes only stop us from expressing opinions. The key thing to remember is that sports are silly, and there are few serious societal consequences to being publicly wrong about baseball.
I ask one of the quantitative directors what he thinks the outsider’s default stance should be when a manager does something that seems strange or inexplicable. Which is more likely: that the manager fell asleep at the controls Homer Simpson style or did something dumb based on bad information, or that he knew something the public wasn’t privy to that made his decision sound or defensible? “I think it’s more likely he knew something,” he says. “That something might be injury. It might be that the hitter says he doesn’t see the ball well or isn’t comfortable in the box against that pitcher (which the manager obviously won’t tell the media). It might be something valid from a model or advance work. Or it might be something wrong/specious from a model or advance work.”
So yes, second-guess—but second-guess responsibly. Embrace the uncertainty. Occasionally couch comments with a “based on what we know” or “unless he’s hiding an injury.” Be a bit cautious about calling a decision a “fireable offense.” Taking a strong stance leads to likes and retweets—and often, in the real world, more meaningful forms of affirmation. But remember that managers aren’t always to blame. And even when they’re wrong, they’re rarely responsible for a loss.
In the eighth inning of Game 2, with two outs, the bases empty, and the Dodgers already down by two, Roberts made a different decision than he had the day before, leaving Báez in to face Devers. The righty got the lefty to ground out. Maybe that would have worked in Game 1. But a single at-bat is the smallest sample of all. And even an 0-for-32 streak is less simple than it sounds.