Tag Archives: Statistics

Mathematical Musing: Simpson’s Paradox

No, nothing about Homer or OJ (is that too much of a nineties reference?), this paradox is about a statistical phenomenon where analysis of pooled data can lead a researcher to make a conclusion in direct contradiction to the one that unpooled data would lead.  There have been several prominent examples of Simpson’s Paradox arising in areas of college admissions, treatment of kidney stones, and baseball batting averages.

The gist is this: Say you need to have a major operation done and there are two hospitals in your town where you could have it.  You’re worried about post-surgery complications, so you do some research into the hospitals and find that in the past year, patients at the larger hospital suffered post-surgery complications in 130 out of 1000 cases, and patients at the smaller hospital suffered complications in only 30 out of 300.  Based on these results, it looks like the smaller hospital is the better bet: only 10% of patients had complications after surgery there versus 13% at the larger hospital.

However, not all surgeries have the same rate of complications.  Relatively minor surgeries are less invasive and would probably result in a lower complication rate.  With that in mind, you look further at the data and find that, at the large hospital, 120 out of the 800 major surgery patients experienced complications compared to 10 out of 200 minor surgery patients, and at the small hospital, 10 of the 50 major surgery patients suffered complications compared to 20 out of 250 minor surgery patients.  In other words, broken down by type of surgery, the complication rates at the large hospital were 15%/5% for major/minor surgeries while the small hospital saw a rates of 20%/8%.  We see now that the larger hospital has a lower rate of complication across the board, regardless of the type of procedure done.

So why the different conclusion?  It has to do with how many of both types of procedures the hospitals did.  The vast majority of the larger hospital’s 1000 surgeries in the last year were major surgeries, which have higher complication rates across the board.  The majority of the smaller hospital’s 300 surgeries were more minor procedures, which generally have lower rates of complication.  As a result of this imbalance, the overall, pooled complication rates for the two hospitals are biased: the larger hospital towards a higher rate and the smaller hospital towards a lower rate. So it only appears that the smaller hospital has a lower complication rate because most of the surgeries performed there are less likely to have complications.

Check out this website for another explanation of Simpson’s Paradox, as well as some clever interactive animations that demonstrate how and why it can arise.  It’s an important lesson as consumers of data and statistics: while the saying may go “Less is More,” when it comes to how much detail to include in your research, sometimes less is wrong.

Update: It appears that the above VUDLab link is dead, which is too bad. Instead, you could check out this Towards Data Science article or this MinutePhysics YouTube video for some more information.