On the Flynn Effect and Merit in Medicine

I.

In the early 1980s, a researcher made a curious discovery.

Human beings were getting smarter.

Professor Flynn in 2016, from his New York Times obituary.

The researcher was named James Flynn. For several years, he’d evaluated trends in intelligence quotient (IQ) test performance – and in every population he studied, the average IQ seemed to be rising steadily each year.

The effect wasn’t subtle. Among Americans from the 1930s to the 1970s, the average IQ rose approximately 14 points – around one full standard deviation – from the previous mean. And scores have continued to rise since, such that the average test-taker today would score 130 – near ‘genius’ level – if they took an IQ test a hundred years ago. Meanwhile, if the average test-taker from a hundred years ago took a test today, they’d score around 70 – close to qualifying as intellectually disabled.

IQ score categories for the Wechsler IQ test, from Wikipedia.

This wasn’t supposed to happen. IQ tests were supposed to measure general mental ability – the so-called g factor or g. That kind of general intelligence was thought to be mostly innate and fixed within a given person.

But the data seemed incontrovertible. Regardless of the population studied and the test used, IQ scores were getting higher.

What was going on?

II.

In their 1994 book The Bell Curve, authors Charles Murray and Richard Herrnstein finally gave a name to the phenomenon that Flynn had described. They called it, appropriately enough, the Flynn effect.

The Bell Curve.

The Bell Curve was a polarizing work.

The authors’ general premise was that the world was transforming from one in which social position was determined by birthright and into one in which a “cognitive elite” were sorted into positions of power, authority, and economic privilege.

But it wasn’t just that those with higher IQs got better jobs and made more money. According to Herrnstein and Murray, IQ predicted desirable social behaviors. They noted, for instance, that around a third of women who had low IQs (<75) lived in poverty, were chronic welfare recipients and had children without being married – while fewer than 2% of those with high IQs (>125) were similarly situated.

Murray and Herrnstein’s argument rested upon certain assumptions that the authors felt were “beyond dispute.” Central among these were the beliefs that g was measurable; that IQ testing was the best way to measure it; and that IQ scores were stable over time, substantially heritable, and not biased against any social class or racial/ethnic group.

Of course, these assumptions were disputed – and disputed vigorously – as were the conclusions that were drawn from them.

Steven Jay Gould was among the most articulate critics. In a scathing review in The New Yorker, he wrote:

The Bell Curve, with its claims and supposed documentation that race and class differences are largely caused by genetic factors and are therefore essentially immutable, contains no new arguments and presents no compelling data to support its anachronistic social Darwinism, so I can only conclude that its success in winning attention must reflect the depressing temper of our time—a historical moment of unprecedented ungenerosity, when a mood for slashing social programs can be powerfully abetted by an argument that beneficiaries cannot be helped, owing to inborn cognitive limits expressed as low IQ scores.

Disturbing as I find the anachronism of The Bell Curve, I am even more distressed by its pervasive disingenuousness. The authors omit facts, misuse statistical methods, and seem unwilling to admit the consequence of their own words.

The penultimate chapter presents an apocalyptic vision of a society with a growing underclass permanently mired in the inevitable sloth of their low IQs. They will take over our city centers, keep having illegitimate babies (for many are too stupid to practice birth control), and ultimately require a kind of custodial state, more to keep them in check—and out of high IQ neighborhoods—than to realize any hope of amelioration, which low IQ makes impossible in any case. Herrnstein and Murray actually write, “In short, by custodial state, we have in mind a high–tech and more lavish version of the Indian reservation for some substantial minority of the nation’s population, while the rest of America tries to go about its business.”

Of course, The Bell Curve found a highly receptive audience among certain groups. To the already elite, it provided a comforting justification for their status that didn’t depend on having pre-existing societal privilege or economic advantage. If you were successful, it was because of your innate, unalterable g. If you were successful, it was because you were smart.

III.

In 1994 – the same year that The Bell Curve was published and the term Flynn effect was coined – the first cohort of students took Step 2 of the United States Medical Licensing Exam (USMLE).

The USMLE Step 2 exam replaced the old NBME Part 2 examination. To ensure that results from the new test would be readily distinguishable from the old one, the USMLE used a new scale for scores.

The NBME Part 2 exam had been score using the same standard scale as the SAT, with a mean set at 500 and a score range of 200-800. In contrast, the new USMLE exam would have a scaled score with a mean at 200 and a standard deviation of 20.

For the first few years of the exam, student performance was steady. But starting in 1998, something funny started to happen.

Since the late 1990s, there has been a steady increase in student performance on USMLE Step 2 CK.

Each successive year, students began to answer more questions correctly than the group that came before – and the national average on the USMLE Step 2 exam began to creep up, little by little.

What was going on?

“It’s the Flynn effect!” someone comments,whenever I post this graphic (or the nearly identical one for USMLE Step 1).

But is it?

IV.

In March 2020, a cardiologist from the University of Pittsburgh published a paper in the Journal of the American Heart Association.

Over 17 pages and 108 references, Dr. Norman Wang argued against affirmative action policies in medicine.

Precious little in the paper had anything to do with medicine in general or cardiology in particular. Most of the piece was spent discussing legal standards and landmark Supreme Court cases related to affirmative action. It was an altogether curious piece to run in JAHA, and for the first few months after its publication, it was almost entirely ignored.

But in August 2020, Wang’s paper started to get some attention on Twitter – and not the good kind. Commentors were outraged by the paper’s suggestion that Black and Hispanic trainees were not as qualified and capable as their Asian and white counterparts. Within 24 hours of the Twitter furor, Wang lost his job as fellowship director and the American Heart Association issued a statement declaring that the paper was “completely incompatible with the [organization’s] core values.” A few days later, JAHA retracted the paper altogether.

But while #MedTwitter celebrated, in a different corner of the internet, outrage sprang up anew. Conservative news outlets pounced on the Wang affair as being the latest excess of the liberal medical establishment, criticizing the “woke” doctors and social justice warriors who were ruining meritocracy in medicine.

V.

As it turned out, identifying the Flynn effect was easy. In nearly every data set researchers examined, they could find evidence of rising IQ scores.

Almost anywhere you looked, you could find evidence of the Flynn effect in IQ testing. (From Flynn JR, Psychol Bull, 1987.)

What proved far more challenging was explaining why. Explanations that seemed sufficient in one instance were altogether lacking in another. Among the various explanations proffered were:

  • Better nutrition
  • Better medical care/less childhood disease
  • Broader exposure to testing
  • More stimulating general environment
  • Decreasing family size
  • Less consanguinity

Each of these hypotheses had at least some support in some datasets, but none were universally present when rising IQs were observed.

In contrast, what seems to be driving the rise in USMLE scores seems much less mysterious.

It’s not because today’s medical students are better nourished than those in the early 1990s.

And it’s not because they come from families with less inbreeding.

It’s just that high scores matter more now than they used to.

What matters to residency program directors: then and now.

Examinees don’t study for the Wechsler IQ test. But they sure do for the USMLE. And they do it with highly efficient test prep resources specifically designed to optimize performance on multiple-choice questions.

Residency selection is an arms race, and it’s only relative performance that matters. Given the high stakes, it would only be surprising if USMLE scores didn’t rise a little each year.

VI.

In December 2020, Wang sued.

Among other things, Wang’s lawsuit sought the return of the $1600 open access publishing fee he’d paid the journal.

Defendants included his institution; his medical group; the American Heart Association; the journal publisher; his chair/division chief; and as yet unknown “John Does” who were alleged to have violated Wang’s right to free speech, defamed his reputation, and interfered with his publishing agreement.

In December of 2021, a judge dismissed most of the defamation and contract interference claims, pointing out that the article had included at least two misleading quotations (which were highlighted in the JAHA retraction notice).

However, the principal defendants’ motions to dismiss the First Amendment and due process claims were denied, and the lawsuit continues on. (I asked a respected attorney what he thought would happen, and after studying the complaint and the judge’s initial ruling, he predicted the university will quietly settle the suit.)

VII.

Although Dr. Flynn passed away in 2020, he lives on in the internet. If you have 18 minutes to spare, you can watch his TED talk entitled, “Why Our IQ Levels are Higher than Our Grandparents.”

The title is provocative. But notice that it’s not called, “Why We are Smarter Than Our Grandparents.” And that’s for good reason.

Flynn himself questioned whether the rise in IQ scores really meant that people were getting more intelligent.

How can people get more intelligent and have no larger vocabularies, no larger stores of general information, no greater ability to solve arithmetical problems? …Why do we not have to make allowances for the limitations of our parents?

-James R. Flynn, What is Intelligence? Beyond the Flynn Effect

It’s simply inconceivable to believe that our great-great-grandparents were all intellectually disabled – even though that’s exactly what their average IQ score of 70 would suggest if they were around to take the test today.

Rather than measuring g, it seems likely that IQ tests measures a particular type of often abstract reasoning – and improved performance there may come at the expense of practical know-how and other types of cognitive reasoning.

Today’s average USMLE Step 2 CK examinee would have scored around the 99th percentile had they taken the test in 1994.

Similarly, it’s beyond dispute that performance on the USMLE is better now than it ever has been. So where are the trumpets and fanfare? Where are the celebrations about this new golden era of medical care?

VIII.

Many were angered by the conclusions of Wang’s paper – but only a few took the time to challenge the basic premise underlying his argument.

Wang argued against affirmative action and other policies that improve diversity in medicine because of his belief that giving preference to historically excluded groups works against “excellence” in medicine. His paper concludes, “Ultimately, all who aspire to a profession in medicine and cardiology must be assessed as individuals on the basis of their personal merits, not their racial and ethnic identities.”

So what are the merits upon which individuals should be assessed? To Wang, the answer is clear: standardized test scores.

White and Asian students have higher scores on the MCAT than Black and Latinx students; therefore they have more merit as physicians. Right?

It’s true, there’s a modest correlation between MCAT scores and academic performance in early medical school (such that variations in MCAT scores explain 32% of the variability in first year medical school grades). But it’s a big leap to get from there to the myriad competencies that make a good physician or cardiologist.

Accepting Wang’s argument requires taking a very narrow view of what constitutes merit in medicine. Though numerical differences in scores by self-reported racial categories can be demonstrated, it should set off your scientific B.S. detectors to accept, on the basis of such evidence, that some of these groups have more potential for medicine than others.

Claims that there are racial differences in innate intelligence should similarly both offend your conscience and reek of pseudoscience – though for Murray and Herrnstein, they seemingly didn’t. They devoted a full chapter of The Bell Curve to exploring “Ethnic Differences in Cognitive Ability.”

A figure from Chapter 13 of The Bell Curve.

Figures like the one above are catnip for white supremacists.

The intended conclusion is so obvious that any fifth-grader who knows how to read a graph could make it. But since these data are provided by a respected professor and presented with a veneer of scientific rigor, now that intended conclusion is no longer just someone’s noxious opinion. Now, it can be taken as a scientific fact.

This is an especially dangerous type of racism, and many critics shouted it down with outrage. (Even today, Murray’s lectures are often the target of violent protests.) But others felt that the faulty logic and claims about intelligence and race had to be challenged on their merits.

In fact, it was to offer just these kinds of challenges that Professor Flynn began studying intelligence in the first place.

IX.

In 1969, Arthur Jensen published a paper called, “How Much Can We Boost IQ and Scholastic Achievement?” It was a seminal work – since cited >6000 times – that argued that IQ differences between Black and white students were genetically determined and therefore largely immutable. It ran counter to the previously-growing consensus that race and intelligence were socially-determined constructs, and it didn’t sit right with a young American expatriate in New Zealand named James Flynn.

In 1980, Flynn published his first book – a devastating 300 page critique of Jensen’s work. Later, he invoked the Flynn Effect itself to demonstrate that IQ scores for Black children increased substantially relative to white Americans between 1972 and 2002. Such a rise in IQ is awfully difficult to explain on the basis of innate racial differences in intelligence – but is easy to understand when considered in light of social changes. Throughout his career, he frequently debated both Jensen and Murray, believing that the best way to silence their rhetoric was not by force, but by making them look like fools.

When a racist makes judgments, we can use logic as a powerful weapon to force him to make his ideals clear. When he says black men deserve to be excluded or kept in their place or exterminated, we can ask him whether this is true simply because they are black. For example, if it were a Nazi talking we might say this: assume that
through industrial pollution a chemical got in the water supply which turned the skins of all Germans permanently black; would they then deserve to be exploited or exter­minated? He can of course answer in the affirmative with complete logical consistency but. . . [t]he reason a Nazi could not answer as above has nothing to do with logic but with the fact that such an answer carries with it an unacceptable price.

-James Flynn, from the preface to Race, IQ, and Jensen (1980)

X.

Perhaps the most succinct-yet-comprehensive explanation of the Flynn effect is that rising IQ scores are not just adaptive to life in the modern world – they’re also also enabled by it. In other words, our modern society and economy rewards those who can reason in abstract terms… but that society also affords the luxury of being able to think about the world in abstract terms in the first place.

From that standpoint, rising IQ scores are indeed something to be celebrated. The Flynn Effect is most prominent as societal conditions are improving – for instance, as more citizens gain nutrition and access to health care. (In fact, IQ scores may actually decline once those gains have been maximized: the peak IQ in Norway seems to have occurred among those born in the 1970s.)

When grouped by world regions, improvements in IQ parallel improvements in societal conditions. From Our World in Data.

Put simply, as societal conditions improve, more of us get to enjoy the luxury of abstract thinking – and to a greater extent – than our just-as-clever forefathers who had to occupy more of their cerebral cortex with the concrete cognitive processes necessary for practical survival. We’re not smarter, exactly… but we are more privileged.

The best interpretation of racial differences in MCAT scores or the Flynn-like effect in USMLE scores is similar.

In a highly-competitive processes like medical school admissions or residency selection, students will push themselves to the limits to distinguish themselves by whatever metric is used to determine the winners. But here, too, privilege is important. To engage in the hard work involved in standardized test preparation requires having the luxury to do so. And when the competition is intense and all the competitors’ engines are turning at their maximum rpm, the small degrees that separate winners from losers often have more to do with different starting points than different top speeds.

An IQ score – or an MCAT score, or a USMLE score – tells us something. But not everything. Identifying intelligence or merit – and rebutting the self-serving assertions of those who seek to enforce narrow definitions of these concepts – requires going beyond what numbers can tell us in isolation. It requires us to think less like Charles Murray or Arthur Jensen or Norman Wang – and more like James Flynn.

ADDENDUM: If you’d prefer a video of this post to share, there’s one on the Sheriff of Sodium YouTube site. (And yeah, it’s really just audio with some slides, but still.)

YOU MIGHT ALSO LIKE:

The Residency Selection Arms Race, Part 2: Anatomy of an Arms Race

Another MCQ test on the USMLE

The Last Lecture (2021)

What Can the NFL Draft Teach Us About Residency Selection?