Last year, as part of my quest to convince whoever will listen that USMLE scores aren’t as useful for residency selection as we act like they are, I wrote about how many points each USMLE Step 1 question is worth. I hypothesized that passing USMLE Step 1 probably required answering around 65% of questions correctly – which implied that each correctly-answered question thereafter was worth around 1 point.
In suggesting this, I was pretty honest that some of my calculations were based on assumptions and guesswork. After all, the USMLE does not provide precise details on how they calculate the three digit score.
But lately, I happened upon a piece of data that helps shed some light on this issue for a related exam.
The Rosetta Stone
The key piece of information to help us translate percentages to USMLE three-digit scores comes from this paper:
A paper written by NBME authors contains data that helps us convert percentages of items answered correctly to USMLE three-digit scores.
The paper is written by psychometricians with the goal of addressing something completely different: how many incorrect answer choices should be included with multiple choice questions? Their point is that, on an exam where most questions are answered correctly, an answer choice that is chosen by <5% of test-takers may still serve a useful role as a distractor. And I guess that’s a good point, as far as it goes.
But what’s more interesting is the dataset they used to make this point.
See, they used data from “an examination for physician licensure.”
A “high stakes” examination.
An examination with “approximately 320… test items.”
Wait a second… this is sounding kinda familiar…
The exam described in the Raymond et al. paper sounds a lot like USMLE Step 2 CK.
The authors note that on this high-stakes medical licensing test with approximately 320 questions (read: USMLE Step 2 CK), the mean percentage of five-answer multiple choice questions answered correctly ranged from 73.4% to 74.8%. And the standard deviation was 8.3% to 8.9% across the various test forms.
This, of course, is a very interesting piece of data that we can use to approximate how many questions must be answered correctly to pass USMLE Step 2 CK.
See, from the USMLE’s Score Interpretation Guidelines, we already know the mean and standard deviation for the Step 2 CK exam.
Lately, the mean Step 2 CK score has been 242 with a standard deviation of 17.
And we also know what the overall distribution of Step 2 CK scores looks like.
The distribution of USMLE Step 2 CK scores.
Note that, although the distribution of Step 2 CK scores is not perfectly normally distributed, it’s probably close enough to treat it as such. And if we do, then we can see how percentages correspond to three digit scores by using z-scores to convert from one (approximately) normal distribution to another.
When we do that, we end up with something that looks like this:
Approximate conversion between raw score and three digit score for USMLE Step 2 CK.
In other words, to pass USMLE Step 2 CK, you only need to answer around 57% of items correctly.
Notice that the conversions are approximate, and differ slightly from the real distribution of scores reported in the USMLE’s interpretation guidelines.
Notice, also, that this doesn’t tell us exactly how many questions an examinee must get correct. Out of the 300+ questions on Step 2 CK, some are unscored experimental questions. How many? It’s not clear – but the Raymond et al. paper suggests that it’s not a small percentage.
In their methods, the authors described taking data from four different versions of this nameless high-stakes licensing test, each of which had “approximately 320 scored and unscored (experimental) items.” But they analyzed data only from items with five answer choices. They excluded questions with three and four possible answers – but they noted that the number of questions so excluded were “too few… for systematic study.” And then they ended up with a sample set of just 206-220 items for each version of the test.
So if we assume that there are around 220 or so scored items on USMLE Step 2 CK, then each individual non-experimental item is worth just under half a point when scores are scaled.
Uh… so why does this matter?
If we’re gonna use USMLE scores for residency selection – a purpose for which these tests were not designed – then we should all understand exactly what we’re talking about when we’re talking scores above the passing threshold. It’s great that Applicant A scored a 233 and Applicant B scored a 250 on Step 2 CK – but what does that actually mean?
Well, if I’m right, it means that, over the course of a 9 hour testing day, Applicant B correctly answered around 35 more questions correctly. How should we interpret that?
Here’s where randomness comes into play. To be confident that two Step 2 CK test-takers really have difference in knowledge, their scores need to be more than 18 points apart (two standard errors of the difference for the test). So Applicants A and B might actually have exactly the same clinical knowledge – Step 2 CK can’t tell us for sure.
And even if two applicants scores were far enough apart that we could be confident that their performance differed, is that difference meaningful? What, exactly, does a 20 point difference mean? What percentage of questions do you need to get right to be a successful resident or practicing physician? (As a sidenote, we already know how many questions you need to get right to be a successful faculty member. When faculty “content experts” – the same experts who set the USMLE minimum passing score – took a block of USMLE Step 2 CK questions under standard conditions, they only got 67% of questions right.)
This is why the NBME should be more transparent about the way that 3 digit scores are calculated. If we’re going to use scores, we need to do better than the NBME’s black box – we need to all have a firm understanding of how the score is calculated so we can interpret it in a sensible and evidence-based manner.