Well I can give my 2 cents as a researcher who does lots of statistical analyses. After a quick look at the paper itself, I think it’s clear these results should not be trusted. I think the study suffers from serious methodological flaws, in addition to being drastically underpowered.
On the front end, it lacks ecological validity. They ostensibly claim to be comparing men and women in emergency situations under high stakes. However, sims are certified for flight dynamics, not for reproducing stakes, fear, and consequence. In the box you know you’re safe, your job/life isn’t on the line, and there are no passengers to worry about. Thus, there are no stressors or pressures really being applied that would equate to a real emergency situation.
Statistically, they are seriously underpowered by small sample sizes. They only have 10 participants of each sex. In the text of the study they calculated post-hoc that they have no power to detect reasonable effects. What’s more, in such an underpowered study, you are almost guaranteed to overestimate the effects when you find statistical significance (and in some cases they go in the wrong direction of the true effects). Also, their sampling method is not random, but participants are recruited via convenience, which makes results un-generalizable even under large samples. To be fair, they do basically outright state that their results are meaningless in regards to generalizability. This means that the study actually cannot tell us anything about sex differences in piloting at all (something missed by the news reports on this). What the study CAN do is serve as a basis for future study design or hypothesis generation, but it cannot offer any conclusions.
There’s also several methodological problems. For example, when evaluating performance in the emergency scenario, they excluded 3 emergency crashes (2 from women, 1 from a man, leaving 8 women and 9 men) and then compared completion time among the survivors. So they essentially dropped 2 of the women’s worst outcomes. This is called “survivor bias”. A proper analysis would treat a crash as a worse outcome than a slow landing and also analyze time-to-completion with “failure” as an event itself, not delete them from the analysis. Especially with such a small sample, one participant’s performance can dramatically affect the group’s average. I strongly suspect dropping the 2 failing women almost certainly inflated their average scores, making this analysis unreliable.
Another thing they compare is situational awareness, but these ratings are self-reported by the pilots, not measured by the researchers. You can see why that estimate will be biased.
The methodological flaws would need to be addressed even with larger samples, but nevertheless the low power here means the results should be taken with a grain of salt until replicated with better sampling and study design. To be fair, the authors mention these caveats and limitations explicitly in the text, but of course news outlets and editorials do not.