Assessing the validity of labs as teaching methods and controlling for confounds

Anyone who has taken one of the harder sciences at university or knows someone who has will know what “labs” are. You are given practical assignments to complete that are meant to consolidate what you’ve learnt in the lecture/seminar. They are almost ubiquitous in physics after becoming widespread by the beginning of the 20th century (Meltzer & Otero, 2015), as they are for chemistry (Layton, 1990), and biology (Brownell, Kloser, Fukami, & Shavelson, 2012). Their value is widely assumed to have been demonstrated multiple times across the hard sciences (Finkelstein et al., 2005; Blosser, 1990) but questions have occasionally been raised as to their effectiveness (Hofstein & Lunetta, 2004). A new paper by Holmes, Olsen, Thomas, & Wieman (2017) sought to test whether participating in labs actually improved physics students’ final grades or not. Across three American universities they tested three questions: what is the impact of labs on associated exam performance; did labs selectively impact the learning physics concepts; and are there short-term learning benefits that are “washed out on the final exams”?

Introducing (potential) bias:

Selection bias is where participants are not randomly selected into different groups but on differences already in the populations (Breen, Choi, & Holm; 2015). This can have a pernicious impact on your research. One potential consequence is finding different results for different groups for a variable (e.g. yearly earnings) and concluding it’s due to the factor you divided the groups on (number of books in the house), when in fact there is no actual difference. The spurious result comes from the fact some people with a personality trait/environmental experience etc. that affects the variable being measured were selected into one group and those that don’t have it were selected into the other.1 Another is wrongfully estimating the direction or magnitude of an actual effect, also called a type S and type M error respectively (Gelman, 2004; Gelman & Carlin, 2014). Selection bias is very powerful (deBoer, 2017) and as such needs to be controlled for.

Holmes, Olsen, Thomas, & Wieman (2017) identified that students who enrolled for the voluntary labs were not equivalent to those who didn’t. Those who did scored higher on pre-course measures of physics knowledge (when the data were available) and outperformed those who didn’t in the final exams. So any potential difference in the groups might be due to more gifted and hard-working students selecting themselves into “labs” condition, rather than because the labs themselves conferred any kind of advantage. To overcome this, they created a difference score for each student for the lab-related questions and the non-lab related questions in the final exams. This gave a measure of their relative performance for each type of question: “For example, a student who scores 1 (i.e. 100%) on the lab-related items and 0 on the non-lab-related items would get a difference score of 1. A student who scores 0.5 (i.e. 50%) on the lab-related items and 0.75 (i.e. 75%) on the non-lab-related items would get a difference score of -0.25.” This allowed them to compare the means of these difference scores across the groups to see the added value of labs.

Their results? They found no measurable benefits of labs for students.

Have you got the power?:

Whenever a null result like this comes up, the question of power must be raised. Did this study have suitable power to detect a small yet interesting effect? You can work out the sensitivity of an experiment by putting the study’s alpha level (0.05), power (95% or 80%), total sample size (2267), and number of groups (2) values into G*Power. This tells us their study had 95% power to detect an effect size f of 0.0757429 or 80% power to detect an effect size f of 0.0588656.

We can then work out what the smallest partial eta-squared (ηp²) effect size2 this study could detect at 95% and 80% power respectively. Using the formula ηp² = f^2 / ( 1 + f^2 ) (Cohen, 1988) we find that this study could detect a ηp² effect size of 0.005704262 (90% confidence interval [0, 0.001325457]) with 95% power and a ηp² effect size of 0.003453193 (90% confidence interval [0, 0.001207499]) with 80% power. Using the estimates provided by Cohen (1988) for eta squared (η²)3 we see that 0.01 is classed as a “small” effect. Thus, the current study had 95% power to detect an effect size roughly half the size of what is broadly classed as a “small” effect. I would argue this study had enough power to detect a small yet interesting effect, but did not find one because group membership didn’t have enough of an impact on the students’ final grades.

Controlling for confounding constructs:

This is a large-scale study that does a good job of controlling for possible confounding variables. But as the perfectly titled paper by Westfall & Yarkoni (2016) states, controlling for confounding constructs is harder than you think.

When a paper states they controlled for a variable (Westfall & Yarkoni use the example of controlling for temperature when examining the relationship between ice cream sales and deaths by falling into a swimming pool) this statistical controlling is only as good as the measurement of this variable. If you use a noisy and unreliable measurement (e.g. using subjective ratings of temperature rather than a more objective correctly functioning thermometer) you can find a relationship or a difference that isn’t truly there. The technique Holmes, Olsen, Thomas, & Wieman use in their paper is an example of a reliable measure: the difference in scores between the lab questions and the non-lab questions. This allowed them to see the added value of the labs when comparing the means without using a noisy measurement because they are using recorded grades rather than, say, students beliefs about how well they did. If they had used a noisy measurement, the chances of a false positive would be greatly increased.4

One criticism:

Holmes, Olsen, Thomas, & Wieman identify a few limitations of their study which I agree with. I would add one (minor) criticism: the use of partial eta-squared (ηp²)  instead of partial omega-squared (ωp²) or partial epsilon-squared (εp²). As Lakens (2015) shows, ηp² overestimates the amount of variance due to being part of a group by quite a bit, especially with more groups, a smaller number of participants, and a smaller effect size. The authors should therefore have used the more accurate ωp² or εp². 5


I think this is a well designed study with an interesting finding. I wanted to not only highlight its results but use it as an example to discuss the importance of controlling for selection bias, how difficult it can be, and the negative impact it can have if you don’t. With regards to the future of labs as a teaching method, I agree with the authors’ conclusions; we don’t need to scrap labs altogether but change their focus e.g. towards evaluating data and models.

Author feedback:

I contacted the lead author Professor Natasha Holmes and she was happy with how the post was written. I am grateful for her clarification of some of the study details.


Blosser, P.E. (1990). The Role of the Laboratory in Science Teaching. Research Matters – to the Science Teacher
No. 9001. Available online at: [accessed on: 11/08/2017].

Breen, R.; Choi, S.; & Holm, A. (2015). Heterogeneous Causal Effects and Sample Selection Bias.  Sociological Science 2: 351-369.

Brownell, S.E.; Kloser, M.J.; Fukami, T.; Shavelson, R. Undergraduate Biology Lab Courses: Comparing the Impact of Traditionally Based “Cookbook” and Authentic Research-Based Courses on Student Lab Experiences. Journal of College Science Teaching; Washington 41 (4), 36-45.

Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences, 2nd ed., New Jersey: Lawrence Erlbaum Associates, Inc.

deBoer, F. (2017). why selection bias is the most powerful force in education. Available online at: [accessed on 24/08/2017].

Finkelstein, N.D.; Adams, W.K.; Keller, C.J.; Kohl, P.B.; Perkins, K.K.; Podolefsky, N.S.; Reid, S.; & LeMaster, R. (2005). When learning about the real world is better done virtually: A study of substituting computer
simulations for laboratory equipment. Physical Review Special Topics- Physics Education Research, 1 010103.

Gelman, A. (2004). Type 1, type 2, type S, and type M errors. Available online at: [accessed on: 24/08/2017].

Gelman, A. & Carlin, J. (2014). Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude)
Errors. Association for Psychological Science,  9(6) 641–651.

Hofstein, A. & Lunetta, V.N. (2004). The Laboratory in Science Education: Foundations for the Twenty-First Century. Science Education 88 (28), 28-54.

Holmes, N.G.; Olsen, J.; Thomas, J.L.; & Wieman, C.E. (2017). Value added or misattributed? A multi-institution study on the educational benefit of labs for reinforcing physics content. Available online at: [accessed on: 11/08/2017]

Kelley, K. (2017). MBESS (Version 4.0.0 and higher) [computer software and manual].
Accessible from

Lakens, D. (2015). Why you should use omega-squared instead of eta-squared. Available at: [accessed on: 24/08/2017].

Layton, D. (1990). Student laboratory practice and the history and philosophy of science. In E. Hegarty-Hazel (Ed.), The student laboratory and the science curriculum (pp.37–59). London: Routledge.

Meltzer, D.E. & Otero, V.K. (2015). A brief history of physics education in the United States. American Journal of Physics, 83 (5), 447-458.

Image credit: Ron Kimball/KimballStock



  1. It doesn’t have to be an “all or nothing” distinction between the groups but I simplified the description to emphasise the point.
  2. ηp² is a measure of the proportion of variance in a dependent variable (final grades achieved) due to group membership defined by an independent variable (whether they attended labs or not) with the effects of other independent variables/interactions partialled out.
  3. When you only have one independent variable in the analysis i.e. you are performing a one-way ANOVA like in this study, then ηp² and η² are the same (Glen, 2016).
  4. Westfall & Yarkoni show in their paper how much the false positive rate can increase when doing so.
  5. This is mainly a moot point as the effect size recorded for attending labs was 0.00005. But the effect size for the differences in instructor across institutions was 0.105, so the use of εp² may have overestimated the amount of variance due to the quality of the instructor.

Write a Comment

Your email address will not be published. Required fields are marked *