Ordinal scales are everywhere in psychology. From mood ratings to pain scales, they are one of the most prevalent tools in the field. They frequently appear in other domains e.g. medicine, education, etc. and most often appear as Likert scales. These scales require participants to give a score (along an increasing scale) in response to a question or series of questions e.g. “How afraid are you of spiders?”. The number of options typically range from 5 to 11. These types of scales produce ordinal data1 as opposed to metric data2.
It is common practice to analyse ordinal data as though it were metric (Liddell & Kruschke, 2018). To do this, you analyse the data using tests that assume the data is either interval or ratio. Therefore, you assume they have equal sized gaps between the units e.g. t-tests, correlations, etc. These models also assume a normal distribution3 for the residual noise4 but when data is assumed to be ordinal, a different assumption for the noise is used; a thresholded cumulative normal distribution for the noise.
Mo models, mo problems
The two models using the different assumptions described above treat the data points in distinct ways. The former describes a single data point’s probability (along a normal distribution curve) at a given value, whereas the latter describes a datum’s probability (along a normal distribution curve) as the cumulative probability between two thresholds on an underlying construct. The graphs below show these two models.
For the metric model, the probability of a given ordinal response is just the probability density at the corresponding metric score. But for the thresholded cumulative normal model (also called an ordered probit model), the ordinal levels are created by dividing the normal distribution curve of an underlying continuous5 value into chunks. For example, if you asked someone to rate “How afraid of spiders are you?”, there is an assumption that the underlying fear is continuous and is divided at certain thresholds to create discrete values. The third graph above shows an assumed underlying continuous value with a normal distribution for the density of scores. The dashed lines show the divides along the continuous scale (the threshold values) which make up the ordinal response options. The areas under the normal distribution curve and between the dashed lines make up the probability of each ordinal response. The bar charts just above this graph represent the cumulative probability.
The probabilities of the responses using the metric model versus the ordered probit model, though they have the same distribution, are not the same. This is due to the threshold values between the intervals. The outer thresholds are fixed but all the inner thresholds are estimated from the data. This is why the threshold between ‘2’ and ‘3’ on the graph is closer to ‘2’. To analyse data using a metric model, the data needs to have equal gaps in units (which ordinal data doesn’t have) and normally distributed data. Ordinal data is frequently skewed or multi-modal so violates the assumption of normal distribution (Ghosh et al., 2018). Thus the distribution is not appropriate for analysis as metric data.
What does this mean for your results?
These discrepancies wouldn’t be a problem if the metric model produced robust results. Some studies have found this e.g. Heeren and D’Agostino (1987) and Bevan et al. (1974). But they each have a crucial weakness: they didn’t compare the results to a non-parametric model. This means we don’t know how well it performed compared to a model which is theoretically better equipped to analyse that kind of data. When research compares how each model performs, the metric model falls far short of its competitor.
Nanna & Sawilowsky (1998) used real-world data to compare how well a parametric t-test performed against a non-parametric Wilcoxon rank-sum test. The non-parametric test had greater power for almost every sample size, and therefore had a better detection rate. The disparity in power increased as the sample sizes increased, running counter to the prevailing logic that when the sample size is greater than or equal to 30 you can use a parametric test as the Central Limit Theorem means you can assume a normal distribution6.
Liddell & Kruschke (2018) simulated a data set using an ordered probit model to see how well the two models explained the data. If it is acceptable to use a metric model when analysing ordinal data, the models should be roughly equally accurate. But that is not what they found.
The ordered probit model (unsurprisingly) was able to accurately capture both the effect size for the group difference (for both an effect size of 0 and 0.66) and the distribution of the responses. But the metric model wasn’t even close: it estimated an effect size of 0.49 when the true effect size was 0 (a Type I error) and -0.01 when the true effect size was 0.66 (Type II error), as well as very poorly estimating the distribution. This pattern of results (the metric model failing to capture the distribution and effect size of the data whilst the ordered probit model succeeded) held true for real world data.
These errors arise because of the discrepancies between the mean values from the different models. The ordered probit model correctly captured the underlying distributions and effect sizes (and therefore means) of the group. For the metric model to do the same, the mean ordinal values of the metric model need to be the same regardless of the standard deviation (distribution) of the latent mean. But looking at the graph below, we can clearly see they aren’t.
Source: Liddell & Kruschke (2018)
The underlying mean from the ordered probit model is on the X axis and the metric model mean is on the Y axis. The different lines represent different SDs, which correspond to different distributions (the larger the SD the wider the distribution). When the underlying mean is the same between groups but the SD is different (because of differences in distribution), the mean ordinal values for the two groups are different. This is a Type I error as the metric model states there is a difference between the means when there isn’t one. This corresponds to points A and B on the graph. When the latent mean is different but the mean ordinal values are the same (points B and D), this is a Type II error.
Many of us (including myself) were taught to analyse ordinal data with a metric model. But it seems like this is a bad idea. There is evidence to show it reduces power and greatly inflates the chances of a false positive or a false negative. Ordered-probit models are much better suited for the task as they allow for unequal variances across groups (an important component for a model). Therefore, it seems the best thing to do is analyse ordinal data like ordinal data.
Thank you to Alex Etz for the explanation drawing of the normal distribution of noise for a linear model.
John Kruschke recommended a few clarifications and developing the conclusion. Torrin Liddell stated he didn’t have anything to add to John’s comments and recommended this paper by Selker, Leem & Iyer (2017).
Bevan, M. F., Denton, J. Q., & Myers, J. L. (1974). The robustness of the F test to violations of continuity and form of treatment populations. British Journal of Mathematical and Statistical Psychology, 27, 199–204.
Ghosh, S.K.; Burns, C.B.; Prager, Zhang, D.L.; & Hui, L. (2018). On nonparametric estimation of the latent distribution for ordinal data. Computational Statistics and Data Analysis,119, 86-98.
Heeren, T., & D’Agostino, R. (1987). Robustness of the two independent samples t-test when applied to ordinal scaled data. Statistics in Medicine, 6, 79–90.
Jolliffe, I.T. (1995) Sample Sizes and the Central Limit Theorem: The Poisson Distribution as an Illustration. The American Statistician, 49 (3), 269, DOI: 10.1080/00031305.1995.10476161
Liddell, T. & Kruschke, J. (2018). Analyzing Ordinal Data with Metric Models: What Could Possibly Go Wrong? Journal of Experimental Social Psychology, 79, 328-348. Available at: https://osf.io/9h3et/
Nanna, M. J. and Sawilowsky, S. S. (1998). Analysis of Likert scale data in disability and medical rehabilitation research. Psychological Methods, 3(1), 55–67, doi:10.1037//1082-989X.3.1.55.
Selker, R., Lee, M. D., & Iyer, R. (2017). Thurstonian cognitive models for aggregating top-n lists. Decision, 4 (2), 87-101.
1 This means the data can be ordered as one answer is higher than another, but you don’t know what the size of the gaps are/whether they are equal. For example, if you’re asked to rate how energetic you are from 1-5, you don’t know if the gap between ‘1’ and ‘2’ is the same size as the gap between ‘4’ and ‘5’.
2 Metric data can take the form of interval or ratio. Interval data means the gaps between units are equal but there isn’t a true zero (the scale doesn’t end at zero and you can have minus numbers) e.g. degrees Celsius. Ratio data also requires equal gaps between the units as well as a true zero e.g. Kelvin.
3 Data is often assumed to take the form of a normal distribution curve: most participants fall around the mean, and there are fewer participants the further you go from the mean. Different distances from the mean, called standard deviations (σ, pronounced sigma, or SD) encompass a greater percentage of all the data points e.g. 1.96 SDs contains 95% of the data points.
4 A normal distribution for residual noise simply means the probability of the position of a data point on they Y axis (as predicted by the model) follows a normal distribution curve (how far away it is is affected by many variables e.g. measurement error). Alex Etz gives an explanation (as well as a drawing to visualise the concept) here. A clearer version from him can be found here.
5 This means the data can be measured theoretically in infinitely small units e.g. age. If you wanted to, you could measure your age in nanoseconds, but it’s typically measured in years. This is giving continuous data a discrete value to allow ease of comparison (though doing so costs some information).
6 Whilst many believe this is true, it is not necessarily (Joliffee, 1995).