A couple of months ago, I wrote a summary of a recent paper arguing you shouldn’t analyse ordinal data like interval or ratio. If you do so, there’s a risk of inflated Type I and Type II error rates, as well as reduced power [zotpressInText item=”{VD8XETGZ}”][note]Open access version here[/note]. In response, Helen Wauck wrote a comment asking about the relevance of this paper from [zotpressInText item=”{L8K4PCQ7}” format=”%a%, (%d%, %p%)”]. She had also been taught to use metric models for ordinal data, with this paper used as justification. The article argues you can analyse interval like ordinal data as the results of the tests produce similar results. This is because the tests are robust to assumption violations, such as the data being non-metric and not being normally distributed.

Norman frames the paper by outlining common criticisms, given during peer review, to statistical analyses. The 3 are: ‘You can’t use parametric tests in this study because the sample size is too small’; ‘You can’t use t tests and ANOVA because the data are not normally distributed’; and ‘You can’t use parametric tests like ANOVA and Pearson correlations… because the data are ordinal and you can’t assume normality’. In this analysis, I’m only going to focus on the 3rd[note]Though the counter argument he provides to point 2 involves a mischaracterisation of the Central Limit Theorem, which I explore in this post.[/note]. For this last argument, Norman gives 3 answers.

##### 1st answer: a history of robustness

Tests of central tendency e.g. ANOVA, t tests, have previously demonstrated robustness. These studies analysed data sets using metric and parametric tests and compared the results. If the results were the same i.e. they both returned significant results, then the test was said to be robust. Various studies have been performed over the years with different distributions and sample sizes as low as 4 per group. The vast majority of results reported in [zotpressInText item=”{L8K4PCQ7}” format=”%a%, (%d%, %p%)”] were both statistically significant. However, this focus on merely retaining or rejecting a two-sided null hypothesis ignores a lot of valuable information. As Richard Morey argues, it is a ‘shallow way of thinking’. The reduction of a statistical test to a simple binary, whilst common, tells us very little (for a more detailed explanation, see [zotpressInText item=”{6P96KSR4}” format=”%a%, %d%, %p%”]).

But this robustness also held true for correlations and went beyond retention or rejection of a two-sided null hypothesis [zotpressInText item=”{6Y8ALJBH}”]. In addition, Norman replicated the finding with real data from [zotpressInText item=”{VMUKANVW}” format=”%a%, (%d%, %p%)”]. It started as a series of 10-point ordinal scales (taken at 2 time points) and he transformed it into a 5-point scale to make “extremely ordinal data sets”. He calculated the Pearson and Spearman correlation between Time 1 and Time 2 for each scale, then calculated the correlation between the pairs of correlations. For the original scales, the correlation between the Pearson and Spearman’s was 0.99. For the transformed 5-point scales, the correlation was 0.987. There was a near perfect correlation between the parametric and non-parametric measures. Whilst this is an impressive correlation, this only pertains to one type of data set. There is no evidence this extends to other kinds of data, with varying amounts of skew and different distributions.

##### 2nd answer: converting ordinal to interval

The next advocated using metric tests because whilst individual Likert questions are ordinal, Likert scales (which involve summing items) are interval[note]In the same way that summing correct answers on multiple tests produces interval data, even though the data it has come from is binary.[/note]. The same argument has been put forward by other authors, such as [zotpressInText item=”{5421944:XEAJ33NB}”]. But, as Saskia Homer explains, labeling the ordinal responses with integers doesn’t turn them into numbers. They are still based on ordinal data, with unknown gap sizes between the rankings. It also mistakes the levels of measurement with the shape of the variable’s distribution [zotpressInText item=”{5421944:RKURRCBS}”]. It is true the sum of the Likert items will be more like a normal distribution, but that doesn’t convert the units into interval. Likert himself argued this wasn’t a problem, as respondents typically tend to view the response scale as a set of evenly-spaced points along a continuum (as reported in [zotpressInText item=”{7LHI2BAX}” format=”%a%, %d%, %p%”]). But this doesn’t overcome the issue of the unknown sizes of the gaps[note]For a detailed explanation, please read this thread by Saskia Homer.[/note] and has been cautioned against [zotpressInText item=”{PR4AJ877}”]. Further, [zotpressInText item=”{5421944:VD8XETGZ}”] provide empirical evidence that summing Likert items and analysing them using metric models inflates both the Type I and Type II error rate as the distributions are likely to be non-normal.

##### 3rd answer: the numbers don’t know

Norman’s last piece of evidence is conceptual. Even if the numbers are drawn from a Likert scale, so we cannot theoretically guarantee the distances between the numbers are equal, they don’t have a magical property which means they can’t be analysed as metric data. The numbers aren’t aware where they were drawn from and behave accordingly. As long as the numbers are reasonably distributed, we can make inferences from the data. However, this rests on a key assumption for which there is strong evidence against. [zotpressInText item=”{VD8XETGZ}” format=”%a%, (%d%, %p%)”] provide comprehensive theoretical evidence as to why you can’t assume a normal distribution with ordinal data for the tests Norman argues for. Doing so can inflate both the false positive and false negative rate, among other things.

##### Is it ever acceptable to analyse ordinal data using metric models?

Under certain conditions, using metric models for ordinal data may be suitable. [zotpressInText item=”{V4KGVQI8}” format=”%a%, (%d%, %p%)”] ran simulations for confirmatory factor analysis, comparing the results from maximum likelihood estimation (a metric model) to weighted least squares means and variance adjusted estimation (a non-parametric model). They did this with a variety of sample sizes, number of factors, and number of categories for the ordinal data. The most relevant result was in relation to the differences in results over the different number of categories: when the data was divided into 5 or more, the metric model performed as well as the non-metric model.

However, the results found used polychoric correlations. Polychoric correlations are estimates of the linear relationship between two continuous variables when you only have ordinal data [zotpressInText item=”{EGELY3WS}”]. There is a significant amount of empirical work demonstrating using maximum likelihood estimation with polychoric correlations is inappropriate as it often produces biased parameter estimates [zotpressInText item=”{FJRIQ4G2}”] and standard errors [zotpressInText item=”{K3K7GDFD}”].

Despite the arguments against it, there has been little direct comparison of the methods under various conditions. [zotpressInText item=”{UEMGV4HM}” format=”%a%, (%d%, %p%)”] compared the results from metric models to non-metric models under a wide range of factors e.g. number of categories, underlying distribution, sample size. Non-parametric models were superior when the data had less than 5 categories. With 5 or more categories, both metric and non-parametric methods give acceptable performance. The choice as to which is more appropriate depends on other aspects of the data e.g. the symmetry of the observed distribution, the likely underlying distribution of the constructs being measured, etc.

##### Why the Norman paper might pose problems

Whilst it seems that most of the time it is inappropriate to analyse ordinal data like it is metric, there are instances where it is justifiable. However, the papers demonstrating its suitability are precise in the limiting conditions i.e. when it is and is not appropriate to use these methods. [zotpressInText item=”{L8K4PCQ7}” format=”%a%, (%d%, %p%)”] paints much broader strokes as to when it is acceptable to analyse ordinal data like metric data. It may be used to justify any use of metric analysis, when often it is suboptimal. A better approach would be to use ordinal models, with [zotpressInText item=”{PR4AJ877}” format=”%a%, (%d%, %p%)”] providing a valuable tutorial on how to do so.

Whether you can analyse ordinal data like interval or ratio greatly depends on the structure of your data and the types of questions you want to ask. For certain kinds of structural equation modelling (SEM), like ANOVA and t tests, there is good evidence to be cautious doing so. But, under certain conditions, there is reason to believe analysing ordinal data using metric models is acceptable for other kinds of SEM, like confirmatory factor analysis with 5 or more categories.

##### References

[zotpressInTextBib style=”apa” sort=”ASC”]

## Leave a Reply