Should we retire nominal, ordinal, interval, & ratio levels of measurement?

One of the first things all psychology students are taught is levels of measurement. Every student must wrap their heads around the four different forms data can take: nominal, ordinal, interval, or ratio. These are the bedrock of a lot of students’ understanding of measurement, including mine. I didn’t realise there were questions about their validity and utility until recently. Should we still use these levels of measurement? Do they aid our understanding of measurement? Or do they need to retire?

On the level

The levels of measurement we all know and love were proposed by Stanley Smith Stevens in the 1940’s. These levels are based on whether the meaning of the data change if different classes of transformation[note]Transformations refer to any mathematical function that is applied to each observation or data point.[/note] are used [zotpressInText item=”{5421944:GJPTUX5Q}”]. For example, nominal data categories can be completely arbitrary. You can use numbers, letters, names, pretty much anything, and the information from the data is the same. Thus, they can be transformed in many ways. But you can’t do the same for ratio data; you must use transformations that preserve the information. If you are weighing something in kilograms and you change your ratio scale from 0-20 to 47.5-10,006 (whilst keeping the same number of units in the scale) then you don’t have the same information; the ‘fixed’ zero point is no longer fixed and the gaps between the units are no longer the same.

There is a ‘hierarchy of data scales based on invariance of their meaning under different classes of transformations’ [zotpressInText item=”{5421944:TVDURDJ2}”]. Nominal data is the lowest on the hierarchy[note]Because it tells you the least amount of information about the data; it only tells you the label and nothing about how it relates to other data points.[/note], then ordinal, then interval, and ratio is the highest[note]As you not only have the number but how it relates to other numbers on the same scale (including order, size of the gaps, and whether one number is a ratio of another e.g. twice as much).[/note]. Different kinds of transformations therefore tell you where your data are on this hierarchy. If the meaning is preserved after many different transformations, it is at the lowest level. The fewer transformations it can tolerate, the higher up the hierarchy it is. There is a corresponding positive relationship between where the data are on the hierarchy and the number of meaningful calculations. The higher the data are on the hierarchy, the more calculations can be performed. For nominal data, you can only count the number of cases and therefore calculate the mode. But for ratio data, you can calculate the mean (as well as median and mode), the standard deviation, correlations, etc.

Do as I command!

This hierarchy is the foundation of one of the main proscriptions of Stevens’ theory of scales of measurement: your choice of statistical test should be guided by the measurement level used, such that truth statements based on the statistical analyses should be as valid under ‘admissible’ transformations of the data [zotpressInText item=”{5421944:43VGWKHH}”]. The ‘admissible’ transformations for nominal data are one-to-one transformations, like replacing one label with another. For ordinal, order-preserving transformations[note]This is also called monotonically increasing.[/note] are acceptable e.g. moving from 0-6 to 15-27[note]The original scale went up in units of 1 starting at ‘0’ and the second went up in units of 2 starting at ’15’.[/note]. For interval data, positive linear transformations[note]Linear transformations are mathematical transformations of the data that preserve its structure.[/note] are admissible and for ratio data, it only allows transformations that multiply with a positive constant[note]Thus preserving the ratios between two data points.[/note]. Therefore, the higher the level of measurement, the fewer transformations are admissible.

The transformations the data can tolerate tell us which statistical tests to use. For example, ordinal data from two independent groups should be statistically analysed using tests that are invariant to changing the scale whilst preserving the order. Squaring each data point is such a monotonically increasing transformation, assuming positive numbers. Analysing ordinal results with a statistical test that converts the scores into ranks is insensitive to this admissible transformation. Whether you perform this order preserving transformation or not, the truth statements based on this statistical analysis remain the same. Thus, this test is appropriate. But theoretically you shouldn’t analyse the same data using a test that can’t tolerate this transformation e.g. Welch’s or Student’s t-test. These tests use the mean of the scores, so squaring each data point will affect this calculation. Thus, the truth statement of the result has likely changed despite the admissible transformation. Therefore, this statistical test is not appropriate.

Lord have mercy, have mercy

As stated above, one of the purposes of these levels of measurement was to determine what kinds of statistical tests could be permitted [zotpressInText item=”{5421944:IBIB5MS3}”]. The level of measurement of the data should therefore be a very strong signal as to what statistical analyses you should conduct. However, one of the main criticisms of these levels is: ‘They do not describe the attributes of real data that are essential to good statistical analysis’ [zotpressInText item=”{5421944:TVDURDJ2}”]. Thus, many argue that the levels of measurement don’t capture the important information that guides you as to how you should statistically analyse your data.

One of the most famous examples of this problem comes from [zotpressInText item=”{5421944:DUBQSK8P}”]

[note]Open access version here.[/note]

. In it, Lord presents a thought experiment where a retired professor sells numbers on the back of American Football jerseys (called ‘football numbers’) to university players. One team suspects they were sold lower numbers than another team and complain to the professor. To see whether they were sold lower numbers by chance, the professor enlists the help of a statistician. The statistician begins by calculating the mean of the football numbers, much to the despair of the professor[note]The numbers are nominal labels so, according to Stevens, you can’t add or divide the numbers as that is as meaningless as adding and dividing colours.[/note]. After performing more forbidden mathematics, the statistician concludes that the numbers were not a random sample. We are then presented with a conundrum: how did putatively invalid calculations produce a meaningful result? As [zotpressInText item=”{5421944:TVDURDJ2}”] explain, scale type, as defined by Stevens, is not a fundamental attribute of the data. Instead, the measurement level depends on the questions you intend to ask of the data and any additional information you have. Many have argued this shows the lack of utility of Stevens’ levels of measurement.

However, the picture is more complex.

Ghost in the machine

On the surface, it seems the professor in Lord’s 1953 paper asked a question about the nominal numbers and whether they were randomly distributed. But, as [zotpressInText item=”{5421944:43VGWKHH}”] argue, what the professor actually asked and drew a conclusion about was the machine; was it in its original state (randomly shuffled) when the numbers were drawn? As such, an inference about the uniqueness of the numbers wasn’t made. The reference class was a set of possible states for the machine (fair or biased). Therefore, whether a nominal scale can be analysed as metric wasn’t explored.

So what does it mean to say the inference regarded the state of the machine, not the player numbers? The test was to see if the machine was in one of two states: fair or biased. But the machine could have been tampered with in a multitude of ways, so it doesn’t fall into the nominal categories. The machine could have had all numbers below 20 removed, thus producing a higher mean value. Alternatively, all the numbers greater than 35 could have been removed, creating a highly biased low mean result. Thus, there is huge range of potential bias in the machines. Also, this amount of bias can be ordered (more or less biased than another hypothetical machine) so it can be treated as ordinal.

Not only that, it can be shown that the amount of bias possesses a quantitative structure that can be represented by linear transformations by the population mean. To show this quantitative structure, it is sufficient to show that the amount of bias can be concatenated

[note]This means the numbers can be added together and the meaning is preserved. For example: using coins to see how much various objects weigh on a balance scale, you see a gridiron football weighs 126 coins and a cricket ball weighs 47. By adding a cricket ball and 79 coins, it should weigh exactly the same as one gridiron football (example from Williams, 2019).[/note]

and the result of this concatenation has the correct properties [zotpressInText item=”{5421944:XR95BQVN}”]. The authors give the analogy of concatenating temperatures in volumes of liquid. If you have two equal volumes of water, one at 10°C and the other at 20°C, and add them together, the resulting temperature will be 15°C[note]With a corresponding doubling of volume.[/note]. This is the equivalent of taking the mean of the individual temperatures. This same process can conceptually be done to the amount of bias in two machines. The bias in the machines can be “added” by concatenating the numbers drawn from each machine into a random pile of numbers[note]Because the random pile will be biased a certain amount depending on the bias in the original two machines. After the concatenation, the amount of bias in this new pile will be equidistant between the two machines.[/note]. This is the same as finding the mean of the bias of the two machines[note]The authors provide the specific mathematics in the Appendix of their article.[/note]. The fact these linear transformations represent the bias equally well demonstrates the scale of bias is interval.

Take care

This analysis relies on the assumption that a single set of numbers can be represented in different ways. This was one of the criticisms of Stevens’ typology by [zotpressInText item=”{5421944:TVDURDJ2}”], stating that the level of measurement isn’t a feature of the data. But, as Zand Scholten and Borsboom argue, being able to represent the numbers in different ways based on the types of questions you are asking isn’t a flaw with the levels of measurement. It does undermine the rules presented by Stevens’ regarding admissible tests, but not necessarily the concept of levels of measurement itself.

Whilst Lord’s paper may not be the devastating critique many think it is, what it does show is that you must be careful when thinking about how to analyse your data. You cannot blindly assume one form of analysis is correct, as what is most appropriate depends on a range of things. Lord himself, in a comment on his 1953 paper, stated that ‘utmost care must be exercised in interpreting the results of arithmetic operations upon nominal and ordinal numbers’ [zotpressInText item=”{5421944:BWQN97P4}”]. He gives an example where an ordinal data set is best analysed using an ordinal statistic (the median) because to analyse it using a metric model would require some likely unjustified assumptions. What Lord was advocating for, as evidenced by his follow-up publication, was not the complete rejection of Stevens’ typology. What he, along with many other authors, was arguing for was the appreciation of the shades of grey when running statistical analyses. Rather than thoughtlessly follow the levels of measurement, ask: what is the distribution of the data [zotpressInText item=”{5421944:IM29UBZM}”]; how does your model treat the distribution of data and is it appropriate [zotpressInText item=”{5421944:VD8XETGZ}”]? These are some of the factors that are more important than the level of measurement [zotpressInText item=”{5421944:D5WBEKRU}”].

Time to retire?

Given the repeated calls for greater consideration when analysing data, would retiring Stevens’ levels of measurement help achieve this end? Would retiring them completely encourage people to think more about their data, rather than just thoughtlessly using the proscribed typology? If they were no longer taught, what should people use instead? Some have argued the levels do more to confuse than help, so we should only think about either discrete or continuous variables. I find this argument the most persuasive, though I worry the discussion is moot. The levels are largely ignored when it is convenient to the researcher. If the levels are fully removed, it is unlikely to improve the situation without a greater appreciation of nuance. If the retirement of the typology leads to a majority critically thinking about how to analyse the data, then I am all for it. But without an overhaul of how many of us approach statistical analysis, it feels like rearranging deck chairs on the titanic.


[zotpressInTextBib style=”apa” sort=”ASC”]

[zotpress items=”{5421944:RKURRCBS}” style=”apa”]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: