It's that time of year again - across the land teachers are to be seen seeking the views of their charges in the annual round of "student voice" surveys. Regardless of the motivation behind asking the questions, at some point shortly thereafter, the aforementioned surveys will be subject to some form of data analysis - and most importantly, conclusions will be drawn, practices changed and possibly interventions put into place.

However, (and I'll make a big statement here): Most surveys on student voice are constructed and processed incorrectly and conclusions are based on poorly understood statistical principals.

Enter Likert scales
Regardless of your prior knowledge of Likert scales, you will have used them - they are amongst the most frequently used tool on questionnaire surveys. At their heart, they are a question or statement against which the respondent indicates their feeling. For example:


Most questionnaires will have a Likert scale - they are easy to create and the respondents tend to find them easy to complete. But they are constantly misused.

1 - Number of responses
People tend to hold the mindset that they want to avoid being seen as holding extreme views. Within statistics, this is called central tendency bias and the result is that survey responses are biased toward the central, less extreme ends of the scale. Surveys with an ODD number of possible responses have a convenient "fence" that can be sat on. The solution in most cases is to always use and EVEN number of choices, forcing the respondent to "take a view". In the case above, removing the "neutral" option will produce data that is more insightful to our students opinion over the past 12 months of science.

2 - Too "granular"
Giving the responder the option to chose a category is not the same as how we need to report the data. Take this question on the amount of practical work in science:


Initially 8 levels of response are offered - an EVEN number of choices. But how meaningful is reporting such granular data? Condensing into the statement that "90% of responders felt that they undertook practical activities at least once per week, with 50% stating that practical science took place every lesson" - is far more useful than the granular data. Yes, we've lost resolution and analytical detail - but we've gained more insight into what's happening.


3 - Means don't mean anything (The biggie and most overlooked)
The biggest mistake in handling Likert data is in computing a mean value for respondent data - that and then drawing a conclusion from such data manipulation.

Likert data is ordinal, not numeric. In the example above, the "scale" is a perception scale moving from "Every lesson" to "Never". Where the trouble starts is when we conveniently "code" the responses as follows:

The moment we link the perception values to a numerical scale we are asking for trouble - remember the column headings are descriptive labels, divorced from any numerical meaning or value. A response of Strongly Agree (4) does not imply that this "agreement" is 4 times stronger that the Strongly disagree (1) option. But yet we are drawn into the thought process that this is now numerical data that can be manipulated in some way.

To illustrate further:

The same questionnaire, but with colours instead of numbers. Now we would never consider calculating an average of Red, Yellow, Blue and Green - because we can see that they are different things and the idea of average makes no sense.

To amplify, to calculate the mean, we are drawn into thinking about weighted averages:

In the first row, the weighted average is given by: (1 x 13) + (2 x 12) + (3 x 3) + (4 x 22) = 134 / 50 = 2.68 -- somewhere between 2 (Disagree) and 3 (Agree)

Now, the mid point of the scale would be 2.5 - below which we are in the "disagree" territory and above which we are in the "agree" territory.

So we construct a form of words like: "The average score of 2.68 indicates slight agreement with the statement "I preferred science in KS3". Total rubbish

Iin this case, 25 learners came down on the "disagree" side and 25 learners on the "agree" side - a 50/50 split. A better commentary would be "Learners were equally divided over science at KS3, with 50% of the claiming to have enjoyed science more at KS3, and 50% holding the contrary position. However, it is noted that whilst those learners who disagreed with the statement were equally divided between "disagree" and "strongly disagree", those agreeing with the statement were far more polarised with 22 claiming to "strongly agree" and only 3 to "agree". It could be concluded that those learners who did prefer science at KS3 hold that belief stronger than those that did not. More work is needed with these constituents to understand their depth of feeling over KS3 science.

To compound this, what we probably have in our real survey is the mean score for the boys and the mean score the girls. Let's say they are 2.71 (boys) and 2.40 (girls). We firstly conclude that "there is a difference between the boy and girl responses" and we further conclude that boys preferred KS3 and girls didn't. Again, total rubbish.

Going back to our colour coded example:
Just how do I now calculate the mean value for the first question: (RED x 13) + (YELLOW x 12) + (BLUE x 3) + (GREEN x 22) = ???? / 50

In calculating these mean values, we've totally forgotten that these Likert scales are a totally arbitrary, ordinal scale --- they are not numbers at all, but labels.


Back to my examples and the final two statements:

  • I like doing practical work = 3.04, so we conclude that our learners "agree" with the statement. We miss that 4 learners clearly hate practicals and don't ask why?

  • I don't like group work = 2.5 - smack in the middle - so we conclude that our learners have no preference over group work. We miss the fact that 20 learners love group work and 20 learners hate it -- why the extreme difference.

Arrghhhhh!!!!

All this adding, multiplying and dividing of Likert data makes no mathematical sense. We might as well ask "what's the average of RED, YELLOW, BLUE and GREEN" - but because we've coded them as numbers we're led to think we can manipulate them. Remember, they're not numbers but labels.

Enter the modal value
For the first question about KS3, by simple observation we can inspect the most common response - 22 learners, so 22/50 = 44% of learners strongly agreed with the statement "I preferred science in KS3".

To compare boys vs girls and support statements such as "there is a difference between the boy and girl responses" and that "boys preferred KS3 and girls didn't", we could compare these modes and make a statement over the percentages.
However, to do so with conviction, we need a non-parametric test (in this case the Mann‐Whitney U-test) specifically designed to manipulate and comment on differences between ordinal data.


Conclusions
What does this mean for us mere mortal teachers?

Firstly, we need to accept and acknowledge that there is far far more to designing a questionnaire that just asking the questions - as (a) we will probably find that we're not asking what we think we are and (b) depending on what / how we ask it, we might not be able to draw the conclusions we think we can. Remember "central tendency bias"

Secondly, presenting more granular data does not make the results "better" - less really is more in this case.

and finally....


Never, never, ever average Likert scale data -- it don't make sense... Simples.


References :
The interweb is awash with Likert debates. Check out what Google says here