ADD Blogposts open a window to the work in and across ADD. Meet our researchers from six Danish universities: Aalborg University, Aarhus University, Copenhagen Business School, Roskilde University, University of Copenhagen, and University of Southern Denmark. Read about their projects, activities, ideas and thoughts and maybe gain a new perspective on the controversies and dilemmas we face in digital democracy and how we can work to advance democracy in a digital age.
By Alf Rehn, Professor at Department of Technology and Innovation, University of Southern Denmark
ADD (Algorithms, Data, and Democracy) is a research project that could be described as being burdened with a name that might obfuscate more than it reveals. On the surface, things seem quite ordinary, even mundane. Starting from the end, we all feel we know what democracy is, at least to a functional degree. The word comes to us from Ancient Greek, dēmokratia, derived from the word for “people” and the word for “power” or “rule”. Rule by the people then, for better or worse. Simple concept, even if it might create various issues in practice. When it comes to the first word of the project, algorithms, we may feel that while the term in itself isn’t all that complicated, we are less sure as to whether we fully understand it. Sure, we might know it comes from Muhammad ibn Musa al-Khwarizmi, or to be more precise, Muhammad, the son of Musa, born in Khwarizm (today known as Khiva, Uzbekistan), whose name was later Latinized as Algoritmi as his work on the Hindu-Arabic numeral system was published in Europe (under the title “Algortimi du numero Indorum”). Sure, we know it has something to do with calculations and the methods thereof, and frankly, that’s where most of us leave things to the nerds and the geeks. We don’t fully know how, but we do know that algorithms can, if properly tended, create strange and wonderful things, to the point that we’re even a little scared of them.
Take these two words together – such as in the notion of “algorithmic democracy” – and hackles start rising. We accept that algorithms may help in the world but combining them with democracy sounds… wrong. It smacks of automated voting, manipulation, computers taking over, and various other unpleasant things. One is a thing we like, the other a thing we are a little unsure of, and together they raise more questions than they answer. Which, if you think about it, is pretty perfect for a research project – something to like, something to doubt, and lots of questions to go round.
The astute reader will by now have noticed that I in this somewhat facetious deconstruction of the project’s name have not touched upon the middle word at all, the little word “data”. This is the only one of the three terms that has a basis in Latin, as it comes from the verb dare, which means “to give”. A thing given or granted, then, is a datum, and the plural of this is data. Thus, data means something like “things given to us”. In medieval times, philosophy started using this term to indicate things that were “a given”, i.e. true for the purpose of argument or reasoning (sometimes put in the form “data rerum”). Over time, science adopted this use, and the etymology got muddled. Data became a mass noun, used quite broadly indeed, and as this happened, the term got questioned less and less. Today we use it as part of a little hierarchy, in which data is assumed to be the raw material (as in “data is the new oil”), which can be structured into information, and in context turned into knowledge. It seems all so very tidy!
Now, let’s test this in practice. The following is data: 48, 12, 9, 28, 18, 24, 22, 52
Presented in this form it is nigh-on useless, unless you are really pressed for lottery numbers, or looking for an unlikely password. We can add something to it, however, and it becomes a little more interesting. That string of data is in fact the ages in a group of people. We now have some information about this group, such as the fact that the majority of said group are adults, at least in the sense that they can vote. We also know that two are children, and that none are old-age pensioners. Granted, not the most thrilling information, but still. I can now add yet another dimension to this and tell you that the string of numbers describes the ages in my family, i.e. the age of me and my partner, and the ages of our children. You now have some knowledge about my family – you’re welcome. So far, so simple, right?
Some questions remain, however. What was the data before it became information? If you answer data, you’d be correct, but only in a general sense. The string of numbers I presented (or gave, as it were) could have been just random ones I made up. They weren’t, as I had the information of our ages at hand, but does that mean that this was information for me yet data for you, at the same time? Was it thus only data from some perspectives but information from others? What if I lied? Was it still data to you, even when I knew it was just random numbers? Going a bit further: I wrote that we “can add something to it”, in that I revealed that the numbers represented ages. Was that something data? It was a category, unusable in and of itself (Consider the question: “What is the median age of dragons?” – acquiring the data necessary for turning that question into information and knowledge is, sadly, not possible.), so it would seem to be. But where did it come from? It clearly pre-existed the data in the series, and it is likely that the category and the data that can populate it did not emerge as separate entities. Rather, we started to pay attention to ages, and the data and the category that created such information emerged simultaneously.
So, it would seem that data isn’t just data. More to the point, what we talk about as data is what can populate categories that we have decided are important, interesting, and apposite for specific phenomena. Consider for instance the data we tend to get about our children’s classes in school: How many students there are, and their gender split. The former data is to us information about whether the class is “small” or “big” and is used in particular to ensure that the class is not “too big”. This also means that schools know that the data is not allowed to go above a certain threshold – if a school says it has a class with 50 students, this will likely be illegal and cause a storm of protests from parents. So, whilst the category “class size” might seem like one that could contain data points from 1 to a thousand or more, the data is in fact closely curated to be in a fairly limited span – in Denmark between 24 and 28, with some outliers. Can data and information when it comes to class size really be separated? Then we have the question of genders in class. Without even thinking about it, we work from the assumption that the “correct” data for this category is something approximating an even split, such as 14+14 in a class of 28. Schools are, again, well aware of the curation demands for this, so ensure that classes only rarely skew heavily in their gender split. So the assumed data in the statement “There are 13 boys and 14 girls in my child’s class” is in fact affected by the knowledge assumptions of said category. Further again, why this specific data? Size and gender, and for most parents, relatively little beyond this. It is for instance very unusual to get data about class happiness, about the noise-levels in class in dB, or the average reading speed among the students. All these would be data about the class, but for various reasons only a very limited amount of data is considered important enough to collect. In some cases, this can be due to the difficulty measuring it – such as in the case of happiness. In other cases, this can be due to fears of a backlash – no parents want their children to be in a noisy class. A seminal essay of second-wave feminism, written by Carol Hanisch in 1969, was titled “The Personal Is Political”, and emphasized the politics underlying much of what was considered personal or private. Today, we may need to open up to the fact that data, far from being just the neutral basis for information, is political as well.
To some, this comes as no surprise. Much of what has been discussed about algorithms, data, and democracy has been very attuned to questions about privacy and biases, often with the assumption that when it comes to data, less is better. If Big Tech has less data about us, they won’t be able to manipulate us in the same way – or at least so the story goes. This, however, ignores the lesson we should learn from the etymology of the word “data”. It stands for that which is given, the assumptions made regarding what is important, the manner in which we give ourselves to the world. We are (mostly) happy to give our gender and age, as these given categories have been with us as definitional from the moment we learnt to speak – the first things we teach a child to communicate to others is their name and age. Few thinks about that this is not in fact data about us, but information. The categories here precede us, and as we are born, we are assigned a gender and have our age recorded. Less as data, but as part of an information system, ready to categorize us, to treat us as given. The smallest aberration to this, and the system gets a hiccup, pushing away the data that does not fit in. Consider for instance the case of a young man in my extended family. In the Danish data-systems he is well-categorized, with all the right names and codes. There is however something to him, a data point, that does not fit the information structure of the Danish state. This something happens to be a functional uterus, currently occupied with nurturing a new life to fruition. The existing information systems in Denmark lacks the capacity to add in this data, as man-with-uterus is not a category that can be chosen. In effect, this data about him in the system becomes non-data, as there is no way to capture it therein. His surplus of data is ignored, left behind, treated as not given at all.
Why does all this matter? Simply put, data is a far more complex category than we tend to realize. We treat it as something akin to oil or water, a free-flowing resource that is always already ready for use. In reality, data is often a choice – we choose the data we gather based on categories we may or may not understand, and the data that we don’t choose, we don’t think much about. So, we learn some things about classes in school, or people in the healthcare-system, but only what the pre-made structures allow us to know. That another world is possible, with different data-structures, this tends to go forgotten.