Classifying Data

By Stephen Downes
Dec 04, 2021
Transcript of Classifying Data

Unedited audio transcription

Hi, everyone. I'm Stephen Downs. We're in ethics analytics and the duty of care module 7 titled, the decisions we make. And this is the second of four videos that we're doing. In this model, our module on the subject of data. This one is titled classifying data. Let's begin by thinking of the different types of data that we can have.

And I've approached it from from two directions here or maybe even three directions on the left hand side. I've got a little diagram there, says Python data types. And these are the data types that computer scientists will typically use certainly or even just, you know, amateur programmers, like me.

Here are the different types strings, like a word, like Python numbers, which might be integers or floating point numbers or exponentials lists love lists, I call them arrays when I use them in pearl tuples, which basically is a an ordered set, like an ordered pair addictionary. I call that a hash.

When I work on that with pearl, basically, it's a list of names and values sets, which are a lot like lists, but they're not, they're sets like a group of numbers. One, two, three. And then boolean, which is something that can either be true or false, one or zero.

It's what we call in the previous video, the most raw data. Hello. It's not really any more raw than the rest of the data. So, those are the sorts of data types that a programmer might use. Now, somebody working with a database, my work with a different set of data types.

For example, they will work with integers with short strings that they call VAR cars. Long strings, which they call text. They might work with dates or date objects. They might work with pointers or or references to external data or they might work with the blobs which are great, big blobs of unorganized data like a photograph or a video.

Or if we're also talking about types of data, we can think about what the data came from. And example, of that is personal data, personal data is personal because it comes from, or is about a person, we can even distinguish that between private data and professional data. Or, you know, it might be our sports data if we're an athlete, there might be business data.

There, there could be meteorological meeting meteorological data etc, right? Anything that data is about, or from could be considered a type of data and then, of course, there's multimodal data and that's data that comes from different kinds of instruments. For example, not just logs, or act of activities completed computer, but by things employed like biosensors eye, tracking infrared, imaging wearable cameras, kinetic devices.

Remember, the Xbox connect or those things that needs to be able to use on on a Nintendo Wii? I used to play hockey with the hockey stick. It was great. They actually had a stick with the, the Wii, and then they just continued, the stick, the next year and I lost interest.

Maybe how all those different. All of those are different data types. This is important to be because we associate data types with the sorts of things that were trying to talk about. Not just in artificial intelligence but in software generally, and these are generally talked about under the heading of what we call entities and we can understand what a program, a computer program is trying to do using what's called an entity relationship diagram or ER diagram.

This is a representation of a very, you know, non-standard representation of an, ER, diagram. I'm using this one because it's it's clearer and really easy to see what's going on and actual, ER, diagram just has it's a lot messier and it's a lot harder to see what the diagram is trying to do.

So here we see, we have basically four entities and these four entities are in the squares. So we have a user, we have a post, which is an article on a blog. We have a rating and we have a category. So what we have here then is the idea that a user writes, a post or a user writes a rating, which is maybe a comment or the post is in a category etc.

Now, this isn't a very good entity diagram because user writes a rating. Wait. No. What? Comment rating rights breathing rights of post. No, that doesn't make sense either. So it's not a very good entity diagram. It's kind of confused here in this entity, the user, the circles are around.

It are intended to indicate properties of the user, like their registration date. They're username their birthday, their email, etc. A rating by contrast. Might consist of stars. It would have an ID, and it might have a comment attached, a category might have a name typically or have more than that.

A post might have a an image associated with it. It might have a description, it would probably have a title but that's missing from this diagram. So the idea here is that the the data is being organized as a set of entities and the person creating the program is trying to conceptualize how these entities work, how they're related to together, as you've taken this course.

You've noticed that I've been creating entities as well. For example, I created a set of entities that I called applications, which were applications artificial intelligence, I created another set of entities, which I called issues, which were, you know, issues in the application of artificial intelligence or values. But I also have other entities, including posts and links, and presentations, and, and videos like this one files persons authors, etc.

So, your data can be organized in a variety of different ways.

Now, this leads to the question of what about the entities? We're not talking about if we're talking about data and data points and and there are two ways to omit data. One way is the way that illustrated here. Where here are the data points on our xy axis, or our XY grid.

Rather here's what the line would be, if we count all of the points, but if we move remove, some of them indicated in gray over here, the slope of the line changes, if we remove others, the slope of the line changes. Again, if we move all of these down here, the slope of the line changes again, so we can.

And the term was use, sometimes cherry pick our data to change the slope of the line. At least changes are very slight but I could easily imagine completely flipping the slope of that data by omitting the right, the slope of that line by emitting the right data points. But it might also be that we got, we changed the results by not even considering some types of data at all.

I suppose, for example, and this is a bit far effect, but suppose for example, we did this entire course on ethics analytics in the duty of care without talking about ethical issues that wouldn't make a whole lot of sense, but we could do it and we'd probably come up with a very different perspective.

If we didn't talk about any of the cases, where ethical breaches actually occurred, then we might become concerned about different sorts of things that the things that were actually problems. You know. Rough example, also related to data is the idea of data quality, we can have good data or we can have bad data, very, very roughly.

What do we mean by data quality? Well, field, and others talk about the principle of representative and high quality data? Now representative, we'll talk about a little bit later on high quality is a bit harder to get at, you know. Does it mean high fidelity? Does it mean data that comes from authoritative sources?

You know, I mean doesn't mean data that doesn't have chunks of the data missing. So you know, it's very clear data or very limited noise in the data. It's hard to say right without being precise about what you mean by quality. But there is a problem and has existed since the beginning of computer science known as garbage in garbage out.

And that's the idea of that. If you want your system to work correctly, you have to put the right input into the system otherwise you'll get say, say garbage out data. Quality is typically taken as consisting of a number of different parameters. And and here we have one from hundred gabs lethal.

Although I've seen this diagram of something similar to this diagram and a number of different sources. So it includes accuracy, relevance completeness, timeliness uniqueness conformity consistence integrity validity and precision. Now we could talk and then people have talked about each of these for a long time and what does it mean?

How do we determine for example, that data is timely? What are the parameters about? How do we measure for that? How much of impact is that have on our overall assessment of data quality for our purposes? What I want to know here is that for each one of these categories, we're making decisions, we're making decisions about what will count as an instance of this type.

You know what will count us an indicator of completeness, for example, and then what processes do we need to undertake to ensure that something is complete and then third, what sort of test after the fact? Do we have so that we know whether or not something is complete, these are hard problems and their hard problems all with ethical implications.

For example, if you don't care, whether you're data is timely that may have an impact on the analytics that produces the output. Good concrete example of that, believe it or not is Google maps and it's interesting. I had a built-in map system and system in my car when I bought the car, it expired after a year and then thereafter they demand that.

I pay money for it. I'm not and I do that it was a crappy map anyways. Crappy map. What do I mean by that? Well it was accurate but it wasn't a very usable oh but usable isn't in this data quality thing. Well, the presentation of the data is distinct from the content of the event.

Anyhow. So I use Google Maps. Google is a Google maps, is kept reasonably up to date because it's based on satellite surveillance. But part of the problem with that, is that it detects anything that looks like a road as a road as a result. I found myself writing my bicycle down what could only be described as a rock pile.

Once a lot of private roads are listed as roads with thanks as across the to this world. So there can be issues with Google Maps, but I found as well. Google Maps also has street view. Street view is way out of date. If I look at street view for mountain, for example, I see my old house and not my old house, the way it was when I sold it.

But my old house, the way it was before I renovated it. So the street view version of it is very much out of date and fewer looking for property, save, maybe they're selling that house. Now, You'd be given bad information by Google Maps. So that's what we mean by data quality.

And you can see the implications of data qualities throughout the application use and therefore ethics of analytics. In AI, we talked briefly about the, the timeliness of data data takes on a particular quality. If we're thinking of it, not just as a static representation of some state of affairs in the world, but as something, as representing something that's constantly changing, like, say the train that could be coming by the location data on the train is never the same moment to moment.

That's why sometimes it's shooting, it's horn and sometimes it's not tooting, it's hard. How do you like the way I work that in. So there's two two basic types here. There's dynamic data which is data sets that often evolve and accumulate and the thing with those is they may permit discoveries in the future but they don't permit today and that's important because if we can consider what sort of discoveries are possible that has an impact on, whether we want to allow the data to be used in that way or weather.

For example, if we're the data subject, whether we give consent for the use of that data as well, we have real time data real-time data is based on online analysis and decision, making as data arrives, and I should that should say pose rather than post, but it poses additional challenges because you need both timeliness and you need accuracy.

But those are hard to get at the same time especially if there's some distance or some processing between you and the data that's being recorded. We see this in multi-user virtual reality systems, for example, where you're using one VR headset on your computer and I'm using a VR headset on my computer and there's a third computer which is actually managing the simulation.

Well, there's gonna be a timeline between what you do when the server Herbert and then when I see it and if that timeline is to great, you know, it really degrades our experience of the game they fire and I'm hit by the bullet before I even see them fire On the other hand, if we want the system to be you know, right up today.

Then we have to give up on the really beautiful graphics and some of the calculations so that basically we're living in a cartoon world instead of, you know, an actual real sort of looking world. And that way I can see the person fire before I get hit with the bullet.

But it's a cartoon bullet and feels doesn't really feel very believable. That's the sort of problem that you can have when you're working with real time. Live data especially real-time live data that's used for the the bay as the basis of decision, making and analysis. So that moves us into the topic of data assemblage and that's going to lead us to the next part of this presentation on the management of data.

But what we want to think about here is thinking about how we pull together, all the different aspects of managing our data. So quoting from Prince who critical data studies encompasses all of the technological political social and economic apparatuses and elements that constitutes that frames the generation circulation and deployment of data through.

So what does that mean? Really? Well, there's the whole process that we've been describing so far of retrieving data, or even creating data sharing data and and using data. And so, you know, we, we need to do things like here we have in the diagram design and pilot instruments.

Collect the data clean the data and then create a hypothesis. Maybe, you know, these two blue things depend on how much you want theory to influence, what you're doing, right? Maybe you clean and analyze the data and go straight back into designing and piloting instruments and you or maybe you just pump that data directly in a raw form into some other system, these are all questions that need to be considered and maybe how what's being proposed here is that, this is an iterative process and it certainly isn't iterative process where we're constantly in a state of negotiation almost with the world, right?

We try something. We collect data, we try something, we collect data, we try something, we collect data and it's, it's sort of a back and forth and that's what we do when we're having a conversation with other people, that's what we do. When we're manipulating physical objects. Even what you do when you're writing software, right?

You're rights and software. You try, you change it, you write it and you try it, you change it, it's cetera. And so, all of that process is in encompassed under the topic of critical data studies. So that's the end of this presentation on classifying data is not the end of the discussions of classifying data that will have.

But now we're going to move into the topic of data management and that'll be next on our list. So, for now I'm Steven Downs and I'll be there waiting for you when you get to part three.

Force:yes