Organizing Data

By Stephen Downes
Dec 04, 2021
Transcript of Organizing Data

Unedited audio transcript

Hi everyone. I'm Steven Downs. Welcome back to another instance, edition, and instance of will part of presentation, of issue of chapter of ethics andalytics and the duty of care, we're in module seven, which is the decisions we make. And in this module, we're talking about organizing data. Or we might talk about it as managing data, or we might talk about it as classifying data, whatever some such thing, this is the third of four videos in this module, where we're talking about data and I've already talked about the different data sources and the different data types that exist and that was, you know, we we dealt with some issues in those in those videos, but that was to a large degree to set up this video and the next video.

So that now we have a bit of a background in a framework that we're working from. And now we can look at some of the decisions that we make in the process of assembling our data. And using our data that really do have significant implications on the output of AI or analytics I've covered all that under the heading of organizing data.

And then later on we'll talk about how that data is used in actual applications and what the consequences of that are. So we mentioned last time an operation called data cleaning and I want to talk about that a little bit more it's something that's covered quite a bit out there on the internet.

May have covered quite a bit because it's such a persistent problem. Even if I do this course, I'm going to be involved in data, cleaning give you an example, I'm recording the audio. Well, that's great. The audio, the reason why I recorded is that it's producing a transcript that is generated as I talk.

Now, this transcript is the best Google can do at the moment. It's not very good. So I've been throwing an edited transcript into the database. I'm going to have to go back and review all of those and probably listen to what I said and correct, the miss the misinterpretations of both punctuation and pronunciation.

And even his was suggested by one of the participants in the course, removed the profanities and the obscenities that Google recorder has for some reason decide that I like to indulge in. So technically data cleaning, quote is the process of identifying deleting or replacing inconsistent or incorrect information from the database phrasing because if it's inconsistent, I'm not sure how it can be correct.

So, really, it's simply replacing incorrect information from the database. But now when we put it that way, we're faced with the question, how do we determine? What is correct? For example, with this transcript, what I'm going to be doing is changing all of the words so that they are sentences incorrect English but you've been listening to me for all this time for hours and hours and hours, you know, as well as I that I make mistakes, I mispronounce words I drop tease in.

Some other words that I use and someone so actually what's going to result is a more correct version of what I actually put into the database and so data, cleaning isn't just an instance of fixing what was wrong, it's sometimes making better what was already there. So there's a process and it's actually a cycle.

Sometimes people represent it in a linear fashion, but I certainly wouldn't were you import the data, you merge? The data data sets. So from, you know, because you might have data from different sources, there may be data missing, so you rebuild that data, then you apply standardization to it, for example, Google transcript has spelled my name, any of three or four different ways and I'm going to pick one and go with that.

Normalization is a process where we get all the data talking about one entity in the same place, this is a formal process called data normalization in the design of databases here where we're doing data cleaning. We just want to make sure that, you know, we don't have multiple sources of information in various places about the same thing, then deduplication, which is as it suggests, we're moving duplicate records or whatever verification and enrichment, pretty tough process.

And then finally exporting exporting very often involves translating the data into some other format. For example, the audio here, I'm going to import the audio from this recording. I might mix it with more audio if I've recorded it in pieces. I might sometimes I have to do a little blurb at the start, if I forgot to introduce it but more to the point.

Now, when I export it, it's recorded in them for a and I'm going to export it and MP3 and I'll also going to enrich it by adding some metadata. So that's the sort of thing that happens in data cleaning. And you can see there's an interpretation that's happening here, right off the bat and you can see the ethical questions being raised because what if you're making the data more perfect than it actually was, what if the important thing in the data was the fact that some of the data was missing or some other data was unclear or some of the data was wrong.

If you're doing data cleaning, you might be covering over. Something that's really important in your data, you know, it returns us to the question of data quality when we first look at the question of data quality in the previous video, we thought of it as an attribute of the source data.

But now, when we're looking at it, we're realizing that data quality is also the objective of data cleaning that what we're trying to do is improve the quality of the data through data cleaning. And so data quality, now isn't just a property of the, the raw data, we can call it that, but it's also a property of the cleaned data, it's an output of data cleaning.

And so things like completeness consistency, even relevance and timelyness etc. Are things that in an important way, we might be adding to the data. And each time we tweak the data in some way we're making position about have we changed the nature of the data that we changed. The accuracy of the data.

Have we changed the usability of the data or have we obscured something important. There is a data cleaning workflow and, you know, this is fairly typical here and we can see the, the actual details of what gets done. And these are, you know, there are algorithms that do a lot of this.

There are also there's also a lot of this that's done by hand and it's really interesting, all you use, for example, for my photographs, I take a photograph and I do the best I can with my camera but it comes in, it's blurry and it's pixelated and you know, and I'd like it to be as faithful as possible.

An image of what I actually photographed. So I'll actually run it through some. Artificial intelligence is made by Topaz. I have one that's called denoice and it takes all the speckling and flattens out the color. So I don't have speckled color anymore and I also have one that's called sharpen.

And so there's a different ways that my image can be non-sharp. They might be caused by motions. So if I move the camera, while the lens is open, I'm gonna get some motion blur or it might be a little bit out of focus. And so, I'm gonna get some fuzziness around the edges and this AI will detect what kind of fuzziness I have.

And then draw the line where it thinks it should be and then color the two sides of the line. So The question I have to ask is, is that a photograph of the bird? Or is that an artistic representation of the bird pretending to be a photograph of the bird?

You know, I mean, I've looked at what this AI does and, you know, I could not say for sure whether they're actually is a feather or a hair, where the resulting image says there's a feather or a hair, there might be and certainly possible. But there might not be, and this is the sort of thing that comes up in data cleaning.

And the ethical question is, should I clean my data? Well, the problem is, if I don't clean my data, then errors, in the data caused by the, the the instrument will end up in my artificial intelligence workflow. And so I might be introducing error that will impact the outcome and the cleaning might actually prevent that sort of thing.

So it's it looks kind of 60 of this half a dozen of another really hard to tell which way to go here. Let's just where our problems begin. Remember, before when I was characterizing, what data is I made a point of saying that it's all ones and zeros And it's all ones and zeros.

But the thing is, as humans we can't deal with ones and zeros, not very well anyways. So we do a process of what's called labeling. We talked a little bit about that in the video on how AI works. You might remember that. Remember we talked about things like edge detection, and feature detection and even the final set of neurons that represented or were represented by the digits, zero through nine, those are labels, right?

It's not really an edge, or a feature or a digit. It's just a series of ones and zeros, but we're imposing or giving or blessing it with a name and these names are not just, you know, strings of text that we assign to a particular neuron or a particular seventh neurons.

They are actually organized. So that one name may refer to a set of other names and so on. So we get out of this as well. Things like classification and taxonomies and names and and all the apparatus. I might add that we have when we're doing theory and this is where we get this data versus theory thing again.

Because the theory side of it begins with all of these names, whatever they are and based on whatever justification they have for using these names rather than other names or these classifications rather than other classifications and then brings that to our data. And says, okay, we want to recognize these things in the data, but the other way of working is we have the data, we see things in the data, maybe patterns or regularities in the data or even just a particular neuron flashing on or off and we say something like, well what's the best word or concept of all the words and concepts that?

I'm, I have accessible to me that I can use to name that neuron or that pattern, so that's the issue of classification and naming. And so these are essentially using human readable signs that interpret, that is to say, give meaning to or significance to a specific piece of data.

We have a, you know, an auto labeling system here because a lot of the times labeling is done automatically by any AI and I find that really interesting. So the raw data comes in that is the ones and zeros come in and our system makes a labeling request. So the system gives it a label based on something.

And that's the preliminary label data a human checks it and either approves it which case the data has been labeled or sends back the correction to train it through say, a process of back propagation. And we try again, we keep going around and around and around until we hit a label, the human likes, and it's labeled.

So, as I said labeling, is the basis for a whole bunch of stuff taxonomies ontologies and kinds. So, the diagram here to the right is sort of an indication of what's going on, so we might have a taxonomy of articles. Say, so, under the heading of news, news articles might be broken into politics, in economy, politics might be broken down into subcategories of international local international might be broken down between USA and, and Asia.

And then this is used as a classification system for us out of documents. So this taxonomy that we have might describe what our output neurons in an oral classification system stand for just as we described the numbers in the previous example. Now instead we have these words, but there's a question here and the question is, are other certain words that are the right words, and this is an important question because it gets at the question of whether our use of words or labels actually reflects what exists in the world and there are different ways to talking about.

This is a huge philosophical issue, which I can only briefly address, but when we look at things like biology for example, it really seems like the world breaks down nicely neatly into predefined categories. We have the kingdoms, we have the species and the genuses, and we can tell the difference between them because all of them are birds, but some of them are crows and some of the crows have grades.

Some of the crows don't have great cetera and it just seems like it works out, right? And that leads some philosophers. For example, saul crypkey to say there are certain natural kinds in the world. The world is organized a certain way and we can understand that organization. There are skeptics though such as coin, who suggests that it's really going to depend a lot on our interpretation of what the world is.

And and he in word and objectives gives an example where we're looking at a rabbit. We call it rabbit and maybe a native person. Looks at a rabbit and calls it gava guy and we think oh, gabagai means rabbit but gave a guy might not mean rabbit. It might mean young rabbit or it might mean moving rabbit or it might mean the physical incarnation of an eternal spirit that we would understand as rabbit or something else.

We can't even fathom. In other words, there's no way we can know if we're doing radical translation whether we've actually translated the word correctly or not. Whether we're actually talking about the same states of affairs in the world and George Lakeoff talks about that at some length and he points to a particular culture that classifies all the things into the world into two categories.

In one category, you have women fire and dangerous things and then the other category you have everything else. So, hahaha George Lakehoff, but it does point to the idea here that we don't know a priori what the right categories are. And from that, it follows that we bring an awful lot to the table ourselves.

When we use categories or taxonomies are labels of any kind, two interpret, what's happening in an artificial intelligence. It's just one's and zeros and these ones and zeros are significant and they can be used to do things like detect patterns or group things into groups or whatever. But the names of the patterns that aims of the groupings, the names of the individual neurons.

That's us. That's us bringing it to the artificial intelligence, unless you have an AI that does auto categorization without human intervention. But then why would we think and AI doing auto categorization, would categorize the world, the same way we do, and this is where we get at. What some of what Chris Anderson was talking about and AI given wide and comprehensive data, but things that exist and change in the world, my organize the world very differently from the way we do.

And, you know, we like to divide people into, you know, like four or five categories or, you know, the, you know, like learning styles on NTP, which is a grid. I guess of 16 categories based on these four variables four squared 16 but an AI might divide the people into the world into 60,000 categories.

Each with its own sets that are properties each with its own kind of predictions that it could make about what kind of learning resource it wants. And who's to say that our 16 categories are better than the AIs 60,000 categories and also when the AI is dividing the world into 60,000 categories, what's the point of naming them?

What's the point of giving them labels? There is no reason. There would be a reason only if the labeling made it easier for us humans to comprehend what's going on. But we're not going to comprehend what's going on with 60,000 categories, just not going to happen, even if we came up with some kind of tax on any for them, but there might not even be a taxonomy, right?

It might just be 60,000 independent categories, not are not organized into a hierarchy because why would the AI necessarily think that the world needs to be organized into a hierarchy? Could just be 60,000 independent unrelated categories that's when Anderson was getting up. And that's why Anderson is saying, well, forget theory, right?

We're going to take this system of six categories or 16 categories and imposing on the world. When an AI would detect 60,000 that, no longer makes any sense And he really does no longer make any sense.

So this is an area of exploration in AI. Right now, there are systems out there among the these the types of AIs that we've been talking about that are specifically focused on classification. And it's interesting because really, they're almost torn between two purposes. The one purpose is to be able to use the class of occasion in some useful way.

For example, to recommend the piece of content or a learning path or an intervention, on the part of an instructor or something like that, or on the other hand, to take complex data and provide, I mean, interpretation of that data that we can use. And when we see things like number recognizers and things like that, facial recognition systems, etc.

That's the sort of thing that we might be thinking of these aren't always that cross purposes, of course, but there are sometimes that cross purposes and we need to make the choice. Whether we want, the AI classification to focus on the utility of the classification for whatever task. We've put it to either creating new content or grading students, or detectoring plagiarism or whatever, or the interpreability of the data.

The way we understand it, there are different types of classification. I've grouped them into three, I don't think this exhausts the possibilities, but we have binary classification. You know, everything in the world is either a frog or not a frog, or everything in the world is either ethical or unethical.

We could design an AI that did that, and what has been designed actually. And although it does, now, does more than just binary, good bad. Classifications, it also said that is allowed or it wouldn't be wrong, so it's a more four. Five six valued classification and that's what we get when we have multi-class classification.

So, for example, if we have a system that recognizes animals and sorts, I mean to, you know, they're either a cat dog, fox tiger or lion. That's multi-class the number. Identifier are recognized, we talked about previously is a multi-class classification system. Same with face wrecking, except there's a very large number of entities.

In the list of possible classes multi-label classification is kind of like tagging so we're not organizing everything into nice neat classes but rather were associating the, the incoming data with a set of strings. So it's a lot like movie classification. So we might say that, you know, it's an action crime thriller or we might say, but something is readable first rate written in Spanish, whatever.

So we're gonna have multiple labels, it doesn't all have to refer to the same thing, the picture at the right contains a cat and it contains a bird. And so those would both also be part of the classes that we can attribute to that photograph. So as you can tell, I'm sure that I've been with talk about the classification algorithms that goes beyond the scope of this discussion, but it's there so that, you know, they exist and we can talk about the specific differences between the the types of probability calculus or graph analysis, etc, of the different class of the classifiers and classification.

Algorithms, the important thing here is for this, for the purposes of this discussion is that a large set of decisions is being made by the designers of the AI, and the people implementing the AI, ranging all the way, from how to label the entities that we're using as input data to how to label the different stages.

And to how Well the output and to raise the question of whether this corresponds with classes or categories or things in the world or, or whether it doesn't. And and how tide is the AI to the actual state of affairs in the world. And they're also raises the question, our our existing systems of naming and classification.

Correct. Are those the ones that we should be using in AI in that Olympics? Or should we just forget about all of that? And just let the artificial intelligence do its thing and think about classification or labels or categories only? After the fact, if it comes up at all, a lot of cases, it might never come up at all.

So, that's it for this presentation. I've got one more in this for video series on data, and I'm going to get to that right away. But again, might be a while for you depending on what you're doing. So until then, I'm Stephen Downs. Bye for now.

Force:yes