Working With Data

By Stephen Downes
Dec 04, 2021
Transcript of Working With Data

Unedited audio transcript from Google Recorder

Hi everyone. I'm Stephen Downs. This is ethics analytics and the duty of care. We're in module, seven called the decisions. We make, this is where we talk about the process of AI and analytics and and look at all the different points where we make individual decisions that add up to large.

Ethical implications as we apply analytics and AI in a learning context. This presentation is the fourth in our mini series within module seven, where we've been talking about data. And yeah, it's it's been quite a bit of work, but I'm happy to be here. Just wrapping this little series up.

So in this presentation, I'll talk about working with data and here about talking so much about the applications of AI and analytics and learning, we talked about that at length in module one. So I'm going to touch on some considerations where the fact that we're working with data impacts the ethics of what we're doing.

And of course, the big one, it will be talking about that is biasing AI, but let's set us out. Let's set ourselves up for that discussion working from the beginning. So probably where a lot of what happens in analytics, and AI happens is not just in the plane raw data collection, but in the actual combining of data to form larger data sets, it's like big data magnified.

I'm little example here of the sort of thing that can happen in the illustration where we've got individual unique individual ideas, and each of them has a zip code for digit zip code because this is a very small island in the Pacific, I guess. And we also have a table mapping zip codes to precise incomes and the data is such that it's unique enough that we can now identify the incomes of each of the ended individual identifications because they're the only person in each zip code.

Well, yeah, maybe zip code is an exactly the thing. But, you know, if we were able to identify people by their addresses or by their phone numbers, we could pull off a similar sort of thing. And actually, there is a process out there in the world called the identity graph.

Where what marketers do is they gather dozens, maybe hundreds of these individual data sources and identify specific individuals, and then track those individuals through all of those and data sources in order to create a comprehensive picture of each person. If Facebook does that and they do it. Even for people who are not Facebook members, is this, this person, the existence of whom, they've been able to deduce from all of the other data, put in their, by other people about that person.

And that's an important lesson data, sold in stands alone, data is linked to other data and often reveals more than was intended. There was a case for example, where a person quite innocently sent in information on their DNA, to something like, 23 me, I'm not sure of the exact source but so they signed in their DNA.

The company does a DNA analysis and sends it back to them. Usually telling them that there with a little bit of indigenous ancestry. Okay, I made a joke there but not a very good one. Anyhow the ice I made the joke because I was reading on line today that something like 40% of the United States population claims to have Irish descent.

There's only seven million people in all of Ireland. And I found that particularly relevant because when we did ancestors studies, all of my hits festers are Irish on my mother's side, and a good chunk of them on my father's side, which makes me one of those. You know, millions of people that says they're Irish only.

I might actually be, which is both a concern. And, and something that's nice. I'm not sure. Anyhow, let's stay on topic here because this is already taking hours. So anyhow the company analyzes, her DNA and sense the results back. But the DNA is also enough of a match with some crime related DNA that they're able to identify a criminal, a brother of the sister who sent in the DNA and identify, oh, you know, you are obviously or, you know, the the criminal who committed this crime is related to the sisters, not the sister, but someone related to the sisters, therefore, it must be the brother.

That's the sort of thing that can happen. When you combine data data on travel can be used to discover things far beyond travel data related to purchases or would not solely relevant to purchases. For example, if you look at my purchases, you can tell I like football. You could tell I go cycling, etc.

So let's just the beginning of all of this because now we're we're talking about using this data and more and more, more applications. And so the the nature and the quality of the data becomes quite significant. So we can pack some of the implications of these data assemblages and, and Paul princely pointed out that that critical data studies, actually part of that is the actual unpacking of these in the context of what he calls data valence that is database surveillance.

He could have just used database surveillance, we did not need another word there, the attrition and loss of privacy, the impact of profiling, social sorting etc. On the other hand, look at the sort of things that can be done, right? A Walmart, for example, using it for supply chain management progressive insurance or married hotels using it for pricing, maybe identifying the best price or whatever.

The maximum price is to give them the greatest yield or profit, money ball, baseball teams and they've put in Oakland here in Boston. Like it's that's a nod to the movie money ball and and the work at Billy Dean, you know, to assess what they call here, human capital is I hate that word but companies like Verizon MCI, you know, understanding quote, the drivers of financial performance and the effect of non-financial factors or product and service quality.

They've got Honda and Intel here, customer selection, loyalty and service, or even research and development to improve quality, efficiently, efficiency, safety products, and services. And our seed did a project using AI to analyze parts and aircraft. So that mechanics could know when they were going to break or fail before they actually broke or failed and replace them so that they don't create a hazard in flight.

I'm in favor of that kind of thing. I don't think that's a bad thing. So, you know, there are both good things and bad things that come out of the work of these sorts of data assemblages and, you know, it's the job of ethics to decide what's good and what's bad?

But we need to understand. It's not all going to be good and it's not all going to be bad. It's going to depend a lot on your prescription perspective. Pricing, for example, has a winner in a victim, right? The winner is the company that you're able to get the best price to maximize profits and and the victim or the losers, the person who pays more than they might otherwise have to.

But then you could switch that around right and a purchaser using AI might find the best price for something on the market. And the victim becomes the company that isn't able to maximize the profits. So there's no simple answers here. And now we turn to bias and bias is well to huge subject and basically, we can describe it as an incomplete or skewed training data set.

Now, remember I said before our data is all ones and zeros and our data is all of ones and zeros but it's carefully preselected once in zeros, and it may be data that has well been cleaned or modified or corrected in some way. Or, you know, it may be taken from a very specific place from a very specific source and therefore, not be representative of a population or might be data that has been cherry picked or incomplete and some other way, etc.

Another source of bias. And these are all from Josh feast and article, he didn't harvard business review the labels that we use for training. We've talked about labels quite a bit already in this short video series on data and training data is labeled in order to teach the model how to behave and humans create these labels and humans use these labels as well.

When they are correcting the AI. And if you use a label to indicate the output of your AI, you can suggest how a person should correct for that particular outcome. So, for example, if the label you use is bad, then the person who's training the AI would send that back through back propagation for some correction.

Well of course the labels might not be simply labeled good or bad but you know, if the label is something like data valence or some other prejudice term, it could call us the person to react in a negative way. Even if something is not necessarily negative, another source of bias are the features and the modeling techniques we've taught, we talked quite a bit in the video on how AI works on some of the aspects of the actual AI algorithms.

For example, adjusting the bias in order to move along to create categorizations. So you can twittle those knobs and dials in order to adjust the AI to produce. Just the sort of outcome that you want to produce, you know? It's just like statistics generally, right? I mean, if you get your hands deep enough into this statistical mix, you can massage this statistics to show what you want to show the problem with AI mean, you know, because it's not original to AI.

Daryl Huff wrote the book. How to lie statistics? I think it was in the 1930s. It was a long time ago well before computers and AI and a lot of the stuff that he describes is stuff that we can do now with artificial intelligence and analytics and a lot of the mechanisms for skewing.

This statistics are the same. The difficulty is back then you could see what was being done, because you could look at the data, you could look at the model that was being used to, to interpret the data. You can actually look at the environment where that data was being collected and determine whether bias was being created in the data collection or interpretation.

But that much harder to do with a system that uses 60,000 classifications of something. How can you tell whether there's a bias between putting something in notification, 6, or 59001 and 59002, It's virtually impossible for a human to detect that sort of bias. This is a kind of a bad article with an even worse, quality of graphic.

But I thought I'd put this in here in any case to indicate that there are many different types of bias and and the ways and which data can be biased. So there's selection bias where that's like the cherry picking there's self-selection bias. And that's, for example, when people determine four themselves, whether they're going to answer a survey or not, and typically, what happens is, the people who care more about the outcome of the survey will answer the survey.

And the people who don't care, just won't bother. And so you'll get a self-selection biased, recall bias occurs. For example, when we're depending on the memory power of the respondent and you know you can suggest what the person remembered or him to what they remembered and actually make them think that they remembered it.

There's observer bias as well survivorship bias you know I take I took information and I survived. Therefore because effect by us, it's a sort of buys where we want to attribute causing effect, where maybe there isn't admitted variable bias, which I've talked about and even funding bias, where we're inclined to support the opinions of our funders.

And that might be conscious or it might be unconscious. There's a huge cognitive bias, codex out there, and it's too small to see on the slide. And, and even, even if I use the view that gives you the big picture, you won't be able to read it. So I recommend that you go to it either.

Go to the web page, which is the link on the left or go to the right page, which is the link on the right, but she'll see some of the categories, right? Things that focus on what we should remember. You know what? What counts are sailing it to one person isn't what counts as salient to another and if you don't believe that, just look at how people give directions.

Some people will point you in the direction. When I was in Australia, they would point me in the direction but it wasn't the direction of where the thing was, is the direction I should take first in order to follow the road to eventually get there, but I would have to turn somewhere along lines.

Other people will give you a street address and the map other people will give you landmarks. You know, you turn left at the IGA that you used to be there. Things like that, other biases, that caused by too much information, or the need to act fast. So you know, a quick instant reaction or there's not enough, meaning there's signals coming in or just vague or unclear, they just don't tell you what you they need to tell you, it's etc.

And as you can see that he doesn't have these biases. So how do you deal with bias? And again, there's not because there are so many biases. They're not going to be a nice simple set of solutions, of course, right? You should know that by now in this course, but this is from the feast article in HBr where he says, there are four ways to agent address gender bias and AI, of course, he's wrong.

But let's look at what he said. Anyways, first of all, ensured diversity in the training samples. Well, we've talked about diversity a bunch of times already. In this course, and indeed, diversity is required in data in order to generate representativeness. If all of your data, that comes from male students, then your analytics is probably not going to be applicable to female students.

And remember, that was the of Carol Gilligan's response. Way back when when she first developed, you know, an ethics of care. Secondly, ensure that humans labeling come from diverse backgrounds. Again, diversity comes into play here. It's the the labeling part of it. We've talked about the importance of labeling several times, and ensuring that the labeling is done, even neutral manner helps to ensure that people who are using the labels in order to train.

The AI are not influenced unduly by the labels. And so do proper training rather than improper training. Third one measure accuracy level separately for different demographic, categories to identify one. One category is being treated on favorably. That's an interesting approach and presumably it works but the idea here is you know you look at whatever you're studying in men specifically and then you look at whatever you're studying in women's specifically and then compare that with what you find when you study the population as a whole.

And that would tell you presumably if one were the other is being treated on favorably and then finally and and this kind of reflects the ethics of care, a little bit solve for unfairness by collecting more training data associated with sensitive groups. In other words, look for the vulnerabilities.

Look for where the problems or issues might be most likely to arise and do extra careful and indeed extra data collection from those points. That makes a lot of sense to me. It, minimizes the chances that affects that harm. Only a small number of people are overlooked in the data collection and analysis.

You know, it may be that only one percent of the population is negatively impacted. By the way, the analytics is being done in a certain sense but that one percent still matters. And if we can identify, you know, the greater likelyhood of somebody being in that one percent, by saying being a member of a vulnerable group, then doing more measurement, makes it less likely that they'll be overlooked.

So I think that's a good idea. Oh there's also this whole discussion about data and objectivity and again this discussion comes from the perspective of, you know, staying grounded while you're working with data. And so, rad and says, first of all, just never accept data on faith. And this is a criticism.

I would apply indeed to a lot of writers out there. Who promote, without nearly enough skepticism instances of what is called evidence-based teaching, or evidence-based, program management or evidence-based whatever and to a large degree, they're accepting data on faith because you know it's evidence. Therefore, it must be good.

And we've seen all the ways in which analytics and AI can be swayed and influenced and made incorrect by poor data. I've misapplies for a humans as well, and if you're not being critical of about the evidence that's being presented to you, then it's very likely that the solutions or the approaches that you advocate will be based on incorrect data.

And will them sells be influenced, and probably possibly negatively impacted by this incorrect data. We can't just take data is objective. I think everybody knows that and hope everybody knows that that's not something. That was always commonly accepted. I mean in the age of philosophical school called logical positivism which you hear a lot of people criticize.

It One of the premises of logical positivism is that you have objectively neutral sense data, which you use as the basis for all of your inferences and statements about the world. And one of the contributions of coin, who's already been cited in these presentations is to argue that, you can't make that distinction between theoretical data and pure sense.

Data All data is what others I think Lottery login would call theory laid-in data. Let's take something like, you know, I see a red patch. Forget the patch part. Just focus on the word red. What do we mean by red? Right. There are according to my computer something like 14 million colors.

That's more than more colors than I can detect, but it can produce more even if it wanted to I suppose. But there's no point and a subset of those colors is what I would call red, what subset? Well no precise line and there's no necessity that were I draw the line.

Everybody else would draw the line. And when I say there's a red patch, what I'm doing is associating what I see with some preconceived notion of what constitutes redness. And there's an article out there that says that the ancients people living in ancient Greece, never saw blue. Think it's blue, you know, they saw various colors.

No, presumably, they looked at the same sky that we did, but when they looked at the sky, they didn't see blue. They saw something else. They called it, something different and their organization of colors was just different. So even in the discussion of sense data, the rust, most basic information that meets our senses.

There's no objectivity, it's not context free, and that's going to be true of our cameras and our detection devices as well, right? We use our color camera, our color camera has receivers that detect different colors. We could change those some cameras for example, detect infrared, light, nothing prevents a camera from doing that.

I can't do it but a camera shirt could. And so you know the cameras data depends on what kind of input I have given that camera. And that's why like people like get them in Jackson point out. Raw data is an oxymoron, there is no raw data. There is data that has in some important way been created by the person doing the experimentation or doing the observation or doing the data collection, or doing the AI task, these data has been created.

There's been selected and it's been organized measured for a very specific way. And so, a critical analysis of data should take into account. The sort of things not so much that bias the data because maybe that's a wrong word because when you say something is biased, you're suggesting there's a right way.

Things should be and the bias skews away from that, but maybe that it's not the way it is. Right. Maybe there isn't a right way to look at things and it's not biased so much as interpretation from your perspective, it's bias from my perspective. It might not be. That's a really hard concept to get at and and it's really hard because in many ways it's almost impossible to shake our sense of way we think things must be or what George wake off, would call the frame or perhaps the worldview with which we see the world.

The things we think are in the world. All of these these words these theories, these classifications these taxonomies constitute a culture and might be a scientific culture. It might be a political culture, it might be a civic culture but it's a culture and it's a way more or less shared along, you know.

Sure it's not identical and needs person but more or less shared accepting what sort of things there are in the world and what's sort of things we can say about what there is in the world. But there's a lot of cases where there is room for discussion interpretation and debate about what those things are and and how they influence, what we count as data in the world.

And that is the role of theory, right? The role of theory to a significant degree. Takes a look at what our culture, whatever it is thinks these things are and then questions are these things really the best way or the right way or the ethical way. We should be looking at the way things are and you know, so the ethics of our, our AI and analytics systems might not even be based in the systems at all.

You know, the might be based in the culture of the people who are using the systems both in selecting the data that's going to be put in the systems or even creating or generating the data that's going to be used in the system interpreting that data, labeling that data, and then applying that data, to different circumstances.

You know, none of us in the machine. None of that is in the AI itself. The AI is a statistical manipulation of the data that maybe uses 60,000 categories, where we would use four, but that doesn't make it more or less. Ethical and any sense at all, there are data risks associated with AI have listed a bunch of them here.

Although I think I would want to underline first that we need to be clear about who benefits and who incurs the risk because a lot of the time, the person who creates the risk isn't the same person who pays for the outcome, when the risk comes true. But anyhow, a few risks, the danger of stale and outdated data.

We've talked about that a little bit already. Sensitive files subject to regulations like GDPR and others hip a PCI, the lack of space time and social context limitation on the scope of data, that relates to what I've just talked about regarding the culture in which the AI application is being used.

The use of data for unexpected purposes or to reveal active information. For example, somebody using your DNA sample that you sent him because you wanted to know whether you were Irish to solve a crime. The risk of exceptional intrusiveness, revealing things about yourself that you do not necessarily want to see revealed Potential from misuse, privacy breach blackmail and other crimes.

I've said for a long time, you know, if you don't want to be blackmailed, don't do the thing that you would be blackmailed for. I still kind of think that that's true but then again, people can be blackmailed for things that really aren't wrong for example, being gay. There are many environments around the world where people can be black mailed if it comes out that they're gay, but there's nothing wrong with being gay and I'm not gonna tell them to stop being gay just to avoid blackmail.

So that is a potential for a misuse for privacy breach. I'm gonna finally ghost users in accounts. People who had an account on your system, but now no longer have an account on your system, but they're still generating data on it. Or in one case, I read on reddit yesterday, where a person was fired, left is job.

And then a few weeks later, a few years later, I think it was realized that the company was still using part of his Google Docs account in order to create models and representations and stuff. So he deleted his Google Docs account, or at least that part of it and they lost all of their data because they were using his zombie account Stuff like that.

These are all things, you know. And again they have ethical implications because if you don't attend to the risk, there may be harm that's caused and we know of that on some ethical theories. Anyways, if there's harm cause that is a breach of ethics, something that can be done to address.

Some of this is what's called data tracing. I did that a long time ago when I was really suspicious about some of the information that was being collected about me. What I did was yeah, but they this was before they had automated checking of all the data. So I used a different postal code.

Every time I put in my address to different companies and that included the government, and then I could watch as these postal codes work their way through the system and showed up as a junk mail on my doorstep. And so I put in a postal code, a fake postal code for my tax return and they didn't check that, I didn't matter, because the post office wasn't actually using postal codes at the time and lo and behold, I see the same postal code show up on a collection letter.

From a student loan company, the student loan company, and somehow managed to get information about my income from the tax department and traced me, using that information and was now harassing me. So that made me hate an agency's all or more and to doubt the safety of information that I gave to the government.

Is that kind of thing that really undermines. Somebody's trust in the institution data. Markers, are used all the time. I mean, this one article, they say, we call this new verification method, quote, radioactive data because it's analogous to the use of radioactive markers and medicine. But you see things like water marks, embedded metadata in images, hash addressing on them on digital content, etc.

That used to trace the flow of data through a network again to edge sword, right? It can be used to to find inappropriate unauthorized or malicious uses of data. On the other hand is kind of intrusive because it's tracking how people are using a piece of data over a time.

And so, it's revealing something about them that they haven't necessarily voluntarily given you get into the end here. Last slide, the last slide deals with data ownership. So of course, is a great, big gray block of text on it. What do we mean by data ownership? Well, we need to be careful here because, you know, there's this principle out there.

It's more apocryphal than anything that data cannot be copyrighted. But, you know, the actual presentation, formatting. The organization all about can be copyrighted and even. So, you can own data in the sense that you have it. I mean, I've got lots of data on my computer. It's mine because I have it and even I created it and also I'm responsible for it.

So ownership here, implies power is well as control. So in any of these data analytics situations, we need to ask questions like who owns the data and as a separate question. Who owns the information? The data is about again that can get tricky, right? Sometimes there's data about a person that can be owned not by that person, but by some third entity.

For example, a newspaper has an exclusive report on a car accident where they have a photograph of the person who was the victim of the accident newspaper owns, the photo newspaper, owns the information about the car accident, but the car accident victim does not own that. And arguably, you know, the court case is go different ways depending on circumstances, arguably the newspaper can report about the accident whether or not the victim wants them to.

Now, I have always found this a bit ironic. When I turn around and describe something that the newspaper has done and they said, no, no, you can't talk about what we've done. We own everything about us including the contents of our stories, but I mean, if I'm not copying the content, do they own it?

And there's where we get, it's few into disagreements. I made it through four videos today. Before I sneezed in data, there's any number of different owners, and, and here's a list, the creator of the data, the consumer of the data, the compiler of the data, the enterprise, the funder, the decoder of the data, the packager, the reader as owner, the subject, as on our like, we talked about and then the purchaser or license are as owner.

That's probably a partial list especially when we get into AI and they close about two to the question of responsibility. You know, to what degree does. The person who wrote the AI algorithm back in saying 1995, have over the output about algorithms to work degree. Is somebody responsible. If they didn't act that became some AI data results in the AI misrepresenting someone, you might think, well, we shouldn't have any, but what about digital red lining, which exists?

Because crying is enforced more by the police, in one area, than in another area. And so the AI thinks there's more crime in that one area now. So, at the AI's fault, it's not even the person who the faults of the person who collected the data because they're just collecting crime reports.

That's the false of the person who actually committed the act of enforcing more in one area than the other, which resulted in the AI reaching this bad decision. So questions of ownership and responsibility become really tricky when we're tracing responsibility and ethical, blameworthiness or ethical credit, when we're working with a I and analytics, anyhow that ends these four videos on data, I hope you found them interesting.

Remember data, is one small fraction of everything. We've been talking about in this course, and biased data. I'm trying to keep these hands as close together as possible. Without touching is one small fraction of that. And most of the literature out there on AI and ethics, talk about AI biased data in AI producing bad results and yeah, biased AI.

Sorry biased data in AI produces bad results but that's one way and and that one way can come up in many different ways. Have many different causes have many different effects and simply saying well the data shouldn't be biased, that's not going to happen. It's something not going to happen.

There's no such thing as non-biased data. The question is, is the data biased and an ethically responsible or ethically irresponsible way or if you don't like those words and ethically good or ethically bad way or just in an ethical or unethical way, that's a harder question. It's easy to say hi.

Yeah, no bias data. But it's heard to look at the question of data from the beginning in data sources, different places we get data, how we measure data, the tools that we use, what they bring to the picture, the data cleaning process, the labeling, the classification systems, the interpretation of the data and then the potential bias of the data, but also data risk and data ownership issues and more huge issue, not going to be solved by a list of 10 principles and I think it would be absurd to suppose that it could be.

I think that you know we're not even through the unit on the decisions that we make in analytics and AI but already we can see that no approach that is based on sweeping general principles is going to work. In fact, interestingly, it's such a complex area that are you ably only in a, how I could address it?

It might be that the distinctions and the categorizations and the classifications that we need in order to undertake a discussion much less, an understanding of whether something is ethical or unethical in analytics in AI just beyond our capacity. We need 60,000 variables and not 16 and I'll leave you with that thought.

Anyhow, for now, we've got some more videos to do in the area of the decisions we make because we've only begun to talk about AI and then we'll be looking at some of the practical implications of this in the final module on the course. So thanks for being with me.

I'm Steven Downs. This investigation of data was long but I hope you found it was worth it till next time.

Force:yes