Outside the Globe Theater, London, England, 2008
In the last post, we talked about Big Data — what it is, what it might be good for, and how our implicit assumptions in creating various schemas (data structures inside a given set) might influence how and what we understand. None of this is easy.
Yet our ability to understand the validity — what something actually means, and what its predictive reach may be — is directly tied to understanding what it is that we do, or did to the data to get an answer. Any good statistician (who I do make fun of in the last post) will tell you that they intrinsically do sensemaking every time they do an analysis. The simplest example might be bimodality of data. Below is an example from Wikipedia by user Maksim.
If one takes the average, mean, whatever you want to call it, the answer will come out just a little to the left of zero. In this case, as any statistician will tell you, the mean won’t mean much! (pun intended!)
But an experienced statistician will recognize from various cues in the data set (maybe through the variance, maybe through a quick plot of the probability density function) that the distribution is bimodal. Then, if you’re approaching them for expert advice, they’ll say “well, we need to do some different things to make sense of this data set.”
What’s really going on here? The statistician is applying a scaffolded heuristic to the data, because he knows you want an answer that means something — that is valid. He is demonstrating evolved, empathetic behavior. And though he’s using algorithms, he isn’t operating out of the purely algorithmic/legalistic v-Meme set, because he knows (if he’s a good statistician, at least) that you want to learn something that will help you along toward your goal. And that just giving you the mean won’t do that.
In this paragraph above, we’ve encapsulated the basic place-taking empathy of the v-Memes above the Trust Boundary. The statistician’s place-taking with you, knowing that just telling you the mean is not enough. He knows you’re trying to reach a goal — that would be Performance v-Meme territory as well. And the heuristic he’s applying comes out of a combination of lower-level scaffolding (algorithmic tests for bimodality) as well as integrative experience (this is not his first rodeo!)
And he cares — feels responsible — for the result. Another result of increased empathy. If he didn’t, he could hand you a laundry list of statistics and say “well, this is what the analysis says. I’m sure the generated statistics are correct (meaning, of course, that they’re reproducible and reliable) but you’re on your own as far as applying them to your problem. I’m a statistician, after all!”
He’s not just any statistician. He’s YOUR statistician.
Right off the bat, one can see that this world gets much more complicated in the world of Big Data. By definition, we’re going to collect lots and lots of data, of different types. Not just one distribution, but lots of them. With lots of statistics. And somehow, we’re going to assume that these statistics are going to help us understand these large, perplexing, ‘wicked’ problems. If we weren’t attempting to solve wicked problems, why go through the hassle of Big Data?
And how are we likely to approach these problems? Well, let’s say we’re examining a social issue. Like who’s going to vote for Donald Trump! First off, it should be pretty obvious — let’s break that data up. Men vs. Women. Women can’t be for Trump, can they? He’s been talking about all those negative things about women, and it’s pretty obvious that he’s in bed with the Religious Right. Let’s break that data out!
Then we’ve got to move on. Black vs. White. Rich vs. Poor. Educated vs. Uneducated. College vs. High School. And so on. I’ll tell you — no one’s going to fire you from your job by assuming standard demographics for a statistical analysis.
Maybe some of these categories, and their associated archetypes are good. Maybe they’re not. But they’re built on assumptions that have pretty clearly demonstrated in the last election are not nearly as clear cut as they may have been 50 years ago.
No problem! you say. We just need more topical groups. Black urban women. White rural professionals. We’ll fine-scale. Add more factors. Intersectionality! People of faith will reject Trump — he doesn’t go to church! At the Prayer Breakfast, he delivered a prayer for Arnold Schwarzenegger. He’s got a thing for people on the Austrian/Slovenian border. That’s it!
Except, of course, it’s not. And in a world where information streams and social situations are more and more differentiated, the people at the end of them are also more topically differentiated — until we get down to the microscopically topically fractionated, and realize that this approach is not going to get us where we need to go. Ideally, we’d live in a Communitarian v-Meme world, where individuals would rise to the level of their independent data stream. Martin Luther King had it right. But I digress.
What is really going on here? Our mental models, produced primarily by our lower v-Memes, which designated what we are, are not sufficiently capturing the dynamics of how and why we are. Topical information alone is almost always not sufficient to capture empathetic dynamics and how we process information. It can give clues, and biases toward certain v-Memes. But it is not conclusive.
Take an issue like global warming. If someone is concerned about global warming, one might think that they must be Global Holistic — after all they’re concerned about the impact of global warming across the planet. Yet maybe the reason that person is concerned is because their parents told them to be concerned (Authoritarian v-Meme.) Or they’re part of a church that thinks environmental issues matter more than anything (Legalistic v-Meme.) Or they come from an aboriginal tradition for respecting the Earth (Tribal v-Meme, with likely 2nd Tier implications — we’re seeing more and more of this!) It’s certainly far more likely that someone who is Global Holistic, with empathetic networks across the world, is going to be concerned about global warming. But the superficial topic alone can’t inform.
What are the implications for Big Data? We have to develop different methodologies for understanding how data categories connect together. We have to find ways of capturing patterns of inference that establish a deeper ‘Why’. These patterns of inference will likely depend on v-Meme content, and will also boil down to spatial and temporal awareness in the individual’s brain — directly relating to their empathetic development. What this also means is that we have to guess explicitly at models a priori, state what they are, and then see if the larger a priori gets captured.
Academics don’t like to do this. The idea is that nine times out of ten, the data solely informs. We’re objective. We do a good job collecting it, and making sure it’s accurate. And then we curve fit, and any number of polynomial terms will give us the power law that we need to establish the cause-and-effect relationships we need to know. Yet once we move out of the relatively meta-simple world of curve fitting, this is not likely to happen. We may capture interpolative/inner detail with increasing accuracy. But more profound extrapolation will elude us.
How then, can we utilize Big Data in the service of society, understanding that our old topical models don’t and won’t work like we think they should? I absolutely don’t have all the answers. But let’s consider a problem that is near and dear to my heart, involving education.
I was involved with Mike Richey, of the Boeing Company, in exploring potentials for improving online education and associated content, for teaching aerospace fundamentals through construction of Unmanned Aerial Vehicles. I’ve talked about this before in this post. Mike works with a Big Data analyst at Purdue, Krishna Madhavan, whose expertise is pulling ‘digital exhaust’ off interactions students have with various content the course architects have decided is important to know. Such digital exhaust shows clicks and other interaction patterns that can be categorized, at the most basic level, as sequences of content students follow to learn material.
Because the course was prepared in a traditional academic fashion — with topical content in mind — the information stream coming off the users’ interaction with the content is scattered. The topical content — the various steps necessary to build a UAV, which include everything from aerodynamic body shape to investigate, to calculations for weights and balances of an aircraft — may make sense on the surface, but from a knowledge structure perspective is all over the map. No one went through and broke down the different v-Memes associated with the information. Aerodynamic drag coefficients are mixed in with heuristics for structures. This is no surprise, and typical in education.
What might happen if, instead of tracking topical information, we identified the connectivity in the material the students were seeing? What if we broke things down along knowledge structure lines? We could then see if connected, empathetic thinking took more time, or required more feedback loops. We could test for retention after experience/use of developed heuristics — something that is generally well accepted in education. We could also see how the learners that proceeded linearly through the material differentiated themselves from feedback-loop learners, or selective content learners. If this experiment were run in a company, we could sort jobs along the various v-Memes — its likely because of the demands of the positions, sales folks might have different empathetic biases than the stress analysts. In short, we could create new lenses for grouping and modality not based on superficial characteristics, but on core brain wiring.
What that really means is that we are allowing expression of agency into our analysis. By not locking people down with titles, we discover what they really know. And what they really want to learn.
There are other applications I’ve been thinking about as well. One young professor I’ve been discussing things with is attempting to optimize traffic flow through detours. One can ask how spatial and temporal awareness — all things affected by empathetic development — might affect one’s ability to problem-solve through a construction zone. The possibilities, associated with core brain wiring, instead of more cosmetic characteristics, are literally endless.
There is much more to explore here. But hopefully this starts your own process. How can we start using Big Data, which will almost certainly give a much more comprehensive view of the world around us, to tell us something truly reliable AND valid? It’s the beginning of a journey.