In the process of analyzing textual data, when should researchers reconfigure the text into numerical representation?
(a) Basically never. Text should be kept “whole.” Textual data are entirely dependent on context.
(b) Under certain circumstances. Depends on the research question.
(c) Basically always. Other forms of textual data analysis are too subjective, prone to personal interpretive error, and unreplicable.
This question and these possible answers remain front and center, as a sociology web debate about coding, textual data, and the Biernacki book Reinventing Evidence presses on. The debate is a good one even though I suspect, if asked, that every participant would answer (b), that coding should proceed depending on the research question. For instance, if someone wanted to know when the “Bad Economy” became socialized knowledge, one could make headway coding textual data found on the internet. Could anyone disagree that, in this case, to answer this question, we could code for the use of specific words and images? And doing so would be useful, would create value?
I strongly suspect Andrew Perrin would answer (b). I think Fabio Rojas would answer (b), though on his blog he makes provocatively inflammatory statements about qualitative methods from time to time that qualify for (c). I can forgive him for that. One of the commenters at Orgtheory, Thomas, is trying to convince Andrew Perrin that Biernacki would answer (b), too.
Andrew Perrin doesn’t buy it. His criticism of Biernacki’s book, if I may, is precisely that Biernacki is arguing for (a), that turning text into numerical representation is never a legitimate option when trying to analyze textual data. Perrin thinks this is unreasonable, as coding, at the very least, has its place, he thinks.
I say Biernacki’s defenders should admit Perrin’s right, turn the tables, and instead point out to him that Perrin himself puts some stock in the more specific argument that “coding has limits.” For example, my previous blog post quoted Perrin’s original Scatterplot post granting a lot of ground on the question of coding’s “validity.”
So, while one might say that Perrin has made his point on the absolutist nature of Biernacki’s argument — Thomas’s comments, to take the most prominent example, take making a different argument than Biernacki made — I am still left wondering how Perrin would deal with the details. Especially the one Biernacki claims to demonstrate and Perrin concedes on the problem of validity. Under what conditions is the validity of coding limited?
Here are three questions, from the point of view of a Biernacki defender trying to stake out some semblance of offensive ground, but who nevertheless has from the start conceded that coding is not never good (double negative, of course, intended):
1. Can we agree that lack of “context” is the general, vague, but real weakness of “coding”?
2. When can a social-scientific study using solely methods in which “context” is the weakness be a complete and maximally useful study?
3. So in general, when is coding less and more valid?
* * *
This debate matters not only to academic sociologists, but to the overall field of private business action more generally. The industrial-scale collection of symbolic data as “intellectual property” going on in private enterprise is more than well-documented. We also know participation in the production of these data is broad and cross-demographic. We have oodles of text and images; symbolic data proliferate in advanced capitalism. If coding these data is worthless, then these data are of little economic value, too. Which would mean the business models of “social media” sites — as intellectual-property miners — stand on much thinner economic foundations than widely believed.
So what has the debate hashed out? Rojas and Perrin’s position is that Biernacki’s central thesis is wrong for its absolutism. This is the position most sociologists would take, and I would agree with these sociologists. However, I want to make a potentially more important point here, and it doesn’t entail agreeing with the fundamentalist nature of Biernacki’s central thesis. My hypothetical theory: Coding symbolic data may not be over-valued, but keeping symbolic data “whole” and in context may be under-valued.
I developed this hypothesis the first half of 2012, during the process of studying Facebook’s IPO and developing a theory of the case. In the process of the research I wrote blog posts like “Is the Facebook IPO worth $100 billion?” and “Facebook and the value of its symbolic data.”
I did not have access to any insider knowledge, so what I found came with qualifications right off the bat. My position was that Facebook’s data are valuable, but limited. They are valuable because they are large, efficient datasets. And coding procedures get better all the time.
But the data are limited because they give you “what” and “who” when prospective clients — (I say clients because Facebook is an economic enterprise, not a remotely academic one) — need “why.” As in: why are consumers, employees, audiences behaving the way they are behaving, thinking, changing, etc? If clients want insight about “why” from Facebook, it will require hours directly observing textual trails in their broader context. Facebook helps make the connection, but it doesn’t automatically give you the data you need.
My final theory of the case was that Facebook was far superior as a “social media” site than as an “intellectual property” giant, with whatever economic implications in terms of valuation that should come as a result. For even most economic actors, using Facebook’s connective features for your own data-collection purposes is still more valuable than using the coding-based analytics Facebook’s proprietary data directly provide.
(This belief stems from another, the belief that most actual economic actors — i.e. Facebook’s potential clients — need particularized, local, contextualized knowledge of a kind that probabilistic, population-level analytics do not provide. Most business actors, in my theoretical formulation, are better off with knowledge coming from direct, extended contact with audiences. Facebook helps connect you with these audiences. Clients can use Facebook to get a better picture of those with whom they then need direct contact, i.e. from whom they can extract “wholistic,” context-based data. But Facebook itself doesn’t do the real, dirty extracting.)
I bring up this analysis of Facebook, which could be wrong for all I know, or just simply found irrelevant, to give the background to my view of the Biernacki debate. And to answer the question on top: When should analysts “code”? And I think the answer is obvious: coding oodles of data makes sense when your research question — whether devised by yourself, in consultation with a Committee, or with a client — calls for it. When the goal is not to know context. When the goal is to gain insights spanning across contexts.
But this raises a significant problem.
Because if you want to know a particular case, you want and need that context. You want to know not only how many times a subject was raised, but why, and for what reasons. The more localized the intellectual pursuit, the less value there is in coding, and the more value there is in keeping data “whole” and “extended” and finding the right informants.
The value of data depends on the context of the data. And the value of “coding” depends on the research question. I agree with Perrin, this is a nuance Biernacki’s book seems to miss. But I think there are layers of possible engagement with “how to study symbolic data” the book helps make clear. And that matter for social science and private business alike.
The more textual data continue to proliferate, the more researchers “code” textual data, the more researchers will need complementary methods to fill the holes of intelligence coding leaves behind. What are these holes? And what are these complementary methods?