Data analysis is the process of (A) inspecting, (B) cleaning, (C) transforming, and (D) modelling data with the objective of discovering useful information, drawing conclusions, and supporting decision-making. That’s a very typical statement but it’s also really vague.
I don’t think there’s any essential difference between data analysis inside and outside the social sciences. It’s all about looking at the data, understanding it, modelling it and then doing something with the insight gained. Whether you do this in public policy, in academia, or in marketing, there are many more similarities than differences. In all these contexts, you will ask similar questions about the possibility to generalize from the sample to outside that sample (to the population or to other samples - that’s what we call inference), about causality, about whether you are in fact measuring the right concept, etc.
Let’s look at what experts in the field of data analysis have to say about what data analysis is: Here, the focus is on data analysis in the social sciences using R.
For Llaudet and Imai:
The focus is on (D) modelling. They identify three things we do with data analysis: describe, predict, explain. This is a key threefold distinction that comes up often.
“Describing” is what we get in polls. For example, “56% of Canadians believe this or that. 57% of homeowners believe this or that. In contrast, only 41% of renters believe that same thing”. Correlation is also about describing. For example, age is correlated with social media use. That is: younger people use social media more often. It’s often said and it is true that correlation does not imply causation. For example: the relationship between ice cream sales and the rate of drowning incidents. In many regions, ice cream sales tend to increase during the summer months. Unfortunately, the number of drowning incidents also tends to rise during this period. But clearly, one does not cause the other. A third factor, warm weather, causes both. For age and social media use, it’s more complicated. It depends on what we mean by “cause” and there are many things that lead to or explain or cause social media use. Age likely is one of them.
“Predicting” is what they do outside social science. Machine learning is about finding patterns, predicting them, and acting on them. It’s more common in engineering and finance than in social science, but it’s still sometimes done, for example in election forecasting. Technically, economics is a social science and there economic prediction/forecasting is very common.
“Explaining” is probably the most common thing we do in social science. It’s about finding causal effects. “All other things being equal” does attending private school increase student scores. How does the electoral system (e.g., first-past-the-post, proportional representation) influence the nature of party politics and election outcomes? Does economic inequality lead to political instability? Getting at causality is also notoriously difficult in many cases.
It’s worth quoting in full from Llaudet and Imai:
Figuring out whether you aim to measure, predict, and/or explain a quantity of interest should always precede the analysis and often also precede the data collection. As you will learn, the goals of your research will determine (i) what data you need to collect and how (ii) the statistical methods you use (iii) what you pay attention to in the analysis.
To measure a quantity of interest such as a population characteristic, we often use survey data, that is, information collected on a sample of individuals from the target population. To analyze the data, we may compute various descriptive statistics, such as mean and median, and create visualizations like histograms and scatter plots. The validity of our conclusions depends on whether the sample is representative of the target population. To measure the proportion of eligible voters in favor of a particular policy, for example, our conclusions will be valid if the sample of voters surveyed is representative of all eligible voters.
To predict a quantity of interest, we typically use a statisti- cal model such as a linear regression model to summarize the relationship between the predictors and the outcome variable of interest. The stronger the association between the predictors and the outcome variable, the better the predictive model will usu- ally be. To predict the likely winner of an upcoming election, for example, if economic conditions are strongly associated with the electoral outcomes of candidates from the incumbent party we may be able to use the current unemployment rate as our predictor.
To explain a quantity of interest such as the causal effect of a treatment on an outcome, we need to find or create a situation in which the group of individuals who received the treatment is comparable, in the aggregate, to the group of individuals who did not. In other words, we need to eliminate or control for all confounding variables, which are variables that affect both (i) the likelihood of receiving the treatment and (i) the outcome variable. For example, when estimating the causal effect of attending a private school on student test scores, family wealth is a poten- tial confoundinq variable. Students from wealthier families are more likely to attend a private school and also more likely to receive after-school tutoring, which might have a positive impact on their test scores. To produce valid estimates of causal effects, we may conduct a randomized experiment, which eliminates all confounding variables by assigning the treatment at random. In the current example, we would achieve this by using a lottery to determine which students attend private schools and which do not. Alternatively, if we cannot conduct a randomized experiment and need to rely on observational data instead, we would need to use statistical methods to control for all confounding variables such as family wealth. Otherwise, we would not know what portion of the difference in average test scores between private and public school students was the result of the type of school attended and what portion was the result of family background.
For Gelman, Hill and Vehtari:
Here is a slightly different take, by Gelman, Hill and Vehtari on what statistical inference is: