Understanding and communicating data

Reading (to be completed before class)

Asking the right questions

There has been an explosion of data being made available online. Some of it is information that has always been collected, but has recently been digitized. Lots of government-related data falls into this category (see e.g., data.gov. Much of it, though, is "natively" digital. Information on websites, what terms people enter into search engines, and articles posted to social networks all fall into this category. Many operators of web services make information available to developers through APIs.

Given the wide variety of data available, it is easy to become overwhelmed and not know where to begin. My advice is to focus on an area where you have interest or expertise. Then you have two options. First, you can either study available sources of data in the area of interest, and then ask questions that the dataset can answer. The second option is to first pose the question and then look for sources of data that can answer the question. In this second case, it is quite possible that you may have to construct parts of the dataset yourself.

We will discuss continue to discuss question selection when we get to the lecture on strategies for data collection.

Properties of data

Data often comes in tabular format. Each row in the table can be thought of as a grouped tuple of values corresponding to each column in the table. These rows are called records. The values that comprise the record are called fields. Consequently, a table is simply a collection of records.

We refer to the columns of the data table as variables. Most interesting data analysis requires tables with several columns whose values combine in ways that tell a compelling "story", such as two variables having a roughly linear relationship.

There are two main classes of variables: categorical and numerical. Numerical variables are just that: numbers. Categorical variables are characteristics of the data point in question. They must be finite, and they are usually unordered. (Ordered categorical variables are also called ordinal variables.) The most interesting data sets usually include both categorical and numeric variables.

Let's take a look at the data set explored in the case study, along with a few records.

                           term    pop  epc                     cat    avgt10 hasmalu
       mardi gras 2011 st louis   5400 0.05   News & Current Events 0.0447370   FALSE
    nuestra belleza latina 2011   2400 0.05           Entertainment 0.0175000   FALSE
             hmong satellite tv   1900 0.05      Telecommunications 0.0222220   FALSE
                  julius caesar 550000 0.37       Arts & Humanities 0.0500000   FALSE
justin tennison deadliest catch   1300 0.05              Recreation 0.0000000    TRUE
               craigslist phila 246000 0.54                   Local 0.0028935    TRUE
                    cyanogenmod 301000 0.05 Computers & Electronics 0.0000000    TRUE

Each variable has the following meaning:

variable meaning
term trending search term
pop monthly searches for that term
epc estimated pay-per click ad price for the term
cat term category
avgt10 Fraction of top 10 results for the term that point to ad-filled sites
hasmalu True if term had undetected malware in its results while trending

We can see that pop, epc and avgt10 are numerical, while cat and hasmalu are categorical. In fact, hasmalu is a boolean variable, which is a special case categorical variable.

Categorical variables are especially useful when they are consistent across data sets. For example, country is often a categorical variable in data sets. Consistency across categorical variables makes combining data sets much easier. For example, if you look at the data made available in Gapminder, you can see that they have successfully linked economic development data sets with public health data sets through countries.

There are additional attributes of some data that we must consider. First is whether the data comes in a time series. Note that the Gapminder data is time-based, and much of the power of the visualization is made possible by seeing the data points move over time. But that's not the only way the Gapminder data is special. It also includes data where correlations naturally exist. These correlations may be chance, but often we expect that one depends on the other.

This leads us to an orthogonal classification of variables: response and explanatory variables. Response variables are the primary variables of interest in a study. Explanatory variables are used to help explain the response variable. For example, in the trending case study, the response variables are the prevalence of MFA sites (avgt10) and the existence of malware for terms (hasmalu), since that was the primary objective of the data collection effort. We also collected additional information about terms, hypothesizing that these additional values might affect the values of the response variables. The additional information included pop, epc and cat, and so they are explanatory variables.

Note that response and explanatory variables can be numerical, categorical or ordinal. For example, pop and epc are numerical explanatory variables, while cat is categorical; similarly, avgt10 is a numerical response variable, while hasmalu is a categorical response variable. As you might have inferred by now, outside statistics, response variables are sometimes called dependent variables, while explanatory variables are called independent variables.

For more information on variable types in data, see this resource.

In-class exercise: play with Gapminder world

Questions to note:

  1. Are there other categorical values in use?
  2. What else is going on with the data?
  3. What are some explanatory and response variables?

Writing effectively

As noted in the syllabus, improving the communication of your findings is a key objective of this course. One aspect of communication is writing. While this may surprise you, writing is so essential to the data analysis process that no successful analysis can be conducted without it. Gopen and Swan put it best in the conclusion of their piece on ``The Science of Scientific Writing'':

The substance of science comprises more than the discovery and recording of data; it extends crucially to include the act of interpretation. It may seem obvious that a scientific document is incomplete without the interpretation of the writer; it may not be so obvious that the document cannot "exist" without the interpretation of each reader. In other words, writers cannot "merely" record data, even if they try. In any recording or articulation, no matter how haphazard or confused, each word resides in one or more distinct structural locations. The resulting structure, even more than the meanings of individual words, significantly influences the reader during the act of interpretation. The question then becomes whether the structure created by the writer (intentionally or not) helps or hinders the reader in the process of interpreting the scientific writing.

Gopen and Swan give lots of practical guidance on how to structure writing whose goal is to explain. Some key conclusions I observed from their article: - Ordering should follow the logic of a reader - Each unit of discourse should make a single point. - Readers naturally emphasize what is placed at the end of sentences (aka the "stress position").

I found it particularly helpful to distinguish between the "topic position" and "stress position" of sentences. Here is my summary of their advice in a table:

_ Topic position Stress position
Reader Perspective, context, linkage Closure, fulfillment
Placement Beginning End
Content Old info, linking backward New information to emphasize

By taking the reader's perspective, it becomes clear that the ordering of your argument is crucial. In essence, all this talk about what to put where is a structured way to construct a logical argument.

Meanwhile, White and Strunk wrote a brief guide to writing style back in 1918, which has stood the test of time. It's more general than Gopen and Swan, aimed at all writing, not just scientific writing. Nonetheless, many of the "guiding principles of composition" can usefully be applied to the writing you will do in this course and beyond. Here are the principles listed out:

  1. Make the paragraph the unit of composition: one paragraph to each topic
  2. As a rule, begin each paragraph with a topic sentence; end it in conformity with the beginning
  3. Use the active voice
  4. Put statements in positive form
  5. Use definite, specific, concrete language
  6. Omit needless words
  7. Avoid a succession of loose sentences
  8. Express co-ordinate ideas in similar form
  9. Keep related words together
  10. In summaries, keep to one tense
  11. Place the emphatic words of a sentence at the end

Consciously following these guidelines can substantially improve the clarity of your writing. I find it interesting that principle 11 matches Gopen and Swan's recommendation to put new content in the stress position, particularly since Strunk and White didn't have the benefit of results from cognitive psychology to draw such a similar conclusion.

Finally, lest you think this advice only applies to writing, the information visualization pioneer Edward Tufte has argued that Strunk and White's Elements of Style "is one of the best treatises on graphing data" (Cleveland 1994). When we read a selection from Cleveland later in the course, he too will present principles for graphing data that are analogous to the principles from Strunk and White.

Example: communicating from the case study

In-class exercise: communicating data from a data sample

We will split up into groups and examine 4 different data analyses. Here are the topics:

For each article, each group has three tasks:

  1. Briefly describe in your own words the goal of the study.
  2. Describe the data that has been collected and explain how this will help achieve the goal of the study. You should begin by writing down what you expect the variables are, then classifying each as categorical or numerical, and finally, if appropriate, label the response and explanatory variables.
  3. Pick a graph that interests you, and write an explanation of what the graph says, relative to the data that has been collected.