Seeing Red: A Statistics Debate

The journal Psychological Science has published some odd research lately. There’s a fun back-and-forth unfolding between statistician Andrew Gelman and psychologist Jessica Tracy and her student Alec Beall, who published one of those pieces. Both sides make good points, and Gelman does overstep a bit, but ultimately Gelman makes the better case.

[See the research (pdf), Gelman's original critique, a response by Beall & Tracy, and Gelman's rejoinder.]

Beall & Tracy wrote a research article on whether women at peak fertility dress to maximize their attractiveness. Given that prior research asserts that red might indicate fertility, the authors conducted an experiment: they ask 100 women on Mechanical Turk and 24 undergraduates about the participants’ latest menses, and observe the color of the women’s shirts. They find a statistically significant relationship (p=0.02, p=0.051 for the two samples) between women wearing red and women being at peak fertility.

Gelman posted a critique for Slate, where he asserts that “this paper provides essentially no evidence about the researchers’ hypotheses.” He points out that the researchers’ samples are not representative, that their measure of “peak fertility” may be flawed, and that participants may not accurately recall their last menstrual period.

Gelman’s larger critique, however, indicts the general practice of data analysis in many social sciences. He cites the “researcher degrees of freedom” in the Beall & Tracy piece, meaning that researchers have many choices in their research designs that may affect their outcomes. For example, Beall & Tracy chose to combine red and pink into a single category, to collect 100 Turkers and 24 undergraduate participants, to compare shirt color only, to compare red/pink to all other colors, to use days 6-14 as “peak fertility”¹, and so forth.

At its most benign, these well-intentioned choices could result in false positive results. More nefariously, researchers could massage the research design, methods, or even rewrite their predetermined hypotheses, to get a publishable result.

Gelman isn’t wrong to point this out, though I find some parts of his criticism unsettling. As the authors write in their rebuttal, Gelman’s critique was published without consulting the authors. If Gelman aspired to clarify the research, he should have reached out to the authors with his comments before impugning their work in Slate. This goes beyond academic courtesy: the authors may have been able to answer some questions which would make the entire discussion more informative.

Some of Gelman’s critiques are also pedantic. Debates about measurement and sample representativeness, while important, are first-year graduate seminar fodder. His “researcher degrees of freedom” point lands a harder blow, though it seems to accuse the authors of subconsciously mining for statistical significance without any real evidence in this particular case.

So why is Gelman right?

Gelman points to a real concern in social scientific research, even if Beall & Tracy take the brunt of his criticism. Current practice heavily incentivizes scholars to find statistically significant results, but the requirements for publication and replication do not sufficiently disincentivize “mining” for significance stars.

Psychological Science has published several weird pieces lately, including work relating physical strength to political attitudes that I criticized on this blog (Parts I, II). Gelman, too, has pointed to several of these curious studies. In each case, the claims far surpass the data mustered to support them.

With surprising claims, like the one made by Beall & Tracy, should come additional scrutiny. Think of this as a Bayesian updating issue. Radical claims are radical because I see little reason to believe them a priori, so my prior belief is pretty strong that the effect doesn’t exist. Compared to intuitive claims (do liberals tend to vote Democratic?), radical hypotheses should require much stronger evidence to overcome my prior belief.

This may seem too subjective, or too unfair to researchers like Beall & Tracy, but that’s science. A top journal should demand much stronger evidence than two small samples to support such a radical claim.

We should, as a field, also move toward preregistered research designs. Beall & Tracy may not have tweaked the designs or hypotheses to match their data, but this type of practice happens, and will will continue unless scholars are required to lock in their designs before collecting any data. Changing publication bias to accepting interesting questions and designs, instead of interesting results, would help too.

Gelman targeted Beall & Tracy when his true criticism points at social science publication writ large. That doesn’t mean, though, that he’s not correct. The Beall & Tracy work makes a big claim which it supports with small samples and few robustness checks.

It’s the research equivalent of a sexy stranger who turns out to have little personality.

Popular choices

Did Arrested Development Hurt Netflix?

In the interest of full disclosure, I’ll admit: I’m a big fan of Arrested Development.

And I’ll even take the apparently unpopular stance that the show’s recent fourth season, produced and released by Netflix, is pretty spot-on.

I was surprised, then, to hear that the fourth season release hurt Netflix stock values. According to many, many stories from the , Netflix took a six percent hit because critics gave the revamped sitcom tepid reviews.

True enough, on the first day of trading after the season’s release, Netflix closed down at $214.19 from $228.74. How unusual is a drop that large, given the history of Netflix stock?

Playing with Netflix stock data, it appears to be fairly rare. Of the nearly 2,800 days of trading data available (since May 29, 2002), only 246 days (around 9 percent) saw absolute change of six percent or greater.

Histogram of daily closing price changes for Netflix since May 29, 2002.

Interestingly, the opposite thing happened the day after Netflix released another straight-to-internet show, House of Cards. After that serial went live, Netflix stocks closed up, to $174.74 from $164.80 (+ 6 percent).

This despite House of Cards garnering only marginally better critical review (76 percent on Metacritic) than Arrested Development (71 percent on Metacritic). That seems like a large difference in reaction based on only a small gap in critics’ opinions.

Netflix closing price from January 4 through June 13, 2013. Vertical lines show daily changes of 6% or more. HC1 is House of Cards Season 1 release and AD4 is Arrested Development Season 4 release.

So did Arrested Development cause Netflix to have its 85th worst day since going public? Could be.

On one hand, it’s rare that Netflix stock jumps or falls by six or more percent. It’s even less likely that it would change at random by that much the day after its two most prominent projects go live. In fact, if we were to randomly pick days for such large shifts to occur, there’s less than a one percent chance that any two specific days would both see large changes.

Yet the reaction to Arrested Development and House of Cards seem very different despite the shows’ similar critical reviews.

If traders reacted to critics, it’s possible that they responded to expectations rather than to the reviews themselves. House of Cards pleasantly surprised viewers, while Arrested Development’s decent 72-point score still falls far behind the 88-point average for seasons one through three.

Either way, Netflix should still breathe easy. There’s always money in the banana stand.

Flexing those Conservative Muscles, Pt. II

In Part I, we discussed popular media coverage of a forthcoming Psychological Science article. The work by Michael Bang Petersen and his colleagues claims to show an evolutionary link between physical strength and the intensity of political beliefs. In Part II, we’ll push that causal claim a bit further.

In the past few days, furor over the NSA surveillance program sparks new debates between liberals and conservatives. Interestingly, much the defense-versus-liberty arguments pit conservatives against one another: defense-hawks who want liberal power to root out terrorists, and libertarian-minded conservatives who abhor anything smelling like a police state.

This calls to mind the more traditional “liberals are like mothers, conservatives are like fathers” trope. Conservatives are tough and strong while liberals are warm and nurturing. The recent piece by Petersen and his colleagues plays, in part, into that stereotype. But how reliable is the research?

Revisiting Causation

Petersen and his colleagues find a correlation between bicep size (their measure of fighting aptitude) and the congruence between individual economic conditions and political conservatism. Why might the relationship exist? The authors argue that this clearly points to evolutionary and thus genetic bases of political attitudes. That’s not impossible, but it’s hardly the only reasonable conclusion to draw.^[1]

Before we dig too deeply into the research, it might serve to review the basics of causal inference. Causation, as the name implies, seeks to establish that some phenomenon causes another. On its surface, this is simple to do. I flip the light switch, the light comes on. Boom: cause produces effect. Sure, there are mechanisms that allow the switch to work, but for all intents and purposes I caused the lights to come on.

Unfortunately, causal inference is rarely so simple. The cleanest way to observe causation is through randomized experiments, where some subjects receive a treatment (some rooms get the light switch flipped) and others don’t (some rooms’ switches are left untouched). If the treatment group experiences different outcomes than the control group, we may have observed a robust causal process.

With observational data (i.e., data collected by observation and not by experimentation), testing causal theories gets stickier. We can observe correlations, but we often cannot determine if x caused y, if y caused x, or if both x and y were caused by some other unobserved phenomenon z.

In political terms, we might ask: Does my evaluation of the economy determine my partisanship? Does my partisanship determine my evaluation of the economy? Does my political ideology cause both? Or might they all three interact in a more complicated way? With observational data, it’s hard to tell.

We don’t just abandon causal inference here, though. We can still make causal claims, but we must (a) provide thorough theoretical explanations for how our posited cause produces a certain effect; and (b) exhaustively examine alternate explanations that might falsify our theory.^[2] Petersen and his colleagues tentatively satisfy the first condition, but hardly attempt at the latter.

Examining the Research

Turning back to the research, we can note first that the effects proffered by the authors are quite small. This illuminates the difference between statistical and substantive significance, which is frequently overlooked. The effect may exist, but it’s so small as to be swamped by other other influences in real world.

The more important challenge, however, critiques the authors’ causal claims. The authors posit that survival likelihood in men should cause more self-interested behavior, but they rather lazily support their case. In fact, the authors don’t even get close to providing an exhaustive test of their causal story.

The authors instead show a correlation between the two phenomena. Yet other reasonable explanations abound. Bicep size, after all, is not purely innate, but depends in large part on resistance training and even body fat percentage. And the predispositions toward having large arms may reflect little about evolution or genetics.

Certain people are more likely to believe strongly in physical aptitude and healthfulness, and perhaps these people are also more likely to hold strong political views. Education, for instance, tends to predict both political sophistication and physical exercise, health and lower obesity. Political science research also shows that knowledge and interest in politics play an important role in our ability to match economic policies and political elites to our personal economic well-being.

Or maybe the story does involve assertiveness, which makes individuals more likely to care about politics, behave more self-interestedly and/or hit the gym more regularly. The mechanism here could be nature, but it could also be nurture. Guys who are raised to assert their opinions on any manner of subjects, politics included, may also be raised to be physically strong.

Or perhaps self-interest causes physical fitness. It could be that people who are more self-interested also tend to care about physical strength. From a survival perspective, that would make sense. In fact, none of these alternative explanations are wholly unreasonable, and any of these explanations could produce similar correlations as expounded in the study.

Of Science and Old Sayings

The age-old bromide “correlation does not imply causation” should be ringing in our ears about now. The authors have a mildly interesting finding, but at root all they’ve presented is a correlation which hardly supports their causal story.

Image courtesy of http://xkcd.com

Unless future research shows that other fitness characteristics, condition political attitudes, the theory remains weak.

Even in that case, researchers need to rule out competing theories. Strength might cause self-interest; but self-interest could cause strength, or both could be the effects some another evolutionary or behavioral cause.

What To Do?

How might researchers better examine the connection between strength and political attitudes? Let’s consider a few things to try.

1. Measure strength, not (just) biceps: The authors argue that biceps are the best single predictor of fighting ability. But why use just one? Surely there are reasonable ways to measure strength, like with compression springs, that would give us a better idea of how strong participants are. This eliminates concerns that bicep size might be a poor indicator of strength.

2. Measure fighting ability, not just strength: Biceps (or strength, otherwise defined) only matters to the degree that it taps into fighting ability. Other characteristics help us fight, though. Height and longer arms are useful, as are stronger legs and faster reflexes. The authors argue that biceps are used by others to assess a person’s fighting ability, but that should not matter for this study. The theory suggests that good fighters, not people who just look like good fighters, should take what they want from the political system.

3. Explore innate characteristics: Behavior, personality and interpersonal influence (parents, peers, et cetera) affect our tendency to lift heavy things. That makes it nearly impossible to say that strength per se influences our political views. More innate qualities, not as subject to behavioral manipulation, could be helpful.

4. Consider other dimensions of assertiveness: The authors argue that strong men will be more politically assertive; but why stop there? Asking, or experimenting, with other facets of assertiveness may shed light on an interesting question, namely whether strength predicts certain personality types, politics included.

In sum, it’s not surprising that conservative outlets latched onto the research without reading it thoroughly. Conservatives on my Facebook feed were certainly thrilled to learn that they were, in fact, ruggedly strong and athletic, standing in stark contrast to their pantywaist liberal counterparts. That’s motivated reasoning at it’s best, but its also wrong.

And frankly, I’m not even too surprised that a peer-reviewed article would oversell its findings. The research garnered a lot of attention and, if true, could constitute a generally interesting finding.

But science isn’t about pithy titles or provocative theories. Causal claims must be made carefully, especially when working with observational data without randomization and controlled treatments. Time will expose the authors’ theory to better tests, and I suspect that when that happens, the theory will find its way to the scientific dustbin.

Notes:

1. I’m fairly agnostic to using biological research in political science. Like all research subfields, it produces some good and some mediocre products. Some of the bio-politics research is actually interesting, some reflects a fetish for new data with little theoretical development, and some just mines for significance stars (see § 3).

2. This is actually a more optimistic view of causal inference in the social sciences than I tend to embrace. For an interesting read on causation writ large, I recommend Causality by Judea Pearl (see also here). For a grounded, if slightly depressing, view of statistical models and causal inference, David Freedman’s posthumous collection is a must read.

Flexing those Conservative Muscles, Pt. I

Several media outlets recently published an astonishing finding: strong men are more conservative. Based on an article in Psychological Science, the popular accounts stray far from the research and probably even further from the truth. In this two-part post, we’ll discuss what the researchers actually said (Pt. I) and what their research might actually mean (Pt. II).

Strong Men are Conservative!

Last week, my Facebook feed brought me some intriguing news: researchers apparently discovered that physically stronger men tend to hold conservative economic views.

A couple quick clicks and a Google search showed ample coverage, much in conservative media, of the peer-reviewed article. Michael Bang Petersen and his coauthors did, indeed, find a relationship between physical strength and political attitudes.

The study’s authors examine the connection between strength and politics by interviewing men and women in three countries (Argentina, Denmark and the United States). They measure the flexed, dominant-arm bicep of participants, and ask them a battery of political questions. The relationship is “statistically significant” at conventional levels.

So, are beefy men more conservative? Nope. Popular accounts of the research misinterpret the findings, which are themselves oversold by the authors. At no point do the data provide exhaustive support for the theory that physical strength causes any part of one’s political views, much less that it makes men more conservative.

Leveling the mountain to uncover the molehill

The study’s authors nest their theory in the evolutionary advantage strong men enjoy in physical conflicts. Members of the conservative press seemed to love this story, and thus extracted the evolutionary thread without reading beyond the article’s abstract.

The logic, according to the popular accounts, claims that since nature favors strong males, these same males will eschew social safety nets. They don’t need it, after all, and would rather not see the fruits of their strength redistributed to the ninnies.

Here’s the kicker: that’s not at all what the authors claimed to find. For the story above to be supported by data, we would expect bicep size to positively correlate with conservative views. It’s presumably not, given that they don’t publish that finding.

The authors posit instead that strength interacts with other characteristics to affect political views. They point out that traditional rational decision making models don’t perform so well in the electorate, especially when using economic wellbeing to predict attitudes. That is, rich people tend to support redistribution less than poorer people, but the relationship between wealth and economic attitudes isn’t as clean as we may expect.

Here enters strength: stronger males may be more willing to engage in self-interested behavior than their weaker counterparts. The authors explain that strong males would, in nature, be more likely to claim resources because they are more able to defend them, and the same may be true of political resources. Strong, rich men may be more willing to fight against redistribution, while stronger, poor men may be more willing to claim resources through redistribution.

That’s a horse of a different color. According to the authors’ account, strength does not predict conservatism, but instead predicts the congruence of personal economic status and political views. The authors do not present a correlation between strength and attitudes, but an interaction term between bicep size and wealth that together predict conservatism.^[1]

How did it go so wrong?

I have several quibbles with the original research, which we’ll explore in Part II, but we should be clear: the authors did not claim to have found that strength relates to conservatism. That’s wholly an invention of either lazy or ulteriorly motivated journalists.

On one level, this shouldn’t be surprising. Of my friends who posted stories about the research, all are quite conservative. The story provides a feel-good boost to people who identify with conservative politics, and it’s not odd that they would share the good news.

Second, the original theory matches our collective political notions. Cowboys are conservative, academics are liberals; soldiers are conservative, environmentalists are liberals. Given this natural frame, it’s easy enough to package the findings in a way that calls to mind these biases. Of course strong men oppose redistribution! They’re rugged, self-sufficient, independent, and thus the perfect candidates for conservative political views. The more nuanced story, of strength conditioning self-interest, is more difficult to tell and less intuitive, so it gets left by the roadside.

Third, media face powerful incentives to hyperbolize. “Strong men are more conservative” is flashier and more provocative than “strong men more likely to act self-interested in economic policy preferences.” A story guaranteed to tickle conservatives and enrage liberals will also attract more shares, tweets, and trackbacks in the blogosphere. If there’s no such thing as bad publicity, perhaps there’s also no such thing as bad web traffic.

Surprising or not, the coverage should still be depressing and alarming. In an era where we’re debating the role of social science research in society, it’s important to understand how the public accesses major findings and how these may improve the democratic process. News media play an important role here, but that responsibility is corrupted when journalists choose sexy over scientific.

In Part II, we’ll discuss the research by Petersen and his colleagues, and explore the relationship between strength, economics and political attitudes.

Notes:

Continue reading →

For It While They Were Against It

Sometimes, people respond in strange ways to survey questions.

For a recent project with Jim Stimson and Elizabeth Coggins, I spent a fair amount of time analyzing data from the Cooperative Congressional Election Study (CCES). Here’s a fun nugget from my exploration: a sizable proportion (21 percent) of respondents both support and oppose Obamacare. Simultaneously.

We can speculate wildly about why a fifth of respondents — in a sample that is disproportionately educated and interested in politics! — would give such a puzzling answer.

But in a bigger sense, surveys — as useful as they are — offer highly artificial settings where respondents will give answers. Not attitudes, nor opinions, nor preferences per se — just answers. We should keep that in mind before reading too much into public opinion reports.

The Conflict

Part of the CCES comprises a set of “roll call” votes. These present respondents with a policy position and require a simple yea/nay answer. Two of these questions ask about the Affordable Care Act: one asks the respondent to vote for or against Obamacare, the second asks respondents to vote for or against repealing Obamacare.

There is a logical connection between these two questions. In general, someone who wants to repeal the law would probably not vote for it; and those who want to keep the law around should vote for it to begin with.

Generally that works… but as the ‘jitter’ plot shows above, it doesn’t work that way for everyone. Each dot on the figure represents a single respondent. (I like imagining that I’m assigning people to stand in a corner of the room depending on their answers to questions. Maybe I have a power complex…) There are clearly a good number of respondents in two quadrants: those who either support Obamacare and want to keep it, and those who oppose Obamacare and want to repeal it. Makes sense.

But who are those respondents in the other two quadrants? Slightly more than 12 percent of them want to repeal Obamacare, despite saying that they would vote for the bill; and 9 percent would vote against the bill, but wouldn’t repeal it.

The latter group — the Vote Against / Don’t Repeal group — may be reasoning through the path dependency of Obamacare. Something like, “Well, I don’t like it, but it would endanger the health care system to repeal it now.” Or maybe they’re just ardent believers in the Democratic process: elected officials passed the bill, any who would I be to usurp them? I doubt either of these stories, but it’s not impossible.

The other group — the Vote For / Repeal It! group — is weirder, though. There’s really no logical connection between the two answers.

Surveys are weird…

Well, they are! Despite having used public opinion data in research for several years now, I took my first “real” political survey over the winter holidays. Gallup called and wanted to talk to me about global warming, and that sounded like fun.

It wasn’t. First, you get pretty tired of answering questions after the first twenty. Second, even as a well-educated, highly-informed and engaged observer of the political world, the survey made me feel dumb. There’s this unusual pressure in a survey to answer questions promptly, which is fine but sometimes you don’t have an easy answer right at the top of your mind. Besides, these issues are complicated! Global warming? Economics? Coal, nuclear, wind, oil? Health care mandates?

Stressed yet? Even informed and engaged respondents get a bit overwhelmed by the survey items, and by the need to provide clean answers to complicated questions. And sometimes the questions aren’t entirely clear. Are we asking if you would have voted for Obamacare back in 2010? Or would you vote for it today? Do some respondents miss the “repeal” part of the question? These are all possible points of confusion, introduced in a highly artificial environment, but for which it’s impossible to test without a specific instrument.

Here’s the uncomfortable truth about polls: we use them because they’re what we have. On many questions, they’re good for giving the general feeling in the public. “Will you vote for Mitt Romney, the Republican, or Barack Obama, the Democrat?” isn’t terribly difficult, and most respondents can give a decent answer.

But as the questions become more complicated, responses become less reliable. Accessing “true” attitudes on policy questions with a survey can sometimes be like removing a splinter from your finger with an axe: In a sense it works, but it’s awfully messy.

And it gets messier when we try drawing relationships between multiple items, all of which have some weird characteristics, like non attitudes, weak attitudes, and non response. Aggregating to reduce the high dimensionality of multiple responses can help filter out some of the noise, but that’s a topic for another post.

Pundits and commentators roll out polls daily to elicit support for some position or another. Being an informed consumer of surveys means going beyond “What’s the Margin of Error?” (We are the Margin of Error, duh!)

It means realizing that a fair number of responses might carry little objective meaning. When pressed I’ll answer, but I honestly don’t know, don’t care or haven’t quite figured out my views yet. Treating these responses as some true-to-life measure of how the American people feel, or how they’ll act, can go pretty far afield.

Note: The CCES sample above is limited to the UNC module of 1,000 respondents. Expanding this to the full CCES sample of 55k+ doesn’t change anything, though, but does make the figure a bit messier.

What’s the (Expected) Value of Your Life?

Last week, a friend asked for my opinion on an economics problem where students were asked to estimate the statistical value of a human life.

This will make more sense later…

The procedure blew my mind, and not in a good way. Not because I’m not a fan of quantifying the value of life — it’s weird, but I’d rather governments use good estimates over bad ones — but because the statistics are being used so poorly.

And it wasn’t my friend’s fault. He was following the example of several scholars who used the same framework to answer this question. And he got the problem set question correct, despite it being horrifyingly misguided.

Regression to the Rescue!

Here’s the published article by James Hammitt with the economic theory, here’s an overview of the field by Viscusi & Aldy (2003), and here are some applications. Basically, we’re trying to pin down how much you would spend to reduce your risk of death by some fixed amount. So, for example, how much is it worth to you if I could reduce your chance of death this year by, say, 0.1 percent?

Unfortunately it’s expensive and theoretically questionable to ask people that question. Even if we could afford the survey, do we think people could respond to the hypothetical accurately? I couldn’t.

So we flip the task toward figuring out how much we demand to be paid to undertake dangerous jobs. And since we’re in econometrics land, let’s start with linear regression. With some controls (age, blue collar, race, et cetera), let’s find the relationship between the dangerousness of a job (risk of death) and compensation (weekly wages). Economists call this a hedonic wage model.

And then? Let’s predict wages where risk of death is certain, i.e., p = 1.0. Or, as Moore & Viscusi (1986) put it, “We extrapolate the willingness to pay of the individual worker for a small risk reduction linearly to calculate the collective willingness to pay for a statistical life.”

Voilà! Now we know how much people value their lives.

See the problems yet?

When my friend first showed me this approach, I laughed a bit. His response went something like this:

Yea, I know, it’s not great. We only have a sample of 300. And we don’t have all of the variables we would want, so there may be some omitted variable bias, right? So it’s not perfect, but…

He’s right. Fitting a linear model without including relevant variables results in biased estimates. Proving this is simple but isn’t really where my objections lie.

My problem comes from the ridiculous extrapolation involved here. Let’s assume that we have the relevant variables in hand. Even under this unlikely condition, extrapolating so far beyond the data can make us super confident in an utterly stupid model and its predictions.

A simulated example

To show what I mean, we will use some simulated data. By using data that we create ourselves, we can observe the “truth” in a purely objective way, and thus test our intuition about lots of stuff. For the following demonstration, I’ll give the basic overview and give technical details after the post.

Let’s assume that we have some variable $x$ that represents “risk of death” in various occupations. Now let’s assume that nature defines some true function linking “risk of death” to wages, and the function is arbitrarily complicated:

$y_i = \alpha + \frac{(100x_i + 3^{x_i})}{1-x_i} + \epsilon_i$

There’s nothing special about the function, other than that I (acting as ‘nature’) defined it myself. It also has a nice asymptote at $x=$ 1, meaning that the limit from the left $\displaystyle \lim_{x \to 1 -} f(x) = \infty$ . This comes from the denominator (since dividing by 0 is undefined), and could match some intuition about nobody accepting any reasonable wage for a risk $x \approx$ 1. The error term $\epsilon_i$ we will assume to be normally distributed with expectation of zero.

With this function in hand, we can randomly generate some values for x that could reasonably be risks of death; in this case, for ease, $x \sim Unif(0,1)$ , meaning uniformly distributed on the interval 0 (risk=0%) to 1 (risk=100%).

If we fit the linear model to the full data (because this data is costless, let’s say n=10,000), we’d see pretty quickly that regressing y on x is a terrible idea. The estimated coefficient on risk of death is significant (with 10,000 observations it’s hard not to get ‘significant’ coefficients; see here and here); but the plot of predicted values $\hat{y}$ against y make it pretty obvious that the model misses the mark:

Here’s the problem: we don’t have the full data. Nobody accepts jobs where the risk of death $x \approx 1$ , or really anywhere near it. The highest occupational risks seem to have ~120 deaths per 100,000 workers. The data my friend was given for his homework, similarly, had the following risk density (and for amusement, see here for perspective):

Density of “risk of death” in the real data given with a problem set.

So, for verisimilitude, let’s refit the model using only a small subset of the data from the low-end values of x, say those for which 0<x<0.01. Now things get frighteningly fun…

Again, the coefficient estimate is significant, but this time there’s nothing in the residuals that give us pause:

These residuals look pretty normal, don’t they?

In fact, there’s nothing to point out exactly how wrong our model is! For example, check out this image which shows the “true” line and the linear fit line. You can’t distinguish one from the other. In this restricted sample, the root mean squared differences between the linear fit and the true model is only 0.002, compared to more than 13,000 between the true versus linear fit values in the full data.

How bad could this possibly be? Well, let’s plot the true function in black and the linear fit in grey:

Well, as it turns out, we can be really, really far off.

Oh, right. We could be really, really far off. Oops.

That was a fun exercise, but so what?

Why does any of this matter? First, consider this fact: I provided us with one possible true function, just for comparison with our linear model. But the actual relationship between risk and wages, to the extent that we can say it exists at all, could take any form.

Second, and this is the scarier part, it’s entirely possible to fit a linear model that looks perfectly acceptable. Within the small subset of data, the linear model is a decent approximation of the truth, at least insofar as we might be interested in predicting wages from risk.

In extrapolating so far beyond the data, however, we assume that this decent approximation works for all values of x, even with no evidence to support the assumption. There’s no data out at the extremes to tell us how good or bad the assumption is, and thus not only is our model likely wrong, but there is effectively no bound on how wrong we might be.

This isn’t too different from me telling you that the relationship between, say, age and income, can be approximated by a linear model. We’ll include a squared term since income at upper ages tends to fall below our full earning potential. And then - because the model fits so well for data we have! - I’ll extrapolate to give you the expected income for somebody who is 200 years old (it’s less than negative $500,000… someone should stop researchers from improving life expectancy, stat!).

E(income | age=200) ~ <-$500,000… or something. Scale on the y-axis is categorical from 0 ($0) to 16 ($500,000+). The rug plot at the bottom shows the range of actual data in the CCES. Everything else is just a really bad guess.

Third, no standard ‘fix’ in the econometric toolkit is going to help us. Unlike our toy data, we aren’t ignoring data that might exist out there somewhere. We’re extrapolating beyond data that will likely ever exist. On our subsample, even transforming x with the true function - the function that we’d never know was correct a priori - gives us predictions indistinguishable from the linear fit.

In short, we will never know how wrong we are, in which direction, or how to improve the model.

So what do you propose?

When I pointed out the extrapolation issue to my friend, he responded with some frustration: “We need an estimate, and this is as good a method as any. What do you want governments to do? Guess?”

Well, yea, kinda. I would rather a government or agency guess, and be transparent about it, than use the prediction from a hedonic wage model. The linear extrapolation is just a poorly founded guess anyway, since the method was not designed nor is it suited to the question at hand. Masking it as some scientific method for defining a quantity of interest gives it an air of authority that it doesn’t deserve.

To put it another way, I can buy a car to get us from New York to London. But hey, at least I bought a vehicle of some variety, right? Except that we’ll both drown, and you’ll wish you hadn’t trusted my whole “look, it’s a fancy machine” argument.

Now, I don’t have a better alternative at the moment. Asking people seems silly, and besides, that could also result in dumb linear modeling. Guessing isn’t satisfactory. So what to do?

Well first, please stop defining everything as a regression problem. As a wise man should never have had to write (but did): “Linear regression is not the philosopher’s stone.” Let’s stop treating it like one.

And second… wait, I’ve written 1,500 words and - if you’re still with me - I owe it to the reader to stop. Also, I don’t have an answer right now. I leave that open to intrepid readers. E-mail or comment, and maybe there’s a follow-up post in the near future.

Americans Secretly Oppose Gay Marriage

If you’ve struggled to find humor in politics recently, rejoice. At least the skewed-polls people are still around.

Yesterday, Chris Stirewalt blogged for Fox News that polls overstate support for gay marriage. Voicing a similar belief, leading social conservative Gary Bauer showed little concern over public opinion, telling Fox’s Chris Wallace:

“No, I’m not worried about it because the polls are skewed, Chris. Just this past November, four states, very liberal states, voted on this issue. And my side lost all four of those votes. But my side had 45, 46 percent of the vote in all four of those liberal states.”

As with many fallacies, there’s an iota of truth here. Stirewalt draws on work by New York University political scientist Patrick Egan that shows that late-season polls typically overestimate support for gay marriage compared with the election returns.

I don’t really have a problem so far. A Pollster article by Harry back in 2009 made a similar point and explored some ways to improve predictive models. The gap between pre-election polls and election returns, in other words, is well documented.

So, the polls are skewed…

Here’s where I depart from most interpretations of this observation. The poll-vote gap does not necessarily imply that the polls are “skewed.” Could it? Yes. But it doesn’t need to. I suspect a good bit of the bias comes from who votes not how they vote.

Stirewalt argues that the polls are skewed and mainly blames social desirability bias. In this line of reasoning, respondents do not want to admit opposition to gay rights for fear of social judgement; instead, they act supportive but cast their secret ballot against. In other words, the “true” level of support is systematically lower than the polls show.

What’s crazy to me is that Stirewalt, even after basing his entire argument on Egan’s research, ignores the part where Egan dismisses social desirability as the primary cause of the polls’ inaccuracy. And Egan couldn’t be much plainer about it: “On the whole, these analyses fail to pin the blame for the inaccuracy of polling on same‐sex marriage bans on social desirability bias” (p. 7)¹.

What seems most likely is that pollsters haven’t figured out how to calibrate their samples to match the turnout. Ballot measures only attract at least moderately engaged observers. On an issue like gay marriage, it’s not surprising that some who ostensibly support gay rights aren’t nearly as motivated as those who have social, cultural or religious objections to it. The polls may decently represent the “true” proportion of citizens who support gay marriage, but not the class of voters who cast a ballot on the issue.

We’re Missing the Point

But far, far more importantly, any potential skew in the polls misses the true point here. Let’s assume that the polls are skewed, and that “true” support for gay marriage is actually seven points (best guess from the Egan research) lower than the polls say.

So what?

Those who invoke public opinion aren’t really that worried about crossing 50 percent. Even if the polls exaggerate support for gay marriage, the trend favors the equal rights argument. The above figure² shows general sentiment (“thermometer” scores) toward gays and lesbians in the American National Election Study³. This figure by Nate Silver shows a similar rise in support for gay marriage. And this figure from Gallup shows a widening gap favoring general rights for gays and lesbians.

In this light, even yelling “Skewed Polling!” doesn’t change the fact that support for gays and their ability to marry is rising steadily.
Now I know that race and sexual orientation are not the same, but there are some similarities between the above kernel density plot and the one at the top of the post. In general, support for rights and general sentiment co-evolve. Sentiment toward black Americans has increased even in the post-Civil Rights era. We see a smaller but similar “swell” in sentiment for homosexuals, with every reason to think it will continue on its current trajectory.

Even if support today is really say, 51 percent instead of 58 percent, it’s much higher than it used to be.

Could we just be getting more politically correct, instead of more ‘liberal’, on gay rights? Sure, but the green line in the time series doesn’t show any real change in the rate of respondents opting out. No, young people are coming of age with a more permissive view on this issue.

Skew or no, the trend speaks for itself.

Notes:
[1] Now, as a brief aside, Egan’s first test for social desirability bias makes no sense to me. I can imagine plenty of reasons why a state’s gay population wouldn’t predict the poll-election gap. But the second test is much stronger: despite the social acceptance of LGBTs growing, the gap has become smaller. All in all, I’m sure social desirability is part of the story, but it’s most likely not the primary factor.

[2] The figure shows thermometers scaled on the interval [0, 1], as well as the proportion of respondents who respond to gays warmly (therm > 0.5), cooly (therm < 0.5), and those who opt to not answer. Confidence bands are generated using 1,000 bootstraps from the survey margin of error. The margin around “skip” seems odd, but for convenience I’m treating “skip” as an expression of a desire to not answer, and thus as a random variable in its own right.

[3] The ANES, funded by the National Science Foundation, could be at risk thanks to recent Congressional targeting of political science. Contact your representatives in Congress because (I promise!) most scholars use the study for more consequential research than I.

The Democratic Left Ascendant

Elections have consequences, and many of those consequences are — well, consequential. But every election brings endless speculation that this election was the election - the realignment, the death of the losing party, the upending of the current political era.

The midterm election in 2002 was “a disaster” that crushed Democrats and forced them, if they were smart, to tack toward the center. President Bush’s 2004 victory only solidified the Democrats’ fate. Until, of course, the 2006 and 2008 elections marked a resurgent Democratic left and the ultimate failure of the conservative project. Until 2010… well, you get the point.

Some of these speculations are more well-founded than others. In his last post for the New York Times’ Campaign Stops blog, Columbia professor Thomas Edsall asks if “Rush Limbaugh’s country [is] gone“. Pointing to some polling data and discussions with prominent Democratic pollsters, Edsall suggests that a new left-leaning electorate is emerging from the ashes of the political polarization and financial crisis of the late 2000s.

The argument is interesting, but we could probably reconsider some of the evidence he points to.

Mr. Edsall, for example, discusses a Pew Research survey showing that young voters, African Americans and Democrats with a favorable impression of socialism. This could mark the emergence of a potent leftism that could forever transform the American political landscape.

Source: Thomas Edsall, New York Times Campaign Stop blog.

The numbers say something else to me. I’m not sure that any more Americans would support actual socialist policies today than would have two or ten years ago. What likely changed is the affective charge of the term.

Take, for example, dissertation research by UNC’s K. Elizabeth Coggins on the emergence and relative decline of “liberalism” as a political identity. A paper (pdf) by Coggins, coauthored with Jim Stimson, explores how individuals attach meaning to such labels as “liberal” and “conservative”, and “how widely popular liberal policies like Social Security, Medicare, and workplace safety came unhinged from the ideological label which defines them. ”

If liberalism could come unhinged from its ideological content, it stands to reason that the same could happen to socialism. Over the past several years, many conservative commentators and Republican leaders have called President Obama’s policies “socialism”; and if the term might rally voters on the right, it may too help to redefine how many liberals think of “socialism”. If liberals begin associating Obamacare and higher taxes on top income earners as socialism, they may be more inclined toward the ideology.

The rest of Mr. Edsall’s case rests on striking differences between liberals and conservatives on an array of policy proposals. The gaps are stark, but they are not necessarily new. Self-identified liberals and conservatives have long held distinct views on an array of policy issues from education to welfare spending.

Liberal and Conservative Attitudes toward Increased Welfare Spending. (ANES Cumulative Data FIle)

Liberal and Conservative Attitudes toward Increased Education Spending. (ANES Cumulative Data FIle)

True, ideologically-leftist voters attach more consistently to the Democratic party, and conservatives self-identify more as Republicans, than in decades past. This upholds a partisan sorting hypothesis, but not to any particular narrative of either left- or right-of-center ideologies emerging as dominant.

Liberalism and Conservatism Over Time. (ANES Cumulative Data FIle)

In fact, there is little sign that the American electorate is moving either left or right. Macropartisanship is known to shift over time in response to economic conditions, the occupant of the White House and political shocks. Surely Republicans will want to rethink their strategy of appealing to minorities and to a lesser extent women; but it’s unlikely that Republicans will have to learn to live with an emerging leftism in the American electorate.

It’s Good to be Average

Last week, we examined the accuracy of several presidential forecasts. For those familiar with statistics and probability theory, the results proved unsurprising: the forecasts came reasonably close to the state-level outcomes, but the average forecast outperformed them all.

Put another way, the aggregate of aggregates performed better than the sum of its parts.

This year’s Senate races provide us another opportunity to test our theory. Today, I gathered the Senate forecasts from several prognosticators and compared them to the most recent Election Day returns. As before, I also computed the RMSE (root mean squared error) to capture how accurate each forecaster was on average.

We must note one modest complication: not all forecasters posited a point-estimate for every Senate race. Nate Silver put forward a prediction for every race; but Sam Wang of Princeton University only released 10 predictions for competitive races.

We accordingly compute two different RMSEs. The first, RMSE-Tossups, only computes the RMSE for those races for which each forecaster put forward a prediction. (There are nine races that fall into this category: Arizona, Connecticut, Massachusetts, Missouri, Montana, Nevada, North Dakota, Virginia and Wisconsin.)

The other calculation, RMSE-Total, shows each forecaster’s RMSE over all predictions. Wang, for example, is evaluated by his accuracy on the ten predictions he made; while Silver is evaluated on all 33 races.

Forecast	RMSE-Tossups	RMSE-Total
Wang	4.7	4.6
Silver	5.1	8.0
Pollster	3.8	5.8
RealClearPolitics	5.4	5.1
TalkingPointsMemo	3.9	8.0
Average Forecast	4.4	5.4

The numbers in the above table give us a sense of how accurate each forecast was. The bigger the number, the larger the error. So what can we learn?

Alas! The average performs admirably yet again. It’s not perfect, of course; for some races, there are precious few forecasts to average over: Delaware, for instance, has only the 538 prediction.

To begin accounting for this, we weight the RMSE by the share of forecasts used to compute the average. If we limit our evaluation of the average to only those races with three or more available forecasts, the RMSE drops to 4.8.

What else emerges from the table? For one, the poll-only forecasts — especially the Wang, RCP and Pollster forecasts — perform better than Nate’s mélange of state polls and economic fundamentals.

North Dakota, where Democrat Heidi Heitkamp bested Republican Rick Berg, provides a case in point. Pollster and RealClearPolitics both predicted a narrow win for Ms. Heitkamp. The 538 model considered the same polls upon which Pollster and RCP based their predictions; but the fundamentals in Mr. Silver’s model overwhelmed the polls. As a result, the 538 model predicted that Mr. Berg would win by more than five points. [See Footnote 1.]

In sum, however, all of the forecasts did reasonably well at calling the overall outcome. We can chalk this up to another victory for (most) pollsters and the quants who crunch the data.

1. Addendum: In the original post (text above is unchanged), I argued that Mr. Silver’s economic fundamentals pushed his ND forecast in the wrong direction. This undoubtedly contributed to his inaccuracy in North Dakota, but it wasn’t the main factor. As commenters pointed out, Silver’s model was selective in the polls it used to predict the outcome. As of the last run of the model, Silver’s polling average lined up fairly well with RCP (Berg +5) but not with Pollster (Heitcamp +0.3). Mea Culpa.

Forecasting the End of the World

As many readers may note, we at Margin of Error tend to think of the world within a Bayesian framework. That’s not exactly unique: most of the prominent forecasters think of probability more as Bayesians than as Frequentists.

This morning, XKCD decided to wade into the debate:

I know which bet I would take. How about you?