Saturday, February 28, 2009

Public data and replication

When should data that support a published study be made public? What if the study is not yet published but is merely presented to an audience of peers? Scienceblogs reports:

An Italian-led research group's closely held data have been outed by paparazzi physicists, who photographed conference slides and then used the data in their own publications.

The controversy over who gets credit is another concern but what if the data had been made public at the same time as the presentation? Then there would be no controversy over credit. However, for many reasons (personal or otherwise) researchers tend not to make their data public until publication of their study, if at all. This would be acceptable if and only if their research is presented only to their peers and does not become part of the public debate.

In many cases, especially in economics (unlike discussion on dark matter or matters of cosmology) this is hardly ever the case. When a study is trumpeted in the Wall Street Journal or the New York Times prior to publication and before data is made public then what is the status of the study? It is as if the study had already been published and had already passed peer review.

For instance, the New York Times gave some coverage to research by Kotchen and Grant on the lack of energy savings from adopting daylight savings time. Unfortunately, the data for replication is not available for download. Therefore, the conclusions from this study is unverifiable and yet it has already made it into the public domain.

Likewise, claims of success against malaria were made based on data that were not made public.

Mr. and Mrs. Gates are repeating numbers that have already been discredited. This story of irresponsible claims goes back to a big New York Times headline on February 1, 2008: “Nets and New Drug Make Inroads Against Malaria,” which quoted Dr. Arata Kochi, chief of malaria for the WHO, as reporting 50-60 percent reductions in deaths of children in Zambia, Ethiopia, and Rwanda, and so celebrated the victories of the anti-malaria campaign. Alas, Dr. Kochi had rushed to the press a dubious report. The report was never finalized by WHO, it promptly disappeared, and its specific claims were contradicted by WHO’s own September 2008 World Malaria Report, by which time Dr. Kochi was no longer WHO chief of malaria.

Thus, there is reason to make a case that data should be made public when the authors of the study feel that their research is ready to be presented to their peers (and not wait until publication) especially since it is likely that the press will report their "preliminary" findings.

Another source of controversy is the tariff-growth paradox whose authors have not made data available for download. There are reasons for this although technology is not one of them since the Internet has made downloads easy - moreover, if the authors can make their papers downloadable, why not their data?

Reasons for not making data downloadable are probably mundane:
1. Documentation - the data was possibly not well documented and only the authors can understand the structure as well as variable names. The amount of work to go back and document the data is just too overwhelming for the authors.
2. Personal - the authors have done the hard work gathering and entering the data and other authors who want to research the topic should do their own data entry and research. This reason may seem petty but it also serves as a check and verification of the original data. Unfortunately, this can be done even if the data has been made available.

David Albouy has contradicted the data on settler mortality in "The Colonial Origins of Comparative Development: An Empirical Investigation" by Acemoglu, Johnson, and Robinson. Even though Acemoglu et. al. have never made the data downloadable, they provided it to Albouy via personal communication. This did not prevent Albouy from contradicting the findings of Acemoglu et. al.

More insidious is the use of confidential data that cannot be made public. This kind of data has made it into policy circles on the effects of file sharing on music sales. The Chronicle of Higher Education has documented this debate between Stan Leibowitz who has tried to verify a Journal of Political Economy article by Oberholzer and Strumpf.

What is the use of publishing research that cannot be replicated? Hamermesh provides a good overview of the issues but the bottom line is that if a study cannot be replicated then it is worthless and if economics wants to stake its claim as a science then it should reject all articles that use confidential data.

Related to the use of confidential data but not quite as serious is the use of public but restricted data. These types of data are commonly found at NCES that are geo-coded. The restrictions are fairly onerous including the need for an application with a signed affidavit as well as being subjected to a bureaucrat's whim of auditing the user's data security arrangements. Other examples of restricted data are Medicare data, Census data without topcoding of income categories (i.e. March supplements) and Longitudinal Employer-Employee Data.

While privacy concerns are real there are a lot of data collected that can be adequately modified to protect privacy. If the Federal Reserve can make data on wealth publicly available then it is hard to believe that economists are unable to make their data sufficiently anonymous to eliminate disclosure risk.

The conclusions are thus:
1. Data must be made downloadable as soon as its authors are willing to present their findings to their peers.
2. Restricted data must be made public - disclosure risk can be eliminated even with geo-coding. NCES type restrictions are too onerous.
3. Articles based on confidential data must be summarily rejected. These studies add nothing to the knowledge base.

1 comment:

Anonymous said...

Your blog keeps getting better and better! Your older articles are not as good as newer ones you have a lot more creativity and originality now keep it up!