The difficulties of getting academics to correct errors is a regular theme on this blog, the Lewandowsky affair being just the latest in a long and shameful litany. Today's guest post by Doug Keenan describes a set of allegations he has submitted to the University of Oxford. Although not related to climatology, the parallels are obvious.
Research Misconduct by Christopher Bronk Ramsey
Submitted to the University of Oxford by Douglas J. Keenan 28 March 2014
NOTE: a draft of this report was sent to Ramsey; Ramsey acknowledged receipt, but had no comments on the contents.
The perpetrator
Christopher Bronk Ramsey is a professor at the University of Oxford. His main area of work is in a subject known as “radiocarbon dating”. Briefly, radiocarbon dating tries to determine how many years ago an organism died. For example, suppose that we find a bone from some animal; then, using radiocarbon dating, we might be able to determine that the animal died, say, 3000 years ago.
The determination of the age of a sample, such as a bone, includes two steps of relevance here. The first step is to take a certain measurement of the sample. The second step is to apply a statistical procedure to the measurement. The statistical procedure is called “calibration”.
Ramsey is the author of a computer program that implements the calibration procedure. The program, OxCal, is very widely used by radiocarbon scientists, and has been for decades. Ramsey’s offences, described below, are related to OxCal.
An overview of radiocarbon dating is given Appendix 1. In the descriptions of Ramsey’s offences, familiarity with the subject at the level of that appendix is assumed.
First offence : calibrating a radiocarbon age
I found an error in the calibration procedure. The root of the error is in a mathematical derivation, which turns out to be invalid. Note that for a mathematical derivation, there cannot be opinions about whether the derivation is valid, any more than there can be opinions about whether an addition, or multiplication, is valid; rather, validity is absolute.
I submitted a paper about the error to the journal Nonlinear Processes in Geophysics. The journal’s reviewers included a statistician, whom the editors told me was eminent. After the journal accepted the paper for publication, I sent the paper to Ramsey.
Ramsey said that the paper did not convince him. He and I discussed the derivation further; he appeared to not understand the mathematics, even though the level is at most third-year undergraduate.
Seeking to convince him, I tried using an example, to illustrate the error in the calibration procedure . The example is simple enough that it does not require training in mathematical statistics. The example is included in the published version of my paper (on p. 349). It is also explained below.
Assume that there are only three calendar years: 9, 10, 11. Additionally, assume that those years have radiocarbon ages 110, 100, 100, with the standard deviations being zero (i.e. the radiocarbon ages are exact). Suppose that the sample’s measurement has a probability distribution with Pr(age = 110) = 1/2 and Pr(age = 100) = 1/2. What is the probability that the sample is from year 9?
To answer the question, note that “year = 9” is true if and only if “age = 110” is true; so Pr(year = 9) = Pr(age = 110). Thus, the answer is 1/2. The method used by the OxCal program, however, does not give that answer, but instead gives 1/3.
Ramsay claimed that he was still not convinced. I then gave up trying to convince him using logic. Instead, I tried to use an argument from authority. For that reason, I hired two independent statistical consultants to check the result: the Statistical Advisory Service at Imperial College London and the Statistical Consulting Service at the University of Waterloo. Both consultants got the same answer that I did. (Moreover, both consultants said that the analysis was so easy, there would be no charge.) I informed Ramsey about this. Ramsey, though, maintained that he was still not convinced.
There is also a Statistical Consultancy Service at the University of Oxford . The Service is not available to the public, however, only to the University ’s researchers. Hence, I proposed that Ramsey check my result with the Service, and I offered to provide full funding. Ramsey refused my offer. It should be clear from the foregoing that the OxCal program is in error and that Ramsey knows it is in error. Yet Ramsey refuses to acknowledge the error, and continues to support using the program. The consequences of this are potentially substantial, because thousands of research papers have relied upon OxCal for calibration.
Second offence: combining radiocarbon ages
Occasionally, repeated radiocarbon measurements are made on a single sample. A statistical method for combining such repeated measurements was presented by Ward & Wilson [Archaeometry, 1978]. The method is now standard in the literature. A problem, though, has arisen: the same method has also been used when measurements are made on different samples. That problem is considered in this section.
Ward & Wilson consider two distinct cases. Case I is where all the measurements are made on the same sample, which is believed to be homogeneous. In Case I, the authors advise first doing a simple statistical test, as a partial check that there was no measurement, or other, error. Assuming that the test is passed, the radiocarbon measurements are combined via a simple weighted average.
Case II is for when “one does not know whether all determinations are estimating the same date (or effectively indistinguishable different dates)”; the emphasis is theirs [Ward & Wilson, 1978, p. 21]. Case II is thus for measurements of different samples. As the authors state, there is a “fundamental difference” between Case I and Case II: the simple weighted average that is used for combining measurements in Case I should not be used in Case II. Instead, a more complicated method must be used. An approximating method is described by the authors.
The foregoing is accepted by everyone who has studied the issues, as far as I know. (There can be different approximating methods used in Case II ; exact methods are not feasible.)
Despite the above, Ramsey has published papers where he uses the simple method of Case I in situations that fall under Case II . The result is to give illusory precision. For example, if some radiocarbon samples that should be treated under Case II would have given an age range of 3000–3100 years — using a valid method — Ramsey might claim a more precise range of 3025–3075, by using the method of Case I (this example is for illustrative purposes only).
I e-mailed Ramsey about this, in February 2014. A copy of my message, and Ramsey’s reply, is in Appendix 2. The reply admits that “any classical statistical method, based on normality assumptions will only be an approximation - and possibly not a very good one”. Thus, the reply acknowledges that the method used in his papers is sometimes not good. The reason for using the method seems to be that doing the calculations correctly would be difficult. It is indeed true that doing the calculations correctly is difficult, but that is not a valid reason for misleading readers by presenting calculations that are known to be wrong.
Ramsey’s reply also states that his “papers all have the primary data and so anyone can reanalyse the data using different assumptions”. That obviously does nothing to allay the problem of unjustified precision in the published papers’ claims — which is what readers of the papers rely upon. The problem is especially acute because hardly any of those readers will have the statistical expertise required to do reanalysis themselves.
The rest of Ramsey’s reply is mostly verbiage.
The full problem is worse than the above might indicate, because the simple method of Case I is implemented in OxCal. Since OxCal is a standard computer program in the field of radiocarbon, many other researchers follow the use of OxCal’s author, Ramsey. Thus, Ramsey has led many other researchers in the field to also publish papers that are based on the inappropriate method.
Appendix 1: Basics of radiocarbon dating
The term “radiocarbon” is commonly used to denote 14C, an isotope of carbon which is radioactive with a half-life of about 5730 years. 14C is produced by cosmic rays in the stratosphere and upper troposphere. It is then distributed throughout the rest of the troposphere, the oceans, and Earth’s other exchangeable carbon reservoirs. In the surface atmosphere, about one part per trillion (ppt) of carbon is 14C.
All organisms absorb carbon from their environment. Those that absorb their carbon directly or indirectly from the surface atmosphere have about 1 ppt of their carbon content as 14C. Such organisms comprise almost all land-dwelling plants and animals. (Other organisms — e.g. fish — have slightly less of their carbon as 14C; this affects how radiocarbon dating works, and there are methods of adjusting for it.)
When an organism dies, carbon stops being absorbed. Hence after 5730 years, about half of its 14C will have radioactively decayed (to nitrogen): only about 0.5 ppt of the carbon of the organism’s remains will be 14C. And if the carbon of the remains is found to be 0.25 ppt 14C, then the organism would be assumed to have died about 11 460 years ago. Thus, a simple calculation can find the age, since death, from any 14C concentration. (Remains older than about 50 000 years, however, have a 14C concentration that is in practice too small to measure; so they cannot be dated via 14C.)
Ages are conventionally reported together with the standard deviation of the laboratory 14C measurement, e.g. 900±25 14C BP (14C-dated, years Before Present). The true standard deviation, though, will often be larger than what is reported, due to non- laboratory sources of error — e.g. the admixture of impurities with the remains.
Although a tree may live for hundreds, even thousands, of years, each ring of a tree absorbs carbon only during the year in which it grows. The year in which a ring was grown can be determined exactly (by counting); so radiocarbon dating can be tested by measuring the 14C concentrations in old tree rings. Such testing found errors of up to several centuries. It turns out that the concentration of 14C in the carbon of the surface atmosphere has not been a constant 1 ppt, but has varied with time. Thus the simple calculation of age from 14C concentration is unreliable.
Tree rings, though, also provide a solution to this problem. The concentration of 14C in the carbon of an organism’s remains can be compared with the concentrations in tree rings. Tree rings that match, within confidence limits, give the years in which the organism could have plausibly died.
The matching procedure thus provides calibration of 14C concentrations. (Calibration via tree rings, though, does not extend back 50 000 years; other ways of calibrating are therefore being developed.) Ages that are estimated without calibration continue to be reported, and are called “uncalibrated 14C ages”, or simply “14C ages”.
Appendix 2 : Some correspondence on the Ward & Wilson method
From: D.J. Keenan
Sent: 05 February 2014 18:20
To: Christopher Bronk Ramsey
Cc: Malcolm Xxxxxx; Tiziano Xxxxxxxx
Subject: Misuse of Ward & Wilson test
Dear Christopher,
Malcolm forwarded to me a copy of some correspondence between you and Tiziano. In the correspondence, you claim that “if all the samples are short- lived from the same year we would expect them to all have the same radiocarbon composition and thus act as if they were all from the same sample”. The claim is clearly false for samples from Thera. You are aware of the issue of misusing the test of Ward & Wilson: you have I have discussed it before by e-mail; the issue is treated in detail in my NPG [2012] paper, which you read. I note that you have been misusing the Ward & Wilson test in some of your published papers, e.g. in Science. Correcting the problem would lead to a wider date range. I ask if you intend to publish corrigenda for those papers.
Sincerely, Doug
____________________________________
From: Christopher Ramsey
Sent: 06 February 2014 00:03
To : D.J. Keenan
Cc: Malcolm Xxxxxx; Tiziano Xxxxxxxx
Subject: Re: Misuse of Ward & Wilson test
Doug
There are several different issues here which are not that simple.
1. There is the Ward and Wilson test, formally for subsamples of the same material. This is a more stringent test than you would apply for samples of material spreading over some years. If a set of dates pass this test, it clearly does not mean that they are all from the same year, nor that they are from the same sample. However, it does indicate that the scatter is not greater than that you would expect for material from a short-lived horizon.
2. There is the combination method used. Which method you used does depend on what you think about the samples - which ultimately is a matter of opinion. If you assume the samples might be a range of different ages there is not a simple solution to this. The distribution of ages is almost certainly non-normal - so any classical statistical method, based on normality assumptions will only be an approximation - and possibly not a very good one. I have suggested Tiziano some other ways he might think about this.
3. In terms of the publications - the papers all have the primary data and so anyone can reanalyse the data using different assumptions. This is what I assume Tiziano will be doing, amongst other things. There have of course already been many papers, book chapters etc putting different interpretations on the data, and also looking at models that exclude the Thera VDL data altogether. I'm sure there is scope for more of this. In the end there are not perfect solutions to any of these. The real situation is quite complicated, the range of possibilities to be entertained is quite large, and there is no statistical model which will incorporate all of this information.
In the end "all models are wrong but some are useful" - which applies to classical statistical models as well as Bayesian ones. A good approach is probably one which looks at robustness - how much do the results change under different assumptions.
However, as I said to Tiziano (which you probably heard), I don't think the details of the statistical methods really address the main issue here. If you think the eruption is - say 1520 BC, then you cannot explain the radiocarbon data just by using slightly different statistical models.
Anyway - I think it is worth Tiziano investigating these ideas in some depth - without too much badgering by the rest of us. I'm happy to answer his queries if he wants any suggestions - but also think that it would be worth him discussing these things with people who have no particular interest in this particular research. I don't think the adversarial tone, which these discussions sometimes descend into, is particularly useful.
Best wishes
Christopher
There is a response to Doug's critique here.