Dating error
The difficulties of getting academics to correct errors is a regular theme on this blog, the Lewandowsky affair being just the latest in a long and shameful litany. Today's guest post by Doug Keenan describes a set of allegations he has submitted to the University of Oxford. Although not related to climatology, the parallels are obvious.
Research Misconduct by Christopher Bronk Ramsey
Submitted to the University of Oxford by Douglas J. Keenan 28 March 2014
NOTE: a draft of this report was sent to Ramsey; Ramsey acknowledged receipt, but had no comments on the contents.
The perpetrator
Christopher Bronk Ramsey is a professor at the University of Oxford. His main area of work is in a subject known as “radiocarbon dating”. Briefly, radiocarbon dating tries to determine how many years ago an organism died. For example, suppose that we find a bone from some animal; then, using radiocarbon dating, we might be able to determine that the animal died, say, 3000 years ago.
The determination of the age of a sample, such as a bone, includes two steps of relevance here. The first step is to take a certain measurement of the sample. The second step is to apply a statistical procedure to the measurement. The statistical procedure is called “calibration”.
Ramsey is the author of a computer program that implements the calibration procedure. The program, OxCal, is very widely used by radiocarbon scientists, and has been for decades. Ramsey’s offences, described below, are related to OxCal.
An overview of radiocarbon dating is given Appendix 1. In the descriptions of Ramsey’s offences, familiarity with the subject at the level of that appendix is assumed.
First offence : calibrating a radiocarbon age
I found an error in the calibration procedure. The root of the error is in a mathematical derivation, which turns out to be invalid. Note that for a mathematical derivation, there cannot be opinions about whether the derivation is valid, any more than there can be opinions about whether an addition, or multiplication, is valid; rather, validity is absolute.
I submitted a paper about the error to the journal Nonlinear Processes in Geophysics. The journal’s reviewers included a statistician, whom the editors told me was eminent. After the journal accepted the paper for publication, I sent the paper to Ramsey.
Ramsey said that the paper did not convince him. He and I discussed the derivation further; he appeared to not understand the mathematics, even though the level is at most third-year undergraduate.
Seeking to convince him, I tried using an example, to illustrate the error in the calibration procedure . The example is simple enough that it does not require training in mathematical statistics. The example is included in the published version of my paper (on p. 349). It is also explained below.
Assume that there are only three calendar years: 9, 10, 11. Additionally, assume that those years have radiocarbon ages 110, 100, 100, with the standard deviations being zero (i.e. the radiocarbon ages are exact). Suppose that the sample’s measurement has a probability distribution with Pr(age = 110) = 1/2 and Pr(age = 100) = 1/2. What is the probability that the sample is from year 9?
To answer the question, note that “year = 9” is true if and only if “age = 110” is true; so Pr(year = 9) = Pr(age = 110). Thus, the answer is 1/2. The method used by the OxCal program, however, does not give that answer, but instead gives 1/3.
Ramsay claimed that he was still not convinced. I then gave up trying to convince him using logic. Instead, I tried to use an argument from authority. For that reason, I hired two independent statistical consultants to check the result: the Statistical Advisory Service at Imperial College London and the Statistical Consulting Service at the University of Waterloo. Both consultants got the same answer that I did. (Moreover, both consultants said that the analysis was so easy, there would be no charge.) I informed Ramsey about this. Ramsey, though, maintained that he was still not convinced.
There is also a Statistical Consultancy Service at the University of Oxford . The Service is not available to the public, however, only to the University ’s researchers. Hence, I proposed that Ramsey check my result with the Service, and I offered to provide full funding. Ramsey refused my offer. It should be clear from the foregoing that the OxCal program is in error and that Ramsey knows it is in error. Yet Ramsey refuses to acknowledge the error, and continues to support using the program. The consequences of this are potentially substantial, because thousands of research papers have relied upon OxCal for calibration.
Second offence: combining radiocarbon ages
Occasionally, repeated radiocarbon measurements are made on a single sample. A statistical method for combining such repeated measurements was presented by Ward & Wilson [Archaeometry, 1978]. The method is now standard in the literature. A problem, though, has arisen: the same method has also been used when measurements are made on different samples. That problem is considered in this section.
Ward & Wilson consider two distinct cases. Case I is where all the measurements are made on the same sample, which is believed to be homogeneous. In Case I, the authors advise first doing a simple statistical test, as a partial check that there was no measurement, or other, error. Assuming that the test is passed, the radiocarbon measurements are combined via a simple weighted average.
Case II is for when “one does not know whether all determinations are estimating the same date (or effectively indistinguishable different dates)”; the emphasis is theirs [Ward & Wilson, 1978, p. 21]. Case II is thus for measurements of different samples. As the authors state, there is a “fundamental difference” between Case I and Case II: the simple weighted average that is used for combining measurements in Case I should not be used in Case II. Instead, a more complicated method must be used. An approximating method is described by the authors.
The foregoing is accepted by everyone who has studied the issues, as far as I know. (There can be different approximating methods used in Case II ; exact methods are not feasible.)
Despite the above, Ramsey has published papers where he uses the simple method of Case I in situations that fall under Case II . The result is to give illusory precision. For example, if some radiocarbon samples that should be treated under Case II would have given an age range of 3000–3100 years — using a valid method — Ramsey might claim a more precise range of 3025–3075, by using the method of Case I (this example is for illustrative purposes only).
I e-mailed Ramsey about this, in February 2014. A copy of my message, and Ramsey’s reply, is in Appendix 2. The reply admits that “any classical statistical method, based on normality assumptions will only be an approximation - and possibly not a very good one”. Thus, the reply acknowledges that the method used in his papers is sometimes not good. The reason for using the method seems to be that doing the calculations correctly would be difficult. It is indeed true that doing the calculations correctly is difficult, but that is not a valid reason for misleading readers by presenting calculations that are known to be wrong.
Ramsey’s reply also states that his “papers all have the primary data and so anyone can reanalyse the data using different assumptions”. That obviously does nothing to allay the problem of unjustified precision in the published papers’ claims — which is what readers of the papers rely upon. The problem is especially acute because hardly any of those readers will have the statistical expertise required to do reanalysis themselves.
The rest of Ramsey’s reply is mostly verbiage.
The full problem is worse than the above might indicate, because the simple method of Case I is implemented in OxCal. Since OxCal is a standard computer program in the field of radiocarbon, many other researchers follow the use of OxCal’s author, Ramsey. Thus, Ramsey has led many other researchers in the field to also publish papers that are based on the inappropriate method.
Appendix 1: Basics of radiocarbon dating
The term “radiocarbon” is commonly used to denote 14C, an isotope of carbon which is radioactive with a half-life of about 5730 years. 14C is produced by cosmic rays in the stratosphere and upper troposphere. It is then distributed throughout the rest of the troposphere, the oceans, and Earth’s other exchangeable carbon reservoirs. In the surface atmosphere, about one part per trillion (ppt) of carbon is 14C.
All organisms absorb carbon from their environment. Those that absorb their carbon directly or indirectly from the surface atmosphere have about 1 ppt of their carbon content as 14C. Such organisms comprise almost all land-dwelling plants and animals. (Other organisms — e.g. fish — have slightly less of their carbon as 14C; this affects how radiocarbon dating works, and there are methods of adjusting for it.)
When an organism dies, carbon stops being absorbed. Hence after 5730 years, about half of its 14C will have radioactively decayed (to nitrogen): only about 0.5 ppt of the carbon of the organism’s remains will be 14C. And if the carbon of the remains is found to be 0.25 ppt 14C, then the organism would be assumed to have died about 11 460 years ago. Thus, a simple calculation can find the age, since death, from any 14C concentration. (Remains older than about 50 000 years, however, have a 14C concentration that is in practice too small to measure; so they cannot be dated via 14C.)
Ages are conventionally reported together with the standard deviation of the laboratory 14C measurement, e.g. 900±25 14C BP (14C-dated, years Before Present). The true standard deviation, though, will often be larger than what is reported, due to non- laboratory sources of error — e.g. the admixture of impurities with the remains.
Although a tree may live for hundreds, even thousands, of years, each ring of a tree absorbs carbon only during the year in which it grows. The year in which a ring was grown can be determined exactly (by counting); so radiocarbon dating can be tested by measuring the 14C concentrations in old tree rings. Such testing found errors of up to several centuries. It turns out that the concentration of 14C in the carbon of the surface atmosphere has not been a constant 1 ppt, but has varied with time. Thus the simple calculation of age from 14C concentration is unreliable.
Tree rings, though, also provide a solution to this problem. The concentration of 14C in the carbon of an organism’s remains can be compared with the concentrations in tree rings. Tree rings that match, within confidence limits, give the years in which the organism could have plausibly died.
The matching procedure thus provides calibration of 14C concentrations. (Calibration via tree rings, though, does not extend back 50 000 years; other ways of calibrating are therefore being developed.) Ages that are estimated without calibration continue to be reported, and are called “uncalibrated 14C ages”, or simply “14C ages”.
Appendix 2 : Some correspondence on the Ward & Wilson method
From: D.J. Keenan
Sent: 05 February 2014 18:20
To: Christopher Bronk Ramsey
Cc: Malcolm Xxxxxx; Tiziano Xxxxxxxx
Subject: Misuse of Ward & Wilson test
Dear Christopher,
Malcolm forwarded to me a copy of some correspondence between you and Tiziano. In the correspondence, you claim that “if all the samples are short- lived from the same year we would expect them to all have the same radiocarbon composition and thus act as if they were all from the same sample”. The claim is clearly false for samples from Thera. You are aware of the issue of misusing the test of Ward & Wilson: you have I have discussed it before by e-mail; the issue is treated in detail in my NPG [2012] paper, which you read. I note that you have been misusing the Ward & Wilson test in some of your published papers, e.g. in Science. Correcting the problem would lead to a wider date range. I ask if you intend to publish corrigenda for those papers.
Sincerely, Doug
____________________________________
From: Christopher Ramsey
Sent: 06 February 2014 00:03
To : D.J. Keenan
Cc: Malcolm Xxxxxx; Tiziano Xxxxxxxx
Subject: Re: Misuse of Ward & Wilson test
Doug
There are several different issues here which are not that simple.
1. There is the Ward and Wilson test, formally for subsamples of the same material. This is a more stringent test than you would apply for samples of material spreading over some years. If a set of dates pass this test, it clearly does not mean that they are all from the same year, nor that they are from the same sample. However, it does indicate that the scatter is not greater than that you would expect for material from a short-lived horizon.
2. There is the combination method used. Which method you used does depend on what you think about the samples - which ultimately is a matter of opinion. If you assume the samples might be a range of different ages there is not a simple solution to this. The distribution of ages is almost certainly non-normal - so any classical statistical method, based on normality assumptions will only be an approximation - and possibly not a very good one. I have suggested Tiziano some other ways he might think about this.
3. In terms of the publications - the papers all have the primary data and so anyone can reanalyse the data using different assumptions. This is what I assume Tiziano will be doing, amongst other things. There have of course already been many papers, book chapters etc putting different interpretations on the data, and also looking at models that exclude the Thera VDL data altogether. I'm sure there is scope for more of this. In the end there are not perfect solutions to any of these. The real situation is quite complicated, the range of possibilities to be entertained is quite large, and there is no statistical model which will incorporate all of this information.
In the end "all models are wrong but some are useful" - which applies to classical statistical models as well as Bayesian ones. A good approach is probably one which looks at robustness - how much do the results change under different assumptions.
However, as I said to Tiziano (which you probably heard), I don't think the details of the statistical methods really address the main issue here. If you think the eruption is - say 1520 BC, then you cannot explain the radiocarbon data just by using slightly different statistical models.
Anyway - I think it is worth Tiziano investigating these ideas in some depth - without too much badgering by the rest of us. I'm happy to answer his queries if he wants any suggestions - but also think that it would be worth him discussing these things with people who have no particular interest in this particular research. I don't think the adversarial tone, which these discussions sometimes descend into, is particularly useful.
Best wishes
Christopher
There is a response to Doug's critique here.
Reader Comments (58)
Note that for a mathematical derivation, there cannot be opinions about whether the derivation is valid, any more than there can be opinions about whether an addition, or multiplication, is valid; rather, validity is absolute
That's why I love maths.
I cannot begin to imagine the level of frustration that must be
boiling inside of Doud Keenan's brain.
Clearly, denial is not just a river in Egypt!
The really interesting part will be Oxford's formal response. Will they actually take appropriate actions or will they emulate UEA and UWA?
They wish that the issue would go away (ie that Doug Keenan would shut up). As we've seen in the case of the Met Office statistics, that is not going to happen.
The owners of Oxcal would be well advised to do the following without delay:
- Confirm the existence of the error.
- Evaluate its significance in [1] the 5% worst cases [2] typical cases [3] the 5% of cases where the error is least.
- Issue a version of the application with the error corrected.
- Issue some means for correcting as best can be done in cases where the original data is no longer available but the published results obtained via Oxcal are out there.
- Have independent testing and verification the foregoing things.
- Ensure full publicity via the carbon dating journals with advice to users of Oxcal and users of results obained using it.
I somehow suspect that this is not going to happen in the immediate future.
Stand by for the closing of the academic ranks...
Has Doug Keenan contacted the NERC Radiocarbon Facility? It ought to be interested in the problems he described.
NERC Radiocarbon Facility
http://www.c14.org.uk/
"The NERC radiocarbon facility is administered by the Natural Environment Research Council (NERC) and provides funded radiocarbon dating research services to the academic community served by NERC and the Arts and Humanities Research Council (AHRC)."
If the NERC Radiocarbon Facility is not interested in the matter then perhaps he should contact his MP to point out that taxpayers' money is being wasted by two research councils, NERC and AHRC, that are funding research projects that use computer software that produces erroneous results.
I'm a retired casualty actuary. Almost all the work in this field involves assumptions that are not exactly correct. For any published article, another actuary might be able to find a more precise way to handle a particular aspect. Such a discovery would not call for admission of error, etc. It would allow another actuary to write a paper with a more precise formula, if he wanted to.
According to Ramsey, all his data is available, so Martin A could himself presumably do the tasks he sets out for Ramsey.
There are some academics who who use carbon-dating techniques who believe that Keenan does not know what he is talking about. It could be the case that his vendetta against Ramsey comes into a similar category.
http://quantpalaeo.wordpress.com/2013/11/01/radiocarbon-calibration-keenan-2012/
i just don't see "misconduct".
misconduct would be something like implementing the programme, in the full knowledge that it is wrong.
here, doug has pointed out, and published, an error in a method. Fine. All is well and good, but it is not up to Doug to determine the allocation of resource in Ramsey's group to correct a bit of software. As to whether Ramsey completely misunderstands Doug's point about the maths- well, i don't know about that. But were it to be true, then it is clear that making a mistake, failing to comprehend, or indeed, being stupid, are not research misconduct.
As for the second case, it is an implemented, published method, with all the caveats of this method. So long as it is in the published domain (and it appears it is), i don't see anything which comes close to research misconduct. If there is an inappropriate assumption in there, and you give unjustified precision for your age range, then all of the assumptions are public and people can follow what you have done.
Statistical errors in science are surprisingly common, and sometimes serious. I think it counter-productive to shout misconduct when there isn't a hope of making it stick.
yours
per
I believe that you are wrong, and Ramsey is right, at least regarding your first point:
Assume that there are only three calendar years: 9, 10, 11. Additionally, assume that those years have radiocarbon ages 110, 100, 100, with the standard deviations being zero (i.e. the radiocarbon ages are exact). Suppose that the sample’s measurement has a probability distribution with Pr(age = 110) = 1/2 and Pr(age = 100) = 1/2. What is the probability that the sample is from year 9?
To answer the question, note that “year = 9” is true if and only if “age = 110” is true; so Pr(year = 9) = Pr(age = 110). Thus, the answer is 1/2. The method used by the OxCal program, however, does not give that answer, but instead gives 1/3.
You are of course correct about the mathematics you present above. If Pr(age=110)=1/2 and age=110 happens in year 9 and only year 9, then Pr(year=9)=1/2. The problem is that the information you have is almost certainly not correctly expressed as Pr(age=110), by which I assume you really mean the conditional probability, Pr(age=110|sample).
Such as probability statement could not come solely from an analysis of the sample. It would also be necessary to specify a prior probability that age=110. Non-Bayesian statisticians wouldn't be willing to even assign a meaning to the statement that Pr(age=110|sample)=1/2.
What you likely meant to say - and in any case should have said - is that the probability of the observed data would be the same if age=110 as it would be if age=100. That is the likelihoods for age=110 and age=100 are the same. Note that "likelihood" is a technical term in statistics, which is not a synonym for "probability". To get the posterior probability, Pr(age=110|sample), by using Bayes' Rule, you need to multiply these equal likelihoods by the prior probabilities of age=110 and age=100, then divide by the factor necessary to get the probabilities for age=110 and age=100 (the only possibilities) to sum to one.
So what should the prior probabiltiies be for age=110 and age=100? Absent any specific prior knowledge of the situation, that favours some years over others, the only reasonable thing to do is assign prior probability 1/3 to age=110 and 2/3 to age=100, since two years result in age=100 but only one year results in age=110. If you then apply Bayes' Rule, you will find that Pr(age=110|sample)=1/3, and all three years have posterior probability 1/3, as they ought to, since with the assumptions made, the sample is uninformative about the year.
"Assume that there are only three calendar years: 9, 10, 11. Additionally, assume that those years have radiocarbon ages 110, 100, 100, with the standard deviations being zero (i.e. the radiocarbon ages are exact)."
I only got this far. When I assumed there were 3 calendar years, the fact they have exact radiocarbon ages 110,100,100 means there are only 2 calendar years with 10 & 11 being the same year.
What have I misunderstood?
@ diogenes, 7:57 PM
Here is a quote from Telford’s post: “[Keenan’s paper] makes the implicit assumption that every radiocarbon date is a priori equally likely”. The post then explains why that implies my paper is incorrect—essentially, because it is calendar years (not radiocarbon ages) that should be a priori equally likely.
Here is what my paper actually says: “Choose a non-empty finite set T⊂ℤ, to represent the possible calendar years … assuming a uniform prior distribution on T (i.e. the calendar years are a priori equally probable)…”. In other words, my paper does things as Telford says they should be done.
Simply put, Telford accuses me of saying something that I did not say, and then criticizes me for that. He has done similar things before.
@ per, 8:19 PM
Ramsey is doing analyses that are invalid and that he knows are invalid. Hence my report to the University of Oxford.
What method would you recommend, for getting scientists to cease doing analyses that are known to be invalid?
I think it means year 9 gives a reading of 110 on their carbon-dating machine, years 10 and 11 both give readings of 100. so if the machine says 100, the sample could be from year 10 or 11.
So if you took lots of samples of wood from an old preserved tree that you wanted to know the age of, and the samples all gave slightly different ages when you tried to carbon date them, you would get Ramseys computer program to work out the age of the tree from the samples.
And the pr(age) bit meant that if there were as many samples with a reading of 100 as there were with a reading of 110 the computer program would think there was only a 1 in 3 chance the tree was from year 9 when really it would be 50:50.
What have I misunderstood?
Mar 30, 2014 at 8:51 PM | Unregistered Commenter son of mulder
My recollection is a bit hazy but I think that it's possible for material originating in different actual years to have the same indicated radiocarbon age, because of inherent ambiguity in the method, not because of random error. I read DK's paper a year or two back and I think that makes it clear.
I originally posted the message below a couple of hours ago but it still has not appeared on the blog. Therefore I am reposting it.
Has Doug Keenan contacted the NERC Radiocarbon Facility? It ought to be interested in the problems he described.
NERC Radiocarbon Facility
http://www.c14.org.uk/
"The NERC radiocarbon facility is administered by the Natural Environment Research Council (NERC) and provides funded radiocarbon dating research services to the academic community served by NERC and the Arts and Humanities Research Council (AHRC)."
If the NERC Radiocarbon Facility is not interested in the matter then perhaps he should contact his MP to point out that taxpayers' money is being wasted by two research councils, NERC and AHRC, that are funding research projects that use computer software that produces erroneous results.
The graph in the paper referred to by diogenes shows how the ambiguity can arise.
http://quantpalaeo.files.wordpress.com/2013/10/oxcal-calibration.png
Doug
This was exceptionally interesting. A high profile case of radio carbon is surely the Turin shroud where wildly different answers are given with each new testing which either makes the shroud potentially authentic at 2000 years old or a medieval fake at only 600 years old.
Applying this to climate change , if the calculations are incorrect both tree rings and moss will surey have margins of error that render them iuseless as an indicator of what the climate was doing 1000 years ago in the case of tree rings and tens of thousands of years in the case of moss?
Tonyb
"Ramsey is doing analyses that are invalid and that he knows are invalid."
"Ramsay claimed that he was still not convinced."
i don't see direct evidence that he knows they are invalid. He can fail to understand, be too stupid to understand, take a different view from you, think your position won't make any difference even if there is an error, lack funds or priority to implement this in a new programme, or think another area is more deserving of his attention; any of these things would make it a non-misconduct issue.
"What method would you recommend, for getting scientists to cease doing analyses that are known to be invalid?"
you have published a paper, which relates to this. This seems the appropriate way to go.
The unfortunate issue is that if you level very serious charges, when there isn't a hope of making them stick, it opens a credibility issue, and that detracts from any contribution you make.
yours
per
Maybe the reverred professorship is following Julia Slingo's UK MET office sceantific doctrine and intends to skip pesky shtatistics lltogether?
"we look at the ensemble of the observations from many sources, and conflate that with simulations.."
Or something like that.
The burden of proof for misconduct is a high bar. Like per, I don't see it being met.
Oxford's formal response on behalf of Research Misconduct by Christopher Bronk Ramsey will be the same as UEA's and UWA's and all the others supporting fraud. The fingers must stay in the dyke as long as possible while the funding is still flowing.
diogenes wrote:
"There are some academics who who use carbon-dating techniques who believe that Keenan does not know what he is talking about. "
I read Doug's radiocarbon dating paper some time ago. It seemed to me quite obviously to be correct, and the standard method he criticised to be quite obviously wrong.
Although the arguments here aren't couched in Bayesian terms, Doug's analysis is closely related to an objective Bayesian approach. Given all the stupidity about the use subjective Bayesian approaches with inappropriate uniform etc. priors in climate science, and the inability of many (almost all?) climate scientists to understand the issues properly, I'm afraid Ramsey's behaviour doesn't greatly surprise me.
I do hope Doug's actions succeed in shaking things up.
@ son of mulder, 8:51 PM
As others have noted, different calendar years can indeed have the same radiocarbon age. For illustrations, see the blue lines in Figure 2 and Figure 4 of my paper.
Appendix 1 should have made this issue clear. Thanks for pointing it out.
@ Radford Neal, 8:36 PM
(There seems to be an issue with the blog software. At the time that I left my prior comment, 8:54 PM, your comment was not displaying. Hence I am only replying now.)
A laboratory radiocarbon measurement does produce a (Gaussian) distribution. The prior distributions are specified, with calendar years being a priori equally likely. And Bayes’ theorem is indeed invoked, based on those distributions: for details, see §3.1 of my paper. Additionally, expressions such as Pr(age<110 | measurement) are accepted by all researchers in the field.
It is trivial to demonstrate that Keenan (2012) is seriously flawed. See for example
http://quantpalaeo.wordpress.com/2013/11/01/radiocarbon-calibration-keenan-2012/
Nobody who uses radiocarbon is going to take Keenan (2012) seriously, so it is not worth the effort of submitting a comment to NPG
@ Douglas J. Keenan, 11:37pm
I've looked more closely at your paper, through the second paragraph of section 4.1, which seems sufficient.
One problem in the paper is that you seem to use the word "age" with two meanings. You need to distinguish the "true age", which is directly related to the true amount of carbon-14 in the sample, with the "measured age", which is the true age plus whatever error the measurement procedure introduced. This may seem trivial, but I think may be the main source of your confusion.
At the start of section 3.1, you refer to the second input to the procedure being "the sample’s radiocarbon measurement, i.e. a Gaussian distribution for the sample’s 14C age". As mentioned in my previous comment above, this is not correct terminology. This Gaussian-shaped object (plotted in red in Fig. 1) is a likelihood function for the true age, not a probability distribution over age (either true or measured). This likelihood function is the probability density of the measured age, seen as a function of the true age.
Your expression for Pr(age = a|year = y) in equation (1) is correct only if by "age" you mean "measured age". The Gaussian distribution c(y) corresponding to the calendar year y is the distribution for measured age, when the true age is the one associated with year y by the calibration curve. Your equation (2), for Pr(year = y|age = a), is also correct if by "age" you mean "measured age", and you assume a uniform prior for year. This distribution for year given what was measured is what should be the output of the procedure, and it appears to be what the existing programs output, as shown in Fig. 1.
Where things go wrong in your paper is at equation (3), where you take the correct result of equation (2) and turn it into an incorrect result. There may be a typo in this equation (with p_t(a) repeated), but it is wrong in any case, since the right answer was already obtained in equation (2). You seem to have arrived at equation (3) by first of all confusing the likelihood function for true age with a probability distribution over true age, and then confusing the measured age, which is what "age" in equation (2) should be interpreted as, with the true age, leading you to average the result of equation (2) with respect to an (incorrect) distribution for the true age. The result of this procedure can be completely wrong.
I don't know, I've only skimmed the paper quickly, but my first thought was that if the chronological year has a uniform prior, that induces a highly non-uniform (and non-Gaussian) prior on the radiocarbon age. This is then combined with the Gaussian measurement to yield a non-Gaussian posterior on the radiocarbon age which you can then use to weight your 'inner summations'.
Because plateaus in the calibration curve induce large peaks in the prior for the age, these are more heavily weighted in the post-measurement posterior than you might expect just from looking at the measurement error.
But as I said, I've only had a quick look, and might have got hold of the wrong end of the stick.
This matter is familiar to me; once, many years ago, the university department I did all the technical work for purchased a very expensive suite of tests for determining undergraduate's mechanical aptitude. Being the on;ly tachnical person in an academic department, I was tasked with assessing the test before it went into use. I found a serious error in one visual test which arose from very careless depiction of the illustrated problem which was, in reality, impossible. I informed the Prof and his reaction was swift and, to my mind, very proper in that he asked me to redraw the problem and have it checked by two groups of professional consulting mechanical engineers. My new drawings were passed and I recieved a letter of commendation from my Prof. who told me that he thought that I had assisted the university to 'dodge a bullet'. I cannot imagine any sensible head of department not wanting to sort problems in any form of testing, but, sadly, refusal to acknowledge problems appears to be a feature of Post-Normal Science.
per wrote:
Based on the information presented here, that is my opinion as well. One of the side effects of this sort of over-charging is that it makes it that much harder to get admission of what is actually wrong.
Having accepted that the ambient concentration of carbon-14 has not been conveniently constant over the last 50,000 applicable years, will the next discovery be that the ambient distribution of carbon-14 has not been constant either?
It's more than a thought question, because of the potential for circular argument, the need to use calibration to determine if the calibration was accurate.
Mar 31, 2014 at 4:50 AM | Steve McIntyre
I don't know if you have experience with union bargaining negotiations, but these sorts of ambit claims are successfully lodged to create negotiating points in order to achieve the covert aims of the union.
I'm with per (and McIntyre) on this one. As far as the University is concerned this will be seen as a technical dispute between experts, and absolutely not something to get involved in.
According to Ramsey, all his data is available, so Martin A could himself presumably do the tasks he sets out for Ramsey.
Mar 30, 2014 at 7:41 PM David in Cal
It's possible that Martin A could do those those tasks but he won't be doing so. As a non-user of radiocarbon dating methods, it's not his job. Even if he were to do so, it would not address the issue.
It's a fact of academic life that people will point out apparent errors in your published work. It takes time and effort to deal with but that, as they say, goes with the territory.
You have no option (if you are following normal professional standards) but to check to see whether or not there is actually an error. It takes time to do so and it often turns out to be a misunderstanding on the part of the part of the person who has found the supposed error.
If there is an error, you thank the person politely (through gritted teeth) and confirm the correction, perhaps taking the opportunity to add additional comments that might be of interest to readers.
If there is in fact no error, you politely point out the nature of the misunderstanding, perhaps with examples to help other readers to see clearly what is involved.
Suppose that the owners of OxCal follow my suggested list of actions.
- Confirm the existence of the error.
If they accurately determine there is no error, that's the end of the story.
- Evaluate its significance in [1] the 5% worst cases [2] typical cases [3] the 5% of cases where the error is least.
If the error is real but, even in the worst case, it has negligible significance, that would then be the end of the story, perhaps with a brief published comment.
As I said, it's just a fact of academic life that people will report errors in your work. It's part of the scientific process to deal with such reports. Even if they are wrong, it takes time and effort to address what they say. That's just how it is.
___________________________
Having said all that, I'd comment that getting people to see and admit errors is more likely to succeed when they are offered a path to do so that does not involve eating large amounts of crow. The use of words such as 'misconduct' and 'perpetrator' make their cooperation in verifying and correcting the error unlikely.
Doug Keenan,
Two comments:
1. I would support the views of Per/McIntyre/Jones that an accusation of misconduct here is not "proportionate", and risks antagonising without achieving your desired result.
2. "Note that for a mathematical derivation, there cannot be opinions about whether the derivation is valid, any more than there can be opinions about whether an addition, or multiplication, is valid; rather, validity is absolute."
I would agree with you except that here the difference may be arising from a difference in the conceptual assumptions, underlying the analysis - specifically the derivation and nature of the Gaussian distribution of the radiocarbon measure.
It is not clear to me what this Gaussian animal is supposed to represent. Some labs claim a very high accuracy in the actual measurement of the isotope ratio - which translates into an age range far less than the distribution spread of several hundred years shown in your worked examples. These lab measurement errors seem to be typically quoted under the assumption of a known initial C14 concentration. The range in "your" Guassian distribution clearly is not then just a simple reflection of measurement error. You need to be very clear in your own mind about what this Gaussian distribution purports to represent conceptually before pursuing the matter. And do keep an open mind about the possibility that you might actually be incorrect conceptually, even though your maths are bulletproof for your particular assumption set..
The comment by Nullius in Verba, at 1:47 AM, is similar to remarks that Nic Lewis e-mailed me last year. Since Nic has commented on this post, I hope he will agree with my copying those remarks.
@ Geoff Sherrington, 5:07 AM
Yes, I published a paper on regional variations in Radiocarbon. The paper merely presents a hypothesis though; there is supporting evidence, but that evidence is not conclusive. This month, however, there was a paper in Antiquity that presents a similar hypothesis, with new supporting evidence. The author, Manfred Bietak, was unaware of my paper. He and I are each glad that the other came to such similar conclusions.
@ Radford Neal, 1:14 AM
The terminology is not due to me; rather, it is common in the field. The interpretation of the Gaussian distribution—resulting from the measurement—is apparently (I do not have competence at lab work) different from usual laboratory measurements, because the substance is radioactive. Your comment, as I understand it, assumes that measuring 14C is like measuring e.g. mass; that is incorrect.
For those who have indicated that my approach might be too strong or overdone, I will note that I have tried to do things in other ways. In particular, I e-mailed most statisticians who have been involved with radiocarbon; only two replied, and they did not give any comments. I have also tried hard (and politely) to persuade Ramsey, as described in the post—and that included offering to fund his own statistical consultant, which he refused.
Additionally, there are archaeologists who are extremely fed up with what Ramsey, and other radiocarbon scientists, are doing—e.g. the way measurements are combined. The background here is that Ramsey and colleagues have been trying to rewrite ancient history, based on their radiocarbon dates. The protestations of the archaeologists have been going on for decades, and done little. Hence I ask the same question that I asked per: what should be done to persuade radiocarbon scientists to do the statistical analyses correctly?
"The fingers must stay in the dyke as long as possible while the funding is still flowing." That's no small feat while the index and middle fingers are gesturing "Run along with you."
Most things in life are less simple than you think.
A possible further twist to C14 dating.
http://www.ann-geophys.net/20/115/2002/angeo-20-115-2002.html
The ~ 2400-year cycle in atmospheric radiocarbon concentration: bispectrum of 14C data over the last 8000 years.
Can you not test the estimated age ranges provided by OxCal using samples of known provenance? If OxCal is systematically biased in its output, would this not be apparent using sets samples of wood (etc.) from old buildings, boats, etc. for which the history is actually known? (Or is this one of those areas of science where validation is not permitted?).
@ Douglas J. Keenan, 1:59pm
The terminology is not due to me; rather, it is common in the field.
Whether they use correct terminology or not, the other people in the field seem to have obtained the right answer. You have not, I think partly because you do not understand the concept behind the terminology. A measurement alone cannot produce a probability distribution for the thing that was measured (with error). Such a probability distribution can be produced only when the measurement is combined with a prior distribution for what was measured. The result of the measurement alone is therefore properly expressed in terms of a likelihood function, giving the probability density of the measurement as a function of the true value, which is not the same as a probability density function for the true value. This common confusion comes about partly because if you assume a uniform prior, you get a posterior probability distribution for the thing measured by just re-scaling the likelihood function so it integrates to one. In this situation, however, assuming a uniform prior for the true C14 age (based on the true amount of C14 in the sample) is not reasonable when the calibration curve has a flat spot.
The interpretation of the Gaussian distribution—resulting from the measurement—is apparently (I do not have competence at lab work) different from usual laboratory measurements, because the substance is radioactive. Your comment, as I understand it, assumes that measuring 14C is like measuring e.g. mass; that is incorrect.
Frankly, this comment is ridiculous. Nothing in your derivation of your method makes any reference to mysterious aspects of carbon-14 measurement. If the measurement is done by counting radioactive decays, the count will be Poisson-distributed, but for a large number of counts, this is approximately Gaussian. In any case, whatever complications there are simply change the form of the likelihood function, without affecting the structure of the argument.
My previous comment explained in detail where you went wrong. You need to read it carefully, and then stop embarrassing yourself and damaging the scientific enterprise by making spurious claims of error, and worse, misconduct.
Doug,
"The comment by Nullius in Verba, at 1:47 AM, is similar to remarks that Nic Lewis e-mailed me last year. Since Nic has commented on this post, I hope he will agree with my copying those remarks."
I agree with Nic's sentiment, but it doesn't answer the point.
If the year is a priori uniform, and the calibration curve has plateaus, then the prior for the age must have peaks in it around where the plateau's are. If 20 different years give the same particular radiocarbon age, then that radiocarbon age is a priori 20 times more likely to be the true value than one that is only given by a single year. With such a lumpy prior, the posterior will be lumpy too - that nice Gaussian (or whatever) measurement will be heavily modified. And it's the posterior distribution, not the measurement distribution, that you need to use to weight the 'inner sums'.
To use your example of 9, 10 11 mapping to 110, 100, 100, the value 110 has a priori probability 1/3 and 100 has a priori probability 2/3. We make a measurement that is 50:50 between 100 and 110, and this measurement modifies the prior probabilities to posterior probabilities according to Bayes rule.
In the log-likelihood form, this is:
log[P(H1|O) / P(H2|O)] = log[P(H1) / P(H2)] + log[P(O|H1) / P(O|H2)]
Which says the posterior LLR in favour of hypothesis H1 over hypothesis H2 given the observation is equal to the prior LLR plus the information/evidence in the experiment in favour of H1 over H2.
Since P(O|H1) = P(O|H2) = 0.5, which is what your 50:50 measurement result really means, in fact the prior distribution is unmodified, and the probability of 110 is still 1/3, and the probability of 100 is still 2/3, and the probabilities of 9, 10, and 11 are still equal.
The problem seems to be you are interpreting the 'measurement error' as a posterior error distribution on the measured result, which is only true if the prior on the measured quantity is uniform. (One could argue that it's wrong to call this thing the measurement error, but it's common usage, unfortunately.) Have a closer look at Radford Neal's comments, too - I think he's saying the same thing in different words.
I sympathise, I really do. But I'd look on it as an opportunity to show everyone how errors *ought* to be corrected in science.
I'd also agree that practitioners ought to have a deep enough understanding of what they're doing to have been able to explain to you what they think is wrong with your argument without it getting this far, rather than handwave and bluster about it, as they seem to have done.
A scientist that does not want to go collaborate to come at the root of a problem of a flaw that someone points out to him in his little baby, is NOT a scientist.
That's a simple fact that is.
As for the others who want to put this back in a box and not call a "misconduct", they are I think rusted in an old way of thinking about education, professors, highest respect, and many many paperpushing ceremony around it.
The new way, is that the "professorship" his ass gets kicked around a bit (a lot) until he takes the phone and contacts the statistics service of what still calls itself a university there, amongst the leftist shysters.
That's a way more in line with modern industries, presumable not -cough- mining, at least in the view of some pensioners..
A new better performing education (the present one is post expiry data, and it STINKS) should be done WITHOUT professorships, without the endless paper flow to make Phds, even without juveniles sitting years on end on campuses.
Cheers.
Radford Neal and Nullius in Verba, very much and very kind thanks. I have struggled to fully follow what you are saying, in part because I have little familiarity with the Bayesian paradigm. I want to think about this further, and will comment again tomorrow.
_____________________
Regarding the Bayesian paradigm, I have shied away from it for the following reason. The Bayesian paradigm implies using the Bayes factor to compare statistical models; and the Bayes factor is asymptotically equal to BIC, in general [Robert, Bayesian Choice, 2007]. BIC and AIC are mutually and materially inconsistent—in general, and certainly asymptotically. AIC is essentially just a reformulation of the Second Law of Thermodynamics, which states that entropy is always maximized; indeed, Akaike originally called his approach an “entropy maximization principle” [Burnham & Anderson, Model Selection, 2002]. Thus, the Bayesian paradigm seems to be generally asymptotically inconsistent with the Second Law.
Is that valid? If anyone can comment, I would be greatly much appreciative.
Ah, AIC as “entropy maximization principle”. That rings ancient, exciting bells. But I know nothing of the BIC, sorry. I equally look forward to expert commentary.
How long before Doug Keenan realises he has to issue an apology and a corrigendum?
http://quantpalaeo.wordpress.com/2014/04/01/keenans-accusations-of-research-misconduct/
@ Douglas J. Keenan 11:54pm
Regarding the Bayesian paradigm, I have shied away from it for the following reason. The Bayesian paradigm implies using the Bayes factor to compare statistical models;
This is true in a way, but only in so far as you think you have narrowed down your question to which of several models is correct, and only if you have put careful thought into the prior distributions for parameters in each of the models, to which the Bayes factor is very sensitive. In practice, it is usually best to avoid comparing models by their Bayes factors. For one thing, we often think that none of the models is all that good. There's no justification for using the Bayes factor to choose the least bad of several models that are all pretty bad, since which is the least bad depends on what you plan to use the model for, and this isn't an input to the Bayes factor.
Of course, Bayes factors and model comparison are not relevant to the issue in this post.
and the Bayes factor is asymptotically equal to BIC, in general [Robert, Bayesian Choice, 2007].
You have to be careful in interpreting this result. The Bayes Factor is asymptotically equal to the BIC times a factor which could be anything at all. This is a pretty weak sense of "equal". In particular, the BIC totally ignores whatever priors you used, and since real Bayes factors are very sensitive to the priors, you can see that the justification for using BIC is hardly solid.
BIC and AIC are mutually and materially inconsistent—in general, and certainly asymptotically. AIC is essentially just a reformulation of the Second Law of Thermodynamics, which states that entropy is always maximized; indeed, Akaike originally called his approach an “entropy maximization principle” [Burnham & Anderson, Model Selection, 2002]. Thus, the Bayesian paradigm seems to be generally asymptotically inconsistent with the Second Law.
Learning statistics from (some) physicists can be hazardous. You should be suspicious of any claim that results from statistical physics imply something about principles of statistical inference. The same goes for results from information theory. There are some connections amongst these fields, but these mostly take the form of either technical mathematical tools that happen to be useful in more than one field, or analogies that may sometimes help develop intuitions, but which should not be taken too seriously.
Doug,
I'm not a specialist in this area, but BIC and AIC are a different sort of thing to the Bayes rule. BIC and AIC are ways of taking into account model complexity when choosing which model fits data better, and are essentially applying statistics to express how likely a particular model structure is. The BIC and AIC can be derived in a common framework, just assuming different priors. The AIC is based on the more uninformative prior, and is based more closely on pure information theory principles, while the BIC despite it's general-sounding name, is specific to a particular exponential family of models.
However, neither of them is the same thing as Bayes rule, which is really just a geometric theorem in measure theory - an area of mathematics generalising things like areas and volumes - applied to probabilities. It's the basic principle by which we interpret experimental evidence probabilistically.
The basic rule is essentially derived from the definition of conditional probability: P(A|B) = P(A & B) / P(B). If we want to swap the terms around, we do P(B|A) = P(A & B) / P(A) = (P(A & B) / P(B)) * (P(B) / P(A)) = P(A|B) P(B) / P(A).
We apply it thus:
P(Hypothesis|Observation) = P(Observation|Hypothesis) P(Hypothesis) / P(Observation)
where P(Hypothesis) is the probability of the hypothesis independent of any observations we might make - i.e. our prior belief in the hypothesis, before we start the experiment. P(Observation) is the tricky part, and is usually unknown. Sometimes we can get away without it by working with a complete set of mutually exclusive hypotheses and normalising probabilities to add up to 1. But there's another trick for cancelling it out, which is to compare our hypothesis H1 to an alternative hypothesis H2.
We write:
P(H1|Obs) = P(Obs|H1) P(H1) / P(Obs)
and
P(H2|Obs) = P(Obs|H2) P(H2) / P(Obs)
and divide one by the other. P(Obs) cancels out, and we get:
P(H1|Obs) / P(H2|Obs) = (P(Obs|H1) / P(Obs|H2)) * (P(H1) / P(H2))
The quantity is called the likelihood ratio, and measures how much more likely H1 is than H2. (We can also choose H2 = "not H1" and just ask whether H1 is more likely true than not, but the theorem is more general than that.) The left hand side is the likelihood ratio after making the observation. The right hand side is the product of two terms, one of them depending only on the experimental design, and the other is our prior likelihood ratio.
We can help the intuition by taking logarithms, to convert the multiplication to an addition. This is the log-likelihood ratio, or the information in the quantity. Then our theorem can be interpreted as:
(Information about the hypotheses after the observation) = (Evidence in the observation) plus (Information about the hypotheses before the observation). This makes intuitive sense - an experiment adds or subtracts confidence in a conclusion, but if the conclusion started off very unlikely, we need stronger evidence before we believe it. This is the essence of Sagan's dictum: 'Extraordinary claims require extraordinary evidence'.
We start off at some point on the confidence scale, and experimental observations move the slider up or down by some fixed-size chunk, but where you end up depends on where you started. An experiment does not prove the hypothesis true or false with some probability, it only adds or subtracts evidence in its favour. This is one of the most common misunderstandings of experimental results - passing a 95% confidence test does NOT tell you that the conclusion is true with 95% probability. It tells you that you have added a large chunk of evidence that if you started off at the zero point of the scale (0 information = 50% probability) then this would would move you as far as 95%.
If the observations are independent (i.e. P(O1 & O2) = P(O1)*P(O2)), you can even chain observations together to yield combined evidence by simply adding all the chunks of information together. (You have to be a bit more careful if the evidence isn't independent, though.)
And yes, information is essentially the same thing as entropy. This stuff is all tied in to the second law.
Bayes' theorem is common to all forms of probability theory, and is an essential thing to understand. The Bayesian approach on the other hand is a particular way of using and interpreting Bayes' theorem, and there is indeed some disagreement here. The fundamental problem is how - in the real world - can we justify our choice of prior? Empirical evidence can only ever add or subtract, it can never tell us where to start. We can argue on symmetry grounds in some special cases, but there are always unjustified assumptions lurking around somewhere. It's what the philosophers call "the problem of induction".
One way of thinking about it is to divide it into two distinct concepts: 'Bayesian probability' and 'Bayesian belief'. Bayesian probability is the ontological, objective, true probability - something we have no physical access to. Bayesian belief is the epistemological 'what we know' sort of probability we derive from experiments, and is subjective. The Bayesian belief probability of an event depends on what you know, and can take different numeric values for different people simultaneously. The two follow the same algebraic rules, but we can only access objective probabilities in theoretical models, in the real world we only have access to beliefs. By calling it 'belief' rather than 'probability', we avoid a lot of the problematic intuitions that lead many astray. Even so, it's still tricky philosophically, and you have to keep your wits about you!
I must apologise for wittering on at such length - I do so enjoy this stuff! - but the relevance for your problem is that the experimental result by convention quotes the measurement errors in the radiocarbon age effectively assuming a uniform prior on the measured quantity. They're isolating the bit that is just to do with the experiment itself, the P(Obs|H(x)) where H(x) is the hypothesis that the value being observed is x. This is common practice, because it allows you to more easily combine it with whatever actual prior you might have: the P(H(x)) you need to be able to figure out the posterior P(H(x)|Obs).
Having now looked at Bronk Ramsey's "Bayesian analysis of radiocarbon dates" 2009 paper, it is evident that the situation is more complex than I initially thought from looking at Doug Keenan's paper and the reproduction therein of OxCal and Calib calibration graphs. The below comments accordingly supercede my earlier ones.
The narrowness of the calibration curves in the calibration graphs confirms that they must relate the 'true' calendar year to the 'true' radiocarbon determination date. As no information to the contrary is given, it follows that the y-axis of the OxCal and Calib calibration graphs is the 'true' radiocarbon determination date, not the 'measured' date, for the Gaussian probability density shown as well as for the calibration curve. That being so, that probability density must represent a Bayesian posterior density for the 'true' radiocarbon determination date. Given the assumption of Gaussian distributed measurement etc. errors, a Gaussian posterior density for the 'true' radiocarbon date follows from the applicable standard noninformative prior – here uniform in radiocarbon date.
The Calib manual (http://calib.qub.ac.uk/calib/manual/chapter1.html) gives the following explanation of the calibration procedure:
"The probability distribution P(R) of the radiocarbon ages R around the radiocarbon age U is assumed normal with a standard deviation equal to the square root of the total sigma (defined below). Replacing R with the calibration curve g(T), P(R) is defined as:
P(R) = exp{-[g(T)-U]²/2σ²}/[σ sqrt(2π)]
[in words, the probability density of R, substituted in the RHS by g(T), is Gaussian with mean U and standard deviation sigma]
To obtain P(T), the probability distribution along the calendar year axis, the P(R) function is transformed to calendar year dependency by determining g(T) for each calendar year and transferring the corresponding probability portion of the distribution to the T axis. "
The manual does not state explicitly whether U is the measured radiocarbon age and R the 'true' radiocarbon age or vice versa. But since the calibration curve must relate the 'true' radiocarbon age to the true calendar age, R must be the 'true' radiocarbon age. So this formula must represent the posterior probability density for the 'true' radiocarbon age. If it were instead a likelihood function then it would be a probability density for U, not one for R even though expressed as a function of R (represented by g(T)) for a given U.
On the above bases, Doug Keenan's analysis makes sense to me. If one ignores the (small) width of the calibration curve and the fact that it is not monotonically declining, his results would follow from the standard method of converting a probability density upon a change of variables, using the Jacobian determinant.
However, both the y-axis labelling of the OxCal graph and the explanation in the Calib manual seem to be inconsistent with the underlying approach set out in Ramsey's paper. In essence, he asserts therein that a uniform prior in calendar year should be assumed for the true calendar year:
"If we only have a single event, we normally take the prior for the date of the event to be uniform".
That translates into a highly informative, non-uniform prior for the 'true' radiocarbon date as inferred from the measured radiocarbon date. Applying Bayes' theorem in the usual way, the posterior density for the 'true' radiocarbon date will then be non- Gaussian. This approach represents a respectable procedure on the basis of the prior set out in the paper – whether or not it is an appropriate one – but is IMO not what the calibration program documentation and output imply. Neither the prior used nor its justification are explained in them.
If one assumes a uniform prior for the true calendar date, then Doug Keenan's results do not follow from standard Bayesian theory. In my view Ramsey's method doesn't represent best scientific inference either. However, the issues are complex and, predicated on his assumed uniform prior in true calendar year, Ramsey seems to have followed a perfectly defensible approach. So I don't see it can be a matter of misconduct – rather it is an issue of poor program documentation, including key assumptions only being set out in referenced papers (and not necessarily being justified).
The statistical issues involved certainly merit further debate. Even assuming there is genuine and objective probabilistic prior information as to the true calendar year, I don't think a simple use of Bayes' theorem to update that information using the likelihood from the radiocarbon measurement will give an objective posterior probability density for the true calendar year based on the combined information in that measurement and in the prior distribution. I recommend using instead the modified form of Bayesian updating set out in my paper at http://arxiv.org/abs/1308.2791 .
Apologies for the length of this comment!
Nic, no need to apologise for a long comment if it is full of substance.
Once again, BH has helped at least this scientifically challenged person to learn something.
Radford Neal and Nullius in Verba, I accept and agree with what you say about how the sample measurement should be interpreted as probability, and that this implies that my criticism of the calibration method is invalid. I thank you extremely greatly much for explaining this.
About thermodynamics, I appreciate that that was going off on a tangent, but I thought that I should ask for your insights while I had the opportunity. This, too, is really appreciated.
My background is that I used to work in mathematical research groups on Wall Street and in the City of London. There, I was introduced to time series, but most of my statistics is self-taught. (Time series are common in finance, and many of the best people are there, because of the salaries.)
Again, thank you kindly.
Douglas,
I'm glad we have had a productive discussion. Let me know if you want any further clarification. (You should be able to easily find me with a web search.)
Radford