Thursday
Nov042010
by Bishop Hill
Why climate scientists don't release code
Nov 4, 2010
John Graham-Cumming has picked up on an article in the magazine of the Association for Computing Machinery, which looks at the question of scientists releasing their code (or not). JG-C makes some interesting comparisons between the reasons for withholding code given by the Real Climate guys and the reasons identified by the ACM.
Reader Comments (41)
I got asked on another site a couple of questions that I figured I'd reproduce here:
Why are we singling out climate scientists here?
Because this recent rash of articles is a result of "ClimateGate". Clearly the issues raised are more general.
And why dismiss so casually the argument that running the code used to generate a paper's result provides no actual independent verification of that result? How does running the same buggy code and getting the same buggy result help anyone
I think it's a bogus argument because it's one scientist deciding to protect another scientist from doing something silly. I like your argument about the code base's bugs propagating but I don't buy it. If you look at CRUTEM3 you'll see that hidden, buggy code from the Met Office has resulted in erroneous _data_ propagating the field even though there was a detailed description of the algorithm available (http://blog.jgc.org/2010/04/met-office-confirms-that-station-errors.html). It would have been far easier to fix that problem had the source code been available. It was only when an enthusiastic amateur (myself) reproduced the algorithm in the paper that the bug was discovered.
My rule is to release the data and the code upon acceptance of the paper. That indeed allows other people to write papers based on my work. However, I know the data and code much better than they do. If I cannot beat them to publishing the next idea, my career should come to an end.
I'm a computing academic and would hope that every journal would demand the code for any paper that employed a program--just on scientific validation grounds. However, there are other reasons. First natural language is notoriously bad at describing an algorithm; this is why clued-in system developers use sem-formal notations. The code is an exact description of an algorithm which has no ambiguity and, even though it might not be documented, it can be examined for the algorithm. Also statistical procedures are difficult to describe and, as before, the code provides an unambiguous narrative. There are a small number of studies which indicate that scientific papers are getting more and more difficult to understand. The code is a necessary adjunct which helps understanding; it would save a fellow researcher so much time if they had the code rather than carry out a number of implementations based on a misinterpration of a poor description. So, to the researcher who says 'I've documented the algorithm thats all you need' I would say look at software engineering and the problems it faces with respect to natural language description.
John
The other reason that climate scientists are being singled out are because of the public policy ramifications of their work. I would have thought that healthcare related research would demand similar levels of scrutiny.
Good article.
Climate scientists are surely intelligent people? Why does it come as a surprise to them that people should demand new standards of transparency? Climate science is no longer a backwater science where they needed to jealously guard their techniques to gain some small notoriety. Now they can leap into international fame just by being picked up by the IPCC or the Guardian. Nobel prizes, grant money, TV exposure, movie references, dinner with Al Gore. With new rewards there must come new responsibilities. How long (if ever) will it take them to restructure climate science to reflect it’s apparent importance? Or in other words, can you believe the World is at risk if petty career considerations take precedence?
In my line of work in the private sector, all code was independently verified by other workers and was visible to everyone in the company and by any client/customer organisations. Nobody did shoddy work or work that they would be ashamed for others to see and criticise. Everybody benefited from this openness and complete transparency. Progress was pretty much assured to be built on firm foundations.
What is funny about this is actual reality of software development...
I could write a beautiful piece of Perl code. Something that I like. Something that does what it should do... and...
I show it to 5 different Perl Programmers and everyone of them will criticise it...
It is the the nature of beast... after the fact of code nearly always produces the statement: "God, who wrote this?", even if it is perfectly decent form and structure. It cannot be avoided.
Basically get over it... after the fact code review will always be critical, even of the best. Goes with the job.
Poor coding is also not always an issue...
Take the simple Cosine function of a language. It could be the biggest bag of crap ever written, but if it is proven to work . Over many years. Many Compilers. Many platforms. I just assume it is robust enough for the job.
But of course a Cosine function has something against which it can be proved to function correctly. ;)
If you do not have a baseline (and Climate Models cannot by definition ), then the "many eyes" approach works wonders re quality assurance.
I once had pi as an integer when doing a finite element analysis of a human spine, I sure wish someone had spotted that for me... the hours I wasted on such a simple error.
I would have thought that healthcare related research would demand similar levels of scrutiny
If you work under Good Laboratory Practice (GLP) in the healthcare sector you not only have to keep the code but you have to document and be prepared to show the GLP inspectors that you have actually validated the code to ensure that it does what it says.
This is simply Good Practice, something that appears to be absent from too many areas of academic science
"Very often, novel methodologies applied to one set of data to gain insight can be applied to others as well. And so an individual scientist with such a methodology might understandably feel that providing all the details to make duplication of their type of analysis ‘too simple’ (that is, providing the code rather carefully describing the mathematical algorithm) will undercut their own ability to get future funding to do similar work...."
In other words, greed is their motivation. Not scientific rigor or synergy. Not even fear of looking like utter morons who have published drivel. Just plain greed.
Oh, yes, greed is the word.
It’s not just the code but the data you feed it with that matters, from CRU e-mails they seem to have little idea if they were even using the right data. If the work being done is as all important as claimed , expecting the code and data handling to stand up to an audit is actual a rather low standard. And no audit or indeed any investigation as ever been done in this area of CRU’s work , all we have is the leaked e-mails suggesting a house in disorder.
I've noticed that the Guardian has dropped Climate Change as main heading under it's Environment section.
http://www.guardian.co.uk/environment
It is now a sub-heading under Energy.
http://www.guardian.co.uk/environment/energy
Is the Guardian now waving the white flag on climate change?
If the algorithm is accurately described and if the code accurately implements the algorithm then there is no need to provide the code. However, there are 2 ifs in the previous statement, both of which must be true in order for the code not to be provided. No matter how simple the algorithm it is still possible to inaccurately describe it or inaccurately implement it and therefore both should be provided.
Latest from guardian, front page , no white flag ;(
More UN rubbish
http://www.guardian.co.uk/global-development/2010/nov/04/united-nations-human-development-report
Kudos to JGC - I've been impressed by your work on rapid builds and makefile magic on many occasions. Thank you for bringing some logic to this field.
Here's my view: In science, code is routinely made public (e.g. physics and crystallography). Climatology is obviously a very different domain. Apparently even Fortran is strangely non-portable in the exacting world of climatology - as was rudely shouted at me at realclimate and protested by Mann to the inquiry into his immaculate work. But, as we all know, passing off the bogus as plausible, usually to willing dupes, and often with considerable aggression, appears to be the main focus of climatology.
TerryS says: "If the algorithm is accurately described and if the code accurately implements the algorithm then there is no need to provide the code. However, there are 2 ifs in the previous statement, both of which must be true in order for the code not to be provided. No matter how simple the algorithm it is still possible to inaccurately describe it or inaccurately implement it and therefore both should be provided."
Exactly! In addition, we still have the matter of the raw data. What specific data was used, from which data pool, was it "adjusted," "corrected," or "modified" in any way, and if so, with what justification?
Without a clear explanation of data sources, usage, modifications, algorithms chosen and software used, all we have is the following:
"I took some unspecified data, did something to it, manipulated it in some way and did so with software of unknown properties. My conclusion is robust."
That is not science. That is nothing more than unsupported assertion.
"The Cathedral and the Bazaar" is a worthwhile read on some of the ideas from Open Source that are applicable here.
The wiki page is here
http://en.wikipedia.org/wiki/The_Cathedral_and_the_Bazaar
The IPR arguements used are also a little bogus aren't they? Ideas could still be copyright or patentable, and also have some academic protection from plagiarism don't they? I've been wondering about this wrt University funding cuts and also their commercial ventures, and whether that can conflict with needs or desires to publish.
According to Ian Jolliffe discussing temperature datasets:
"Raw data can seldom be taken at face value. In addition to random recording errors, the vast majority of records have experienced systematic changes at some time in their histories. Changes can be abrupt, such as a change of site or instrument, or gradual such as urbanisation or vegetation growth. The metadata will include information about such changes, when available. Current practice is to apply various adjustments, so-called homogenisation, to account for known and unknown changes. Interpolation across data void regions is also frequently carried out......In the past homogenisation has not always been transparent. For this project any institution or individuals wishing their homogenised (Stage 5) data set to be included will need to fully document their quality control and homogenisation algorithms and have them objectively tested and assessed. A suite of artificial benchmarking datasets that replicate the structure of real climate data sets, with changes artificially introduced to mimic those most likely to occur in practice, will be created by a third party. Data set creators will be required to apply their algorithms to the benchmark data sets and the results will be recorded."
www.significancemagazine.org/details/webexclusive/870053/New-temperature-datasets-for-the-21st-Century.html
www.surfacetemperatures.org/
Not releasing code is like Fermat's last theorem: Claiming that you have solved a problem but let people work threehundred years to check your work.
Probably "the proof doesn't fit in the margin" is the biggest understatement in science.
"You may have better-known people market your idea better than you can and be credited with the work"
(quoted by JG-C)
That's not science. In any case, if is has been published, you can always invoke 'prior art'.
Outside of GCMs, is there any code in climate science that is actually that difficult? How hard can it be to make adjustments to thermometer readings?
Because scientists write lousy code, as a rule. I thought that was kinda obvious.
If they don't provide the data and code, the public should ignore the work for any policy purpose. Period. As should any assessment of the science. If it can't be checked, it's not science in accordance with the scientific method.
"Can you believe the World is at risk if petty career considerations take precedence?"
TinyCO2, many thanks, that is my fav phrase of the week!
Stan
If it can't be checked, it's not science in accordance with the scientific method.
That pretty much summarizes it.
Mojo
Because scientists write lousy code, as a rule.
That has been my experience. However, I will go further and say that there are plenty of statistical packages out there that are generally available which can give erroneous results as well. All computationally intense code is prone to abuse. My favorite was Factor Analysis. Take a bunch of data, run it through a computer program, find some random eigenvector and publish. The "climate science" version of FA is PCA, which are basically the same thing. PCA and FA are valuable analytical tools, but like chain saws, can be misused.
I think this computer engineer has a better idea of how to conduct Science than the AGW Clowns whose only reason to keep code it seem is Ego.
http://www.nature.com/news/2010/101013/full/467753a.html
Pablo: Like this one?
http://www.sas.com/technologies/analytics/statistics/index.html
Mojo
Like my Stihl 045 chainsaw (which I had a long time), SAS statistical package is a very fine product. I have nothing against chainsaws and statistical programs. I do have problems with people taking a chainsaw, cutting down trees, counting their rings and using a statistical program to predict the weather 100 years from now. (Naw, that would NEVER happen, would it?)
My point was not in the tool, but the use of the tool. Computers are abused in the same way. I have no idea how many people told me something must be true because "A computer said so!"
Up until a few moments ago I thought it ironic that the chap who did the "audit" of the Wegman Report was an apparently well thought of computer scientist - why isn't he looking at the software quality issue I asked myself? Now I'm wondering if he has been involved or has a professional interest at stake in climate research/modelling?
http://en.wikipedia.org/wiki/John_Mashey
Thanks not banned:
Re. John Mashey - John's wife Angela Hey is a 'Cleantech Open' involved individual. And, of course, Cleantech Open, exist to push green technologies. Hence, no doubt, their immense joint interest in shoring up the shoddy 'science' of AGW with strange plagiarism attacks on Wegman. (but, of course, lousy code and plagiarism in the CAGW cause, are a-ok).
Anyway, one hesitates to think it - but could the attack on Wegman have been funded, in part, by Cleantech Open?
OT
The channel 4 debate on what the green movement got wrong IS a damp squib. Really disappointing that there are NO recognised "skeptics" on the panel BUT wait, they did find tine to get George monboit on to talk about the hundreds of millions of dollars being pumped in to the skeptic movement...completely unchallenged!
I guess bush that must make you a multi millionaire eh?
Mailman
@Mailman
-- I guess bush that must make you a multi millionaire eh? ----
Is that bish not bush ? Bush probably is a multi-millionaire, Having seen Bish's suit on the TV interview a few weeks ago, I don't think he is that well off ;-)
C4 debate is even worse than the programme !!
Yeah, auto spell on the iPhone changed bish to bush!
Mailman
I thought the CH4 programme and subsequent discussion were unintentionally hilarious expose's of the whole ramshackle green bandwagon.
Brand - elderly hippie living in a shack (and driving an SUV!), reminiscing about his LSD days and trying to grab a last flicker of attention by upsetting green orthodoxy.
Lynas - sublimely confident, but dimwitted, public schoolboy who looks as if he's finally struggling towards comprehending that his jolly pie-throwing, crop destroying days made him a laughing stock.
Monbiot and the self-styled "Greenpeace Scientist" - the worst kind of boss-eyed, spittle-flecked high priests of green orthodoxy
And the biggest loony of all - the so-called "nuclear power expert" whose theory that nuclear power stations were about to be submerged by tidal waves made even the other greenies curl up with embarrassment.
Made me wonder how anybody could have swallowed anything these second rate charlatans have pumped out over the last 20 or 30 years.
Still, the whole thing had the encouraging feel of a "Titanic deck chair arrangement exercise" - so quite enjoyable in a masochistic sort of way.
It's my view that they were lured into this degree by degree.
They were producing bodge code like many researchers having to do programming as a side line. Nobody cares usually.
The work assumed political importance which they basked in and encouraged, but the bodge code and data management were allowed to be glossed over. A mushroom growing under the floorboards.
The work assumed immense political importance and the bodge code and crap data management was eventually put under a spotlight. by Climategate. An outraged world was wanting to know why they hadn't applied the highest standards of software development and verification and why they appeared to be bullshitting; they felt they were researchers producing a bit of bodge code as was normal and were being unfairly picked on, but they knew they had been bullshitting.
Very clearly, the difference between the coding in climate science and the coding in support of say, a paper on an obscure aspect of metallurgy, is the huge political and economic interests at stake in the matter of climate science.
Just watched recorded C4 prog, have the feeling that Stewart Brand has just delivered the equivalent of:-
Friends, Romans, countrymen, lend me your ears;
I come to bury Caesar, not to praise him;
The evil that men do lives after them,
His opening “all is not bad, you have just had a prime time show saying that global warming is happening” is a fob.
The message was loud and clear, get real, it is time for the howling at the moon to stop.
Sadly I doubt the message will have got through to many, maybe George got it, though that would have been awhile ago.
@Green Sand
Sadly George did not get it, or at least isn't ready to admit it yet, so Brand gets a trademark howling fit.
@Foxgoose
It is unwise to dismiss Brand lightly. He's thoughtful, pragmatic and fairly honest; he has a long track record of being worth listening to. He made a television program, aired back in 1997, about why he lives in a "shack"; had more people listened to it (especially the bit from around 20:00) the world economy wouldn't be such a hole now.
Sorry: I see that Google Video has chosen to decorate this with a grotesque (and utterly unrelated) porn image in the Related Videos sidebar, which may make the page NSFW.