
Wonky code
Nature has an article up about wonky computer code, with particular reference made to the Harry Readme file and Nick Barnes' efforts to get climatologists to do better on the coding front.
This struck me as interesting:
When hackers leaked thousands of e-mails from the Climatic Research Unit (CRU) at the University of East Anglia in Norwich, UK, last year, global-warming sceptics pored over the documents for signs that researchers had manipulated data. No such evidence emerged...
Now correct me if I'm wrong, but none of the inquiries actually looked at the computer code, apart from there being a brief word from Tim Osborn in evidence to Muir Russell, denying that the bodges he'd mentioned affected published results. I'm pretty sure the Harry Readme was not looked at by any of the inquiries.
There is an accompanying comment piece by Nick Barnes here.
Reader Comments (31)
There should be a requirement that all data and code should be publically released before it can be used in an IPCC report or government funded research. The more transparency the better. Unfortunately the scientists will have to be forced to open up by the government.
The article also asserts
'Although Harry's frustrations did not ultimately compromise CRU's work'
But since nobody AFAIK has ever seen the code that CRU used, how can this be so certainly be stated?
Just a few sentences later it details an example of how undetected errors in shared software caused a whole host of scientific papers to be wrong and have to be withdrawn...even though everyone concerned had been acting with goodwill and to the best of their abilities.
This is a classic case where you do not need somebody to actively 'manipulate data' to come up with entirely the wrong answer. In Harry's case at CRU, even he admits that he was out of his depth, and yet the article's authors are confident that he got the right results.
H'mmmm. I find that very hard to believe without further proof.
Perhaps computer code in cliamtology is different from the stuff used in chemistry, physics and maths... Maybe by a process akin to 'teleconnections' , a bit of FORTRAN with a bug in it when used in a chemistry program suddenly becomes bug-free when used in climatology. It sort of 'knows' the intention of the programmer. Or perhasp it isn't :-)
Having been in computing for over 40 years, I have to disagree with both Zeeya Merali, who clearly has little understanding of computational issues, or even Nick Barnes, who might be right about commercial code, such as Windows that is not at all computational -- it is graphical and an OS. It does almost no computational work. And if it gives a blue screen, well hell, that is what the reboot button is for, isn't it?
However, when you come to computational computing, such as that referred to by Latimer Alder, it doesn't take much sloppiness in your code to get totally spurious results, even with quad precision floating point. There is a very active field in Computer Science -- which is a real science -- called Numerical Analysis, which has been studying these issues for 50 or so years. All computers have limited precision and if not used carefully, you can easily overwhelm the available digits with a number if issues, such as scaling.
All of that is ignored by Ms. Merali and Mr Barnes.
I agree that not only should the raw data be disclosed, but the actual code used as well, together with all the parameter settings. Yesterday we had a bit of fun with some non-chemists claiming that they simulated that airplane exhausts kill thousands of people -- with no data -- just some of the input parameters to a program they got somewhere. Total BS.
(claps sardonically)
Oh, well done, chaps! I've seen better coding from 12 year olds.
@ Don Pablo
Grrrr!!.
They weren't non-chemists any more than they were non-traffic wardens or non-lion tamers! They were Engineers. And from Fen Poly as well. You really must work on your Chemist awareness issues. With or without that substance coming from the River Liffey.
Otherwise your post is spot on as ever. Ciao
Relying on memory, I'm pretty sure that some of Harry's comments admitted that some of the results were just made up. The distinct impression was that CRU couldn't be real sure exactly what was going into or coming out of their work product.
I think someone ought to bookmark this claim by Nature and revisit it 5 and 10 years from now. I suspect that their claim that there is nothing in the software errors which affects results will come to look very, very foolish.
I am a software dev. It's pretty much a badge of honour these days to publish your code in a blog, so that other can learn from it. This is true across all platforms, Linux, Microsoft included.
There is very little of what I would call commercially sensitive code that can't be let out into the wild.
Scientists could also benefit from learning some of the modern approaches to software development, such as unit testing frameworks, source control, etc. However, there is a bit of an ego issue, possibly, by scientists who don't really regard coding as that important.
Enough preaching, now, off to the day job, writing some wonky code...
Finally, some sense about this issue. This is a real problem, as I showed with the final bug I found in the Met Office's code: http://blog.jgc.org/2010/04/met-office-confirms-that-station-errors.html
Zeeya Merali shows her complete ignorance of the subject by the way she drifts in and out of "code" - singular - as we call our code in the trade and "codes" - plural - as James Bond calls secret messages in his fictional world.
No your right there has so far been no investigation of the code at all , so how this claim could be made may have to remain an mystery. But its not just the code, the way the data is handled and version control which is all over the place , good code is usual if you feed it bad data and to know its good data you have to be able to produce an audit trail and control the versions. The e-mails suggest that neither of these things occur, they simple don’t know if it’s the right data nor if its valid , it’s all very back of the fag packet approach stuff.
A fair and open review of this area would actual benefit everyone involved, so far their simple has not been one in any way.
I'm also a software dev and (like ALL developers, even commenters here I'm sure) I have been responsible for some pretty poor code. We all have. Anyone who says no is a liar ;-)
I haven't read the full article but the opinion piece by Nick Barnes seems to hit the nail on the head IMHO, right here :
This quote basically summarises the piece pretty well - he accepts that most scientific code is pretty poor and his response is, well, publish and be damned.
Publish the code along with your analysis and source data. In that way others can rerun your routines, point out bugs or areas where things are not quite as you expect them to be.
You may be scared of people needing support, but as he (rightly, again IMHO) points out if they choose to use the code then they have to take it as is.
Questions and errors fine. People starting from scratch, well they have to find out on their own.
Seems reasonable to me.
I know nothing about R, nothing about Fortran - before my time. I won't even start with those unless I expect to have to deal with learning them myself.
Steve M had a nice post on code readability in M&W recently and I think that backs up this nicely. In that post he mentions things he likes to find in R code and other items, and things he prefers but often doesn't find.
But he knows the software at issue and can work his way round it.
I am off to read the other article just now, looks like another 'Software Engineering from The Clean Room' article to me on first glance though.
PS - See the No Excuses section in the Barnes piece.
I completely agree with Don Pablo de la Sierra's comment:
"However, when you come to computational computing, such as that referred to by Latimer Alder, it doesn't take much sloppiness in your code to get totally spurious results, even with quad precision floating point. There is a very active field in Computer Science -- which is a real science -- called Numerical Analysis, which has been studying these issues for 50 or so years. All computers have limited precision and if not used carefully, you can easily overwhelm the available digits with a number if issues, such as scaling."
I do believe that numerical analysis (as a math area) existed before computer science existed. Anyway, the climate is a chaotic system, so accurate models of the climate system are also chaotic. A small input change in a chaotic system can lead to enormous changes in the output. The classic example of an unintended small change in software modeling a chaotic system that leads to a large error is a rounding error.
I used to port modeling code from Matlab to C++ (i.e. going from experimental prototype to commercial realtime code) and had to take tremendous care that each line of the ported C++ produced identical intermediate results to the original Matlab model code. These tiny intermediate errors can change results in a very large way.
Bottom line, there is no room for sloppiness in GCMs. Nick was right about the big picture (i.e. that scientists need to release their code so the world can review it) but had a bit of a rounding error in his detailed comments.
UEA database, E-mails, and programmer-comment files were not "hacked" but transmitted anonymously, in untraceable format, to foreign websites only after a BBC recipient had chosen to suppress the astounding scoop for weeks. Assembling and editing this large amount of abstruse technical material, much of it with highly incriminating personal transmissions, can only have been done by a capable and knowledgeable insider, a so-called whistle-blower who on the eve of cultists' Copenhagen dogpile had undergone a change of heart.
Nearly a year later, an ongoing police investigation has demonstrated incompetent stupidity beyond excuse. Manifestly, UEA powers-that-be know or strongly suspect that the culprit is one of their own, and are obstructing inquiries from Muir Russell's to Oxburgh's accordingly. Paul Dennis denies everything; Phil Jones' bleats and squeaks do rise to very heaven; and as for low-lying Keith Briffa... well, why not subpoena his on-record backup files preliminary to interrogating him directly, under oath?
Perish forbid that anyone should cast aspersions on eco-fascists' extraordinarily mendacious AGW/Save-the-Planet scam. But the time for treading gently around Green Gangsters of this ilk, including Hansen, Mann, Trenberth and various internet Big Liars (Romm, Schmidt, and others come to mind) is long since past. To assess this historic fraud in detail, we need to know how its internals played. Those who plead otherwise are cutting their own throats. Speak, now!
I am spectacularly confused by this post.
"Nature has an article up about wonky computer code, with particular reference made to the Harry Readme file and Nick Barnes' efforts to get climatologists to do better on the coding front."
I have read the article and am with you so far....
"This struck me as interesting:
When hackers leaked thousands of e-mails from the Climatic Research Unit (CRU) at the University of East Anglia in Norwich, UK, last year, global-warming sceptics pored over the documents for signs that researchers had manipulated data. No such evidence emerged..."
Global warming sceptics did indeed. I would quarrel with the assertion that "no such evidence emerged" but otherwise I am still on board.
"Now correct me if I'm wrong, but none of the inquiries actually looked at the computer code, apart from there being a brief word from Tim Osborn in evidence to Muir Russell, denying that the bodges he'd mentioned affected published results. I'm pretty sure the Harry Readme was not looked at by any of the inquiries."
So, suddenly, we are now talking about the inquiries..NOT the work done by sceptics? The work on the "code" I came across by one or two ( sceptic?) programmers eviscerated it as sloppy and amateur BUT THERE IS ABSOLUTELY NO EVIDENCE AT ALL THAT THIS CODE WAS EVER USED. As a consequence, Harry Read Me , intriguing as it is as a window into the travails of a particular poor sod out of his depth and fully aware of it, it has no relevance.
As far as I am aware, the sceptics took the code apart but it was only peripheral as there was no way to link it to any actual output by the CRU. I do not therefore find it surprising that the inquiries ignored it. They should. It only has relevance if someone at the CRU held up their hand and admitted that this was the code used to "adjust" their surface temperatures. Unsurprisingly , no one did so. It may be that it was, but without that confirmation it has to be ignored.
Have I got it completely round my neck ? Help! I usually understand plain English but I am floundering here!
Jack Hughes:
Actually scientific programmers often talk about 'codes', plural; it's a bit of a shibboleth in fact.
I have to admit I'm puzzled too, like Jack Savage. I've learned that learnèd institutions can make completely false, spurious, untrue, mendacious, bogus claims, receive corrections, and continue to do the same. Even good friends of mine "ping" back to former opinions that I've serially disproved to them.
So are the issues of code bodging or data manipulating, that arose in the Climategate emails, put to rest, or not? Also, I have the feeling there were, in addition to the emails, a quantity of documents and I'm not even sure if these have been opened and studied properly at all. After all, there was so much in the emails... when added to the tsunami of comments, counter-comments, and counter-counter comments...
...anyone? please?
Lucy Skywalker there has been no investigation into the code or data handling procedures at CRU. Therefore it is simple not possible to say there are no issues in this area. The leaked e-mail do suggest that there may be, and indeed there have be calls for such an investigation , but at the moment we just do not know.
anonymous
Actually scientific programmers often talk about 'codes', plural; it's a bit of a shibboleth in fact.
Given that one definition of shibboleth is "A custom or practice that betrays one as an outsider", I would have to agree. Many years ago, I spend six years at the Cornell University computer center's Office of Computer Services (OCS) as both their statistical consultant (on BMD, which I rewrote so it actually might work if used reasonably) as well as a debugging and problem solving consultant for Fortran, PLI, JCL, APL and other other languages back in the late 1960's and early 1970's,
About 90% of the code I had to deal will was rank amateur code. When I read the Harry Read me file, it was clear that nothing changed. These "scientific" programmers don't know the first thing about writing reasonable code even today. I am sure that if we saw the actual code used, it would be ripped a part mercilessly by real programmers who take pride in their work and know what they are doing, including the real chemists who write some of the most sophisticated code for their computational spatial molecular modeling programs. Indeed, it is considered a sub field of numerical analysis in its own right.
This, of course, is not to be confused with non-chemists, non-traffic wardens, non-lion tamers, and in my opinion, non-engineers from Phen Fen Poly who don't know Jack Shite about what they are doing but still get published in ACS journals for some mysterious reason.
(I do love watching Latimer spin!)
Data are what gets fed into the box. Code transforms the data into results.
Results are what you act on.
If you can trust, by checking, the data, and if you can trust, by checking, the code then you can examine the results against reality.
If you are willing to bet the family jewels on the results then you must have complete trust in every stage.
I find it difficult to place my trust in the principal actors who've been involved thus far.
Maybe a recent study of the gambling habits of pigeons does reveal clues as to why western governments are willing to take a "flutter" on insider tipsters.
10 REM How to do climate science
20 CALL Collect_Data
20 CALL Computer_Scientist
30 CALL Professional_Statistician
40 PRINT "Paper_Data_Code"
50 GOTO 10
Jack Hughes and anonym,
I have also in the past been puzzled by written references to 'codes' and 'the codes', rather than 'code' when referring to computer software, since I had never used the term myself, or heard it used in that way in the computer industry. Always 'code' and 'the code'.
I've also only ever seen it used by Americans, so the 'scientific' and 'American' usage seems possible. Perhaps don Pablo can enlighten us? For me it jarred most forcefully in George Dysons book Project Orion, where he repeatedly referred to 'codes' and 'the codes' in relation to software to model various nuclear interactions and processes.
A long time ago a programmer friend told me about a cautionary phrase to keep in mind when writing code:
"Meatball Logic Generates Spaghetti Code"
Methinks that everyone who writes code should keep this in mind.
Before you get to the code - and the choice of Fortran as a coding language was a big mistake IMO - there is the data management. You can think of it as the chain of custody, that each data item is what you think it is, each alteration has been made correctly, and for the right reasons, that the work can be replicated by others. 'Harry' shows prima facie a complete - and I mean absolutely complete - lack of data management.
When you move on to the code, you have very similar issues - the right version of the right program, the records of its validation, the nature and reasons for changes between versions - so that the work can be replicated by others. 'Harry' showed an utter and complete lack of that too. Ironically the Uk gvt, paymaster for all this, sponsors the ITIL - a method and toolkit to address just this issue.
So regardless of the integrity 'the science', we know the integrity of the data cannot be established, and neither can the integrity of the code that processes the data. What does that tell us about the integrity of the results ?
@ Jack Savage: "There is absolutely no evidence at all that this code was ever used"
Of course there is no evidence - the enquiries did not look into these issues. As a vindication this is as weak as it gets, really.
John Blake: "a BBC recipient had chosen to suppress the astounding scoop for weeks"
I think that is a Legend of the Interwebs. From my memory of the contemporary discussion (some of it on this very blog, I think), Paul Hudson probably received no more than a strand of CRU emails that related directly to him (he was suspected of heresy and deviationism).
Chuckles
"Code" is both singular and plural on this side of the Pond and I suspect the European side as well. The point Jack Hughes made is akin to you meeting someone who learned English as a second language and who tells you about all the sheeps in the field and deers in the forest. They will also often use "youse" for the plural second person pronoun.
As an Irishman as well, I have spent a good deal of time on Hiberno-English, as the English used by the Irish in Ireland is called. They -- at least in the country side -- insist on a plural form for "you" and will use "youse" or "yiz". And if you use the standard English form of "you" they look at you in wonderment. It is all a matter of perspective.
English is an illogical language. And our technical jargon is just as illogical. The use of "codes" is very much a give away of a less than complete understanding. This is not as prominent a point as it once was as more and more people talk about technology and you hear more of them use "codes", much like most of the people in Dublin will use standard English except in the pub.
These comments are fine, because the decline is a problem still unsolved, although dates seem somehow arbitrary
However these others are very suspicious:
Don Pablo, Thanks for your comment, your thoughts on 'code' and 'codes' exactly mirror my experience.
I'm still troubled by George Dyson's use of 'codes' in this context, as I found it very jarring when I first read it in his book. e.g. 'he would let them have one of the big two-dimensional codes that they had developed.' or 'In addition to MOTET for expansion and SPUTTER for ablation, the mathematical group at General Atomic came up with codes such as BUMP for impulse, RAMM, for dynamic response, PRESS, for pusher plate stress...' or 'OBOP and OBOPLE were evolutionary dead ends, codes so specialized that they went extinct...More adaptable Orion codes are still going strong.'
Is it possible that it is a modeling term rather than computer/IT terminology? All of these would have modeled physical processes in their research.
Devil's Kitchen is under no illusions on this topic
http://www.devilskitchen.me.uk/2010/10/is-it-in-their-nature-to-lie.html
@chuckles - Oct 14 at 5:16 pm
@Don Pablo de la Sierra - Oct 14 at 2:42 pm
As a programmer that got started in the 1970's, I was used to the phrase "coding a program". But I first encountered the term "codes" as shorthand for a group of computer programs in the IT trade literature of the 1970s. Those articles described the term's being used almost exclusively by the numerical simulation folks on the big supercomputers at the US nuclear research/weapon design labs (Sandia/Los Alamos and Lawrence/Berkeley). I got the impression at the time that these "codes" were pretty complex. Their purpose was the simulation of nuclear explosions for weapon design, since this took place after the Nuclear Test Ban treaties went into effect in the early 1960's. I have no idea how much internal review they received, but the articles did make the point that they were. At the time, they were almost exclusively written in FORTRAN and used pretty much exclusively on the Control Data Corp. supercomputers of the day (pre-Cray designs). The alternative language would have been CDC assembler. It wouldn't surprise me to find that they are still in use pretty much as is, due to the cost of re-coding them.
Old Unix Head,
Thanks for that info, that makes a lot of sense, and certainly explains why many of us are unfamiliar with the usage.
The George Dyson 'codes' I listed are from the late 50's early 60's and actually ran on IBM gear, 650s, 704/4 and 7090 mainframes, but FORTRAN, definitely.I think FORTRAN was actually developed for the 704. Definitely predates my first acquaintance with a 1401 in the late 60's.