Donald Gillies, "Why Research Assessment Exercises Are a Bad Thing", Post-Autistic Economics Review, issue 37

post-autistic economics review
Issue no. 37, 28 April 2006
article 1

issue 37 contents PAE Review index home page

Why Research Assessment Exercises Are a Bad Thing

Donald Gillies (University College London^*)

1. Introduction

In the UK a Research Assessment Exercise (henceforth abbreviated to RAE) was introduced in 1986 by Thatcher, and was continued by Blair. Now the idea seems to be catching on, and RAEs are being introduced in many countries. But are such RAEs really a good thing? In this paper I want to argue that they are not. The rationale for conducting an RAE is presumably that it will improve research output. However I will show that an RAE is likely to have the opposite effect, and make the quality of research produced worse than it was before.

An RAE usually involves a double use of peer review. A researcher has to submit publications, and these will in general have been peer reviewed. Then the review by the RAE panels is itself a peer review. This exclusive reliance on peer review is the first major defect of an RAE, for, as I will argue, it is likely to lead to a systematic failure to recognise ground-breaking research. Indeed the study of the history of science shows that peer review can give results which later turn out to have been quite erroneous. It often happens that researchers produce work which is judged at the time by their fellow researchers to be worthless, but which is later (sometimes much later) recognized to have been a major advance. In the next section I will give three examples of this phenomenon selected from different branches of science, namely (i) mathematics, (ii) medicine, and (iii) astronomy. Then in section 3 I will explain why this occurs, using Kuhn’s philosophy of science. These results will be used in sections 4 and 5 to analyse the likely effects of an RAE on the quality of research and on wealth-generating technologies. The last section argues that the conclusions drawn can be extended to economics and the social sciences.

2. Examples of the Failure of Peer Review

Mathematics

My first example is taken from the field of mathematics and I want to consider an important advance in mathematical logic. This advance was made by Frege in a booklet published in 1879, and which is usually referred to by its German title of Begriffsschrift, which means literally: ‘concept-writing’. Frege worked all his life in the mathematics department of Jena university.

In the Begriffsschrift, Frege presents for the first time an axiomatic-deductive development of the propositional calculus and of the predicate calculus (or quantification theory). These subjects are the core of modern mathematical logic, and are expounded in the opening chapters of most modern textbooks on the subject.

Frege’s remarkable achievement has been fully recognised by experts in the field since the 1950s. William and Martha Kneale in their 1962 history of logic write: ‘Frege’s Begriffsschrift is the first really comprehensive system of formal logic. … Frege’s work … contains all the essentials of modern logic, and it is not unfair either to his predecessors or to his successors to say that 1879 is the most important date in the history of the subject.’

However the significance of Frege’s work was certainly not realised by his contemporaries working in the same field. There were 6 reviews of the Begriffsschrift – 4 by Germans, 1 by a Frenchman, and 1 by an Englishman. Of these 6, 1 was favourable, but the other 5 were not only hostile but even completely dismissive. Schröder, the leading German logician of the time, wrote: ‘ ... the present little book makes an advance which I should consider very creditable, if a large part of what it attempts had not already been accomplished by someone else, and indeed (as I shall prove) in a doubtlessly more adequate fashion.’ Tannery in France wrote: ‘In such circumstances, we should have a right to demand complete clarity or a great simplification of formulas or important results. But much to the contrary, the explanations are insufficient, the notations are excessively complex; and as far as applications are concerned, they remain only promises.’ Venn in England entirely agreed with Schröder that Frege had made no advance in the subject, and had indeed taken a step backwards. He wrote: ‘ … it does not seem to me that Dr. Frege’s scheme can for a moment compare with that of Boole. I should suppose, from his making no reference whatever to the latter, that he has not seen it, nor any of the modifications of it with which we are familiar here. Certainly the merits which he claims as novel for his own method are common to every symbolic method.’ Venn concluded his review by saying: ‘ … Dr Frege’s system … seems to me cumbrous and inconvenient.’

The importance of Frege’s work only began to be recognised towards the end of the 19^th century, twenty years after it has been published, and then only by a few avant-garde researchers such as Peano in Italy and Bertrand Russell in Britain.

Medicine

My second case-history (Semmelweis and antisepsis) comes from a completely different branch of science. Semmelweis’s investigation was into the causes of puerperal fever, which was, at the time, the principal cause of death in childbirth.

Semmelweis was Hungarian, but studied medicine at the University of Vienna. In 1844 he qualified as a doctor, and, later in the same year obtained the degree of Master of Midwifery. From then until 1849, he held the posts of either aspirant to assistant or full assistant at the first maternity clinic in Vienna. It was during this period that he carried out his research.

The Vienna Maternity Hospital was divided into two clinics from 1833. Between 1833 and 1840, medical students, doctors and midwives attended both clinics, but, thereafter, although doctors went to both clinics, the first clinic only was used for the instruction of medical students who were all male in those days, and the second clinic was reserved for the instruction of midwives. When Semmelweis began working as a full assistant in 1846, the mortality statistics showed a strange phenomenon.

Between 1833 and 1840, the death rates in the two clinics had been comparable, but, in the period 1841-46, the death rate in the first clinic was 9.92% and in the second clinic 3.88%. The first figure is more than 2.5 times the second – a difference which is certainly statistically significant. Semmelweis was puzzled and set himself the task of finding the cause of the higher death rate in the first clinic.

After considering many different hypotheses, Semmelweis finally hit on the idea that some cases of puerperal fever might be caused by doctors transferring particles from corpses to the patients. In fact professors, assistants and students often went directly from dissecting corpses to examining patients in the first clinic. It is true that they washed their hands with soap and water, but perhaps some cadaverous particles still adhered to their hands. Indeed this seemed probable since their hands often retained a cadaverous odour after washing. The doctors and medical students might then infect some of the patients in the first clinic with these cadaverous particles, thereby giving them puerperal fever. This would explain why the death rate was lower in the second clinic, since the student midwives did not carry out post-mortems.

In order to test this hypothesis, Semmelweis, from some time in May 1847, required everyone to wash their hands in disinfectant before making examinations. At first he used chlorina liquida, but, as this was rather expensive, chlorinated lime was substituted. The result was dramatic. In 1848 the mortality rate in the first clinic fell to 1.27%, while that in the second clinic was 1.30%. This was the first time the mortality rate in the first clinic had been lower than that of the second clinic since the medical students had been divided from the student midwives in 1841.

Through a consideration of some further cases, Semmelweis extended his theory to the view that, not just cadaverous particles, but any decaying organic matter, could cause puerperal fever if it entered the bloodstream of a patient.

Let us next look at Semmelweis’s theory from a modern point of view. Puerperal fever is now known as ‘post-partum sepsis’ and is considered to be a bacterial infection. The bacterium principally responsible is streptococcus pyogenes, but other streptococci and staphylococci may be involved. Thus, from a modern point of view, cadaverous particles and other decaying organic matter would not necessarily cause puerperal fever but only if they contain a large enough quantity of living streptococci and staphylococci. However as putrid matter derived from living organisms is a good source of such bacteria, Semmelweis was not far wrong.

As for the hand washing recommended by Semmelweis, that is of course absolutely standard in hospitals. Medical staff have to wash their hands in antiseptic soap (hibiscrub), and there is also a gelatinous substance (alcogel) which is squirted on to the hand. Naturally a doctor’s hands must be sterilised in this way before examining any patient – exactly as Semmelweis recommended.

This then is the modern point of view, but how did Semmelweis’s contemporaries react to his new theory of the cause of puerperal fever and the practical recommendations based on it? The short answer is that Semmelweis’s reception by his contemporaries was almost exactly the same as Frege’s. Semmelweis did manage to persuade one or two doctors of the truth of his findings, but the vast majority of the medical profession rejected his theory and ignored the practical recommendations based upon it. This can be illustrated by one typical reaction. After Semmelweis had made his discoveries in 1848, he and some of his friends in Vienna wrote about them to the directors of several maternity hospitals. Simpson of Edinburgh replied somewhat rudely to this letter saying that its authors obviously had not studied the obstetrical literature in English. Simpson was of course a very important figure in the medical world of the time. He had introduced the use of chloroform for operations, and had recommended its use as a pain-killer in childbirth. His response to Semmelweis and his friends is very similar in character to Venn’s review of Frege’s Begriffsschrift.

The failure of the research community to recognise Semmelweis’s work had of course much more serious consequences than the corresponding failure to appreciate Frege’s innovations. In the twenty years after 1847 when Semmelweis made his basic discoveries, hospitals throughout the world were plagued with what were known as ‘hospital diseases’, that is to say, diseases which a patient entering a hospital was very likely to contract. These included not just puerperal fever, but a whole range of other unpleasant illnesses. There were wound sepsis, hospital gangrene, tetanus, and spreading gangrene, erysipelas (or ‘St. Anthony’s fire’), pyaemia and septicaemia which are two different forms of blood poisoning, and so on. Many of these diseases were fatal. From the modern point of view, they are all bacterial diseases which can be avoided by applying the kind of antiseptic precautions recommended by Semmelweis.

In 1871, over twenty years after his rather abrupt reply to Semmelweis and his friends, Simpson of Edinburgh wrote a series of articles on ‘Hospitalism’. These contained his famous claim, well-supported by statistics, that ‘the man laid on the operating-table in one of our surgical hospitals is exposed to more chances of death than the English soldier on the field of Waterloo’. Simpson thought that hospitals infected with pyaemia might have to be demolished completely. So serious was the crisis, that he even recommended replacing hospitals by villages of small iron huts to accommodate one or two patients, which were to be pulled down and re-erected periodically. Luckily the theory and practice of antisepsis were introduced in Britain by Lister in 1865, and were supported by the germ theory of disease developed by Pasteur in France and Koch in Germany. The new antiseptic methods had become general by the mid 1880s, so that the hospital crisis was averted. All the same, the failure to recognise Semmelweis’s work must have cost the lives of many patients.

Astronomy

I now turn to my third example (Copernicus and astronomy). Copernicus (1473-1543) was born in which is now Poland and studied at universities in both Poland and Italy. Through the influence of his uncle, he obtained the post of Canon of Frauenberg Cathedral in 1497, and held this position until his death. Copernicus’ duties as canon seem to have left him plenty of time for other activities, and he seems to have devoted much of this time to developing in detail his new theory of the universe. This was published as De Revolutionibus Orbium Caelestium, when Copernicus was on his death bed. In the preface Copernicus states that he had meditated on this work for more than 36 years.

There is little doubt that during Copernicus’ lifetime and for more than 50 or 60 years after his death, his view that the Earth moved was regarded as absurd, not only by the vast majority of the general public, but also by the vast majority of those who were expert in astronomy.

Although the majority of expert astronomers of the period would have dismissed the Copernican view as absurd, a few such astronomers, notably Kepler and Galileo, did side with Copernicus and carried out researches developing his theory until, in due course, it won general acceptance by astronomers.

3. Kuhn’s Distinction between Normal and Revolutionary Science

I have given three examples of the failure of peer review, and, of course, many others could be given. But why do such errors occur? How is it possible for experts in a field to judge as worthless what is later seen to be a major advance? This phenomenon is explained by Kuhn’s theory of scientific development as set out in his The Structure of Scientific Revolutions (1962). Kuhn’s view is that science develops through periods of normal science which are characterised by the dominance of a paradigm, but which are interrupted by occasional revolutions during which the old paradigm is replaced by a new one. During a period of normal science, the researchers in a given field all accept the dominant paradigm. So those who diverge from the paradigm are regarded as ‘cranks’ who ‘don’t know what they are talking about’. Usually such dissidents are indeed cranks who don’t know what they are talking about, but every so often they turn out to be a Frege, a Semmelweis, or a Copernicus, and initiate a revolutionary advance in the subject. An important consequence of Kuhn’s theory is that the mistaken judgements regarding Copernicus, Semmelweis and Frege are not features of science’s past, but are likely to recur over and over again, because they are features of the development of science in general.

4. Analysis of the Likely Effects of an RAE

Let us begin by considering the effects of an RAE on normal science. In a period of normal science, those working in a branch of the subject will all accept the dominant paradigm, and no revolutionary alternative will have been suggested. It will then be an easier matter for the experts in the field to judge who is best according to the criteria of the dominant paradigm. Allocating research funding to these most successful ‘puzzle solvers’, as Kuhn calls them, will usually enable the normal science activity of puzzle solving to continue successfully.

Even in the relatively unproblematic case of normal science, however, an excessive reliance on peer review can lead to mistakes. Suppose research is required on some problem, and there are four different approaches to its solution which lead to four different research programmes. This situation is still possible, and indeed often occurs, in normal science, for the four different research programmes could all be compatible with the dominant paradigm. It may be almost impossible to say at the beginning of the research which of the four programmes is going to lead to success. Suppose it turns out that only research programme number 3 is successful. The researchers on programmes 1, 2 & 4 may be just as competent and hard-working as those on programme 3, but, because their efforts are being made in the wrong direction, they will lead nowhere. In a case like this, a thoughtless use of peer review as a tool could easily lead to wrong decisions. Suppose that programme 3, the one which eventually leads to the solution of the problem, is initially supported by only a few researchers. A peer review conducted by a committee chosen at random from those working on the problem might well contain an overwhelming majority of researchers working on programmes 1, 2 & 4, and such a committee could easily recommend the cancellation of funding for research programme 3, a decision which would have disastrous long term results.

This point can be clarified and extended by introducing a distinction between two types of error (Type I error, and Type II error). A research assessment procedure commits a Type I error if it leads to funding being withdrawn from a research programme which would have obtained excellent results had the funding been continued. A research assessment procedure commits a Type II error if it leads to funding being continued for a research programme which obtains no good results however long it goes on. This distinction enables us to state a second major defect of an RAE. An RAE concentrates exclusively on eliminating Type II errors. The idea behind an RAE is to make research more cost effective by withdrawing funds from bad researchers and giving them to good researchers. No thought is devoted to the possibility of making a Type I error, the error that is of withdrawing funding from researchers who would have made important advances if their research had been supported. Yet the history of science shows that Type I errors are much more serious than Type II errors. The case of Semmelweis is a very striking example. The fact that his line of research was not recognised and supported by the medical community meant that, for twenty years after his investigation, thousands of patients lost their lives and there was a general crisis in the whole hospital system.

In comparison with Type I errors, Type II errors are much less serious. The worst that can happen is that some government money is spent with nothing to show for it. Moreover Type II errors are inevitable from the very nature of research. Suppose in our example of the 4 competing research programmes, programme 3 is cancelled in order to save money (Type I error), then all the money spent on research in the problem will lead nowhere. It will be a total loss. On the other hand if another unsuccessful programme (programme 5) is also funded, the costs will be a bit higher but a successful result will be obtained. This shows why Type I errors are much more serious than Type II errors, and why funding bodies should make sure that some funding at least is given to every research school and approach rather than concentrating on the hopeless task of trying to foresee which approach will in the long run prove successful.

So an RAE may well have a damaging effect even on normal science. Yet normal science tends to be routine in character and to produce small advances rather slowly. Surely, however, we want a research regime to encourage big advances in the subject, exciting innovations, breakthroughs, etc.

It is precisely here that an RAE is likely to fail in the most serious way. Any big advance is likely to have something revolutionary about it, something which challenges accepted ideas and paradigms. However it is precisely in these cases, as we have shown above, that an RAE with its excessive reliance on peer review is likely to have a very negative effect. Our conclusion then is that an RAE is likely to shift the research community in the direction of producing the routine research of normal science resulting in slow progress and small advances. At the same time it will have the effect of tending to stifle the really good research – the big advances, the exciting innovations, the major breakthroughs. Clearly then the overall effect of an RAE is likely to be very negative as regards research output.

5. The effects of an RAE on wealth-generating technologies

An RAE is also likely to impact very negatively on the production of wealth-generating science-based technologies. The reason for this is that the most striking technologies from the point of view of wealth-generation are often based on revolutionary scientific advances. This is well-illustrated by the three examples considered in this paper. Copernicus’ new astronomy led to much better astronomical tables, and so to a much improved navigation. This greatly helped the profitable development of European sea-borne trade in the 17^th and 18^th centuries. The new mathematical logic introduced by Frege was essential for the development of the computer. It is significant here that Bertrand Russell was one of the first to recognise and develop Frege’s work. Russell established an interest in mathematical logic in the UK, which passed on to two later researchers at Cambridge: Max Newman and his student Alan Turing. After the Second World War, Newman and Turing were part of the team at Manchester which produced the Manchester Automatic Digital Machine (MADM). This started running in 1948, and can be considered as the first computer in the modern sense. Thus Russell’s early recognition of Frege’s revolutionary innovations led indirectly to the UK taking an early lead in the computer field. This early lead was later lost, as we know, but this was owing to lack of sufficient investment by either the public or private sectors. There was no problem with the UK’s research community in those pre-RAE days. Our third case was concerned with the revolutionary introduction of antisepsis in conjunction with revolutionary new theories about the causes of disease. We focussed on Semmelweis whose research work was rejected by the medical community of his time. As we remarked, however, Lister was more successful, and was able to persuade the medical community in the UK to accept antisepsis. This was obviously of great benefit to patients, but I would now like to add that it led to very successful business developments. For his new form of surgery Lister needed antiseptic dressings, and he devoted a lot of time and thought to working out the best design and composition of such dressings. As his ideas came to be accepted, the demand for these dressings increased and companies were formed to produce them. One of these was founded by a pharmacist Thomas James Smith. In 1896, he went into partnership with his nephew Horatio Nelson Smith to produce and sell antiseptic dressings. They called the firm Smith and Nephew. Today Smith and Nephew is a transnational company operating in 33 countries and generating sales of £1.25 billion. The company is still involved in wound care as one of its three main specialities, but it has expanded into orthopaedics and endoscopy. One of its well-known products is elastoplast which was developed in 1928. The general design of elastoplast is based on some of the antiseptic dressings developed by Lister. The commercial success of Smith and Nephew is a good illustration of the importance of having a satisfactory research regime in the UK. If Lister’s research on antisepsis had met the same fate as that of Semmelweis only 17 years earlier, then the firm of Smith and Nephew would not be with us today.

6. General Conclusions

The examples I have given are taken from science, using this term in a broad sense to include mathematics and medicine, as well as the natural sciences such as physics and chemistry. But do the conclusions drawn apply also to economics and the social sciences? It seems to me clear that they do. If areas such as mathematics and experimental medicine, which are normally thought of as unproblematic, raise severe problems as regards peer review, it can hardly be denied that such problems are going to worse in areas such as economics and the social sciences where political and ideological factors are much stronger. So for mathematics, the natural sciences, medicine, economics and the social sciences, we can draw the following conclusion.

An RAE is very expensive both in money and in the time which academics have to devote to it. Its likely effect is to shift the research community in the direction of producing the routine research of normal science resulting in slow progress and small advances, while tending to stifle the really good research – the big advances, the exciting innovations, the big breakthroughs. Thus a great deal of tax payers’ money will be spent on an exercise whose likely effect is to make research output worse rather than better. Only one conclusion can be drawn from this, namely that RAEs should be abolished rather than introduced..

Note

This is a shortened version of a paper entitled: ‘Lessons from the History and Philosophy of Science regarding the Research Assessment Exercise’ which was read at the Royal Institute of Philosophy in London on 18 November 2005 in a series of talks on the Philosophy of Science. The series will be published by Cambridge University Press in 2006. For convenience of reading I have not included exact references and other academic apparatus in this shortened version, but they are to be found in the longer version, which is available on my website: www.ucl.ac.uk/sts/gillies.

Author contact: donald.gillies@ucl.ac.uk

___________________________

SUGGESTED CITATION:
Donald Gillies,” Why Research Assessment Exercises Are a Bad Thing ” post-autistic economics review, issue no. 37, 28 April 2006, article 1, pp. 2-9, http://www.paecon.net/PAEReview/issue37/Gillies37.htm