Cort Johnson Interviews Tom Kindlon (Part Two)
In early 2012, Cort Johnson began an interview with the Irish ME Association Officer and peer-review published author, Tom Kindlon. This interview project was revived and edited by Stukindawski in early 2013. The interview is in three parts, with an introduction by Stukindawski, and the other parts can be accessed by clicking the following links:-
In this second part of the interview, Cort and Tom discuss the two main types of CBT: programs that include incremental exercise (Graded Exercise Therapy) and programs that include activity management in respect of the patient’s symptoms (Pacing). Cort and Tom focus on the prevalence of their deployment and their potential for harm. Later Cort asks Tom about the value of the data that informs the CBT debate and Tom highlights issues which impact on the reliability of available data on the efficacy of CBT.
CORT: There doesn’t seem to be any such thing as ‘CBT’ – instead there are a variety of programs that can differ substantially. You noted that Cochrane report stated there are two different kinds of CBT/GET; one seems to downplay the patients symptoms while systematically increasing their activity levels and reducing their rest times using a fairly rigid schedule. The other is more focused on helping patients ‘cope’ better with their disability. It sounds like the potential for harm differs significantly depending on what type of program a patient is. Do we know how prevalent one type of program is versus the other one? Has anyone looked at levels of harm in each type of therapy?
TOM: The type of CBT that, as you describe, “downplays the patients symptoms while systematically increasing their activity levels and reducing their rest times using a fairly rigid schedule”, sometimes called the “fear avoidance” model, would be seen as the “evidence-based” form of CBT for the condition. It is the form of CBT recommended by the NICE “CFS/ME” guidelines that cover England, Wales & Northern Ireland in the UK (Scotland has its own system).
The other form of CBT has been very rarely assessed in trials.
If one extrapolates from the ratios of research studies assessing the two types, one might then think that the “fear avoidance” model is generally the most widely used.
However, I think the CBT used in the private sector in many countries (and probably the public sector in some countries) is probably less dominated by the fear avoidance model. Also, I think a lot of the CBT used in the private sector isn’t CFS-specific CBT but often based on CBT for other conditions. Some therapists will have more knowledge and experience than others of other therapeutic modalities and this may influence what therapies are used. Just as a medical doctor can have a variety of interventions they can offer in general, so are there a variety of nonpharmacological therapies in many healthcare professionals’ armamentariums.
The reporting of harms in trials of CBT for CFS has been recognised by systematic reviews as being poor. This does not seem to be uncommon with nonpharmacological interventions; while the reporting of harms in trials of drugs is still not perfect in general, there has been much more focus of it. It is easy for people to presume that nonpharmacological interventions are safe but this should not be accepted without testing, particularly with certain groups. For instance, unmonitored aerobic exercise is likely to be relatively “safe” for the average healthy adult but not so for the patient who just had a heart attack. A lot of care is taken with exercise, from what I have heard, in individuals undergoing cardiac rehabilitation. Healthcare staff ask such patients if they have symptoms such as shortness of breath or chest pain (i.e. possible signs that the heart might not be receiving adequate oxygen), patients’ blood pressure and heart rate is taken regularly, and they are observed for abnormal heart rhythms using portable monitors. If symptoms occur or abnormalities are seen, the professionals involved might ask patients to stop exercising or decrease the intensity or duration of exercise.
In the field of ME/CFS, particularly with the very unusual effects of exercise with this condition, I think there is much more of a possibility for adverse effects from nonpharmacological interventions so there should be a lot more focus on the issue.
Researchers in Nijmegen in the Netherlands have produced many papers on CBT for CFS over the years. Their program now for “relatively passive” patients is quite intense: “So, for example, the first day the patient has six 1-minute walks, the second day six 2-minute walks, the third day six 3-minute walks, and so on. The aim is a total build-up of 5 minutes a week for each walk a day.” I think most people with a good knowledge and understanding of ME/CFS would have no difficulty imagining that such a program might cause problems for some patients.
I don’t believe that we have data that divides up the two types of CBT you refer to. In trials, as I say, reporting has been very poor. Researchers have generally just reported average (mean) scores or other measures of central tendency (e.g. median scores). And the changes have generally been more likely to be positive than negative on the measures used. And even if there were generally negative results, they would tend to be not that negative if one looked at the averages. However, significant deteriorations by individuals can be buried within averages (and variances) – getting an average improvement doesn’t tell one that every individual has improved.
There has been more focus on possible adverse effects in the last 18 months or so. A Dutch study looked at “possible detrimental effects” of CBT and claimed their study showed CBT to be safe based on a review of the findings of three studies. A published letter of mine, “Harms of cognitive behaviour therapy designed to increase activity levels in chronic fatigue syndrome: questions remain”, pointed out that a previous study had shown that, for the same three studies, those that did CBT didn’t have a bigger change in their activity levels (as measured by a motion sensing devise) than the control group so we didn’t learn what the effects would be if they patients had actually increased their activity levels (which is supposed to be the aim of the programs: i.e. what appears to have happened is that the patients may have gone for longer walks, but they had reduced activity in other areas of their life so the overall change was not that great). Other points could be made about the study such as the post-hoc nature of the measures, whether they cover the range of possible adverse effects, whether the measures they used were sensitive to changes incl. whether they had ceiling/floor effects, etc.
A paper which reported on harms closer to the way one would see it in a drug trial was the Lancet PACE Trial paper. They had specific adverse event outcome measures, rather than simply relying on efficacy measures to do the job. There were other aspects of the trial reporting that are praise-worthy, such as releasing the manuals for therapists and patients, so one can see what the interventions involved. Like with the Dutch group, the authors claimed that CBT (and GET) can be safely used. However, the CONSORT statement on the reporting of harms specifically recommends against such claims as they are so hard to prove (think of a comparable statement with drugs i.e. it is hard for a pharmaceutical company to prove a drug will be safe for everyone, esp. if one looks at the long-term). The trial was a long way from proving that the interventions had few important adverse events associated with them, either. For example, we were given less than 1% of the adverse events that occurred during CBT or GET. They felt the need to give the details only for severe adverse events, but most of such “severe” events (e.g. death) were not the sorts of events one might be concerned about, or focused on, with these interventions, in people with an average age of less than 40 who don’t have other conditions because of the definition used. The authors also changed the adverse outcome measures from what had been listed in the trial protocol which they had published – and, importantly, didn’t mention this (if one does make changes to protocols, one is supposed to highlight this). I discuss this trial in quite a lot of detail in my recent paper. It has some good aspects to it with regard to the reporting of harms, that other authors/researchers can learn from. But it has significant problems also, so I think any claims that the interventions have been shown to be “safe” in this trial are very premature.
Another interesting paper from recent times was published by a team from Barcelona, from the Catalan Region in Spain looking at a “multidisciplinary treatment combining CBT, GET, and pharmacological treatment”. It found that there was a worsening in SF-36 physical function and bodily pain scores along with a worsening on the the pain subscale of the Stanford Health Assessment Questionnaire (HAQ) (54). The SF-36 physical function finding is possibly less interesting as other studies, admittedly not using the exact same program, have shown average improvements. However, the worsening on two pain scales is interesting – pain has generally not been reported in ME/CFS CBT or GET studies so one is left wondering whether, if other investigators had used such scales, what the effects on pain levels would have been. The study also found a significant increase in the total number of the following co-morbidities: fibromyalgia, sicca syndrome, endometriosis/dysmenorrhea, dysthymia, thyroid dysfunction, multiple chemical sensitivity, and irritable bowel syndrome. Again this is interesting as most studies in the field have not looked at comorbidities before and after treatment. So although one wouldn’t say this trial would qualify as high quality in terms of the reporting of harms, it does raise interesting questions.
I’m not sure if there is any survey data comparing different types of CBT programs.
CORT: In the paper you mentioned that a ‘ceiling/floor’ effect could confound results. Can you explain what that is and how that figures in the CBT debate.
TOM: Drs. Bart Stouten and Ellen Goudsmit have done the most work on this in the field that I’m aware but I have also seen others such as Drs. Leonard Jason and Fred Friedberg refer to the problem occasionally. I find this effect quite fascinating. It takes a little thought to understand it initially but no mathematical knowledge so hopefully most people will understand it after the following explanation.
An instrument or questionnaire gives a range of possible scores. Often the range is not as broad as the 101 possible scores in a 0-100 scale – different questionnaires will have different ranges, e.g. 0-10, 5-25, etc. Sometimes a questionnaire may give several summary numbers based on different subscales: for example, a fatigue questionnaire might have a “physical fatigue” and “mental fatigue” score, as well as a total score.
If an individual has already scored the lowest they can on a particular item (question) or even on a subscale or total score, even if, in reality, they have become worse because of a treatment, this won’t show up. A particular problem in the ME/CFS field is that often participants have a relatively bad or severe problem like severe “fatigue”. This means questionnaires that might not have a particularly bad floor/ceiling problem in the general population (as the percentage of the group with the worst score is small) can have a bigger floor/ceiling problem in a ME/CFS population (as the percentage of a cohort with the worst score is not small).
On some scales, the worst score would be the lowest possible number, such as 0 out of 10 – one could call that a “floor effect”; similarly, on some scales, the worst score could be the highest possible number, for example, 10 out of 10 – one could call that a “ceiling effect”. [Aside: it can get slightly confusing as one can have the same problem “in reverse” in other situations; e.g. somebody scores the best possible score initially and hence, even if they improve, this wouldn’t show up. Again this might be called a “floor” or “ceiling” effect depending on the questionnaire].
Drs. Goudsmit and Stouten along with their colleague Sandra Howes published a nice little paper that illustrated the problem in 2008 . They looked at the Chalder Fatigue Scale, which is named after Dr Trudie Chalder, a well-known UK-based researcher. This scale has been regularly used in trials assessing CBT and GET therapy interventions in CFS. This scale has 11 questions (the initial version had 14 questions). There are four possible answers for each of the questions: “Less than usual”; “No more than usual”; “More than usual” and “Much more than usual”. Questions are scored respectively 0, 0, 1, 1 in what’s called the bimodal scoring system – giving a possible range of scores of 0-11. To make things a little complicated, it can also scored 0, 1, 2, 3 respectively in what’s called the Likert scoring system – giving a possible range of scores of 0-33. (Aside: the scoring for one single question (out of the 11) is actually reversed because of the wording but I’ll ignore that from now on as things are complicated enough).
Ellen Goudsmit and her team divided a sample of people with ME into those who they defined as being “mild”, “moderate” or “severe” overall. What they found was that, using bimodal scoring, 9 out of 11 (82%) of the severely affected scored the maximum score of 11 on the Chalder Fatigue Scale. Four out of 12 (33%) in the moderate group also scored the maximum. This means that in a trial, even if they got worse on all 11 items, it would not show up – they would score the same. If one looks at the figure (linked here), this may make it easier to understand.
Many of the others in the study scored 10 out of 11: 2/11 (18%) in the “severe” group, 4/12 (33%) in the “moderate” group and 1/3 (33%) in the mild group. This illustrates another strange effect that can occur in questionnaires which have “ceiling/floor” effect problems: somebody could potentially deteriorate overall, moving from the “moderate” to the “severe” group but their score could suggest they improved! This would happen if a moderately affected patient scored 11 initially, deteriorated to become “severe”, but scored 10 on completion (again see the figure to see this more clearly). Some people might deteriorate overall but might say that they were no worse than usual to the question “Do you feel sleepy or drowsy?” or “Do you have problems starting things?” hence allowing the change in score from 11/11 to 10/11.
As one can imagine, this can cause a problem with harms reporting: somebody could become worse but they would either show up as the same or even improved!
It can also affect the validity of efficacy scores where averages tend to be more important: more people could deteriorate than improve but the average score might look like there was an average improvement! This is because the deterioration if somebody scored 11/11 initially would not show up in the figures. Then, if somebody else scored 10/11 initially, the maximum deterioration that could be recorded is one point (even if the degree of deterioration was “worth” more points). Another person could improve overall by the same amount but their score might increase (say) by four points (because of the floor effect meaning the other individual’s score couldn’t drop by four points). For the mean to show “no change”, one would need four people deteriorating by 1 point to balance out one person improving by 4 points. So mean scores can show artificial improvements in such scenarios.
As one can see, this is potentially a very important issue. Dr. Stouten published a neat paper in 2005 on the issue. In it, he used the summary scores that were publicly available to try to estimate the minimum percentage of CFS patients who scored the worst score on any one question (item) for three different fatigue questionnaires. If these “worst scoring” CFS patients were in a trial, they couldn’t then show up as having become worse even if they did feel worse in this domain. He produced figures (i.e. percentages) that were quite high but also did vary depending on the questionnaire (i.e. some questionnaires seemed to have a worse problem than others with ceiling/floor effects). It is important to remember with this data that his figures are minimums: the actual percentage could be a lot higher than the percentages he reports. I have seen instances where such figures have been published but generally such information has not been made available in the ME/CFS field.
Although these examples have to do with fatigue, one can also have similar problems with other symptoms or other domains. I know, for example, that a lot of people with ME/CFS score 0/100 in the role physical subscale of the SF-36 questionnaire. So again if they were in a trial, their scores would never show a deterioration.
Hopefully in the future, there will be more of a focus on this area. A review by Haywood and colleagues was recently published looking at the quality and acceptability of patient-reported outcome measures used in ME/CFS. It was part of the PRIME (Partnership for Research in ME/CFS) initiative in the UK. It found that relatively few “CFS/ME-specific” measures had been used previously in the field. Those which did exist generally scored poorly in terms of the quality and quantity of the evidence for their psychometric properties. The review found that a wide variety of other questionnaires had been used that were not “CFS/ME-specific”. Generic questionnaires can be useful in terms of making comparisons with other conditions. However, again, there was limited evidence of measurement and/or practical properties. The authors concluded that “clear discrepancies exist between what is measured in research and how patients define their experience of CFS/ME.” They recommended that new patient-reported outcome measures be devised and that patients be actively involved in their development and evaluation. In the meantime, one needs to be cautious when interpreting the results of trials which use questionnaires that have the “floor/ceiling effect” problem.
Addendum: Since I initially wrote this, the PACE analysis team have done a more detailed graph of Chalder fatigue scores using both the Goudsmit et al (2008) data as well as new data they collected themselves. This can be seen at this link. One of them (who preferred not to be named as they say it was a team effort) adds that the data shows that people who said that they were “mild” scored between 4 and 10 with bimodal scoring (15 and 29 with Likert scoring). “Moderate” scores were between 3 and 11 with bimodal scoring (12 and 33 with Likert scoring). “Severe” scores were between 7 and 11 with bimodal scoring (22 and 33 with Likert scoring).
Add Your Comment