A modest proposal for a clinical spirometry grading system
A while back I reviewed the spirometry grading system that was included in the 2017 ATS reporting standards. My feeling was, and continues to be, that its usefulness is very limited because it’s mostly a reproducibility grading system that relies on a few easy-to-measure parameters. This doesn’t mean that a grading system can’t be helpful, just that it needs to be focused differently.
In a clinical PFT lab many patients have difficulty performing adequate and reproducible spirometry, but that doesn’t mean the results aren’t clinically useful. Moreover, suboptimal quality results may be the very best the patient is ever able to produce. So what’s more important in a grading system than reproducibility is the ability to assess the clinical utility of a reported spirometry effort.
The two most important results that come from spirometry are the FEV1 and the FVC, and I strongly believe that they need to be assessed separately. For each of these values there are two aspects that need to be determined. First, is there a reliable probability that the reported value is correct? Second, are any errors causing the reported value to be underestimated or overestimated? The two are inter-related since a value with excellent reliability is not going to have any significant errors, but if there are errors then a reviewer needs to know which direction the result is being biased.
The current ATS/ERS standards contain specific thresholds for certain spirometry values such as expiratory time and back-extrapolation. Although these are certainly indications of test quality they are almost always used in a binary [pass | fail] manner. In order to assess clinical usefulness however, you instead need to grade these on a scale. For example an expiratory time of 5.9 seconds for spirometry from a 60 year-old individual would mean that there is a small probability that the FVC is underestimated, but with an expiratory time of 1.9 seconds the FVC would have a very high probability of being underestimated and this needs to be recognized in order to assess clinical utility.
Note: Although the A-B-C-D-F grading system is rather prosaic it is still universally understandable, so I will use it for grading reliability. An A grade or an F grade are probably easy to assign but differentiating between B-C-D may be more subjective, particularly since reliability depends on multiple parameters and judging their relative contribution is always going to be subjective at some point. For bias, I will be using directional characters (↑↓) to show the direction of the bias (i.e. positive or negative), so ↑ will indicate probable overestimation, ↓ will indicate probable underestimation, and ~ indicates a neutral bias.
FEV1 / Back extrapolation:
Back-extrapolation is a way to assess the quality of the start of a spirometry effort and the accuracy of the timing of the FEV1. The ATS/ERS statement says that the back-extrapolated volume must be less that 5% of the FVC or less than 0.150 L, whichever is greater.
My experience is that an elevated back-extrapolation tends to cause FEV1 to be overestimated far more often than underestimated. So a suggested grading system for back-extrapolation would be (and I’ll be the first to admit these are off the top of my head and open for discussion):
FEV1: | ||
Back-Extrapolation: | Reliability: | Bias: |
Within standards: | A | ~ |
> 1 x standard, < 1.5 x standard: | B | ↑ |
> 1.5 x standard, < 2 x standard | C | ↑↑ |
> 2 x standard, < 2.5 x standard: | D | ↑↑↑ |
> 2.5 x standard | F | ↑↑↑↑ |
FEV1 / Pause:
Any pauses that occur due to cough or glottal closure during the first second of exhalation will cause the FEV1 to be underestimated. The time at which these occur and their duration will determine how much the FEV1 will be affected.
What matters is the duration of the pause within the FEV1. Any part of the pause that occurs after the FEV1 may possibly affect the FVC, but not the FEV1. A possible grading system would be:
FEV1: | ||
Pause Duration: | Reliability: | Bias: |
No pause | A | ~ |
>0, <0.1 second | B | ↓ |
>0.1 second, <0.15 second | C | ↓↓ |
>0.15 second, <0.2 second | D | ↓↓↓ |
>0.2 second | F | ↓↓↓↓ |
FEV1 / Peak flow contour
This part of the grading gets into subjective territory. Although the ATS/ERS spirometry standard does not consider the Peak Expiratory Flow (PEF) to be a criteria when selecting spirometry efforts it does say that a good spirometry effort should show maximal patient effort. I think that PEF should be a selection criteria for FEV1 because submaximal spirometry efforts, as shown be a lower PEF, often have an elevated FEV1:
So the FEV1 from the effort with the highest PEF should be reported (even when it isn’t the highest reported FEV1), but that’s only looking at reported PEF value, not the actual quality of the PEF effort. For that, we generally have look at how “pointy” (there’s got to a better was to describe this) the PEF is on the flow-volume loop.
In general, the “sharper” the PEF contour, the more likely the PEF effort was good. The more “blunted” the PEF contour (which should not be mistaken for the expiratory plateaus of intrathoracic large airway obstructions or the typical flattened contour of tracheomalacia) the more likely the PEF effort was submaximal and the more likely that the FEV1 is overestimated.
And the suggested grading would be:
FEV1: | ||
PEF Contour: | Reliability: | Bias: |
Sharp | A | ~ |
Mildly blunted | B | ↑ |
Moderately Blunted | C | ↑↑ |
Severely blunted | F | ↑↑↑ |
FVC / Expiratory time:
The ATS/ERS spirometry standard recommends a minimum expiratory time of 6 seconds for adults but this fails to acknowledge that expiratory time necessary to obtain a reliable FVC is often lower in young adults and higher in the elderly. Nor does it take into consideration the fact that expiratory time increases as airway obstruction increases.
For these reasons, expiratory time needs to be assessed by two different criteria, age and degree of airway obstruction.
FVC / Expiratory time / Age:
I’d like to suggest that an adequate expiratory time should be be 4 seconds for a 20 year old and 8 seconds for an 80 year old (totally arbitrary of course but hopefully reasonably correct). Because exhaled volume closely follows an exponential curve, an expiratory time that’s low by 2 seconds has a proportionally greater effect on FVC than does an expiratory time that’s low by 1 second. For this reason, grading the reliability of expiratory time should look something like this:
There is a pretty direct relationship between the reliability of the expiratory time and the bias:
FVC: | |
Expiratory Time – Age: | Bias: |
A | ~ |
B | ↓ |
C | ↓↓ |
D | ↓↓↓ |
F | ↓↓↓↓ |
Note: Expiratory time is usually determined by the point at which the patient starts to inhale after their maximal exhalation or when the technician manually terminates the test. The reported expiratory time will be therefore be overestimated when there are expiratory pauses or when the patient stops exhaling but the test system does not immediately register that this has occurred. Whenever possible the expiratory time used in grading reliability and bias should be adjusted for pauses and early termination of exhalation.
FVC / Expiratory time / Airway obstruction
The presence of airway obstruction is assessed using the LLN of the FEV1/VC ratio and its severity is assessed using the percent predicted FEV1. There is likely a curvilinear relationship between the severity of airway obstruction and the amount of extra expiratory time that’s required for a reliable FVC.
In one sense this curvilinearity quickly produces excessive FVC expiratory times that aren’t clinically or physiologically realistic (i.e. more than 12-15 seconds) and under no circumstances should we expect our patients to exhale that long. At the same time however, does anybody expect that an FVC that’s 50% of predicted in a patient with an FEV1 of 25% of predicted and a 12 second expiratory time is the patient’s “real” FVC?
The degree of this curvilinearity is only speculative however, but expiratory time should be adjusted for obstruction in some way. Off the top of my head I’d suggest that anybody with mild airway obstruction needs an additional 25% in expiratory time for a reliable FVC and that anybody with very severe airway obstruction would need 3 times their expected expiratory time for a reliable FVC and in-between:
FVC: | Expiratory time factor: |
Mild OVD | 1.25 |
Moderate OVD | 1.50 |
Severe OVD | 2.00 |
Very Severe OVD | 3.00 |
I further suggest that the age-adjusted expected expiratory time should be multiplied by the appropriate factor and then scored by the percent of the actual expiratory time:
FVC: | |
Expiratory Time – OVD | Reliability |
> 90% | A |
>75%, <90% | B |
>60%, <75% | C |
>50%, <60% | D |
<50% | F |
FVC / Terminal expiratory flow rate:
The current ATS/ERS standard for an adequate terminal expiratory flow rate is 0.025 L/sec (although it’s actually expressed as a volume change of 0.025 L over 1 second and not as an actual flow rate). The problem is that an FVC with a terminal expiratory flow that is only slightly over this value still has a reasonable probability of being correct. It’s when the terminal flowrate is high that it’s clear the probability the FVC is being underestimated is also high.
However, there are no test systems that I know of that report the terminal expiratory flowrate (why not?), so until they do this has to be judged by eye.
And terminal expiratory flow rate should be graded as:
FVC: | ||
Terminal Flowrate: | Reliability: | Bias: |
Within standard | A | ~ |
Mild | B | ↓ |
Moderate | C | ↓↓ |
Severe | F | ↓↓↓ |
FVC / Gas Trapping:
A spirometry effort may meet all the ATS/ERS criteria but inspection of the flow-volume loop sometimes shows that the exhaled volume is lower than the inhaled volume. This is a sign of gas trapping and can happen in individuals with severe airway obstruction. Unfortunately, there are no test systems that measure the volume of the initial inhalation (and again, why not?) so this must be detected by eye.
If the difference in inspiratory and expiratory volumes can be measured the expiratory volume should be compared to the inspiratory as a percent and could be graded accordingly:
FVC: | ||
Gas Trapping: | Reliability: | Bias: |
Exhaled volume ≥ Inhaled Volume | A | ~ |
Exhaled volume > 95% Inhaled Volume | B | ↓ |
Exhaled volume > 85% & < 95% Inhaled Volume | C | ↓↓ |
Exhaled volume > 75% & < 85% Inhaled Volume | D | ↓↓↓ |
Exhaled volume < 75% Inhaled Volume | F | ↓↓↓↓ |
FVC / Inadequate Inhalation:
All of the ATS/ERS criteria that apply to the FVC are concerned with an inadequate exhalation and there are no criteria that address an inadequate inhalation. This is mostly because detecting an inadequate inhalation is quite difficult. Although there are several signs that are suspicious for this problem the only circumstance in which this clearly shows is when a maximal inhalation is performed after the maximal expiratory maneuver and the final inhalation has a larger volume than the initial one. Many test systems will measure this final maximal inhalation as the FIVC, although this value is not often reported.
When the difference in the initial inspiratory volume (as shown by the FVC) can be compared to the final inspiratory volume (FIVC) as a percent and this could be graded accordingly:
FVC: | ||
Inadequate Inhalation: | Reliability: | Bias: |
FVC ≥ FIVC | A | ~ |
FVC > 95% FIVC | B | ↓ |
FVC > 85% & < 95% FIVC | C | ↓↓ |
FVC > 75% & < 85% FIVC | D | ↓↓↓ |
FVC < 75% FIVC | F | ↓↓↓↓ |
FVC / Zero offset error:
This is primarily an equipment error that is uncommon but still occasionally happens (twice last week in my lab on two different test systems). It can also be caused by transtracheal O2.
It can be difficult to detect, particularly when it’s a negative offset, but when it is detected there is no way to be sure how much or how little the FVC has been overestimated. This error gets an automatic F score with a ↑↑↑ for bias when it’s a positive zero offset and ↓↓↓ when it’s a negative zero offset.
FEV1 and FVC scoring:
FEV1 and FVC will each be affected most by the lowest reliability score. So when the individual scores are combined:
A+A = A
A+B = B
A+C = C
A+D = D
A+F = F
There should also be an additive effect, so:
B+B = C
C+C = D
D+D = F
Bias is likely additive. Opposite biases will cancel each other out to some extent, but probably not ever exactly. For this reason, when opposing biases are added, they should be replaced with ↕ to indicate that the resultant bias is uncertain but may be neutral. For example:
Overall FEV1 bias: ↓↓↑
Would be reported as:
Overall FEV1 bias: ↕↓
FEV1/FVC:
When the FVC or FEV1 are under- or over-estimated this will affect the reliability and the bias of the reported FEV1/FVC ratio. The reliability of the FEV1/FVC ratio should equal the lowest overall reliability score for the FVC and the FEV1. For example:
Overall FEV1 reliability: C
Overall FVC reliability: B
FEV1/FVC reliability: C
The direction of the bias in FEV1 and FVC have opposite effects on the FEV1/FVC ratio. A negative (↓) bias in FEV1 will have a negative bias (↓) in the FEV1/FVC ratio. A negative bias in FVC (↓) on the other hand will have a positive bias (↑) in the FEV1/FVC ratio. For this reason when estimating the total bias acting on the FEV1/FVC ratio it is probably easiest to flip the direction of the FVC bias and add it to the FEV1 bias.
FVC and FEV1 biases can oppose or reinforce each other. Opposite-acting biases will probably never cancel each other out exactly but will leave an uncertainty regarding the actual bias of the FEV1/FVC ratio. For this reason I’d again suggest that when two biases oppose each other they are replaced with an indication of uncertainty: ↕. So, for example, after the FVC biases have been flipped:
FEV1 bias: ↓↓
FVC bias: ↑↑
and would be reported as:
FEV1/FVC ratio bias: ↕↕
or:
FEV1 bias: ↓↓↓
FVC bias: ↑
Would be reported as:
FEV1/FVC ratio bias: ↕↓↓
And the overall reporting of reliability and bias for all these parameters could look something like this:
FEV1: | Reliability: | Bias: | |
Overall: | C | ↓↓ | |
Back Extrapolation: | A | ~ | |
Pause: | C | ↓↓ | |
PEF Contour: | A | ~ | |
FVC: | Reliability: | Bias: | |
Overall: | C | ↓↓ | |
Expiratory Time – age: | B | ↓ | |
Expiratory Time – OVD: | A | ~ | |
Terminal Flow: | B | ↓ | |
Gas Trapping: | A | ~ | |
Inadequate Inhalation: | A | ~ | |
Zero Offset: | A | ~ | |
Reliability: | Bias: | ||
FEV1/FVC: | Overall: | C | ↕↕ |
My point in suggesting this grading system is that spirometry results are often less than perfect. Some patients (10%? 15%?, 20%?) are completely unable to give any kind of a reproducible effort but that doesn’t mean that the reported effort isn’t clinically relevant. The clinical utility of FVC and FEV1 are difficult, if not impossible, to judge using the current [pass | fail] approach to grading results. Even more importantly, the reliability and bias of the reported FEV1 and FVC need to be addressed separately rather than combined in a single score.
Reliability and bias scores would help reviewers to assess the clinical utility of the reported results and this system attempts to address this. Most of the values I’ve suggested for assessing test quality are fairly arbitrary but I wouldn’t have suggested them if I didn’t think they were reasonably accurate.
There’s no particular reason that most, if not all, of this suggested grading system could be implemented in software and so there’s some potential for producing reliability and bias scores automatically. Most manufacturers are reluctant to add features like this however, unless they are recommended or mandated by the ATS and ERS. As much I may think this is the direction that a clinically-oriented grading system should go, I’m well aware that until it gains approval by the ATS or ERS this type of system would have implemented manually and that means it’s unlikely to be adopted. Nevertheless I still hope to at least generate some ideas and conversation on this subject.
Finally though, I’ve begun to wonder if the basic premise of getting both the FEV1 and the FVC from the same test maneuver is really correct. The standard spirometry maneuver is good for getting the best FEV1 but often so-so in getting the best VC. An SVC maneuver on the other hand, is good for getting the best VC, but very poor in getting the best FEV1. Is it time that we re-thought routine spirometry and obtained the FEV1 and VC from different maneuvers rather than just the one? But I’ll save discussion of this topic for another time.
References:
Brusasco V, Crapo R, Viegi G. ATS/ERS task force: Standardisation of lung function testing. Standardisation of spirometry. Eur Respir J 2005; 26: 319-338.
Culver BH, Graham BL, Coates AL et al. Recommendations for a standardized pulmonary function report. Am J Respir Crit Care Med 2017; 196(11): 1463-1472.