Migraine is an important health issue because of its high prevalence, disability, and heavy burden. Fifty-one percent of respondents to a survey reported reduction of at least 50% in their work or school productivities because of migraine and approximately 23% of households had at least one member who suffered from migraines [1]. Migraine prevalence is highest during peak productive years (ages 25–55), thus resulting in substantial indirect costs. Estimates of workdays lost due to migraine ranged from 8.1 to 40.5 days annually [2]. Prophylactic therapy in the management of migraine has gained increasing interest in recent years. Yet, the effects of these preventive therapies [3, 4] need to be better understood, especially in the area of health-related quality of life (HRQoL).

The Migraine-Specific Quality of Life Questionnaire (MSQ) is a widely used migraine-specific instrument in HRQoL research [5]. Developed by Jhingran et al. [6], the MSQ measures the impact of migraine on the patient’s HRQoL over the past four weeks across three dimensions: Role Restrictive (RR), Role Preventive (RP), and Emotional Function (EF).

Whereas the validity of the MSQ has been examined in many studies with migraine patients treated with abortive therapy, validity of the MSQ for patients undergoing prophylactic treatment has not yet been examined. Validity of an instrument in one group (e.g., migraineurs on acute therapy) cannot be assumed to hold in alternative groups (e.g., migraineurs on prophylactics) [7]. Because of increasing application of MSQ to measure HRQoL in migraineurs, we undertook a comprehensive examination of the psychometrics properties of MSQ in a sample of patients undergoing migraine prevention.

Methods

Subjects

Conducted between February 2001 and April 2002, TOPMAT-MIGR-001 and TOPMAT-MIGR-002 were randomized, double-blind, placebo-controlled clinical trials of identical design, one conducted in the US [8] (n = 468) and the other conducted in US and Canada [9] (n = 448). The 916 patients were aged 12–65 years, had a 6-month history of migraine (International Headache Society criteria), and experienced 3–12 migraines per month but no more than 15 headache days a month during the 28-day baseline period. After washout, patients were randomized to placebo or different dosage topiramate groups. Maintenance therapy continued for 18 weeks.

Measures

Migraine-Specific Quality of Life Questionnaire version 2.1

MSQ was administered at baseline, month 2, 4, and 6. The 14-item MSQ measures the impact of migraine on HRQoL in three domains: RR, RP, and EF. All domains of the MSQ are scored from 0 to 100, with higher scores indicating better functioning. Items are on a standard six-point ordered-categorical scale with choices ranging from none of the time to all of the time. Internal consistency reliability and validity (i.e., structural and convergent) have been found to be appropriate [6, 1012] in patients undergoing acute migraine treatment.

Convergent and discriminant measures

Using an IRT calibration process that incorporates measurement error, MSQ scores (collected in clinical trials) were converted into scores on Headache Disability Inventory (HDI) and Headache Impact Test–6 item scale (HIT-6) for measures of convergent validity [13]. Discriminant validation was analyzed from baseline scores on the SF-36 [14], measuring physical (PCS) and mental component summaries (MCS) for general HRQoL.

Data analysis

MSQ scores were calculated per published instructions [12]. Examination of the MSQ latent structure was conducted using item-level information. Latent structure analyses included confirmatory factor analysis (CFA) to assure that MSQ scoring matches with the factor analysis results [15], and continued with item response theory (IRT) and differential item functioning (DIF) to explore possible item bias. Subsequently, review of MSQ items and scale measurement properties was conducted.

Missing data

Because of loss of power, spuriously inflated standard errors, and lack of generalizability that occur from removing cases with missing data [16], we undertook a single multivariate-normal Bayesian imputation process [16, 17] using the missingness imputation sequential system described in Cole [17].

CFA

CFA was used to examine the match between the scoring system and MSQ latent structure. CFA was conducted with the same methods described previously in Cole et al. [18]. The first CFA model had each of three domains (RR, RP, and EF) impact the scores of their respective items. Another MSQ latent model examined endpoint data to determine if the baseline model fit endpoint data (and thus scores may be interpreted in the same manner at baseline and endpoint).

IRT latent analyses and DIF

We used the generalized partial credit IRT model [19] and maximum marginal likelihood estimation procedure [20]. We also examined item fit adapted for polytomous items: for each item, we compared expected and observed frequency distributions [21].

After calculation of IRT item properties, DIF was examined. The MSQ scoring assumes people with similar scale scores have the same probability of answering an item in a certain way, regardless of their demographic composition. Logistic regression methods [22] were used to calculate DIF wherein simple sum score of the items is used as a proxy for migraine-specific QoL (once for each of the three domains). DIF is evaluated by testing for associations between each item and subgroup membership, conditioning on the sum score. Because multiple tests were performed, two criterion were mandated for an item to exhibit DIF: statistical significance (p-value below Bonferroni-corrected alpha of .05) and magnitude of DIF (R 2 difference [Δ−R 2] of at least 2%, a level indicative of the minimal change found to be important) [23].

Item-level psychometric evaluation

Five sets of analyses were conducted in items found to be sufficient after latent analyses: equality of item-total correlations for each scale, equality of variances for each item per scale, sufficient item-total correlations, small alpha-removed statistics, and item-total correlations that were higher for each item’s own scale than for other scales. Four of these analyses are MSQ standards [12]; alpha-removed statistics were added based on recommendations from other psychometricians [24].

Scale-level psychometric evaluation

Scale psychometrics were examined with internal consistency reliability as well as convergent and discriminant validity. Internal consistency was measured with three indices: coefficient alpha, Spearman-Brown adjusted alpha to a 10-item scale, and average interitem correlation [24]. We evaluated coefficient alpha and the Spearman–Brown adjusted alpha against criteria from Nunnally and Bernstein [25]: ≥.90 is excellent (i.e., an excellent degree of internal consistency) and ≥.80 is sufficient. Average interitem correlations for these scales should be at least .40, preferably above .50 [26].

Although average correlations can provide an appropriate measure of overall convergent and discriminant validity, meta-analysis provides a more generalizable measure of overall relationship including standard error adjustments based on the error and sample size for each correlation [27]. Convergent validity was calculated for each of the MSQ domains with a fixed-effects model from correlations with the other two MSQ domains, both HDI scales, and HIT-6. Discriminant validity was determined through correlations of MSQ domains with PCS and MCS scores. Differences between convergent and discriminant effects was examined.

Results

The final sample of 916 patients had a mean age of 40.7 years (SD = 10.7) and the following mean (SD) baseline values for the MSQ: RR = 49.58 (SD = 16.65), RP = 67.38 (SD = 19.81), and EF = 54.00 (SD = 24.83). Few missing items were present among the baseline MSQ items, with a range from 0% missingness (several items) to 0.44% missingness (items 3 and 4). Because of dropout, there was greater missingness at endpoint, ranging from 10.70% (several items) to 11.68% (items 1 and 13). PCS and MCS (measured at baseline) were missing for 4.59% of the sample.

Poor performance of item 12

All analyses were conducted on the 14-item MSQ as noted in the Methods. Analyses came back with a consistent problem: item 12 (frustration) had poor psychometric performance. A brief review follows.

The initial CFA model (Model 1) obtained fit that was close to acceptance (GFI = .93, CFI = .95, NNFI = .95, and RMSEA = .07). Results indicated that the model was missing some significant relationships. A large modification index indicated item 12 should also have a path from the RR factor. Nevertheless, dual loadings of items on scales lead to interpretation and multiplicity problems [24]. For item 12, item-total correlation was significantly less than the mean of the other EF items: z diff = 3.667 (p < .001). Although sample size could have been an issue, making n = 458 (1/2 of the sample) did not change significance (p = .009). Item 12 had an αremoved = .85, a .017 difference. Although not meeting the alpha-removed criterion of .02, it is worthy to consider removal given other poor statistics for this item. Finally, item-total correlation for item 12 was not significantly different than correlations between item 12 and other scales. Indeed, item 12 correlated numerically higher with RR (.62) than with its own scale (.60).

Reanalysis after removal of item 12

Because of item 12’s poor performance, it was considered for removal. We considered psychometric, theoretical, and clinical issues related to its removal. Psychometrically, a two-item scale is not ideal, but with carefully selected item removal two-item scales can work sufficiently [28]. Theoretically, the ambiguous relationship of a frustration item with the rest of the items is logical in that RP and RR are contributing factors to further frustration. Clinically, the MSQ has undergone several refinements and item reductions, including a major overhaul of items between versions 1 and 2 [11], a recent reduction from 16 to 14 items from version 2 to version 2.1 [12], and other papers considering removal of additional items [5]. With these justifications, psychometrics were reassessed after removal of item 12. The mean (SD) baseline scores for EF was changed to 60.95 (27.44). Correlations for all final outcomes at baseline are provided in Table 1.

Table 1 Scale-level correlations and descriptive statistics for all participants after removal of item 12

CFA of Model 1 was reanalyzed and suggested some important paths were missing and therefore modification indices were inspected.Footnote 1 Fit indices revealed a strong relationship between the residuals for items 1 (time with family) and 2 (leisure) (modification index = 45.64) and between items 6 (tired) and 7 (energy) (modification index = 60.64), resulting in correlations of .24 and .28, respectively. Both correlated residual pairs are logical: issues that impact one’s ability to spend time with family likely impact one’s ability to spend time on leisure activity, and extraneous issues that make one feel tired likely also make one feel less energetic.

Fig. 1
figure 1

CFA model 2 (standardized estimates): Basic MSQ with two residual correlations for baseline data (standardized loadings)—No item 12 (e1–e14 are item residuals)

The refined version of Model 1, now Model 2, yielded strong fit statistics for all model fit criteria with GFI, CFI, and NNFI > .95 and RMSEA = .06; (see Fig. 1).Footnote 2 A similar model (based on Model 2) with endpoint data had sufficient fit on all fit statistics. DIF analyses for the three domains found none of the r 2-change values were greater than 2% and none of these changes were significant, indicating no item bias for gender, study, or age. Item-level analyses were also re-examined (Table 2). Item-total correlations within each scale were RR = .744, RP = .675, EF = .740. Equality of the item-total correlations did not need to be retested for EF as item-total correlations are necessarily equal (.74). The item-total correlation most deviant on RR and RP from these respective mean item-total correlations was item 7 (energy) for RR (.69) and item 9 (help) for RP (.66). For RR, z diff = 2.395 (p = .017). There was a significant difference, suggesting item 7 was significantly less related to RR than the combination of the rest of the RR items. However, cutting the sample in half (458) yielded the p = .091, indicating that this significant difference measures a small effect. The result was even more favorable for RP: z diff = 0.586 (p = .558).

Table 2 Item-level psychometrics after removal of item 12

Hartley’s F max values were 1.14, 1.09, and 1.04 for RR, RP, and EF, respectively. Thus, all domains showed sufficient item-variance equivalence. Item-total correlations were compared against a criterion of .40, for which all domains met. The smallest corrected item-total correlation for each domain was .69 (item 7), .66 (item 9), and .74 (item 13 and 14) for RR, RP, and EF respectively. Also, no items had alpha-removed statistics that were too high.

Finally, each item was examined to assure its corrected-item correlation for its own scale was significantly higher than that for either of the other scales. All items on RR had markedly higher item-total correlations with RR than with RP or EF. Items on RP had item-total correlations, which were not significantly higher than those with RR, but were consistently higher than correlations with EF. Finally, EF items had significantly higher item-total correlations than correlations with RR or RP.

Coefficient alpha for the MSQ scales ranged from excellent (.915 for RR) to good (.841 and .850 for RP and EF, respectively). An excellent rating was achieved for their ten-item adjusted alpha (.939, .930, and .966 for RR, RP, and EF, respectively) and average interitem correlation (.606, .569, and .739 for RR, RP, and EF, respectively).

Meta-analytically derived z scores indicated very large convergent effect sizes (from 51.03 for EF to 86.62 for RR) and small discriminant effect sizes (from 12.81 for RP to 13.49 for EF). Z-score differences between convergent and discriminant effects were extremely large (from 772.54 for EF to 1515.45 for RR), providing strong evidence that each domain has appropriate convergent and discriminant validity (far beyond the probable inflation of convergent scores due to the derivation of HDI and HIT-6 scores).

Discussion

The purpose of the current study was to investigate various measurement characteristics of the MSQ for migraineurs undergoing prophylactic treatment. Our initial investigation revealed that most psychometric characteristics were appropriate, but one particular problem was present: item 12. After the removal of this item, improvements were seen in the measurement properties of MSQ.

The final 13-item scale demonstrated appropriate psychometric characteristics. The final model with three latent factors was found to have strong model fit on all indices. Moreover, all factor loadings were greater than .80, indicating that at least 65% of the variance for each item was explained by its respective factor. There was also a lack of item-level bias for both age and gender demographics. Strong item-level reliability estimates were found for all 13 items. Finally, good internal consistency, convergent validity, and discriminant validity for each of the MSQ scales was also found.

The match between the scoring system and the latent structure of the 13-item MSQ was examined from many perspectives, as there have been questions regarding the use of domain scores versus one total score, as well as the ability to distinguish RP and RR items from one another. Nevertheless, the comparison between various factor structures for the 13- and 14-item versions of the MSQ supported the interpretation of three domain scores separately. Moreover, distinctions between RR and RP remain sufficiently strong: all RR items correlate significantly higher to RR than to RP, and all but one of the RP items correlate significantly higher to RP than to RR.

We provide scoring of the new EF scale and comparison between two- and three-item EF scales in clinical trials. Scoring of the new EF scale is based on scoring from Martin et al. [12]. To calculate the new EF scale, use the following formula: EF = ((item13 + item14–2) × 100)/10 (rather than subtracting 2 and dividing by 15 for the 3-item scale). Examining a change score for 916 patients over 16 weeks of treatment revealed a significant difference between the 13-item and 14-item MSQ scores, though the effect was small (Cohen’s d = 0.11) indicating that the difference is not worthy of the poorer psychometrics obtained by retaining item 12.

Two previous studies have suggested removal of items on the MSQ v2.1 in migraineurs undergoing acute therapy [5, 12]. Nevertheless, after removing these items the authors did not provide updated item-level statistics on the remaining MSQ items. Future research can be used to provide additional insight on the 13-item versus 14-item MSQ.

Limitations of the current study include no test–retest reliability and derived convergent score. We did not conduct test–retest reliability analysis because we determined the structure of the MSQ at endpoint was more important than a retest correlation on MSQ that was administered at 2-month intervals. Nevertheless, future research should examine test–retest reliability. Convergent correlations may have been inflated, given the use of estimated convergent measures rather than from data actually collected. Very large differences between convergent and discriminant correlations would likely remain with actual measures, but the reader should be aware of this important difference.

Conclusions

This study supports the use of MSQ v2.1 to measure HRQoL in migraineurs undergoing preventive treatment. Removal of item 12 markedly improved the psychometric properties of this scale. From the implications standpoint, removal of item 12 has no effect on the RR and RP domain scores, and small effect on scoring of EF domain; findings regarding change over time on EF domain will likely not be significantly affected. Further studies are recommended to bring additional evidence of the measurement properties of MSQ.