Abstract
Background
In metaregression, as the number of trials in the analyses decreases, the risk of false positives or false negatives increases. This is partly due to the assumption of normality that may not hold in small samples. Creation of a distribution from the observed trials using permutation methods to calculate P values may allow for less spurious findings. Permutation has not been empirically tested in metaregression. The objective of this study was to perform an empirical investigation to explore the differences in results for metaanalyses on a small number of trials using standard large sample approaches verses permutationbased methods for metaregression.
Methods
We isolated a sample of randomized controlled clinical trials (RCTs) for interventions that have a small number of trials (herbal medicine trials). Trials were then grouped by herbal species and condition and assessed for methodological quality using the Jadad scale, and data were extracted for each outcome. Finally, we performed metaanalyses on the primary outcome of each group of trials and metaregression for methodological quality subgroups within each metaanalysis. We used large sample methods and permutation methods in our metaregression modeling. We then compared final models and final P values between methods.
Results
We collected 110 trials across 5 intervention/outcome pairings and 5 to 10 trials per covariate. When applying large sample methods and permutationbased methods in our backwards stepwise regression the covariates in the final models were identical in all cases. The P values for the covariates in the final model were larger in 78% (7/9) of the cases for permutation and identical for 22% (2/9) of the cases.
Conclusions
We present empirical evidence that permutationbased resampling may not change final models when using backwards stepwise regression, but may increase P values in metaregression of multiple covariates for relatively small amount of trials.
Introduction
Systematic reviews are prone to various forms of heterogeneity between included studies. Variability in the participants, interventions and outcomes across studies may be termed clinical heterogeneity; variability in the trial design and quality is typically termed methodological heterogeneity; variability in treatment effects between trials can be termed statistical heterogeneity [1,2]. Methodological heterogeneity hinges on the exact methods of the individual trials, and how they differ from each other. That is, trials that do not properly conceal allocation to treatment groups may bias estimates in treatment effect and cause increased variations in effect between studies included systematic reviews [3]. Significant statistical heterogeneity arising from methodological heterogeneity suggests that the studies are not all estimating the same effect due to suffering from different degrees of bias [2]. In the current work, we focus on clinical heterogeneity that arises from differences in participant characteristics (for example, sex, age, baseline disease severity, ethnicity, and so on), types of outcome measurements, and intervention characteristics (for example, dose, duration of treatment, form of intervention and so on).
In systematic reviews that assess heterogeneity, this is typically examined through subgroup analyses or metaregression. Subgroup analyses involve dividing the complete dataset into smaller subgroups to make comparisons between them. It is suggested that subgroup analyses be preplanned as part of a systematic review protocol, and even then they should be interpreted with caution [2]. Subgroup analyses may be performed for subsets of participants (for example, males and females) or for intervention characteristics (for example, dose or duration of treatment). These analyses may be performed as a means to investigate heterogeneous results, to answer questions concerning patient groups, types of intervention or types of study. However, as more subgroup analyses are performed on a set of trials, the likelihood of finding false positive or false negative results increases [4].
Metaregression is an extension of subgroup analyses that allows continuous as well categorical variables to be examined and for the investigation of multiple variables of interest, with the exception of comparisons with less than 10 trials [4]. Metaregression is similar to simple regression in which an outcome variable is predicted relative to the values of one or more explanatory variables. The outcome variable in metaregression is the effect estimate, and the explanatory variables (that is, potential effect modifiers or covariates) are any characteristics of the study that might influence the effect estimate. The regression coefficient in metaregression describes how the treatment effect changes with each unit increase in the explanatory variable and the statistical significance of the coefficient is a test of whether there is a linear relationship between the two. These investigations can be misleading for several reasons [1,2].
First, metaregression involves making observational associations that are subject to bias (for example, aggregation bias) and confounding (for example, resulting from correlation between characteristics). Also, many systematic reviews using this technique include only a small number of studies while any one of a large number of characteristics of these studies could be a cause of heterogeneity [1,2]. That is, with a smaller amount of studies, the likelihood of a statistically significant explanatory variable rises; a number of falsepositive findings is more likely than in conventional regression [1,2,4]. For these reasons the use of permutation tests have been suggested to assess the 'true' statistical significance of an observed metaregression finding for individual covariates [1,2,4].
The permutation test is useful for analyses of a small number of studies or a large number of covariates and has the goal of maintaining the type 1 error rate [4]. Permutationbased resampling tests the hypotheses of no effect when the distribution of the test statistic is unknown or when the data are not randomly sampled from a defined population, as is often the case in primary metaanalyses or metaregression [5]. For example, if we were testing the effect of allocation concealment on the summary treatment effects, and we want to compare the summary effect of those trials that performed adequate allocation concealment (group A) and those that did not (group B), we would test the equality of the two summary treatment effects (one for A and one for B) to determine if they likely originated from the same sampling distribution. For metaregression the associated P value is based upon the assumption that the two statistics were sampled from a normal distribution. For permutation tests, a covariance matrix, or distribution, is created for all possible pairs of observations for groups A and B (total is 32 possible pairs of observations for 2 groups across 5 pairs of trials, or 32 matrices and a summary treatment effect difference for each). Next, a ratio is created relative to the frequency that the observed difference in treatment effect found in the primary analysis exceed the overall summary differences obtained from the permutation analysis resulting in a P value (for example, 3/32; P = 0.09). A simulation study has demonstrated that permutation analyses result in more conservative P values and less false positive rates than standard metaregression (when referring to the t statistic) especially in the presence of heterogeneity, when there are a small number of studies, or when there are a large number of covariates [4].
Due to the large number of treatment level covariates in trials of herbal interventions [6] and the small number of clinical trials included in systematic reviews of these interventions [7] it is important to decrease the false positive rate in metaregression analyses while still providing valuable information on which characteristics of the intervention or other trial characteristics influence treatment effects. This may lead to increasingly valid and clinically relevant systematic reviews. With this in mind, our objective was to explore the differences in metaregression modeling results for a predetermined set of covariates using standard large sample approaches verses permutationbased resampling methods.
Methods
Trial inclusion
From a sample of 406 randomized controlled trials (RCTs) of herbal medicine identified through a literature search reported elsewhere [8], 2 individuals (JJG, HB) with expertise in herbal medicine interventions identified a selection of herbal intervention/condition pairings that would include the largest number of trials for this project. Through consensus they decided on the following pairings: Hypericum perforatum L. (St. John's wort) for treating patients with depression, Ginkgo biloba (ginkgo) for dementia, Serenoa repens (saw palmetto) for benign prostatic hypertrophy, and Allium sativum (garlic) for blood lipids (triglycerides, lipoproteins and cholesterol). Next, we examined the details of the included RCTs for each pairing and determined which trials included similar outcome measures. All other trials were excluded from the analyses.
A generally accepted 'rule of thumb' is that ten events per predictor variable (EPV) will maintain bias and variability at acceptable levels. This guidance derives from two simulation studies carried out for logistic and Cox modeling strategies [911]. This has been adapted to metaregression and thus it is suggested that for each covariate there should be at least ten trials to avoid spurious findings [12]. Therefore, we sought to include a minimum of 20 RCTs for each intervention/outcome pairing to allow for valid metaregression with 2 or more covariates.
Development of intervention level covariates
Next, we identified a variety of intervention level covariates for which to explore heterogeneity. We consulted a group of four individuals with expertise in RCTs of herbal medicine interventions who were asked to identify characteristics of the herbal medicine interventions that might influence estimates of treatment effect (for example, active constituent levels). The list intervention level covariates was compiled and included: dose, duration of use, percentage of active constituent, part of plant used, form of product (for example, liquid, dried, powder), method of extraction, and the extraction solvent used.
Methodological quality assessment
Methodological quality was assessed separately and independently by two individuals. We assessed the sequence generation, allocation sequence concealment, blinding, and description of withdrawals and dropouts, which was the Cochrane practice at the time of this research [13,14]. Each question was answered with a yes, no or don't know response option.
All assessments were entered into preformatted extraction forms. The assessors met on a weekly basis to discuss disagreements that were then resolved by consensus. A third party was not required to resolve any disagreements.
Data extraction
One individual extracted the following data and a second individual checked all extractions: type of intervention, details of the intervention identified above, summary effect estimates (mean differences, odds ratio (OR), relative risk (RR), risk difference (RD), no. needed to treat (NNT), no. needed to harm (NNH)) for the outcomes of interest, baseline values for the outcomes of interest in each group, participant levels characteristics (mean age, proportion of males and females), sample sizes per group (intention to treat where available), measures of central tendency for all groups (mean, median, mode) and measures of variability (SD, SE, range, CI) before and after treatment on all outcomes of interest, trial length (in weeks), and year of publication.
Statistical analysis
All statistical procedures were performed by one individual (JJG) using STATA 10.0 (Stata, College Station, TX, USA).
Effect size calculation for individual trials
For each trial, where effect sizes were not reported for continuous outcome measures we calculated them from reported means for all outcomes of interest. For continuous variables the effect sizes calculated were mean differences (MD) from baseline. All effect estimates for the outcomes of interest were calculated on intention to treat (ITT) data where reported. When variance data were not reported as SDs, they were calculated from the trial data using the standard error of the mean (SE), 95% confidence intervals, reported P values, ranges or interquartile ranges according to methods described in the Cochrane handbook [14]. When trials did not report information to calculate the variances we imputed values by giving trials the average SD calculated from across the other trials.
Metaanalysis for overall effect
The main metaanalysis used a random effects model on the weighted mean differences (WMD) of the primary outcome of interest across all trials. We then collected the WMD, together with 95% CI, P value for the main effect and I^{2 }results for statistical heterogeneity with the associated P value. Where measures of variance were imputed from the average across the other trials, we performed a sensitivity analysis by removing those trials and comparing the results with the findings for the analysis including all trials.
Exploration of heterogeneity
Statistical heterogeneity was detected with Cochran's Q test [15,16], which tests if the amount of between trial heterogeneity is greater than due to chance and the recently developed I^{2 }statistic measures the magnitude of statistical heterogeneity that can be expected by partitioning out the chance heterogeneity. Therefore, the presence of statistical heterogeneity was tested using Cochran's Q test [15,16] and the magnitude of statistical heterogeneity between trials using I^{2 }[1]. An I^{2 }of 0% to 40% will indicate no observed statistical heterogeneity, 30% to 60% moderate heterogeneity, 50% to 90% substantial heterogeneity, and 75% to 100% considerable heterogeneity [2].
We performed multivariable metaregression on the summary treatment effect for the covariates identified above, separated into two models for all intervention outcome pairings. Covariates representing similar topics were grouped into models through a small group process with the investigators. The first model (model 1) included methodological covariates (proper randomization, proper double blinding, a description of withdrawal and dropouts (items 2, 4 and 5 of the Jadad scale) and adequate allocation concealment). The second model (model 2) included participant level and intervention level covariates (length of the trial, proportion of males/females, dose of intervention, baseline severity for the intervention/treatment group). All categorical covariates were coded in a binary manner.
We used backwards stepwise elimination in which the variable with the largest nonsignificant P value for contribution to the model was dropped and the regression was repeated. For each model we used standard large sample methods (referring to the t distribution) and permutationbased resampling to arrive at P values. Variables were dropped until only significant predictors of the treatment effect remained. To maintain model validity, if a binary coded categorical variable was statistically significant, both categories of the variable were kept in the model. Collinear variables were also dropped from the model and explored individually for significance. Statistically significant P values were those less than 0.05. We then compared the final models and models obtained with each method.
In the permutation test we tested the equality of two means: H_{o}:u_{x }= u_{y}. The assumption being if × data are sampled from the distribution Px and the Y data from Py then under the null hypothesis all permutations of observations are equally probable. The test statistic was the WMD from the metaanalysis results above on all levels of the covariates across trials. We then estimated the null distribution for the treatment effect by permuting (randomly shuffling) the observations in the combined × and Y samples to create all possible distributions. We then compared the effect estimate (or difference in effect) obtained with metaregression for each covariate against the permuted distribution of effects to arrive at a P value. All permutations were set at 10,000 shuffles for each analysis and were performed simultaneously across all variables.
Results
A total of 110 trials were included in our analyses (see Figure 1). The intervention outcome pairings are described in Table 1. All other intervention outcome pairings had less than 20 trials and thus were not included. The primary metaanalyses for each of the pairings are listed in Table 2. For all analyses there was a statistically significant effect in favor of the herbal medicine intervention. We also ran a sensitivity analysis removing those trials for which standard deviations were imputed. Results of the sensitivity analyses did not appear to differ from the analyses with all trials included. In the primary analysis there was statistically significant heterogeneity for all analyses with I^{2 }values ranging from 65.7% to 99.1%.
Figure 1. Flow chart of inclusion of trials.
Table 1. Description of included intervention condition pairings
Table 2. Results of primary metaanalyses and sensitivity analyses
Many of the intervention level covariates were not reported in the included trials. Also, in no instance did a trial report enough information to code a trial as performing inadequate allocation concealment. Therefore, this variable was coded in a binary manner (1 = adequate allocation concealment; 0 = not enough information reported in the trial). For the backward stepwise metaregression procedures, the covariates remaining in the final models were identical for large sample techniques and for permutationbased resampling methods of arriving at P values. Out of a total of ten backwards stepwise procedures, six of the models for both large sample methods and permutationbased resampling resulted in several significant covariates in the final models. Only the models for lowdensity lipoprotein (LDL)A. sativum (garlic) and for total cholesterol (TC)A. sativum resulted in no covariates remaining significant. The P values and I^{2 }of the backwards stepwise elimination procedure for large sample methods (referring to the t table) and the corresponding P values using permutationbased resampling including are listed in Tables 3, 4, 5. The P values for the covariates in the final model were larger in 78% (7/9) of the cases for permutation and identical for 22% (2/9) of the cases.
Table 3. Hypericum perforatum HAMD pairing stepwise elimination results: comparison of final P values for final models using both large sample and permutation techniques
Table 4. Allium sativum highdensity lipoprotein (HDL) stepwise elimination: comparison of final P values for final models using both large sample and permutation techniques
Table 5. Allium sativum triglyceride (TG) stepwise elimination: comparison of final P values for final models using both large sample and permutation techniques
For the Hypericum perforatum (St. John's wort)Hamilton Depression scale (HAMD) pairing, the backward stepwise elimination metaregression and permutation P values are listed in Table 3. All P values obtained with the permutation tests exceeded those obtained with standard metaregression. The stepwise elimination metaregression and permutation P values for the influence of covariates on the effect of Allium sativum on highdensity lipoprotein (HDL) are listed in Table 4. None of the methodological variables reached significance in the first model. In the second model for participant level and intervention level covariates both P values obtained with permutation were identical to those obtained with standard large sample metaregression. This analysis yielded the smallest I^{2 }value (45.8%) of all stepwise elimination values in all models. The stepwise elimination metaregression and permutation P values for the effect of covariates on changes in triglyceride (TG) for Allium sativum are listed in Table 5. All P values for permutation were larger than those obtained with large sample methods.
In Table 6 we present various features of the final metaregression models including the number of trials per variable, the magnitude of statistical heterogeneity (I^{2}), and the P values for both the large sample methods and permutationbased resampling. In the model exhibiting moderate levels of statistical heterogeneity (that is, 45.8%; A. sativumHDL: model 2), and with an adequate number of trials per variable (that is, 9.5) both methods of arriving at P values produce identical values.
Table 6. Results of final stepwise models relative to number of trials per variable and statistical heterogeneity ranked by number of trials per variable in the final models
Discussion
Approximately 50% of systematic reviews use statistical techniques to combine study results and most of these assess consistency across the studies [17]. Several studies report that tests of presence of heterogeneity are frequently performed in metaanalyses, that they are often statistically significant, and that a variety of methods are used to explore heterogeneity, including subgroup analyses and metaregression [1821]. Metaregression may produce spurious findings when performed on a small number of studies, or when investigating multiple covariates [4].
Our study found that using backwards stepwise metaregression with large sample methods or with permutationbased resampling resulted in identical final models. To the best of our knowledge this is the first empirical test of permutation in metaregression modeling using an elimination procedure. The results are surprising given that we only had 5 to 6.25 trials per covariate in all of the initial models, which is well below the recommended 10 trials per variable [912]. Given that this rule of thumb was derived from simulations for logistic and Cox modeling it is possible that these results do not reliably apply to metaregression procedures on continuous outcome variables. For example, a simulation study on using linear and logistic regression resulted in a rule of thumb being proposed that, to avoid overfitting when using forward stepwise selection procedures, an event per variable of greater than 4 is required [2225]. If we extend this observation to linear metaregression with a backwards stepwise selection procedure then our results seem to make empirical sense. Furthermore, we recommend that more simulations be performed to test the ten trials per variable rule of thumb.
We also found that the P values obtained using permutation tests are more conservative, or larger, than P values obtained using standard metaregression methods. Specifically, the P values for significant covariates obtained with stepwise metaregression were larger with permutation 78% of the time and identical 22% of the time compared with using standard large sample methods of obtaining P values. This finding is particularly important when P values are near a set level of statistical significance (for example, 0.05 or 0.01), since any increase in the P value will render the result nonsignificant. This finding provides empirical evidence that permutation tests increase P values and that permutation can be used in metaanalyses especially when examining the effects of multiple covariates, when faced with a relatively small number of included primary studies, or when a large amount of statistical heterogeneity is present. However, these results do not indicate that permutationbased resampling protects against associations arising due to ecological bias. One must always be aware that some associations found at the group level, in this case the study level, may not apply at the individual level.
Our findings are similar to those of Higgins and Thompson [4] who found that when examining multiple covariates in a random effects analysis, permutation resulted in generally larger P values, but not in all instances. For example, they found that for the covariate 'intention to treat analysis' the P value obtained by metaregression was larger than that found with permutation. We found that out of nine significant covariates the P values for seven of these were larger in the permutation tests. In the other two instances the P values were identical. It is possible that the relatively low heterogeneity found in the latter final model (45.8%), or the relative normality of the distribution of effects in the final model resulted in this finding. It is possible that the level of statistical heterogeneity plays a larger role than the number of trials in metaregression. This hypothesis remains to be tested.
Metaregression may result in spurious findings with either multiple covariates, with a small number of primary studies [4,2528] or when there is a large magnitude of statistical heterogeneity. Permutation tests quell P values when exploring heterogeneity in these circumstances and will lead to more conservative probability estimates. In some cases the P value may cross over to nonsignificance. This finding is important in metaanalytic research especially since metaanalyses often include a small number of primary studies [4,15], included studies often fail to report important information for these analyses [19], and they often have significant statistical heterogeneity [2830]. The ability to accurately explore the reasons for heterogeneity and the influence of specific covariates could create increasingly specific and clinically relevant findings in metaanalyses and lead to valid hypothesis generation for future clinical trials. This could potentially save many resources including time, healthcare expenditure, and funding allocation. In agreement with previous recommendations [4] it is advised that permutation tests be used in all metaanalyses that include a small number of clinical trials per covariate (five or less to ten) or that have considerable or substantial heterogeneity.
This study has several strengths. First, we included a large number of primary studies and crosschecked the data extractions. By including a large number of studies we have performed an empirical test of permutation, which goes beyond previous simulations [4]. Next, we used a set of covariates that have empirical and theoretical relationships with our outcome variables. For example, it is well known that adequate randomization sequence generation and allocation concealment are important predictors of effect size [2,3,29,30]. Also, it follows from basic pharmacokinetics that differing doses of the active herbal medicine or its active constituent predict clinical responses [10]. An additional strength of this study was that we ran 10,000 iterations in the permutation tests. Given that permutation tests are random processes, a larger number of permutations results in very similar findings with each additional permutation using the same variable. Therefore, our permutation test P values are likely robust. Another strength of this study is that it extends the statistical techniques that can be used in metaanalyses, giving reviewers an empirically validated method when analyzing the influence of covariates.
A drawback of this study is that it may not include all RCTs examining the herbal medicine outcome pairings chosen. That is, the treatment effect estimates presented in Table 2 may not be 'true' estimates of the effect across all published trials. We did not set out to perform a comprehensive metaanalysis on each pairing or to describe the actual or true extent/degree of bias or influence for certain covariates on summary effect estimates. Our objective was to explore the difference in P values obtained from standard metaregression and permutation tests on a sample of trials for several covariates. This project was in effect an empirical exercise comparing two separate statistical techniques for arriving at P values. Another potential drawback of this study was that at no point did the permutation test change a significant P value to a nonsignificant one. Even though we did not see this, this would be expected since in most instances the permutation P value exceeded the metaregression values. It is conceivable that such P value increases would result in a marginally significant P value obtained with metaregression to become nonsignificant with permutation [4]. Also, though P values are often used to determine the existence of covariatebased effect modification, one should be certain to complement P values with actual differences in the effects estimates to determine if changes in the effect are clinically significant. Further research could build upon the intersection of statistical and clinical significance in effect modification. Finally, the last drawback of this research was that data from eight trials for the SJW depression pairing were extracted from Cochrane reviews [31,32]. Therefore, any errors in extraction in the Cochrane review will result in errors in our extractions and the resulting data analyses we present.
Conclusions
In summary, given that systematic reviews frequently contain a small number of studies and often wish to explore the influence of covariates to explain heterogeneity, the permutation test may help to protect against spurious findings when using metaregression. However, the changes in significance level we found for the permutation test in the sample of trials we included were small. Furthermore, the relationship between the magnitude of statistical heterogeneity, events per variable and metaregression with permutationbased resampling should be explored in future research.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
JJG conceived of the study design, performed all data extractions and data analysis and wrote the manuscript. JB conceived of the study design and helped write and edit the manuscript. DM aided in the study design and helped edit the manuscript. HB aided in the study design and helped edit the manuscript. CB aided in the study design and helped edit the manuscript. All authors read and approved the final manuscript.
Acknowledgements
JJG was supported by a Canadian Institutes of Health Research postdoctoral fellowship at the time that this research was carried out.
References

Higgins JPT, Thompson SG: Quantifying heterogeneity in a metaanalysis.
Statist Med 2002, 21:15391558. Publisher Full Text

Higgins JPT, Green S (Eds): Cochrane Handbook for Systematic Reviews of Interventions of interventions 5.0.0 [http://www.cochranehandbook.org] webcite
In

Pildal J, Hrobjartsson A, Jorgensen KJ, Hilden J, Altman DG, Gotzsche PC: Impact of allocation concealment on conclusions drawn from metaanalyses of randomized trials.
Int J Epidemiol 2007, 36:847857. PubMed Abstract  Publisher Full Text

Higgins JPT, Thompson SG: Controlling the risk of spurious findings from metaregression.
Statist Med 2004, 23:16631682. Publisher Full Text

Huang Y, Xu H, Calian V, Hsu JC: To permute or not to permute.
Bioinformatics 2006, 22:22442248. PubMed Abstract  Publisher Full Text

Gagnier JJ, Rochon PA, Boon H, Barnes J, Moher D, Bombardier C, for the CONSORT Group: Improving the reporting of randomized controlled trials of herbal medicine interventions: an application of the CONSORT statement.
Ann Intern Med 2006, 144:364367. PubMed Abstract

Linde K, ter Riet G, Hondras M, Melchart D, Willich SN: Characteristics and quality of systematic reviews of acupuncture, herbal medicines and homeopathy.
Forsch Komplmentarmed Klass Natureheilkd 2003, 10:8894. Publisher Full Text

Gagnier JJ, DeMelo J, Boon H, Rochon P, Bombardier C: Quality of reporting of randomized controlled trials of herbal medicine interventions.
Am J Med 2006, 119:800. PubMed Abstract  Publisher Full Text

Concato J, Peduzzi P, Holfold TR, Feinstein AR: Importance of events per independent variable in proportional hazards analysis. I. Background, goals, and general strategy.
J Clin Epidemiol 1995, 48:14951501. PubMed Abstract  Publisher Full Text

Peduzzi P, Concato J, Feinstein AR, Holfold TR: Importance of events per independent variable in proportional hazards regression analysis. II. Accuracy and precision of regression estimates.
J Clin Epidemiol 1995, 48:15031510. PubMed Abstract  Publisher Full Text

Peduzzi P, Concato J, Kemper E, Holfold TR, Feinstein AR: A simulation study of the number of events per variable in logistic regression analysis.
J Clin Epidemiol 1996, 49:13731379. PubMed Abstract  Publisher Full Text

The Cochrane Collaboration: The Cochrane Manual Issue 4, 2008. [http://www.cochrane.org/admin/manual.htm] webcite

Jadad AR, Moore RA, Carroll D, Jenkinson C, Reynolds DJ, Gavaghan DJ, McQuay HJ: Assessing the quality of reports of randomized clinical trials: is blinding necessary?
Control Clin Trials 1996, 17:112. PubMed Abstract  Publisher Full Text

Higgins JPT, Green S (Eds): Cochrane Handbook for Systematic Reviews of Interventions 4.2.6 [http://www.cochranehandbook.org] webcite
In

Cochran WG: The combination of estimates from different experiments.
Biometrics 1954, 10:101129. Publisher Full Text

Whitehead A, Whitehead J: A general parametric approach to the metaanalysis of randomised clinical trials.
Stat Med 1991, 10:16651677. PubMed Abstract  Publisher Full Text

Moher D, Tetzlaff J, Tricco AC, Sampson N, Altman DG: Epidemiology and reporting characteristics of systematic reviews.
PLoS Med 2007, 4:e78. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Higgins J, Thompson S, Deeks J, Altman D: Statistical heterogeneity in systematic reviews of clinical trials: a critical appraisal of guidelines and practice.
J Health Serv Res Policy 2002, 7:5161. PubMed Abstract  Publisher Full Text

Petittie DB: Approaches to heterogeneity in metaanalysis.
Stat Med 2001, 20:36253633. PubMed Abstract  Publisher Full Text

Xu H, Platt RW, Luo Z, Wei S, Fraser WD: Exploring heterogeneity in metaanalyses: needs, resources and challenges.

Schmid CH, Stark PC, Berlin JA, Landais P, Lau J: Metaregression detected associations between heterogeneous treatment effects and studylevel, not patientlevel, factors.
J Clin Epi 2004, 57:683697. Publisher Full Text

Flack V, Change PC: Frequency of selecting noise variables in subset regression analysis: a simulation study.

Freedman LS, Pee D: Return to a note on screening regression equations.
Am Stat 1989, 43:279282. Publisher Full Text

Freedman LS, Pee D, Midthune DN: The problem of understanding the residual error variance in forward stepwise regression.
Statistician 1992, 41:405412. Publisher Full Text

Berlin JA, Antman EM: Advantages and limitations of metaanalytic regressions of clinical trials data.

Davey Smith G, Egger M, Phillips AN: Metaanalysis: beyond the grand mean?
BMJ 1997, 315:16101614. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lau J, Ioannidis JP, Schmid CH: Summing up evidence: one answer is not always enough.
Lancet 1998, 351:123127. PubMed Abstract  Publisher Full Text

Thompson SG, Higgins JPT: How should metaregression analyses be undertaken and interpreted?
Stat Med 2002, 21:15591574. PubMed Abstract  Publisher Full Text

Kunz R, Vist GE, Oxman AD: Randomisation to protect against selection bias in healthcare trials.
Cochrane Database Syst Rev 2007, 2:MR000012. PubMed Abstract  Publisher Full Text

Verhagen AP, De Bie RA, Lenssen AF, de Vet HC, Kessels AG, Boers M, van den Brandt PA: Impact of quality items on study outcome.
Int J Technol Assess Healthcare 2000, 16:11361146. Publisher Full Text

Linde K, Mulrow CD, Berner M, Egger M: St John's Wort for depression.
Cochrane Database Syst Rev 2005, 2:CD000448. PubMed Abstract  Publisher Full Text

Linde K, Mulrow CD, Berner M, Egger M: St John's Wort for depression.
Cochrane Database Syst Rev 2008, 4:CD000448. PubMed Abstract  Publisher Full Text