This section is intended to be a quick reference for a selection of basic statistical tools for hypothesis testing. You can also calculate confidence intervals for most of these examples based on the descriptive measures. Before using this reference, review the definitions in the Statistical Terms Dictionary to help you determine which statistic is best for your project.

Consider consulting with a statistician. All statistical tests rely on mathematical assumptions which should be evaluated prior to using and interpreting the statistics.

 Groups Purpose Conditions or Assumptions Test Descriptive Measures (also used to calculate CIs) Single sample Compare mean to historical or hypothetical (known) mean Standard deviation known or N large and approximate normality One-sample z-test Mean, std deviation, difference from historical mean Compare mean to historical or hypothetical (known) mean Standard deviation unknown; approximate normality One-sample t-test Mean, std deviation, difference from historical mean Two independent samples Compare means Approximate normality Two-sample t-test Means, std deviations, pooled std deviation, difference in means Compare medians (Pr (X < Y)) Use for highly non-normal outcomes Wilcoxon rank-sum test Mann-Whitney U test Medians, quartiles, difference between medians Two+ independent samples Compare means (any or overall difference) Approximate normality Analysis of variance (ANOVA) Means, std deviations, pairwise differences in means Compare medians (outcome distributions) Use for highly non-normal outcomes Kruskal-Wallis test Medians, quartiles, pairwise differences in medians Matched Pairs (2) Evaluate whether mean of differences nonzero Approximate normality Paired t-test Mean of differences, std deviation of differences Evaluate whether median of differences nonzero Use for highly non-normal outcomes Wilcoxon sign-rank test Median of differences, quartiles of differences Matched Sets (> 2) Evaluate whether mean of differences nonzero Approximate normality Repeated measures ANOVA Mean of differences, std deviation of differences Evaluate whether median of differences nonzero Use for highly non-normal outcomes Friedman test Median of differences, quartiles of differences

 Groups Purpose Conditions or Assumptions Test Descriptive Measures (also used to calculate CIs) Single sample Compare proportion to historical or hypothetical (known) proportion N > 30 One-sample z-test (exact binomial N < 30) Proportion, difference from historical proportion Compare proportion to historical or hypothetical (known) proportion Exact binomial test (one-sample) Proportion, difference from historical proportion Compare proportion(s) to hypothetical (known) proportion(s) Allows overall comparison of > 2 categories Pearson’s goodness-of-fit Chi-square Proportion(s) relative to historical proportion(s) Two independent samples Compare proportions (absolute difference) N large (> 30) Normal approximation to binomial Difference between two proportions Compare proportions (absolute difference) Exact binomial test (two samples) Difference between two proportions Two+ independent samples Compare proportions or test for association Expected cell count > 1; most > 5 Chi-square test Odds ratio, row and column percentages, Compare proportions and test for association Use for empty cells or small cell counts Fisher’s exact test Odds ratio, row and column percentages, Compare rates and test for association Population denominators available (able to calculate valid rates) Chi-square or Fisher’s exact test Relative risk, row and column percentages Matched Pairs (2) Compare proportions or test for association McNemar’s test Difference between two proportions (different std dev than unmatched) Matched Sets (> 2) Compare proportions or test for association Cochrane Q test Difference between pairs of proportions

 Groups Purpose Conditions or Assumptions Test Descriptive Measures (also used to calculate CIs) Single sample Describe survival and hazard Non-informative censoring; reasonable number of events One-sample log-rank test Kaplan-Meier survival curve, median survival, hazard Two+ independent samples Compare survival Non-informative censoring; reasonable number of events Log-rank test Kaplan-Meier survival curves, median survival, hazard ratio Matched Pairs or Sets (2+) Compare survival Non-informative censoring; reasonable number of events Sign test or conditional proportional hazards Kaplan-Meier survival curves, median survival, hazard

Remember that there is variability associated with your outcomes and statistics.

When you calculate a statistic based on your sample data, how do you know if the statistic truly represents your population? Even if you have selected a random sample, your sample will not completely reflect your population. Each sample you take will give you a different result.

Let’s Look at an Example:
Suppose that you want to compare the mean age for those with and without an IV in the prehospital setting. You review the ambulance runs for the past two weeks and calculate a mean age of 10.4 years for those with an IV and 8.5 years for those without an IV. The difference between the two means is 1.9 years. From this, you might conclude that those receiving an IV were older on average.

Now suppose you collect the same data over the next six weeks. This time the average age for those with an IV is 9.2 years and the average age for those without an IV is 8.9 years, for a difference of 0.3 years. Suddenly, it is not clear that there is an important difference in age between these two groups. Why did your different samples yield different results? Is one sample more correct than the other?

Remember that there is variability in your outcomes and statistics. The more individual variation you see in your outcome, the less confidence you have in your statistics. In addition, the smaller your sample size, the less comfortable you can be asserting that the statistics you calculate are representative of your population.

#### Providing a Range of Values

A confidence interval provides a range of values that will capture the true population value a certain percentage of the time. You determine the level of confidence, but it is generally set at 90%, 95%, or 99%. Confidence intervals use the variability of your data to assess the precision or accuracy of your estimated statistics. You can use confidence intervals to describe a single group or to compare two groups. We will not cover the statistical equations for a confidence interval here, but we will discuss several examples.

Example
Average pulse rate = 101 bpm; Standard Deviation = 50; N = 200

95% Confidence Interval = (94, 108)
We are 95% confident that the true pulse rate for our population is between 94 and 108.
Margin of error = (108 – 94) / 2 = ± 7 bpm

The confidence interval in the above example could be described at 94 to 108 bpm (beats per minute) or 101 bpm ± 7 bpm. Here the number 7 is your margin of error. For confidence intervals around the mean, the margin of error is just half of your total confidence interval width.

#### Sample Size and Variability

The precision of your statistics depends on your sample size and variability. A larger sample size or lower variability will result in a tighter confidence interval with a smaller margin of error. A smaller sample size or a higher variability will result in a wider confidence interval with a larger margin of error. The level of confidence also affects the interval width. If you want a higher level of confidence, that interval will not be as tight. A tight interval at 95% or higher confidence is ideal.

Examples:
Average Scene Time = 5.5. mins; Standard Deviation = 3 mins; N = 10 runs

95% Confidence Interval = (3.6, 7.4)
Margin of Error = ±1.9 minutes

Average Scene Time = 5.5 mins; Standard Deviation = 3 mins; N=1,000 runs

95% Confidence Interval = (5.4, 5.6)
Margin of Error = ± 0.1 minutes

Average Scene Time = 5.5 mins; Standard Deviation = 15 mins; N=1,000 runs

95% Confidence Interval = (4.6, 6.4)
Margin of Error = ± 0.9 minutes

When you are evaluating a hypothesis, you need to account for both the variability in your sample and how large your sample is.

#### Introduction

Hypothesis testing is generally used when you are comparing two or more groups.

For example, you might implement protocols for performing intubation on pediatric patients in the pre-hospital setting. To evaluate whether these protocols were successful in improving intubation rates, you could measure the intubation rate over time in one group randomly assigned to training in the new protocols and compare this to the intubation rate over time in another control group that did not receive training in the new protocols.

When you are evaluating a hypothesis, you need to account for both the variability in your sample and how large your sample is. Based on this information, you would like to assess whether any differences you see are meaningful, or if they are just due to chance. This is formally done through a process called hypothesis testing.

#### Five Steps in Hypothesis Testing:

Step 1: Specify the Null Hypothesis
The null hypothesis (H0) is a statement of no effect, relationship, or difference between two or more groups or factors. In research studies, a researcher is usually interested in disproving the null hypothesis.

Examples:
There is no difference in intubation rates across ages 0 to 5 years.

The intervention and control groups have the same survival rate (or, the intervention does not improve survival rate).

There is no association between injury type and whether or not the patient received an IV in the prehospital setting.

Step 2: Specify the Alternative Hypothesis
The alternative hypothesis (H1) is the statement that there is an effect or difference. This is usually the hypothesis the researcher is interested in proving. The alternative hypothesis can be one-sided (only provides one direction, e.g., lower) or two-sided. We often use two-sided tests even when our true hypothesis is one-sided because it requires more evidence against the null hypothesis to accept the alternative hypothesis.

Examples:

The intubation success rate differs with the age of the patient being treated (two-sided).

The time to resuscitation from cardiac arrest is lower for the intervention group than for the control (one-sided).

There is an association between injury type and whether or not the patient received an IV in the prehospital setting (two sided).

Step 3: Set the Significance Level (α)
The significance level (denoted by the Greek letter alpha— α) is generally set at 0.05.  This means that there is a 5% chance that you will accept your alternative hypothesis when your null hypothesis is actually true. The smaller the significance level, the greater the burden of proof needed to reject the null hypothesis, or in other words, to support the alternative hypothesis.

Step 4: Calculate the Test Statistic and Corresponding P-Value
In another section we present some basic test statistics to evaluate a hypothesis. Hypothesis testing generally uses a test statistic that compares groups or examines associations between variables. When describing a single sample without establishing relationships between variables, a confidence interval is commonly used.

The p-value describes the probability of obtaining a sample statistic as or more extreme by chance alone if your null hypothesis is true. This p-value is determined based on the result of your test statistic. Your conclusions about the hypothesis are based on your p-value and your significance level.

Example:

P-value = 0.01 This will happen 1 in 100 times by pure chance if your null hypothesis is true. Not likely to happen strictly by chance.

Example:

P-value = 0.75 This will happen 75 in 100 times by pure chance if your null hypothesis is true. Highly likely to occur strictly by chance.

Your sample size directly impacts your p-value. Large sample sizes produce small p-values even when differences between groups are not meaningful. You should always verify the practical relevance of your results. On the other hand, a sample size that is too small can result in a failure to identify a difference when one truly exists.

Plan your sample size ahead of time so that you have enough information from your sample to show a meaningful relationship or difference if one exists. See calculating a sample size for more information.

Example:

Average ages were significantly different between the two groups (16.2 years vs. 16.7 years; p = 0.01; n=1,000). Is this an important difference? Probably not, but the large sample size has resulted in a small p-value.

Example:

Average ages were not significantly different between the two groups (10.4 years vs. 16.7 years; p = 0.40, n=10). Is this an important difference? It could be, but because the sample size is small, we cannot determine for sure if this is a true difference or just happened due to the natural variability in age within these two groups.

If you do a large number of tests to evaluate a hypothesis (called multiple testing), then you need to control for this in your designation of the significance level or calculation of the p-value. For example, if three outcomes measure the effectiveness of a drug or other intervention, you will have to adjust for these three analyses.

Step 5: Drawing a Conclusion

• P-value <= significance level (α) => Reject your null hypothesis in favor of your alternative hypothesis. Your result is statistically significant.
• P-value > significance level (α) => Fail to reject your null hypothesis. Your result is not statistically significant.

Hypothesis testing is not set up so that you can absolutely prove a null hypothesis. Therefore, when you do not find evidence against the null hypothesis, you fail to reject the null hypothesis. When you do find strong enough evidence against the null hypothesis, you reject the null hypothesis. Your conclusions also translate into a statement about your alternative hypothesis. When presenting the results of a hypothesis test, include the descriptive statistics in your conclusions as well. Report exact p-values rather than a certain range. For example, “The intubation rate differed significantly by patient age with younger patients having a lower rate of successful intubation (p=0.02).”  Here are two more examples with the conclusion stated in several different ways.

Example:

H0: There is no difference in survival between the intervention and control group.

H1: There is a difference in survival between the intervention and control group.

α = 0.05; 20% increase in survival for the intervention group; p-value = 0.002

Conclusion:

• Reject the null hypothesis in favor of the alternative hypothesis.
• The difference in survival between the intervention and control group was statistically significant.
• There was a 20% increase in survival for the intervention group compared to control (p=0.001).

Example:

H0 : There is no difference in survival between the intervention and control group.

H1 : There is a difference in survival between the intervention and control group.

α = 0.05; 5% increase in survival between the intervention and control group; p-value = 0.20

Conclusion:

• Fail to reject the null hypothesis.
• The difference in survival between the intervention and control group was not statistically significant.
• There was no significant increase in survival for the intervention group compared to control (p=0.20).

#### Suppose you read the following statement:

The mean value for the intervention group was 29 points lower than for the control group (p-value < 0.05). This might correspond to either of the following 95% confidence intervals:

• Treatment difference: 29.3 (22.4, 36.2)
• Treatment difference: 29.3 (11.8, 46.8)

If exact p-value is reported, then the relationship between confidence intervals and hypothesis testing is very close. However, the objective of the two methods is different:

• Hypothesis testing relates to a single conclusion of statistical significance vs. no statistical significance.
• Confidence intervals provide a range of plausible values for your population.

#### Which one?

• Use hypothesis testing when you want to do a strict comparison with a pre-specified hypothesis and significance level.
• Use confidence intervals to describe the magnitude of an effect (e.g., mean difference, odds ratio, etc.) or when you want to describe a single sample.

Combining multiple databases into one extensive database for analysis or linking multiple events…

Have you ever wondered what effect seatbelt usage has on the amount of money spent for hospital admission for crash victims? Or perhaps you want to know if size-appropriate splinting in the prehospital setting reduces hospital admissions, lengths of stay, and charges?

#### Using Existing Databases

In most cases the answers for these problems are rarely contained in one database. A researcher must therefore start from scratch, building a new database that follows patients from the point of splinting, to the emergency department, and finally determine if the patient was admitted. You can imagine that building these databases can be expensive in terms of time and money. However, if you have access to existing databases, such as computerized EMS run reports and an emergency department or hospital discharge database, then probabilistic record linkage may be the tool for you.

##### Purpose

The purpose of probabilistic record linkage is to combine multiple databases into one extensive database for analysis. It can also be used to link multiple events within one database that refer to a single patient or individual.

##### Description

Probabilistic record linkage is accomplished by comparing data fields in two files, such as birth date or gender. The comparison of numerous data fields leads to a judgment of whether two records refer to the same patient and/or event (and should be linked). This judgment is based on the cumulative weight of agreement and disagreement among field values. The amount of information in a field affects the field’s impact on whether two records should be linked. For instance, agreement of the gender field alone would not determine that two records refer to the same patient, but agreement on Social Security Number nearly guarantees that two records refer to the same individual. Probabilistic linkage software utilizes mathematical algorithms to determine whether two records should be linked based on the information in each record.

##### Technical Details
By assigning log-likelihood ratios to field comparisons, it is possible to computerize the judgment process. Let mi equal the probability the ith field agrees, given that the records are known to refer to the same person or event (a true match). Let ui equal the probability that the ith field will agree by chance among records known to not match. Then for a given pair of records, if field i agrees, the agreement weight is wi= log2( mi / ui.) If field i disagrees, a disagreement weight wi = log2(( 1-mi ) / ( 1-ui )) is assigned. The composite weight for a record pair will be the sum of agreement and disagreement weights for all fields available for comparison. To improve computation time, both files are sorted on one or several data fields. Comparisons are then made only on records that agree on the sorted fields, which are called blocking variables. If an error occurs in a data field that is used for blocking then records that should match will not be compared. This is because when the file is blocked, only records that agree on the blocking variable(s) are compared. To account for this problem, records that fail to match are subjected to subsequent attempts to match the files after re-blocking with different data fields. Researchers can relate the match weight for a pair of records to the probability that these records are correctly matched. Based on the sizes of the databases being linked and the number of expected matches, researchers can relate the match weight for a pair of records to the probability that these records are correctly matched. Generally, only record pairs attaining a probability of being correct of at least 0.90 or higher are linked and considered true matches. Examples: Probabilistic record linkage has been used on a national level to look at:
• The effects of seatbelts and motorcycle helmets on medical outcomes.1
• The Data Coordinating Center (housed with EDC) has used probabilistic linkage to study a variety of topics including:
• Drivers with medical conditions.2
• Effect of wearing only a shoulder strap in a motor vehicle crash.3
• Older and teenage drivers as well as children involved in motor vehicle crashes.4-7
• Pediatric utilization of pre-hospital emergency medical services.8
• Injuries sustained in shop classes at public schools.9
• Impact of Individual Components of Emergency Department Pediatric Readiness on Pediatric Mortality10

#### Bibliography

1Johnson SW, Walker J. The Crash Outcome Data Evaluation System (CODES). Washington DC: National Highway Traffic Safety Administration; 1996.

2Diller E, Cook LJ, Leonard DR, Dean M, Reading JM, Vernon DD. Evaluating Drivers Licensed with Medical Conditions Licensed with Medical Conditions in Utah, 1992 – 1996. National Highway Traffic Safety Administration 1999 June; Report No. DOT HS 809 023.

3Knight S, Cook LJ, Nechodom PJ, Olson LM, Reading JC, Dean JM. Shoulder belts in motor vehicle crashes: a statewide analysis of restraint efficacy. Accident Analysis & Prevention, 33(1), 65-71.

4Cook LJ, Knight S, Olson LM, Nechodom PJ, Dean JM. Motor vehicle crash characteristics and medical outcomes among older drivers in Utah, 1992 – 1995. Annals of Emergency Medicine 2000;35(6):585-591.

5Cvijanovich NZ, Cook LJ, Nechodom PJ, Dean JM. A Population-Based Study of Teenage Drivers: 1992-1996. 43rd Annual Proceedings Association for the Advancement of Automotive Medicine 1999;175-186.

6Berg M, Cook LJ, Corneli H, Vernon D, Dean JM. Effect of Seating Position and Restraint Use on Injuries to Children in Motor Vehicle Crashes. Pediatrics 2000;105(4):831-835.

7Corneli HM, Cook LJ, Dean JM. Adults and Children in severe motor vehicle crashes: A Matched-Pairs Study. Annals of Emergency Medicine 2000 Oct;36(4):340-5.

8Suruda AJ, Vernon DD, Reading J, Cook LJ, Nechodom PJ, Leonard D, Dean JM. Pre-Hospital Emergency Medical Services: A Population-Based Study of Pediatric Utilization. Injury Prevention 1999;5(4):294-297.

9Knight S, Junkins EP, Lightfoot AC, Cazier C, Olson LM, Injuries Sustained by Students in Shop Class. Pediatrics 2000;106(1):10-13.

10Remick K, Smith M, Newgard CD, Lin A, Hewes H, Jensen AR, Glass N, Ford R, Ames S, Gcph JC, Malveau S, Dai M, Auerbach M, Jenkins P, Gausche-Hill M, Fallat M, Kuppermann N, Mann NC. Impact of Individual Components of Emergency Department Pediatric Readiness on Pediatric Mortality in US Trauma Centers. J Trauma Acute Care Surg. 2022 Sep 1.

This section is intended to be a quick reference for a selection of advanced statistical models with a single outcome and multiple explanatory factors. The importance of moving beyond looking at each explanatory factor’s relationship to the outcome separately is that you can evaluate the independent effect of each factor and reduce the variability in the outcome. All statistical models rely on mathematical assumptions which should be evaluated prior to implementing and interpreting the model.

Consult with a statistician when planning or analyzing an advanced statistical model. This section utilizes several definitions that can be found here.

• Linear Regression
• Logistic Regression
• Cumulative Logistic Regression
• Multinomial Logit Model
• Poisson Regression
• Survival Model (Cox Proportional Hazards)
• Mixed Models (Hierarchical or Multilevel Modeling)
• Generalized Estimating Equations (GEE)

#### Linear Regression

• Continuous outcome
• Independent observations (no repeated measures or clustering)
• Objective: to evaluate the effect of continuous and/or categorical explanatory variables on the outcome
• Interpretation: For categorical variables such as sex, you estimate a difference in mean response (outcome). For continuous variables such as age, the change in the response for every one unit change in the variable is estimated. Because you are evaluating all variables in the same model, each effect estimate is interpreted within the context of the other variables (holding all else constant).
• Also obtain R2, an estimate of the amount of variability in your response that is accounted for by the model.
• Analysis of variance (ANOVA) and analysis of covariance (ANCOVA) models can fit into this framework.

#### Logistic Regression

• Yes/No (two-level) outcome
• Independent observations (no repeated measures or clustering)
• Objective: to evaluate the effect of continuous and/or categorical explanatory variables on the probability of a “yes” outcome
• Interpretation: For categorical variables such as sex, you estimate the odds of a yes outcome for female vs. a yes outcome for male (referred to as an odds ratio — in some cases can be interpreted as relative risk). For continuous variables such as age, the odds ratio is for a one-unit increase in the explanatory variable.

#### Cumulative Logistic Regression

• Similar to logistic regression but modeling an ordered categorical outcome with more than 2 levels
• Independent observations (no repeated measures or clustering)
• Objective: to evaluate the effect of continuous and/or categorical explanatory variables on the probability of level 2 outcome vs. level 1 outcome and level 3 outcome vs. level 2 outcome.
• Interpretation: Odds ratios again – similar to logistic regression but now reflecting a one-level increase in the outcome.

#### Multinomial Logit Model

• Extension of logistic regression with more than 2 levels for the outcome but no ordering of the outcome is required
• Independent observations (no repeated measures or clustering)
• Objective: to evaluate the effect of continuous and/or categorical explanatory variables on the probability of level 2 outcome vs. level 1 outcome and level 3 outcome vs. level 1 outcome (this is different from the cumulative model because you estimate a different effect for each level separately rather than a cumulative effect)
• Interpretation: Odds ratios again – similar to logistic regression but now must specify which two levels of the outcome you are comparing.

#### Poisson Regression

• Count or rate outcome (0, 1, 2, . . . )
• You can have varying follow-up time on subjects and model the outcome as a rate over time
• Independent observations (no repeated measures or clustering)
• Objective: to evaluate the effect of continuous and/or categorical explanatory variables on the outcome
• Interpretation: Effect estimates are used to obtain incidence rate ratios. Fabricated example: older drivers have a two times increased rate of motor vehicle crashes/year than middle-aged drivers.
• Related models include the negative binomial and zero-inflated Poisson

#### Survival Model (Cox Proportional Hazards)

• Outcome is time to a certain event (e.g., death, disease development or recovery, stabilization of patient)
• Independent observations (no repeated measures or clustering)
• Special methods are necessary because not all patients/units will have an observed outcome (called censoring).
• Objective: to evaluate the effect of continuous and/or categorical explanatory variables on the time to event outcome
• Interpretation: Effect estimates are used to obtain relative risk of the outcome at a given time. Fabricated example: the relative risk of death using radiation therapy only vs. radiation + additional treatment is 6.2. Risk of death is six times higher for radiation only than radiation + new treatment combined.
• You also can estimate median survival time and mortality rates overall and within groups. Graphing the survival function illustrates changes in survival over time.

#### Mixed Models (Hierarchical or Multilevel Modeling)

• Continuous outcome (similar framework can extend to categorical outcomes)
• Mixed refers to fixed and random effects. Fixed effects are those for which any levels you wish you make conclusions about are included in your study (e.g., sex). Random effects are those for which you wish to take the conclusions from your study and apply to a wider range of factors (e.g., hospitals).
• Often used for dependent observations where data is clustered by some factor (e.g., treating physicians, hospital sites, individual w/repeated measures). It is not appropriate to analyze clustered data by traditional methods that do not account for the correlation between observations.
• This is an immensely powerful and flexible framework for correctly evaluating not only data means but also the variance and covariance structure of the data.
• Repeated Measures ANOVA can also be used to look at measurements taken over time but is a less flexible framework, especially in the case of missing data.

#### Generalized Estimating Equations (GEE)

If observations are not independent because of clustering (e.g., treating physicians, hospital sites, repeated measures) then another option other than the mixed model is adjusting for the correlation between observations by using generalized estimating equations. This is possible for most of the models discussed above. It allows accurate estimation and significance testing of both individual level and cluster level variables.

Caution: At least 40 clusters are needed for GEE to yield reliable estimates.