Glossary of statistical terms

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

^A/_Z

25^th percentile: The value below which 25 percent of the observed cases fall and above which 75 percent of the observed cases fall.
50^th percentile: The value above and below which half of the observed values of a variable fall (= the median).
75^th percentile: The value below which 75 percent of the observed values of a variable fall and above which 25 percent of the observed values of a variable fall.
80% central range: The range between the 10^th and 90^th percentiles (in this range lie 80% of the values).
90% central range: The range between the 5^th and 95^th percentiles (in this range lie 90% of the values).
95% CI for the correlation coefficient: The range of values that contains the 'true' correlation coefficient with a 95% confidence.
95% CI for the mean: A range of values which contains the population mean with a 95% probability.
95% CI for the median: A range of values which contains the population median with a 95% probability.
95% central range: The range between the 2.5^th and 97.5^th percentiles (in this range lie 95% of the values).
95% confidence interval (regression): Curves representing a 95% confidence interval for the regression line. This interval includes the true regression line with 95% probability.
95% confidence interval: A range of values that 95% of the time includes the population (true) value.
95% prediction interval (regression): Curves representing the 95% prediction interval for the regression curve. For any given value of the independent variable, this interval represents the 95% probability for the values of the dependent variable.

Absolute difference: The difference, taken without regard to sign, between two values.
Absolute error: The absolute error of an observation x is the absolute deviation of x from its "true" value. See also Relative error.
Absolute value: The value of a number without regard of its algebraic sign.
Alternative hypothesis (H1): The statement that contradicts the null hypothesis, indicating the presence of an effect or difference.
Analysis of variance (regression): The analysis of variance table divides the total variation in the dependent variable into two components, one which can be attributed to the regression model (labeled Regression) and one which cannot (labeled Residual), and calculates a corresponding F-statistic.
ANCOVA (Analysis of covariance): A general linear model with one continuous outcome variable (quantitative) and one or more factor variables (qualitative).
ANOVA (Analysis of variance): ANOVA is used to compare means across multiple groups to determine if at least one group mean is significantly different from the others.
ANOVA Assumptions: The set of assumptions that must be met for the results of ANOVA to be valid, including independence, normality, and homogeneity of variance.
Area Under the ROC curve (AUC): The area under the ROC curve (AUC) measures a classifier's ability to distinguish between classes, ranging from 0.5 (random) to 1.0 (perfect discrimination).
Average: See Mean.

Backward elimination: A stepwise regression technique that starts with all predictors and iteratively removes the least significant ones. See also Forward selection, Stepwise selection.
Bar chart: A chart that presents categorical data with rectangular bars representing the frequency or value of each category.
Bayesian inference: A method of statistical inference in which Bayes’ theorem is used to update the probability for a hypothesis as more evidence or information becomes available.
Beta Coefficient: A measure of the influence of an independent variable on a dependent variable in regression analysis.
Bimodal distribution: A distribution with two different modes or peaks.
Binomial distribution: A discrete probability distribution representing the number of successes in a fixed number of independent trials, each with the same probability of success.
Bland & Altman plot: Graphical method for the assessment of agreement between two measurements techniques.
Bootstrap resampling: A resampling method that involves repeatedly sampling with replacement from the data to estimate statistics. See also Jackknife resampling, Monte Carlo simulation.
Box-and-whisker plot: Graphical statistical summary of a variable: median, quartiles, range and possibly extreme values (outliers). See also Notched box-and-whisker plot.
Box plot: See Box-and-whisker plot.

Case-Control study: An observational study designed to help determine if an exposure is associated with an outcome.
Causal inference: The process of drawing a conclusion about a causal connection based on the conditions of the occurrence of an event.
Censoring: In survival analysis, censoring occurs when the exact time of the event (e.g., death, disease recurrence) is unknown for some subjects in a study. This happens because the study ends before the event occurs, the participant is lost to follow-up, or they withdraw from the study.
Central Limit Theorem: A statistical theory that states the distribution of the sample means approaches a normal distribution as the sample size gets larger, regardless of the population's distribution.
Chi-Squared test: A statistical test used to determine if there is a significant association between categorical variables.
Cluster sampling: Clusters or groups are randomly selected, rather than individuals, for analysis.
Cochran’s Q test: Tests for differences in proportions across related groups.
Cochran-Mantel-Haenszel test: The Cochran-Mantel-Haenszel test computes an odds ratio taking into account a confounding factor.
Coefficient of contingency: A measure of association based on Chi-squared. This coefficient is always between 0 and 1, but it is not generally possible for it to attain the value of 1. The maximum value possible depends on the number of rows and columns in a table.
Coefficient of determination: In regression analysis, the coefficient of determination indicates the proportion of the variance in one variable that can be associated with the variance in the other variable. It can range from 0 to 1.
Coefficient of repeatability: Defined as 1.96 times the standard deviations of the differences between two measurements (see Bland & Altman plot).
Coefficient of variation: See Relative standard deviation (RSD).
Cohen's d: An effect size measurement that expresses the difference between two group means in terms of standard deviation.
Cohort study: A type of observational study that follows a group of people over time to determine how certain exposures affect their outcomes.
Conditional probability: Conditional probability is the probability of an event occurring given that another event has occurred.
Confidence Interval (CI): A range of values that is likely to contain the population parameter with a specified level of confidence (e.g., 95%).
Confounding variable: A variable that influences both the dependent and independent variable, leading to a false association.
Continuous variable: A random variable that can take on any value within a given range.
Control chart: A control chart is a statistical tool used in quality control to monitor and analyze process performance over time.
Control group: The subjects in a controlled study who do not receive the treatment.
Controlled study: A study that evaluates the effect of a treatment by comparing treated subjects with a control group, who do not receive the treatment. See also Cross-sectional study, Longitudinal study.
Correlation: The correlation between two variables x and y is a measure of how closely related they are, or how linearly related they are. Correlation is the measure of the extent to which a change in one random variable tends to correspond to change in the other random variable.
Correlation coefficient: A measure of the strength and direction of the relationship between two variables.
Covariance: A measure of the joint variability of two random variables.
Cox regression: Cox regression (or Cox proportional hazards regression) is a statistical method to analyze the effect of several risk factors on survival, or in general on the time it takes for a specific event to happen.
Cronbach's alpha: A measure of internal consistency reliability. It calculates the average correlation among all items in a scale or questionnaire.
Crosstabulation: A method of quantitatively analyzing the relationship between multiple variables by displaying the distributions of variables in a matrix format.
Cross-validation: A statistical method used to estimate the skill of a model on unseen data.
Cross-sectional study: A cross-sectional study is a type of research design in which you collect data from many different individuals at a single point in time. In cross-sectional research, you observe variables without influencing them. See also Controlled study, Longitudinal study.
Cumulative frequency: The sum of the frequencies of all the values up to a given value.
Cumulative frequency distribution graph: A graph where for each value of the characteristic, the percentage of elements equal to or less than that value is plotted. The points are then connected by straight lines.

Data cleaning: The process of correcting or removing erroneous data from a dataset.
Data mining: The practice of examining large datasets to discover patterns and extract valuable information.
Data visualization: The graphical representation of data to identify patterns, trends, and insights.
Degrees of Freedom (DF): The number of independent values or quantities that can vary in an analysis without breaking any constraints.
Dependent variable: In regression, the variable whose values are supposed to be explained by changes in the other variable(s) (the independent or explanatory variables). Usually represented by $y$. See also Independent variable.
Descriptive statistics: Summarizes and describes the characteristics of a data set, including measures of central tendency and variability.
Dichotomous variable: A variable that can only have 2 values.
Discrete variable: A type of random variable that can take on a finite or countably infinite number of values.

Effect size: A quantitative measure of the magnitude of a phenomenon or the strength of a relationship.
Error term: A variable representing the amount of variation in the dependent variable that cannot be explained by the independent variable(s).
Experimental study: An experimental study is a study where the factors under consideration are controlled so as to obtain information about their influence on the variable of interest.
Explanatory variable: See Independent variable.
Explained variation (regression): The amount of the total observed variability in the dependent variable that is explained by the regression.
Exponential distribution: A probability distribution that describes the time between events in a Poisson process.
Extrapolate: Extrapolation is a way to estimate values beyond the known data. You can use patterns and graphs to determine other possible data points that were not actually measured. See also Interpolate.

F-test: A statistical test used to compare the variances of two populations or to assess the overall significance of a regression model.
Factor: An independent variable defining groups of cases.
Factor analysis: A technique used to reduce data to a smaller set of summary variables and identify underlying relationships.
False Positive: A Type I error in hypothesis testing, where a null hypothesis is incorrectly rejected. See also False Negative.
False Negative: A Type II error in hypothesis testing, where a null hypothesis is incorrectly accepted. See also False Positive.
Far out value: A far out value is defined as a value that is smaller than the lower quartile minus 3 times the interquartile range, or larger than the upper quartile plus 3 times the interquartile range (outer fences) (see also Box-and-whisker plot).
Fisher's exact test: A statistical significance test used for analyzing contingency tables, especially when sample sizes are small.
Forest plot: In meta-analysis, a forest plot visually represents effect sizes and confidence intervals from multiple studies.
Forward selection: A stepwise regression technique that starts with no predictors and iteratively adding the most statistically significant ones. See also Backward elimination, Stepwise selection.
Frequency chart: A graphical representation of a frequency table.
Frequency table: A table showing the number of cases that belong to distinct categories, or simultaneously to two or more distinct categories, e.g. patients cross-classified according to both gender and age group, or according to treatment and result categories.
Friedman test: Non-parametric alternative to repeated-measures ANOVA.
Full Model: A regression model that includes all potential predictor variables.
Funnel plot: A graphical tool for detecting bias in meta-analysis.

Gaussian Distribution: Another name for the normal distribution.
Geometric mean: The geometric mean is the n^th root of the product of n observations.
$$\left ( \prod_{i=1}^n{x_i} \right ) ^\tfrac1n = \sqrt[n]{x_1 x_2 \cdots x_n} = \exp\left[\frac1n\sum_{i=1}^n\ln x_i\right] $$
See also Mean, Harmonic mean.
Goodness of fit: Quantification of how well a model fits observed data.

H-test: See Kruskal-Wallis test.
Harmonic mean: Harmonic mean is defined as the average of the reciprocal values of the given values.
$$\frac{n}{\frac1{x_1} + \frac1{x_2} + \cdots + \frac1{x_n}} = \frac{n}{\sum\limits_{i=1}^n \frac1{x_i}} $$
See also Mean, Geometric mean.
Hazard ratio: The hazard ratio is a statistical measure used primarily in survival analysis to compare the likelihood of an event occurring at any given point in time between two groups.
Histogram: Graphical representation of the distribution of a numerical variable. The number of observations belonging to each interval on the horizontal scale, is represented by the height of a bar erected above that interval. (Note: if the intervals are not equal-length, the number of observations is represented by the volume of the bar).
Hypothesis: A proposed explanation for a phenomenon, typically formulated in the context of a statistical test.
Hypothesis testing: Hypothesis testing is a method to test an assumption regarding a population parameter, based on sample data.

Independent variable: In regression, the independent variables are the ones that are supposed to explain the dependent variable. Usually represented by $x$ or $x_i$.
Inferential statistics: Procedures for making generalizations about a population by studying a sample from this population.
Interpolate: Interpolation is a way to estimate data. When you interpolate you estimate the data between two known observations or measurements. See also Extrapolate.
Interquartile range (IQR): A measure of statistical dispersion, calculated as the difference between the first (Q1) and third (Q3) quartile.

Jackknife resampling: A resampling technique used to estimate the bias and variance of a statistical estimator. See also Bootstrap resampling.
Jonckheere-Terpstra Test: A non-parametric statistical test for trends in ordinal data.
Jittering: A technique used in data visualization to add random noise to data points to reduce overlap in scatterplots.

Kaplan-Meier curve: A stepwise graph that estimates the survival probability over time, and that accounts for censored data.
Kappa: A measure of the agreement between two (diagnostic) classification systems, after correction for agreement by chance.
Kendall’s Tau: A statistic that measures the ordinal association between two measured quantities.
Kolmogorov-Smirnov test: Compares a sample distribution with a reference distribution (e.g. the Normal distribution) or compares two distributions.
Kruskal-Wallis test: Non-parametric alternative to ANOVA for comparing three or more groups. An extension of the Mann-Whitney U Test.
Kurtosis: Kurtosis is a measure for the degree of tailedness in the variable distribution.

Least squares: Used to estimate parameters in statistical models such as those that occur in regression analysis. Estimates for the parameter are obtained by minimizing the sum of the squares of the differences between the observed values and the predicted values under the model.
Levene’s test: Tests the equality of variances across multiple groups.
Likelihood function: A function of parameters in a statistical model that measures how well the model explains the observed data.
Linear interpolation: Linear interpolation is a method of estimating an unknown value between two known values on a straight line. Given two known points $(x_1,y_1)$ and $(x_2,y_2)$, the interpolated value $y$ at point $x$ between them is
$$ y = y_1 + \frac{(x - x_1)(y_2 - y_1)}{x_2 - x_1} $$
Linear regression: A regression analysis method that models the relationship between two variables by fitting a linear equation.
Logarithmic transformation: Logarithmic transformation is a statistical technique that involves applying a logarithm to each data point in a dataset to stabilize variance and to make the distribution closer to a Normal distribution.
Logistic regression: A regression analysis method used to model the relationship between a binary dependent variable and one or more independent variables. Logistic regression analysis generates the coefficients of a formula to predict a logit transformation of the probability of presence of the characteristic of interest.
$$ logit(p) = b_0 + b_1 x_1 + b_2 x_2 + b_3 x_3 + ... + b_n x_n $$
Logit function: The logit function is the inverse of the sigmoid (logistic) function and is used in logistic regression and other statistical models. It transforms probabilities $p$ into a range from $-\infty$ to $+\infty$, making it useful for modeling binary outcomes.
$$ \operatorname{logit}(p)= \ln\left( \frac{p}{1-p} \right) $$
Logrank test: The logrank test, or log-rank test, is a hypothesis test to compare the survival distributions of two samples.
Longitudinal study: A study that observes the same subjects over a long period of time. See also Controlled study, Cross-sectional study.

Mann-Whitney U test: Non-parametric alternative to the T-test for comparing two independent groups.
Maximum: The largest value taken on by a variable.
Maximum Likelihood Estimate (MLE): The maximum likelihood estimate of a parameter is the possible value of the parameter for which the chance of observing the data is largest.
McNemar test: Compares paired categorical data (e.g., pre/post designs with binary outcomes).
Mean: The mean is the average value calculated by summing all values and dividing by the number of values.
$$ \bar{x} = \frac{x_1+x_2+\cdots +x_n}{n} = {1 \over n} \sum_{i=1}^{n}{x_i} = {1 \over n} \sum_{}^{}{x} $$
See also Geometric mean, Harmonic mean.
Mean survival time: The mean survival time is estimated as the area under the survival curve in the interval 0 to t_max. See also Restricted mean survival time.
Mean Squared Error (MSE): An average of the squares of errors or deviations, representing the average squared difference between estimated values and actual value.
Measures of central tendency: Measures of central tendency help to summarize a data set with a single value. Examples: Mean, Median.
Measures of dispersion: These measures indicate the spread or variability within a data set. Examples: Range, Variance, Standard Deviation.
Median: The median is the middle value in an ordered data set. If there is an even number of observations, it is the average of the two middle values. The median is equal to the 50^th percentile.
Median survival time: The median survival is the smallest time at which the survival probability drops to 0.5 (50%) or below.
Meta-analysis: A statistical method that combines results from multiple studies to identify overall trends.
Minimum: The smallest value taken on by a variable.
Mode: The mode is the most frequently occurring value in a data set. A dataset may have one mode, more than one mode, or no mode at all.
Monte Carlo simulation: A computational technique that uses randomness to obtain numerical results, typically for assessing risk and uncertainty. See also Bootstrap resampling.
Multiple regression: Multiple regression assesses the relationship between one dependent variable and multiple independent variables.
$$ y = b_0 + b_1 x_1 + b_2 x_2 + b_3 x_3 +\ ...\ + b_k x_k $$
Multivariate analysis: A statistical method used to analyze data that involves multiple variables to understand relationships.

N: Number of cases in a sample.
Negative likelihood ratio: Ratio between the probability of a negative test result given the presence of the disease and the probability of a negative test result given the absence of the disease. See also Positive likelihood ratio.
Negative predictive value: Probability that the disease is not present when the test is negative. See also Positive predictive value.
Nominal data: Observations that are coded alphanumerically but without having an obvious order, e.g. blood group, male/female. See also Ordinal data.
Non-parametric tests: Non-parametric tests are tests that do not assume a specific distribution for the data. See Mann-Whitney U Test, Wilcoxon Signed-Rank Test, Kruskal-Wallis Test.
Normal distribution: A symmetric probability distribution characterized by a bell-shaped curve, defined by its mean $\mu$ and standard deviation $\sigma $. The normal distrubiton is given by the formula
$$ f(x) = \frac{1}{\sqrt{2\pi\sigma^2} } e^{-\frac{(x-\mu)^2}{2\sigma^2}} $$
Normal plot: The Normal plot is a graphical tool to judge the Normality of the distribution of sample data. In a Normal plot, the expected z-scores are plotted against the observed data. A random sample from a normal distribution will form a near straight line.
Normal range: See Reference interval.
Normality test: A statistical test that assesses whether a dataset is normally distributed.
Notched box-and-whisker plot: A Box-and-whisker plot with notches representing intervals for the medians allowing pairwise comparison of the medians at a 95% confidence level. If the notches about two medians do not overlap, the medians are -roughly- significantly different at about a 95% confidence level.
Null and Alternative Hypotheses: Null Hypothesis (H0): Assumes no effect or no difference.|Alternative Hypothesis (H1): Assumes there is an effect or a difference.
Null distribution: The probability distribution of a statistic under the null hypothesis.
Null hypothesis (H0): A statement asserting that there is no effect or no difference, used as a starting point for hypothesis testing.
Number Needed to Treat (NNT): The number needed to treat (NNT) is the estimated number of patients who need to be treated with the new treatment rather than the standard treatment (or no treatment) for one additional patient to benefit.

Observational study: An observational study is a study where the researcher observes and records behavior or outcomes without manipulating any variables.
Odds: A way of representing the likelihood of an event's occurrence. The odds m:n in favor of an event means we expect the event will occur m times for every n times it does not occur.
Odds ratio: The ratio of the odds of the outcome in two groups. The odds ratio compares the odds of an event occurring in one group to the odds of it occurring in another group. See also Relative risk.
One-way analysis of variance: See ANOVA (Analysis of Variance).
Ordinal data: Ordinal data are categorical data which can take a value that can be logically ordered or ranked. See also Nominal data.
Outlier: An observation that is deemed to be unusual and possibly erroneous because it does not follow the general pattern of the data in the sample.
Outside value: An outside value is defined as a value that is smaller than the lower quartile minus 1.5 times the interquartile range, or larger than the upper quartile plus 1.5 times the interquartile range (inner fences) (see also Box-and-whisker plot).
Overfitting: A modeling error that occurs when a model captures noise rather than the underlying pattern. See also Underfitting.

P-value: The probability of obtaining a value of the test statistic equal to or more extreme than that observed, assuming that the null hypothesis is true.
Paired T-Test: Compares the means of two related samples (e.g., before vs. after treatment).
Parameter: A numerical characteristic or measure of a population.
Parametric tests: Statistical tests that assume the data follows a certain distribution (e.g., normal distribution).
Partial area under ROC curve: The partial area under the ROC curve summarizes a portion of the ROC curve over a prespecified interval of interest. This interval can be a specificity or sensitivity interval.
Partial correlation: Examines the relationship between two variables while controlling for a third.
Pearson correlation coefficient: This measures the linear correlation between two variables, producing a value between -1 and 1.
Percentile: A percentile is the value below which a certain percent of numerical data falls. The p-th percentile is equal to the observation with rank number
$$ R(p) = 0.5 + \frac {p \times n} {100} $$
Poisson distribution: A discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space.
Polynomial Regression: Polynomial regression is a type of regression that models the relationship between an independent variable $x$ and an dependent variable $y$ as an n-th degree polynomial.
$$ Y = b_0 + b_1 x + b_2 x^2 + b_3 x^3 + \ ... \ + b_n x^n $$
Population: The entire set of items from which data can be selected, or the entire group of individuals or items that researchers are interested in studying.
Positive likelihood ratio: Ratio between the probability of a positive test result given the presence of the disease and the probability of a positive test result given the absence of the disease. See also Negative likelihood ratio.
Positive predictive value: Probability that the disease is present when the test is positive. See also Negative predictive value.
Power of a test: The probability that the test correctly rejects a false null hypothesis (1 minus the probability of a Type II error).
Precision-recall curve: A precision-recall curve is a plot of the precision (positive predictive value, y-axis) against the recall (sensitivity, x-axis) for different thresholds. It is an alternative for the ROC curve.
Predicted value: Value for the dependent variable predicted from the regression model.
Probability: The probability of an event is the ratio of the number of outcomes that includes the event to the total number of possible outcomes.
Probability distributions: Probability distributions describe how probabilities are distributed over the values of a random variable. See Normal distribution, Binomial Distribution, Poisson Distribution.

Q-Q plot: A Q-Q plot (quantile-quantile plot) is a graphical tool used to compare the distribution of a dataset to a theoretical distribution (e.g., normal distribution).
Quadratic regression: Quadratic regression is a type of polynomial regression used to model the relationship between a dependent variable and an independent variable when the data exhibits a parabolic (U-shaped or inverted U-shaped) pattern. It is a non-linear regression technique, but it is still considered a linear model because it is linear in terms of the coefficients.
$$y = a x^2 + b x + c $$
Qualitative data: Qualitative data are measures of 'types' and may be represented by a name, symbol, or a number. See also Quantitative data.
Quantitative data: Quantitative data are measures of values or counts and are expressed as numbers. See also Qualitative data.
Quartiles: Quartiles divide an ordered dataset into four equal parts. They correspond to the 25^th, 50^th and 75^th percentile.

R-adjusted: The coefficient of (multiple) determination adjusted for the number of independent variables in the regression model. R-adjusted may decrease if variables are entered in the model that do not add significantly to the model fit.
Random sample: A sample in which all population members have the same probability of being selected, and the selection of each member is independent of the selection of all other members.
Random variable: A variable whose values are determined by the outcomes of a random phenomenon.
Randomized controlled trial (RCT): An experiment that randomly assigns participants to a treatment or control group.
Range: The minimum and maximum values, or the difference between the maximum and minimum values in a data set.
Rank correlation: Method to study the relationship between two variables that are not Normally distributed or between ordinal variables.
Rate: A measure of the quantity of one thing in relation to another thing. The rate compares two different quantities, measured in different units. See also Ratio.
Ratio: A comparison of two quantities by division. A ratio compares the frequency of one value for a quantity with another value for that quantity. See also Rate.
Reference interval: A Reference interval (Reference range, Normal range) for a parameter is the interval in which the central 95% values of apparently healthy subjects lie.
Regression: A statistical technique that models the relationship between a dependent variable and one or more independent variables.
Regression coefficients: The regression coefficients indicate the expected change in the dependent variable for a one-unit change in the independent variable. For example, in the multiple regression equation $ y = b_0 + b_1 x_1 + b_2 x_2 + b_3 x_3 +\ ...\ + b_k x_k $, the values $b_1\ ... \ b_n$ are the regression coefficients.
Regression equation: The regression equation describes the relationship between two variables. For example in the linear regression equation $y = a + bx $, where $y$ is the dependent variable and $x$ is the independent variable, the coefficient $b$ is the slope, and $a$ is the y-intercept. See also Multiple regression, Quadratic regression.
Regression line: A graphical representation of the regression equation. Typically combined with a scatter diagram.
Relative error: Absolute error divided by the true value.
Relative frequency: A relative frequency describes the number of times a particular value for a variable has been observed to occur in relation to the total number of values for that variable.
Relative risk: The ratio of the proportions of cases having a positive outcome in two groups. The relative risk compares the probability of an event occurring in the exposed group to the probability of it occurring in the unexposed group. See also Odds ratio.
Relative standard deviation (RSD): This is the standard deviation divided by the mean. If appropriate, this number can be expressed as a percentage by multiplying it by 100 to obtain the coefficient of variation
Repeated Measures ANOVA: Compares means across multiple time points or conditions within the same subjects.
Residuals: The differences between observed and predicted values. Residuals represent unexplained (or residual) variation after fitting a regression model. It is the difference between the observed value of the variable and the value suggested by the regression model.
Residual standard deviation: The standard deviation of the residuals in regression analysis.
Restricted mean survival time: Restricted mean survival time (RMST) is defined as the area under the survival curve up to a specific time point. See also Mean survival time.
Risk ratio: See Relative risk.
ROC curve: A receiver operating characteristic (ROC) curve, is a graphical plot that illustrates the performance of a binary classifier model at varying threshold values.
R-squared: A statistic that provides an indication of the goodness of fit of a regression model.

Sample: A subset of the population chosen for analysis.
Sample size: The number of cases (observations) in the sample.
Sampling: The process of selecting a proper subset of elements from the full population so that the subset can be used to make inference to the population as a whole.
Sampling bias: A bias that occurs when the sample collected is not representative of the population.
Sampling distribution: The probability distribution of a statistic obtained from a large number of samples drawn from the same population.
Sampling techniques: Sampling techniques are methods used to select individuals from a population to participate in a study. See Random Sampling, Stratified Sampling.
Scatter diagram: A graph that illustrates the relationship between two quantitative variables.
Sensitivity: Probability that a test result will be positive when the disease is present (true positive rate). See also Specificity.
Significance level: Significance level (alpha) defines the threshold for deciding whether to reject H0.
Simple linear regression: See linear regression.
Skewed distribution: A distribution that is not symmetrical.
Skewness: Skewness is a measure for the degree of symmetry in the variable distribution.
Slope: A number that indicates the incline or steepness of a line on a graph.
Spearman’s rank correlation: A non-parametric measure of correlation that assesses how well the relationship between two variables can be described by a monotonic function.
Specificity: Probability that a test result will be negative when the disease is not present (true negative rate). See also Sensitivity.
Standard deviation: The standard deviation is the square root of the Variance. It provides a measure of the average distance of each data point from the mean.
$$s = \sqrt{\frac{\sum_{}^{}{(x-\bar{x})^2}}{n-1}} $$
Standard error: A measure of how much the value of a test statistic may vary from sample to sample. It is the standard deviation of the sampling distribution for a statistic.
Standard error of the mean (SEM): The SEM is calculated by dividing the standard deviation by the square root of the sample size.
$$SEM = \frac{s}{\sqrt{n}} $$
Standardized Normal distribution: A Normal distribution with mean 0 and standard deviation 1. A Normal distribution can be converted into a standardized Normal distribution by substituting all values x with (x-mean)/SD.
Statistic: A numerical characteristic or measure of a sample.
Statistical quality control (SQC): A set of techniques and tools used to monitor and control a process to ensure it operates at its full potential.
Statistical significance: The likelihood that a relationship between two or more variables is caused by something other than chance.
Statistics: Statistics is the discipline that uses mathematical theories and methodologies to collect, analyze, interpret, present, and organize data. It provides a framework for decision-making and making inferences about populations based on sample data.
Stepwise selection: In regression, a combination of Forward selection and Backward elimination.
Stratified sampling: A sampling method that involves dividing the population into subgroups and taking samples from each subgroup.
Stratum: In random sampling, sometimes the sample is drawn separately from different disjoint subsets of the population. Each such subset is called a stratum.
Survey: A method of data collection that involves asking people questions to gather information.
Survival analysis: Survival analysis is a statistical method used to analyze time-to-event data, where the event of interest is often death, disease progression, or recovery. It estimates the time until an event occurs and accounts for censoring.

T-test: A statistical test used to compare the means of two groups.
Time series analysis: A statistical technique that deals with time-ordered data to identify trends and seasonal patterns.
Trimmed mean: A trimmed mean is a statistical measure that calculates the average after removing a certain percentage of extreme values from both ends of the sample.
Total variation (regression): The sum of the squares of the difference between the observed values and the arithmetic mean for the dependent variable.
Two-way analysis of variance: Method to analyze the effect of two qualitative factors on one dependent variable.
Type I Error: The error of rejecting the null hypothesis when it is actually true (false positive).
Type II Error: The error of failing to reject the null hypothesis when it is actually false (false negative).

Unbiased estimator: For an estimator to be unbiased it is required that on average the estimator will yield the true value of the unknown parameter.
Underfitting: A modeling error that occurs when a model is too simple to capture the underlying trend of the data. See also Overfitting.
Unexplained variation (regression): The sum of the squares of the differences between the observed values and the estimated or predicted values.
Uniform distribution: A probability distribution where all outcomes are equally likely.

Variable: A quantity that varies.
Variance: A measure of the dispersion of a set of values, calculated as the sum of squared differences from the mean, divided by the number of values minus 1.
$$s^2 = \frac{\sum_{}^{}{(x-\bar{x})^2}}{n-1} $$
Violin plot: A violin plot depicts distributions of numeric data for one or more groups using density curves. The width of each curve corresponds with the approximate frequency of data points in each region.

Welch’s T-test: Version of the T-test used when two groups have unequal variances and/or sample sizes.
Wilcoxon signed-Rank test: Tests differences between paired samples, assessing whether their population mean ranks differ. Non-parametric alternative to the paired T-test.

X-bar ($ \bar{X}$): The sample mean.
X-variable: Often used to denote the independent variable in regression analysis.

Y-variable: Typically represents the dependent variable in a mathematical or statistical model. See Regression.
Yates' correction: A correction applied in chi-squared tests for continuity, particularly with 2x2 contingency tables.
Youden plot: A graphical method to analyze inter-laboratory data, where all laboratories have analyzed 2 samples.

Z-score: A measure that describes a value's relation to the mean of a group of values, expressed in standard deviations.
$$ z = \frac{x-\bar{x}}{s} $$
Zero order correlation coefficient: The simple correlation coefficient for the dependent variable and all independent variables separately.

Glossary of statistical terms

See also