The Controversy over Null Hypothesis Significance Testing Revisited
Abstract
Abstract. Null hypothesis significance testing (NHST) is one of the most widely used methods for testing hypotheses in psychological research. However, it has remained shrouded in controversy throughout the almost seventy years of its existence. The present article reviews both the main criticisms of the method as well as the alternatives which have been put forward to complement or replace it. It focuses basically on those alternatives whose use is recommended by the Task Force on Statistical Inference (TFSI) of the APA (Wilkinson and TFSI, 1999) in the interests of improving the working methods of researchers with respect to statistical analysis and data interpretation. In addition, the arguments used to reject each of the criticisms levelled against NHST are reviewed and the main problems with each of the alternatives are pointed out. It is concluded that rigorous research activity requires use of NHST in the appropriate context, the complementary use of other methods which provide information about aspects not addressed by NHST, and adherence to a series of recommendations which promote its rational use in psychological research.
References
Allen, M. , Preiss, R. (1993). Replication and meta-analysis: A necessary connection. Journal of Social Behavior and Personality, 8(6), 9– 20Abelson, R.P. (1995). Statistics as principled argument . Hillsdale, NJ: ErlbaumAbelson, R.P. (1997). A retrospective on the significance test ban of 1999 (if there were no significance tests, they would be invented). In L.L. Harlow, S.A. Mulaik, & J.H. Steiger (Eds.), What if there were no significance tests? (pp. 117-144). Hillsdale, NJ: Erlbaum2001). Publication manual of the American Psychological Association (5th ed.). Washington, DC: Author
(Bakan, D. (1966). The tests of significance in psychological research. Psychological Bulletin, 66, 423– 437Baril, G.L. , Cannon, J.T. (1995). What is the probability that null hypothesis testing is meaningless?. American Psychologist, 50, 1098– 1099Berger, J.O. , Sellke, T. (1987). Testing a point null hypothesis: The irreconcilability of P values and evidence. Journal of the American Statistical Association, 82, 112– 122Berkson, J. (1938). Some difficulties of interpretation encountered in the application of the χ2 test. Journal of the American Statistical Association, 33, 526– 542Binder, A. (1963). Further considerations on testing the null hypothesis and the strategy and tactics of investigating theoretical models. Psychological Review, 70, 107– 115Bleymüller, J. , Gehlert, G. , Gülicher, H. (1988). Statistik für Wirtschaftswissenschaften (5. Aufl) . München: VahlenBracey, G.W. (1991). Sense, non-sense, and statistics. PhiDelta Kappan, 73, 335–Branstätter, E. (1999). Confidence intervals as an alternative to significance testing. Methods of Psychological Research Online, 4(2), 33– 46Brewer, J.K. (1985). Behavioral statistics textbooks: Source of myths and misconceptions?. Journal of Educational Statistics, 10, 252– 268Carver, R.P. (1978). The case against statistical significance testing. Harvard Educational Review, 48, 378– 399Carver, R.P. (1993). The case against statistical significance testing, revisited. Journal of Experimental Education, 61, 287– 292Chow, S.L. (1987). Experimental psychology: Rationale, procedures and issues . Calgary, Alberta, Canada: Detselig EnterprisesChow, S.L. (1988). Significance test or effect size?. Psychological Bulletin, 103, 105– 110Chow, S.L. (1989). Significance tests and deduction: Reply to Folger (1989). Psychological Bulletin, 106, 161– 165Chow, S.L. (1991). Some reservations about power analysis. American Psychologist, 46, 1088– 1089Chow, S.L. (1996). Statistical significance: Rationale, validity, and utility . Beverly Hills, CA: SageChow, S.L. (1998a). Précis of statistical significance: Rationale, validity, and utility. Behavioral and Brain Sciences, 21, 169– 239Chow, S.L. (1998b). What statistical significance means. Theory and Psychology, 8, 323– 330Cleveland, W.S. (1993). Visualizing data . Summit, NJ: HobartCleveland, W.S. , McGill, M.E. Eds. (1988). Dynamic graphics for statistics . Belmont, CA: WadsworthCohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145– 153Cohen, J. (1987). Statistical power analysis for the behavioral sciences (rev. ed.). Hillsdale, NJ: ErlbaumCohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: ErlbaumCohen, J. (1990). Things I have learned (so far). American Psychologist, 45, 1304– 1312Cohen, J. (1994). The earth is round (p < .5). American Psychologist, 49, 997– 1003Cook, T.D. , Campbell, D.T. (1979). Quasi-experimentation: Design and analysis issues for field settings . Chicago: Rand McNallyCooper, H.M. (1979). Statistically combining independent studies: A meta-analysis of sex differences in conformity research. Journal of Personality and Social Psychology, 37, 131– 146Cooper, H.M. , Rosenthal, R. (1980). Statistical versus traditional procedures for summarizing research findings. Psychological Bulletin, 87, 442– 449Cortina, J.M. , Dunlap, W.P. (1997). On the logic and purpose of significance testing. Psychological Methods, 2, 161– 172Cowles, M. (1989). Statistics in psychology: An historical perspective . Hillsdale, NJ: ErlbaumCowles, M. , Davis, C. (1982). On the origins of the .5 level of statistical significance. American Psychologist, 37, 553– 558Cox, D.R. (1977). The role of significance tests. Scandinavian Journal of Statistics, 4, 49– 70Cronbach, L.J. (1975). Beyond the two disciplines of scientific psychology. American Psychologist, 30, 116– 127Cronbach, L.J. , Snow, R.E. (1977). Aptitudes and instructional methods: A handbook for research on interactions . New York: IrvingtonCrow, E.L. (1991). Response to Rosenthal's comment “How are we doing in soft psychology. rdquo; American Psychologist, 46, 1083–Cumming, G. , Finch, S. (2001). A primer on the understanding, use, and calculation of confidence intervals that are based on central and noncentral distributions. Educational and Psychological Measurement, 61, 532– 574Dar, R. (1987). Another look at Meehl, Lakatos, and the scientific practices of psychologists. American Psychologist, 42, 145– 151Dar, R. , Serlin, R.C. , Omer, H. (1994). Misuse of statistical tests in three decades of psychotherapy research. Journal of Consulting and Clinical Psychology, 62, 75– 82Dixon, P. (1998). Why scientists value p values. Psychonomic Bulletin and Review, 5, 390– 396Dooling, D. , Danks, J.H. (1975). Going beyond tests of significance: Is psychology ready?. Bulletin of the Psychonomic Society, 5, 15– 17Edwards, W. (1965). Tactical note on the relation between scientific and statistical hypotheses. Psychological Bulletin, 63, 400– 402Erwin, E. (1998). The logic of null hypothesis testing. Behavioral and Brain Sciences, 21, 197– 198Falk, R. (1986). Misconceptions of statistical significance. Journal of Structural Learning, 9, 83– 96Falk, R. , Greenbaum, C.W. (1995). Significance tests die hard: The amazing persistence of a probabilistic misconception. Theory and Psychology, 5, 75– 98Fidler, F. (2002). The fifth edition of the APA publication manual: Why its statistics recommendations are so controversial. Educational and Psychological Measurement, 62 (5), 749– 770Fidler, F. , Thompson, B. (2001). Computing correct confidence intervals for ANOVA fixed-and random-effects effect sizes. Educational and Psychological Measurement, 61, 575– 604Finch, S. , Cumming, G. , Thomason, N. (2001). Reporting of statistical inference in the Journal of Applied Psychology: Little evidence of reform. Educational and Psychological Measurement, 61, 181– 210Fisher, R.A. (1925). Statistical methods for research workers . London: Oliver & BoydFisher, R.A. (1931). Introduction. In J.R. Airey (Ed.), Table of Hh functions (pp. xxvi-xxxv). London: British AssociationFisher, R.A. (1935). The design of experiments . London: Oliver & BoydFolger, R. (1989). Significance tests and the duplicity of binary decisions. Psychological Bulletin, 106, 155– 160Frick, R.W. (1995). Accepting the null hypothesis. Memory & Cognition, 23(1), 132– 138Frick, R.W. (1996). The appropriate use of null hypothesis testing. Psychological Methods, 1, 379– 390Gigerenzer, G. (1993). The Superego, the Ego, and the Id in statistical reasoning. In G. Keren, & C. Lewis (Eds.), A handbook for data analysis in the behavioral science: Volume 1. Methodological issues (pp. 311-339). Hillsdale, NJ: ErlbaumGigerenzer, G. , Murray, D.J. (1987). Cognition as intuitive statistics . Hillsdale, NJ: ErlbaumGigerenzer, G. , Swijtink, Z. , Porter, T, , Daston, L, , Beatty, J. , Krüger, L. (1989). The empire of chance: How probability changed science and everyday life . Cambridge, UK: Cambridge University PressGlass, G.V. (1976). Primary, secondary and meta-analysis of research. Educational Researcher, 5, 3– 8Glass, G.V. , McGaw, B, , Smith, M.L. (1981). Meta-analysis in social research . Beverly Hills, CA: SageGorsuch, R.L. (1991). Things learned from another perspective (so far). American Psychologist, 46, 1089– 1090Grant, D.A. (1962). Testing the null hypothesis and the strategy and tactics of investigating theoretical models. Psychological Review, 69, 54– 61Greenland, S. (1998). Meta-analysis. In K. Rothman & S. Greenland (Eds.). Modern epidemiology. Philadelphia: Lippincott-RavenGreenwald, A.G. (1975). Consequences of prejudice against the null hypothesis. Psychological Bulletin, 82, 1– 20Greenwald, A.G. (1993). Consequences of prejudice against the null hypothesis. In G. Kerens, & C. Lewis (Eds.). A handbook for data analysis in the behavioral sciences: Volume 1. Methodological issues (pp. 419-448). Hillsdale, NJ: ErlbaumGreenwald, A.G. , Gonzalez, R. , Harris, R.J. , Guthrie, D. (1996). Effect sizes and p-values: What should be reported and what should be replicated?. Psychophysiology, 33, 175– 183Guttman, L. (1985). The illogic of statistical inference for cumulative science. Applied Stochastic Models and Data Analysis, 1, 3– 10Hagen, R.L. (1997). In praise of the null hypothesis statistical test. American Psychologist, 52(1), 15– 24Haller, H. , Krauss, S. (2002). Misinterpretations of significance: A problem students share with their teachers?. Methods of Psychological Research Online, 7(1), 1– 20Harris, R.J. (1991). Significance tests are not enough: The role of effect-size estimation in theory corroboration. Theory and Psychology, 1, 375– 382Hayes, W.L. (1963). Statistics for psychologists . New York: Holt, Rinehart & WinstonHayes, A.F. (1998). Reconnecting data analysis and research designs: Who needs a confidence interval?. Behavioral and Brain Sciences, 21, 203– 204Hays, W.L. (1994). Statistics (4th ed.). New York: Holt, Rinehart and WinstonHoward, G.S. , Maxwell, S.E. , Fleming, K.J. (2000). The proof of the pudding: An illustration of the relative strengths of null hypothesis, meta-analysis, and bayesian analysis. Psychological Methods, 5, 315– 332Hubbard, R. (1995). The Earth is highly significantly round (p < .001). American Psychologist, 50, 1098–Hubbard, R. , Armstrong, J.S. (1994). Replications and extensions in Marketing: Rarely published but quite contrary. International Journal of Research in Marketing, 11, 233– 248Hubbard, R. , Parsa, A.R. , Luthy, M.R. (1997). The spread of statistical significance testing in psychology: The case of the Journal of Applied Psychology, 1917-1994. Theory and Psychology, 7, 545– 554Hubbard, R. , Ryan, P.A. (2000). The historical growth of statistical significance testing in psychology and its future prospects. Educational and Psychological Measurement, 60, 661– 681Huberty, C.J. (1987). On statistical testing. Educational Researcher, 16(8), 4– 9Hunter, J.E. (1997). Need: A ban on the significance test. Psychological Science, 8, 3– 7Hunter, J.E. , Schmidt, F.L. (1990). Methods of meta-analysis: Correcting error and bias in research findings . Newbury Park, CA: SageJeffreys, H. (1934). Probability and scientific method. Proceedings of the Royal Society of London, Series A, 146, 9– 16Johnson, D.H. (1999). The insignificance of statistical significance testing. Journal of Wildlife Management, 63, 763– 772Kazdin, A.E. , Bass, D. (1989). Power to detect differences between alternative treatments in comparative psychotherapy outcome research. Journal of Consulting and Clinical Psychology, 57, 138– 147Kirk, R.E. (1996). Practical significance: a concept whose time has come. Educational and Psychological Measurement, 56, 746– 759Kirk, R.E. (2001). Promoting good statistical practices: Some suggestions. Educational and Psychological Measurement, 61, 213– 218Krueger, J. (2001). Null hypothesis significance testing. On the survival of a flawed method. American Psychologist, 56, 16– 26Kupfersmid, J. (1988). Improving what is published: A model in search of an editor. American Psychologist, 43, 635– 642Levin, J.R. (1998). To test or not to test H0?. Educational and Psychological Measurement, 58, 313– 333Lindgren, B.W. (1976). Statistical theory (3rd ed.). New York: MacmillanLindley, D.V. (1957). A statistical paradox. Biometrika, 44, 187– 192Lindsay, R.M. , Ehrenberg, A.S.C. (1993). The design of replicated studies. American Statistician, 47, 217– 228Loftus, G.R. (1991). On the tyranny of hypothesis testing in the social sciences. Contemporary Psychology, 36, 102– 105Loftus, G.R. (1993). A picture is worth a thousand p values: On the irrelevance of hypothesis testing in the microcomputer age. Behavior Research Methods, Instruments and Computers, 25, 250– 256Loftus, G.R. (1995). Data analysis as insight: Reply to Morrison and Weaver. Behavior Research Methods, Instruments and Computers, 27, 57– 59Loftus, G.R. (1996). Psychology will be a much better science when we change the way to analyse data. Current Directions in Psychological Science, 5, 161– 171Loftus, G.R. , Masson, M.E. (1994). Using confidence intervals in within-subject designs. Psychonomic Bulletin and Review, 1, 476– 490Lykken, D. (1968). Statistical significance in psychological research. Psychological Bulletin, 70, 151– 159Markus, K.A. (2001). The converse inequality argument against tests of statistical significance. Psychological Methods, 6, 147– 160McGaw, K.O. (1991). Problems with the BESD: A comment on Rosenthal's “How are we doing in soft psychology?”. American Psychologist, 46(10), 1084– 1086McGaw, K.O. (1995). Determining false alarm rates in null hypothesis testing research. American Psychologist, 50, 1099– 1100Meehl, P.E. (1967). Theory testing in psychology and physics: A methodological paradox. Philosophy of Science, 34, 103– 115Meehl, P.E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806– 834Meehl, P.E. (1990a). Appraising and amending theories: The strategy of Lakatosian defence and two principles that warrant it. Psychological Inquiry, 1, 108– 141Meehl, P.E. (1990b). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66, 195– 244Meehl, P.E. (1991). Why summaries of research on psychological theories are often uninterpretable. In R.E. Snow, & D.E. Wilet (Eds.), Improving inquiry in social science: A volume in honor of Lee J. Cronbach. (pp. 13-59). Hillsdale, NJ: ErlbaumMeehl, P.E. (1997). The problem is epistemology, not statistics: Replace significance tests by confidence intervals and quantify accuracy of risky numerical predictions. In L.L. Harlow, S.A. Mulaik, & J.H. Steiger (Eds.), What if there were no significance tests? (pp. 391-423). Hillsdale, NJ: ErlbaumMorrison, D.E. , Henkel, R.E. (1970). Eds. The significance test controversy: A reader . Chicago: AldireMurphy, K.R. (1990). If the null hypothesis is impossible, why test it?. American Psychologist, 45, 403– 404Murphy, K.R. , Myors, B. (1999). Testing the hypothesis that treatments have negligible effects: Minimum-effect tests in the general linear model. Journal of Applied Psychology, 84, 234– 248Neyman, J. , Pearson, E.S. (1928a). On the use and interpretation of certain test criteria for purposes of statistical inference: Part I. Biometrika, 20A, 175– 263Neyman, J. , Pearson, E.S. (1928b). On the use and interpretation of certain test criteria for purposes of statistical inference: Part II. Biometrika, 20A, 264– 294Neyman, J. , Pearson, E.S. (1933). On the testing of statistical hypotheses in relation to probabilities a priori. Proceedings of the Cambridge Philosophical Society, 28, 492–Nickerson, R.S. (2000). Null hypothesis significance testing: A review of and old and continuing controversy. Psychological Methods, 5, 241– 301Nunnally, J. (1960). The place of statistics in psychology. Educational and Psychological Measurement, 20, 641– 650Oakes, M. (1986). Statistical inference: A commentary for social and behavioral sciences . New York: WileyParker, S. (1995). The “difference of means” may not be the “effect size.”. American Psychologist, 50, 1101– 1102Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine, 50, 157– 175Pearson, E. , Hartley, H. (1972). Biometrika tables for statisticians (Vol. 2). Cambridge, UK: Cambridge University PressPollard, P. (1993). How significant is “significance”?. In G. Keren, & C. Lewis (Eds.), A handbook for data analysis in the behavioral sciences: Volume 1. Methodological issues. Hillsdale, NJ: ErlbaumPopper, K.R. (1959). The logic of scientific discovery . New York: Basic BooksRobinson, D. , Levin, J. (1997). Reflections on statistical and substantive significance, with a slice of replication. Educational Researcher, 26(5), 21– 26Robinson, D.H. , Wainer, H. (2001). On the past and future of null hypothesis significance testing . Princeton: Statistics & Research DivisionRosenthal, R. (1983). Assessing the statistical and social importance of the effects of psychotherapy. Journal of Consulting and Clinical Psychology, 51, 4– 13Rosenthal, R. (1984). Meta-analytic procedures for social research . Beverly Hills, CA: SageRosenthal, R. (1993). Cumulating evidence. In G. Keren, & C. Lewis (Eds.), A handbook of data analysis in the behavioral sciences: Volume 1. Methodological issues (pp. 519-559). Hillsdale, NJ: ErlbaumRosenthal, R. , Rubin, D.B. (1994). The counternull value of an effect size: A new Statistic. Psychological Science, 5, 329– 334Rosnow, R.L. , Rosenthal, R. (1989). Statistical procedures and the justification of knowledge in psychological science. American Psychologist, 44, 1276– 1284Rossi, J.S. (1990). Statistical power of psychological research: What have we gained in 20 years?. Journal of Consulting and Clinical Psychology, 58, 646– 656Rossi, J.S. (1997). A case study in the failure of psychology as a cumulative science: The spontaneous recovery of verbal learning. In L.L. Harlow, S.A. Mulaik, & J.H. Steiger (Eds.), What if there were no significance tests? (pp. 175-197). Hillsdale, NJ: ErlbaumRouanet, H. (1996). Bayesian methods for assessing importance of effects. Psychological Bulletin, 119, 149– 158Rozeboom, W.W. (1960). The fallacy of the null hypothesis significance test. Psychological Bulletin, 57, 416– 428Schmidt, F.L. (1992). What do data really mean?. American Psychologist, 47, 1173– 1181Schmidt, F.L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 1, 115– 129Schmidt, F.L. (2002). Are there benefits from NHST?. American Psychologist, 57, 65– 71Schmidt, F.L. , Hunter, J.E. (1997). Eight common but false objections to the discontinuation of significance testing in the analysis of research data. In L.L. Harlow, S.A. Mulaik, & J.H. Steiger (Eds.), What if there were no significance tests? (pp. 37-64). Hillsdale, NJ: ErlbaumSedlmeier, P. , Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies?. Psychological Bulletin, 105, 309– 316Serlin, R.C. , Lapsley, D.K. (1985). Rationality in psychological research: The good-enough principle. American Psychologist, 40, 73– 83Serlin, R.C. , Lapsley, D.K. (1993). Rational appraisal of psychological research and the good-enough principle. In G. Keren, & C. Lewis (Eds.), A handbook of data analysis in behavioral sciences: Volume 1. Methodological issues, (pp. 199- 228). Hillsdale, NJ: ErlbaumSchafer, W.D. (1993). Interpreting statistical significance and nonsignificance. Journal of Experimental Education, 61, 383– 387Shafer, G. (1982). Lindley's paradox. Journal of the American Statistical Association, 77, 325– 334Shaver, J. (1985). Chance and nonsense: A conversation about interpreting tests of statistical significance. PhiDelta Kappan, 67(1), 138– 141Shaver, J. (1993). What statistical significance testing is, and what is not. Journal of Experimental Education, 61, 293– 316Snyder, P. , Lawson, S. (1993). Evaluating results using corrected and uncorrected effect size estimates. Journal of Experimental Education, 61, 334– 349Snow, R.E. (1998). Inductive strategy and statistical tactics. Behavioral and Brain Sciences, 21, 219–Smithson, M. (2001). Correct confidence intervals for various regression effect sizes and parameters: The importance of noncentral distributions in computing intervals. Educational and Psychological Measurement, 61, 305– 632Steiger, J.H. , Fouladi, R.T. (1997). Noncentrally interval estimation and the evaluation of statistical models. In L.L. Harlow, S.A. Mulaik, & J.H. Steiger (Eds.), What if there were no significance tests? (pp. 221-258). Hillsdale, NJ: ErlbaumStrahan, R.F. (1991). Remarks on the binomial effect size display. American Psychologist, 46, 1083– 1084Student, [W.S. Gosset] (1908). The probable error of a mean. Biometrika, 6, 1– 25Thompson, B. (1992). Two and one-half decades of leadership in measurement and evaluation. Journal of Consulting and Clinical Psychology, 70, 434– 438Thompson, B. (1993). The use of statistical significance tests in research: Bootstrap and other alternatives. Journal of Experimental Education, 61, 361– 377Thompson, B. (1994). Guidelines for authors. Educational and Psychological Measurement, 54, 837– 847Thompson, B. (1996). AERA editorial policies regarding statistical significance testing: Three suggested reforms. Educational Researcher, 25(2), 26– 30Thompson, B. (1997). Editorial policies regarding statistical significance tests: Further comments. Educational Researcher, 26(5), 29– 32Thompson, B. (2002). “Statistical,” “practical,” and “clinical”: How many kinds of significance do counselors need to consider?. Journal of Counseling and Development, 80, 64– 71Thompson, B. , Snyder, P.A. (1998). Statistical significance and reliability analyses in recent Journal of Counseling & Development research articles. Journal of Counseling and Development, 76, 436– 441Tryon, W.W. (2001). Evaluating statistical difference, equivalence, and indeterminacy using inferential confidence intervals: An integrated alternative method of conducting null hypothesis statistical tests. Psychological Methods, 6, 371– 386Tufte, E.R. (1983). The visual display of quantitative information . Cheshire, CT: Graphics PressTufte, E.R. (1990). Envisioning information . Cheshire, CT: Graphics PressTukey, J.W. (1962). The future of data analysis. Annals of Mathematical Statistics, 33, 1– 67Tukey, J.W. (1969). Analyzing data: Sanctification or detective work?. American Psychologist, 24, 83– 91Tukey, J.W. (1977). Exploratory data analysis . Reading, MA: Addison-WesleyTukey, J.W. (1991). The philosophy of multiple comparisons. Statistical Science, 6, 100– 116Tversky, A. , Kahneman, D. (1971). Belief in the law of small numbers. Psychological Bulletin, 76, 105– 110Wainer, H. , Thissen, D. (1993). Combining multiple-choice and constructed-response test scores: Toward a Marxist theory of test construction. Applied Measurement in Education, 6(2), 103– 118Weitzman, R.A. (1984). Seven treacherous pitfalls of statistics, illustrated. Psychological Reports, 54, 355– 363Wilkinson, L. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594– 604Wilson, W. , Miller, H.L. , Lower, J.S. (1967). Much ado about the null hypothesis. Psychological Bulletin, 68, 188– 196