In 2017 my Website
was migrated to the clouds and reduced in size.
Hence some links below are broken.
Contact me at rjensen@trinity.edu if you really need to file that is missing
Common Accountics Science and Econometric Science Statistical Mistakes
Bob Jensen at Trinity University
Accountics is the mathematical science of values.
Charles Sprague [1887] as quoted by McMillan [1998, p. 1]
http://faculty.trinity.edu/rjensen/395wpTAR/Web/TAR395wp.htm#_msocom_1
Tom Lehrer on Mathematical Models and Statistics ---
http://www.youtube.com/watch?v=gfZWyUXn3So
You must watch this to the ending to appreciate it.
David Johnstone asked me to write a paper on the following:
"A Scrapbook on What's Wrong with the Past, Present and Future of Accountics
Science"
Bob Jensen
February 19, 2014
SSRN Download:
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2398296
Abstract
For operational convenience I define accountics science as research that features equations and/or statistical inference. Historically, there was a heated debate in the 1920s as to whether the main research journal of academic accounting, The Accounting Review (TAR) that commenced in 1926, should be an accountics journal with articles that mostly featured equations. Practitioners and teachers of college accounting won that debate.
TAR articles and accountancy doctoral dissertations prior to the 1970s seldom had equations. For reasons summarized below, doctoral programs and TAR evolved to where in the 1990s there where having equations became virtually a necessary condition for a doctoral dissertation and acceptance of a TAR article. Qualitative normative and case method methodologies disappeared from doctoral programs.
What’s really meant by “featured equations” in doctoral programs is merely symbolic of the fact that North American accounting doctoral programs pushed out most of the accounting to make way for econometrics and statistics that are now keys to the kingdom for promotion and tenure in accounting schools ---
The purpose of this paper is to make a case that the accountics science monopoly of our doctoral programs and published research is seriously flawed, especially its lack of concern about replication and focus on simplified artificial worlds that differ too much from reality to creatively discover findings of greater relevance to teachers of accounting and practitioners of accounting. Accountics scientists themselves became a Cargo Cult.
http://faculty.trinity.edu/rjensen/Theory01.htm#DoctoralPrograms
Significance Testing: We Can Do Better
Strategies to Avoid Data Collection Drudgery and Responsibilities for Errors in the Data
Drawing Inferences From Very Large Data-Sets
The Insignificance of Testing the Null
Can You Really Test for Multicollinearity?
Simpson's Paradox and Cross-Validation
David Giles' Top Five Econometrics Blog Postings for 2013
Gasp! How could an accountics scientist question such things? This is sacrilege!
A Scrapbook on What's Wrong with the Past, Present and Future of Accountics Science
574 Shields Against Validity Challenges in Plato's Cave ---
http://faculty.trinity.edu/rjensen/TheoryTAR.htm
Real Science versus Pseudo Science ---
http://www.cs.trinity.edu/~rjensen/temp/AccounticsDamn.htm#Pseudo-Science
How Accountics Scientists Should Change:
"Frankly, Scarlett, after I get a hit for my resume in The Accounting Review
I just don't give a damn"
http://www.cs.trinity.edu/~rjensen/temp/AccounticsDamn.htm
One more mission in what's left of my life will be to try to change this
http://www.cs.trinity.edu/~rjensen/temp/AccounticsDamn.htm
"How Non-Scientific Granulation Can Improve Scientific
Accountics"
http://www.cs.trinity.edu/~rjensen/temp/AccounticsGranulationCurrentDraft.pdf
Gaming for Tenure as an Accounting Professor ---
http://faculty.trinity.edu/rjensen/TheoryTenure.htm
(with a reply about tenure publication point systems from Linda Kidwell)
David Giles Econometrics Beat Blog ---
http://davegiles.blogspot.com/
Significance Testing: We Can Do Better
Significance Testing: We Can Do Better
Abacas, June 13, 2016
http://onlinelibrary.wiley.com/doi/10.1111/abac.12078/full
This is not a free article
Author
Thomas R. Dyckman Professor Emeritus Cornell University
Abstract
This paper advocates abandoning null hypothesis statistical tests (NHST) in favor of reporting confidence intervals. The case against NHST, which has been made repeatedly in multiple disciplines and is growing in awareness and acceptance, is introduced and discussed. Accounting as an empirical research discipline appears to be the last of research communities to face up to the inherent problems of significance test use and abuse. The paper encourages adoption of a meta-analysis approach which allows for the inclusion of replication studies in the assessment of evidence. This approach requires abandoning the typical NHST process and its reliance on p-values. However, given that NHST has deep roots and wide “social acceptance” in the empirical testing community, modifications to NHST are suggested so as to partly counter the weakness of this statistical testing method.
Extended Quotation
. . .
2. Why The Frequentist Approach (NHSTs) Should be Abandoned in Favor of a Bayesian ApproachFrequentist Approach:
The frequentist NHST relies on rejecting a null hypothesis of no effect or relationship based on the probability, or “p-level”, of observing a specific sample result X equal to or more extreme than the actual observation X₀, conditional on the null hypothesis H₀ being true. In symbols, this calculation yields a p-level = Pr(X≥X₀|H₀), where ≥ signifies “as or more discrepant with H₀ than X₀”. The origin of the approach is generally credited to Karl Pearson (1900), who introduced it in his χ²-test (Pearson actually called it the P, χ²-test). However, it was Sir Ronald Fisher who is credited with naming and popularizing statistical significance testing and p-values as promulgated in the many editions of his classic books Statistical Methods for Research Workers and The Design of Experiments. See Spielman (1974), Seidenfeld (1979), Johnstone et al. (1986), Barnett (1999), Berger (2003) and Howson and Urbach (2006) on the ideas and development of modern hypothesis tests (NHST).The Bayesian Approach:
Probabilities, under the Bayesian approach, rely on informed beliefs rather than physical quantities. They represent informed reasoned guesses. In the Bayesian approach, the objective is the posterior (post sample) belief concerning where a parameter, β in our case, is possibly located. Bayes’ theorem allows us to use the sample data to update our prior beliefs about the value of the parameter of interest. The revised (posterior) distribution represents the new belief based on the prior and the statistical method (the model) applied, and calculated using Bayes theorem. Prior beliefs play an important role in the Bayesian process. In fact, no data can be interpreted without prior beliefs (“data cannot speak for themselves”).Bayesians emphasize the unavoidably subjective nature of the research process. The decision to select a models and specific prior or family of priors is necessarily subjective, and the sample data are seldom obtained objectively (Basturk et al., 2014). Indeed, data quality has become a major problem with the advent of “big data” and with the recognition that the rewards for publication tend to induce gamesmanship and even fraud in the data selected for the study.
When the investigator experiences difficulty and uncertainty in specifying a specific prior distribution, the use of diffuse or “uninformative” prior is typically adopted. The idea is to impose no strong prior belief on the analysis and hence allow the data to have a bigger part in the final conclusions. Ultimately, enough data will “swamp” any prior distribution, but in reality, where systems are not stationary and no models is known to be “true”, there is always subjectivity and room for revision in Bayesian posterior beliefs.
The Bayesian viewpoint is that this is a fact of research life and needs to be faced and treated formally in the analysis. Objectivity is not possible, so there is no gain from pretending that it is. Formal Bayesian methods for coping with subjectivity are easy to understand. For example, one approach is to ask how robust the posterior distribution of belief about β is to different possible prior distributions. If we can say that we come to essentially the same qualitative belief over all feasible models and prior distributions, or across the different priors that different people hold, then that is perhaps the most objective that a statistical conclusion can claim.
Continued in article
Academic psychology and medical testing are both dogged by unreliability.
The reason is clear: we got probability wrong ---
https://aeon.co/essays/it-s-time-for-science-to-abandon-the-term-statistically-significant?utm_source=Aeon+Newsletter&utm_campaign=b8fc3425d2-Weekly_Newsletter_14_October_201610_14_2016&utm_medium=email&utm_term=0_411a82e59d-b8fc3425d2-68951505
. . .
For one, it’s of little use to say that your observations would be rare if there were no real difference between the pills (which is what the p-value tells you), unless you can say whether or not the observations would also be rare when there is a true difference between the pills. Which brings us back to induction.
The problem of induction was solved, in principle, by the Reverend Thomas Bayes in the middle of the 18th century. He showed how to convert the probability of the observations given a hypothesis (the deductive problem) to what we actually want, the probability that the hypothesis is true given some observations (the inductive problem). But how to use his famous theorem in practice has been the subject of heated debate ever since.
Take the proposition that the Earth goes round the Sun. It either does or it doesn’t, so it’s hard to see how we could pick a probability for this statement. Furthermore, the Bayesian conversion involves assigning a value to the probability that your hypothesis is right before any observations have been made (the ‘prior probability’). Bayes’s theorem allows that prior probability to be converted to what we want, the probability that the hypothesis is true given some relevant observations, which is known as the ‘posterior probability’.
These intangible probabilities persuaded Fisher that Bayes’s approach wasn’t feasible. Instead, he proposed the wholly deductive process of null hypothesis significance testing. The realisation that this method, as it is commonly used, gives alarmingly large numbers of false positive results has spurred several recent attempts to bridge the gap.
There is one uncontroversial application of Bayes’s theorem: diagnostic screening, the tests that doctors give healthy people to detect warning signs of disease. They’re a good way to understand the perils of the deductive approach.
In theory, picking up on the early signs of illness is obviously good. But in practice there are usually so many false positive diagnoses that it just doesn’t work very well. Take dementia. Roughly 1 per cent of the population suffer from mild cognitive impairment, which might, but doesn’t always, lead to dementia. Suppose that the test is quite a good one, in the sense that 95 per cent of the time it gives the right (negative) answer for people who are free of the condition. That means that 5 per cent of the people who don’t have cognitive impairment will test, falsely, as positive. That doesn’t sound bad. It’s directly analogous to tests of significance which will give 5 per cent of false positives when there is no real effect, if we use a p-value of less than 5 per cent to mean ‘statistically significant’.
But in fact the screening test is not good – it’s actually appallingly bad, because 86 per cent, not 5 per cent, of all positive tests are false positives. So only 14 per cent of positive tests are correct. This happens because most people don’t have the condition, and so the false positives from these people (5 per cent of 99 per cent of the people), outweigh the number of true positives that arise from the much smaller number of people who have the condition (80 per cent of 1 per cent of the people, if we assume 80 per cent of people with the disease are detected successfully). There’s a YouTube video of my attempt to explain this principle, or you can read my recent paper on the subject.
Notice, though, that it’s possible to calculate the disastrous false-positive rate for screening tests only because we have estimates for the prevalence of the condition in the whole population being tested. This is the prior probability that we need to use Bayes’s theorem. If we return to the problem of tests of significance, it’s not so easy. The analogue of the prevalence of disease in the population becomes, in the case of significance tests, the probability that there is a real difference between the pills before the experiment is done – the prior probability that there’s a real effect. And it’s usually impossible to make a good guess at the value of this figure.
An example should make the idea more concrete. Imagine testing 1,000 different drugs, one at a time, to sort out which works and which doesn’t. You’d be lucky if 10 per cent of them were effective, so let’s proceed by assuming a prevalence or prior probability of 10 per cent. Say we observe a ‘just significant’ result, for example, a P = 0.047 in a single test, and declare that this is evidence that we have made a discovery. That claim will be wrong, not in 5 per cent of cases, as is commonly believed, but in 76 per cent of cases. That is disastrously high. Just as in screening tests, the reason for this large number of mistakes is that the number of false positives in the tests where there is no real effect outweighs the number of true positives that arise from the cases in which there is a real effect.
In general, though, we don’t know the real prevalence of true effects. So, although we can calculate the p-value, we can’t calculate the number of false positives. But what we can do is give a minimum value for the false positive rate. To do this, we need only assume that it’s not legitimate to say, before the observations are made, that the odds that an effect is real are any higher than 50:50. To do so would be to assume you’re more likely than not to be right before the experiment even begins.
If we repeat the drug calculations using a prevalence of 50 per cent rather than 10 per cent, we get a false positive rate of 26 per cent, still much bigger than 5 per cent. Any lower prevalence will result in an even higher false positive rate.
The upshot is that, if a scientist observes a ‘just significant’ result in a single test, say P = 0.047, and declares that she’s made a discovery, that claim will be wrong at least 26 per cent of the time, and probably more. No wonder then that there are problems with reproducibility in areas of science that rely on tests of significance.
Continued in article
Jensen Comment
Especially note the many replies to this article. . .
David Colquhoun
https://aeon.co/conversations/what-should-be-done-to-improve-statistical-literacy#
I think that it’s quite hard to find a really good practical guide to Bayesian analysis. By really good, I mean on that is critical about priors and explains exactly what assumptions are being made. I fear that one reason for this is that Bayesians often seem to have an evangelical tendency that leads to them brushing the assumptions under the carpet. I agree that Alexander Etz is a good place to start. but I do wonder how much it will help when your faced with a particular set of observations to analyze.Henning Strandin ---
https://aeon.co/users/henning-strandin
Thank you for a good and useful article on the pitfalls of ignoring the baseline. I have a couple of comments.
Bayes didn’t resolve the problem of induction, even in principle. The problem of induction is the problem of knowing that the observations you have made are relevant to some set of (perhaps as-yet) unobserved events. In his Essay on Probabilities, Laplace illustrated the problem in the same paragraph in which he suggests . . .Karl Young
Nice article; as a Bayesian who was forced to quote p values in a couple of medical physics papers for which the journal would have nothing else, I appreciate the points made here. But even as a Bayesian one has to acknowledge that there are a number of open problems besides just how to estimate priors. E.g. what one really wants to know is given some observations, how one’s hypothesis fares against as complete a list of alternative hypothesis as can be mustered. Even assuming that one could come up with such a list, calculating the probability that one’s hypothesis best fits the observations in that case requires calculation of a quantity called the evidence that is generally extremely difficult (the reason that the diagnostic examples mentioned in the piece lead to reasonable calculations is that calculating the evidence for the set of proposed hypotheses, that either someone in the population has a disease or doesn’t, is straightforward). So while I think Bayes is the philosophically most coherent approach to analyzing data (doesn’t solve the problem of induction but tries to at least manage it) there are still a number of issues preventing itComments Continued at
https://aeon.co/conversations/what-should-be-done-to-improve-statistical-literacy
Strategies to Avoid Data Collection Drudgery and Responsibilities for Errors in the Data
In 2013 I scanned all six issues of The Accounting Review (TAR) published in 2013 to detect what public databases were (usually at relatively heavy fees for a system of databases) in the 72 articles published January-November, 2013 in TAR. The outcomes were as follows:
42 35.3% Miscellaneous public databases used infrequently 33 27.7% Compustat --- http://en.wikipedia.org/wiki/Compustat 21 17.6% CRSP --- http://en.wikipedia.org/wiki/Center_for_Research_in_Security_Prices 17 14.3% Datastream --- http://en.wikipedia.org/wiki/Thomson_Financial 6 5.0% Audit Analytics --- http://www.auditanalytics.com/ 119 100.0% Total Purchased Public Databases 10 Non-public Databases (usually experiments) and mathematical analysis studies with no data Note that there are subsets of databases within database like Compustat. CRSP, and Datastream Many of these 72 articles used more than one public database, and when the Compustat and CRSP joint database was used I counted one for the Compustat Database and one for the CRSP Database. Most of the non-public databases are behavioral experiments using students as surrogates for real-world decision makers.
My opinion is that 2013 is a typical year where over 92% of the articles published in TAR used puchased public databases. The good news is that most of these public databases are enormous, thereby allowing for huge samples for which statistical inference is probably superfluous. For very large samples even miniscule differences are significant for hypothesis testing making statistical inference testing superfluous:
My theory is that accountics science gained dominance in accounting research, especially in North American accounting Ph.D. programs, because it abdicated responsibility:
1.
Most accountics scientists buy data, thereby
avoiding the greater cost and drudgery of collecting data.
2.
By relying so heavily on purchased data, accountics
scientists abdicate responsibility for errors in the data.
3.
Since adding missing variable data to the public
database is generally not at all practical in purchased databases, accountics
scientists have an excuse for not collecting missing variable data.
4. Software packages for modeling and testing data abound. Accountics researchers need only feed purchased data into the hopper of statistical and mathematical analysis programs. It still takes a lot of knowledge to formulate hypotheses and to invent and understand complex models. But the really hard work of collecting data and error checking is avoided by purchasing data.
David Johnstone posted the following message on the AECM Listserv on November 19, 2013:
An interesting aspect of all this is that there is a widespread a priori or learned belief in empirical research that all and only what you have to do to get meaningful results is to get data and run statistics packages, and that the more advanced the stats the better. Its then just a matter of turning the handle. Admittedly it takes a lot of effort to get very proficient at this kind of work, but the presumption that it will naturally lead to reliable knowledge is an act of faith, like a religious tenet. What needs to be taken into account is that the human systems (markets, accounting reporting, asset pricing etc.) are madly complicated and likely changing structurally continuously. So even with the best intents and best methods, there is no guarantee of reliable or lasting findings a priori, no matter what “rigor” has gone in.
Part and parcel of the presumption that empirical research methods are automatically “it” is the even stronger position that no other type of work is research. I come across this a lot. I just had a 4^{th} year Hons student do his thesis, he was particularly involved in the superannuation/pension fund industry, and he did a lot of good practical stuff, thinking about risks that different fund allocations present, actuarial life expectancies etc. The two young guys (late 20s) grading this thesis, both excellent thinkers and not zealots about anything, both commented to me that the thesis was weird and was not really a thesis like they would have assumed necessary (electronic data bases with regressions etc.). They were still generous in their grading, and the student did well, and it was only their obvious astonishment that there is any kind of worthy work other than the formulaic-empirical that astonished me. This represents a real narrowing of mind in academe, almost like a tendency to dark age, and cannot be good for us long term. In Australia the new push is for research “impact”, which seems to include industry relevance, so that presents a hope for a cultural widening.
I have been doing some work with a lawyer-PhD student on valuation in law cases/principles, and this has caused similar raised eyebrows and genuine intrigue with young colleagues – they just have never heard of such stuff, and only read the journals/specific papers that do what they do. I can sense their interest, and almost envy of such freedom, as they are all worrying about how to compete and make a long term career as an academic in the new academic world.
This could also happen in
accountics science, but we'll probably never know! ---
http://www.cs.trinity.edu/~rjensen/temp/AccounticsDamn.htm
"Statistical Flaw Punctuates Brain Research in Elite Journals," by
Gary Stix, Scientific American, March 27, 2014 ---
http://blogs.scientificamerican.com/talking-back/2014/03/27/statistical-flaw-punctuates-brain-research-in-elite-journals/
Neuroscientists need a statistics refresher.
That is the message of a new analysis in Nature Neuroscience that shows that more than half of 314 articles on neuroscience in elite journals during an 18-month period failed to take adequate measures to ensure that statistically significant study results were not, in fact, erroneous. Consequently, at least some of the results from papers in journals like Nature, Science, Nature Neuroscience and Cell were likely to be false positives, even after going through the arduous peer-review gauntlet.
The problem of false positives appears to be rooted in the growing sophistication of both the tools and observations made by neuroscientists. The increasing complexity poses a challenge to one of the fundamental assumptions made in statistical testing, that each observation, perhaps of an electrical signal from a particular neuron, has nothing to do with a subsequent observation, such as another signal from that same neuron.
In fact, though, it is common in neuroscience experiments—and in studies in other areas of biology—to produce readings that are not independent of one another. Signals from the same neuron are often more similar than signals from different neurons, and thus the data points are said by statisticians to be clustered, or “nested.” To accommodate the similarity among signals, the authors from VU University Medical Center and other Dutch institutions suggest that a technique called multilevel analysis is needed to take the clustering of data points into account.
No adequate correction was made in any of the 53 percent of the 314 papers that contained clustered data when surveyed in 2012 and the first half of 2013. “We didn’t see any of the studies use the correct multi-level analysis,” says Sophie van der Sluis, the lead researcher. Seven percent of the studies did take steps to account for clustering, but these methods were much less sensitive than multi-level analysis in detecting actual biological effects. The researchers note that some of the studies surveyed probably report false-positive results, although they couldn’t extract enough information to quantify precisely how many. Failure to statistically correct for the clustering in the data can increase the probability of false-positive findings to as high as 80 percent—a risk of no more than 5 percent is normally deemed acceptable.
Jonathan D. Victor, a professor of neuroscience at Weill Cornell Medical College had praise for the study, saying it “raises consciousness about the pitfalls specific to a nested design and then counsels you as to how to create a good nested design given limited resources.”
Emery N. Brown, a professor of computational neuroscience in the department of brain and cognitive sciences at the MIT-Harvard Division of Health Sciences and Technology, points to a dire need to bolster the level of statistical sophistication brought to bear in neuroscience studies. “There’s a fundamental flaw in the system and the fundamental flaw is basically that neuroscientists don’t know enough statistics to do the right things and there’s not enough statisticians working in neuroscience to help that.”
The issue of reproducibility of research results has preoccupied the editors of many top journals in recent years. The Nature journals have instituted a checklist to help authors on reporting on the methods used in their research, a list that inquires about whether the statistical objectives for a particular study were met. (Scientific American is part of the Nature Publishing Group.) The one clear message from studies like that of van der Sluis and others is that the statistician will take on an increasingly pivotal role as the field moves ahead in deciphering ever more dense networks of neural signaling.
Jensen Comment
Accountics science differs neuroscience in that reproducibility of research
results does not preoccupy research journal editors ---
http://faculty.trinity.edu/rjensen/TheoryTAR.htm
Obsession With R-Squared
"Good Old R-Squared," by David Giles, Econometrics Beat: Dave
Giles’ Blog, University of Victoria, June 24, 2013 ---
http://davegiles.blogspot.com/2013/05/good-old-r-squared.html
My students are often horrified when I tell them, truthfully, that one of the last pieces of information that I look at when evaluating the results of an OLS regression, is the coefficient of determination (R^{2}), or its "adjusted" counterpart. Fortunately, it doesn't take long to change their perspective!
After all, we all know that with time-series data, it's really easy to get a "high" R^{2} value, because of the trend components in the data. With cross-section data, really low R^{2 }values are really common. For most of us, the signs, magnitudes, and significance of the estimated parameters are of primary interest. Then we worry about testing the assumptions underlying our analysis. R2 is at the bottom of the list of priorities.
Continued in article
Also see http://davegiles.blogspot.com/2013/07/the-adjusted-r-squared-again.html
Drawing Inferences From Very Large Data-Sets
David Johnstone wrote the following:
Indeed if you hold H_{0} the same and keep changing the model, you will eventually (generally soon) get a significant result, allowing “rejection of H_{0} at 5%”, not because H0 is necessarily false but because you have built upon a false model (of which there are zillions, obviously).
"Drawing Inferences From Very Large Data-Sets," by David Giles, Econometrics
Beat: Dave Giles’ Blog, University of Victoria, April 26, 2013 ---
http://davegiles.blogspot.ca/2011/04/drawing-inferences-from-very-large-data.html
. . .
Granger (1998; 2003) has reminded us that if the sample size is sufficiently large, then it's virtually impossible not to reject almost any hypothesis. So, if the sample is very large and the p-values associated with the estimated coefficients in a regression model are of the order of, say, 0.10 or even 0.05, then this really bad news. Much, much, smaller p-values are needed before we get all excited about 'statistically significant' results when the sample size is in the thousands, or even bigger. So, the p-values reported above are mostly pretty marginal, as far as significance is concerned. When you work out the p-values for the other 6 models I mentioned, they range from to 0.005 to 0.460. I've been generous in the models I selected.
Here's another set of results taken from a second, really nice, paper by Ciecieriski et al. (2011) in the same issue of Health Economics:Continued in article
Jensen Comment
My research suggest that over 90% of the recent papers published in TAR use
purchased databases that provide enormous sample sizes in those papers. Their
accountics science authors keep reporting those meaningless levels of
statistical significance.
What is even worse is when meaningless statistical significance tests are used to support decisions.
"Statistical Significance - Again " by David Giles, Econometrics
Beat: Dave Giles’ Blog, University of Victoria, December 28, 2013 ---
http://davegiles.blogspot.com/2013/12/statistical-significance-again.html
Statistical Significance - Again
With all of this emphasis on "Big Data", I was pleased to see this post on the Big Data Econometrics blog, today.
When you have a sample that runs to the thousands (billions?), the conventional significance levels of 10%, 5%, 1% are completely inappropriate. You need to be thinking in terms of tiny significance levels.
I discussed this in some detail back in April of 2011, in a post titled, "Drawing Inferences From Very Large Data-Sets". If you're of those (many) applied researchers who uses large cross-sections of data, and then sprinkles the results tables with asterisks to signal "significance" at the 5%, 10% levels, etc., then I urge you read that earlier post.
It's sad to encounter so many papers and seminar presentations in which the results, in reality, are totally insignificant!
How Standard Error Costs Us Jobs,
Justice, and Lives, by Stephen T. Ziliak and Deirdre N. McCloskey (Ann
Arbor: University of Michigan Press, ISBN-13: 978-472-05007-9, 2007)
http://www.cs.trinity.edu/~rjensen/temp/DeirdreMcCloskey/StatisticalSignificance01.htm
Page 206
Like scientists today in medical and economic and
other sizeless sciences, Pearson mistook a large sample size for the definite,
substantive significance---evidence s Hayek put it, of "wholes." But it was as
Hayek said "just an illusion." Pearson's columns of sparkling asterisks, though
quantitative in appearance and as appealing a is the simple truth of the sky,
signified nothing.
pp. 250-251
The textbooks are wrong. The teaching is wrong. The
seminar you just attended is wrong. The most prestigious journal in your
scientific field is wrong.
You are searching, we know, for ways to avoid being wrong. Science, as Jeffreys said, is mainly a series of approximations to discovering the sources of error. Science is a systematic way of reducing wrongs or can be. Perhaps you feel frustrated by the random epistemology of the mainstream and don't know what to do. Perhaps you've been sedated by significance and lulled into silence. Perhaps you sense that the power of a Roghamsted test against a plausible Dublin alternative is statistically speaking low but you feel oppressed by the instrumental variable one should dare not to wield. Perhaps you feel frazzled by what Morris Altman (2004) called the "social psychology rhetoric of fear," the deeply embedded path dependency that keeps the abuse of significance in circulation. You want to come out of it. But perhaps you are cowed by the prestige of Fisherian dogma. Or, worse thought, perhaps you are cynically willing to be corrupted if it will keep a nice job
Bob Jensen's threads on the often way analysts, particularly accountics
scientists, often cheer for statistical significance of large sample outcomes
that praise statistical significance of insignificant results such as R^{2}
values of .0001 ---
The Cult of Statistical Significance: How Standard Error Costs Us Jobs, Justice,
and Lives ---
http://www.cs.trinity.edu/~rjensen/temp/DeirdreMcCloskey/StatisticalSignificance01.htm
Significance Testing: We Can Do Better
Abacas, June 13, 2016
http://onlinelibrary.wiley.com/doi/10.1111/abac.12078/full
This is not a free article
Author
Thomas R. Dyckman Professor Emeritus Cornell University
Abstract
This paper advocates abandoning null hypothesis statistical tests (NHST) in favor of reporting confidence intervals. The case against NHST, which has been made repeatedly in multiple disciplines and is growing in awareness and acceptance, is introduced and discussed. Accounting as an empirical research discipline appears to be the last of research communities to face up to the inherent problems of significance test use and abuse. The paper encourages adoption of a meta-analysis approach which allows for the inclusion of replication studies in the assessment of evidence. This approach requires abandoning the typical NHST process and its reliance on p-values. However, given that NHST has deep roots and wide “social acceptance” in the empirical testing community, modifications to NHST are suggested so as to partly counter the weakness of this statistical testing method.
Extended Quotation
. . .
2. Why The Frequentist Approach (NHSTs) Should be Abandoned in Favor of a Bayesian ApproachFrequentist Approach:
The frequentist NHST relies on rejecting a null hypothesis of no effect or relationship based on the probability, or “p-level”, of observing a specific sample result X equal to or more extreme than the actual observation X₀, conditional on the null hypothesis H₀ being true. In symbols, this calculation yields a p-level = Pr(X≥X₀|H₀), where ≥ signifies “as or more discrepant with H₀ than X₀”. The origin of the approach is generally credited to Karl Pearson (1900), who introduced it in his χ²-test (Pearson actually called it the P, χ²-test). However, it was Sir Ronald Fisher who is credited with naming and popularizing statistical significance testing and p-values as promulgated in the many editions of his classic books Statistical Methods for Research Workers and The Design of Experiments. See Spielman (1974), Seidenfeld (1979), Johnstone et al. (1986), Barnett (1999), Berger (2003) and Howson and Urbach (2006) on the ideas and development of modern hypothesis tests (NHST).The Bayesian Approach:
Probabilities, under the Bayesian approach, rely on informed beliefs rather than physical quantities. They represent informed reasoned guesses. In the Bayesian approach, the objective is the posterior (post sample) belief concerning where a parameter, β in our case, is possibly located. Bayes’ theorem allows us to use the sample data to update our prior beliefs about the value of the parameter of interest. The revised (posterior) distribution represents the new belief based on the prior and the statistical method (the model) applied, and calculated using Bayes theorem. Prior beliefs play an important role in the Bayesian process. In fact, no data can be interpreted without prior beliefs (“data cannot speak for themselves”).Bayesians emphasize the unavoidably subjective nature of the research process. The decision to select a models and specific prior or family of priors is necessarily subjective, and the sample data are seldom obtained objectively (Basturk et al., 2014). Indeed, data quality has become a major problem with the advent of “big data” and with the recognition that the rewards for publication tend to induce gamesmanship and even fraud in the data selected for the study.
When the investigator experiences difficulty and uncertainty in specifying a specific prior distribution, the use of diffuse or “uninformative” prior is typically adopted. The idea is to impose no strong prior belief on the analysis and hence allow the data to have a bigger part in the final conclusions. Ultimately, enough data will “swamp” any prior distribution, but in reality, where systems are not stationary and no models is known to be “true”, there is always subjectivity and room for revision in Bayesian posterior beliefs.
The Bayesian viewpoint is that this is a fact of research life and needs to be faced and treated formally in the analysis. Objectivity is not possible, so there is no gain from pretending that it is. Formal Bayesian methods for coping with subjectivity are easy to understand. For example, one approach is to ask how robust the posterior distribution of belief about β is to different possible prior distributions. If we can say that we come to essentially the same qualitative belief over all feasible models and prior distributions, or across the different priors that different people hold, then that is perhaps the most objective that a statistical conclusion can claim.
Continued in article
Academic psychology and medical testing are both dogged by unreliability.
The reason is clear: we got probability wrong ---
https://aeon.co/essays/it-s-time-for-science-to-abandon-the-term-statistically-significant?utm_source=Aeon+Newsletter&utm_campaign=b8fc3425d2-Weekly_Newsletter_14_October_201610_14_2016&utm_medium=email&utm_term=0_411a82e59d-b8fc3425d2-68951505
. . .
For one, it’s of little use to say that your observations would be rare if there were no real difference between the pills (which is what the p-value tells you), unless you can say whether or not the observations would also be rare when there is a true difference between the pills. Which brings us back to induction.
The problem of induction was solved, in principle, by the Reverend Thomas Bayes in the middle of the 18th century. He showed how to convert the probability of the observations given a hypothesis (the deductive problem) to what we actually want, the probability that the hypothesis is true given some observations (the inductive problem). But how to use his famous theorem in practice has been the subject of heated debate ever since.
Take the proposition that the Earth goes round the Sun. It either does or it doesn’t, so it’s hard to see how we could pick a probability for this statement. Furthermore, the Bayesian conversion involves assigning a value to the probability that your hypothesis is right before any observations have been made (the ‘prior probability’). Bayes’s theorem allows that prior probability to be converted to what we want, the probability that the hypothesis is true given some relevant observations, which is known as the ‘posterior probability’.
These intangible probabilities persuaded Fisher that Bayes’s approach wasn’t feasible. Instead, he proposed the wholly deductive process of null hypothesis significance testing. The realisation that this method, as it is commonly used, gives alarmingly large numbers of false positive results has spurred several recent attempts to bridge the gap.
There is one uncontroversial application of Bayes’s theorem: diagnostic screening, the tests that doctors give healthy people to detect warning signs of disease. They’re a good way to understand the perils of the deductive approach.
In theory, picking up on the early signs of illness is obviously good. But in practice there are usually so many false positive diagnoses that it just doesn’t work very well. Take dementia. Roughly 1 per cent of the population suffer from mild cognitive impairment, which might, but doesn’t always, lead to dementia. Suppose that the test is quite a good one, in the sense that 95 per cent of the time it gives the right (negative) answer for people who are free of the condition. That means that 5 per cent of the people who don’t have cognitive impairment will test, falsely, as positive. That doesn’t sound bad. It’s directly analogous to tests of significance which will give 5 per cent of false positives when there is no real effect, if we use a p-value of less than 5 per cent to mean ‘statistically significant’.
But in fact the screening test is not good – it’s actually appallingly bad, because 86 per cent, not 5 per cent, of all positive tests are false positives. So only 14 per cent of positive tests are correct. This happens because most people don’t have the condition, and so the false positives from these people (5 per cent of 99 per cent of the people), outweigh the number of true positives that arise from the much smaller number of people who have the condition (80 per cent of 1 per cent of the people, if we assume 80 per cent of people with the disease are detected successfully). There’s a YouTube video of my attempt to explain this principle, or you can read my recent paper on the subject.
Notice, though, that it’s possible to calculate the disastrous false-positive rate for screening tests only because we have estimates for the prevalence of the condition in the whole population being tested. This is the prior probability that we need to use Bayes’s theorem. If we return to the problem of tests of significance, it’s not so easy. The analogue of the prevalence of disease in the population becomes, in the case of significance tests, the probability that there is a real difference between the pills before the experiment is done – the prior probability that there’s a real effect. And it’s usually impossible to make a good guess at the value of this figure.
An example should make the idea more concrete. Imagine testing 1,000 different drugs, one at a time, to sort out which works and which doesn’t. You’d be lucky if 10 per cent of them were effective, so let’s proceed by assuming a prevalence or prior probability of 10 per cent. Say we observe a ‘just significant’ result, for example, a P = 0.047 in a single test, and declare that this is evidence that we have made a discovery. That claim will be wrong, not in 5 per cent of cases, as is commonly believed, but in 76 per cent of cases. That is disastrously high. Just as in screening tests, the reason for this large number of mistakes is that the number of false positives in the tests where there is no real effect outweighs the number of true positives that arise from the cases in which there is a real effect.
In general, though, we don’t know the real prevalence of true effects. So, although we can calculate the p-value, we can’t calculate the number of false positives. But what we can do is give a minimum value for the false positive rate. To do this, we need only assume that it’s not legitimate to say, before the observations are made, that the odds that an effect is real are any higher than 50:50. To do so would be to assume you’re more likely than not to be right before the experiment even begins.
If we repeat the drug calculations using a prevalence of 50 per cent rather than 10 per cent, we get a false positive rate of 26 per cent, still much bigger than 5 per cent. Any lower prevalence will result in an even higher false positive rate.
The upshot is that, if a scientist observes a ‘just significant’ result in a single test, say P = 0.047, and declares that she’s made a discovery, that claim will be wrong at least 26 per cent of the time, and probably more. No wonder then that there are problems with reproducibility in areas of science that rely on tests of significance.
Continued in article
Jensen Comment
Especially note the many replies to this article. . .
David Colquhoun
https://aeon.co/conversations/what-should-be-done-to-improve-statistical-literacy#
I think that it’s quite hard to find a really good practical guide to Bayesian analysis. By really good, I mean on that is critical about priors and explains exactly what assumptions are being made. I fear that one reason for this is that Bayesians often seem to have an evangelical tendency that leads to them brushing the assumptions under the carpet. I agree that Alexander Etz is a good place to start. but I do wonder how much it will help when your faced with a particular set of observations to analyze.Henning Strandin ---
https://aeon.co/users/henning-strandin
Thank you for a good and useful article on the pitfalls of ignoring the baseline. I have a couple of comments.
Bayes didn’t resolve the problem of induction, even in principle. The problem of induction is the problem of knowing that the observations you have made are relevant to some set of (perhaps as-yet) unobserved events. In his Essay on Probabilities, Laplace illustrated the problem in the same paragraph in which he suggests . . .Karl Young
Nice article; as a Bayesian who was forced to quote p values in a couple of medical physics papers for which the journal would have nothing else, I appreciate the points made here. But even as a Bayesian one has to acknowledge that there are a number of open problems besides just how to estimate priors. E.g. what one really wants to know is given some observations, how one’s hypothesis fares against as complete a list of alternative hypothesis as can be mustered. Even assuming that one could come up with such a list, calculating the probability that one’s hypothesis best fits the observations in that case requires calculation of a quantity called the evidence that is generally extremely difficult (the reason that the diagnostic examples mentioned in the piece lead to reasonable calculations is that calculating the evidence for the set of proposed hypotheses, that either someone in the population has a disease or doesn’t, is straightforward). So while I think Bayes is the philosophically most coherent approach to analyzing data (doesn’t solve the problem of induction but tries to at least manage it) there are still a number of issues preventing itComments Continued at
https://aeon.co/conversations/what-should-be-done-to-improve-statistical-literacy
The Insignificance of Testing the Null
"Statistics: reasoning on uncertainty, and
the insignificance of testing null," by Esa Läärä
Ann. Zool. Fennici 46: 138–157
ISSN 0003-455X (print), ISSN 1797-2450 (online)
Helsinki 30 April 2009 © Finnish Zoological and Botanical Publishing Board 200
http://www.sekj.org/PDF/anz46-free/anz46-138.pdf
The practice of statistical analysis and inference in ecology is critically reviewed. The dominant doctrine of null hypothesis signi fi cance testing (NHST) continues to be applied ritualistically and mindlessly. This dogma is based on superficial understanding of elementary notions of frequentist statistics in the 1930s, and is widely disseminated by influential textbooks targeted at biologists. It is characterized by silly null hypotheses and mechanical dichotomous division of results being “signi fi cant” ( P < 0.05) or not. Simple examples are given to demonstrate how distant the prevalent NHST malpractice is from the current mainstream practice of professional statisticians. Masses of trivial and meaningless “results” are being reported, which are not providing adequate quantitative information of scientific interest. The NHST dogma also retards progress in the understanding of ecological systems and the effects of management programmes, which may at worst contribute to damaging decisions in conservation biology. In the beginning of this millennium, critical discussion and debate on the problems and shortcomings of NHST has intensified in ecological journals. Alternative approaches, like basic point and interval estimation of effect sizes, likelihood-based and information theoretic methods, and the Bayesian inferential paradigm, have started to receive attention. Much is still to be done in efforts to improve statistical thinking and reasoning of ecologists and in training them to utilize appropriately the expanded statistical toolbox. Ecologists should finally abandon the false doctrines and textbooks of their previous statistical gurus. Instead they should more carefully learn what leading statisticians write and say, collaborate with statisticians in teaching, research, and editorial work in journals.
Jensen Comment
And to think Alpha (Type 1) error is the easy part. Does anybody ever test for
the more important Beta (Type 2) error? I think some engineers test for Type 2
error with Operating Characteristic (OC) curves, but these are generally applied
where controlled experiments are super controlled such as in quality control
testing.
Jensen Comment
Beta Error ---
http://en.wikipedia.org/wiki/Beta_error#Type_II_error
I've never seen an accountics science study published anywhere that tested for Beta Error.
Scientific Irreproducibility (Frequentists Versus
Bayesians)
"Weak statistical standards implicated in scientific irreproducibility:
One-quarter of studies that meet commonly used statistical cutoff may be false."
by Erika Check Hayden, Nature, November 11, 2013 ---
http://www.nature.com/news/weak-statistical-standards-implicated-in-scientific-irreproducibility-1.14131
The plague of non-reproducibility in science may be mostly due to scientists’ use of weak statistical tests, as shown by an innovative method developed by statistician Valen Johnson, at Texas A&M University in College Station.
Johnson compared the strength of two types of tests: frequentist tests, which measure how unlikely a finding is to occur by chance, and Bayesian tests, which measure the likelihood that a particular hypothesis is correct given data collected in the study. The strength of the results given by these two types of tests had not been compared before, because they ask slightly different types of questions.
So Johnson developed a method that makes the results given by the tests — the P value in the frequentist paradigm, and the Bayes factor in the Bayesian paradigm — directly comparable. Unlike frequentist tests, which use objective calculations to reject a null hypothesis, Bayesian tests require the tester to define an alternative hypothesis to be tested — a subjective process. But Johnson developed a 'uniformly most powerful' Bayesian test that defines the alternative hypothesis in a standard way, so that it “maximizes the probability that the Bayes factor in favor of the alternate hypothesis exceeds a specified threshold,” he writes in his paper. This threshold can be chosen so that Bayesian tests and frequentist tests will both reject the null hypothesis for the same test results.
Johnson then used these uniformly most powerful tests to compare P values to Bayes factors. When he did so, he found that a P value of 0.05 or less — commonly considered evidence in support of a hypothesis in fields such as social science, in which non-reproducibility has become a serious issue — corresponds to Bayes factors of between 3 and 5, which are considered weak evidence to support a finding.
False positives
Indeed, as many as 17–25% of such findings are probably false, Johnson calculates^{1}. He advocates for scientists to use more stringent P values of 0.005 or less to support their findings, and thinks that the use of the 0.05 standard might account for most of the problem of non-reproducibility in science — even more than other issues, such as biases and scientific misconduct.
“Very few studies that fail to replicate are based on P values of 0.005 or smaller,” Johnson says.
Some other mathematicians said that though there have been many calls for researchers to use more stringent tests^{2}, the new paper makes an important contribution by laying bare exactly how lax the 0.05 standard is.
“It shows once more that standards of evidence that are in common use throughout the empirical sciences are dangerously lenient,” says mathematical psychologist Eric-Jan Wagenmakers of the University of Amsterdam. “Previous arguments centered on ‘P-hacking’, that is, abusing standard statistical procedures to obtain the desired results. The Johnson paper shows that there is something wrong with the P value itself.”
Other researchers, though, said it would be difficult to change the mindset of scientists who have become wedded to the 0.05 cutoff. One implication of the work, for instance, is that studies will have to include more subjects to reach these more stringent cutoffs, which will require more time and money.
“The family of Bayesian methods has been well developed over many decades now, but somehow we are stuck to using frequentist approaches,” says physician John Ioannidis of Stanford University in California, who studies the causes of non-reproducibility. “I hope this paper has better luck in changing the world.”
Accountics Scientists are More Interested in
Their Tractors Than Their Harvests ---
http://faculty.trinity.edu/rjensen/TheoryTAR.htm
Can You Really Test for Multicollinearity?
Unlike real scientists, accountics scientists seldom replicate published
accountics science research by the exacting standards real science ---
http://faculty.trinity.edu/rjensen/TheoryTAR.htm#Replication
Multicollinearity ---
http://en.wikipedia.org/wiki/
"Can You Actually TEST for Multicollinearity?" by David Giles, Econometrics
Beat: Dave Giles’ Blog, University of Victoria, June 24, 2013 ---
http://davegiles.blogspot.com/2013/06/can-you-actually-test-for.html
. . .
Now, let's return to the "problem" of multicollinearity.
What do we mean by this term, anyway? This turns out to be the key question!
Multicollinearity is a phenomenon associated with our particular sample of data when we're trying to estimate a regression model. Essentially, it's a situation where there is insufficient information in the sample of data to enable us to enable us to draw "reliable" inferences about the individual parameters of the underlying (population) model.
I'll be elaborating more on the "informational content" aspect of this phenomenon in a follow-up post. Yes, there are various sample measures that we can compute and report, to help us gauge how severe this data "problem" may be. But they're not statistical tests, in any sense of the word
Because multicollinearity is a characteristic of the sample, and not a characteristic of the population, you should immediately be suspicious when someone starts talking about "testing for multicollinearity". Right?
Apparently not everyone gets it!
There's an old paper by Farrar and Glauber (1967) which, on the face of it might seem to take a different stance. In fact, if you were around when this paper was published (or if you've bothered to actually read it carefully), you'll know that this paper makes two contributions. First, it provides a very sensible discussion of what multicollinearity is all about. Second, the authors take some well known results from the statistics literature (notably, by Wishart, 1928; Wilks, 1932; and Bartlett, 1950) and use them to give "tests" of the hypothesis that the regressor matrix, X, is orthogonal.
How can this be? Well, there's a simple explanation if you read the Farrar and Glauber paper carefully, and note what assumptions are made when they "borrow" the old statistics results. Specifically, there's an explicit (and necessary) assumption that in the population the X matrix is random, and that it follows a multivariate normal distribution.
This assumption is, of course totally at odds with what is usually assumed in the linear regression model! The "tests" that Farrar and Glauber gave us aren't really tests of multicollinearity in the sample. Unfortunately, this point wasn't fully appreciated by everyone.
There are some sound suggestions in this paper, including looking at the sample multiple correlations between each regressor, and all of the other regressors. These, and other sample measures such as variance inflation factors, are useful from a diagnostic viewpoint, but they don't constitute tests of "zero multicollinearity".
So, why am I even mentioning the Farrar and Glauber paper now?
Well, I was intrigued to come across some STATA code (Shehata, 2012) that allows one to implement the Farrar and Glauber "tests". I'm not sure that this is really very helpful. Indeed, this seems to me to be a great example of applying someone's results without understanding (bothering to read?) the assumptions on which they're based!
Be careful out there - and be highly suspicious of strangers bearing gifts!
References
Bartlett, M. S., 1950. Tests of significance in factor analysis. British Journal of Psychology, Statistical Section, 3, 77-85.
Farrar, D. E. and R. R. Glauber, 1967. Multicollinearity in regression analysis: The problem revisited. Review of Economics and Statistics, 49, 92-107.
Shehata, E. A. E., 2012. FGTEST: Stata module to compute Farrar-Glauber Multicollinearity Chi2, F, t tests.
Wilks, S. S., 1932. Certain generalizations in the analysis of variance. Biometrika, 24, 477-494.
Wishart, J., 1928. The generalized product moment distribution in samples from a multivariate normal population. Biometrika, 20A, 32-52.
It's relatively uncommon for accountics scientists to criticize each others'
published works. A notable exception is as follows:
"Selection Models in Accounting Research," by Clive S. Lennox, Jere R.
Francis, and Zitian Wang, The Accounting Review, March 2012, Vol. 87,
No. 2, pp. 589-616.
This study explains the challenges associated with the Heckman (1979) procedure to control for selection bias, assesses the quality of its application in accounting research, and offers guidance for better implementation of selection models. A survey of 75 recent accounting articles in leading journals reveals that many researchers implement the technique in a mechanical way with relatively little appreciation of important econometric issues and problems surrounding its use. Using empirical examples motivated by prior research, we illustrate that selection models are fragile and can yield quite literally any possible outcome in response to fairly minor changes in model specification. We conclude with guidance on how researchers can better implement selection models that will provide more convincing evidence on potential selection bias, including the need to justify model specifications and careful sensitivity analyses with respect to robustness and multicollinearity.
. . .
CONCLUSIONS
Our review of the accounting literature indicates that some studies have implemented the selection model in a questionable manner. Accounting researchers often impose ad hoc exclusion restrictions or no exclusion restrictions whatsoever. Using empirical examples and a replication of a published study, we demonstrate that such practices can yield results that are too fragile to be considered reliable. In our empirical examples, a researcher could obtain quite literally any outcome by making relatively minor and apparently innocuous changes to the set of exclusionary variables, including choosing a null set. One set of exclusion restrictions would lead the researcher to conclude that selection bias is a significant problem, while an alternative set involving rather minor changes would give the opposite conclusion. Thus, claims about the existence and direction of selection bias can be sensitive to the researcher's set of exclusion restrictions.
Our examples also illustrate that the selection model is vulnerable to high levels of multicollinearity, which can exacerbate the bias that arises when a model is misspecified (Thursby 1988). Moreover, the potential for misspecification is high in the selection model because inferences about the existence and direction of selection bias depend entirely on the researcher's assumptions about the appropriate functional form and exclusion restrictions. In addition, high multicollinearity means that the statistical insignificance of the inverse Mills' ratio is not a reliable guide as to the absence of selection bias. Even when the inverse Mills' ratio is statistically insignificant, inferences from the selection model can be different from those obtained without the inverse Mills' ratio. In this situation, the selection model indicates that it is legitimate to omit the inverse Mills' ratio, and yet, omitting the inverse Mills' ratio gives different inferences for the treatment variable because multicollinearity is then much lower.
In short, researchers are faced with the following trade-off. On the one hand, selection models can be fragile and suffer from multicollinearity problems, which hinder their reliability. On the other hand, the selection model potentially provides more reliable inferences by controlling for endogeneity bias if the researcher can find good exclusion restrictions, and if the models are found to be robust to minor specification changes. The importance of these advantages and disadvantages depends on the specific empirical setting, so it would be inappropriate for us to make a general statement about when the selection model should be used. Instead, researchers need to critically appraise the quality of their exclusion restrictions and assess whether there are problems of fragility and multicollinearity in their specific empirical setting that might limit the effectiveness of selection models relative to OLS.
Another way to control for unobservable factors that are correlated with the endogenous regressor (D) is to use panel data. Though it may be true that many unobservable factors impact the choice of D, as long as those unobservable characteristics remain constant during the period of study, they can be controlled for using a fixed effects research design. In this case, panel data tests that control for unobserved differences between the treatment group (D = 1) and the control group (D = 0) will eliminate the potential bias caused by endogeneity as long as the unobserved source of the endogeneity is time-invariant (e.g., Baltagi 1995; Meyer 1995; Bertrand et al. 2004). The advantages of such a difference-in-differences research design are well recognized by accounting researchers (e.g., Altamuro et al. 2005; Desai et al. 2006; Hail and Leuz 2009; Hanlon et al. 2008). As a caveat, however, we note that the time-invariance of unobservables is a strong assumption that cannot be empirically validated. Moreover, the standard errors in such panel data tests need to be corrected for serial correlation because otherwise there is a danger of over-rejecting the null hypothesis that D has no effect on Y (Bertrand et al. 2004).10
Finally, we note that there is a recent trend in the accounting literature to use samples that are matched based on their propensity scores (e.g., Armstrong et al. 2010; Lawrence et al. 2011). An advantage of propensity score matching (PSM) is that there is no MILLS variable and so the researcher is not required to find valid Z variables (Heckman et al. 1997; Heckman and Navarro-Lozano 2004). However, such matching has two important limitations. First, selection is assumed to occur only on observable characteristics. That is, the error term in the first stage model is correlated with the independent variables in the second stage (i.e., u is correlated with X and/or Z), but there is no selection on unobservables (i.e., u and υ are uncorrelated). In contrast, the purpose of the selection model is to control for endogeneity that arises from unobservables (i.e., the correlation between u and υ). Therefore, propensity score matching should not be viewed as a replacement for the selection model (Tucker 2010).
A second limitation arises if the treatment variable affects the company's matching attributes. For example, suppose that a company's choice of auditor affects its subsequent ability to raise external capital. This would mean that companies with higher quality auditors would grow faster. Suppose also that the company's characteristics at the time the auditor is first chosen cannot be observed. Instead, we match at some stacked calendar time where some companies have been using the same auditor for 20 years and others for not very long. Then, if we matched on company size, we would be throwing out the companies that have become large because they have benefited from high-quality audits. Such companies do not look like suitable “matches,” insofar as they are much larger than the companies in the control group that have low-quality auditors. In this situation, propensity matching could bias toward a non-result because the treatment variable (auditor choice) affects the company's matching attributes (e.g., its size). It is beyond the scope of this study to provide a more thorough assessment of the advantages and disadvantages of propensity score matching in accounting applications, so we leave this important issue to future research.
A second indicator is our journals.
They have proliferated in number. But we struggle with an intertemporal
sameness, with incremental as opposed to discontinuous attempts to move our
thinking forward, and with referee intrusion and voyeurism. Value relevance is a
currently fashionable approach to identifying statistical regularities in the
financial market arena, just as a focus on readily observable components of
compensation is a currently fashionable dependent variable in the compensation
arena. Yet we know measurement error abounds, that other sources of informa-
tion are both present and hardly unimportant, that compensation is broad-based
and intertemporally managed, and that compen- sating wage differentials are part
of the stew. Yet we continue on the comfortable path of sameness.
Joel Demski, AAA President's Message, Accounting Education News, Fall 2001
http://aaahq.org/pubs/AEN/2001/Fall2001.pdf
Models That aren't Robust
Robust Statistics --- http://en.wikipedia.org/wiki/Robust_statistics
Robust statistics are statistics with good performance for data drawn from a wide range of probability distributions, especially for distributions that are not normally distributed. Robust statistical methods have been developed for many common problems, such as estimating location, scale and regression parameters. One motivation is to produce statistical methods that are not unduly affected by outliers. Another motivation is to provide methods with good performance when there are small departures from parametric distributions. For example, robust methods work well for mixtures of two normal distributions with different standard-deviations, for example, one and three; under this model, non-robust methods like a t-test work badly.
Continued in article
Game Theory Model Solutions Are Rarely Robust
Nash Equilibrium --- http://en.wikipedia.org/wiki/Nash_equilibrium
Question
Why do game theory model solutions like Nash Equilibrium fail so often in the
real world?
"They Finally Tested The 'Prisoner's Dilemma' On Actual Prisoners — And
The Results Were Not What You Would Expect," by Max Nissen, Business
Insider, July 13, 2013 ---
http://www.businessinsider.com/prisoners-dilemma-in-real-life-2013-7
The "prisoner's dilemma" is a familiar concept to just about everyone who took Econ 101.The basic version goes like this: Two criminals are arrested, but police can't convict either on the primary charge, so they plan to sentence them to a year in jail on a lesser charge. Each of the prisoners, who can't communicate with each other, are given the option of testifying against their partner. If they testify, and their partner remains silent, the partner gets three years and they go free. If they both testify, both get two. If both remain silent, they each get one.
In game theory, betraying your partner, or "defecting" is always the dominant strategy as it always has a slightly higher payoff in a simultaneous game. It's what's known as a "Nash Equilibrium," after Nobel Prize winning mathematician and "A Beautiful Mind" subject John Nash.
In sequential games, where players know each other's previous behavior and have the opportunity to punish each other, defection is the dominant strategy as well.
However, on an overall basis, the best outcome for both players is mutual cooperation.
Yet no one's ever actually run the experiment on real prisoners before, until two University of Hamburg economists tried it out in a recent study comparing the behavior of inmates and students.
Surprisingly, for the classic version of the game, prisoners were far more cooperative than expected.
Menusch Khadjavi and Andreas Lange put the famous game to the test for the first time ever, putting a group of prisoners in Lower Saxony's primary women's prison, as well as students, through both simultaneous and sequential versions of the game.
The payoffs obviously weren't years off sentences, but euros for students, and the equivalent value in coffee or cigarettes for prisoners.
They expected, building off of game theory and behavioral economic research that show humans are more cooperative than the purely rational model that economists traditionally use, that there would be a fair amount of first-mover cooperation, even in the simultaneous simulation where there's no way to react to the other player's decisions.
And even in the sequential game, where you get a higher payoff for betraying a cooperative first mover, a fair amount will still reciprocate.
As for the difference between student and prisoner behavior, you'd expect that a prison population might be more jaded and distrustful, and therefore more likely to defect.
The results went exactly the other way for the simultaneous game, only 37% of students cooperate. Inmates cooperated 56% of the time.
On a pair basis, only 13% of student pairs managed to get the best mutual outcome and cooperate, whereas 30% of prisoners do.
In the sequential game, far more students (63%) cooperate, so the mutual cooperation rate skyrockets to 39%. For prisoners, it remains about the same.
What's interesting is that the simultaneous game requires far more blind trust from both parties, and you don't have a chance to retaliate or make up for being betrayed later. Yet prisoners are still significantly more cooperative in that scenario.
Obviously the payoffs aren't as serious as a year or three of your life, but the paper still demonstrates that prisoners aren't necessarily as calculating, self-interested, and un-trusting as you might expect, and as behavioral economists have argued for years, as mathematically interesting as Nash equilibrium might be, they don't line up with real behavior all that well.
"Nobody understands “Prisoner’s dilemma”" July 23, 2013
http://beranger.org/2013/07/23/nobody-understands-prisoners-dilemma/
. . .
Now, the theory says they’d be better off by betraying — i.e. by confessing. And they invoke the Nash equilibrium to “prove” that they’d be better off this way.
The problem in real life is that:
- there’s no such thing as an “iterated prisoners’ dilemma” — the deal is only offered you once!
- once again, this is a one-time shot, not something that repeats, so anything Nash-related is pure stupidity — you can’t have any kind of “equilibrium” when a unique, unrepeatable decision affects the entire outcome!
- “maximizing” WHAT? You’re not playing baccarat, you’re getting out or staying in, game over!
- also, any discussion of a “probability distribution” is pointless — it’s a ONE-TIME ISSUE, and then you go to jail, dammit! Statistics doesn’t work with a unique sample.
- consequently, any analysis of what the other prisoner might be f***ing thinking is f***ing useless — you cannot possibly know how stupid or how intelligent the other guy is, and again, any “statistical assumption” is pure intellectual masturbation; you can only hope he’s not mentally deranged;
- as a practical issue, the “classical” dilemma, which uses prison terms of 1, 2, and 3 years, is not only confusing and making the judgement more difficult (as the terms are too close to each other), but it’s also highly unrealistic in terms of what the lack of evidence or the presence of a confession would give — therefore, the variant with 1, 5 and 20 years is much more appropriate.
OK, now let me say what I’d do if I were a prisoner to have been offered such a deal: I’ll keep being silent — they’d call this “cooperation”, but by elementary logic, this is obviously the best thing to do, especially when thinking of real jail terms: 20 years is horrendous, 5 years is painful, but 1 year is rather cheap, so I’d assume the other prisoner would think the same. Just common sense. Zero years would be ideal, but there is a risk, and the risk reads “5 years”. This is not altruism, but compared to 5 years, 1 year would be quite acceptable for a felon, wouldn’t you think so? Nothing about any remorse of possibly putting the other guy behind bars for 20 years — just selfish considerations are enough to choose this strategy! (Note that properly choosing the prison terms makes the conclusion easier to reach: 2 years are not as much different from 1 year as the 5 years are.)
They’ve now for the first time tried this dilemma in practice. The idiots have used two groups: students — the stake being a material reward –, and real inmates — where not the freedom was at stake, but merely some cigarettes or coffee.
In such a flawed test environment, 37% of the students did “cooperate”, versus 56% of the inmates. The “iterated” (sequential) version of the dilemma showed an increased cooperation, but only amongst the students (which, in my opinion, proves that they were totally dumb).
Now, I should claim victory, as long as this experiment contradicts the theory saying the cooperation should have been negligible — especially amongst “immoral convicts”. And really, invoking a Pareto standpoint (making one individual better off without making any other individual worse off) is equally dumb, as nobody thinks in terms of ethics… for some bloody cigarettes! In real conditions though, where PERSONAL FREEDOM would be at stake FOR YEARS (1, 5, or 20) — not just peanuts –, an experiment would show even more “cooperation”, meaning that most people would remain silent!
They can’t even design an experiment properly. Not winning a couple of bucks, or a cuppa coffee is almost irrelevant to all the subjects involved (this is not a real stake!), whereas the stress of staying in jail for 5 or for 20 years is almost a life-or-death issue. Mathematicians and sociologists seem unbelievably dumb when basic empathy is needed in order to analyze a problem or conduct an experiment.
__
P.S.: A classical example that’s commonly mentioned is that during the Cold War, both parts have chosen to continuously arm, not to disarm — which means they didn’t “cooperate”. Heck, this is a continuously iterated prisoners’ dilemma, which is a totally different issue than a one-time shot prisoners’ dilemma! In such a continuum, the “official theory” applies with great success.
__
LATE EDIT: If it wasn’t clear enough, the practical experiment was flawed for two major reasons:
- The stake. When it’s not about losing personal FREEDOM for years, but merely about not earning a few euros or not being given some cigarettes or coffee, people are more prone to take chances and face the highest possible risk… because they don’t risk that much!
- The reversed logic. How can you replace penalties with rewards (on a reversed scale, obviously) and still have people apply the same judgement? Being put in jail for 20 years is replaced with what? With not earning anything? Piece of cake! What’s the equivalent of being set free? Being given a maximum of cash or of cigarettes? To make the equivalent of a real prisoner’s dilemma, the 20 years, 5 years or 1 year penalties shouldn’t have meant “gradually lower earnings”, but rather fines imposed to the subjects! Say, for the students:
- FREE means you’re given 100 €
- 1 year means you should pay 100 €
- 5 years means you should pay 500 €
- 20 years means you should pay 2000 €
What do you think the outcome would have been in such an experiment? Totally different, I’m telling you!
"ECONOMICS AS ROBUSTNESS ANALYSIS," by Jaakko Kuorikoski, Aki Lehtinen
and Caterina Marchionn, he University of Pittsburgh, 2007 ---
http://philsci-archive.pitt.edu/3550/1/econrobu.pdf
ECONOMICS AS ROBUSTNESS ANALYSISJaakko Kuorikoski, Aki Lehtinen and Caterina Marchionni25.9. 20071. Introduction ..................................................................................................................... 12. Making sense of robustness............................................................................................ 43. Robustness in economics................................................................................................ 64. The epistemic import of robustness analysis................................................................. 85. An illustration: geographical economics models ........................................................ 136. Independence of derivations......................................................................................... 187. Economics as a Babylonian science ............................................................................ 238. Conclusions ...................................................................................................................1.Introduction
Modern economic analysis consists largely in building abstract mathematical models and deriving familiar results from ever sparser modeling assumptions is considered as a theoretical contribution. Why do economists spend so much time and effort in deriving same old results from slightly different assumptions rather than trying to come up with new and exciting hypotheses? We claim that this is because the process of refining economic models is essentially a form of robustness analysis. The robustness of modeling results with respect to particular modeling assumptions, parameter values or initial conditions plays a crucial role for modeling in economics for two reasons. First, economic models are difficult to subject to straightforward empirical tests for various reasons. Second, the very nature of economic phenomena provides little hope of ever making the modeling assumptions completely realistic. Robustness analysis is therefore a natural methodological strategy for economists because economic models are based on various idealizations and abstractions which make at least some of their assumptions unrealistic (Wimsatt 1987; 1994a; 1994b; Mäki 2000; Weisberg 2006b). The importance of robustness considerations in economics ultimately forces us to reconsider many commonly held views on the function and logical structure of economic theory.Given that much of economic research praxis can be characterized as robustness analysis, it is somewhat surprising that philosophers of economics have only recently become interested in robustness. William Wimsatt has extensively discussed robustness analysis, which he considers in general terms as triangulation via independent ways of determination . According to Wimsatt, fairly varied processes or activities count as ways of determination: measurement, observation, experimentation, mathematical derivation etc. all qualify. Many ostensibly different epistemic activities are thus classified as robustness analysis. In a recent paper, James Woodward (2006) distinguishes four notions of robustness. The first three are all species of robustness as similarity of the result under different forms of determination. Inferential robustness refers to the idea that there are different degrees to which inference from some given data may depend on various auxiliary assumptions, and derivational robustness to whether a given theoretical result depends on the different modelling assumptions. The difference between the two is that the former concerns derivation from data, and the latter derivation from a set of theoretical assumptions. Measurement robustness means triangulation of a quantity or a value by (causally) different means of measurement. Inferential, derivational and measurement robustness differ with respect to the method of determination and the goals of the corresponding robustness analysis. Causal robustness, on the other hand, is a categorically different notion because it concerns causal dependencies in the world, and it should not be confused with the epistemic notion of robustness under different ways of determination.
In Woodward’s typology, the kind of theoretical model-refinement that is so common in economics constitutes a form of derivational robustness analysis. However, if Woodward (2006) and Nancy Cartwright (1991) are right in claiming that derivational robustness does not provide any epistemic credence to the conclusions, much of theoretical model- building in economics should be regarded as epistemically worthless. We take issue with this position by developing Wimsatt’s (1981) account of robustness analysis as triangulation via independent ways of determination. Obviously, derivational robustness in economic models cannot be a matter of entirely independent ways of derivation, because the different models used to assess robustness usually share many assumptions. Independence of a result with respect to modelling assumptions nonetheless carries epistemic weight by supplying evidence that the result is not an artefact of particular idealizing modelling assumptions. We will argue that although robustness analysis, understood as systematic examination of derivational robustness, is not an empirical confirmation procedure in any straightforward sense, demonstrating that a modelling result is robust does carry epistemic weight by guarding against error and by helping to assess the relative importance of various parts of theoretical models (cf. Weisberg 2006b). While we agree with Woodward (2006) that arguments presented in favour of one kind of robustness do not automatically apply to other kinds of robustness, we think that the epistemic gain from robustness derives from similar considerations in many instances of different kinds of robustness.
In contrast to physics, economic theory itself does not tell which idealizations are truly fatal or crucial for the modeling result and which are not. Economists often proceed on a preliminary hypothesis or an intuitive hunch that there is some core causal mechanism that ought to be modeled realistically. Turning such intuitions into a tractable model requires making various unrealistic assumptions concerning other issues. Some of these assumptions are considered or hoped to be unimportant, again on intuitive grounds. Such assumptions have been examined in economic methodology using various closely related terms such as Musgrave’s (1981) heuristic assumptions, Mäki’s (2000) early step assumptions, Hindriks’ (2006) tractability assumptions and Alexandrova’s (2006) derivational facilitators. We will examine the relationship between such assumptions and robustness in economic model-building by way of discussing a case: geographical economics. We will show that an important way in which economists try to guard against errors in modeling is to see whether the model’s conclusions remain the same if some auxiliary assumptions, which are hoped not to affect those conclusions, are changed. The case also demonstrates that although the epistemological functions of guarding against error and securing claims concerning the relative importance of various assumptions are somewhat different, they are often closely intertwined in the process of analyzing the robustness of some modeling result.
. . .
8. Conclusions
The practice of economic theorizing largely consists of building models with slightly different assumptions yielding familiar results. We have argued that this practice makes sense when seen as derivational robustness analysis. Robustness analysis is a sensible epistemic strategy in situations where we know that our assumptions and inferences are fallible, but not in what situations and in what way. Derivational robustness analysis guards against errors in theorizing when the problematic parts of the ways of determination, i.e. models, are independent of each other. In economics in particular, proving robust theorems from different models with diverse unrealistic assumptions helps us to evaluate what results correspond to important economic phenomena and what are merely artefacts of particular auxiliary assumptions. We have addressed Orzack and Sober’s criticism against robustness as an epistemically relevant feature by showing that their formulation of the epistemic situation in which robustness analysis is useful is misleading. We have also shown that their argument actually shows how robustness considerations are necessary for evaluating what a given piece of data can support. We have also responded to Cartwright’s criticism by showing that it relies on an untenable hope of a completely true economic model.Viewing economic model building as robustness analysis also helps to make sense of the role of the rationality axioms that apparently provide the basis of the whole enterprise. Instead of the traditional Euclidian view of the structure of economic theory, we propose that economics should be approached as a Babylonian science, where the epistemically secure parts are the robust theorems and the axioms only form what Boyd and Richerson call a generalized sample theory, whose the role is to help organize further modelling work and facilitate communication between specialists.
Jensen Comment
As I've mentioned before I spent a goodly proportion of my time for two years in
a think tank trying to invent adaptive regression and cluster analysis models.
In every case the main reasons for my failures were lack of robustness. In
particular, if any two models feeding in predictor variables w, x, y, and z
generated different outcomes that were not robust in terms of the time ordering
of the variables feeding into the algorithms. This made the results dependent of
dynamic programming which has rarely been noted for computing practicality ---
http://en.wikipedia.org/wiki/Dynamic_programming
Simpson's Paradox and Cross-Validation
Simpson's Paradox --- http://en.wikipedia.org/wiki/Simpson%27s_paradox
"Simpson’s Paradox: A Cautionary Tale in Advanced Analytics," by Steve
Berman, Leandro DalleMule, Michael Greene, and John Lucker, Significance:
Statistics Making Sense, October 2012 ---
http://www.significancemagazine.org/details/webexclusive/2671151/Simpsons-Paradox-A-Cautionary-Tale-in-Advanced-Analytics.html
Analytics projects often present us with situations in which common sense tells us one thing, while the numbers seem to tell us something much different. Such situations are often opportunities to learn something new by taking a deeper look at the data. Failure to perform a sufficiently nuanced analysis, however, can lead to misunderstandings and decision traps. To illustrate this danger, we present several instances of Simpson’s Paradox in business and non-business environments. As we demonstrate below, statistical tests and analysis can be confounded by a simple misunderstanding of the data. Often taught in elementary probability classes, Simpson’s Paradox refers to situations in which a trend or relationship that is observed within multiple groups reverses when the groups are combined. Our first example describes how Simpson’s Paradox accounts for a highly surprising observation in a healthcare study. Our second example involves an apparent violation of the law of supply and demand: we describe a situation in which price changes seem to bear no relationship with quantity purchased. This counterintuitive relationship, however, disappears once we break the data into finer time periods. Our final example illustrates how a naive analysis of marginal profit improvements resulting from a price optimization project can potentially mislead senior business management, leading to incorrect conclusions and inappropriate decisions. Mathematically, Simpson’s Paradox is a fairly simple—if counterintuitive—arithmetic phenomenon. Yet its significance for business analytics is quite far-reaching. Simpson’s Paradox vividly illustrates why business analytics must not be viewed as a purely technical subject appropriate for mechanization or automation. Tacit knowledge, domain expertise, common sense, and above all critical thinking, are necessary if analytics projects are to reliably lead to appropriate evidence-based decision making.
The past several years have seen decision making in many areas of business steadily evolve from judgment-driven domains into scientific domains in which the analysis of data and careful consideration of evidence are more prominent than ever before. Additionally, mainstream books, movies, alternative media and newspapers have covered many topics describing how fact and metric driven analysis and subsequent action can exceed results previously achieved through less rigorous methods. This trend has been driven in part by the explosive growth of data availability resulting from Enterprise Resource Planning (ERP) and Customer Relationship Management (CRM) applications and the Internet and eCommerce more generally. There are estimates that predict that more data will be created in the next four years than in the history of the planet. For example, Wal-Mart handles over one million customer transactions every hour, feeding databases estimated at more than 2.5 petabytes in size - the equivalent of 167 times the books in the United States Library of Congress.
Additionally, computing power has increased exponentially over the past 30 years and this trend is expected to continue. In 1969, astronauts landed on the moon with a 32-kilobyte memory computer. Today, the average personal computer has more computing power than the entire U.S. space program at that time. Decoding the human genome took 10 years when it was first done in 2003; now the same task can be performed in a week or less. Finally, a large consumer credit card issuer crunched two years of data (73 billion transactions) in 13 minutes, which not long ago took over one month.
This explosion of data availability and the advances in computing power and processing tools and software have paved the way for statistical modeling to be at the front and center of decision making not just in business, but everywhere. Statistics is the means to interpret data and transform vast amounts of raw data into meaningful information.
However, paradoxes and fallacies lurk behind even elementary statistical exercises, with the important implication that exercises in business analytics can produce deceptive results if not performed properly. This point can be neatly illustrated by pointing to instances of Simpson’s Paradox. The phenomenon is named after Edward Simpson, who described it in a technical paper in the 1950s, though the prominent statisticians Karl Pearson and Udney Yule noticed the phenomenon over a century ago. Simpson’s Paradox, which regularly crops up in statistical research, business analytics, and public policy, is a prime example of why statistical analysis is useful as a corrective for the many ways in which humans intuit false patterns in complex datasets.
Simpson’s Paradox is in a sense an arithmetic trick: weighted averages can lead to reversals of meaningful relationships—i.e., a trend or relationship that is observed within each of several groups reverses when the groups are combined. Simpson’s Paradox can arise in any number of marketing and pricing scenarios; we present here case studies describing three such examples. These case studies serve as cautionary tales: there is no comprehensive mechanical way to detect or guard against instances of Simpson’s Paradox leading us astray. To be effective, analytics projects should be informed by both a nuanced understanding of statistical methodology as well as a pragmatic understanding of the business being analyzed.
The first case study, from the medical field, presents a surface indication on the effects of smoking that is at odds with common sense. Only when the data are viewed at a more refined level of analysis does one see the true effects of smoking on mortality. In the second case study, decreasing prices appear to be associated with decreasing sales and increasing prices appear to be associated with increasing sales. On the surface, this makes no sense. A fundamental tenet of economics is that of the demand curve: as the price of a good or service increases, consumers demand less of it. Simpson’s Paradox is responsible for an apparent—though illusory—violation of this fundamental law of economics. Our final case study shows how marginal improvements in profitability in each of the sales channels of a given manufacturer may result in an apparent marginal reduction in the overall profitability the business. This seemingly contradictory conclusion can also lead to serious decision traps if not properly understood.
Case Study 1: Are those warning labels really necessary?
We start with a simple example from the healthcare world. This example both illustrates the phenomenon and serves as a reminder that it can appear in any domain.
The data are taken from a 1996 follow-up study from Appleton, French, and Vanderpump on the effects of smoking. The follow-up catalogued women from the original study, categorizing based on the age groups in the original study, as well as whether the women were smokers or not. The study measured the deaths of smokers and non-smokers during the 20 year period.
Continued in article
"Is the Ohlson
(1995) Model an Example of the Simpson's Paradox?" by Samithamby
Senthilnathan, SSRN 1417746, June 11, 2009 ---
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1417746
The Equity
Prices and Accounting Variables: The role of the most recent prior period's
price in value relevance studies
Paperback
by Samithamby Senthilnathan (Author)
Publisher: LAP LAMBERT Academic Publishing (May 22, 2012)
ISBN-10: 3659103721 ISBN-13: 978-3659103728
http://www.amazon.com/dp/3659103721?tag=beschevac-20
"Does an End of
Period's Accounting Variable Assessed have Relevance for the Particular Period?
Samithamby Senthilnathan, SSRN SSRN 1415182,, June 6, 2009 ---
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1415182
Also see
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1406788
What happened to cross-validation in accountics science research?
Over time I've become increasingly critical of
the lack of validation in accountics science, and I've focused mainly upon lack
of replication by independent researchers and lack of commentaries published in
accountics science journals ---
http://faculty.trinity.edu/rjensen/TheoryTAR.htm
Another type of validation that seems to be on the decline in accountics science are the so-called cross-validations. Accountics scientists seem to be content with their statistical inference tests on Z-Scores, F-Tests, and correlation significance testing. Cross-validation seems to be less common, at least I'm having troubles finding examples of cross-validation. Cross-validation entails comparing sample findings with findings in holdout samples.
Cross Validation --- http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29
When reading the following paper using logit
regression to to predict audit firm changes, it struck me that this would've
been an ideal candidate for the authors to have performed cross-validation using
holdout samples.
"Audit Quality and Auditor Reputation: Evidence from Japan," by Douglas J.
Skinner and Suraj Srinivasan, The Accounting Review, September 2012, Vol.
87, No. 5, pp. 1737-1765.
We study events surrounding ChuoAoyama's failed audit of Kanebo, a large Japanese cosmetics company whose management engaged in a massive accounting fraud. ChuoAoyama was PwC's Japanese affiliate and one of Japan's largest audit firms. In May 2006, the Japanese Financial Services Agency (FSA) suspended ChuoAoyama for two months for its role in the Kanebo fraud. This unprecedented action followed a series of events that seriously damaged ChuoAoyama's reputation. We use these events to provide evidence on the importance of auditors' reputation for quality in a setting where litigation plays essentially no role. Around one quarter of ChuoAoyama's clients defected from the firm after its suspension, consistent with the importance of reputation. Larger firms and those with greater growth options were more likely to leave, also consistent with the reputation argument.
Rather than just use statistical inference tests on logit model Z-statistics, it struck me that in statistics journals the referees might've requested cross-validation tests on holdout samples of firms that changed auditors and firms that did not change auditors.
I do find somewhat more frequent cross-validation studies in finance, particularly in the areas of discriminant analysis in bankruptcy prediction modes.
Instances of cross-validation in accounting research journals seem to have died out in the past 20 years. There are earlier examples of cross-validation in accounting research journals. Several examples are cited below:
"A field study examination of budgetary participation and locus of control," by Peter Brownell, The Accounting Review, October 1982 ---
http://www.jstor.org/discover/10.2307/247411?uid=3739712&uid=2&uid=4&uid=3739256&sid=21101146090203"Information choice and utilization in an experiment on default prediction," Abdel-Khalik and KM El-Sheshai - Journal of Accounting Research, 1980 ---
http://www.jstor.org/discover/10.2307/2490581?uid=3739712&uid=2&uid=4&uid=3739256&sid=21101146090203"Accounting ratios and the prediction of failure: Some behavioral evidence," by Robert Libby, Journal of Accounting Research, Spring 1975 ---
http://www.jstor.org/discover/10.2307/2490653?uid=3739712&uid=2&uid=4&uid=3739256&sid=21101146090203There are other examples of cross-validation in the 1970s and 1980s, particularly in bankruptcy prediction.
I have trouble finding illustrations of cross-validation in the accounting research literature in more recent years. Has the interest in cross-validating waned along with interest in validating accountics research? Or am I just being careless in my search for illustrations?
Reverse Regression
"Solution to Regression Problem," by David Giles, Econometrics
Beat: Dave Giles’ Blog, University of Victoria, December 26, 2013 ---
http://davegiles.blogspot.com/2013/12/solution-to-regression-problem.html
O.K. - you've had long enough to think about that little regression problem I posed the other day. It's time to put you out of your misery!
Here's the problem again, with a solution.
Problem:
Suppose that we estimate the following regression model by OLS:
y_{i} = α + β x_{i} + ε_{i} .
The model has a single regressor, x, and the point estimate of β turns out to be 10.0.
Now consider the "reverse regression", based on exactly the same data:
x_{i} = a + b y_{i} + u_{i} .
What can we say about the value of the OLS point estimate of b?
- It will be 0.1.
- It will be less than or equal to 0.1.
- It will be greater than or equal to 0.1.
- It's impossible to tell from the information supplied.
Solution:Continued in article
David Giles' Top Five Econometrics Blog Postings for 2013
Econometrics Beat: Dave Giles’ Blog, University of Victoria, December
31, 2013 ---
http://davegiles.blogspot.com/2013/12/my-top-5-for-2013.html
Everyone seems to be doing it at this time of the year. So, here are the five most popular new posts on this blog in 2013:
- Econometrics and "Big Data"
- Ten Things for Applied Econometricians to Keep in Mind
- ARDL Models - Part II - Bounds Tests
- The Bootstrap - A Non-Technical Introduction
- ARDL Models - Part I
Thanks for reading, and for your comments.
Happy New Year!
Jensen Comment
I really like the way David Giles thinks and writes about econometrics. He does
not pull his punches about validity testing.
Econometrics Beat: Dave Giles' Blog --- http://davegiles.blogspot.com/
Reading for the New Year
Back to work, and back to reading:
- Basturk, N., C. Cakmakli, S. P. Ceyhan, and H. K. van Dijk, 2013. Historical developments in Bayesian econometrics after Cowles Foundation monographs 10,14. Discussion Paper 13-191/III, Tinbergen Institute.
- Bedrick, E. J., 2013. Two useful reformulations of the hazard ratio. American Statistician, in press.
- Nawata, K. and M. McAleer, 2013. The maximum number of parameters for the Hausman test when the estimators are from different sets of equations. Discussion Paper 13-197/III, Tinbergen Institute.
- Shahbaz, M, S. Nasreen, C. H. Ling, and R. Sbia, 2013. Causality between trade openness and energy consumption: What causes what high, middle and low income countries. MPRA Paper No. 50832.
- Tibshirani, R., 2011. Regression shrinkage and selection via the lasso: A retrospective. Journal of the Royal Statistical Society, B, 73, 273-282.
- Zamani, H. and N. Ismail, 2014. Functional form for the zero-inflated generalized Poisson regression model. Communications in Statistics - Theory and Methods, in press.
Once upon a time, when all the world and you and I were youngand beautiful,there lived in the ancient town of Metrika a young boy by the name of Joe.
Now, young Joe was a talented lad, and his home town was prosperous and filled with happy folk - Metricians, they were called. Joe was a member of the Econo family, and his ancestors had been among the founding-fathers of the town. Originating in the neighbouring city of Econoville, Joe Econometrician's forebears had arrived in Metrika not long after the original settlers of that town - the Biols (from nearby Biologica), and the unfortunately named Psychos (from the hamlet of Psychovia).
In more recent times, other families (or "specialists", as they were sometimes known) had also established themselves in the town, and by the time that Joe was born there was already a sprinkling of Clios (from the ancient city of Historia), and even a few Environs. Hailing from the suburbs of Environmentalia, the Environs were regarded with some disdain by many of the more established families of Metrika.
Metrika began as a small village - little more than a coach-stop and a mandatory tavern at a junction in the highway running from the ancient data mines in the South, to the great city of Enlightenment, far to the North. In Metrika, the transporters of data of all types would pause overnight on their long journey; seek refreshment at the tavern; and swap tales of their experiences on the road.
To be fair, the data transporters were more than just humble freight carriers. The raw material that they took from the data mines was largely unprocessed. The vast mountains of raw numbers usually contained valuable gems and nuggets of truth, but typically these were buried from sight. The data transporters used the insights that they gained from their raucous, beer-fired discussions and arguments (known locally as "seminars") with the Metrikayokelslocals at the tavern to help them to sift through the data and extract the valuable jewels. With their loads considerably lightened, these "data-miners" then continued on their journey to the City of Enlightenment in a much improved frame of mind, hangovers nothwithstanding!
Over time, the town of Metrika prospered and grew as the talents of its citizens were increasingly recognized and valued by those in the surrounding districts, and by the dataminerstransporters.
Young Joe grew up happily, supported by his family of econometricians, and he soon developed the skills that were expected of his societal class. He honed his computing skills; developed a good nose for "dodgy" data; and studiously broadened and deepened his understanding of the various tools wielded by the artisans in the neighbouring town of Statsbourg.
In short, he was a model child!
But - he was torn! By the time that he reached the tender age of thirteen, he felt the need to make an important, life-determining, decision.
Should he align his talents with the burly crew who frequented the gym near his home - the macroeconometricians - or should he throw in his lot with the physically challenged bunch of empirical economists known locally as the microeconometricians?
What a tough decision! How to decide?
He discussed his dilemma with his parents, aunts, and uncles. Still, the choice was unclear to him.
Then, one fateful day, while sitting by the side of the highway and watching the data-miners pass by with their increasingly heavy loads, the answer came to him! There was a simple solution - he would form his own break-away movement that was free of the shackles of his Econo heritage.
Overwhelmed with excitement, Joe raced back to the tavern to announce to theseminar participantslocals that henceforth he was to be known as a Data Scientist.
As usual, the locals largely ignored what he was saying, and instead took turns at talking loudly about things that they thought would make them seem important to their peers. Finally, though, after many interruptions, and the consumption of copious quantities of ale, Joe was able to hold their attention.
"You see", he said, "the data that are now being mined, and transported to the City of Enlightenment, are available in such vast quantities that the truth must lie within them."
"All of this energy that we've been expending on building economic models, and then using the data to test their validity - it's a waste of time! The data are now so vast that the models are superfluous."
(To be perfectly truthful, he probably used words of one syllable, but I think you get the idea.)
"We don't need to use all of thosesillysimplifying assumptions that form the basis of the analysis being undertaken by the microeconometricians and macroeonometricians."
(Actually, he slurred these last three words due to a mixture of youthful enthusiasm and a mouthful of ale.)
"Their models are just a silly game, designed to create the impression that they're actually adding some knowledge to the information in the data. No, all that we need to do is to gather together lots and lots of our tools, and use them to drill deep into the data to reveal the true patterns that govern our lives."
"The answer was there all of the time. While we referred to those Southerners in disparaging terms, calling them "data miners" as if such activity were beneath the dignity of serious modellers such as ourselves, in reality data-mining is our future. How foolish we were!"
Now, it must be said that there were a few older econometricians who were somewhat unimpressed by Joe's revelation. Indeed, some of them had an uneasy feeling that they'd heard this sort of talk before. Amid much head-scratching, beard-stroking, and ale-quaffing, some who were present that day swear they heard mention of long-lost names such as Koopmans and Vining. Of course, we'll never know for sure.
However, young Joe was determined that he had found his destiny. A Data Scientist he would be, and he was convinced that others would follow his lead. Gathering together as many calculating tools as he could lay his hands on, Joe hitched a ride North, to the great City of Enlightenment. The protestations of his family and friends were to no avail. After all, as he kept insisting, we all know that "E" comes after "D".
And so, Joe was last seen sitting in a large wagon of data, trundling North while happily picking through some particularly interesting looking nuggets, and smiling the smile of one who knows the truth.
To this day, econometricians gather, after a hard day of modelling, in the taverns of Metrika. There, they swap tales of new theories, interesting computer algorithms, and even the characteristics of their data. Occasionally, Joe's departure from the town is recalled, but what became of him, or his followers, we really don't know. Perhaps he never actually found the City of Enlightenment after all. (Shock, horror!)
And that, dear children, is what can happen to you - yes, even you - if you don't eat all of your vegetables, or if you believe everything that you hear atseminarsthe tavern.
"Some
Thoughts About Accounting Scholarship," by Joel Demski, AAA President's
Message, Accounting Education News, Fall 2001
http://aaahq.org/pubs/AEN/2001/Fall2001.pdf
Some Thoughts on Accounting Scholarship From Annual Meeting Presidential Address, August 22, 2001
Tradition calls for me to reveal plans and aspirations for the coming year. But a slight deviation from tradition will, I hope, provide some perspective on my thinking.We have, in the past half century, made considerable strides in our knowledge of accounting institutions. Statistical connections between accounting measures and market prices, optimal contracting, and professional judgment processes and biases are illustrative. In the process we have raised the stature, the relevance, and the sheer excitement of intellectual inquiry in accounting, be it in the classroom, in the cloak room, or in the journals.
Of late, however, a malaise appears to have settled in. Our progress has turned flat, our tribal tendencies have taken hold, and our joy has diminished.
Some Warning Signs
Some Warning Signs One indicator is our textbooks, our primary communication medium and our statement to the world about ourselves. I see several patterns here. One is the unrelenting march to make every text look like People magazine. Form now leads, if not swallows, substance. Another is the insatiable appetite to list every rule published by the FASB (despite the fact we have a tidal wave thanks to DIG, EIFT, AcSEC, SABs, and what have you). Closely related is the interest in fads. Everything, including this paragraph of my remarks, is now subject to a value-added test. Benchmarking, strategic vision, and EVA ® are everywhere. Foundations are nowhere. Building blocks are languishing in appendices and wastebaskets.A second indicator is our journals. They have proliferated in number. But we struggle with an intertemporal sameness, with incremental as opposed to discontinuous attempts to move our thinking forward, and with referee intrusion and voyeurism. Value relevance is a currently fashionable approach to identifying statistical regularities in the financial market arena, just as a focus on readily observable components of compensation is a currently fashionable dependent variable in the compensation arena. Yet we know measurement error abounds, that other sources of information are both present and hardly unimportant, that compensation is broad-based and intertemporally managed, and that compensating wage differentials are part of the stew. Yet we continue on the comfortable path of sameness.
A third indicator is our work habits. We have embraced, indeed been swallowed by, the multiple adjective syndrome, or MAS: financial, audit, managerial, tax, analytic, archival, experimental, systems, cognitive, etc. This applies to our research, to our reading, to our courses, to our teaching assignments, to our teaching, and to the organization of our Annual Meeting. In so doing, we have exploited specialization, but in the process greatly reduced communication networks, and taken on a near tribal structure.
A useful analogy here is linearization. In accounting we linearize everything in sight: additive components on the balance sheet, linear cost functions, and the most glaring of all, the additive representation inherent in ABC, which by its mere structure denies the scope economy that causes the firm to jointly produce that set of products in the first place. Linearization denies interaction, denies synergy; and our recent propensity for multiple adjectives does precisely the same to us. We are doing to ourselves what we’ve done to our subject area. What, we might ask, happened to accounting? Indeed, I worry we will someday have a section specialized in depreciation or receivables or intangibles.
I hasten to add this particular tendency has festered for some time. Rick Antle, discussing the “Intellectual Boundaries in Accounting Research” at the ’88 meeting observed:
In carving out tractable pieces of institutionally defined problems, we inevitably impose intellectual boundaries. ... My concern arises when, instead of generating fluid, useful boundaries, our processes of simplification lead to rigid, dysfunctional ones. (6/89 Horizons, page 109).
I fear we have perfected and made a virtue out of Rick’s concern. Fluid boundaries are now held at bay by our work habits and natural defenses.
A final indicator is what appears to be coming down the road, our work in progress. Doctoral enrollment is down, a fact. It is also arguably factual that doctoral training has become tribal. I, personally, have witnessed this at recent Doctoral and New Faculty Consortia, and in our recruiting at UF. This reinforces the visible patterns in our textbooks, in our journals, and in our work habits. Some Contributors
Some Contributors
These patterns, of course, are not accidental. They are largely endogenous. And I think it is equally instructive to sketch some of the contributors.One contributor is employers, their firms, and their professional organizations. Employers want and lobby for the student well equipped with the latest consulting fad, or the student well equipped to transition into a billable audit team member or tax consultant within two hours of the first day of employment. Immediacy is sought and championed, though with the caveat of critical-thinking skills somehow being added to the stew.
Continued in article
Jensen
Comment
I agree with much of what Joel said, but I think he overlooks what I think is a
major problem in accounting scholarship. That major problem in my viewpoint is
the takeover of accountancy doctoral programs in North America where accounting
dissertations are virtually not acceptable unless they have equations ---
http://faculty.trinity.edu/rjensen/Theory01.htm#DoctoralPrograms
Recommendation 2 of the American Accounting
Association Pathways Commission (emphasis added)
Scapbook1083--- http://faculty.trinity.edu/rjensen/TheoryTar.htm#Scrapbook1083 |
Promote accessibility
of doctoral education by allowing for flexible content and structure
in doctoral programs and developing multiple pathways for degrees.
The current path to an accounting Ph.D. includes lengthy, full-time
residential programs and research training that is for the most
part confined to quantitative rather than qualitative methods.
More flexible programs -- that might be part-time, focus on applied
research and emphasize training in teaching methods and curriculum
development -- would appeal to graduate students with professional
experience and candidates with families, according to the report. |
It has been well over a year in which I've scanned the media for signs of change. But in well over a year I've seen little progress and zero encouragement that accounting doctoral programs and our leading accounting research journals are going to change. A necessary condition remains that an accounting doctoral dissertation and an Accounting Review article is not acceptable unless it has equations.
Accounting scholarship in doctoral programs is still "confined to quantitative rather than qualitative methods." The main reason is simple. Quantitative research is easier.
My theory is that accountics science gained dominance in accounting research, especially in North American accounting Ph.D. programs, because it abdicated responsibility:
1.
Most accountics scientists buy data, thereby avoiding the greater cost and
drudgery of collecting data.
2.
By
relying so heavily on purchased data, accountics scientists abdicate
responsibility for errors in the data.
3.
Since adding missing variable data to the public database is generally not at
all practical in purchased databases, accountics scientists have an excuse for
not collecting missing variable data.
4.
Software packages for modeling and testing data abound. Accountics researchers
need only feed purchased data into the hopper of statistical and mathematical
analysis programs. It still takes a lot of knowledge to formulate hypotheses and
to understand the complex models. But the really hard work of collecting data
and error checking is avoided by purchasing data.
Some Thoughts About Accounting Scholarship," by Joel Demski, AAA
President's Message, Accounting Education News, Fall 2001
http://aaahq.org/pubs/AEN/2001/Fall2001.pdf
. . .
A second indicator is our journals. They have proliferated in number. But we struggle with an intertemporal sameness, with incremental as opposed to discontinuous attempts to move our thinking forward, and with referee intrusion and voyeurism. Value relevance is a currently fashionable approach to identifying statistical regularities in the financial market arena, just as a focus on readily observable components of compensation is a currently fashionable dependent variable in the compensation arena. Yet we know measurement error abounds, that other sources of information are both present and hardly unimportant, that compensation is broad-based and intertemporally managed, and that compensating wage differentials are part of the stew. Yet we continue on the comfortable path of sameness.
It has been well over a year since the Pathways Report was issued. Nobody is listening on the AECM or anywhere else! Sadly the accountics researchers who generate this stuff won't even discuss their research on the AECM or the AAA Commons:
"Frankly,
Scarlett, after I get a hit for my resume in The Accounting Review I just
don't give a damn"
http://www.cs.trinity.edu/~rjensen/temp/AccounticsDamn.htm
One more mission in what's left of my life will be to try to change this
http://www.cs.trinity.edu/~rjensen/temp/AccounticsDamn.htm
Bob Jensen's threads on validity testing in accountics science ---
http://faculty.trinity.edu/rjensen/TheoryTAR.htm
How did academic accounting research
become a pseudo science?
http://faculty.trinity.edu/rjensen/theory01.htm#WhatWentWrong
Avoiding applied research for practitioners and failure to attract
practitioner interest in academic research journals ---
"Why business ignores the business schools," by Michael Skapinker
Some ideas for applied research ---
http://faculty.trinity.edu/rjensen/theory01.htm#AcademicsVersusProfession
Clinging to Myths in Academe and Failure to Replicate and Authenticate
Research Findings
http://faculty.trinity.edu/rjensen/theory01.htm#Myths
Poorly designed and executed experiments that are rarely, I mean very, very
rarely, authenticated
http://faculty.trinity.edu/rjensen/theory01.htm#PoorDesigns
Discouragement
of case method research by leading journals (TAR, JAR, JAE, etc.) by turning
back most submitted cases ---
http://faculty.trinity.edu/rjensen/000aaa/thetools.htm#Cases
Economic Theory Errors
Where analytical mathematics in accountics research made a huge mistake
relying on flawed economic theory and interval/ratio scaling
http://faculty.trinity.edu/rjensen/theory01.htm#EconomicTheoryErrors
Accentuate the Obvious and Avoid the Tough Problems (like fraud) for Which
Data and Models are Lacking
http://faculty.trinity.edu/rjensen/theory01.htm#AccentuateTheObvious
Financial Theory Errors
Where capital market research in accounting made a huge mistake by relying
on CAPM
http://faculty.trinity.edu/rjensen/theory01.htm#AccentuateTheObvious
Philosophy of Science is a Dying Discipline
Most scientific papers are probably wrong
http://faculty.trinity.edu/rjensen/theory01.htm#PhilosophyScienceDying
History of Quantitative Finance
"Four features in appreciation of the life and work of Benoit Mandelbrot,"
Simoleon Sense, February 3, 2011 ---
http://www.simoleonsense.com/four-features-in-appreciation-of-the-life-and-work-of-benoit-mandelbrot/
"Psychology’s
Treacherous Trio: Confirmation Bias, Cognitive Dissonance, and Motivated
Reasoning," by sammcnerney, Why We Reason, September 7, 2011 ---
Click Here
http://whywereason.wordpress.com/2011/09/07/psychologys-treacherous-trio-confirmation-bias-cognitive-dissonance-and-motivated-reasoning/
Gasp! How could an accountics scientist question such things? This is
sacrilege!
Let me end my remarks with a question: Have Ball and
Brown (1968)—and Beaver (1968) for that matter, if I can bring Bill Beaver into
it—have we had too much influence on the research agenda to the point where
other questions and methods are being overlooked?
Phil Brown of Ball and Brown Fame
"How Can We Do Better?" by Phillip R. Brown (of Ball and Brown Fame),
Accounting Horizons (Forum on the State of Accounting Scholarship),
December 2013 ---
http://aaajournals.org/doi/full/10.2308/acch-10365
Not Free
Philip R. Brown AM is an Honorary Professor at The University of New South Wales and Senior Honorary Research Fellow at The University of Western Australia.
I acknowledge the thoughtful comments of Sudipta Basu, who arranged and chaired this session at the 2012 American Accounting Association (AAA) Annual Meeting, Washington, DC.
The video presentation can be accessed by clicking the link in Appendix A.
Corresponding author: Philip R. Brown AM. Email: philip.brown@uwa. edu. au When Sudipta Basu asked me whether I would join this panel, he was kind enough to share with me the proposal he put to the conference organizers. As background to his proposal, Sudipta had written:
Analytical and empirical researchers generate numerous results about accounting, as do logicians reasoning from conceptual frameworks. However, there are few definitive tests that permit us to negate propositions about good accounting.
This panel aims to identify a few “most wrong” beliefs held by accounting experts—academics, regulators, practitioners—where a “most wrong” belief is one that is widespread and fundamentally misguided about practices and users in any accounting domain.
While Sudipta's proposal resonated with me, I did wonder why he asked me to join the panel, and whether I am seen these days as just another “grumpy old man.” Yes, I am no doubt among the oldest here today, but grumpy? You can make your own mind on that, after you have read what I have to say.
This essay begins with several gripes about editors, reviewers, and authors, along with suggestions for improving the publication process for all concerned. The next section contains observations on financial accounting standard setting. The essay concludes with a discussion of research myopia, namely, the unfortunate tendency of researchers to confine their work to familiar territory, much like the drunk who searches for his keys under the street light because “that is where the light is.”
ON EDITORS AND REVIEWERS, AND AUTHORS I have never been a regular editor, although I have chaired a journal's board of management and been a guest editor, and I appointed Ray Ball to his first editorship (Ray was the inaugural editor of the Australian Journal of Management). I have, however, reviewed many submissions for a whole raft of journals, and written literally hundreds of papers, some of which have been published. As I reflect on my involvement in the publications process over more than 50 years, I do have a few suggestions on how we can do things better. In the spirit of this panel session, I have put my suggestions in the form of gripes about editors, reviewers, and authors.
One-eyed editors—and reviewers—who define the subject matter as outside their journal's interests are my first gripe; and of course I except journals with a mission that is stated clearly and in unequivocal terms for all to see. The best editors and the best reviewers are those who are open-minded who avoid prejudging submissions by reference to some particular set of questions or modes of thinking that have become popular over the last five years or so. Graeme Dean, former editor of Abacus, and Nick Dopuch, former editor of the Journal of Accounting Research, are fine examples, from years gone by, of what it means to be an excellent editor.
Editors who are reluctant to entertain new ways of looking at old questions are a second gripe. Many years ago I was asked to review a paper titled “The Last Word on …” (I will not fill in the dots because the author may still be alive.) But at the time I thought, what a strange title! Can any academic reasonably believe they are about to have the last say on any important accounting issue? We academics thrive on questioning previous works, and editors and their reviewers do well when they nurture this mindset.
My third gripe concerns editors who, perhaps unwittingly, send papers to reviewers with vested interests and the reviewers do not just politely return the paper to the editor and explain their conflict of interest. A fourth concerns editors and reviewers who discourage replications: their actions signal a disciplinary immaturity. I am referring to rejecting a paper that repeats an experiment, perhaps in another country, purely because it has been done before. There can be good reasons for replicating a study, for example if the external validity of the earlier study legitimately can be questioned (perhaps different outcomes are reasonably expected in another institutional setting), or if methodological advances indicate a likely design flaw. Last, there are editors and reviewers who do not entertain papers that fail to reject the null hypothesis. If the alternative is well-reasoned and the study is sound, and they can be big “ifs,” then failure to reject the null can be informative, for it may indicate where our knowledge is deficient and more work can be done.^{1}
It is not only editors and reviewers who test my emotional state. I do get a bit short when I review papers that fail to appreciate that the ideas they are dealing with have long yet uncited histories, sometimes in journals that are not based in North America. I am particularly unimpressed when there is an all-too-transparent and excessive citation of works by editors and potential reviewers, as if the judgments of these folks could possibly be influenced by that behavior. Other papers frustrate me when they are technically correct but demonstrate the trivial or the obvious, and fail to draw out the wider implications of their findings. Then there are authors who rely on unnecessarily coarse “control” variables which, if measured more finely, may well threaten their findings.^{2} Examples are dummy variables for common law/code law countries, for “high” this and “low” that, for the presence or absence of an audit/nomination/compensation committee, or the use of an industry or sector variable without saying which features of that industry or sector are likely to matter and why a binary representation is best. In a nutshell, I fear there may be altogether too many dummies in financial accounting research!
Finally, there are the International Financial Reporting Standards (IFRS) papers that fit into the category of what I describe as “before and after studies.” They focus on changes following the adoption of IFRS promulgated by the London-based International Accounting Standards Board (IASB). A major concern, and I have been guilty too, is that these papers, by and large, do not deal adequately with the dynamics of what has been for many countries a period of profound change. In particular, there is a trade-off between (1) experimental noise from including too long a “before” and “after” history, and (2) not accommodating the process of change, because the “before” and “after” periods are way too short. Neither do they appear to control convincingly for other time-related changes, such as the introduction of new accounting and auditing standards, amendments to corporations laws and stock exchange listing rules, the adoption of corporate governance codes of conduct, more stringent compliance monitoring and enforcement mechanisms, or changes in, say stock, market liquidity as a result of the introduction of new trading platforms and protocols, amalgamations among market providers, the explosion in algorithmic trading, and the increasing popularity among financial institutions of trading in “dark pools.”
ON FINANCIAL ACCOUNTING STANDARD SETTING I count a number of highly experienced financial accounting standard setters among my friends and professional acquaintances, and I have great regard for the difficulties they face in what they do. Nonetheless, I do wonder
. . .
ON RESEARCH MYOPIA A not uncommon belief among academics is that we have been or can be a help to accounting standard setters. We may believe we can help by saying something important about whether a new financial accounting standard, or set of standards, is an improvement. Perhaps we feel this way because we have chosen some predictive criterion and been able to demonstrate a statistically reliable association between accounting information contained in some database and outcomes that are consistent with that criterion. Ball and Brown (1968, 160) explained the choice of criterion this way: “An empirical evaluation of accounting income numbers requires agreement as to what real-world outcome constitutes an appropriate test of usefulness.” Note their reference to a requirement to agree on the test. They were referring to the choice of criterion being important to the persuasiveness of their tests, which were fundamental and related to the “usefulness” of U.S. GAAP income numbers to stock market investors 50 years ago. As time went by and the financial accounting literature grew accordingly, financial accounting researchers have looked in many directions for capital market outcomes in their quest for publishable results.
Research on IFRS can be used to illustrate my point. Those who have looked at the consequences of IFRS adoption have mostly studied outcomes they believed would interest participants in equity markets and to a less extent parties to debt contracts. Many beneficial outcomes have now been claimed,^{4} consistent with benefits asserted by advocates of IFRS. Examples are more comparable accounting numbers; earnings that are higher “quality” and less subject to managers' discretion; lower barriers to international capital flows; improved analysts' forecasts; deeper and more liquid equity markets; and a lower cost of capital. But the evidence is typically coarse in nature; and so often the results are inconsistent because of the different outcomes selected as tests of “usefulness,” or differences in the samples studied (time periods, countries, industries, firms, etc.) and in research methods (how models are specified and variables measured, which estimators are used, etc.). The upshot is that it can be difficult if not impossible to reconcile the many inconsistencies, and for standard setters to relate reported findings to the judgments they must make.
Despite the many largely capital market outcomes that have been studied, some observers of our efforts must be disappointed that other potentially beneficial outcomes of adopting IFRS have largely been overlooked. Among them are the wider benefits to an economy that flow from EU membership (IFRS are required),^{5} or access to funds provided by international agencies such as the World Bank, or less time spent by CFOs of international companies when comparing the financial performance of divisions operating in different countries and on consolidating the financial statements of foreign subsidiaries, or labor market benefits from more flexibility in the supply of professionally qualified accountants, or “better” accounting standards from pooling the skills of standard setters in different jurisdictions, or less costly and more consistent professional advice when accounting firms do not have to deal with as much cross-country variation in standards and can concentrate their high-level technical skills, or more effective compliance monitoring and enforcement as regulators share their knowledge and experience, or the usage of IFRS by “millions (of small and medium enterprises) in more than 80 countries” (Pacter 2012), or in some cases better education of tomorrow's accounting professionals.^{6} I am sure you could easily add to this list if you wished.
In sum, we can help standard setters, yes, but only in quite limited ways.^{7} Standard setting is inherently political in nature and will remain that way as long as there are winners and losers when standards change. That is one issue. Another is that the results of capital markets studies are typically too coarse to be definitive when it comes to the detailed issues that standard setters must consider. A third is that accounting standards have ramifications extending far beyond public financial markets and a much more expansive view needs to be taken before we can even hope to understand the full range of benefits (and costs) of adopting IFRS.
Let me end my remarks with a question: Have Ball and Brown (1968)—and Beaver (1968) for that matter, if I can bring Bill Beaver into it—have we had too much influence on the research agenda to the point where other questions and methods are being overlooked?
February 27, 2014 Reply from Paul Williams
Bob,
If you read that last Horizon's section provided by "thought leaders" you realize the old guys are not saying anything they could not have realized 30 years ago. That they didn't realize it then (or did but was not in their interest to say so), which led them to run journals whose singular purpose seemed to be to enable they and their cohorts to create politically correct academic reputations, is not something to ask forgiveness for at the end of your career.Like the sinner on his deathbed asking for God's forgiveness , now is a hell of a time to suddenly get religion. If you heard these fellows speak when they were young they certainly didn't speak with voices that adumbrated any doubt that what they were doing was rigorous research and anyone doing anything else was the intellectual hoi polloi.
Oops, sorry we created an academy that all of us now regret, but, hey, we got ours. It's our mess, but now we are telling you its a mess you have to clean up. It isn't like no one was saying these things 30 years ago (you were as well as others including yours truly) and we have intimate knowledge of how we were treated by these geniuses.
David Johnstone asked me to write a paper on the following:
"A Scrapbook on What's Wrong with the Past, Present and Future of Accountics
Science"
Bob Jensen
February 19, 2014
SSRN Download:
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2398296
Abstract
For operational convenience I define accountics science as research that features equations and/or statistical inference. Historically, there was a heated debate in the 1920s as to whether the main research journal of academic accounting, The Accounting Review (TAR) that commenced in 1926, should be an accountics journal with articles that mostly featured equations. Practitioners and teachers of college accounting won that debate.
TAR articles and accountancy doctoral dissertations prior to the 1970s seldom had equations. For reasons summarized below, doctoral programs and TAR evolved to where in the 1990s there where having equations became virtually a necessary condition for a doctoral dissertation and acceptance of a TAR article. Qualitative normative and case method methodologies disappeared from doctoral programs.
What’s really meant by “featured equations” in doctoral programs is merely symbolic of the fact that North American accounting doctoral programs pushed out most of the accounting to make way for econometrics and statistics that are now keys to the kingdom for promotion and tenure in accounting schools ---
The purpose of this paper is to make a case that the accountics science monopoly of our doctoral programs and published research is seriously flawed, especially its lack of concern about replication and focus on simplified artificial worlds that differ too much from reality to creatively discover findings of greater relevance to teachers of accounting and practitioners of accounting. Accountics scientists themselves became a Cargo Cult.
http://faculty.trinity.edu/rjensen/Theory01.htm#DoctoralPrograms
Shielding Against Validity Challenges in Plato's Cave ---
http://faculty.trinity.edu/rjensen/TheoryTAR.htm
Common Accountics Science and Econometric Science Statistical Mistakes ---
http://www.cs.trinity.edu/~rjensen/temp/AccounticsScienceStatisticalMistakes.htm
The Cult of Statistical Significance:
How Standard Error Costs Us Jobs, Justice, and Lives ---
http://www.cs.trinity.edu/~rjensen/temp/DeirdreMcCloskey/StatisticalSignificance01.htm
How Accountics Scientists Should Change:
"Frankly, Scarlett, after I get a hit for my resume in The Accounting Review
I just don't give a damn"
http://www.cs.trinity.edu/~rjensen/temp/AccounticsDamn.htm
One more mission in what's left
of my life will be to try to change this
http://www.cs.trinity.edu/~rjensen/temp/AccounticsDamn.htm
What went wrong in accounting/accountics research? ---
http://faculty.trinity.edu/rjensen/theory01.htm#WhatWentWrong
The Sad State of Accountancy Doctoral
Programs That Do Not Appeal to Most Accountants ---
http://faculty.trinity.edu/rjensen/theory01.htm#DoctoralPrograms