#
Applied Bayesian and Classical Inference (Englisch)

The Case of The Federalist Papers

von Mosteller, F. & Wallace, D. L.

##
97,45 €

inkl. MwSt.

### Produktdetails:

### Beschreibung:

## Inhaltsverzeichnis

Analytic Table of Contents.- 1. The Federalist Papers As a Case Study.- 1.1. Purpose.- To study how Bayesian inference works in a large-scale data analysis, we chose to try to resolve the problem of the authorship of the disputed Federalist papers..- 1.2. The Federalist papers.- The Federalist papers were written by Hamilton, Madison, and Jay. Jay's papers are known. Of the 77 papers originally published in newspapers, 12 are in dispute between Hamilton and Madison, and 3 may regarded as joint by them. Historians have varied in their attributions..- 1.3. Early work.- Frederick Williams and Frederick Mosteller found that sentence length and its variability within papers did not discriminate. Tables 1.3-1, 2, 3, 4 show that they found some discriminating power in percentage of nouns, of adjectives, of one- and two-letter words, and of the's. Together these variables could have decided whether Hamilton or Madison wrote all the disputed papers, if that were the problem, but the problem is to make an effective assignment for each paper..- 1.4. Recent work-pilot study.- We call marker words those which one author often uses and the other rarely uses. Douglass Adair found while (Hamilton) versus whilst (Madison). We found enough (Hamilton) and upon (Hamilton); see Tables 1.4-1, 2 for incidence and rates. Tables 1.4-3, 4, 5 give an over-view of marker words for Federalist and non-Federalist writings. Alone, they would not settle the dispute compellingly..- 1.5. Plots and honesty.- Some say that the dispute is not a matter of honesty but a matter of memory. Hamilton was hurried in his annotation by an impending duel, but Madison had plenty of time. Editing may be a hazard. We want to use many words as discriminating variables..- 1.6. The plan of the book.- 2. Words and Their Distributions.- 2.1. Why words?.- Hamilton and Madison use the same words at different rates, and so their rates offer a vehicle for discrimination. Some words like by and to vary relatively little in their rates as context changes, others like war vary a lot, as the empirical distributions in the four tables show. Generally, less meaningful words offer more stability..- 2.2. Variation with time.- In Table 2.2-2, a separate study illustrated by Madison's rates for 11 function words over a 26-year period examines the stability of rates through time. We desire stability because we need additional text of known authorship to choose words and their rates for discriminating between authors. Among function words, some pronouns and auxiliary verbs seem unstable..- 2.3. How frequency of use varies.- For establishing a mathematical model, we need to find out empirically how rates of use by an author vary from one chunk of writing to another..- 2.4. Correlations between rates for different words.- Theoretical study shows that the correlation between the rates of occurrence for different words should ordinarily be small but negative. An empirical study whose results appear in Table 2.4-1 shows that these correlations are ordinarily negligible for our work..- 2.5. Pools of words.- Three pools of words produced potential discriminators..- 2.6. Word counts and their accuracies.- Some word counts were carried out by hand using slips of paper, one word per slip. Others were done by a high-speed computer which constructed a concordance..- 2.7. Concluding remarks.- Although words .offer .only .one set .of discriminators, .one needs a large enough Pool of potential discriminators to .offer a good chance .of success. We need to avoid selection and regression effects. Ideally we want enough data to get a grip on the distribution theory for the variables to be used..- 3. The Main Study.- In the main study, we use Bayes' theorem to determine odds of authorship for each disputed paper by weighting the evidence from words. Bayesian methods enter centrally in estimating the word rates and choosing the words to use as discriminators. We use not one but an empirically based range of prior distributions. We present the results for the disputed papers and examine the sensitivity of the results to various aspects of the analysis..- After a brief guide to the chapter, we describe some views of prob-ability as a degree of belief and we discuss the need and the difficulties of such an interpretation..- 3.1. Introduction to Bayes' theorem and its applications.- We give an overview, abstracted from technical detail, of the ideas and methods of the main study, and we describe the principal sources of difficulties and how we go about meeting them..- 3.1 A. An example applying Bayes' theorem with both initial odds and parameters known.- 3.1B. Selecting words and weighting their evidence.- 3.1C. Initial odds.- 3.1D. Unknown parameters.- 3.2. Handling unknown parameters of data distributions.- We begin to set out the components of our Bayesian analysis..- 3.2A. Choosing prior distributions.- 3.2B. The interpretation of the prior distributions.- 3.2C. Effect of varying the prior.- 3.2D. The posterior distribution of (?, ?).- 3.2E. Negative binomial.- 3.2F. Final choices of underlying constants.- 3.3. Selection of words.- The prior distributions are the route for allowing and protecting against selection effects in choice of words . We use an unselected pool of 90 words for estimating the underlying constants of the priors, and we assume the priors apply to the populations of words from which we developed our pool of 165 words. We then selectively reduce that pool to the final 30 words. We describe a stratification of words into word groups and our deletion of two groups because of contextuality..- 3.4. Log odds.- We compute the logarithm of the odds factor that changes initial odds to final odds and call it simply log odds. The computations use the posterior modal estimates as if they were exact and are made under the various choices of underlying constants and using both negative binomial or Poisson models..- 3.4A. Checking the method.- 3.4B. The disputed papers.- 3.5 Log odds by words and word groups.- 3.5A. Word groups.- 3.5B. Single words.- 3.5C. Contributions of marker and high-frequency words.- 3.6. Late Hamilton papers.- We assess the log odds for four of the late Federalist papers, written by Hamilton after the newspaper articles appeared and not used in any of our other analyses. The log odds all favor Hamilton, very strongly for all but the shortest paper..- 3.7. Adjustments to the log odds.- Through special studies, we estimate the magnitude of effects on the log odds of various approximations and imperfect assumptions underlying the main computations and results presented in Section 3.4. Percentage reductions in log odds are a good way to extrapolate from the special studies to the main study..- 3.7A. Correlation.- 3.7B. Effects of varying the underlying constants that determine the prior distributions.- 3.7C. Accuracy of the approximate log odds calculation.- 3.7D. Changes in word counts.- 3.7E. Approximate adjusted log odds for the disputed papers.- 3.7F. Can the odds be believed?.- 4. Theoretical Basis of the Main Study.- This chapter is a sequence of technical sections supporting the methods and results of the main study presented in Chapter 3. We set out the distributional assumptions, our methods of determining final odds of authorship, and the logical basis of the inference. We explain our methods for choosing prior distributions. We develop theory and approximate methods to explore the adequacy of the assumptions and to support the methods and the findings..- 4.1. The negative binomial distribution.- We review and develop properties of the negative binomial family of distributions..- 4.1 A. Standard properties.- 4.1B. Distributions of word frequency.- 4.1C. Parametrization.- 4.1D. Estimation.- 4.2. Analysis of the papers of known authorship.- We treat the choice of prior distributions, the determination of the posterior distribution, and the computational problem in finding posterior modes..- 4.2A. The data: notations and distributional assumptions.- 4.2B. Object of the analysis.- 4.2C. Prior distributions : assumptions.- 4.2D. The posterior distribution.- 4.2E. The modal estimates.- 4.2F. An alternative choice of modes.- 4.2G. Choice of initial estimate.- 4.3. Abstract structure of the main study.- We describe an abstract structure for our problem; we derive the appropriate formulas for our application of Bayes' theorem and give a formal basis for the method of bracketing the prior distribution. The treatment is abstracted both from the notation of words and their distributions and from numerical evaluations..- 4.3A. Notation and assumptions.- 4.3B. Stages of analysis.- 4.3C. Derivation of the odds formula.- 4.3D. Historical information.- 4.3E. Odds for single papers.- 4.3F. Prior distributions for many nuisance parameters.- 4.3G. Summary.- 4.4 Odds factors for the negative binomial model.- We develop properties of the Poisson and negative binomial families of distributions. The discussion of appropriate shapes for the likelihood ratio function may suggest new ways to choose the form of distributions..- 4.4A. Odds factors for an unknown paper.- 4.4B. Integration difficulties in evaluation of ?.- 4.4C. Behavior of likelihood ratios.- 4.4D. Summary.- 4.5. Choosing the prior distributions.- We give methods for choosing sets of underlying constants to bracket the prior distributions and we explore the effects of varying the prior on the log odds. Choices are based in part on empirical analysis but also on heuristic considerations of reasonableness, analogy, and tractability..- 4.5A. Estimation of ?1 and ? 2 : first analysis 125.- 4.5B. Estimation of ?1 and ?2 : second analysis.- 4.5C. Estimation of ?3.- 4.5D. Estimation of ?4 and ?5.- 4.5E. Effect of varying the set of underlying constants.- 4.5F. Upon: a case study.- 4.5G. Summary.- 4.6. Magnitudes of adjustments required by the modal approximation to the odds factor.- We study, by example, the effect of using the posterior mode as if it were exact. To make the assessment we develop some general asymptotic theory of posterior densities..- 4.6A. Ways of studying the approximation.- 4.6B. Normal theory for adjusting the negative binomial modal approximation.- 4.6C. Approximations to expectations.- 4.6D. Notes on asymptotic methods.- 4.7. Correlations.- We study the magnitudes of effects of erroneous assumptions: the effects of correlations between rates for different words..- 4.7A. Independence and odds.- 4.7B. Adjustment for a pair of words.- 4.7C. Example. The words upon and on.- 4.7D. Study of 15 word pairs.- 4.7E. Several words.- 4.7F. Further theory.- 4.7G. Summary.- 4.8. Studies of regression effects.- To study the adequacy of assumptions, we compare the performance of the log odds for the disputed papers with theoretical expectations..- 4.8A. Introduction.- 4.8B. The study of word rates.- 4.8C. Total log odds for the final 30 words.- 4.8D. Log odds by word group.- 4.8E. Theory for the Poisson model.- 4.8F. Theory for the negative binomial model.- 4.8G. Two-point formulas for expectations of negative binomial log odds.- 4.9. A logarithmic penalty study.- 4.9A. Probability predictions.- 4.9B. The Federalist application: procedure.- 4.9C. The Federalist application: the penalty function.- 4.9D. The Federalist application: numerical results.- 4.9E. The Federalist application: adjusted log odds.- 4.9F. The choice of penalty function.- 4.9G. An approximate likelihood interpretation.- 4.10. Techniques in the final choice of words.- This section provides details of a special difficulty, and its possible general value lies in illustrating how to investigate the effects of a split into two populations of what was thought to be a single population..- 4.10A. Systematic variation in Madison's writing.- 4.10B. Theory.- 5. Weight-Rate Analysis.- 5.1. The study, its strengths and weaknesses.- Using a screening set of papers, we choose words and weights to use in a linear discriminant function for distinguishing authors. We use a calibrating set to allow for selection and regression effects. A stronger study would use the covariance structure of the rates for different words in choosing the weights; we merely allow for it through the calibrating set. The zero-rate words also weaken the study because we have not allowed for length of paper as we have done in the main study and in a robust one reported later..- 5.2. Materials and techniques.- Using the pool of words described in Chapter 2, we develop a linear discriminant function
$$tilde y = sum {W_i tilde x_{i,} } $$
, where
$$W_i $$
is the weight assigned to the i th. word and
$$x_i $$
is the rate for that word. The
$$W_i $$
are chosen so that ÿ tends to be high if Hamilton is the author, low if Madison is. Ideally the weights are proportional to the difference between the authors' rates and inversely proportional to the sum of the variances. By asimplified and robust calculation, an index of importance of a word was created. We use it to cut the number of words used to 20..- 5.3 Results for the screening and calibrating sets.- The 20 words, their weights, and estimated importances are displayed in Table 5.3-1, upon being outstanding by a factor of 4. Table 5.3-2 shows the results of applying the weights to the screening set of papers. Hamilton's 23 average .87 and all exceed .40, while Madison's 25 average -.41 and all are below -.19. For the calibrating set Hamilton averages .92 and Madison -.38. The smallest Hamilton score is .31, and the largest Madison is .15 (zero plays no special role here)..- 5.4. Regression effects.- As a rough measure of separation, we use the number of standard deviations between the Hamilton and Madison means. For the whole set of 20 words, the separation regresses from 6.9 standard deviations in the screening set to 4.5 in the calibrating set. In Section 5.3, we see almost no change from screening to calibration set in the average separations; the loss comes from increased standard deviations. In a general way, as the groups of words become more contextual the regression effect is larger. Group 1, the word upon, actually gains strength from screening to calibration set..- 5.5. Results for the disputed papers.- After displaying the numerical outcome of the weight-rate discriminant function for the disputed papers in Table 5.5-1, we carry out two types of analyses, one based on significance tests and one based on likelihood ratios. In Table 5.5-2 we show two t-statistics and corresponding P-values for each paper, first for testing that the paper is a Hamilton paper, and second for testing that the paper is a Madison paper. We compute
$$
t_i = frac{{y - overline y _i }}
{{s_i sqrt {1 + (1/n_j )} }}
$$
where j = Hamilton or Madison, y is the value for the disputed paper from Table 5.5-1, sj is the standard deviation for author j for the calibrating set, and nj = 25, the number of papers in each calibrating set. Except for paper 55, the P-values for the Hamilton hypotheses are all very small (less than .004); the P-values for the Madison hypotheses are large, the smallest being .087. Paper 55 is further from Madison than from Hamilton but both P-values are significant..- Table 5.5-3 gives log likelihood ratios for the joint and disputed papers, assuming normal distributions and using the means and variances in the calibrating set. To allow for the uncertainty in estimating the means and variances, conservative 90 per cent confidence limits are shown for the log likelihood ratio, and a Bayesian log odds is calculated using the t-distribution. Except for paper 55, which goes slightly in Hamilton's favor, the odds favor Madison for the disputed papers..- 6. A Robust Hand-Calculated Bayesian Analysis.- 6.1. Why a robust study?.- Because the main study leans on parametric assumptions and heavy calculations, we want a study to check ourselves that depends less on distributional assumptions and that has calculations that a human being can check. This robust approach, based on Bayes' theorem, naturally sacrifices information. It dichotomizes the observed frequency distributions of occurrences of words. For choosing and weighting words, it uses both a screening set and a validating set of papers..- 6.2. Papers and words.- Using a screening set of 46 papers of length about 2000 words, we selected the words shown in Table 6.2-2 for the robust Bayes study..- 6.3. Log odds for high-frequency words.- For each of the 64 high-frequency words, we divide the rates of the 46 papers in the 2000-word set into two equal parts, highs and lows. For each word, we form a 2 × 2 table for the high and for the low rates. To estimate the odds (Hamilton to Madison) to be assigned to a word, we first add 1.25 to the count in each of the four cells of the word's 2x2 table. We explain the theoretical framework for this adjustment which is based on a beta prior distribution. We use the adjusted counts to estimate the odds for the high and for the low rate for that word..- 6.4 Low-frequency words.- For low-frequency words, we use the probability of zero occurrence and must adjust the Hamilton-Madison odds according to the length of paper..- 6.5 The procedure for low-frequency words.- Following the theory of Section 6.6, this section explains the arithmetic leading to log odds for each word appropriate to the length of the paper. Ultimately we sum the log odds..- 6.6 Bayesian discussion for low-frequency words.- Theoretical development required for the procedure given in Section 6.5..- 6.7 Log odds for 2000-word set and validating set.- For each of the five groups of words in Table 6.2-2 and in total, Table 6.7-1 shows the log odds for each paper in the 2000-word set used to choose the words and create the odds. All Hamilton papers have positive log odds (averaging 14.0) and all Madison papers have negative log odds (averaging -14.2). Table 6.7-2 gives a more relevant assessment: the same information for the validating set of papers not used to develop the odds. The corresponding averages are 10.2 for 13 Hamilton papers and -8.2 for 18 Madison papers. One Hamilton paper has log odds of 0 or equivalently even odds of 1:1..- 6.8. Disputed papers.- Table 6.8-1 gives the detailed data parallel to the previous tables for the unknown papers. Only paper 55 is not ascribed to Madison. The strength of attribution is, of course, much weaker than in the main study..- 7. Three-Category Analysis.- 7.1. The general plan.- By categorizing rates into three categories-low, middle, and high- and estimating log odds for each category, we can get a score for each unknown paper. This study defends against outlying results and failures of assumptions though it does a crude job of handling zero frequencies..- 7.2 Details of method.- For a given word, the rates in 48 papers (23 Hamilton and 25 Madison) were ranked with the lowest 18 papers giving the cutoff for "low" and the highest 18 papers the cutoff for "high". Table 7.2-2 gives the cut-points so determined and the log odds for 63 words. To get a score for a paper, sum the log odds. Some special rules killed some words and pooled categories in others..- 7.3 Groups of words.- After applying the rules of Section 7.2, we had 63 words left, grouped as before by perceived degrees of contextuality..- 7.4 Results for the screening and calibrating sets.- The scoring system was applied to the screening set of papers. As shown in Table 7.4-1, all Hamilton papers scored positive averaging 20.54, all Madison negative averaging -31.24. To see the regression effect, the same scheme was applied to a calibrating set as shown in Table 7.4-2 with average log odds for Hamilton of 8.54 and for Madison of -19.30..- 7.5. Regression effects.- 7.5A. Word group.- For each word group we show in Table 7.5-1 the regression effect from screening to calibrating set. Generally speaking, the more the group is perceived as contextual, the greater its regression effect. The word upon improved from screening to calibrating set..- 7.5B. The regression effect by single words.- 7.6. Results for the joint and disputed papers.- As in the analysis of Chapter 6, all disputed papers but paper 55 lean strongly toward Madison, and that paper falls on the fence..- 8. Other Studies.- 8.1 How word rates vary from one text to another.- For 165 words we give rates in Table 8.1-1 from six sources: Hamilton, Madison, Jay, Miller-Newman-Friedman, Joyce's Ulysses, and the Bible..- 8.2 Making simplified studies of authorship.- To begin an authorship study we advise: Edit for quotations and special usage; make counts for separate pieces, using a list of words of moderate length; obtain the rates; assess variation and discard words; get statistical help if the problem is delicate; use natural groupings; use a high-speed computer; see Chapter 10 for some new variables..- 8.3 The Caesar letters.- As a little example, we explore the possibility that Hamilton, as opposed to someone else, wrote the Caesar letters. Table 8.3-1 shows the rates for 23 high-frequency words for the Caesar letters, and for Hamilton, for Madison, and for Jay. For 13 of the words, the Caesar rate differs from the Hamilton rate by two or more standard deviations under Poisson theory. If we apply the log odds computation of Chapter 3 for Hamilton versus Madison to the Caesar letters, we get-4.2, instead of positive log odds in the teens or twenties as we would expect if Hamilton were the author. The results are strongly against Hamilton, though not in favor of Madison, but of some unknown author..- 8.4 Further analysis of Paper No. 20.- Among the three papers we classified as having joint Hamilton-Madison authorship, paper No. 20 is most nearly on the fence. We hunted for Hamilton's contribution. Some Hamilton markers could be traced not to him but to the writing of Sir William Temple, from which Madison drew extensively for this paper. We abandoned the analysis..- 8.5 How words are used.- Joanna F. Handlin made an elaborate study of the various dictionary meanings of 22 marker words and probably. In 15 appearances of upon, Madison had 3 usages that Hamilton never used in 216 appearances. Table 8.5-1 gives detailed data for the occurrences of 13 meanings of of in several papers for each author--a study carried out by Miriam Gallaher..- 8.6 Scattered investigations.- We hunted for useful pairs of words like toward-towards with little success. Use of comparatives and superlatives showed great variation. Words with emotional tone gave no discrimination. How Hamilton and Madison handled enumerations led nowhere. A study of conditional clauses failed because of unreliability in classification. Relating strength of discrimination to proportion of original material, although suggestive, was not useful. Length of papers offered some discrimination, but we feared it because of contextuality and because of newspaper constraints..- 8.7. Distributions of word-length.- The earliest discrimination analyses by Mendenhall used word length as a discriminator. Robert M. Kleyle and Marie Yeager display the distribution of word length for eight Hamilton and seven Madison papers in Table 8.7-1 and in three figures. The chi-squared statistic for goodness of fit in Table 8. 7-2 shows so much variation that we cannot use it for discrimination. The Hamilton papers fit the Madison averages as well as do the Madison papers..- 9. Summary of Results and Conclusions.- 9.1 Results on the authorship of the disputed Federalist papers.- Except for paper No 55, the odds are strong for Madison in the main study For No 55 they are about 90 to 1 for Madison.- 9.2 Authorship problems.- Function words offer a fertile source of discriminators Contextuality must be investigated See also Chapter 10 for further variables.- 9.3 Discrimination problems.- A large pool of variables systematically explored may pay off when obvious important variables are not available Contextual effects have counterparts in other situations Selection effects must be allowed for..- 9.4 Remarks on Bayesian studies.- We recommend sensitivity studies made by varying the priors We like priors that have an empirical orientation Data distributions matter We need simple routine Bayesian methods.- 9.5 Summing up.- We tracked the problems of Bayesian analysis to their lair and solved the problem of the disputed Federalist papers.- 10 The State of Statistical Authorship Studies in 1984.- 10.1 Scope.- We treat the time period since 1969, emphasizing prose disputes almost exclusively This chapter discusses both technological advances and empirical studies.- 10.2 Computers, concordances, texts, and monographs.- The computer and its software leading to easy compilation of concordances have been the major technological advance Scholars have produced several monographs but few statistical texts in stylistics..- 10.3 General empirical work.- Morton studies sentence length further, and like Ellegard, uses proportional pairs of words (the fraction that the occurrences of word U make up the total occurrences of word U and word V) Morton introduces collocation variables to expand the number of potential discriminators (A collocation consists of a keyword like in and has associated words that precede or succeed it) The ratio of the number of times the associate word occurs with the keyword to the number of times the keyword appears is the measure of collocation Position of a word in a sentence (especially first or last) offers additional discriminators.- 10.4 Poetry versus prose.- To examine a possible systematic difference between poetry and prose, Williams looks at the Shakespeare-Bacon controversy He takes samples of Shakespeare (who wrote only poetry), Bacon (who wrote only prose), and as a control samples of both poetry and prose from Sir Philip Sidney Williams uses words of length 3 and 4 as discriminators Table 104-1 shows the comparisons He concludes that poetry and prose produce differing distributions of word lengths, and that the difference between Shakespeare and Bacon could be regarded as a poetry-to-prose effect rather than an authorship effect.- 10.5 Authorship studies similar to the Junius or Federalist studies.- 10.5A And Quiet Flows the Don.- We review the dispute about the authorship of the Russian novel And Quiet Flows the Don An anonymous critic, D*, in a book with preface by Solzhenitsyn, regards Mikhail Sholokhov, the reputed author, as having plagiarized much of the work of the anti-Bolshevik author Fyodor Kryukov, who died before publishing his work on the Don Cossacks Roy A Medvedev reviews the issues, concluding that Sholokhov probably had access to some Don Cossack writings.- 10.5B Kesari.- In discriminating between two possible authors of certain editorials published in the Indian newspaper Kesari, Gore, Gokhale, and Joshi use the variables word length, sentence length, and the rate of use of commas as discriminators They reject the hypothesis that word length follows the log normal distribution They find sentence length to be approximately log normal, but unfortunately unstable for material from the same author, and so not helpful Their new variable, rate of use of commas, offers some discrimination.- 10.5C Die Nachtwachen.- The author of this pseudonymous romantic German novel has been hotly sought since its publication in 1804 Wickmann uses transition frequencies from one part of speech to another as discriminators andconcludes that among several candidates only Hoffmann is a reasonable possibility.- 10.5D Economic history.- O'Brien and Darnell tackle six authorship puzzles from the field of economics They use the collocation method and the first words of sentences to decide authorship in a book-length sequence of studies.- 10.6 Homogeneity problems.- In the simplest homogeneity problem, we have two pieces of text and we ask whether they were produced by the same author.- 10.6A Aristotle and Ethics.- Kenny analyzes two versions of a book

Springer Book Archives