# Identifing the most anomolous member of a set

Prof Wonmug, Sep 21, 2010

Prof Wonmug

I have a large database containing the text from various sources
(newspapers, magazines, books, etc.) from 1950-1999. The text is
broken down by decade (50s, 60s, 70s, 80s, 90s). I have a program that
counts the number of times each word occurs in the text in each

Here are the counts for a few words:

50s 60s 70s 80s 90s
120 0 0 0 0
23051 12 19 55 9
1 6032 501 2 28
537 25384 1544 818 220
2 220 2285 5 13
6 86 1322 12 58
1075 331 3266 11882 319
45 18 143 1541 16
1 0 3 0 4156
189 143 959 283 22541

I am not sure what the correct terminology is. I would like to
calculate the "skew" in each of these numbers. That is, I would like
to know how unlikely it is to get "120" in the first row. I guess this
would be as compared to having the numbers evenly distributed across
the decades (24 24 24 24 24).

I would like to calculate this value for each number, but I am only
interested in the largest in each row.

Prof Wonmug, Sep 21, 2010

Ray Koopman

The p-values for the maximum in each row are all infinitesimal:

5.34 * 10^-106
1.66 * 10^-19901
2.65 * 10^-4607
1.15 * 10^-18450
2.29 * 10^-1705
1.85 * 10^-963
3.75 * 10^-5824
1.80 * 10^-1089
3.47 * 10^-3607
1.95 * 10^-17670

In the notation of

Fuchs, C., & Kenett, R. (1980). A test for detecting outlying cells
in the multinomial distribution and two-way contingency tables.
Journal of the American Statistical Association 75, 395-398,

the p-values were obtained by setting M+* to the observed max Zi,
and inverting the upper bound for M+* in equation 3.6.

For the record, the Mathematica code I used was

px[f_?VectorQ] := With[{k = [email protected], n = [email protected]},
Erfc[(k*[email protected] - n)/Sqrt[2n(k-1.)]]*k/2 ]

Ray Koopman, Sep 21, 2010