# Simple probability question (volume vs. probability)

Discussion in 'Scientific Statistics Math' started by golabidoon, Dec 28, 2007.

1. ### golabidoonGuest

Hello all,

Assume that unit cube in R^3 is randomly and uniformly sampled by N
points. Let's call this set of points S(N). Now consider a new sample
point X drawn from the same distribution. Denote the nearest element
of S(N) to X by Y. I would like to find the probability that the
distance between X and Y is below some constant D. The constants N and
D are given, i.e. the ultimate probability should hold for any S and X
and must only depend on N and D.

G.D.

golabidoon, Dec 28, 2007

2. ### David WinsemiusGuest

Musings of a former physics major: I am not sure that it is simple. My
subject line would have been "Distance to nearest neighbor in 3-space".
My first hunch was a Rayleigh distribution, but that distribution is
derived from distances to neighbors in a Gaussian distribution.

I found treatments of the one dimensional analogue of this problem in a
text on order statistics where the term given to the distances to the
next order statistic was "spacings". I would think, perhaps naively, that
an arbitrary additional point would have at most 1/2 of a "spacing
instance" to the nearest neighbor, and also perhaps naively, that the
distribution within that 1/2-distance would be uniform. (The 1-D spacings
are distributed as Poisson. So I my probability guesswork says the
nearest neighbor distance would also be Poisson, but I have not done the
math.)

I would have expected to find a worked solution in studies of gas
dynamics, electron physics, interstellar medium, or perhaps even in
physical systems where electrons are bumping into crystal defects ... but
I failed to find a reference that gave an immediate answer. Search terms
used without success in various combinations included "3 dimensional
uniform", "nearest neighbor", "spacings". You might try posting to one of
the theoretical physics newsgroups and see if my hunch, that this has
been solved in that domain, pans out.

David Winsemius, Dec 31, 2007

3. ### Robert DodierGuest

A couple of ideas. (1) Work with the square of the distance to
avoid a square root which would probably make life more difficult.
(2) The cumulative distribution function of the minimum of
a set of independent, identically distributed variables
{d(X, Y)^2 : Y in S(N)} is just (1 - F(u))^N where F is the cdf of
any one of them. (3) The pdf of d(X, Y)^2 is going to be
a convolution pdf(d1^2) * pdf(d2^2) * pdf(d3^2)
where d1 = x1 - y1, d2 = x2 - y2, d3 = x3 - y3.

Maybe you've already figured out that much. As suggested
elsewhere in this thread, it seems this problem must have been
solved already in some context ....

Hope this is useful in some way.

Robert Dodier

Robert Dodier, Dec 31, 2007
4. ### Herman RubinGuest

It is easier to look upon it the other way.

The distribution of X is uniform. For any position
x of X, let h(x,d) denote the volume of the intersection
of the sphere of radius d around x with the unit cube,
which is of volume 1. Then the probability that the
distance is greater than D is

int (1-h(x,D))^N dx,

integrating over the cube.

Herman Rubin, Dec 31, 2007
5. ### Robert DodierGuest

Oops, that should be 1 - (1 - F(u))^N.
I've spent some time noodling on this and made a little progress.
It appears the pdf of d(X, Y)^2 is a function defined piecewise,
with the piece for d^2 in [0, 1] being
(-3*x^2+16*x^(3/2)-6*%pi*x)/6+(sqrt(x)*(4*x+6*%pi)-6*%pi*x)/3
(there being a piece for [1, 2] and one for [2, 3] also, but so far
I am too lazy to work them out), as determined by computing
the convolution directly from the definition (f conv g)(x) =
integral(f(t) * g(x - t), t, -inf, inf),
with pdf(d1^2) being f(x) = 1/sqrt(x) - 1 (and g = f in this case).

The piece of the cdf corresponding to the pdf above appears to be
Pr(d1^2 < D) = (-5*D^3+sqrt(D)*(48*D^2+40*%pi*D)-45*%pi*D^2)/30
for D in [0, 1]. Incidentally I've computed the integrals with the
computer program Maxima; %pi is Maxima's symbol for 3.14....

Completing the solution is left as an exercise for the reader.
Dunno if I'll get around to it.

HTH

Robert Dodier

Robert Dodier, Jan 1, 2008
6. ### Scott HemphillGuest

I confirm your result for the cdf. Of course your D is the square of
the original poster's D. I used Mathematica to work on this, and have
arrived at a cdf for all three pieces. I'll use "d" for the original
poster's variable.

For d in [0,1]:

(8*d^5)/5 - d^6/6 + (4*d^3*Pi)/3 - (3*d^4*Pi)/2

For d in [1,Sqrt[2]]:

1/10 - d^2/2 + (3*d^4)/2 + d^6/3 -
(2*Sqrt[-1 + d^2]*(-2 + 9*d^2 + 8*d^4))/5 - Pi/2 + 3*d^2*Pi -
(8*d^3*Pi)/3 + 6*d^4*ArcSec[d]

For d in (Sqrt[2],Sqrt[3]]:

9/10 - (5*d^2)/2 - (3*d^4)/2 - d^6/6 +
(2*Sqrt[-2 + d^2]*(1 + 9*d^2 + 4*d^4))/5 - (17*Pi)/2 + 3*d^2*Pi +
(3*d^4*Pi)/2 - (16*d^3*ArcTan[Sqrt[d^2/(-2 + d^2)]])/3 +
8*ArcTan[(2 - d)/Sqrt[-2 + d^2]] + 8*ArcTan[(2 + d)/Sqrt[-2 + d^2]] -
6*(-3 + 2*d^2 + d^4)*ArcTan[Sqrt[-2 + d^2]] +
(16*d^3*ArcTan[Sqrt[d^2*(-2 + d^2)]])/3

This last expression is undefined for d==Sqrt[2], but it has the
correct limiting value. The "ArcTan"s in the expression give some
hope for simplification, but they didn't seem to collapse as well as
the ones that had existed in the middle expression prior to
simplification.

Scott

Scott Hemphill, Jan 3, 2008