separating two normal distributions

torekp · March 28, 2007

Let's say you have two groups of items, "good" ones and "bad" ones, and you perform some test on them and each item gets a score from the test. Let's say a high score indicates good. Unfortunately the test is not perfect: some of the good ones get lower scores than some of the bad ones. Take this XY graph for example, let's say the red ones are the good ones:

http://forums.lavag.org/index.php?act=attach&type=post&id=5312

You're going to separate the items into two piles based on their score. Putting a bad item in the "good" pile has a cost of 1, while putting a good item in the "bad" pile has a cost of R (0<R<Inf). Where would you draw the line - and how to program it in Labview?

Has anyone here already done this - and if not, would you be interested in receiving my solution, once I get one?

So far, I'm planning to pretend (and I do mean pretend) that my distributions are normal distributions, and use the "Normal CDF" (Cumulative Distribution Function) vi from the math palette. On each iteration, I'll take some guesses about where the cutoff should be and calculate the cost of each one using the Normal CDF. And here's the tricky part: I want to start with three+ guesses about where the cutoff will be that straddle the right answer, fit a curve through these points, and use "Brent with Derivatives 1D" optimization. Only, the only way I can think of to guarantee this "straddle" is to test some fairly wild guesses - say, the low mean minus lots of standard dev's, and the high mean plus lots of stdev's. If one of these turns out to be the best initial guess, and better than a slightly less wild guess, then I simply declare it to be the "right" answer and don't do any optimization.

Tomi Maila · March 28, 2007

Seems similar to signal-noise separation problems. Have you tried to google for "signal noise separation"? You may want also to read some text books on the issue, however I'm not that familiar with the topis so I cannot suggest any.

Tomi

After a minute of further thinking, why not to change the algorithm that does the classification so that it really does the classification all the way so that you didn't have to first use classification algorithm and then classify the result of classification algorithm. You can use Bayesian statistics to determine exactly the probability for a particular point belonging to one of your two classes. I guess however that this requires you to study some math... For quite an easy introduction to this kind of math, see for example D. S. Sivia: Data Analysis; A Bayesian Tutorial.

Tomi

syrus · March 28, 2007

QUOTE(torekp @ Mar 27 2007, 04:34 AM)

Let's say you have two groups of items, "good" ones and "bad" ones, and you perform some test on them and each item gets a score from the test. Let's say a high score indicates good. Unfortunately the test is not perfect: some of the good ones get lower scores than some of the bad ones.
You're going to separate the items into two piles based on their score. Putting a bad item in the "good" pile has a cost of 1, while putting a good item in the "bad" pile has a cost of R (0<R<Inf). Where would you draw the line - and how to program it in Labview?

Has anyone here already done this - and if not, would you be interested in receiving my solution, once I get one?

This is a pattern classification problem of the statistical machine learning variety. If you have the ability to collect a large amount of labeled examples, and the data source is stationary, then you can apply standard regression techniques to find an optimal parametric solution (e.g. if you assume normal distributions) or an optimal non-parametric solution (e.g. using a neural network approach).

Your goal is to approximate p("good"|x ) where x is an n-dimensional data vector and p("good"|x ) is the probability that the data example is in class "good" given that x . In your diagram, x is a 2-dimensional vector. [in the two-class case, p("bad"|x ) = 1 - p("good"|x ), so it is not necessary to explicitly approximate both quantities.]

Because you have an uneven cost function, you will add an additional weighting parameter to account for the different, i.e. greater, cost of labeling "bads" as "goods". If the cost was the same in each case, you would get the Bayes optimal classification accuracy by choosing the class with the maximum posterior probability, i.e. choose "good" if p("good"|x )>p("bad"|x ).

Reference: Neural Networks for Pattern Recognition by Chris Bishop (http://research.microsoft.com/%7Ecmbishop/nnpr.htm ''>http://research.microsoft.com/%7Ecmbishop/nnpr.htm ' target="_blank">http://research.microsoft.com/%7Ecmbishop/nnpr.htm)

torekp · April 3, 2007

Thanks guys. I will do some reading as you suggest.

I did come up with a solution, which assumes that the distributions are normal distributions. It's kinda ugly but it converges in a reasonable number of iterations nonetheless. If anyone wants to use it, just ask.

Sign In

separating two normal distributions

Recommended Posts

torekp

Tomi Maila

syrus

torekp

Join the conversation

Browse

Activity

Important Information