# separating two normal distributions

## Recommended Posts

Let's say you have two groups of items, "good" ones and "bad" ones, and you perform some test on them and each item gets a score from the test. Let's say a high score indicates good. Unfortunately the test is not perfect: some of the good ones get lower scores than some of the bad ones. Take this XY graph for example, let's say the red ones are the good ones:

http://forums.lavag.org/index.php?act=attach&type=post&id=5312

You're going to separate the items into two piles based on their score. Putting a bad item in the "good" pile has a cost of 1, while putting a good item in the "bad" pile has a cost of R (0<R<Inf). Where would you draw the line - and how to program it in Labview?

Has anyone here already done this - and if not, would you be interested in receiving my solution, once I get one?

So far, I'm planning to pretend (and I do mean pretend) that my distributions are normal distributions, and use the "Normal CDF" (Cumulative Distribution Function) vi from the math palette. On each iteration, I'll take some guesses about where the cutoff should be and calculate the cost of each one using the Normal CDF. And here's the tricky part: I want to start with three+ guesses about where the cutoff will be that straddle the right answer, fit a curve through these points, and use "Brent with Derivatives 1D" optimization. Only, the only way I can think of to guarantee this "straddle" is to test some fairly wild guesses - say, the low mean minus lots of standard dev's, and the high mean plus lots of stdev's. If one of these turns out to be the best initial guess, and better than a slightly less wild guess, then I simply declare it to be the "right" answer and don't do any optimization.

Seems similar to signal-noise separation problems. Have you tried to google for "signal noise separation"? You may want also to read some text books on the issue, however I'm not that familiar with the topis so I cannot suggest any.

Tomi

After a minute of further thinking, why not to change the algorithm that does the classification so that it really does the classification all the way so that you didn't have to first use classification algorithm and then classify the result of classification algorithm. You can use Bayesian statistics to determine exactly the probability for a particular point belonging to one of your two classes. I guess however that this requires you to study some math... For quite an easy introduction to this kind of math, see for example D. S. Sivia: Data Analysis; A Bayesian Tutorial.

Tomi

Thanks guys. I will do some reading as you suggest.

I did come up with a solution, which assumes that the distributions are normal distributions. It's kinda ugly but it converges in a reasonable number of iterations nonetheless. If anyone wants to use it, just ask.

## Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

×   Pasted as rich text.   Paste as plain text instead

Only 75 emoji are allowed.

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×