LV OOP Kd-Tree much slower than .Net, brute force

jkflying · February 20, 2012

I am writing a performance-sensitive application which requires the use of a nearest-neighbor lookup. Originally I used a brute-force method, but unfortunately this gets to be very slow as the data size increases . I have a point cloud of ~100k points in 2D, and need around 50k nearest neighbor lookups per second as a minimum performance requirement.

As a solution I wrote a kd-tree in .Net and used LabView to call the .Net dll. However, I discovered that each .Net transaction carries with it about a 0.5ms delay. I've tried bunching the data up into groups, but this only helps so much, as I am using an iterative process.

Armed with my new-found kd-tree knowledge, I then wrote the kd-tree in LV-OOP, using DVRs for both subtrees and leaf values. However, my LV implementation is still 100x slower than the .Net implementation, and 20x slower than brute force. And this is with just 10k points.

I'm fairly new to LV (about 6 months in an academic environment) and I'm fairly sure I've made a massive blunder somewhere, but I don't have any idea what it might be.

http://robowiki.net/...d-tree_Tutorial is the tutorial that I used for writing both trees - note that I've only implemented a 1-NN lookup method, so have no need for the priority queue.

Just some notes:

I found using in-place data structure unbundle-bundle was much faster than using normal unbundle-bundle for all of the read/writes.
The tree started out with pure by-value subtrees and data, so was even slower before I changed to DVRs.
The lookup uses a queue as a stack, rather than being recursive. Not sure if this is good or bad.
The add element uses recursion. Again, not sure if this is good or bad.

I've written speed tests for brute force, .Net and LV lookups with names like with test_...... .vi if you want to compare performance.

kd_tree Folder.zip

Thanks in advance for any help

Julian Kent

asbo · February 20, 2012

Depending on the size of your point data, does it make sense to use LVOOP? I have no benchmarks to point to, but I imagine for small data, the overhead of LVOOP will outweigh the performance of, say, a cluster of {prev, data, next} that you don't get the nice OOP-ness with. There will probably be more legwork with that route (and you may have to be more careful with memory allocations) but I think for the performance spec you're looking to hit, the simple route may be more effective.

ShaunR · February 20, 2012

Depending on Irrespective of the size of your point data, does it make sense to use LVOOP? I have no benchmarks to point to, but I imagine for small data, the overhead of LVOOP will outweigh the performance of, say, a cluster of {prev, data, next} for loop that you don't get the nice OOP-ness with. There will probably be more less legwork with that route (and you may won't have to be more careful with memory allocations) but I think for the performance spec you're looking to hit, the simple route may be more effective.

....and get rid of the queues.

asbo · February 20, 2012

....and get rid of the queues.

:lol:

sam · February 20, 2012

just my 2¢: You can always do (translate) a recursive algorithm with a while loop. With above suggestion of clusters instead of Classes and while loop you should gain significant speeds.

Never mind my post, I looked at the code after I posted. and I don't see a Recursive implementation.

Edited February 20, 2012 by sam

Aristos Queue · February 21, 2012

Hi, jkflying. Welcome to LabVIEW.

You've now heard the position of the "OO is bad" crowd. If you'd like to rewrite without classes, go for it, but that's not the bulk of your performance problem. If you're like 90% of the world's programmers, you'd probably prefer some help getting the OO solution up to speed. So let's look at what you've got.

There's a lot of text in this post, but I've tried to break it into short paragraphs -- each one is independent, so you can read each one and absorb it by itself without having to take in all of this in one go.

First of all, remove the Timed Loops from your test harnesses. Timed Loops have side effects on the thread scheduling and prioritization of LabVIEW. Lots of things that could be faster are not in order to get determinism within the timed loop (in other words, repeatable schedule is more important than fast schedule on any given iteration). I do not understand the details well enough to teach them to you -- it's an area of LV outside my normal baliwick -- but the long and short of it is that Timed Loops are meant to provide fixed scheduling for real-time deployments, and using them for other purposes will have undesirable side-effects. Instead of timed loops, you can use a Flat Sequence Structure, with a Get Date/Time In Seconds node in the first frame, your code in the second frame, and another Get Date/Time In Seconds node in the third frame.

Now, about the benchmark itself... any language has problems where it just cannot outcompete another given language, but I do not think we are dealing with one of those in your case. One major item to consider: is the .NET implementation thread safe? A lot of the overhead in your code implemented in LV is going into thread safety for that kd tree to be used in parallel branches simultaneously... I'm going to guess that the .NET code is not thread safe if only because almost nothing that C# programmers write is thread safe (haven't looked at your implementation). Unless the .NET solution is doing the same level of mutex locking, you're benchmark comparisons are skewed. We can get the LV implementation up to speed, but this is one of the common problems with comparisons between LV and other languages: as long as everything is dataflow, our compares match theirs. Once you introduce references, we implicitly buy the cost of thread safety, where as they leave you to implement that yourself if you decide you need it. What that often means is that small test apps used for benchmarks suffer because we've got overhead, but real world apps are faster because we then use that thread safety to achieve parallel performance. Please keep that in mind when evaluating.

Looking at your code... you use LabVIEW Object as the type of your DVRs, which is the reason you need all the To More Specific and Preserve Run-Time Class primitives. Introduce a parent class (kdtreeparent.lvclass) that has the methods on it that you need (dynamic dispatch). Have kdtree.lvclass inherit from it, and use kdtreeparent.lvclass inside the DVRs. Now when you access a DVR inside the Inplace Element (IPE) structures you avoid the fairly expensive dynamic type checking. The dynamic type checking is getting a speed boost in LV 2012, but it is better to remove it entirely.

Using the queues as stacks is fine... they're optimized for that use case. However, you should definitely not name the queue... you've got the name "Stack" wired in. That means that you're using the same queue as anyone else who uses the name "Stack". It's just like putting a variable in the global namespace. Bad things happen. There are legitimate reasons for using this feature -- ok, one legitimate reason -- but this isn't it.

Inside addExemplar.vi, you're using Read Left.vi, Read Right.vi and other accessors. Although it is normally fine practice to use your accessors inside the other methods of your class, in this case, I suggest using a plain Unbundle node (not the IPE) to access those fields. The reason is that you're extracting multiple fields and you'll be able to unbundle multiple at once.

Read Domain.vi, as written and used inside addExemplar.vi, is very expensive (and, side note to previous posters, it would also be expensive with plain clusters). You are calling Read Domain.vi and then wiring its output directly to an Index Array node. That's producing a duplicate of the array every time. Instead, create an accessor on Exemplar.lvclass that accesses the domain inside the IPE, indexes one element, and returns that out of the subVI. With that, you'll only be copying a single floating point number instead of allocating a whole array.

In the isTree.vi, you are doing the "!= infinity" check outside of the IPE. This is copying the floating point integer. You can avoid that by moving the "!= infinity" computation inside the IPE. Similar performance improvement can be done in other VIs in your hierarchy.

I am confused about why AddToTree is destroying the DVR and then creating a new one instead of just accessing the DVR through an IPE. If this is a trick to avoid initializing the class (which is my guess as to why you're doing that), you've just introduced a significant thread synchronization problem. If you fork your tree wire on the top-level VI, when you call Add on one branch, you'll invalidate the DVR refnum that is still inside the tree on the bottom branch. And now you'll get errors from that other branch, or just lose all your data. Probably not what you want. This should just access the DVR using an IPE and you should have a Create Tree.vi that returns the object populated with the top-level DVR and a matching Delete Tree.vi which knows how to traverse and deallocate the entitre tree and throw away the top-level DVR.

On that same note, you really should include this in your test cases:

That will show you a big problem with your current implementation. It is a bad idea to put the meta data about the contents of a DVR as by value fields of your top-level class (or cluster) and have the refnums shared. Putting the whole data block into a single top-level DVR makes this truly behave like a refnum type. What that means is that all of your public VIs will all start off by accessing a DVR in order to get access to all the fields that are currently in your kdtree.lvclass private data cluster. Oh... and change your wire color to refnum cyan (it's a predefined color in the color picker) to let people know that your wire contains (and therefore behaves like) a refnum.

Those are the big points. You didn't pick an easy problem for your first out-of-the-gate project. A basic KD tree is kids stuff... a thread safe KD tree? That takes a bit more to reason through. You've got a good start here, though. I have seen DVR-with-classes implementations for other data structures that can beat or exceed .NET thread-safe structures performance, so if you keep chipping at it, I'm confident you can get there.

Happy wiring!

mje · February 21, 2012

Being a chemist rather than an engineer, I can't say I really know the finer details of what differentiates a k-d from a red-black or a spruce, but those are all very good tips which can take years of experience to learn. They are valuable to know in any context where performance matters.

I for one never thought of the ramifications of using the timed loop. Interesting.

jkflying · February 21, 2012

asbo, ShaunR:

How would you suggest I place clusters within clusters within clusters within... (dynamically)? I'm desperate for speed, and if it means sacrificing OOP design I'm willing to try it, but as far as I can tell LV specifies cluster contents pretty rigidly. Doing a prev/next doesn't work in a kd tree because we have a large number of dimensions... and implicit kd trees will give this type of structure but use huge amounts of memory (curse of dimensionality and all that). As you'll see in my project, I've already tried a simple brute force method with a simple array subtract/square/sum/min for finding the nearest point, and if it isn't fast enough with raw arrays it definitely won't be fast enough with clusters. Unless I've completely misunderstood you...

sam:

The recursive implementation is simply because I can't think of any other way to move into a tree of unknown depth in a data-flow environment without doing a copy of each subtree as we go. Sure,i could use a while loop, but it would be really slow as it would require copying each subtree onto the stack as I go.

Aristos Queue

Thanks for the feedback. I've gone through and implemented the ideas you've given, excepting the parent_kdtree class, and I've almost doubled my performance, but it's still 50x slower than the .Net. As I'm not really concerned about concurrent access, would you then recommend getting rid of the DVR-on-a-stack technique and instead go for a recursive unbundle/bundle without any DVRs at all then?

I thought the casting might be slow so I tried changing the data type in the queue to just straight kdtree DVRs and it slowed me down about 4x. Not sure why, but it seems that is the wrong thing to do.

mje: A kd-tree is a tree that works in more than one dimension, ie. for a k-dimensional point cloud. It is typically used for fast nearest-neighbor lookup. Check out http://en.wikipedia.org/wiki/K-d_tree. The one described on Wikipedia isn't quite what I'm using, as I don't store points on lower parts of the tree, just the leaves, and at each leaf I keep an array of points. I also have some optimizations such as keeping track of the hyperrect containing each subtree and only traversing that tree if that hyperrect could contain a closer point. A bit abstract, I know :rolleyes:

I just tried profiling, and I spend more time in Exemplar:getSqDist than I do in the entire .Net lookup algorithm. So maybe I'll try a C++ .dll instead... sigh.

ShaunR · February 21, 2012

You've now heard the position of the "OO is bad" crowd.

Wow. I'm a crowd? I really can be in two places at once (multiple salaries )

So maybe I'll try a C++ .dll instead... sigh.

For raw speed. It's the only way. Labview is fast (in comparison to, say Java), but not the fastest. A dll call will probably only add about 1-3 usecs of overhead.

asbo · February 21, 2012

asbo, ShaunR:

How would you suggest I place clusters within clusters within clusters within... (dynamically)? I'm desperate for speed, and if it means sacrificing OOP design I'm willing to try it, but as far as I can tell LV specifies cluster contents pretty rigidly. Doing a prev/next doesn't work in a kd tree because we have a large number of dimensions... and implicit kd trees will give this type of structure but use huge amounts of memory (curse of dimensionality and all that). As you'll see in my project, I've already tried a simple brute force method with a simple array subtract/square/sum/min for finding the nearest point, and if it isn't fast enough with raw arrays it definitely won't be fast enough with clusters. Unless I've completely misunderstood you...

Yes, I rather meant that prev/next would be indices in a parent array, where data would actually be the data cluster. The indices would be used for lookup, which could make traversal a little more work (thus my legwork comment). So, as a C programmer would write:

cluster data { ... };

cluster node {int prev, cluster data, int next}; (whatever metadata is appropriate, this is really more of a linked list)

cluster[] tree;

I should have warned in my first post, I didn't look at your implementation. I don't consider myself experienced enough to critique LVOOP methodologies, I was just throwing an idea in the ring. I'm not an OOP nay-sayer, but if you can't get the performance you need there may be other options. I'm certainly interested in what implementation reaches your goal for you.

Wouter · February 22, 2012

@Aristos Queue

Maybe he can use your implementation of linkedlist, tree and the other datastructures etc... Can't find the files anymore though.

Phillip Brooks · February 22, 2012

@Aristos Queue

Maybe he can use your implementation of linkedlist, tree and the other datastructures etc... Can't find the files anymore though.

Were you thinking of the Map thread?

http://lavag.org/topic/5983-map-implemented-with-classes-for-85/

Sign In

LV OOP Kd-Tree much slower than .Net, brute force

Recommended Posts

jkflying

asbo

ShaunR

asbo

sam

Aristos Queue

mje

jkflying

ShaunR

asbo

Wouter

Phillip Brooks

Join the conversation

Similar Content

ViPER an Open-Source Dependency Injection Framework for LabVIEW

To OOP or not to OOP?? Please Help "Array of clusters or Classed data"

OOP QMH

Nested collection with interfaces

Public DVR on a private class : access problems

Browse

Activity

Important Information