Clone Wars!

John Lokanis · March 13, 2009

Found some really interesting behaviors when trying out shared clone mode vs preallocate clone mode:

I have a massively parallel application that does a lot of dynamic VI execution in many (200+) threads. I have been trying to figure out if I should set all my reentrant VIs (almost everything is reentrant) to the 'shared' mode or the 'preallocate' mode.

I am running this app on an 8 core machine.

In 'shared' mode, the CPU load on all cores seems to be fairly balanced but the whole system bogs down as I cross the ~100 thread level. Here is what the CPU load looks like:

In 'preallocate' mode, the CPU load on one core seems to get 'pegged' a lot but the overall load seems lower. I can reach the ~150 thread level before it bogs down too much and can get to ~200 threads, albeit slowly. Here is what the CPU load looks like in this mode:

From talking with some sources at NI, it seems that the UI thread is used for the memory manager as well as all VI server calls. So, in preallocate mode, there is a lot more memory management going on (allocating space for all those clones) and that is swamping out the core that is running the UI thread and the VI server calls, thus causing the sluggishness. In shared clone mode, this does not happen, but the system has to do a lot more overhead (spread across all cores) to manage the shared clones and their execution data spaces.

Does anyone have any ideas on what the optimal solution/configuration would be in cases like this? Any 'best practices' to apply here? It seems to me if we are going to be doing a lot of multi-core programming in the future, LabVIEW needs to be tweaked to spread the load out a bit more effectively.

-John

Michael Aivaliotis · March 13, 2009

No idea, I just hopped in to comment that I think your post title is awesome!

Mark Smith · March 13, 2009

John,

Here's an interesting article from the home team

http://www.sandia.gov/LabNews/081219.html

I think presently the best way to build a massively parallel test system is not by using a multi-core machine (at least not over four cores) but by using multiple machines that coordinate over some common interface (gigabit ethernet and TCP/IP?). This allows the architect access to more memory and bandwidth per thread (and typically per hardware device) than you can get from just adding more cores to the processor. Of course, it costs more

Mark

John Lokanis · March 13, 2009

QUOTE (mesmith @ Mar 12 2009, 09:38 AM)

Here's an interesting article from the home team

Great link! Thanks!

I have considered load balancing my test system across multiple machines but it does make it a nightmare to do UI interaction. LabVIEW just does not offer an easy way to embed UIs from other machines. If only I could have a VI in a sub panel that was actually running on a different computer over the network!

But, for now, I will be looking at a Nehalem based system to get me over the hump. It is supposed to address the memory access and multicore bottlenecks and yield at least 30% improvement over Penryn based cores.

-John

Mark Yedinak · March 13, 2009

QUOTE (jlokanis @ Mar 12 2009, 12:50 PM)

Great link! Thanks!
I have considered load balancing my test system across multiple machines but it does make it a nightmare to do UI interaction. LabVIEW just does not offer an easy way to embed UIs from other machines. If only I could have a VI in a sub panel that was actually running on a different computer over the network!

But, for now, I will be looking at a Nehalem based system to get me over the hump. It is supposed to address the memory access and multicore bottlenecks and yield at least 30% improvement over Penryn based cores.

-John

Are you saying you need to have multiple UIs in your system or that coordinating your data for multiple applications on multiple machines to a single UI is the problem? If it is the latter you could use network queues to send data back and forth from the controlling UIs to the slave applications. If you want to have bidirectional communication I would have a master UI queue that all of your individual applications can post to to inform the UI that an update is required. As part of your message you would need an application ID. Each application would have its own queue for receiving updates from the UI. The master UI would need to manage the application queues. As processes start up they register with the UI an give it the connection specifics for its queue. The connection specifics would need to be static for the UI code (fixed server address and port number) so all of the remote processes know how to register with the UI. If the UI wanted to broadcast an update to all remote applications it could simply iterate over all of its remote queues and post the event.

bsvingen · March 13, 2009

I have a similar problem I think. On single core machines, no problem, but on dual core machines there seems to be some memory allocation bug kind of thing. For no apparent reason the whole UI becomes sluggish when showing graphs. It can happen when only 3-4 vis are loaded dynamically and with severe sluggishness that varies from pc to pc (slowdown of a factor of 10 or more and worse on laptops than desktops). I have added a "other 1" execution thread vi that runs in a main loop. That seems to make it somewhat better - sometimes. It seems to me it has something to do with display card RAM and buffering or something, but it is random, it can suddenly start, and once started the only remedy i to restart LV.

Aristos Queue · March 13, 2009

QUOTE (jlokanis @ Mar 11 2009, 05:33 PM)

From talking with some sources at NI, it seems that the UI thread is used for the memory manager as well as all VI server calls. So, in preallocate mode, there is a lot more memory management going on (allocating space for all those clones) and that is swamping out the core that is running the UI thread

That doesn't make sense. In preallocate mode, the memory is all preallocated at the moment you start your program running. There is clone allocation after that point, so that shouldn't be responsible for pegging the UI thread. Let me offer a different theory...

In the preallocate model, I agree that the one pegged thread is the UI thread. But I think what it is doing is responding to UI requests from all those other threads. The other threads have much higher performance, so they get to their UI requests more often, so the UI thread always has work to do. In the shared model, the UI thread sometimes has downtime while everyone is sharing copies around.

John Lokanis · March 13, 2009

QUOTE (Mark Yedinak @ Mar 12 2009, 11:04 AM)

Are you saying you need to have multiple UIs in your system or that coordinating your data for multiple applications on multiple machines to a single UI is the problem? If it is the latter you could use network queues to send data back and forth from the controlling UIs to the slave applications. If you want to have bidirectional communication I would have a master UI queue that all of your individual applications can post to to inform the UI that an update is required. As part of your message you would need an application ID. Each application would have its own queue for receiving updates from the UI. The master UI would need to manage the application queues. As processes start up they register with the UI an give it the connection specifics for its queue. The connection specifics would need to be static for the UI code (fixed server address and port number) so all of the remote processes know how to register with the UI. If the UI wanted to broadcast an update to all remote applications it could simply iterate over all of its remote queues and post the event.

Those are all good ideas for rearchitecting my system but it would be a lot of work at this point. Currently, my top level VI spawns sub-vis (engines) that run tests on a DUT. My top level VI can display the FP of any of these running engines inside a sub-panel on the main VI's FP. From this sub-panel, I can interact with the UI of the engine. The problem is I cannot use this technique if the engine is running in a different app instance.

So, I would have to use your techniques above and separate my engine code into a UI component (running in the app instance of the main VI) and a functional component, running on a different machine (and app instance).

Also, I would need to maintain a pool of machines (like a server farm) and be able to ping them for their current load so I can load balance tests across them.

A very interesting project, but first I would need to get my boss's approval to spend a few months on it. In the short run, buying a faster machine is easier.

I will be sure to post some benchmarks on the Nehalem machine vs the Penryn machine when I get it.

John Lokanis · March 14, 2009

QUOTE (Aristos Queue @ Mar 12 2009, 12:24 PM)

In preallocate mode, the memory is all preallocated at the moment you start your program running. There is clone allocation after that point, so that shouldn't be responsible for pegging the UI thread.

Yes, but if the app is continuously launching reentrant VIs dynamically, then every time it does this, it needs to allocate space for that VI and all it's sub-vis 'on the fly'. So, there is continuous memory allocation going on in this case.

Aristos Queue · March 14, 2009

QUOTE (jlokanis @ Mar 12 2009, 06:41 PM)

Yes, but if the app is continuously launching reentrant VIs dynamically, then every time it does this, it needs to allocate space for that VI and all it's sub-vis 'on the fly'. So, there is continuous memory allocation going on in this case.

In that case, there is allocation regardless of the preallocate vs. shared clone setup. The cache of clones are only (to the best of my knowledge) shared among the subVI calls, not among the Open VI Reference calls.

John Lokanis · March 25, 2009

QUOTE (Aristos Queue @ Mar 12 2009, 07:25 PM)

In that case, there is allocation regardless of the preallocate vs. shared clone setup. The cache of clones are only (to the best of my knowledge) shared among the subVI calls, not among the Open VI Reference calls.

In the end, the problem was not the clone setting but rather a very hard to find memory allocation issue.

Quick tip: don't place a large data structure in an attribute of a variant that you store in a single element queue and access from several points in your app. If you do, then everytime you preview the queue element to read it, you will make a copy of the whole shabang, even if you are only interested in another attribute of the variant structure that is a simple scalar. This will eventually blow up on you. Took over a year to discover that one!

So, now we are back to shared clones and all is well again... Oh, and the app runs 100x faster. Maybe I will have time to come to NI Week after all!

Neville D · March 25, 2009

QUOTE (jlokanis @ Mar 24 2009, 12:21 PM)

Quick tip: don't place a large data structure in an attribute of a variant that you store in a single element queue and access from several points in your app. If you do, then everytime you preview the queue element to read it, you will make a copy of the whole shabang, even if you are only interested in another attribute of the variant structure that is a simple scalar. This will eventually blow up on you.

So is the copy made because of the preview Q or is it because its an attribute of the variant?

(Just to be clear..)

N.

John Lokanis · March 25, 2009

QUOTE (Neville D @ Mar 24 2009, 12:42 PM)

So is the copy made because of the preview Q or is it because its an attribute of the variant?

Yes.

The answer really is both. When you preview, it makes a copy. When you then get the attribute value, it makes another copy. And, I am pretty sure when you cast the variant value of the attribute to its LV datatype, it makes yet another copy.

So, the worst case is when you try to access the large data structure in the queue element. But even if you are just looking as something small that is stored as part of the variant 'tree' you still incure one copy of everything.

Mark Yedinak · March 25, 2009

This is good information to know. I would ask though if you find yourself only peeking at a small part regularly and only accessing the large data occasionally wouldn't it improve the design to separate the elements and their storage to improve performance? I mean if know that you will be accessing some information frequently and other parts only once in a while (at least when talking about large data sets), why store them together. This is one of the basic tenants of code optimization.

Regardless your findings are interesting and good to know.

mje · March 25, 2009

I thought variants were really just handles, so a U32, or maybe U64 nowadays? So you're saying that running a variant down a wire from the queue copies the entire variant to the wire, all attributes and everything? The linked article below (very bottom) to me implies otherwise. I'd have expected LV to be smart enough NOT to make a copy of the actual variant unless you modify the data or attributes.

http://zone.ni.com/reference/en-XX/help/37...data_in_memory/

-m

John Lokanis · March 25, 2009

QUOTE (Mark Yedinak @ Mar 24 2009, 02:13 PM)

I would ask though if you find yourself only peeking at a small part regularly and only accessing the large data occasionally wouldn't it improve the design to separate the elements and their storage to improve performance?

Yes yes yes. That is very true. And that is exactly how I fixed it. My mistake was not realizing I was doing this in the first place. It wasn't untill I started using a large data set in the structure that the problem became more evident. But, after some work, liberal use of the inplace structure and addition of some shift registers, I was able to corden off the large data set and eliminate all reoccuring copies.

The concept of a tree of variants stored in a single element queue as an efficient way of sharing data between parallel threads seemed like such a good idea. But, like every programming construct, it does have pitfalls. Maybe I should put together an NI Week session on the do's and don'ts of massive parallelism in LV programming. I think I have had to climb out of almost every pitfall there is over the last few years...

QUOTE (MJE @ Mar 24 2009, 02:23 PM)

I thought variants were really just handles, so a U32, or maybe U64 nowadays? So you're saying that running a variant down a wire from the queue copies the entire variant to the wire, all attributes and everything? The linked article below (very bottom) to me implies otherwise. I'd have expected

LV

to be smart enough NOT to make a copy of the actual variant unless you modify the data or attributes.

-m

That may be true, but if you preview the queue, you make a copy. I suppose if you dequeue it, the element you get is just a pointer to the same block of memory that the queue was holding on to. But even that may not be true. After all, if you dequeue all elements in a queue, the queue does not free the memory it used to store those elements. A queue hangs on to all the memory it has ever allocated. So, if each element takes up 1k and you had 100 elements in the queue at one time, then the queue is still consuming 100k, even if you flush it.

Now, I do know that when I previewed the queue, even if I was just referencing an attribute of the variant that was a scalar, I did incur a memory copy of all the attributes of the variant. You can even see this if you turn on 'show buffer allocations'.

Getting the large structure out of the variant tree and into it's own queue that I did not access often saved the day for my app.

Mark Yedinak · March 25, 2009

QUOTE (MJE @ Mar 24 2009, 04:23 PM)

I thought variants were really just handles, so a U32, or maybe U64 nowadays? So you're saying that running a variant down a wire from the queue copies the entire variant to the wire, all attributes and everything? The linked article below (very bottom) to me implies otherwise. I'd have expected LV to be smart enough NOT to make a copy of the actual variant unless you modify the data or attributes.
http://zone.ni.com/reference/en-XX/help/37...data_in_memory/

-m

Variants are simply a type agnostic way to pass data around. They are not a reference to the data. It is simply a large binary dump of the data that you need to indicate how it should be decoded. Think of a variant like the data portion of a TCP packet. To TCP it is simply data. It doesn't care or need to know what that data represents but it does pass all of the data. It is up to the receiver of the data to know how to interpret it. The same is true for variants. A variant is just a generic way to pass data.

mje · March 26, 2009

Yeah, I was aware variants weren't reference types, there's a fundamental difference between a reference, pointer, and handle. A variant is a little more sophisticated than just a binary data dump though. It is aware of native types, and can intelligently convert between compatible scalars (not sure how well that translates to arrays & clusters).

I guess this thread just brings up one of my major beefs with the LV IDE: NI really needs a good profiling tool to allow you to watch how your memory footprint is set up. What I wouldn't give for the equivalent of a watch and locals window to be able to browse my stack and heap data spaces during break points.

QUOTE (jlokanis)

That may be true, but if you preview the queue, you make a copy. I suppose if you dequeue it, the element you get is just a pointer to the same block of memory that the queue was holding on to. But even that may not be true.

At the risk of going offtopic...I admit predicting memory allocation when dealing with queues is rather nebulous to me. Anyone got a good handle on that, maybe a link or reference? I completely missed you were dealing with previews. I remember somewhere that that primitive will *always* return a copy, which makes sense since, because not copying it would open a whole can of worms with regards to synchronization and thread safety. It is entirely possible to use a queue and never deal with copies, see the NI singleton implementation if I recall. But what I don't understand is how that plays with fixed size queues. I was also under the impression queues hang on to their element buffer space (excluding heap allocations for arrays, reference types, etc). Those two strategies seem exclusive to me, I'm not sure how they're mixed and when one will be used as opposed to the other.

jdunham · March 26, 2009

QUOTE (MJE @ Mar 24 2009, 01:23 PM)

I thought variants were really just handles, so a U32, or maybe U64 nowadays? So you're saying that running a variant down a wire from the queue copies the entire variant to the wire, all attributes and everything? The linked article below (very bottom) to me implies otherwise.

I think you are misreading the article:

QUOTE (How LabVIEW Stores Data in Memory)

String

LabVIEW stores strings as pointers to a structure that contains ... If the

handle

, or the pointer to the structure, is NULL...

QUOTE (How LabVIEW Stores Data in Memory)

Variant

LabVIEW stores variants as

handles

to a LabVIEW internal data structure. Variant data is made up of 4 bytes.

As I read it, variants are a lot like strings, but with more features (a type descriptor and attributes). Also variants don't behave like by-reference objects (queues & notifiers) and they aren't represented by refnums. When you probe them, you can see the actual data.

QUOTE (jlokanis @ Mar 24 2009, 01:59 PM)

That may be true, but if you preview the queue, you make a copy. I suppose if you dequeue it, the element you get is just a pointer to the same block of memory that the queue was holding on to. But even that may not be true. After all, if you dequeue all elements in a queue, the queue does not free the memory it used to store those elements. A queue hangs on to all the memory it has ever allocated. So, if each element takes up 1k and you had 100 elements in the queue at one time, then the queue is still consuming 100k, even if you flush it.

Hmm. It would seem to me that a queue of 100 strings would contain 100 handles pointing to some string data. Then if you flushed the queue, the 100 handles would probably remain allocated, but the actual string data they were pointing to might get released. Exactly when that release would happen should be handled by LabVIEW's memory manager and would depend on whether any other wires on any active diagrams were also pointing to those strings. Of course I'm too lazy to try to test this out, and MJE is right that it's awfully hard to tell which wires are using/sharing memory at any given time. In general, the lack of memory control and debugging tools is a feature, but I can understand that when things go wrong and memory usage rockets upward, it is hard to isolate the problem.

Phillip Brooks · March 26, 2009

QUOTE (jdunham @ Mar 25 2009, 12:46 AM)

Hmm. It would seem to me that a queue of 100 strings would contain 100 handles pointing to some string data. Then if you flushed the queue, the 100 handles would probably remain allocated, but the actual string data they were pointing to might get released. Exactly when that release would happen should be handled by LabVIEW's memory manager and would depend on whether any other wires on any active diagrams were also pointing to those strings.

I recall Aristos telling us that there is an algorithm that defines the way LabVIEW will allocate additional memory for an unbounded queue, and that LabVIEW will not free memory allocated for a queue until the top level VI goes idle or the Release Queue function is used with the 'destroy' option set to true. I also recall that Aristos told us LabVIEW http://forums.lavag.org/Lossy-Queue-Capability-t4217.html' target="_blank">queues are implemented as circular in-memory buffer.

Yair · March 26, 2009

I don't have an official documentation for this, but a queue will hang onto its unused memory until it's destroyed OR until it's needed elsewhere. Calling the Request Deallocation primitive will probably also release it. In any case, I think you should regard this as an implementation detail, not something to be relied upon.

P.S. A dequeue does use a pointer to the enqueued data to move the data without creating a copy. It can do this because every enqueue is will always have only one dequeue. An "exception" to this is if you split the wire before enqueuing, which will force LV to create a copy for the queue (it's not really an exception when you think about it, since you still have a pointer to the enqueued data).

Sign In

Clone Wars!

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation

Important Information