Jump to content

Code Optimization


Recommended Posts

Hi gurus,

I am trying to optimize a real-time data processing application. I believe that a big chunk of my time is associated with moving large arrays into and out of subvis, rather than actual calculation time. The nature of the processing does not lend itself well to parallelization, and 3rd-party driver considerations prevent us from using LVRT.

Here's my question. Are there advantages (i.e. regarding inplaceness, etc.) to keeping vi's that share data in the same execution thread? Or conversely, are there penalties for spreading large memory blocks among various execution threads?

Thanks,

Gary

Link to comment

QUOTE(Gary Rubin @ Jul 17 2007, 12:18 PM)

Hi gurus,

I am trying to optimize a real-time data processing application. I believe that a big chunk of my time is associated with moving large arrays into and out of subvis, rather than actual calculation time. The nature of the processing does not lend itself well to parallelization, and 3rd-party driver considerations prevent us from using LVRT.

Here's my question. Are there advantages (i.e. regarding inplaceness, etc.) to keeping vi's that share data in the same execution thread? Or conversely, are there penalties for spreading large memory blocks among various execution threads?

Thanks,

Gary

HI Gary,

I'll take a first stab at this and let others correct me if I get this wrong.

The execution threads are all about the process the OS caries out to decide "what value do I set the program counter to next?"

It has nothing to do with if data gets copied or buffers get re-used.

Rather than repeat what I suspect you already know, I'll end here.

Ben

Link to comment

QUOTE(Ben @ Jul 17 2007, 12:46 PM)

The execution threads are all about the process the OS caries out to decide "what value do I set the program counter to next?"

It has nothing to do with if data gets copied or buffers get re-used.

Thanks Ben,

That's what I suspected, but every time I think I understand something about the rules for memory reuse, I find that it's more complicated than I expected.

Link to comment

QUOTE(Gary Rubin @ Jul 17 2007, 01:06 PM)

Thanks Ben,

That's what I suspected, but every time I think I understand something about the rules for memory reuse, I find that it's more complicated than I expected.

For RT projects that had a lot of data and a bunch of analysis, I set up an Action Engine that handled everything to do with the data.

The AE had actions to

"Read form I/O" - This just read the hardware and put the results in SR's.

"Analyze" to crunch the numbers and determine the results

"Post" to make the result available were required.

This let me work completely in-place and left the rest of memory for me to buffer the results.

Ben

Link to comment

QUOTE(Ben @ Jul 17 2007, 01:14 PM)

For RT projects that had a lot of data and a bunch of analysis, I set up an Action Engine that handled everything to do with the data.

The AE had actions to

"Read form I/O" - This just read the hardware and put the results in SR's.

"Analyze" to crunch the numbers and determine the results

"Post" to make the result available were required.

This let me work completely in-place and left the rest of memory for me to buffer the results.

My "datapoints" are basically records consisting of 28 fields. I keep a history of 20k, so that's one 560k-element array. The datapoints are placed (by reference) into 2000 different "bins", with each bin having a depth of 64, so that's a 128k-element array. Figuring out which bin to put each new datapoint in, as well as other bookkeeping uses linked lists, so tack on a few more 20k and 2k-element vectors. Several different subVI's need access to those two big arrays in order to do their thing. I found that scaling by bin depth from 200 to 64 had a pretty significant impact on execution speed, even though calculations/loop lengths didn't change; this is why I came to the conclusion that my bottleneck is memory related, rather than computational.

I've managed to get rid of most of the buffer allocation dots. I am using the Quotient/Remainder operator on an array and am only using the remainder output. I noticed that there's a buffer allocation dot on the unwired IQ output. I thought that Labview knows whether an output is wired, and doesn't allocate space for that. Am I mistaken about that?

Link to comment

Dunno how much/if this will help, but figured I'd add a thought to consider. When I've got apps that repeatedly process not-so-small chunks of data, I at least consider the following approach:

- Try to design the code to process fixed-size chunks (or at least to define an upper bound to the size)

- Processing routines that generate a buffer allocation dot are made into subvi's. The output data requiring a buffer allocation becomes a candidate for a Unitialized Shift Register (USR).

- I either initialize this USR array once using the "First Call?" primitive, or I turn the subvi into a small Action Engine with explicit cases for "Initialize" and "Process Data". It depends whether I can live with the memory allocation delay on the first call or not.

If I'm following your app right, this advice may not seem directly relevant. Sounds like you've got some mini database-like sets of data / information. Your program extracts chunks of this information for processing. The point of concern is where this chunk of data is created, where data copying is necessary and memory allocation may be possible.

My only thought there is to see whether you can combine Ben's action engine suggestion and my thoughts at the top of this post. If one of those USR's is fed directly into an output indicator, that may give LV enough clue that it can keep re-using the same memory space for that indicator on subsequent calls (it could conceivably see that the USR data and the indicator data are of the exact same size, and realize that it only needs to copy rather than allocate).

However, all the ways LV optimizes for memory allocation are still partly a mystery to me. Sometimes I find I've done extra work to be explicitly careful about memory, only to find that using simpler sloppy-seeming built-in functions still works better anyway.

-Kevin P.

Link to comment

Thanks for the ideas.

QUOTE(Kevin P @ Jul 18 2007, 09:10 AM)

I'm doing that. Everything is preallocated (as much as possible).

QUOTE(Kevin P @ Jul 18 2007, 09:10 AM)

- Processing routines that generate a buffer allocation dot are made into subvi's. The output data requiring a buffer allocation becomes a candidate for a Unitialized Shift Register (USR).

- I either initialize this USR array once using the "First Call?" primitive, or I turn the subvi into a small Action Engine with explicit cases for "Initialize" and "Process Data". It depends whether I can live with the memory allocation delay on the first call or not.

A simple example (but maybe not a good example) of something that always generates an array allocation is an array subset or index array call.

Are you suggesting replacing method 1 (below) with method 2?

http://forums.lavag.org/index.php?act=attach&type=post&id=6394

Link to comment

I don't have LV handy to be able to post a screenshot.

What you posted isn't quite what I had in mind. My idea -- and it's just an idea, I'm not sure whether it'd be better or worse until after benchmarking -- would be to set up a For loop, explicitly index values out of Array 2, and use "Replace Array Subset" to overwrite the corresponding elements of subarray 2.

I'd expect the average execution speed to be noticeably slower than the LV array subset primitive, but I'd also expect it to be more nearly constant and predictable. At least, that'd be the desired tradeoff, not sure if it'd work that way. Because it seems I've also observed (I think) that allocating chunks of memory in the kB size range and smaller don't seem to slow execution speed measurably.

I wonder also if it may be helpful to incorporate a queue into the processing scheme. I know when I have producer-consumer loops in data acq apps, I wire directly from DAQmx Read to Enqueue Element to pass data without extra allocation. The queue is given ownership of the memory space, and only stores a pointer or something like that. Later, the call to Dequeue transfers the pointer & ownership to my consumer code, and I can do processing without having had to pass it through controls and indicators, where extra data copying might have happened.

Anyone have insight into Gary's original question about execution threads?

-Kevin P.

Link to comment

QUOTE(Gary Rubin @ Jul 18 2007, 04:11 PM)

I assume that Ben was correct, as nobody has disagreed with him yet...

Yes, the rules if LabVIEW does a data copy are mostly tied to the functions the wire runs to and especially if a wire has branching built in. And here multithreading might get in the place sometimes. When a wire branches, LabVIEW tries to execute functions that only read the data but don't reuse them further or create a complete copy anyhow before functions that could reuse the buffer. In that context subVIs are always considered as a potential consumer of a buffer. And now if you have multithreading in place LabVIEW might get in a difficult sitution:

Serialize the code to execute non consuming nodes first or allow simultanous execution that requires a data copy in any case. I would opt myself for the first case as it should probably in most but the simplest cases where only skalars are involved be the more efficient solution.

Rolf Kalbermatter

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.