Jump to content

Slower qeueus in LabVIEW 8.6?


Recommended Posts

Hello all,

As I indicated an a post last night, I am working on transitioning some code from LabVIEW 7.1.1 to LabVIEW 8.6. Because performance is very important in this application, the first thing I did was benchmark the execution times for the two versions using profile. I found something surprising that I'm hoping someone can corroborate and/or explain. I noticed that a lot (but not all) of my subVIs were faster in LV8.6, but that those involving queues were quite a bit slower. To check this out, I did the test shown in the attached picture.

The reason I wrapped the enqueue and flush queue in subVIs was just so I could use the profiler on them. Both the subVIs are running in Time Critical Priority with debugging and auto error handling turned off. The Top Level VI is normal priority, but also with debugging and error handling turned off.

These results are typical of what I saw for my real code; The top-level VI is MUCH faster in 8.6, but the enqueue is considerably slower.

So, what's going on? Is this speed difference real, or a by-product of using Profile? Are 8.6 queues really slower than 7.1 queues?

post-4344-1241181115.png?width=400

Thanks,

I'm sure I'll be posting more as I explore the 8.6.

Link to comment

QUOTE (Gary Rubin @ May 1 2009, 08:43 AM)

Hello all,

As I indicated an a post last night, I am working on transitioning some code from LabVIEW 7.1.1 to LabVIEW 8.6. Because performance is very important in this application, the first thing I did was benchmark the execution times for the two versions using profile. I found something surprising that I'm hoping someone can corroborate and/or explain. I noticed that a lot (but not all) of my subVIs were faster in LV8.6, but that those involving queues were quite a bit slower. To check this out, I did the test shown in the attached picture.

The reason I wrapped the enqueue and flush queue in subVIs was just so I could use the profiler on them. Both the subVIs are running in Time Critical Priority with debugging and auto error handling turned off. The Top Level VI is normal priority, but also with debugging and error handling turned off.

These results are typical of what I saw for my real code; The top-level VI is MUCH faster in 8.6, but the enqueue is considerably slower.

So, what's going on? Is this speed difference real, or a by-product of using Profile? Are 8.6 queues really slower than 7.1 queues?

http://lavag.org/old_files/monthly_05_2009/post-4344-1241181115.png' target="_blank">post-4344-1241181115.png?width=400

Thanks,

I'm sure I'll be posting more as I explore the 8.6.

HI Gary,

I don't know the answer to your Q but I am curious. If you changed your code such that all of the enqueues happened and then all of the dequeues, I wonder if it would change the numbers. My thoughts for this are due to the scheduling of paralllel tasks may be different so the dequeue maybe taking loanger since it is waiting for something to show up in the queue. I just wonder if there is some interaction between the two loops.

So no answer here, just more Q's.

Ben

Link to comment

QUOTE (neBulus @ May 1 2009, 09:28 AM)

so the dequeue maybe taking loanger since it is waiting for something to show up in the queue.

One of the reasons I'm wondering whether it's a red herring coming out of the profiler is that both versions report that the dequeue subVI takes 0.0ms. In fact, changing the display in the profiler to microseconds, LV8.6 claims the dequeue subVI takes 0us. Not sure I believe that.

Link to comment

QUOTE (Gary Rubin @ May 1 2009, 09:45 AM)

One of the reasons I'm wondering whether it's a red herring coming out of the profiler is that both versions report that the dequeue subVI takes 0.0ms. In fact, changing the display in the profiler to microseconds, LV8.6 claims the dequeue subVI takes 0us. Not sure I believe that.

I have had many questions about what the profiler shows me in recent version as well.

Run some benchmark tests to see if they confirm what the profile is showing.

Ben

Link to comment

QUOTE (neBulus @ May 1 2009, 09:28 AM)

I think you're onto something there, Ben, but I'm not sure what to make of it.

I modified my test code as follows, and enforced the parallel vs. sequential loops by the addition/subtraction of the wire going from the Enqueue Time indicator to the while loop. I had to decrease my number of iterations to 10k to keep the queue from growing to an unmanageable size in the sequential case.

post-4344-1241190054.png?width=400

Here are the results:

post-4344-1241190363.png?width=400

I guess the sequential results are more like what I'd expect; LV8.6 is more efficient at allocating memory for the growing queue.

I am surprised at the parallel results. Am I really to understand that a version of LabVIEW that was written back when everyone was on a single CPU is better at parallel tasks than one that's been developed in the world of multi-processors?

Incidentally, the 25% difference between the two is consistent with what the profiler was saying.

Link to comment

QUOTE (Gary Rubin @ May 1 2009, 11:12 AM)

I think you're onto something there, Ben, but I'm not sure what to make of it.

I modified my test code as follows, and enforced the parallel vs. sequential loops by the addition/subtraction of the wire going from the Enqueue Time indicator to the while loop. I had to decrease my number of iterations to 10k to keep the queue from growing to an unmanageable size in the sequential case.

post-4344-1241190054.png?width=400

Here are the results:

post-4344-1241190363.png?width=400

I guess the sequential results are more like what I'd expect; LV8.6 is more efficient at allocating memory for the growing queue.

I am surprised at the parallel results. Am I really to understand that a version of LabVIEW that was written back when everyone was on a single CPU is better at parallel tasks than one that's been developed in the world of multi-processors?

Incidentally, the 25% difference between the two is consistent with what the profiler was saying.

Ok still wandering in the dark looking for the door...

Both of your loops require allocating memory for the queue elements or for the array you are building.

If you pre-allocate and repalce the array instead of building it, then we can stop thinking about call to the OS to allocate memory being a common bottle kneck.

Please see Q below my avatar.

Ben

Link to comment

QUOTE (neBulus @ May 1 2009, 11:21 AM)

Both of your loops require allocating memory for the queue elements or for the array you are building.

But the parallel one doesn't get nearly as big as the sequential one, because it's flushed periodically. The sequential one just keeps growing.

Link to comment

QUOTE (Gary Rubin @ May 1 2009, 11:27 AM)

But the parallel one doesn't get nearly as big as the sequential one, because it's flushed periodically. The sequential one just keeps growing.

I was just grasping at straws to explain the apperent interaction when they run in parallel. Calls to the OS memory manager is just a guess at how the two loops could slow each other down when run at the same time.

Ben

Link to comment

QUOTE (neBulus @ May 1 2009, 06:49 AM)

I have had many questions about what the profiler shows me in recent version as well.

Run some benchmark tests to see if they confirm what the profile is showing.

Ben

I have had a lot of questions about the profiler since LV 8.0. I believe something was changed somewhere so that it does no longer behave quite the same way that in previous LV version

Link to comment

QUOTE (PJM_labview @ May 1 2009, 12:43 PM)

I have had a lot of questions about the profiler since LV 8.0. I believe something was changed somewhere so that it does no longer behave quite the same way that in previous LV version

I won't disagree with that, but my second attempt using timers in the code seemed to show the same general behavior that the profiler showed.

Link to comment

Okay, this is (as usual) pretty far afield, but...

Could the queue's internal lock/unlock mechanism have changed between versions? The compiler might be smart enough to recognize that the sequential only needs two lock/unlock sets, while the parallel needs many. Unfortunately, this is not easy to measure.

(Edit: thinking more on the above, unless the queue has a lock-on-read, it seems unlikely. Sorry.)

I suppose it's also slightly possible you're getting context switching issues in the parallel version. It seems unlikely, since that would probably cause a much larger delay, and having the queue run in more than one thread seems unlikely. On Windows, Process Explorer can show the switches, though... (http://technet.microsoft.com/en-us/sysinte...s/bb896653.aspx).

Joe Z.

Link to comment

All the internal benchmarks that I've done show a substantial speed improvement for all queue operations from 7.1 to 8.5. I didn't redo benchmarks for 8.6. There is a new queue primitive in 8.6, but to the best of my memory none of the functionality of the existing primitives changed.

QUOTE (Gary Rubin @ May 1 2009, 08:45 AM)

On Windows (and I believe all the desktop platforms) the timing cannot be any more exact than milliseconds. There just isn't a more precise timer available. The fact that the time reported is zero means that it is taking less than 1 millisecond to execute, so the sum of all iterations is still zero. That's why it makes sense to benchmark in aggregate but not individual operations. If you want a more precise timing, you need LV Real Time running on a real-time target.

QUOTE (PJM_labview @ May 1 2009, 11:43 AM)

I have had a lot of questions about the profiler since
LV
8.0. I believe something was changed somewhere so that it does no longer behave quite the same way that in previous
LV
version

I don't believe it had any significant changes since at least LV 6.1 other than displaying VIs from multiple app instances, added in LV 8.0. I could be wrong -- it isn't an area that I pay close attention to feature-wise.

QUOTE (jzoller @ May 1 2009, 02:18 PM)

(Edit: thinking more on the above, unless the queue has a lock-on-read, it seems unlikely. Sorry.)

I'm not sure if you're asking about this, but, yes, there is a mutex that must be acquired by the thread when it reads the queue.

Link to comment

QUOTE (Gary Rubin @ May 2 2009, 11:00 PM)

I'm still looking for confirmation that my results are reproducible, but here's more. I've done a side-by-side comparison with parallel Queue-Dequeue vs. a LV2-style Global FIFO.

Download File:post-4344-1241463223.zip

post-4344-1241463354.png?width=400

If the producer loop has a 0 ms wait, and the consumer loop has a 1ms wait, I get the following results:

post-4344-1241463644.png?width=400

post-4344-1241463649.png?width=400

If I change the consumer loop wait to 0ms, I get the following:

post-4344-1241463761.png?width=400

post-4344-1241463799.png?width=400

I thought maybe I was comparing apples to oranges, in that the LV2 FIFO uses preallocated memory, while the queue does not, so I tried preallocating the queue by filling it up then flushing it prior to entering my sequence structures. This did help, but not by much. For a 100 element array, preallocating the queue reduced its runtime by about a factor of 4, but it's still twice as slow as the LV2 FIFO.

The take-away message seems to be that 8.6 Queues are faster than 7.1 queues if the number of elements in the enqueued data is less than ~50. After that, 7.1 Queues are faster. Also, LV2-style FIFOs have the potential of being much faster than Queues.

I would love to have someone shoot holes in this, or tell me that I'm not doing my comparisons right. One of my suspicions is that, because I'm timing both the producer and consumer together, it's the build array in the dequeue consumer loop that's killing me, but is there any other way around that? In the meantime, I'm going to be thinking hard about changing over all my queues to LV2-FIFOs. Are queues are just not meant to be used as deep FIFOs for passing data between loops?

Gary

Link to comment

QUOTE (Gary Rubin @ May 4 2009, 09:26 PM)

http://lavag.org/old_files/monthly_05_2009/post-4344-1241463354.png' target="_blank">post-4344-1241463354.png?width=400

I would love to have someone shoot holes in this, or tell me that I'm not doing my comparisons right. One of my suspicions is that, because I'm timing both the producer and consumer together, it's the build array in the dequeue consumer loop that's killing me, but is there any other way around that? In the meantime, I'm going to be thinking hard about changing over all my queues to LV2-FIFOs. Are queues are just not meant to be used as deep FIFOs for passing data between loops?

Gary

I am suspicious about the array building as well.

I haven't looked at the LV2 code, but I don't think that a queue using to store all data and read that as a burst using 'Flush Queue' is standard procedure.

To be totally fair I wouldn't include the 'Destroy Queue' in the timing, one other thing is that preset the size of the LV2 global, something you didn't do for the Queue. You only have an overflow flag to detect the slow 'dequeuing' of data.

There are some little things allready taken care for in the queue solution that isn't available in the LV2 global, like prevention of getting old data.

Ton

Link to comment

QUOTE (Ton @ May 4 2009, 04:32 PM)

I am suspicious about the array building as well.

I haven't looked at the LV2 code, but I don't think that a queue using to store all data and read that as a burst using 'Flush Queue' is standard procedure.

To be totally fair I wouldn't include the 'Destroy Queue' in the timing, one other thing is that preset the size of the LV2 global, something you didn't do for the Queue. You only have an overflow flag to detect the slow 'dequeuing' of data.

There are some little things allready taken care for in the queue solution that isn't available in the LV2 global, like prevention of getting old data.

Ton

Thanks, Ton.

I'll try some of those suggestions tonight, like rerunning with the queue preallocate and removing the Destroy queue. I'll post results later, so long as the kids let me work ;)

Gary

Link to comment

QUOTE (Gary Rubin @ May 4 2009, 10:51 PM)

Thanks, Ton.

I'll try some of those suggestions tonight, like rerunning with the queue preallocate and removing the Destroy queue. I'll post results later, so long as the kids let me work ;)

Gary

I changed the "preferred Execution System" of your benchmarking VI to "User Interface" and I got much faster results for the Queues..... (Still slower then LV2 global though)

Discuss :unsure:

Performed in 8.2.1.

Shane.

Link to comment

Eureka!

I've added queue preallocation, and am finally getting trends more in line with what I was expecting. I also tried changing the execution system from Same as Caller to Standard, but that didn't seem to have much effect.

Download File:post-4344-1241484811.zip

For a 0ms wait in both consumer and producer loops:

post-4344-1241484819.png?width=400

post-4344-1241484834.png?width=400

For a 0ms wait in the producer and a 1ms wait in the consumer:

post-4344-1241484828.png?width=400

post-4344-1241484842.png?width=400

It appears that from 7.1 to 8.6, the queue method (as measured by this benchmark) got about 50% faster, while the LV2 method got considerably slower?! :wacko:

So, the lesson learned is that you really want to preallocate your queue if you plan to use it for a large amount of data.

Thanks to all who have taken a look at my code.

There's still something I'm wondering about; as Ton suggested, is a queue recommended for this type of use? (AQ?)

QUOTE (shoneill @ May 4 2009, 05:51 PM)

I changed the "preferred Execution System" of your benchmarking VI to "User Interface" and I got much faster results for the Queues..... (Still slower then LV2 global though)

Shane, I tried that and got MUCH slower results overall, although the queue and the LV2 were more similar.

QUOTE

There are some little things allready taken care for in the queue solution that isn't available in the LV2 global, like prevention of getting old data.

Ton, what do you mean by this? How could you get old (I assume you mean previously read) out of a LV2 global?

Thanks again,

Gary

Link to comment

QUOTE (Gary Rubin @ May 5 2009, 03:09 AM)

Eureka!

I've added queue preallocation, and am finally getting trends more in line with what I was expecting. I also tried changing the execution system from Same as Caller to Standard, but that didn't seem to have much effect.

Shane, I tried that and got MUCH slower results overall, although the queue and the LV2 were more similar.

I can confirm that on the MAC in LV 8.6 that selecting User Interface makes the whole thing a LOT slower.

Funnily enough on LV 8.2.1 and WinXP, it got faster.....

The QUEUE is still a good bit slower than the LV2 on the Mac with 8.6.....

Wierd.

Shane.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.