Gary Rubin Posted May 2, 2009 Report Posted May 2, 2009 Hello all, As I indicated an a post last night, I am working on transitioning some code from LabVIEW 7.1.1 to LabVIEW 8.6. Because performance is very important in this application, the first thing I did was benchmark the execution times for the two versions using profile. I found something surprising that I'm hoping someone can corroborate and/or explain. I noticed that a lot (but not all) of my subVIs were faster in LV8.6, but that those involving queues were quite a bit slower. To check this out, I did the test shown in the attached picture. The reason I wrapped the enqueue and flush queue in subVIs was just so I could use the profiler on them. Both the subVIs are running in Time Critical Priority with debugging and auto error handling turned off. The Top Level VI is normal priority, but also with debugging and error handling turned off. These results are typical of what I saw for my real code; The top-level VI is MUCH faster in 8.6, but the enqueue is considerably slower. So, what's going on? Is this speed difference real, or a by-product of using Profile? Are 8.6 queues really slower than 7.1 queues? Thanks, I'm sure I'll be posting more as I explore the 8.6. Quote
Grampa_of_Oliva_n_Eden Posted May 2, 2009 Report Posted May 2, 2009 QUOTE (Gary Rubin @ May 1 2009, 08:43 AM) Hello all, As I indicated an a post last night, I am working on transitioning some code from LabVIEW 7.1.1 to LabVIEW 8.6. Because performance is very important in this application, the first thing I did was benchmark the execution times for the two versions using profile. I found something surprising that I'm hoping someone can corroborate and/or explain. I noticed that a lot (but not all) of my subVIs were faster in LV8.6, but that those involving queues were quite a bit slower. To check this out, I did the test shown in the attached picture. The reason I wrapped the enqueue and flush queue in subVIs was just so I could use the profiler on them. Both the subVIs are running in Time Critical Priority with debugging and auto error handling turned off. The Top Level VI is normal priority, but also with debugging and error handling turned off. These results are typical of what I saw for my real code; The top-level VI is MUCH faster in 8.6, but the enqueue is considerably slower. So, what's going on? Is this speed difference real, or a by-product of using Profile? Are 8.6 queues really slower than 7.1 queues? http://lavag.org/old_files/monthly_05_2009/post-4344-1241181115.png' target="_blank"> Thanks, I'm sure I'll be posting more as I explore the 8.6. HI Gary, I don't know the answer to your Q but I am curious. If you changed your code such that all of the enqueues happened and then all of the dequeues, I wonder if it would change the numbers. My thoughts for this are due to the scheduling of paralllel tasks may be different so the dequeue maybe taking loanger since it is waiting for something to show up in the queue. I just wonder if there is some interaction between the two loops. So no answer here, just more Q's. Ben Quote
Gary Rubin Posted May 2, 2009 Author Report Posted May 2, 2009 QUOTE (neBulus @ May 1 2009, 09:28 AM) so the dequeue maybe taking loanger since it is waiting for something to show up in the queue. One of the reasons I'm wondering whether it's a red herring coming out of the profiler is that both versions report that the dequeue subVI takes 0.0ms. In fact, changing the display in the profiler to microseconds, LV8.6 claims the dequeue subVI takes 0us. Not sure I believe that. Quote
Grampa_of_Oliva_n_Eden Posted May 2, 2009 Report Posted May 2, 2009 QUOTE (Gary Rubin @ May 1 2009, 09:45 AM) One of the reasons I'm wondering whether it's a red herring coming out of the profiler is that both versions report that the dequeue subVI takes 0.0ms. In fact, changing the display in the profiler to microseconds, LV8.6 claims the dequeue subVI takes 0us. Not sure I believe that. I have had many questions about what the profiler shows me in recent version as well. Run some benchmark tests to see if they confirm what the profile is showing. Ben Quote
Gary Rubin Posted May 2, 2009 Author Report Posted May 2, 2009 QUOTE (neBulus @ May 1 2009, 09:28 AM) I don't know the answer to your Q but I am curious. If you changed your code such that all of the enqueues happened and then all of the dequeues, I wonder if it would change the numbers. I think you're onto something there, Ben, but I'm not sure what to make of it. I modified my test code as follows, and enforced the parallel vs. sequential loops by the addition/subtraction of the wire going from the Enqueue Time indicator to the while loop. I had to decrease my number of iterations to 10k to keep the queue from growing to an unmanageable size in the sequential case. Here are the results: I guess the sequential results are more like what I'd expect; LV8.6 is more efficient at allocating memory for the growing queue. I am surprised at the parallel results. Am I really to understand that a version of LabVIEW that was written back when everyone was on a single CPU is better at parallel tasks than one that's been developed in the world of multi-processors? Incidentally, the 25% difference between the two is consistent with what the profiler was saying. Quote
Grampa_of_Oliva_n_Eden Posted May 2, 2009 Report Posted May 2, 2009 QUOTE (Gary Rubin @ May 1 2009, 11:12 AM) I think you're onto something there, Ben, but I'm not sure what to make of it.I modified my test code as follows, and enforced the parallel vs. sequential loops by the addition/subtraction of the wire going from the Enqueue Time indicator to the while loop. I had to decrease my number of iterations to 10k to keep the queue from growing to an unmanageable size in the sequential case. Here are the results: I guess the sequential results are more like what I'd expect; LV8.6 is more efficient at allocating memory for the growing queue. I am surprised at the parallel results. Am I really to understand that a version of LabVIEW that was written back when everyone was on a single CPU is better at parallel tasks than one that's been developed in the world of multi-processors? Incidentally, the 25% difference between the two is consistent with what the profiler was saying. Ok still wandering in the dark looking for the door... Both of your loops require allocating memory for the queue elements or for the array you are building. If you pre-allocate and repalce the array instead of building it, then we can stop thinking about call to the OS to allocate memory being a common bottle kneck. Please see Q below my avatar. Ben Quote
Gary Rubin Posted May 2, 2009 Author Report Posted May 2, 2009 QUOTE (neBulus @ May 1 2009, 11:21 AM) Both of your loops require allocating memory for the queue elements or for the array you are building. But the parallel one doesn't get nearly as big as the sequential one, because it's flushed periodically. The sequential one just keeps growing. Quote
Grampa_of_Oliva_n_Eden Posted May 2, 2009 Report Posted May 2, 2009 QUOTE (Gary Rubin @ May 1 2009, 11:27 AM) But the parallel one doesn't get nearly as big as the sequential one, because it's flushed periodically. The sequential one just keeps growing. I was just grasping at straws to explain the apperent interaction when they run in parallel. Calls to the OS memory manager is just a guess at how the two loops could slow each other down when run at the same time. Ben Quote
PJM_labview Posted May 2, 2009 Report Posted May 2, 2009 QUOTE (neBulus @ May 1 2009, 06:49 AM) I have had many questions about what the profiler shows me in recent version as well.Run some benchmark tests to see if they confirm what the profile is showing. Ben I have had a lot of questions about the profiler since LV 8.0. I believe something was changed somewhere so that it does no longer behave quite the same way that in previous LV version Quote
Gary Rubin Posted May 2, 2009 Author Report Posted May 2, 2009 QUOTE (PJM_labview @ May 1 2009, 12:43 PM) I have had a lot of questions about the profiler since LV 8.0. I believe something was changed somewhere so that it does no longer behave quite the same way that in previous LV version I won't disagree with that, but my second attempt using timers in the code seemed to show the same general behavior that the profiler showed. Quote
jzoller Posted May 2, 2009 Report Posted May 2, 2009 Okay, this is (as usual) pretty far afield, but... Could the queue's internal lock/unlock mechanism have changed between versions? The compiler might be smart enough to recognize that the sequential only needs two lock/unlock sets, while the parallel needs many. Unfortunately, this is not easy to measure. (Edit: thinking more on the above, unless the queue has a lock-on-read, it seems unlikely. Sorry.) I suppose it's also slightly possible you're getting context switching issues in the parallel version. It seems unlikely, since that would probably cause a much larger delay, and having the queue run in more than one thread seems unlikely. On Windows, Process Explorer can show the switches, though... (http://technet.microsoft.com/en-us/sysinte...s/bb896653.aspx). Joe Z. Quote
Aristos Queue Posted May 3, 2009 Report Posted May 3, 2009 All the internal benchmarks that I've done show a substantial speed improvement for all queue operations from 7.1 to 8.5. I didn't redo benchmarks for 8.6. There is a new queue primitive in 8.6, but to the best of my memory none of the functionality of the existing primitives changed. QUOTE (Gary Rubin @ May 1 2009, 08:45 AM) One of the reasons I'm wondering whether it's a red herring coming out of the profiler is that both versions report that the dequeue subVI takes 0.0ms. In fact, changing the display in the profiler to microseconds, LV8.6 claims the dequeue subVI takes 0us. Not sure I believe that. On Windows (and I believe all the desktop platforms) the timing cannot be any more exact than milliseconds. There just isn't a more precise timer available. The fact that the time reported is zero means that it is taking less than 1 millisecond to execute, so the sum of all iterations is still zero. That's why it makes sense to benchmark in aggregate but not individual operations. If you want a more precise timing, you need LV Real Time running on a real-time target. QUOTE (PJM_labview @ May 1 2009, 11:43 AM) I have had a lot of questions about the profiler since LV 8.0. I believe something was changed somewhere so that it does no longer behave quite the same way that in previous LV version I don't believe it had any significant changes since at least LV 6.1 other than displaying VIs from multiple app instances, added in LV 8.0. I could be wrong -- it isn't an area that I pay close attention to feature-wise. QUOTE (jzoller @ May 1 2009, 02:18 PM) (Edit: thinking more on the above, unless the queue has a lock-on-read, it seems unlikely. Sorry.) I'm not sure if you're asking about this, but, yes, there is a mutex that must be acquired by the thread when it reads the queue. Quote
Gary Rubin Posted May 4, 2009 Author Report Posted May 4, 2009 QUOTE (Aristos Queue @ May 2 2009, 10:33 AM) All the internal benchmarks that I've done show a substantial speed improvement for all queue operations from 7.1 to 8.5. I didn't redo benchmarks for 8.6. There is a new queue primitive in 8.6, but to the best of my memory none of the functionality of the existing primitives changed. Ok, given that, could others please confirm my results, that when the enqueue and dequeue loops are run in parallel, LV7.1 is faster on the enqueue than LV8.6? To get the loops to run in parallel, just delete the wire going from the timer subtraction to the lower loop. Download File:post-4344-1241319157.zip Download File:post-4344-1241319209.zip If someone else can confirm these results, are there any workarounds? A 25% slowdown in my producer loop (based on enqueue performance) is unacceptable for my application. Also, I just saw my typo in the topic title. Quote
Gary Rubin Posted May 5, 2009 Author Report Posted May 5, 2009 QUOTE (Gary Rubin @ May 2 2009, 11:00 PM) If someone else can confirm these results, are there any workarounds? A 25% slowdown in my producer loop (based on enqueue performance) is unacceptable for my application. I'm still looking for confirmation that my results are reproducible, but here's more. I've done a side-by-side comparison with parallel Queue-Dequeue vs. a LV2-style Global FIFO. Download File:post-4344-1241463223.zip If the producer loop has a 0 ms wait, and the consumer loop has a 1ms wait, I get the following results: If I change the consumer loop wait to 0ms, I get the following: I thought maybe I was comparing apples to oranges, in that the LV2 FIFO uses preallocated memory, while the queue does not, so I tried preallocating the queue by filling it up then flushing it prior to entering my sequence structures. This did help, but not by much. For a 100 element array, preallocating the queue reduced its runtime by about a factor of 4, but it's still twice as slow as the LV2 FIFO. The take-away message seems to be that 8.6 Queues are faster than 7.1 queues if the number of elements in the enqueued data is less than ~50. After that, 7.1 Queues are faster. Also, LV2-style FIFOs have the potential of being much faster than Queues. I would love to have someone shoot holes in this, or tell me that I'm not doing my comparisons right. One of my suspicions is that, because I'm timing both the producer and consumer together, it's the build array in the dequeue consumer loop that's killing me, but is there any other way around that? In the meantime, I'm going to be thinking hard about changing over all my queues to LV2-FIFOs. Are queues are just not meant to be used as deep FIFOs for passing data between loops? Gary Quote
LAVA 1.0 Content Posted May 5, 2009 Report Posted May 5, 2009 QUOTE (Gary Rubin @ May 4 2009, 09:26 PM) http://lavag.org/old_files/monthly_05_2009/post-4344-1241463354.png' target="_blank"> I would love to have someone shoot holes in this, or tell me that I'm not doing my comparisons right. One of my suspicions is that, because I'm timing both the producer and consumer together, it's the build array in the dequeue consumer loop that's killing me, but is there any other way around that? In the meantime, I'm going to be thinking hard about changing over all my queues to LV2-FIFOs. Are queues are just not meant to be used as deep FIFOs for passing data between loops? Gary I am suspicious about the array building as well. I haven't looked at the LV2 code, but I don't think that a queue using to store all data and read that as a burst using 'Flush Queue' is standard procedure. To be totally fair I wouldn't include the 'Destroy Queue' in the timing, one other thing is that preset the size of the LV2 global, something you didn't do for the Queue. You only have an overflow flag to detect the slow 'dequeuing' of data. There are some little things allready taken care for in the queue solution that isn't available in the LV2 global, like prevention of getting old data. Ton Quote
Gary Rubin Posted May 5, 2009 Author Report Posted May 5, 2009 QUOTE (Ton @ May 4 2009, 04:32 PM) I am suspicious about the array building as well.I haven't looked at the LV2 code, but I don't think that a queue using to store all data and read that as a burst using 'Flush Queue' is standard procedure. To be totally fair I wouldn't include the 'Destroy Queue' in the timing, one other thing is that preset the size of the LV2 global, something you didn't do for the Queue. You only have an overflow flag to detect the slow 'dequeuing' of data. There are some little things allready taken care for in the queue solution that isn't available in the LV2 global, like prevention of getting old data. Ton Thanks, Ton. I'll try some of those suggestions tonight, like rerunning with the queue preallocate and removing the Destroy queue. I'll post results later, so long as the kids let me work Gary Quote
shoneill Posted May 5, 2009 Report Posted May 5, 2009 QUOTE (Gary Rubin @ May 4 2009, 10:51 PM) Thanks, Ton.I'll try some of those suggestions tonight, like rerunning with the queue preallocate and removing the Destroy queue. I'll post results later, so long as the kids let me work Gary I changed the "preferred Execution System" of your benchmarking VI to "User Interface" and I got much faster results for the Queues..... (Still slower then LV2 global though) Discuss Performed in 8.2.1. Shane. Quote
Gary Rubin Posted May 6, 2009 Author Report Posted May 6, 2009 Eureka! I've added queue preallocation, and am finally getting trends more in line with what I was expecting. I also tried changing the execution system from Same as Caller to Standard, but that didn't seem to have much effect. Download File:post-4344-1241484811.zip For a 0ms wait in both consumer and producer loops: For a 0ms wait in the producer and a 1ms wait in the consumer: It appears that from 7.1 to 8.6, the queue method (as measured by this benchmark) got about 50% faster, while the LV2 method got considerably slower?! So, the lesson learned is that you really want to preallocate your queue if you plan to use it for a large amount of data. Thanks to all who have taken a look at my code. There's still something I'm wondering about; as Ton suggested, is a queue recommended for this type of use? (AQ?) QUOTE (shoneill @ May 4 2009, 05:51 PM) I changed the "preferred Execution System" of your benchmarking VI to "User Interface" and I got much faster results for the Queues..... (Still slower then LV2 global though) Shane, I tried that and got MUCH slower results overall, although the queue and the LV2 were more similar. QUOTE There are some little things allready taken care for in the queue solution that isn't available in the LV2 global, like prevention of getting old data. Ton, what do you mean by this? How could you get old (I assume you mean previously read) out of a LV2 global? Thanks again, Gary Quote
shoneill Posted May 6, 2009 Report Posted May 6, 2009 QUOTE (Gary Rubin @ May 5 2009, 03:09 AM) Eureka!I've added queue preallocation, and am finally getting trends more in line with what I was expecting. I also tried changing the execution system from Same as Caller to Standard, but that didn't seem to have much effect. Shane, I tried that and got MUCH slower results overall, although the queue and the LV2 were more similar. I can confirm that on the MAC in LV 8.6 that selecting User Interface makes the whole thing a LOT slower. Funnily enough on LV 8.2.1 and WinXP, it got faster..... The QUEUE is still a good bit slower than the LV2 on the Mac with 8.6..... Wierd. Shane. Quote
Grampa_of_Oliva_n_Eden Posted May 6, 2009 Report Posted May 6, 2009 RE: running faster in the user interface... An old recomendation from Greg McKaskle was to set the OS to optimize for background processes.... I am not usre if that will make a difference. Ben Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.