infinitenothing Posted November 10, 2016 Report Share Posted November 10, 2016 I was considering migrating to tag channels as a stop signal and wanted to do a performance test. I'm targeting an SB9627. I've attached the bench marking code below. Results: DVR:8.3us Global: 4.1us Notifier: 7.9us Pipe: 90us Queue: 7.7us I'd say all the reference based methods are about the same. The global is the fastest but there's potential reuse issues. The pipe is significantly slower. Quote Link to comment
infinitenothing Posted November 10, 2016 Author Report Share Posted November 10, 2016 (edited) To focus on the vi being benchmarked, I put a for loop in the sequence structure: DVR:2.4us Pipe:77us Notifier:2.2us Queue:2.0us Global: 0.04us Edited November 10, 2016 by infinitenothing Quote Link to comment
smithd Posted November 11, 2016 Report Share Posted November 11, 2016 (edited) I don't know how much this got optimized before 2016, but if you look here at the 'what goes on behind the scenes' section: https://decibel.ni.com/content/docs/DOC-41918 "the channel wire is replaced by a static VI reference to a clone VI that both writers and readers share....Some of the implementations use the core VI for their entire implementation. Take a look at the Pipe template for an example of one of these." Unless anything has changed in the last few years, VI reference calls are obscenely slow. If you've ever poked around the "why is lvoop so slow" threads, the answer is "lvoop isn't slow........its just the technology they're using" (which is VI ref calls). Other thoughts: -Re the global, they are fast but I don't think they're that fast...60x faster than a dvr accessed by a single thread seems awfully high. -You missed RT FIFOs in your benchmark, should be on par with queues. -As mentioned in the link, channels have other more performant implementations -There is a use case for out-of-band stops, but its generally better to use in-band (user event for an event loop, a 'stop' message for queued message handlers). One nice thing channels do is they have an inline stop bit, so consumers can be told "this is my last bit of data, you can stop after you're done processing it". Edited November 11, 2016 by smithd Quote Link to comment
shoneill Posted November 11, 2016 Report Share Posted November 11, 2016 (edited) Quote Unless anything has changed in the last few years, VI reference calls are obscenely slow. If you've ever poked around the "why is lvoop so slow" threads, the answer is "lvoop isn't slow........its just the technology they're using" (which is VI ref calls). Being someone on the side of "Why is LVOOP so slow" I can't let this stay uncorrected. While DD calls ARE slow, they most certainly do NOT utilise call VI by reference in the background, They're too fast for that. My benchmarks have shown that the overhead for a pure DD call is in the region of 1us per DD call. Note that if a child calls its parent, that's TWO DD calls (and therefore int he region of 2us overhead). Please note this is purely the OVERHEAD of the call, the actual call my take considerably longer. But even if the code does NOTHING (kind of like here), the 1-2 us are pretty much guaranteed. So LVOOP is slower than it should be, but I don't know if I'd equate it with calling VIs by reference. That's way worse I think. Edited November 11, 2016 by shoneill Quote Link to comment
shoneill Posted November 11, 2016 Report Share Posted November 11, 2016 CAVEAT: I can't open the code provided as I don't have LV 2016 installed, so I don't know HOW the OP is calling the VIs by reference. I'm assuming it's over the Connector pane with a strictly-typed VI reference? Quote Link to comment
GregSands Posted November 11, 2016 Report Share Posted November 11, 2016 Opening up the Tag VIs, it doesn't look as though there's a call by reference (except when it is Instantiated) - it's a straight call inside the channel code. But there is a Lookup Channel Probe on every call - I wonder if that's what's slow? Nothing else looks particularly unusual or time-consuming. Quote Link to comment
infinitenothing Posted November 11, 2016 Author Report Share Posted November 11, 2016 15 hours ago, smithd said: I don't know how much this got optimized before 2016, but if you look here at the 'what goes on behind the scenes' section: https://decibel.ni.com/content/docs/DOC-41918 "the channel wire is replaced by a static VI reference to a clone VI that both writers and readers share....Some of the implementations use the core VI for their entire implementation. Take a look at the Pipe template for an example of one of these." Unless anything has changed in the last few years, VI reference calls are obscenely slow. If you've ever poked around the "why is lvoop so slow" threads, the answer is "lvoop isn't slow........its just the technology they're using" (which is VI ref calls). Other thoughts: -Re the global, they are fast but I don't think they're that fast...60x faster than a dvr accessed by a single thread seems awfully high. -You missed RT FIFOs in your benchmark, should be on par with queues. -As mentioned in the link, channels have other more performant implementations -There is a use case for out-of-band stops, but its generally better to use in-band (user event for an event loop, a 'stop' message for queued message handlers). One nice thing channels do is they have an inline stop bit, so consumers can be told "this is my last bit of data, you can stop after you're done processing it". A large part of the slowness in the pipe might be the occurrence. Just checking an occurrence for 0 ms takes 50us. If backing this into 2015 would be useful I can probably do that. I skipped RT FIFOs because I couldn't see a way to use them as a global (Preview Queue) The high speed stream is more akin to queues than a tag or a global and I got an error 1055 when I tried to create it in RT Many of my loops in RT are something like "grab data, send data, wait, repeat until something catastrophic happens and then cleanup." There really isn't any "in band" communication to tag onto. Quote Link to comment
infinitenothing Posted November 11, 2016 Author Report Share Posted November 11, 2016 This is the call by ref in question inside NI's tag channel code. I don't think that gets called unless there's a probe but, yeah, the lookup could have a small penalty. Quote Link to comment
smithd Posted November 12, 2016 Report Share Posted November 12, 2016 (edited) On 11/11/2016 at 11:04 AM, infinitenothing said: This is the call by ref in question inside NI's tag channel code. I don't think that gets called unless there's a probe but, yeah, the lookup could have a small penalty. I have no real idea how it works, but if you make two boolean channels for example they both point to the same VI in the channel instances folder on disk. Since the data is a functional global, the compiler must by some magic figure out which FGV is associated with each channel. I would assume it uses call-by-ref or something similar to figure out how to assign a specific channel instance to a given call site, but I'd certainly be curious if there were a real answer to this somewhere. On 11/11/2016 at 0:43 AM, shoneill said: Being someone on the side of "Why is LVOOP so slow" I can't let this stay uncorrected. While DD calls ARE slow, they most certainly do NOT utilise call VI by reference in the background, They're too fast for that. My benchmarks have shown that the overhead for a pure DD call is in the region of 1us per DD call. Note that if a child calls its parent, that's TWO DD calls (and therefore int he region of 2us overhead). Please note this is purely the OVERHEAD of the call, the actual call my take considerably longer. But even if the code does NOTHING (kind of like here), the 1-2 us are pretty much guaranteed. So LVOOP is slower than it should be, but I don't know if I'd equate it with calling VIs by reference. That's way worse I think. I (well mostly other people I work with, but hey) have done a series of benchmarks which show to my satisfaction that DD is on the same order of magnitude performance as call-by-ref. I'm not sure what difference in performance you saw. The reason I accuse DD of being similar to call by ref is that this is what I see in the profile tool. It shows every DD call site as being a separate call by ref instance: Block diagram of untitled 2.vi, calling 2 instances of DD function untitled 1.vi. Edited November 15, 2016 by smithd clarify Quote Link to comment
shoneill Posted November 14, 2016 Report Share Posted November 14, 2016 I dont understand what you are trying to show there to be honest..... Quote Link to comment
smithd Posted November 15, 2016 Report Share Posted November 15, 2016 (edited) I'm just trying to show that the VI profiler says the dynamic dispatch calls are call-by-ref instances. The code shown is all the code being run through the profiler ('untitled 2' being top level). I don't know if the vi profiler is telling the truth, but thats why I thought DD ~= callbyref. Edited November 15, 2016 by smithd Quote Link to comment
shoneill Posted November 15, 2016 Report Share Posted November 15, 2016 Try setting the DD VI to not be reentrant...... The tests I made were with non-reentrant VIs (and with all debugging disabled) and I saw overheads of the region of 1 microsecond per DD call. I have had a long discussion with NI over this over HERE. If DD calls really are by-reference VI calls int ehbackground that would be interesting, but I always thought the overhead of such VI calls were significantly slower than 1us. Maybe I've been misinformed all this time. Quote Link to comment
Neil Pate Posted November 15, 2016 Report Share Posted November 15, 2016 21 minutes ago, shoneill said: I have had a long discussion with NI over this over HERE. Champions access only (things like this make me sad...) Quote Link to comment
shoneill Posted November 15, 2016 Report Share Posted November 15, 2016 Ah crap, really? I keep forgetting that. Well it was basically a discussion where Stephen Mercer helped me out with benchmarking DD calls anc making apples to apples comparisons. LVOOP non-reentrant : 260ns Overhead LVOOP reentrant : 304ns Overhead LVOOP static inline : 10ns Overhead Standard non-inlined VI : 78ns Overhead Case Structure with specific code instead of DD call : 20.15ns Overhead "manual DD" (Case Structure with non-inlined non-reentrant VIs) : 99ns Overhead A direct apples-to-apples comparison of a DD call vs a case structure witn N VIs within (Manually selecting the version to call) showed that whatever DD is doing, it is three times slower (in Overhead, NOT execution speed in general) than doing the same thing manually. Again, bear in mind this measures the OVERHEAD of the VI call only, the VIs themselves are doing basically nothing. If your code takes even 100us to execute, then the DD overhead is basically negligible. Quote Link to comment
smithd Posted November 15, 2016 Report Share Posted November 15, 2016 (edited) 7 hours ago, shoneill said: Try setting the DD VI to not be reentrant...... I had both cases in my post originally but removed the labels to try to make it more clear The behavior is identical for reentrant and non-reentrant. Edited November 15, 2016 by smithd Quote Link to comment
shoneill Posted November 15, 2016 Report Share Posted November 15, 2016 So questioning my previously-held notions of speed regarding calls by reference I performed a test with several identical VIs called in different ways: 1) DD Method called normally (DD cannot be called by reference) 2) Static class method called by reference 3) Same static method VI called by reference 4) Standard VI (not a class member) called by reference 5) Same standard VI called statically All VIs have the SAME connector pane controls connected, return the same values and are run the same number of times. I do NOT have VI profiler running while they are being benchmarked as this HUGELY changes the results. Debugging is enabled and nothing is inlined. The class used for testing had initially NO private data, but I repeated the tests with a string as private data along with an accessor to write to the string. Of course as the actual size of the object increases as does the overhead (Presumably a copy is being made somewhere). The same trend is observed throughout. I can't attach any images currently, LAVA is giving me errors..... Results were: 1) 0.718us per call (DD) - 1.334us with String length 640 2) 0.842us per call (non-DD, Ref) - 1.458us with String length 640 3) 0.497us per call (non-DD, Static) - 1.075us with String length 640 4) 0.813us per call (Std, Ref) - 1.487us with String length 640 5) 0.504us per call Std, Static) - 1.098us with String length 640 It appears to me that calling a vi by reference versus statically adds approximately 0.3us to the call (nearly doubling the overhead). Given this fact, a single DD call is actually slightly more efficient than calling an equivalent static member by reference. (or a standard VI by reference for that matter). Of course we're at the limit of what we can reliably benchmark here. Quote Link to comment
shoneill Posted November 15, 2016 Report Share Posted November 15, 2016 So one thing I have learned here is that calling a VI by reference is WAY faster than I thought. When did that change. I may have last tested in LV 6.1 Quote Link to comment
GregSands Posted November 15, 2016 Report Share Posted November 15, 2016 OK, that all looks interesting, even reassuring, but does it explain why the Tag Channel appears to be so slow? And is it specific to a Tag Channel, or the same slowdown for any type of Channel? Given it's a new technology it's not too surprising that there are some optimizations yet to come, but the degree to which it is slower is concerning. Quote Link to comment
infinitenothing Posted November 15, 2016 Author Report Share Posted November 15, 2016 Most of the slowness is the occurrence (50us by itself) 1 Quote Link to comment
GregSands Posted November 16, 2016 Report Share Posted November 16, 2016 On 16/11/2016 at 9:36 AM, infinitenothing said: Most of the slowness is the occurrence (50us by itself) I was thinking "this can't be right, maybe that's only on your sbRIO", but I checked on a Windows desktop (just creating a new case in your timing test above), and checking an occurrence takes ~75% of the time of checking the Tag Channel. I've almost never used occurrences, but I note in the Help that "National Instruments encourages you to use the Notifier Operations functions in place of occurrences for most operations." Perhaps Channels should follow this advice! Unless of course there is some reason that an occurrence is required. Quote Link to comment
infinitenothing Posted November 17, 2016 Author Report Share Posted November 17, 2016 If you use a 0 wait notifier it takes just as long. I was cheating when I used the notifier status function which is much faster. I think this is common to all the functions where it might wait for a message and timeout. I wonder if that's similar to a 0 ms elapsed time where it does allow the processor to sleep a little. Quote Link to comment
GregSands Posted November 18, 2016 Report Share Posted November 18, 2016 I didn't know that, interesting. I modified the test code to check all these options, with these results (median time per iteration): 1 Quote Link to comment
ShaunR Posted November 18, 2016 Report Share Posted November 18, 2016 (edited) 1 hour ago, GregSands said: I modified the test code to check all these options Be aware when benchmarking globals that their access times are heavily dependent on how many instances there are and whether reading and/or writing. They deteriorate rapidly as contention for the resource and copies of the data increases.. Edited November 18, 2016 by ShaunR Quote Link to comment
GregSands Posted November 18, 2016 Report Share Posted November 18, 2016 2 hours ago, ShaunR said: Be aware when benchmarking globals that their access times are heavily dependent on how many instances there are and whether reading and/or writing. They deteriorate rapidly as contention for the resource and copies of the data increases.. Thanks. This means I guess that the value shown above will be the best-case (for a single reader) and is comparable with a local variable read. Quote Link to comment
Rolf Kalbermatter Posted November 23, 2016 Report Share Posted November 23, 2016 (edited) On 16-11-2016 at 9:44 PM, GregSands said: I was thinking "this can't be right, maybe that's only on your sbRIO", but I checked on a Windows desktop (just creating a new case in your timing test above), and checking an occurrence takes ~75% of the time of checking the Tag Channel. I've almost never used occurrences, but I note in the Help that "National Instruments encourages you to use the Notifier Operations functions in place of occurrences for most operations." Perhaps Channels should follow this advice! Unless of course there is some reason that an occurrence is required. I'm pretty convinced that the Notifiers, Queues, Semaphores and such all use internally the occurrence mechanism to do their asynchronous operation. The Wait on Occurrence node is likely a more complex wrapper around the internal primitive that waits on the actual occurrence and is being used by those objects internally but there might be a fundamental implementation detail in how the OnOccurrence() function, which is what the Wait on Occurrence ultimately ends up calling, (and all those other nodes when they need to wait) in LabVIEW is implemented that takes this much of time. Edited November 23, 2016 by rolfk 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.