Jump to content

Boolean Tag Channel vs Notifier Performance


Recommended Posts

I was considering migrating to tag channels as a stop signal and wanted to do a performance test. I'm targeting an SB9627. I've attached the bench marking code below.

Results:

DVR:8.3us

Global: 4.1us

Notifier: 7.9us

Pipe: 90us

Queue: 7.7us

I'd say all the reference based methods are about the same. The global is the fastest but there's potential reuse issues. The pipe is significantly slower.

 

 

 

 

 

bool read benchmark.png

Link to comment

I don't know how much this got optimized before 2016, but if you look here at the 'what goes on behind the scenes' section:

https://decibel.ni.com/content/docs/DOC-41918

"the channel wire is replaced by a static VI reference to a clone VI that both writers and readers share....Some of the implementations use the core VI for their entire implementation. Take a look at the Pipe template for an example of one of these."

Unless anything has changed in the last few years, VI reference calls are obscenely slow. If you've ever poked around the "why is lvoop so slow" threads, the answer is "lvoop isn't slow........its just the technology they're using" (which is VI ref calls).

 

Other thoughts:
-Re the global, they are fast but I don't think they're that fast...60x faster than a dvr accessed by a single thread seems awfully high.
-You missed RT FIFOs in your benchmark, should be on par with queues.
-As mentioned in the link, channels have other more performant implementations
-There is a use case for out-of-band stops, but its generally better to use in-band (user event for an event loop, a 'stop' message for queued message handlers). One nice thing channels do is they have an inline stop bit, so consumers can be told "this is my last bit of data, you can stop after you're done processing it".

Edited by smithd
Link to comment
Quote

Unless anything has changed in the last few years, VI reference calls are obscenely slow. If you've ever poked around the "why is lvoop so slow" threads, the answer is "lvoop isn't slow........its just the technology they're using" (which is VI ref calls).

 

Being someone on the side of "Why is LVOOP so slow" I can't let this stay uncorrected.  While DD calls ARE slow, they most certainly do NOT utilise call VI by reference in the background, They're too fast for that.  My benchmarks have shown that the overhead for a pure DD call is in the region of 1us per DD call.  Note that if a child calls its parent, that's TWO DD calls (and therefore int he region of 2us overhead).  Please note this is purely the OVERHEAD of the call, the actual call my take considerably longer.  But even if the code does NOTHING (kind of like here), the 1-2 us are pretty much guaranteed.

So LVOOP is slower than it should be, but I don't know if I'd equate it with calling VIs by reference.  That's way worse I think.

Edited by shoneill
Link to comment

Opening up the Tag VIs, it doesn't look as though there's a call by reference (except when it is Instantiated) - it's a straight call inside the channel code.  But there is a Lookup Channel Probe on every call - I wonder if that's what's slow?  Nothing else looks particularly unusual or time-consuming.

 

Link to comment
15 hours ago, smithd said:

I don't know how much this got optimized before 2016, but if you look here at the 'what goes on behind the scenes' section:

https://decibel.ni.com/content/docs/DOC-41918

"the channel wire is replaced by a static VI reference to a clone VI that both writers and readers share....Some of the implementations use the core VI for their entire implementation. Take a look at the Pipe template for an example of one of these."

Unless anything has changed in the last few years, VI reference calls are obscenely slow. If you've ever poked around the "why is lvoop so slow" threads, the answer is "lvoop isn't slow........its just the technology they're using" (which is VI ref calls).

 

Other thoughts:
-Re the global, they are fast but I don't think they're that fast...60x faster than a dvr accessed by a single thread seems awfully high.
-You missed RT FIFOs in your benchmark, should be on par with queues.
-As mentioned in the link, channels have other more performant implementations
-There is a use case for out-of-band stops, but its generally better to use in-band (user event for an event loop, a 'stop' message for queued message handlers). One nice thing channels do is they have an inline stop bit, so consumers can be told "this is my last bit of data, you can stop after you're done processing it".

A large part of the slowness in the pipe might be the occurrence. Just checking an occurrence for 0 ms takes 50us.

If backing this into 2015 would be useful I can probably do that.

I skipped RT FIFOs because I couldn't see a way to use them as a global (Preview Queue)

The high speed stream is more akin to queues than a tag or a global and I got an error 1055 when I tried to create it in RT

Many of my loops in RT are something like "grab data, send data, wait, repeat until something catastrophic happens and then cleanup." There really isn't any "in band" communication to tag onto.

 

 

occurance.png

Link to comment
On 11/11/2016 at 11:04 AM, infinitenothing said:

This is the call by ref in question inside NI's tag channel code. I don't think that gets called unless there's a probe but, yeah, the lookup could have a small penalty.

I have no real idea how it works, but if you make two boolean channels for example they both point to the same VI in the channel instances folder on disk. Since the data is a functional global, the compiler must by some magic figure out which FGV is associated with each channel. I would assume it uses call-by-ref or something similar to figure out how to assign a specific channel instance to a given call site, but I'd certainly be curious if there were a real answer to this somewhere.

On 11/11/2016 at 0:43 AM, shoneill said:

Being someone on the side of "Why is LVOOP so slow" I can't let this stay uncorrected.  While DD calls ARE slow, they most certainly do NOT utilise call VI by reference in the background, They're too fast for that.  My benchmarks have shown that the overhead for a pure DD call is in the region of 1us per DD call.  Note that if a child calls its parent, that's TWO DD calls (and therefore int he region of 2us overhead).  Please note this is purely the OVERHEAD of the call, the actual call my take considerably longer.  But even if the code does NOTHING (kind of like here), the 1-2 us are pretty much guaranteed.

So LVOOP is slower than it should be, but I don't know if I'd equate it with calling VIs by reference.  That's way worse I think.

I (well mostly other people I work with, but hey) have done a series of benchmarks which show to my satisfaction that DD is on the same order of magnitude performance as call-by-ref. I'm not sure what difference in performance you saw. The reason I accuse DD of being similar to call by ref is that this is what I see in the profile tool. It shows every DD call site as being a separate call by ref instance:

Block diagram of untitled 2.vi, calling 2 instances of DD function untitled 1.vi.

non-reentrant.PNG

 

Edited by smithd
clarify
Link to comment

I'm just trying to show that the VI profiler says the dynamic dispatch calls are call-by-ref instances. The code shown is all the code being run through the profiler ('untitled 2' being top level). I don't know if the vi profiler is telling the truth, but thats why I thought DD ~= callbyref.

Edited by smithd
Link to comment

Try setting the DD VI to not be reentrant......

The tests I made were with non-reentrant VIs (and with all debugging disabled) and I saw overheads of the region of 1 microsecond per DD call.  I have had a long discussion with NI over this over HERE.

If DD calls really are by-reference VI calls int ehbackground that would be interesting, but I always thought the overhead of such VI calls were significantly slower than 1us.  Maybe I've been misinformed all this time. :o

Link to comment

Ah crap, really? I keep forgetting that.

Well it was basically a discussion where Stephen Mercer helped me out with benchmarking DD calls anc making apples to apples comparisons.

LVOOP non-reentrant : 260ns Overhead
LVOOP reentrant : 304ns Overhead
LVOOP static inline : 10ns Overhead
Standard non-inlined VI : 78ns Overhead
Case Structure with specific code instead of DD call : 20.15ns Overhead
"manual DD" (Case Structure with non-inlined non-reentrant VIs) : 99ns Overhead

A direct apples-to-apples comparison of a DD call vs a case structure witn N VIs within (Manually selecting the version to call) showed that whatever DD is doing, it is three times slower (in Overhead, NOT execution speed in general) than doing the same thing manually.

Again, bear in mind this measures the OVERHEAD of the VI call only, the VIs themselves are doing basically nothing.  If your code takes even 100us to execute, then the DD overhead is basically negligible.

Link to comment

So questioning my previously-held notions of speed regarding calls by reference I performed a test with several identical VIs called in different ways:

1) DD Method called normally (DD cannot be called by reference)
2) Static class method called by reference
3) Same static method VI called by reference
4) Standard VI (not a class member) called by reference
5) Same standard VI called statically

All VIs have the SAME connector pane controls connected, return the same values and are run the same number of times.  I do NOT have VI profiler running while they are being benchmarked as this HUGELY changes the results.  Debugging is enabled and nothing is inlined.  The class used for testing had initially NO private data, but I repeated the tests with a string as private data along with an accessor to write to the string.  Of course as the actual size of the object increases as does the overhead (Presumably a copy is being made somewhere).  The same trend is observed throughout.

I can't attach any images currently, LAVA is giving me errors.....

Results were:

1) 0.718us per call (DD) - 1.334us with String length 640
2) 0.842us per call (non-DD, Ref) - 1.458us with String length 640
3) 0.497us per call (non-DD, Static) - 1.075us with String length 640
4) 0.813us per call (Std, Ref) - 1.487us with String length 640
5) 0.504us per call Std, Static) - 1.098us with String length 640

It appears to me that calling a vi by reference versus statically adds approximately 0.3us to the call (nearly doubling the overhead).  Given this fact, a single DD call is actually slightly more efficient than calling an equivalent static member by reference. (or a standard VI by reference for that matter).  Of course we're at the limit of what we can reliably benchmark here.

Link to comment

OK, that all looks interesting, even reassuring, but does it explain why the Tag Channel appears to be so slow?  And is it specific to a Tag Channel, or the same slowdown for any type of Channel?  Given it's a new technology it's not too surprising that there are some optimizations yet to come, but the degree to which it is slower is concerning.

Link to comment
On 16/11/2016 at 9:36 AM, infinitenothing said:

Most of the slowness is the occurrence (50us by itself)

I was thinking "this can't be right, maybe that's only on your sbRIO", but I checked on a Windows desktop (just creating a new case in your timing test above), and checking an occurrence takes ~75% of the time of checking the Tag Channel.  I've almost never used occurrences, but I note in the Help that "National Instruments encourages you to use the Notifier Operations functions in place of occurrences for most operations."  Perhaps Channels should follow this advice!  Unless of course there is some reason that an occurrence is required.

Link to comment

If you use a 0 wait notifier it takes just as long. I was cheating when I used the notifier status function which is much faster. I think this is common to all the functions where it might wait for a message and timeout. I wonder if that's similar to a 0 ms elapsed time where it does allow the processor to sleep a little.

Link to comment
1 hour ago, GregSands said:

I modified the test code to check all these options

Be aware when benchmarking globals that their access times are heavily dependent on how many instances there are and whether reading and/or writing. They deteriorate rapidly as contention for the resource and copies of the data increases..

Edited by ShaunR
Link to comment
2 hours ago, ShaunR said:

Be aware when benchmarking globals that their access times are heavily dependent on how many instances there are and whether reading and/or writing. They deteriorate rapidly as contention for the resource and copies of the data increases..

Thanks.  This means I guess that the value shown above will be the best-case (for a single reader) and is comparable with a local variable read.

Link to comment
On 16-11-2016 at 9:44 PM, GregSands said:

I was thinking "this can't be right, maybe that's only on your sbRIO", but I checked on a Windows desktop (just creating a new case in your timing test above), and checking an occurrence takes ~75% of the time of checking the Tag Channel.  I've almost never used occurrences, but I note in the Help that "National Instruments encourages you to use the Notifier Operations functions in place of occurrences for most operations."  Perhaps Channels should follow this advice!  Unless of course there is some reason that an occurrence is required.

I'm pretty convinced that the Notifiers, Queues, Semaphores and such all use internally the occurrence mechanism to do their asynchronous operation. The Wait on Occurrence node is likely a more complex wrapper around the internal primitive that waits on the actual occurrence and is being used by those objects internally but there might be a fundamental implementation detail in how the OnOccurrence() function, which is what the Wait on Occurrence ultimately ends up calling, (and all those other nodes when they need to wait) in LabVIEW is implemented that takes this much of time.

Edited by rolfk
  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.