Queues Vs Ring Buffers

Aristos Queue · July 3, 2013

I've been rewriting the benchmark VIs to my own standards. An important change is making those While Loops into For Loops and adding some sequence structures because the timers are being delayed by panel updates. Also the queue and buffer benchmarks do not match. I'll probably post revised VIs late tonight.

ShaunR · July 3, 2013

Version 2 has the same problems for me. Firstly, here's the Queues as a baseline:
CB Test QUEUE_FP.png

Next is Buffers (10k iterations with a 10k buffer):

CB Test BUFFER_10k_FP.png

There are no periodic delays, but still some occasional longer iterations (up to 100ms) presumably due to Windows being Windows.

Last is Buffers (10k iterations with a 1k buffer):

CB Test BUFFER_1k_FP.png

The results are the same for both 32-bit and 64-bit LabVIEW, all on Windows 7 x64. I've also run on another machine with 4 cores - this works without any long iterations.

OK. I've managed to replicate this on a Core 2 Duo (the other PC is an I7). The periodic large deviations coincide with the buffer length. If you look at your 10K@ 1Kbuffer, each point is 1K apart. If you were to set the #iterations to 100 and, say the buffer length to 10, then the separation would be every 10 and you would see 9 points [ (#iterations/buffer length) -1 ]. Similarly, if you set the buffer length to 20, you would see 4.

My initial thoughts (as seems to be the norm) was a collision or maybe a task switch on the modulo operator. But it smells more like processor hogging as you can get rid of it by placing a wait with 0 ms in the write and read while loops. You have to set the read and write VIs to be "Normal" instead of "subroutine" to be able to use the wait and therefore you lose a lot of the performance so it's not ideal, but it does seem to cure it. I'm not sure of the exact mechanism-i'll have to chew it over. But it seems CPU architecture dependent.

I've been rewriting the benchmark VIs to my own standards. An important change is making those While Loops into For Loops and adding some sequence structures because the timers are being delayed by panel updates. Also the queue and buffer benchmarks do not match. I'll probably post revised VIs late tonight.

Look forward to it. Care to elaborate on for loop Vs While loop?

(Don't speed up the queues too much eh? )

Edited July 3, 2013 by ShaunR

mje · July 3, 2013

My only criticism of the benchmarks is the queue consumers go until the queue is released, which does not mean they consumed all the data. I think it would be best to handle releasing each queue only after the consumer has finished, don't stop on error, stop on fixed number of iterations. I suspect that's what AQ was getting at. I don't expect it will make much difference when sample sizes are large.

I'm also concerned about CPU exhaustion. Correct me if I'm wrong, but if a reader tries reading an empty buffer, it will spin in an unthrottled loop until data is ready?

I worry that should there not be enough cores to accommodate all the end points, how much time will be spent spinning waiting for the cursor to advance that could otherwise be executing other endpoints? Worse would the subroutine priority lead to complete starvation? Similar arguments for writers, though I haven't poked under the hood there.

Changing the VI from a subroutine and letting it yield should it ever not get a valid cursor did not seem to change my metrics.

ShaunR · July 3, 2013

My only criticism of the benchmarks is the queue consumers go until the queue is released, which does not mean they consumed all the data. I think it would be best to handle releasing each queue only after the consumer has finished, don't stop on error, stop on fixed number of iterations. I suspect that's what AQ was getting at. I don't expect it will make much difference when sample sizes are large.

I'm also concerned about CPU exhaustion. Correct me if I'm wrong, but if a reader tries reading an empty buffer, it will spin in an unthrottled loop until data is ready?

Read Loop.png

I worry that should there not be enough cores to accommodate all the end points, how much time will be spent spinning waiting for the cursor to advance that could otherwise be executing other endpoints? Worse would the subroutine priority lead to complete starvation? Similar arguments for writers, though I haven't poked under the hood there.

Changing the VI from a subroutine and letting it yield should it ever not get a valid cursor did not seem to change my metrics.

Read Loop 2.png

Indeed. See my post in reply to GregSands about processor hogging.

For a practical implementation. Then yes.I agree. It should yield. Adding the wait ms and changing to Normal degrades the performance on my machine by about 30% and as I wanted to see what it "could do" and explore it's behaviour rather than labviews, I wasn't particulalry concerned about other software or parts of the same software. It makes it quite hard to optimise if the benchmarks are cluttered with task/thread switches et al. Better (to begin with) to run it as fast as posssible with as much resource as required to eek out the last few drops, then you can make informed decisions as to what you are prepared to trade for how much performance. Of course, in 2011+ you can "inline". In 2009 you cannot, so subroutine is the only way.

Ooooh. I've just thought of a new signature.......

"I don't write code fast. I just write fast code

Edited July 3, 2013 by ShaunR

drjdpowell · July 3, 2013

I noticed that sequencing the two “Enqueue” operations (by connecting the error wire) increased the queue performance greatly on my system.

todd · July 3, 2013

The LabVIEW memory manager uses an exponential allocation method for creating memory

Can't find the reference, but someone once said that the buffer size doubles until it's as big as a block (4kB?), then increases by block-sized increments.

ShaunR · July 3, 2013

I noticed that sequencing the two “Enqueue” operations (by connecting the error wire) increased the queue performance greatly on my system.

It's a good job MJE noticed the bug where the dequeues aren't reading everything then. Still. It's rather counter intuitive.

Can't find the reference, but someone once said that the buffer size doubles until it's as big as a block (4kB?), then increases by block-sized increments.

Well .remembered. That also explains why the sudden increase (the curved lines around 1ms) in the old 1M data point plots.

By the way peeps. Feel free to write your own benchmarks/test software or abuse-ware. Especially if it shows a particular problem. The ones I supplied are really just test harnesses so that I could write the API and see some info on what was going wrong (you can probably see by the latest images that the ones I am using now are evevolving as I probe different aspects). Not much time has been spent on them so I expect there's lots of issues.

Eventually, once all fingers have been inserted in all orifices, they will become usage examples and regression tests (if anyone uses the SQLite API For LabVIEW you will know what this means) so if you do knock something up, please post it-big or small.

I think I should also request at this point, to avoid any confusion, that you should only post code to this thread if you are happy for your submission to be public domain. By all means make your own proprietary stuff, but please post it somewhere else so those that come here can be sure that anything downloaded from this thread is covered by the previous public domain declaration.

Eventually it will go in the CR, and it will not be an issue,but until then please be aware that any submissions to this thread are public domain. Sorry to have to state it, but a little bit of garlic now will save the blood being sucked later.

Edited July 3, 2013 by ShaunR

GregSands · July 3, 2013

Does anyone want to put forward a theory why we get discrete lines exactly 0.493 usecs apart? (maybe a different number on your machine, but the separation is always constant)

I was going to suggest that's the resolution of the Timer. But when I just put your Tick Count+ VI in a loop, and compute differences, I get:

which says that the resolution is 29.1038pS, but the loop time (with no other work done) is around 405.2nS. So the even spacing is just a function of a largely constant, and almost empty, loop.

ShaunR · July 4, 2013

I was going to suggest that's the resolution of the Timer. But when I just put your Tick Count+ VI in a loop, and compute differences, I get:
TimerTest.png

which says that the resolution is 29.1038pS, but the loop time (with no other work done) is around 405.2nS. So the even spacing is just a function of a largely constant, and almost empty, loop.

+1. It's not conclusive between resolution and loop jitter if you check with the elapsed time.vi (that is inside the loop) since it is difficult to separate them (occasionally you see 10s of nsec or psec readings). But as the Tick count2+ uses the same timing method (which yields ps resolution), I think we can assume it is mainly loop jitter that causes the spacing. I think also that whether the Elapsed Time.vi gets executed in the same slice or the next (or the next +1 etc, depending on the schedular) is the reason we see several lines at x*loop jitter.

drjdpowell · July 5, 2013

Playing around with Queue tester, it was interesting how most of the slow queue performance is due to attempting operations in parallel. In a striped-down test (statistics removed) it was about 8 times faster if it was run in the UI thread (forcing serial access). It was 0.6uS per element on my relatively slow processor.

ShaunR · July 5, 2013

Playing around with Queue tester, it was interesting how most of the slow queue performance is due to attempting operations in parallel. In a striped-down test (statistics removed) it was about 8 times faster if it was run in the UI thread (forcing serial access). It was 0.6uS per element on my relatively slow processor.

The queue primitives are actually faster (as can be expected). But even a while loop with only the Elapsed time.vi in without anything else yields ~900ns on my machine. If it was one producer and one consumer, queues would be better, but the "spread" or "jitter" is far greater for queues. If you run the tests without the Elapsed Time.vi, you will also see a significant improvement in the buffer-not so much with the queues-so I'm interested to see what AQ comes up with for tests.

I think also we will see the true advantage when we scale up to, say, 5-10 in parallel (a more realistic application scenario) especially on quad core machines where thread use can be maximised. But I'll get the variants working first as then it's actually usable. The next step after that is to scale up then add events to the read to see if we can maintain throughput with the better application link.

Aristos Queue · July 7, 2013

Sorry... I've been on vacation and out of wi-fi range (enjoyably!).

I've got some very weird results at the moment... trying to figure out what's up. On any Mac, the graphs are flatlined. Is that just an artifact of the low-resolution timer?

I'm also trying to figure out which time we are trying to minimize... I think we want to minimize the total execution time of two readers, one with a short operation and one with a long operation. That's not quite the same as comparing against the two-queue solution. Are we supposed to be comparing against a single reader that does both operations in parallel or a single reader that does both operations serially?

Finally, if you put the buffer solution in a loop, it deadlocks. Trying to figure out how to get the state out of the read function so that it becomes part of the input -- I think the right way is to structure this as a com-channel where you create reader refnums from the com-channel refnum, readers that are invalidated when the main goes stale.

There's also a memory leak if you abort or run to complete without calling the uninit VI. This is caused by not having any way to throw away the buffers that are allocated in Init when Abort/auto cleanup runs. The call library node has callbacks to handle these, but we would need to create an actual DLL to handle that work.

Aristos Queue · July 7, 2013

In case it wasn't clear this morning, I'm pretty sure that the proper compare for this proposed algorithm is against the single reader that executes both operations in parallel. I'm attaching my rewritten benchmark VIs here. Now, the benchmarks I'm attaching are testing have some overly large values for the time of the operations -- 50ms and 5ms respectively -- which means that the data transmission times are completely subsumed by the data operation times... scale these down to much shorter (but similarly skewed) values to do the benchmarking. I have the long delays in there because those are the values I finished on when I was using it to validate the benchmarks themselves... probably want to make these into parameters in the long run so we can get a feel for how fast the data operations need to be before this disruptor pattern has any value.

Details:

Put the "Average Array.vi" into the "TestUtils" directory.

CB Test BUFFER.vi is the test of Shaun's implementation of disruptor.

CB Test 1QUEUE.vi is what I think is a correct comparison of disruptor.

In case I'm wrong on some front, I've included "CB Test QUEUEPAIR.vi" which is the double-queue implementation.

I've moved stuff to For Loops so that LV preallocates the arrays so that ceases to be an issue. I cleaned up the sequencing of timer calls so that there's minimal ambiguity about which timer goes off first. I say "minimal" because the final timing read after each of the For Loops is totally unpredictable -- sometimes LV waits until all of the For Loops have finished before any of those timer calls get executed, and sometimes they execute in reverse order of the order the loops finished. I consider the internal timings to be junk data -- I'm not sure why I left them in place. The outer timer check around the entire write and read operation (labeled as "Total Time (sec)" on the front panel) is the interesting value that we are trying to minimize.

Aristos Queue · July 7, 2013

Now, some more comments about the pattern itself... this post wanders from one topic to the next haphazardly. I'm regurgitating my notes more than trying to polish up a discussion of this pattern. ALL OF WHAT I SAY BELOW IS SUBJECT TO FUTURE CORRECTION BY MYSELF OR SOMEONE ELSE WHO SEES SOMETHING I MISSED.

To be generally usable within LV, the data structure is going to need some strengthening and probably some language tweaks (aka changes to labview.exe and lvrt.dll). The items in the buffer in Shaun's implementation are doubles. But this pattern is intended to be used for larger object-like data structures that are used by reference in multiple threads of operation. The objects are preallocated into the ring, the writer sets their fields, and then the objects are used simultaneously by both (all) readers. The system of the disruptor works in these other languages *by contract* not by language/environment enforcement. In other words, those other languages have no protections against a write to a data structure happening while two readers are both reading it. Allowing a truly "use at your own risk" structure in LV is antithetical to the design of LV. We do not have any sort of "typed raw memory address" data type in LV specifically because that isn't the design environment of LabVIEW. R&D has repeatedly refused to implement anything like that even under the argument that "advanced users could use it to write high-performance code". LabVIEW tries to be a language that is reliable in the hands of any of its users, and such raw pointer constructs aren't provably reliable. But although LabVIEW does not have any sort of "shared read-only large data structure", it does seem like there should be a way to introduce a DVR-like reference that has a modal switch on it where it is "read only for now" and then a writer comes along and gets an error if there is any reader currently reading and otherwise flips the mode to "writing now" during which any readers or additional writers would error out. These would not be locks, just atomic instruction bookkeeping. I stressed that because if a disruptor were properly implemented (i.e. the timings were set up correctly), such bookkeeping would -- in theory -- be limited to an additional Boolean test-and-set instruction on each write and each read. Minimal overhead. But it would be sufficient to trap an improperly implemented disruptor and produce a runtime error without seg-faulting the execution environment.

Assume that we add the bookkeeping checks above (which *I think* could be added to the cluster that Shaun has created), there's still a problem of data copies of the complex data structures in the ring. If we work on changing the double data type out for a string data type (or any of the non-flat types, i.e., anything other than Boolean or the numerics or clusters of the same), we'd need a way to teach LV that the output of that DLL call is not stompable so that the implaceness analysis goes ahead and makes a copy if it is used in a stomper operation. I'm not sure what that trick would be, or if there is a way to do it. Suspect it is likely possible to do with a Call Library Node if we lie and say that the method call does not modify the value when in fact it actually does. Would have to play with that a lot to be confident of it working... might not be possible without revving LV.

Anyway, both of those modifications are for the future. For now, we need to work with the benchmarks to see if there really is any advantage in LV for this pattern. So far, I'm not seeing it when compared against the 1QUEUE benchmark. Seems to be pretty much a wash, especially when the full implementation of disruptor will need a few more flag checks to be fully robust.

Now, having said all of that, the ring buffer concept when we talk about network transmission is interesting for the NAK replay abilities. In LV, that would still require copying data out of the buffer for all non-flat data types, but it might be a really nice way to handle network traffic on a noisy line. All that requires is a LV array in a shift register, with none of the rest of the disruptor fanciness. That's definitely worth exploring if you're writing that kind of network-retry code.

Done with the data dump. :-)

[EDIT] Oh, and this comment from Trisha's comments is interesting:

"The RingBuffer is not the secret sauce in the Disruptor's performance, in fact in the current version of the Disruptor you don't need it at all."

Without the ring buffer? The only way I'm understanding what is going on here is with that buffer... not sure what Disruptor is without the buffer.

ShaunR · July 7, 2013

Sorry... I've been on vacation and out of wi-fi range (enjoyably!).

I thought holidays were for not doing what you do at work I'll still be at this when you get back, so just enjoy your holiday.

I've got some very weird results at the moment... trying to figure out what's up. On any Mac, the graphs are flatlined. Is that just an artifact of the low-resolution timer?

Most definately. Yes. The timer uses the windows hi-res timer (on windows, of course) but falls back to the ms timer on other platforms.

I'm also trying to figure out which time we are trying to minimize... I think we want to minimize the total execution time of two readers, one with a short operation and one with a long operation. That's not quite the same as comparing against the two-queue solution. Are we supposed to be comparing against a single reader that does both operations in parallel or a single reader that does both operations serially?

The benchmarks are just to ascertain the execution time of the read and write VIs themselves and what jitter there is from execution to execution. The theory is that because there are only memory barriers and no synchronisation locking between a read and a write, the "spread" should be much more predictable since, with a sizable buffer, access contention is virtually eliminated. In that respect, the queue benchmark isn't really a "use case" comparison, just an examination of the execution time of the primitives vs the dll calls with logic. Don't forget also, that the buffer has inherent feedback in that the writer can slow down if the readers can't keep up. If we were being pedantic, we would also enforce a fixed size buffer so that the write "stalls" when one or more readers can't keep up. I don't think that would be useful however.

Finally, if you put the buffer solution in a loop, it deadlocks. Trying to figure out how to get the state out of the read function so that it becomes part of the input -- I think the right way is to structure this as a com-channel where you create reader refnums from the com-channel refnum, readers that are invalidated when the main goes stale.

I'm not sure what you mean here by "buffer solution in a loop". Can you elaborate?

There's also a memory leak if you abort or run to complete without calling the uninit VI. This is caused by not having any way to throw away the buffers that are allocated in Init when Abort/auto cleanup runs. The call library node has callbacks to handle these, but we would need to create an actual DLL to handle that work.

Indeed. I'm trying to keep away from external code......at least for now. There is still a lot to do to make it an API. I'm still investigating its behaviour ATM so this thread is really just documenting that journey. If you remember the SQLite API for labview, that's how that started. So right now we have some of the building blocks, but a lot needs to be added/removed/modified before it's ready for "prime-time". The issue is going to be. How do we make it robust without enforcing access synchronisation.

ShaunR · July 8, 2013

In case it wasn't clear this morning, I'm pretty sure that the proper compare for this proposed algorithm is against the single reader that executes both operations in parallel. I'm attaching my rewritten benchmark VIs here. Now, the benchmarks I'm attaching are testing have some overly large values for the time of the operations -- 50ms and 5ms respectively -- which means that the data transmission times are completely subsumed by the data operation times... scale these down to much shorter (but similarly skewed) values to do the benchmarking. I have the long delays in there because those are the values I finished on when I was using it to validate the benchmarks themselves... probably want to make these into parameters in the long run so we can get a feel for how fast the data operations need to be before this disruptor pattern has any value.

Details:

Put the "Average Array.vi" into the "TestUtils" directory.

CB Test BUFFER.vi is the test of Shaun's implementation of disruptor.

CB Test 1QUEUE.vi is what I think is a correct comparison of disruptor.

In case I'm wrong on some front, I've included "CB Test QUEUEPAIR.vi" which is the double-queue implementation.

I've moved stuff to For Loops so that LV preallocates the arrays so that ceases to be an issue. I cleaned up the sequencing of timer calls so that there's minimal ambiguity about which timer goes off first. I say "minimal" because the final timing read after each of the For Loops is totally unpredictable -- sometimes LV waits until all of the For Loops have finished before any of those timer calls get executed, and sometimes they execute in reverse order of the order the loops finished. I consider the internal timings to be junk data -- I'm not sure why I left them in place. The outer timer check around the entire write and read operation (labeled as "Total Time (sec)" on the front panel) is the interesting value that we are trying to minimize.

Hmm. I still don't think you've quite grasped it, but I think we are nearly there . You are "averaging" a total time. That's not what we want to do as it doesn't tell us anything meaningful. With the default values of your benchmarks; we see some straight lines until the buffer is full in both. But what are the values you see periodically in the buffer @ about 1us that do not appear in the queue example? What is going on there? Data being processed? (you betcha)

To demonstrate with your new benchmark, set the buffer size to 2 (This scenario is more exemplified for, say, an FPGA or RT where you don't have GB of memory for your queue-fixed sized array ring a bell?). You will notice there is not much difference in your "average times" and the queue write gets into its stall-state much more quickly but everything else is pretty much the same. But look at the graphs and the median of the buffer. They are definitely not equivalent even though the "average times" say they are. .You have not been kind to queues in this example since the write will block until the largest process time has executed (once the buffer is full) before moving to the next. The buffer doesn't suffer as much from this limitation and my benchmark was so that queues didn't suffer from it either.

You asked me previously what time was I trying to optimise (which I erringly failed to answer-apologies). Well. The time I'm trying to get below is 1usec for the execution of a VI - the buffer access time. I am, at the moment, comparing the execution time of the VIs I've written with the queue primitives (which are compiled code). The next thing I am looking at is the "jitter" and then trying to identify any blocking modes which may indicate resource contention locking. This was the reason for moving from the global to the pointers since a read of the global array locks the writer (albeit transparently) and forces a copy of the entire array. Once I can figure out variants (IF I can figure out variants) then I'll start looking at 5-10 readers with one writer (that's going to be a tedious weekend )

ShaunR · July 8, 2013

Now, some more comments about the pattern itself... this post wanders from one topic to the next haphazardly. I'm regurgitating my notes more than trying to polish up a discussion of this pattern. ALL OF WHAT I SAY BELOW IS SUBJECT TO FUTURE CORRECTION BY MYSELF OR SOMEONE ELSE WHO SEES SOMETHING I MISSED.

Good get-out clause

To be generally usable within LV, the data structure is going to need some strengthening and probably some language tweaks (aka changes to labview.exe and lvrt.dll). The items in the buffer in Shaun's implementation are doubles. But this pattern is intended to be used for larger object-like data structures that are used by reference in multiple threads of operation. The objects are preallocated into the ring, the writer sets their fields, and then the objects are used simultaneously by both (all) readers.

Yes. Variants I think are the prime candidate for LV.

The system of the disruptor works in these other languages *by contract* not by language/environment enforcement. In other words, those other languages have no protections against a write to a data structure happening while two readers are both reading it. Allowing a truly "use at your own risk" structure in LV is antithetical to the design of LV. We do not have any sort of "typed raw memory address" data type in LV specifically because that isn't the design environment of LabVIEW. R&D has repeatedly refused to implement anything like that even under the argument that "advanced users could use it to write high-performance code".

I don't think we need to "allow" it at all. They have certain implementation aspects that address atomicity, but we don't suffer from it because of LabVIEW. So we can ignore those aspects as long as it doesn't affect performance. What we do need is to be able to read a single or multiple elements from an array without copying the entire array, hence the move to LV memory manager functions.

LabVIEW tries to be a language that is reliable in the hands of any of its users, and such raw pointer constructs aren't provably reliable. But although LabVIEW does not have any sort of "shared read-only large data structure", it does seem like there should be a way to introduce a DVR-like reference that has a modal switch on it where it is "read only for now" and then a writer comes along and gets an error if there is any reader currently reading and otherwise flips the mode to "writing now" during which any readers or additional writers would error out.

Again. I don't think a new primitive is needed although a DVR would have been the first choice had it not come with all the baggage. It is as close to pointer referencing as we get and that is what we need. A global variable has most/all the features we need which you are describing here, without all the external signalling. It just comes with the baggage of copying entire arrays. Works great for <500 elements, but too much copy overhead for larger buffers. Don't forget, that a writer can't write whilst a reader is reading by the design (the counters), not by the storage locking mechanism. I think we can do everything with a bit of lateral thinking with what we have rather than require specialised semantics.

These would not be locks, just atomic instruction bookkeeping. I stressed that because if a disruptor were properly implemented (i.e. the timings were set up correctly), such bookkeeping would -- in theory -- be limited to an additional Boolean test-and-set instruction on each write and each read. Minimal overhead. But it would be sufficient to trap an improperly implemented disruptor and produce a runtime error without seg-faulting the execution environment.

See my previous paragraph. Unnecessary I think.

Assume that we add the bookkeeping checks above (which *I think* could be added to the cluster that Shaun has created), there's still a problem of data copies of the complex data structures in the ring. If we work on changing the double data type out for a string data type (or any of the non-flat types, i.e., anything other than Boolean or the numerics or clusters of the same), we'd need a way to teach LV that the output of that DLL call is not stompable so that the implaceness analysis goes ahead and makes a copy if it is used in a stomper operation. I'm not sure what that trick would be, or if there is a way to do it. Suspect it is likely possible to do with a Call Library Node if we lie and say that the method call does not modify the value when in fact it actually does. Would have to play with that a lot to be confident of it working... might not be possible without revving LV.

I think the idea that the buffer "contains" the data object is perhaps misleading. I am expecting the buffer to contain pointers to the objects and the reader just de-references (if we can get away without a copy, even better). So you create your variants then stick the addresses in the buffer. The writer can change the variants by the normal labview methods. The reader can read via the normal LV methods (after de-referencing). The counters make sure the writer doesn't update a variant that hasn't been read yet. All is happy in R&D because nothing ever gets stomped on to require management

Anyway, both of those modifications are for the future. For now, we need to work with the benchmarks to see if there really is any advantage in LV for this pattern. So far, I'm not seeing it when compared against the 1QUEUE benchmark. Seems to be pretty much a wash, especially when the full implementation of disruptor will need a few more flag checks to be fully robust.

Now, having said all of that, the ring buffer concept when we talk about network transmission is interesting for the NAK replay abilities. In LV, that would still require copying data out of the buffer for all non-flat data types, but it might be a really nice way to handle network traffic on a noisy line. All that requires is a LV array in a shift register, with none of the rest of the disruptor fanciness. That's definitely worth exploring if you're writing that kind of network-retry code.

Done with the data dump. :-)

[EDIT] Oh, and this comment from Trisha's comments is interesting:

"The RingBuffer is not the secret sauce in the Disruptor's performance, in fact in the current version of the Disruptor you don't need it at all."

Without the ring buffer? The only way I'm understanding what is going on here is with that buffer... not sure what Disruptor is without the buffer.

No idea what she means there. It looks to me to be the very heart of the disruptor pattern unless she is using the name to reference the resource access and boundary contracts. But they wouldn't work well with anything else but a circular buffer as as far as I can tell. I see the elegance in the pattern in the way the writer and readers are able to circumvent lock contention (on a circular buffer) with simple counters and memory barriers. The usual method is to dollop large quantities of mutexes around.

Aristos Queue · July 8, 2013

Variants? You'll lose any performance gains in type checking when you actually want to manipulate the data. If the final solution has anything to do with variants, I'll be very surprised.

ShaunR · July 8, 2013

Variants? You'll lose any performance gains in type checking when you actually want to manipulate the data. If the final solution has anything to do with variants, I'll be very surprised.

Well. It's cheap to wire anything to it. What about your "Generic" terminals?

mje · July 8, 2013

What we do need is to be able to read a single or multiple elements from an array without copying the entire array, hence the move to LV memory manager functions.

Was the memory manager the first place you looked? There are definitely ways to do this in native LabVIEW code...

ShaunR · July 8, 2013

Was the memory manager the first place you looked? There are definitely ways to do this in native LabVIEW code...

Nope. The first release was using a global variable and an array index. That was after looking at classes, DVRs and LV2 globals, Most work OK until you get to the writer needing to know the index of all the readers Do you have something specific in mind?

Rolf Kalbermatter · July 9, 2013

Nope. The first release was using a global variable and an array index. That was after looking at classes, DVRs and LV2 globals, Most work OK until you get to the writer needing to know the index of all the readers Do you have something specific in mind?

Personally I think DVRs are probably the way to go here. They have reference semantics so you won't have to copy all the data each time eventhough I 'm pretty sure the DVR overhead will be negative in case of scalar buffer elements but positive for more complex datatypes.

The only external code solution I could think of would be the use of the ILVDataInterface but I'm now pretty sure that is exactly what the DVR actually is internally about and I do not see any benefit in moving that into external code as in both cases you would have to use polymorphic VIs to support multiple datatypes.

About the writer needing to know the index of the readers this would seem most easily solved by having all readers subscribe themselves to the buffer and getting back an refnum (index) into an array of indices where the buffer stores the reader index. Each time the reader wants to retrieve its data it then has to hand its refnum and the buffer retrieves the data for the reader and updates the according index. Personally I would use a functional global for the buffer management including the reader indices, but doing it with LVOOP would allow easy instantiation so you can have multiple circular buffers in an application without having to resort to making the FGV itself indexable too.

ShaunR · July 9, 2013

Personally I think DVRs are probably the way to go here. They have reference semantics so you won't have to copy all the data each time eventhough I 'm pretty sure the DVR overhead will be negative in case of scalar buffer elements but positive for more complex datatypes.

The only external code solution I could think of would be the use of the ILVDataInterface but I'm now pretty sure that is exactly what the DVR actually is internally about and I do not see any benefit in moving that into external code as in both cases you would have to use polymorphic VIs to support multiple datatypes.

About the writer needing to know the index of the readers this would seem most easily solved by having all readers subscribe themselves to the buffer and getting back an refnum (index) into an array of indices where the buffer stores the reader index. Each time the reader wants to retrieve its data it then has to hand its refnum and the buffer retrieves the data for the reader and updates the according index. Personally I would use a functional global for the buffer management including the reader indices, but doing it with LVOOP would allow easy instantiation so you can have multiple circular buffers in an application without having to resort to making the FGV itself indexable too.

Registering etc isn't a problem. The problem is the access to the indexes and the buffer without locking (mutexes etc) which is where the pattern derives its performance gains. My first choice was a class since the buffer and/or the reader/writer indexes can be held in each instance. However, the locking overhead around the private cluster coupled with atomicity of the private data clster means the performance degrades significantly (we are talking ms rather than us). You need a way of having indexes and buffer controls being able to be accessed independently (e.g. you cannot have them all in one cluster) and break access restrictions so that a reader can simultaneously read data when a writer is writing without waiting (which you cannot do if you wrap anything in a non-reentrant VI).

You end up in a kind of no-mans land where you need a global storage (i.e. you need something like a DVR or LV2 global) with no locking mechanisms (requires reentrant like access behaviour). The closest native labview object that fulfills that is the global variable. If someone else has an alternative. I'm all ears.

Rolf Kalbermatter · July 9, 2013

Registering etc isn't a problem. The problem is the access to the indexes and the buffer without locking (mutexes etc) which is where the pattern derives its performance gains. My first choice was a class since the buffer and/or the reader/writer indexes can be held in each instance. However, the locking overhead around the private cluster coupled with atomicity of the private data clster means the performance degrades significantly (we are talking ms rather than us). You need a way of having indexes and buffer controls being able to be accessed independently (e.g. you cannot have them all in one cluster) and break access restrictions so that a reader can simultaneously read data when a writer is writing without waiting (which you cannot do if you wrap anything in a non-reentrant VI).
You end up in a kind of no-mans land where you need a global storage (i.e. you need something like a DVR or LV2 global) with no locking mechanisms (requires reentrant like access behaviour). The closest native labview object that fulfills that is the global variable. If someone else has an alternative. I'm all ears.

Well lets forget about the polymorphic data storage aspect for a moment then. That is IMHO a completely separate issue. What you therefore want is a managing infrastructure for the indices and all for both reader and writer but in a way that they are not globally locked but only locally protected. What I didn't like in your example (the first I think, didn't look at the others if there are) was (besides the use of globals of course ) that the implementation of the reader index does absolutely not scale. There is no way to add an additional reader without complete modification of just about anything in there. So I think the first approach would be to have the reader index as an array, and hence why I came up with the reader registration. Now you are right that globals have some sort of protection but only immediate access, you can not have a protected read/modify/store without additional external synchronization means.

The question is, do we need protected read/modify/store at all? I think not, as there is only really one writer for every variable and the variables being integers have in itself atomic read access guaranteed on all modern platforms. So what about making the reader index an array? I think that should work. And if you separate the index management from the actual buffer data storage somehow I think that the delays caused by a non reentrant call to the index manager entity (be it an FGV or a LVOOP private data) should not cost to much performance.

If protected read/modify/store would be required, maybe the inplace structure node might help. It seems to do some sort of locking that makes sure that the data can not be in an intermediate state. If that fails the only solution I would see in order to avoid explicit semaphores or mutexes would be the use of some external code to access some cmpexch() like function.

Incidentally I have been struggling with this type of thing just now for some external code (LuaVIEW for those wanting to know) where I needed to be able to update internal flags without running into a potential race if some other code tries to update another flag in the same data structure. The solution turned out to be the cmpexch() or similar function which is a function that atomically compares the state of a value with an expected value and only updates the value with the new value if that compare is positive.

So to set a bit in a word I could then do something like:

long atomic_fetch_and_or(long *value, long mask){	 long old;	 do	 {		  old = *value;	 }     while (!cmpexch(value, old, old | mask));	 return old;}

In C this is fairly simple since the cmpexch() (or other names) is a standard OS function nowadays (or usually a compiler intrinsic) but there are exceptions in LabVIEW land such as the older Pharlap based RT targets and also the VxWorks based ones it seems. At least I couldn't find a reliable cmpexch() or similar function so far in the VxWorks headers and Pharlap ETS before LabVIEW 8.5 or so did not have a kernel32.InterlockedCompareExchange

This are so called lockfree mechanismes although personally I find that a little misleading since it is actually still locking in the cmpexch() as that normally implements a complete bus lock for the duration of the assembly sequence, to prevent state corruption even when other CPU cores might want to access the same address at that moment. There are variations possible on the type of lock held such as only for write operations or only on read, but they make the whole story even more complex, are rather unportable between different architectures because of differences in their semantic so that I don't think it makes much sense to bother about them for more general purpose use.

ShaunR · July 9, 2013

Well lets forget about the polymorphic data storage aspect for a moment then. That is IMHO a completely separate issue.

Indeed it is. But it is what makes it usable.

What you therefore want is a managing infrastructure for the indices and all for both reader and writer but in a way that they are not globally locked but only locally protected. What I didn't like in your example (the first I think, didn't look at the others if there are) was (besides the use of globals of course ) that the implementation of the reader index does absolutely not scale. There is no way to add an additional reader without complete modification of just about anything in there. So I think the first approach would be to have the reader index as an array, and hence why I came up with the reader registration. Now you are right that globals have some sort of protection but only immediate access, you can not have a protected read/modify/store without additional external synchronization means.

The question is, do we need protected read/modify/store at all? I think not, as there is only really one writer for every variable and the variables being integers have in itself atomic read access guaranteed on all modern platforms. So what about making the reader index an array? I think that should work. And if you separate the index management from the actual buffer data storage somehow I think that the delays caused by a non reentrant call to the index manager entity (be it an FGV or a LVOOP private data) should not cost to much performance.

If protected read/modify/store would be required, maybe the inplace structure node might help. It seems to do some sort of locking that makes sure that the data can not be in an intermediate state. If that fails the only solution I would see in order to avoid explicit semaphores or mutexes would be the use of some external code to access some cmpexch() like function.

Bingo! No. We don't need protected read/modify/store IF we are using the LV memory manager functions. The reason being is that we can pick and chose individual elements of arrays to update. you cannot do this in labview without reading the whole array, modifying and then writing the whole array.

This is why I'm not worried about registration. Originally I had an array for the indexes in the global variable. But the obvious race conditions (read/modify/write) caused me to revert to single indexes fixed to the number of readers I was testing so I could get on and benchmark. It is/was an interim solution as, at the time, it wasn't clear whether it was worth spending the time working out a registration scheme if the performance was atrocious.

Now I'm not using a global. This is a no-brainer and doesn't require additional logic, just a pointer to an array of indexes. The writer only needs to know how many are registered and then can do an array min on all of them. I'm thinking of 2 arrays, one for the readers' counts and one for the readers' R indexes as the two calls are probably more efficient than one call to a 2d flat array with all the gubbins required to extract the dimensions.(That's just a gut feeling). This part could, of course, also be done in native labview. It's the readers that are the problem.......

Because the writer only reads the readers indexes and the readers only write to them AND the only important value is the lowest; if it is possible to update a single location in the array without affecting others, then no locking is required at all (memory barriers are sufficient) and no read/modify/write issues arise. Moveblock enables this but any native LabVIEW array manipulation will not (AFAIK).

Incidentally I have been struggling with this type of thing just now for some external code (LuaVIEW for those wanting to know) where I needed to be able to update internal flags without running into a potential race if some other code tries to update another flag in the same data structure. The solution turned out to be the cmpexch() or similar function which is a function that atomically compares the state of a value with an expected value and only updates the value with the new value if that compare is positive.

So to set a bit in a word I could then do something like:

long atomic_fetch_and_or(long *value, long mask)

{

long old;

do

{

old = *value;

}

while (!cmpexch(value, old, old | mask));

return old;

}

In C this is fairly simple since the cmpexch() (or other names) is a standard OS function nowadays (or usually a compiler intrinsic) but there are exceptions in LabVIEW land such as the older Pharlap based RT targets and also the VxWorks based ones it seems. At least I couldn't find a reliable cmpexch() or similar function so far in the VxWorks headers and Pharlap ETS before LabVIEW 8.5 or so did not have a kernel32.InterlockedCompareExchange

This are so called lockfree mechanismes although personally I find that a little misleading since it is actually still locking in the cmpexch() as that normally implements a complete bus lock for the duration of the assembly sequence, to prevent state corruption even when other CPU cores might want to access the same address at that moment. There are variations possible on the type of lock held such as only for write operations or only on read, but they make the whole story even more complex, are rather unportable between different architectures because of differences in their semantic so that I don't think it makes much sense to bother about them for more general purpose use.

cmpexch (CAS) is an optimised CPU instruction so the processor needs to support it (if it's INTEL, it's a given, PPC - not so sure). I was hoping this was what SwapBlock used as it could potentially be more efficient than moveblock. But the only reference I can find to it was on a German forum and you provided an answer . I can't see any other reason for SwapBlock other than to wrap cmpexch The only people that can answer that question is NI, really. For my purposes, I don't really need the compare aspect, but an in-place atomic exchange would be useful for the indexes.

Queues Vs Ring Buffers

Recommended Posts

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Popular Posts

ShaunR

mje

ShaunR

Posted Images

Join the conversation

Important Information