-
Posts
4,942 -
Joined
-
Days Won
308
Content Type
Profiles
Forums
Downloads
Gallery
Posts posted by ShaunR
-
-
I noticed that sequencing the two “Enqueue” operations (by connecting the error wire) increased the queue performance greatly on my system.
It's a good job MJE noticed the bug where the dequeues aren't reading everything then.
Still. It's rather counter intuitive.
Can't find the reference, but someone once said that the buffer size doubles until it's as big as a block (4kB?), then increases by block-sized increments.Well .remembered. That also explains why the sudden increase (the curved lines around 1ms) in the old 1M data point plots.
By the way peeps. Feel free to write your own benchmarks/test software or abuse-ware. Especially if it shows a particular problem. The ones I supplied are really just test harnesses so that I could write the API and see some info on what was going wrong (you can probably see by the latest images that the ones I am using now are evevolving as I probe different aspects). Not much time has been spent on them so I expect there's lots of issues.
Eventually, once all fingers have been inserted in all orifices, they will become usage examples and regression tests (if anyone uses the SQLite API For LabVIEW you will know what this means) so if you do knock something up, please post it-big or small.
I think I should also request at this point, to avoid any confusion, that you should only post code to this thread if you are happy for your submission to be public domain. By all means make your own proprietary stuff, but please post it somewhere else so those that come here can be sure that anything downloaded from this thread is covered by the previous public domain declaration.
Eventually it will go in the CR, and it will not be an issue,but until then please be aware that any submissions to this thread are public domain. Sorry to have to state it, but a little bit of garlic now will save the blood being sucked later.
-
Indeed. See my post in reply to GregSands about processor hogging.My only criticism of the benchmarks is the queue consumers go until the queue is released, which does not mean they consumed all the data. I think it would be best to handle releasing each queue only after the consumer has finished, don't stop on error, stop on fixed number of iterations. I suspect that's what AQ was getting at. I don't expect it will make much difference when sample sizes are large.I'm also concerned about CPU exhaustion. Correct me if I'm wrong, but if a reader tries reading an empty buffer, it will spin in an unthrottled loop until data is ready?
I worry that should there not be enough cores to accommodate all the end points, how much time will be spent spinning waiting for the cursor to advance that could otherwise be executing other endpoints? Worse would the subroutine priority lead to complete starvation? Similar arguments for writers, though I haven't poked under the hood there.
Changing the VI from a subroutine and letting it yield should it ever not get a valid cursor did not seem to change my metrics.
For a practical implementation. Then yes.I agree. It should yield. Adding the wait ms and changing to Normal degrades the performance on my machine by about 30% and as I wanted to see what it "could do" and explore it's behaviour rather than labviews, I wasn't particulalry concerned about other software or parts of the same software. It makes it quite hard to optimise if the benchmarks are cluttered with task/thread switches et al. Better (to begin with) to run it as fast as posssible with as much resource as required to eek out the last few drops, then you can make informed decisions as to what you are prepared to trade for how much performance. Of course, in 2011+ you can "inline". In 2009 you cannot, so subroutine is the only way.
Ooooh. I've just thought of a new signature.......
"I don't write code fast. I just write fast code
-
Version 2 has the same problems for me. Firstly, here's the Queues as a baseline:
Next is Buffers (10k iterations with a 10k buffer):
There are no periodic delays, but still some occasional longer iterations (up to 100ms) presumably due to Windows being Windows.
Last is Buffers (10k iterations with a 1k buffer):
The results are the same for both 32-bit and 64-bit LabVIEW, all on Windows 7 x64. I've also run on another machine with 4 cores - this works without any long iterations.
OK. I've managed to replicate this on a Core 2 Duo (the other PC is an I7). The periodic large deviations coincide with the buffer length. If you look at your 10K@ 1Kbuffer, each point is 1K apart. If you were to set the #iterations to 100 and, say the buffer length to 10, then the separation would be every 10 and you would see 9 points [ (#iterations/buffer length) -1 ]. Similarly, if you set the buffer length to 20, you would see 4.
My initial thoughts (as seems to be the norm) was a collision or maybe a task switch on the modulo operator. But it smells more like processor hogging as you can get rid of it by placing a wait with 0 ms in the write and read while loops. You have to set the read and write VIs to be "Normal" instead of "subroutine" to be able to use the wait and therefore you lose a lot of the performance so it's not ideal, but it does seem to cure it. I'm not sure of the exact mechanism-i'll have to chew it over. But it seems CPU architecture dependent.
I've been rewriting the benchmark VIs to my own standards. An important change is making those While Loops into For Loops and adding some sequence structures because the timers are being delayed by panel updates. Also the queue and buffer benchmarks do not match. I'll probably post revised VIs late tonight.
Look forward to it. Care to elaborate on for loop Vs While loop?
(Don't speed up the queues too much eh?
)
-
-
Addendum.
I've just modified the test to
a) ensure timers always execute at pre-determined points in the while loops (connected the error terminals).
b) pre-allocate the arrays.
So it looks like a lot of what I was surmising is true. There is still one allocation in the 258 iteration image which might be for the shift registers. But everything is a lot more stable and the STD and mean are now meaningful (if you'l excuse the pun).
Does anyone want to put forward a theory why we get discrete lines exactly 0.493 usecs apart? (maybe a different number on your machine, but the separation is always constant)
-
@ GregSands
Whilst a median of 4..6 micro seconds isn't fantastic. It's still not bad (slightly faster than the queues). In your images I am looking at the median and Max-Count Exec Times peak. The reason is as will follow. I'm not (at the moment) sure why you get the 300ms spikes (thread starvation?) but most of the other stuff can be explained (I think)
I've been playing a bit more and have some theories to fit the facts. The following images are a little contrived since I run the tests multiple times and chose the ones that show the effects without extra spurii. But the data is always there in each run, just that you get more of them
The following is with 258 data points buffer.
We can clearly see two distinct levels (lines) for the write time and believe me; each data point in each line is identical. There are a couple of data points above these two (5 in fact). Theses anomalous points intriguingly occur at 1, 33, 65,129 and 257 or (2^n)+1. OK. So 17 is missing. you'll just have to take my word for it that it does sometimes show up.
We can also notice that these points occur in the reader as well; at exactly the same locations with approximately the same magnitude. That is just too convenient.
OK. So maybe we are getting collisions between the read and write. The following is again with 258 iterations with a buffer size of 2 (the minimum). That will definitely cause collisions if any are to be had.
Nope. Exactly the same positions and "roughly" the same magnitudes. I would expect something to change at least if that were the issue. So if they really do occur at 2n+1 if we increase further we should see another appear at 513.
Bingo!. Indeed. There it is.
Here is what I think is going on. The LabVIEW memory manager uses an exponential allocation method for creating memory (perhaps AQ can confirm). So, every time the "Build Array" on the edge of the while loop needs more space, we see the the allocation take place which causes these artifacts. The more data points we build, the more of impact the LabVIEW memory manager has on the results. The "real" results for the buffer itself are the straight lines in the plots which are predictable and discrete. The spurious data points above these are LabVIEW messing with the test so that we can get pretty graphs.
We can exacerbate the effect by going to a very high number.
We can still clearly see our two lines and you will notice throughout all the images the Median and the Max Count-Exec Times have remained constant (scroll back up and check) which implies that the vast majority of the data points are the same. The results for the mean, STD and to our eyes are "confused" by the number of suprii. So I am postulating that most, if not everything above those two lines in the images is due to the build arrays on the border of the while loops. Of course. I cannot prove this since we are suffering from the "Observer Effect" and if I remove the build arrays on the border, we won't have any pretty graphs
. Even running the profiler will cause massive increases in benchmark times and spread the data. I think we need a better bench marking method (pre-allocated arrays?).
Of course. It raises the question. Why don't the queues exhibit this behaviour with the queue benchmark?. Well. I think they do. But it is less obvious because, for small numbers of iterations there is greater variance in the execution times and it is buried in the natural jitter. It only sticks out like a sore thumb with the buffer because it is so predictable that they are the only anomalies.
-
Sweet!
I've replaced the global variable with LabVIEW memory manager functions and, as surmised, the buffer size longer affects performance so you can make it as big as you want..
You can grab version 2 here:Prototype Circular Buffer Benchmark V2I've been asked to clarify/state Licencing so here goes.
I, the copyright holder of this work, hereby release it into the public domain. This applies worldwide.In case it is not legally possible or if the public domain licence is not recognised in any particular country ,The copyright holder shall be deemed to be the original developer (myself) and I hereby grant any entity the right to use this work for any purpose, without any conditions, except that the entity cannot claim ownership, copywrite or other lawful means to restrict or prevent others adopting the rights I hereby grant. This is free software for everyone.
-
2
-
-
Contingent on buffer multiples is a little perplexing. But that it disapers when the buffer is bigger than the points suggests it is due to the write waiting for the readers to catch up and letting LabVIEW get involved in scheduling (readers "should" be faster than the writer because they are simpler but the global array access is slow for the reasons MJE and AQ mentioned - I think I have a fix for thatThe slow iterations seem to occur at multiples of the Buffer Size - i.e. above, at 64, 128, 192, 256, ... If the buffer is bigger than the number of data points, it doesn't occur.)
Can you post the image of it not misbehaving? The graphs tell me a lot about what is happening. (BTW, I like the presentation as a log graphs better than my +3stds-I will modify the tests)
-
Easy to try. Unfortunately, I get this for the Circular Buffer (LV 2012 x32 on Win 7 x64):
Note the logarithmic scale! I wonder if it is something to do with my machine having only two cores? Here's the Profile Data for the two subroutines:
Interesting. You can see a lot of data points at about 300ms dragging the average up whereas most are clearly sub 10us.
If you set the buffer size to 101 and do 100 data points, does it improve? (don't run the profiler at the same time, it will skew the results)
-
OK. Tidied up the benchmark so you can play, scorn, abuse. If we get something viable, then I will stick it in the CR with the loosest licence (public domain, BSD or something)
You can grab it here: Prototype Circular Buffer Benchmark
It'll be interesting to see the performance on different platforms/versions/bitnesses. (The above image was on Windows 7 x64 using LV2009 x64)
Compared alongside each other. This is what we see.
-
Have you hit your 10MB limit? The only place I can find to check my current usage is when I try to attach a file. Currently, I have 5.89MB left.
It doesn't tell me when I upload (just says uploading is not allowed with a upload error). When it first happened, I deleted 2 of my code repository submissions just in case that was the problem (deleted about 1.2 MB to upload a 46KB image). It didn't make a difference. There used to be a page in the profile that allowed you to view all the attachments and uploads and monitor your usage. That seems to have disappeared so I can't be sure that deleting more from the CR will allow uploading..
-
Interestingly, not sure if it is related, your most recent posts have included a broken image link icon whenever you have attached a picture. Even more interestingly, this picture does actually display, inline, in the quick reply editor!
The images are just inserted images that have to reside on another server (using the insert image button in the bar). Usually I upload the images to lavag and insert them. Obviously I cannot do that at the moment so this way is a work-around. I can do something similar for the files I want to upload, but then they won't appear inline in the posts as attachments (and presumably I cannot put stuff in the code repository for people). You will have to be redirected to my download page to get them (not desirable).
-
For a while now (ever since the last Lavag.org crash). I have not been able to post pictures or upload files,
The upload section just states "Uploading is not allowed" and there is no real indication as to why this is so.
-
It's PXI-RT, And yes, A 2MB/s is not high speed. But when I need a frequently transfering and receiving function in a determined loop rate, it is...
Here is how my application works.
PXI A and PXI B are synchronized using a 10K trigger signal.
PXI A: Loop Rate 10K, Read 12 DBL from PXI B -->Processing-->Send 12 DBL to PXI B
PXI B: Loop Rate 10K, Read 12 DBL from PXI A -->Processing-->Send 12 DBL to PXI A
the transfer part took more time than I thought.. My processing part will take about 50us, and the transfer data take about 70us while using reflective memory(GE 5565)... So I am wonder how can I make the transfer time less than 40us.
You want it 40 usecs because because 40+50 <100?
Put your acquisition and processing (50us) in a producer loop and the TX in a consumer loop. Then your total processing time will be just the worst of the two (70us) rather than the addition of both.
-
Hi,
I am doing a project which required high speed data commnication between 2 chassis. 24 Double digital numbers in a loop rate 10K. The first thing come to my mind is using reflective memory. But the result is not good enough. The data transfer tooks 80% of time in the 10K loop, Then to avoid loop late, I could not do any thing else in this loop.
Is there any option else? Maybe using digital I/O in FPGA card?
Thanks in advance!
It depends where your bottleneck is.
24xDouble precision numbers @ 10k is about 2MB/sec. Doesn't sound a lot to me.
Are we talking PXI-RT or PXI-Windows7? How are you acquiring and how are you transferring (TCPIP, MXI?).
-
Well.
Another weekend wasted being a geek
So I wrote a circular buffer and took on board (but didn't implement exactly) the Disruptor pattern. I've written a test harness which I will post soon once it is presentable so that we can optismise with the full brain power of Lavag.org rather than my puny organ (fnarr, fnarr).
In a nutshell, it looks good - assuming I'm not doing anything daft and getting overruns that I'm not detecting.
-
1
-
-
It looks to me like the data is already processed. If you just plot the data directly (and change the graph scale to logarithmic) you will get:
If you want to smooth it, use the Interpolate 1D.vi and select spline (ntimes= 10).and you will get:
-
Resolved via email.Hi, I've been trying to download a copy from the Web Site but it keeps redirecting me to login page. -
You can no longer have a dynamic dispatch "read" VI because some tasks will return an array, others a single value. So, you can make a polymorphic VI, but then you are back at square one if the wrong instance is selected: run time errors.
You can always just return an array with one element for single values (if a task returns a single value, just use the build array to convert it). Then all your companes are the same. If you really want to, you can wrap that that into a single value polymorphic VI to return just element 0. That way you won't get a run-time error.
-
I have a really hard time believing their architecture doesn't use some sort of locking mechanism. At some point they need to read that circular buffer, and either they maintain a lock on an element while its being operated on, or they only lock it briefly while it's being copied. There's just no other way to ensure an old element doesn't get overwritten when the circular buffer flows around.
Reference: LMAX Technical paper
In the case of having only one producer, and regardless of the complexity of the consumer graph, no locks or CAS operations are required. The whole concurrency coordination can be achieved with just memory barriers on the discussed sequences. -
> Which method did you use to create the ring buffer?
I haven't gotten as far as actually creating any buffers -- I was just setting up the shell VIs for one writer and two readers.
> Are you saying that merely using the index array primitive destroys the element?
Using Index Array copies the element out of the array. That is *exactly* what made me pause: if at any point you make a copy of the data, you have lost the advantage of this system over the parallel queues. Since you *must* make a copy of the data in order to have two separate processes act on it at the same time, there's no way to get an advantage over the parallel queues mechanism.
If I remember correctly. The Array Subset only keeps track of indexes (sub-array type). Would this avoid the copy?
Not sure where "parallel" queues came into it. If we were to try parallel queues (I think you are saying a queue for each read process). Then data still has to be copied (within the labview code) on to each queue? Would you not get a copy at the wire junction at least if not on the queues themselves? This scenario is really slow in LabVIEW. I have to use TCPIP repeaters (one of the things I have my eye on for this implementation) and it is a huge bottleneck since to cater for arbitrary numbers of queues, you need to use a for loop to populate the queues and copies are definately made.
Their buffer system appears to take advantage of the fact that you can, in a thread-controlled by-reference environment, act on data in-place. That's impossible in a free-thread by-value environment.
I stopped going any further than that until I clarified my understanding since if these thoughts are correct, there's no possible advantage in LabVIEW with this concept. If I've missed something, I'm happy to re-evaluate.
I think it will be impossible to see the same sort of performance that they achieve without going to compiled code (there is a .NET and C++ implementation) and if our only interest is to benchmark it to the LV queue primitives, we aren't really comparing apples (compiled code implementation of queues in the LV runtime vs labview code). However, the principle still stands and I think it may yield benefits for the aforementioned scenarios (queue-case for example), so I will certainly persevere.
Of course. It'd be great if NI introduced a set of primitives in the LV kernel (Apache 2.0 licence I believe
)
-
This intrigued me, so I started to put the code together for this to see how it would work. After just a bit of setting up the VIs, I realized that I can't have two consumers walk the same buffer without making copies of the data for each consumer. That made me fairly certain that this ring buffer idea doesn't buy you anything in LabVIEW. In Java/C#/C++, the items in your queue-implemented-as-ring-buffer might be pointers/references, so having a single queue and having multiple readers traverse down it makes sense -- it saves you duplication. But in LV, each of those readers is going to have to copy data out of the buffer in order to leave the buffer in tact for the next reader. That means that you have zero advantage over just having multiple queues, one for each consumer that is going to consume the data. It could save you if you had a queue of data value references, but then you're going to be back to serializing your operations waiting on the DVR to be available.
Am I missing some trick here?
In general, I think it will not realise the performance improvements for the pointer reasons you have stated (we are ultimately constrained by the internal workings of LV, which we cannot circumvent). I'm sure if we tried to implement a queue in native labview, it wouldn't be anywhere near as fast as the primitives. That said... There a lot of the code seems to be designed around ensuring atomicity. For example. In LabVIEW, we can both read and write to a global variable without having to use mutexes (I believe this is why they discuss CAS). LabVIEW handles all that. Maybe there are some aspects of their software (I haven't got around to looking at their source yet) that is redundant due to LabVIEWS machinations........that's a big maybe with a capital "PROBABLY NOT".
I'm not quite sure what you mean about "is going to have to copy data out of the buffer in order to leave the buffer in tact for the next reader".
Are you saying that merely using the index array primitive destroys the element?
I'm currently stuck at the "back pressure" aspect to the writer as I can't seem to get the logic right.
Assuming I have the logic right (still not sure) then this is one instance when a class beats the pants off of classic labview.
With a class I can read (2 readers) and write at about 50us, but don't quote me on that as I still don't have confidence in my logic (strange thing is, this slows down if you remove the error case structures to about 1ms
). I'm not trying anything complex. Just an array of doubles as the buffer.
DVRs just kill it. Not an option, So it makes classes a bit of a nightmare since you need to share the buffer off-wire. To hack around this, I went for a global variable to store the buffer (harking back to my old "Data Pool" pattern) and the classes just being accessors (Dpendancy Barrier?) and storing the position (for the reader).
I should just qualify that time claim in that the class VIs are all re-entrant subroutines (using 2009, so no in-place). Not doing this you can multiply by about 100.
Which method did you use to create the ring buffer? I'm currently trying the size mod 2 with the test for 1 element gap. This is slower than checking for overflow and reset, but easier to read whilst I'm chopping things around.
-
Interesting topic, and I can't say I anticipate being able to formulate a full reply so I'm just going to lob this one over the fence and see where it lands.
I can't say I'm really convinced it's a matter of "Queues vs Ring Buffers". Queues after all provide a whole synchronization layer. Perhaps arguing "Sequential vs Random Access Buffers" might be more apt?
Also it seems to me the underlying challenge here is how do you contend with multiple consumers which might be operating at different rates. Without delving too deep down the rabbit hole, it appears they've done only a handful of things that distinguish this Disruptor from the traditional producer-consumer pattern we all use know of in LabVIEW. Obviously there are huge differences in implementation, but I'm referring to very high level abstraction.
- Replace a sequential buffer with a random access buffer.
- Make the buffer shared among multiple consumers.
- Decouple the buffer from the synchronization transport.
We definitely have the tools to do this in LabVIEW. Throw your data model behind some reference-based construct (DVR, SEQ, DB, file system...), then start to think about your favorite messaging architecture not as a means of pushing data to consumers, but as a way of signalling to a consumer how to retrieve data from the model. That is not "Here's some new data, do what you will with it", but rather "Data is ready, here's how to get it". Seems like a pretty typical model-view relationship to me, I've definitely done exactly this many times in LabVIEW.
Of course I may have completely missed the point of their framework. I probably spent more time over the last day thinking about it than I have actually reading the referenced documents...
Well. The title was really to place the discussion in the queue producer/consumer pattern vs a ring buffer producer/consumer. Whilst queues generally are just a buffer behind the scenes (some can be linked lists) there is a domain separation here.
Queues are a "Many to One". Their real benefit is having many produces and a single consumer. In the one producer, one consumer, this is ok, but the example isn't really a one-to-one although we would shoehorns it into one in LabVIEW such that we have one consumer then branch the wire. Additionally. Looking at the classic dequeue with case statement which many messaging architectures are based on including mine. This is mitigating concurrency by enforcing serial execution.
The Disruptor or ring buffer approach is a "One to Many". So it has more in common with Events than queues.Events, however, have a lot of signalling and, in labview, are useless for encapsulation.
I've only breached the surface of the Disruptor pattern. But it doesn't seem to be a "Data Is Ready" approach since its premise is to try and remove signalling to enhance performance. The "Write" free wheels at the speed of the incoming data or until it reaches a piece of data that has not been "consumed". By consumed, I do not mean removed, simply that it has been read and therefore is no longer relevant.
A "Reader" requests a piece of data that is next in the sequence and waits until it receives it. Once received, it then processes it and requests the next. So. If it is up to the latest, it will idle or yield until new data is incoming.
The result seems to be self regulating throughput with back-pressure to the writer and all readers running flat out as long as there is data to process somewhere in the buffer. It also seems to be inherently cooperative towards resource allocation since the fast ones will yield (when they catch up to the writer) allowing more to the slower ones.
Here's a pretty good comparison of the different methods.
There's also a nice latency chart of the Java performance
- Replace a sequential buffer with a random access buffer.
-
Another thought: Some problems the Disruptor pattern solves in Java we don't have in LV, e.g garbage collection issues. It could however be worth trying to implement it in LV because we were able to build a producer/consumer architecture that doesn't suffer from stalling when producers and consumers run at different speeds.
It doesn't solve the garbage collection issues in Java. They "alleviate" that by having custom objects and reboot every 24hrs
.
The advantage of this technique is that M processes can work on the same data buffer in parallel without waiting for all processes to finish before moving to the next.
As an example.
Lets say we have a stream of doubles being written to a Queue/buffer of length N. We wish to do a mean and linear fit (Pt by Pt). We will assume that the linear fit~ 2x slower and that the queue/buffer is full and therefore blocking writes.
With a queue we remove one element, then proceed to do our aforesaid operations (which in LabVIEW we can do in parallel anyway). The queue writer can now add an element. The mean finishes first and then the reader has to wait for the linear fit to finish before it can de-queue the next element. Once the linear fit finishes, we then de-queue the next and start the process again evaluating the mean and linear fit.
From what I gather with this technique the following would happen.
We read the first element and pass it to the mean and linear fit. The mean finishes and then moves on to the next data point (doesn't wait for the linear fit). Once the linear fit has finished, the next value in the buffer can be inserted and it too moves to the next value. At this point the mean is working on element 3 (it is twice as fast)The result is that the mean travels through the buffer ahead of the linear fit (since it is faster) and is not a consideration for reading the next element from the buffer. Additionally (the theory goes) that once the faster process has reached the end of the data, there are more processing cycles available to the linear fit so that *should* decrease its processing time.
Now. They cite that by reading in this fashion, they can parallelise the processing. We already have that capability so I don't think we gain much of a benefit there. But leveraging processing of a function that would spend most of it's time doing nothing due to data being unavailable until the slower process finishes seems like it is worth experimenting with.
Queues Vs Ring Buffers
in OpenG General Discussions
Posted
+1. It's not conclusive between resolution and loop jitter if you check with the elapsed time.vi (that is inside the loop) since it is difficult to separate them (occasionally you see 10s of nsec or psec readings). But as the Tick count2+ uses the same timing method (which yields ps resolution), I think we can assume it is mainly loop jitter that causes the spacing. I think also that whether the Elapsed Time.vi gets executed in the same slice or the next (or the next +1 etc, depending on the schedular) is the reason we see several lines at x*loop jitter.