-
Posts
4,905 -
Joined
-
Days Won
299
Content Type
Profiles
Forums
Downloads
Gallery
Everything posted by ShaunR
-
It's a good job MJE noticed the bug where the dequeues aren't reading everything then. Still. It's rather counter intuitive. Well .remembered. That also explains why the sudden increase (the curved lines around 1ms) in the old 1M data point plots. By the way peeps. Feel free to write your own benchmarks/test software or abuse-ware. Especially if it shows a particular problem. The ones I supplied are really just test harnesses so that I could write the API and see some info on what was going wrong (you can probably see by the latest images that the ones I am using now are evevolving as I probe different aspects). Not much time has been spent on them so I expect there's lots of issues. Eventually, once all fingers have been inserted in all orifices, they will become usage examples and regression tests (if anyone uses the SQLite API For LabVIEW you will know what this means) so if you do knock something up, please post it-big or small. I think I should also request at this point, to avoid any confusion, that you should only post code to this thread if you are happy for your submission to be public domain. By all means make your own proprietary stuff, but please post it somewhere else so those that come here can be sure that anything downloaded from this thread is covered by the previous public domain declaration. Eventually it will go in the CR, and it will not be an issue,but until then please be aware that any submissions to this thread are public domain. Sorry to have to state it, but a little bit of garlic now will save the blood being sucked later.
-
Indeed. See my post in reply to GregSands about processor hogging.For a practical implementation. Then yes.I agree. It should yield. Adding the wait ms and changing to Normal degrades the performance on my machine by about 30% and as I wanted to see what it "could do" and explore it's behaviour rather than labviews, I wasn't particulalry concerned about other software or parts of the same software. It makes it quite hard to optimise if the benchmarks are cluttered with task/thread switches et al. Better (to begin with) to run it as fast as posssible with as much resource as required to eek out the last few drops, then you can make informed decisions as to what you are prepared to trade for how much performance. Of course, in 2011+ you can "inline". In 2009 you cannot, so subroutine is the only way. Ooooh. I've just thought of a new signature....... "I don't write code fast. I just write fast code
-
OK. I've managed to replicate this on a Core 2 Duo (the other PC is an I7). The periodic large deviations coincide with the buffer length. If you look at your 10K@ 1Kbuffer, each point is 1K apart. If you were to set the #iterations to 100 and, say the buffer length to 10, then the separation would be every 10 and you would see 9 points [ (#iterations/buffer length) -1 ]. Similarly, if you set the buffer length to 20, you would see 4. My initial thoughts (as seems to be the norm) was a collision or maybe a task switch on the modulo operator. But it smells more like processor hogging as you can get rid of it by placing a wait with 0 ms in the write and read while loops. You have to set the read and write VIs to be "Normal" instead of "subroutine" to be able to use the wait and therefore you lose a lot of the performance so it's not ideal, but it does seem to cure it. I'm not sure of the exact mechanism-i'll have to chew it over. But it seems CPU architecture dependent. Look forward to it. Care to elaborate on for loop Vs While loop? (Don't speed up the queues too much eh? )
-
Calculating the MD5 Message-Digest of a String or File
-
Addendum. I've just modified the test to a) ensure timers always execute at pre-determined points in the while loops (connected the error terminals). b) pre-allocate the arrays. So it looks like a lot of what I was surmising is true. There is still one allocation in the 258 iteration image which might be for the shift registers. But everything is a lot more stable and the STD and mean are now meaningful (if you'l excuse the pun). Does anyone want to put forward a theory why we get discrete lines exactly 0.493 usecs apart? (maybe a different number on your machine, but the separation is always constant)
-
@ GregSands Whilst a median of 4..6 micro seconds isn't fantastic. It's still not bad (slightly faster than the queues). In your images I am looking at the median and Max-Count Exec Times peak. The reason is as will follow. I'm not (at the moment) sure why you get the 300ms spikes (thread starvation?) but most of the other stuff can be explained (I think) I've been playing a bit more and have some theories to fit the facts. The following images are a little contrived since I run the tests multiple times and chose the ones that show the effects without extra spurii. But the data is always there in each run, just that you get more of them The following is with 258 data points buffer. We can clearly see two distinct levels (lines) for the write time and believe me; each data point in each line is identical. There are a couple of data points above these two (5 in fact). Theses anomalous points intriguingly occur at 1, 33, 65,129 and 257 or (2^n)+1. OK. So 17 is missing. you'll just have to take my word for it that it does sometimes show up. We can also notice that these points occur in the reader as well; at exactly the same locations with approximately the same magnitude. That is just too convenient. OK. So maybe we are getting collisions between the read and write. The following is again with 258 iterations with a buffer size of 2 (the minimum). That will definitely cause collisions if any are to be had. Nope. Exactly the same positions and "roughly" the same magnitudes. I would expect something to change at least if that were the issue. So if they really do occur at 2n+1 if we increase further we should see another appear at 513. Bingo!. Indeed. There it is. Here is what I think is going on. The LabVIEW memory manager uses an exponential allocation method for creating memory (perhaps AQ can confirm). So, every time the "Build Array" on the edge of the while loop needs more space, we see the the allocation take place which causes these artifacts. The more data points we build, the more of impact the LabVIEW memory manager has on the results. The "real" results for the buffer itself are the straight lines in the plots which are predictable and discrete. The spurious data points above these are LabVIEW messing with the test so that we can get pretty graphs. We can exacerbate the effect by going to a very high number. We can still clearly see our two lines and you will notice throughout all the images the Median and the Max Count-Exec Times have remained constant (scroll back up and check) which implies that the vast majority of the data points are the same. The results for the mean, STD and to our eyes are "confused" by the number of suprii. So I am postulating that most, if not everything above those two lines in the images is due to the build arrays on the border of the while loops. Of course. I cannot prove this since we are suffering from the "Observer Effect" and if I remove the build arrays on the border, we won't have any pretty graphs . Even running the profiler will cause massive increases in benchmark times and spread the data. I think we need a better bench marking method (pre-allocated arrays?). Of course. It raises the question. Why don't the queues exhibit this behaviour with the queue benchmark?. Well. I think they do. But it is less obvious because, for small numbers of iterations there is greater variance in the execution times and it is buried in the natural jitter. It only sticks out like a sore thumb with the buffer because it is so predictable that they are the only anomalies.
-
Sweet! I've replaced the global variable with LabVIEW memory manager functions and, as surmised, the buffer size longer affects performance so you can make it as big as you want. . You can grab version 2 here:Prototype Circular Buffer Benchmark V2 I've been asked to clarify/state Licencing so here goes.
-
Contingent on buffer multiples is a little perplexing. But that it disapers when the buffer is bigger than the points suggests it is due to the write waiting for the readers to catch up and letting LabVIEW get involved in scheduling (readers "should" be faster than the writer because they are simpler but the global array access is slow for the reasons MJE and AQ mentioned - I think I have a fix for that )Can you post the image of it not misbehaving? The graphs tell me a lot about what is happening. (BTW, I like the presentation as a log graphs better than my +3stds-I will modify the tests)
-
Interesting. You can see a lot of data points at about 300ms dragging the average up whereas most are clearly sub 10us. If you set the buffer size to 101 and do 100 data points, does it improve? (don't run the profiler at the same time, it will skew the results)
-
OK. Tidied up the benchmark so you can play, scorn, abuse. If we get something viable, then I will stick it in the CR with the loosest licence (public domain, BSD or something) You can grab it here: Prototype Circular Buffer Benchmark It'll be interesting to see the performance on different platforms/versions/bitnesses. (The above image was on Windows 7 x64 using LV2009 x64) Compared alongside each other. This is what we see.
-
It doesn't tell me when I upload (just says uploading is not allowed with a upload error). When it first happened, I deleted 2 of my code repository submissions just in case that was the problem (deleted about 1.2 MB to upload a 46KB image). It didn't make a difference. There used to be a page in the profile that allowed you to view all the attachments and uploads and monitor your usage. That seems to have disappeared so I can't be sure that deleting more from the CR will allow uploading..
-
The images are just inserted images that have to reside on another server (using the insert image button in the bar). Usually I upload the images to lavag and insert them. Obviously I cannot do that at the moment so this way is a work-around. I can do something similar for the files I want to upload, but then they won't appear inline in the posts as attachments (and presumably I cannot put stuff in the code repository for people). You will have to be redirected to my download page to get them (not desirable).
-
For a while now (ever since the last Lavag.org crash). I have not been able to post pictures or upload files, The upload section just states "Uploading is not allowed" and there is no real indication as to why this is so.
-
You want it 40 usecs because because 40+50 <100? Put your acquisition and processing (50us) in a producer loop and the TX in a consumer loop. Then your total processing time will be just the worst of the two (70us) rather than the addition of both.
- 4 replies
-
- data communication
- reflective memory
-
(and 3 more)
Tagged with:
-
It depends where your bottleneck is. 24xDouble precision numbers @ 10k is about 2MB/sec. Doesn't sound a lot to me. Are we talking PXI-RT or PXI-Windows7? How are you acquiring and how are you transferring (TCPIP, MXI?).
- 4 replies
-
- data communication
- reflective memory
-
(and 3 more)
Tagged with:
-
Well. Another weekend wasted being a geek So I wrote a circular buffer and took on board (but didn't implement exactly) the Disruptor pattern. I've written a test harness which I will post soon once it is presentable so that we can optismise with the full brain power of Lavag.org rather than my puny organ (fnarr, fnarr). In a nutshell, it looks good - assuming I'm not doing anything daft and getting overruns that I'm not detecting.
-
It looks to me like the data is already processed. If you just plot the data directly (and change the graph scale to logarithmic) you will get: If you want to smooth it, use the Interpolate 1D.vi and select spline (ntimes= 10).and you will get:
-
SQLite API For LabVIEW testers Needed (Apple Mac).
ShaunR replied to ShaunR's topic in Apple Macintosh
Resolved via email. -
DAQmx and wrapping it in classes
ShaunR replied to GregFreeman's topic in Object-Oriented Programming
You can always just return an array with one element for single values (if a task returns a single value, just use the build array to convert it). Then all your companes are the same. If you really want to, you can wrap that that into a single value polymorphic VI to return just element 0. That way you won't get a run-time error. -
Reference: LMAX Technical paper
-
If I remember correctly. The Array Subset only keeps track of indexes (sub-array type). Would this avoid the copy? Not sure where "parallel" queues came into it. If we were to try parallel queues (I think you are saying a queue for each read process). Then data still has to be copied (within the labview code) on to each queue? Would you not get a copy at the wire junction at least if not on the queues themselves? This scenario is really slow in LabVIEW. I have to use TCPIP repeaters (one of the things I have my eye on for this implementation) and it is a huge bottleneck since to cater for arbitrary numbers of queues, you need to use a for loop to populate the queues and copies are definately made. I think it will be impossible to see the same sort of performance that they achieve without going to compiled code (there is a .NET and C++ implementation) and if our only interest is to benchmark it to the LV queue primitives, we aren't really comparing apples (compiled code implementation of queues in the LV runtime vs labview code). However, the principle still stands and I think it may yield benefits for the aforementioned scenarios (queue-case for example), so I will certainly persevere. Of course. It'd be great if NI introduced a set of primitives in the LV kernel (Apache 2.0 licence I believe )
-
In general, I think it will not realise the performance improvements for the pointer reasons you have stated (we are ultimately constrained by the internal workings of LV, which we cannot circumvent). I'm sure if we tried to implement a queue in native labview, it wouldn't be anywhere near as fast as the primitives. That said... There a lot of the code seems to be designed around ensuring atomicity. For example. In LabVIEW, we can both read and write to a global variable without having to use mutexes (I believe this is why they discuss CAS). LabVIEW handles all that. Maybe there are some aspects of their software (I haven't got around to looking at their source yet) that is redundant due to LabVIEWS machinations........that's a big maybe with a capital "PROBABLY NOT". I'm not quite sure what you mean about "is going to have to copy data out of the buffer in order to leave the buffer in tact for the next reader". Are you saying that merely using the index array primitive destroys the element? I'm currently stuck at the "back pressure" aspect to the writer as I can't seem to get the logic right. Assuming I have the logic right (still not sure) then this is one instance when a class beats the pants off of classic labview. With a class I can read (2 readers) and write at about 50us, but don't quote me on that as I still don't have confidence in my logic (strange thing is, this slows down if you remove the error case structures to about 1ms ). I'm not trying anything complex. Just an array of doubles as the buffer. DVRs just kill it. Not an option, So it makes classes a bit of a nightmare since you need to share the buffer off-wire. To hack around this, I went for a global variable to store the buffer (harking back to my old "Data Pool" pattern) and the classes just being accessors (Dpendancy Barrier?) and storing the position (for the reader). I should just qualify that time claim in that the class VIs are all re-entrant subroutines (using 2009, so no in-place). Not doing this you can multiply by about 100. Which method did you use to create the ring buffer? I'm currently trying the size mod 2 with the test for 1 element gap. This is slower than checking for overflow and reset, but easier to read whilst I'm chopping things around.
-
Well. The title was really to place the discussion in the queue producer/consumer pattern vs a ring buffer producer/consumer. Whilst queues generally are just a buffer behind the scenes (some can be linked lists) there is a domain separation here. Queues are a "Many to One". Their real benefit is having many produces and a single consumer. In the one producer, one consumer, this is ok, but the example isn't really a one-to-one although we would shoehorns it into one in LabVIEW such that we have one consumer then branch the wire. Additionally. Looking at the classic dequeue with case statement which many messaging architectures are based on including mine. This is mitigating concurrency by enforcing serial execution. The Disruptor or ring buffer approach is a "One to Many". So it has more in common with Events than queues.Events, however, have a lot of signalling and, in labview, are useless for encapsulation. I've only breached the surface of the Disruptor pattern. But it doesn't seem to be a "Data Is Ready" approach since its premise is to try and remove signalling to enhance performance. The "Write" free wheels at the speed of the incoming data or until it reaches a piece of data that has not been "consumed". By consumed, I do not mean removed, simply that it has been read and therefore is no longer relevant. A "Reader" requests a piece of data that is next in the sequence and waits until it receives it. Once received, it then processes it and requests the next. So. If it is up to the latest, it will idle or yield until new data is incoming. The result seems to be self regulating throughput with back-pressure to the writer and all readers running flat out as long as there is data to process somewhere in the buffer. It also seems to be inherently cooperative towards resource allocation since the fast ones will yield (when they catch up to the writer) allowing more to the slower ones. Here's a pretty good comparison of the different methods. There's also a nice latency chart of the Java performance And finally. Some hand drawn pictures of the basics
-
It doesn't solve the garbage collection issues in Java. They "alleviate" that by having custom objects and reboot every 24hrs . The advantage of this technique is that M processes can work on the same data buffer in parallel without waiting for all processes to finish before moving to the next. As an example. Lets say we have a stream of doubles being written to a Queue/buffer of length N. We wish to do a mean and linear fit (Pt by Pt). We will assume that the linear fit~ 2x slower and that the queue/buffer is full and therefore blocking writes. With a queue we remove one element, then proceed to do our aforesaid operations (which in LabVIEW we can do in parallel anyway). The queue writer can now add an element. The mean finishes first and then the reader has to wait for the linear fit to finish before it can de-queue the next element. Once the linear fit finishes, we then de-queue the next and start the process again evaluating the mean and linear fit. From what I gather with this technique the following would happen. We read the first element and pass it to the mean and linear fit. The mean finishes and then moves on to the next data point (doesn't wait for the linear fit). Once the linear fit has finished, the next value in the buffer can be inserted and it too moves to the next value. At this point the mean is working on element 3 (it is twice as fast)The result is that the mean travels through the buffer ahead of the linear fit (since it is faster) and is not a consideration for reading the next element from the buffer. Additionally (the theory goes) that once the faster process has reached the end of the data, there are more processing cycles available to the linear fit so that *should* decrease its processing time. Now. They cite that by reading in this fashion, they can parallelise the processing. We already have that capability so I don't think we gain much of a benefit there. But leveraging processing of a function that would spend most of it's time doing nothing due to data being unavailable until the slower process finishes seems like it is worth experimenting with.
-
I happened to come stumble upon what, was to me, an interesting presentation about high speed transaction processing (LMAX presentation). Their premise was that queues, which are the standard approach, were not the most appropriate for high throughput due to the pipeline nature of queues. To achieve their requirements, they have approached it from using ring buffers which enable them to parallel process data in the ring buffer, thus alleviating, but not eliminating pipe-lining (if the readers are faster than the writer, they still have to wait for data). The "classic" producer consumer in LabVIEW heavily relies on queues and, one of the problems we encounter is when the reader is slower than the writer (we are concerning ourselves with a single write only). Because we can only process the data at the head of the queue, we have a similar throughput problem in that we cannot use LabVIEWs parallelism to alleviate the bottleneck. So I was thinking that the alternative design pattern.that they term the Disruptor might be worth discussing even though we are contained by how LabVIEW manages things in the background (it will probably pan out that LabVIEWs queues will out-perform anything we can write in LabVIEW-parallel or not). Thoughts? (apart from why can't I upload images )