Jump to content

ShaunR

Members
  • Posts

    4,871
  • Joined

  • Days Won

    296

Posts posted by ShaunR

  1. Playing around with Queue tester, it was interesting how most of the slow queue performance is due to attempting operations in parallel.  In a striped-down test (statistics removed) it was about 8 times faster if it was run in the UI thread (forcing serial access).  It was 0.6uS per element on my relatively slow processor.

    The queue primitives are actually faster (as can be expected). But even a while loop with only the Elapsed time.vi in without anything else yields ~900ns on my machine. If it was one producer and one consumer, queues would be better, but the "spread" or "jitter" is far greater for queues. If you run the tests without the Elapsed Time.vi, you will also see a significant improvement in the buffer-not so much with the queues-so I'm interested to see what AQ comes up with for tests.

     

    I think also we will see the true advantage when we scale up to, say, 5-10  in parallel (a more realistic application scenario) especially on quad core machines where thread use can be maximised. But I'll get the variants working first as then it's actually usable. The next step after that is to scale up then add events to the read to see if we can maintain throughput with the better application link.

  2. And why do you think a variant is 8 bytes long? I don't know how long it is, but it is either a pointer or a more complex structure whose size you cannot easily determine. In the pointer case, which I would tend to believe it is, it would be either 4 bytes or 8 bytes long depending on if you run this on 32 bit or 64 bit. The pointer theory is further enforced by the lonely definition of the LvVariant typedef in the extcode.h file.

     

    If you run your VI on LabVIEW for 32 bit, the second MoveBlock will overwrite data in memory beyond the Variantpointer and therefore destroy something in memory.

     

    Please note also that LabVIEW knows in fact two variants. One is the native LvVariant and the other is the Windows VARIANT. They look the same on the diagram and LabVIEW will happily coerce from one to the other but in memory they are different. And while you can configure a CLN parameter to be a Windows VARIANT, this is only supported on Windows obviously.

     

    Still wish they would document the LvVariant API that is exported by LabVIEW.exe.

     

    I know. I'm running x64 at the moment so that's why it s 8. Maybe I should have put a conditional disable for people on 32 bit, but it was late at night and I had already pulled most of my hair out.

     

    One doc I found states:

    "LabVIEW stores variants as handles to a LabVIEW internal data structure. Variant data is made up of 4 bytes"

     

    And a handle is a **myhandle. I don't really care what the data structure is only that I can point to the handle so it "should" be just pointer copying. Works fine for the write and once for the read, but second time around the deallocation kills LV.

     

    So. Suggestions?

  3. I'm probably missing something fundamental about how variants are stored in memory :angry: .

     

    I want to create an array of pointers to variants. I seem to be able to write quite happily, but when read; it causes a problem.

    The following VI demonstrates what I am trying to do. It runs through once quite happily, but on the second execution it fails on the DSDisposePointer (even though the variant has been written and read correctly). If you disable the read, then it doesn't crash so it must be something to do with the way I'm retrieving the data.

     

    Any help appreciated.

     

    varpointers.vi

     

     

  4. I was going to suggest that's the resolution of the Timer.  But when I just put your Tick Count+ VI in a loop, and compute differences, I get:

    attachicon.gifTimerTest.png

    which says that the resolution is 29.1038pS, but the loop time (with no other work done) is around 405.2nS.  So the even spacing is just a function of a largely constant, and almost empty, loop.

     

    +1. It's not conclusive between resolution and loop jitter if you check with the elapsed time.vi (that is inside the loop) since it is difficult to separate them (occasionally you see 10s of nsec or psec readings). But as the Tick count2+ uses the same timing method (which yields ps resolution), I think we can assume it is mainly loop jitter that causes the spacing. I think also that whether the Elapsed Time.vi gets executed in the same slice or the next (or the next +1 etc, depending on the schedular) is the reason we see several lines at x*loop jitter.

  5. I noticed that sequencing the two “Enqueue” operations (by connecting the error wire) increased the queue performance greatly on my system.  

    It's a good job MJE noticed the bug where the dequeues aren't reading everything then. :) Still. It's rather counter intuitive. 

     

    Can't find the reference, but someone once said that the buffer size doubles until it's as big as a block (4kB?), then increases by block-sized increments.

    Well .remembered. That also explains why the sudden increase (the curved lines around 1ms) in the old 1M data point plots.

    By the way peeps. Feel free to write your own benchmarks/test software or abuse-ware. Especially if it shows a particular problem. The ones I supplied are really just test harnesses so that I could write the API and see some info on what was going wrong (you can probably see by the latest images that the ones I am using now are evevolving as I probe different aspects). Not much time has been spent on them so I expect there's lots of issues.

    Eventually, once all fingers have been inserted in all orifices, they will become usage examples and regression tests (if anyone uses the SQLite API For LabVIEW you will know what this means) so if you do knock something up, please post it-big or small.

    I think I should also request at this point, to avoid any confusion, that you should only post code to this thread if you are happy for your submission to be public domain. By all means make your own proprietary stuff, but please post it somewhere else so those that come here  can be sure that anything downloaded from this thread is covered by the previous public domain declaration.

     

    Eventually it will go in the CR, and it will not be an issue,but until then please be aware that any submissions to this thread are public domain. Sorry to have to state it, but a little bit of garlic now will save the blood being sucked later.

  6. My only criticism of the benchmarks is the queue consumers go until the queue is released, which does not mean they consumed all the data. I think it would be best to handle releasing each queue only after the consumer has finished, don't stop on error, stop on fixed number of iterations. I suspect that's what AQ was getting at. I don't expect it will make much difference when sample sizes are large.

     

    I'm also concerned about CPU exhaustion. Correct me if I'm wrong, but if a reader tries reading an empty buffer, it will spin in an unthrottled loop until data is ready?

     

    attachicon.gifRead Loop.png

     

    I worry that should there not be enough cores to accommodate all the end points, how much time will be spent spinning waiting for the cursor to advance that could otherwise be executing other endpoints? Worse would the subroutine priority lead to complete starvation? Similar arguments for writers, though I haven't poked under the hood there.

     

    Changing the VI from a subroutine and letting it yield should it ever not get a valid cursor did not seem to change my metrics.

     

    attachicon.gifRead Loop 2.png

    Indeed. See my post in reply to GregSands about processor hogging.

    For a practical implementation. Then yes.I agree. It should yield. Adding the wait ms and changing to Normal degrades the performance on my machine by about 30% and as I wanted to see what it "could do" and explore it's behaviour rather than labviews, I wasn't particulalry concerned about other software or parts of the same software. It makes it quite hard to optimise if the benchmarks are cluttered with task/thread switches et al. Better (to begin with) to run it as fast as posssible with as much resource as required to eek out the last few drops, then you can make informed decisions as to what you are prepared to trade for how much performance. Of course, in 2011+ you can "inline". In 2009 you cannot, so subroutine is the only way.

    Ooooh. I've just thought of a new signature.......

    "I don't write code fast. I just write fast code :D

  7. Version 2 has the same problems for me.  Firstly, here's the Queues as a baseline:

    attachicon.gifCB Test QUEUE_FP.png

     

    Next is Buffers (10k iterations with a 10k buffer):

     

    attachicon.gifCB Test BUFFER_10k_FP.png

     

    There are no periodic delays, but still some occasional longer iterations (up to 100ms) presumably due to Windows being Windows.

     

    Last is Buffers (10k iterations with a 1k buffer):

     

    attachicon.gifCB Test BUFFER_1k_FP.png

     

    The results are the same for both 32-bit and 64-bit LabVIEW, all on Windows 7 x64.  I've also run on another machine with 4 cores - this works without any long iterations.

     

    OK. I've managed to replicate this on a Core 2 Duo (the other PC is an I7). The periodic large deviations coincide with the buffer length. If you look at your 10K@ 1Kbuffer, each point is 1K apart. If you were to set the #iterations to 100 and, say the buffer length to 10, then the separation would be every 10 and you would see 9 points [ (#iterations/buffer length) -1 ]. Similarly, if you set the buffer length to 20, you would see 4.

     

    My initial thoughts (as seems to be the norm) was  a collision or maybe a task switch on the modulo operator. But it smells more like processor hogging as you can get rid of it by placing a wait with 0 ms in the write and read while loops. You have to set the read and write VIs to be "Normal" instead of "subroutine" to be able to use the wait and therefore you lose a lot of the performance so it's not ideal, but it does seem to cure it. I'm not sure of the exact mechanism-i'll have to chew it over. But it seems CPU architecture dependent.

     

     

     

    I've been rewriting the benchmark VIs to my own standards. An important change is making those While Loops into For Loops and adding some sequence structures because the timers are being delayed by panel updates. Also the queue and buffer benchmarks do not match. I'll probably post revised VIs late tonight.

     

     

    Look forward to it. Care to elaborate on for loop Vs While loop?

    (Don't speed up the queues too much eh? :P)

  8. Addendum.

     

    I've just modified the test to

    a) ensure timers always execute at pre-determined points in the while loops (connected the error terminals).

    b) pre-allocate the arrays.

     

    So it looks like a lot of what I was surmising is true. There is still one allocation in the 258 iteration image which might be for the shift registers. But everything is a lot more stable and the STD and mean are now meaningful (if you'l excuse the pun).

     

    buffer258x258-prealloc.png

     

    buffer1Mx1M-prealloc.png

     

    Does anyone want to put forward a theory why we get discrete lines exactly 0.493 usecs apart? (maybe a different number on your machine, but the separation is always constant)

     

    lines.png

  9. @ GregSands

    Whilst a median of 4..6 micro seconds isn't fantastic. It's still not bad (slightly faster than the queues). In your images I am looking at the median and Max-Count Exec Times peak. The reason is as will follow. I'm not (at the moment) sure why you get the 300ms spikes (thread starvation?) but most of the other stuff can be explained (I think)

     

    I've been playing a bit more and have some theories to fit the facts. The following images are a little contrived since I run the tests multiple times and chose the ones that show the effects without extra spurii. But the data is always there in each run, just that you get more of them

     

    The following is with 258 data points buffer.

     

    buffer258x258.png

     

    We can clearly see two distinct levels (lines) for the write time and believe me; each data point in each line is identical. There are a couple of data points above these two (5 in fact). Theses anomalous points intriguingly occur at 1, 33, 65,129 and 257 or (2^n)+1. OK. So 17 is missing. you'll just have to take my word for it that it does sometimes show up.

     

    We can also notice that these points occur  in the reader as well; at exactly the same locations with approximately the same magnitude. That is just too convenient.

     

    OK. So maybe we are getting collisions between the read and write. The following is again with 258 iterations with a buffer size of 2 (the minimum). That will definitely cause collisions if any are to be had.

     

    buffer258x2.png

     

    Nope. Exactly the same positions and "roughly" the same magnitudes. I would expect something to change at least if that were the issue. So if they really do occur at 2n+1 if we increase further we should see another appear at 513.

     

    buffer562x562.png

     

    Bingo!. Indeed. There it is.

     

    Here is what I think is going on. The LabVIEW memory manager uses an exponential allocation method for creating memory (perhaps AQ can confirm). So, every time the "Build Array" on the edge of the while loop needs more space, we see the the allocation take place which causes these artifacts. The more data points we build, the more of impact the LabVIEW memory manager has on the results. The "real" results for the buffer itself are the straight lines in the plots which are predictable and discrete. The spurious data points above these are LabVIEW messing with the test so that we can get pretty graphs.

     

    We can exacerbate the effect by going to a very high number.

     

    buffer1Mx1M.png

     

    We can still clearly see our two lines and you will notice throughout all the images the Median and the Max Count-Exec Times have remained constant (scroll back up and check) which implies that the vast majority of the data points are the same. The results for the mean, STD and to our eyes are "confused" by the number of suprii. So I am postulating that most, if not everything above those two lines in the images  is due to the build arrays on the border of the while loops. Of course. I cannot prove this since we are suffering from the "Observer Effect" and if I remove the build arrays on the border, we won't have any pretty graphs :P. Even running the profiler will cause massive increases in benchmark times and spread the data. I think we need a better bench marking method (pre-allocated arrays?).

     

    Of course. It raises the question. Why don't the queues exhibit this behaviour with the queue benchmark?. Well. I think they do. But it is less obvious because, for small numbers of iterations there is greater variance in the execution times and it is buried in the natural jitter. It only sticks out like a sore thumb with the buffer because it is so predictable that they are the only anomalies. 

     

    queue258x258.png

     

    queue1Mx1M.png

  10. Sweet!
    I've replaced the global variable with LabVIEW memory manager functions and, as surmised, the buffer size longer affects performance so you can make it as big as you want.

    .
    You can grab version 2 here:Prototype Circular Buffer Benchmark V2

     

    I've been asked to clarify/state Licencing so here goes.

     

     

    I, the copyright holder of this work, hereby release it into the public domain. This applies worldwide.

     

    In case it is not legally possible or if the public domain licence is not recognised in any particular country ,The copyright holder shall be deemed to be the original developer (myself) and I hereby grant any entity the right to use this work for any purpose, without any conditions, except that the entity cannot claim ownership, copywrite or other lawful means to restrict or prevent others adopting the rights I hereby grant. This is free software for everyone. 

    • Like 2
  11. The slow iterations seem to occur at multiples of the Buffer Size - i.e. above, at 64, 128, 192, 256, ...  If the buffer is bigger than the number of data points, it doesn't occur.
    Contingent on buffer multiples is a little perplexing. But that it disapers when the buffer is bigger than the points suggests it is due to the write waiting for the readers to catch up and letting LabVIEW get involved in scheduling (readers "should" be faster than the writer because they are simpler but the global array access is slow for the reasons MJE and AQ mentioned - I think I have a fix for that ;) )

    Can you post the image of it not misbehaving? The graphs tell me a lot about what is happening. (BTW, I like the presentation as a log graphs better than my +3stds-I will modify the tests)

  12. Easy to try.  Unfortunately, I get this for the Circular Buffer (LV 2012 x32 on Win 7 x64):

     

    attachicon.gifCB Test BUFFER_FP.png

     

    Note the logarithmic scale!  I wonder if it is something to do with my machine having only two cores?  Here's the Profile Data for the two subroutines:

     

     

     

    Interesting. You can see a lot of data points at about 300ms dragging the average up whereas most are clearly sub 10us.

    If you set the buffer size to 101 and  do 100 data points, does it improve? (don't run the profiler at the same time, it will skew the results)

  13. OK. Tidied up the benchmark so you can play, scorn, abuse. If we get something viable, then I will stick it in the CR with the loosest licence (public domain,  BSD or something)

     

    You can grab it here: Prototype Circular Buffer Benchmark

     

    It'll be interesting to see the performance on different platforms/versions/bitnesses. (The above image was on Windows 7 x64 using LV2009 x64)

     

    Compared alongside each other. This is what we see.

     

    qvb.png

  14. Have you hit your 10MB limit? The only place I can find to check my current usage is when I try to attach a file. Currently, I have 5.89MB left.

     

    It doesn't tell me when I upload (just says uploading is not allowed with a upload error). When it first happened, I deleted 2 of my code repository submissions just in case that was the problem (deleted about 1.2 MB to upload a 46KB image). It didn't make a difference. There used to be a page in the profile that allowed you to view all the attachments and uploads and monitor your usage. That seems to have disappeared so I can't be sure that deleting more from the CR will allow uploading..

  15. Interestingly, not sure if it is related, your most recent posts have included a broken image link icon whenever you have attached a picture. Even more interestingly, this picture does actually display, inline, in the quick reply editor!

    The images are just inserted images that have to reside on another server (using the insert image button in the bar). Usually I upload the images to lavag and insert them. Obviously I cannot do that at the moment so this way is a work-around. I can do something similar for the files I want to upload, but then they won't appear inline in the posts as attachments (and presumably I cannot put stuff in the code repository for people). You will have to be redirected to my download page to get them (not desirable).

  16. It's PXI-RT, And yes, A 2MB/s is not high speed. But when I need a frequently transfering and receiving function in a determined loop rate, it is...

    Here is how my application works.

    PXI A and PXI B are synchronized using a 10K trigger signal.

    PXI A: Loop Rate 10K, Read 12 DBL from PXI B  -->Processing-->Send 12 DBL to PXI B

    PXI B: Loop Rate 10K, Read 12 DBL from PXI A  -->Processing-->Send 12 DBL to PXI A

     

    the transfer part took more time than I thought.. My processing part will take about 50us, and the transfer data take about 70us while using reflective memory(GE 5565)... So I am wonder how can I make the transfer time less than 40us.

    You want it 40 usecs because because 40+50 <100?

     

    Put your acquisition and processing (50us) in a producer loop and the TX in a consumer loop. Then your total processing time will be just the worst of the two (70us) rather than the addition of both.

  17. Hi,

    I am doing a project which required high speed data commnication between 2 chassis. 24 Double digital numbers in a loop rate 10K. The first thing come to my mind is using reflective memory. But the result is not good enough. The data transfer tooks 80% of time in the 10K loop, Then to avoid loop late, I could not do any thing else in this loop.

     

    Is there any option else? Maybe using digital I/O in FPGA card?

     

    Thanks in advance!

     

    It depends where your bottleneck is.

    24xDouble precision numbers @ 10k is about 2MB/sec. Doesn't sound a lot to me.

     

    Are we talking PXI-RT or PXI-Windows7? How are you acquiring and how are you transferring (TCPIP, MXI?).

  18. Well.

    Another weekend wasted being a geek  :D

     

    So I wrote a circular buffer and took on board (but didn't implement exactly) the Disruptor pattern. I've written a test harness which I will post soon once it is presentable so that we can optismise with the full brain power of Lavag.org rather than my puny organ (fnarr, fnarr).

    In a nutshell, it looks good - assuming I'm not doing anything daft and getting overruns that I'm not detecting.

     

    cb.png

     

     

    • Like 1
  19. It looks to me like the data is already processed. If you just plot the data directly (and change the graph scale to logarithmic) you will get:

     

     

     

    psd1.png

     

    If you want to smooth it, use the Interpolate 1D.vi and select spline (ntimes= 10).and you will get:

     

    psd2.png

  20. You can no longer have a dynamic dispatch "read" VI because some tasks will return an array, others a single value. So, you can make a polymorphic VI, but then you are back at square one if the wrong instance is selected: run time errors.

    You can always just return an array with one element for single values (if a task returns a single value, just use the build array to convert it). Then all your companes are the same. If you really want to, you can wrap that that into a single value polymorphic VI to return just element 0. That way you won't get a run-time error.

  21. I have a really hard time believing their architecture doesn't use some sort of locking mechanism. At some point they need to read that circular buffer, and either they maintain a lock on an element while its being operated on, or they only lock it briefly while it's being copied. There's just no other way to ensure an old element doesn't get overwritten when the circular buffer flows around.

     

     

    Reference: LMAX Technical paper

    In the case of having only one producer, and regardless of the complexity of the consumer graph, no locks or CAS operations are required. The whole concurrency coordination can be achieved with just memory barriers on the discussed sequences.
×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.