ShaunR Posted June 25, 2013 Report Posted June 25, 2013 (edited) I happened to come stumble upon what, was to me, an interesting presentation about high speed transaction processing (LMAX presentation). Their premise was that queues, which are the standard approach, were not the most appropriate for high throughput due to the pipeline nature of queues. To achieve their requirements, they have approached it from using ring buffers which enable them to parallel process data in the ring buffer, thus alleviating, but not eliminating pipe-lining (if the readers are faster than the writer, they still have to wait for data). The "classic" producer consumer in LabVIEW heavily relies on queues and, one of the problems we encounter is when the reader is slower than the writer (we are concerning ourselves with a single write only). Because we can only process the data at the head of the queue, we have a similar throughput problem in that we cannot use LabVIEWs parallelism to alleviate the bottleneck. So I was thinking that the alternative design pattern.that they term the Disruptor might be worth discussing even though we are contained by how LabVIEW manages things in the background (it will probably pan out that LabVIEWs queues will out-perform anything we can write in LabVIEW-parallel or not). Thoughts? (apart from why can't I upload images ) Edited June 25, 2013 by ShaunR 1 Quote
candidus Posted June 25, 2013 Report Posted June 25, 2013 I just watched the LMAX presentation and found it quite impressive. However, I don't think this architecture will increase performance when used with LV: The parallelism of LV doesn't help us here. As I understand the Disruptor pattern, RingBuffer access must be sequentially, there seem to be good reasons to access it from only one thread. Quote
candidus Posted June 26, 2013 Report Posted June 26, 2013 Another thought: Some problems the Disruptor pattern solves in Java we don't have in LV, e.g garbage collection issues. It could however be worth trying to implement it in LV because we were able to build a producer/consumer architecture that doesn't suffer from stalling when producers and consumers run at different speeds. Quote
ShaunR Posted June 26, 2013 Author Report Posted June 26, 2013 (edited) Another thought: Some problems the Disruptor pattern solves in Java we don't have in LV, e.g garbage collection issues. It could however be worth trying to implement it in LV because we were able to build a producer/consumer architecture that doesn't suffer from stalling when producers and consumers run at different speeds. It doesn't solve the garbage collection issues in Java. They "alleviate" that by having custom objects and reboot every 24hrs . The advantage of this technique is that M processes can work on the same data buffer in parallel without waiting for all processes to finish before moving to the next. As an example. Lets say we have a stream of doubles being written to a Queue/buffer of length N. We wish to do a mean and linear fit (Pt by Pt). We will assume that the linear fit~ 2x slower and that the queue/buffer is full and therefore blocking writes. With a queue we remove one element, then proceed to do our aforesaid operations (which in LabVIEW we can do in parallel anyway). The queue writer can now add an element. The mean finishes first and then the reader has to wait for the linear fit to finish before it can de-queue the next element. Once the linear fit finishes, we then de-queue the next and start the process again evaluating the mean and linear fit. From what I gather with this technique the following would happen. We read the first element and pass it to the mean and linear fit. The mean finishes and then moves on to the next data point (doesn't wait for the linear fit). Once the linear fit has finished, the next value in the buffer can be inserted and it too moves to the next value. At this point the mean is working on element 3 (it is twice as fast)The result is that the mean travels through the buffer ahead of the linear fit (since it is faster) and is not a consideration for reading the next element from the buffer. Additionally (the theory goes) that once the faster process has reached the end of the data, there are more processing cycles available to the linear fit so that *should* decrease its processing time. Now. They cite that by reading in this fashion, they can parallelise the processing. We already have that capability so I don't think we gain much of a benefit there. But leveraging processing of a function that would spend most of it's time doing nothing due to data being unavailable until the slower process finishes seems like it is worth experimenting with. Edited June 26, 2013 by ShaunR Quote
mje Posted June 26, 2013 Report Posted June 26, 2013 Interesting topic, and I can't say I anticipate being able to formulate a full reply so I'm just going to lob this one over the fence and see where it lands. I can't say I'm really convinced it's a matter of "Queues vs Ring Buffers". Queues after all provide a whole synchronization layer. Perhaps arguing "Sequential vs Random Access Buffers" might be more apt? Also it seems to me the underlying challenge here is how do you contend with multiple consumers which might be operating at different rates. Without delving too deep down the rabbit hole, it appears they've done only a handful of things that distinguish this Disruptor from the traditional producer-consumer pattern we all use know of in LabVIEW. Obviously there are huge differences in implementation, but I'm referring to very high level abstraction. Replace a sequential buffer with a random access buffer. Make the buffer shared among multiple consumers. Decouple the buffer from the synchronization transport. We definitely have the tools to do this in LabVIEW. Throw your data model behind some reference-based construct (DVR, SEQ, DB, file system...), then start to think about your favorite messaging architecture not as a means of pushing data to consumers, but as a way of signalling to a consumer how to retrieve data from the model. That is not "Here's some new data, do what you will with it", but rather "Data is ready, here's how to get it". Seems like a pretty typical model-view relationship to me, I've definitely done exactly this many times in LabVIEW. Of course I may have completely missed the point of their framework. I probably spent more time over the last day thinking about it than I have actually reading the referenced documents... Quote
ShaunR Posted June 26, 2013 Author Report Posted June 26, 2013 Interesting topic, and I can't say I anticipate being able to formulate a full reply so I'm just going to lob this one over the fence and see where it lands. I can't say I'm really convinced it's a matter of "Queues vs Ring Buffers". Queues after all provide a whole synchronization layer. Perhaps arguing "Sequential vs Random Access Buffers" might be more apt? Also it seems to me the underlying challenge here is how do you contend with multiple consumers which might be operating at different rates. Without delving too deep down the rabbit hole, it appears they've done only a handful of things that distinguish this Disruptor from the traditional producer-consumer pattern we all use know of in LabVIEW. Obviously there are huge differences in implementation, but I'm referring to very high level abstraction. Replace a sequential buffer with a random access buffer. Make the buffer shared among multiple consumers. Decouple the buffer from the synchronization transport. We definitely have the tools to do this in LabVIEW. Throw your data model behind some reference-based construct (DVR, SEQ, DB, file system...), then start to think about your favorite messaging architecture not as a means of pushing data to consumers, but as a way of signalling to a consumer how to retrieve data from the model. That is not "Here's some new data, do what you will with it", but rather "Data is ready, here's how to get it". Seems like a pretty typical model-view relationship to me, I've definitely done exactly this many times in LabVIEW. Of course I may have completely missed the point of their framework. I probably spent more time over the last day thinking about it than I have actually reading the referenced documents... Well. The title was really to place the discussion in the queue producer/consumer pattern vs a ring buffer producer/consumer. Whilst queues generally are just a buffer behind the scenes (some can be linked lists) there is a domain separation here. Queues are a "Many to One". Their real benefit is having many produces and a single consumer. In the one producer, one consumer, this is ok, but the example isn't really a one-to-one although we would shoehorns it into one in LabVIEW such that we have one consumer then branch the wire. Additionally. Looking at the classic dequeue with case statement which many messaging architectures are based on including mine. This is mitigating concurrency by enforcing serial execution. The Disruptor or ring buffer approach is a "One to Many". So it has more in common with Events than queues.Events, however, have a lot of signalling and, in labview, are useless for encapsulation. I've only breached the surface of the Disruptor pattern. But it doesn't seem to be a "Data Is Ready" approach since its premise is to try and remove signalling to enhance performance. The "Write" free wheels at the speed of the incoming data or until it reaches a piece of data that has not been "consumed". By consumed, I do not mean removed, simply that it has been read and therefore is no longer relevant. A "Reader" requests a piece of data that is next in the sequence and waits until it receives it. Once received, it then processes it and requests the next. So. If it is up to the latest, it will idle or yield until new data is incoming. The result seems to be self regulating throughput with back-pressure to the writer and all readers running flat out as long as there is data to process somewhere in the buffer. It also seems to be inherently cooperative towards resource allocation since the fast ones will yield (when they catch up to the writer) allowing more to the slower ones. Here's a pretty good comparison of the different methods. There's also a nice latency chart of the Java performance And finally. Some hand drawn pictures of the basics Quote
Aristos Queue Posted June 27, 2013 Report Posted June 27, 2013 This intrigued me, so I started to put the code together for this to see how it would work. After just a bit of setting up the VIs, I realized that I can't have two consumers walk the same buffer without making copies of the data for each consumer. That made me fairly certain that this ring buffer idea doesn't buy you anything in LabVIEW. In Java/C#/C++, the items in your queue-implemented-as-ring-buffer might be pointers/references, so having a single queue and having multiple readers traverse down it makes sense -- it saves you duplication. But in LV, each of those readers is going to have to copy data out of the buffer in order to leave the buffer in tact for the next reader. That means that you have zero advantage over just having multiple queues, one for each consumer that is going to consume the data. It could save you if you had a queue of data value references, but then you're going to be back to serializing your operations waiting on the DVR to be available. Am I missing some trick here? Quote
ShaunR Posted June 27, 2013 Author Report Posted June 27, 2013 (edited) This intrigued me, so I started to put the code together for this to see how it would work. After just a bit of setting up the VIs, I realized that I can't have two consumers walk the same buffer without making copies of the data for each consumer. That made me fairly certain that this ring buffer idea doesn't buy you anything in LabVIEW. In Java/C#/C++, the items in your queue-implemented-as-ring-buffer might be pointers/references, so having a single queue and having multiple readers traverse down it makes sense -- it saves you duplication. But in LV, each of those readers is going to have to copy data out of the buffer in order to leave the buffer in tact for the next reader. That means that you have zero advantage over just having multiple queues, one for each consumer that is going to consume the data. It could save you if you had a queue of data value references, but then you're going to be back to serializing your operations waiting on the DVR to be available. Am I missing some trick here? In general, I think it will not realise the performance improvements for the pointer reasons you have stated (we are ultimately constrained by the internal workings of LV, which we cannot circumvent). I'm sure if we tried to implement a queue in native labview, it wouldn't be anywhere near as fast as the primitives. That said... There a lot of the code seems to be designed around ensuring atomicity. For example. In LabVIEW, we can both read and write to a global variable without having to use mutexes (I believe this is why they discuss CAS). LabVIEW handles all that. Maybe there are some aspects of their software (I haven't got around to looking at their source yet) that is redundant due to LabVIEWS machinations........that's a big maybe with a capital "PROBABLY NOT". I'm not quite sure what you mean about "is going to have to copy data out of the buffer in order to leave the buffer in tact for the next reader". Are you saying that merely using the index array primitive destroys the element? I'm currently stuck at the "back pressure" aspect to the writer as I can't seem to get the logic right. Assuming I have the logic right (still not sure) then this is one instance when a class beats the pants off of classic labview. With a class I can read (2 readers) and write at about 50us, but don't quote me on that as I still don't have confidence in my logic (strange thing is, this slows down if you remove the error case structures to about 1ms ). I'm not trying anything complex. Just an array of doubles as the buffer. DVRs just kill it. Not an option, So it makes classes a bit of a nightmare since you need to share the buffer off-wire. To hack around this, I went for a global variable to store the buffer (harking back to my old "Data Pool" pattern) and the classes just being accessors (Dpendancy Barrier?) and storing the position (for the reader). I should just qualify that time claim in that the class VIs are all re-entrant subroutines (using 2009, so no in-place). Not doing this you can multiply by about 100. Which method did you use to create the ring buffer? I'm currently trying the size mod 2 with the test for 1 element gap. This is slower than checking for overflow and reset, but easier to read whilst I'm chopping things around. Edited June 27, 2013 by ShaunR Quote
Aristos Queue Posted June 27, 2013 Report Posted June 27, 2013 > Which method did you use to create the ring buffer? I haven't gotten as far as actually creating any buffers -- I was just setting up the shell VIs for one writer and two readers. > Are you saying that merely using the index array primitive destroys the element? Using Index Array copies the element out of the array. That is *exactly* what made me pause: if at any point you make a copy of the data, you have lost the advantage of this system over the parallel queues. Since you *must* make a copy of the data in order to have two separate processes act on it at the same time, there's no way to get an advantage over the parallel queues mechanism. Their buffer system appears to take advantage of the fact that you can, in a thread-controlled by-reference environment, act on data in-place. That's impossible in a free-thread by-value environment. I stopped going any further than that until I clarified my understanding since if these thoughts are correct, there's no possible advantage in LabVIEW with this concept. If I've missed something, I'm happy to re-evaluate. Quote
ShaunR Posted June 27, 2013 Author Report Posted June 27, 2013 (edited) > Which method did you use to create the ring buffer? I haven't gotten as far as actually creating any buffers -- I was just setting up the shell VIs for one writer and two readers. > Are you saying that merely using the index array primitive destroys the element? Using Index Array copies the element out of the array. That is *exactly* what made me pause: if at any point you make a copy of the data, you have lost the advantage of this system over the parallel queues. Since you *must* make a copy of the data in order to have two separate processes act on it at the same time, there's no way to get an advantage over the parallel queues mechanism. If I remember correctly. The Array Subset only keeps track of indexes (sub-array type). Would this avoid the copy? Not sure where "parallel" queues came into it. If we were to try parallel queues (I think you are saying a queue for each read process). Then data still has to be copied (within the labview code) on to each queue? Would you not get a copy at the wire junction at least if not on the queues themselves? This scenario is really slow in LabVIEW. I have to use TCPIP repeaters (one of the things I have my eye on for this implementation) and it is a huge bottleneck since to cater for arbitrary numbers of queues, you need to use a for loop to populate the queues and copies are definately made. Their buffer system appears to take advantage of the fact that you can, in a thread-controlled by-reference environment, act on data in-place. That's impossible in a free-thread by-value environment. I stopped going any further than that until I clarified my understanding since if these thoughts are correct, there's no possible advantage in LabVIEW with this concept. If I've missed something, I'm happy to re-evaluate. I think it will be impossible to see the same sort of performance that they achieve without going to compiled code (there is a .NET and C++ implementation) and if our only interest is to benchmark it to the LV queue primitives, we aren't really comparing apples (compiled code implementation of queues in the LV runtime vs labview code). However, the principle still stands and I think it may yield benefits for the aforementioned scenarios (queue-case for example), so I will certainly persevere. Of course. It'd be great if NI introduced a set of primitives in the LV kernel (Apache 2.0 licence I believe ) Edited June 27, 2013 by ShaunR Quote
mje Posted June 27, 2013 Report Posted June 27, 2013 <blockquote class='ipsBlockquote'data-author="Aristos Queue" data-cid="103871" data-time="1372301245"> <p>That is *exactly* what made me pause: if at any point you make a copy of the data, you have lost the advantage of this system over the parallel queues. Since you *must* make a copy of the data in order to have two separate processes act on it at the same time, there's no way to get an advantage over the parallel queues mechanism.</p></blockquote> I would argue otherwise. If you're implementing parallel queues to pass data directly, you've avoided making two distinct buffers. Yes, each element ultimately gets copied for each consumer, but there's only ever a copy for a single element per consumer floating around. If on the other hand your parallel queues are passing indices to the shared buffer, it allows you to get in and out of that buffer-- you make your copy and do your work, no need to maintain a lock while you operate in place. I have a really hard time believing their architecture doesn't use some sort of locking mechanism. At some point they need to read that circular buffer, and either they maintain a lock on an element while its being operated on, or they only lock it briefly while it's being copied. There's just no other way to ensure an old element doesn't get overwritten when the circular buffer flows around. Aside: I wish lava would fix posting from mobile browsers... Quote
Aristos Queue Posted June 27, 2013 Report Posted June 27, 2013 Shaun: Yes, the parallel queues make copies of the data. The only advantage I saw for the detonator was the lack of copies. So if I end up making copies, there's no advantage. Mje raises the point about the buffers. That's a reasonable point. I think that argument might flip the advantage back toward this being useful even in LV. Quote
ShaunR Posted June 27, 2013 Author Report Posted June 27, 2013 I have a really hard time believing their architecture doesn't use some sort of locking mechanism. At some point they need to read that circular buffer, and either they maintain a lock on an element while its being operated on, or they only lock it briefly while it's being copied. There's just no other way to ensure an old element doesn't get overwritten when the circular buffer flows around. Reference: LMAX Technical paper In the case of having only one producer, and regardless of the complexity of the consumer graph, no locks or CAS operations are required. The whole concurrency coordination can be achieved with just memory barriers on the discussed sequences. Quote
ShaunR Posted June 30, 2013 Author Report Posted June 30, 2013 Well. Another weekend wasted being a geek So I wrote a circular buffer and took on board (but didn't implement exactly) the Disruptor pattern. I've written a test harness which I will post soon once it is presentable so that we can optismise with the full brain power of Lavag.org rather than my puny organ (fnarr, fnarr). In a nutshell, it looks good - assuming I'm not doing anything daft and getting overruns that I'm not detecting. 1 Quote
ShaunR Posted July 1, 2013 Author Report Posted July 1, 2013 (edited) OK. Tidied up the benchmark so you can play, scorn, abuse. If we get something viable, then I will stick it in the CR with the loosest licence (public domain, BSD or something) You can grab it here: Prototype Circular Buffer Benchmark It'll be interesting to see the performance on different platforms/versions/bitnesses. (The above image was on Windows 7 x64 using LV2009 x64) Compared alongside each other. This is what we see. Edited July 1, 2013 by ShaunR Quote
GregSands Posted July 1, 2013 Report Posted July 1, 2013 Easy to try. Unfortunately, I get this for the Circular Buffer (LV 2012 x32 on Win 7 x64): Note the logarithmic scale! I wonder if it is something to do with my machine having only two cores? Here's the Profile Data for the two subroutines: Profile Data: all times in milliseconds, memory usage in kilobytes.Begin: Tue, 2 Jul 2013 11:29:38 a.m..End: Tue, 2 Jul 2013 11:29:46 a.m.. VI Time Sub VIs Time Total Time # Runs Average Shortest Longest Diagram Display Draw Tracking Locals Avg Bytes Min Bytes Max Bytes Avg Blocks Min Blocks Max Blocks Project Library Application InstanceCB Read.vi 3993.6 0.0 3993.6 2000 2.0 0.0 249.6 3993.6 0.0 0.0 0.0 0.0 0.00k 0.00k 0.00k 0 0 0 My ComputerCB Write.vi 1482.0 0.0 1482.0 1000 1.5 0.0 109.2 1482.0 0.0 0.0 0.0 0.0 0.00k 0.00k 0.00k 0 0 0 My Computer The Queue version works fine, with a mean time of 8.37us. Quote
ShaunR Posted July 2, 2013 Author Report Posted July 2, 2013 (edited) Easy to try. Unfortunately, I get this for the Circular Buffer (LV 2012 x32 on Win 7 x64): CB Test BUFFER_FP.png Note the logarithmic scale! I wonder if it is something to do with my machine having only two cores? Here's the Profile Data for the two subroutines: Interesting. You can see a lot of data points at about 300ms dragging the average up whereas most are clearly sub 10us. If you set the buffer size to 101 and do 100 data points, does it improve? (don't run the profiler at the same time, it will skew the results) Edited July 2, 2013 by ShaunR Quote
mje Posted July 2, 2013 Report Posted July 2, 2013 Neat. This is a 64-bit Win7 virtual machine which has access to 2 cores and 4 GB RAM. I see 15 us for queues versus 9 us for buffer when looking at medians. The mean and standard deviations are useless metrics on this platform because of the occasional hideously long access times on the buffer implementation. Note the chart for the buffers has been scaled to remove the outliers. Observed a moderate dependence on buffer size, which I expect is a consequence of using globals-- copy the whole array to read only a single element. I can see up to 12 us for the buffers if I increase the buffer size by even a factor of 2, but it levels off after that, I never saw more than 12 us. Haven't looked at the guts of it yet, but does the system ensure that old elements aren't overwritten if one of the consumers hasn't read them? Or is that "OK" in this pattern? I'm not surprised the buffers do better than the queues considering the lack of any synchronization in the buffers however I am surprised at how close the results were. I was half expecting an order of magnitude gain by circumventing the refnum locks that need to be run using queues. Perhaps the polling is starving some processing, or the buffer copies, or a combination of the two? Quote
GregSands Posted July 2, 2013 Report Posted July 2, 2013 Interesting. You can see a lot of data points at about 300ms dragging the average up whereas most are clearly sub 10us.If you set the buffer size to 101 and do 100 data points, does it improve? (don't run the profiler at the same time, it will skew the results) The slow iterations seem to occur at multiples of the Buffer Size - i.e. above, at 64, 128, 192, 256, ... If the buffer is bigger than the number of data points, it doesn't occur. Quote
ShaunR Posted July 2, 2013 Author Report Posted July 2, 2013 (edited) The slow iterations seem to occur at multiples of the Buffer Size - i.e. above, at 64, 128, 192, 256, ... If the buffer is bigger than the number of data points, it doesn't occur.Contingent on buffer multiples is a little perplexing. But that it disapers when the buffer is bigger than the points suggests it is due to the write waiting for the readers to catch up and letting LabVIEW get involved in scheduling (readers "should" be faster than the writer because they are simpler but the global array access is slow for the reasons MJE and AQ mentioned - I think I have a fix for that )Can you post the image of it not misbehaving? The graphs tell me a lot about what is happening. (BTW, I like the presentation as a log graphs better than my +3stds-I will modify the tests) Edited July 2, 2013 by ShaunR Quote
ShaunR Posted July 2, 2013 Author Report Posted July 2, 2013 (edited) Sweet!I've replaced the global variable with LabVIEW memory manager functions and, as surmised, the buffer size longer affects performance so you can make it as big as you want. .You can grab version 2 here:Prototype Circular Buffer Benchmark V2 I've been asked to clarify/state Licencing so here goes. I, the copyright holder of this work, hereby release it into the public domain. This applies worldwide. In case it is not legally possible or if the public domain licence is not recognised in any particular country ,The copyright holder shall be deemed to be the original developer (myself) and I hereby grant any entity the right to use this work for any purpose, without any conditions, except that the entity cannot claim ownership, copywrite or other lawful means to restrict or prevent others adopting the rights I hereby grant. This is free software for everyone. Edited July 2, 2013 by ShaunR 2 Quote
jcarmody Posted July 2, 2013 Report Posted July 2, 2013 FWIW Quote I, the copyright holder of this work, hereby release it into the public domain. This applies worldwide. In case it is not legally possible or if the public domain licence is not recognised in any particular country ,The copyright holder shall be deemed to be the original developer (myself) and I hereby grant any entity the right to use this work for any purpose, without any conditions, except that the entity cannot claim ownership, copywrite or other lawful means to restrict or prevent others adopting the rights I hereby grant. This is free software for everyone. Quote
GregSands Posted July 2, 2013 Report Posted July 2, 2013 Version 2 has the same problems for me. Firstly, here's the Queues as a baseline: Next is Buffers (10k iterations with a 10k buffer): There are no periodic delays, but still some occasional longer iterations (up to 100ms) presumably due to Windows being Windows. Last is Buffers (10k iterations with a 1k buffer): The results are the same for both 32-bit and 64-bit LabVIEW, all on Windows 7 x64. I've also run on another machine with 4 cores - this works without any long iterations. Quote
ShaunR Posted July 3, 2013 Author Report Posted July 3, 2013 (edited) @ GregSands Whilst a median of 4..6 micro seconds isn't fantastic. It's still not bad (slightly faster than the queues). In your images I am looking at the median and Max-Count Exec Times peak. The reason is as will follow. I'm not (at the moment) sure why you get the 300ms spikes (thread starvation?) but most of the other stuff can be explained (I think) I've been playing a bit more and have some theories to fit the facts. The following images are a little contrived since I run the tests multiple times and chose the ones that show the effects without extra spurii. But the data is always there in each run, just that you get more of them The following is with 258 data points buffer. We can clearly see two distinct levels (lines) for the write time and believe me; each data point in each line is identical. There are a couple of data points above these two (5 in fact). Theses anomalous points intriguingly occur at 1, 33, 65,129 and 257 or (2^n)+1. OK. So 17 is missing. you'll just have to take my word for it that it does sometimes show up. We can also notice that these points occur in the reader as well; at exactly the same locations with approximately the same magnitude. That is just too convenient. OK. So maybe we are getting collisions between the read and write. The following is again with 258 iterations with a buffer size of 2 (the minimum). That will definitely cause collisions if any are to be had. Nope. Exactly the same positions and "roughly" the same magnitudes. I would expect something to change at least if that were the issue. So if they really do occur at 2n+1 if we increase further we should see another appear at 513. Bingo!. Indeed. There it is. Here is what I think is going on. The LabVIEW memory manager uses an exponential allocation method for creating memory (perhaps AQ can confirm). So, every time the "Build Array" on the edge of the while loop needs more space, we see the the allocation take place which causes these artifacts. The more data points we build, the more of impact the LabVIEW memory manager has on the results. The "real" results for the buffer itself are the straight lines in the plots which are predictable and discrete. The spurious data points above these are LabVIEW messing with the test so that we can get pretty graphs. We can exacerbate the effect by going to a very high number. We can still clearly see our two lines and you will notice throughout all the images the Median and the Max Count-Exec Times have remained constant (scroll back up and check) which implies that the vast majority of the data points are the same. The results for the mean, STD and to our eyes are "confused" by the number of suprii. So I am postulating that most, if not everything above those two lines in the images is due to the build arrays on the border of the while loops. Of course. I cannot prove this since we are suffering from the "Observer Effect" and if I remove the build arrays on the border, we won't have any pretty graphs . Even running the profiler will cause massive increases in benchmark times and spread the data. I think we need a better bench marking method (pre-allocated arrays?). Of course. It raises the question. Why don't the queues exhibit this behaviour with the queue benchmark?. Well. I think they do. But it is less obvious because, for small numbers of iterations there is greater variance in the execution times and it is buried in the natural jitter. It only sticks out like a sore thumb with the buffer because it is so predictable that they are the only anomalies. Edited July 3, 2013 by ShaunR Quote
ShaunR Posted July 3, 2013 Author Report Posted July 3, 2013 (edited) Addendum. I've just modified the test to a) ensure timers always execute at pre-determined points in the while loops (connected the error terminals). b) pre-allocate the arrays. So it looks like a lot of what I was surmising is true. There is still one allocation in the 258 iteration image which might be for the shift registers. But everything is a lot more stable and the STD and mean are now meaningful (if you'l excuse the pun). Does anyone want to put forward a theory why we get discrete lines exactly 0.493 usecs apart? (maybe a different number on your machine, but the separation is always constant) Edited July 3, 2013 by ShaunR Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.