How to implement triple buffering

ShaunR · October 18, 2014

Are you currently using IMAQ Extract Buffer VI and finding it is not adequate?

CharlesB · October 19, 2014

Once again, thanks everyone for your propositions!

Scenario 2 Implementation:
Producer Dequeues a buffer from Q2. Fills it. Enqueues it to Q1.

Consumer Dequeues from Q1. Processes it. Enqueues it to Q2.

Since Producer is pulling from Q2, there is no chance it will ever overwrite an unprocessed buffer.

Q1 being empty is not a problem. Means consumer is faster than Producer.

If Q2 is empty, Consumer is backlogged and a loss must occur. Producer Dequeues from Q1.

Since an element can only be Dequeued once, there is no chance the Consumer is processing that buffer and it is safe to overwrite.

But if consumer is too slow, producer will have an empty Q2 when starting to fill, and will have to wait.

I have sketched a simple condition variable (only one waiter allowed) class, that protects access to a variant, and gives two main methods, signal and wait.

It is used in a triple-buffer class, having 3 main methods: start grab, grab ready, get latest. "get latest" doesn't wait if a buffer has been ready since latest call, and waits if it's not the case. Both methods "start grab" and "grab ready" never wait.

I will post it as soon as it's ready.

ShaunR · October 19, 2014

I think I fully understand the advantages of dataflow programming,

I mean, dataflow with pointers isn't really dataflow, since they don't carry the data.

Obviously you don't. :book:

Looking forward to your triple buffering and bench-marking it Maybe put it in the CR?

bbean · October 19, 2014

Triple Buffering with simple queues and shift register.

BufferTest.llb

ShaunR · October 20, 2014

And the stage is set!

In the blue corner, we have the "Triple Buffering Triceratops from Timbuktoo". , In the red corner we have the Quick Queue Quisling from Quebec and, at the last minute. In the green corner we have the old geezers' favourite - the Dreaded Data Pool From Global Grange.

Dreaded Data-pool From Global Grange.llb

Tune in next week to see the results.

Edited October 20, 2014 by ShaunR

drjdpowell · October 20, 2014

Unfortunately I can't use any async framegrabber option, because I'm doing processing on sequence of images, which disables the possibility of a "get latest frame". I really need to implement triple-buffering by myself, or I'll have to slow down the producer with copying image to the display at each iteration.

I donâ€™t have enough of a feel for your application to give specific advice, but are you sure it wouldnâ€™t be simpler to attend to the main processing of your 400 images per second in a simple clean way, then just make a copy for display 20-30 times per second for the separate display part, rather than invent a complex locking system to avoid those 20-30 copies? You might be able to do the display with a single IMAQ reference, making it very simple.

bbean · October 20, 2014

ShaunR - I was unaware that you could use events like that with Imaq IO refs...very nice. I will have to remember that for my next vision app.

While the Godwin Global approach is nice, I think there are two issues: 1) I believe the poster is not using IMAQ camera interface (Matrox proprietary instead) and 2) Somehow his Imaq Image Display is getting corrupted by newer images when it is updated faster (400fps) than it can be painted by the OS.

I'm not suggesting the "triple buffering" approach is the proper solution here, but I am collaborating with the hopes that he can see a "simple" LabVIEW queue approach can work.

PS. I surprised ShaunR doesn't have a "Cantankerous Callback" approach and I haven't seen any Actor Framefork approaches with multiple Actors.

Sparc · October 20, 2014

But if consumer is too slow, producer will have an empty Q2 when starting to fill, and will have to wait.

If no losses are acceptable and buffers are reused, any mechanism proposed will have this problem. That point is not valid criticism.

If losses are acceptable a "no wait" path is provided by reusing a buffer from Q1.

CharlesB · October 21, 2014

UPDATE Victory!! The corruption problem wasn't related to the triple-buffering, but to my display which was using XControl. I don't know why, but it looks like my XControl was doing display stuff after setting value, anyway problem is gone.

Note that the solution with two queues posted by bbean perfectly works and have similar performance. Kudos! :worshippy: However I keep my solution, which is more complex, but has a fully independent producer.

Triple buffering.zip

How to use

This class allows to have a producer loop running at its own rate, independently from the consumer. It is useful in the case of a fast producer faster than the consumer, where the consumer doesn't need to process all the data (like a display).

Buffers are provided at initialization, through refnums. They can be DVRs, or IMAQ refnums, or any pointer to some memory area.
Once initialized, consumer gets the refnums with "get latest or wait". The refnum given is locked and guaranteed to stay uncorrupted from the producer loop. If new data has been produced between two consumer calls, the call doesn't wait for new data, and returns the latest one. If not, it waits for the next data.
At each producer iteration, producer starts with a "start grab", which returns the refnum in which to fill. Once data is ready, it calls "ready". These two calls never wait, so producer is always running at a faster pace.

Implementation details

A condition variable is shared between producer and consumer. This variable is a cluster holding indexes "locked", "grabbing", and "ready". The condition variable has a mechanism that allows to acquire mutex access to the cluster, and atomically release it and wait. When the variable is signaled by the producer, the mutex is re-acquired by the consumer. This guarantees that the consumer that the variable isn't accessed by producer between end of consumer wait and lock by consumer.

Reference for CV implementation: "Implementing Condition Variables with Semaphores ", Andrew D. Birrell, Microsoft Research

Edited October 21, 2014 by CharlesB

drjdpowell · October 21, 2014

Youâ€™ll need to mark your Image Indicator as â€œSynchronous Displayâ€ if you want it to display before the Producer overwrites the buffer. Indicators are asynchronous by default and update at something like 60Hz, slower than your 400 frames/second.

BTW, I canâ€™t see how this code would interface with some other process doing the main application work on all 400 frames. What do you do with the full 400 frames?

bbean · October 21, 2014

CharlesB...I can't figure out where your race condition is. Also am not sure why you need all the extra mechanisms (Semaphore, DVR, Status) when you can achieve the same thing using 2 simple queues as shown in my example. Plus the 2 queue approach guarantees you can not work on the image being displayed until it is put back in the camera/processing pipeline. IMHO it is a simpler and easier to debug solution. The other thing my solution does is allow you to do the image processing in "parallel" to your acquisition.

CharlesB · October 21, 2014

CharlesB...I can't figure out where your race condition is. Also am not sure why you need all the extra mechanisms (Semaphore, DVR, Status) when you can achieve the same thing using 2 simple queues as shown in my example. Plus the 2 queue approach guarantees you can not work on the image being displayed until it is put back in the camera/processing pipeline. IMHO it is a simpler and easier to debug solution. The other thing my solution does is allow you to do the image processing in "parallel" to your acquisition.

It may be a bit overkill, but DVR and semaphore are here to protect against race condition. I actually just translated the code shown in the paper from MS research. It's important that the operation "unlock then wait then re-lock" is atomic, so that the producer don't read data in between, and if so you have inconsistent operation...

Yes, the 2 queue approach is simpler, and it also works, but it's also interesting to have a G implementation of the condition variable, as this pattern may be helpful in some cases. I agree it's not aligned with the usual paradigm in LabVIEW, but overall it was a good exercise :cool: Also I have a small performance gain with the CV version of triple buffer.

Maybe you can implement condition variable using fewer synchronization mecanism, I'd have to think about it

Youâ€™ll need to mark your Image Indicator as â€œSynchronous Displayâ€ if you want it to display before the Producer overwrites the buffer. Indicators are asynchronous by default and update at something like 60Hz, slower than your 400 frames/second.

BTW, I canâ€™t see how this code would interface with some other process doing the main application work on all 400 frames. What do you do with the full 400 frames?

Thanks, I didn't know about the synchronous display stuff. The producer is actually doing other processing tasks with the frames, and needs to spend as less time as possible on the display, which is secondary compared to the overall acquisition rate, so I need display-related stuff to be wait-free in the producer.

drjdpowell · October 21, 2014

Thanks, I didn't know about the synchronous display stuff. The producer is actually doing other processing tasks with the frames, and needs to spend as less time as possible on the display, which is secondary compared to the overall acquisition rate, so I need display-related stuff to be wait-free in the producer.

I suspect simply writing to an IMAQ image indicator inside the producer (and get rid of the consumer entirely) will be less overhead than your complex structure.

shoneill · October 22, 2014

Won't a single Queue do what you want? Your Queue represents a pool of available references and the consumer can remove one from the Queue, leaving two for your acquisition to re-use until your consumer is finished. When the consumer is finished, it places the item back in the Queue and then the acquisition again has three available for storage.

You could theoretically run into the situation where the consumer gets the same image twice if the timings are just right. To overcome this, perhaps implement a return Queue so that the acquisition, when choosing a reference for storage, first sees if there's anything in the return Queue and if yes, use it first (thus ensuring that the same image is never received twice by the consumer). If the Queue is empty, take an item from the queue as in the original example.

This is very similar to Shaun's proposal except for the fact that the producer actually consumes the same queue it is filling in order to keep things fresh. We don't refresh any reference which has been "locked" because this item is no longer in the Queue (we can treat this as an effective "Lock").

Shane.

Edited October 22, 2014 by shoneill

ShaunR · October 22, 2014

Do you have 2012 or later as an option? If so, the IMAQ ImageToEDVR VI will be available.

ShaunR · October 22, 2014

Sorry to say, but lossy queue on a shared buffer doesn't solve data corruption, as I said before.

This has been really bugging me in that if you have the image in an IMAQ ref (and your 3buff seems to imply you do), how are you able to write a partial frame to get corruption? Are you absolutely, super-duper, positively sure that you don't have two IMAQ images that are inadvertently identically named? That would cause corruption and may not be apparent until higher speeds.

bbean · October 23, 2014

Do you have 2012 or later as an option? If so, the IMAQ ImageToEDVR VI will be available.

Off topic: that looks like one of the most interesting/promising improvements to the Vision toolkit in a while.

CharlesB · October 23, 2014

Do you have 2012 or later as an option? If so, the IMAQ ImageToEDVR VI will be available.

I'm not sure to understand how it would help here?

This has been really bugging me in that if you have the image in an IMAQ ref (and your 3buff seems to imply you do), how are you able to write a partial frame to get corruption? Are you absolutely, super-duper, positively sure that you don't have two IMAQ images that are inadvertently identically named? That would cause corruption and may not be apparent until higher speeds.

Yes, perfectly sure. Buffers are allocated with different names everywhere, and filled in DLL functions, using IMAQ GetImagePixelPtr.

I have made some some benchmarks, measuring both consumer and producer frequency, and had the 3 solutions. Display is now faster, now that I have dumped my XControl used to embed IMAQ control, which I believe was causing corruption.

Trivial solution: 1-element queue, enqueued by producer. Consumer previews queue, displays, and empty the queue, blocking producer during display
2 queues solution (by bbean)
Condition variable (mine)

All three solutions have similar performance in all my scenarios, except when I limit consumer loop to a 25 Hz, in this case producer in 1. is also limited at 25 Hz. Trivial solution shows image corruption in some cases.

Except this case, I never see producer loop being faster than consumer, they both stay at roughly 80 Hz, while it has some margin: when I hide display window, producer goes up at its max speed (200 Hz in this benchmark). When CPU is doing other things, the rates go down to the same values at the same time, as if both loops were synchronized. This is quite strange, because in both 2. and 3. producer loop rate should be independent from consumer.

Consumer really does only display, so there's no reason it would slow down the producer like this... Everything looks like there's a global lock on IMAQ functions? Everything is shared reentrant. Producer is part of an actor, execution system set to "data acquisition" and consumer is in the main VI.

bbean · October 23, 2014

All three solutions have similar performance in all my scenarios, except when I limit consumer loop to a 25 Hz, in this case producer in 1. is also limited at 25 Hz. Trivial solution shows image corruption in some cases.

Except this case, I never see producer loop being faster than consumer, they both stay at roughly 80 Hz, while it has some margin: when I hide display window, producer goes up at its max speed (200 Hz in this benchmark). When CPU is doing other things, the rates go down to the same values at the same time, as if both loops were synchronized. This is quite strange, because in both 2. and 3. producer loop rate should be independent from consumer.

Consumer really does only display, so there's no reason it would slow down the producer like this... Everything looks like there's a global lock on IMAQ functions? Everything is shared reentrant. Producer is part of an actor, execution system set to "data acquisition" and consumer is in the main VI.

Are the Matrox dll calls thread safe? Are you making any dll calls in your image processing? Is it possible they are executing in the user interface thread?

shoneill · October 23, 2014

Charles, Have you tried my version?

Edited October 23, 2014 by shoneill

CharlesB · October 23, 2014

Are the Matrox dll calls thread safe? Are you making any dll calls in your image processing? Is it possible they are executing in the user interface thread?

Ooh thanks! I had some processing CLFN that were specified to run in the UI thread! Now producer loop frequency is more independent of display loop.

Edited October 23, 2014 by CharlesB

CharlesB · October 23, 2014

Charles, Have you tried my version?

Yes, it also works, and the benchmarking gives the same results as two other solutions

ShaunR · October 23, 2014

Yes, perfectly sure. Buffers are allocated with different names everywhere, and filled in DLL functions, using IMAQ GetImagePixelPtr.

IMAQ GetImagePixelPtr? That only retrieves a pointer. Are you then using IMAQ SetPixelValue to write individual pixels to an IMAQ ref?

Everything looks like there's a global lock on IMAQ functions?

Not IMAQ functions. IMAQ references. This maybe explains my confusion about corruption, Yes. IMAQ handles resource locking transparently (same as global variables, local variables, and any other shared resource we use in LabVIEW) so we never have to worry about data corruption (unless we cock up the IMAQ names. of course) Once you have an image inside an IMAQ ref. Never, never manipulate it outside of IMAQ (use only the IMAQ functions like copy, extract etc). Going across the IMAQ boundary (either direction) causes huge performance hits. As AQs signature states "we write C++ so you don't have to".

If you are pixel bashing singly into an IMAQ reference pretending it is just an array of bytes in memory somewhere using IMAQ SetPixelValue, then you will never achieve performance. Get it in with one block copy inside the DLL and never take the data out into values or arrays. Use the IMAQ functions to manipulate and display the data . This will cure any corruption as you will only receive complete frames via your DLL. If you want, you can implement your triple buffering inside the DLL. Will it be fast enough? Maybe. This is where using NI products have the advantage as the answer would be a resounding "easily".

Anecdotally with yours and my methods I can easily get 400 FPS using a 1024x1024 U8 greyscal image in multiple viewers. I'm actually simulating acquisition with an animation on a cruddy ol' laptop running all sorts of crap in the background. If I don;t do the animation and just a straight buffer copy from one image to another, I get a thousands of frames/sec. However, I'm not trying to put it into an IMAQ ref a pixel at a time.

Animation

Buffer copy.

bbean · October 23, 2014

Once you have an image inside an IMAQ ref. Never, never manipulate it outside of IMAQ (use only the IMAQ functions like copy, extract etc). Going across the IMAQ boundary (either direction) causes huge performance hits. As AQs signature states "we write C++ so you don't have to".

What type of performance hit would you expect manipulating pixels using IMAQ ImageToEDVR and an IPE? Curious as to the difference between an algorithm implemented with this technique vs a c DLL call, but I'm not near a development environment with Vision.

ShaunR · October 23, 2014

What type of performance hit would you expect manipulating pixels using IMAQ ImageToEDVR and an IPE? Curious as to the difference between an algorithm implemented with this technique vs a c DLL call, but I'm not near a development environment with Vision.

I don't really know that much about it (it's not in 2009 which I'm using here). I do know that this is the method they now employ for all other DAQ such as. FPGA for high throughput applications. I would guess that it is very efficient as they can do DMA transfers directly from acquired memory to application memory without having to go through the OS. Reading out of the DVR would be the bottleneck rather than from acquired memory (system memory as they call it) to LabVIEW memory which is the case with painfully slow IMAQ to Array and Array to IMAQ, for example. Until IMAQ ImageToEDVR that was really the only way when you wanted to stream frames to files. With the EDVR coupled with the asynchronous TDMS (which just so happens to take a DVR) I can see huge speed benefits to file streaming at the very least.

Edited October 23, 2014 by ShaunR

How to implement triple buffering

Recommended Posts

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Popular Posts

ShaunR

ShaunR

bbean

Posted Images

Join the conversation

Important Information