Jump to content

How to implement triple buffering


Recommended Posts

In the need for displaying large images at a high performance, I want to use triple buffering in my program. This type of acquisition allows to acquire large data in buffers, and have it used without copying images back and forth between producer and consumer.

 

This way consumer thread doesn't wait if a buffer is ready, and producer works at max speed because it never waits or copy.
If the consumer makes the request when a buffer is ready, it is atomically turned into a "lock" state. If a buffer isn't ready, it waits for it, atomically lock it when it is ready.

The following timing diagram shows how it goes with the 3 buffers.

 

post-1401-0-56949400-1413476263_thumb.pn

 

Traditional LabVIEW queues don't fit here because we have large buffers that we don't want to copy.

 

I tried to implement it with notifiers, but there is always a risk of a race condition between the producer selecting where to fill next acquisition, and locking by the consumer. With condition variables, it is easy because when a wait ends, a mutex is locked. There is no such synchronization primitive in LabVIEW. How can I implement this?

 

(cross-posted on the dark-side)

Edited by CharlesB
Link to comment

In the need for displaying large images at a high performance, I want to use triple buffering in my program. This type of acquisition allows to acquire large data in buffers, and have it used without copying images back and forth between producer and consumer.

 

This way consumer thread doesn't wait if a buffer is ready, and producer works at max speed because it never waits or copy.

If the consumer makes the request when a buffer is ready, it is atomically turned into a "lock" state. If a buffer isn't ready, it waits for it, atomically lock it when it is ready.

The following timing diagram shows how it goes with the 3 buffers.

 

attachicon.giftriple buffer.png

 

Traditional LabVIEW queues don't fit here because we have large buffers that we don't want to copy.

 

I tried to implement it with notifiers, but there is always a risk of a race condition between the producer selecting where to fill next acquisition, and locking by the consumer. With condition variables, it is easy because when a wait ends, a mutex is locked. There is no such synchronization primitive in LabVIEW. How can I implement this?

 

(cross-posted on the dark-side)

 

DVRs (for the buffers) and semaphores (LabVIEW's "condition variable").

 

However. You only have one writer and reader, right? So push the DVRs into a queue and you will only copy a pointer. You can then either let LabVIEW handle memory by creating and destroying a DVR for each image or have a round-robin pool of permanent DVRs if you want to be fancy (n-buffering). You were right in your original approach, you just didn't use the DVR so had to copy the data.

Edited by ShaunR
  • Like 1
Link to comment

Following on from what Shaun said, I use exactly the set up he described. I pull my images from an FPGA FIFO so I have it set up to give me a DVR which I pass through a series of queues (a messaging hierarchy). The ultimate endpoint of the DVR is the File IO process, but one of the intermediate message routers will periodically copy the data out from the DVR (at a rate of 30 Hz or similar) for display before piping the DVR onto the File IO process.

Link to comment

 

Traditional LabVIEW queues don't fit here because we have large buffers that we don't want to copy.

 

Are you sure of that?  I wouldn’t expect a queue to make a copy of a pointer-based data structure like an array or object.  Unlike Notifiers, which must make a copy on reading.  

Link to comment

DVRs (for the buffers) and semaphores (LabVIEW's "condition variable").

 

However. You only have one writer and reader, right? So push the DVRs into a queue and you will only copy a pointer. You can then either let LabVIEW handle memory by creating and destroying a DVR for each image or have a round-robin pool of permanent DVRs if you want to be fancy (n-buffering). You were right in your original approach, you just didn't use the DVR so had to copy the data.

 

In my case buffers are IMAQ references, so I don't think I need DVRs, since IMAQ refs are already pointers on data.

 

If I give it a try with queues, I need a P->C queue for the "ready" event.

 

At grab start, P determines on which buffer to fill, which is the one that is neither locked nor ready.

At grab end, P pushes on the "ready" queue. If C is waiting, the just-filled buffer is marked as locked, and if not, it is marked as ready (and the previous lock remains). But P also needs to know whether a buffer has been locked by C during a fill, like in the third iteration on my original diagram.

 

So we must have another queue, C->P, that P will check at grab start. Which brings the following race condition problem (P and C are in different threads) if C request happen at the same time a grab ends:

  • B1 is ready, P is filling B2, and B3 is locked
  • C queries the "ready" queue, which already has an element (no wait), B1
  • C thread is put of hold
  • P threads wakes, with a grab end, P pushes in the "ready" queue B2
  • P sees no change in the "locked", queue, and chooses to fill on B1, since B3 is still locked
  • C thread finally wakes, and pushes on the lock queue B1
  • But P has already chose to fill on the B1 buffer, so B1 will be corrupted when C will read data.

So we need a semaphore around this, so that P doesn't interrupt the sequence "ready dequeue, lock queue". The semaphore will be acquired by C before waiting on the "ready" queue, released after pushing to the "locked" queue, and P will acquire it at grab start. I don't think there's a dead lock risk, but I'm not sure...

 

That's why condition variables are great: you have an atomic wait and lock, which prevents such race conditions. I pretty sure condition variables can be implemented with queues and semaphores, but I don't know how.

 

Following on from what Shaun said, I use exactly the set up he described. I pull my images from an FPGA FIFO so I have it set up to give me a DVR which I pass through a series of queues (a messaging hierarchy). The ultimate endpoint of the DVR is the File IO process, but one of the intermediate message routers will periodically copy the data out from the DVR (at a rate of 30 Hz or similar) for display before piping the DVR onto the File IO process.

 

But how does the original FIFO knows it shouldn't write on the memory location it gave to you? If the producer acquires data at a higher rate than the consumer, there must be a mechanism to prevent data corruption if buffers aren't copied between producer and consumer.

 

Are you sure of that?  I wouldn’t expect a queue to make a copy of a pointer-based data structure like an array or object.  Unlike Notifiers, which must make a copy on reading.  

 

So if you put a large array in a queue, and dequeue it in another loop, no data copy is ever made?

Edited by CharlesB
Link to comment

Why don’t you just use two queues and three IMAQ refs?  Producer takes a ref from one queue, fills it, and puts on the other queue.  Consumer takes from that queue, reads it, and puts it back on the first queue.   Simple.  Why do you need some complex locking system?

Link to comment

 

So if you put a large array in a queue, and dequeue it in another loop, no data copy is ever made?

Yes and no. I've had to characterize this recently with a cluster of various arrays and I've found using the RT trace toolkit (9068 in LVRT2013) that:

-If you pass a dataset into a queue, the original (upstream) producer of that data set must generate a new buffer because the current buffer is handed off to the queue

-If that upstream producer is an empty queue, an empty buffer must be allocated. For some reason I don't understand this ends up with like 5 wait on memory flags in the trace.

-If that upstream producer is a full queue, no new buffer will be allocated

-If the buffer (pulled out of any of the queues) is fed to another API, like IMAQ, you'll end up losing that buffer and you'll need to allocate a new one, unless the API can avoid stomping on the data.

 

tl;dr: in normal operation you won't make a copy of the full dataset by using a queue until you pass that dataset to someone else. For determinism purposes dequeuing an empty queue will cause an allocation, which is why we have RT fifos. If you can avoid another API stomping on the data, you can pass a set of buffers through a loop of queues (A->B->C->A...) without allocations.

 

Obviously all of the above is for my use case only and you should always use a trace tool to confirm that your code behaves as expected if performance is critical to your application.

Link to comment

Threw together some quick code using an Express VI (GASP) to test the theory.  Haven't run or debugged other than with my webcam.

 

The Queue should prevent race conditions and provide a little margin on processing. Not sure how many data copies of the images will exist.  Don't have time to investigate.

 

Bean

BufferTest.llb

Link to comment

Why don’t you just use two queues and three IMAQ refs?  Producer takes a ref from one queue, fills it, and puts on the other queue.  Consumer takes from that queue, reads it, and puts it back on the first queue.   Simple.  Why do you need some complex locking system?

 

Complex locking is here to be sure that P won't overwrite a buffer locked by C in a race condition, because P won't be aware of it. Both events "C wait ends" and "C locking buffer" must be atomic, or you have this race condition.

 

With your solution, naming queues PQ and CQ.

If C is longer than P, PQ might be empty when P is ready to fill a buffer, and has to know which buffer is held by C, in order to not corrupt it. So P has to wait for C to complete operation and fill PQ, which is against purpose of triple buffering where P acquires at full frequency.

Edited by CharlesB
Link to comment

Why don’t you just use two queues and three IMAQ refs?  Producer takes a ref from one queue, fills it, and puts on the other queue.  Consumer takes from that queue, reads it, and puts it back on the first queue.   Simple.  Why do you need some complex locking system?

 

Complex locking is here to be sure that P won't overwrite a buffer locked by C in a race condition, because P won't be aware of it. Both events "C wait ends" and "C locking buffer" must be atomic, or you have this race condition.

 

With your solution, naming queues PQ and CQ.

If C is longer than P, PQ might be empty when P is ready to fill a buffer, and has to know which buffer is held by C, in order to not corrupt it. So P has to wait for C to complete operation and fill PQ, which is against purpose of triple buffering where P acquires at full frequency.

 

I think you are over thinking this. The inherent nature of a queue is your lock. Only place the IMAQ ref on the queue when the grab is complete and make the queue a maximum length of 3 (although why not make it more?). The producer will wait until a there is at least one space left  when it tries to place a 4th ref on the queue (because it is a fixed length queque). If you have multiple grabs that represent 1 consumer retrieval (3 grabs then the consumer takes all three), then just pass an array of IMAQ refs as the queue element. As 

Link to comment

I think you are over thinking this. The inherent nature of a queue is your lock. Only place the IMAQ ref on the queue when the grab is complete and make the queue a maximum length of 3 (although why not make it more?). The producer will wait until a there is at least one space left  when it tries to place a 4th ref on the queue (because it is a fixed length queque). If you have multiple grabs that represent 1 consumer retrieval (3 grabs then the consumer takes all three), then just pass an array of IMAQ refs as the queue element. As 

 

See my code

Link to comment

I think you are over thinking this. The inherent nature of a queue is your lock. Only place the IMAQ ref on the queue when the grab is complete and make the queue a maximum length of 3 (although why not make it more?). The producer will wait until a there is at least one space left  when it tries to place a 4th ref on the queue (because it is a fixed length queque). If you have multiple grabs that represent 1 consumer retrieval (3 grabs then the consumer takes all three), then just pass an array of IMAQ refs as the queue element. As 

 

I wish I was overthinking, but with a simple 3-elements queue, the producer has to wait when queue is full. This means lower performance when the consumer task is longer than producer task. In triple buffering the producer never waits, that's its strength. Take a look at http://en.wikipedia.org/wiki/Multiple_buffering#Triple_buffering if you're not convinced of the advantages.

 

Yes. You are right. I had forgotten about those. Whats the betting it's just a IMAQ refs queue ;)

 

I don't use IMAQ for acquisition, only for display (I have Matrox framegrabber, and make calls to their dlls to fill IMAQ references), so I don't have access to these VIs.

 

See my code

 

Same remark, producer is filling a queue, and will have to wait if it's full.

Link to comment

In triple buffering the producer never waits, that's its strength. 

Uh, it aint magic.   If you can consume 90 frames a second then you can’t produce 100 frames a second.   In that case, a single buffer would lead to something like 45 frames/sec, as the Consumer waits about half the time for the producer.  With two buffers the frame rate would still be less than 90/sec, as jitter in the Producer sometimes causes the Consumer to wait.  With three buffers the jitter doesn’t matter, and one gets 90 frames/sec.  But you don’t get 100.

BTW, you should check to see if your Matrox frame grabber isn’t already buffering frames asynchronously, rendering additional buffering pointless.

Link to comment

Uh, it aint magic.   If you can consume 90 frames a second then you can’t produce 100 frames a second.   In that case, a single buffer would lead to something like 45 frames/sec, as the Consumer waits about half the time for the producer.  With two buffers the frame rate would still be less than 90/sec, as jitter in the Producer sometimes causes the Consumer to wait.  With three buffers the jitter doesn’t matter, and one gets 90 frames/sec.  But you don’t get 100.

BTW, you should check to see if your Matrox frame grabber isn’t already buffering frames asynchronously, rendering additional buffering pointless.

 

But in my case a slower consumer can drop frames (maybe I should have pointed that before), as it's just for display. So if producer is faster it must not be slowed down waiting for consumer to process every frame. Does my problem make more sense now?

Link to comment

Can you share your code? Are you doing any image processing?  Can you drop frames on the image processing?

It's a really large app, so I can't share the code, but producer actually does acquire and process, and consumer dose only display. Camera is awfully fast (2Mpixels at 400 FPS), and I'm doing very basic processing, so in the end image display is slower than acquisition, which is why I want to drop frames, without slowing acquisition down.

Link to comment

But in my case a slower consumer can drop frames (maybe I should have pointed that before), as it's just for display. So if producer is faster it must not be slowed down waiting for consumer to process every frame. Does my problem make more sense now?

Again, check that your frame grabber isn’t already buffering, with a “Get latest frame†method.  That is what I would expect it to do.  And this would greatly simplify your program as you’ll only need one loop.

 

However, you can get “lossy behavior†by getting the Consumer to flush the queue and only process the last element, passing any other IMAQ refs returned back to the C—>P queue.  You might need more than 3 refs if your Consumer is very slow.

Link to comment

How fast can you run your acquire and processing loop if you do not display?

 

So passing the imaq reference to a display loop with a notifier doesn't work because you are afraid you may be overwriting the "displayed" image in the notifier loop? 

Link to comment

Again, check that your frame grabber isn’t already buffering, with a “Get latest frame†method.  That is what I would expect it to do.  And this would greatly simplify your program as you’ll only need one loop.

 

However, you can get “lossy behavior†by getting the Consumer to flush the queue and only process the last element, passing any other IMAQ refs returned back to the C—>P queue.  You might need more than 3 refs if your Consumer is very slow.

 

Unfortunately I can't use any async framegrabber option, because I'm doing processing on sequence of images, which disables the possibility of a "get latest frame". I really need to implement triple-buffering by myself, or I'll have to slow down the producer with copying image to the display at each iteration.

 

It seems that LabVIEW doesn't have proper synchronization function to do this, but in C++11 it is really straightforward (condition variable are native), so I'll use a CFLN...

 

How fast can you run your acquire and processing loop if you do not display?

 

So passing the imaq reference to a display loop with a notifier doesn't work because you are afraid you may be overwriting the "displayed" image in the notifier loop? 

 

It runs at full speed if I disable display. And I already tested simple notifier, display gets corrupted by the acquisition

Link to comment

It seems that LabVIEW doesn't have proper synchronization function to do this, but in C++11 it is really straightforward (condition variable are native), so I'll use a CFLN...

 

 

It runs at full speed if I disable display. And I already tested simple notifier, display gets corrupted by the acquisition

 

Everything we said before and use a lossy queue.

 

You are over thinking it because LabVIEW is a "Dataflow paradigm". Synchronisation is inherent!. Double/triple buffering is a solution t get around a language problem with synchronising asynchronous processes that labview just doesn't have. If you do emulate a triple buffer, it will be slower than the solutions we are advocating because they use dataflow synchronisation and all the critical sections, semaphores and other techniques required in other languages are not needed.

 

C++ triple buffering is not the droid you are looking for.

Link to comment

However, you can get “lossy behavior†by getting the Consumer to flush the queue and only process the last element, passing any other IMAQ refs returned back to the C—>P queue.  You might need more than 3 refs if your Consumer is very slow.

 

Forgot to answer on this: if doing this, and the consumer is really slow, you have to adjust the number of buffers, depending on the "slowness" of the consumer, in order to prevent corruption. So for me it has the flow of "too bad, it worked on my setup". More, there is still no way for the producer to know which buffer is locked without the race condition problem.

 

Everything we said before and use a lossy queue.

 

Sorry to say, but lossy queue on a shared buffer doesn't solve data corruption, as I said before.

 

You are over thinking it because LabVIEW is a "Dataflow paradigm". Synchronisation is inherent!. Double/triple buffering is a solution t get around a language problem with synchronising asynchronous processes that labview just doesn't have. If you do emulate a triple buffer, it will be slower than the solutions we are advocating because they use dataflow synchronisation and all the critical sections, semaphores and other techniques required in other languages are not needed.

 

C++ triple buffering is not the droid you are looking for.

I think I fully understand the advantages of dataflow programming, My program has more than 1500 VIs, uses the actor model for communication, and doesn't have any semaphore or complex synchronization.

 

But here we have large buffers, in a pointer-style programming, so dataflow isn't really applicable. I mean, dataflow with pointers isn't really dataflow, since they don't carry the data. These buffers are shared between two threads, and LabVIEW doesn't protect you if you don't have proper synchronization. Sorry to insist, but triple buffer is just a simple sync pattern that fits to my problem if I want the best performance, it is widely used display-related programming.

Edited by CharlesB
Link to comment

The IMAQ wire is a reference to an image buffer.  The data they contain isn't copied unless explicitly copied via IMAQ Copy function or IMAQ Get Bytes function or you wire up the "Dest" terminal on the IMAQ functions. 

 

Scenario 1: If the Consumer must process every item sent by the Producer the number of buffers must be a size to store the back log.  Whether this size is 2 or 3 or 907 is only relevant when deciding or being constrained by the platform it's running on.  

 

Scenario 2: If the Consumer is only required to process as many as it can and losses are acceptable, pick a number of buffers that satisfies the acceptable loss criteria.  Again the actual number of buffers is irrelevant.

 

Two Queues.  Q1 is Producer to Consumer.  Q2 is Consumer to Producer.  Queue element is same in both and is whatever you want as long as it has an IMAQ reference (buffer) in it.

 

During initialization Q2 is loaded with as many unique buffers as deemed necessary by the Scenario.

 

Scenario 1 Implementation:

Producer Dequeues a buffer from Q2.  Fills it.  Enqueues it to Q1.

Consumer Dequeues from Q1.  Processes it.  Enqueues it to Q2.

 

Since Producer is pulling from Q2, there is no chance it will ever overwrite an unprocessed buffer.

 

Q1 being empty is not a problem.  Means consumer is faster than Producer.

Q2 being empty is an error condition that must be addressed at the system level.

 

Scenario 2 Implementation:

Producer Dequeues a buffer from Q2.  Fills it.  Enqueues it to Q1.

Consumer Dequeues from Q1.  Processes it.  Enqueues it to Q2.

 

Since Producer is pulling from Q2, there is no chance it will ever overwrite an unprocessed buffer.

 

Q1 being empty is not a problem.  Means consumer is faster than Producer.

If Q2 is empty, Consumer is backlogged and a loss must occur.  Producer Dequeues from Q1.

 

Since an element can only be Dequeued once, there is no chance the Consumer is processing that buffer and it is safe to overwrite.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.