Data transfer strategies for FPGA-RT-Host

Max Joseph · July 6, 2017

Hi all,

I have a question about high level system design with FPGA-RT-PC. It would be great if I can get some advice about ideal approaches to move data between the 3 components in an efficient manner. There are several steps; DMA FIFO from FPGA to RT, processing the data stream in the RT to derive chunks of useful information, parsing these chunks into complete sets on the RT and sending these sets up to the Host.

In my system, I have the FPGA monitoring a channel of a digitiser and deriving several data streams from events that occur (wave, filtered data, parameters etc). When an event occurs the data streams are sent to the RT through a DMA FIFO in U64 chunks. Importantly, events can be variable length. To overcome this, I reunite the data by inserting unique identifiers and special characters (sets of 0's) into the data streams which I later search for on the RT.

Because the FPGA is so fast, I might fill the DMA FIFO buffer rapidly, so I want to poll the FIFO frequently and deterministically. I use a timed loop on the RT to poll the FIFO and dump the data as U64's straight into a FIFO on the RT. The RT FIFO is much larger than the DMA FIFO, so I don't need to poll it as regularly before it fills.

The RT FIFO is polled and parsed by parallel loop on the RT that empties the RT FIFO and dumps into a variable sized array. The parsing of the array then happens by looking for special characters element wise. A list of special character indices is then passed to a loop which chops out the relevant chunk and, using the UID therein, writes them to a TDMS file.

Another parallel loop then looks at the TDMS group names and when an event has an item relating to each of the data streams (i.e. all the data for the event has been received), a cluster is made for the event and it is sent to the host over a network stream. This UID is then marked as completed.

The aim of the system is to be fast enough that I do not fill any data buffers. This means I need to carefully avoid bottle necks. But I worry that the parsing step, with a dynamically assigned memory operation on a potentially large memory object, an element wise search and delete operation (another dynamic memory operation) may become slow. But I can't think of a better way to arrange my system or handle the data. Does anyone have any ideas?

PS I would really like to send the data streams to the RT in a unified manner straight from the RT, by creating a custom data typed DMA FIFO. But this is not possible for DMA FIFOs, even though it is for target-scoped FIFOs!

Many thanks,

Max

Tim_S · July 6, 2017

Ideal approach really depends on your application. Your description sounds like you're streaming large amounts of data from FPGA->RT->PC. To that, I have an application that is similar in that the cRIO is a glorified DAQ card (at least initially). The design was able to get 9 channels at 1 MS/sec and 9 channels at 1 kS/sec simultaneously. I did an overnight test checking all 18 channels arrived at the PC without loss, and the design has been in use (at much lower and less intensive transfer rates) for over a year, so the design should be solid though I expect there is a lot of room for improvement.

FPGA side...

The FPGA folds, spindles and mutilates the signals to where each is a U32 value. The top byte contains a 'channel number' and the bottom 3 bytes contains the data. This gives 256 possible channels and up to 24 bits resolution. This gets transferred across a DMA to the RT. The commands to the FPGA (e.g., sample rate) are done by reading/writing the control from the RT side. This is doubled so I can have the two sample rates on the same channels.

RT side...

The RT side grabs whatever data is in the DMAs (which are as big as could make it) and puts it in a circular buffer. The circular buffer gets sent to a network stream to the PC at regular intervals.

PC side...

The PC side connects to the network stream using the primitives. The channel identifier (top byte) is used to sort out the data into an array of waveforms. If I recall right, performance blips up to about 5% CPU usage on my test machine (2.6 GHz i7).

smithd · July 6, 2017

8 hours ago, Max Joseph said:

In my system, I have the FPGA monitoring a channel of a digitiser and deriving several data streams from events that occur (wave, filtered data, parameters etc). When an event occurs the data streams are sent to the RT through a DMA FIFO in U64 chunks. Importantly, events can be variable length. To overcome this, I reunite the data by inserting unique identifiers and special characters (sets of 0's) into the data streams which I later search for on the RT.

Because the FPGA is so fast, I might fill the DMA FIFO buffer rapidly, so I want to poll the FIFO frequently and deterministically. I use a timed loop on the RT to poll the FIFO and dump the data as U64's straight into a FIFO on the RT. The RT FIFO is much larger than the DMA FIFO, so I don't need to poll it as regularly before it fills.

The RT FIFO is polled and parsed by parallel loop on the RT that empties the RT FIFO and dumps into a variable sized array. The parsing of the array then happens by looking for special characters element wise. A list of special character indices is then passed to a loop which chops out the relevant chunk and, using the UID therein, writes them to a TDMS file.

If you havent seen, this is a slightly old but still useful resource: http://www.ni.com/compactriodevguide/

For example it would inform you that there is no reason to copy data from the DMA directly into another fifo, because the real-time side of the DMA can be as large as you need it to be.

As for your transfer mechanism, prefixing your data rather than null terminating is the better plan, as C has been teaching us over and over for 40 years. In either case if you lose any data all future DMA data is invalidated, so you have to set it up so that the FPGA never loses a packet of data. You can do this by previewing the DMA fifo size to make sure its big enough or using a -1 timeout. The length-prefix will also give you nominally faster latency as you know exactly how much data you're looking for.

ned · July 6, 2017

First, one terminology note: NI uses "host" to mean the system that hosts the FPGA, in this case your RT system. RT and Host are synonymous here.

7 hours ago, Max Joseph said:

Because the FPGA is so fast, I might fill the DMA FIFO buffer rapidly, so I want to poll the FIFO frequently and deterministically. I use a timed loop on the RT to poll the FIFO and dump the data as U64's straight into a FIFO on the RT. The RT FIFO is much larger than the DMA FIFO, so I don't need to poll it as regularly before it fills.

It sounds like you have a misunderstanding about the DMA FIFO. The FIFO has two buffers: one on the host (RT) and the other on the FPGA. The host-side buffer can be many times larger than the buffer on the FPGA. The DMA logic automatically transfers data from the FPGA buffer to the host buffer whenever the FGPA buffer fills, or at regular intervals. If you're finding that you're filling the DMA FIFO, make the host-side buffer larger (you can do this through an FPGA method on the RT side) and read larger chunks. I would take out the RT FIFO entirely here and read directly from the DMA FIFO, although probably in a normal loop rather than a timed one since it sounds like your timing is event-based.

I don't fully understand your parsing scheme with the special characters; if you can share some code, it might be possible to provide specific suggestions. Have you considered using multiple DMA FIFOs to separate out different streams, or you need them all combined into a single channel?

Max Joseph · July 21, 2017

Hi all!

Many thanks for the replies! I have been super busy the last couple of weeks so I am only getting back now. For some reason the reply notification didn't seem to work, maybe it went to my spam!

I have also since been on the NI high-throughput FPGA course, which was quite useful. I did not really understand that the DMA FIFO buffer is different sizes on the FPGA and host; I was confused by the single buffer size on the General page of the DMA FIFO properties page! I will try using very large buffer sizes on the host and remove the RT FIFO polling step. Thinking about it though, how does the FPGA side buffer become full? When the DMA engine is not able to read the buffer as fast as data is being put into it. How fast is the DMA engine? Does it suffer from jitter? Is there an optimum of buffer size ratios here?

My sample rate of raw data is 16-bit at 250 MHz across up to four channels, so my data rate can get quite large when there are a lot of events occurring in my input. Although, in reality, events are relatively sparse so my total real data rate is likely to be a couple of orders of magnitude smaller than this theoretical maximum!

I am now thinking about now to manage data through the RT. I want to parse the data streams into groups relating to individual events on the RT and then pass these groups up to the PC periodically for plotting. I have previously dumped the data from each stream into a TDMS in whatever order it came in. I need to determine if a group is complete and then send it to the host. The process of checking if an event group is complete and then sending it up became very slow quickly; when # events > 10 k it took 1 s check the set and send the complete events over to the host. Since I want to be able to handle and stream at least 50k events per second (corresponding to 10's of MB/s) I thought that this performance was insufficient.

So, I thought about making an individual TDMS file for each event group, which is then checked for completeness whenever a data stream puts a new bit of data in. If the event group is complete it is packaged and sent to the PC and then deleted from the RT. This approach makes it easier to check and send new event data but leads to lots of TDMS open/close actions and a proliferation of small files which seems to get slow too.

Does anyone have any ideas about this aspect specifically?

ned · July 21, 2017

2 hours ago, Max Joseph said:

Thinking about it though, how does the FPGA side buffer become full? When the DMA engine is not able to read the buffer as fast as data is being put into it. How fast is the DMA engine? Does it suffer from jitter? Is there an optimum of buffer size ratios here?

The most common way that the FPGA buffer fills is when the host side either hasn't been started or hasn't been read, but at very high data rates a buffer that's too small could do it too. The DMA engine is fast, but I can't quantify that. Transfer rates vary depending on the hardware and possibly also on the direction of transfer (on older boards, transfers from the FPGA to the host were much faster than the other direction, that may no longer be true with newer hardware). Is there jitter? You could measure it... make a simple FPGA VI that fills a DMA buffer continously at a fixed rate. On the RT side, read, say, 6000 elements at a time in a loop, measure how long each loop iteration takes and how much variation there is. As for buffer sizing, there's no optimal ratio, it depends on details of your application and what else needs memory in your FPGA design.

2 hours ago, Max Joseph said:

So, I thought about making an individual TDMS file for each event group, which is then checked for completeness whenever a data stream puts a new bit of data in. If the event group is complete it is packaged and sent to the PC and then deleted from the RT. This approach makes it easier to check and send new event data but leads to lots of TDMS open/close actions and a proliferation of small files which seems to get slow too.

Does anyone have any ideas about this aspect specifically?

I'm having trouble following your description of events and what data is transferred, but if you can provide a clearer explanation or some sample code I'll try to provide suggestions. Why are you using TDMS files to transfer data? Using a file transfer for raw data is probably not the most efficient approach. As for the DMA transfer, inserting magic values in the middle of the stream and then scanning the entire stream for them doesn't sound too efficient, there's probably a better approach there. Perhaps it's possible to use a second FIFO that simply tells the RT how many elements to read from the main data FIFO? That is, you write however many points are necessary for one event into the main data FIFO, and when that's complete, you write element count to another FIFO. Then, on the host, you read the second FIFO and then read that number of elements from the first FIFO. That may not match what you're doing at all - as I said, I don't quite understand your data transfers - but maybe it helps think about alternate approaches.

smithd · July 21, 2017

Its possible to overflow the buffer on the fpga just from fpga use, but its hard. The zynq chips can DMA something like 300 MB/s so I could imagine overflowing that only if you have images you are processing. The older targets have a PCI bus which I think should in theory support something like 80 MB/s of data but I vaguely remember getting more like 40. The highest speed analog module (except the new store and forward scope) generates 2 MB/s/ch, so a full chassis would be 8*4*2 = 64 MB/s ch. So basically if you have old hardware and abuse it, you can hit the limit.

ShaunR · July 21, 2017

1 hour ago, smithd said:

PCI bus which I think should in theory support something like 80 MB/s

PCI = 33MHz x 4 bytes = 132 MB/s

Edited July 21, 2017 by ShaunR

smithd · July 21, 2017

Oh I meant "in theory, but really", not the theoretical max. Theres the bus max physical, then theres what is achievable for most people, then there is what labview fpga achieves.

Edited July 22, 2017 by smithd

Sign In

Data transfer strategies for FPGA-RT-Host

Recommended Posts

Max Joseph

Tim_S

smithd

ned

Max Joseph

ned

smithd

ShaunR

smithd

Join the conversation

Similar Content

Asynchronous Call By Reference on Real-Time target (cRIO)

Download Labview Real-Time 8.0

Calling a bash script on RT target programmatically

SQLite based application design

LABVIEW on real-time platform RTX64

Browse

Activity

Important Information