Jump to content

Recommended Posts

Hi all,

I have a question about high level system design with FPGA-RT-PC. It would be great if I can get some advice about ideal approaches to move data between the 3 components in an efficient manner. There are several steps; DMA FIFO from FPGA to RT, processing the data stream in the RT to derive chunks of useful information, parsing these chunks into complete sets on the RT and sending these sets up to the Host.

In my system, I have the FPGA monitoring a channel of a digitiser and deriving several data streams from events that occur (wave, filtered data, parameters etc). When an event occurs the data streams are sent to the RT through a DMA FIFO in U64 chunks. Importantly, events can be variable length.  To overcome this, I reunite the data by inserting unique identifiers and special characters (sets of 0's) into the data streams which I later search for on the RT.

Because the FPGA is so fast, I might fill the DMA FIFO buffer rapidly, so I want to poll the FIFO frequently and deterministically. I use a timed loop on the RT to poll the FIFO and dump the data as U64's straight into a FIFO on the RT. The RT FIFO is much larger than the DMA FIFO, so I don't need to poll it as regularly before it fills.

The RT FIFO is polled and parsed by parallel loop on the RT that empties the RT FIFO and dumps into a variable sized array. The parsing of the array then happens by looking for special characters element wise. A list of special character indices is then passed to a loop which chops out the relevant chunk and, using the UID therein, writes them to a TDMS file.

Another parallel loop then looks at the TDMS group names and when an event has an item relating to each of the data streams (i.e. all the data for the event has been received), a cluster is made for the event and it is sent to the host over a network stream. This UID is then marked as completed.

The aim of the system is to be fast enough that I do not fill any data buffers. This means I need to carefully avoid bottle necks. But I worry that the parsing step, with a dynamically assigned memory operation on a potentially large memory object, an element wise search and delete operation (another dynamic memory operation) may become slow. But I can't think of a better way to arrange my system or handle the data. Does anyone have any ideas?

PS I would really like to send the data streams to the RT in a unified manner straight from the RT, by creating a custom data typed DMA FIFO. But this is not possible for DMA FIFOs, even though it is for target-scoped FIFOs!

Many thanks,

Max

Link to post
Share on other sites

Ideal approach really depends on your application. Your description sounds like you're streaming large amounts of data from FPGA->RT->PC. To that, I have an application that is similar in that the cRIO is a glorified DAQ card (at least initially). The design was able to get 9 channels at 1 MS/sec and 9 channels at 1 kS/sec simultaneously. I did an overnight test checking all 18 channels arrived at the PC without loss, and the design has been in use (at much lower and less intensive transfer rates) for over a year, so the design should be solid though I expect there is a lot of room for improvement.

FPGA side...

The FPGA folds, spindles and mutilates the signals to where each is a U32 value. The top byte contains a 'channel number' and the bottom 3 bytes contains the data. This gives 256 possible channels and up to 24 bits resolution. This gets transferred across a DMA to the RT. The commands to the FPGA (e.g., sample rate) are done by reading/writing the control from the RT side. This is doubled so I can have the two sample rates on the same channels.FPGA side.png

RT side...

The RT side grabs whatever data is in the DMAs (which are as big as could make it) and puts it in a circular buffer. The circular buffer gets sent to a network stream to the PC at regular intervals.

PC side...

The PC side connects to the network stream using the primitives. The channel identifier (top byte) is used to sort out the data into an array of waveforms. If I recall right, performance blips up to about 5% CPU usage on my test machine (2.6 GHz i7).

 

Link to post
Share on other sites
8 hours ago, Max Joseph said:

In my system, I have the FPGA monitoring a channel of a digitiser and deriving several data streams from events that occur (wave, filtered data, parameters etc). When an event occurs the data streams are sent to the RT through a DMA FIFO in U64 chunks. Importantly, events can be variable length.  To overcome this, I reunite the data by inserting unique identifiers and special characters (sets of 0's) into the data streams which I later search for on the RT.

Because the FPGA is so fast, I might fill the DMA FIFO buffer rapidly, so I want to poll the FIFO frequently and deterministically. I use a timed loop on the RT to poll the FIFO and dump the data as U64's straight into a FIFO on the RT. The RT FIFO is much larger than the DMA FIFO, so I don't need to poll it as regularly before it fills.

The RT FIFO is polled and parsed by parallel loop on the RT that empties the RT FIFO and dumps into a variable sized array. The parsing of the array then happens by looking for special characters element wise. A list of special character indices is then passed to a loop which chops out the relevant chunk and, using the UID therein, writes them to a TDMS file.

If you havent seen, this is a slightly old but still useful resource: http://www.ni.com/compactriodevguide/

For example it would inform you that there is no reason to copy data from the DMA directly into another fifo, because the real-time side of the DMA can be as large as you need it to be.

As for your transfer mechanism, prefixing your data rather than null terminating is the better plan, as C has been teaching us over and over for 40 years. In either case if you lose any data all future DMA data is invalidated, so you have to set it up so that the FPGA never loses a packet of data. You can do this by previewing the DMA fifo size to make sure its big enough or using a -1 timeout. The length-prefix will also give you nominally faster latency as you know exactly how much data you're looking for.

Link to post
Share on other sites

First, one terminology note: NI uses "host" to mean the system that hosts the FPGA, in this case your RT system. RT and Host are synonymous here.

7 hours ago, Max Joseph said:

Because the FPGA is so fast, I might fill the DMA FIFO buffer rapidly, so I want to poll the FIFO frequently and deterministically. I use a timed loop on the RT to poll the FIFO and dump the data as U64's straight into a FIFO on the RT. The RT FIFO is much larger than the DMA FIFO, so I don't need to poll it as regularly before it fills.

It sounds like you have a misunderstanding about the DMA FIFO. The FIFO has two buffers: one on the host (RT) and the other on the FPGA. The host-side buffer can be many times larger than the buffer on the FPGA. The DMA logic automatically transfers data from the FPGA buffer to the host buffer whenever the FGPA buffer fills, or at regular intervals. If you're finding that you're filling the DMA FIFO, make the host-side buffer larger (you can do this through an FPGA method on the RT side) and read larger chunks. I would take out the RT FIFO entirely here and read directly from the DMA FIFO, although probably in a normal loop rather than a timed one since it sounds like your timing is event-based.

I don't fully understand your parsing scheme with the special characters; if you can share some code, it might be possible to provide specific suggestions. Have you considered using multiple DMA FIFOs to separate out different streams, or you need them all combined into a single channel?

Link to post
Share on other sites
  • 2 weeks later...

Hi all!

Many thanks for the replies! I have been super busy the last couple of weeks so I am only getting back now. For some reason the reply notification didn't seem to work, maybe it went to my spam!

I have also since been on the NI high-throughput FPGA course, which was quite useful. I did not really understand that the DMA FIFO buffer is different sizes on the FPGA and host; I was confused by the single buffer size on the General page of the DMA FIFO properties page! I will try using very large buffer sizes on the host and remove the RT FIFO polling step. Thinking about it though, how does the FPGA side buffer become full? When the DMA engine is not able to read the buffer as fast as data is being put into it. How fast is the DMA engine? Does it suffer from jitter? Is there an optimum of buffer size ratios here?

My sample rate of raw data is 16-bit at 250 MHz across up to four channels, so my data rate can get quite large when there are a lot of events occurring in my input. Although, in reality, events are relatively sparse so my total real data rate is likely to be a couple of orders of magnitude smaller than this theoretical maximum!

I am now thinking about now to manage data through the RT. I want to parse the data streams into groups relating to individual events on the RT and then pass these groups up to the PC periodically for plotting. I have previously dumped the data from each stream into a TDMS in whatever order it came in. I need to determine if a group is complete and then send it to the host. The process of checking if an event group is complete and then sending it up became very slow quickly; when # events > 10 k it took 1 s check the set and send the complete events over to the host. Since I want to be able to handle and stream at least 50k events per second (corresponding to 10's of MB/s) I thought that this performance was insufficient.

So, I thought about making an individual TDMS file for each event group, which is then checked for completeness whenever a data stream puts a new bit of data in. If the event group is complete it is packaged and sent to the PC and then deleted from the RT. This approach makes it easier to check and send new event data but leads to lots of TDMS open/close actions and a proliferation of small files which seems to get slow too.

Does anyone have any ideas about this aspect specifically?

Link to post
Share on other sites
2 hours ago, Max Joseph said:

Thinking about it though, how does the FPGA side buffer become full? When the DMA engine is not able to read the buffer as fast as data is being put into it. How fast is the DMA engine? Does it suffer from jitter? Is there an optimum of buffer size ratios here?

The most common way that the FPGA buffer fills is when the host side either hasn't been started or hasn't been read, but at very high data rates a buffer that's too small could do it too. The DMA engine is fast, but I can't quantify that. Transfer rates vary depending on the hardware and possibly also on the direction of transfer (on older boards, transfers from the FPGA to the host were much faster than the other direction, that may no longer be true with newer hardware). Is there jitter? You could measure it... make a simple FPGA VI that fills a DMA buffer continously at a fixed rate. On the RT side, read, say, 6000 elements at a time in a loop, measure how long each loop iteration takes and how much variation there is. As for buffer sizing, there's no optimal ratio, it depends on details of your application and what else needs memory in your FPGA design.

2 hours ago, Max Joseph said:

So, I thought about making an individual TDMS file for each event group, which is then checked for completeness whenever a data stream puts a new bit of data in. If the event group is complete it is packaged and sent to the PC and then deleted from the RT. This approach makes it easier to check and send new event data but leads to lots of TDMS open/close actions and a proliferation of small files which seems to get slow too.

Does anyone have any ideas about this aspect specifically?

I'm having trouble following your description of events and what data is transferred, but if you can provide a clearer explanation or some sample code I'll try to provide suggestions. Why are you using TDMS files to transfer data? Using a file transfer for raw data is probably not the most efficient approach. As for the DMA transfer, inserting magic values in the middle of the stream and then scanning the entire stream for them doesn't sound too efficient, there's probably a better approach there. Perhaps it's possible to use a second FIFO that simply tells the RT how many elements to read from the main data FIFO? That is, you write however many points are necessary for one event into the main data FIFO, and when that's complete, you write element count to another FIFO. Then, on the host, you read the second FIFO and then read that number of elements from the first FIFO. That may not match what you're doing at all - as I said, I don't quite understand your data transfers - but maybe it helps think about alternate approaches.

Link to post
Share on other sites

Its possible to overflow the buffer on the fpga just from fpga use, but its hard. The zynq chips can DMA something like 300 MB/s so I could imagine overflowing that only if you have images you are processing. The older targets have a PCI bus which I think should in theory support something like 80 MB/s of data but I vaguely remember getting more like 40. The highest speed analog module (except the new store and forward scope) generates 2 MB/s/ch, so a full chassis would be 8*4*2 = 64 MB/s ch. So basically if you have old hardware and abuse it, you can hit the limit.

Link to post
Share on other sites

Oh I meant "in theory, but really", not the theoretical max. Theres the bus max physical, then theres what is achievable for most people, then there is what labview fpga achieves.

Edited by smithd
Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Similar Content

    • By McQuillan
      Hi Everyone,
      I (re)watched James Powell's talk at GDevCon#2 about Application Design Around SQLite. I really like this idea as I have an application with lots of data (from serial devices and software configuration) that's all needed in several areas of the application (and external applications) and his talk was a 'light-bulb' moment where I thought I could have a centralized SQLite database that all the modules could access to select / update data.
      He said the database could be the 'model' in the model-view-controller design pattern because the database is very fast. So you can collect data in one actor and publish it directly to the DB, and have another actor read the data directly from the DB, with a benefit of having another application being able to view the data.
      Link to James' talk: https://www.youtube.com/watch?v=i4_l-UuWtPY&t=1241s)
       
      I created a basic proof of concept which launches N-processes to generate-data (publish to database) and others to act as a UI (read data from database and update configuration settings in the DB (like set-point)). However after launching a couple of processes I ran into  'Database is locked (error 5) ', and I realized 2 things, SQLite databases aren't magically able to have n-concurrent readers/writers , and I'm not using them right...(I hope).
      I've created a schematic (attached) to show what I did in the PoC (that was getting 'Database is locked (error 5)' errors).
      I'm a solo-developer (and SQLite first-timer*) and would really appreciate it if someone could look over the schematic and give me guidance on how it should be done. There's a lot more to the actual application, but I think once I understand the limitations of the DB I'll be able to work with it.
      *I've done SQL training courses.
      In the actual application, the UI and business logic are on two completely separate branches (I only connected them to a single actor for the PoC) 
      Some general questions / thoughts I had:
      Is the SQLite based application design something worth perusing / is it a sensible design choice? Instead of creating lots of tables (when I launch the actors) should I instead make separate databases? - to reduce the number of requests per DB? (I shouldn't think so... but worth asking) When generating data, I'm using UPDATE to change a single row in a table (current value), I'm then reading that single row in other areas of code. (Then if logging is needed, I create a trigger to copy the data to a separate table) Would it be better if I INSERT data and have the other modules read the max RowId for the current value and periodically delete rows? The more clones I had, the slower the UI seemed to update (should have been 10 times/second, but reduced to updating every 3 seconds). I was under the impression that you can do thousands of transactions per second, so I think I'm querying the DB inefficiently. The two main reasons why I like the database approach are:
      External applications will need to 'tap-into' the data, if they could get to it via an SQL query - that would be ideal. Data-logging is a big part of the application. Any advice you can give would be much appreciated.
      Cheers,
      Tom
      (I'm using quite a few reuse libraries so I can't easily share the code, however, if it would be beneficial, I could re-work the PoC to just use 'Core-LabVIEW' and James Powell's SQLite API)

    • By Alexander Kocian
      Hello
      Currently, I stream and proecess audio (for medical purposes) on my PC (i7-4790T with 8 cores) using LABVIEW 2013. To improve performance, the 8 cores could be shared between MS Windows and the real-time operating system RTX by IntervalZero.
      Please, how can I tell LABVIEW to use the (deterministic) RTX cores instead of the (stochastic) MS Windows cores, to stream audio?  
    • By Zyl
      Hi everybody,
      Currently working on VeriStand custom devices, I'm facing a 'huge' problem when debugging the code I make for the Linux RT Target. The console is not availbale on such targets, and I do not want to fall back to the serial port and Hyperterminal-like programs (damn we are in the 21st century !! )...
      Several years ago (2014 if I remember well) I posted an request on the Idea Exchange forum on NI's website to get back the console on Linux targets. NI agreed with the idea and it is 'in development' since then. Seems to be so hard to do that it takes years to have this simple feature back to life.
      On my side I developped a web-based console : HTML page displaying strings received through a WebSocket link. Pretty easy and fast, but the integration effort (start server, close server, handle push requests, ...)must be done for each single code I create for such targets.
      Do you have any good tricks to debug your code running on a Linux target ?
    • By fennectp
      Hi all,
       
      I've got a customer that wants to zip/unzip files on their cRIO-9035, so I had them playing with the OpenG Zip tools to see if it would fit their needs. Although they've found that they can zip files on their cRIO just fine, they find that they get disconnected from their RT target and the shell shows the following error message:
      LabVIEW caught a fatal signal 15.0 - Recieved SIGSEGV Reason: address not mapped to object Attempt to reference address: 0x0x10000000 stdin: is not a tty  
      The zip file they're testing with includes two simple .txt files with short strings in them ("booya" and "booya2"). This file unzips just fine on the host PC with the "ZLIB Extract All Files To Dir" VI, but when copied over to the cRIO and unzipped via the same VI, it only unzips the first text file and the resulting file doesn't have any text in it.
      I've attached a copy of the project and the zip file I used to reproduce this behavior using a cRIO-9033 that I had on hand. (The only thing I can't provide is the cRIO >_<)
      Could anybody tell me what I'm doing wrong? Any suggestions as to what other workarounds I could take to zip/unzip files on a LinuxRT target would also be very much appreciated!
       
      Regards,

      Tamon
      Unzip on cRIO-9033.zip
    • By dterry
      Hello again LAVAG,
      I'm currently feeling the pain of propagating changes to multiple, slightly different configuration files, and am searching for a way to make things a bit more palatable.
      To give some background, my application is configuration driven in that it exists to control a machine which has many subsystems, each of which can be configured in different ways to produce different results.  Some of these subsystems include: DAQ, Actuator Control, Safety Limit Monitoring, CAN communication, and Calculation/Calibration.  The current configuration scheme is that I have one main configuration file, and several sub-system configuration files.  The main file is essentially an array of classes flattened to binary, while the sub-system files are human readable (INI) files that can be loaded/saved from the main file editor UI.  It is important to note that this scheme is not dynamic; or to put it another way, the main file does not update automatically from the sub-files, so any changes to sub-files must be manually reloaded in the main file editor UI.
      The problem in this comes from the fact that we periodically update calibration values in one sub-config file, and we maintain safety limits for each DUT (device under test) in another sub-file.  This means that we have many configurations, all of which must me updated when a calibration changes.
      I am currently brainstorming ways to ease this burden, while making sure that the latest calibration values get propagated to each configuration, and was hoping that someone on LAVAG had experience with this type of calibration management.  My current idea has several steps:
      Rework the main configuration file to be human readable. Store file paths to sub-files in the main file instead of storing the sub-file data.  Load the sub-file data when the main file is loaded. Develop a set of default sub-files which contain basic configurations and calibration data.   Set up the main file loading routine to pull from the default sub-files unless a unique sub-file is not specified. Store only the parameters that differ from the default values in the unique subfile. Load the default values first, then overwrite only the unique values.  This would work similarly to the way that LabVIEW.ini works.  If you do not specify a key, LabVIEW uses its internal default.  This has two advantages: Allows calibration and other base configuration changes to easily propagate through to other configs. Allows the user to quickly identify configuration differences. Steps 4 and 5 are really the crux of making life easier, since they allow global changes to all configurations.  One thing to note here is that these configurations are stored in an SVN repository to allow versioning and recovery if something breaks.
      So my questions to LAVAG are:
      Has anyone ever encountered a need to propagate configuration changes like this?   How did you handle it?   Does the proposal above seem feasible?   What gotchas have I missed that will make my life miserable in the future? Thanks in advance everyone!
      Drew
×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.