Neil Pate Posted July 25, 2020 Report Posted July 25, 2020 (edited) Can anyone shed some light for me on the best practices for the FIFO Acquire Read Region technique? I have never used this before, I always have just done the usual trick of reading zero elements to get the size of the buffer and then reading if there are enough elements for my liking. To my knowledge this was a good technique and I have used it quite a few times with no actual worries (including in some VST code with a ridiculous data rate). This screenshot is taken from here. Is this code really more efficient? Does the Read Region block with high CPU usage like the Read FIFO method does? (I don't want that) Has anyone used this "new" technique successfully? For reference this is my current technique: Edited July 25, 2020 by Neil Pate Quote
ned Posted August 6, 2020 Report Posted August 6, 2020 I really wanted to use that function a few years back, but it wasn't available on the cRIO I was using. In case it's helpful, here's the situation in which I'd hoped to use it: We were using a cRIO to drive an inkjet print head. The host system downloaded a bitmap to the cRIO, which the cRIO then sent to the FPGA over a DMA FIFO. I used a huge host-side buffer, large enough to store the entire bitmap; the FPGA read data from that FIFO as needed. I benchmarked this and it required 3 copies of the entire bitmap, which could be several megabytes: one copy when initially downloaded; one copy for the conversion from string (from TCP Read) to numeric array (for the FIFO Write); and one copy in the FIFO buffer. These memory copies were one of the limiting factors in the speed of the overall system (the other was how fast we could load data to the print head). If I had been able to use "Acquire Write Region" I could have saved one copy, because the typecast from string to numeric array could have written directly to the FIFO buffer. If there were some way to do the string to numeric array conversion in-place maybe I could have avoided that copy too. Quote
Rolf Kalbermatter Posted August 26, 2020 Report Posted August 26, 2020 (edited) On 8/7/2020 at 12:52 AM, ned said: I really wanted to use that function a few years back, but it wasn't available on the cRIO I was using. In case it's helpful, here's the situation in which I'd hoped to use it: We were using a cRIO to drive an inkjet print head. The host system downloaded a bitmap to the cRIO, which the cRIO then sent to the FPGA over a DMA FIFO. I used a huge host-side buffer, large enough to store the entire bitmap; the FPGA read data from that FIFO as needed. I benchmarked this and it required 3 copies of the entire bitmap, which could be several megabytes: one copy when initially downloaded; one copy for the conversion from string (from TCP Read) to numeric array (for the FIFO Write); and one copy in the FIFO buffer. These memory copies were one of the limiting factors in the speed of the overall system (the other was how fast we could load data to the print head). If I had been able to use "Acquire Write Region" I could have saved one copy, because the typecast from string to numeric array could have written directly to the FIFO buffer. If there were some way to do the string to numeric array conversion in-place maybe I could have avoided that copy too. Actually there is another option to avoid the typecast copy. Because Typecast is in fact not just a type reinterpretation like in C but also does byte swapping on all Little Endian platforms which are currently all but the old VxWorks based cRIO platforms, since the use a PowerPC CPU which by default operates in Big Endian (this CPU can support both Endian modes but is typically always used in Big Endian mode. If all you need is a byte array then use the String to Byte Array node instead. This is more like a C Typecast as the data type in at least Classic LabVIEW doesn't change at all (somewhat sloppily stated: a string is simply a byte array with a different wire color 😀). If you need a typecast sort of thing because your numeric array is something else than a byte array, but don't want endianizing you could with a bit of low level byte shuffling (preferably in C but with enough persistence it could even be done in LabVIEW diagram although not 100% safe) you could write a small function that swaps out two handles with additional correction of the numElm value in the array and do this as a virtually zero cost operation. I'm not sure the Acquire Write Region would save you as much as you hope for this. The DVR returned still needs to copy your LabVIEW data array into the DMA buffer and there is also some overhead from protecting the DVR access from the DMA routine which will attempt to read the data. Getting rid of the inherent copy in the Typecast function is probably more performant. On 7/26/2020 at 12:00 AM, Neil Pate said: Can anyone shed some light for me on the best practices for the FIFO Acquire Read Region technique? I have never used this before, I always have just done the usual trick of reading zero elements to get the size of the buffer and then reading if there are enough elements for my liking. To my knowledge this was a good technique and I have used it quite a few times with no actual worries (including in some VST code with a ridiculous data rate). This screenshot is taken from here. Is this code really more efficient? Does the Read Region block with high CPU usage like the Read FIFO method does? (I don't want that) Has anyone used this "new" technique successfully? Why would the Read FIFO method block with high CPU usage? I'm not sure what you refer to here. Sure it needs to allocated an array of the requested size and then copy the data from the DMA buffer into this array and that takes of course CPU but if you don't require more data than there is currently in the DMA buffer it does not "block", it simply has to do some considerable work. Depending on what you are then doing with the data you do not save anything by using the Acquire Region variant. This variant is only useful if you can do all of the operation on the data inside the IPE in which you access the actual data. If you only do use the IPE to read the data and then pass it outside of the IPE as normal LabVIEW array there is absolutely nothing to be gained by using the Acquire Read Region variant. In the case of the Read FIFO, the array is generated (and copied into) in the Read FIFO node, in the Acquire Read Region version it is generated (and copied into) as soon as the wire crosses the IPE border. It's pretty much the same effort and there is really nothing LabVIEW could do to avoid that. The DVR data is only inside the IPE accessible without creating a full data copy. I did recently a project where I used a Acquire Read Region but found that it had no real advantage to the normal FIFO Read, since all I did with the data was in fact to pass it on to a TCP Read. As soon as the data needs to be send to TCP Read, the data buffer has to be allocated anyhow as a real LabVIEW handle and then it doesn't really matter if that happens inside the FIFO Read, or inside the IPE accessing the DVR from the FIFO Region. My loop timing was anyhow heavily dominated by the TCP Write. As long as I only read the data from the FIFO, my loop could run consistently at 10.7MB/s with a steady 50ms interval with very little jitter. As soon as I added the TCP Write the loop timing jumped to 150 ms an steadily increased until the FIFO was overflowing. My tests showed that I could go up to 8MB/s with a loop interval timing of around 150 ms +- 50ms jitter without the loop starting to run off. This was also caused by the fact that the ethernet port was really only operating at 100Mb/s due to the switch I was connected to not supporting 1Gb/s. The maximum theoretical throughput at 100Mb/s is only 12.5MB/s and the realistic throughput is usually at around 60% of that. But even with a 1Gb/s switch the overhead of TCP Write was dominating the loop by far, making other differences including the use of an optimized Typecast without any Endian normalization compared to the normal LabVIEW Typecast which did Endian normalization fall into unmeasurable noise. And it's nitpicking really and likely only costs a few ns execution time extra but the calculation of the number of scans inside the loop to resize the array to a number of scans and number of cannels should be all done in integer space anyhow and using the Quotient & Reminder. Not to much use in using Double Precision values for all these for something that inherently should be integer numbers anyhow. There is even a potential for a wrong number of scans in the 2D array since the ToI32 conversion number does standard rounding, so could end up one more than there are full scans in the read data. Edited August 26, 2020 by Rolf Kalbermatter Quote
Neil Pate Posted August 26, 2020 Author Report Posted August 26, 2020 (edited) @Rolf Kalbermatter it was a few years ago now, but if I recall correctly it was a known issue that requesting a fixed number of elements from a DMA buffer caused the CPU to poll unnecessarily fast while it was waiting for those elements to arrive. I will see if I can find the KB. https://knowledge.ni.com/KnowledgeArticleDetails?id=kA00Z000000P9SASA0&l=en-US That is specifically for RT but I have definitely seen this on Windows and FPGA also. Edited August 26, 2020 by Neil Pate Quote
Rolf Kalbermatter Posted August 26, 2020 Report Posted August 26, 2020 (edited) 46 minutes ago, Neil Pate said: @Rolf Kalbermatter it was a few years ago now, but if I recall correctly it was a known issue that requesting a fixed number of elements from a DMA buffer caused the CPU to poll unnecessarily fast while it was waiting for those elements to arrive. I will see if I can find the KB. https://knowledge.ni.com/KnowledgeArticleDetails?id=kA00Z000000P9SASA0&l=en-US That is specifically for RT but I have definitely seen this on Windows and FPGA also. Ahhh, I see, blocking when you request more data than there is currently available. Well I would in fact not expect the Acquire Read Region to perform much differently in that aspect. I solved this in my last project a little differently though. Rather than calling FIFO Read with 0 samples to read, I used the <remaining samples> from the previous loop iteration to calculate an estimation for the amount of samples to read similar to this formula (<previous remaining samples> + (<current sample rate> * <measured loop interval>) to determine the number of samples to request. Works flawlessly, saves a call to Read FIFO with 0 samples to read (which I do not expect to take any measurable execution time, but still). I need to do this since the sampling rate is in fact externaly determined through a quadrature encoder so can dynamically change in a pretty large range. But unless you can do all data intense work inside the IPE as in the example you show, the Acquire FIFO Read Region offers no advantage in terms of execution speed to a normal FIFO Read. Edited August 26, 2020 by Rolf Kalbermatter Quote
ned Posted August 27, 2020 Report Posted August 27, 2020 19 hours ago, Rolf Kalbermatter said: Actually there is another option to avoid the typecast copy. Because Typecast is in fact not just a type reinterpretation like in C but also does byte swapping on all Little Endian platforms which are currently all but the old VxWorks based cRIO platforms, since the use a PowerPC CPU which by default operates in Big Endian (this CPU can support both Endian modes but is typically always used in Big Endian mode. If all you need is a byte array then use the String to Byte Array node instead. This is more like a C Typecast as the data type in at least Classic LabVIEW doesn't change at all (somewhat sloppily stated: a string is simply a byte array with a different wire color 😀). If you need a typecast sort of thing because your numeric array is something else than a byte array, but don't want endianizing you could with a bit of low level byte shuffling (preferably in C but with enough persistence it could even be done in LabVIEW diagram although not 100% safe) you could write a small function that swaps out two handles with additional correction of the numElm value in the array and do this as a virtually zero cost operation. I'm not sure the Acquire Write Region would save you as much as you hope for this. The DVR returned still needs to copy your LabVIEW data array into the DMA buffer and there is also some overhead from protecting the DVR access from the DMA routine which will attempt to read the data. Getting rid of the inherent copy in the Typecast function is probably more performant. Thanks for the notes! String to byte array wasn't an option because I needed to use a 32-bit wide FIFO to get sufficiently fast transfers (my testing indicated that DMA transfers were roughly constant in elements/second regardless of the element size, so using a byte array would have cut throughput by 75%). I posted about this at the time https://forums.ni.com/t5/LabVIEW/optimize-transfer-from-TCP-Read-to-DMA-Write-on-sbRIO/td-p/2622479 but 7 years (and 3 job transfers) later I'm no longer in a position to experiment with it. I like the idea of implementing type cast without a copy as a learning experience; I think the C version would be straightforward and pure LabVIEW (with calls to memory manager functions) would be an interesting challenge. Quote
Rolf Kalbermatter Posted August 27, 2020 Report Posted August 27, 2020 (edited) 5 hours ago, ned said: Thanks for the notes! String to byte array wasn't an option because I needed to use a 32-bit wide FIFO to get sufficiently fast transfers (my testing indicated that DMA transfers were roughly constant in elements/second regardless of the element size, so using a byte array would have cut throughput by 75%). I posted about this at the time https://forums.ni.com/t5/LabVIEW/optimize-transfer-from-TCP-Read-to-DMA-Write-on-sbRIO/td-p/2622479 but 7 years (and 3 job transfers) later I'm no longer in a position to experiment with it. I like the idea of implementing type cast without a copy as a learning experience; I think the C version would be straightforward and pure LabVIEW (with calls to memory manager functions) would be an interesting challenge. I haven't benchmarked the FIFO transfer in respect to element size, but I know for a fact that the current FIFO DMA implementation from NI-RIO does pack data to 64-bit data boundaries. This made me change the previous implementation in my project from transfering 12:12 bit FXP signed integer data to 16-bit signed integers since 4 12-bit samples are internally transferd as 64-bit anyhow over DMA, just as 4 16-bit samples are. (In fact I'm currently always packing two 16 bit integers into a 32-bit unsigned integer for the purpose of the FIFO transfer, not because of performance but because of the implementation in the FPGA which makes it faster to always grab two 16-bit memory locations at once and push them into the FIFO. Otherwise the memory read loop would take double as much time (or require a higher loop speed) to be able to catch up with the data acquisition. 64 12-bit ADC samples at 75 kHz add up to quite some data that needs to be pushed into the FIFO. I might consider to push this up to 64-bit FIFO elements just to see if it makes a performance difference, but the main problem I have is not the FIFO but rather to get the data pushed onto the TCP/IP network in the RT application. Calling directly libc:send() to push the data into the network socket stack rather than through TCP Write seems to have more effect. Edited August 27, 2020 by Rolf Kalbermatter Quote
Neil Pate Posted August 27, 2020 Author Report Posted August 27, 2020 For what it is worth, the performance of regular DMA FIFOs is quite impressive. Recently I worked with a VST and had multiple channels at 120 MHz data rate and I was able to read these from the FPGA, do some processing and stream continuously to a RAID array at the full rate. Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.