FFT Labview FPGA or Cuda

JimPanse · August 6, 2018

Hello experts,

I have a camera application (USB3 or camera link) and would like to do an FFT analysis of a line with 1024 or 2048 pixels. Data type U8, U16, I16. The processing must be very fast due to the process.

The processing time should not take longer than 4μs. It would be conceivable to process with an FPGA, e.g.

Kintex-7 160T / IC-3120 (USB3)
Virtex-5 LX50 / PCIe 1473 (camera link)

or on the GPU using CUDA.

I also heard that Labview 64 bit FPGA should be much faster.

Does anyone know if the processing time of 4 μs is possible with the described methods ???
.... or what would you prefer and why? .... Or is there another solution with Labview?

Is CUDA running under Labview Realtime?

A good start to the week

Jim

Neil Pate · August 6, 2018

I don't think 64-bit will be any faster than 32-bit for your application. From what I understand going to 64-bit only helps if you need the extra memory (> 3 GB).

I suspect CUDA will be limited to Windows, not realtime (which is Linux). Also, realtime does not imply things actually happen any faster, just there is better determinism. Generally an RT CPU is also quite a bit slower than a desktop CPU.

Although I have not done it myself I would expect an FPGA to quite easily process a 2048 point FFT in 4 us, but getting the data to the FPGA in time could be difficult, especially if you have to wait for it to come in on USB and process via the host.

Edited August 6, 2018 by Neil Pate

smithd · August 6, 2018

yeah that pretty wildly depends on where that 4 usec requirement comes from and what you want to do with the result. It seems like an oddly specific number.

In any case, with sufficient memory on the fpga I believe you can do a line fft in a single pass (although I can't remember for sure) although its several operations so you'd have to do a lot of parallelization, clock rate fiddling, etc -- at default rates, 4 usec is just 160 operations. Your best bet would probably be to look at the xilinx core (http://zone.ni.com/reference/en-XX/help/371599N-01/lvfpgahelp/fpga_xilinxip_descriptions/) and see how much data you can feed it.

PiDi · August 6, 2018

With Xilinx FFT IP Core the latency is usually about two times the FFT length and the max archievable clock frequency is about 300 MHz. With 1024 points FFT that gives you about ~7 uSec latency. And we're talking about 1D FFT only, so we'd also need to account for image acquisition, data preparation for FFT and post-FFT processing and decision making. And by the way, 4 uSec is 250000 frames per second. There are two possibilities: either your requirements need a bit of grooming... Or you're working on some amazing project which I would love to take part in

Edited August 6, 2018 by PiDi

JimPanse · August 7, 2018

Hello and thank you for your answers.

Then an FFT of a line with 2048 pixels and corresponding parallelization in 4μs would be possible. As far as I know, the mentioned NI FPGAs run at 400 MHz. The question that arises is which type of FPGA is suitable? Because he has a decisive influence on the costs.
According to NI, the FFT runs in 4 μs (2048 pixels, 12bit) with the PXIe 7965 (Virtex-5 SX95T). The card unfortunately costs 10,000 € and would probably blow up the project.

Therefore, I would like to know if it also with the variants:

Kintex-7 160T / IC-3120 (USB3)
Virtex-5 LX50 / PCIe 1473 (camera link)

is working.

If you look at the document:

http://www.bertendsp.com/pdf/whitepaper/BWP001_GPU_vs_FPGA_Performance_Comparison_v1.0.pdf

Then it should also work with CUDA and a GPU. And at significantly lower costs. how do you see it? ... @ PiDi: In principle, you are already doing with 

Have a nice time, Jim

Edited August 7, 2018 by JimPanse

PiDi · August 7, 2018

If we would go with the Xilinx FFT IP Core path: the FPGA itself is not limiting factor in the clock frequency, the IP Core is. Take a look at Xilinx docs: https://www.xilinx.com/support/documentation/ip_documentation/ru/xfft.html . Assuming Kintex-7 target and pipelined architecture you probably won't make it to 400 MHz. From my own experience - when trying to compile the FFT core at 300 MHz I got about 50% success rate (that is - 50% of the compilations failed due to the timing constraints) - but this is FPGA compilation, so when you're at the performance boundary, it is really random. We can also take a look at older versions of FFT IP - Xilinx even included latency information there: https://www.xilinx.com/support/documentation/ip_documentation/xfft_ds260.pdf . Take a look at page 41 for example - they didn't go under 4 uSec.

Ok, thats Xilinx, but you say: "According to NI, the FFT runs in 4 μs (2048 pixels, 12bit) with the PXIe 7965 (Virtex-5 SX95T)." - I can't find it, could you provide reference?

GPU itself should be able to do the FFT calculation in that time with no problem, the limiting factor is data transfer to and from GPU.

I wrote all of the above to provide a bit of perspective, but I'm not saying that this is impossible to do. I rather say that the only way to know for sure is to actually prototype it, try different configurations and implementations and see if it works.

So, wrapping this up, I would review the requirements (ok, if you say that it is absolutely 4 uSec without any doubts, then let's stick with it - and I really think this awesome to push the performance to the limits ). Then try to get a hold of some FPGA target (borrow it somewhere? from NI directly?) and try to implement this FFT. And the same for the GPU (for the initial tests you could probably go with any non-ancient GPU you can find in PC or laptop).

smithd · August 7, 2018

Re gpu: 2048 16-bit ints is 4096 bytes, per 4usec is 1,024,000,000 bytes/sec or 976 MB/s. Except its both directions, so actually ~2 GB/s. If youre using a haswell for example (PCIe v3) thats 3 lanes already...without giving your GPU any processing time. A x16 card would give you 3.5 usec, assuming the cuda interface itself has no overhead.

As mentioned above it also depends on the rest of your budget -- whats the cycle time, how much time are you allocating for image capture itself, and what do you need to do with that fft (if greater than x write boolean out? send a message? etc)?

Edited August 7, 2018 by smithd

JimPanse · August 8, 2018

Hello experts .... Thank you for your answers.

The FPS are really 250 kHz @2048 Pixel

GPU
So a current graphics card e.g. 6GB Asus GeForce GTX 1060 Dual OC Active PCIe 3.0 x16 (Retail) should have a transfer rate of 15 Gbytes / s. This year PCIe 4.0 will come out with twice the data rate. At an expected data rate of about 1 Gbyte / s that should not be the bottleneck, right? So theoretically it should work with CUDA.

What should be the limiting element in data transmission?

In the USB3.0 variant, the USB3.0 interface is the data bottle neck. This should max. 640Mbytes / s transfer. That would probably work only at the expense of a reduction in dynamics. 2048 pixels * 250 kHz * 8 bits = 488 Mbytes / s.

FPGA
Ok, that's Xilinx, but you say: "According to NI, the FFT runs in 4 μs (2048 pixels, 12bit) with the PXIe 7965 (Virtex-5 SX95T)." - I can not find it, could you provide reference?
That's a statement from an NI systems engineer. Which has tested on the mentioned hardware. He tested it under Labview FPGA 32 bit and he said that it should go even better under Labview 64 bit.

If you look at the latency in the document
https://www.xilinx.com/support/documentation/ip_documentation/xfft_ds260.pdf
of the Virtex-5 SX95T looks then this is always larger 4μs @ 2k point size. The data type with LV FPGA can only be u8 or u16, right?

That raises the question: How did he do that? Is it possible to gain speed through parallelization and why should it be even faster with LV FPGA 64 bit?

The document is from 2011. Maybe something has changed in recent years?

The result of the FFT (A value) is then stored in the main memory or on the hard disk.

A nice stay, Jim

Edited August 8, 2018 by JimPanse

Neil Pate · August 8, 2018

I still do not really understand the 4 us requirement. Are you expecting to do something, i.e. react to some signal, in 4 us? Can you not just accept that some portion of the signal processing may lag a bit and then have some sort of buffering mechanism? If you are just storing to disk or memory then as long as you have enough buffer to capture the required sampling history does it really matter if you are not doing the FFT at exactly the same rate? I did some work with a FlexRIO on a system with approx 1000 frames / second rate and we just buffered to RAM on the FPGA card and then spooled it out at a more sensible rate to be processed by the host.

shoneill · August 8, 2018

4 hours ago, JimPanse said:

FPGA
Ok, that's Xilinx, but you say: "According to NI, the FFT runs in 4 μs (2048 pixels, 12bit) with the PXIe 7965 (Virtex-5 SX95T)." - I can not find it, could you provide reference?
That's a statement from an NI systems engineer. Which has tested on the mentioned hardware. He tested it under Labview FPGA 32 bit and he said that it should go even better under Labview 64 bit.

If you look at the latency in the document
https://www.xilinx.com/support/documentation/ip_documentation/xfft_ds260.pdf
of the Virtex-5 SX95T looks then this is always larger 4μs @ 2k point size. The data type with LV FPGA can only be u8 or u16, right?

1. LabVIEW 64-bit has no effect whatsoever on FPGA performance because FPGA is not 32-bit or 64-bit, only the development environment. So sorry, but LV 64-bit brings absolutely nothing to the table.

2. Nowhere in this Xilinx document do I see a Virtex 5 doing a 2k FFT in anywhere near 4us (more like 30us). This also assuming a max clock rate of up to 425 MHz, which NI cards will not do, 320MHz is tops. I know from experience. I work with a 95T on a daily basis.

3. No, datatypes on FPGA can be from 1 bit to N bits. No real restriction.

4. I have the feeling your expertise and ability to discuss this topic is very limited because you seem to have so little understanding of the numbers behind the claims of a 2k FFT in 4us on 10 year old hardware.

5.Get the NI systems engineer to do your work for you and see how far he gets. Sounds like he's talking through his hat.

JimPanse · August 8, 2018

Thanks for the friendly answer!

The FFT calculation does not have to be done every 4 μs. At that point I should have contributed more. It is a 250 kHz continuous detector and this processing must be guaranteed. It would be quite conceivable to store a certain amount of data and then parallel to processing.

For example ... So if an FFT calculation takes 100 μs, you could cache 25 measurements and calculate them in parallel. If that would be feasible .... The question arises, which low-cost FPGA is able to store the data between and calculate it in parallel?

Simulation can determine the FPGA type containing the necessary resources....not the timing

Have fun sweating, Jim

smithd · August 8, 2018

FPGA compilation is basically a simulation so you can get the clock rate from that and use cycle-accurate simulation of the fft core to determine throughput performance.

So if the calculation can be buffered, I think we all collectively want to know what your control requirement is. From detecting a 'bad' value on your camera, how long do you have to respond? If you have 100 usec latency budget it doesnt matter that the gpu can do the processing in 4 usec -- it might not even get there in time. However a fpga card with the camera acquisition and processing happening on the same device makes 100 usec feasible using multiple fft cores to increase throughput. You even get to cheat since you read each pixel as its acquired from the sensor, rather than waiting for the entire acquisition to be complete and buffered as you would with a normal cpu driver. On the other hand if you have a 10 ms latency budget then the CPU+GPU implementation will probably be much much much simpler to implement for the same throughput.

PiDi · August 8, 2018

Ahhh, so if you don't actually need to process data in 4 uSec, but can have a little more latency, that changes a lot! Using Pipelined Architecture in Xilinx FFT Core you can stream the data continuosly and it will generate result continously - with initlai latency:

(from Fast Fourier Transform v9.0 LogiCORE IP Product Guide).

In other words: if you'll start streaming the data continuosly, you'll get the first output value after X usec, and then next value every clock cycle. Though you'll need to properly implement AXI4-Stream protocol around the IP Core (basically this: http://zone.ni.com/reference/en-XX/help/371599K-01/lvfpgaconcepts/xilinxip_using/ , but there are some caveats when interfacing with Xilinx IP Cores) and some buffering.

I also agree with smithd that with enough "latency budget" the GPU way might be more cost-efficient.

shoneill · August 9, 2018

22 hours ago, JimPanse said:

The FFT calculation does not have to be done every 4 μs. At that point I should have contributed more. It is a 250 kHz continuous detector and this processing must be guaranteed. It would be quite conceivable to store a certain amount of data and then parallel to processing.

Ah, so you need 250kHz throughput, not 4us latency. Then why on earth was the 4us so prominent in the earlier posts..... anyway.....

Yes, given an maximum allowed latency, parallel processing over several FFT processes will help you get to where you need to be. Just keep track of the resource utilisation of the FFT cores. It seems like the Radix-4 burst IO version allows 10 channels of 16-bit resolution with a base clock of 250MHz and a latency of approx. 26us. Streaming allows only one channel, so I don't know how well that's going to work out for you.... I have no experience with the FFP IPCore. (except what i have just tried in order to get the information listed above).

Sign In

FFT Labview FPGA or Cuda

Recommended Posts

JimPanse

Neil Pate

smithd

PiDi

JimPanse

PiDi

smithd

JimPanse

Neil Pate

shoneill

JimPanse

smithd

PiDi

shoneill

Join the conversation

Browse

Activity

Important Information