Bruniii Posted April 16 Report Share Posted April 16 Hi all, I'm looking for help to increase as much as possibile the speed of a function that reshape a 1D array from a 4 channels acquisition board to 4 2D array. The input array is: Ch0_0_0 - Ch1_0_0 - Ch2_0_0 - Ch3_0_0 - Ch0_1_0 - Ch1_1_0 - Ch2_1_0 - Ch3_1_0 - ... - Ch0_N_0 - Ch1_N_0 - Ch2_N_0 - Ch3_N_0 - Ch0_0_1 - Ch1_0_1 - Ch2_0_1 - Ch3_0_1 - Ch0_1_1 - Ch1_1_1 - Ch2_1_1 - Ch3_1_1 - ... - Ch0_N_M - Ch1_N_M - Ch2_N_M - Ch3_N_M where, basically, the array is the stream of samples from 4 channel, of M measures, each measure of N samples per channel per measure. First the first sample of each channel of the first measure, than the second sample of each channel.... Addtionally, I need to remove the first X samples and last Z samples from each measure for each channel (basically, i'm getting N samples from the board but I only care about the samples from X to N-Z, for each cahnnel and measure). The board can be configured only with power of 2 samples per measure, hence no way to receive from the board only the desired length. The end goal is to have 4 2D array (one for each channel), with M rows and N-(X+Z) columns. The typical length of the input 1D array is 4 channel * M=512 measure * N=65536 samples/ch*measure; typical X = 200, Z = 30000. Originally I tried the following code: and then this, which is faster : Still, every millisecond gained will help and I'm sure that an expert here can achieve the same result with a single super efficient function. The function will run on a 32-cores intel i9 cpu. Thanks! Marco. Link to comment
X___ Posted April 16 Report Share Posted April 16 (edited) Try decimate array? Edited April 16 by X___ corrected mistake Link to comment
Bruniii Posted April 16 Author Report Share Posted April 16 3 hours ago, X___ said: Try interleave array? I do not have a "nice" vi anymore but the very first implementation was based on the decimate array function (I guess you are referring to decimate and not interleave) but it was slower than the other two solutions: Link to comment
ShaunR Posted April 17 Report Share Posted April 17 (edited) Post the VI's rather than snippets (snippets don't work on Lavag.org) along with example data. It's also helpful if you have standard benchmarks that we can plug our implementation into (sequence structure with frames and getmillisecs) so we can compare and contrast. e.g Edited April 17 by ShaunR Link to comment
Bruniii Posted April 17 Author Report Share Posted April 17 9 hours ago, ShaunR said: Post the VI's rather than snippets (snippets don't work on Lavag.org) along with example data. It's also helpful if you have standard benchmarks that we can plug our implementation into (sequence structure with frames and getmillisecs) so we can compare and contrast. e.g Sure, the attached vi contains the generation of a sample 1d array to simulate the 4 channels, M measures, N samples and the latest version on the code to reshape it, inside a sequence structure. test_reshape.vi Link to comment
ShaunR Posted April 23 Report Share Posted April 23 Nope. I can't beat it. To get better performance i expect you would probably have to use different hardware (FPGA or GPU). Self auto-incrementing arrays in LabVIEW are extremely efficient and I've come across the situation previously where decimate is usually about 4 times slower. Your particular requirement requires deleting a subsection at the beginning and end of each acquisition so most optimisations aren't available. Just be aware that you have a fixed number of channels and hope the HW guys don't add more or make a cheaper version with only 2. Link to comment
Bruniii Posted April 23 Author Report Share Posted April 23 2 hours ago, ShaunR said: Nope. I can't beat it. To get better performance i expect you would probably have to use different hardware (FPGA or GPU). Self auto-incrementing arrays in LabVIEW are extremely efficient and I've come across the situation previously where decimate is usually about 4 times slower. Your particular requirement requires deleting a subsection at the beginning and end of each acquisition so most optimisations aren't available. Just be aware that you have a fixed number of channels and hope the HW guys don't add more or make a cheaper version with only 2. Thanks for trying! How "easy" is to use GPUs in LabVIEW for this type of operations? I remeber reading that I'm supposed to write the code in C++, where the CUDA api is used, compile the dll and than use the labview toolkit to call the dll. Unfortunally, I have zero knowlodge in basically all these step. Link to comment
ShaunR Posted April 23 Report Share Posted April 23 (edited) 2 hours ago, Bruniii said: Thanks for trying! How "easy" is to use GPUs in LabVIEW for this type of operations? I remeber reading that I'm supposed to write the code in C++, where the CUDA api is used, compile the dll and than use the labview toolkit to call the dll. Unfortunally, I have zero knowlodge in basically all these step. There is a GPU Toolkit if you want to try it. No need to write wrapper DLL's. It's in VIPM so you can just install it and try. Don't bother with the download button on the website-it's just a launch link for VIPM and you'd have to log in. One afterthought. When benchmarking you must never leave outputs unwired (like the 2d arrays in your benchmark). LabVIEW will know that the data isn't used anywhere and optimise to give different results than when in production. So you should at least do something like this: On my machine your original executed in ~10ms. With the above it was ~30ms. Edited April 23 by ShaunR Link to comment
Bruniii Posted April 27 Author Report Share Posted April 27 On 4/23/2024 at 5:07 PM, ShaunR said: There is a GPU Toolkit if you want to try it. No need to write wrapper DLL's. It's in VIPM so you can just install it and try. Don't bother with the download button on the website-it's just a launch link for VIPM and you'd have to log in. One afterthought. When benchmarking you must never leave outputs unwired (like the 2d arrays in your benchmark). LabVIEW will know that the data isn't used anywhere and optimise to give different results than when in production. So you should at least do something like this: On my machine your original executed in ~10ms. With the above it was ~30ms. Thank you for the note regarding the compiler and the need to "use" all the outputs. I know it but forget when writing this specific vi. Regarding the GPU toolkit: it's the one I read in the past. In the """""documentation"""", NI writes: Quote In this toolkit, the function wrappers for the FFT and BLAS operations already are built with the LVGPU SDK, and they specifically call the NVIDIA CUDA libraries and communicate with a GPU through an NVIDIA API. You can use the LVGPU SDK to build wrappers for implementing custom GPU functions to execute on any co-processor device as long as LabVIEW can call the external function. https://www.ni.com/docs/en-US/bundle/labview-gpu-analysis-toolkit-api-ref/page/lvgpu/lvgpu.html And, for example, I found the following topic on NI forum: https://forums.ni.com/t5/GPU-Computing/Need-Help-on-Customizing-GPU-Computing-Using-the-LabVIEW-GPU/td-p/3395649 where it looks like the custom dll for the specific operations needed is required. Link to comment
Rolf Kalbermatter Posted April 27 Report Share Posted April 27 There are several alternatives for the NI GPU Toolkit that are considerably more up to date and actually still maintained. https://www.ngene.co/gpu-toolkit-for-labview https://www.g2cpu.com/ Link to comment
Recommended Posts
Please sign in to comment
You will be able to leave a comment after signing in
Sign In Now