Jump to content

Optimization of reshape 1d array to 4 2d arrays function


Recommended Posts

Hi all,

I'm looking for help to increase as much as possibile the speed of a function that reshape a 1D array from a 4 channels acquisition board to 4 2D array. The input array is:

Ch0_0_0 - Ch1_0_0 - Ch2_0_0 - Ch3_0_0 - Ch0_1_0 - Ch1_1_0 - Ch2_1_0 - Ch3_1_0 - ... - Ch0_N_0 - Ch1_N_0 - Ch2_N_0 - Ch3_N_0 - Ch0_0_1 - Ch1_0_1 - Ch2_0_1 - Ch3_0_1 - Ch0_1_1 - Ch1_1_1 - Ch2_1_1 - Ch3_1_1 - ... - Ch0_N_M - Ch1_N_M - Ch2_N_M - Ch3_N_M

where, basically, the array is the stream of samples from 4 channel, of M measures, each measure of N samples per channel per measure. First the first sample of each channel of the first measure, than the second sample of each channel....

Addtionally, I need to remove the first X samples and last Z samples from each measure for each channel (basically, i'm getting N samples from the board but I only care about the samples from X to N-Z, for each cahnnel and measure). The board can be configured only with power of 2 samples per measure, hence no way to receive from the board only the desired length.

The end goal is to have 4 2D array (one for each channel), with M rows and N-(X+Z) columns. The typical length of the input 1D array is 4 channel * M=512 measure * N=65536 samples/ch*measure; typical X = 200, Z = 30000.

Originally I tried the following code:

first_version.png.5e8ef1c0d46ee63ddbe06abeebe35c7e.png

and then this, which is faster :

second_version.png.7490f279c26c450983032f927d66a860.png

Still, every millisecond gained will help and I'm sure that an expert here can achieve the same result with a single super efficient function. The function will run on a 32-cores intel i9 cpu.

Thanks!

Marco.

 

 

 

Link to comment
3 hours ago, X___ said:

Try interleave array?

I do not have a "nice" vi anymore but the very first implementation was based on the decimate array function (I guess you are referring to decimate and not interleave) but it was slower than the other two solutions:

55568808_Screenshot2024-04-16222236.png.521b97721ffbb6551a6738a03999db2a.png

Link to comment

Post the VI's rather than snippets (snippets don't work on Lavag.org) along with example data. It's also helpful if you have standard benchmarks that we can plug our implementation into (sequence structure with frames and getmillisecs) so we can compare and contrast.

e.g

 

Edited by ShaunR
Link to comment
9 hours ago, ShaunR said:

Post the VI's rather than snippets (snippets don't work on Lavag.org) along with example data. It's also helpful if you have standard benchmarks that we can plug our implementation into (sequence structure with frames and getmillisecs) so we can compare and contrast.

e.g

 

Sure, the attached vi contains the generation of a sample 1d array to simulate the 4 channels, M measures, N samples and the latest version on the code to reshape it, inside a sequence structure.

test_reshape.vi

Link to comment

Nope. I can't beat it. To get better performance i expect you would probably have to use different hardware (FPGA or GPU).

Self auto-incrementing arrays in LabVIEW are extremely efficient and I've come across the situation previously where decimate is usually about 4 times slower. Your particular requirement requires deleting a subsection  at the beginning and end of each acquisition so most optimisations aren't available.

Just be aware that you have a fixed number of channels and hope the HW guys don't add more or make a cheaper version with only 2.

Link to comment
2 hours ago, ShaunR said:

Nope. I can't beat it. To get better performance i expect you would probably have to use different hardware (FPGA or GPU).

Self auto-incrementing arrays in LabVIEW are extremely efficient and I've come across the situation previously where decimate is usually about 4 times slower. Your particular requirement requires deleting a subsection  at the beginning and end of each acquisition so most optimisations aren't available.

Just be aware that you have a fixed number of channels and hope the HW guys don't add more or make a cheaper version with only 2.

Thanks for trying!

How "easy" is to use GPUs in LabVIEW for this type of operations? I remeber reading that I'm supposed to write the code in C++, where the CUDA api is used, compile the dll and than use the labview toolkit to call the dll. Unfortunally, I have zero knowlodge in basically all these step.

Link to comment
2 hours ago, Bruniii said:

Thanks for trying!

How "easy" is to use GPUs in LabVIEW for this type of operations? I remeber reading that I'm supposed to write the code in C++, where the CUDA api is used, compile the dll and than use the labview toolkit to call the dll. Unfortunally, I have zero knowlodge in basically all these step.

There is a GPU Toolkit if you want to try it. No need to write wrapper DLL's. It's in VIPM so you can just install it and try. Don't bother with the download button on the website-it's just a launch link for VIPM and you'd have to log in.

One afterthought. When benchmarking you must never leave outputs unwired (like the 2d arrays in your benchmark). LabVIEW will know that the data isn't used anywhere and optimise to give different results than when in production. So you should at least do something like this:

image.png.7e3a75210f3399b5a2f93a0ddc299a5c.png

On my machine your original executed in ~10ms. With the above it was ~30ms.

Edited by ShaunR
Link to comment
On 4/23/2024 at 5:07 PM, ShaunR said:

There is a GPU Toolkit if you want to try it. No need to write wrapper DLL's. It's in VIPM so you can just install it and try. Don't bother with the download button on the website-it's just a launch link for VIPM and you'd have to log in.

One afterthought. When benchmarking you must never leave outputs unwired (like the 2d arrays in your benchmark). LabVIEW will know that the data isn't used anywhere and optimise to give different results than when in production. So you should at least do something like this:

image.png.7e3a75210f3399b5a2f93a0ddc299a5c.png

On my machine your original executed in ~10ms. With the above it was ~30ms.

Thank you for the note regarding the compiler and the need to "use" all the outputs. I know it but forget when writing this specific vi.

 

Regarding the GPU toolkit: it's the one I read in the past. In the """""documentation"""", NI writes:

Quote

In this toolkit, the function wrappers for the FFT and BLAS operations already are built with the LVGPU SDK, and they specifically call the NVIDIA CUDA libraries and communicate with a GPU through an NVIDIA API. You can use the LVGPU SDK to build wrappers for implementing custom GPU functions to execute on any co-processor device as long as LabVIEW can call the external function.

https://www.ni.com/docs/en-US/bundle/labview-gpu-analysis-toolkit-api-ref/page/lvgpu/lvgpu.html And, for example, I found the following topic on NI forum:  https://forums.ni.com/t5/GPU-Computing/Need-Help-on-Customizing-GPU-Computing-Using-the-LabVIEW-GPU/td-p/3395649 where it looks like the custom dll for the specific operations needed is required.

Link to comment

Please sign in to comment

You will be able to leave a comment after signing in



Sign In Now
×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.