Jump to content


Photo
- - - - -

Fastest Search Array of String


  • Please log in to reply
25 replies to this topic

#1 hooovahh

hooovahh

    The 500 club

  • Members
  • PipPipPipPipPip
  • 809 posts
  • Version:LabVIEW 2011
  • Since:2004

Posted 24 July 2012 - 12:49 PM

So I embarked on a little journey this morning. Many times when developing code I have the need to search an array of string for some pattern. Usually I am looking for some set of characters in a string and then I need to get the whole string, and possibly the index that the string was found in the array.

There are many different ways to do what I just mentioned, some more obvious then others, and I wanted to know which was the methods that should not be used for larger arrays, and which method appears to be the best, on a Windows machine with time to process being the only thing to judge on.

So attached you'll find a VI (saved in 2011) which generates an array of strings. Each string will be between 5 and 15 characters long and the array size is controllable from the front panel. For approximately half the elements in the array, a string will be appended to the front of the generated string. These are the values we will search for.

We then search the array looking for the indexes that contain the string we are looking for. In this example the string is appended to the front, but I would like to make this generic and I am assuming the string we are looking for could be any where in the string not just the beginning. I use 6 different methods for finding the indexes and calculate the time it took to find them.

I noticed for arrays of 1000 elements or less it doesn't matter too much. Each of the 6 methods take 0 or 1 milliseconds to complete. For 10,000, or 100,000 the winner is Method 2, which uses the Search/Split String, combined with the OpenG Conditional Auto-Indexing Tunnel.

I know this is such a simple task, and for most cases it doesn't matter, but I was wondering what others have found. Is there a method that is generally preferred more? My thoughts would be that Method 1 would have been the fastest, since we only have to iterate over the items once, instead of an OpenG method, but I do realize the Build Array is an operation which can be bad for memory.

Attached Files


"Maybe Hoovah is really Crelf's alter-ego, which he uses to irk people?" - Gary Rubin

"Seemingly minor changes....can mean the difference between a working app and a quivering heap of unresponsive code." - David Boyd


#2 ShaunR

ShaunR

    LabVIEW Archetype

  • Members
  • PipPipPipPipPipPip
  • 2,267 posts
  • Version:LabVIEW 2009
  • Since:1994

Posted 24 July 2012 - 01:20 PM

So I embarked on a little journey this morning. Many times when developing code I have the need to search an array of string for some pattern. Usually I am looking for some set of characters in a string and then I need to get the whole string, and possibly the index that the string was found in the array.
<snip>
I know this is such a simple task, and for most cases it doesn't matter, but I was wondering what others have found. Is there a method that is generally preferred more? My thoughts would be that Method 1 would have been the fastest, since we only have to iterate over the items once, instead of an OpenG method, but I do realize the Build Array is an operation which can be bad for memory.

Well. For me anything that doesn't use 3rd party products is preferred ;)

But generally when we are talking about this sort of thing we are also talking about convergence on a solution for the final optimisation step. As was indicated with the "Fast Trim"; using LabVIEW parallelism will always converge faster since the worst case is a solution is in the middle (if split over 2 for loops) rather than at the end. This however comes with the caveat for this case in how do you tell one loop that the other has found it without introducing complexity that mitigates the benefit.

I haven't tried your example (I don't have the JKI toolkit) but in theory enabling parallelism of the for loop should also yield benefits although I have yet to see a real world scenario where it actually does.

Edited by ShaunR, 24 July 2012 - 01:21 PM.

A positive attitude may not solve all your problems, but it will annoy enough people to make it worth the effort. (Herm Albright 1876-1944).

Founder and general mischief maker on www.labview-tools.com.
SQlite aficionado and websocket zealot.
If it 'aint in LabVIEW, then you 'aint got a clue!

#3 lordexod

lordexod

    Very Active

  • Members
  • PipPipPip
  • 64 posts
  • Location:Poland
  • Version:LabVIEW 2012
  • Since:2007

Posted 24 July 2012 - 04:35 PM

You can set: File -> VI Properties -> Execution -> Reentrant execution -> Preallocate clone for each instance.
LabVIEW Reverse Engineer

#4 Mellroth

Mellroth

    The 500 club

  • Members
  • PipPipPipPipPip
  • 538 posts
  • Version:LabVIEW 2011
  • Since:1995

Posted 24 July 2012 - 05:25 PM

You can set: File -> VI Properties -> Execution -> Reentrant execution -> Preallocate clone for each instance.

Why would you do this in this specific case?
I think Shaun´s parallel FOR loop has much more potential, since a VI call always has a small time associated with it regardless of the reentrancy setting.

/J

#5 Wouter

Wouter

    Very Active

  • Members
  • PipPipPip
  • 103 posts
  • Version:LabVIEW 2011
  • Since:2006

Posted 24 July 2012 - 05:56 PM

I played with this just now. Let me post my remarks. The tests where done with 1 million elements

First off al the first method can be made much more efficient, by just preallocation the data. If the input has 10000 then you should also preallocate a array of 10000 elements for both the string array as the index array. Then after the loop you just split the array and hence remove the elements you didn't use.

1.png

The first method, (500 ms), is then most times faster then the second method, (550 ms).

However of course you can also apply this to the second method. Which makes the second faster by more then 50% (550 ms to 250 ms)

2.png (yes the ss is of a subvi, but I did that for other testing purposes after I benchmarked it with the above results, I took the ss of the code when I was finished)

I then thought I could optimize it further but my attempt failed. I tried combining the two OpenG VI's into one. You namely do not have to keep track of the indices in the first for loop since they are also recored in the boolean array. The index at which the value is true, is the index of the element that we want to keep. After that you know the exact size of the array you are going to create and you can prealocate that. Then just fill it. However the execution then came back on its original level (+/- 250 ms to 550 ms)

3.png

Edited by Wouter, 24 July 2012 - 06:02 PM.


#6 hooovahh

hooovahh

    The 500 club

  • Members
  • PipPipPipPipPip
  • 809 posts
  • Version:LabVIEW 2011
  • Since:2004

Posted 24 July 2012 - 06:20 PM

I played with this just now. Let me post my remarks. The tests where done with 1 million elements

....

Wow thanks alot Wouter. I'm not sure why I didn't think of pre-allocation. I think this would make a nice little reuse nuggest for the community or OpenG some day. I also did notice that with a smaller sample this method 2 with pre-allocation doesn't always win but it may be the best solution overall.

"Maybe Hoovah is really Crelf's alter-ego, which he uses to irk people?" - Gary Rubin

"Seemingly minor changes....can mean the difference between a working app and a quivering heap of unresponsive code." - David Boyd


#7 lordexod

lordexod

    Very Active

  • Members
  • PipPipPip
  • 64 posts
  • Location:Poland
  • Version:LabVIEW 2012
  • Since:2007

Posted 25 July 2012 - 12:40 PM

Maybe something like this:

Attached File  test.vi   12.85K   51 downloads
Attached File  search.vi   12.82K   48 downloads

Edited by lordexod, 25 July 2012 - 12:41 PM.

LabVIEW Reverse Engineer

#8 hooovahh

hooovahh

    The 500 club

  • Members
  • PipPipPipPipPip
  • 809 posts
  • Version:LabVIEW 2011
  • Since:2004

Posted 25 July 2012 - 02:02 PM

Maybe something like this:

I'm not quite sure why but your VIs on my system took 34 seconds to search through, using 10,000 items in the array, and it took over 900MB of private memory.

To compare my original post took a minimum of 169ms and 1.4GB of private memory for all 6 methods. At the start of all of this I said the only real criteria was time for execution so the memory usage is a little unfair to add as a requirement, I simply wanted to give a comparison.

"Maybe Hoovah is really Crelf's alter-ego, which he uses to irk people?" - Gary Rubin

"Seemingly minor changes....can mean the difference between a working app and a quivering heap of unresponsive code." - David Boyd


#9 Mellroth

Mellroth

    The 500 club

  • Members
  • PipPipPipPipPip
  • 538 posts
  • Version:LabVIEW 2011
  • Since:1995

Posted 25 July 2012 - 02:51 PM


Wow thanks alot Wouter. I'm not sure why I didn't think of pre-allocation. I think this would make a nice little reuse nuggest for the community or OpenG some day. I also did notice that with a smaller sample this method 2 with pre-allocation doesn't always win but it may be the best solution overall.


Our internal solution is quite similar to wouters solution. The big differences are that we use "ArraySubset" instead of "DeleteSubset" and that we reuse the incoming search array to store the result.

SearchStringArray.png

I have not bechmarked it against the other solutions, so I don't know if it is faster or slower.

/J

#10 lordexod

lordexod

    Very Active

  • Members
  • PipPipPip
  • 64 posts
  • Location:Poland
  • Version:LabVIEW 2012
  • Since:2007

Posted 25 July 2012 - 04:47 PM

I forgot, I have a better computer.

xxx.jpg

Edited by lordexod, 25 July 2012 - 04:48 PM.

LabVIEW Reverse Engineer

#11 ShaunR

ShaunR

    LabVIEW Archetype

  • Members
  • PipPipPipPipPipPip
  • 2,267 posts
  • Version:LabVIEW 2009
  • Since:1994

Posted 25 July 2012 - 05:02 PM

A bit of lateral thinking?

A positive attitude may not solve all your problems, but it will annoy enough people to make it worth the effort. (Herm Albright 1876-1944).

Founder and general mischief maker on www.labview-tools.com.
SQlite aficionado and websocket zealot.
If it 'aint in LabVIEW, then you 'aint got a clue!

#12 Mellroth

Mellroth

    The 500 club

  • Members
  • PipPipPipPipPip
  • 538 posts
  • Version:LabVIEW 2011
  • Since:1995

Posted 25 July 2012 - 05:17 PM

I'm not quite sure why but your VIs on my system took 34 seconds to search through, using 10,000 items in the array, and it took over 900MB of private memory.

To compare my original post took a minimum of 169ms and 1.4GB of private memory for all 6 methods. At the start of all of this I said the only real criteria was time for execution so the memory usage is a little unfair to add as a requirement, I simply wanted to give a comparison.

I'm getting times very close to this as well, and I believe that the reason is that the size of each string is also set to the size of the array... i.e. basically 10000 elements at approximately 10000 bytes each. On my system I run out of memory, or at least start swapping all the time.

Setting the size of the generated random string to more moderate 20 bytes, gives much better results... but still not close to the original code.

/J

#13 Jordan Kuehn

Jordan Kuehn

    Very Active

  • Premium Member
  • 239 posts
  • Location:Oklahoma
  • Version:LabVIEW 2011
  • Since:2009

Posted 25 July 2012 - 05:44 PM

Maybe something like this:

Attached File  test.vi   12.85K   51 downloads
Attached File  search.vi   12.82K   48 downloads


On the initial run I got 14.8s, but when running again without closing I got 278ms for 10,000 elements.

1-4ms on the original post for 10,000 elements.
The Colex Group
Lead Software Engineer
Certified LabVIEW Developer

#14 hooovahh

hooovahh

    The 500 club

  • Members
  • PipPipPipPipPip
  • 809 posts
  • Version:LabVIEW 2011
  • Since:2004

Posted 25 July 2012 - 05:52 PM


Our internal solution is quite similar to wouters solution. The big differences are that we use "ArraySubset" instead of "DeleteSubset" and that we reuse the incoming search array to store the result.

SearchStringArray.png

I have not bechmarked it against the other solutions, so I don't know if it is faster or slower.

/J

I'm starting to get inconsistent results, but from what I've seen it looks to be one of the better solutions but not the best. The best I've seen would either be Method 3, or Method 3, modified as wouter suggests.

A bit of lateral thinking?

In my tests this did not perform as well as the others. It would usually be 4th or 5th place in any array size. Not terrible but not the method I would prefer.

"Maybe Hoovah is really Crelf's alter-ego, which he uses to irk people?" - Gary Rubin

"Seemingly minor changes....can mean the difference between a working app and a quivering heap of unresponsive code." - David Boyd


#15 Darin

Darin

    Very Active

  • Members
  • PipPipPip
  • 168 posts
  • Version:LabVIEW 2009
  • Since:1992

Posted 25 July 2012 - 06:12 PM

In general I am not a fan of the "plow through an array to select a subset of items" then "plow through the subset of items" methodology. When possible I prefer to do whatever it is I am going to do when I find the element. Therefore, the code I would optimize and package for reuse would simply take an array and a start index and return the index of the first element which matches, and perhaps the element itself in this case. Sort of a special version of Search 1D array. This use case of collecting all of the items would simply use that code in a loop, with appropriate pre-allocation code.

My other general point (don't have LV11 to test any of this code), is that benchmarking string operations is a dicey proposition and I often find very high variances. Be careful about drawing too many conclusions. I would check proper scaling with array size (should scale as N not N^2 or worse), and avoid the third-party code when practical. Hard to beat the good-old swap/cut method to move the desired items to the front of the input array, then use Reshape Array or Array Subset to lop of the undesired tail.

#16 Wouter

Wouter

    Very Active

  • Members
  • PipPipPip
  • 103 posts
  • Version:LabVIEW 2011
  • Since:2006

Posted 26 July 2012 - 12:25 AM

A bit of lateral thinking?


I get a memory is full error for a size of 1 million :D


On the initial run I got 14.8s, but when running again without closing I got 278ms for 10,000 elements.

1-4ms on the original post for 10,000 elements.

Thats because proberbly some intern machine code results are cached.

#17 ShaunR

ShaunR

    LabVIEW Archetype

  • Members
  • PipPipPipPipPipPip
  • 2,267 posts
  • Version:LabVIEW 2009
  • Since:1994

Posted 26 July 2012 - 01:38 AM

I get a memory is full error for a size of 1 million :D

Thats because I cut and pasted your seed, method so it's not surprising (i.e you were trying to fit 1,0000,0000 x 1,000,0000 chars in memory)

In my tests this did not perform as well as the others. It would usually be 4th or 5th place in any array size. Not terrible but not the method I would prefer.

There's no pleasing some people :D

Edited by ShaunR, 26 July 2012 - 01:49 AM.

A positive attitude may not solve all your problems, but it will annoy enough people to make it worth the effort. (Herm Albright 1876-1944).

Founder and general mischief maker on www.labview-tools.com.
SQlite aficionado and websocket zealot.
If it 'aint in LabVIEW, then you 'aint got a clue!

#18 asbo

asbo

    I have no idea what you're talking about... so:

  • V I Engineering, Inc.
  • 1,273 posts
  • Version:LabVIEW 2011
  • Since:2008

Posted 26 July 2012 - 04:33 AM

Thats because proberbly some intern machine code results are cached.

It's more likely related to a (lack of) memory allocations.

#19 Jordan Kuehn

Jordan Kuehn

    Very Active

  • Premium Member
  • 239 posts
  • Location:Oklahoma
  • Version:LabVIEW 2011
  • Since:2009

Posted 26 July 2012 - 04:36 AM

It's more likely related to a (lack of) memory allocations.


System info:
Posted Image
The Colex Group
Lead Software Engineer
Certified LabVIEW Developer

#20 asbo

asbo

    I have no idea what you're talking about... so:

  • V I Engineering, Inc.
  • 1,273 posts
  • Version:LabVIEW 2011
  • Since:2008

Posted 26 July 2012 - 02:11 PM

System info:

What's your point? It still takes time to execute a memory allocation, regardless of how much you have. It is very normal for a memory-intensive VI to execute magnitudes faster on a second run.