Jump to content

Fastest Search Array of String


Recommended Posts

Posted

So I embarked on a little journey this morning. Many times when developing code I have the need to search an array of string for some pattern. Usually I am looking for some set of characters in a string and then I need to get the whole string, and possibly the index that the string was found in the array.

There are many different ways to do what I just mentioned, some more obvious then others, and I wanted to know which was the methods that should not be used for larger arrays, and which method appears to be the best, on a Windows machine with time to process being the only thing to judge on.

So attached you'll find a VI (saved in 2011) which generates an array of strings. Each string will be between 5 and 15 characters long and the array size is controllable from the front panel. For approximately half the elements in the array, a string will be appended to the front of the generated string. These are the values we will search for.

We then search the array looking for the indexes that contain the string we are looking for. In this example the string is appended to the front, but I would like to make this generic and I am assuming the string we are looking for could be any where in the string not just the beginning. I use 6 different methods for finding the indexes and calculate the time it took to find them.

I noticed for arrays of 1000 elements or less it doesn't matter too much. Each of the 6 methods take 0 or 1 milliseconds to complete. For 10,000, or 100,000 the winner is Method 2, which uses the Search/Split String, combined with the OpenG Conditional Auto-Indexing Tunnel.

I know this is such a simple task, and for most cases it doesn't matter, but I was wondering what others have found. Is there a method that is generally preferred more? My thoughts would be that Method 1 would have been the fastest, since we only have to iterate over the items once, instead of an OpenG method, but I do realize the Build Array is an operation which can be bad for memory.

Search Array Of Strings.vi

  • Like 1
Posted (edited)

So I embarked on a little journey this morning. Many times when developing code I have the need to search an array of string for some pattern. Usually I am looking for some set of characters in a string and then I need to get the whole string, and possibly the index that the string was found in the array.

<snip>

I know this is such a simple task, and for most cases it doesn't matter, but I was wondering what others have found. Is there a method that is generally preferred more? My thoughts would be that Method 1 would have been the fastest, since we only have to iterate over the items once, instead of an OpenG method, but I do realize the Build Array is an operation which can be bad for memory.

Well. For me anything that doesn't use 3rd party products is preferred ;)

But generally when we are talking about this sort of thing we are also talking about convergence on a solution for the final optimisation step. As was indicated with the "Fast Trim"; using LabVIEW parallelism will always converge faster since the worst case is a solution is in the middle (if split over 2 for loops) rather than at the end. This however comes with the caveat for this case in how do you tell one loop that the other has found it without introducing complexity that mitigates the benefit.

I haven't tried your example (I don't have the JKI toolkit) but in theory enabling parallelism of the for loop should also yield benefits although I have yet to see a real world scenario where it actually does.

Edited by ShaunR
Posted

You can set: File -> VI Properties -> Execution -> Reentrant execution -> Preallocate clone for each instance.

Why would you do this in this specific case?

I think Shaun´s parallel FOR loop has much more potential, since a VI call always has a small time associated with it regardless of the reentrancy setting.

/J

Posted (edited)

I played with this just now. Let me post my remarks. The tests where done with 1 million elements

First off al the first method can be made much more efficient, by just preallocation the data. If the input has 10000 then you should also preallocate a array of 10000 elements for both the string array as the index array. Then after the loop you just split the array and hence remove the elements you didn't use.

post-17774-0-46361500-1343149793.png

The first method, (500 ms), is then most times faster then the second method, (550 ms).

However of course you can also apply this to the second method. Which makes the second faster by more then 50% (550 ms to 250 ms)

post-17774-0-62793600-1343152022.png (yes the ss is of a subvi, but I did that for other testing purposes after I benchmarked it with the above results, I took the ss of the code when I was finished)

I then thought I could optimize it further but my attempt failed. I tried combining the two OpenG VI's into one. You namely do not have to keep track of the indices in the first for loop since they are also recored in the boolean array. The index at which the value is true, is the index of the element that we want to keep. After that you know the exact size of the array you are going to create and you can prealocate that. Then just fill it. However the execution then came back on its original level (+/- 250 ms to 550 ms)

post-17774-0-43469500-1343152227_thumb.p

Edited by Wouter
Posted

I played with this just now. Let me post my remarks. The tests where done with 1 million elements

....

Wow thanks alot Wouter. I'm not sure why I didn't think of pre-allocation. I think this would make a nice little reuse nuggest for the community or OpenG some day. I also did notice that with a smaller sample this method 2 with pre-allocation doesn't always win but it may be the best solution overall.

Posted

Maybe something like this:

I'm not quite sure why but your VIs on my system took 34 seconds to search through, using 10,000 items in the array, and it took over 900MB of private memory.

To compare my original post took a minimum of 169ms and 1.4GB of private memory for all 6 methods. At the start of all of this I said the only real criteria was time for execution so the memory usage is a little unfair to add as a requirement, I simply wanted to give a comparison.

Posted

Wow thanks alot Wouter. I'm not sure why I didn't think of pre-allocation. I think this would make a nice little reuse nuggest for the community or OpenG some day. I also did notice that with a smaller sample this method 2 with pre-allocation doesn't always win but it may be the best solution overall.

Our internal solution is quite similar to wouters solution. The big differences are that we use "ArraySubset" instead of "DeleteSubset" and that we reuse the incoming search array to store the result.

post-5958-0-82514200-1343227680_thumb.pn

I have not bechmarked it against the other solutions, so I don't know if it is faster or slower.

/J

  • Like 1
Posted

I'm not quite sure why but your VIs on my system took 34 seconds to search through, using 10,000 items in the array, and it took over 900MB of private memory.

To compare my original post took a minimum of 169ms and 1.4GB of private memory for all 6 methods. At the start of all of this I said the only real criteria was time for execution so the memory usage is a little unfair to add as a requirement, I simply wanted to give a comparison.

I'm getting times very close to this as well, and I believe that the reason is that the size of each string is also set to the size of the array... i.e. basically 10000 elements at approximately 10000 bytes each. On my system I run out of memory, or at least start swapping all the time.

Setting the size of the generated random string to more moderate 20 bytes, gives much better results... but still not close to the original code.

/J

Posted

Our internal solution is quite similar to wouters solution. The big differences are that we use "ArraySubset" instead of "DeleteSubset" and that we reuse the incoming search array to store the result.

post-5958-0-82514200-1343227680_thumb.pn

I have not bechmarked it against the other solutions, so I don't know if it is faster or slower.

/J

I'm starting to get inconsistent results, but from what I've seen it looks to be one of the better solutions but not the best. The best I've seen would either be Method 3, or Method 3, modified as wouter suggests.

A bit of lateral thinking?

In my tests this did not perform as well as the others. It would usually be 4th or 5th place in any array size. Not terrible but not the method I would prefer.

Posted

In general I am not a fan of the "plow through an array to select a subset of items" then "plow through the subset of items" methodology. When possible I prefer to do whatever it is I am going to do when I find the element. Therefore, the code I would optimize and package for reuse would simply take an array and a start index and return the index of the first element which matches, and perhaps the element itself in this case. Sort of a special version of Search 1D array. This use case of collecting all of the items would simply use that code in a loop, with appropriate pre-allocation code.

My other general point (don't have LV11 to test any of this code), is that benchmarking string operations is a dicey proposition and I often find very high variances. Be careful about drawing too many conclusions. I would check proper scaling with array size (should scale as N not N^2 or worse), and avoid the third-party code when practical. Hard to beat the good-old swap/cut method to move the desired items to the front of the input array, then use Reshape Array or Array Subset to lop of the undesired tail.

Posted

A bit of lateral thinking?

I get a memory is full error for a size of 1 million :D

On the initial run I got 14.8s, but when running again without closing I got 278ms for 10,000 elements.

1-4ms on the original post for 10,000 elements.

Thats because proberbly some intern machine code results are cached.

Posted (edited)

I get a memory is full error for a size of 1 million :D

Thats because I cut and pasted your seed, method so it's not surprising (i.e you were trying to fit 1,0000,0000 x 1,000,0000 chars in memory)

In my tests this did not perform as well as the others. It would usually be 4th or 5th place in any array size. Not terrible but not the method I would prefer.

There's no pleasing some people :D

Edited by ShaunR
Posted

Thats because proberbly some intern machine code results are cached.

It's more likely related to a (lack of) memory allocations.

Posted

System info:

What's your point? It still takes time to execute a memory allocation, regardless of how much you have. It is very normal for a memory-intensive VI to execute magnitudes faster on a second run.

Posted

What's your point? It still takes time to execute a memory allocation, regardless of how much you have. It is very normal for a memory-intensive VI to execute magnitudes faster on a second run.

Yes you are totally correct. In the second run the memory is already allocated and hence does not need to be done anymore.

Posted

I did some testing... everyone was using Match Pattern... I found a code snippet that was faster than that in every scenario I tested:

post-5877-0-03813800-1343556340_thumb.pn

I wasn't sure if it was only prefixes that were being sought ... all the test harnesses only created prefix test cases. If you want "anywhere in the string", then this doesn't help, obviously.

Posted

I did some testing... everyone was using Match Pattern... I found a code snippet that was faster than that in every scenario I tested:

post-5877-0-03813800-1343556340_thumb.pn

I wasn't sure if it was only prefixes that were being sought ... all the test harnesses only created prefix test cases. If you want "anywhere in the string", then this doesn't help, obviously.

I'm not suprised by it. I actually also build a version which used the string subset but then i thought of the case that the string "test" may not be on the front in each case, then I just choose to create a more general solution.

Posted

I did some testing... everyone was using Match Pattern... I found a code snippet that was faster than that in every scenario I tested:

post-5877-0-03813800-1343556340_thumb.pn

I wasn't sure if it was only prefixes that were being sought ... all the test harnesses only created prefix test cases. If you want "anywhere in the string", then this doesn't help, obviously.

I realize some of my posts are a little long winded so I don't blame you for not picking up on what I was trying to say. In my first post I said this:

In this example the string is appended to the front, but I would like to make this generic and I am assuming the string we are looking for could be any where in the string not just the beginning.

Which was my way of saying that the search should be throughout the string. If I knew it was at the beginning I would use your technique with the string subset.

Also I may be missing something, but why does concatenating a "^" symbol cause the Match Pattern method to work faster?

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.