Regex Challenge: How to exclude variable-width parts of a match?

JackDunaway · March 14, 2013

How can I exclude a variable-width, inner part from a regex match?

For instance, given the three following inputs:

The quick brown fox jumped over the lazy dog
The quick brown fox jumped over the sleepy dog
The quick brown fox jumped over the hotdog

I want to match the following:

The quick brown fox jumped over the dog

I have investigated negative lookarounds, but since these are zero-width assertions, I can't easily figure out how to include additional regex directives on *both sides* of the lookaround.

Is this problem relegated to the use of Search and Replace String configured for regex matching, or can this be achieved with a simple regex match?

Here's a screenshot of one of my naïve attempts using RegexPal; a successful attempt would also show "dog" highlighted as part of the match, but excluding the dog modifier. At least it doesn't match 'frog' :-)

Darin · March 14, 2013

In Perl you can get funky, here I would S&R or use capture groups on both sides and concatenate. Actually I would probably do that in Perl as well.

GregSands · March 14, 2013

Once again, just as I think of a possible solution (capturing "dog" separately and concatenating) Darin jumps in first with the answer.

:worshippy:

JackDunaway · March 14, 2013

OK; so, the consensus is that there does not exist a single regex to solve this? (I was really hoping to learn of a solution that could be achieved purely by the regex engine.)

The best name I can come up with for what I'm trying to do is submatch extraction, where the whole match depends on matching directives before and after the submatch, yet excludes the submatch like this:

mje · March 14, 2013

Yeah, S&R will be way faster. But you asked...

Don't forget the * and + operators are greedy, they will match as long of a string as possible. So a simple "The quick brown fox jumped over the (.*)dog" will do the trick. Substitute + for * if you want to require at least 1 character, or use the {m,n} syntax if you have other length restrictions.

Obligatory:

JackDunaway · March 14, 2013

...a simple "The quick brown fox jumped over the (.*)dog" will do the trick.

Well, simply finding a submatch is the easy part ;-)

What I'm really interested in doing is returning the original input string as the match, minus the submatch. Figure this out, and then we can fly around on vines saving days :-)

Darin · March 14, 2013

Why does S&R not fulfill your desire?

The regex engine deals in offsets and lengths in its internal state machine. Implementing dropped characters would really require a fundamental change to this representation. I am not sure what it would do to its speed (lack thereof) but my guess is that it will not speed up.

There seems to be a reason why regular expressions are low-level tools used in higher-level languages.

In Perl you can sprinkle some other code inside, but it really makes things hard to read. Most people do not realize that you can comment inside regexes, and even fewer ever want to deal with a regex which requires commenting.

JackDunaway · March 14, 2013

Why does S&R not fulfill your desire?

It would be helpful to have access to the text that was replaced. It is easy enough to create a re-use VI that does this; I'm just curious if the ability already exists.

mje · March 15, 2013

[blockquote class=ipsBlockquote data-author=JackDunaway data-cid=102091 data-time=1363301998]

What I'm really interested in doing is returning the original input string as the match, minus the submatch. Figure this out, and then we can fly around on vines saving days :-)

Oh, I misread. Yes, the look ahead/behind are what you want.

(?<=The quick brown fox jumped over the ).*(?=dog)

Using the S&R in regex mode will get you the string you want, but the match primitive won't since the whole point of the match is not to create new strings but to only return substrings. You could still use the match though if you concatenate the before/after substrings.

Ok, I give up. My phone completely screwed that up:

(?<=The quick brown fox jumped over the ).*(?=dog)

If that doesn't work, I give up.

JackDunaway · March 15, 2013

(?<=The quick brown fox jumped over the ).*(?=dog)

If that doesn't work, I give up.

That got me excited... but it's not quite right. The whole match is still just the submatch, because the lookarounds are zero-width.

Virtual +50 bounty for the regex that makes the green light come on in the following test harness:

mje · March 15, 2013

By giving up I meant giving up on fixing that post by the way.

Anyways, you're not going to be able to do it with the match primitive alone. It won't make a new string for you, which is what you need if you want to get the actual "The quick brown fox jumped over the dog" out of the match. The S&R would do it, but then you won't get the match. If you add some extra logic though, you can do it.

I hope I got your examples right, my LabVIEW stopped accepting snippets for some reason, so I rolled this one from scratch.

Other case is empty by the way, in case the snippets I produce are as defective as my ability to read them. Basically you need to construct the final string yourself.

JackDunaway · March 15, 2013

Anyways, you're not going to be able to do it with the match primitive alone. It won't make a new string for you...

Well, it's kinda not a new string... it's just a noncontiguous substring. (Can a substring be defined as 'noncontiguous'? Perhaps, no, and what i desire is impossible.)

I hope I got your examples right, my LabVIEW stopped accepting snippets for some reason, so I rolled this one from scratch.

regex2.png

Wow, thanks for going out of your way to recreate! Sorry snippets broke for you

That is basically the way I'm solving the problem right now; with extra syntax. The prime motivation for finding a *purely* regex solution is to generalize this problem -- consider wanting to remove adjectives from both nouns:

The fox jumped over the dog

This general solution more closely matches my problem domain. (This thread presents the simplest form of the problem, since I can't even figure that out; or if the desired solution is even possible!)

mje · March 15, 2013

Hah, no worries. I love these types of problems. Pure logic.

Non-contiguous strings exist, just not in LabVIEW

I figured you had a good reason to use a regex because as it stood a regex was not the best way to do it: scanning would be faster.

You have an interesting problem though in that you want to replace and match at the same time, and apparently globally.

ShaunR · March 15, 2013

You are expecting that "Whole Match" really means Whole Match except those bits I don't want?

LV Help

whole match contains all the characters that match the expression entered in regular expression. Any substring matches the function finds appear in the submatch outputs.

To match individual components you have to create capture groups. The ?: syntax means that you exclude that component from the list of capture groups (this isn't a LV peculiarity, it's how all regex parsers work).

So whilst (The quick brown fox jumped over thes*)([a-zA-Z0-9]*)(s*dog) will give you three terminal outputs with

1. The quick brown fox jumped over the

2. lazy

3. dog

(The quick brown fox jumped over thes*)(?:[a-zA-Z0-9]*)(s*dog) will give you only two terminals with

1. The quick brown fox jumped over the

2. dog

Similarly. The quick brown fox jumped over thes*([a-zA-Z0-9]*)s*dog will give you only one terminal with

1. lazy

In all cases the Whole Match will give you the whole string if all capture groups match and bugger all if it doesn't.

Edited March 15, 2013 by ShaunR

Phillip Brooks · March 15, 2013

So I've tried to load every snippet in this thread to LV2012 and all I get is a picture on the BD. I tested LV by grabbing a couple from the dark side and they work fine.

I created a few of my own as well and could load them back in.

Could there be a filter on the LAVAG server that is stripping out the meta data?

I'll share my SuperSecret toggle snippet here and see if I can load it back into LV.

EDIT:

I think something is happening on the server. My image file on disk was ~15k in size and when downloaded is ony 5k.

Hmmm.... :shifty:

Edited March 15, 2013 by Phillip Brooks

ShaunR · March 15, 2013

So I've tried to load every snippet in this thread to LV2012 and all I get is a picture on the BD. I tested LV by grabbing a couple from the dark side and they work fine.

I created a few of my own as well and could load them back in.

Could there be a filter on the LAVAG server that is stripping out the meta data?

I'll share my SuperSecret toggle snippet here and see if I can load it back into LV.

Change SuperSecret Setting.png

EDIT:

I think something is happening on the server. My image file on disk was ~15k in size and when downloaded is ony 5k.

Hmmm....

Yup.

I checked my snippet after Hoover said (on another thread) and it was fine.

I've now uploaded the same snippet to http://postimage.org/image/5p8stekp7/ and it works fine when downloaded. Definitely something going on with lavag.org. Maybe it's now being optimised/compressed?

Daklu · March 20, 2013

If you have to do regular expressions very often and want to learn them, RegexBuddy is your friend.

Sign In

Regex Challenge: How to exclude variable-width parts of a match?

Recommended Posts

JackDunaway

Darin

GregSands

JackDunaway

mje

JackDunaway

Darin

JackDunaway

mje

JackDunaway

mje

JackDunaway

mje

ShaunR

Phillip Brooks

ShaunR

Daklu

Join the conversation

Browse

Activity

Important Information