String Find/Replace Algorithm - algorithm

I would like to be able to search a string for various words, when I find one, i want to split the string at that point into 3 parts (left, match, right), the matched text would be excluded, and the process would continue with the new string left+right.
Now, once i have all my matches done, i need to reverse the process by reinserting the matched words (or a replacement for them) at the point they were removed. I have never really found what i wanted in any of my searches, so I thought I would ask for input here on SO.
Please let me know if this question needs further description.
BTW - at the moment, i have a very poor algorithm that replaces matched text with a unique string token, and then replaces the tokens with the replacement text for the appropriate match after all the matches have been done.
This is the goal:
one two three four five six
match "three" replace with foo (remember we found three, and where we found it)
one two four five six
|
three
match "two four" and prevent it from being matched by anything (edited for clarity)
one five six
|
two four
|
three
at this point, you cannot match for example "one two"
all the matches have been found, now put their replacements back in (in reverse order)
one two four five six
|
three
one two foo four five six
What's the point? Preventing one match's replacement text from being matched by another pattern. (all the patterns are run at the same time and in the same order for every string that is processed)
I'm not sure the language matters, but I'm using Lua in this case.
I'll try rephrasing, I have a list of patterns i want to find in a given string, if I find one, I want to remove that part of the string so it isnt matched by anything else, but I want to keep track of where i found it so I can insert the replacement text there once I am done trying to match my list of patterns
Here's a related question:
Shell script - search and replace text in multiple files using a list of strings

Your algorithm description is unclear. There's no exact rule where the extracted tokens should be re-inserted.
Here's an example:
Find 'three' in 'one two three four five six'
Choose one of these two to get 'foo bar' as result:
a. replace 'one two' with 'foo' and 'four five six' with 'bar'
b. replace 'one two four five six' with 'foo bar'
Insert 'three' back in the step 2 resulting string 'foo bar'
At step 3 does 'three' goes before 'bar' or after it?
Once you've come up with clear rules for reinserting, you can easily implement the algorithm as a recursive method or as an iterative method with a replacements stack.

Given the structure of the problem, I'd probably try an algorithm based on a binary tree.

pseudocode:
for( String snippet in snippets )
{
int location = indexOf(snippet,inputData);
if( location != -1)
{
// store replacement text for a found snippet on a stack along with the
// location where it was found
lengthChange = getReplacementFor(snippet).length - snippet.length;
for each replacement in foundStack
{
// IF the location part of the pair is greater than the location just found
//Increment the location part of the pair by the lengthChange to account
// for the fact that when you replace a string with a new one the location
// of all subsequent strings will be shifted
}
//remove snippet
inputData.replace(snippet, "");
}
}
for( pair in foundStack )
{
inputData.insert( pair.text, pair.location);
}
This is basically just doing exactly as you said in your problem description. Step through the algorithm, putting everything on a stack with the location it was found at. You use a stack so when you reinsert in the second half, it happens in reverse order so that the stored "location" applies to the current state of the inputString.
Edited with a potential fix for commenter's criticism. Does the commented for block within the first one account for your criticisms, or is it still buggy in certain scenarios?

Related

Bash Script Loop for Substitution and Traversal

So I'm trying to figure out what a question from an old exam means and I'm slightly confused about one or two parts.
#!/bin/bash
awk '{$0 = tolower($0)
gsub(/[,.?;:#!\(\)]/),"",$0)
for(a=1;a<=NF;a++)
b[$a]++}
END print b[a],a}'
sort -sk2
Here is my interpretation:
target the bash script location
scan file with awk
convert string to lower case
sub all occurrences of symbols with nothing (ie. remove) and overwrite string
(here is my issue) for every field increment a by 1?
(again not sure what this is doing) b takes a's number and increments by 1?
end the for loop and print (b, a)
sort by size of the second field
I think the last four lines are my main issue. Also is it just me or is there an extra } in that question?
Thanks in advance.
The for loop is weirdly formatted. Here it is again with proper indentation:
for(a=1; a<=NF; a++)
b[$a]++
In other words, we loop over the field positions; for each, the count in the associative array b is incremented. So if the current input line is
foo bar poo bar baz
the script will do
b["foo"]++ # a is 1; $a is $1
b["bar"]++
b["poo"]++
b["bar"]++
b["baz"]++
So now b contains a set of tokens as keys, and the number of times each occurred as their respective values. In other words, this collects word counts for each word in the input.
The case folding and removal of punctuation normalizes the input so that
Word word word, word!
will count as four occurrences of "word", rather than one each for the capitalized version, the undecorated normal form, and the ones with punctuation attached at the end. It slightly distorts e.g. words which should properly be capitalized, and conflates into homographs words which are differentiated only by capitalization (such as china porcelain vs China the country.)
The END block is executed only when all input lines have been consumed, and thus b is fully loaded with all input words from all input lines, with their final counts. (Though here, there is no valid END block actually, because the opening brace after END is missing; this is a fatal syntax error. There isn't one closing brace too many, there's one non-optional opening brace missing.)

Apache NiFi: Extracting nth column from a csv [duplicate]

I need a regular expression that can be used to find the Nth entry in a comma-separated list.
For example, say this list looks like this:
abc,def,4322,mail#mailinator.com,3321,alpha-beta,43
...and I wanted to find the value of the 7th entry (alpha-beta).
My first thought would not be to use a regular expression, but to use something that splits the string into an array on the comma, but since you asked for a regex.
most regexes allow you to specify a minimum or maximum match, so something like this would probably work.
/(?:[^\,]*,){5}([^,]*)/
This is intended to match any number of character that are not a comma followed by a comma six times exactly (?:[^,]*,){5} - the ?: says to not capture - and then to match and capture any number of characters that are not a comma ([^,]+). You want to use the first capture group.
Let me know if you need more info.
EDIT: I edited the above to not capture the first part of the string. This regex works in C# and Ruby.
You could use something like:
([^,]*,){$m}([^,]*),
As a starting point. (Replace $m with the value of (n-1).) The content would be in capture group 2. This doesn't handle things like lists of size n, but that's just a matter of making the appropriate modifications for your situation.
#list = split /,/ => $string;
$it = $list[6];
or just
$it = (split /,/ => $string)[6];
Beats writing a pattern with a {6} in it every time.

Finding and Editing Multiple Regex Matches on the Same Line

I want to add markdown to key phrases in a (gollum) wiki page that will link to the relevant wiki page in the form:
This is the key phrase.
Becomes
This is the [[key phrase|Glossary#key phrase]].
I have a list of key phrases such as:
keywords = ["golden retriever", "pomeranian", "cat"]
And a document:
Sue has 1 golden retriever. John has two cats.
Jennifer has one pomeranian. Joe has three pomeranians.
I want to iterate over every line and find every match (that isn't already a link) for each keyword. My current attempt looks like this:
File.foreach(target_file) do |line|
glosses.each do |gloss|
len = gloss.length
# Create the regex. Avoid anything that starts with [
# or (, ends with ] or ), and ignore case.
re = /(?<![\[\(])#{gloss}(?![\]\)])/i
# Find every instance of this gloss on this line.
positions = line.enum_for(:scan, re).map {Regexp.last_match.begin(0) }
positions.each do |pos|
line.insert(pos, "[[")
# +2 because we just inserted 2 ahead.
line.insert(pos+len+2, "|#{page}\##{gloss}]]")
end
end
puts line
end
However, this will run into a problem if there are two matches for the same key phrase on the same line. Because I insert things into the line, the position I found for each match isn't accurate after the first one. I know I could adjust for the size of my insertions every time but, because my insertions are a different size for each gloss, it seems like the most brute-force, hacky solution.
Is there a solution that allows me to make multiple insertions on the same line at the same time without several arbitrary adjustments each time?
After looking at #BryceDrew's online python version, I realized ruby probably also has a way to fill in the match. I now have a much more concise and faster solution.
First, I needed to make regexes of my glosses:
glosses.push(/(?<![\[\(])#{gloss}(?![\]\)])/i)
Note: The majority of that regex is look-ahead and look-behind assertions to prevent catching a phrase that's already part of a link.
Then, I needed to make a union of all of them:
re = Regexp.union(glosses)
After that, it's as simple as doing gsub on every line, and filling in my matches:
File.foreach(target_file) do |line|
line = line.gsub(re) {|match| "[[#{match}|Glossary##{match.downcase}]]"}
puts line
end

Ruby (on Rails) Regex: removing thousands comma from numbers

This seems like a simple one, but I am missing something.
I have a number of inputs coming in from a variety of sources and in different formats.
Number inputs
123
123.45
123,45 (note the comma used here to denote decimals)
1,234
1,234.56
12,345.67
12,345,67 (note the comma used here to denote decimals)
Additional info on the inputs
Numbers will always be less than 1 million
EDIT: These are prices, so will either be whole integers or go to the hundredths place
I am trying to write a regex and use gsub to strip out the thousands comma. How do I do this?
I wrote a regex: myregex = /\d+(,)\d{3}/
When I test it in Rubular, it shows that it captures the comma only in the test cases that I want.
But when I run gsub, I get an empty string: inputstr.gsub(myregex,"")
It looks like gsub is capturing everything, not just the comma in (). Where am I going wrong?
result = inputstr.gsub(/,(?=\d{3}\b)/, '')
removes commas only if exactly three digits follow.
(?=...) is a lookahead assertion: It needs to be possible to be matched at the current position, but it's not becoming part of the text that is actually matched (and subsequently replaced).
You are confusing "match" with "capture": to "capture" means to save something so you can refer to it later. You want to capture not the comma, but everything else, and then use the captured portions to build your substitution string.
Try
myregex = /(\d+),(\d{3})/
inputstr.gsub(myregex,'\1\2')
In your example, it is possible to tell from the number of digits after the last separator (either , or .) that it is a decimal point, since there are 2 lone digits. For most cases, if the last group of digits does not have 3 digits then you can assume that the separator in front is decimal point. Another sign is the multiple appearance of a separator in big numbers allows us to differentiate between decimal point and separators.
However, I can give a string 123,456 or 123.456 without any sort of context. It is impossible to tell whether they are "123 thousand 456" or "123 point 456".
You need to scan the document to look for clue whether , is used for thousand separator or decimal point, and vice versa for .. With the context provided, then you can safely apply the same method to remove the thousand separators.
You may also want to check out this article on Wikipedia on the less common ways to specify separators or decimal points. Knowing and deciding not to support is better than assuming things will work.

A more elegant way to parse a string with ruby regular expression using variable grouping?

At the moment I have a regular expression that looks like this:
^(cat|dog|bird){1}(cat|dog|bird)?(cat|dog|bird)?$
It matches at least 1, and at most 3 instances of a long list of words and makes the matching words for each group available via the corresponding variable.
Is there a way to revise this so that I can return the result for each word in the string without specifying the number of groups beforehand?
^(cat|dog|bird)+$
works but only returns the last match separately , because there is only one group.
OK, so I found a solution to this.
It doesn't look like it is possible to create an unknown number of groups, so I went digging for another way of achieving the desired outcome: To be able to tell if a string was made up of words in a given list; and to match the longest words possible in each position.
I have been reading Mastering Regular Expressions by Jeffrey E. F. Friedl and it shed some light on things for me. It turns out that NFA based Regexp engines (like the one used in Ruby) are sequential as well as lazy/greedy. This means that you can dictate how a pattern is matched using the order in which you give it choices. This explains why scan was returning variable results, it was looking for the first word in the list that matched the criteria and then moved on to the next match. By design it was not looking for the longest match, but the first one. So in order to rectify this all I needed to do was reorder the array of words used to generate the regular expression from alphabetical order, to length order (longest to shortest).
array = %w[ as ascarid car id ]
list = array.sort_by {|word| -word.length }
regexp = Regexp.union(list)
Now the first match found by scan will be the longest word available. It is also pretty simple to tell if a string contains only words in the list using scan:
if "ascarid".scan(regexp).join.length == word.length
return true
else
return false
end
Thanks to everyone that posted in response to this question, I hope that this will help others in the future.
You could do it in two steps:
Use /^(cat|dog|bird)+$/ (or better /\A(cat|dog|bird)+\z/) to make sure it matches.
Then string.scan(/cat|dog|bird/) to get the pieces.
You could also use split and a Set to do both at once. Suppose you have your words in the array a and your string in s, then:
words = Set.new(a)
re = /(#{a.map{|w| Regexp.quote(w)}.join('|')})/
parts = s.split(re).reject(&:empty?)
if(parts.any? {|w| !words.include?(w) })
# 's' didn't match what you expected so throw a
# hissy fit, format the hard drive, set fire to
# the backups, or whatever is appropriate.
else
# Everything you were looking for is in 'parts'
# so you can check the length (if you care about
# how many matches there were) or something useful
# and productive.
end
When you use split with a pattern that contains groups then
the respective matches will be returned in the array as well.
In this case, the split will hand us something like ["", "cat", "", "dog"] and the empty strings will only occur between the separators that we're looking for and so we can reject them and pretend they don't exist. This may be an unexpected use of split since we're more interested in the delimiters more than what is being delimited (except to make sure that nothing is being delimited) but it gets the job done.
Based on your comments, it looks like you want an ordered alternation so that (ascarid|car|as|id) would try to match from left to right. I can't find anything in the Ruby Oniguruma (the Ruby 1.9 regex engine) docs that says that | is ordered or unordered; Perl's alternation appears to be specified (or at least strongly implied) to be ordered and Ruby's certainly behaves as though it is ordered:
>> 'pancakes' =~ /(pan|pancakes)/; puts $1
pan
So you could sort your words from longest to shortest when building your regex:
re = /(#{a.sort_by{|w| -w.length}.map{|w| Regexp.quote(w)}.join('|')})/
and hope that Oniguruma really will match alternations from left to right. AFAIK, Ruby's regexes will be eager because they support backreferences and lazy/non-greedy matching so this approach should be safe.
Or you could be properly paranoid and parse it in steps; first you'd make sure your string looks like what you want:
if(s !~ /\A(#{a.map{|w| Regexp.quote(w)}.join('|')})+\z/)
# Bail out and complain that 's' doesn't look right
end
The group your words by length:
by_length = a.group_by(&:length)
and scan for the groups from the longest words to the shortest words:
# This loses the order of the substrings within 's'...
matches = [ ]
by_length.keys.sort_by { |k| -k }.each do |group|
re = /(#{a.map{|w| Regexp.quote(w)}.join('|')})/
s.gsub!(re) { |w| matches.push(w); '' }
end
# 's' should now be empty and the matched substrings will be
# in 'matches'
There is still room for possible overlaps in these approaches but at least you'd be extracting the longest matches.
If you need to repeat parts of a regex, one option is to store the repeated part in a variable and just reference that, for example:
r = "(cat|dog|bird)"
str.match(/#{r}#{r}?#{r}?/)
You can do it with .Net regular expressions. If I write the following in PowerShell
$pat = [regex] "^(cat|dog|bird)+$"
$m = $pat.match('birddogcatbird')
$m.groups[1].captures | %{$_.value}
then I get
bird
dog
cat
bird
when I run it. I know even less about IronRuby than I do about PowerShell, but perhaps this means you can do it in IronRuby as well.

Resources