Matching strings with variations in spaces - ruby

I have a log file with a lot of arbitrary spaces:
Number of active files:
20
Missing files:
10
I am trying to determine if a certain string like this:
Expected_string = "Number of active files: 20"
is contained in the log file. Is there an easy way to compare the strings disregarding the variation in spaces?
I am using a method that looks like this:
def isStringInLog?(string)
if open(#log_full_path).grep(/#{string}/).size > 0
return true
end
return false
end
However, this only works if the strings match exactly.

You can use Ruby's gsub method to turn all instances of one or more whitespace characters (including newlines) into a single space:
def string_in_log?(str)
File.read(#log_full_path).gsub(/\s+/, " ").include?(str)
end
gsub uses the regular expression /\s+/ to replace all groups of whitespace with one space.
Also, ruby variables (except Constants) and method names should begin with a lowercase letter and use snake_case, not camelCase.

Maybe break and rejoin the log?
log = <<-EOF
Number of active files:
20
Missing files:
10
EOF
pattern = 'Number of active files: 20'
puts log.split.join(' ').include?(pattern) # true

Related

Ruby: How to copy a word in a string containing a specific sequence of letters?

I am trying to read in a text file and iterate through every line. If the line contains "_u" then I want to copy that word in that line.
For example:
typedef struct {
reg 1;
reg 2;
} buffer_u;
I want to copy the word buffer_u.
This is what I have so far (everything up to how to copy the word in the string):
f_in = File.open( h_file )
test = h_file.read
text.each_line do |line|
if line.include? "_u"
# copy word
# add to output file
end
end
Thanks in advance for your help!
Don't make it harder than it has to be. If you want to scan a body of text for words that match a criteria, do just that:
text = "
word_u1
something
_u1 foo
bar _u2
another word_u2
typedef struct {
reg 1;
reg 2;
} buffer_u;
"
text.scan(/\w+/).select{ |w| w['_u'] }
# => ["word_u1", "_u1", "_u2", "word_u2", "buffer_u"]
Regex are useful but the more complex ("smarter") they are, they slower they run unless you are very careful to anchor them, as anchors give them hints on where to look. Without those, the engine tries a number of things to determine exactly what you want, and that can really bog down the processing.
I recommend instead simply grabbing the words in the text:
scan(/\w+/)
Then filtering out the ones that match:
select{ |w| w['_u'] }
Using select with a simple sub-string search w['_u'] is extremely fast.
It could probably run faster using split() instead of scan(/\w+/) but you'll have to deal with cleaning up non-word characters.
Note: \w means [a-zA-Z0-9_] so what we generally call a "word" character is actually a "variable" definition for most languages since words generally don't include digits or _.
You can probably reduce your code to:
File.read( h_file ).scan(/\w+/).select{ |w| w['_u'] }
That will return an array of matching words.
Caveat: Using read has scalability issues. If you're concerned about the size of the file being read (which you always should be) then use foreach and iterate over the file line-by-line. You will probably see no change in processing speed.
You can try something like this:
words = []
File.open( h_file ) { |file| file.each_line { |line|
words << line.split.find { |a| a =~ /_u/ }
}}
words.compact!
# => [["buffer_u"]]
puts words
# buffer_u
This regex should catch a word ending with _u
(\w*_u)(?!\w)
The matching group will match a word ending with _u not followed by letters digits or underscores.
If you want _u to appear anywhere in a word use
(\w*_u\w*)
See DEMO here.
This will return all such words in the file, even if there are two or more in a line:
r = /
\w* # match >= 0 word characters
_u # match string
\w* # match >= 0 word characters
/x # extended mode
File.read(fname).scan r
For example:
str = "Cat_u has 9 lives, \n!dog_u has none and \n pig_u_o and cow_u, 3."
fname = 'temp'
File.write(fname, str)
#=> 63
Confirm the file contents:
File.read(fname)
#=> "Cat_u has 9 lives, \n!dog_u has none and \n pig_u_o and cow_u, 3."
Extract strings:
File.read(fname).scan r
#=> ["Cat_u", "dog_u", "pig_u_o", "cow_u"]
It's not difficult to modify this code to return at most one string per line. Simply read the file into an array of lines (or read a line at a time) and execute s = line[r]; arr << s if s for each line, where r is the above regex.

Ruby - Get file contents with in a separator in an array

I have a file like this:
some content
some oterh
*********************
useful1 text
useful3 text
*********************
some other content
How do I get the content of the file within between two stars line in an array. For example, on processing the above file the content of array should be like this
a=["useful1 text" , "useful2 text"]
A really hack solution is to split the lines on the stars, grab the middle part, and then split that, too:
content.split(/^\*+$/)[1].split(/\s+/).reject(&:empty?)
# => ["useful1","useful3"]
f = File.open('test_doc.txt', 'r')
content = []
f.each_line do |line|
content << line.rstrip unless !!(line =~ /^\*(\*)*\*$/)
end
f.close
The regex pattern /^*(*)*$/ matches strings that contain only asterisks. !!(line =~ /^*(*)*$/) always returns a boolean value. So if the pattern does not match, the string is added to the array.
What about this:
def values_between(array, separator)
array.slice array.index(separator)+1..array.rindex(separator)-1
end
filepath = '/tmp/test.txt'
lines = %w(trash trash separator content content separator trash)
separator = "separator\n"
File.write '/tmp/test.txt', lines.join("\n")
values_between File.readlines('/tmp/test.txt'), "separator\n"
#=> ["content\n", "content\n"]
I'd do it like this:
lines = []
File.foreach('./test.txt') do |li|
lines << li if (li[/^\*{5}/] ... li[/^\*{5}/])
end
lines[1..-2].map(&:strip).select{ |l| l > '' }
# => ["useful1 text", "useful3 text"]
/^\*{5}/ means "A string that starts with and has at least five '*'.
... is one of two uses of .. and ... and, in this use, is commonly called a "flip-flop" operator. It isn't used often in Ruby because most people don't seem to understand it. It's sometimes mistaken for the Range delimiters .. and ....
In this use, Ruby watches for the first test, li[/^\*{5}/] to return true. Once it does, .. or ... will return true until the second condition returns true. In this case we're looking for the same delimiter, so the same test will work, li[/^\*{5}/], and is where the difference between the two versions, .. and ... come into play.
.. will return toggle back to false immediately, whereas ... will wait to look at the next line, which avoids the problem of the first seeing a delimiter and then the second seeing the same line and triggering.
That lets the test assign to lines, which, prior to the [1..-2].map(&:strip).select{ |l| l > '' } looks like:
# => ["*********************\n",
# "\n",
# "useful1 text\n",
# "\n",
# "useful3 text\n",
# "\n",
# "*********************\n"]
[1..-2].map(&:strip).select{ |l| l > '' } cleans that up by slicing the array to remove the first and last elements, strip removes leading and trailing whitespace, effectively getting rid of the trailing newlines and resulting in empty lines and strings containing the desired text. select{ |l| l > '' } picks up the lines that are greater than "empty" lines, i.e., are not empty.
See "When would a Ruby flip-flop be useful?" and its related questions, and "What is a flip-flop operator?" for more information and some background. (Perl programmers use .. and ... often, for just this purpose.)
One warning though: If the file has multiple blocks delimited this way, you'll get the contents of them all. The code I wrote doesn't know how to stop until the end-of-file is reached, so you'll have to figure out how to handle that situation if it could occur.

Ruby: How to append to each line of a string based on a given regex?

I want to append </tag> to each line where it's missing:
text = '<tag>line 1</tag>
<tag>line2 # no closing tag, append
<tag>line3 # no closing tag, append
line4</tag> # no opening tag, but has a closing tag, so ignore
<tag>line5</tag>'
I tried to create a regular expression to match this but I know its wrong:
text.gsub! /.*?(<\/tag>)Z/, '</tag>'
How can I create a regular expression to conditionally append each line?
Here you go:
text.gsub!(%r{(?<!</tag>)$}, "</tag>")
Explanation:
$ means end of line and \z means end of string. \Z means something similar, with complications.
(?<!) work together to create a negative lookbehind.
Given the example provided, I'd just do something like this:
text.split(/<\/?tag>/).
reject {|t| t.strip.length == 0 }.
map {|t| "<tag>%s</tag>" % t.strip }.
join("\n")
You're basically treating either and as record delimiters, so you can just split on them, reject any blank records, then construct a new combined string from the extracted values. This works nicely when you can't count on newlines being record delimiters and will generally be tolerant of missing tags.
If you're insistent on a pure regex solution, though, and your data format will always match the given format (one record per line), you can use a negative lookbehind:
text.strip.gsub(/(?<!<\/tag>)(\n|$)/, "</tag>\\1")
One that could work is:
/<tag>[^\n ]+[^>][\s]*(\n)/
This is will return all the newline chars without a ">" before them.
Replace it with "\n", i.e.
text.gsub!( /<tag>[^\n ]+[^>][\s]*(\n)/ , "</tag>\n")
For more polishing, try http://rubular.com/
text = '<tag>line 1</tag>
<tag>line2
<tag>line3
line4</tag>
<tag>line5</tag>'
result = ""
text.each_line do |line|
line.rstrip!
line << "</tag>" if not line.end_with?("</tag>")
result << line << "\n"
end
puts result
--output:--
<tag>line 1</tag>
<tag>line2</tag>
<tag>line3</tag>
line4</tag>
<tag>line5</tag>

How do I write a regular expression that will match characters in any order?

I'm trying to write a regular expressions that will match a set of characters without regard to order. For example:
str = "act"
str.scan(/Insert expression here/)
would match:
cat
act
tca
atc
tac
cta
but would not match ca, ac or cata.
I read through a lot of similar questions and answers here on StackOverflow, but have not found one that matches my objectives exactly.
To clarify a bit, I'm using ruby and do not want to allow repeat characters.
Here is your solution
^(?:([act])(?!.*\1)){3}$
See it here on Regexr
^ # matches the start of the string
(?: # open a non capturing group
([act]) # The characters that are allowed and a capturing group
(?!.*\1) # That character is matched only if it does not occur once more, Lookahead assertion
){3} # Defines the amount of characters
$
The only special think is the lookahead assertion, to ensure the character is not repeated.
^ and $ are anchors to match the start and the end of the string.
[act]{3} or ^[act]{3}$ will do it in most regular expression dialects. If you can narrow down the system you're using, that will help you get a more specific answer.
Edit: as mentioned by #georgydyer in the comments below, it's unclear from your question whether or not repeated characters are allowed. If not, you can adapt the answer from this question and get:
^(?=[act]{3}$)(?!.*(.).*\1).*$
That is, a positive lookahead to check a match, and then a negative lookahead with a backreference to exclude repeated characters.
Here's how I'd go about it:
regex = /\b(?:#{ Regexp.union(str.split('').permutation.map{ |a| a.join }).source })\b/
# => /(?:act|atc|cat|cta|tac|tca)/
%w[
cat act tca atc tac cta
ca ac cata
].each do |w|
puts '"%s" %s' % [w, w[regex] ? 'matches' : "doesn't match"]
end
That outputs:
"cat" matches
"act" matches
"tca" matches
"atc" matches
"tac" matches
"cta" matches
"ca" doesn't match
"ac" doesn't match
"cata" doesn't match
I use the technique of passing an array into Regexp.union for a lot of things; I works especially well with the keys of a hash, and passing the hash into gsub for rapid search/replace on text templates. This is the example from the gsub documentation:
'hello'.gsub(/[eo]/, 'e' => 3, 'o' => '*') #=> "h3ll*"
Regexp.union creates a regex, and it's important to use source instead of to_s when extracting the actual pattern being generated:
puts regex.to_s
=> (?-mix:\b(?:act|atc|cat|cta|tac|tca)\b)
puts regex.source
=> \b(?:act|atc|cat|cta|tac|tca)\b
Notice how to_s embeds the pattern's flags inside the string. If you don't expect them you can accidentally embed that pattern into another, which won't behave as you expect. Been there, done that and have the dented helmet as proof.
If you really want to have fun, look into the Perl Regexp::Assemble module available on CPAN. Using that, plus List::Permutor, lets us generate more complex patterns. On a simple string like this it won't save much space, but on long strings or large arrays of desired hits it can make a huge difference. Unfortunately, Ruby has nothing like this, but it is possible to write a simple Perl script with the word or array of words, and have it generate the regex and pass it back:
use List::Permutor;
use Regexp::Assemble;
my $regex_assembler = Regexp::Assemble->new;
my $perm = new List::Permutor split('', 'act');
while (my #set = $perm->next) {
$regex_assembler->add(join('', #set));
}
print $regex_assembler->re, "\n";
(?-xism:(?:a(?:ct|tc)|c(?:at|ta)|t(?:ac|ca)))
See "Is there an efficient way to perform hundreds of text substitutions in Ruby?" for more information about using Regexp::Assemble with Ruby.
I will assume several things here:
- You are looking for permutations of given characters
- You are using ruby
str = "act"
permutations = str.split(//).permutation.map{|p| p.join("")}
# and for the actual test
permutations.include?("cat")
It is no regex though.
No doubt - the regex that uses positive/negative lookaheads and backreferences is slick, but if you're only dealing with three characters, I'd err on the side of verbosity by explicitly enumerating the character permutations like #scones suggested.
"act".split('').permutation.map(&:join)
=> ["act", "atc", "cat", "cta", "tac", "tca"]
And if you really need a regex out of it for scanning a larger string, you can always:
Regexp.union "act".split('').permutation.map(&:join)
=> /\b(act|atc|cat|cta|tac|tca)\b/
Obviously, this strategy doesn't scale if your search string grows, but it's much easier to observe the intent of code like this in my opinion.
EDIT: Added word boundaries for false positive on cata based on #theTinMan's feedback.

Using select rather than gsub to avoid multiple regex evaluations in Ruby

Here is one output that requires multiple regex evaluations but gets what I want to do done (remove everything except the text).
words = IO.read("file.txt").
gsub(/\s/, ""). # delete white spaces
gsub(".",""). # delete periods
gsub(",",""). # delete commas
gsub("?","") # delete Q marks
puts words
# output
# WheninthecourseofhumaneventsitbecomesnecessaryIwanttobelieveyoureallyIdobutwhoamItoblameWhenthefactsarecountedthenumberswillbereportedLotsoflaughsCharlieIthinkIheardthatonetentimesbefore
Looking at this post - Ruby gsub : is there a better way - I figured I would try to do a match to accomplish the same result without multiple regex evaluations. But I don't get the same output.
words = IO.read("file.txt").
match(/(\w*)+/)
puts words
# output - this only gets the first word
# When
And this only gets the first sentence:
words = IO.read("file.txt").
match(/(...*)+/)
puts words
# output - this only gets the first sentence
# When in the course of human events it becomes necessary.
Any suggestions on getting the same output (including stripping out white spaces and non-word characters) on a match rather than gsub?
You can do what you want in one gsub operation:
s = 'When in the course of human events it becomes necessary.'
s.gsub /[\s.,?]/, ''
# => "Wheninthecourseofhumaneventsitbecomesnecessary"
You don't need multiple regex evaluations for this.
str = "# output - this only gets the first sentence
# When in the course of human events it becomes necessary."
p str.gsub(/\W/, "")
#=>"outputthisonlygetsthefirstsentenceWheninthecourseofhumaneventsitbecomesnecessary"

Resources