Ruby grep, match and return - ruby

Is there anyway to check if a value exist in a file without ALWAYS going through entire file ?
Currently I used:
if open('file.txt').grep(/value/).length > 0
puts "match"
else
puts "no match"
end
But it's not efficient as I only want to know whether it exists or not. Really appreciate a solution with grep / others similar one-liner.
Please note the "ALWAYS" before down-vote my question

If you want line-by-line comparison using a one-liner:
matches = open('file.txt') { |f| f.lines.find { |line| line.include?("value") } }
puts matches ? "yes" : "naaw"

By definition, the only way you can tell if an arbitrary expression exists in a file is by going over the file and looking for it. If you're looking for the first instance, then on average you'll be scanning half the file until you find your expression when it's there. If the expression isn't there then you'll have to scan the entire file to figure that out.
You could implement that in a one-liner by scanning the file line-by-line. Use IO.foreach
If you do this often, then you can make the search super efficient by indexing the file first, e.g. by using Lucene. It's a trade-off - you still have to scan the file, but only once since you save it's content in a more search-friendly data structure. However, if you don't access a given file very frequently, it's probably not worth the overhead - implementation, maintenance and extra storage.

Here's a ruby one-liner that will work from the linux command line to perform a grep on a text file, and stop on first found.
ruby -ne '(puts "first found on line #{$.}"; break) if $_ =~ /regex here/' file.txt
-n gets each line in the file and feeds it to the global variable $_
$. is a global variable that stores the current line number
If you want to find all lines matching the regex, then:
ruby -ne 'puts "found on line #{$.}" if $_ =~ /regex here/' file.txt

Related

Sed or Perl: One file with regex instructions, one instruction per line, executed on another file

I'm setting up a regex learning environment purely in bash/tmux with a pane for the file containing a regex, a pane for a text-file-for-processing, and a pane for the bash shell. I'm at the start of "The Bastards Book of Ruby"-regex chapter.
The 'Bastard's Book' shows an example of a 'negative-lookahead' regex (perfect, lets learn), where perl is recommended over sed. As I'm going for a CLI approach-> Bash command: $ perl -p file_with_regex.pl test.txt
(This prints the lines from test.txt with the intended substitutions)
Question: How would I add a second regex (on a new line) of the regex.pl file, and have perl execute both the first and (next) this second instruction for processing the text file?
# regex.pl
s/^(?!Mr)/Ms./g
s/Ms./Mrs./g
(Adding the second regex results in "Execution of regex.pl aborted due to compilation errors.")
The overall aim here is to progress in Ruby, while testing Regular Expressions as concisely as possible. Picking up a bare minimum of sed/perl while doing so would be a plus, as a proper dive into perl would take time from Ruby (and when it's time for the perl dive, I'll have had some time with the basics). The more I look at this the more it seems necessary to just do it in Ruby, if there isn't a perl switch that would enable a command-line-with-files approach.
The basic answer is that you need a semicolon after each line.
Paraphrased from perlrun, -p reads all lines of input, runs the commands you specified, and then prints out the value in $_ (the implicit variable you're running your substitute commands on in this script).
So, removing the magic, -p transformed your code into:
LINE:
while (<>) {
# regex.pl
s/^(?!Mr)/Ms./g
s/Ms./Mrs./g
} continue {
print or die "-p destination: $!\n";
}
Perl requires a semicolon between statements (but a terminal semicolon at the end of a block is optional) hence the error.
I personally would recommend writing the whole script above into the file instead of using -p because it is far less magical, but you're welcome to do it either way.
If you were going to write the whole script, I would recommend something more like the following:
use strict;
use warnings;
while ( my $line = <ARGV> ) {
$line =~ s/^(?!Mr)/Ms./g;
print "After first subst: $line";
$line =~ s/Ms./Mrs./g;
print "After second subst: $line";
}
use strict and use warnings are the boilerplate you want at the top of any perl script (to catch typos and other common mistakes) and explicitly calling the variable $line gives you a better understanding of how the script is working ($_ is very magical for beginners and the source of many errors IMO, but great when you know what's what).
If you're wondering about <> vs. <ARGV> they are the same thing and mean "Read through all the lines of files provided as command-line arguments to this script or standard input if no files are provided"."

How do I print the line number of the file I am working with via ARGV?

I'm currently opening a file taken at runtime via ARGV:
File.open(ARGV[0]) do |f|
f.each_line do |line|
Once a match is found I print output to the user.
if line.match(/(strcpy)/i)
puts "[!] strcpy does not check for buffer overflows when copying to destination."
puts "[!] Consider using strncpy or strlcpy (warning, strncpy is easily misused)."
puts " #{line}"
end
I want to know how to print out the line number for the matching line in the (ARGV[0]) file.
Using print __LINE__ shows the line number from the Ruby script. I've tried many different variations of print __LINE__ with different string interpolations of #{line} with no success. Is there a way I can print out the line number from the file?
When Ruby's IO class opens a file, it sets the $. global variable to 0. For each line that is read that variable is incremented. So, to know what line has been read simply use $..
Look in the English module for $. or $INPUT_LINE_NUMBER.
We can also use the lineno method that is part of the IO class. I find that a bit more convoluted because we need an IO stream object to tack that onto, while $. will work always.
I'd write the loop more simply:
File.foreach(ARGV[0]) do |line|
Something to think about is, if you're on a *nix system, you can use the OS's built-in grep or fgrep tool to greatly speed up your processing. The "grep" family of applications are highly optimized for doing what you want, and can find all occurrences, only the first, can use regular expressions or fixed strings, and can easily be called using Ruby's %x or backtick operators.
puts `grep -inm1 abacus /usr/share/dict/words`
Which outputs:
34:abacus
-inm1 means "ignore character-case", "output line numbers", "stop after the first occurrence"

replace $1 variable in file with 1-10000

I want to create 1000s of this one file.
All I need to replace in the file is one var
kitename = $1
But i want to do that 1000s of times to create 1000s of diff files.
I'm sure it involves a loop.
people answering people is more effective than google search!
thx
I'm not really sure what you are asking here, but the following will create 1000 files named filename.n containing 1 line each which is "kite name = n" for n = 1 to n = 1000
for i in {1..1000}
do
echo "kitename = $i" > filename.$i
done
If you have mysql installed, it comes with a lovely command line util called "replace" which replaces files in place across any number of files. Too few people know about this, given it exists on most linux boxen everywhere. Syntax is easy:
replace SEARCH_STRING REPLACEMENT -- targetfiles*
If you MUST use sed for this... that's okay too :) The syntax is similar:
sed -i.bak s/SEARCH_STRING/REPLACEMENT/g targetfile.txt
So if you're just using numbers, you'd use something like:
for a in {1..1000}
do
cp inputFile.html outputFile-$a.html
replace kitename $a -- outputFile-$a.html
done
This will produce a bunch of files "outputFile-1.html" through "outputFile-1000.html", with the word "kitename" replaced by the relevant number, inside the file.
But, if you want to read your lines from a file rather than generate them by magic, you might want something more like this (we're not using "for a in cat file" since that splits on words, and I'm assuming here you'd have maybe multi-word replacement strings that you'd want to put in:
cat kitenames.txt | while read -r a
do
cp inputFile.html "outputFile-$a.html"
replace kitename "$a" -- kitename-$a
done
This will produce a bunch of files like "outputFile-red kite.html" and "outputFile-kite with no string.html", which have the word "kitename" replaced by the relevant name, inside the file.

Remove line break if line does not start with KEYWORD

I have a flat file with lines that look like
KEYWORD|DATA STRING HERE|32|50135|ANOTHER DATA STRING
KEYWORD|STRING OF DATA|1333|552555666|ANOTHER STRING
KEYWORD|STRING OF MORE DATA|4522452|5345245245|REALLY REALLY REALLY REALLY
LONGSTRING THAT INSERTED A LINE BREAK WHEN I WAS EXTRACTING FROM SQLPLUS/ORACLE
KEYWORD|.....
How do I go about removing the linebreak so that
KEYWORD|STRING OF MORE DATA|4522452|5345245245|REALLY REALLY REALLY REALLY
LONGSTRING THAT INSERTED A LINE BREAK WHEN I WAS EXTRACTING FROM SQLPLUS/ORACLE
turns into
KEYWORD|STRING OF MORE DATA|4522452|5345245245|REALLY REALLY REALLY REALLY LONGSTRING THAT INSERTED A LINE BREAK WHEN I WAS EXTRACTING FROM SQLPLUS/ORACLE
This is in a HP-UNIX environment and I can move the file to another system (windows box with powershell and ruby installed).
I don't know what tools are you using, but you can use this regex to match every \n (or maybe \r) that isn't followed by KEYWORD so you can replace it for SPACE and you would have it.
DEMO
Regex: \r(?!KEYWORD) (With global modifier)
Ruby's Array has a nice method called slice_before that it inherits from Enumerable, which comes to the rescue here:
require 'pp'
text = 'KEYWORD|DATA STRING HERE|32|50135|ANOTHER DATA STRING
KEYWORD|STRING OF DATA|1333|552555666|ANOTHER STRING
KEYWORD|STRING OF MORE DATA|4522452|5345245245|REALLY REALLY REALLY REALLY
LONGSTRING THAT INSERTED A LINE BREAK WHEN I WAS EXTRACTING FROM SQLPLUS/ORACLE
KEYWORD|.....'
pp text.split("\n").slice_before(/^KEYWORD/).map{ |a| a.join(' ') }
=> ["KEYWORD|DATA STRING HERE|32|50135|ANOTHER DATA STRING",
"KEYWORD|STRING OF DATA|1333|552555666|ANOTHER STRING",
"KEYWORD|STRING OF MORE DATA|4522452|5345245245|REALLY REALLY REALLY REALLY LONGSTRING THAT INSERTED A LINE BREAK WHEN I WAS EXTRACTING FROM SQLPLUS/ORACLE",
"KEYWORD|....."]
This code just splits your text on line breaks, then uses slice_before to break the resulting array into sub-arrays, one for each block of text starting with /^KEYWORD/. Then it walks through the resulting sub-arrays, joining them with a single space. Any line that wasn't pre-split will be left alone. Ones that were broken are rejoined.
For real use you'd probably want to replace pp with a regular puts.
As for moving the code to Windows with Ruby, why? Install Ruby on HP-Unix and run it there. It's a more natural fit.
this short awk oneliner should do the job:
awk '/^KEYWORD/{print ""}{printf $0}' file
This might work for you (GNU sed):
sed ':a;$!{N;/\n.*|/!{s/\n/ /;ba}};P;D' file
Keep two lines in the pattern space and if the second line doesn't contain a | replace the newline with a space and repeat until it does or the the end of the file is reached.
This assumes the last field is the field that overflows, otherwise use the KEYWORD such:
sed ':a;$!{N;/\nKEYWORD/!{s/\n/ /;ba}};P;D' file
Powershell way:
[System.IO.File]::ReadAllText( "c:\myfile.txt" ) -replace "`r`n(?!KEYWORD)", ' '
You can use sed or awk (preferred) for this ยป
sed -n 's|\r||g;$!{1{x;d};H};${H;x;s|\n\(KEYWORD\)|\r\1|g;s|\n||g;s|\r|\n|g;p}' file.txt
awk 'BEGIN{ORS="";}NR==1{print;next;}/^KEYWORD/{print"\n";print;next;}{print;}' file.txt
Note: Write each command (sed, awk) in one line

How to efficiently parse large text files in Ruby

I'm writing an import script that processes a file that has potentially hundreds of thousands of lines (log file). Using a very simple approach (below) took enough time and memory that I felt like it would take out my MBP at any moment, so I killed the process.
#...
File.open(file, 'r') do |f|
f.each_line do |line|
# do stuff here to line
end
end
This file in particular has 642,868 lines:
$ wc -l nginx.log /code/src/myimport
642868 ../nginx.log
Does anyone know of a more efficient (memory/cpu) way to process each line in this file?
UPDATE
The code inside of the f.each_line from above is simply matching a regex against the line. If the match fails, I add the line to a #skipped array. If it passes, I format the matches into a hash (keyed by the "fields" of the match) and append it to a #results array.
# regex built in `def initialize` (not on each line iteration)
#regex = /(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) - (.{0})- \[([^\]]+?)\] "(GET|POST|PUT|DELETE) ([^\s]+?) (HTTP\/1\.1)" (\d+) (\d+) "-" "(.*)"/
#... loop lines
match = line.match(#regex)
if match.nil?
#skipped << line
else
#results << convert_to_hash(match)
end
I'm completely open to this being an inefficient process. I could make the code inside of convert_to_hash use a precomputed lambda instead of figuring out the computation each time. I guess I just assumed it was the line iteration itself that was the problem, not the per-line code.
I just did a test on a 600,000 line file and it iterated over the file in less than half a second. I'm guessing the slowness is not in the file looping but the line parsing. Can you paste your parse code also?
This blogpost includes several approaches to parsing large log files. Maybe thats an inspiration. Also have a look at the file-tail gem
If you are using bash (or similar) you might be able to optimize like this:
In input.rb:
while x = gets
# Parse
end
then in bash:
cat nginx.log | ruby -n input.rb
The -n flag tells ruby to assume 'while gets(); ... end' loop around your script, which might cause it to do something special to optimize.
You might also want to look into a prewritten solution to the problem, as that will be faster.

Resources