Ruby script for matching 3 patterns on ruby - ruby

I have a fail2ban.log from which I want to grab specific fields, from 'Ban' strings. I can grab the data I need using regex one at the time, but I am not able to combine them. A typical 'fail2ban' log file has many strings. I'm interested in strings like these:
2012-05-02 14:47:40,515 fail2ban.actions: WARNING [ssh-iptables] Ban 84.xx.xx.242
xx = numbers (digits)
I want to grab: a) Date and Time, b) Ban (keyword), c) IP address
Here is my regex:
IP = (\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})
date & time = ^(\d{4}\W\d{2}\W\d{2}\s\d{2}\W\d{2}\W\d{2})
My problem here is, how can I combine these three together. I tried something like this:
^(?=^\d{4}\W\d{2}\W\d{2}\s\d{2}\W\d{2}\W\d{2})(?=\.*d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$)(?=^(?Ban).)*$).*$
but does not work as I would wanted too.
To give a clearer example, here is what I want:
greyjewel:FailMap atma$ cat fail2ban.log |grep Ban|awk -F " " '{print $1, $2, $7}'|tail -n 3
2012-05-02 14:47:40,515 84.51.18.242
2012-05-03 00:35:44,520 202.164.46.29
2012-05-03 17:55:03,725 203.92.42.6
Best Regards

A pretty direct translation of the example
ruby -alne 'BEGIN {$,=" "}; print $F.values_at(0,1,-1) if /Ban/' fail2ban.log
And because I figure you must want them from within Ruby
results = File.foreach("input").grep(/Ban/).map { |line| line.chomp.split.values_at 0, 1, -1 }

If the field placement doesn't change, you don't even need a regex here:
log_line =
'2012-05-02 14:47:40,515 fail2ban.actions: WARNING [ssh-iptables] Ban 84.12.34.242'
date, time, action, ip = log_line.split.values_at(0,1,-2,-1)

Related

Checking really fast if one of many strings exists in one of many other strings, in Perl

Let's say I have a set of 100,000 strings, each one around 20-50 bytes on average.
Then let's say I have another set of 100,000,000 strings, each one also around 20-50 bytes on average.
I'd like to go through all 100,000,000 strings from the second set and check if any one of the strings from the first set exist in any one string from the second set.
Example: string from first set: "abc", string from second set: "123abc123" = match!
I've tried using Perl's index(), but it's not fast enough. Is there a better way to do this type of matching?
I found Algorithm::AhoCorasik::XS on CPAN, which implements the classic, very efficient multiple string search Aho-Corasick algorithm (Same one used by grep -F), and it seems to be reasonably fast (About half the speed of an equivalent grep invocation):
Example script:
#!/usr/bin/env perl
use warnings;
use strict;
use autodie;
use feature qw/say/;
use Algorithm::AhoCorasick::XS;
open my $set1, "<", "set1.txt";
my #needles = <$set1>;
chomp #needles;
my $search = Algorithm::AhoCorasick::XS->new(\#needles);
open my $set2, "<", "set2.txt";
while (<$set2>) {
chomp;
say if defined $search->first_match($_);
}
and using it (With randomly-generated test files):
$ wc -l set1.txt set2.txt
10000 set1.txt
500000 set2.txt
510000 total
$ time perl test.pl | wc -l
458414
real 0m0.403s
user 0m0.359s
sys 0m0.031s
$ time grep -Ff set1.txt set2.txt | wc -l
458414
real 0m0.199s
user 0m0.188s
sys 0m0.031s
You should use a regex alternation, like:
my #string = qw/abc def ghi/;
my $matcher = qr/#{[join '|', map quotemeta, sort #string]}/;
This should be faster than using index. But it can be made faster yet:
Up to a certain limit, depending on both the length and number of the strings, perl will build a trie for efficient matching; see e.g. https://perlmonks.org/?node_id=670558. You will want to experiment with how many strings you can include in a single regex to generate an array of regexes. Then combine those separate regexes into a single one (untested):
my #search_strings = ...;
my #matchers;
my $string_limit = 3000; # a guess on my part
my #strings = sort #search_strings;
while (my #subset = splice #strings, 0, $string_limit) {
push #matchers, qr/^.*?#{[join '|', map quotemeta, sort #subset]}/s;
}
my $matcher = '(?:' . join('|', map "(??{\$matchers[$_]})", 0..$#matchers) . ')';
$matcher = do { use re 'eval'; qr/$matcher/ };
/$matcher/ and print "line $. matched: $_" while <>;
The (??{...}) construct is needed to join the separate regexes; without it, the subregexes are all just interpolated and the joined regex compiled all together, removing the trie optimization. Each subregex starts with ^.*? so it searches the entire string; without that, the joined regex would have to invoke each subregex separately for each position in the string.
Using contrived data, I'm seeing about 3000 strings searched per second with this approach in a not very fast vm. Using the naive regex is less than 50 strings per second. Using grep, as suggested in a comment by Shawn, is faster (about 4200 strings per second for me) but gives you less control if you want to do things like identify which strings matched or at what positions.
You may want to have a look at https://github.com/leendo/hello-world .
Its parallel processing makes it really fast. Either just type in all search terms individually or as || conjunctions, or (better) adapt it to run the second set programatically in one go.
Here is an idea: you could partition the dictionary into lists of words that have the same 2 or 3 letter prefix. You would then iterate on the large set and for each position in each string, extract the prefix and try and match the strings that have this prefix.
You would use hashtable to store the lists with O(1) lookup time.
If some words in the dictionary are shorter than the prefix length, you would have to special case short words.
Making prefixes longer will make the hashtable larger but the lists shorter, improving the test time.
I have no clue if this can be implemented efficiently in Perl.
The task is quite simple and perhaps used on everyday basis around the globe.
Please see following code snippet for one of many possible implementations
use strict;
use warnings;
use feature 'say';
use Getopt::Long qw(GetOptions);
use Pod::Usage;
my %opt;
my $re;
GetOptions(
'sample|s=s' => \$opt{sample},
'debug|d' => \$opt{debug},
'help|h' => \$opt{help},
'man|m' => \$opt{man}
) or pod2usage(2);
pod2usage(1) if $opt{help};
pod2usage(-verbose => 2) if $opt{man};
pod2usage("$0: No files given.") if ((#ARGV == 0) && (-t STDIN));
$re = read_re($opt{sample});
say "DEBUG: search pattern is ($re)" if $opt{debug};
find_in_file($re);
sub find_in_file {
my $search = shift;
while( <> ) {
chomp;
next unless /$search/;
say;
}
}
sub read_re {
my $filename = shift;
open my $fh, '<', $filename
or die "Couldn't open $filename";
my #data = <$fh>;
close $fh;
chomp #data;
my $re = join('|', #data);
return $re;
}
__END__
=head1 NAME
file_in_file.pl - search strings of one file in other
=head1 SYNOPSIS
yt_video_list.pl [options] -s sample.txt dbfile.txt
Options:
-s,--sample search pattern file
-d,--debug debug flag
-h,--help brief help message
-m,--man full documentation
=head1 OPTIONS
=over 4
=item B<-s,--sample>
Search pattern file
=item B<-d,--debug>
Print debug information.
=item B<-h,--help>
Print a brief help message and exits.
=item B<-m,--man>
Prints the manual page and exits.
=back
B<This program> seaches patterns from B<sample> file in B<dbfile>
=cut

Extract 2 fields from string with search

I have a file with several lines of data. The fields are not always in the same position/column. I want to search for 2 strings and then show only the field and the data that follows. For example:
{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}
{"id":"5555","name":"6666","hwVersion":"7777"}
I would like to return the following:
"id":"1111","hwVersion":"4444"
"id":"5555","hwVersion":"7777"
I am struggling because the data isn't always in the same position, so I can't chose a column number. I feel I need to search for "id" and "hwVersion" Any help is GREATLY appreciated.
Totally agree with #KamilCuk. More specifically
jq -c '{id: .id, hwVersion: .hwVersion}' <<< '{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}'
Outputs:
{"id":"1111","hwVersion":"4444"}
Not quite the specified output, but valid JSON
More to the point, your input should probably be processed record by record, and my guess is that a two column output with "id" and "hwVersion" would be even easier to parse:
cat << EOF | jq -j '"\(.id)\t\(.hwVersion)\n"'
{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}
{"id":"5555","name":"6666","hwVersion":"7777"}
EOF
Outputs:
1111 4444
5555 7777
Since the data looks like a mapping objects and even corresponding to a JSON format, something like this should do, if you don't mind using Python (which comes with JSON) support:
import json
def get_id_hw(s):
d = json.loads(s)
return '"id":"{}","hwVersion":"{}"'.format(d["id"], d["hwVersion"])
We take a line of input string into s and parse it as JSON into a dictionary d. Then we return a formatted string with double-quoted id and hwVersion strings followed by column and double-quoted value of corresponding key from the previously obtained dict.
We can try this with these test input strings and prints:
# These will be our test inputs.
s1 = '{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}'
s2 = '{"id":"5555","name":"6666","hwVersion":"7777"}'
# we pass and print them here
print(get_id_hw(s1))
print(get_id_hw(s2))
But we can just as well iterate over lines of any input.
If you really wanted to use awk, you could, but it's not the most robust and suitable tool:
awk '{ i = gensub(/.*"id":"([0-9]+)".*/, "\\1", "g")
h = gensub(/.*"id":"([0-9]+)".*/, "\\1", "g")
printf("\"id\":\"%s\",\"hwVersion\":\"%s\"\n"), i, h}' /your/file
Since you mention position is not known and assuming it can be in any order, we use one regex to extract id and the other to get hwVersion, then we print it out in given format. If the values could be something other then decimal digits as in your example, the [0-9]+ but would need to reflect that.
And for the fun if it (this preserves the order) if entries from the file, in sed:
sed -e 's#.*\("\(id\|hwVersion\)":"[0-9]\+"\).*\("\(id\|hwVersion\)":"[0-9]\+"\).*#\1,\3#' file
It looks for two groups of "id" or "hwVersion" followed by :"<DECIMAL_DIGITS>".

Find lines that have partial matches

So I have a text file that contains a large number of lines. Each line is one long string with no spacing, however, the line contains several pieces of information. The program knows how to differentiate the important information in each line. The program identifies that the first 4 numbers/letters of the line coincide to a specific instrument. Here is a small example portion of the text file.
example text file
1002IPU3...
POIPIPU2...
1435IPU1...
1812IPU3...
BFTOIPD3...
1435IPD2...
As you can see, there are two lines that contain 1435 within this text file, which coincides with a specific instrument. However these lines are not identical. The program I'm using can not do its calculation if there are duplicates of the same station (ie, there are two 1435* stations). I need to find a way to search through my text files and identify if there are any duplicates of the partial strings that represent the stations within the file so that I can delete one or both of the duplicates. If I could have BASH script output the number of the lines containing the duplicates and what the duplicates lines say, that would be appreciated. I think there might be an easy way to do this, but I haven't been able to find any examples of this. Your help is appreciated.
If all you want to do is detect if there are duplicates (not necessarily count or eliminate them), this would be a good starting point:
awk '{ if (++seen[substr($0, 1, 4)] > 1) printf "Duplicates found : %s\n",$0 }' inputfile.txt
For that matter, it's a good starting point for counting or eliminating, too, it'll just take a bit more work...
If you want the count of duplicates:
awk '{a[substr($0,1,4)]++} END {for (i in a) {if(a[i]>1) print i": "a[i]}}' test.in
1435: 2
or:
{
a[substr($0,1,4)]++ # put prefixes to array and count them
}
END { # in the end
for (i in a) { # go thru all indexes
if(a[i]>1) print i": "a[i] # and print out the duplicate prefixes and their counts
}
}
Slightly roundabout but this should work-
cut -c 1-4 file.txt | sort -u > list
for i in `cat list`;
do
echo -n "$i "
grep -c ^"$i" file.txt #This tells you how many occurrences of each 'station'
done
Then you can do whatever you want with the ones that occur more than once.
Use following Python script(syntax of python 2.7 version used)
#!/usr/bin/python
file_name = "device.txt"
f1 = open(file_name,'r')
device = {}
line_count = 0
for line in f1:
line_count += 1
if device.has_key(line[:4]):
device[line[:4]] = device[line[:4]] + "," + str(line_count)
else:
device[line[:4]] = str(line_count)
f1.close()
print device
here the script reads each line and initial 4 character of each line are considered as device name and creates a key value pair device with key representing device name and value as line numbers where we find the string(device name)
following would be output
{'POIP': '2', '1435': '3,6', '1002': '1', '1812': '4', 'BFTO': '5'}
this might help you out!!

Grep for displaying count of muliple strings in a single file

Another question ... Can I get the count of items that are unique .. If in my previous case, i just took a simple instance . My business req is here ...
I have string like the below happy=7
happy=5
happy=5,
bascically I will be using regex for searching the word happy, I would give like "happy=*"... I need the output as "count of happy =2" as there is one duplicate instance ...
Use awk:
awk '/happy/{ happy+=1 } /sad/ {sad += 1 }
END { print "happy =", happy+0, "sad = ", sad+0 }'
Note that like grep -c, this does not count occurrences of each word but the number of lines that match each word.
You're better off using something like perl or awk, where you can increment counters based on conditional statements.

SPLIT file by Script (bash, cpp) - numbers in columns

I have files with some columns filled by numbers (float). I would need to split these files according to the value in one of the columns (can set as the first one). This means, when
a b c
in my file the value c fullfils 0.05<=c<=0.1 then create the file named c and copy the whole columns there which fullfils the c-condition...
is this possible? I can something small with bash, awk, something also with c++.
I have searched for some solutions but - I can the data sort of course and only read the first number of the line..
I don't know.
Please, very please.
Thank you
Jane
As you mentioned awk, the basic rule in awk is 'match a line (either by default or with a regexp, condition or line number)' AND 'do something because you found a match'.
awk uses values like $1, $2, $3 to indicate which column in the current line of data it is looking at. $0 refers to the whole line. So ...
awk '
BEGIN{
afile="afile.txt"
bfile="bfile.txt"
cfile="cfile.txt"
}
{
# test c value between .05 and .1
if ($3 >= 0.05 && $3 <= 0.1) print $0 > cfile
} inputData
Note that I am testing the value of the third column (c in your example). You can use $2 to test b column, etc.
If you don't know about the sort of condition test I have included >= 0.5 && $3 <= 0.1 you'll have some learning ahead of you.
Questions in the form of 1. I have this input, 2. I want this output. 3. (but) I'm getting this output, 4. with this code .... {code here} .... have a much better chance of getting a reasonable response in a reasonable amount of time ;-)
I hope this helps.
P.S. as you appear to be a new user, if you get an answer that helps you please remember to mark it as accepted, and/or give it a + (or -) as a useful answer.
If I understand your requirements correctly:
awk '{print > $3}' file ...

Resources