how to fetch multiple pattern and numbers - bash

i have this this file ( pattern1 and pattern2 is fixed but numbers is randoms )
aaaa patern1[1234] bbbb cccc pattern2[5678]
jjjj patern1[9999] hhhhhhhh
and I want to extract the following patterns with bash script
pattern1[1234] pattern2[5678]
pattern1[9999]
I try by grep -Eo 'pattern1\[[0-9]{1,4}' it works for one pattern not for two,

$ cat ip.txt
aaaa pattern1[1234] bbbb cccc pattern2[5678]
jjjj pattern1[9999] hhhhhhhh
$ perl -lne 'print join " ", /pattern[12]\[\d+\]/g' ip.txt
pattern1[1234] pattern2[5678]
pattern1[9999]
pattern[12]\[\d+\] pattern to extract
print join " ", to print the results separated by space
If lines not containing the desired pattern are to be omitted:
perl -lne 'print join " ", //g if /pattern[12]\[\d+\]/' ip.txt

You can use the pipe character | to allow for multiple patterns:
grep -oP '(patern1|pattern2)\[[0-9]{1,4}\]' file
patern1[1234]
pattern2[5678]
patern1[9999]
Since the patterns are similar, you can simplify like this:
grep -oP 'patt?ern[12]\[[0-9]{1,4}\]' file

$ awk '{ c=0; while ( match($0,/(patern1|pattern2)[[][^][]+[]]/) ) { printf "%s%s", (c++?OFS:""), substr($0,RSTART,RLENGTH); $0=substr($0,RSTART+RLENGTH) } if (c) print "" }' file
patern1[1234] pattern2[5678]
patern1[9999]
If you prefer brevity over clarity then consider this, using GNU awk for multi-char RS and RT and run against the same input file as shown in https://stackoverflow.com/a/39453928/1745001:
$ awk -v RS='pattern[12][[][0-9]+[]]|\n' '{$0=RT;ORS=(/\n/?x:FS)} 1' file
pattern1[1234] pattern2[5678]
pattern1[9999]

Related

Partial string search between two files using AWK

I have been trying to re-write an egrep command using awk to improve performance but haven't been successful. The egrep command performs a simple case insensitive search of the records in file1 against (partial matches in) file2. Below is the command and sample output.
file1 contains:
Abc
xyz
123
blah
hh
a,b
file2 contains:
abc de
xyz
123
456
blah
test1
abdc
abc,def,123
kite
a,b,c
Original command :
egrep -i -f file1 file2
Original (egrep) command output :
$ egrep -i -f file1 file2
abc de
xyz
123
blah
abc,def,123
a,b,c
I would like to use AWK to rewrite the command to do the same operation. I have tried the below but it is performing a full record match and not partial like grep does.
Modified command in awk :
awk 'NR==FNR{a[tolower($0)];next} tolower($0) in a' file1 file2
Modified command (awk) output:
$ awk 'NR==FNR{a[tolower($0)];next} tolower($0) in a' file1 file2
xyz
123
blah
This excludes the records which had partial matches for the string "abc". Any help to fix the awk command please? Thanks in advance.
Use index like this for a partial literal match:
awk '
NR == FNR {
needles[tolower($0)]
next
}
{
haystack = tolower($0)
for (needle in needles) {
if (index(haystack, needle)) {
print
break
}
}
}' file1 file2
I would be a bit surprised that it's significantly faster than egrep but you can try this:
$ awk 'NR==FNR {r=r ((r=="")?"":"|") tolower($0);next} tolower($0)~r' file1 file2
abc de
xyz
123
blah
abc,def,123
Explanation: first build the r1|r2|...|rn regular expression from the content of file1 and store it in awk variable r. Then print all lines of file2 that match it, thanks to the ~ match operator.
If you have GNU awk you can use its IGNORECASE variable instead of tolower:
$ awk -v IGNORECASE=1 'NR==FNR{r=r ((r=="")?"":"|") $0;next} $0~r' file1 file2
abc de
xyz
123
blah
abc,def,123
And with GNU awk it could be that forcing the type of r to regexp instead of string leads to better performance. The manual says:
Given that you can use both regexp and string constants to describe
regular expressions, which should you use? The answer is "regexp
constants," for several reasons:
...
It is more efficient to use regexp constants. 'awk' can note that
you have supplied a regexp and store it internally in a form that
makes pattern matching more efficient. When using a string
constant, 'awk' must first convert the string into this internal
form and then perform the pattern matching.
In order to do this you can try:
$ awk -v IGNORECASE=1 'NR==FNR {s=s ((s=="")?"":"|") $0;next}
FNR==1 && NR!=FNR {r=#//;sub(//,s,r);print typeof(r),r} $0~r' file1 file2
regexp Abc|xyz|123|blah|hh
abc de
xyz
123
blah
abc,def,123
(r=#// forces variable r to be of type regexp and sub(//,s,r) does not change this)
Note: just like with your egrep attempts, the lines of file1 are considered as regular expressions, not simple text strings to search for. So, if one line in file1 is .*, all lines in file2 will match, not just the lines containing substring .*.

How to get lines from the last match to the end of file?

Need to print lines after the last match to the end of file. The number of matches could be anything and not definite. I have some text as shown below.
MARKER
aaa
bbb
ccc
MARKER
ddd
eee
fff
MARKER
ggg
hhh
iii
MARKER
jjj
kkk
lll
Output desired is
jjj
kkk
lll
Do I use awk with RS and FS to get the desired output?
You can actually do it with awk (gawk) without using any pipe.
$ awk -v RS='(^|\n)MARKER\n' 'END{printf "%s", $0}' file
jjj
kkk
lll
Explanations:
You define your record separator as (^|\n)MARKER\n via RS='(^|\n)MARKER\n', by default it is the EOL char
'END{printf "%s", $0}' => at the end of the file, you print the whole line, as RS is set at (^|\n)MARKER\n, $0 will include all the lines until EOF.
Another option is to use grep (GNU):
$ grep -zoP '(?<=MARKER\n)(?:(?!MARKER)[^\0])+\Z' file
jjj
kkk
lll
Explanations:
-z to use the ASCII NUL character as delimiter
-o to print only the matching
-P to activate the perl mode
PCRE regex: (?<=MARKER\n)(?:(?!MARKER)[^\0])+\Z explained here https://regex101.com/r/RpQBUV/2/
Last but not least, the following sed approach can also been used:
sed -n '/^MARKER$/{n;h;b};H;${x;p}' file
jjj
kkk
lll
Explanations:
n jump to next line
h replace the hold space with the current line
H do the same but instead of replacing, append
${x;p} at the end of the file exchange (x) hold space and pattern space and print (p)
that can be turned into:
tac file | sed -n '/^MARKER$/q;p' | tac
if we use tac.
Could you please try following.
tac file | awk '/MARKER/{print val;exit} {val=(val?val ORS:"")$0}' | tac
Benefit of this approach will be awk will just read last block of the Input_file(which will be actually first block for awk after tac prints it reverse)and exit after that.
Explanation:
tac file | ##Printing Input_file in reverse order.
awk '
/MARKER/{ ##Searching for a string MARKER in a line of Input_file.
print val ##Printing variable val here. Because we need last occurrence of string MARKER,which has become first instance after reversing the Input_file.
exit ##Using exit to exit from awk program itself.
}
{
val=(val?val ORS:"")$0 ##Creating variable named val whose value will be keep appending to its own value with a new line to get values before string MARKER as per OP question.
}
' | ##Sending output of awk command to tac again to make it in its actual form, since tac prints it in reverse order.
tac ##Using tac to make it in correct order(lines were reversed because of previous tac).
You can try Perl as well
$ perl -0777 -ne ' /.*MARKER(.*)/s and print $1 ' input.txt
jjj
kkk
lll
$
This might work for you (GNU sed):
sed -nz 's/.*MARKER.//p' file
This uses greed to delete all lines upto and including the last occurrence of MARKER.
Simplest to remember:
tac fun.log | sed "/MARKER/Q" | tac
This awk solution would work with any version of awk on any OS:
awk '/^MARKER$/ {s=""; next} {s = s $0 RS} END {printf "%s", s}' file
jjj
kkk
lll

Print all lines between two patterns, exclusive, first instance only (in sed, AWK or Perl) [duplicate]

This question already has answers here:
How to print lines between two patterns, inclusive or exclusive (in sed, AWK or Perl)?
(9 answers)
Closed 3 years ago.
Using sed, AWK (or Perl), how do you print all lines between (the first instance of) two patterns, exclusive of the patterns?1
That is, given as input:
aaa
PATTERN1
bbb
ccc
ddd
PATTERN2
eee
Or possibly even:
aaa
PATTERN1
bbb
ccc
ddd
PATTERN2
eee
fff
PATTERN1
ggg
hhh
iii
PATTERN2
jjj
I would expect, in both cases:
bbb
ccc
ddd
1 A number of users voted to close this question as a duplicate of this one. In the end, I provided a gist that proves they are different. The question is also superficially similar to a number of others, but there is no exact match, and none of them are of high quality, and, as I believe that this specific problem is the one most commonly faced, it deserves a clear formulation, and a set of correct, clear answers.
If you have GNU sed (tested using version 4.7 on Mac OS X), the simplest solution could be:
sed '0,/PATTERN1/d;/PATTERN2/Q'
Explanation:
The d command deletes from line 1 to the line matching /PATTERN1/ inclusive.
The Q command then exits without printing on the first line matching /PATTERN2/.
If the file has only once instance of the pattern, or if you don't mind extracting all of them, and you want a solution that doesn't depend on a GNU extension, this works:
sed -n '/PATTERN1/,/PATTERN2/{//!p}'
Explanation:
Note that the empty regular expression // repeats the last regular expression match.
With awk (assumes that PATTERN1 and PATTERN2 are always present in pairs and either of them do not occur inside a pair)
$ cat ip.txt
aaa
PATTERN1
bbb
ccc
ddd
PATTERN2
eee
fff
PATTERN1
ggg
hhh
iii
PATTERN2
jjj
$ awk '/PATTERN2/{exit} f; /PATTERN1/{f=1}' ip.txt
bbb
ccc
ddd
/PATTERN1/{f=1} set flag if /PATTERN1/ is matched
/PATTERN2/{exit} exit if /PATTERN2/ is matched
f; print input line if flag is set
Generic solution, where the block required can be specified
$ awk -v b=1 '/PATTERN2/ && c==b{exit} c==b; /PATTERN1/{c++}' ip.txt
bbb
ccc
ddd
$ awk -v b=2 '/PATTERN2/ && c==b{exit} c==b; /PATTERN1/{c++}' ip.txt
2
46
This might work for you (GNU sed);
sed -n '/PATTERN1/{:a;n;/PATTERN2/q;p;$!ba}' file
This prints only the lines between the first set of delimiters, or if the second delimiter does not exist, to the end of the file.
I attempted twice to answer, but the questions switched hold/duplicate statuses..
Borrowing input from #Sundeep and adding the answer which I shared in the question comments.
Using awk
awk -v x=0 -v y=1 ' /PATTERN1/&&y { x=1;next } /PATTERN2/&&y { x=0;y=0; next } x ' file
with Perl
perl -0777 -ne ' while( /PATTERN1.*?\n(.+?)^[^\n]*?PATTERN2/msg ) { print $1 if $x++ <1 } '
Results:
$ cat ip.txt
aaa
PATTERN1
bbb
ccc
ddd
PATTERN2
eee
PATTERN1
2
46
PATTERN2
xyz
$
$ awk -v x=0 -v y=1 ' /PATTERN1/&&y { x=1;next } /PATTERN2/&&y { x=0;y=0; next } x ' ip.txt
bbb
ccc
ddd
$ perl -0777 -ne ' while( /PATTERN1.*?\n(.+?)^[^\n]*?PATTERN2/msg ) { print $1 if $x++ <1 } ' ip.txt
bbb
ccc
ddd
$
To make it generic
awk here y is the input
awk -v x=0 -v y=2 ' /PATTERN1/ { x++;next } /PATTERN2/ { if(x==y) exit } x==y ' ip.txt
2
46
perl check ++$x against the occurence.. here it is 2
perl -0777 -ne ' while( /PATTERN1.*?\n(.+?)^[^\n]*?PATTERN2/msg ) { print $1 if ++$x==2 } ' ip.txt
2
46
Adding more solutions(possible ways here, for fun :) and not at all claiming that these are better than usual ones) All tested and written in GNU awk. Also tested with given examples only.
1st Solution:
awk -v RS="" -v FS="PATTERN2" -v ORS="" '$1 ~ /\nPATTERN1\n/{sub(/.*PATTERN1\n/,"",$1);print $1}' Input_file
2nd solution:
awk -v RS="" -v ORS="" 'match($0,/PATTERN1[^(PATTERN2)]*/){val=substr($0,RSTART,RLENGTH);gsub(/^PATTERN1\n|^$\n/,"",val);print val}' Input_file
3rd solution:
awk -v RS="" -v OFS="\n" -v ORS="" 'sub(/PATTERN2.*/,"") && sub(/.*PATTERN1/,"PATTERN1"){$1=$1;sub(/^PATTERN1\n/,"")} 1' Input_file
In all above codes output will be as follows.
bbb
ccc
ddd
Using GNU sed:
sed -nE '/PATTERN1/{:s n;/PATTERN2/q;p;bs}'
-n will prune all but lines between PATTERN1 and PATTERN2 including both, because there will be p printout command.
every sed range check if it's true will execute only one the next, so {} grouping is mandated..
Drop PATTERN1 by n command (means next), if reach the first PATTERN2 outrightly quit otherwise print the line then and continue the next line within that boundary.

Use bash to cut lines in one file to lengths explicitly stated in another

I have one file that is a list of numbers, and another file (same number of lines) in which I need the length of each line to match the number of the line in the other file. For example:
file 1:
5
8
7
11
15
file 2:
abcdefghijklmnopqrstuvwxyz
abcdefghijklmnopqrstuvwxyz
abcdefghijklmnopqrstuvwxyz
abcdefghijklmnopqrstuvwxyz
abcdefghijklmnopqrstuvwxyz
output:
abcde
abcdefgh
abcdefg
abcdefghijk
abcdefghijklmno
I've tried using awk and cut together but I keep getting the error "fatal: attempt to use array `line' in a scalar context". I'm not sure how else to go about this. Any guidance is much appreciated!
awk is probably more appropriate, but you can also do:
while read line <&3; do
read len <&4; echo "${line:0:$len}";
done 3< file2 4< file1
awk is your tool for this: one of
# read all the lengths, then process file2
awk 'NR == FNR {len[NR] = $1; next} {print substr($0, 1, len[FNR])}' file1 file2
# fetch a line from file1 whilst processing file2
awk '{getline len < lenfile; print substr($0, 1, len)}' lenfile=file1 file2
another awk
$ paste file1 file2 | awk '{print substr($2,1,$1)}'
abcde
abcdefgh
abcdefg
abcdefghijk
abcdefghijklmno
Using Perl
perl -lne ' BEGIN { open($f,"file1.txt");#x=<$f>;close($f) }
print substr($_,0,$x[$.-1]) ' file2.txt
with the given inputs
$ cat cmswen1.txt
5
8
7
11
15
$ cat cmswen2.txt
abcdefghijklmnopqrstuvwxyz
abcdefghijklmnopqrstuvwxyz
abcdefghijklmnopqrstuvwxyz
abcdefghijklmnopqrstuvwxyz
abcdefghijklmnopqrstuvwxyz
$ perl -lne ' BEGIN { open($f,"cmswen1.txt");#x=<$f>;close($f) } print substr($_,0,$x[$.-1]) ' cmswen2.txt
abcde
abcdefgh
abcdefg
abcdefghijk
abcdefghijklmno
$

Non matching word from file1 to file2

I have two files - file1 & file2.
file1 contains (only words) says-
ABC
YUI
GHJ
I8O
..................
file2 contains many para.
dfghjo ABC kll njjgg bla bla
GHJ njhjckhv chasjvackvh ..
ihbjhi hbhibb jh jbiibi
...................
I am using below command to get the matching lines which contains word from file1 in file2
grep -Ff file1 file2
(Gives output of lines where words of file1 found in file2)
I also need the words which doesn't match/found in file 2 and unable to find Un-matching word.
Can anyone help in getting below output
YUI
I8O
i am looking one liner command (via grep,awk,sed), as i am using pssh command and can't use while,for loop
You can print only the matched parts with -o.
$ grep -oFf file1 file2
ABC
GHJ
Use that output as a list of patterns for a search in file1. Process substitution <(cmd) simulates a file containing the output of cmd. With -v you can print lines that did not match. If file1 contains two lines such that one line is a substring of another line you may want to add -x (only match whole lines) to prevent false positives.
$ grep -vxFf <(grep -oFf file1 file2) file1
YUI
I8O
Using Perl - both matched/non-matched in same one-liner
$ cat sinw.txt
ABC
YUI
GHJ
I8O
$ cat sin_in.txt
dfghjo ABC kll njjgg bla bla
GHJ njhjckhv chasjvackvh ..
ihbjhi hbhibb jh jbiibi
$ perl -lne '
BEGIN { %x=map{chomp;$_=>1} qx(cat sinw.txt); $w="\\b".join("\|",keys %x)."\\b"}
print "$&" and delete($x{$&}) if /$w/ ;
END { print "\nnon-matched\n".join("\n", keys %x) }
' sin_in.txt
ABC
GHJ
non-matched
I8O
YUI
$
Getting only the non-matched
$ perl -lne '
BEGIN {
%x = map { chomp; $_=>1 } qx(cat sinw.txt);
$w = "\\b" . join("\|",keys %x) . "\\b"
}
delete($x{$&}) if /$w/;
END { print "\nnon-matched\n".join("\n", keys %x) }
' sin_in.txt
non-matched
I8O
YUI
$
Note that even a single use of $& variable used to be very expensive for the whole program, in Perl versions prior to 5.20.
Assuming your "words" in file1 are in more than 1 line :
while read line
do
for word in $line
do
if ! grep -q $word file2
then echo $word not found
fi
done
done < file1
For Un-matching words, here's one GNU awk solution:
awk 'NR==FNR{a[$0];next} !($1 in a)' RS='[ \n]' file2 file1
YUI
I8O
Or !($0 in a), it's the same. Since I set RS='[ \n]', every space as line separator too.
And note that I read file2 first, and then file1.
If file2 could be empty, you should change NR==FNR to different file checking methods, like ARGIND==1 for GNU awk, or FILENAME=="file2", or FILENAME==ARGV[1] etc.
Same mechanism for only the matched one too:
awk 'NR==FNR{a[$0];next} $0 in a' RS='[ \n]' file2 file1
ABC
GHJ

Resources