Grep based on pattern - bash

Sample Text:
This is a test
This is aaaa test
This is aaa test
This is test a
This aa is test
I have just started learning unix commands like grep, awk and sed and have a quick question. If my text file contains the above text how can I just print out lines that use the letter ‘a’ 2 or fewer times.
I tried using awk, but don’t understand the syntax to add up all the instances of ‘a’ and only print the lines that have ‘a’ 2 or fewer times. I understand comparing numbers based on columns like awk ‘$1 <=2’ but don’t know how to use that with characters as well. Any help would be appreciated.
Essentially it should print out:
This is a test
This is test a
This aa is test
For Clarity: I don't want to remove the extra As, but rather only print the lines that contain two or fewer As.

Using awk
awk '!/aaa+/' file
This is a test
This is test a
This aa is test
Do not print lines with three or more a together.
Same with sed
sed '/aaa\+/d' file
This is a test
This is test a
This aa is test
Default for sed is to print all line. /aaa\+/d tells sed to delete lines with 3 or more a

like this?
kent$ grep -v 'aaa\+' file
This is a test
This is test a
This aa is test
Update
I just saw the comment, if your requirement is anywhere on the line, no matter consecutive or not, see the example (with awk):
kent$ cat f
1a a
2a
3
4a a a aa
5aaaaaaaaaa
kent$ awk 'gsub(/a/,"a")<3' f
1a a
2a
3
without gsub:
kent$ awk -F'a' 'NF<4' f
1a a
2a
3

Related

Shell script: Insert multiple lines into a file ONLY after a specified pattern appears for the FIRST time. (The pattern appears multiple times)

I want to insert multiple lines into a file using shell script. Let us consider my original file: original.txt:
aaa
bbb
ccc
aaa
bbb
ccc
aaa
bbb
ccc
.
.
.
and my insert file: toinsert.txt
111
222
333
Now I have to insert the three lines from the 'toinsert.txt' file ONLY after the line 'ccc' appears for the FIRST time in the 'original.txt' file. Note: the 'ccc' pattern appears more than one time in my 'original.txt' file. After inserting ONLY after the pattern appears for the FIRST time, my file should change like this:
aaa
bbb
ccc
111
222
333
aaa
bbb
ccc
aaa
bbb
ccc
.
.
.
I should do the above insertion using a shell script. Can someone help me?
Note2: I found a similar case, with a partial solution:
sed -i -e '/ccc/r toinsert.txt' original.txt
which actually does the insertion multiple times (for every time the ccc pattern shows up).
Use ed, not sed, to edit files:
printf "%s\n" "/ccc/r toinsert.txt" w | ed -s original.txt
It inserts the contents of the other file after the first line containing ccc, but unlike your sed version, only after the first.
This might work for you (GNU sed):
sed '0,/ccc/!b;/ccc/r insertFile' file
Use a range:
If the current line is in the range following the first occurrence of ccc, break from further processing and implicitly print as usual.
Otherwise if the current line does contain ccc,insert lines from insertFile.
N.B. This uses the address 0 which allows the regexp to occur on line 1 and is specific to GNU sed.
or:
sed -e '/ccc/!b;r insertFile' -e ':a;n;ba' file
Use a loop:
If a line does not contain ccc, no further processing and print as usual.
Otherwise, insert lines from insertFile and then using a loop, fetch/print the remaining lines until the end of the file.
N.B. The r command insists on being delimited from other sed commands by a newline. The -e option simulates this effect and thus the sed commands are split across two -e options.
or:
sed 'x;/./{x;b};x;/ccc/!b;h;r insertFile' file
Use a flag:
If the hold space is not empty (the flag has already been set), no further processing and print as usual.
Otherwise, if the line does not contain ccc, no further processing and print as usual.
Otherwise, copy the current line to the hold space (set the flag) and insert lines from insertFile.
N.B. In all cases the r command inserts lines from insertFile after the current line is printed.

Extracting numeric pattern from file line

I have a file that has the following format:
EDouble entry for scenario XX AAA 70337262003 Line 000000003350
EDouble entry for scenario XX AAA 70337262003 Line 000000003347
EDouble entry for scenario XX AAA 71375201001 Line 000000003353
EDouble entry for scenario XX AAA 71375201001 Line 000000003351
EDouble entry (different date/time) for scenario YY AAA 10722963407 Line 000000000447
EDouble entry for scenario YY AAA 55173006602 Line 000000002868
EDouble entry (different date/time) for scenario YY AAA 60404822801 Line 000000003285
What I want to do is basically strip away all the alphabet characters and output a file that contains:
70337262003
70337262003
71375201001
71375201001
10722963407
55173006602
60404822801
I've thought of a couple ways that could assist me in getting there, simply listing some ideas since I don't have a ready solution. I could strip all alphabetic characters with:
tr -d '[[:alpha:]]'
but that would still mean I would need to process the file further to separate the first number from the second. Sed could perhaps provide a simpler solution since the second number will always start with 0.
sed -n 's/.*\[1-9][1-9][1-9][1-9][1-9][1-9][1-9][1-9][1-9][1-9][1- 9]\).*/\1/p'
to find the pattern, and only printing pattern – but the above command doesn't output anything. Could someone help me please? It's not necessary to accomplish this with sed, I imagine awk with gsub and grep have something similar?
Print third to last column:
awk '{print $(NF-2)}' file
Output:
70337262003
70337262003
71375201001
71375201001
10722963407
55173006602
60404822801
So If you prefer sed, use this:
sed -rn "s#.*([1-9][0-9]{10}).*#\1#p" file.txt
With grep you can do this:
grep -o '[1-9][0-9]\{10\}' file
With sed:
sed -n 's/.*\([1-9][0-9]\{10\}\).*/\1/p' file
There's a narrow margin of error targeting 11 digits, as the numbers starting with 0 are 12 digits long. A more robust solution considering that fact would be:
sed -n 's/.*[[:blank:]]\([1-9][0-9]\{10\}\).*/\1/p' file
i.e make sure to match a [[:blank:]] before the number.
I see that AAA is constant in all rows behind the number.
Therefore you can use this:
$ grep -oP '(?<=AAA\s)\s*\d+' data
70337262003
70337262003
71375201001
71375201001
10722963407
55173006602
60404822801
This one extracts a group of digits followed by a word boundary, but not followed by the end of the line:
$ grep -Po '\d+\b(?!$)' infile
70337262003
70337262003
71375201001
71375201001
10722963407
55173006602
60404822801
-P enables Perl regular expressions
-o retains only the match
\d+\b greedily matches digits followed by a word boundary
(?!$) is a "negative look-ahead": if the next character is the end of the line, don't match

In shell, how to process this line, in order to extract the filed that I want

I have some lines in a plat file. Take 2 line for instance:
1 aa bb 05 may 2014 cc G 14-MAY-2014 hello world
j sd az 20140505 sd G 14-MAY-2014 hello world haha
So maybe you have noticed, I can count neither the number of the char, nor the number of the space, because the lines are not well aligned, and the forth field, sometimes it's like 20140505, sometimes it's like 05 may 2014. So what I want, is to try to match the G , or match the 14-MAY-2014. Then I can easily get the following fields: hello world or hello world haha. So Can anyone help me? thank you!
Assuming your lines are in a file called test.txt:
cat test.txt | sed -r 's/^.*-[0-9]{4}\s//'
This is using GNU sed on a Linux system. There are many other ways. Here i simply remove anything up to and including the date from the begiining of the line.
sed -r 's/^.*-[0-9]{4}\s//'
-r = extendes reg ex, makes things like the quantor {4} possible
's/ ... //' = s is for substitute,
it matches the first part and replaces it with the second.
since the resocond part is empty, it's a remove/delete
^ = start of line
.* = any character, any number of times
-[0-9]{4} = a dash, followed by four digits ([0-9]), the year part of the date
\s = any white space
You can make use of lookbehind regex of perl:
perl -lne '/(?<=14-MAY-2014)(.*)/ && print $1' file
It will print anything after 14-MAY-2014.
You can also use grep if it supports -P:
grep -Po '(?<=14-MAY-2014)(.*)' file

Removing newlines between tokens

I have a file that contains some information spanning multiple lines. In order for certain other bash scripts I have to work property, I need this information to all be on a single line. However, I obviously don't want to remove all newlines in the file.
What I want to do is replace newlines, but only between all pairs of STARTINGTOKEN and ENDINGTOKEN, where these two tokens are always on different lines (but never get jumbled up together, it's impossible for instance to have two STARTINGTOKENs in a row before an ENDINGTOKEN).
I found that I can remove newlines with
tr "\n" " "
and I also found that I can match patterns over multiple lines with
sed -e '/STARTINGTOKEN/,/ENDINGTOKEN/!d'
However, I can't figure out how to combine these operations while leaving the remainder of the file untouched.
Any suggestions?
are you looking for this?
awk '/STARTINGTOKEN/{f=1} /ENDINGTOKEN/{f=0} {if(f)printf "%s",$0;else print}' file
example:
kent$ cat file
foo
bar
STARTINGTOKEN xx
1
2
ENDINGTOKEN yy
3
4
STARTINGTOKEN mmm
5
6
7
nnn ENDINGTOKEN
8
9
kent$ awk '/STARTINGTOKEN/{f=1} /ENDINGTOKEN/{f=0} {if(f)printf "%s",$0;else print}' file
foo
bar
STARTINGTOKEN xx12ENDINGTOKEN yy
3
4
STARTINGTOKEN mmm567nnn ENDINGTOKEN
8
9
This seems to work:
sed -ne '/STARTINGTOKEN/{ :next ; /ENDINGTOKEN/!{N;b next;}; s/\n//g;p;}' "yourfile"
Once it finds the starting token it loops, picking up lines until it finds the ending token, then removes all the embedded newlines and prints it. Then repeats.
Using awk:
awk '$0 ~ /STARTINGTOKEN/ || l {l=sprintf("%s%s", l, $0)}
/ENDINGTOKEN/{print l; l=""}' input.file
This might work for you (GNU sed):
sed '/STARTINGTOKEN/!b;:a;$bb;N;/ENDINGTOKEN/!ba;:b;s/\n//g' file
or:
sed -r '/(START|END)TOKEN/,//{/STARTINGTOKEN/{h;d};H;/ENDINGTOKEN/{x;s/\n//gp};d}' file

Split input into multiple outputs based on content?

Let's assume there is a file which looks like this:
xxxx aa whatever
yyyy bb whatever
zzzz aa whatever
I'd like split it into 2 files, containing:
first:
xxxx aa whatever
zzzz aa whatever
second:
yyyy bb whatever
I.e. I want to group the rows based on some value in the lines (rule can be: 2nd word separated by spaces), but do not reorder the lines within groups.
Of course I can write a program to do it, but I'm wondering if there is any ready tool that can do something like this?
Sorry, I didn't mention it, as I assumed it's pretty obvious - number of different "words" is huge. we are talking about at least 10000 of them. I.e. any solution based on enumeration of the words before hand will not work.
And also - I wouldn't really like multi-pass split - the files in question are usually pretty big.
This will create files named output.aa, output.bb, etc.:
awk '{print >> "output." $2}' input.file
Well, you could do a grep to get the lines that match, and a grep -v to get the lines that don't match.
Hm, you could do sort -f" " -s -k 2,2, but that's O(n log n).

Resources