Extracting numeric pattern from file line - bash

I have a file that has the following format:
EDouble entry for scenario XX AAA 70337262003 Line 000000003350
EDouble entry for scenario XX AAA 70337262003 Line 000000003347
EDouble entry for scenario XX AAA 71375201001 Line 000000003353
EDouble entry for scenario XX AAA 71375201001 Line 000000003351
EDouble entry (different date/time) for scenario YY AAA 10722963407 Line 000000000447
EDouble entry for scenario YY AAA 55173006602 Line 000000002868
EDouble entry (different date/time) for scenario YY AAA 60404822801 Line 000000003285
What I want to do is basically strip away all the alphabet characters and output a file that contains:
70337262003
70337262003
71375201001
71375201001
10722963407
55173006602
60404822801
I've thought of a couple ways that could assist me in getting there, simply listing some ideas since I don't have a ready solution. I could strip all alphabetic characters with:
tr -d '[[:alpha:]]'
but that would still mean I would need to process the file further to separate the first number from the second. Sed could perhaps provide a simpler solution since the second number will always start with 0.
sed -n 's/.*\[1-9][1-9][1-9][1-9][1-9][1-9][1-9][1-9][1-9][1-9][1- 9]\).*/\1/p'
to find the pattern, and only printing pattern – but the above command doesn't output anything. Could someone help me please? It's not necessary to accomplish this with sed, I imagine awk with gsub and grep have something similar?

Print third to last column:
awk '{print $(NF-2)}' file
Output:
70337262003
70337262003
71375201001
71375201001
10722963407
55173006602
60404822801

So If you prefer sed, use this:
sed -rn "s#.*([1-9][0-9]{10}).*#\1#p" file.txt

With grep you can do this:
grep -o '[1-9][0-9]\{10\}' file
With sed:
sed -n 's/.*\([1-9][0-9]\{10\}\).*/\1/p' file
There's a narrow margin of error targeting 11 digits, as the numbers starting with 0 are 12 digits long. A more robust solution considering that fact would be:
sed -n 's/.*[[:blank:]]\([1-9][0-9]\{10\}\).*/\1/p' file
i.e make sure to match a [[:blank:]] before the number.

I see that AAA is constant in all rows behind the number.
Therefore you can use this:
$ grep -oP '(?<=AAA\s)\s*\d+' data
70337262003
70337262003
71375201001
71375201001
10722963407
55173006602
60404822801

This one extracts a group of digits followed by a word boundary, but not followed by the end of the line:
$ grep -Po '\d+\b(?!$)' infile
70337262003
70337262003
71375201001
71375201001
10722963407
55173006602
60404822801
-P enables Perl regular expressions
-o retains only the match
\d+\b greedily matches digits followed by a word boundary
(?!$) is a "negative look-ahead": if the next character is the end of the line, don't match

Related

Shell script: Insert multiple lines into a file ONLY after a specified pattern appears for the FIRST time. (The pattern appears multiple times)

I want to insert multiple lines into a file using shell script. Let us consider my original file: original.txt:
aaa
bbb
ccc
aaa
bbb
ccc
aaa
bbb
ccc
.
.
.
and my insert file: toinsert.txt
111
222
333
Now I have to insert the three lines from the 'toinsert.txt' file ONLY after the line 'ccc' appears for the FIRST time in the 'original.txt' file. Note: the 'ccc' pattern appears more than one time in my 'original.txt' file. After inserting ONLY after the pattern appears for the FIRST time, my file should change like this:
aaa
bbb
ccc
111
222
333
aaa
bbb
ccc
aaa
bbb
ccc
.
.
.
I should do the above insertion using a shell script. Can someone help me?
Note2: I found a similar case, with a partial solution:
sed -i -e '/ccc/r toinsert.txt' original.txt
which actually does the insertion multiple times (for every time the ccc pattern shows up).
Use ed, not sed, to edit files:
printf "%s\n" "/ccc/r toinsert.txt" w | ed -s original.txt
It inserts the contents of the other file after the first line containing ccc, but unlike your sed version, only after the first.
This might work for you (GNU sed):
sed '0,/ccc/!b;/ccc/r insertFile' file
Use a range:
If the current line is in the range following the first occurrence of ccc, break from further processing and implicitly print as usual.
Otherwise if the current line does contain ccc,insert lines from insertFile.
N.B. This uses the address 0 which allows the regexp to occur on line 1 and is specific to GNU sed.
or:
sed -e '/ccc/!b;r insertFile' -e ':a;n;ba' file
Use a loop:
If a line does not contain ccc, no further processing and print as usual.
Otherwise, insert lines from insertFile and then using a loop, fetch/print the remaining lines until the end of the file.
N.B. The r command insists on being delimited from other sed commands by a newline. The -e option simulates this effect and thus the sed commands are split across two -e options.
or:
sed 'x;/./{x;b};x;/ccc/!b;h;r insertFile' file
Use a flag:
If the hold space is not empty (the flag has already been set), no further processing and print as usual.
Otherwise, if the line does not contain ccc, no further processing and print as usual.
Otherwise, copy the current line to the hold space (set the flag) and insert lines from insertFile.
N.B. In all cases the r command inserts lines from insertFile after the current line is printed.

Grep based on pattern

Sample Text:
This is a test
This is aaaa test
This is aaa test
This is test a
This aa is test
I have just started learning unix commands like grep, awk and sed and have a quick question. If my text file contains the above text how can I just print out lines that use the letter ‘a’ 2 or fewer times.
I tried using awk, but don’t understand the syntax to add up all the instances of ‘a’ and only print the lines that have ‘a’ 2 or fewer times. I understand comparing numbers based on columns like awk ‘$1 <=2’ but don’t know how to use that with characters as well. Any help would be appreciated.
Essentially it should print out:
This is a test
This is test a
This aa is test
For Clarity: I don't want to remove the extra As, but rather only print the lines that contain two or fewer As.
Using awk
awk '!/aaa+/' file
This is a test
This is test a
This aa is test
Do not print lines with three or more a together.
Same with sed
sed '/aaa\+/d' file
This is a test
This is test a
This aa is test
Default for sed is to print all line. /aaa\+/d tells sed to delete lines with 3 or more a
like this?
kent$ grep -v 'aaa\+' file
This is a test
This is test a
This aa is test
Update
I just saw the comment, if your requirement is anywhere on the line, no matter consecutive or not, see the example (with awk):
kent$ cat f
1a a
2a
3
4a a a aa
5aaaaaaaaaa
kent$ awk 'gsub(/a/,"a")<3' f
1a a
2a
3
without gsub:
kent$ awk -F'a' 'NF<4' f
1a a
2a
3

In shell, how to process this line, in order to extract the filed that I want

I have some lines in a plat file. Take 2 line for instance:
1 aa bb 05 may 2014 cc G 14-MAY-2014 hello world
j sd az 20140505 sd G 14-MAY-2014 hello world haha
So maybe you have noticed, I can count neither the number of the char, nor the number of the space, because the lines are not well aligned, and the forth field, sometimes it's like 20140505, sometimes it's like 05 may 2014. So what I want, is to try to match the G , or match the 14-MAY-2014. Then I can easily get the following fields: hello world or hello world haha. So Can anyone help me? thank you!
Assuming your lines are in a file called test.txt:
cat test.txt | sed -r 's/^.*-[0-9]{4}\s//'
This is using GNU sed on a Linux system. There are many other ways. Here i simply remove anything up to and including the date from the begiining of the line.
sed -r 's/^.*-[0-9]{4}\s//'
-r = extendes reg ex, makes things like the quantor {4} possible
's/ ... //' = s is for substitute,
it matches the first part and replaces it with the second.
since the resocond part is empty, it's a remove/delete
^ = start of line
.* = any character, any number of times
-[0-9]{4} = a dash, followed by four digits ([0-9]), the year part of the date
\s = any white space
You can make use of lookbehind regex of perl:
perl -lne '/(?<=14-MAY-2014)(.*)/ && print $1' file
It will print anything after 14-MAY-2014.
You can also use grep if it supports -P:
grep -Po '(?<=14-MAY-2014)(.*)' file

'grep +A': print everything after a match [duplicate]

This question already has answers here:
How to get the part of a file after the first line that matches a regular expression
(12 answers)
Closed 7 years ago.
I have a file that contains a list of URLs. It looks like below:
file1:
http://www.google.com
http://www.bing.com
http://www.yahoo.com
http://www.baidu.com
http://www.yandex.com
....
I want to get all the records after: http://www.yahoo.com, results looks like below:
file2:
http://www.baidu.com
http://www.yandex.com
....
I know that I could use grep to find the line number of where yahoo.com lies using
grep -n 'http://www.yahoo.com' file1
3 http://www.yahoo.com
But I don't know how to get the file after line number 3. Also, I know there is a flag in grep -A print the lines after your match. However, you need to specify how many lines you want after the match. I am wondering is there something to get around that issue. Like:
Pseudocode:
grep -n 'http://www.yahoo.com' -A all file1 > file2
I know we could use the line number I got and wc -l to get the number of lines after yahoo.com, however... it feels pretty lame.
AWK
If you don't mind using AWK:
awk '/yahoo/{y=1;next}y' data.txt
This script has two parts:
/yahoo/ { y = 1; next }
y
The first part states that if we encounter a line with yahoo, we set the variable y=1, and then skip that line (the next command will jump to the next line, thus skip any further processing on the current line). Without the next command, the line yahoo will be printed.
The second part is a short hand for:
y != 0 { print }
Which means, for each line, if variable y is non-zero, we print that line. In AWK, if you refer to a variable, that variable will be created and is either zero or empty string, depending on context. Before encounter yahoo, variable y is 0, so the script does not print anything. After encounter yahoo, y is 1, so every line after that will be printed.
Sed
Or, using sed, the following will delete everything up to and including the line with yahoo:
sed '1,/yahoo/d' data.txt
This is much easier done with sed than grep. sed can apply any of its one-letter commands to an inclusive range of lines; the general syntax for this is
START , STOP COMMAND
except without any spaces. START and STOP can each be a number (meaning "line number N", starting from 1); a dollar sign (meaning "the end of the file"), or a regexp enclosed in slashes, meaning "the first line that matches this regexp". (The exact rules are slightly more complicated; the GNU sed manual has more detail.)
So, you can do what you want like so:
sed -n -e '/http:\/\/www\.yahoo\.com/,$p' file1 > file2
The -n means "don't print anything unless specifically told to", and the -e directive means "from the first appearance of a line that matches the regexp /http:\/\/www\.yahoo\.com/ to the end of the file, print."
This will include the line with http://www.yahoo.com/ on it in the output. If you want everything after that point but not that line itself, the easiest way to do that is to invert the operation:
sed -e '1,/http:\/\/www\.yahoo\.com/d' file1 > file2
which means "for line 1 through the first line matching the regexp /http:\/\/www\.yahoo\.com/, delete the line" (and then, implicitly, print everything else; note that -n is not used this time).
awk '/yahoo/ ? c++ : c' file1
Or golfed
awk '/yahoo/?c++:c' file1
Result
http://www.baidu.com
http://www.yandex.com
This is most easily done in Perl:
perl -ne 'print unless 1 .. m(http://www\.yahoo\.com)' file
In other words, print all lines that aren’t between line 1 and the first occurrence of that pattern.
Using this script:
# Get index of the "yahoo" word
index=`grep -n "yahoo" filepath | cut -d':' -f1`
# Get the total number of lines in the file
totallines=`wc -l filepath | cut -d' ' -f1`
# Subtract totallines with index
result=`expr $total - $index`
# Gives the desired output
grep -A $result "yahoo" filepath

Removing newlines between tokens

I have a file that contains some information spanning multiple lines. In order for certain other bash scripts I have to work property, I need this information to all be on a single line. However, I obviously don't want to remove all newlines in the file.
What I want to do is replace newlines, but only between all pairs of STARTINGTOKEN and ENDINGTOKEN, where these two tokens are always on different lines (but never get jumbled up together, it's impossible for instance to have two STARTINGTOKENs in a row before an ENDINGTOKEN).
I found that I can remove newlines with
tr "\n" " "
and I also found that I can match patterns over multiple lines with
sed -e '/STARTINGTOKEN/,/ENDINGTOKEN/!d'
However, I can't figure out how to combine these operations while leaving the remainder of the file untouched.
Any suggestions?
are you looking for this?
awk '/STARTINGTOKEN/{f=1} /ENDINGTOKEN/{f=0} {if(f)printf "%s",$0;else print}' file
example:
kent$ cat file
foo
bar
STARTINGTOKEN xx
1
2
ENDINGTOKEN yy
3
4
STARTINGTOKEN mmm
5
6
7
nnn ENDINGTOKEN
8
9
kent$ awk '/STARTINGTOKEN/{f=1} /ENDINGTOKEN/{f=0} {if(f)printf "%s",$0;else print}' file
foo
bar
STARTINGTOKEN xx12ENDINGTOKEN yy
3
4
STARTINGTOKEN mmm567nnn ENDINGTOKEN
8
9
This seems to work:
sed -ne '/STARTINGTOKEN/{ :next ; /ENDINGTOKEN/!{N;b next;}; s/\n//g;p;}' "yourfile"
Once it finds the starting token it loops, picking up lines until it finds the ending token, then removes all the embedded newlines and prints it. Then repeats.
Using awk:
awk '$0 ~ /STARTINGTOKEN/ || l {l=sprintf("%s%s", l, $0)}
/ENDINGTOKEN/{print l; l=""}' input.file
This might work for you (GNU sed):
sed '/STARTINGTOKEN/!b;:a;$bb;N;/ENDINGTOKEN/!ba;:b;s/\n//g' file
or:
sed -r '/(START|END)TOKEN/,//{/STARTINGTOKEN/{h;d};H;/ENDINGTOKEN/{x;s/\n//gp};d}' file

Resources