bash command to extract sequences of texts based on a starting key - bash

I have a log file as the following. Each line logs some string and the thread id. Each thread belongs to a process and a process can have N threads.
Based on the following sample, I want to extract (using bash tools, grep, sed and whatever) all the lines of all threads that belongs to a given process. Note that the process is mentioned only once, at the top of a thread sequence:
line1 thread= 150 process= 200
line2 thread= 152 whatever
line3 thread= 150 whatever
line4 thread= 150 whatever
line5 thread= 130 whatever
line6 thread= 130 process= 200
line7 thread= 150 process= 201
line8 thread= 130 whatever
line9 thread= 130 whatever
For this sample, give process 200 the output should be:
line1 thread= 150 process= 200
line3 thread= 150 whatever
line4 thread= 150 whatever
line6 thread= 130 process= 200
line8 thread= 130 whatever
line9 thread= 130 whatever

awk solution:
filter_threads.awk script:
#!/bin/awk -f
function get_thread(s){ # extracts thread number from the string
t = substr(s,index(s,"=")+1); # considering `=` as separator (e.g. `thread=150`)
return t;
}
BEGIN {
pat = "process="p # regex pattern to match the line with specified process
}
$3~pat { # on encountering "process" line
thread = get_thread($2); print; next # getting base thread number
}
{
t = get_thread($2);
if (t==thread) print # comparing current thread numbers with base thread number
}
Usage:
awk -f filter_threads.awk -v p=200 yourfile
- where p is process number
The output:
line1 thread=150 process=200
line3 thread=150 whatever
line4 thread=150 whatever
line6 thread=130 process=200
line8 thread=130 whatever
line9 thread=130 whatever
Update:
As you have changed you initial input the new solution would be as below:
awk -v p=200 '$4~/process=/ && $5==p{ thread=$3; print; next }$3==thread{ print }' yourfile

Related

Move all lines between header and footer into new file in unix

I have file records like below, header, data & footer records.
I need to move only data part to another file. New file should only contain lines between Header2 and Footer1.
I have tried t head -n 30 fiename | tail 10 > newfile
as data record counts may vary .
example records from source file .
Header1
Header2
Header3
SEQ++1
line1
line2
SEQ++2
line1
SEQ++3
line1
line2
line3
Footer1
Footer2
Footer3
Output file should have:
SEQ++1
line1
line2
SEQ++2
line1
SEQ++3
line1
line2
line3
There are different ways.
grep:
grep -v -E "Header|Footer" source.txt
awk:
awk '! /Header.|Footer./ { print }' source.txt
You can replace the "Header" and "Footer" values by whatever you use to identify each lines.

Print all the lines between two patterns in shell

I have a file which is the log of a script running in a daily cronjob. The log file looks like-
Aug 19
Line1
Line2
Line3
Line4
Line5
Line6
Line7
Line8
Line9
Aug 19
Aug 20
Line1
Line2
Line3
Line4
Line5
Line6
Line7
Line8
Line9
Aug 20
Aug 21
Line1
Line2
Line3
Line4
Line5
Line6
Line7
Line8
Line9
Aug 21
The log is written by the script starting with the date and ending with the date and in between all the logs are written.
Now when I try to get the logs for a single day using the command below -
sed -n '/Aug 19/,/Aug 19/p' filename
it displays the output as -
Aug 19
Line1
Line2
Line3
Line4
Line5
Line6
Line7
Line8
Line9
Aug 19
But if I try to get the logs of multiple dates, the logs of last day is always missing.
Example- If I run the command
sed -n '/Aug 19/,/Aug 20/p' filename
the output looks like -
Aug 19
Line1
Line2
Line3
Line4
Line5
Line6
Line7
Line8
Line9
Aug 19
Aug 20
I have gone through this site and found some valuable inputs to a similar problem but none of the solutions work for me. The links are Link 1
Link 2
The commands that I have tried are -
awk '/Aug 15/{a=1}/Aug 21/{print;a=0}a'
awk '/Aug 15/,/Aug 21/'
sed -n '/Aug 15/,/Aug 21/p
grep -Pzo "(?s)(Aug 15(.*?)(Aug 21|\Z))"
but none of the commands gives the logs of the last date, all the commands prints till the 1st timestamp as I have shown above.
I think you can use the awk command as followed to print the lines between Aug 19 & Aug 20,
awk '/Aug 19/||/Aug 20/{a++}a; a==4{a=0}' file
Brief explanation,
/Aug 19/||/Aug 20/: find the record matched Aug 19 or Aug 20
if the criteria is met, set the flag a++
if the flag a in front of the semicolon is greater than 0, that would print the record.
Final criteria, if a==4, then reset a=0, mind that it only worked for the case in the example, if Aug 19 or Aug 20 are more than 4, modify the number 4 in the answer to meet your new request.
If you want to assign the searched patterns into variables, modify the command as followed,
$ b="Aug 19"
$ c="Aug 20"
$ awk -v b="$b" -v c="$c" '$0 ~ c||$0 ~ b{a++}a; a==4{a=0}' file
You may use multiple patterns by separating with a semicolon.
sed -n '/Aug 19/,/Aug 19/p;/Aug 20/,/Aug 20/p' filename
Could you please try following awk solution too once and let me know if this helps you.
awk '/Aug 19/||/Aug 20/{flag=1}; /Aug/ && (!/Aug 19/ && !/Aug 20/){flag=""} flag' Input_file
EDIT: Adding output too here for letting OP know.
awk '/Aug 19/||/Aug 20/{flag=1}; /Aug/ && (!/Aug 19/ && !/Aug 20/){flag=""} flag' Input_file
Aug 19
Line1
Line2
Line3
Line4
Line5
Line6
Line7
Line8
Line9
Aug 19
Aug 20
Line1
Line2
Line3
Line4
Line5
Line6
Line7
Line8
Line9
Aug 20
The following approach is quite easy to understand conceptually...
Print all lines from Aug 19 onwards to end of file.
Reverse the order of the lines (with tac because tac is cat backwards).
Print all lines from Aug 21 onwards.
Reverse the order of the lines back to the original order.
sed -ne '/Aug 19/,$p' filename | tac | sed -ne '/Aug 21/,$p' | tac

How to find duplicate lines in a file?

I have an input file with foillowing data:
line1
line2
line3
begin
line5
line6
line7
end
line9
line1
line3
I am trying to find all the duplicate lines , I tried
sort filename | uniq -c
but does not seem to be working for me :
It gives me :
1 begin
1 end
1 line1
1 line1
1 line2
1 line3
1 line3
1 line5
1 line6
1 line7
1 line9
the question may seem duplicate as Find duplicate lines in a file and count how many time each line was duplicated?
but nature of input data is different .
Please suggest .
use this:
sort filename | uniq -d
man uniq
try
sort -u file
or
awk '!a[$0]++' file
you'll have to modify the standard de-dupe code just a tiny bit to account for this:
if you want unique copy of the duplicates, then it's very much same idea:
{m,g}awk 'NF~ __[$_]++' FS='^$'
{m,g}awk '__[$_]++==!_'
If you want every copy printed for duplicates, then whenever the condition yields true for the first time, print 2 copies of it, plus print new matches along the way.
Usually it's waaaaaaaaay faster to first de-dupe, then sort, instead of the other way around.

Merging Two, Nearly Similar Text Files

Suppose we have ~/file1:
line1
line2
line3
...and ~/file2:
line1
lineNEW
line3
Notice that thes two files are nearly identical, except line2 differs from lineNEW.
Question: How can I merge these two files to produce one that reads as follows:
line1
line2
lineNEW
line3
That is, how can I merge the two files so that all unique lines are captured (without overlap) into a third file? Note that the order of the lines doesn't matter (as long as all unique lines are being captured).
awk '{
print
getline line < second
if ($0 != line) print line
}' second=file2 file1
will do it
Considered the command below. It is more robust since it also works for files where a new line has been added instead of replaced (see f1 and f2 below).
First, I executed it using your files. I divided the command(s) into two lines so that it fits nicely in the "code block":
$ (awk '{ print NR, $0 }' file1; awk '{ print NR, $0 }' file2) |\
sort -k 2 | uniq -f 1 | sort | cut -d " " -f 2-
It produces your expected output:
line1
line2
lineNEW
line3
I also used these two extra files to test it:
f1:
line1 stuff after a tab
line2 line2
line3
line4
line5
line6
f2:
line1 stuff after a tab
lineNEW
line2 line2
line3
line4
line5
line6
Here is the command:
$ (awk '{ print NR, $0 }' f1; awk '{ print NR, $0 }' f2) |\
sort -k 2 | uniq -f 1 | sort | cut -d " " -f 2-
It produces this output:
line1 stuff after a tab
line2 line2
lineNEW
line3
line4
line5
line6
When you do not care about the order, just sort them:
cat ~/file1 ~/file2 | sort -u > ~/file3

awk or sed - delete specific lines

I've got this
line1
line2
line3
line4
line5
line6
line7
line8
line9
line10
line11
line12
line13
line14
line15
I want this
line1
line3
line5
line6
line8
line10
line11
line13
line15
As you can see , the line to be deleted are going from x+2 then x+3 , x equals the line number to be deleted.
I tried this with awk but this is not the right way.
awk '(NR)%2||(NR)%3' file > filev1
Any ideas why?
If I decipher your requirements correctly, then
awk 'NR % 5 != 2 && NR % 5 != 4' file
should do.
Based on Wintermutes logic :)
awk 'NR%5!~/^[24]$/' file
line1
line3
line5
line6
line8
line10
line11
line13
line15
or
awk 'NR%5~/^[013]$/' file
How it works.
We can see from your lines that the one with * should be removed and other kept.
line1
line2*
line3
line4*
line5
line6
line7*
line8
line9*
line10
line11
line12*
line13
line14*
line15
By grouping data inn to every 5 line NR%5,
We see that line to delete is 2 or 4 in every group.
NR%5!~/^[24]$/' This divide data inn to group of 5
Then this part /^[24]$/' tell to not keep 2 or 4
The ^ and $ is important so line 12 47 i deleted too,
since 12 contains 2. So we need to anchor it ^2$ and ^4$.
Using GNU sed, you can do the following command:
sed '2~5d;4~5d' test.txt

Resources