Bash script cut at specific ranges - bash

I have a log file with a plenty of collected logs, I already made a grep command with a regex that outputs the number of lines that matches it.
This is the grep command I'm using to output the matched lines:
grep -n -E 'START_REGEX|END_REGEX' Example.log | cut -d ':' -f 1 > ranges.txt
The regex is conditional it can match the begin of a specific log or its end, thus the output is something like:
12
45
128
136
...
The idea is to use this as a source of ranges to make specific cut on the log file from first number to the second and save them on another file.
The ranges are made by couples of the output, according to the example the first range is 12,45 and the second 128,136.
I expect to see in the final file all the text from line 12 to 45 and then from 128 to 136.
The problem I'm facing is that the sed command seems to work with only one range at time.
sed -E -iTMP "$START_RANGE,$END_RANGE! d;$END_RANGEq" $FILE_NAME
Is there any way (maybe with awk) to do that just in one "cycle"?
Constraints: I can only use supported bash command.

You can use an awk statement, too
awk '(NR>=12 && NR<=45) || (NR>=128 && NR<=136)' file
where, NR is a special variable in Awk which keep tracks of the line number as it processes the file.
An example,
seq 1 10 > file
cat file
1
2
3
4
5
6
7
8
9
10
awk '(NR>=1 && NR<=3) || (NR>=8 && NR<=10)' file
1
2
3
8
9
10
You can also avoid, hard-coding the line numbers by using the -v variable option,
awk -v start1=1 -v end1=3 -v start2=8 -v end2=10 '(NR>=start1 && NR<=end1) || (NR>=start2 && NR<=end2)' file
1
2
3
8
9
10

With sed you can do multiple ranges of lines like so:
sed -n '12,45p;128,136p'
This would output lines 12-45, then 128-136.

Related

Grep all content of File 1 from File 2

This is regarding grepping all the Thread IDs which are mentioned in one file from the thread dump file in unix.
I also require at least 5 lines below each thread id from thread dump while grepping.
Like below:-
MAX_CPU_PID_TD_Ids.out:
1001
1003
MAX_CPU_PID_TD.txt:
............TDID=1001..................
Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Line 7
............TDID=1002...................
Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Line 7
...........TDID=1003......................
Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Line 7
Output should contain :-
............TDID=1001..................
Line 1
Line 2
Line 3
Line 4
Line 5
...........TDID=1003......................
Line 1
Line 2
Line 3
Line 4
Line 5
If possible I would like to have the above output in the mail body.
I have tried the below code but it sends me the thread IDs in the body with thread dump file as an attachment
How ever I would like to have the description of each thread id in the body of the mail only
JAVA_HOME=/u01/oracle/products/jdk
MAX_CPU_PID=`ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%mem | head -2 | sed -n '1!p' | awk '{print $1}'`
ps -eLo pid,ppid,tid,pcpu,comm | grep $MAX_CPU_PID > MAX_CPU_PID_SubProcess.out
cat MAX_CPU_PID_SubProcess.out | awk '{ print "pccpu: "$4" pid: "$1" ppid: "$2" ttid: "$3" comm: "$5}' |sort -n > MAX_CPU_PID_SubProcess_Sorted_temp1.out
rm MAX_CPU_PID_SubProcess.out
sort -k 2n MAX_CPU_PID_SubProcess_Sorted_temp1.out > MAX_CPU_PID_SubProcess_Sorted_temp2.out
rm MAX_CPU_PID_SubProcess_Sorted_temp1.out
awk '{a[i++]=$0}END{for(j=i-1;j>=0;j--)print a[j];}' MAX_CPU_PID_SubProcess_Sorted_temp2.out > MAX_CPU_PID_SubProcess_Sorted_temp3.out
rm MAX_CPU_PID_SubProcess_Sorted_temp2.out
awk '($2 > 15 ) ' MAX_CPU_PID_SubProcess_Sorted_temp3.out > MAX_CPU_PID_SubProcess_Sorted_Highest_Consuming.out
rm MAX_CPU_PID_SubProcess_Sorted_temp3.out
awk '{ print $8 }' MAX_CPU_PID_SubProcess_Sorted_Highest_Consuming.out > MAX_CPU_PID_SubProcess_Sorted_temp4.out
( echo "obase=16" ; cat MAX_CPU_PID_SubProcess_Sorted_temp4.out ) | bc > MAX_CPU_PID_TD_Ids_temp.out
rm MAX_CPU_PID_SubProcess_Sorted_temp4.out
$JAVA_HOME/bin/jstack -l $MAX_CPU_PID > MAX_CPU_PID_TD.txt
#grep -i -A 10 'error' data
awk 'BEGIN{print "The below thread IDs from the attached thread dump of OUD1 server are causing the highest CPU utilization. Please Analyze it further\n"}1' MAX_CPU_PID_TD_Ids_temp.out > MAX_CPU_PID_TD_Ids.out
rm MAX_CPU_PID_TD_Ids_temp.out
tr -cd "[:print:]\n" < MAX_CPU_PID_TD_Ids.out | mailx -s "OUD1 MAX CPU Utilization Analysis" -a MAX_CPU_PID_TD.txt <My Mail ID>
Answer for the first part: How to extract the lines.
The solution with grep -F -f MAX_CPU_PID_TD_Ids.out -A 5 MAX_CPU_PID_TD.txt as proposed in a comment is much simpler, but it may fail if the lines Line 1 etc can contain the values from MAX_CPU_PID_TD_Ids.out. It may also print a non-matching TDID= line if there are not enough lines after the previous matching line.
For the grep solution it may be better to create a file with patterns like ...TDID=1001....
The following script will print the matching lines ...TDID=XYZ... and at most the following 5 lines. It will stop after fewer lines if a new ...TDID=XYZ... is found.
For simplicity an empty line is printed before every ...TDID=XYZ... line, i.e. also before the first one.
awk 'NR==FNR {ids[$1]=1;next} # from the first file save all IDs as array keys
/\.\.\.TDID=/ {
sel = 0; # stop any previous output
id=gensub(/\.*TDID=([^.]*)\.*/,"\\1",1); # extract ID
if(id in ids) { # select if ID is present in array
print "" # empty line as separator
sel = 1;
}
count = 0; # counter to limit number of lines
}
sel { # selected for output?
print;
count++;
if(count > 5) { # stop after ...TDID= + 5 more lines (change the number if necessary)
sel = 0
}
}' MAX_CPU_PID_TD_Ids.out MAX_CPU_PID_TD.txt > MAX_CPU_PID_TD.extract
Apart from the first empty line, this script produces the expected output from the example input as shown in the question. If it does not work with the real input or if there are additional requirements, update the question to show the problematic input and the expected output or the additional requirements.
Answer for the second part: Mail formatting
To get the resulting data into the mail body you simply have to pipe it into mailx instead of specifying the file as an attachment.
( tr -cd "[:print:]\n" < MAX_CPU_PID_TD_Ids.out ; cat MAX_CPU_PID_TD.extract ) | mailx -s "OUD1 MAX CPU Utilization Analysis" <My Mail ID>

SED to spit out nth and (n+1)th lines

EDITS: For reference, "stuff" is a general variable, as is "KEEP".
KEEP could be "Hi, my name is Dave" on line 2 and "I love pie" on line 7. The numbers I've put here are for illustration only and DO NOT show up in the data.
I had a file that needed to be parsed, keeping every 4th line, starting at the 3rd line. In other words, it looked like this:
1 stuff
2 stuff
3 KEEP
4
5 stuff
6 stuff
7 KEEP
8 stuff etc...
Great, sed solved that easily with:
sed -n -e 3~4p myfile
giving me
3 KEEP
7 KEEP
11 KEEP
Now I have a different file format and a different take on the pattern:
1 stuff
2 KEEP
3 KEEP
4
5 stuff
6 KEEP
7 KEEP etc...
and I still want the output of
2 KEEP
3 KEEP
6 KEEP
7 KEEP
10 KEEP
11 KEEP
Here's the problem - this is a multi-pattern "pattern" for sed. It's "every 4th line, spit out 2 lines, but start at line 2".
Do I need to have some sort of DO/FOR loop in my sed, or do I need a different command like awk or grep? Thus far, I have tried formats like:
sed -n -e '3~4p;4~4p' myfile
and
awk 'NR % 3 == 0 || NR % 4 ==0' myfile
and
sed -n -e '3~1p;4~4p' myfile
and
awk 'NR % 1 == 0 || NR % 4 ==0' myfile
source: https://superuser.com/questions/396536/how-to-keep-only-every-nth-line-of-a-file
If your intent is to print lines 2,3 then every fourth line after those two, you can do:
$ seq 20 | awk 'BEGIN{e[2];e[3]} (NR%4) in e'
2
3
6
7
10
11
14
15
18
19
You were pretty close with your sed:
$ printf '%s\n' {1..12} | sed -n '2~4p;3~4p'
2
3
6
7
10
11
this is the idiomatic way to write in awk
$ awk 'NR%4==2 || NR%4==3' file
however, this special case can be shortened to
$ awk 'NR%4>1' file
This might work for you (GNU sed):
sed '2~4,+1p;d' file
Use a range, the first parameter is the starting line and modulus (in this case from line 2 modulus 4). The second parameter is how man lines following the start of the range (in this case plus one). Print these lines and delete all others.
In the generic case, you want to keep lines p to p+q and p+n to p+q+n and p+2n to p+q+2n ... So you can write:
awk '(NR - p) % n <= q'

manipulate text using shell script?

How can i manipulate the text file using shell script?
input
chr2:98602862-98725768
chr11:3100287-3228869
chr10:3588083-3693494
chr2:44976980-45108665
expected output
2 98602862 98725768
11 3100287 3228869
10 3588083 3693494
2 44976980 45108665
Using sed you can write
$ sed 's/chr//; s/[:-]/ /g' file
2 98602862 98725768
11 3100287 3228869
10 3588083 3693494
2 44976980 45108665
Or maybe you could use awk
awk -F "chr|[-:]" '{print $2,$3, $4}' file
2 98602862 98725768
11 3100287 3228869
10 3588083 3693494
2 44976980 45108665
What it does
-F "chr|[-:]" sets the field separators to chr or : or -. Now you could print the different fields or columns.
You can also use another field separator as -F [^0-9]+ which will makes anything other than digits as separators.
If you don't care about a leading blank char:
$ tr -s -c '[0-9\n]' ' ' < file
2 98602862 98725768
11 3100287 3228869
10 3588083 3693494
2 44976980 45108665

How do I pick random unique lines from a text file in shell?

I have a text file with an unknown number of lines. I need to grab some of those lines at random, but I don't want there to be any risk of repeats.
I tried this:
jot -r 3 1 `wc -l<input.txt` | while read n; do
awk -v n=$n 'NR==n' input.txt
done
But this is ugly, and doesn't protect against repeats.
I also tried this:
awk -vmax=3 'rand() > 0.5 {print;count++} count>max {exit}' input.txt
But that obviously isn't the right approach either, as I'm not guaranteed even to get max lines.
I'm stuck. How do I do this?
This might work for you:
shuf -n3 file
shuf is one of GNU coreutils.
If you have Python accessible (change the 10 to what you'd like):
python -c 'import random, sys; print("".join(random.sample(sys.stdin.readlines(), 10)).rstrip("\n"))' < input.txt
(This will work in Python 2.x and 3.x.)
Also, (again change the 10 to the appropriate value):
sort -R input.txt | head -10
If jot is on your system, then I guess you're running FreeBSD or OSX rather than Linux, so you probably don't have tools like rl or sort -R available.
No worries. I had to do this a while ago. Try this instead:
$ printf 'one\ntwo\nthree\nfour\nfive\n' > input.txt
$ cat rndlines
#!/bin/sh
# default to 3 lines of output
lines="${1:-3}"
# default to "input.txt" as input file
input="${2:-input.txt}"
# First, put a random number at the beginning of each line.
while read line; do
printf '%8d%s\n' $(jot -r 1 1 99999999) "$line"
done < "$input" |
sort -n | # Next, sort by the random number.
sed 's/^.\{8\}//' | # Last, remove the number from the start of each line.
head -n "$lines" # Show our output
$ ./rndlines input.txt
two
one
five
$ ./rndlines input.txt
four
two
three
$
Here's a 1-line example that also inserts the random number a little more cleanly using awk:
$ printf 'one\ntwo\nthree\nfour\nfive\n' | awk 'BEGIN{srand()} {printf("%8d%s\n", rand()*10000000, $0)}' | sort -n | head -n 3 | cut -c9-
Note that different versions of sed (in FreeBSD and OSX) may require the -E option instead of -r to handle ERE instead or BRE dialect in the regular expression if you want to use that explictely, though everything I've tested works with escapted bounds in BRE. (Ancient versions of sed (HP/UX, etc) might not support this notation, but you'd only be using those if you already knew how to do this.)
This should do the trick, at least with bash and assuming your environment has the other commands available:
cat chk.c | while read x; do
echo $RANDOM:$x
done | sort -t: -k1 -n | tail -10 | sed 's/^[0-9]*://'
It basically outputs your file, placing a random number at the start of each line.
Then it sorts on that number, grabs the last 10 lines, and removes that number from them.
Hence, it gives you ten random lines from the file, with no repeats.
For example, here's a transcript of it running three times with that chk.c file:
====
pax$ testprog chk.c
} else {
}
newNode->next = NULL;
colm++;
====
pax$ testprog chk.c
}
arg++;
printf (" [%s] n", currNode->value);
free (tempNode->value);
====
pax$ testprog chk.c
char tagBuff[101];
}
return ERR_OTHER;
#define ERR_MEM 1
===
pax$ _
sort -Ru filename | head -5
will ensure no duplicates. Not all implementations of sort have the -R option.
To get N random lines from FILE with Perl:
perl -MList::Util=shuffle -e 'print shuffle <>' FILE | head -N
Here's an answer using ruby if you don't want to install anything else:
cat filename | ruby -e 'puts ARGF.read.split("\n").uniq.shuffle.join("\n")'
for example, given a file (dups.txt) that looks like:
1 2
1 3
2
1 2
3
4
1 3
5
6
6
7
You might get the following output (or some permutation):
cat dups.txt| ruby -e 'puts ARGF.read.split("\n").uniq.shuffle.join("\n")'
4
6
5
1 2
2
3
7
1 3
Further example from the comments:
printf 'test\ntest1\ntest2\n' | ruby -e 'puts ARGF.read.split("\n").uniq.shuffle.join("\n")'
test1
test
test2
Of course if you have a file with repeated lines of test you'll get just one line:
printf 'test\ntest\ntest\n' | ruby -e 'puts ARGF.read.split("\n").uniq.shuffle.join("\n")'
test

Shell script numbering lines in a file

I need to find a faster way to number lines in a file in a specific way using tools like awk and sed. I need the first character on each line to be numbered in this fashion: 1,2,3,1,2,3,1,2,3 etc.
For example, if the input was this:
line 1
line 2
line 3
line 4
line 5
line 6
line 7
The output needs to look like this:
1line 1
2line 2
3line 3
1line 4
2line 5
3line 6
1line 7
Here is a chunk of what I have. $lines is the number of lines in the data file divided by 3. So for a file of 21000 lines I process this loop 7000 times.
export i=0
while [ $i -le $lines ]
do
export start=`expr $i \* 3 + 1`
export end=`expr $start + 2`
awk NR==$start,NR==$end $1 | awk '{printf("%d%s\n", NR,$0)}' >> data.out
export i=`expr $i + 1`
done
Basically this grabs 3 lines at a time, numbers them, and adds to an output file. It's slow...and then some! I don't know of another, faster, way to do this...any thoughts?
Try the nl command.
See https://linux.die.net/man/1/nl (or another link to the documentation that comes up when you Google for "man nl" or the text version that comes up when you run man nl at a shell prompt).
The nl utility reads lines from the
named file or the standard input if
the file argument is ommitted, applies
a configurable line numbering filter
operation and writes the result to the
standard output.
edit: No, that's wrong, my apologies. The nl command doesn't have an option for restarting the numbering every n lines, it only has an option for restarting the numbering after it finds a pattern. I'll make this answer a community wiki answer because it might help someone to know about nl.
It's slow because you are reading the same lines over and over. Also, you are starting up an awk process only to shut it down and start another one. Better to do the whole thing in one shot:
awk '{print ((NR-1)%3)+1 $0}' $1 > data.out
If you prefer to have a space after the number:
awk '{print ((NR-1)%3)+1, $0}' $1 > data.out
Perl comes to mind:
perl -pe '$_ = (($.-1)%3)+1 . $_'
should work. No doubt there is an awk equivalent. Basically, ((line# - 1) MOD 3) + 1.
This might work for you:
sed 's/^/1/;n;s/^/2/;n;s/^/3/' input
Another way is just to use grep and match everything. For example this will enumerate files:
grep -n '.*' <<< `ls -1`
Output will be:
1:file.a
2:file.b
3:file.c
awk '{printf "%d%s\n", ((NR-1) % 3) + 1, $0;}' "$#"
Python
import sys
for count, line in enumerate(sys.stdin):
stdout.write( "%d%s" % ( 1+(count % 3), line )
You don't need to leave bash for this:
i=0; while read; do echo "$((i++ % 3 + 1)) $REPLY"; done < input
This should solve the problem. $_ will print the whole line.
awk '{print ((NR-1)%3+1) $_}' < input
1line 1
2line 2
3line 3
1line 4
2line 5
3line 6
1line 7
# cat input
line 1
line 2
line 3
line 4
line 5
line 6
line 7

Resources