Grep specific lines in a file - bash

my file looks like this
Tree:0,pos:0,len:2.29276,TMRCA:0.795328,ARG:,len:2.29276,TMRCA:0.795328
NEWICK_TREE: [169]((2:0.147398,(6:0.136844,(((9:0.00903981,4:0.00903981):0.084126,5:0.0931658):0.0077254,(7:0.0053182,8:0.0053182):0.095573):0.0359525):0.0105546):0.647929,(0:0.199142,(1:0.0103058,3:0.0103058):0.188836):0.596186);
SITE: 0 0.0123617064 0.648849164 0010111111
iHistoryMax: 0
Tree:1,pos:0.0169589,len:2.28476,TMRCA:0.795328,ARG:,len:2.28476,TMRCA:0.795328
NEWICK_TREE: [303]((2:0.147398,((6:0.00230499,1:0.00230499):0.134539,(((9:0.00903981,4:0.00903981):0.084126,5:0.0931658):0.0077254,(7:0.0053182,8:0.0053182):0.095573):0.0359525):0.0105546):0.647929,(0:0.199142,3:0.199142):0.596186);
iHistoryMax: 1
Tree:2,pos:0.0472255,len:2.77342,TMRCA:0.795328,ARG:,len:2.77342,TMRCA:0.795328
NEWICK_TREE: [67](((6:0.00230499,1:0.00230499):0.134539,(((9:0.00903981,4:0.00903981):0.084126,5:0.0931658):0.0077254,(7:0.0053182,8:0.0053182):0.095573):0.0359525):0.658484,((0:0.199142,3:0.199142):0.436921,2:0.636062):0.159266);
iHistoryMax: 2
Tree:3,pos:0.0539094,len:2.96385,TMRCA:0.795328,ARG:,len:2.96385,TMRCA:0.795328
NEWICK_TREE: [40](((6:0.00230499,1:0.00230499):0.134539,(((9:0.00903981,4:0.00903981):0.084126,5:0.0931658):0.0077254,(7:0.0053182,8:0.0053182):0.095573):0.0359525):0.658484,((0:0.389568,3:0.389568):0.246494,2:0.636062):0.159266);
iHistoryMax: 3
However, what I only need is the pos of each Tree (in the line Tree:1,pos) and the output should be only the number followed by pos in 1 column with 3 rows (or more). The position of the Tree line is not always in each 3 line as the part in between can change in length. This can be done in bash?

Use awk with a delimiter of : and , and then print the fields you want. For example, this will print the the Tree and pos numbers:
awk -F[:,] '/^Tree:/{print $2,$4}' file

using grep with -P
grep -Po "(?<=Tree.*pos:)[0-9.]+" file
0
0.0169589
0.0472255
0.0539094

Related

how to add one to all fields in a file

suppose I have file containing numbers like:
1 4 7
2 5 8
and I want to add 1 to all these numbers, making the output like:
2 5 8
3 6 9
is there a simple one-line command (e.g. awk) to realize this?
try following once.
awk '{for(i=1;i<=NF;i++){$i=$i+1}} 1' Input_file
EDIT: As per OP's request without loop, here is a solution(written as per shown sample only).
With hardcoding of number of fields.
awk -v RS='[ \n]' '{ORS=NR%3==0?"\n":" ";print $0+1}' Input_file
OR
Without hardcoding number of fields.
awk -v RS='[ \n]' -v col=$(awk 'FNR==1{print NF}' Input_file) '{ORS=NR%col==0?"\n":" ";print $0+1}' Input_file
Explanation: So in EDIT section 1st solution I have hardcoded the number of fields by mentioning 3 there, in OR solution of EDIT, I am creating a variable named col which will read the very first line of Input_file to get the number of fields. Then it will not read all the Input_file, Now coming onto the code I have set Record separator as space or new line to it will add them without using a loop and it will add space each time after incrementing 1 in their values. It will print new line only when number of lines are completely divided by value of col(which is why we have taken number of fields in -v col section).
In native bash (no awk or other external tool needed):
#!/usr/bin/env bash
while read -r -a nums; do # read a line into an array, splitting on spaces
out=( ) # initialize an empty output array for that line
for num in "${nums[#]}"; do # iterate over the input array...
out+=( "$(( num + 1 ))" ) # ...and add n+1 to the output array.
done
printf '%s\n' "${out[*]}" # then print that output array with a newline following
done <in.txt >out.txt # with input from in.txt and output to out.txt
You can do this using gnu awk:
awk -v RS="[[:space:]]+" '{$0++; ORS=RT} 1' file
2 5 8
3 6 9
If you don't mind Perl:
perl -pe 's/(\d+)/$1+1/eg' file
Substitute any number composed of multiple digits (\d+) with that number ($1) plus 1. /e means to execute the replacement calculation, and /g means globally throughout the file.
As mentioned in the comments, the above only works for positive integers - per the OP's original sample file. If you wanted it to work with negative numbers, decimals and still retain text and spacing, you could go for something like this:
perl -pe 's/([-]?[.0-9]+)/$1+1/eg' file
Input file
Some column headers # words
1 4 7 # a comment
2 5 cat dog # spacing and stray words
+5 0 # plus sign
-7 4 # minus sign
+1000.6 # positive decimal
-21.789 # negative decimal
Output
Some column headers # words
2 5 8 # a comment
3 6 cat dog # spacing and stray words
+6 1 # plus sign
-6 5 # minus sign
+1001.6 # positive decimal
-20.789 # negative decimal

Bash script cut at specific ranges

I have a log file with a plenty of collected logs, I already made a grep command with a regex that outputs the number of lines that matches it.
This is the grep command I'm using to output the matched lines:
grep -n -E 'START_REGEX|END_REGEX' Example.log | cut -d ':' -f 1 > ranges.txt
The regex is conditional it can match the begin of a specific log or its end, thus the output is something like:
12
45
128
136
...
The idea is to use this as a source of ranges to make specific cut on the log file from first number to the second and save them on another file.
The ranges are made by couples of the output, according to the example the first range is 12,45 and the second 128,136.
I expect to see in the final file all the text from line 12 to 45 and then from 128 to 136.
The problem I'm facing is that the sed command seems to work with only one range at time.
sed -E -iTMP "$START_RANGE,$END_RANGE! d;$END_RANGEq" $FILE_NAME
Is there any way (maybe with awk) to do that just in one "cycle"?
Constraints: I can only use supported bash command.
You can use an awk statement, too
awk '(NR>=12 && NR<=45) || (NR>=128 && NR<=136)' file
where, NR is a special variable in Awk which keep tracks of the line number as it processes the file.
An example,
seq 1 10 > file
cat file
1
2
3
4
5
6
7
8
9
10
awk '(NR>=1 && NR<=3) || (NR>=8 && NR<=10)' file
1
2
3
8
9
10
You can also avoid, hard-coding the line numbers by using the -v variable option,
awk -v start1=1 -v end1=3 -v start2=8 -v end2=10 '(NR>=start1 && NR<=end1) || (NR>=start2 && NR<=end2)' file
1
2
3
8
9
10
With sed you can do multiple ranges of lines like so:
sed -n '12,45p;128,136p'
This would output lines 12-45, then 128-136.

How to replace all matches with an incrementing number in BASH?

I have a text file like this:
AAAAAA this is some content.
This is AAAAAA some more content AAAAAA. AAAAAA
This is yet AAAAAA some more [AAAAAA] content.
I need to replace all occurrence of AAAAAA with an incremented number, e.g., the output would look like this:
1 this is some content.
This is 2 some more content 3. 4
This is yet 5 some more [6] content.
How can I replace all of the matches with an incrementing number?
Here is one way of doing it:
$ awk '{for(x=1;x<=NF;x++)if($x~/AAAAAA/){sub(/AAAAAA/,++i)}}1' file
1 this is some content.
This is 2 some more content 3. 4
This is yet 5 some more [6] content.
A perl solution:
perl -pe 'BEGIN{$A=1;} s/AAAAAA/$A++/ge' test.dat
This might work for you (GNU sed):
sed -r ':a;/AAAAAA/{x;:b;s/9(_*)$/_\1/;tb;s/^(_*)$/0\1/;s/$/:0123456789/;s/([^_])(_*):.*\1(.).*/\3\2/;s/_/0/g;x;G;s/AAAAAA(.*)\n(.*)/\2\1/;ta}' file
This is a toy example, perl or awk would be a better fit for a solution.
The solution only acts on lines which contain the required string (AAAAAA).
The hold buffer is used as a place to keep the incremented integer.
In overview: when a required string is encountered, the integer in the hold space is incremented, appended to the current line, swapped for the required string and the process is then repeated until all occurences of the string are accounted for.
Incrementing an integer simply swaps the last digit (other than trailing 9's) for the next integer in sequence i.e. 0 to 1, 1 to 2 ... 8 to 9. Where trailing 9's occur, each trailing 9 is replaced by a non-integer character e.g '_'. If the number being incremented consists entirely of trailing 9's a 0 is added to the front of the number so that it can be incremented to 1. Following the increment operation, the trailing 9's (now _'s) are replaced by '0's.
As an example say the integer 9 is to be incremented:
9 is replaced by _, a 0 is prepended (0_), the 0 is swapped for 1 (1_), the _ is replaced by 0. resulting in the number 10.
See comments directed at #jaypal for further notes.
Maybe something like this
#!/bin/bash
NR=1
cat filename while read line
do
line=$(echo $line | sed 's/AAAAA/$NR/')
echo ${line}
NR=$((NR + 1 ))
done
Perl did the job for me
perl -pi -e 's/\b'DROP'\b/$&.'_'.++$A /ge' /folder/subfolder/subsubfolder/*
Input:
DROP
drop
$drop
$DROP
$DROP="DROP"
$DROP='DROP'
$DROP=$DROP
$DROP="DROP";
$DROP='DROP';
$DROP=$DROP;
$var="DROP_ACTION"
drops
DROPS
CODROP
'DROP'
"DROP"
/DROP/
Output:
DROP_1
drop
$drop
$DROP_2
$DROP_3="DROP_4"
$DROP_5='DROP_6'
$DROP_7=$DROP_8
$DROP_9="DROP_10";
$DROP_11='DROP_12';
$DROP_13=$DROP_14;
$var="DROP_ACTION"
drops
DROPS
CODROP
'DROP_15'
"DROP_16"
/DROP_17/

Getting specific lines of a file

I have this file with 25 million rows. I want to get specific 10 million lines from this file
I have the indices of these lines in another file. How can I do it efficiently?
Assuming that the list of lines is in a file list-of-lines and the data is in data-file, and that the numbers in list-of-lines are in ascending order, then you could write:
current=0
while read wanted
do
while ((current < wanted))
do
if read -u 3 line
then ((current++))
else break 2
fi
done
echo "$line"
done < list-of-lines 3< data-file
This uses the Bash extension that allows you to specify which file descriptor read should read from (read -u 3 to read from file descriptor 3). The list of line numbers to be printed is read from standard input; the data file is read from file descriptor 3. This makes one pass through each of the two files, which is within a constant factor of optimal.
If the list-of-lines is not sorted, replace the last line with the following, which uses the Bash extension called process substitution:
done < <(sort -n list-of-lines) 3< data-file
Assume that the file containing line indices is called "no.txt" and the data file is "input.txt".
awk '{printf "%08d\n", $1}' no.txt > no.1.txt
nl -n rz -w 8 input.txt | join - no.1.txt | cut -d " " -f1 --complement > output.txt
The output.txt will have the lines wanted. I am not sure if this is efficient enough. It seems to be faster than this script (https://stackoverflow.com/a/22926494/3264368) under my environment though.
Some explanations:
The 1st command preprocess the indices file so that the numbers are right adjusted with leading zeroes and width 8 (since number of rows in input.txt is known to be 25M)
The 2nd command will print the rows and line numbers with exactly the same format as in the preprocessed index file, then join them to get the wanted rows (cut to remove the line numbers).
Since you said the file with lines you're looking for is sorted, you can loop through the two files in awk:
awk 'BEGIN{getline nl < "line_numbers.txt"} NR == nl {print; getline nl < "line_numbers.txt"}' big_file.txt
This will read each line in each file precisely once.
Like your index file is index.txt and datafile is data.txt then you can do it using sed like as follows
#!/bin/bash
while read line_no
do
sed ''$line_no'q;d' data.txt
done < input.txt
You could run a loop that reads from the 25 million lined file and when the loop counter reaches a line number that you want tell it to write that line. EX:
String line = "";
int count = 0;
while((line = br.readLine())!=null)
{
if(count == indice)
{
System.out.println(line) //or file write
}

Shell script numbering lines in a file

I need to find a faster way to number lines in a file in a specific way using tools like awk and sed. I need the first character on each line to be numbered in this fashion: 1,2,3,1,2,3,1,2,3 etc.
For example, if the input was this:
line 1
line 2
line 3
line 4
line 5
line 6
line 7
The output needs to look like this:
1line 1
2line 2
3line 3
1line 4
2line 5
3line 6
1line 7
Here is a chunk of what I have. $lines is the number of lines in the data file divided by 3. So for a file of 21000 lines I process this loop 7000 times.
export i=0
while [ $i -le $lines ]
do
export start=`expr $i \* 3 + 1`
export end=`expr $start + 2`
awk NR==$start,NR==$end $1 | awk '{printf("%d%s\n", NR,$0)}' >> data.out
export i=`expr $i + 1`
done
Basically this grabs 3 lines at a time, numbers them, and adds to an output file. It's slow...and then some! I don't know of another, faster, way to do this...any thoughts?
Try the nl command.
See https://linux.die.net/man/1/nl (or another link to the documentation that comes up when you Google for "man nl" or the text version that comes up when you run man nl at a shell prompt).
The nl utility reads lines from the
named file or the standard input if
the file argument is ommitted, applies
a configurable line numbering filter
operation and writes the result to the
standard output.
edit: No, that's wrong, my apologies. The nl command doesn't have an option for restarting the numbering every n lines, it only has an option for restarting the numbering after it finds a pattern. I'll make this answer a community wiki answer because it might help someone to know about nl.
It's slow because you are reading the same lines over and over. Also, you are starting up an awk process only to shut it down and start another one. Better to do the whole thing in one shot:
awk '{print ((NR-1)%3)+1 $0}' $1 > data.out
If you prefer to have a space after the number:
awk '{print ((NR-1)%3)+1, $0}' $1 > data.out
Perl comes to mind:
perl -pe '$_ = (($.-1)%3)+1 . $_'
should work. No doubt there is an awk equivalent. Basically, ((line# - 1) MOD 3) + 1.
This might work for you:
sed 's/^/1/;n;s/^/2/;n;s/^/3/' input
Another way is just to use grep and match everything. For example this will enumerate files:
grep -n '.*' <<< `ls -1`
Output will be:
1:file.a
2:file.b
3:file.c
awk '{printf "%d%s\n", ((NR-1) % 3) + 1, $0;}' "$#"
Python
import sys
for count, line in enumerate(sys.stdin):
stdout.write( "%d%s" % ( 1+(count % 3), line )
You don't need to leave bash for this:
i=0; while read; do echo "$((i++ % 3 + 1)) $REPLY"; done < input
This should solve the problem. $_ will print the whole line.
awk '{print ((NR-1)%3+1) $_}' < input
1line 1
2line 2
3line 3
1line 4
2line 5
3line 6
1line 7
# cat input
line 1
line 2
line 3
line 4
line 5
line 6
line 7

Resources