Searching for a string between two characters - bash

I need to find two numbers from lines which look like this
>Chr14:453901-458800
I have a large quantity of those lines mixed with lines that doesn't contain ":" so we can search for colon to find the line with numbers. Every line have different numbers.
I need to find both numbers after ":" which are separated by "-" then substract the first number from the second one and print result on the screen for each line
I'd like this to be done using awk
I managed to do something like this:
awk -e '$1 ~ /\:/ {print $0}' file.txt
but it's nowhere near the end result
For this example i showed above my result would be:
4899
Because it is the result of 458800 - 453901 = 4899
I can't figure it out on my own and would appreciate some help

With GNU awk. Separate the row into multiple columns using the : and - separators. In each row containing :, subtract the contents of column 2 from the contents of column 3 and print result.
awk -F '[:-]' '/:/{print $3-$2}' file
Output:
4899

Using awk
$ awk -F: '/:/ {split($2,a,"-"); print a[2] - a[1]}' input_file
4899

Related

Print the characters from string

I have a file which contains words-
abfiuf.com abdbhj.co.in abcahjkl.org.in.2 abciuf zasdg cbhjk asjk
including other contents. My Requirement is -The word which starts with abfiuf,
abdbhj, abcahjkl, abciuf ,.... cut the two character from middle like below.
abfiuf - fi
abdbhj - db
abjcahjkl - ca
abciuf - ci
I have tried below-
First to get the matching word-
cat /etc/xyz.txt|grep -Eo \<(abfiuf|abdbhj|abjcahjkl|abciuf)\S*'|cut -f1 -d"."
But unable to cut before and after matching "fi" , "db" , "ca" , "ci" from words .
Tried with sed command - sed 's/^.*fi/fi/' -> working only for one word removing before. But How to cut multiple char before & after from words ?
EDIT2: Since OP told he/she only want to print matching strings value only if that is the case then one should try following.
awk 'match($0,/fi|db|ca|ci/){print substr($0,RSTART,RLENGTH)}' Input_file
OR in case you want to print a message with line number that DO NOT have any match found then try following.
awk 'match($0,/fi|db|ca|ci/){print substr($0,RSTART,RLENGTH);next} {print "Line number " FNR " is NOT having any matching value in it."}' Input_file
Assuming that you need to print only 3rd and 4th character if that is the case then try following.
awk '{print substr($0,3,2)}' Input_file
EDIT: Now I am assuming that you DO NOT want to hard code the position to print from lines if that is the case then try following, which will first calculate the length of line and will print 2 characters starting from its middle letter.
awk '{len=length($0)/2;print substr($0,len,2)}' Input_file

Extract the last three columns from a text file with awk

I have a .txt file like this:
ENST00000000442 64073050 64074640 64073208 64074651 ESRRA
ENST00000000233 127228399 127228552 ARF5
ENST00000003100 91763679 91763844 CYP51A1
I want to get only the last 3 columns of each line.
as you see some times there are some empty lines between 2 lines which must be ignored. here is the output that I want to make:
64073208 64074651 ESRRA
127228399 127228552 ARF5
91763679 91763844 CYP51A1
awk  '/a/ {print $1- "\t" $-2 "\t" $-3}'  file.txt.
it does not return what I want. do you know how to correct the command?
Following awk may help you in same.
awk 'NF{print $(NF-2),$(NF-1),$NF}' OFS="\t" Input_file
Output will be as follows.
64073208 64074651 ESRRA
127228399 127228552 ARF5
91763679 91763844 CYP51A1
EDIT: Adding explanation of command too now.(NOTE this following command is for only explanation purposes one should run above command only to get the results)
awk 'NF ###Checking here condition NF(where NF is a out of the box variable for awk which tells number of fields in a line of a Input_file which is being read).
###So checking here if a line is NOT NULL or having number of fields value, if yes then do following.
{
print $(NF-2),$(NF-1),$NF###Printing values of $(NF-2) which means 3rd last field from current line then $(NF-1) 2nd last field from line and $NF means last field of current line.
}
' OFS="\t" Input_file ###Setting OFS(output field separator) as TAB here and mentioning the Input_file here.
You can use sed too
sed -E '/^$/d;s/.*\t(([^\t]*[\t|$]){2})/\1/' infile
With some piping:
$ cat file | tr -s '\n' | rev | cut -f 1-3 | rev
64073208 64074651 ESRRA
127228399 127228552 ARF5
91763679 91763844 CYP51A1
First, cat the file to tr to squeeze out repeted \ns to get rid of empty lines. Then reverse the lines, cut the first three fields and reverse again. You could replace the useless cat with the first rev.

print 3 consecutive column after specific string from CSV

I need to print 2 columns after specific string (in my case it is 64). There can be multiple instances of 64 within same CSV row, however next instance will not occur within 3 columns of previous occurrence. Output of each instance should be in next line and unique. The problem is, the specific string does not fall in same column for all rows. All row is having kind of dynamic data and there is no header for CSV. Let say, below is input file (its just a sample, actual file is having approx 300 columns & 5 Million raws):
00:TEST,123453103279586,ABC,XYZ,123,456,65,906,06149,NIL TS21,1,64,906,06149,NIL TS22,1,64,916,06149,NIL BS20,1,64,926,06149,NIL BS30,1,64,906,06149,NIL CAML,1,ORIG,0,TERM,1,1,1,6422222222
00:TEST,123458131344169,ABC,XYZ,123,456,OCCF,1,1,1,64,857,19066,NIL TS21,1,64,857,19066,NIL TS22,1,64,857,19066,NIL BS20,1,64,857,19067,NIL BS30,1,64,857,19068,NIL PSS,1,E2 EPSDATA,GRANTED,NONE,1,N,N,256000,5
00:TEST,123458131016844,ABC,XYZ,123,456,HOLD,,1,64,938,36843,NIL TS21,1,64,938,36841,NIL TS22,1,64,938,36823,NIL BS20,1,64,938,36843,NIL BS30,1,64,938,36843,NIL CAML,1,ORIG,0,TERM,00,50000,N,N,N,N
00:TEST,123453102914690,ABC,XYZ,123,456,HOLD,,1,PBS,TS11,64,938,64126,NIL TS21,1,64,938,64126,NIL TS22,1,64,938,64126,NIL BS20,1,64,938,64226,NIL BS30,1,64,938,64326,NIL CAML,1,ORIG,0,TERM,1,1,1,6422222222,2222,R
Output required(only unique entries):
64,906,06149
64,857,19066
64,857,19067
64,857,19068
64,938,36843
64,938,36841
64,938,36823
64,938,36843
64,938,36843
64,938,64326
There is no performance related concerns. I have tried to search many threads but could not get anything near related. Please help.
We can use a pipe of two commands... first to put the 64's leading on a line and a second to print first three columns if we see a leading 64.
sed 's/,64[,\n]/\n64,/g' | awk -F, '/^64/ { print $1 FS $2 FS $3 }'
There are ways of doing this with a single awk command, but this felt quick and easy to me.
Though the sample data from the question contains redundant lines, karakfa (see below) reminds me that the question speaks of a "unique data" requirement. This version uses the keys of an associative array to keep track of duplicate records.
sed 's/,64[,\n]/\n64,/g' | awk -F, 'BEGIN { split("",a) } /^64/ && !((x=$1 FS $2 FS $3) in a) { a[x]=1; print x }'
gawk:
awk -F, '{for(i=0;++i<=NF;){if($i=="64")a=4;if(--a>0)s=s?s","$i:$i;if(a==1){print s;s=""}}}' file
Sed for fun
sed -n -e 's/$/,n,n,n/' -e ':a' -e 'G;s/[[:blank:],]\(64,.*\)\(\n\)$/\2\1/;s/.*\(\n\)\(64\([[:blank:],][^[:blank:],]\{1,\}\)\{2\}\)\([[:blank:],][^[:blank:],]\{1,\}\)\{3\}\([[:blank:],].*\)\{0,1\}$/\1\2\1\5/;s/^.*\n\(.*\n\)/\1/;/^64.*\n/P;s///;ta' YourFile | sort -u
assuming column are separated by blank space or comma
need a sort -u for uniq (possible in sed but a new "simple" action of the same kind to add in this case)
awk to the rescue!
$ awk -F, '{for(i=1;i<=NF;i++)
if($i==64)
{k=$i FS $(++i) FS $(++i);
if (!a[k]++)
print k
}
}' file
64,906,06149
64,916,06149
64,926,06149
64,857,19066
64,857,19067
64,857,19068
64,938,36843
64,938,36841
64,938,36823
64,938,64126
64,938,64226
64,938,64326
ps. your sample output doesn't match the given input.

awk combine 2 commands for csv file formatting

I have a CSV file which has 4 columns. I want to first:
print the first 10 items of each column
only print the items in the third column
My method is to pipe the first awk command into another but i didnt get exactly what i wanted:
awk 'NR < 10' my_file.csv | awk '{ print $3 }'
The only missing thing was the -F.
awk -F "," 'NR < 10' my_file.csv | awk -F "," '{ print $3 }'
You don't need to run awk twice.
awk -F, 'NR<=10{print $3}'
This prints the third field for every line whose record number (line) is less than or equal to 10.
Note that < is different from <=. The former matches records one through nine, the latter matches records one through ten. If you need ten records, use the latter.
Note that this will walk through your entire file, so if you want to optimize your performance:
awk -F, '{print $3} NR>10{exit}'
This will print the third column. Then if the record number is greater than 10, it will exit. This does not step through your entire file.
Note also that awk's "CSV" matching is very simple; awk does not understand quoted fields, so the record:
red,"orange,yellow",green
has four fields, two of which have double quotes in them. YMMV depending on your input.

Getting repeated lines with awk in Bash

I'm trying to know which are the lines that are repeated X times in a text file, and I'm using awk but I see that awk in my command, not work with lines that begin with the same characters or words. That is, does not recognize the full line individually.
Using this command I try to get the lines that are repeated 3 times:
awk '++A[$1]==3' ./textfile > ./log
This is what you need hopefully:
awk '{a[$0]++}END{for(i in a){if(a[i]==3)print i}}' File
Increment array a with the line($0) as index for each line. In the end, for each index ($0), check if the count(a[i] which is the original a[$0]) equals 3. If so, print the line (i which is the original $0 / line). Hope it's clear.
This returns lines repeated 3 times but adds a space at the beginning of each 3x-repeated line:
sort ./textfile | uniq -c | awk '$1 == 3 {$1 = ""; print}' > ./log

Resources