awk for sort lines in file - shell

I have a file which needs to sort on basis of column and the column is fixed length based column i.e. from character 5 to 10.
example file:
0120456789bcdc hsdsjjlofk
01204567-9 __abc __hsdsjjjiejks
01224-6777 abcddd hsdsjjjpsdpf
012645670- abccccd hsdsjjjopp
I tried awk -v FIELDWIDTHS="4 10" '{print|"$2 sort -n"}' file but it does not give proper output.

You can use sort for this
$ sort -k1.5,1.10 file
01224-6777 abcddd hsdsjjjpsdpf
01204567-9 __abc __hsdsjjjiejks
012645670- abccccd hsdsjjjopp
0120456789bcdc hsdsjjlofk

Related

Pass number of for loop elements to external command

I'm using for loop to iterate through .txt files in a directory and grab specified rows from the files. Afterwards the output is passed to pr command in order to print it as a table. Everything works fine, however I'm manually specifying the number of columns that the table should contain. This is cumbersome when the number of files is not constant.
The command I'm using:
for f in *txt; do awk -F"\t" 'FNR ~ /^(2|6|9)$/{print $2}' $f; done | pr -ts --column 4
How should I modify the command to replace '4' with elements number?
Edit:
The fundamental question was if one can provide matching files number to function outside the loop. Seeing the solutions I guess it is not possible to work around the problem. Until this conclusion the structure of the files was not really relevant.
However taking the above into account, I'm providing the files structure below.
Sample file.txt:
Irrelevant1 text
Placebo 1222327
Irrelevant1 text
Irrelevant2 text
Irrelevant3 text
Treatment1 105956
Irrelevant1 text
Irrelevant2 text
Treatment2 49271
Irrelevant1 text
Irrelevant2 text
The for loop generates the following from 4 *txt files:
1222327
105956
49271
969136
169119
9672
1297357
237210
11581
1189529
232095
13891
Expected pr output using a dynamically generated --column 4:
1222327 969136 1297357 1189529
105956 169119 237210 232095
49271 9672 11581 13891
Assumptions:
all input files generate the same number of output lines (otherwise we can add some code to keep track of the max number of lines and generate blank columns as needed)
Setup (columns are tab-delimited):
$ grep -n xxx f[1-4].txt
f1.txt:6:xxx 1222327
f1.txt:9:xxx 105956
f1.txt:24:xxx 49271
f2.txt:6:xxx 969136
f2.txt:9:xxx 169119
f2.txt:24:xxx 9672
f3.txt:6:xxx 1297357
f3.txt:9:xxx 237210
f3.txt:24:xxx 11581
f4.txt:6:xxx 1189529
f4.txt:9:xxx 232095
f4.txt:24:xxx 13891
One idea using awk to dynamically build the 'table' (replaces OP's current for loop):
awk -F'\t' '
FNR==1 { c=0 }
FNR ~ /^(6|9|24)$/ { ++c ; arr[c]=arr[c] (FNR==NR ? "" : " ") $2 }
END { for (i=1;i<=c;i++) print arr[i] }
' f[1-4].txt | column -t -o ' '
NOTE: we'll go ahead and let column take care of pretty-printing the table with a single space separating the columns, otherwise we could add some more code to awk to right-pad columns with spaces
This generates:
1222327 969136 1297357 1189529
105956 169119 237210 232095
49271 9672 11581 13891
You could just run ls and pipe the output to wc -l. Then once you've got that number you can assign it to a variable and place that variable in your command.
num=$(ls *.txt | wc -l)
I forget how to place bash variables in AWK, but I think you can do that. If not, respond back and I'll try to find a different answer.

How to grep a pattern followed by a number, only if the number is above a certain value

I actually need to grep the entire line. I have a file with a bunch of lines that look like this
1 123213 A T . stuff=1.232;otherstuf=34;morestuff=121;AF=0.44;laststuff=AV
4 223152 D L . stuff=1.122;otherstuf=4;morestuff=41;AF=0.02;laststuff=RV
and I want to keep all the lines where AF>0.1. So for the lines above I only want to keep the first line.
Using gnu-awk you can do this:
awk 'gensub(/.*;AF=([^;]+).*/, "\\1", "1", $NF)+0 > 0.1' file
1 123213 A T . stuff=1.232;otherstuf=34;morestuff=121;AF=0.44;laststuff=AV
This gensub function parses out AF=<number> from last field of the input and captures number in captured group #1 which is used for comparison with 0.1.
PS: +0 will convert parsed field to a number.
You could use awk with multiple delimeters to extract the value and compare it:
$ awk -F';|=' '$8 > 0.1' file
Assuming that AF is always of the form 0.NN you can simply match values where the tens place is 1-9, e.g.:
grep ';AF=0.[1-9][0-9];' your_file.csv
You could add a + after the second character group to support additional digits (i.e. 0.NNNNN) but if the values could be outside the range [0, 1) you shouldn't try to match the field with regular expressions.
$ awk -F= '$5>0.1' file
1 123213 A T . stuff=1.232;otherstuf=34;morestuff=121;AF=0.44;laststuff=AV
If that doesn't do what you want when run against your real data then edit your question to provide more truly representative sample input/output.
I would use awk. Since awk supports alphanumerical comparisons you can simply use this:
awk -F';' '$(NF-1) > "AF=0.1"' file.txt
-F';' splits the line into fields by ;. $(NF-1) address the second last field in the line. (NF is the number of fields)

How to split a CSV file into multiple files based on column value

I have CSV file which could look like this:
name1;1;11880
name2;1;260.483
name3;1;3355.82
name4;1;4179.48
name1;2;10740.4
name2;2;1868.69
name3;2;341.375
name4;2;4783.9
there could more or less rows and I need to split it into multiple .dat files each containing rows with the same value of the second column of this file. (Then I will make bar chart for each .dat file) For this case it should be two files:
data1.dat
name1;1;11880
name2;1;260.483
name3;1;3355.82
name4;1;4179.48
data2.dat
name1;2;10740.4
name2;2;1868.69
name3;2;341.375
name4;2;4783.9
Is there any simple way of doing it with bash?
You can use awk to generate a file containing only a particular value of the second column:
awk -F ';' '($2==1){print}' data.dat > data1.dat
Just change the value in the $2== condition.
Or, if you want to do this automatically, just use:
awk -F ';' '{print > ("data"$2".dat")}' data.dat
which will output to files containing the value of the second column in the name.
Try this:
while IFS=";" read -r a b c; do echo "$a;$b;$c" >> data${b}.dat; done <file

Pull out a range of data from unique character to unique character using grep or awk

I have a moderately large fasta format file that has a complex header. I need to pull a sequence out based on a value (an 8 digit number) from another file. I can get the sequence out using 'grep -20 "value" fasta.file'. Some of the sequences are very large and I often have to adjust the number of lines in order to get the whole sequence. I then have to copy and paste into another file. Right now, I have too many values (1000) to do this manually. The tools that I have found to do this haven't worked so far...
The fasta format file looks like this:
>transcript_cluster:RaGene-2_0-st-v1:17818557; Assembly=build-Rnor3.4/rn4; Seqname=chr6; Start=134300789; Stop=134300869; Strand=+; Length=80;
GGATCATTGATGACCATAAAAGATGTGGGAGTCGTCTGAAACATGCATGATGACCACAAC
ATTGAGAGTCTGAGGTCCAC
>transcript_cluster:RaGene-2_0-st-v1:17818559; Assembly=build-Rnor3.4/rn4; Seqname=chr6; Start=134301675; Stop=134301762; Strand=+; Length=87;
GGATCATTGATGACCAAAAAAAAAAAAACATCTGGGAGTCCTCTGAGACATCCATGATGA
CCACAACATTGGGAGTCTGAGGTCCAC
If I use the command grep -4 "17818557" fasta.fa I get:
ATTGCGAGTCTGAGGTCCAC
>transcript_cluster:RaGene-2_0-st-v1:17818555; Assembly=build-Rnor3.4/rn4; Seqname=chr6; Start=134299894; Stop=134299978; Strand=+; Length=84;
GGATCATTGATGACCAGAAAAAAATCATCTCGGAGTCCTCTGAGACATCCATGATGACCA
CAACATTGGGAGTCTGAGGTCCAC
>transcript_cluster:RaGene-2_0-st-v1:17818557; Assembly=build-Rnor3.4/rn4; Seqname=chr6; Start=134300789; Stop=134300869; Strand=+; Length=80;
GGATCATTGATGACCATAAAAGATGTGGGAGTCGTCTGAAACATGCATGATGACCACAAC
ATTGAGAGTCTGAGGTCCAC
>transcript_cluster:RaGene-2_0-st-v1:17818559; Assembly=build-Rnor3.4/rn4; Seqname=chr6; Start=134301675; Stop=134301762; Strand=+; Length=87;
GGATCATTGATGACCAAAAAAAAAAAAACATCTGGGAGTCCTCTGAGACATCCATGATGA
grep -4 grabs four lines above and below. What I need to do is use the numerical query and pull out just the sequence data below the fasta header (>). It would be nice to collect the sequence below the fasta header to next fasta header, ie from > to >.
I have tried some of the UCSC tools 'faSomeRecord' and some perl scripts. They haven't worked with the numerical query in list file or on the command line with and without the 'transcript_cluster:RaGene-2_0-st-v1:' addition. I am thinking that it is the colons or because the header includes positions and lengths which are variable.
Any comments or help is greatly appreciated!
EDIT 30July14
Thanks to the help I received here. I was able to get the data from one file to another using this bash script:
#!/usr/bin/bash
filename='21Feb14_list.txt'
filelines=`cat $filename`
for i in $filelines ; do
awk '/transcript/ && f==1 {f=0;next} /'"$i"'/ {f=1} f==1{print $1}' RaGene-2_0-st-v1.rn4.transcript_cluster.fa
done
This pulls out the sequence, but it truncates the data to the wildcard value. Is there a way to modify this so that I can get the entire header?
example output:
>transcript_cluster:RaGene-2_0-st-v1:17719499;
ATGCCTGAGCCTTCGAAATCTGCACCAGCTCCTAAGAAGGGCTCTAAGAAAGCTATCTCT
AAAGCTCAGAAAAAGGATGGCAAGAAGCGCAAGCGTAGCCGCAAGGAGAGCTATTCCGTG
TACGTGTACAAGGTGCTGAAGCAAGTGCACCCGGACACCGGCATCTCTTCCAAGGCCATG
GGCATCATGAACTCGTTCGTGAACGACATCTTCGAGCGCATCGCGGGCGAGGCGTCGCGC
CTGGCGCATTACAACAAGCGCTCGACCATCACGTCCCGGGAGATCCAGACCGCCGTGCGC
CTGCTGCTGCCGGGGGAGCTGGCCAAGCACGCGGTGTCGGAAGGCACCAAGGCGGTCACC
AAGTACACCAGCTCCAAGTG
>transcript_cluster:RaGene-2_0-st-v1:17623679;
Thanks again!!
$ awk '/transcript/ {f=0} /17818557/ {f=1} f==1{print}' fasta
>transcript_cluster:RaGene-2_0-st-v1:17818557; Assembly=build-Rnor3.4/rn4; Seqname=chr6; Start=134300789; Stop=134300869; Strand=+; Length=80;
GGATCATTGATGACCATAAAAGATGTGGGAGTCGTCTGAAACATGCATGATGACCACAAC
ATTGAGAGTCTGAGGTCCAC
How it works:
The code uses a flag, called f, to decide if a line should be printed. Taking each command, one by one:
/transcript/ {f=0}
If "transcript" appears in the line, indicating a header, we set the flag to 0.
/17818557/ {f=1}
If the line contains our key, 17818557, we set the flag to 1
f==1{print}
If the flag is 1, print the line.
sed '1,/17818557/d;/>/,$d' file
Output:
GGATCATTGATGACCATAAAAGATGTGGGAGTCGTCTGAAACATGCATGATGACCACAAC
ATTGAGAGTCTGAGGTCCAC
With Header:
id=17818557
sed "/$id/p;1,/$id/d;/>/,\$d" file
Output:
>transcript_cluster:RaGene-2_0-st-v1:17818557; Assembly=build-Rnor3.4/rn4; Seqname=chr6; Start=134300789; Stop=134300869; Strand=+; Length=80;
GGATCATTGATGACCATAAAAGATGTGGGAGTCGTCTGAAACATGCATGATGACCACAAC
ATTGAGAGTCTGAGGTCCAC

bash sort on column but do not sort same columns

My file contains:
9827259163,0,D<br>
9827961481,0,D<br>
9827202228,0,A<br>
9827529897,5,D<br>
9827529897,0#1#5#8,A<br>
9827700249,0#1,A<br>
9827700249,1#2,D<br>
9883219029,0,A<br>
9861065312,0,A<br>
I want it to sort on the basis of first column, if the records in first column are same, then do not sort those records further.
$ sort -t, -k1,1 test
9827202228,0,A
9827259163,0,D
9827529897,0#1#5#8,A
9827529897,5,D
9827700249,0#1,A
9827700249,1#2,D
9827961481,0,D
9861065312,0,A
9883219029,0,A
but what I expect is:
9827202228,0,A
9827259163,0,D
9827529897,5,D
9827529897,0#1#5#8,A
9827700249,0#1,A
9827700249,1#2,D
9827961481,0,D
9861065312,0,A
9883219029,0,A
because there are two records for 9827529897 and 9827700249, therefore it should not be sorted further.
Please suggest the command in bash shell
add option -s
sort -st, -k1,1 test
output:
9827202228,0,A
9827259163,0,D
9827529897,5,D
9827529897,0#1#5#8,A
9827700249,0#1,A
9827700249,1#2,D
9827961481,0,D
9861065312,0,A
9883219029,0,A

Resources