Shell command to count lines in a file - shell

I have a file with 3 columns and to give a shell/bash command to know how many lines there are for each combination of the 1st and 3rd column
My file is as follows
COLS949 300 7
COLS949 301 7
COLS949 302 7
COLS949 302 8
COLS949 303 7
COLS949 43401 84
COLS950 303 7
Desired output:
COLS949 7 4
COLS949 8 1
COLS949 84 1
COLS950 1 7
So I have 4 times a line with "Cols949" in the first column and "7" in the third column etc. It does not matter if the order of columns is changed, so it is also fine to have the output as follows:
COLS949 4 7
COLS949 1 8
COLS949 1 84
COLS950 7 1

awk '{a[$1 " " $3]++}END {for( i in a) print i, a[i]}' input

Related

How can I split comma separated values into multiple rows?

I'm trying to split multiple comma-separated values into rows.
I have achieved it in a small number of columns with comma-separated values (with awk), but in my real table, I have to do it in 80 columns. So, I'm looking for a way to iterate.
An example of input that I need to split:
CHROM POS REF ALT GT_00 C_00 D_OO E_00 F_00 GT_11
chr1 10 T A 2,2 1,1 0,1 1,2 1,0 2
chr1 10 T G 3 2 1 2 0 0
The expected output:
chr1 10 T A 2 1 0 1 1 2
chr1 10 T A 2 1 1 2 0 2
chr1 10 T G 3 2 1 2 0 0
I have done it with the following code:
awk 'BEGIN{FS=OFS="\t"}
{
j=split($5,a,",");split($6,b,",");
split($7,c,",");split($8,d,",");split($9,e,",");
for(i=1;i<=j;++i)
{
$5=a[i];$6=b[i];$7=c[i];$8=d[i];$9=e[i];print
}}'
But, as I have said before, there are 80 columns (or more) with comma-separated values in my real data.
Is there a way to it using iteration?
Note: I need to do it in bash (not MySQL, SQL, python...)
This awk may do:
file:
chr1 10 T A 2,2 1,1 0,1 1,2 1,0 2
chr1 10 T G 3 2 1 2 0 0
chr1 10 T C 5 1,2,3 4,2,1 7 1,8,3 3
chr1 10 T D 1,2,3,5 4,2,1,8 1,8,3,2 3 5 7
Solution:
awk '{
n=0;
for(i=5;i<=NF;i++) {
t=split($i,a,",");if(t>n) n=t};
for(j=1;j<=n;j++) {
printf "%s\t%s\t%s\t%s",$1,$2,$3,$4;
for(i=5;i<=NF;i++) {
split($i,a,",");printf "\t%s",(a[j]?a[j]:a[1])
};
print ""
}
}' file
chr1 10 T A 2 1 0 1 1 2
chr1 10 T A 2 1 1 2 1 2
chr1 10 T G 3 2 1 2 0 0
chr1 10 T C 5 1 4 7 1 3
chr1 10 T C 5 2 2 7 8 3
chr1 10 T C 5 3 1 7 3 3
chr1 10 T D 1 4 1 3 5 7
chr1 10 T D 2 2 8 3 5 7
chr1 10 T D 3 1 3 3 5 7
chr1 10 T D 5 8 2 3 5 7
Your test input gives:
chr1 10 T A 2 1 0 1 1 2
chr1 10 T A 2 1 1 2 1 2
chr1 10 T G 3 2 1 2 0 0
It does not mater if comma separated values are consecutive, as long as you do not mix 2 or 3 comma on the same line.
Here is another awk. In contrast to the previous solutions where we split fields into arrays, we attack the problem differently using substitutions. There is no field iterating going on:
awk '
BEGIN { OFS="\t" }
{ $1=$1;t=$0; }
{ while(index($0,",")) {
gsub(/,[[:alnum:],]*/,""); print;
$0=t; gsub(OFS "[[:alnum:]]*,",OFS); t=$0;
}
print t
}' file
how does it work:
The idea is based on 2 types of substitutions:
gsub(/,[[:alnum:],]*/,""): this removes all substrings made from alphanumeric characters and commas that start with a comma: 1,2,3,4 -> 1. This does not change fields that have no comma.
gsub(OFS "[[:alnum:]]*,",OFS): this removes alphanumeric characters followed by a single comma and are in the beginning of the field: 1,2,3,4 -> 2,3,4
So using these two substitutions, we iterate until no comma is left. See How can you tell which characters are in which character classes? on details for [[:alnum:]]
input:
chr1 10 T A 2,2 1,1 0,1 1,2 1,0 2
chr1 10 T G 3 2 1 2 0 0
chr1 10 T C 5 1,2,3 4,2,1 7 1,8,3 3
chr1 10 T D 1,2,3,5 4,2,1,8 1,8,3,2 3 5 7
output:
chr1 10 T A 2 1 0 1 1 2
chr1 10 T A 2 1 1 2 0 2
chr1 10 T G 3 2 1 2 0 0
chr1 10 T C 5 1 4 7 1 3
chr1 10 T C 5 2 2 7 8 3
chr1 10 T C 5 3 1 7 3 3
chr1 10 T D 1 4 1 3 5 7
chr1 10 T D 2 2 8 3 5 7
chr1 10 T D 3 1 3 3 5 7
chr1 10 T D 5 8 2 3 5 7

How to print the field with duplicate value in .txt file

I have a file like this
File
124 3 ac 7
143 3 zf 10
176 8 lm 1
547 7 km 5
862 8 sf 6
991 7 zv 6
I want to create 3 different files from this with following output
File 1
124 3 ac 7
143 3 zf 10
File 2
176 8 lm 1
862 8 sf 6
File 3
547 7 km 5
991 7 zv 6
Please help me with the commands.
$awk 'NR>1{print $2,$3,$4 > $1}' File
This command did the work for me.
Thank You!!

AWK: Add number to the column for specific line

I have a data file of:
1 2 3
1 5 7
2 5 9
11 21 110
6 17 -2
10 2 8
6 4 3
5 1 8
6 1 5
7 3 1
I want to add number 1 to the third column, only for line number 1, 3, 6, 8, 9, 10. And add 2 to the second column, for line number 6~9.
I know how to add 2 to entire second column, and add 1 to entire third column using awk
awk '{print $1, $2+2, $3+1}' data > data2
But how can I modify this code to specific lines of second and third column?
Thanks
Best,
awk to the rescue! You can check for NR in the condition, but for 6 values it will be tedious, alternatively you can check for string match with anchored NR.
$ awk 'BEGIN{lines=",1,3,6,8,9,10,"}
match(lines,","NR","){$3++}
NR>=6 && NR<=9{$2+=2}1' nums
1 2 4
1 5 7
2 5 10
11 21 110
6 17 -2
10 4 9
6 6 3
5 3 9
6 3 6
7 3 2
$ cat tst.awk
BEGIN {
for (i=6;i<=9;i++) {
d[2,i] = 2
}
split("1 3 6 8 9 10",t);
for (i in t) {
d[3,t[i]] = 1
}
}
{ $2 += d[2,NR]; $3 += d[3,NR]; print }
$ awk -f tst.awk file
1 2 4
1 5 7
2 5 10
11 21 110
6 17 -2
10 4 9
6 6 3
5 3 9
6 3 6
7 3 2

How to use awk and sed to make statistics on a data

I have a raw data file and i want to generate an output file, both of which are shown below. The rule for generation is that column 1 in the output is equivalent to the column 2 of raw file. The column 2 in the output is the mean of last column values in the raw data, the line of which are match. For example, the value1 in the output is (25+24.3846+13.8972+1.33333+1)/5.
#Raw data
4 compiler-compiler 100000 99975 1 25
4 compiler-compiler 100000 99683 13 24.3846
4 compiler-compiler 100000 93649 457 13.8972
4 compiler-compiler 100000 99764 177 1.33333
4 compiler-compiler 100000 99999 1 1
4 compiler-sunflow 100000 99999 1 1
4 compiler-sunflow 100000 99674 11 29.6364
4 compiler-sunflow 100000 93467 423 15.4444
4 compiler-sunflow 100000 99694 159 1.92453
4 compiler-sunflow 100000 99938 4 15.5
4 compress 100000 99997 1 3
4 compress 100000 99653 10 34.7
4 compress 100000 93639 454 14.011
4 compress 100000 99666 173 1.93064
4 compress 100000 99978 4 5.5
4 serial 100000 99998 1 2
4 serial 100000 99932 6 11.3333
4 serial 100000 93068 460 15.0696
4 serial 100000 99264 206 3.57282
4 serial 100000 99997 3 1
4 sunflow 100000 99998 1 2
4 sunflow 100000 99546 18 25.2222
4 sunflow 100000 93387 481 13.7484
4 sunflow 100000 99752 189 1.31217
4 sunflow 100000 99974 4 6.5
4 xml-transfomer 100000 99994 1 6
4 xml-transfomer 100000 99964 3 12
4 xml-transfomer 100000 93621 463 13.7775
4 xml-transfomer 100000 99540 199 2.31156
4 xml-transfomer 100000 99986 2 7
4 xml-validation 100000 99996 1 4
4 xml-validation 100000 99563 16 27.3125
4 xml-validation 100000 93748 451 13.8625
4 xml-validation 100000 99716 190 1.49474
4 xml-validation 100000 99979 3 7
#Output data
compiler-compiler value1
....
xml-transfomer value2
xml-validation value3
I think the comment awk & sed can work for this, but i do not know how get it.
sed cannot being used here since it does not support math operations. It's a job for awk:
awk 'NR>1{c[$2]++;s[$2]+=$(NF)}END{for(i in c){print i,s[i]/c[i]}}' input.txt
Explanation:
NR>1 { c[$2]++; s[$2+=($NF) }
NR>1 means that the following block gets executed on all lines except of the first line. $2 is the value of the second column. NF is the number of fields per line. $(NF) contains the value of the last column. c and s are assoc arrays. c counts the occurrences of $2, c stores a total of the numeric value in the last column - grouped by $2.
END {for(i in c){print i,s[i]/c[i]}}
END means the following action will take place after the last line of input has been processed. The for loop iterates through c and outputs the name and the mean for all indexes in c.
Output:
xml-validation 10.7339
compiler-compiler 13.123
serial 6.59514
sunflow 9.75655
xml-transfomer 8.21781
compiler-sunflow 12.7011
compress 11.8283
Note that you have influence on the output order if using an assoc aray. If you care about the output order you might use the following command:
awk 'NR>1 && $2!=n && c {print n,t/c;c=t=0} NR>1{n=$2;c++;t+=$(NF)}'
This command does not use assoc arrays, it prints out the stats just in time when $2 changes - note that this requires to sorted by $2

adding columns for specified rows & dividing by the number of rows using awk

So I'm really new to using linux and script commands, help would really be appreciated!
I have a file of 1050 rows and 8 columns. Example:
anger 1 0 5 101 13 2 somesentenceofwords
anger 2 0 5 101 23 3 somesentenceofwords
anger 3 0 3 101 35 3 somesentenceofwords
anger 4 0 2 101 23 3 somesentenceofwords
arch 5 0 3 101 34 12 somesentenceofwords
arch 6 0 2 101 45 23 somesentenceofwords
arch 7 0 2 101 23 12 somesentenceofwords
hand 8 9 0 101 32 21 somesentenceofwords
hand 9 0 2 101 23 12 somesentenceofwords
What I want to do is if the first column is the same for x number of rows then output the sum of the 6th column for those rows and divide it by the number of rows (an average essentially).
So in the example since the first 4 rows are all anger I want to get the average of the numbers corresponding to all rows with anger in column 1 for column 6. It would add 13 + 23 + 35 + 23 / 4. It would then do the same for arch, then hand and so on.
Example output:
anger 23.5
arch 34
hand 27.5
I tried this just to see if I can do it individually where each column would equal a specific letter string but couldn't even get that to work.
$ awk '{if($1="anger"){sum+=$6} {print sum}}' filename
Is this possible?
Using awk:
awk '!($1 in s){b[++i]=$1; s[$1]=0} {c[$1]++; s[$1]+=$6}
END{for (k=1; k<=i; k++) printf "%s %.1f\n", b[k], s[b[k]]/c[b[k]]}' file
anger 23.5
arch 34.0
hand 27.5
Pretty straight forward with awk:
$ awk '{a[$1]+=$6;b[$1]++}END{for (i in a) print i,a[i]/b[i]}' file
hand 27.5
arch 34
anger 23.5
How this works?
The block {a[$1]+=$6;b[$1]++} is executed for every line that is read. We create two maps, one storing the sum, for each key, and one storing the count for each key.
The block END{for (i in a) print i,a[i]/b[i]} is executed after all lines are read. We iterate over the keys of the first map, and print the key, and the division of the sum over the count (i.e. the mean).

Resources