I have a matrix like this:
A 25 27 50
B 35 37 475
C 75 78 80
D 99 88 76
0 234 230 681
The last row is the sum of all elements in the column - and it is also the maximum value.
What I would like to get is the matrix in which each value is divided by the last value in the column (e.g. for the first number in column 2, I would want "25/234="):
A 0.106837606837607 0.117391304347826 0.073421439060206
B 0.14957264957265 0.160869565217391 0.697503671071953
C 0.320512820512821 0.339130434782609 0.117474302496329
D 0.423076923076923 0.382608695652174 0.11160058737151
An answer in another thread gives an acceptable result for one column, but I was not able to loop it over all columns.
$ awk 'FNR==NR{max=($2+0>max)?$2:max;next} {print $1,$2/max}' file file
(this answer was provided here: normalize column data with maximum value of that column)
I would be grateful for any help!
In addition to the great approaches by #RavinderSingh13, you can also isolate the last line in the input file with, e.g. tail -n1 Input_file and then use the split() command in the BEGIN rule to separate the values. You can then make a single-pass through the file with awk to update the values as you indicate. In the end, you can pipe the output to head -n-1 to remove the unneeded final row, e.g.
awk -v lline="$(tail -n1 Input_file)" '
BEGIN { split(lline,a," ") }
{
printf "%s", $1
for(i=2; i<=NF; i++)
printf " %.15lf", $i/a[i]
print ""
}
' Input_file | head -n-1
Example Use/Output
$ awk -v lline="$(tail -n1 Input_file)" '
> BEGIN { split(lline,a," ") }
> {
> printf "%s", $1
> for(i=2; i<=NF; i++)
> printf " %.15lf", $i/a[i]
> print ""
> }
> ' Input_file | head -n-1
A 0.106837606837607 0.117391304347826 0.073421439060206
B 0.149572649572650 0.160869565217391 0.697503671071953
C 0.320512820512821 0.339130434782609 0.117474302496329
D 0.423076923076923 0.382608695652174 0.111600587371512
(note: this presumes you don't have trailing blank lines in your file and you really don't have blank lines between every row. If you do, let me know)
The differences between the approaches are largely negligible. In each case you are making a total of 3-passes through the file. Here with tail, awk and then head. In the other case with wc and then two-passes with awk.
Let either of us know if you have questions.
1st solution: Could you please try following, written and tested with shown samples in GNU awk. With exact 15 floating points as per OP's shown samples:
awk -v lines=$(wc -l < Input_file) '
FNR==NR{
if(FNR==lines){
for(i=2;i<=NF;i++){ arr[i]=$i }
}
next
}
FNR<lines{
for(i=2;i<=NF;i++){ $i=sprintf("%0.15f",(arr[i]?$i/arr[i]:"NaN")) }
print
}
' Input_file Input_file
2nd solution: If you don't care of floating points to be specific points then try following.
awk -v lines=$(wc -l < Input_file) '
FNR==NR && FNR==lines{
for(i=2;i<=NF;i++){ arr[i]=$i }
next
}
FNR<lines && FNR!=NR{
for(i=2;i<=NF;i++){ $i=(arr[i]?$i/arr[i]:"NaN") }
print
}
' Input_file Input_file
OR(placing condition of FNR==lines inside FNR==NR condition):
awk -v lines=$(wc -l < Input_file) '
FNR==NR{
if(FNR==lines){
for(i=2;i<=NF;i++){ arr[i]=$i }
}
next
}
FNR<lines{
for(i=2;i<=NF;i++){ $i=(arr[i]?$i/arr[i]:"NaN") }
print
}
' Input_file Input_file
Explanation: Adding detailed explanation for above.
awk -v lines=$(wc -l < Input_file) ' ##Starting awk program from here, creating lines which variable which has total number of lines in Input_file here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read.
if(FNR==lines){ ##Checking if FNR is equal to lines then do following.
for(i=2;i<=NF;i++){ arr[i]=$i } ##Traversing through all fields here of current line and creating an array arr with index of i and value of current field value.
}
next ##next will skip all further statements from here.
}
FNR<lines{ ##Checking condition if current line number is lesser than lines, this will execute when 2nd time Input_file is being read.
for(i=2;i<=NF;i++){ $i=sprintf("%0.15f",(arr[i]?$i/arr[i]:"NaN")) } ##Traversing through all fields here and saving value of divide of current field with arr current field value with 15 floating points into current field.
print ##Printing current line here.
}
' Input_file Input_file ##Mentioning Input_file names here.
Related
I have a csv file like below:
http://www.a.com/1,apple
http://www.a.com/2,apple
http://www.a.com/3,apple
http://www.a.com/4,apple
...
http://www.z.com/1,flower
http://www.z.com/2,flower
http://www.z.com/3,flower
...
I want combine the csv file to new csv file like below:
"http://www.a.com/1
http://www.a.com/2
http://www.a.com/3
http://www.a.com/4
",apple
"http://www.z.com/1
http://www.z.com/2
http://www.z.com/3
http://www.z.com/4
...
http://www.z.com/100
",flower
"http://www.z.com/101
http://www.z.com/102
http://www.z.com/103
http://www.z.com/104
...
http://www.z.com/200
",flower
I want keep the first column every cell have max 100 lines http url.
Column two same value will appear in corresponding cell.
Is there a very simple command pattern to achieve this idea ?
I used command below:
awk '{if(NR%100!=0)ORS="\t";else ORS="\n"}1' test.csv > result.csv
$ awk -F, '$2!=p || n==100 {if(NR!=1) print "\"," p; printf "\""; p=$2; n=0}
{print $1; n+=1} END {print "\"," p}' test.csv
"http://www.a.com/1
http://www.a.com/2
http://www.a.com/3
http://www.a.com/4
",apple
"http://www.z.com/1
http://www.z.com/2
http://www.z.com/3
",flower
First set the field separator to the comma (-F,). Then:
If the second field changes ($2!=p) or if we already printed 100 lines in the current batch (n==100):
if it is not the first line, print a double quote, a comma, the previous second field and a newline,
print a double quote,
store the new second field in variable p for later comparisons,
reset line counter n.
For all lines print the first field and increment line counter n.
At the end print a double quote, a comma and the last value of second field.
1st solution: With your shown samples, please try following awk code.
awk '
BEGIN{
s1="\""
FS=OFS=","
}
prev!=$2 && prev{
print s1 val s1,prev
val=""
}
{
val=(val?val ORS:"")$1
prev=$2
}
END{
if(val){
print s1 val s1,prev
}
}
' Input_file
2nd solution: In case your Input_file is NOT sorted with 2nd column then try following sort + awk code.
sort -t, -k2 Input_file |
awk '
BEGIN{
s1="\""
FS=OFS=","
}
prev!=$2 && prev{
print s1 val s1,prev
val=""
}
{
val=(val?val ORS:"")$1
prev=$2
}
END{
if(val){
print s1 val s1,prev
}
}
'
Output will be as follows:
"http://www.a.com/1
http://www.a.com/2
http://www.a.com/3
http://www.a.com/4",apple
"http://www.z.com/1
http://www.z.com/2
http://www.z.com/3",flower
Given:
cat file
http://www.a.com/1,apple
http://www.a.com/2,apple
http://www.a.com/3,apple
http://www.a.com/4,apple
http://www.z.com/1,flower
http://www.z.com/2,flower
http://www.z.com/3,flower
Here is a two pass awk to do this:
awk -F, 'FNR==NR{seen[$2]=FNR; next}
seen[$2]==FNR{
printf("\"%s%s\"\n,%s\n",data,$1,$2)
data=""
next
}
{data=data sprintf("%s\n",$1)}' file file
If you want to print either at the change of the $2 value or at some fixed line interval (like 100) you can do:
awk -F, -v n=100 'FNR==NR{seen[$2]=FNR; next}
seen[$2]==FNR || FNR%n==0{
printf("\"%s%s\"\n,%s\n",data,$1,$2)
data=""
next
}
{data=data sprintf("%s\n",$1)}' file file
Either prints:
"http://www.a.com/1
http://www.a.com/2
http://www.a.com/3
http://www.a.com/4"
,apple
"http://www.z.com/1
http://www.z.com/2
http://www.z.com/3"
,flower
I want to use awk to match all the occurrences of a pattern within a large file. For each match, I would like to print the row number and the starting position of the pattern along the row (sort of xy coordinates). There are several occurrences of the pattern in each line.
I found this somewhat related question.
So far, I managed to do it only for the first (leftmost) occurrence in each line. As an example:
echo xyzABCdefghiABCdefghiABCdef | awk 'match($0, /ABC/) {print NR, RSTART } '
The resulting output is :
1 4
But what I would expect is something like this:
1 4
1 13
1 22
I tried using split instead of match. I manage to identify all the occurrences, but the RSTART is lost and printed as "0".
echo xyzABCdefghiABCdefghiABCdef | awk ' { split($0,t, /ABC/,m) ; for (i=1; i in m; i++) print (NR, RSTART) } '
Output:
1 0
1 0
1 0
Any advice would be appreciated. I am not limited to using awk but a awk solution would be appreciated.
Also, in my case the pattern to match would be a regex (/A.C/).
Thank you
This may be what you're trying to do:
echo xyzABCdefghiABCdefghiABCdef |
awk '{ begpos=1
while (match(substr($0, begpos), /ABC/)) {
print NR, begpos + RSTART - 1
begpos += RLENGTH + RSTART - 1
}
}'
Another option using gnu awk could be using split with a regex.
Using the split function, the 3rd field is the fieldsep array and the 4th field is the seps array which you can both use to calculate the positions.
echo xyzABCdefghiABCdefghiABCdef |
awk ' {
n=split($0, a, /ABC/, seps); pos=1
for(i=1; i<n; i++){
pos += length(a[i])
print NR, pos
pos += length(seps[i])
}
}'
Output
1 4
1 13
1 22
With your shown samples, please try following awk code.
awk '
{
prev=0
while(match($0,/ABC/)){
$0=substr($0,RSTART+RLENGTH)
print FNR,prev+RSTART
prev+=RSTART+2
}
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
prev=0 ##Setting prev variable to 0 here.
while(match($0,/ABC/)){ ##Using while loop to match ABC string and it runs till ABC match is ture in current line.
$0=substr($0,RSTART+RLENGTH) ##Re-creating current line by assigning value of rest of line(which starts after match of ABC).
print FNR,prev+RSTART ##Printing line number along with prev+RSTART value here.
prev+=RSTART+2 ##Setting prev to prev+RSTART+2 here.
}
}
' Input_file ##Mentioning Input_file name here.
Determination of the coordinates of a string with awk:
echo "xyzABCdefghiABCdefghiABCdef" \
| awk -v s="ABC" 'BEGIN{ len=length(s) }
{
for(i=1; i<=length($0); i++){
if(substr($0, i, len)==s){
print NR, i
}
}
}'
Output:
1 4
1 13
1 22
As one line:
echo xyzABCdefghiABCdefghiABCdef | awk -v s="ABC" 'BEGIN{ len=length(s) } { for(i=1; i<=length($0); i++){ if(substr($0,i,len)==s) { print NR,i } } }'
Source: Find position of character with awk
One awk idea using split() and some slicing-n-dicing of length() results:
ptn='ABC'
echo xyzABCdefghiABCdefghiABCdef |
awk -v ptn="${ptn}" '
{ pos=-(length(ptn)-1)
n=split($0,arr,ptn)
for (i=1;i<n;i++) {
pos+=length(arr[i] ptn)
print NR,pos
}
}'
This generates:
1 4
1 13
1 22
I am looking for a simple solution to have for each line the same number of commas in file (CSV file)
e.g.
example of file:
1,1
A,B,C,D,E,F
2,2,
3,3,3,
4,4,4,4
expected:
1,1,,,,
A,B,C,D,E,F
2,2,,,,
3,3,3,,,
4,4,4,4,,
the line with the largest number of commas has 5 commas in this case (line #2). so, I want to add other commas in all lines to have the same number for each line (i.e. 5 commas)
Using awk:
$ awk 'BEGIN{FS=OFS=","} {$6=$6} 1' file
1,1,,,,
A,B,C,D,E,F
2,2,,,,
3,3,3,,,
4,4,4,4,,
As you can see above, in this approach the max. number of fields must be hardcoded in the command.
Another take on providing making all lines in a CSV file have the same number of fields. The number of fields need not be known. The max fields will be calculated and a substring of needed commas appended to each record, e.g.
awk -F, -v max=0 '{
lines[n++] = $0 # store lines indexed by line number
fields[lines[n-1]] = NF # store number of field indexed by $0
if (NF > max) # find max NF value
max = NF
}
END {
for(i=0;i<max;i++) # form string with max commas
commastr=commastr","
for(i=0;i<n;i++) # loop appended substring of commas
printf "%s%s\n", lines[i], substr(commastr,1,max-fields[lines[i]])
}' file
Example Use/Output
Pasting at the command-line, you would receive:
$ awk -F, -v max=0 '{
> lines[n++] = $0 # store lines indexed by line number
> fields[lines[n-1]] = NF # store number of field indexed by $0
> if (NF > max) # find max NF value
> max = NF
> }
> END {
> for(i=0;i<max;i++) # form string with max commas
> commastr=commastr","
> for(i=0;i<n;i++) # loop appended substring of commas
> printf "%s%s\n", lines[i], substr(commastr,1,max-fields[lines[i]])
> }' file
1,1,,,,
A,B,C,D,E,F
2,2,,,,
3,3,3,,,
4,4,4,4,,
Could you please try following, a more generic way. This code will work even number of fields are not same in your Input_file and will first read and get maximum number of fields from whole file and then 2nd time reading file it will reset the fields(why because we have set OFS as , so if current line's number of fields are lesser than nf value those many commas will be added to that line). Enhanced version of #oguz ismail's answer.
awk '
BEGIN{
FS=OFS=","
}
FNR==NR{
nf=nf>NF?nf:NF
next
}
{
$nf=$nf
}
1
' Input_file Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program frmo here.
BEGIN{ ##Starting BEGIN section of awk program from here.
FS=OFS="," ##Setting FS and OFS as comma for all lines here.
}
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read.
nf=nf>NF?nf:NF ##Creating variable nf whose value is getting set as per condition, if nf is greater than NF then set it as NF else keep it as it is,
next ##next will skip all further statements from here.
}
{
$nf=$nf ##Mentioning $nf=$nf will reset current lines value and will add comma(s) at last of line if NF is lesser than nf.
}
1 ##1 will print edited/non-edited lines here.
' Input_file Input_file ##Mentioning Input_file names here.
i wrote a bash script in order to pull substrings and save it to an output file from two input files that looks like this:
input file 1
>genotype1
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
input file 2
gene1 10 20
gene2 40 50
genen x y
my script
>output_file
cat input_file2 | while read row; do
echo $row > temp
geneName=`awk '{print $1}' temp`
startPos=`awk '{print $2}' temp`
endPos=`awk '{print $3}' temp`
length=$(expr $endPos - $startPos)
for i in temp; do
echo ">${geneName}" >> genes_fasta
awk -v S=$startPos -v L=$length '{print substr($0,S,L)}' input_file1 >> output file
done
done
how can i make it work in a loop for more than one string in the input file 1?
new input file looks like this:
>genotype1
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
>genotype2
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
>genotypen...
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn...
I would like to have a different out file for every genotype and that the file name would be the genotype name.
thank you!
If I'm understanding correctly, would you try the following:
awk '
FNR==NR {
name[NR] = $1
start[NR] = $2
len[NR] = $3 - $2
count = NR
next
}
/^>/ {
sub(/^>/,"")
genotype=$0
next
}
{
for (i = 1; i <= count; i++) {
print ">" name[i] > genotype
print substr($0, start[i], len[i]) >> genotype
}
close(genotype)
}' input_file2 input_file1
input_file1:
>genotype1
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
>genotype2
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
>genotype3
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Input_file2:
gene1 10 20
gene2 40 50
gene3 20 25
[Results]
genotype1:
>gene1
aaaaaaaaaa
>gene2
aaaaaaaaaa
>gene3
aaaaa
genotype2:
>gene1
bbbbbbbbbb
>gene2
bbbbbbbbbb
>gene3
bbbbb
genotype3:
>gene1
nnnnnnnnnn
>gene2
nnnnnnnnnn
>gene3
nnnnn
[EDIT]
If you want to store the output files to a different directory,
please try the following instead:
dir="./outdir" # directory name to store the output files
# you can modify the name as you want
mkdir -p "$dir"
awk -v dir="$dir" '
FNR==NR {
name[NR] = $1
start[NR] = $2
len[NR] = $3 - $2
count = NR
next
}
/^>/ {
sub(/^>/,"")
genotype=$0
next
}
{
for (i = 1; i <= count; i++) {
print ">" name[i] > dir"/"genotype
print substr($0, start[i], len[i]) >> dir"/"genotype
}
close(dir"/"genotype)
}' input_file2 input_file1
The 1st two lines are executed in bash to define and mkdir the destination directory.
Then the directory name is passed to awk via -v option
Hope this helps.
Could you please try following, where I am assuming that your Input_file1's column which starts with > should be compared with 1st column of Input_file2's first column (since samples are confusing so based on OP's attempt this has been written).
awk '
FNR==NR{
start_point[$1]=$2
end_point[$1]=$3
next
}
/^>/{
sub(/^>/,"")
val=$0
next
}
{
print val ORS substr($0,start_point[val],end_point[val])
val=""
}
' Input_file2 Input_file1
Explanation: Adding explanation for above code.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first Input_file named Input_file2 is being read.
start_point[$1]=$2 ##Creating an array named start_point with index $1 of current line and its value is $2.
end_point[$1]=$3 ##Creating an array named end_point with index $1 of current line and its value is $3.
next ##next will skip all further statements from here.
}
/^>/{ ##Checking condition if a line starts from > then do following.
sub(/^>/,"") ##Substituting starting > with NULL.
val=$0 ##Creating a variable val whose value is $0.
next ##next will skip all further statements from here.
}
{
print val ORS substr($0,start_point[val],end_point[val]) ##Printing val newline(ORS) and sub-string of current line whose start value is value of start_point[val] and end point is value of end_point[val].
val="" ##Nullifying variable val here.
}
' Input_file2 Input_file1 ##Mentioning Input_file names here.
I have table like
Student_name,Subject
Ram,Maths
Ram,Science
Arjun,Maths
Arjun,Science
Arjun,Social
Arjun,Social
Output : I need to report only 'student' whose 'Social' subject percentage is more than 49%
Final output
Arjun, social, 50
.
Temp output(backend)
Student_name,Subject,Percentage(group by student name)
Ram,Maths,50
Ram,Science,50
Arjun,Maths,25
Arjun,Science,25
Arjun,Social,50
I have tried with below awk commands but I see percentage on complete subjects irrespective group by student name.
awk -F, '{x++;}{a[$1,$2]++;}END{for (i in a)print i, a[i],(a[i]/x)*100;}' OFS=, test1.csv > output2.dat
awk -F, '$2=="Science" && $3>=49{ print $1}' output2.dat
And Can we get it in single awk command.
Try following awk too once where it will provide the output in same order in which Input_file is data is there.
awk 'FNR>1 && FNR==NR{a[$1]++;b[$1]=$0;next} FNR==1 && FNR!= NR{print $0,"percentage";next}($1 in b){print $0"\t"100/a[$1]"%"}' Input_file Input_file
EDIT: Adding non-one liner form of solution too now.
awk '
FNR>1 && FNR==NR{
a[$1]++;
b[$1]=$0;
next
}
FNR==1 && FNR!= NR{
print $0,"percentage";
next
}
($1 in b){
print $0"\t"100/a[$1]"%"
}
' Input_file Input_file
EDIT1: Adding new solution as per OP's change in requirement.
awk '
FNR>1 && FNR==NR{
a[$1]++;
b[$1]=b[$1]?b[$1] ORS $0:$0;
c[$1,$2];
next
}
FNR==1 && FNR!= NR{
print $0,"percentage";
next
}
($1 in b){
if($2=="Science" && (100/a[$1])>49){
print b[$1]
}
}
' Input_file Input_file
GNU awk solution:
awk -F, 'NR==1{ print $0,"Percentage" }NR>1{ a[$1][$2]++ }
END{
for(i in a) for(j in a[i]) print i,j,(a[i][j]/length(a[i])*100"%")
}' OFS=',' test1.csv | column -t
The output:
Student_name,Subject,Percentage
Ram,Maths,50%
Ram,Science,50%
Arjun,Social,66.6667%
Arjun,Maths,33.3333%
Arjun,Science,33.3333%
Use a Numeric Comparison
You can do this with a very simple numeric comparison against the third field:
$ awk '$3 > 49 {print}' /tmp/input
Student_name Subject Percentage(group by student name)
Ram Maths 50%
Ram Science 50%
For this comparison, AWK coerces to a string, so the comparion will treat 50% the same as 50. As a nice byproduct, if the third field doesn't contain any numbers then it does a string comparison. The header line is greater than ! so it matches, too.