How to get percentage of a column based on key in unix - bash

I have table like
Student_name,Subject
Ram,Maths
Ram,Science
Arjun,Maths
Arjun,Science
Arjun,Social
Arjun,Social
Output : I need to report only 'student' whose 'Social' subject percentage is more than 49%
Final output
Arjun, social, 50
.
Temp output(backend)
Student_name,Subject,Percentage(group by student name)
Ram,Maths,50
Ram,Science,50
Arjun,Maths,25
Arjun,Science,25
Arjun,Social,50
I have tried with below awk commands but I see percentage on complete subjects irrespective group by student name.
awk -F, '{x++;}{a[$1,$2]++;}END{for (i in a)print i, a[i],(a[i]/x)*100;}' OFS=, test1.csv > output2.dat
awk -F, '$2=="Science" && $3>=49{ print $1}' output2.dat
And Can we get it in single awk command.

Try following awk too once where it will provide the output in same order in which Input_file is data is there.
awk 'FNR>1 && FNR==NR{a[$1]++;b[$1]=$0;next} FNR==1 && FNR!= NR{print $0,"percentage";next}($1 in b){print $0"\t"100/a[$1]"%"}' Input_file Input_file
EDIT: Adding non-one liner form of solution too now.
awk '
FNR>1 && FNR==NR{
a[$1]++;
b[$1]=$0;
next
}
FNR==1 && FNR!= NR{
print $0,"percentage";
next
}
($1 in b){
print $0"\t"100/a[$1]"%"
}
' Input_file Input_file
EDIT1: Adding new solution as per OP's change in requirement.
awk '
FNR>1 && FNR==NR{
a[$1]++;
b[$1]=b[$1]?b[$1] ORS $0:$0;
c[$1,$2];
next
}
FNR==1 && FNR!= NR{
print $0,"percentage";
next
}
($1 in b){
if($2=="Science" && (100/a[$1])>49){
print b[$1]
}
}
' Input_file Input_file

GNU awk solution:
awk -F, 'NR==1{ print $0,"Percentage" }NR>1{ a[$1][$2]++ }
END{
for(i in a) for(j in a[i]) print i,j,(a[i][j]/length(a[i])*100"%")
}' OFS=',' test1.csv | column -t
The output:
Student_name,Subject,Percentage
Ram,Maths,50%
Ram,Science,50%
Arjun,Social,66.6667%
Arjun,Maths,33.3333%
Arjun,Science,33.3333%

Use a Numeric Comparison
You can do this with a very simple numeric comparison against the third field:
$ awk '$3 > 49 {print}' /tmp/input
Student_name Subject Percentage(group by student name)
Ram Maths 50%
Ram Science 50%
For this comparison, AWK coerces to a string, so the comparion will treat 50% the same as 50. As a nice byproduct, if the third field doesn't contain any numbers then it does a string comparison. The header line is greater than ! so it matches, too.

Related

Copy one csv header to another csv with type modification

I want to copy one csv header to another in row wise with some modifications
Input csv
name,"Mobile Number","mobile1,mobile2",email2,Address,email21
test, 123456789,+123456767676,a#test.com,testaddr,a1#test.com
test1,7867778,8799787899898,b#test,com, test2addr,b2#test.com
In new csv this should be like this and file should also be created. And for sting column I will pass the column name so only that column will be converted to string
name.auto()
Mobile Number.auto()
mobile1,mobile2.string()
email2.auto()
Address.auto()
email21.auto()
As you see above all these header with type modification should be inserted in different rows
I have tried with below command but this is only for copy first row
sed '1!d' input.csv > output.csv
You may try this alternative gnu awk command as well:
awk -v FPAT='"[^"]+"|[^,]+' 'NR == 1 {
for (i=1; i<=NF; ++i)
print gensub(/"/, "", "g", $i) "." ($i ~ /,/ ? "string" : "auto") "()"
exit
}' file
name.auto()
Mobile Number.auto()
mobile1,mobile2.string()
email2.auto()
Address.auto()
email21.auto()
Or using sed:
sed -i -e '1i 1234567890.string(),My address is test.auto(),abc3#gmail.com.auto(),120000003.auto(),abc-003.auto(),3.com.auto()' -e '1d' test.csv
EDIT: As per OP's comment to print only first line(header) please try following.
awk -v FPAT='[^,]*|"[^"]+"' '
FNR==1{
for(i=1;i<=NF;i++){
if($i~/^".*,.*"$/){
gsub(/"/,"",$i)
print $i".string()"
}
else{
print $i".auto()"
}
}
exit
}
' Input_file > output_file
Could you please try following, written and tested with GUN awk with shown samples.
awk -v FPAT='[^,]*|"[^"]+"' '
FNR==1{
for(i=1;i<=NF;i++){
if($i~/^".*,.*"$/){
gsub(/"/,"",$i)
print $i".string()"
}
else{
print $i".auto()"
}
}
next
}
1
' Input_file
Explanation: Adding detailed explanation for above.
awk -v FPAT='[^,]*|"[^"]+"' ' ##Starting awk program and setting FPAT to [^,]*|"[^"]+".
FNR==1{ ##Checking condition if this is first line then do following.
for(i=1;i<=NF;i++){ ##Running for loop from i=1 to till NF value.
if($i~/^".*,.*"$/){ ##Checking condition if current field starts from " and ends with " and having comma in between its value then do following.
gsub(/"/,"",$i) ##Substitute all occurrences of " with NULL in current field.
print $i".string()" ##Printing current field and .string() here.
}
else{ ##else do following.
print $i".auto()" ##Printing current field dot auto() string here.
}
}
next ##next will skip all further statements from here.
}
1 ##1 will print current line.
' Input_file ##Mentioning Input_file name here.

divide each column by max value/last value

I have a matrix like this:
A 25 27 50
B 35 37 475
C 75 78 80
D 99 88 76
0 234 230 681
The last row is the sum of all elements in the column - and it is also the maximum value.
What I would like to get is the matrix in which each value is divided by the last value in the column (e.g. for the first number in column 2, I would want "25/234="):
A 0.106837606837607 0.117391304347826 0.073421439060206
B 0.14957264957265 0.160869565217391 0.697503671071953
C 0.320512820512821 0.339130434782609 0.117474302496329
D 0.423076923076923 0.382608695652174 0.11160058737151
An answer in another thread gives an acceptable result for one column, but I was not able to loop it over all columns.
$ awk 'FNR==NR{max=($2+0>max)?$2:max;next} {print $1,$2/max}' file file
(this answer was provided here: normalize column data with maximum value of that column)
I would be grateful for any help!
In addition to the great approaches by #RavinderSingh13, you can also isolate the last line in the input file with, e.g. tail -n1 Input_file and then use the split() command in the BEGIN rule to separate the values. You can then make a single-pass through the file with awk to update the values as you indicate. In the end, you can pipe the output to head -n-1 to remove the unneeded final row, e.g.
awk -v lline="$(tail -n1 Input_file)" '
BEGIN { split(lline,a," ") }
{
printf "%s", $1
for(i=2; i<=NF; i++)
printf " %.15lf", $i/a[i]
print ""
}
' Input_file | head -n-1
Example Use/Output
$ awk -v lline="$(tail -n1 Input_file)" '
> BEGIN { split(lline,a," ") }
> {
> printf "%s", $1
> for(i=2; i<=NF; i++)
> printf " %.15lf", $i/a[i]
> print ""
> }
> ' Input_file | head -n-1
A 0.106837606837607 0.117391304347826 0.073421439060206
B 0.149572649572650 0.160869565217391 0.697503671071953
C 0.320512820512821 0.339130434782609 0.117474302496329
D 0.423076923076923 0.382608695652174 0.111600587371512
(note: this presumes you don't have trailing blank lines in your file and you really don't have blank lines between every row. If you do, let me know)
The differences between the approaches are largely negligible. In each case you are making a total of 3-passes through the file. Here with tail, awk and then head. In the other case with wc and then two-passes with awk.
Let either of us know if you have questions.
1st solution: Could you please try following, written and tested with shown samples in GNU awk. With exact 15 floating points as per OP's shown samples:
awk -v lines=$(wc -l < Input_file) '
FNR==NR{
if(FNR==lines){
for(i=2;i<=NF;i++){ arr[i]=$i }
}
next
}
FNR<lines{
for(i=2;i<=NF;i++){ $i=sprintf("%0.15f",(arr[i]?$i/arr[i]:"NaN")) }
print
}
' Input_file Input_file
2nd solution: If you don't care of floating points to be specific points then try following.
awk -v lines=$(wc -l < Input_file) '
FNR==NR && FNR==lines{
for(i=2;i<=NF;i++){ arr[i]=$i }
next
}
FNR<lines && FNR!=NR{
for(i=2;i<=NF;i++){ $i=(arr[i]?$i/arr[i]:"NaN") }
print
}
' Input_file Input_file
OR(placing condition of FNR==lines inside FNR==NR condition):
awk -v lines=$(wc -l < Input_file) '
FNR==NR{
if(FNR==lines){
for(i=2;i<=NF;i++){ arr[i]=$i }
}
next
}
FNR<lines{
for(i=2;i<=NF;i++){ $i=(arr[i]?$i/arr[i]:"NaN") }
print
}
' Input_file Input_file
Explanation: Adding detailed explanation for above.
awk -v lines=$(wc -l < Input_file) ' ##Starting awk program from here, creating lines which variable which has total number of lines in Input_file here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read.
if(FNR==lines){ ##Checking if FNR is equal to lines then do following.
for(i=2;i<=NF;i++){ arr[i]=$i } ##Traversing through all fields here of current line and creating an array arr with index of i and value of current field value.
}
next ##next will skip all further statements from here.
}
FNR<lines{ ##Checking condition if current line number is lesser than lines, this will execute when 2nd time Input_file is being read.
for(i=2;i<=NF;i++){ $i=sprintf("%0.15f",(arr[i]?$i/arr[i]:"NaN")) } ##Traversing through all fields here and saving value of divide of current field with arr current field value with 15 floating points into current field.
print ##Printing current line here.
}
' Input_file Input_file ##Mentioning Input_file names here.

How to compare two files and print the values of both the files which are different

There are 2 files. I need to sort them first and then compare the 2 files and then the difference I need to print the value from File 1 and File 2.
file1:
pair,bid,ask
AED/MYR,3.918000,3.918000
AED/SGD,3.918000,3.918000
AUD/CAD,3.918000,3.918000
file2:
pair,bid,ask
AUD/CAD,3.918000,3.918000
AUD/CNY,3.918000,3.918000
AED/MYR,4.918000,4.918000
Output should be:
pair,inputbid,inputask,outputbid,outtputask
AED/MYR,3.918000,3.918000,4.918000,4.918000
The only difference in 2 files is AED/MYR with different bid/ask rates. How can I print difference value from file 1 and file 2.
I tried using below commands:
nawk -F, 'NR==FNR{a[$1]=$4;a[$2]=$5;next} !($4 in a) || !($5 in a) {print $1 FS a[$1] FS a[$2] FS $4 FS $5}' file1 file2
Result output as below:
pair,bid,ask,bid,ask
AUD/CAD,3.918000,3.918000,3.918000,3.918000
AUD/CHF,3.918000,3.918000,3.918000,3.918000
AUD/CNH,3.918000,3.918000,3.918000,3.918000
AUD/CNY,3.918000,3.918000,3.918000,3.918000
AED/MYR,3.918000,3.918000,4.918000,4.918000
We are still not able to get only the difference.
Could you please try following, written and tested in GNU awk with shown samples.
awk -v header="pair,inputbid,inputask,outputbid,outtputask" '
BEGIN{
FS=OFS=","
}
FNR==NR{
arr[$1]=$0
next
}
($1 in arr) && arr[$1]!=$0{
val=$1
$1=""
sub(/^,/,"")
if(!found){
print header
found=1
}
print arr[val],$0
}' Input_file1 Input_file2
Explanation: Adding detailed explanation for above.
awk -v header="pair,inputbid,inputask,outputbid,outtputask" ' ##Starting awk program from here and setting this to header value here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS=OFS="," ##Setting field separator and output field separator as comma here.
}
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when Input_file1 is being read.
arr[$1]=$0 ##Creating arr with index $1 and keep value as current line.
next ##next will skip all further statements from here.
}
($1 in arr) && arr[$1]!=$0{ ##Checking condition if first field is present in arr and its value NOT equal to $0
val=$1 ##Creating val which has current line value in it.
$1="" ##Nullifying irst field here.
sub(/^,/,"") ##Substitute starting , with NULL here.
if(!found){ ##Checking if found is NULL then do following.
print header ##Printing header here only once.
found=1 ##Setting found here.
}
print arr[val],$0 ##Printing arr with index of val and current line here.
}' Input_file1 Input_file2 ##Mentioning Input_files here.
With bash process substitution, then join and then choosing with awk:
# print header
printf "%s\n" "pair,inputbid,inputask,outputbid,outtputask"
# remove first line from both files, then sort them on first field
# then join them on first field and output first 5 fields
join -t, -11 -21 -o1.1,1.2,1.3,2.2,2.3 <(tail -n +2 file1 | sort -t, -k1) <(tail -n +2 file2 | sort -t, -k1) |
# output only those lines, that columns differ
awk -F, '$2 != $4 || $3 != $5'

Remove duplicate from csv using bash / awk

I have a csv file with the format :
"id-1"|"A"
"id-2"|"C"
"id-1"|"B"
"id-1"|"D"
"id-2"|"B"
"id-3"|"A"
"id-3"|"A"
"id-1"|"B"
I want to group by first column unique id's and concat types in a single row like this:
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"
I found awk does a great job in handling such scenarios. But all I could achieve is this:
"id-1"|"A":"B":"D":"B"
"id-2"|"B":"C"
"id-3"|"A":"A"
I used this command:
awk -F "|" '{if(a[$1])a[$1]=a[$1]":"$2; else a[$1]=$2;}END{for (i in a)print i, a[i];}' OFS="|" file
How can I remove the duplicates and also handle the formatting of the second column types?
quick fix:
$ awk -F "|" '!seen[$0]++{if(a[$1])a[$1]=a[$1]":"$2; else a[$1]=$2;}END{for (i in a)print i, a[i];}' OFS="|" file
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"
!seen[$0]++ will be true only if line was not already seen
If second column should all be within double quotes
$ awk -v dq='"' 'BEGIN{FS=OFS="|"}
!seen[$0]++{a[$1]=a[$1] ? a[$1]":"$2 : $2}
END{for (i in a){gsub(dq,"",a[i]); print i, dq a[i] dq}}' file
"id-1"|"A:B:D"
"id-2"|"C:B"
"id-3"|"A"
With GNU awk for true multi-dimensional arrays and gensub() and sorted_in:
$ awk -F'|' '
{ a[$1][gensub(/"/,"","g",$2)] }
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (i in a) {
c = 0
for (j in a[i]) {
printf "%s%s", (c++ ? ":" : i "|\""), j
}
print "\""
}
}
' file
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"
The output rows and columns will both be string-sorted (i.e. alphabetically by characters) in ascending order.
Short GNU datamash + tr solution:
datamash -st'|' -g1 unique 2 <file | tr ',' ':'
The output:
"id-1"|"A":"B":"D"
"id-2"|"B":"C"
"id-3"|"A"
----------
In case if between-item double quotes should be eliminated - use the following alternative:
datamash -st'|' -g1 unique 2 <file | sed 's/","/:/g'
The output:
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"
For sample, input below one will work, but unsorted
One-liner
# using two array ( recommended )
awk 'BEGIN{FS=OFS="|"}!seen[$1,$2]++{a[$1] = ($1 in a ? a[$1] ":" : "") $2}END{for(i in a)print i,a[i]}' infile
# using regexp
awk 'BEGIN{FS=OFS="|"}{ a[$1] = $1 in a ? ( a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] : a[$1]":"$2 ) : $2}END{for(i in a)print i,a[i]}' infile
Test Results:
$ cat infile
"id-1"|"A"
"id-2"|"C"
"id-1"|"B"
"id-1"|"D"
"id-2"|"B"
"id-3"|"A"
"id-3"|"A"
"id-1"|"B"
$ awk 'BEGIN{FS=OFS="|"}!seen[$1,$2]++{a[$1] = ($1 in a ? a[$1] ":" : "") $2}END{for(i in a)print i,a[i]}' infile
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"
$ awk 'BEGIN{FS=OFS="|"}{ a[$1] = $1 in a ? ( a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] : a[$1]":"$2 ) : $2}END{for(i in a)print i,a[i]}' infile
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"
Better Readable:
Using regexp
awk 'BEGIN{
FS=OFS="|"
}
{
a[$1] =$1 in a ?(a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] : a[$1]":"$2):$2
}
END{
for(i in a)
print i,a[i]
}
' infile
Using two array
awk 'BEGIN{
FS=OFS="|"
}
!seen[$1,$2]++{
a[$1] = ($1 in a ? a[$1] ":" : "") $2
}
END{
for(i in a)
print i,a[i]
}' infile
Note: you can also use !seen[$0]++, it will use entire line as index, but in case in your real data, if
you want to prefer some other column, you may prefer !seen[$1,$2]++,
here column1 and column2 are used as index
awk + sort solution:
awk -F'|' '{ gsub(/"/,"",$2); a[$1]=b[$1]++? a[$1]":"$2:$2 }
END{ for(i in a) printf "%s|\"%s\"\n",i,a[i] }' <(sort -u file)
The output:
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"

find unique lines based on one field only [duplicate]

Would like to print unique lines based on first field , keep the first occurrence of that line and remove duplicate other occurrences.
Input.csv
10,15-10-2014,abc
20,12-10-2014,bcd
10,09-10-2014,def
40,06-10-2014,ghi
10,15-10-2014,abc
Desired Output:
10,15-10-2014,abc
20,12-10-2014,bcd
40,06-10-2014,ghi
Have tried below command and in-complete
awk 'BEGIN { FS = OFS = "," } { !seen[$1]++ } END { for ( i in seen) print $0}' Input.csv
Looking for your suggestions ...
You put your test for "seen" in the action part of the script instead of the condition part. Change it to:
awk -F, '!seen[$1]++' Input.csv
Yes, that's the whole script:
$ cat Input.csv
10,15-10-2014,abc
20,12-10-2014,bcd
10,09-10-2014,def
40,06-10-2014,ghi
10,15-10-2014,abc
$
$ awk -F, '!seen[$1]++' Input.csv
10,15-10-2014,abc
20,12-10-2014,bcd
40,06-10-2014,ghi
This should give you what you want:
awk -F, '{ if (!($1 in a)) a[$1] = $0; } END '{ for (i in a) print a[i]}' input.csv
typo there in syntax.
awk '{ if (!($1 in a)) a[$1] = $0; } END { for (i in a) print a[i]}'

Resources