Awk to (random) sample a file by id-uniques criteria - shell

I'm learning AWK to read a big file which format is similar to this MasterFile:
Beth|4.00|0|
Dan|3.75|0|
Kathy|4.00|10|
Mark|5.00|20|
Mary|5.50|22|
Susie|4.25|18|
Jise|5.62|0|
Mark|5.60|23.3|
Mary|8.50|42|
Susie|8.75|8.8|
Jise|3.62|0.8|
Beth|3.21|10|
Dan|8.39|20|
I would like to sample by unique values (size K) from the first column with size N (I choose it).
What I have done is following: I select unique values from first column and save it as IDfile.txt. Later, I take K random values from that archive and I match it with the MasterFile. I mean:
awk -F\| 'BEGIN{srand()}{print rand() " " $0}' IDfile | sort -n | tail -n K| awk -F'[[:blank:]|]+' 'BEGIN{OFS="|"}{$1="";sub(/\|/,"")}'1>tmp | awk -F\| 'NR==FNR{a[$1];next} {for (i in a) if(index($0,i)) print $0}' tmp MasterFile
But the output has repeated values and the result that I'd like to get is like to (assuming that K=3):
Beth|4.00|0|
Mark|5.60|23.3|
Mary|5.50|22|
I know that my code is far from efficient [or nice] and I'm open to suggestions [].
Thanks!

this is the one of the right ways to do this
$ sort -t'|' -u -k1,1 file | shuf -n3
Mark|5.00|20|
Kathy|4.00|10|
Jise|5.62|0|
change -n3 to whatever number of unique entries you need.

Related

check if column has more than one value in unix [duplicate]

I have a text file with a large amount of data which is tab delimited. I want to have a look at the data such that I can see the unique values in a column. For example,
Red Ball 1 Sold
Blue Bat 5 OnSale
...............
So, its like the first column has colors, so I want to know how many different unique values are there in that column and I want to be able to do that for each column.
I need to do this in a Linux command line, so probably using some bash script, sed, awk or something.
What if I wanted a count of these unique values as well?
Update: I guess I didn't put the second part clearly enough. What I wanted to do is to have a count of "each" of these unique values not know how many unique values are there. For instance, in the first column I want to know how many Red, Blue, Green etc coloured objects are there.
You can make use of cut, sort and uniq commands as follows:
cat input_file | cut -f 1 | sort | uniq
gets unique values in field 1, replacing 1 by 2 will give you unique values in field 2.
Avoiding UUOC :)
cut -f 1 input_file | sort | uniq
EDIT:
To count the number of unique occurences you can make use of wc command in the chain as:
cut -f 1 input_file | sort | uniq | wc -l
awk -F '\t' '{ a[$1]++ } END { for (n in a) print n, a[n] } ' test.csv
You can use awk, sort & uniq to do this, for example to list all the unique values in the first column
awk < test.txt '{print $1}' | sort | uniq
As posted elsewhere, if you want to count the number of instances of something you can pipe the unique list into wc -l
Assuming the data file is actually Tab separated, not space aligned:
<test.tsv awk '{print $4}' | sort | uniq
Where $4 will be:
$1 - Red
$2 - Ball
$3 - 1
$4 - Sold
# COLUMN is integer column number
# INPUT_FILE is input file name
cut -f ${COLUMN} < ${INPUT_FILE} | sort -u | wc -l
Here is a bash script that fully answers the (revised) original question. That is, given any .tsv file, it provides the synopsis for each of the columns in turn. Apart from bash itself, it only uses standard *ix/Mac tools: sed tr wc cut sort uniq.
#!/bin/bash
# Syntax: $0 filename
# The input is assumed to be a .tsv file
FILE="$1"
cols=$(sed -n 1p $FILE | tr -cd '\t' | wc -c)
cols=$((cols + 2 ))
i=0
for ((i=1; i < $cols; i++))
do
echo Column $i ::
cut -f $i < "$FILE" | sort | uniq -c
echo
done
This script outputs the number of unique values in each column of a given file. It assumes that first line of given file is header line. There is no need for defining number of fields. Simply save the script in a bash file (.sh) and provide the tab delimited file as a parameter to this script.
Code
#!/bin/bash
awk '
(NR==1){
for(fi=1; fi<=NF; fi++)
fname[fi]=$fi;
}
(NR!=1){
for(fi=1; fi<=NF; fi++)
arr[fname[fi]][$fi]++;
}
END{
for(fi=1; fi<=NF; fi++){
out=fname[fi];
for (item in arr[fname[fi]])
out=out"\t"item"_"arr[fname[fi]][item];
print(out);
}
}
' $1
Execution Example:
bash> ./script.sh <path to tab-delimited file>
Output Example
isRef A_15 C_42 G_24 T_18
isCar YEA_10 NO_40 NA_50
isTv FALSE_33 TRUE_66

get unique combination of values of two columns

I can get the unique values from col using below command
cut -d',' -f3 file.txt | uniq -c.
This gives me unique values in field 3.
But if I want to get unique combination of two fields, how can I get that ?
input
A,B,C
B,C,D
D,B,C
H,C,D
K,C,D
output
2 B,C
3 C,D
You can specify range of fields using -f 2-3 or -f 2,3
cut -d',' -f2-3 file.txt | sort | uniq -c
uniq does not detect repeated lines unless they are adjacent. Input should be sorted before using uniq command
Output
2 B,C
3 C,D
Another option you may find provides greater flexibility in processing the input is awk. You can use a concatenation of the fields at issue as an index for an array to sum the occurrences of each unique combination of fields and then output the results using the END rule, e.g.
awk -F, '{a[$2","$3]++} END{for(i in a)print a[i], i}' file
Example Use/Output
With your example file in input you would have:
$ awk -F, '{a[$2","$3]++} END{for(i in a)print a[i], i}' input
3 C,D
2 B,C
awk arrays are associative rather than indexed, but you can preserve the order of appearance using a 3rd array if needed. Or you can simply pipe the output to sort for whatever order you like.

AWK : To print data of a file in sorted order of result obtained from columns

I have an input file that looks somewhat like this:
PlayerId,Name,Score1,Score2
1,A,40,20
2,B,30,10
3,C,25,28
I want to write an awk command that checks for players with sum of scores greater than 50 and outputs the PlayerId,and PlayerName in sorted order of their total score.
When I try the following:
awk 'BEGIN{FS=",";}{$5=$3+$4;if($5>50) print $1,$2}' | sort -k5
It does not work and seemingly sorts them on the basis of their ids.
1 A
3 C
Whereas the correct output I'm expecting is : ( since Player A has sum of scores=60, and C has sum of scores=53, and we want the output to be sorted in ascending order )
3 C
1 A
In addition to this,what confuses me a bit is when I try to sort it on the basis of score1, i.e. column 3 but intend to print only the corresponding ids and names, it dosen't work either.
awk 'BEGIN{FS=",";}{$5=$3+$4;if($5>50) print $1,$2}' | sort -k3
And outputs :
1 A
3 C
But if the $3 with respect to what the data is being sorted is included in the print,
awk 'BEGIN{FS=",";}{$5=$3+$4;if($5>50)print $1,$2,$3}' | sort -k3
It produces the correct output ( but includes the unwanted score1 parameter in display )
3 C 25
1 A 40
But what if one wants to only print the id and name fields ?
Actually I'm new to awk commands, and probably I'm not using the sort command correctly. It would be really helpful if someone could explain.
I think this is what you're trying to do:
$ awk 'BEGIN{FS=","} {sum=$3+$4} sum>50{print sum,$1,$2}' file |
sort -k1,1n | cut -d' ' -f2-
3 C
1 A
You have to print the sum so you can sort by it and then the cut removes it.
If you wanted the header output too then it'd be:
$ awk 'BEGIN{FS=","} {sum=$3+$4} (NR==1) || (sum>50){print (NR>1),sum,$1,$2}' file |
sort -k1,2n | cut -d' ' -f3-
PlayerId Name
3 C
1 A
if you outsource sorting, you need to have the auxiliary values and need to cut it out later, some complication is due to preserve the header.
$ awk -F, 'NR==1 {print s "\t" $1 FS $2; next}
(s=$3+$4)>50 {print s "\t" $1 FS $2 | "sort -n" }' file | cut -f2
PlayerId,Name
3,C
1,A

awk to do group by sum of column

I have this csv file and I am trying to write shell script to calculate sum of column after doing group by on it. Column number is 11th (STATUS)
My script is
awk -F, 'NR>1{arr[$11]++}END{for (a in arr) print a, arr[a]}' $f > $parentdir/outputfile.csv;
File output expected is
COMMITTED 2
but actual output is just 2.
It prints only count and not group by sum. If I delete any other columns and run same query then it works fine but not with below sample data.
FILE NAME;SEQUENCE NR;TRANSACTION ID;RUN NUMBER;START EDITCREATION;END EDITCREATION;END COMMIT;EDIT DURATION;COMMIT DURATION;HAS DEPENDENCY;STATUS;DETAILS
Buldhana_Refinesource_FG_IW_ETS_000001.xml;1;4a032127-b20d-4fa8-9f4d-7f2999c0c08f;1;20180831130210345;20180831130429638;20180831130722406;140;173;false;COMMITTED;
Buldhana_Refinesource_FG_IW_ETS_000001.xml;2;e4043fc0-3b0a-46ec-b409-748f98ce98ad;1;20180831130722724;20180831130947144;20180831131216693;145;150;false;COMMITTED;
change the FS to ; in your script
awk -F';' 'NR>1{arr[$11]++}END{for (a in arr) print a, arr[a]}' file
COMMITTED 2
You're using wrong field separator. Use
awk -F\;
; must be escaped to use it as a literal. Except this, your approach seems OK.
Besides awk, you may also use
tail -n +2 $f | cut -f11 -d\; | sort | uniq -c
or
datamash --header-in -t \; -g 11 count 11 < $f
to do the same thing.

awk loop over all fields in one file

This statement gives me the count of unique values in column 1:
awk -F ',' '{print $1}' infile1.csv | sort | uniq -c | sort -nr > outfile1.csv
It does what I expected (gives the count (left) of unique values (right) in the column):
117 5
58 0
18 4
14 3
11 1
9 2
However, now I want to create a loop, so it will go through all columns.
I tried:
for i in {1..10}
do
awk -F ',' '{print $$i}' infile.csv | sort | uniq -c | sort -nr > outfile$i.csv
done
This does not do the job (it does produce a file but with much more data). I think that a variable in a print statement, as I tried with print $$i, is not something that works in general, since I did not come across it so far.
I also tried this:
awk -F ',' '{for(i=1;i<=NF;i++) infile.csv | sort | uniq -c | sort -nr}' > outfile$i.csv
But this does not give any result at all (meaning syntax errors for infile and sort command). I am sure I am using the for statement the wrong way.
Ideally, I would like the code to find the count of unique values for each column and print them all in the same output file. However, I am already very happy with a well functioning loop.
Please let me know if this explanation is not good enough, I will do my best to clarify.
Any time you write a loop in shell just to manipulate text you have the wrong approach. Just do it in one awk command, something like this using GNU awk for 2D arrays and sorted in (untested since you didn't provide any sample input):
awk -F, '
BEGIN { PROCINFO["sorted_in"] = "#val_num_desc" }
{ for (i=1; i<=NF; i++) cnt[i][$i]++ }
END {
for (i=1; i<=NF; i++)
for (val in cnt[i])
print val, cnt[i][val] > ("outfile" i ".csv")
}
' infile.csv
No need for half a dozen different commands, pipes, etc.
You want to loop through the columns and perform the same command in each one of them. So what you are doing is fine: pass the column name to awk. However, you need to pass the value differently, so that it is an awk variable:
for i in {1..10}
do
awk -F ',' -v col=$i '{print $col}' infile.csv | sort | uniq -c | sort -nr > outfile$i.csv
^^^^^^^^^^^^^^^^^^^^^^^^
done

Resources