Bash count occurences based on parameters - bash
I'm new to bash shell and I have to do a script with a csv file.
The file is a list of the participants, countries, sports and medals achieved.
when executing the script, I should give as parameters the nationality (column 3) and the sport (column 8). The script should return the amount of participants of that country for that sport, and the amount of medals achieved.
The amount of medals achieved is the sum of the columns "gold" "silver" "bronze" of each row which are columns 9,10 and 11.
I cannot use grep, awk, sed or csvkit.
So far, I have this code but I'm stuck with the medal counting part.
nacionality=$1
sport=$2
columns= cut -d, -f 3,8 athletes.csv
echo columns | tr -cd $nacionality,$sport | wc -c
Could anyone help me?
The file is: https://github.com/flother/rio2016/blob/master/athletes.csv
The name of the file is script2_4.sh
An example of the output is:
./script2_4.sh POL rowing
Participants, Medals
26, 6
A sample of the file:
id,name,nationality,sex,date_of_birth,height,weight,sport,gold,silver,bronze,info
736041664,A Jesus Garcia,ESP,male,1969-10-17,1.72,64,athletics,0,0,0,
532037425,A Lam Shin,KOR,female,1986-09-23,1.68,56,fencing,0,0,0,
435962603,Aaron Brown,CAN,male,1992-05-27,1.98,79,athletics,0,0,1,
521041435,Aaron Cook,MDA,male,1991-01-02,1.83,80,taekwondo,0,0,0,
33922579,Aaron Gate,NZL,male,1990-11-26,1.81,71,cycling,0,0,0,
173071782,Aaron Royle,AUS,male,1990-01-26,1.80,67,triathlon,0,0,0,
266237702,Aaron Russell,USA,male,1993-06-04,2.05,98,volleyball,0,0,1,
382571888,Aaron Younger,AUS,male,1991-09-25,1.93,100,aquatics,0,0,0,
87689776,Aauri Lorena Bokesa,ESP,female,1988-12-14,1.80,62,athletics,0,0,0,
997877719,Ababel Yeshaneh,ETH,female,1991-07-22,1.65,54,athletics,0,0,0,
343694681,Abadi Hadis,ETH,male,1997-11-06,1.70,63,athletics,0,0,0,
591319906,Abbas Abubakar Abbas,BRN,male,1996-05-17,1.75,66,athletics,0,0,0,
258556239,Abbas Qali,IOA,male,1992-10-11,,,aquatics,0,0,0,
376068084,Abbey D'Agostino,USA,female,1992-05-25,1.61,49,athletics,0,0,0,
162792594,Abbey Weitzeil,USA,female,1996-12-03,1.78,68,aquatics,1,1,0,
521036704,Abbie Brown,GBR,female,1996-04-10,1.76,71,rugby sevens,0,0,0,
149397772,Abbos Rakhmonov,UZB,male,1998-07-07,1.61,57,wrestling,0,0,0,
256673338,Abbubaker Mobara,RSA,male,1994-02-18,1.75,64,football,0,0,0,
337369662,Abby Erceg,NZL,female,1989-11-20,1.75,68,football,0,0,0,
334169879,Abd Elhalim Mohamed Abou,EGY,male,1989-06-03,2.10,88,volleyball,0,0,0,
215053268,Abdalaati Iguider,MAR,male,1987-03-25,1.73,57,athletics,0,0,0,
763711985,Abdalelah Haroun,QAT,male,1997-01-01,1.85,80,athletics,0,0,0,
Here is a pure bash implementation. Build a hash from field name to position ($h):
#!/bin/bash
file=athletes.csv
nationality=$1
sport=$2
IFS=, read -a l < "$file"
declare -A h
for pos in "h${!l[#]}"
do
h["${l[$pos]}"]=$pos
done
declare -i participants=0
declare -i medals=0
while IFS=, read -a l
do
if [ "${l[${h["nationality"]}]}" = "$nationality" ] &&
[ "${l[${h["sport"]}]}" = "$sport" ]
then
((participants++))
medals=$((
$medals +
"${l[${h["gold"]}]}" +
"${l[${h["silver"]}]}" +
"${l[${h["bronze"]}]}"
))
fi
done < "$file"
echo "Participants, Medals"
echo "$participants, $medals"
and example output with the first 4 lines of input:
$ ./script2_4.sh CAN athletics
Participants, Medals
1, 1
Related
Average marks from list
Sorry if I don't write good, it's my first post. I have a list in one file with the name, id, marks etc of students (see below): And I want to calculate the average mark in another file, but I don't know how to take only the marks and write the average in another file. Thanks; #name surname student_index_number course_group_id lecturer_id list_of_marks athos musketeer 1 1 1 3,4,5,3.5 porthos musketeer 2 1 1 2,5,3.5 aramis musketeer 3 2 2 2,1,4,5 while read line; do echo "$line" | cut -f 6 -d ' ' done<main_list
awk 'NR>1{n=split($NF,a,",");for(i=1;i<=n;i++){s+=a[i]} ;print $1,s/n;s=0}' input athos 3.875 porthos 3.5 aramis 3 For all the lines except header(NR>1 will filter out header) , pick up the last column and split into smaller numbers by comma. Using for loop sum the value of all the marks and then divid by the total subject number.
Something like (untested) awk '{ n = split($6, a, ","); total=0; for (v in a) total += a[v]; print total / n }' main_list
In pure BASH solution, could you please try following once. while read first second third fourth fifth sixth do if [[ "$first" =~ (^#) ]] then continue fi count="${sixth//[^,]}" val=$(echo "(${#count}+1)" | bc) echo "scale=2; (${sixth//,/+})/$val" | bc done < "Input_file"
while loops in parallel with input from splited file
I am stuck on that. So I have this while-read loop within my code that is taking so long and I would like to run it in many processors. But, I'd like to split the input file and run 14 loops (because I have 14 threads), one for each splited file, in parallel. Thing is that I don't know how to tell the while loop which file to get and work with. For example, in a regular while-read loop I would code: while read line do <some code> done < input file or variable... But in this case I would like to split the above input file in 14 files and run 14 while loops in parallel, one for each splited file. I tried : split -n 14 input_file find . -name "xa*" | \ parallel -j 14 | \ while read line do <lot of stuff> done also tried split -n 14 input_file function loop { while read line do <lot of stuff> done } export -f loop parallel -j 14 ::: loop But neither I was able to tell which file would be the input to the loop so parallel would understand "take each of those xa* files and place into individual loops in parallel" An example of the input file (a list of strings) AEYS01000010.10484.12283 CVJT01000011.50.2173 KF625180.1.1799 KT949922.1.1791 LOBZ01000025.54942.57580 EDIT This is the code. The output is a table (741100 lines) with some statistics regarding DNA sequences alignments already made. The loop takes an input_file (no broken lines, varies from 500 to ~45000 lines, 800Kb) with DNA sequence acessions, reads it line-by-line and look for each correspondent full taxonomy for those acessions in a databank (~45000 lines). Then, it does a few sums/divisions. Output is a .tsv and looks like this (an example for sequence "KF625180.1.1799"): Rate of taxonomies for this sequence in %: KF625180.1.1799 D_6__Bacillus_atrophaeus Taxonomy %aligned number_ocurrences_in_the_alignment num_ocurrences_in_databank %alingment/databank D_6__Bacillus_atrophaeus 50% 1 20 5% D_6__Bacillus_amyloliquefaciens 50% 1 154 0.649351% $ head input file AEYS01000010.10484.12283 CVJT01000011.50.217 KF625180.1.1799 KT949922.1.1791 LOBZ01000025.54942.57580 Two additional files are also used inside the loop. They are not the loop input. 1) a file called alnout_file that only serves for finding how many hits (or alignments) a given sequence had against the databank. It was also previously made outside this loop. It can vary in the number of lines from hundreads to thousands. Only columns 1 and 2 matters here. Column1 is the name of the sequence and col2 is the name of all sequences it matched in the databnk. It looks like that: $ head alnout_file KF625180.1.1799 KF625180.1.1799 100.0 431 0 0 1 431 1 431 -1 0 KF625180.1.1799 KP143082.1.1457 99.3 431 1 2 1 431 1 429 -1 0 KP143082.1.1457 KF625180.1.1799 99.3 431 1 2 1 429 1 431 -1 0 2) a databank .tsv file containing ~45000 taxonomies correspondent to the DNA sequences. Each taxonomy is in one line: $ head taxonomy.file.tsv KP143082.1.1457 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_amyloliquefaciens KF625180.1.1799 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_atrophaeus So, given sequence KF625180.1.1799. I previously aligned it against a databank containing ~45000 other DNA sequences and got an output whis has all the accessions to sequences that it matched. What the loop does is that it finds the taxonomies for all those sequences and calculates the "statistics" I mentionded previously. Code does it for all the DNA-sequences-accesions I have. TAXONOMY=path/taxonomy.file.tsv while read line do #find hits hits=$(grep $line alnout_file | cut -f 2) completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g') printf "\nRate of taxonomies for this sequence in %%:\t$completename\n" printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n" #find hits and calculate the frequence (%) of the taxonomy in the alignment output # ex.: Bacillus_subtilis 33 freqHits=$(grep "${hits[#]}" $TAXONOMY | \ cut -f 2 | \ awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \ sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \ sort -k2 -hr) # print frequence of each taxonomy in the databank freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1)) #print cols with taxonomy and calculations paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}' done < input_file It is a lot of greps and parsing so it takes about ~12h running in one processor for doing it to all the 45000 DNA sequence accessions. The, I would like to split input_file and do it in all the processors I have (14) because it would the time spend in that. Thank you all for being so patient with me =)
You are looking for --pipe. In this case you can even use the optimized --pipepart (version >20160621): export TAXONOMY=path/taxonomy.file.tsv doit() { while read line do #find hits hits=$(grep $line alnout_file | cut -f 2) completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g') printf "\nRate of taxonomies for this sequence in %%:\t$completename\n" printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n" #find hits and calculate the frequence (%) of the taxonomy in the alignment output # ex.: Bacillus_subtilis 33 freqHits=$(grep "${hits[#]}" $TAXONOMY | \ cut -f 2 | \ awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \ sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \ sort -k2 -hr) # print frequence of each taxonomy in the databank freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1)) #print cols with taxonomy and calculations paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}' done } export -f doit parallel -a input_file --pipepart doit This will chop input_file into 10*ncpu blocks (where ncpu is the number of CPU threads), pass each block to doit, run ncpu jobs in parallel. That said I think your real problem is spawning too many programs: If you rewrite doit in Perl or Python I will expect you will see a major speedup.
As an alternative I threw together a quick test. #! /bin/env bash mkfifo PIPELINE # create a single queue cat "$1" > PIPELINE & # supply it with records { declare -i cnt=0 max=14 while (( ++cnt <= max )) # spawn loop creates worker jobs do printf -v fn "%02d" $cnt while read -r line # each work loop reads common stdin... do echo "$fn:[$line]" sleep 1 done >$fn.log 2>&1 & # these run in background in parallel done # this one exits } < PIPELINE # *all* read from the same queue wait cat [0-9][0-9].log Doesn't need split, but does need a mkfifo. Obviously, change the code inside the internal loop.
This answers what you asked, namely how to process in parallel the 14 files you get from running split. However, I don't think it is the best way of doing whatever it is that you are trying to do - but we would need some answers from you for that. So, let's make a million line file and split it into 14 parts: seq 1000000 > 1M split -n 14 1M part- That gives me 14 files called part-aa through part-an. Now your question is how to process those 14 parts in parallel - (read the last line first): #!/bin/bash # This function will be called for each of the 14 files DoOne(){ # Pick up parameters job=$1 file=$2 # Count lines in specified file lines=$(wc -l < "$file") echo "Job No: $job, file: $file, lines: $lines" } # Make the function above known to processes spawned by GNU Parallel export -f DoOne # Run 14 parallel instances of "DoOne" passing job number and filename to each parallel -k -j 14 DoOne {#} {} ::: part-?? Sample Output Job No: 1, file: part-aa, lines: 83861 Job No: 2, file: part-ab, lines: 72600 Job No: 3, file: part-ac, lines: 70295 Job No: 4, file: part-ad, lines: 70295 Job No: 5, file: part-ae, lines: 70294 Job No: 6, file: part-af, lines: 70295 Job No: 7, file: part-ag, lines: 70295 Job No: 8, file: part-ah, lines: 70294 Job No: 9, file: part-ai, lines: 70295 Job No: 10, file: part-aj, lines: 70295 Job No: 11, file: part-ak, lines: 70295 Job No: 12, file: part-al, lines: 70294 Job No: 13, file: part-am, lines: 70295 Job No: 14, file: part-an, lines: 70297 You would omit the -k argument to GNU Parallel normally - I only added it so the output comes in order.
I think that using a bunch of grep and awk commands is the wrong approach here - you would be miles better off using Perl, or awk. As you have not provided any sample files I generated some using this code: #!/bin/bash for a in {A..Z} {0..9} ; do for b in {A..Z} {0..9} ; do for c in {A..Z} {0..9} ; do echo "${a}${b}${c}" done done done > a # Now make file "b" which has the same stuff but shuffled into a different order gshuf < a > b Note that there are 26 letters in the alphabet, so if I add the digits 0..9 to the letters of the alphabet, I get 36 alphanumeric digits and if I nest 3 loops of that I get 36^3 or 46,656 lines which matches your file sizes roughly. File a now looks like this: AAA AAB AAC AAD AAE AAF File b looks like this: UKM L50 AOC 79U K6S 6PO 12I XEV WJN Now I want to loop through a finding the corresponding line in b. First, I use your approach: time while read thing ; do grep $thing b > /dev/null ; done < a That takes 9 mins 35 seconds. If I now exit grep on the first match, on average I will find it in the middle, which means the time will be halved since I won't continue to needlessly read b after I find what I want. time while read thing ; do grep -m1 $thing b > /dev/null ; done < a That improves the time down to 4 mins 30 seconds. If I now use awk to read the contents of b into an associative array (a.k.a. hash) and then read the elements of a and find them in b like this: time awk 'FNR==NR{a[$1]=$1; next} {print a[$1]}' b a > /dev/null That now runs in 0.07 seconds. Hopefully you get the idea of what I am driving at. I expect Perl would do this in the same time and also provide more expressive facilities for the maths in the middle of your loop too.
I hope this small script helps you out: function process { while read line; do echo "$line" done < $1 } function loop { file=$1 chunks=$2 dir=`mktemp -d` cd $dir split -n l/$chunks $file for i in *; do process "$i" & done rm -rf $dir } loop /tmp/foo 14 It runs the process loop on the specified file with the specified number of chunks (without splitting lines) in parallel (using & to put each invocation in the background). I hope it gets you started.
This can do the job for You, I am not familiar with parallel instead using native bash spawning processes &: function loop () { while IFS= read -r -d $'\n' do # YOUR BIG STUFF done < "${1}" } arr_files=(./xa*) for i in "${arr_files[#]}" do loop "${i}" & done wait
using either ksh bash how to have performance effective way
112345D000000000000129 123456D000000000000129 112345C000000000000129 123456C000000000000129 123456C000000000000126 position 2-6 is the account number position 7-22 is the debit or credit value based on D or C in 7th position want to sum the credit and debit value on the per account basis tried awk '{array[substr($0,7,1)]+=substr($0,8,15)+0} END{for(i in array){print array[i]}}')" but since the file is huge its taking more time is there a way we can find out this one more faster MVCE fileA contains the account number + other info fileB contains the exampe above with debit credit typeset -i stbal2 typeset -i endbal2 DONE=false until $DONE; do read s || DONE=true accountnumber=${s:1:10} //account number endbal=${s:26:1} //contain + or - sign endbal1=${s:11:15} //balance endbal2=$endbal1 //strip of leading zeros endbal3=$endbal$endbal2 //concatenate the sign with balance //similar process as above to get the start balance stbal=${s:42:1} stbal1=${s:27:15} stbal2=$stbal1 stbal3=$stbal$stbal2 creditdebit="$(grep "${bban}" ${fileB} | awk '{array[substr($0,7,1)]+=substr($0,8,15)+0} END{for(i in array){print array[i]}}')" set -- $creditdebit ... further logic done < ${fileA}
Without a complete MCVE its a guess but this might be what you're looking for, using GNU awk for true 2D arrays: $ awk ' { tots[substr($0,7,1)][substr($0,2,5)] += substr($0,8) } END { for (type in tots) { for (id in tots[type]) { print type, id, tots[type][id]+0 } } } ' file C 12345 129 C 23456 255 D 12345 129 D 23456 129
in ksh you might do something like that : (myfile.txt being your information file) #!/bin/ksh typeset -A Ledger typeset -i amount typeset -L10 Col1 typeset -L10 Col2 while read line do account=${line:1:5} action=${line:6:1} amount=${line:7:21} if [[ $action == "C" ]]; then Ledger[$account]=$(( ${Ledger[$account]} + $amount )) elif [[ $action == "D" ]]; then Ledger[$account]=$(( ${Ledger[$account]} - $amount )) fi done < myfile.txt Col1="Account" Col2="Balance" print "$Col1$Col2\n" for i in ${!Ledger[#]}; do Col1=$i Col2=${Ledger[$i]} print "$Col1$Col2" done With your example my output is : Account Balance 12345 0 23456 126 Hope it could help
How to sum a row of numbers from text file-- Bash Shell Scripting
I'm trying to write a bash script that calculates the average of numbers by rows and columns. An example of a text file that I'm reading in is: 1 2 3 4 5 4 6 7 8 0 There is an unknown number of rows and unknown number of columns. Currently, I'm just trying to sum each row with a while loop. The desired output is: 1 2 3 4 5 Sum = 15 4 6 7 8 0 Sum = 25 And so on and so forth with each row. Currently this is the code I have: while read i do echo "num: $i" (( sum=$sum+$i )) echo "sum: $sum" done < $2 To call the program it's stats -r test_file. "-r" indicates rows--I haven't started columns quite yet. My current code actually just takes the first number of each column and adds them together and then the rest of the numbers error out as a syntax error. It says the error comes from like 16, which is the (( sum=$sum+$i )) line but I honestly can't figure out what the problem is. I should tell you I'm extremely new to bash scripting and I have googled and searched high and low for the answer for this and can't find it. Any help is greatly appreciated.
You are reading the file line by line, and summing line is not an arithmetic operation. Try this: while read i do sum=0 for num in $i do sum=$(($sum + $num)) done echo "$i Sum: $sum" done < $2 just split each number from every line using for loop. I hope this helps.
Another non bash way (con: OP asked for bash, pro: does not depend on bashisms, works with floats). awk '{c=0;for(i=1;i<=NF;++i){c+=$i};print $0, "Sum:", c}'
Another way (not a pure bash): while read line do sum=$(sed 's/[ ]\+/+/g' <<< "$line" | bc -q) echo "$line Sum = $sum" done < filename
Using the numsum -r util covers the row addition, but the output format needs a little glue, by inefficiently paste-ing a few utils: paste "$2" \ <(yes "Sum =" | head -$(wc -l < "$2") ) \ <(numsum -r "$2") Output: 1 2 3 4 5 Sum = 15 4 6 7 8 0 Sum = 25 Note -- to run the above line on a given file foo, first initialize $2 like so: set -- "" foo paste "$2" <(yes "Sum =" | head -$(wc -l < "$2") ) <(numsum -r "$2")
Sorting and printing a file in bash UNIX
I have a file with a bunch of paths that look like so: 7 /usr/file1564 7 /usr/file2212 6 /usr/file3542 I am trying to use sort to pull out and print the path(s) with the most occurrences. Here it what I have so far: cat temp| sort | uniq -c | sort -rk1 > temp I am unsure how to only print the highest occurrences. I also want my output to be printed like this: 7 1564 7 2212 7 being the total number of occurrences and the other numbers being the file numbers at the end of the name. I am rather new to bash scripting so any help would be greatly appreciated!
To emit only the first line of output (with the highest number, since you're doing a reverse numeric sort immediately prior), pipe through head -n1. To remove all content which is not either a number or whitespace, pipe through tr -cd '0-9[:space:]'. To filter for only the values with the highest number, allowing there to be more than one: { read firstnum name && printf '%s\t%s\n' "$firstnum" "$name" while read -r num name; do [[ $num = $firstnum ]] || break printf '%s\t%s\n' "$num" "$name" done } < temp
If you want to avoid sort and you are allowed to use awk, then you can do this: awk '{ if($1>maxcnt) {s=$1" "substr($2,10,4); maxcnt=$1} else if($1==maxcnt) {s=s "\n"$1" "substr($2,10,4)}} END{print s}' \ temp