Most repeated string based on two arguments strings csv file. BASH - bash

First exercise with bash, this is taking a lot of time...
I'm trying to create a script where, giving 2 arguments (height,weight) on athletes.csv returns number of coincidence on both values and predominant nationality based on that. And if that was not enough, if predominance is equal for 2 countries, then echo predominance with lowest id.
Also i can't use awk, grep, sed or csvkit.
Here is csv header:
id,name,nationality,sex,date_of_birth,height,weight,sport,gold,silver,bronze,info
736041664,A Jesus Garcia,ESP,male,1969-10-17,1.72,64,athletics,0,0,0,
532037425,A Lam Shin,KOR,female,1986-09-23,1.68,56,fencing,0,0,0,
435962603,Aaron Brown,CAN,male,1992-05-27,1.98,79,athletics,0,0,1,
521041435,Aaron Cook,MDA,male,1991-01-02,1.83,80,taekwondo,0,0,0,
33922579,Aaron Gate,NZL,male,1990-11-26,1.81,71,cycling,0,0,0,
173071782,Aaron Royle,AUS,male,1990-01-26,1.80,67,triathlon,0,0,0,
266237702,Aaron Russell,USA,male,1993-06-04,2.05,98,volleyball,0,0,1,
Until now:
count=0
while IFS=, read -a id _ nation _ _ height weight _ _ _ _; do
if (( $height == "$2" )) && (( "$weight" == $3 )) ; then
((count++))
fi
done < athletes.csv
echo "$count"
I have seen a similiar problem. But can't find the way to return the most common nationality (strings).
Looking for something similar to:
Count, Predominant_nationality 1.85 130
8460, BRA
Should i try to work the hole exercise with arrays instead trying with lopps? Probably i could do it indexes, but looks like arrays are 1d here?
Any help would be a blessing

It's a problem of sorting and counting that can be resolved with Linux standard text utilities
csv='athletes.csv'
crit='1\.85,90'
echo "Count Predominant_nationality $crit"
# Get fields from csv and sort on filtered fields 2,3
cut -d ',' -f 1,3,6,7 "$csv" | grep "$crit" | sort -t ',' -k2,3 | tr ',' ' ' | \
# Count unique skipping first field, get first
uniq -f 1 -c | sort -n -k1,1nr -k2n | head -n1 | tr -s ' ' | \
# print result
cut -d ' ' -f 2,4 --output-delimiter=' '
Result
Count Predominant_nationality 1.85,90
2 BRA

A few issues with current code:
read -a says to read values into an array but what you really want is to read the values into individual variables
read -r is typical in this type of scenario (-r disables backslashes as escapes)
the if (( ... )) constucts are typically used for integer comparisons and since heights are non-integer (eg, 1.85) it's probably best to stick with string comparisons (especially since we're only interested in equality matches)
Setup; instead of downloading the link/data file I'll add 4x bogus lines to OP's sample input, making sure all 4x lines match OP's sample search parameters (1.85 and 130):
$ cat athletes.csv
id,name,nationality,sex,date_of_birth,height,weight,sport,gold,silver,bronze,info
736041664,A Jesus Garcia,ESP,male,1969-10-17,1.72,64,athletics,0,0,0,
532037425,A Lam Shin,KOR,female,1986-09-23,1.68,56,fencing,0,0,0,
435962603,Aaron Brown,CAN,male,1992-05-27,1.98,79,athletics,0,0,1,
521041435,Aaron Cook,MDA,male,1991-01-02,1.83,80,taekwondo,0,0,0,
33922579,Aaron Gate,NZL,male,1990-11-26,1.81,71,cycling,0,0,0,
173071782,Aaron Royle,AUS,male,1990-01-26,1.80,67,triathlon,0,0,0,
266237702,Aaron Russell,USA,male,1993-06-04,2.05,98,volleyball,0,0,1,
134,Aaron XX1,USA,male,1993-06-04,1.85,130,volleyball,0,0,1,
127,Aaron XX2,CAD,male,1993-06-04,1.85,130,volleyball,0,0,1,
34,Aaron XX3,USA,male,1993-06-04,1.85,130,volleyball,0,0,1,
27,Aaron XX4,CAD,male,1993-06-04,1.85,130,volleyball,0,0,1,
One bash idea:
arg1="1.85"
arg2="130"
maxid=99999999999
unset counts ids maxcount
declare -A counts ids
maxcount=0
while IFS=, read -r id _ nation _ _ height weight _
do
if [[ "${height}" == "${arg1}" && "${weight}" == "${arg2}" ]]
then
(( counts[${nation}]++ ))
# keep track of overall max count
[[ "${counts[${nation}]}" -gt "${maxcount}" ]] && maxcount="${counts[${nation}]}"
# keep track of min(id) for each nation
[[ "${id}" -lt "${ids[${nation}]:-${maxid}}" ]] && ids[${nation}]="${id}"
fi
done < athletes.csv
Alternatively, since it looks like our search patterns are together and can only occur in one location within a line, we can use grep to filter out only the matching rows:
$ grep ",${arg1},${arg2}," athletes.csv
134,Aaron XX1,USA,male,1993-06-04,1.85,130,volleyball,0,0,1,
127,Aaron XX2,CAD,male,1993-06-04,1.85,130,volleyball,0,0,1,
34,Aaron XX3,USA,male,1993-06-04,1.85,130,volleyball,0,0,1,
27,Aaron XX4,CAD,male,1993-06-04,1.85,130,volleyball,0,0,1,
We can then feed this result to the while/read loop and eliminte the need to test the height/weight variables, eg:
while IFS=, read -r id _ nation _
do
(( counts[${nation}]++ ))
[[ "${counts[${nation}]}" -gt "${maxcount}" ]] && maxcount="${counts[${nation}]}"
[[ "${id}" -lt "${ids[${nation}]:-${maxid}}" ]] && ids[${nation}]="${id}"
done < <(grep ",${arg1},${arg2}," athletes.csv)
At this point both of these while/read loops produce:
$ typeset -p counts ids maxcount
declare -A counts=([USA]="2" [CAD]="2" )
declare -A ids=([USA]="34" [CAD]="27" )
declare -- maxcount="2"
From here OP can loop through the list of nations ("${!counts[#]}") looking for counts that are equal to maxcount and when found then apply an additional check to see if the nation has the lowest id (ids[]) seen so far in the loop. At the end of the loop OP should have the country a) count that is equal to maxcount and b) with the lowest id.

Related

How to calculate different alphabet number between two strings in bash?

a="ABCDEFG"
b="ABCDXYG"
How can I calculate different alphabet number between these two strings in bash?
In this case the answer is 2 (E != X and F != Y).
As far as I understand, you want the number of different letters in the same position in both strings.
So:
Insert a newline each character in both strings (so we can parse them in bash)
Put first string (with newlines) in one column, the second one in another
Print only lines which have different columns
Count the lines.
paste <(<<<"$a" sed 's/./&\n/g') <(<<<"$b" sed 's/./&\n/g') |
awk '$1 != $2' |
wc -l
And just for fun a pure bash solution:
declare -i cnt
cnt=0
while
IFS= read -r -n1 -u3 c1 &&
IFS= read -r -n1 -u4 c2
do
if [ "$c1" != "$c2" ]; then
cnt=cnt+1
fi
done 3<<<"$a" 4<<<"$b"
echo "$cnt"
This is Shellcheck-clean pure Bash code, with no subprocesses and no I/O:
#! /bin/bash
a=ABCDEFG
b=ABCDXYG
declare -i diffcount=0
(( ${#a} < ${#b} )) && maxlen=${#b} || maxlen=${#a}
for ((i=0; i<maxlen; i++)) ; do
[[ ${a:i:1} != "${b:i:1}" ]] && diffcount+=1
done
echo $diffcount
maxlen is the maximum of the lengths of the strings. If one of the strings is longer than the other then each character past the length of the short string in the long string is counted as a difference. The code will need to be modified if you want a different behaviour.

printing line numbers that are multiple of 5

Hi I am trying to print/echo line numbers that are multiple of 5. I am doing this in shell script. I am getting errors and unable to proceed. below is the script
#!/bin/bash
x=0
y=$wc -l $1
while [ $x -le $y ]
do
sed -n `$x`p $1
x=$(( $x + 5 ))
done
When executing above script i get below errors
#./echo5.sh sample.h
./echo5.sh: line 3: -l: command not found
./echo5.sh: line 4: [: 0: unary operator expected
Please help me with this issue.
For efficiency, you don't want to be invoking sed multiple times on your file just to select a particular line. You want to read through the file once, filtering out the lines you don't want.
#!/bin/bash
i=0
while IFS= read -r line; do
(( ++i % 5 == 0 )) && echo "$line"
done < "$1"
Demo:
$ i=0; while read line; do (( ++i % 5 == 0 )) && echo "$line"; done < <(seq 42)
5
10
15
20
25
30
35
40
A funny pure Bash possibility:
#!/bin/bash
mapfile ary < "$1"
printf "%.0s%.0s%.0s%.0s%s" "${ary[#]}"
This slurps the file into an array ary, which each line of the file in a field of the array. Then printf takes care of printing one every 5 lines: %.0s takes a field, but does nothing, and %s prints the field. Since mapfile is used without the -t option, the newlines are included in the array. Of course this really slurps the file into memory, so it might not be good for huge files. For large files you can use a callback with mapfile:
#!/bin/bash
callback() {
printf '%s' "$2"
ary=()
}
mapfile -c 5 -C callback ary < "$1"
We're removing all the elements of the array during the callback, so that the array doesn't grow too large, and the printing is done on the fly, as the file is read.
Another funny possibility, in the spirit of glenn jackmann's solution, yet without a counter (and still pure Bash):
#!/bin/bash
while read && read && read && read && IFS= read -r line; do
printf '%s\n' "$line"
done < "$1"
Use sed.
sed -n '0~5p' $1
This prints every fifth line in the file starting from 0
Also
y=$wc -l $1
wont work
y=$(wc -l < $1)
You need to create a subshell as bash will see the spaces as the end of the assignment, also if you just want the number its best to redirect the file into wc.
Dont know what you were trying to do with this ?
x=$(( $x + 5 ))
Guessing you were trying to use let, so id suggest looking up the syntax for that command. It would look more like
(( x = x + 5 ))
Hope this helps
There are cleaner ways to do it, but what you're looking for is this.
#!/bin/bash
x=5
y=`wc -l $1`
y=`echo $y | cut -f1 -d\ `
while [ "$y" -gt "$x" ]
do
sed -n "${x}p" "$1"
x=$(( $x + 5 ))
done
Initialize x to 5, since there is no "line zero" in your file $1.
Also, wc -l $1 will display the number of line counts, followed by the name of the file. Use cut to strip the file name out and keep just the first word.
In conditionals, a value of zero can be interpreted as "true" in Bash.
You should not have space between your $x and your p in your sed command. You can put them right next to each other using curly braces.
You can do this quite succinctly using awk:
awk 'NR % 5 == 0' "$1"
NR is the record number (line number in this case). Whenever it is a multiple of 5, the expression is true, so the line is printed.
You might also like the even shorter but slightly less readable:
awk '!(NR%5)' "$1"
which does the same thing.

Match first few letters of a file name : Shell script

I am trying to match first few letters of a file.
for entry in `ls`; do
echo $entry
done
With the above code I get the name of all the files.
I have a few files with similar name at the start:
Beaglebone-v1
Beaglebone-v3
Beaglebone-v2
How can I compare $entry with Beaglebone* and then extract the latest version file name?
If you want to loop over all Beaglebone-* files:
for entry in Beaglebone-* ; do
echo $entry
done
if you just need the file with the latest version, you can depend on the fact that ls sorts your names alphabetically, so you could just do:
LATEST_FILE_NAME=$(ls Beaglebone-* | tail -n 1)
which will just take the last one alphabetically.
To deal with larger numbers, you could use numeric comparison like this:
stem="Beaglebone-v"
for file in $stem*; do
ver=${file#"$stem"} # cut away stem to get version number
(( ver > max )) && max=$ver # conditionally assign `ver` to `max`
done
echo "$stem$max"
Testing it out:
bash-4.3$ ls Beaglebone-v*
Beaglebone-v1 Beaglebone-v10 Beaglebone-v2 Beaglebone-v3
bash-4.3$ stem="Beaglebone-v" &&
for file in $stem*
do
ver=${file#"$stem"}
(( ver > max )) && max=$ver
done; echo "$stem$max"
Beaglebone-v10
You can store the filenames matching the pattern in an array and then pick the last element of the array.
shopt -s nullglob
arr=( Beaglebone-* )
if (( ${#arr[#]} > 0 ))
then
latest="${arr[ (( ${#arr[#]} - 1 )) ]}"
echo "$latest"
fi
You need to enable nullglob so that if there are no files matching the pattern, you will get an empty array rather than the pattern itself.
If version numbers can go beyond single digits,
function version_numbers {
typeset w; for w in $1-v*; do echo ${w#$1-v}; done
}
version_numbers "Beaglebone" | sort -n | tail -1
Or, adding function max:
# read a stream of numbers, from stdin (one per line)
# and return the largest value
function max
{
typeset _max n
read _max || return
while read n; do
((_max < n)) && _max=$n
done
echo $_max
}
We can now do the whole thing without external commands:
version_numbers Beaglebone | max
Note that max will fail horribly if any one line fails the numerical comparison.

Unexpected behaviour of for

Script:
#!/bin/bash
IFS=','
i=0
for j in `cat database | head -n 1`; do
variables[$i]=$j
i=`expr $i + 1`
done
k=0
for l in `cat database | tail -n $(expr $(cat database | wc -l) - 1)`; do
echo -n $k
k=`expr $k + 1`
if [ $k -eq 3 ]; then
k=0
fi
done
Input file
a,b,c
d,e,f
g,e,f
Output
01201
Expected output
012012
The question is why the for skips last echo? It is weird, because if I change $k to $l echo will run 6 times.
Update:
#thom's analysis is correct. You can fix the problem by changing IFS=',' to IFS=$',\n'.
My original statements below may be of general interest, but do not address the specific problem.
If accidental shell expansions were a concern, here's how the loop could be rewritten (assuming it's practical to read everything into an array variable first):
IFS=$',\n' read -d '' -r -a fields < <(echo $'*,b,c\nd,e,f\ng,h,i')
for field in "${fields[#]}"; do
  # $field is '*' in 1st iteration, then 'b', 'c', 'd',...
done
Original statements:
Just a few general pointers:
You should use a while loop rather than for to read command output - see http://mywiki.wooledge.org/BashFAQ/001; the short of it: with for, the input lines are subject to various shell expansions.
A missing iteration typically stems from the last input line missing a terminating \n (or a separator as defined in $IFS). With a while loop, you can use the following approach to address this: while read -r line || [[ -n $line ]]; do …
For instance, your 2nd for loop could be rewritten as (using process substitution as input to avoid creating a subshell with a separate variable scope):
while read -r l || [[ -n $l ]]; do …; done < <(cat database | tail -n $(expr $(cat database | wc -l) - 1))
Finally, you could benefit from using modern bashisms: for instance,
k=`expr $k + 1`
could be rewritten much more succinctly as (( ++k )) (which will run faster, too).
Your code expects after EVERY read variable a comma but you only give this:
a,b,c
d,e,f
g,e,f
instead of this:
a,b,c,
d,e,f,
g,e,f,
so it reads:
d,e,f'\n'g,e,f
and that is equal to 5 values, not 6

Creating a bash script that acts as a fortune teller or "magic 8 ball" essentially

I am trying to create a bash script that is essentially like a magic 8 ball with 6 different responses (Yes, No, Maybe, Hard to tell, Unlikely, and Unknown). The key is that once a response is given, it should not be given again until all responses have been given.
Here is what I have so far:
#!/bin/bash
echo "Ask and you shall receive your fortune: "
n=$((RANDOM*6/32767))
while [`grep $n temp | wc awk '{print$3}'` -eq 0]; do
n=$((RANDOM*6/32767))
done
grep -v $n temp > temp2
mv temp2 temp
Basically I have the 6 responses all on different lines in the temp file, and I am trying to construct the loops so that once a response is given, it creates a new file without that response (temp2), then copies it back to temp. Then once the temp file is empty it will continue from the beginning.
I'm quite positive that my current inner loop is wrong, and that I need an outer loop, but I'm fairly new to this and I am stuck.
Any help will be greatly appreciated.
Try something like this:
#!/bin/bash
shuffle() {
local i tmp size max rand
# $RANDOM % (i+1) is biased because of the limited range of $RANDOM
# Compensate by using a range which is a multiple of the array size.
size=${#array[*]}
max=$(( 32768 / size * size ))
for ((i=size-1; i>0; i--)); do
while (( (rand=$RANDOM) >= max )); do :; done
rand=$(( rand % (i+1) ))
tmp=${array[i]} array[i]=${array[rand]} array[rand]=$tmp
done
}
array=( 'Yes' 'No' 'Maybe' 'Hard to tell' 'Unknown' 'Unlikely' )
shuffle
for var in "${array[#]}"
do
echo -n "Ask a question: "
read q
echo "${var}"
done
I wrote a script that follows your initial approach (using temp files):
#!/bin/bash
# Make a copy of temp, so you don't have to recreate the file every time you run this script
TEMP_FILE=$(tempfile)
cp temp $TEMP_FILE
# You know this from start, the file contains 6 possible answers, if need to add more in future, change this for the line count of the file
TOTAL_LINES=6
echo "Ask and you shall receive your fortune: "
# Dummy reading of the char, adds a pause to the script and involves the user interaction
read
# Conversely to what you stated, you don't need an extra loop, with one is enough
# just change the condition to count the line number of the TEMP file
while [ $TOTAL_LINES -gt 0 ]; do
# You need to add 1 so the answer ranges from 1 to 6 instead of 0 to 5
N=$((RANDOM*$TOTAL_LINES/32767 + 1))
# This prints the answer (grab the first N lines with head then remove anything above the Nth line with tail)
head -n $N < $TEMP_FILE | tail -n 1
# Get a new file deleting the $N line and store it in a temp2 file
TEMP_FILE_2=$(tempfile)
head -n $(( $N - 1 )) < $TEMP_FILE > $TEMP_FILE_2
tail -n $(( $TOTAL_LINES - $N )) < $TEMP_FILE >> $TEMP_FILE_2
mv $TEMP_FILE_2 $TEMP_FILE
echo "Ask and you shall receive your fortune: "
read
# Get the total lines of TEMP (use cut to delete the file name from the wc output, you only need the number)
TOTAL_LINES=$(wc -l $TEMP_FILE | cut -d" " -f1)
done
$ man shuf
SHUF(1) User Commands
NAME
shuf - generate random permutations
SYNOPSIS
shuf [OPTION]... [FILE]
shuf -e [OPTION]... [ARG]...
shuf -i LO-HI [OPTION]...
DESCRIPTION
Write a random permutation of the input lines to standard output.
More stuff follows, you can read it on your own machine :)

Resources