Comparison shell script for large text/csv files - improvement needed - bash
My task is the following - I have two CSV files:
File 1 (~9.000.000 records):
type(text),number,status(text),serial(number),data1,data2,data3
File 2 (~6000 records):
serial_range_start(number),serial_range_end(number),info1(text),info2(text)
The goal is to add to each entry in File 1 the corresponding info1 and info2 from File 2:
type(text),number,status(text),serial(number),data1,data2,data3,info1(text),info2(text)
I use the following script:
#!/bin/bash
USER="file1.csv"
RANGE="file2.csv"
for SN in `cat $USER | awk -F , '{print $4}'`
do
#echo \n "$SN"
for LINE in `cat $RANGE`
do
i=`grep $LINE $RANGE| awk -F, '{print $1}'`
#echo \n "i= " "$i"
j=`grep $LINE $RANGE| awk -F, '{print $2}'`
#echo \n "j= " "$j"
k=`echo $SN`
#echo \n "k= " "$k"
if [ $k -ge $i -a $k -le $j ]
then
echo `grep $SN $USER`,`grep $i $RANGE| cut -d',' -f3-4` >> result.csv
break
#else
#echo `grep $SN $USER`,`echo 'N/A','N/A'` >> result.csv
fi
done
done
The script works rather good on small files but I'm sure there is a way to optimize it because I am running it on an i5 laptop with 4GB of RAM.
I am a newbie in shell scripting and I came up with this script after hours and hours of research, trial and error but now I am out of ideas.
note: not all the info in file 1 can be found in file.
Thank you!
Adrian.
FILE EXAMPLES and additional info:
File 1 example:
prep,28620026059,Active,123452010988759,No,No,No
post,28619823474,Active,123453458466109,Yes,No,No
post,28619823474,Inactive,123453395270941,Yes,Yes,Yes
File 2 example:
123452010988750,123452010988759,promo32,1.11
123453458466100,123453458466199,promo64,2.22
123450000000000,123450000000010,standard128,3.333
Result example (currently):
prep,28620026059,Active,123452010988759,No,No,No,promo32,1.11
post,28619823474,Active,123453458466109,Yes,No,No,promo64,2.22
Result example (nice to have):
prep,28620026059,Active,123452010988759,No,No,No,promo32,1.11
post,28619823474,Active,123453458466109,Yes,No,No,promo64,2.22
post,28619823474,Inactive,123453395270941,Yes,Yes,Yes,NA,NA
File 1 is sorted after the 4th column
File 2 is sorted after the first column.
File 2 does not have ranges that overlap
Not all the info in file 1 can be found in a range in file 2
Thanks again!
LE:
The script provided by Jonathan seems to have an issue on some records, as follows:
file 2:
123456780737000,123456780737012,ONE 32,1.11
123456780016000,123456780025999,ONE 64,2.22
file 1:
Postpaid,24987326427,Active,123456780737009,Yes,Yes,Yes
Postpaid,54234564719,Active,123456780017674,Yes,Yes,Yes
The output is the following:
Postpaid,24987326427,Active,123456780737009,Yes,Yes,Yes,ONE 32,1.11
Postpaid,54234564719,Active,123456780017674,Yes,Yes,Yes,ONE 32,1.11
and it should be:
Postpaid,24987326427,Active,123456780737009,Yes,Yes,Yes,ONE 32,1.11
Postpaid,54234564719,Active,123456780017674,Yes,Yes,Yes,ONE 64,2.22
It seems that it returns 0 and writes the info on first record from file2...
I think this will work reasonably well:
awk -F, 'BEGIN { n = 0; OFS = ","; }
NR==FNR { lo[n] = $1; hi[n] = $2; i1[n] = $3; i2[n] = $4; n++ }
NR!=FNR { for (i = 0; i < n; i++)
{
if ($4 >= lo[i] && $4 <= hi[i])
{
print $1, $2, $3, $4, $5, $6, $7, i1[i], i2[i];
break;
}
}
}' file2 file1
Given file2 containing:
1,10,xyz,pqr
11,20,abc,def
21,30,ambidextrous,warthog
and file1 containing:
A,123,X2,1,data01_1,data01_2,data01_3
A,123,X2,2,data02_1,data02_2,data02_3
A,123,X2,3,data03_1,data03_2,data03_3
A,123,X2,4,data04_1,data04_2,data04_3
A,123,X2,5,data05_1,data05_2,data05_3
A,123,X2,6,data06_1,data06_2,data06_3
A,123,X2,7,data07_1,data07_2,data07_3
A,123,X2,8,data08_1,data08_2,data08_3
A,123,X2,9,data09_1,data09_2,data09_3
A,123,X2,10,data10_1,data10_2,data10_3
A,123,X2,11,data11_1,data11_2,data11_3
A,123,X2,12,data12_1,data12_2,data12_3
A,123,X2,13,data13_1,data13_2,data13_3
A,123,X2,14,data14_1,data14_2,data14_3
A,123,X2,15,data15_1,data15_2,data15_3
A,123,X2,16,data16_1,data16_2,data16_3
A,123,X2,17,data17_1,data17_2,data17_3
A,123,X2,18,data18_1,data18_2,data18_3
A,123,X2,19,data19_1,data19_2,data19_3
A,223,X2,20,data20_1,data20_2,data20_3
A,223,X2,21,data21_1,data21_2,data21_3
A,223,X2,22,data22_1,data22_2,data22_3
A,223,X2,23,data23_1,data23_2,data23_3
A,223,X2,24,data24_1,data24_2,data24_3
A,223,X2,25,data25_1,data25_2,data25_3
A,223,X2,26,data26_1,data26_2,data26_3
A,223,X2,27,data27_1,data27_2,data27_3
A,223,X2,28,data28_1,data28_2,data28_3
A,223,X2,29,data29_1,data29_2,data29_3
the output of the command is:
A,123,X2,1,data01_1,data01_2,data01_3,xyz,pqr
A,123,X2,2,data02_1,data02_2,data02_3,xyz,pqr
A,123,X2,3,data03_1,data03_2,data03_3,xyz,pqr
A,123,X2,4,data04_1,data04_2,data04_3,xyz,pqr
A,123,X2,5,data05_1,data05_2,data05_3,xyz,pqr
A,123,X2,6,data06_1,data06_2,data06_3,xyz,pqr
A,123,X2,7,data07_1,data07_2,data07_3,xyz,pqr
A,123,X2,8,data08_1,data08_2,data08_3,xyz,pqr
A,123,X2,9,data09_1,data09_2,data09_3,xyz,pqr
A,123,X2,10,data10_1,data10_2,data10_3,xyz,pqr
A,123,X2,11,data11_1,data11_2,data11_3,abc,def
A,123,X2,12,data12_1,data12_2,data12_3,abc,def
A,123,X2,13,data13_1,data13_2,data13_3,abc,def
A,123,X2,14,data14_1,data14_2,data14_3,abc,def
A,123,X2,15,data15_1,data15_2,data15_3,abc,def
A,123,X2,16,data16_1,data16_2,data16_3,abc,def
A,123,X2,17,data17_1,data17_2,data17_3,abc,def
A,123,X2,18,data18_1,data18_2,data18_3,abc,def
A,123,X2,19,data19_1,data19_2,data19_3,abc,def
A,223,X2,20,data20_1,data20_2,data20_3,abc,def
A,223,X2,21,data21_1,data21_2,data21_3,ambidextrous,warthog
A,223,X2,22,data22_1,data22_2,data22_3,ambidextrous,warthog
A,223,X2,23,data23_1,data23_2,data23_3,ambidextrous,warthog
A,223,X2,24,data24_1,data24_2,data24_3,ambidextrous,warthog
A,223,X2,25,data25_1,data25_2,data25_3,ambidextrous,warthog
A,223,X2,26,data26_1,data26_2,data26_3,ambidextrous,warthog
A,223,X2,27,data27_1,data27_2,data27_3,ambidextrous,warthog
A,223,X2,28,data28_1,data28_2,data28_3,ambidextrous,warthog
A,223,X2,29,data29_1,data29_2,data29_3,ambidextrous,warthog
This uses a linear search on the list of ranges; you can write functions in awk and a binary search looking for the correct range would perform better on 6,000 entries. That part, though, is an optimization — exercise for the reader. Remember that the first rule of optimization is: don't. The second rule of optimization (for experts only) is: don't do it yet. Demonstrate that it is a problem. This code shouldn't take all that much longer than the time it takes to copy the 9,000,000 record file (somewhat longer, but not disastrously so). Note, though, that if the file1 data is sorted, the tail of the processing will take longer than the start because of the linear search. If the serial numbers are in a random order, then it will all take about the same time on average.
If your CSV data has commas embedded in the text fields, then awk is no longer suitable; you need a tool with explicit support for CSV format — Perl and Python both have suitable modules.
Answer to Exercise for the Reader
awk -F, 'BEGIN { n = 0; OFS = ","; }
NR==FNR { lo[n] = $1; hi[n] = $2; i1[n] = $3; i2[n] = $4; n++ }
NR!=FNR { i = search($4)
print $1, $2, $3, $4, $5, $6, $7, i1[i], i2[i];
}
function search(i, l, h, m)
{
l = 0; h = n - 1;
while (l <= h)
{
m = int((l + h)/2);
if (i >= lo[m] && i <= hi[m])
return m;
else if (i < lo[m])
h = m - 1;
else
l = m + 1;
}
return 0; # Should not get here
}' file2 file1
Not all that hard to write the binary search. This gives the same result as the original script on the sample data. It has not been exhaustively tested, but appears to work.
Note that the code does not really handle missing ranges in file2; it assumes that the ranges are contiguous but non-overlapping and in sorted order and cover all the values that can appear in the serial column of file1. If those assumptions are not valid, you get erratic behaviour until you fix either the code or the data.
In Unix you may use the join command (type 'man join' for more information) which can be configured to work similarly to join operation in databases. That may help you to add the information from File2 to File1.
Related
printing contents of variable to a specified line in outputfile with sed/awk
I have been working on a script to concatenate multiple csv files into a single, large csv. The csv's contain names of folders and their respective sizes, in a 2-column setup with the format "Size, Projectname" Example of a single csv file: 49747851728,ODIN 32872934580,_WORK 9721820722,LIBRARY 4855839655,BASELIGHT 1035732096,ARCHIVE 907756578,USERS 123685100,ENV 3682821,SHOTGUN 1879186,SALT 361558,SOFTWARE 486,VFX 128,DNA For my current test I have 25 similar files, with different numbers in the first column. I am trying to get this script to do the following: Read each csv file For each Project it sees, scan the outputfile if that Project was already printed to the file. If not, print the Projectname For each file, for each Project, if the Project was found, print the Size to the output csv. However, I need the Projects to all be on textline 1, comma separated, so I can use this outputfile as input for a javascript graph. The Sizes should be added in the column below their projectname. My current script: csv_folder=$(echo "$1" | sed 's/^[ \t]*//;s/\/[ \t]*$//') csv_allfiles="$csv_folder/*.csv" csv_outputfile=$csv_folder.csv echo -n "" > $csv_outputfile for csv_inputfile in $csv_allfiles; do while read line && [[ $line != "" ]]; do projectname=$(echo $line | sed 's/^\([^,]*\),//') projectfound1=$(cat $csv_outputfile | grep -w $projectname) if [[ ! $projectfound1 ]]; then textline=1 sed "${textline}s/$/${projectname}, /" >> $csv_outputfile for csv_foundfile in $csv_allfiles; do textline=$(echo $textline + 1 | bc ) projectfound2=$(cat $csv_foundfile | grep -w $projectname) projectdata=$(echo $projectfound2 | sed 's/\,.*$//') if [[ $projectfound2 ]]; then sed "${textline}s/$/$projectdata, /" >> $csv_outputfile fi done fi done < $csv_inputfile done My current script finds the right information (projectname, projectdata) and if I just 'echo' those variables, it prints the correct data to a file. However, with echo it only prints in a long list per project. I want it to 'jump back' to line 1 and print the new project at the end of the current line, then run the loop to print data at the end of each next line. I was thinking this should be possible with sed or awk. sed should have a way of inserting text to a specific line with sed '{n}s/search/replace/' where {n} is the line to insert to awk should be able to do the same thing with something like awk -v l2="$textline" -v d="$projectdata" 'NR == l2 {print d} {print}' >> $csv_outputfile However, while replacing the sed commands in the script with echo $projectname echo $projectdata spit out the correct information (so I know my variables are filled correctly) the sed and awk commands tend to spit out the entire contents of their current inputcsv; not just the line that I want them to. Pastebin outputs per variant of writing to file https://pastebin.com/XwxiAqvT - sed output https://pastebin.com/xfLU6wri - echo, plain output (single column) https://pastebin.com/wP3BhgY8 - echo, detailed output per variable https://pastebin.com/5wiuq53n - desired output As you see, the sed output tends to paste the whole contents of inputcsv, making the loop stop after one iteration. (since it finds the other Projects after one loop) So my question is one of these; How do I make sed / awk behave the way I want it to; i.e. print only the info in my var to the current textline, instead of the whole input csv. Is sed capable of this, printing just one line of variable? Or Should I output the variables through 'echo' into a temp file, then loop over the temp file to make sed sort the lines the way I want them to? (Bear in mind that more .csv files will be added in the future, I can't just make it loop x times to sort the info) Is there a way to echo/print text to a specific text line without using sed or awk? Is there a printf option I'm missing? Other thoughts? Any help would be very much appreciated.
A way to accomplish this transposition is to save the data to an associative array. In the following example, we use a two dimensional array to keep track of our data. Because ordering seems to be important, we create a col array and create a new increment whenever we see a new projectname -- this col array ends up being our first index into our data. We also create a row array which we increment whenever we see a new data for the current column. The row number is our second index into data. At the end, we print out all the records. #! /usr/bin/awk -f BEGIN { FS = "," OFS = ", " rows=0 cols=0 head="" split("", data) split("", row) split("", col) } !($2 in col) { # new project if (head == "") head = $2 else head = head OFS $2 i = col[$2] = cols++ row[i] = 0 } { i = col[$2] j = row[i]++ data[i,j] = $1 if (j > rows) rows = j } END { print head for (j=0; j<=rows; ++j) { if ((0,j) in data) x = data[0,j] else x = "" for (i=1; i<cols; ++i) { if ((i,j) in data) x = x OFS data[i,j] else x = x OFS } print x } } As a bonus, here is a script to reproduce the detailed output from one of your pastebins. #! /usr/bin/awk -f BEGIN { FS = "," split("", data) # accumulated data for a project split("", line) # keep track of textline for data split("", idx) # index into above to maintain input order sz = 0 } $2 in idx { # have seen this projectname i = idx[$2] x = ORS "textline = " ++line[i] x = x ORS "textdata = " $1 data[i] = data[i] x next } { # new projectname i = sz++ idx[$2] = i x = "textline = 1" x = x ORS "projectname = " $2 x = x ORS "textline = 2" x = x ORS "projectdata = " $1 data[i] = x line[i] = 2 } END { for (i=0; i<sz; ++i) print data[i] }
Fill parray with project names and array with values, then print them with bash printf, You can choose column width in printf command (currently 13 characters - %13s) #!/bin/bash declare -i index=0 declare -i pindex=0 while read project; do parray[$pindex]=$project index=0 while read;do array[$pindex,$index]="$REPLY" index+=1 done <<< $(grep -h "$project" *.csv|cut -d, -f1) pindex+=1 done <<< $(cat *.csv|cut -d, -f 2|sort -u) maxi=$index maxp=$pindex for (( pindex=0; $pindex < $maxp ; pindex+=1 ));do STR="%13s $STR" VAL="$VAL ${parray[$pindex]}" done printf "$STR\n" $VAL for (( index=0; $index < $maxi;index+=1 ));do STR=""; VAL="" for (( pindex=0; $pindex < $maxp;pindex+=1 )); do STR="%13s $STR" VAL="$VAL ${array[$pindex,$index]}" done printf "$STR\n" $VAL done
If you are OK with the output being sorted by name this one-liner might be of use: awk 'BEGIN {FS=",";OFS=","} {print $2,$1}' * | sort | uniq The files have to be in the same directory. If not a list of files replaces the *. First it exchanges the two fields. Awk will take a list of files and do the concatenation. Then sort the lines and print just the unique lines. This depends on the project size always being the same. The simple one-liner above gives you one line for each project. If you really want to do it all in awk and use awk write the two lines, then the following would be needed. There is a second awk at the end that accumulates each column entry in an array then spits it out at the end: awk 'BEGIN {FS=","} {print $2,$1}' *| sort |uniq | awk 'BEGIN {n=0} {p[n]=$1;s[n++]=$2} END {for (i=0;i<n;i++) printf "%s,",p[i];print ""; for (i=0;i<n;i++) printf "%s,",s[i];print ""}' If you have the rs utility then this can be simplified to awk 'BEGIN {FS=","} {print $2,$1}' *| sort |uniq | rs -C',' -T
Split file with 800,000 columns
I want to split a file of genomic data with 800,000 columns and 40,000 rows into a series of files with 100 columns each, total size 118GB. I am currently running the following bash script, multithread 15 times: infile="$1" start=$2 end=$3 step=$(($4-1)) for((curr=$start, start=$start, end=$end; curr+step <= end; curr+=step+1)); do cut -f$curr-$((curr+step)) "$infile" > "${infile}.$curr" -d' ' done However, judging by current progress of the script, it will take 300 days to complete the split?! Is there a more efficient way to column wise split a space-delimited file into smaller chunks?
Try this awk script: awk -v cols=100 '{ f = 1 for (i = 1; i <= NF; i++) { printf "%s%s", $i, (i % cols && i < NF ? OFS : ORS) > (FILENAME "." f) f=int(i/cols)+1 } }' largefile I expect it to be faster than the shell script in the question.
How to efficiently sum two columns in a file with 270,000+ rows in bash
I have two columns in a file, and I want to automate summing both values per row for example read write 5 6 read write 10 2 read write 23 44 I want to then sum the "read" and "write" of each row. Eventually after summing, I'm finding the max sum and putting that max value in a file. I feel like I have to use grep -v to rid of the column headers per row, which like stated in the answers, makes the code inefficient since I'm grepping the entire file just to read a line. I currently have this in a bash script (within a for loop where $x is the file name) to sum the columns line by line lines=`grep -v READ $x|wc -l | awk '{print $1}'` line_num=1 arr_num=0 while [ $line_num -le $lines ] do arr[$arr_num]=`grep -v READ $x | sed $line_num'q;d' | awk '{print $2 + $3}'` echo $line_num line_num=$[$line_num+1] arr_num=$[$arr_num+1] done However, the file to be summed has 270,000+ rows. The script has been running for a few hours now, and it is nowhere near finished. Is there a more efficient way to write this so that it does not take so long?
Use awk instead and take advantage of modulus function: awk '!(NR%2){print $1+$2}' infile
awk is probably faster, but the idiomatic bash way to do this is something like: while read -a line; do # read each line one-by-one, into an array # use arithmetic expansion to add col 1 and 2 echo "$(( ${line[0]} + ${line[1]} ))" done < <(grep -v READ input.txt) Note the file input file is only read once (by grep) and the number of externally forked programs is kept to a minimum (just grep, called only once for the whole input file). The rest of the commands are bash builtins. Using the <( ) process substition, in case variables set in the while loop are required out of scope of the while loop. Otherwise a | pipe could be used.
Your question is pretty verbose, yet your goal is not clear. The way I read it, your numbers are on every second line, and you want only to find the maximum sum. Given that: awk ' NR%2 == 1 {next} NR == 2 {max = $1+$2; next} $1+$2 > max {max = $1+$2} END {print max} ' filename
You could also use a pipeline with tools that implicitly loop over the input like so: grep -v read INFILE | tr -s ' ' + | bc | sort -rn | head -1 > OUTFILE This assumes there are spaces between your read and write data values.
Why not run: awk 'NR==1 { print "sum"; next } { print $1 + $2 }' You can afford to run it on the file while the other script it still running. It'll be complete in a few seconds at most (prediction). When you're confident it's right, you can kill the other process. You can use Perl or Python instead of awk if you prefer. Your code is running grep, sed and awk on each line of the input file; that's damnably expensive. And it isn't even writing the data to a file; it is creating an array in Bash's memory that'll need to be printed to the output file later.
Assuming that it's always one 'header' row followed by one 'data' row: awk ' BEGIN{ max = 0 } { if( NR%2 == 0 ){ sum = $1 + $2; if( sum > max ) { max = sum } } } END{ print max }' input.txt Or simply trim out all lines that do not conform to what you want: grep '^[0-9]\+\s\+[0-9]\+$' input.txt | awk ' BEGIN{ max = 0 } { sum = $1 + $2; if( sum > max ) { max = sum } } END{ print max }' input.txt
Shell script: copying columns by header in a csv file to another csv file
I have a csv file which I'll be using as input with a format looking like this: xValue,value1-avg,value1-median,value2-avg,value3-avg,value3-median 1,3,4,20,14,20 The key attributes of the input file are that each "value" will have a variable number of statistics, but the statistic type and "value" will always be separated by a "-". I then want to output the statistics of all the "values" to separate csv files. The output would then look something like this: value1.csv xvalue,value1-avg,value1-median 1,3,4 value2.csv xvalue,value2-avg 1,20 I've tried finding solutions to this, but all I can find are ways to copy by the column number, not the header name. I need to be able to use the header names to append the associated statistics to each of the output csv files. Any help is greatly appreciated! P.S. the output file may have already been written to during previous runs of this script, meaning the code should append to the output file
Untested but should be close: awk -F, ' NR==1 { for (i=2;i<=NF;i++) { outfile = $i sub(/-.*/,".csv",outfile) outfiles[i] = outfile } } { delete(outstr) for (i=2;i<=NF;i++) { outfile = outfiles[i] outstr[outfile] = outstr[outfile] FS $i } for (outfile in outstr) print $1 outstr[outfile] >> outfile } ' inFile.csv Note that deleting a whole array with delete(outstr) is gawk-specific. With other awks you can use split("",outstr) to get the same effect. Note that this appends the output you wanted to existing files BUT that means you'll get the header line repeated on every execution. If that's an issue, tell us how to know when to generate the header line or not but the solution I THINK you'll want would look something like this: awk -F, ' NR==1 { for (i=2;i<=NF;i++) { outfile = $i sub(/-.*/,".csv",outfile) outfiles[i] = outfile } for (outfile in outfiles) { exists[outfile] = ( ((getline tmp < outfile) > 0) && (tmp != "") ) close(outfile) } } { delete(outstr) for (i=2;i<=NF;i++) { outfile = outfiles[i] outstr[outfile] = outstr[outfile] FS $i } for (outfile in outstr) if ( (NR > 1) || !exists[outfile] ) print $1 outstr[outfile] >> outfile } ' inFile.csv
Just figure out the name associated with each column and use that mapping to manipulate the columns. If you're trying to do this in awk, you can use associative arrays to store the column names and the rows those correspond to. If you're using ksh93 or bash, you can use associative arrays to store the column names and the rows those correspond to. If you're using perl or python or ruby or ... you can... Or push the columns into an array to map the numbers to column numbers. Either way, then you have a list of column headers, which can further be manipulated however you need to.
The solution I have found most useful to this kind of problem is to first retrieve the column number using an AWK script (encapsulated in a shell function) and then follow with a cut statement. This technique/strategy turns into a very concise, general and fast solution that can take advantage of co-processing. The non-append case is cleaner, but here is an example that handles the complication of the append you mentioned: #! /bin/sh fields() { LC_ALL=C awk -F, -v pattern="$1" '{ j=0; split("", f) for (i=1; i<=NF; i++) if ($(i) ~ pattern) f[j++] = i if (j) { printf("%s", f[0]) for (i=1; i<j; i++) printf(",%s", f[i]) } exit 0 }' "$2" } cut_fields_with_append() { if [ -s "$3" ] then cut -d, -f `fields "$1" "$2"` "$2" | sed '1 d' >> "$3" else cut -d, -f `fields "$1" "$2"` "$2" > "$3" fi } cut_fields_with_append '^[^-]+$|1-' values.csv value1.csv & cut_fields_with_append '^[^-]+$|2-' values.csv value2.csv & cut_fields_with_append '^[^-]+$|3-' values.csv value3.csv & wait The result is as you would expect: $ ls values values.csv $ cat values.csv xValue,value1-avg,value1-median,value2-avg,value3-avg,value3-median 1,3,4,20,14,20 $ ./values $ ls value1.csv value2.csv value3.csv values values.csv $ cat value1.csv xValue,value1-avg,value1-median 1,3,4 $ cat value2.csv xValue,value2-avg 1,20 $ cat value3.csv xValue,value3-avg,value3-median 1,14,20 $ ./values $ cat value1.csv xValue,value1-avg,value1-median 1,3,4 1,3,4 $ cat value2.csv xValue,value2-avg 1,20 1,20 $ cat value3.csv xValue,value3-avg,value3-median 1,14,20 1,14,20 $
Convert tallies to relative probabilities
Background Create a probability lexicon based on a CSV file of words and tallies. This is a prelude to a text segmentation problem, not a homework problem. Problem Given a CSV file with the following words and tallies: aardvark,10 aardwolf,9 armadillo,9 platypus,5 zebra,1 Create a file with probabilities relative to the largest tally in the file: aardvark,1 aardwolf,0.9 armadillo,0.9 platypus,0.5 zebra,0.1 Where, for example, aardvark,1 is calculated as aardvark,10/10 and platypus,0.5 is calculated as platypus,5/10. Question What is the most efficient way to implement a shell script to create the file of relative probabilities? Constraints Neither the words nor the numbers are in any order. No major programming language (such as Perl, Ruby, Python, Java, C, Fortran, or Cobol). Standard Unix tools such as awk, sed, or sort are welcome. All probabilities must be relative to the highest probability in the file. The words are unique, the numbers are not. The tallies are natural numbers. Thank you!
awk 'BEGIN{max=0;OFS=FS=","} $NF>max{max=$NF}NR>FNR {print $1,($2/max) }' file file
No need to read the file twice: awk 'BEGIN {OFS = FS = ","} {a[$1] = $2} $2 > max {max=$2} END {for (w in a) print w, a[w]/max}' inputfile If you need the output sorted by word: awk ... | sort or awk 'BEGIN {OFS = FS = ","} {a[$1] = $2; ind[j++] = $1} $2 > max {max=$2} END {n = asort(ind); for (i=1; i<=n; i++) print ind[i], a[ind[i]]/max}' inputfile If you need the output sorted by probability: awk ... | sort -t, -k2,2n -k1,1
This is not error-proof but something like this should work: #!/bin/bash INPUT=data.cvs OUTPUT=tally.cvs DIGITS=1 OLDIFS=$IFS IFS=, maxval=0 # Assuming all $val are positive while read name val do if (( val > maxval )); then maxval=$val; fi done < $INPUT # Make sure $OUTPUT doesn't exist touch $OUTPUT while read name val do tally=`echo "scale=$DIGITS; result=$val/$maxval; if (0 <= result && result < 1) { print "0" }; print result" | bc` echo "$name,$tally" >> $OUTPUT done < $INPUT IFS=$OLDIFS Borrowed from this question, and various googling.