Comparison shell script for large text/csv files - improvement needed

Comparison shell script for large text/csv files - improvement needed - bash

My task is the following - I have two CSV files:
File 1 (~9.000.000 records):
type(text),number,status(text),serial(number),data1,data2,data3
File 2 (~6000 records):
serial_range_start(number),serial_range_end(number),info1(text),info2(text)
The goal is to add to each entry in File 1 the corresponding info1 and info2 from File 2:
type(text),number,status(text),serial(number),data1,data2,data3,info1(text),info2(text)
I use the following script:
#!/bin/bash
USER="file1.csv"
RANGE="file2.csv"
for SN in `cat $USER | awk -F , '{print $4}'`
do
#echo \n "$SN"
for LINE in `cat $RANGE`
do
i=`grep $LINE $RANGE| awk -F, '{print $1}'`
#echo \n "i= " "$i"
j=`grep $LINE $RANGE| awk -F, '{print $2}'`
#echo \n "j= " "$j"
k=`echo $SN`
#echo \n "k= " "$k"
if [ $k -ge $i -a $k -le $j ]
then
echo `grep $SN $USER`,`grep $i $RANGE| cut -d',' -f3-4` >> result.csv
break
#else
#echo `grep $SN $USER`,`echo 'N/A','N/A'` >> result.csv
fi
done
done
The script works rather good on small files but I'm sure there is a way to optimize it because I am running it on an i5 laptop with 4GB of RAM.
I am a newbie in shell scripting and I came up with this script after hours and hours of research, trial and error but now I am out of ideas.
note: not all the info in file 1 can be found in file.
Thank you!
Adrian.
FILE EXAMPLES and additional info:
File 1 example:
prep,28620026059,Active,123452010988759,No,No,No
post,28619823474,Active,123453458466109,Yes,No,No
post,28619823474,Inactive,123453395270941,Yes,Yes,Yes
File 2 example:
123452010988750,123452010988759,promo32,1.11
123453458466100,123453458466199,promo64,2.22
123450000000000,123450000000010,standard128,3.333
Result example (currently):
prep,28620026059,Active,123452010988759,No,No,No,promo32,1.11
post,28619823474,Active,123453458466109,Yes,No,No,promo64,2.22
Result example (nice to have):
prep,28620026059,Active,123452010988759,No,No,No,promo32,1.11
post,28619823474,Active,123453458466109,Yes,No,No,promo64,2.22
post,28619823474,Inactive,123453395270941,Yes,Yes,Yes,NA,NA
File 1 is sorted after the 4th column
File 2 is sorted after the first column.
File 2 does not have ranges that overlap
Not all the info in file 1 can be found in a range in file 2
Thanks again!
LE:
The script provided by Jonathan seems to have an issue on some records, as follows:
file 2:
123456780737000,123456780737012,ONE 32,1.11
123456780016000,123456780025999,ONE 64,2.22
file 1:
Postpaid,24987326427,Active,123456780737009,Yes,Yes,Yes
Postpaid,54234564719,Active,123456780017674,Yes,Yes,Yes
The output is the following:
Postpaid,24987326427,Active,123456780737009,Yes,Yes,Yes,ONE 32,1.11
Postpaid,54234564719,Active,123456780017674,Yes,Yes,Yes,ONE 32,1.11
and it should be:
Postpaid,24987326427,Active,123456780737009,Yes,Yes,Yes,ONE 32,1.11
Postpaid,54234564719,Active,123456780017674,Yes,Yes,Yes,ONE 64,2.22
It seems that it returns 0 and writes the info on first record from file2...

I think this will work reasonably well:
awk -F, 'BEGIN { n = 0; OFS = ","; }
NR==FNR { lo[n] = $1; hi[n] = $2; i1[n] = $3; i2[n] = $4; n++ }
NR!=FNR { for (i = 0; i < n; i++)
{
if ($4 >= lo[i] && $4 <= hi[i])
{
print $1, $2, $3, $4, $5, $6, $7, i1[i], i2[i];
break;
}
}
}' file2 file1
Given file2 containing:
1,10,xyz,pqr
11,20,abc,def
21,30,ambidextrous,warthog
and file1 containing:
A,123,X2,1,data01_1,data01_2,data01_3
A,123,X2,2,data02_1,data02_2,data02_3
A,123,X2,3,data03_1,data03_2,data03_3
A,123,X2,4,data04_1,data04_2,data04_3
A,123,X2,5,data05_1,data05_2,data05_3
A,123,X2,6,data06_1,data06_2,data06_3
A,123,X2,7,data07_1,data07_2,data07_3
A,123,X2,8,data08_1,data08_2,data08_3
A,123,X2,9,data09_1,data09_2,data09_3
A,123,X2,10,data10_1,data10_2,data10_3
A,123,X2,11,data11_1,data11_2,data11_3
A,123,X2,12,data12_1,data12_2,data12_3
A,123,X2,13,data13_1,data13_2,data13_3
A,123,X2,14,data14_1,data14_2,data14_3
A,123,X2,15,data15_1,data15_2,data15_3
A,123,X2,16,data16_1,data16_2,data16_3
A,123,X2,17,data17_1,data17_2,data17_3
A,123,X2,18,data18_1,data18_2,data18_3
A,123,X2,19,data19_1,data19_2,data19_3
A,223,X2,20,data20_1,data20_2,data20_3
A,223,X2,21,data21_1,data21_2,data21_3
A,223,X2,22,data22_1,data22_2,data22_3
A,223,X2,23,data23_1,data23_2,data23_3
A,223,X2,24,data24_1,data24_2,data24_3
A,223,X2,25,data25_1,data25_2,data25_3
A,223,X2,26,data26_1,data26_2,data26_3
A,223,X2,27,data27_1,data27_2,data27_3
A,223,X2,28,data28_1,data28_2,data28_3
A,223,X2,29,data29_1,data29_2,data29_3
the output of the command is:
A,123,X2,1,data01_1,data01_2,data01_3,xyz,pqr
A,123,X2,2,data02_1,data02_2,data02_3,xyz,pqr
A,123,X2,3,data03_1,data03_2,data03_3,xyz,pqr
A,123,X2,4,data04_1,data04_2,data04_3,xyz,pqr
A,123,X2,5,data05_1,data05_2,data05_3,xyz,pqr
A,123,X2,6,data06_1,data06_2,data06_3,xyz,pqr
A,123,X2,7,data07_1,data07_2,data07_3,xyz,pqr
A,123,X2,8,data08_1,data08_2,data08_3,xyz,pqr
A,123,X2,9,data09_1,data09_2,data09_3,xyz,pqr
A,123,X2,10,data10_1,data10_2,data10_3,xyz,pqr
A,123,X2,11,data11_1,data11_2,data11_3,abc,def
A,123,X2,12,data12_1,data12_2,data12_3,abc,def
A,123,X2,13,data13_1,data13_2,data13_3,abc,def
A,123,X2,14,data14_1,data14_2,data14_3,abc,def
A,123,X2,15,data15_1,data15_2,data15_3,abc,def
A,123,X2,16,data16_1,data16_2,data16_3,abc,def
A,123,X2,17,data17_1,data17_2,data17_3,abc,def
A,123,X2,18,data18_1,data18_2,data18_3,abc,def
A,123,X2,19,data19_1,data19_2,data19_3,abc,def
A,223,X2,20,data20_1,data20_2,data20_3,abc,def
A,223,X2,21,data21_1,data21_2,data21_3,ambidextrous,warthog
A,223,X2,22,data22_1,data22_2,data22_3,ambidextrous,warthog
A,223,X2,23,data23_1,data23_2,data23_3,ambidextrous,warthog
A,223,X2,24,data24_1,data24_2,data24_3,ambidextrous,warthog
A,223,X2,25,data25_1,data25_2,data25_3,ambidextrous,warthog
A,223,X2,26,data26_1,data26_2,data26_3,ambidextrous,warthog
A,223,X2,27,data27_1,data27_2,data27_3,ambidextrous,warthog
A,223,X2,28,data28_1,data28_2,data28_3,ambidextrous,warthog
A,223,X2,29,data29_1,data29_2,data29_3,ambidextrous,warthog
This uses a linear search on the list of ranges; you can write functions in awk and a binary search looking for the correct range would perform better on 6,000 entries. That part, though, is an optimization — exercise for the reader. Remember that the first rule of optimization is: don't. The second rule of optimization (for experts only) is: don't do it yet. Demonstrate that it is a problem. This code shouldn't take all that much longer than the time it takes to copy the 9,000,000 record file (somewhat longer, but not disastrously so). Note, though, that if the file1 data is sorted, the tail of the processing will take longer than the start because of the linear search. If the serial numbers are in a random order, then it will all take about the same time on average.
If your CSV data has commas embedded in the text fields, then awk is no longer suitable; you need a tool with explicit support for CSV format — Perl and Python both have suitable modules.
Answer to Exercise for the Reader
awk -F, 'BEGIN { n = 0; OFS = ","; }
NR==FNR { lo[n] = $1; hi[n] = $2; i1[n] = $3; i2[n] = $4; n++ }
NR!=FNR { i = search($4)
print $1, $2, $3, $4, $5, $6, $7, i1[i], i2[i];
}
function search(i, l, h, m)
{
l = 0; h = n - 1;
while (l <= h)
{
m = int((l + h)/2);
if (i >= lo[m] && i <= hi[m])
return m;
else if (i < lo[m])
h = m - 1;
else
l = m + 1;
}
return 0; # Should not get here
}' file2 file1
Not all that hard to write the binary search. This gives the same result as the original script on the sample data. It has not been exhaustively tested, but appears to work.
Note that the code does not really handle missing ranges in file2; it assumes that the ranges are contiguous but non-overlapping and in sorted order and cover all the values that can appear in the serial column of file1. If those assumptions are not valid, you get erratic behaviour until you fix either the code or the data.

In Unix you may use the join command (type 'man join' for more information) which can be configured to work similarly to join operation in databases. That may help you to add the information from File2 to File1.

Related

printing contents of variable to a specified line in outputfile with sed/awk

I have been working on a script to concatenate multiple csv files into a single, large csv. The csv's contain names of folders and their respective sizes, in a 2-column setup with the format "Size, Projectname"
Example of a single csv file:
49747851728,ODIN
32872934580,_WORK
9721820722,LIBRARY
4855839655,BASELIGHT
1035732096,ARCHIVE
907756578,USERS
123685100,ENV
3682821,SHOTGUN
1879186,SALT
361558,SOFTWARE
486,VFX
128,DNA
For my current test I have 25 similar files, with different numbers in the first column.
I am trying to get this script to do the following:
Read each csv file
For each Project it sees, scan the outputfile if that Project was already printed to the file. If not, print the Projectname
For each file, for each Project, if the Project was found, print the Size to the output csv.
However, I need the Projects to all be on textline 1, comma separated, so I can use this outputfile as input for a javascript graph. The Sizes should be added in the column below their projectname.
My current script:
csv_folder=$(echo "$1" | sed 's/^[ \t]*//;s/\/[ \t]*$//')
csv_allfiles="$csv_folder/*.csv"
csv_outputfile=$csv_folder.csv
echo -n "" > $csv_outputfile
for csv_inputfile in $csv_allfiles; do
while read line && [[ $line != "" ]]; do
projectname=$(echo $line | sed 's/^\([^,]*\),//')
projectfound1=$(cat $csv_outputfile | grep -w $projectname)
if [[ ! $projectfound1 ]]; then
textline=1
sed "${textline}s/$/${projectname}, /" >> $csv_outputfile
for csv_foundfile in $csv_allfiles; do
textline=$(echo $textline + 1 | bc )
projectfound2=$(cat $csv_foundfile | grep -w $projectname)
projectdata=$(echo $projectfound2 | sed 's/\,.*$//')
if [[ $projectfound2 ]]; then
sed "${textline}s/$/$projectdata, /" >> $csv_outputfile
fi
done
fi
done < $csv_inputfile
done
My current script finds the right information (projectname, projectdata) and if I just 'echo' those variables, it prints the correct data to a file. However, with echo it only prints in a long list per project. I want it to 'jump back' to line 1 and print the new project at the end of the current line, then run the loop to print data at the end of each next line.
I was thinking this should be possible with sed or awk. sed should have a way of inserting text to a specific line with
sed '{n}s/search/replace/'
where {n} is the line to insert to
awk should be able to do the same thing with something like
awk -v l2="$textline" -v d="$projectdata" 'NR == l2 {print d} {print}' >> $csv_outputfile
However, while replacing the sed commands in the script with
echo $projectname
echo $projectdata
spit out the correct information (so I know my variables are filled correctly) the sed and awk commands tend to spit out the entire contents of their current inputcsv; not just the line that I want them to.
Pastebin outputs per variant of writing to file
https://pastebin.com/XwxiAqvT - sed output
https://pastebin.com/xfLU6wri - echo, plain output (single column)
https://pastebin.com/wP3BhgY8 - echo, detailed output per variable
https://pastebin.com/5wiuq53n - desired output
As you see, the sed output tends to paste the whole contents of inputcsv, making the loop stop after one iteration. (since it finds the other Projects after one loop)
So my question is one of these;
How do I make sed / awk behave the way I want it to; i.e. print only the info in my var to the current textline, instead of the whole input csv. Is sed capable of this, printing just one line of variable? Or
Should I output the variables through 'echo' into a temp file, then loop over the temp file to make sed sort the lines the way I want them to? (Bear in mind that more .csv files will be added in the future, I can't just make it loop x times to sort the info)
Is there a way to echo/print text to a specific text line without using sed or awk? Is there a printf option I'm missing? Other thoughts?
Any help would be very much appreciated.

A way to accomplish this transposition is to save the data to an associative array.
In the following example, we use a two dimensional array to keep track of our data. Because ordering seems to be important, we create a col array and create a new increment whenever we see a new projectname -- this col array ends up being our first index into our data. We also create a row array which we increment whenever we see a new data for the current column. The row number is our second index into data. At the end, we print out all the records.
#! /usr/bin/awk -f
BEGIN {
FS = ","
OFS = ", "
rows=0
cols=0
head=""
split("", data)
split("", row)
split("", col)
}
!($2 in col) { # new project
if (head == "")
head = $2
else
head = head OFS $2
i = col[$2] = cols++
row[i] = 0
}
{
i = col[$2]
j = row[i]++
data[i,j] = $1
if (j > rows)
rows = j
}
END {
print head
for (j=0; j<=rows; ++j) {
if ((0,j) in data)
x = data[0,j]
else
x = ""
for (i=1; i<cols; ++i) {
if ((i,j) in data)
x = x OFS data[i,j]
else
x = x OFS
}
print x
}
}
As a bonus, here is a script to reproduce the detailed output from one of your pastebins.
#! /usr/bin/awk -f
BEGIN {
FS = ","
split("", data) # accumulated data for a project
split("", line) # keep track of textline for data
split("", idx) # index into above to maintain input order
sz = 0
}
$2 in idx { # have seen this projectname
i = idx[$2]
x = ORS "textline = " ++line[i]
x = x ORS "textdata = " $1
data[i] = data[i] x
next
}
{ # new projectname
i = sz++
idx[$2] = i
x = "textline = 1"
x = x ORS "projectname = " $2
x = x ORS "textline = 2"
x = x ORS "projectdata = " $1
data[i] = x
line[i] = 2
}
END {
for (i=0; i<sz; ++i)
print data[i]
}

Fill parray with project names and array with values, then print them with bash printf, You can choose column width in printf command (currently 13 characters - %13s)
#!/bin/bash
declare -i index=0
declare -i pindex=0
while read project; do
parray[$pindex]=$project
index=0
while read;do
array[$pindex,$index]="$REPLY"
index+=1
done <<< $(grep -h "$project" *.csv|cut -d, -f1)
pindex+=1
done <<< $(cat *.csv|cut -d, -f 2|sort -u)
maxi=$index
maxp=$pindex
for (( pindex=0; $pindex < $maxp ; pindex+=1 ));do
STR="%13s $STR"
VAL="$VAL ${parray[$pindex]}"
done
printf "$STR\n" $VAL
for (( index=0; $index < $maxi;index+=1 ));do
STR=""; VAL=""
for (( pindex=0; $pindex < $maxp;pindex+=1 )); do
STR="%13s $STR"
VAL="$VAL ${array[$pindex,$index]}"
done
printf "$STR\n" $VAL
done

If you are OK with the output being sorted by name this one-liner might be of use:
awk 'BEGIN {FS=",";OFS=","} {print $2,$1}' * | sort | uniq
The files have to be in the same directory. If not a list of files replaces the *. First it exchanges the two fields. Awk will take a list of files and do the concatenation. Then sort the lines and print just the unique lines. This depends on the project size always being the same.
The simple one-liner above gives you one line for each project. If you really want to do it all in awk and use awk write the two lines, then the following would be needed. There is a second awk at the end that accumulates each column entry in an array then spits it out at the end:
awk 'BEGIN {FS=","} {print $2,$1}' *| sort |uniq | awk 'BEGIN {n=0}
{p[n]=$1;s[n++]=$2}
END {for (i=0;i<n;i++) printf "%s,",p[i];print "";
for (i=0;i<n;i++) printf "%s,",s[i];print ""}'
If you have the rs utility then this can be simplified to
awk 'BEGIN {FS=","} {print $2,$1}' *| sort |uniq | rs -C',' -T

Split file with 800,000 columns

I want to split a file of genomic data with 800,000 columns and 40,000 rows into a series of files with 100 columns each, total size 118GB.
I am currently running the following bash script, multithread 15 times:
infile="$1"
start=$2
end=$3
step=$(($4-1))
for((curr=$start, start=$start, end=$end; curr+step <= end; curr+=step+1)); do
cut -f$curr-$((curr+step)) "$infile" > "${infile}.$curr" -d' '
done
However, judging by current progress of the script, it will take 300 days to complete the split?!
Is there a more efficient way to column wise split a space-delimited file into smaller chunks?

Try this awk script:
awk -v cols=100 '{
f = 1
for (i = 1; i <= NF; i++) {
printf "%s%s", $i, (i % cols && i < NF ? OFS : ORS) > (FILENAME "." f)
f=int(i/cols)+1
}
}' largefile
I expect it to be faster than the shell script in the question.

How to efficiently sum two columns in a file with 270,000+ rows in bash

I have two columns in a file, and I want to automate summing both values per row
for example
read write
5 6
read write
10 2
read write
23 44
I want to then sum the "read" and "write" of each row. Eventually after summing, I'm finding the max sum and putting that max value in a file. I feel like I have to use grep -v to rid of the column headers per row, which like stated in the answers, makes the code inefficient since I'm grepping the entire file just to read a line.
I currently have this in a bash script (within a for loop where $x is the file name) to sum the columns line by line
lines=`grep -v READ $x|wc -l | awk '{print $1}'`
line_num=1
arr_num=0
while [ $line_num -le $lines ]
do
arr[$arr_num]=`grep -v READ $x | sed $line_num'q;d' | awk '{print $2 + $3}'`
echo $line_num
line_num=$[$line_num+1]
arr_num=$[$arr_num+1]
done
However, the file to be summed has 270,000+ rows. The script has been running for a few hours now, and it is nowhere near finished. Is there a more efficient way to write this so that it does not take so long?

Use awk instead and take advantage of modulus function:
awk '!(NR%2){print $1+$2}' infile

awk is probably faster, but the idiomatic bash way to do this is something like:
while read -a line; do # read each line one-by-one, into an array
# use arithmetic expansion to add col 1 and 2
echo "$(( ${line[0]} + ${line[1]} ))"
done < <(grep -v READ input.txt)
Note the file input file is only read once (by grep) and the number of externally forked programs is kept to a minimum (just grep, called only once for the whole input file). The rest of the commands are bash builtins.
Using the <( ) process substition, in case variables set in the while loop are required out of scope of the while loop. Otherwise a | pipe could be used.

Your question is pretty verbose, yet your goal is not clear. The way I read it, your numbers are on every second line, and you want only to find the maximum sum. Given that:
awk '
NR%2 == 1 {next}
NR == 2 {max = $1+$2; next}
$1+$2 > max {max = $1+$2}
END {print max}
' filename

You could also use a pipeline with tools that implicitly loop over the input like so:
grep -v read INFILE | tr -s ' ' + | bc | sort -rn | head -1 > OUTFILE
This assumes there are spaces between your read and write data values.

Why not run:
awk 'NR==1 { print "sum"; next } { print $1 + $2 }'
You can afford to run it on the file while the other script it still running. It'll be complete in a few seconds at most (prediction). When you're confident it's right, you can kill the other process.
You can use Perl or Python instead of awk if you prefer.
Your code is running grep, sed and awk on each line of the input file; that's damnably expensive. And it isn't even writing the data to a file; it is creating an array in Bash's memory that'll need to be printed to the output file later.

Assuming that it's always one 'header' row followed by one 'data' row:
awk '
BEGIN{ max = 0 }
{
if( NR%2 == 0 ){
sum = $1 + $2;
if( sum > max ) { max = sum }
}
}
END{ print max }' input.txt
Or simply trim out all lines that do not conform to what you want:
grep '^[0-9]\+\s\+[0-9]\+$' input.txt | awk '
BEGIN{ max = 0 }
{
sum = $1 + $2;
if( sum > max ) { max = sum }
}
END{ print max }' input.txt

Shell script: copying columns by header in a csv file to another csv file

I have a csv file which I'll be using as input with a format looking like this:
xValue,value1-avg,value1-median,value2-avg,value3-avg,value3-median
1,3,4,20,14,20
The key attributes of the input file are that each "value" will have a variable number of statistics, but the statistic type and "value" will always be separated by a "-". I then want to output the statistics of all the "values" to separate csv files.
The output would then look something like this:
value1.csv
xvalue,value1-avg,value1-median
1,3,4
value2.csv
xvalue,value2-avg
1,20
I've tried finding solutions to this, but all I can find are ways to copy by the column number, not the header name. I need to be able to use the header names to append the associated statistics to each of the output csv files.
Any help is greatly appreciated!
P.S. the output file may have already been written to during previous runs of this script, meaning the code should append to the output file

Untested but should be close:
awk -F, '
NR==1 {
for (i=2;i<=NF;i++) {
outfile = $i
sub(/-.*/,".csv",outfile)
outfiles[i] = outfile
}
}
{
delete(outstr)
for (i=2;i<=NF;i++) {
outfile = outfiles[i]
outstr[outfile] = outstr[outfile] FS $i
}
for (outfile in outstr)
print $1 outstr[outfile] >> outfile
}
' inFile.csv
Note that deleting a whole array with delete(outstr) is gawk-specific. With other awks you can use split("",outstr) to get the same effect.
Note that this appends the output you wanted to existing files BUT that means you'll get the header line repeated on every execution. If that's an issue, tell us how to know when to generate the header line or not but the solution I THINK you'll want would look something like this:
awk -F, '
NR==1 {
for (i=2;i<=NF;i++) {
outfile = $i
sub(/-.*/,".csv",outfile)
outfiles[i] = outfile
}
for (outfile in outfiles) {
exists[outfile] = ( ((getline tmp < outfile) > 0) && (tmp != "") )
close(outfile)
}
}
{
delete(outstr)
for (i=2;i<=NF;i++) {
outfile = outfiles[i]
outstr[outfile] = outstr[outfile] FS $i
}
for (outfile in outstr)
if ( (NR > 1) || !exists[outfile] )
print $1 outstr[outfile] >> outfile
}
' inFile.csv

Just figure out the name associated with each column and use that mapping to manipulate the columns. If you're trying to do this in awk, you can use associative arrays to store the column names and the rows those correspond to. If you're using ksh93 or bash, you can use associative arrays to store the column names and the rows those correspond to. If you're using perl or python or ruby or ... you can...
Or push the columns into an array to map the numbers to column numbers.
Either way, then you have a list of column headers, which can further be manipulated however you need to.

The solution I have found most useful to this kind of problem is to first retrieve the column number using an AWK script (encapsulated in a shell function) and then follow with a cut statement. This technique/strategy turns into a very concise, general and fast solution that can take advantage of co-processing. The non-append case is cleaner, but here is an example that handles the complication of the append you mentioned:
#! /bin/sh
fields() {
LC_ALL=C awk -F, -v pattern="$1" '{
j=0; split("", f)
for (i=1; i<=NF; i++) if ($(i) ~ pattern) f[j++] = i
if (j) {
printf("%s", f[0])
for (i=1; i<j; i++) printf(",%s", f[i])
}
exit 0
}' "$2"
}
cut_fields_with_append() {
if [ -s "$3" ]
then
cut -d, -f `fields "$1" "$2"` "$2" | sed '1 d' >> "$3"
else
cut -d, -f `fields "$1" "$2"` "$2" > "$3"
fi
}
cut_fields_with_append '^[^-]+$|1-' values.csv value1.csv &
cut_fields_with_append '^[^-]+$|2-' values.csv value2.csv &
cut_fields_with_append '^[^-]+$|3-' values.csv value3.csv &
wait
The result is as you would expect:
$ ls
values values.csv
$ cat values.csv
xValue,value1-avg,value1-median,value2-avg,value3-avg,value3-median
1,3,4,20,14,20
$ ./values
$ ls
value1.csv value2.csv value3.csv values values.csv
$ cat value1.csv
xValue,value1-avg,value1-median
1,3,4
$ cat value2.csv
xValue,value2-avg
1,20
$ cat value3.csv
xValue,value3-avg,value3-median
1,14,20
$ ./values
$ cat value1.csv
xValue,value1-avg,value1-median
1,3,4
1,3,4
$ cat value2.csv
xValue,value2-avg
1,20
1,20
$ cat value3.csv
xValue,value3-avg,value3-median
1,14,20
1,14,20
$

Convert tallies to relative probabilities

Background
Create a probability lexicon based on a CSV file of words and tallies. This is a prelude to a text segmentation problem, not a homework problem.
Problem
Given a CSV file with the following words and tallies:
aardvark,10
aardwolf,9
armadillo,9
platypus,5
zebra,1
Create a file with probabilities relative to the largest tally in the file:
aardvark,1
aardwolf,0.9
armadillo,0.9
platypus,0.5
zebra,0.1
Where, for example, aardvark,1 is calculated as aardvark,10/10 and platypus,0.5 is calculated as platypus,5/10.
Question
What is the most efficient way to implement a shell script to create the file of relative probabilities?
Constraints
Neither the words nor the numbers are in any order.
No major programming language (such as Perl, Ruby, Python, Java, C, Fortran, or Cobol).
Standard Unix tools such as awk, sed, or sort are welcome.
All probabilities must be relative to the highest probability in the file.
The words are unique, the numbers are not.
The tallies are natural numbers.
Thank you!

awk 'BEGIN{max=0;OFS=FS=","} $NF>max{max=$NF}NR>FNR {print $1,($2/max) }' file file

No need to read the file twice:
awk 'BEGIN {OFS = FS = ","} {a[$1] = $2} $2 > max {max=$2} END {for (w in a) print w, a[w]/max}' inputfile
If you need the output sorted by word:
awk ... | sort
or
awk 'BEGIN {OFS = FS = ","} {a[$1] = $2; ind[j++] = $1} $2 > max {max=$2} END {n = asort(ind); for (i=1; i<=n; i++) print ind[i], a[ind[i]]/max}' inputfile
If you need the output sorted by probability:
awk ... | sort -t, -k2,2n -k1,1

This is not error-proof but something like this should work:
#!/bin/bash
INPUT=data.cvs
OUTPUT=tally.cvs
DIGITS=1
OLDIFS=$IFS
IFS=,
maxval=0 # Assuming all $val are positive
while read name val
do
if (( val > maxval )); then maxval=$val; fi
done < $INPUT
# Make sure $OUTPUT doesn't exist
touch $OUTPUT
while read name val
do
tally=`echo "scale=$DIGITS; result=$val/$maxval; if (0 <= result && result < 1) { print "0" }; print result" | bc`
echo "$name,$tally" >> $OUTPUT
done < $INPUT
IFS=$OLDIFS
Borrowed from this question, and various googling.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Comparison shell script for large text/csv files - improvement needed - bash

In Unix you may use the join command (type 'man join' for more information) which can be configured to work similarly to join operation in databases. That may help you to add the information from File2 to File1.

Related

printing contents of variable to a specified line in outputfile with sed/awk

Split file with 800,000 columns

How to efficiently sum two columns in a file with 270,000+ rows in bash

Shell script: copying columns by header in a csv file to another csv file

Convert tallies to relative probabilities

Categories

Resources