Totalling a column based on another column - bash

Say I have a text file items.txt
Another Purple Item:1:3:01APR13
Another Green Item:1:8:02APR13
Another Yellow Item:1:3:01APR13
Another Orange Item:5:3:04APR13
Where the 2nd column is price and the 3rd is quantity. How could I loop through this in bash such that each unique date had a total of price * quantity?

try this awk one-liner:
awk -F: '{v[$NF]+=$2*$3}END{for(x in v)print x, v[x]}' fileĀ 
result:
01APR13 6
04APR13 15
02APR13 8
EDIT sorting
as I commented, there are two approaches to sort the output by date, I just take the simpler one: ^_^:
kent$ awk -F: '{ v[$NF]+=$2*$3}END{for(x in v){"date -d\""x"\" +%F"|getline d;print d,x,v[x]}}' file|sort|awk '$0=$2" "$3'
01APR13 6
02APR13 8
04APR13 15

Take a look at bash's associative arrays.
You can create a map of dates to values, then go over the lines, calculate price*quantity and add it to the value currently in the map or insert it if non-existent.

Related

How to get the mean value of different column values when one column match a value

I have a big data file with many columns. I would like to get the mean value of some of the columns if another column has a specific value.
For example if $19=9.1 then get the mean of $24, $25,$27, $28, $32 and $35 and write these values in a file like
9.1 (mean$24) (mean$25) ..... (mean$32) (mean$35)
and add two more lines for two other values of $19 column, for example, 11.9 and 13.9, resulting:
9.1 (mean$24) (mean$25) ..... (mean$32) (mean$35)
11.9 (mean$24) (mean$25) ..... (mean$32) (mean$35)
13.9 (mean$24) (mean$25) ..... (mean$32) (mean$35)
I have seen a post "awk average part of a column if lines (specific field) match" which makes the mean of only one column if the first has some value, but I do not know how to extend the solution to my problem.
this should work, if you fill in the blanks...
$ awk 'BEGIN {n=split("1.9 11.9 13.9",a)}
{k=$19; c[k]++; m24[k]+=$24; m25[k]+=$25; ...}
END {for(i=1;i<=n;i++) print k=a[i], m24[k]/c[k], m25[k]/c[k], ...}' file
perhaps handle c[k]=0 condition as well, with something like this:
function mean(sum,count) {return (count==0?"NaN":sum/count)}

bash: identifying the first value list that also exists in another list

I have been trying to come up with a nice way in BASH to find the first entry in list A that also exists in list B. Where A and B are in separate files.
A B
1024dbeb 8e450d71
7e474d46 8e450d71
1126daeb 1124dae9
7e474d46 7e474d46
1124dae9 3217a53b
In the example above, 7e474d46 is the first entry in A also appearing in B, So I would return 7e474d46.
Note: A can be millions of entries, and B can be around 300.
awk is your friend.
awk 'NR==FNR{a[$1]++;next}{if(a[$1]>=1){print $1;exit}}' file2 file1
7e474d46
Note : Check the [ previous version ] of this answer too which assumed that values are listed in a single file as two columns. This one is wrote after you have clarified that values are fed as two files in [ this ] comment.
Though few points are not clear, like how about if a number in A list is coming 2 times or more?(IN your given example itself d46 comes 2 times). Considering that you need all the line numbers of list A which are present in List B, then following will help you in same.
awk '{col1[$1]=col1[$1]?col1[$1]","FNR:FNR;col2[$2];} END{for(i in col1){if(i in col2){print col1[i],i}}}' Input_file
OR(NON-one liner form of above solution)
awk '{
col1[$1]=col1[$1]?col1[$1]","FNR:FNR;
col2[$2];
}
END{
for(i in col1){
if(i in col2){
print col1[i],i
}
}
}
' Input_file
Above code will provide following output.
3,5 7e474d46
6 1124dae9
creating array col1 here whose index is first field and array col2 whose index is $2. col1's value is current line's value and it will be concatenating it's own value too. Now in END section of awk traversing through col1 array and then checking if any value of col1 is present in array col2 too, if yes then printing col1's value and it's index.
If you have GNU grep, you can try this:
grep -m 1 -f B A

How to split a matrix market file according to the number of rows of the matrix stored in it?

I would like to split a market matrix file into two parts. These two parts should be of varying sizes. Such sizes should be corresponding to the number of rows of the matrix represented by the hash in the market format.
There are some examples:
http://math.nist.gov/MatrixMarket/formats.html
For a usual matrix file format of , e.g., 100 rows it is quite easy:
head -n 70 matrix1.mtx > matrix170.mtx
tail -n 30 matrix1.mtx > matrix130.mtx
where in matrix170.mtx there are the 70 first lines of matrix1.mtx and so on.
Thank you.
awk to the rescue!
You can use this script to split the matrix file unevenly
awk -v splits='70 30' 'BEGIN{n=split(splits,s);i=1;limit=s[i]}
NR==1{split(FILENAME,f,".")}
{if(NR<=limit) suffix=s[i];
else limit+=s[++i]}
i>n{exit}
{print > f[1] suffix "." f[2]}' matrix.mtx
will generate two files matrix70.mtx and matrix30.mtx derived from input file name and split values.

How I can extract some numbers from a value in bash

I'm trying to separate numbers from a value in bash. For example, I have a text file with the following row:
2015 0212 0455 25.0 L -20.270 -70.950 44.0 GUC 4.6LGUC 1
I need to separate the number 0212 in the second column in order to get two numbers: num1=02 and num2=12. The same way for the number in the third column.
I'd like to find a generalized method with awk or sed to do this, because other files have this line:
2015 0212 0455 25.0 L -20.270 -70.950136.0 GUC 4.6LGUC 1
And in that case I also have to separate the value -70.950136.0 in two numbers: -70.950 and 136.0. In this case the first number always has the same length: -70.950, -69.320, -68.000, etc.
assuming fixed lenght records
sed 's/.\{37\}/& /;s/.\{29\}/& /;s/.\{21\}/& /;s/.\{12\}/& /;s/.\{7\}/& /' YourFile
adapt on your need by adding or removing s/.\{IndexOfCharInLine\}/& /;

How to Write a Unix Shell to Sum the Values in a Row Against Each Unique Column (e.g., how to calculate total votes for each distinct candidate)

In its basic form, I am given a text file with state vote results from the 2012 Presidential Election and I need to write a one line shell script in Unix to determine which candidate won. The file has various fields, one of which is CandidateName and the other is TotalVotes. Each record in the file is the results from one precinct within the state, thus there are many records for any given CandidateName, so what I'd like to be able to do is sort the data according to CandidateName and then ultimately sum the TotalVotes for each unique CandidateName (so the sum starts at a unique CandidateName and ends before the next unique CandidateName).
No need for sorting with awk and its associative arrays. For convenience, the data file format can be:
precinct1:candidate name1:732
precinct1:candidate2 name:1435
precinct2:candidate name1:9920
precinct2:candidate2 name:1238
Thus you need to create totals of field 3 based on field 2 with : as the delimiter.
awk -F: '{sum[$2] += $3} END { for (name in sum) { print name " = " sum[name] } }' data.file
Some versions of awk can sort internally; others can't. I'd use the sort program to process the results:
sort -t= -k2nb
(field separator is the = sign; the sort is on field 2, which is a numeric field, possibly with leading blanks).
Not quite one line, but will work
$ cat votes.txt
Colorado Obama 50
Colorado Romney 20
Colorado Gingrich 30
Florida Obama 60
Florida Romney 20
Florida Gingrich 30
script
while read loc can num
do
if ! [ ${!can} ]
then
cans+=($can)
fi
(( $can += num ))
done < votes.txt
for can in ${cans[*]}
do
echo $can ${!can}
done
output
Obama 110
Romney 40
Gingrich 60

Resources