How to divide my script output by the output of another command? - bash

I have a folder, my_folder, which contains over 800 files named myfile_*.dat where * is the unique ID for each file. In my file I basically have a variety of repeated fields but the one I am interested in is the <rating> field. Lines of this field look like the following: <rating>n where n is the rating score. I have a script which sums up all of the ratings per file, but now I must divide it by the number of lines that have <rating>n in order to obtain an average rating per file. Here is my script:
dir=$1
cd $dir
grep -P -o '(?<=<rating>).*' * |awk -F: '{A[$1]+=$2;next}END{for(i in A){print i,A[i]}}'|sort -nr -k2
I figure that I would use grep -c <rating> myfile_*.dat to count the number of matching lines and then divide the sum by this count per file but do not know where to put this in my script? Any suggestions are appreciated.
My script takes the folder name as an argument in the command line.
INPUT FILE
<Overall Rating>
<Avg. Price>$155
<URL>
<Author>Jeter5
<Content>I hope we're not disappointed! We enjoyed New Orleans...
<Date>Dec 19, 2008
<No. Reader>-1
<No. Helpful>-1
<rating>4
<Value>-1
<Rooms>3
<Location>5
<Cleanliness>3
<Check in / front desk>5
<Service>5
<Business service>5
<Author>...
repeat fields again...

Just set up another array L to track the count of items:
grep -P -o '(?<=<rating>).*' * |
awk -F: '{A[$1]+=$2;L[$1]++;next}END{for(i in A){print i,A[i],A[i]/L[i]}}' |
sort -nr -k2

Related

How to average the values of different files and save them in a new file

I have about 140 files with data which I would like to process with a script.
The files have two types of names:
sys-time-4-16-80-15-1-1.txt
known-ratio-4-16-80-15-1-1.txt
where the two last numbers vary. The penultimate number takes 1, 50, 100, 150,...,300, and the last number ranges from 1,2,3,4,5...,10. A sample of these files are in this link.
I would like to write a new file with 3 columns as follows:
A 1st column with the penultimate number of the file, i.e., 1,25,50...
A 2nd column with the mean value of the second column in each sys-time-.. file.
A 3rd column with the mean value of the second column in each known-ratio-.. file.
The result might have a row for each pair of averaged 2nd columns of sys and known files:
1 mean-sys-1 mean-know-1
1 mean-sys-2 mean-know-2
.
.
1 mean-sys-10 mean-know-10
50 mean-sys-1 mean-know-1
50 mean-sys-2 mean-know-2
.
.
50 mean-sys-10 mean-know-10
100 mean-sys-1 mean-know-1
100 mean-sys-2 mean-know-2
.
.
100 mean-sys-10 mean-know-10
....
....
300 mean-sys-10 mean-know-10
where each row corresponds with the sys and known files with the same two last numbers.
Besides, I would like to copy in the first column the penultimate number of the files.
I know how to compute the mean value of the second column of a file with awk:
awk '{ sum += $2; n++ } END { if (n > 0) print sum / n; }' sys-time-4-16-80-15-1-5.txt
but I do not know how to iterate on all the files and build a result file with the three columns as above.
Here's a shell script that uses GNU datamash to compute the averages (Though you can easily swap out to awk if desired; I prefer datamash for calculating stats):
#!/bin/sh
nums=$(mktemp)
sysmeans=$(mktemp)
knownmeans=$(mktemp)
for systime in sys-time-*.txt
do
knownratio=$(echo -n "$systime" | sed -e 's/sys-time/known-ratio/')
echo "$systime" | sed -E 's/.*-([0-9]+)-[0-9]+\.txt/\1/' >> "$nums"
datamash -W mean 2 < "$systime" >> "$sysmeans"
datamash -W mean 2 < "$knownratio" >> "$knownmeans"
done
paste "$nums" "$sysmeans" "$knownmeans"
rm -f "$nums" "$sysmeans" "$knownmeans"
It creates three temporary files, one per column, and after populating them with the data from each pair of files, one pair per line of each, uses paste to combine them all and print the result to standard output.
I've used GNU Awk for easy, per-file operations. This is untested; please let me know how it runs. You might want to look into printf() for pretty-printed output.
mapfile -t Files < <(find . -type f -name "*-4-16-80-15-*" |sort -t\- -k7,7 -k8,8) #1
gawk '
BEGINFILE {n=split(FILENAME, f, "-"); type=f[1]; a[type]=0} #2
{a[type] = ($2 + a[type] * c++) / c} #3
ENDFILE {if(type=="sys") print f[n], a[sys], a[known]} #4
' "${Files[#]}"
Create a Bash array with matching files sorted by the last two "keys". We will feed this array to Awk later. Notice how we alternate between "sys" and "known" files in this sample:
./known-ratio-4-16-80-15-2-150
./sys-time-4-16-80-15-2-150
./known-ratio-4-16-80-15-3-1
./sys-time-4-16-80-15-3-1
./known-ratio-4-16-80-15-3-50
./sys-time-4-16-80-15-3-50
At the beginning of every file, clear any existing average value and save the type as either "sys" or "known".
On every line, calculate the Cumulative Moving Average
At the end of every file, check the file type. If we just handled a "sys" file, print the last part of the filename followed by our averages.

How to sort a 3G bytes access log file?

Hi all: Now I have a 3G bytes tomcat access log named urls, each line is a url. I want to count each url and sort these urls order by the number of each url. I did it this way:
awk '{print $0}' urls | sort | uniq -c | sort -nr >> output
But it took really long time to finish this job, it's already took 30 minutes and its still working.
log file is like bellow:
/open_api/borrow_business/get_apply_by_user
/open_api/borrow_business/get_apply_by_user
/open_api/borrow_business/get_apply_by_user
/open_api/borrow_business/get_apply_by_user
/loan/recent_apply_info?passportId=Y20151206000011745
/loan/recent_apply_info?passportId=Y20160331000000423
/open_api/borrow_business/get_apply_by_user
...
Is there any other way that I could process and sort a 3G bytes file? Thanks in advance!
I'm not sure why you're using awk at the moment - it's not doing anything useful.
I would suggest using something like this:
awk '{ ++urls[$0] } END { for (i in urls) print urls[i], i }' urls | sort -nr
This builds up a count of each URL and then sorts the output.
I generated a sample file of 3,200,000 lines, amounting to 3GB, using Perl like this:
perl -e 'for($i=0;$i<3200000;$i++){printf "%d, %s\n",int rand 1000, "0"x1000}' > BigBoy
I then tried sorting it in one step, followed by splitting it into 2 halves and sorting the halves separately and merging the results, then splitting into 4 parts and sorting separately and merging, then splitting into 8 parts and sorting separately and merging.
This resulted, on my machine at least, in a very significant speedup.
Here is the script. The filename is hard-coded as BigBoy, but could easily be changed and the number of parts to split the file into must be supplied as a parameter.
#!/bin/bash -xv
################################################################################
# Sort large file by parts and merge result
#
# Generate sample large (3GB with 3,200,000 lines) file with:
# perl -e 'for($i=0;$i<3200000;$i++){printf "%d, %s\n",int rand 1000, "0"x1000}' > BigBoy
################################################################################
file=BigBoy
N=${1:-1}
echo $N
if [ $N -eq 1 ]; then
# Straightforward sort
sort -n "$file" > sorted.$N
else
rm sortedparts-* parts-* 2> /dev/null
tlines=$(wc -l < "$file")
echo $tlines
((plines=tlines/N))
echo $plines
split -l $plines "$file" parts-
for f in parts-*; do
sort -n "$f" > "sortedparts-$f" &
done
wait
sort -n -m sortedparts-* > sorted.$N
fi
Needless to say, the resulting sorted files are identical :-)

How do I calculate the standard deviation in my shell script?

I have a shell script:
dir=$1
cd $dir
grep -P -o '(?<=<rating>).*' * |
awk -F: '{A[$1]+=$2;L[$1]++;next}END
{for(i in A){print i, A[i]/L[i]}}' | sort -nr -k2 |
awk '{ sub(/.dat/, " "); print }'
which sums up all of the numbers that follow the <rating> field in each file of my folder but now I need to calculate the standard deviation of the numbers rather than getting the average. By summing up the difference of each rating in the file from the mean squared and then dividing this by the sample size -1. I do not need to do this in every file in the folder, but instead in 2 specific files, hotel_188937.dat and hotel_203921.dat. Here is an example of the contents of one of these files:
<Overall Rating>
<Avg. Price>$155
<URL>
<Author>Jeter5
<Content>I hope we're not disappointed! We enjoyed New Orleans...
<Date>Dec 19, 2008
<No. Reader>-1
<No. Helpful>-1
<rating>4
<Value>-1
<Rooms>3
<Location>5
<Cleanliness>3
<Check in / front desk>5
<Service>5
<Business service>5
<Author>...
repeat fields again...
The sample size of the first file is 127 with a mean of 4.78 compared with a sample size of 324 and a mean of 4.78 for the second file. Is there anyway that I can alter my script to calculate the standard deviation for these two specific files rather than calculating the average for every file in my directory? Thanks for your time.
You can do all in one awk script
$ awk -F'>' '
$1=="<rating" {k=FILENAME;sub(/.dat/,"",k);
s[k]+=$2;ss[k]+=$2^2;c[k]++}
END{for(i in s)
print i,m=s[i]/c[i],sqrt(ss[i]/c[i]-m^2)}' r1.dat r2.dat
r1 2.5 1.11803
r2 3 1.41421
s is for sum, ss for square sum, c for count, m for mean. Note that this computes population standard deviation not sample standard deviation. For latter you need to do some scaling adjustments with (count-1).
Yes.
The * in the grep line tells it to search in all the files.
Change the line
grep -P -o '(?<=<rating>).*' * |
to
grep -P -o '(?<=<rating>).*' hotel_188937.dat hotel_203921.dat |

How to split text files by number of rows that corresponds to another set of files?

Cut a file into several files according to numbers in a list:
$ wc -l all.txt
8500 all.txt
$ wc -l STS.*.txt
2000 STS.input.answers-forums.txt
1500 STS.input.answers-students.txt
2000 STS.input.belief.txt
1500 STS.input.headlines.txt
1500 STS.input.images.txt
How do I split my all.txt into the no. of lines of the STS.*.txt and then save them to the respective STS.output.*.txt?
I've been doing it manually as such:
$ sed '1,2000!d' all.txt > STS.output.answers-forums.txt
$ sed '2001,3500!d' all.txt > STS.output.answers-students.txt
$ sed '3501,5500!d' all.txt > STS.output.belief.txt
$ sed '5501,7000!d' all.txt > STS.output.headlines.txt
$ sed '7001,8500!d' all.txt > STS.output.images.txt
The all.txt input would look something like this:
$ head all.txt
2.3059
2.2371
2.1277
2.1261
2.0576
2.0141
2.0206
2.0397
1.9467
1.8518
Or sometimes all.txt looks like this:
$ head all.txt
2.3059 92.123
2.2371 1.123
2.1277 0.12452
2.1261123 213
2.0576 100
2.0141 0
2.02062 1
2.03972 34.123
1.9467 9.23
1.8518 9123.1
As for the STS.*.txt, they are just plain text lines, e.g.:
$ head STS.output.answers-forums.txt
The problem likely will mean corrective changes before the shuttle fleet starts flying again. He said the problem needs to be corrected before the space shuttle fleet is cleared to fly again.
The technology-laced Nasdaq Composite Index .IXIC inched down 1 point, or 0.11 percent, to 1,650. The broad Standard & Poor's 500 Index .SPX inched up 3 points, or 0.32 percent, to 970.
"It's a huge black eye," said publisher Arthur Ochs Sulzberger Jr., whose family has controlled the paper since 1896. "It's a huge black eye," Arthur Sulzberger, the newspaper's publisher, said of the scandal.
Wish you'd posted some sample input for splitting an input file of, say, 10 lines into output files of say, 2, 3, and 5 lines instead of 8500 lines into.... as that would have given us something to test a solution against. Oh well, this might work but is untested of course:
awk '
ARGIND < (ARGC-1) { outfile[NR] = gensub(/input/,"output","",FILENAME); next }
{ print > outfile[FNR] }
' STS.input.* all.txt
The above used GNU awk for ARGIND and gensub().
It just creates an array that maps each line number across all "input" files to the name of the "output" file that that same line number of "all.txt" should be written to.
Any time you write a loop in shell just to manipulate text you have the wrong approach. The guys who created shell also created awk for shell to call to manipulate text so just do that.
I would suggest writing a loop:
for file in answers-forums answers-students belief headlines images; do
lines=$(wc -l < "STS.input.$file.txt")
sed "$(( total + 1 )),$(( total + lines ))!d" all.txt > "STS.output.$file.txt"
(( total += lines ))
done
total keeps a track of how many lines have been read so far. The sed command extracts the lines from total + 1 to total + lines, writing them to the corresponding output file.

UNIX shell-scripting: Split a textfile by its entries

I'm trying to analyze an enormous text file (1.6GB), whose data lines look like this:
20090118025859 -2.400000 78.100000 1023.200000 0.000000
20090118025900 -2.500000 78.100000 1023.200000 0.000000
20090118025901 -2.400000 78.100000 1023.200000 0.000000
I don't even know how many lines there are. But I'm trying to split the file by date. The left number is a time stamp (these lines for example are from 2009, january 18th).
How can I split this file into pieces according to the date?
The number of entries per date differs, so using split with a constant number won't work.
Everything I know would be to grep file '20090118*' > data20090118.dat , but there sure is a way to do all the dates at once, right?
Thanks in advance,
Alex
Using awk:
awk '{print > "data"substr($1,0,8)".dat"}' myfile
This should work if the items are in date sequence:
date=20090101 # Change to the earliest date
while IFS= read -rd $'\n' line
do
if [ "$(echo "$line" | cut -d ' ' -f 1 | cut -c 1-8)" -eq $date ]
then
echo "$line" >> "$date.dat"
else
let date++
fi
done < log.dat
With the caveats that each day needs to have more than 1 record,
and that the output file will have blank lines:
uniq --all-repeated=separate -w8 file | csplit -s - '/^$/' '{*}'
We really should have an option to uniq to output even uniq records.
Also csplit should have an option to suppress the matched line.

Resources