Multiple column sort - sorting

I have some data in the following format:
1298501934.311 42.048
1298501934.311 60.096
1298501934.311 64.128
1298501934.311 64.839
1298501944.203 28.352
1298501966.283 6.144
1298501972.900 0
1298501972.939 0
1298501972.943 0
1298501972.960 0
1298501972.961 0
1298501972.964 0
1298501973.964 28.636
1298501974.215 27.52
1298501974.407 25.984
1298501974.527 27.072
1298501974.527 31.168
1298501974.591 30.144
1298501974.591 31.296
1298501974.83 27.605
1298501975.804 28.096
1298501976.271 23.879
1298501978.488 25.472
1298501978.744 25.088
1298501978.808 25.088
1298501978.936 26.24
1298501979.123 26.048
1298501980.470 23.75
1298501980.86 17.53
1298501982.392 22.336
1298501990.199 8.064
1298501997.943 0.256
1298501997.943 0.448
1298501997.943 0.512
1298501997.943 5.952
1298501997.946 0.448
1298501997.946 0.576
1298501997.946 5.44
My goal is to get the maximum value from the right column for each unique value in the left column. For instance, after processing the following 4 lines:
1298501997.943 0.256
1298501997.943 0.448
1298501997.943 0.512
1298501997.943 5.952
I would like to get just the last line,
1298501997.943 5.952
since "5.952" is the largest value for 1298501997.943
Similarly, for the following lines:
1298501997.946 0.448
1298501997.946 0.576
1298501997.946 5.44
I would like to get:
1298501997.946 5.44
And for:
1298501990.199 8.064
simply:
1298501990.199 8.064
and so on...
I tried searching for some hints in awk/uniq/etc., but not sure even how to formulate the query.
I could write a Python script, but it feels that proceeding with awk or some other standard tools would be more efficient (especially since I have a lot of data - millions/tens of millions of lines).
PS: Is there any Python module for text processing scenarios like that?
Thank you

You could put it in Excel (importing it by splitting on the SPACE character) and sort it that way. This is a rather brute-force solution, but it's simple.

Use awk:
{
if (array[$1] < $2)
array[$1]=$2
}
END {
printf("%-20s%s\n", "Value", "Max")
printf("%-20s%s\n", "-----", "---")
for (i in array)
printf("%-20s%s\n", i, array[i])
}
Output:
$ awk -f sort.awk log
Value Max
----- ---
1298501980.86 17.53
1298501978.808 25.088
1298501974.215 27.52
1298501973.964 28.636
1298501979.123 26.048
1298501978.936 26.24
1298501975.804 28.096
1298501972.964
1298501944.203 28.352
1298501974.83 27.605
1298501974.407 25.984
1298501997.943 5.952 <---- as in your example
1298501978.488 25.472
1298501972.939
1298501972.900
1298501982.392 22.336
1298501974.527 31.168
1298501997.946 5.44 <---- as in your example
1298501980.470 23.75
1298501974.591 31.296
1298501990.199 8.064 <---- as in your example
1298501966.283 6.144
1298501934.311 64.839
1298501976.271 23.879
1298501972.960
1298501978.744 25.088
1298501972.961
1298501972.943

A simple sort -g does the trick. It is general numeric sort and can handle space.

I doubt python would be significantly less efficient here than other tools (unless you need to process millions of data every fraction of second). You can do something like this:
import sys
d={}
for l in open(sys.argv[1]):
a,b=[float(item) for item in l.split()]
d[a]=max(d.get(a,b),b)
for a in d: print a,d[a]
and run it with
$ python script.py dataFile

As a shell one-liner (uses the -f argument of uniq, which ignores first n columns; to ignore the second, columns are swapped twice)
cat yourData | sort -g | awk '{print $2,$1};' | uniq -f1 | awk '{print $2,$1};'

Related

Parsing multiline program output

I've recently been working on some lab assignments and in order to collect and analyze results well, I prepared a bash script to automate my job. It was my first attempt to create such script, thus it is not perfect and my question is strictly connected with improving it.
Exemplary output of the program is shown below, but I would like to make it more general for more purposes.
>>> VARIANT 1 <<<
Random number generator seed is 0xea3495cc76b34acc
Generate matrix 128 x 128 (16 KiB)
Performing 1024 random walks of 4096 steps.
> Total instructions: 170620482
> Instructions per cycle: 3.386
Time elapsed: 0.042127 seconds
Walks accrued elements worth: 534351478
All data I want to collect is always in different lines. My first attempt was running the same program twice (or more times depending on the amount of data) and then using grep in each run to extract the data I need by looking for the keyword. It is very inefficient, as there probably are some possibilities of parsing whole output of one run, but I could not come up with any idea. At the moment the script is:
#!/bin/bash
write() {
o1=$(./progname args | grep "Time" | grep -o -E '[0-9]+.[0-9]+')
o2=$(./progname args | grep "cycle" | grep -o -E '[0-9]+.[0-9]+')
o3=$(./progname args | grep "Total" | grep -o -E '[0-9]+.[0-9]+')
echo "$1 $o1 $o2 $o3"
}
for ((i = 1; i <= 10; i++)); do
write $i >> times.dat
done
It is worth mentioning that echoing results in one line is crucial, as I am using gnuplot later and having data in columns is perfect for that use. Sample output should be:
1 0.019306 3.369 170620476
2 0.019559 3.375 170620475
3 0.021971 3.334 170620478
4 0.020536 3.378 170620480
5 0.019692 3.390 170620475
6 0.020833 3.375 170620477
7 0.019951 3.450 170620477
8 0.019417 3.381 170620476
9 0.020105 3.374 170620476
10 0.020255 3.402 170620475
My question is: how could I improve the script to collect such data in just one program execution?
You could use awk here and could get values into an array and later access them by index 1,2 and 3 in case you want to do this in a single command.
myarr=($(your_program args | awk '/Total/{print $NF;next} /cycle/{print $NF;next} /Time/{print $(NF-1)}'))
OR use following to forcefully print all elements into a single line, which will not come in new lines if someone using " to keep new lines safe for values.
myarr=($(your_program args | awk '/Total/{val=$NF;next} /cycle/{val=(val?val OFS:"")$NF;next} /Time/{print val OFS $(NF-1)}'))
Explanation: Adding detailed explanation of awk program above.
awk ' ##Starting awk program from here.
/Total/{ ##Checking if a line has Total keyword in it then do following.
print $NF ##Printing last field of that line which has Total in it here.
next ##next keyword will skip all further statements from here.
}
/cycle/{ ##Checking if a line has cycle in it then do following.
print $NF ##Printing last field of that line which has cycle in it here.
next ##next keyword will skip all further statements from here.
}
/Time/{ ##Checking if a line has Time in it then do following.
print $(NF-1) ##Printing 2nd last field of that line which has Time in it here.
}'
To access individual items you could use like:
echo ${myarr[0]}, echo ${myarr[1]} and echo ${myarr[2]} for Total, cycle and time respectively.
Example to access all elements by loop in case you need:
for i in "${myarr[#]}"
do
echo $i
done
You can execute your program once and save the output at a variable.
o0=$(./progname args)
Then you can grep that saved string any times like this.
o1=$(echo "$o0" | grep "Time" | grep -o -E '[0-9]+.[0-9]+')
Assumptions:
each of the 3x search patterns (Time, cycle, Total) occur just once in a set of output from ./progname
format of ./progname output is always the same (ie, same number of space-separated items for each line of output)
I've created my own progname script that just does an echo of the sample output:
$ cat progname
echo ">>> VARIANT 1 <<<
Random number generator seed is 0xea3495cc76b34acc
Generate matrix 128 x 128 (16 KiB)
Performing 1024 random walks of 4096 steps.
> Total instructions: 170620482
> Instructions per cycle: 3.386
Time elapsed: 0.042127 seconds
Walks accrued elements worth: 534351478"
One awk solution to parse and print the desired values:
$ i=1
$ ./progname | awk -v i=${i} ' # assign awk variable "i" = ${i}
/Time/ { o1 = $3 } # o1 = field 3 of line that contains string "Time"
/cycle/ { o2 = $5 } # o2 = field 5 of line that contains string "cycle"
/Total/ { o3 = $4 } # o4 = field 4 of line that contains string "Total"
END { printf "%s %s %s %s\n", i, o1, o2, o3 } # print 4x variables to stdout
'
1 0.042127 3.386 170620482

Multiplication of two variables containing tuples in BASH script

I have two variables containing tuples of same length generated from a PostgreSQL database and several successful follow on calculations, which I would like to multiply to generate a third variable containing the answer tuple. Each tuple contains 100 numeric records. Variable 1 is called rev_p_client_pa and variable 2 is called lawnp_p_client. I tried the following which gives me a third tuple but the answer rows are not calculated correctly:
rev_p_client_pa data is:
0.018183
0.0202814
0.013676
0.0134083
0.0108168
0.014197
0.0202814
lawn_p_client data is:
52.17
45
30.43
50
40
35
50
The command I used in the script:
awk -v var3="$rev_p_client_pa" 'BEGIN{print var3}' | awk -v var4="$lawnp_p_client" -F ',' '{print $(1)*var4}'
The command gives the following output:
0.948607
1.05808
0.713477
0.699511
0.564312
0.740657
1.05808
However when manually calculated in libreoffice calc i get:
0.94860711
0.912663
0.41616068
0.670415
0.432672
0.496895
1.01407
I used this awk structure to multiply a tuple variable with numeric value variable in a previous calculation and it calculated correctly. Does someone know how the correct awk statement should be written or maybe you have some other ideas that might be useful? Thanks for your help.
Use paste to join the two data sets together, forming a list of pairs, each separated by tab.
Then pipe the result to awk to multiply each pair of numbers, resulting in a list of products.
#!/bin/bash
rev_p_client_pa='0.018183
0.0202814
0.013676
0.0134083
0.0108168
0.014197
0.0202814'
lawn_p_client='52.17
45
30.43
50
40
35
50'
paste <(echo "$rev_p_client_pa") <(echo "$lawn_p_client") | awk '{print $1*$2}'
Output:
0.948607
0.912663
0.416161
0.670415
0.432672
0.496895
1.01407
All awk:
$ awk -v rev_p_client_pa="$rev_p_client_pa" \
-v lawn_p_client="$lawn_p_client" ' # "tuples" in as vars
BEGIN {
split(lawn_p_client,l,/\n/) # split the "tuples" by \n
n=split(rev_p_client_pa,r,/\n/) # get count of the other
for(i=1;i<=n;i++) # loop the elements
print r[i]*l[i] # multiply and output
}'
Output:
0.948607
0.912663
0.416161
0.670415
0.432672
0.496895
1.01407

How do I sort a "MON_YYYY_day_NUM" time with UNIX tools?

I'm wondering how do i sort this example based on time. I have already sorted it based on everything else, but i just cannot figure out how to go sort it using time (the 07:30 part for example).
My current code:
sort -t"_" -k3n -k2M -k5n (still need to implement the time sort for the last sort)
What still needs to be sorted is the time:
Dunaj_Dec_2000_day_1_13:00.jpg
Rim_Jan_2001_day_1_13:00.jpg
Ljubljana_Nov_2002_day_2_07:10.jpg
Rim_Jan_2003_day_3_08:40.jpg
Rim_Jan_2003_day_3_08:30.jpg
Any help or just a point in the right direction is greatly appreciated!
Alphabetically; 24h time with a fixed number of digits is okay to sort using a plain alphabetic sort.
sort -t"_" -k3n -k2M -k5n -k6 # default sorting
sort -t"_" -k3n -k2M -k5n -k6V # version-number sort.
There's also a version sort V which would work fine.
I have to admit to shamelessly stealing from this answer on SO:
How to split log file in bash based on time condition
awk -F'[_:.]' '
BEGIN {
months["Jan"] = 1
months["Feb"] = 2
months["Mar"] = 3
months["Apr"] = 4
months["May"] = 5
months["Jun"] = 6
months["Jul"] = 7
months["Aug"] = 8
months["Sep"] = 9
months["Oct"] = 10
months["Nov"] = 11
months["Dec"] = 12
}
{ print mktime($3" "months[$2]" "$5" "$6" "$7" 00"), $0 }
' input | sort -n | cut -d\ -f2-
Use _:.\ field separator characters to parse each file name.
Initialize an associative array so we can map month names to numerical values (1-12)
Uses awk function mktime() - it takes a string in the format of "YYYY MM DD HH MM SS [ DST ]" as per https://www.gnu.org/software/gawk/manual/html_node/Time-Functions.html. Each line of input is print with a column prepending with the time in epoch seconds.
The results are piped to sort -n which will sort numerically the first column
Now that the results are sorted, we can remove the first column with cut
I have a MAC, so I had to use gawk to get the mktime function (it's not available with MacOS awk normally ). mawk is another option I've read.

Having SUM issues with a bash script

I'm trying to write a script to pull the integers out of 4 files that store temperature readings from 4 industrial freezers, this is a hobby script it generates the general readouts I wanted, however when I try to generate a SUM of the temperature readings I get the following printout into the file and my goal is to print the end SUM only not the individual numbers printed out in a vertical format
Any help would be greatly appreciated;here's my code
grep -o "[0.00-9.99]" "/location/$value-1.txt" | awk '{ SUM += $1; print $1} END { print SUM }' >> "/location/$value-1.txt"
here is what I am getting in return
Morningtemp:17.28
Noontemp:17.01
Lowtemp:17.00 Hightemp:18.72
1
7
.
2
8
1
7
.
0
1
1
7
.
0
0
1
8
.
7
2
53
It does generate the SUM I don't need the already listed numbers, just the SUM total
Why not stick with AWK completely? Code:
$ cat > summer.awk
{
while(match($0,/[0-9]+\.[0-9]+/)) # while matches on record
{
sum+=substr($0, RSTART, RLENGTH) # extract matches and sum them
$0=substr($0, RSTART + RLENGTH) # reset to start after previous match
count++ # count matches
}
}
END {
print sum"/"count"="sum/count # print stuff
Data:
$ cat > data.txt
Morningtemp:17.28
Noontemp:17.01
Lowtemp:17.00 Hightemp:18.72
Run:
$ awk -f summer.awk file
70.01/4=17.5025
It might work in the winter too.
The regex in grep -o "[0.00-9.99]" "/location/$value-1.txt" is equivalent to [0-9.], but you're probably looking for numbers in the range 0.00 to 9.99. For that, you need a different regex:
grep -o "[0-9]\.[0-9][0-9]" "/location/$value-1.txt"
That looks for a digit, a dot, and two more digits. It was almost tempting to use [.] in place of \.; it would also work. A plain . would not; that would select entries such as 0X87.
Note that the pattern shown ([0-9]\.[0-9][0-9]) will match 192.16.24.231 twice (2.16 and 4.23). If that's not what you want, you have to be a lot more precise. OTOH, it may not matter in the slightest for the actual data you have. If you'd want it to match 192.16 and 24.231 (or .24 and .231), you have to refine your regex.
Your command structure:
grep … filename | awk '…' >> filename
is living dangerously. In the example, it is 'OK' (but there's a huge grimace on my face as I type 'OK') because the awk script doesn't write anything to the file until grep has read it all. But change the >> to > and you have an empty input, or have awk write material before the grep is complete and suddenly it gets very tricky to determine what happens (it depends, in part, on what awk writes to the end of the file).

Bash - select lines of a file based on values in another file

I have 2 files; let's call them file1 and file2. file1 contains a start and an end coordinate in each row, e.g.:
start end
2000 2696
3465 3688
8904 9546
etc.
file2 has several columns, of which the first is the most relevant for the question:
position v2 v3 v4
3546 value12 value13 value14
9847 value22 value23 value24
12000 value32 value33 value34
Now, I need to output a new file, which will contain only the lines of file2 for which the 'position' value (1st column) is in between the 'start' and the 'end' value of any of the columns of file1. In R I'd just make a double loop, but it takes too much time (the files are large), so need to do it in bash. In case the question is unclear, here's the R loop that would do the job:
for(i in 1:dim(file1)[1]){
for(j in 1:dim(file2)[1]){
if(file2[j,1]>file1$start[i] & file2[j,1]<file1$end[i]) file2$select=1 else file2$select=0
}
}
Very sure there's a simple way of doing this using bash / awk...
The awk will look like this, but you'll need to remove the first line from file1 and file2 first:
awk 'FNR==NR{x[i]=$1;y[i++]=$2;next}{for(j=0;j<i;j++){if($1>=x[j]&&$1<=y[j]){print $0}}}' file1 file2
The bit in curly braces after "FNR==NR" only applies to the processing of file1 and it says to store field1 in array x[] and field2 in array y[] so we have the upper and lower bounds of each range. The bit in the second set of curly braces applies to procesing file2 only. It says to iterate through all the bounds in array x[] and y[] and see if field1 is between the bounds, and print the whole reocrd if it is.
If you don't want to remove the header line at the start, you can make the awk a little more complicated and ignore it like this:
awk 'FNR==1{next}FNR==NR{x[i]=$1;y[i++]=$2;next}{for(j=0;j<i;j++){if($1>=x[j]&&$1<=y[j]){print $0}}}' file1 file2
EDITED
Ok, I have added code to check "chromosome" (whatever that is!) assuming it is in the first field in both files, like this:
File1
x 2000 2696
x 3465 3688
x 8904 9546
File2
x 3546 value12 value13 value14
y 3467 value12 value13 value14
x 9847 value22 value23 value24
x 12000 value32 value33 value34
So the code now stores the chromosome in array c[] as well and checks they are equal before outputting.
awk 'BEGIN{i=0}FNR==NR{c[i]=$1;x[i]=$2;y[i++]=$3;next}{for(j=0;j<i;j++){if(c[j]==$1&&$2>=x[j]&&$2<=y[j]){print $0;next}}}' file1 file2
Don't know how to do this in bash...
I would try a perl script, reading the first file and storing it in memory (if it's possible, it depends on its size) and then going through the second file line by line and doing the comparisons to output the line or not.
I think you can do this in R too... the same way: storing the first file, looping for each line of the second file .
Moreover if the intervals don't overlap, you can do a sort on the files to speed up your algorithm.
This should be faster than the for loop
res <- apply(file2, 1, function(row)
{
any(row$position > file1$start & row$position < file1$end)
})
Assuming the delimiters for the files are spaces (if not - change -d estting).
The script uses cut to extract the first field of file2.
Then a simple grep searches for the field in file1. If present, the line from file2 is printed.
#!/bin/bash
while read line
do
word=$(echo $line | cut -f1 -d" ")
grep -c $word file1 >/dev/null
if [ $? -eq 0 ];then
echo "$line"
fi
done < file2

Resources