How to reduce a set of lines to take the average? - bash

I have a file with lines like these (columns are tab seperated)
2 1.414455 3.70898
2 2.414455 3.80898
2 3.414455 3.90898
2 1.414455 3.90898
4 4.414455 7.23898
4 3.414455 6.23898
4 5.414455 8.23898
i.e. there are consecutive lines where the first column is an integer, and rest two columns are floats.
I want to reduce them as below
2 2.164455 3.75898
4 4.414455 7.23898
where I keep the first columns, and take the averages of the second and third columns for all elements with same first columns. The number of consecutive lines with same first elements might be different, but they will always be consecutive.
I can do this in perl, but was wondering if there is a simpler bash / sed / awk mix that can do the same for me?

Using awk:
awk '{a[$1]+=$2;b[$1]+=$3;c[$1]++;}END{for(i in c)print i, a[i]/c[i],b[i]/c[i];}' file
2 2.16445 3.83398
4 4.41446 7.23898
Using 3 different arrays: a and b to keep the sum of 2nd and 3rd columns, c to keep the count of elements. At the end, calculating the average and printing it.

Related

Ascending Cardinal Numbers in APL

In the FinnAPL Idiom Library, the 19th item is described as “Ascending cardinal numbers (ranking, all different) ,” and the code is as follows:
⍋⍋X
I also found a book review of the same library by R. Peschi, in which he said, “'Ascending cardinal numbers (ranking, all different)' How many of us understand why grading the result of Grade Up has that effect?” That's my question too. I searched extensively on the internet and came up with zilch.
Ascending Cardinal Numbers
For the sake of shorthand, I'll call that little code snippet “rank.” It becomes evident what is happening with rank when you start applying it to binary numbers. For example:
X←0 0 1 0 1
⍋⍋X ⍝ output is 1 2 4 3 5
The output indicates the position of the values after sorting. You can see from the output that the two 1s will end up in the last two slots, 4 and 5, and the 0s will end up at positions 1, 2 and 3. Thus, it is assigning rank to each value of the vector. Compare that to grade up:
X←7 8 9 6
⍋X ⍝ output is 4 1 2 3
⍋⍋X ⍝ output is 2 3 4 1
You can think of grade up as this position gets that number and, you can think of rank as this number gets that position:
7 8 9 6 ⍝ values of X
4 1 2 3 ⍝ position 1 gets the number at 4 (6)
⍝ position 2 gets the number at 1 (7) etc.
2 3 4 1 ⍝ 1st number (7) gets the position 2
⍝ 2nd number (8) gets the position 3 etc.
It's interesting to note that grade up and rank are like two sides of the same coin in that you can alternate between the two. In other words, we have the following identities:
⍋X = ⍋⍋⍋X = ⍋⍋⍋⍋⍋X = ...
⍋⍋X = ⍋⍋⍋⍋X = ⍋⍋⍋⍋⍋⍋X = ...
Why?
So far that doesn't really answer Mr Peschi's question as to why it has this effect. If you think in terms of key-value pairs, the answer lies in the fact that the original keys are a set of ascending cardinal numbers: 1 2 3 4. After applying grade up, a new vector is created, whose values are the original keys rearranged as they would be after a sort: 4 1 2 3. Applying grade up a second time is about restoring the original keys to a sequence of ascending cardinal numbers again. However, the values of this third vector aren't the ascending cardinal numbers themselves. Rather they correspond to the keys of the second vector.
It's kind of hard to understand since it's a reference to a reference, but the values of the third vector are referencing the orginal set of numbers as they occurred in their original positions:
7 8 9 6
2 3 4 1
In the example, 2 is referencing 7 from 7's original position. Since the value 2 also corresponds to the key of the second vector, which in turn is the second position, the final message is that after the sort, 7 will be in position 2. 8 will be in position 3, 9 in 4 and 6 in the 1st position.
Ranking and Shareable
In the FinnAPL Idiom Library, the 2nd item is described as “Ascending cardinal numbers (ranking, shareable) ,” and the code is as follows:
⌊.5×(⍋⍋X)+⌽⍋⍋⌽X
The output of this code is the same as its brother, ascending cardinal numbers (ranking, all different) as long as all the values of the input vector are different. However, the shareable version doesn't assign new values for those that are equal:
X←0 0 1 0 1
⌊.5×(⍋⍋X)+⌽⍋⍋⌽X ⍝ output is 2 2 4 2 4
The values of the output should generally be interpreted as relative, i.e. The 2s have a relatively lower rank than the 4s, so they will appear first in the array.

Maximum values in matrix

So here is an interesting problem in C#. I'm looking for a better way of solving it:
Given a matrix M (not necesarily square) of matches, find the best matching elements. Element i matches elem j by value M(i,j). M(i,j) != M(j,i).
Since #rows != #columns, find the best min(#rows,#columns) matching pairs (i,j).
Basically the problem is to pick the maximum from each row/column such that no row/column is picked twice.
Example:
1 2 3
+---------
a | 10 3 1
b | 12 99 2
c | 20 5 3
d | 5 7 4
The maximum value in this matrix is 99 so the best match is (b,2). For the next selection we cannot use anymore row b and column 2. Is like cutting them
1 2 3 or, if you prefer, 1 3
+--------- a smaller matrix: +------
a | 10 || 1 a | 10 1
b | ===++=== c | 20 3
c | 20 || 3 d | 5 4
d | 5 || 4
The max is now 20 and the match is (c, 1). The remaining matrix has only one column.
After another pick we'll get the match (d, 3) with match = 4
In the end "a" has no match.
My current implementation uses 2 array to store the already matched rows/columns and for each match goes through the entire matrix, picking the first maximum that belongs to row/col not match.
PS: in case of value multiple matches having the same value, just pick one of them
PS2: The array is stored as int [,]
How would you approach this problem in a more optimal/beautiful way?
If you are trying to maximise the sum of the cells chosen, such that exactly one cell is picked from each row and from each column, then this is http://en.wikipedia.org/wiki/Assignment_problem. If your matrix is not square, you can make it square by adding rows or columns to them, with values in the new cells which mean that they won't be picked unless there is no other way to fill out the solution.
(If you are not maximising the sum, you need to say what function of the values chosen you are maximising - is (1,3) better than (2,2)?. Otherwise you are into http://en.wikipedia.org/wiki/Multi-objective_optimization, which is possible, but more complicated).
You could first sort all of the entries of the matrix in descending order, and then process the sorted list. Whenever you see an entry that isn't in an already-picked row/col, it means that entry should be picked, so you mark the corresponding row/column and continue further down the list until either all rows or all columns have been picked.

Array problem using if and do loop

This is my code:
data INDAT8; set INDAT6;
Array myarray{24,27};
goodgroups=0;
do i=2 to 24 by 2;
do j=2 to 27;
if myarray[i,j] gt 1 then myarray[i+1,j] = 'bad';
else if myarray[i,j] eq 1 and myarray[i+1,j] = 1 then myarray[i+1,j]= 'good';
end;
end;
run;
proc print data=INDAT8;
run;
Problem:
I have the data in this format- it is just an example: n=2
X Y info
2 1 good
2 4 bad
3 2 good
4 1 bad
4 4 good
6 2 good
6 3 good
Now, the above data is in sorted manner (total 7 rows). I need to make a group of 2 , 3 or 4 rows separately and generate a graph. In the above data, I made a group of 2 rows. The third row is left alone as there is no other column in 3rd row to form a group. A group can be formed only within the same row. NOT with other rows.
Now, I will check if both the rows have “good” in the info column or not. If both rows have “good” – the group formed is also good , otherwise bad. In the above example, 3rd /last group is “good” group. Rest are all bad group. Once I’m done with all the rows, I will calculate the total no. of Good groups formed/Total no. of groups.
In the above example, the output will be: Total no. of good groups/Total no. of groups => 1/3.
This is the case of n=2(size of group)
Now, for n=3, we make group of 3 rows and for n=4, we make a group of 4 rows and find the good /bad groups in a similar way. If all the rows in a group has “good” block—the result is good block, otherwise bad.
Example: n= 3
2 1 good
2 4 bad
2 6 good
3 2 good
4 1 good
4 4 good
4 6 good
6 2 good
6 3 good
In the above case, I left the 4th row and last 2 rows as I can’t make group of 3 rows with them. The first group result is “bad” and last group result is “good”.
Output: 1/ 2
For n= 4:
2 1 good
2 4 good
2 6 good
2 7 good
3 2 good
4 1 good
4 4 good
4 6 good
6 2 good
6 3 good
6 4 good
6 5 good
In this case, I make a group of 4 and finds the result. The 5th,6th,7th,8th row are left behind or ignored. I made 2 groups of 4 rows and both are “good” blocks.
Output: 2/2
So, After getting 3 output values from n=2 , n-3, and n=4 I will plot a graph of these values.
If you can help in any any language using array, if and do loop. it would be great.
I can change my code accordingly.
Update:
The answer for this doesn't have to be in sas. Since it is more algorithm-related than anything, I will accept suggestions in any language as long as they show how to accomplish this using arrays and do.
I am having trouble understanding your problem statement, but from what I can gather here is what I can suggest:
Place data into bins and the process the summary data.
Implementation 1
Assumption: You don't know what the range of the first column will be or distriution will be sparse
Create a hash table. The Key will be the item you are doing your grouping on. The value will be the count seen so far.
Proces each record. If the key already exists, increment the count (value for that key in the hash). Otherwise add the key and set the value to 1.
Continue until you have processed all records
Count the number of keys in the hash table and the number of values that are greater than your threshold.
Implementation 2
Assumption: You know the range of the first column and the distriution is reasonably dense
Create an array of integers with enough elements so the index can match the column value. Initialize all elements to zero. This array will hold your count for each item you are grouping on
Process each record. Examine value of first column. Increment corresponding index in array. (So if you have "2 1 good", do groupCount[2]++)
Continue until you have processed all records
Walk each element in the array. Count how many items are non zero (meaning they appeared at least once) and how many items meet your threshold.
You can use the same approach for gathering the good and bad counts.

How to display unique numbers with their frequencies as occurring in a matrix?

I have a matrice with some number:
1 2 3 6
6 7 2 1
1 4 5 6
And the program should display all different number with own frequency for example:
1 -> 3
2 -> 2
3 -> 1
4 -> 1
5 -> 1
6 -> 3
7 -> 1
Please help me
You probably mean
1->3
Create vector (array), filled with zeros, that have size of max value in matrice (like [0..9]), travell by whole matrice and with every step increment index of vector that equals actual number.
This is soluction for short range values in matrice. If you excpect some big values, use joined list insted of vector, or matrice like this for counting:
1 0
5 0
15 0
142 0
2412 0
And increment values in second column and expand this matrice rows every time you find a new number.
Using pointers this problem reduces from matrix to a single dimensional array. Maintain a 1D array whose size is equal to the total no. of elements in the matrix, say it COUNT. Initialize it with zero. Now start with first element of the matrix and compare it with all the other elements. If we use pointers this problem transforms into traversing a 1D array and finding the no of occurrences of each element. For traversing all you have to do is just increment the pointer. While comparing when you encounter the same number just shift forward all the consecutive numbers one place ahead. For example, if 0th element is 1 and you again found 1 on 4th index, then shift forward element on 5th index to 4th, 6th to 5th and so on till the last element. This way the duplicate entry at 4th index is lost. Now decrease the count of total no of elements in the matrix by 1 and increase the corresponding entry in array COUNT by 1. Continuing this way till the last element we get a matrix with distinct nos. and their corresponding frequency in array COUNT.
This implementation is very effective for languages which support pointers.
Here's an example of how it could be done in Python.
The dict is of this format: {key:value, key2:value2}. So you can use that so you have something like {'2':3} so it'll tell you what number has how many occurances. (I'm not assuming you're going to use Python. It's just so you understand the code... maybe)
matrix = [[1,5,6],
[2,6,3],
[5,3,9]]
dict = {}
for row in matrix:
for column in row:
if str(column) in dict.keys():
dict[str(column)] += 1
else:
dict[str(column)] = 1
for key in sorted(dict.keys()):
print key, '->', dict[key]
I hope you can figure out what this does. This codepad shows the output and nice syntax hightlighting.
(I don't get why SO isn't aligning the code properly... it's monospaced but not aligned :S ... turns out it's because I was using IE6 (It's the only browser at work :-(

How can I maximally partition a set?

I'm trying to solve one of the Project Euler problems. As a consequence, I need an algorithm that will help me find all possible partitions of a set, in any order.
For instance, given the set 2 3 3 5:
2 | 3 3 5
2 | 3 | 3 5
2 | 3 3 | 5
2 | 3 | 3 | 5
2 5 | 3 3
and so on. Pretty much every possible combination of the members of the set. I've searched the net of course, but haven't found much that's directly useful to me, since I speak programmer-ese not advanced-math-ese.
Can anyone help me out with this? I can read pretty much any programming language, from BASIC to Haskell, so post in whatever language you wish.
Have you considered a search tree? Each node would represent a choice of where to put an element and the leaf nodes are answers. I won't give you code because that's part of the fun of Project Euler ;)
Take a look at:
The Art of Computer Programming, Volume 4, Fascicle 3: Generating All Combinations and Partitions
7.2.1.5. Generating all set partitions
In general I would look at the structure of the recursion used to compute the number of configurations, and build a similar recursion for enumerating them. Best is to compute a one-to-one mapping between integers and configurations. This works well for permutations, combinations, etc. and ensures that each configuration is enumerated only once.
Now even the recursion for the number of partitions of some identical items is rather complicated.
For partitions of multisets the counting amounts to solving the generalization of Project Euler problem 181 to arbitrary multisets.
Well, the problem has two aspects.
Firsty, the items can be arranged in any order. So for N items, there are N! permutations (assuming the items are treated as unique).
Secondly, you can envision the grouping as a bit flag between each item indicating a divide. There would be N-1 of these flags, so for a given permutation there would be 2^(N-1) possible groupings.
This means that for N items, there would be a total of N!*(2^(N-1)) groupings/permutations, which gets big very very fast.
In your example, the top four items are groupings of one permutation. The last item is a grouping of another permutation. Your items can be viewed as :
2 on 3 off 3 off 5
2 on 3 on 3 off 5
2 on 3 off 3 on 5
2 on 3 on 3 on 5
2 off 5 on 3 off 3
The permutations (the order of display) can be derived by looking at them like a tree, as mentioned by the other two. This would almost certainly involve recursion, such as here.
The grouping is independent of them in many ways. Once you have all the permutations, you can link them with the groupings if needed.
Here is the code you need for this part of your problem:
def memoize(f):
memo={}
def helper(x):
if x not in memo:
memo[x]=f(x)
return memo[x]
return helper
#memoize
def A000041(n):
if n == 0: return 1
S = 0
J = n-1
k = 2
while 0 <= J:
T = A000041(J)
S = S+T if k//2%2!=0 else S-T
J -= k if k%2!=0 else k//2
k += 1
return S
print A000041(100) #the 100's number in this series, as an example
I quickly whipped up some code to do this. However, I left out separating every possible combination of the given list, because I wasn't sure it was actually needed, but it should be easy to add, if necessary.
Anyway, the code runs quite well for small amounts, but, as CodeByMoonlight already mentioned, the amount of possibilities gets really high really fast, so the runtime increases accordingly.
Anyway, here's the python code:
import time
def separate(toseparate):
"Find every possible way to separate a given list."
#The list of every possibility
possibilities = []
n = len(toseparate)
#We can distribute n-1 separations in the given list, so iterate from 0 to n
for i in xrange(n):
#Create a copy of the list to avoid modifying the already existing list
copy = list(toseparate)
#A boolean list indicating where a separator is put. 'True' indicates a separator
#and 'False', of course, no separator.
#The list will contain i separators, the rest is filled with 'False'
separators = [True]*i + [False]*(n-i-1)
for j in xrange(len(separators)):
#We insert the separators into our given list. The separators have to
#be between two elements. The index between two elements is always
#2*[index of the left element]+1.
copy.insert(2*j+1, separators[j])
#The first possibility is, of course, the one we just created
possibilities.append(list(copy))
#The following is a modification of the QuickPerm algorithm, which finds
#all possible permutations of a given list. It was modified to only permutate
#the spaces between two elements, so it finds every possibility to insert n
#separators in the given list.
m = len(separators)
hi, lo = 1, 0
p = [0]*m
while hi < m:
if p[hi] < hi:
lo = (hi%2)*p[hi]
copy[2*lo+1], copy[2*hi+1] = copy[2*hi+1], copy[2*lo+1]
#Since the items are non-unique, some possibilities will show up more than once, so we
#avoid this by checking first.
if not copy in possibilities:
possibilities.append(list(copy))
p[hi] += 1
hi = 1
else:
p[hi] = 0
hi += 1
return possibilities
t1 = time.time()
separations = separate([2, 3, 3, 5])
print time.time()-t1
sepmap = {True:"|", False:""}
for a in separations:
for b in a:
if sepmap.has_key(b):
print sepmap[b],
else:
print b,
print "\n",
It's based on the QuickPerm algorithm, which you can read more about here: QuickPerm
Basically, my code generates a list containing n separations, inserts them into the given list and then finds all possible permutations of the separations in the list.
So, if we use your example we would get:
2 3 3 5
2 | 3 3 5
2 3 | 3 5
2 3 3 | 5
2 | 3 | 3 5
2 3 | 3 | 5
2 | 3 3 | 5
2 | 3 | 3 | 5
In 0.000154972076416 seconds.
However, I read through the problem description of the problem you are doing and I see how you are trying to solve this, but seeing how quickly the runtime increases I don't think that it would work as fast you would expect. Remember that Project Euler's problems should solve in around a minute.

Resources