Sort tuples by averages of second element - sorting

I have been able to convert a csv file to list format using a function. In doing so I was able to assign a name to a class number and thereafter 3 additional numbers e.g:
In the csv file:
Hussain 1 7 8 0
Alexandra 1 0 0 2
Became :
['Alexandra', 2],['Hussain', 8]
As the sorting method asked for the name in alphabetical order and the person's highest score. I used tuples to complete the code above and would like to carry on using tuples.
Now I wish to be able to sort this so that it becomes highest averages to lowest, e.g, the sorting method for averages would result in:
[Hussain, 1.66666666667],[Alexandra, 0.6666666666]
These numbers are what I expect as they are the averages of the last three numbers in the csv file as the 2nd column is being ignored here. As Hussain has the highest average he is placed first. I would appreciate any possible help.
What I would like to be done is the following:
I would like to be able to print out all the students in order of highest averages to lowest. As Hussain has a higher average of 1.6, he is printed out first then Alexandra is printed as she has a lower average. These two students are from the same class (shown in the second column of the csv file) and they are to be printed when the user chooses class 1 to be sorted.
TIA

suppose you have a list of class 1 like:
class1 = [['Alexandra', 1, 0, 0, 2], ['Hussain', 1, 7, 8, 0]]
then you sorted this according second element in lists of list
##this is class number 1 and you find class 1 people in second index of list by i[1] == 1(class number)
avg_list = [[i[0], float(sum(i[2:]))/len(i[2:])] for i in class1 if i[1] == 1 ]
dd = sorted(avg_list, key=lambda x: x[1])
dd.reverse()
print dd
Output:
[['Hussain', 5.0], ['Alexandra', 0.6666666666666666]]

Use key function in python's sort function. Read about it here:
https://docs.python.org/3/howto/sorting.html

Related

How to calculate the percentage of values higher than one tenth of the mean of an Int list

I have a long list of ints and I would like to calculate the percentage of numbers which are higher or above one tenth of the mean. That is, I want to calculate the percentile of the score mean / 10. Here is a naive approach (in Python, but that doesn't matter):
ls = [35,35,73,23,40,60,5,7,3,4,1,1,1,1,1]
length = 0
summ = 0
for i in ls:
length += 1
summ += i
mean = float(summ) / float(length)
print('The input value list is: {}'.format(ls))
print('The mean is: {}'.format(mean))
tenth_mean = mean / 10
print('One tenth of the mean is: {}'.format(tenth_mean))
summ = 0
for i in ls:
if (i >= tenth_mean):
summ += 1
result = float(summ) / float(length)
print('The percentage of values equal or above one tenth of the mean is: {}'.format(result))
Output:
The input value list is: [35, 35, 73, 23, 40, 60, 5, 7, 3, 4, 1, 1, 1, 1, 1]
The mean is: 19.3333333333
One tenth of the mean is: 1.93333333333
The percentage of values equal or above one tenth of the mean is: 0.666666666667
The problem with this approach is that I have to loop over the list twice. Is there any smart way to avoid this?
I can't see any since I first need to calculate the average in order to know which values to keep in the count (second loop).
Furthermore, I would like to do this for multiple percentages (i.e. one tenth of the mean, one fifth of the mean, etc.). This can be easily achieved within the second loop. I just wanted to point this out.
The input array does not follow any distribution.
EDIT: The range of possible values is only in the couple of thousands. The total number of values is around 3 billion.
EDIT: Fixed usage of the word "percentile" above.
If you have many queries on the list, it might be helpful do some preprocess to decrease time complexity up to O(log(n)).
If you sort the list and compute mean (using python function) of the list, you can find percentiles in the list using binary search. Hence, query time would be O(log(n)).
This is a well-known result of stats and information science: you cannot get all of that information with a single pass. #OmG already gave you the best complexity. Depending on the distribution of your scores, you may be able to improve the search time (but not the complexity) with an interpolation search.
If you have a massive data set, you might also be able to improve the search's starting point with partial estimates of the mean as you progress.
Based on the answer from others I have come up with the following approach for an improved search: The key insight is that for every possible value x one can count and sort all occurrences of values smaller or equal to x. Independently, the mean can be calculated in parallel (i.e. in the same loop). One can then do a linear or binary search in the tuple list to calculate any arbitrary fraction.
This works very well when the number of possible different values is much smaller than the total number of values.
Here is a simple implementation in bash/awk:
# The "tee >(awk ... > meant.txt) calculates the mean on the fly
# The second awk ("... value2count ...") counts the occurences of each value
# The sort simply sorts the output of awk (could be done within awk, too)
# The third awk ("... value2maxline ...") counts the number of lines having value x or less ("prevc" = previous count, "prevv" = previous value)
# The sort simply sorts the output of awk (could be done within awk, too)
echo -n "10\n15\n15\n20\n20\n25" | tee >(awk '{ sum += $1; } END { print sum / NR; }' > mean.txt) | awk '{ value2count[$1]++ } END { for (value in value2count) { print value, value2count[value] } }' | sort --numeric-sort --stable -k 1,1 | awk 'BEGIN { prevc = 0 ; prevv = -1 } { if (prevv != $1) { value2maxline[$1] = prevc + $2 ; prevc += $2 ; prevv = $1 } } END { for (value in value2maxline) { print value, value2maxline[value] } }' | sort --numeric-sort --stable -k 1,1 > counts.txt
cat mean.txt
17.5
cat counts.txt
10 1 # one line with value 10
15 3 # 3 lines with value 15 or less
20 5 # 5 lines with value 20 or less
25 6 # 6 lines with value 25 or less, 6 is also the total number of values
In the example above, if I were interested in the percentage of values >= 70% of the mean, I would calculate
int(0.7 * 17.5) = 12
Then find (with linear or binary search in the tuple list) that 1 line (of 6 total lines) is covered by less then 12 ("10 1" is still below, "15 3" already above). Finally, I'd calculate (6-1) / 6 = 0.83: 83% percent of the values are higher or equal then 70% of the mean.

Trying to improve efficiency of this search in an array

Suppose I have an input array where all objects are non-equivalent - e.g. [13,2,36]. I want the output array to be [1,0,2], since 13 is greater than 2 so "1", 2 is greater than no element so "0", 36 is greater than both 13 and 2 so "2". How do I get the output array with efficiency better than O(n2)?
Edit 1 : I also want to print the output in same ordering. Give a c/c++ code if possible.
Seems like a dynamic programming.
May be this can help
Here is an O(n) algorithm
1.Declare an array of say max size say 1000001;
2.Traverse through all the elements and make arr[input[n]]=1 where input[n] is the element
3.Traverse through the arr and add with the previous index(To keep record of arr[i] is greater than how many elements) like this
arr[i]+=arr[i-1]
Example: if input[]={12,3,36}
After step 2
arr[12]=1,arr[3]=1,arr[36]=1;
After step 3
arr[3]=1,arr[4]=arr[3]+arr[4]=1(arr[4]=0,arr[3]=1),
arr[11]=arr[10]=arr[9]=arr[8]=arr[7]arr[6]=arr[5]=arr[4]=1
arr[12]=arr[11]+arr[12]=2(arr[11]=1,arr[12]=1)
arr[36]=arr[35]+arr[36]=3(because arr[13],arr[14],...arr[35]=2 and arr[36]=1)
4.Traverse through the input array an print arr[input[i]]-1 where i is the index.
So arr[3]=1,arr[12]=2,arr[36]=3;
If you print arr[input[i]] then output will be {2,1,3} so we need to subtract 1 from each element then the output becomes {1,0,2} which is your desired output.
//pseude code
int arr[1000001];
int input[size];//size is the size of the input array
for(i=0;i<size;i++)
input[i]=take input;//take input
arr[input[i]]=1;//setting the index of input[i]=1;
for(i=1;i<1000001;i++)
arr[i]+=arr[i-1];
for(i=0;i<size;i++)
print arr[input[i]]-1;//since arr[i] was initialized with 1 but you want the input as 0 for first element(so subtracting 1 from each element)
To understand the algorithm better,take paper and pen and do the dry run.It will help to understand better.
Hope it helps
Happy Coding!!
Clone original array (and keep original indexes of elements somewhere) and quicksort it. Value of the element in quicksorted array should be quicksorted.length - i, where i is index of element in the new quicksorted array.
[13, 2, 36] - original
[36(2), 13(1), 2(0)] - sorted
[1, 0, 2] - substituted
def sort(array):
temp = sorted(array)
indexDict = {temp[i]: i for i in xrange(len(temp))}
return [indexDict[i] for i in array]
I realize it's in python, but nevertheless should still help you
Schwartzian transform: decorate, sort, undecorate.
Create a structure holding an object as well as an index. Create a new list of these structures from your list. Sort by the objects as planned. Create a list of the indices from the sorted list.

Column-wise comparision of common element-Matlab

How can I compare 2 matrices column wise; and find if there is any common element in corresponding column and return the column number (note: elements need not be in corresponding position)
Function:bsxfun(#eq,A,B) is NOT useful here as it compares corresponding elements in column.
Requirement: A=[1 2 3;4 5 6;7 8 9], B=[0 0 0;8 7 9;4 1 6] here value 4 is common in col#1 of A and B; similarly value 6,9 are common in column 3 of A&B; So return column 1 and column 3.
Can you please suggest a method; I would be grateful to you.
You can use ismember to compare columns (or rows) as you describe. It returns a logical index of A indicating matches in B. Use any to reduce column wise and find to get the column indices.
You can use a for loop over columns or use arrayfun:
find(arrayfun(#(c) any(ismember(A(:,c), B(:,c))), 1:size(A,2)))
I would be interested to see if you find a neater, more succinct solution!

Project Euler #22 - Incorrect logic?

I'm tackling some of the programming challenges on Project Euler. The challenge is as follows:
Using names.txt (right click and 'Save Link/Target As...'),
a 46K text file containing over five-thousand first names,
begin by sorting it into alphabetical order. Then working out
the alphabetical value for each name, multiply this value by
its alphabetical position in the list to obtain a name score.
For example, when the list is sorted into alphabetical order,
COLIN, which is worth 3 + 15 + 12 + 9 + 14 = 53, is the 938th name in the list.
So, COLIN would obtain a score of 938 53 = 49714.
What is the total of all the name scores in the file?
So I've written it in coffee-script, but I will explain the logic so its understandable.
fs = require 'fs'
total = 0
fs.readFile './names.txt', (err,names) ->
names = names.toString().split(',')
names = names.sort()
for num in [0..(names.length-1)]
asc = 0
for i in [1..names[num].length]
asc += names[num].charCodeAt(i-1) - 64
total += num * asc
console.log total
So basically, I'm reading the file in. I split the names into an array and sort them. I'm looping through each of the names. As I loop through I'm going through each character at getting the charCode of it (as its all capitals). I'm then subtracting it by 64 to get its position in the alphabet. Finally, I add to the total variable the num of the loop * sum of positions of all letters.
The answer I get is 870873746, however it's incorrect and other answers have a slightly higher number.
Can anyone see why?
total += num * asc
I think this is where it went wrong. The loop for num starts from 0 (thats how computers store things). But for ranking, the start should be from 1st and not 0. So for populating the total count, the code should be :
total += (num+1) * asc

Random number generator that fills an interval

How would you implement a random number generator that, given an interval, (randomly) generates all numbers in that interval, without any repetition?
It should consume as little time and memory as possible.
Example in a just-invented C#-ruby-ish pseudocode:
interval = new Interval(0,9)
rg = new RandomGenerator(interval);
count = interval.Count // equals 10
count.times.do{
print rg.GetNext() + " "
}
This should output something like :
1 4 3 2 7 5 0 9 8 6
Fill an array with the interval, and then shuffle it.
The standard way to shuffle an array of N elements is to pick a random number between 0 and N-1 (say R), and swap item[R] with item[N]. Then subtract one from N, and repeat until you reach N =1.
This has come up before. Try using a linear feedback shift register.
One suggestion, but it's memory intensive:
The generator builds a list of all numbers in the interval, then shuffles it.
A very efficient way to shuffle an array of numbers where each index is unique comes from image processing and is used when applying techniques like pixel-dissolve.
Basically you start with an ordered 2D array and then shift columns and rows. Those permutations are by the way easy to implement, you can even have one exact method that will yield the resulting value at x,y after n permutations.
The basic technique, described on a 3x3 grid:
1) Start with an ordered list, each number may exist only once
0 1 2
3 4 5
6 7 8
2) Pick a row/column you want to shuffle, advance it one step. In this case, i am shifting the second row one to the right.
0 1 2
5 3 4
6 7 8
3) Pick a row/column you want to shuffle... I suffle the second column one down.
0 7 2
5 1 4
6 3 8
4) Pick ... For instance, first row, one to the left.
2 0 7
5 1 4
6 3 8
You can repeat those steps as often as you want. You can always do this kind of transformation also on a 1D array. So your result would be now [2, 0, 7, 5, 1, 4, 6, 3, 8].
An occasionally useful alternative to the shuffle approach is to use a subscriptable set container. At each step, choose a random number 0 <= n < count. Extract the nth item from the set.
The main problem is that typical containers can't handle this efficiently. I have used it with bit-vectors, but it only works well if the largest possible member is reasonably small, due to the linear scanning of the bitvector needed to find the nth set bit.
99% of the time, the best approach is to shuffle as others have suggested.
EDIT
I missed the fact that a simple array is a good "set" data structure - don't ask me why, I've used it before. The "trick" is that you don't care whether the items in the array are sorted or not. At each step, you choose one randomly and extract it. To fill the empty slot (without having to shift an average half of your items one step down) you just move the current end item into the empty slot in constant time, then reduce the size of the array by one.
For example...
class remaining_items_queue
{
private:
std::vector<int> m_Items;
public:
...
bool Extract (int &p_Item); // return false if items already exhausted
};
bool remaining_items_queue::Extract (int &p_Item)
{
if (m_Items.size () == 0) return false;
int l_Random = Random_Num (m_Items.size ());
// Random_Num written to give 0 <= result < parameter
p_Item = m_Items [l_Random];
m_Items [l_Random] = m_Items.back ();
m_Items.pop_back ();
}
The trick is to get a random number generator that gives (with a reasonably even distribution) numbers in the range 0 to n-1 where n is potentially different each time. Most standard random generators give a fixed range. Although the following DOESN'T give an even distribution, it is often good enough...
int Random_Num (int p)
{
return (std::rand () % p);
}
std::rand returns random values in the range 0 <= x < RAND_MAX, where RAND_MAX is implementation defined.
Take all numbers in the interval, put them to list/array
Shuffle the list/array
Loop over the list/array
One way is to generate an ordered list (0-9) in your example.
Then use the random function to select an item from the list. Remove the item from the original list and add it to the tail of new one.
The process is finished when the original list is empty.
Output the new list.
You can use a linear congruential generator with parameters chosen randomly but so that it generates the full period. You need to be careful, because the quality of the random numbers may be bad, depending on the parameters.

Resources