Correlation with groupby function - correlation

If I want to calculate the correlation coefficient for the two variables "Duration" and "% played". Am I allowed to use df.groupby("Duration").agg({"played": "mean"}) and then calculate the correlation coefficient? Or do I have to take the entire dataset as is without groupby?
Duration
% played
6
50%
12
48.5%
6
42.5%
8
41.5%
12
48.5%

Related

Finding maximum and minimum value in a matrix column using Octave

I have a 10 x 2 sample matrix as follows
2104 3
1600 3
2400 3
1416 2
3000 4
1985 4
1534 3
1427 3
1380 3
1494 3
I need a generalized method to find the minimum and maximum value in a column.
I can use
max(max(X)) to find the maximum value in a matrix, but not of a column.
Also, max(min(X)) to find the minimum value is not a generalized solution.
Given a matrix X, max(X) will return the maximum value in each column. You can index the result to get the value for a given column:
max(X)(1) % max of the fist column (doesn't work in MATLAB)
Alternatively, extract the column and get its max:
max(X(:,1)) % max of the fist column
max (and many similar functions) operate on columns by default. To get the maximum of each row, use max(X,[],2).

How do I representation percentage in evolutionary Algorithm?

Considering I have 4 chromosomes (gi, i=1 to 4}) to represent 4 percentages of different things so that the sum of 4 percentages are equal to 100. How Do I represent this efficiently?
I know that it is possible by: g1/(g1+g2+g3+g4). However, This is not efficient. Consider all gi=0.2 or all gi=0.1 will represent 25% in these two cases. It is possible to generate many cases where different genes present same percentage. Is there any other efficient way, where unique set of combination of genes present unique set of percentages.
Thanks in advance.
I think you're confusing genes and chromosomes. A chromosome encodes a candidate solution to your problem. A gene is part of a chromosome.
Under this setting, why would you want that constraint on the chromosomes? it sounds like you want it on the genes of a chromosome.
In order to do this you can do a number of things: have each gene encode an integer in [0, 100]. If the genes do not add to 100 in the end, penalize the fitness of those chromosomes.
Another way, which might make crossover operators more natural to apply, is to have each gene store 100 bits. If x bits are set, that means the gene will encode x%.
Yet another way is to have the entire chromosome encode 100 set bits. Then each gene will hold a value x, which represents an interval. The number of set bits between two split points is the percentage associated to that gene. For example:
1 2 3 4 5 6 7 8 ... 100
1 1 1 1 1 1 1 1 ... 1
| | | | |
g1 g2 g3 g4
This can be done by generating 5 random numbers <= 100, sorting them and taking the differences between them.
One way to assign X units to N possibilities is to store X * (N-1) bits. Every unit is given (N-1) bits and if k of the (N-1) bits are set then the unit is assigned to k.
This is easy to work with as there are no invalid solutions and no penalties/repairs are necessary. This makes fitness evaluation, crossover and mutation easier to implement.
For example, the problem is to assign 5 units (X) to one of 4 (N) possibilities. Each individual is (4-1)x5=15 bits.
The bit string: 010 100 000 011 111 assigns the first 2 units to possibility 1 because both groups have 1 bit set. The third unit which has no bits set is assigned to 0. The fourth unit is assigned to 2 and the fifth to 3.
partition units
0 1
1 2
2 1
3 1

Calculate the Ranks of Candidates based on Votes and Total Candidates

How can I claulate the rank of each candidate when I have the total candidates and votes secured by each?
I've managed the percentage part, but calculating the rank has me stuck.
I'll be using MySql in the end for this, but right now I only need the formula or method to calculate ranks.
Id be glad if you could help with just the formula. Just like the formula for interest is PTR/100.
Total Candidates
5
Total Votes
75
Votes
Name Marks Percentage Rank(What I'm trying to calculate)
A 25 33.34 1/5 ->Rank 1/5 has the most votes
B 20 26.67 2/5 ->And so on
C 10 13.34 4/5
D 5 6.67 5/5
E 15 20.00 3/5
There is a previous question on SO that addresses this, using MySQL and a ranking variable. There is some lovely stuff in the answers
MySQL rank function

Including STDEV in this macro and call

Ive got together the following macro to get a random sample of various sizes then calculate MEAN a given number of times. I would also like calculate the STDEV alongside the MEAN. Ive tried various amendments to the script but im struggling to apply the correct syntax i think. Thanks for any help.
Regards
Andy
DEFINE !sample(myvar !TOKENS(1)
/nbsampl !TOKENS(1)
/size !CMDEND).
myvar = the variable of interest (here we want the mean of salary)
nbsampl = number of samples.
size = the size of each samples.
!LET !first='1'
!DO !ss !IN (!size)
!DO !count = 1 !TO !nbsampl.
GET FILE='E:\Monte carlo testing\s1.sav'.
COMPUTE draw=uniform(1).
SORT CASES BY draw.
N OF CASES !ss.
COMPUTE samplenb=!count.
COMPUTE ss=!ss.
AGGREGATE
/OUTFILE=*
/BREAK=samplenb
/!myvar = MEAN(!myvar) /ss=FIRST(ss).
!IF (!first !NE '1') !THEN
ADD FILES /FILE=* /FILE='E:Monte carlo testing\sample.sav'.
!IFEND
SAVE OUTFILE='E:\Monte carlo testing\sample.sav'.
!LET !first='0'
!DOEND.
!DOEND.
VARIABLE LABEL ss 'Sample size'.
EXAMINE
VARIABLES=salary BY ss /PLOT=BOXPLOT/STATISTICS=NONE/NOTOTAL
/MISSING=REPORT.
!ENDDEFINE.
!sample myvar=VAR00001 nbsampl=200 size= 1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100
Does this help?
AGGREGATE
/OUTFILE=*
/BREAK=samplenb
/!myvar = MEAN(!myvar)
/sd = STDEV(!myvar)
/ss=FIRST(ss).

Probability of event

Here is a probability problem: you observe .5 cars on average passing in front of you every 5 minutes on a road. What is the probability of seeing at least 1 car in 10 minutes?
I'm trying to solve this in 2 ways. The first way is to say: P(no car in 5 minutes) = 1 - .5 = .5. P(no car in first 5 minutes and no car in second 5 minutes) = P(no car in first 5 minutes) * P(no car in second 5 minutes) by independence. Therefore P(at least 1 car in 10 minutes) = 1 - .5*.5 = .75.
However, if I try the same, with a Poisson distribution with rate lambda = .5 per unit of time, for 2 units of time, I get: P(at least 1 car in 2 units of time) = 1 - exp(-2*lambda) = .63.
Am I doing something wrong? If not, what explains the discrepancy?
Thanks!
Your first calculation is incorrect. An average .5 cars / 5 minutes does not imply P(no car in 5 minutes) = 0.5. Consider for instance a process where every five minute, you see either no car with probability 90%, or 5 cars with probability 10%. On average you will see 0.5 cars every five minute, but the probability you see 0 cars in the next 5 minutes is clearly not 50%.
I haven't checked the computations for your second example; the calculation logic is looks correct, but the conclusion is incorrect: you are making an assumption about the distribution (Poisson) which is plausible but not implied by the problem statement.
If you take again my example, which is consistent with your problem description, the probability to see 0 cars in 10 minutes is 0.9 x 0.9 = 0.81, which gives you 19% of seeing one car or more. We could arbitrarily change my example to give you a wide variety of probabilities.
From your problem statement, the only thing you can say is that "in the long run, you'll see 0.5 cars every 5 minutes". Beyond that you can't make a statement on what should be expected within 10 minutes, unless you make some assumptions about the distribution of the cars arrivals.

Resources