I have a huge dataset that is broken down into counts per tree. So there are 15 counts made per tree. I need to make an average of counts of egg.scars (column name) within each tree. I don't want an average of the whole column like I keep getting, I need an average egg scar count per tree.
Thanks!
you can extract specific rows of a column by doing eggs[1:5,5] where 5 is your column and 1:5 are the rows from 1 to 5 and eggs your dataframe and then do mean(eggs[1:5,5])
Related
Was asked this question in a coding round:
Given a matrix of 0's and 1's where, in any row - the values will be ascending order. i.e 1's are always after the 0's. Consider the example :
0,0,0,1,1
0,0,1,1,1
0,0,0,0,1
1,1,1,1,1
0,0,0,0,0
Find the first column that has a 1. ( from left - right )
In this case the first column ( in row 4 ) has a 1.
Answer is 1
I suggested a column wise traversal across all rows and exit when the current column encounters 1 in any of the rows.
Since the worse case performance is n * n ( comparing every element in the matrix) the interviewer wasn't pleased and was looking for a efficient solution - what is an efficient solution here ?
Take advantage of the fact that the rows are sorted which is evident from "in any row - the values will be ascending order. i.e 1's are always after the 0's"
Let there be m rows and n columns. Do a binary search on first row to figure out the first 1 and store that index in some variable, say index (One may think of a better variable name. I am just focused here on solving the problem optimally.) Continue binary search on every row, update the index if the first column containing 1 has lesser index than the index. After doing binary search on every row, you'll end up with the result in index variable.
Time complexity: m rows * log2(n columns) i.e. O(m * log2(n)).
This is the approach I could think of, which is better than the brute force approach having O(mn) time complexity. I don't think there would be a more optimal approach in terms of time and space complexity, as one has to search for the first 1 in every row.
[I don't think I should add the details on how to do a binary search to figure out the first column containing a 1. In case someone isn't very familiar with binary search, I leave this trivial part as an exercise.]
Given a text file with two columns, produce the largest possible subset of lines for which no value is repeated within either column.
For example, given these four lines :
1 a
1 b
2 a
2 b
One can use something like "sort -u" on the command line, to unique first on column 1, leaving
1 a
2 a
and then on column two, leaving just
1 a
This satisfies "no value is repeated" but not "largest possible subset"
In an ideal world, I would have produced either
1 a
2 b
or
1 b
2 a
Given the further constraint that these files might be many gigabytes (i.e. much larger than available RAM, but much smaller than available disk), I can't just keep all the values in a data structure.
Can anyone think of an approach?
I would also be happy with "a pretty large subset", if I can't literally get "the largest possible subset"
If I sort by (column 1 ascending and then column 2 random), uniq'ing on column 1 will give me slightly better results, but I feel like there's something simple that I'm missing.
For each unique item from col 1 create a list of unique items from col 2. Then starting with the smallest of lists build the final output by taking first value from each list and each col-1-item, that has not been used in the output yet.
I wanted to use Google Sheets to do a competition ranking which can help me to rank or sort the ranking automatically when I key in the Points.
However, there is a condition where there will be a tied happens. If a tie happens, I will take the Score Differences (SD) into consideration. If the Score Differences is low, then it will be rank higher in the tie condition.
See below table for illustration:
For example: Currently Team A and Team D having the highest PTS, so both of them are currently Rank 1. However, Team D is having a lower SD compare to Team A. So I wanted to have it automatically rank Team D as Rank 1 and Team A as Rank 2.
Is this possible?
One solution might be to create a hidden column with a formula like:
=PTS * 10000 - SD
(Replacing PTS and SD with the actual cell references)
Multiplying PTS by 10000 ensures it has a higher priority than SD.
We want to reward low SDs, so we subtract instead of add.
Finally, in the rank column, we can use a formula like:
=RANK(HiddenScoreCell, HiddenScoreColumnRange, 0)
So, for example, if the HiddenScore column is column K, the actual formula for row 2 might look like
=RANK(K2, K:K, 0)
The third parameter is 0 as we want higher scores to have a lower rank.
To sort, you can just apply a sort on the Rank column.
With sort() you can define multiple sorting criteria (see [documentation][1], e.g.
=sort(A2:I5,8,false,7,false)
So you're going to sort your table (in A2:I5, change accordingly) based first on PTS, descending, then on SD, descending? You can add more criteria with more pairs of parameters (column index, then descending or ascending as a boolean).
Then you need to compare your team name with with the sorted table and find its rank in the sorted list:
=ArrayFormula(match(A2:I5,sort(A2:I5,8,false,7,false),0))
Paste that formula in I2 (assuming your table starts in A1 with its headers, otherwise adjust accordingly).
=ARRAYFORMULA(IF(LEN(A2:A), RANK(H2:H*9^9-G2:G, H2:H*9^9-G2:G), ))
We have a matrix A=(a_ij). We sort every row of A in increasing order and after that we sort every column in increasing order. Why every time I do that, the rows stay sorted? Can someone give an explanation or a contra example or a proof or whatever?
We are given following matrix
5 7 9
7 8 4
4 2 9
We need to find maximum sum row or column and then we need to subtract 1 from each element of that row or column and then we need to repeat this operation for 3 times.
I will try to explain.
The matrix is n*n and the increment process is repeated for k times.
An o(n^2+k×log(n)) algorithm is possible.
If the sum of the biggest row/columns is a row so:
The row sum is increased by n.
All columns sum is increased by 1.
The two rules apply for columns as well.
For rule one store all rows/columns sum in 2 AVL Trees
(or every other data structure that support o(log(n)) insert and remove)
For rule two store rows/columns number of operations. (just two integers)
Now take the max of both trees where the two integers play a role for difference of the data staractures. Change it and change the other it and insert back.