How to filter rows in which at least a defined number of observations is greater than a specific value? - filter

I have a data frame with 9 columns and many rows. I want to filter all the rows that have observations greater than 3.0 in at least 3 columns. Which conditional statements should I use to subset my data frame?
Since I am a n00b, I only came up with this:
data_frame[data_frame > 3,]
Obviously, this gives me all the rows for which all values are > 2, regardless of what I actually need.
Thanks!

I figured that you could also combine logical operators:
data[rowSums(data>2)>=3,]
Like this, you can subset from a data frame the rows for which the sum of observations (higher than 2) occurs three or more times. And no specification for the columns.

Logical operator, in this case, the brain. I used the sum(rowSum(data))>x # x =sum of the limit value times columns available.

Related

How to select highly variable genes in bulk RNA seq data?

As a pre-processing step, I need to select the top 1000 highly variable genes (rows) from a bulk RNA-seq data which contains about 60k genes across 100 different samples(columns). The column value already contains the mean of the triplicates. The table contains normalized value in FPKM (Note: I don't have access to raw counts and am not able to use common R packages as these packages takes raw counts as input.)
In this case, what is the best way to select the top 1000 variable genes ?
I have tried to filter out the genes using rowSums() function (to remove the genes with lower rowsums values) and narrowed it down from 60k genes to 10K genes but I am not sure if it the right way to select highly variable genes. Any input is appreciated.
row sum is first filtration step. after this your data will discarded by log2fold change cutoff and padjst value (0.05 or o.o1 depend on your goal). you can repeat this pathway with different row sum cutoff to see results. I personal discard row sums zero

PowerQuery syntax to overcome #NUM! error

I have two columns of data in Excel. Using PowerQuery I am trying to divide these two columns and call it column X. The problem is that there are zeros in these two columns meaning that we get a "#NUM!" in Column X when dividing. How can I write an IF statement in PowerQuery so that IF the value of column X (the division) is Nan (#NUM!) then it is set to zero?
The below doesn't change the NaN's to zeros:
if[Column1]/[Column2]="NaN" then 0 else[Column1]/[Column2]
This should be a FAQ but approach is similar in almost every langage. I'd write your statement like this: if [Column2] = 0 then 0 else [column1]/[column2]. Should work for all non-zero denominators.
Other thought, I just used this: Powerquery (and PowerPivot) has a divide function that is divide-by-zero-safe! divide(column1,column2). Shorter to write and should perform better as it is only performing the calculation once. Especially with more complex denominators.
Final thought: because they aren't additive, I tend not to store ratios in the PQ results choosing instead to calculate dynamically in powerpivot or elsewhere in the reporting. In Excel you can use =iferror(a/b, 0).
JR

The column of the csv file in google automl tables is recognised as text or categorical instead of numeric as i would like

I tried to train a model using google automl tables but i have the following problem
The csv file is correctly imported, it has 2 columns and about 1870 rows, all numeric.
The system recognises only 1 column as numeric but not the other.
The column, where the problem is, has 5 digits in each row separated with space.
Is there anything i should do in order for the system to properly recognise the data as numeric?
Thanks in advance for your help
The issue is with the Data type Numeric definition, the number needs to be comparable (greater than, smaller than, equal).
Two different list of numbers are not comparable, for example 2 4 7 is not comparable to 1 5 7. To solve this, without using strings and therefore losing the "information" of those numbers, you have several options.
For example:
Create an array of numbers, by inserting [ ] in the limits of the second entrance. Take into consideration the Array Data type relative weighted approach in AutoMl tables as it may affect the "information" extracted from the sequence.
Create additional columns for every entry of the second column so each one is a single number and hence truly numeric.
I would personally go for the second option.
If you are afraid of losing "information" by splitting the numbers take into consideration that after training, the model should deduce by itself the importance of the position and other "information" those number sequences might contain (mean, norm/modulus,relative increase,...) provided the training data is representative.

Count cells in a row where value is greater than zero

I'm looking to add a column to my PowerQuery data which will count how many of 5 cells in the row are greater than zero. Example data below with end result:
I could do this with lots of If statements but I need to be able to expand my number of columns in the future.
I think this should do it. Add a column with:
List.Count(List.RemoveMatchingItems(Record.FieldValues(_),{0}))
e.g.:

Oracle: Loop over partitions / groups and their subpartitions / groups

I would like to know if it is possible to achieve the steps below in PL / SQL.
Please note that I use the word "partition" when I mean "put rows with a certain condition together" because a) I would like to avoid the word "group" because it combines rows in SQL, b) my research so far led me to think that the "PARTITION BY" clause is possibly what I want:
1. Select rows based on a long query with many joins,
partition the results based on a certain column value of type LONG.
2. Loop through each row of a partition and partition again,
based on another column of type VARCHAR.
Do that for every partition.
3. Loop through each row of the resulting sub-partition, compare multiple columns
with predefined values, set a boolean column to true or false based on the result.
Do that for every sub-partition.
It would be really easy to do for me in a normal programming language, such as Java. But can I do that in PL/SQL? If so, what would be a good approach?

Resources