How to select highly variable genes in bulk RNA seq data? - rna-seq

As a pre-processing step, I need to select the top 1000 highly variable genes (rows) from a bulk RNA-seq data which contains about 60k genes across 100 different samples(columns). The column value already contains the mean of the triplicates. The table contains normalized value in FPKM (Note: I don't have access to raw counts and am not able to use common R packages as these packages takes raw counts as input.)
In this case, what is the best way to select the top 1000 variable genes ?
I have tried to filter out the genes using rowSums() function (to remove the genes with lower rowsums values) and narrowed it down from 60k genes to 10K genes but I am not sure if it the right way to select highly variable genes. Any input is appreciated.

row sum is first filtration step. after this your data will discarded by log2fold change cutoff and padjst value (0.05 or o.o1 depend on your goal). you can repeat this pathway with different row sum cutoff to see results. I personal discard row sums zero

Related

Parquet file with uneven columns

I'm trying to figure out how to write a parquet file where the columns do not contain the same number of rows per Row Group. For example, my first column might be a value sampled at 10Hz, while my second column may be a value sampled at only 5Hz. I'd rather not repeat values in the slower column since this can lead to computational errors. However, I cannot write columns of two different sizes to the same Row Group, so how can I accomplish this?
I'm attempting to do this with ParquetSharp.
It is not possible for the columns in a parquet file to have different row counts.
It is not explicit in the documentation but if you look on https://parquet.apache.org/documentation/latest/#metadata, you will see that a RowGroup has a num_rows and several ColumnChunks that do not themselves have individual row numbers.

The column of the csv file in google automl tables is recognised as text or categorical instead of numeric as i would like

I tried to train a model using google automl tables but i have the following problem
The csv file is correctly imported, it has 2 columns and about 1870 rows, all numeric.
The system recognises only 1 column as numeric but not the other.
The column, where the problem is, has 5 digits in each row separated with space.
Is there anything i should do in order for the system to properly recognise the data as numeric?
Thanks in advance for your help
The issue is with the Data type Numeric definition, the number needs to be comparable (greater than, smaller than, equal).
Two different list of numbers are not comparable, for example 2 4 7 is not comparable to 1 5 7. To solve this, without using strings and therefore losing the "information" of those numbers, you have several options.
For example:
Create an array of numbers, by inserting [ ] in the limits of the second entrance. Take into consideration the Array Data type relative weighted approach in AutoMl tables as it may affect the "information" extracted from the sequence.
Create additional columns for every entry of the second column so each one is a single number and hence truly numeric.
I would personally go for the second option.
If you are afraid of losing "information" by splitting the numbers take into consideration that after training, the model should deduce by itself the importance of the position and other "information" those number sequences might contain (mean, norm/modulus,relative increase,...) provided the training data is representative.

What is the optimal speed for handling ranges?

Let's say I have a Google sheet with tab1 and tab2
in tab 1 I have 2000 rows and 20 columns filled with data and 1000 rows are empty, so I have 3000 rows.
In tab2 I have a few formulas like vlookup and some if functions.
The options I can think of are:
I can name the range of the data in tab1 and use that in the formula(s) and if the range expands, I can edit the range
I can use option B:B
I can delete the empty rows and use B:B
what is the fastest way?
all three of those options have no real word effect on the overall performance given that you have only 3000 rows across 20 columns. the biggest impact on performance you can have is from QUERYs, IMPORTRANGEs and ARRAYFORMULAs if fed by a huge amount of data (10000+ rows) or if you have extensive calculations with multiple sub-steps consisting of whole virtual arrays.

How to filter rows in which at least a defined number of observations is greater than a specific value?

I have a data frame with 9 columns and many rows. I want to filter all the rows that have observations greater than 3.0 in at least 3 columns. Which conditional statements should I use to subset my data frame?
Since I am a n00b, I only came up with this:
data_frame[data_frame > 3,]
Obviously, this gives me all the rows for which all values are > 2, regardless of what I actually need.
Thanks!
I figured that you could also combine logical operators:
data[rowSums(data>2)>=3,]
Like this, you can subset from a data frame the rows for which the sum of observations (higher than 2) occurs three or more times. And no specification for the columns.
Logical operator, in this case, the brain. I used the sum(rowSum(data))>x # x =sum of the limit value times columns available.

How to organise and rank observations of a variable?

I have this dataset containing world bilateral trade data for a few years.
I would like to determine which goods were the most exported ones in the timespan considered by the dataset.
The dataset is composed by the following variables:
"year"
"hs2", containing a two-digit number that tells which good is exported
"exp_val", giving the value of the export in a certain year, for that good
"exp_qty", giving the exported quantity of the good in a certain year
Basically, I would like to get the total sum of the quantity exported for a certain good, so an output like
hs2 exp_qty
01 34892
02 54548
... ...
and so forth. Right now, the column "hs2" gives me a very large number of observations and, as you can understand, they repeat themselves multiple times (as the variables vary across both time and country of destination). So, the task would be to have every hs2 number just once, with the correspondent value of "total" exports.
Also (but that would be just a plus, I could just check the numbers by myself) it would be nice to get a result sorted by exp_qty, so to have a ranking of the most exported goods by quantity.
The following might be a start at what you need.
collapse (sum) exp_qty, by(hs2)
gsort -exp_qty
collapse summarizes the data in memory to one observation per value of hs2, summing the values of exp_qty. gsort then sorts the collapsed data by descending value of exp_qty so the first observation will be the largest. See help collapse and help gsort for further details.

Resources