Parquet file with uneven columns - parquet

I'm trying to figure out how to write a parquet file where the columns do not contain the same number of rows per Row Group. For example, my first column might be a value sampled at 10Hz, while my second column may be a value sampled at only 5Hz. I'd rather not repeat values in the slower column since this can lead to computational errors. However, I cannot write columns of two different sizes to the same Row Group, so how can I accomplish this?
I'm attempting to do this with ParquetSharp.

It is not possible for the columns in a parquet file to have different row counts.
It is not explicit in the documentation but if you look on https://parquet.apache.org/documentation/latest/#metadata, you will see that a RowGroup has a num_rows and several ColumnChunks that do not themselves have individual row numbers.

Related

Evaluate table from row x to row y

I am trying to extract a table to a flat file using python and the ssas_api package, allowing me to run DAX queries from python code.
The table is fairly big and because of that a simple EVALUATE tablename query will timeout after 1h.
I want to split the queries into smaller ones, iterating over the table by chunks of let say 20k lines for example.
I could do the first chunk using TOPN but what about the next ones?

How to select highly variable genes in bulk RNA seq data?

As a pre-processing step, I need to select the top 1000 highly variable genes (rows) from a bulk RNA-seq data which contains about 60k genes across 100 different samples(columns). The column value already contains the mean of the triplicates. The table contains normalized value in FPKM (Note: I don't have access to raw counts and am not able to use common R packages as these packages takes raw counts as input.)
In this case, what is the best way to select the top 1000 variable genes ?
I have tried to filter out the genes using rowSums() function (to remove the genes with lower rowsums values) and narrowed it down from 60k genes to 10K genes but I am not sure if it the right way to select highly variable genes. Any input is appreciated.
row sum is first filtration step. after this your data will discarded by log2fold change cutoff and padjst value (0.05 or o.o1 depend on your goal). you can repeat this pathway with different row sum cutoff to see results. I personal discard row sums zero

The column of the csv file in google automl tables is recognised as text or categorical instead of numeric as i would like

I tried to train a model using google automl tables but i have the following problem
The csv file is correctly imported, it has 2 columns and about 1870 rows, all numeric.
The system recognises only 1 column as numeric but not the other.
The column, where the problem is, has 5 digits in each row separated with space.
Is there anything i should do in order for the system to properly recognise the data as numeric?
Thanks in advance for your help
The issue is with the Data type Numeric definition, the number needs to be comparable (greater than, smaller than, equal).
Two different list of numbers are not comparable, for example 2 4 7 is not comparable to 1 5 7. To solve this, without using strings and therefore losing the "information" of those numbers, you have several options.
For example:
Create an array of numbers, by inserting [ ] in the limits of the second entrance. Take into consideration the Array Data type relative weighted approach in AutoMl tables as it may affect the "information" extracted from the sequence.
Create additional columns for every entry of the second column so each one is a single number and hence truly numeric.
I would personally go for the second option.
If you are afraid of losing "information" by splitting the numbers take into consideration that after training, the model should deduce by itself the importance of the position and other "information" those number sequences might contain (mean, norm/modulus,relative increase,...) provided the training data is representative.

Count cells in a row where value is greater than zero

I'm looking to add a column to my PowerQuery data which will count how many of 5 cells in the row are greater than zero. Example data below with end result:
I could do this with lots of If statements but I need to be able to expand my number of columns in the future.
I think this should do it. Add a column with:
List.Count(List.RemoveMatchingItems(Record.FieldValues(_),{0}))
e.g.:

Oracle Sor by for denormalized table

Say we have a denormalized table where the row size is quite big.
When oracle is performing a sort by (in memory):
Is it loading in memory the whole row just to check a small column to order? or is it just loading in memory the ID and the column to sort?
Is the behavior when doing a sort in disk different?
It only sorts the required data, which includes the order-by columns and the data being projected.
If you select ten columns from a fifty column table, and sort by two columns not selected, then 12 columns are included in the sort area requirements.

Resources