How to write multiple arrow/parquet files in chunks while reading in large data quantities of data so that all written files are one dataset? - parquet

I'm working in R with the arrow package. I have multiple tsv files:
sample1.tsv
sample2.tsv
sample3.tsv
sample4.tsv
sample5.tsv
...
sample50.tsv
each of the form
| id | start| end | value|
| --- | -----|-----|------|
| id1 | 1 | 3 | 0.2 |
| id2 | 4 | 6 | 0.5 |
| id. | ... | ... | ... |
| id2 | 98 | 100 | 0.5 |
and an index file:
| id | start| end |
| --- | -----|-----|
| id1 | 1 | 3 |
| id2 | 4 | 6 |
| id. | ... | ... |
| id2 | 98 | 100 |
I use the index file to left join on id, start and end with each sample to get a datatable like this:
| id | start| end | sample 1| sample 2| sample ...|
| --- | -----|-----|---------|---------|-----------|
| id1 | 1 | 3 | 0.2 | 0.1 | ... |
| id2 | 4 | 6 | 0.5 | 0.8 | ... |
| id. | ... | ... | ... | ... | ... |
| id2 | 98 | 100 | 0.5 | 0.6 | ... |
With multiple samples. I'd like to read them in chunks (eg: chunk_size=5), and when I have a table of chunk_size samples read, write that joined datatable as a parquet file to disk.
Currently, I'm able to write each chunked datatable to disk and I read them with open_dataset(datadir). In a loop with i as the sample_number:
# read and join
...
if (i %% chunk_size == 0) {
write_parquet(joined_table, paste0("datadir", "chunk", i / chunk_size, ".parquet"))
}
...
# clear the data table of samples
However, even though the arrow package says it read as many files as were written, when I check the columns available, only the columns from the first chunk are found.
data <- arrow::open_dataset("datadir")
data
# FileSystemDataset with 10 Parquet files
# id: string
# start: int32
# end: int32
# sample1: double
# sample2: double
# sample3: double
# sample4: double
# sample5: double
Samples 6-50 are missing. Reading the parquet files individually shows that each contains the samples from their chunk.
data2 <- arrow::open_dataset("datadir/chunk2.parquet")
data2
# FileSystemDataset with 1 Parquet file
# id: string
# start: int32
# end: int32
# sample6: double
# sample7: double
# sample8: double
# sample9: double
# sample10: double
Are parquet files the right format for this task? I'm not sure what I'm missing to make a splintered set of files that are all the same dataset when read in.

Related

How to split a row where there's 2 data in each cells separated by a carriage return?

Someone gives me a file with, sometimes, inadequate data.
Data should be like this :
+---------+-----------+--------+
| Name | Initial | Age |
+---------+-----------+--------+
| Jack | J | 43 |
+---------+-----------+--------+
| Nicole | N | 12 |
+---------+-----------+--------+
| Mark | M | 22 |
+---------+-----------+--------+
| Karine | K | 25 |
+---------+-----------+--------+
Sometimes it comes like this tho :
+---------+-----------+--------+
| Name | Initial | Age |
+---------+-----------+--------+
| Jack | J | 43 |
+---------+-----------+--------+
| Nicole | N | 12 |
| Mark | M | 22 |
+---------+-----------+--------+
| Karine | K | 25 |
+---------+-----------+--------+
As you can see, Nicole and Mark are put in the same row, but the data are separated by a carriage return.
I can do split by row, but it demultiply the data :
+---------+-----------+--------+
| Nicole | N | 12 |
| | M | 22 |
+---------+-----------+--------+
| Mark | N | 12 |
| | M | 22 |
+---------+-----------+--------+
Which make me lose that Mark is associated with the "2nd row" of data.
(The data here is purely an example)
One way to do this is to transform each cell into a list by doing a Text.Split on the line feed / carriage return symbol.
TextSplit = Table.TransformColumns(Source,
{
{"Name", each Text.Split(_,"#(lf)"), type text},
{"Initial", each Text.Split(_,"#(lf)"), type text},
{"Age", each Text.Split(_,"#(lf)"), type text}
}
)
Now each column is a list of lists which you can combine into one long list using List.Combine and you can glue these columns together to make table with Table.FromColumns.
= Table.FromColumns(
{
List.Combine(TextSplit[Name]),
List.Combine(TextSplit[Initial]),
List.Combine(TextSplit[Age])
},
{"Name", "Initial", "Age"}
)
Putting this together, the whole query looks like this:
let
Source = <Your data source>
TextSplit = Table.TransformColumns(Source,{{"Name", each Text.Split(_,"#(lf)"), type text},{"Initial", each Text.Split(_,"#(lf)"), type text},{"Age", each Text.Split(_,"#(lf)"), type text}}),
FromColumns = Table.FromColumns({List.Combine(TextSplit[Name]),List.Combine(TextSplit[Initial]),List.Combine(TextSplit[Age])},{"Name","Initial","Age"})
in
FromColumns

MDX - filter empty outside of selected range

Cube is populated with data divided into time dimension ( period ) which represents a month.
Following query:
select non empty {[Measures].[a], [Measures].[b], [Measures].[c]} on columns,
{[Period].[Period].ALLMEMEMBERS} on rows
from MyCube
returns:
+--------+----+---+--------+
| Period | a | b | c |
+--------+----+---+--------+
| 2 | 3 | 2 | (null) |
| 3 | 5 | 3 | 1 |
| 5 | 23 | 2 | 2 |
+--------+----+---+--------+
Removing non empty
select {[Measures].[a], [Measures].[b], [Measures].[c]} on columns,
{[Period].[Period].ALLMEMEMBERS} on rows
from MyCube
Renders:
+--------+--------+--------+--------+
| Period | a | b | c |
+--------+--------+--------+--------+
| 1 | (null) | (null) | (null) |
| 2 | 3 | 2 | (null) |
| 3 | 5 | 3 | 1 |
| 4 | (null) | (null) | (null) |
| 5 | 23 | 2 | 2 |
| 6 | (null) | (null) | (null) |
+--------+--------+--------+--------+
What i would like to get, is all records from period 2 to period 5, first occurance of values in measure "a" denotes start of range, last occurance - end of range.
This works - but i need this to be dynamically calculated during runtime by mdx:
select non empty {[Measures].[a], [Measures].[b], [Measures].[c]} on columns,
{[Period].[Period].&[2] :[Period].[Period].&[5]} on rows
from MyCube
desired output:
+--------+--------+--------+--------+
| Period | a | b | c |
+--------+--------+--------+--------+
| 2 | 3 | 2 | (null) |
| 3 | 5 | 3 | 1 |
| 4 | (null) | (null) | (null) |
| 5 | 23 | 2 | 2 |
+--------+--------+--------+--------+
I tried looking for first/last values but just couldn't compose them into the query properly. Anyone has this issue before ? This should be pretty common seeing as I want to get a continuous financial report without skipping months where nothing is going on. Thanks.
Maybe try playing with NonEmpty / Tail function in a WITH clause:
WITH
SET [First] AS
{HEAD(NONEMPTY([Period].[Period].MEMBERS, [Measures].[a]))}
SET [Last] AS
{TAIL(NONEMPTY([Period].[Period].MEMBERS, [Measures].[a]))}
SELECT
{
[Measures].[a]
, [Measures].[b]
, [Measures].[c]
} on columns,
[First].ITEM(0).ITEM(0)
:[Last].ITEM(0).ITEM(0) on rows
FROM MyCube;
to debug a custom set, to see what members it is returning you can do something like this:
WITH
SET [First] AS
{HEAD(NONEMPTY([Period].[Period].MEMBERS, [Measures].[a]))}
SELECT
{
[Measures].[a]
, [Measures].[b]
, [Measures].[c]
} on columns,
[First] on rows
FROM MyCube;
I think reading your comment about Children means that this is also an alternative - to add an extra [Period]:
WITH
SET [First] AS
{HEAD(NONEMPTY([Period].[Period].[Period].MEMBERS
, [Measures].[a]))}
SET [Last] AS
{TAIL(NONEMPTY([Period].[Period].[Period].MEMBERS
, [Measures].[a]))}
SELECT
{
[Measures].[a]
, [Measures].[b]
, [Measures].[c]
} on columns,
[First].ITEM(0).ITEM(0)
:[Last].ITEM(0).ITEM(0) on rows
FROM MyCube;

hive rows preceding unexpected behavior

Given this ridiculously simple data set:
+--------+-----+
| Bucket | Foo |
+--------+-----+
| 1 | A |
| 1 | B |
| 1 | C |
| 1 | D |
+--------+-----+
I want to see the value of Foo in the previous row:
select
foo,
max(foo) over (partition by bucket order by foo rows between 1 preceding and 1 preceding) as prev_foo
from
...
Which gives me:
+--------+-----+----------+
| Bucket | Foo | Prev_Foo |
+--------+-----+----------+
| 1 | A | A |
| 1 | B | A |
| 1 | C | B |
| 1 | D | C |
+--------+-----+----------+
Why do I get 'A' back for the first row? I would expect it to be be null. It's throwing off calculations where I'm looking for that null. I can work around it by throwing a row_number() in there, but I'd prefer to handle it with fewer calcs.
use the LAG function to get previous row:
LAG(foo) OVER(partition by bucket order by foo) as Prev_Foo

Multiple Tables Group and substract sum of columns using linq sql

Here i have two tables
Table One
+---------------+----------+------------+
| Raw Material | Size | Qty |
+---------------+----------+------------+
| A | 1 | 5 |
| A | 2 | 2 |
| A | 1 | 2 |
| B | 0 | 5 |
| B | 0 | 1 |
+---------------+----------+------------+
Table Two
+---------------+----------+------------+
| Raw Material | Size | Qty |
+---------------+----------+------------+
| A | 1 | 2 |
| A | 2 | 1 |
| A | 1 | 1 |
+---------------+----------+------------+
I want out put like
+---------------+----------+------------+
| Raw Material | Size | Qty |
+---------------+----------+------------+
| A | 1 | 4 |
| A | 2 | 1 |
| B | 0 | 6 |
+---------------+----------+------------+
Want to get substract first two tables sum of qty by grouping Rawmaterial and Size
Something like this should do the job
var result = tableA.Select(e => new { Item = e, Factor = 1 })
.Concat(tableB.Select(e => new { Item = e, Factor = -1 }))
.GroupBy(e => new { e.Item.RawMaterial, e.Item.Size }, (key, elements) => new
{
RawMaterial = key.RawMaterial,
Size = key.Size,
Qty = elements.Sum(e => e.Item.Qty * e.Factor)
}).ToList();
First we create a union of the two tables using Concat, including the information which one is additive (in Factor field), and then just do the normal grouping.
If you want the result to be List<YourTableElementType>, just replace the final anonymous type projection (new { ... }) with new YourTableElementType { ... }.

How to remove repeated columns using ruby FasterCSV

I'm using Ruby 1.8 and FasterCSV.
The csv file I'm reading in has several repeated columns.
| acct_id | amount | acct_num | color | acct_id | acct_type | acct_num |
| 345 | 12.34 | 123 | red | 345 | 'savings' | 123 |
| 678 | 11.34 | 432 | green | 678 | 'savings' | 432 |
...etc
I'd like to condense it to:
| acct_id | amount | acct_num | color | acct_type |
| 345 | 12.34 | 123 | red | 'savings' |
| 678 | 11.34 | 432 | green | 'savings' |
Is there a general purpose way to do this?
Currently my solution is something like:
headers = CSV.read_line(file)
headers = CSV.read_line # get rid of garbage line between headers and data
FasterCSV.filter(file, :headers => headers) do |row|
row.delete(6) #delete second acct_num field
row.delete(4) #delete second acct_id field
# additional processing on the data
row['color'] = color_to_number(row['color'])
row['acct_type'] = acct_type_to_number(row['acct_type'])
end
Assuming you want to get rid of the hardcoded deletions
row.delete(6) #delete second acct_num field
row.delete(4) #delete second acct_id field
Can be replaced by
row = row.to_hash
This will clobber duplicates. The rest of the posted code will keep working.

Resources