Apache Arrow C++: What's the best fast alternative to parquet::StreamWriter? - parquet

I'm writing a program to convert a tabular ("rownar") custom binary file format to Parquet using Arrow C++.
The core of the program works as follows:
ColInfo colinfo = ...; // fixed schema per file, not known at compile time
parquet::StreamWriter w = ...;
on_row(RowData row) {
for (col : colinfo) {
w << convertCell(row, col);
}
}
Here, on_row is called by the input file parser for each row parsed.
This works fine but is pretty slow, with StreamWriter::<< being the bottleneck.
Question: What's an alternative to StreamWriter that's similarly easy to use but faster?
Constraints:
Can't change the callback-based interface shown above.
Input data doesn't fit into memory.
I've looked into the reader_writer{,2}.cc examples in the Arrow repository that use the WriteBatch interface. Is that the recommended way to quickly create Parquet files? If so, what's the recommended way to size row groups? Or is there an interface that abstracts away row groups, like with StreamWriter? And what's the recommended size num_values to WriteBatch?
Secondary question: What are some good opportunities to concurrently create the Parquet file? Can batches, chunks, columns, or row groups by written concurrently?

Related

(Using Julia) How can I reduce my data matrix by averaging values from the same hour?

I am trying to reduce the size of my data and I cannot make it work. I have data points taken every minute over 1 month. I want to reduce this data to have one sample for every hour. The problem is: Some of my runs have "NA" value, so I delete these rows. There is not exactly 60 points for every hour - it varies.
I have a 'Timestamp' column. I have used this to make a 'datehour' column which has the same value if the data set has the same date and hour. I want to average all the values with the same 'datehour' value.
How can I do this? I have tried using the if and for loop below, but it takes so long to run.
Thanks for all your help! I am new to Julia and come from a Matlab background.
======= CODE ==========
uniquedatehour=unique(datehour,1)
index=[]
avedata=reshape([],0,length(alldata[1,:]))
for j in uniquedatehour
for i in 1:length(datehour)
if datehour[i]==j
index=vcat(index,i)
else
rows=alldata[index,:]
rows=convert(Array{Float64,2},rows)
avehour=mean(rows,1)
avedata=vcat(avedata,avehour)
index=[]
continue
end
end
end
There are several layers to optimizing this code. I am assuming that your data is sorted on datehour (your code assumes this).
Layer one: general recommendation
Wrap your code in a function. Executing code in global scope in Julia is much slower than within a function. By wrapping it make sure to either pass data to your function as arguments or if data is in global scope it should be qualified with const;
Layer two: recommendations to your algorithm
Statement like [] creates an array of type Any which is slow, you should use type qualifier like index=Int[] to make it fast;
Using vcat like index=vcat(index,i) is inefficient, it is better to do push!(index, i) in place;
It is better to preallocate avedata with e.g. fill(NA, length(uniquedatehour), size(alldata, 2)) and assign values to an existing matrix than to do vcat on it;
Your code will produce incorrect results if I am not mistaken as it will not catch the last entry of uniquedatehour vector (assume it has only one element and check what happens - avedata will have zero rows)
Line rows=convert(Array{Float64,2},rows) is probably not needed at all. If alldata is not Matrix{Float64} it is better to convert it at the beginning with Matrix{Float64}(alldata);
You can change line rows=alldata[index,:] to a view like view(alldata, index, :) to avoid allocation;
In general you can avoid creation of index vector as it is enough that you remember start s and end e position of the range of the same values and then use range s:e to select rows you want.
If you correct those things please post your updated code and maybe I can help further as there is still room for improvement but requires a bit different algorithmic approach (but maybe you will prefer option below for simplicity).
Layer three: how I would do it
I would use DataFrames package to handle this problem like this:
using DataFrames
df = DataFrame(alldata) # assuming alldata is Matrix{Float64}, otherwise convert it here
df[:grouping] = datehour
agg = aggregate(df, :grouping, mean) # maybe this is all what you need if DataFrame is OK for you
Matrix(agg[2:end]) # here is how you can convert DataFrame back to a matrix
This is not the fastest solution (as it converts to a DataFrame and back but it is much simpler for me).

How to draw a chart from a CSV-file in MQL4?

I'm new to MQL and MetaTrader 4,but I want to read a .CSV-file and draw the values I've got into the chart of the Expert Advisor I'm working on.
Every .CSV file has the form of:
;EURUSD;1
DATE;TIME;HIGH;LOW;CLOSE;OPEN;VOLUME
2014.06.11;19:11:00;1.35272;1.35271;1.35271;1.35272;4
2014.06.11;19:14:00;1.35287;1.35282;1.35284;1.35283;30
Where the EURUSD part is the _Symbol, which another program generated, the 1 is the period, and all the other things are the data to draw.
Is there any form to do it inside an Expert Advisor, or do I need to use a Custom Indicator?
If that's the case, how can I do it in the simplest way?
P.S.: I read the data in a struct:
struct entry
{
string date;
string time;
double high;
double low;
double close;
double open;
int volume;
};
There are three principally different approaches available in MT4
First, one mayreshuffle data-cells into a compatible format T,O,H,L,C,V and import records using F2 History Center [Import] facility of the MetaTrader Terminal. One may create one's own Symbol-name so as to avoid name-colliding cases in the History Center database.
This way, one lets MT4 to create system-level illustrations of the TOHLCV-data, using the platform's underlying graphical engine.
Second,
one may ignore the underlying graphical engine andwork on a user-controlled GUI-overlayso as to implement an algorithm to read a CSV file and create a set of MQL4 GUI-objects algorithmically, based on the data contained in the said CSV file. An experience based decision whether to use an { ExpertAdvisor | CustomIndicator } would yield to use a Script for this purpose, due to it's one-shot processing.One shall realise, MT4 code-execution ecosystem does a specific context-binding between an MQL4-code ( which is being run ) and an MT4.Graph which does not allow a code launched on a GBPJPY MT4.Graph to process directly objects, related with FTSE.100 MT4.Graph. Yes, if asked to, one may implement a few add-ons and develop a sofisticated distributed processing model to make this work "accross" the said context-binding borders.
Third,
and for some cases the most interesting way is a file based approach, whereone may
pre-process CSV data in a similar way as in second option but not inside a live-MT4 process, but "beforehand" and
generate one's own Profile file, keeping an MT4 convention of placing & content of - ~/profiles/<aProfileNAME>/chart01.chr - ~/profiles/<aProfileNAME>/order.wnd
-~/profiles/lastprofile.ini, referring <aProfileNAME> on it's first row
This way, once the MT4 session starts, the pre-fabricated files are pilot-tape auto-loaded and displayed as one wishes to, Q.E.D.
A .chr file syntax sample:
<chart>
id=130394787628125000
comment=msLIB.TERMINAL: _______________2013.04.15 08:00:00 |cpuClockTIXs = 448765484 |
symbol=EURCHF
period=60
leftpos=6188
digits=4
scale=4
graph=1
fore=0
grid=0
volume=1
scroll=0
shift=1
ohlc=1
...
<window>
height=100
fixed_height=0
<indicator>
name=main
<object>
type=10
object_name=Fibo 16762
...
<object>
type=16
object_name=msLIB.RectangleOnEVENT
period_flags=0
create_time=1348596865
color=25600
style=0
weight=1
background=0
filling=0
selectable=1
hidden=0
zorder=0
time_0=1348592400
value_0=1.213992
time_1=1348624800
value_1=1.209486
ray=0
</object>
...
<object>
type=17
object_name=msLIB.TriangleMarker
period_flags=0
create_time=1348064992
color=17919
style=2
weight=1
background=0
filling=0
selectable=1
hidden=0
zorder=0
time_0=1348052400
value_0=1.213026
time_1=1348070400
value_1=1.213026
time_2=1348070400
value_2=1.210476
</object>

Logic to compare rows in pig

I need logic for below scenario which needs to be implemented using Pig scripts. Can anyone please help in providing some ideas on how to do this.
Input contains a column groupName with some data like others and unknown. This data needs to be replaced by its previous record data.
Input:
id,groupName
123,casc0001
124,casc0002
125,sale0001
126,unknown
127,nave9876
128,casc0001
129,sale0002
130,others
131,casc0004
132,unknown
133,unknown
134,others
135,nave1234
output:
123,casc0001
124,casc0002
125,sale0001
126,sale0001
127,nave9876
128,casc0001
129,sale0002
130,sale0002
131,casc0004
132,casc0004
133,casc0004
134,casc0004
135,nave1234
In the above input 126,unknown to be replaced with 125,sale0001. 130,others need to be replaced by 129,sale0002. 132,unknown 133,unknown 134,others to be replaced with 131,casc0004.
--Edit--
I tried lead function in Pig. But it is used only to compare n rows at a time. Which cannot solve this completely.
Another logic which is working, but looking for optimized one.
Cogroup for the same data set (like Dataset and Dataset_self)
-Filter Dataset.id=Dataset_self.id or Dataset_self.groupname='others' or Dataset_self.groupname='unknown'
-Generate IdDiff like (Dataset_self.id-Dataset.id), CASE when id=id then ( id, group) else (id_self,group)
-Foreach (group id){
ordered = order by id,diff,group;
limited = ordered limit 1;
generate limited ;
}
This is going to be a complicated problem on a distributed system like hadoop, especially that your file is going to be split between nodes. In your case what if 126 happens to be the first record in a new split. Then you will need to trace the previous file split which is most likely on a different node. Lets say you come up with a MapReduce program to do this, in all likelyhood it would an extremely slow and inefficient way to do it. The solution might be simpler if you are in a single node system where the splittable property of your input format is false, and the nuber of reducers is set to 1.
In that case you could almost make the argument that a traditional database like Oracle or Terra data might be a better fit for your problem as you have lead or lag functions readily available which could be used to do exactly what u need.

Dealing with huge data

Let's assume that I have a big file (500GB+) and I have a data record
declaration Sample which indicates a row in that file:
data Sample = Sample {
field1 :: Int,
field2 :: Int
}
Now what is the data structure suitable for processing
(filter/map/fold) on the collection of these Sample datas ? Don
Stewart has answered here that the Sample type should not be treated
as a list [Sample] type but as a Vector type. My question is how
does representing it as Vector type solve the problem ? Doesn't
representing the file contents as a vector of Sample type will also
occupy around 500Gb ?
What is the recommended method for solving these types of problem ?
As far as I can see, the operations you want to use (filter, map and fold) can be done via both conduit (see Data.Conduit.List) and pipes (see Pipes.Prelude).
Both libraries are perfectly capable of manipulating/folding and filtering streaming data. Depending on your scenario they might solve your actual problem.
If you, however, need to investigate values several times, you're better of by loading chunks into a vector, as #Don said.

Getting output files which contain the value of one key only?

I have a use-case with Hadoop where I would like my output files to be split by key. At the moment I have the reducer simply outputting each value in the iterator. For example, here's some python streaming code:
for line in sys.stdin:
data = line.split("\t")
print data[1]
This method works for a small dataset (around 4GB). Each output file of the job only contains the values for one key.
However, if I increase the size of the dataset (over 40GB) then each file contains a mixture of keys, in sorted order.
Is there an easier way to solve this? I know that the output will be in sorted order and I could simply do a sequential scan and add to files. But it seems that this shouldn't be necessary since Hadoop sorts and splits the keys for you.
Question may not be the clearest, so I'll clarify if anyone has any comments. Thanks
Ok then create a custom jar implementation of your MapReduce solution and go for MultipleTextOutputFormat to be the OutputFormat used as explained here. You just have to emit the filename (in your case the key) as the key in your reducer and the entire payload as the value, and your data will be written in the file named as your key.

Resources