Divide a dataset into chunks

Divide a dataset into chunks - performance

I have a function in R that chokes if I apply it to a dataset with more than 1000 rows. Therefore, I want to split my dataset into a list of n chunks, each of not more than 1000 rows.
Here's the function I'm currently using to do the chunking:
chunkData <- function(Data,chunkSize){
Chunks <- floor(0:(nrow(Data)-1)/(chunkSize))
lapply(unique(Chunks),function(x) Data[Chunks==x,])
}
chunkData(iris,100)
I would like to make this function more efficient, so that it runs faster on large datasets.

You can do this easily using split from base R. For example, split(iris, 1:3), will split the iris dataset into a list of three data frames by row. You can modify the arguments to specify a chunk size.
Since the output is still a list of data frames, you can easily use lapply on the output to process the data, and combine them as required.
Since speed is the primary issue for using this approach, I would recommend that you take a look at the data.table package, which works great with large data sets. If you specify more information on what you are trying to achieve in your function, people at SO might be able to help.

Replace the lapply() call with a call to split():
split(Data, Chunks)

You should also take a look at ddply fom the plyr package, this package is built around the split-apply-combine principle. This paper about the package explains how this works and what things are available in plyr.
The general strategy I would take here is to add a new data to the dataset called chunkid. This cuts up the data in chunks of 1000 rows, look at the rep function to create this row. You can then do:
result = ddply(dat, .(chunkid), functionToPerform)
I like plyr for its clear syntax and structure, and its support of parallel processing. As already said, please also take a look at data.table, which could be quite a bit faster in some situations.
An additional tip could be to use matrices in stead of data.frames...

Related

What is the best way to lag a value in a Dask Dataframe?

I have a Dask Dataframe called data which is extremely large and cannot be fit into main memory, and is importantly not sorted. The dataframe is unique on the following key: [strike, expiration, type, time]. What I need to accomplish in Dask is the equivalent of the following in Pandas:
data1 = data[['strike', 'expiration', 'type', 'time', 'value']].sort_values()
data1['lag_value'] = data1.groupby(['strike', 'expiration', 'type', 'time'])['value'].shift(1)
In other words, I need to lag the variable value within a by group. What is the best way to do this in Dask - I know that sorting is going to be very computationally expensive, but I don't think there is a way around it given what I would like to do?
Thank you in advance!

I'll make a few assumptions, but my guess is that the data is 'somewhat' sorted. So you might have file partitions that are specific to a day or a week or maybe an hour if you are working with high-frequency data. This means that you can do sorting within those partitions, which is often a more manageable task.
If this guess is wrong, then it might be a good idea to incur the fixed cost of sorting (and persisting) the data since it will speed up your downstream analysis.
Since you have only one large file and it's not very big (25GB should be manageable if you have access to a cluster), the best thing might be to load into memory with regular pandas, sort and save the data with partitioning on dates/expirations/tickers (if available) or some other column division that makes sense for your downstream analysis.
It might be possible to reduce memory footprint by using appropriate dtypes, for example strike, type, expiration columns might take less space as categories (vs strings).
If there is no way at all of loading it into memory at once, then it's possible to iterate on chunks of rows with pandas and then saving the relevant bits in smaller chunks, here's rough pseudocode:
df = pd.read_csv('big_file', iterator=True, chunksize=10**4)
for rows in df:
# here we want to split into smaller sets based on some logic
# note the mode is append so some additional check on file
# existence should be added
for group_label, group_df in rows.groupby(['type', 'strike']):
group_df.to_csv(f"{group_label}.csv", mode='a')
Now the above might sound weird, since the question is tagged with dask and I'm focusing on pandas, but the idea is to save time downstream by partitioning the data on the relevant variables. With dask it is probably possible to achieve also, but in my experience in situations like these I would run into memory problems due to data shuffling among workers. Of course, if in your situation there were many files rather than one, then some parallelisation with dask.delayed would be helpful.
Now, after you partition/index your data, then dask will work great when operating on the many smaller chunks. For example, if you partitioned the data based on date and your downstream analysis is primarily using dates, then operations like groupby and shift will be very fast because the workers will not need to check with each other whether they have overlapping dates, so most processing will occur within partitions.

LightGBM incrementally construct Dataset

I want to construct a LightGBM Dataset object from very large X and y, which can not be load to memory. Is there any method that can construct Dataset in "batch"? eg. something like
import lightgbm as lgb
ds = lgb.Dataset()
for X, y in data_generator():
ds.add_new_data(data=X, label=y)

regarding the data there are a few hacks, for example, if your data has numeric, you make sure the precision are too long, e.g. probably two digits would be enough (it depends on your data). or if you have categorical data make sure you store them with digits. but probably you are looking for a better approach
There is a concept called incremental learning. Basically you make a model (a tree) in your first iteration using the first batch of data. Then for your next model, you use that tree as a template and only updates the values (you can also allow for shrinkage). you can use the keep_training_booster for such scenario and please read on your own to learn the mechanism.
The third technique is you make multiple models: say you divide your data into N pieces and make N models, then use an ensemble approach. This way you have used your entire data with N number of observations.

How to find average of two lines in NiFi?

I need to find average of two values in separate lines.
My CSV file looks like this
Name,ID,Marks
Mahi,1,90
Mahi,1,100
Andy,2,85
Andy,2,95
Now I need to store that average of 2 marks in database.
"Average" column should add two marks and divide with 2 and store that result in SQL query
Table:
Name,ID,Average
Mahi,2,95
Andy,2,90
Is it possible to find the average of two values in separate rows using NiFi?

Given a lot of assumptions, this is doable. You are definitely better off pre-processing the data in NiFi and exporting it to a tool better suited to this, like Apache Spark using the NiFi Spark Receiver library (instructions here), because this solution will not scale well.
However, you could certainly use a combination of SplitText processors to get the proper data into individual flowfiles (i.e. all Mahi rows in one, all Andy rows in another). Once you have a record that looks like:
Andy,1,85
Andy,1,95
you can use ExtractText with regular expressions to get 85 and 95 into attributes marks.1 and marks.2 (a good example of where scaling will break down -- doing this with 2 rows is easy; doing this with 100k is ridiculous). You can then use UpdateAttribute with the Expression Language to calculate the average of those two attributes (convert toNumber() first) and populate a third attribute marks.average (either through chaining plus() and divide() functions or with the math advanced operation (uses Java Reflection)). Once you have the desired result in an attribute, use ReplaceText to update the flowfile content, and MergeContent to merge the individual flowfiles back into a single instance.
If this were me, I'd first evaluate how static my incoming data format was, and if it was guaranteed to stay the same, probably just write a Groovy script that parsed the data and calculated the averages in place. I think that would even scale better (within reason) because of the flexibility of having written domain-specific code. If you need to offload this to cluster operations, Spark is the way to go.

Building a histogram faster

I am working with a large dataset that I need to build a histogram of. I feel like my method of just going through the entire list and marking in a second array the frequency is a slow approach. Any suggestions on how to speed the process up?

Given that a histogram is a graph containing the counts of all items in each bin, you can't make one without visiting all the items.
However, you can:
Create the histogram as you collect the data. Then it takes no time to generate.
Break up the data into N parts, and work on each part in parallel. When each part is done counting, just sum the results for each bin. (You can also combine this with #1)
Sample the data. In theory, looking at a fraction of your data, you should be able to estimate the rest of it. The Math.

Comparing two large datasets using a MapReduce programming model

Let's say I have two fairly large data sets - the first is called "Base" and it contains 200 million tab delimited rows and the second is call "MatchSet" which has 10 million tab delimited rows of similar data.
Let's say I then also have an arbitrary function called Match(row1, row2) and Match() essentially contains some heuristics for looking at row1 (from MatchSet) and comparing it to row2 (from Base) and determining if they are similar in some way.
Let's say the rules implemented in Match() are custom and complex rules, aka not a simple string match, involving some proprietary methods. Let's say for now Match(row1,row2) is written in psuedo-code so implementation in another language is not a problem (though it's in C++ today).
In a linear model, aka program running on one giant processor - we would read each line from MatchSet and each line from Base and compare one to the other using Match() and write out our match stats. For example we might capture: X records from MatchSet are strong matches, Y records from MatchSet are weak matches, Z records from MatchSet do not match. We would also write the strong/weak/non values to separate files for inspection. Aka, a nested loop of sorts:
for each row1 in MatchSet
{
for each row2 in Base
{
var type = Match(row1,row2);
switch(type)
{
//do something based on type
}
}
}
I've started considering Hadoop streaming as a method for running these comparisons as a batch job in a short amount of time. However, I'm having a bit of a hardtime getting my head around the map-reduce paradigm for this type of problem.
I understand pretty clearly at this point how to take a single input from hadoop, crunch the data using a mapping function and then emit the results to reduce. However, the "nested-loop" approach of comparing two sets of records is messing with me a bit.
The closest I'm coming to a solution is that I would basically still have to do a 10 million record compare in parallel across the 200 million records so 200 million/n nodes * 10 million iterations per node. Is that that most efficient way to do this?

From your description, it seems to me that your problem can be arbitrarily complex and could be a victim of the curse of dimensionality.
Imagine for example that your rows represent n-dimensional vectors, and that your matching function is "strong", "weak" or "no match" based on the Euclidean distance between a Base vector and a MatchSet vector. There are great techniques to solve these problems with a trade-off between speed, memory and the quality of the approximate answers. Critically, these techniques typically come with known bounds on time and space, and the probability to find a point within some distance around a given MatchSet prototype, all depending on some parameters of the algorithm.
Rather than for me to ramble about it here, please consider reading the following:
Locality Sensitive Hashing
The first few hits on Google Scholar when you search for "locality sensitive hashing map reduce". In particular, I remember reading [Das, Abhinandan S., et al. "Google news personalization: scalable online collaborative filtering." Proceedings of the 16th international conference on World Wide Web. ACM, 2007] with interest.
Now, on the other hand if you can devise a scheme that is directly amenable to some form of hashing, then you can easily produce a key for each record with such a hash (or even a small number of possible hash keys, one of which would match the query "Base" data), and the problem becomes a simple large(-ish) scale join. (I say "largish" because joining 200M rows with 10M rows is quite a small if the problem is indeed a join). As an example, consider the way CDDB computes the 32-bit ID for any music CD CDDB1 calculation. Sometimes, a given title may yield slightly different IDs (i.e. different CDs of the same title, or even the same CD read several times). But by and large there is a small set of distinct IDs for that title. At the cost of a small replication of the MatchSet, in that case you can get very fast search results.

Check the Section 3.5 - Relational Joins in the paper 'Data-Intensive Text Processing
with MapReduce'. I haven't gone in detail, but it might help you.

This is an old question, but your proposed solution is correct assuming that your single stream job does 200M * 10M Match() computations. By doing N batches of (200M / N) * 10M computations, you've achieved a factor of N speedup. By doing the computations in the map phase and then thresholding and steering the results to Strong/Weak/No Match reducers, you can gather the results for output to separate files.
If additional optimizations could be utilized, they'd like apply to both the single stream and parallel versions. Examples include blocking so that you need to do fewer than 200M * 10M computations or precomputing constant portions of the algorithm for the 10M match set.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio