I need to find average of two values in separate lines.
My CSV file looks like this
Name,ID,Marks
Mahi,1,90
Mahi,1,100
Andy,2,85
Andy,2,95
Now I need to store that average of 2 marks in database.
"Average" column should add two marks and divide with 2 and store that result in SQL query
Table:
Name,ID,Average
Mahi,2,95
Andy,2,90
Is it possible to find the average of two values in separate rows using NiFi?
Given a lot of assumptions, this is doable. You are definitely better off pre-processing the data in NiFi and exporting it to a tool better suited to this, like Apache Spark using the NiFi Spark Receiver library (instructions here), because this solution will not scale well.
However, you could certainly use a combination of SplitText processors to get the proper data into individual flowfiles (i.e. all Mahi rows in one, all Andy rows in another). Once you have a record that looks like:
Andy,1,85
Andy,1,95
you can use ExtractText with regular expressions to get 85 and 95 into attributes marks.1 and marks.2 (a good example of where scaling will break down -- doing this with 2 rows is easy; doing this with 100k is ridiculous). You can then use UpdateAttribute with the Expression Language to calculate the average of those two attributes (convert toNumber() first) and populate a third attribute marks.average (either through chaining plus() and divide() functions or with the math advanced operation (uses Java Reflection)). Once you have the desired result in an attribute, use ReplaceText to update the flowfile content, and MergeContent to merge the individual flowfiles back into a single instance.
If this were me, I'd first evaluate how static my incoming data format was, and if it was guaranteed to stay the same, probably just write a Groovy script that parsed the data and calculated the averages in place. I think that would even scale better (within reason) because of the flexibility of having written domain-specific code. If you need to offload this to cluster operations, Spark is the way to go.
Related
We have a use case where we are merging CSV files from two different, independent JMeter instances and using the combined CSV file to generate the dashboard. While merging two files, there is a question about the order of the data points in the CSV and how it impacts the final dashboard. To test it out, I created two CSV files with the same data but one where the data is in increasing order of timestamps and the other is in decreasing order and compared the Response Time Percentiles Over Time (successful responses) graphs.
From what I understand from Kth select algorithm used for percentile calculation, if the input data set contains of the same values, irrespective of the order, the output should be the same every time. In JMeter, the input data set is controlled by the jmeter.reportgenerator.statistic_window property.
So, my hypothesis is, if the statistic_window is large enough to hold all the samples for a given time-granularity window, the output graphs should be 100% identical but that doesn't seem to be the case. I generated graphs with a statistic_window of 100 million which is far more than the number of samples in the entire CSV file and still, the graphs are different from one another, even if it is by a small percentage.
I like to understand if my assumptions are wrong or if there is a bug in JMeter that I need to search for.
Turns out there was a bug in JMeter which I was able to submit a patch for. For more details: https://bz.apache.org/bugzilla/show_bug.cgi?id=65446
I have a Dask Dataframe called data which is extremely large and cannot be fit into main memory, and is importantly not sorted. The dataframe is unique on the following key: [strike, expiration, type, time]. What I need to accomplish in Dask is the equivalent of the following in Pandas:
data1 = data[['strike', 'expiration', 'type', 'time', 'value']].sort_values()
data1['lag_value'] = data1.groupby(['strike', 'expiration', 'type', 'time'])['value'].shift(1)
In other words, I need to lag the variable value within a by group. What is the best way to do this in Dask - I know that sorting is going to be very computationally expensive, but I don't think there is a way around it given what I would like to do?
Thank you in advance!
I'll make a few assumptions, but my guess is that the data is 'somewhat' sorted. So you might have file partitions that are specific to a day or a week or maybe an hour if you are working with high-frequency data. This means that you can do sorting within those partitions, which is often a more manageable task.
If this guess is wrong, then it might be a good idea to incur the fixed cost of sorting (and persisting) the data since it will speed up your downstream analysis.
Since you have only one large file and it's not very big (25GB should be manageable if you have access to a cluster), the best thing might be to load into memory with regular pandas, sort and save the data with partitioning on dates/expirations/tickers (if available) or some other column division that makes sense for your downstream analysis.
It might be possible to reduce memory footprint by using appropriate dtypes, for example strike, type, expiration columns might take less space as categories (vs strings).
If there is no way at all of loading it into memory at once, then it's possible to iterate on chunks of rows with pandas and then saving the relevant bits in smaller chunks, here's rough pseudocode:
df = pd.read_csv('big_file', iterator=True, chunksize=10**4)
for rows in df:
# here we want to split into smaller sets based on some logic
# note the mode is append so some additional check on file
# existence should be added
for group_label, group_df in rows.groupby(['type', 'strike']):
group_df.to_csv(f"{group_label}.csv", mode='a')
Now the above might sound weird, since the question is tagged with dask and I'm focusing on pandas, but the idea is to save time downstream by partitioning the data on the relevant variables. With dask it is probably possible to achieve also, but in my experience in situations like these I would run into memory problems due to data shuffling among workers. Of course, if in your situation there were many files rather than one, then some parallelisation with dask.delayed would be helpful.
Now, after you partition/index your data, then dask will work great when operating on the many smaller chunks. For example, if you partitioned the data based on date and your downstream analysis is primarily using dates, then operations like groupby and shift will be very fast because the workers will not need to check with each other whether they have overlapping dates, so most processing will occur within partitions.
Please help me understand the best way storing information in HBase.
Basically, I have a rowkey like hashed_uid+date+session_id with metrics like duration, date, time, location, depth and so on.
I have read a lot of materials where I am bit confused. People have suggested less column family for better performance, so I am facing three options to choose:
Have each metrics sits in one row like rowkey_key cf1->alias1:value
Have many columns like rowkey cf1->key1:val1, cf1->key2:val2 ...
Have all the key-value pairs coded into one big string like rowkey cf1->"k1:v1,k2:v2,k3:v3..."
Thank you in advance. I don't know which to choose. The goal of my HBase design is to prepare for incremental windowing functions of a user profiling output, like percentiles, engagement and stat summary for last 60 days. Most likely, I will use hive for that.
Possibly you are confused by the similarity of naming of column family and column. These concepts are different things in HBase. Column family consist of several columns. This design is to improve the speed of access to data when you need to read only some type of columns. E.g., you have raw data and processed data. Reading processed data will not involve raw data if they are stored in separated column families. You can partially to have any numbers of columns per row key; it should be stored in one region, no more than 10GB. The design depends on what you what:
The first variant has no alternatives when you need to store a lot of
data per one-row key, that can't be stored in on a region. More than
10GB.
Second is good when you need to get only a few metrics per
single read per row key.
The last variant is suitable when you
always get all metrics per single read per row key.
In KNN like algorithm we need to load model Data into cache for predicting the records.
Here is the example for KNN.
So if the model will be a large file say1 or 2 GB we will be able to load them into Distributed cache.
Example:
Inorder to predict 1 otcome, we need to find the distnce between that single record with all the records in model result and find the min distance. So we need to get the model result in our hands. And if it is large file it cannot be loaded into Distributed cache for finding distance.
The one way is to split/partition the model Result into some files and perform the distance calculation for all records in that file and then find the min ditance and max occurance of classlabel and predict the outcome.
How can we parttion the file and perform the operation on these partition ?
ie 1 record <Distance> file1,file2,....filen
2nd record <Distance> file1,file2,...filen
This is what came to my thought.
Is there any further way.
Any pointers would help me.
I think the way you partitionin the data mainly depends on your data itself.
Being that you have a model with a bunch of rows, and that you want to find the k closes ones to the data on your input, the trivial solution is to compare them one by one. This can be slow because of going through 1-2GB of data millions of times (I assume you have large numbers of records that you want to classify, otherwise you don't need hadoop).
That is why you need to prune your model efficiently (your partitioning) so that you can compare only those rows that are most likely to be the closest. This is a hard problem and requires knowledge of the data you operate on.
Additional tricks that you can use to fish out performance are:
Pre-sorting the input data so that the input items that will be compared from the same partition come together. Again depends on the data you operate on.
Use random access indexed files (like Hadoop's Map files) to find the data faster and cache it.
In the end it may actually be easier for your model to be stored in lucene index, so you can achieve effects of partitioning by looking up the index. Pre-sorting the data is still helpful there.
I have a function in R that chokes if I apply it to a dataset with more than 1000 rows. Therefore, I want to split my dataset into a list of n chunks, each of not more than 1000 rows.
Here's the function I'm currently using to do the chunking:
chunkData <- function(Data,chunkSize){
Chunks <- floor(0:(nrow(Data)-1)/(chunkSize))
lapply(unique(Chunks),function(x) Data[Chunks==x,])
}
chunkData(iris,100)
I would like to make this function more efficient, so that it runs faster on large datasets.
You can do this easily using split from base R. For example, split(iris, 1:3), will split the iris dataset into a list of three data frames by row. You can modify the arguments to specify a chunk size.
Since the output is still a list of data frames, you can easily use lapply on the output to process the data, and combine them as required.
Since speed is the primary issue for using this approach, I would recommend that you take a look at the data.table package, which works great with large data sets. If you specify more information on what you are trying to achieve in your function, people at SO might be able to help.
Replace the lapply() call with a call to split():
split(Data, Chunks)
You should also take a look at ddply fom the plyr package, this package is built around the split-apply-combine principle. This paper about the package explains how this works and what things are available in plyr.
The general strategy I would take here is to add a new data to the dataset called chunkid. This cuts up the data in chunks of 1000 rows, look at the rep function to create this row. You can then do:
result = ddply(dat, .(chunkid), functionToPerform)
I like plyr for its clear syntax and structure, and its support of parallel processing. As already said, please also take a look at data.table, which could be quite a bit faster in some situations.
An additional tip could be to use matrices in stead of data.frames...