How to do Rollling Regression in Pyspark in DataBricks?

How to do Rollling Regression in Pyspark in DataBricks? - windows

I am working on DataBricks and using pySpark to write my code. I have a large Pyspark.Sql dataframe and I want to run weekly or Bi-weekly regression and then predict the output on next week's data. I need to do this for at least 2 years of data. Is there a way to do this without running the loops and dividing the entire dataset into weekly data? Is there any inbuilt function to do rolling window regression? If possible, I want to store the result of all weekly regression (coefficient, R2 value, P-value).

Related

How to include diary data/lagged data in multilevel regression code

I work with diary data and I am trying to do a multilevel regression with a lagged variable. I try to
predict a binary variable from a continuous variable of the previous day and
To predict a continuous variable from a binary variable of the previous day.
I don't understand how to include this time lag/diary data format into the R code. Also, I do not know how to control for the predictor variable from the same day.
I use the lmer and glmer command for continuous and binary outcomes and so far only level 1 predictors
lmer<-(pred.binary.previousday~outcome.cont.+1(1|level2),data=data)
lmer<-(pred.binary.previousday~outcome.cont.+1(pred.binary.previousday|level2),data=data)
glmer<-(pred.cont.previousday~outcome.binary.+1(1|level2),family(link=binomial),data=data)
AIC did not show a better fit of random coefficients, but that I was expecting. regression results so far were not significant, but maybe that is because I did not write the code correctly.

Does tensorflow.data.experimental.CsvDataset read from the file over and over?

I am struggling with long training times with tf.data.Dataset, and am beginning to wonder if reading the CSV file may be a bottleneck. Does tensorflow.data.experimental.CsvDataset read from the file over and over?
I consider trying to first import the whole dataset and put it in a numpy array, and then create a new TF Dataset from tensors. But such a change will take time, and I don't want to waste time if SO could have told me beforehand that it makes no difference.

I do not know exactly why I got so long training times with CsvDataset, but modifying my code to first import the data to a numpy array and then import it using tf.data.Dataset.from_tensor_slices made the training about 10-100 times faster. One more, prehaps relevant, change that followed by this, was that the dataset was no longer nested throughout the handling. In the old version, each batch was a tuple of column tensors, whereas in the new version, each batch is simply a tensor. (Further speedups may be achieved by removing transforms tailored to the nested structure, which are now applied to one tensor only.)

Interpret Google AutoML Online Prediction Results

We are using Google AutoML with Tables using input as CSV files. We have imported data , linked all schema with nullable columns and train model and then deployed and used the online prediction to predict value of one column.
Column we targeted has values min-max ( 44 - 263).
When we deployed and ran online-prediction it return values like this
Prediction result
0.49457597732543945
95% prediction interval
[-8.209495544433594, 0.9892584085464478]
Most of the resultset is in above format. How can we convert it to values in range of (44-263). Didn't find much documentation online on the same.
Looking for documentation reference and interpretation along with interpretation of 95% prediction.

Actually to clarify (I'm the PM of AutoML Tables)--
AutoML Tables does not do any normalization of the predicted values for your label data, so if you expect your label data to have a distribution of min/max 44-263, then the output predictions should also be in that range. Two possibilities would make it significantly different:
1) You selected the wrong label column
2) Your input features for this prediction are dramatically different than what was seen in the training data used.
Please feel free to reach out to cloud-automl-tables-discuss#googlegroups.com if you'd like us to help debug further

How to find average of two lines in NiFi?

I need to find average of two values in separate lines.
My CSV file looks like this
Name,ID,Marks
Mahi,1,90
Mahi,1,100
Andy,2,85
Andy,2,95
Now I need to store that average of 2 marks in database.
"Average" column should add two marks and divide with 2 and store that result in SQL query
Table:
Name,ID,Average
Mahi,2,95
Andy,2,90
Is it possible to find the average of two values in separate rows using NiFi?

Given a lot of assumptions, this is doable. You are definitely better off pre-processing the data in NiFi and exporting it to a tool better suited to this, like Apache Spark using the NiFi Spark Receiver library (instructions here), because this solution will not scale well.
However, you could certainly use a combination of SplitText processors to get the proper data into individual flowfiles (i.e. all Mahi rows in one, all Andy rows in another). Once you have a record that looks like:
Andy,1,85
Andy,1,95
you can use ExtractText with regular expressions to get 85 and 95 into attributes marks.1 and marks.2 (a good example of where scaling will break down -- doing this with 2 rows is easy; doing this with 100k is ridiculous). You can then use UpdateAttribute with the Expression Language to calculate the average of those two attributes (convert toNumber() first) and populate a third attribute marks.average (either through chaining plus() and divide() functions or with the math advanced operation (uses Java Reflection)). Once you have the desired result in an attribute, use ReplaceText to update the flowfile content, and MergeContent to merge the individual flowfiles back into a single instance.
If this were me, I'd first evaluate how static my incoming data format was, and if it was guaranteed to stay the same, probably just write a Groovy script that parsed the data and calculated the averages in place. I think that would even scale better (within reason) because of the flexibility of having written domain-specific code. If you need to offload this to cluster operations, Spark is the way to go.

Compare performance between using Hadoop with and without Hbase

I'm building an application that finding all similar images from user's input image using Hadoop.
I'm implementing it in two ways:
Way 1:
My collection images is converted to SequenceFile to be used as input for map function.
Then in map function, use OpenCV libary for compare similarity between those images with user's input image which include these steps:
- Extract keypoints
- Compute descriptors
- Calculate distance between each pairs to find the similarity
In Reduce function, I just copy images that is similar to output folder.
Way 2:
Similar with way 1 except:
I use Hbase to store image features first (keypoints, descriptors). To do that, because OpenCV doesnt support the way to convert keypoints, descriptors data type to bytes[] directly (in order to insert data to Hbase, we have to convert to bytesl[]) so I have to use a trick that I refer in this: OpenCV Mat object serialization in java
Then in map function, I will just query image features from Hbase to compare with user'input image feature.
In normal thought, we can see that saving all image features to a database then just query them to compare with user's input image will be faster than in each map function we have to start extract these feature to do comparison.
But in fact, when I do and test two ways in my virtual machine (standalone mode), I see that the way 2 run slower than way 1 and running time is not acceptable. In my opinion, I think that the way 2 run slowly because in the map function, it takes much time to convert from bytes[] value in Hbase to keypoints, descriptors datatype in openCV to do comparison. That why it degrades the performance of whole map function.
My collection images are just include 240 images in jpg format.
So my question here that, beside the reason I think as above that make way 2 run slower than way 1. Is there any reason else that make way 2 run slower than way 1 such as:
Run in standalone mode is not recommended for using Hbase?
Input size is not enough big to use Hbase?
Thanks.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio