Time series based features for binary classification - h2o

Is it possible (and how?) to provide time series for binary classification in H2O.ai's Driverless AI? I have dataframe that looks like this:
ID
Status/Target [0/1]
TimeStamp for events that happened on given ID, in last 90 days
Details of those events (category, description, values, etc...)
Ideally what i want is to build a model that predict status for given ID, based on provided history of events.

For H2O's Driverless AI, you can use it out of the box for time-series modeling. See this section. You need to provide the "Time Column" as your TimeStamp and add ID to your "Time Groups Column".
If your target column is 0s or 1s, then it should automatically identify it as binary. If you not, you can toggle it from regression to binary classification.

Related

Power BI Only Showing Average of Whole Column in Line Graph

So I am trying to create a report for (among other things) measurements made by a device.
I have normalised my schema such that these measurements are in a measurement dimension.
I have created a calendar and timetable to plot these measurements as a line graph.
However rather than having a line graph showing the changes, I get a constant line of the average (or whichever aggregate function I am using) of the whole column.
My line graph at a date level
My line graph at a date and time level
Since entries are unique for each datetime stamp, I don't understand why the line graph doesn't show variation here.
I've tried all sorts of changes to my schema, but it seems it only wants to work when its a flat file.
For a sense of my schema:
My schema
All the data types for columns seems to be correct.
Its happy filtering by date if I do it in transform data, so I don't think there is a problem with my relationships.
Am I missing something obvious?

How to create visualization using ratio of fields

I have a data set similar to the table below (simplified for brevity)
I need to calculate the total spend per conversion per team for every month, with ability to plot this as time based line chart being an additional nicety. The total spend is equal to the sum of Phone Expenditure, Travel allowance & Misc. Allowance, this can be a calculated field.
I cannot add a calculated field for the ratio, as for some sales person, the number of conversion can be 0 for a given month. So, averaging over team is not option. How can I go about this?
Thanks for help and suggestions in advance!
I've discussed the question with the Harish offline. I've learned that he is trying to calculate ratio per group, not per row.
To perform calculations per group, users can add calculated fields inside a QuickSight analysis and use level aware aggregation expressions. (Note that level aware aggregations can only be used in an analysis, not in the data prep view). Here is a link to the documentation about level aware aggregations if you want to learn more about this area https://docs.aws.amazon.com/quicksight/latest/user/level-aware-aggregations.html

Which machine learning algorithm I have to use for sequence prediction?

I have a dataset like below. I have datetime column as index, type is a column with sequence. For ex; R,C,D,D,D,R,R is a sequence.
start_time type
2019-12-14 09:00:00 RCDDDRR
2019-12-14 10:00:00 CCRD
2019-12-14 11:00:00 DDRRCC
2019-12-14 12:00:00 ?
I want to predict what would be the next sequence at time 12:00:00? which is the best algorithm to predict the next sequence?
I know that we can use Markov chain to predict the probable sequence. However, are there any other better algorithms?
Thanks
you can use from knn,svm for prediction.but the first of all you have to change database and define feature for training dataset for example
you can use from another method base on deep learning , I think this link can help you
https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/
LSTMs have an edge over conventional feed-forward neural networks and RNN in many ways. This is because of their property of selectively remembering patterns for long durations of time.
LSTMs on the other hand, make small modifications to the information by multiplications and additions. With LSTMs, the information flows through a mechanism known as cell states. This way, LSTMs can selectively remember or forget things. The information at a particular cell state has three different dependencies.
Let’s take the example of predicting stock prices for a particular stock. The stock price of today will depend upon:
The trend that the stock has been following in the previous days, maybe a downtrend or an uptrend.
The price of the stock on the previous day, because many traders compare the stock’s previous day price before buying it.
The factors that can affect the price of the stock for today. This can be a new company policy that is being criticized widely, or a drop in the company’s profit, or maybe an unexpected change in the senior leadership of the company.
These dependencies can be generalized to any problem as:
The previous cell state (i.e., the information that was present in the memory after the previous time step).
The previous hidden state (this is the same as the output of the previous cell).
The input at the current time step (i.e., the new information that is being fed in at that moment).
Maybe this link and method could help you
https://www.bioinf.jku.at/publications/older/2604.pdf
https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/

Interpret Google AutoML Online Prediction Results

We are using Google AutoML with Tables using input as CSV files. We have imported data , linked all schema with nullable columns and train model and then deployed and used the online prediction to predict value of one column.
Column we targeted has values min-max ( 44 - 263).
When we deployed and ran online-prediction it return values like this
Prediction result
0.49457597732543945
95% prediction interval
[-8.209495544433594, 0.9892584085464478]
Most of the resultset is in above format. How can we convert it to values in range of (44-263). Didn't find much documentation online on the same.
Looking for documentation reference and interpretation along with interpretation of 95% prediction.
Actually to clarify (I'm the PM of AutoML Tables)--
AutoML Tables does not do any normalization of the predicted values for your label data, so if you expect your label data to have a distribution of min/max 44-263, then the output predictions should also be in that range. Two possibilities would make it significantly different:
1) You selected the wrong label column
2) Your input features for this prediction are dramatically different than what was seen in the training data used.
Please feel free to reach out to cloud-automl-tables-discuss#googlegroups.com if you'd like us to help debug further

Several sensors - noise filtering algorithm needed

My software receives information from several sensors. The number of sensors is not fixed - they can be added and removed, each sensor one has its own unique identifier. Sensors send data irregularly - they can keep silent for weeks or push data every second. Each sensor generates a value from a fixed set of values ​​- so sensors are discrete. My program logs each message from each sensor into an SQL database table (sensorId, time, value).
The task is to filter the information. I need to select only one record from this log, which I'm considering to be the actual information. For example, if I get the latest record from a single sensor, which says that value is A, but before it 10 different sensors told me that the value is B, then I shall still consider B to be the actual information. At the same time the problem is not just the usual noise filtering, because if there was one sensor which told me for a month every second that value was C, and then five sensors recently tell that in fact the value is D, I shall immediately consider D to be the actual data despite long history - I want to say that the number of independent sources also must have weight.
So, I think I get a kind of a function of two variables - time (ageing) and the number of unique sensors in the moment of time. So, I think, I must somehow calculate the weight of each record and then just select the one with the biggest weight. And I suppose that to calculate the record weight I should use not only the information from current record, but information from all the previous ones.
I need some help with the algorithm. Maybe there is actually some well-known solution which I'm not aware of?

Resources