Question: Is it possible to train the same Model, from Google AutoML, multiple times?
Problem: I have several datasets with time series data. Example:
Dataset A: [[product1, date1, price], [product1, date2, price]]
Dataset B: [[product2, date1, price], [product2, date2, price]]
Dataset C: [[product3, date1, price], [product3, date2, price]]
When describing the columns in Google AutoML you can mark the data as time series data and specify the date column as the time serie. It is very important to keep in mind it is time series data. I'd think combining the datasets wouldn't be a good idea because there will be duplicate dates.
Is it possible to train the model on dataset A and after that finishes on dataset B, etc. or would you advice to combine the datasets?
Thanks.
You can combine the data, I don't see how that would matter with what you are describing. Marking a column as a Time column has AutoML Tables split the data based on that column, putting the oldest 80% as the training set, next more recent 10% as the validation set, and the most recent 10% as the test set.
If there is not enough data in your set that is distinct in the time column to split the data as 80/10/10 described above, you will want to not mark it as the Time column and instead manually split the data.
If the datasets are not related and are distinct from each other, then you would want to train individual models for each.
Related
I need to tune a job that looks like below.
import pyspark.sql.functions as F
dimensions = ["d1", "d2", "d3"]
measures = ["m1", "m2", "m3"]
expressions = [F.sum(m).alias(m) for m in measures]
# Aggregation
aggregate = (spark.table("input_table")
.groupBy(*dimensions)
.agg(*expressions))
# Write out summary table
aggregate.write.format("delta").mode("overwrite").save("output_table")
The input table contains transactions, partitioned by date, 8 files per date.
It has 108 columns and roughly half a billion records. The aggregated result has 37 columns and ~20 million records.
I am unable to make any sort of improvement in the runtime whatever I do, so I would like to understand what are the things that affect the performance of this aggregation, i.e. what are the things I can potentially change?
The only thing that seems to help is to manually partition the work, i.e. starting multiple concurrent copies of the same code but with different date ranges.
to the best of my understanding currently the groupBy clause doesn't include the 'date' column so you are actually aggregating all dates in the query and you are not using the input table partition at all.
you can add the "date" column to the partitionBy clause and then you will sum up the measures for each date.
also, as for the input_table, when it is built, if possible, you can additionally partition it by d1, d2, d3 if they don't have a high cardinality or at least some of them.
finally the input_table will benefit from a columnar file type (parquet) so you won't have to i/o all 108 columns if you are using something like csv. assuming you are using something like parquet but just in case.
Is it possible (and how?) to provide time series for binary classification in H2O.ai's Driverless AI? I have dataframe that looks like this:
ID
Status/Target [0/1]
TimeStamp for events that happened on given ID, in last 90 days
Details of those events (category, description, values, etc...)
Ideally what i want is to build a model that predict status for given ID, based on provided history of events.
For H2O's Driverless AI, you can use it out of the box for time-series modeling. See this section. You need to provide the "Time Column" as your TimeStamp and add ID to your "Time Groups Column".
If your target column is 0s or 1s, then it should automatically identify it as binary. If you not, you can toggle it from regression to binary classification.
We are using Google AutoML with Tables using input as CSV files. We have imported data , linked all schema with nullable columns and train model and then deployed and used the online prediction to predict value of one column.
Column we targeted has values min-max ( 44 - 263).
When we deployed and ran online-prediction it return values like this
Prediction result
0.49457597732543945
95% prediction interval
[-8.209495544433594, 0.9892584085464478]
Most of the resultset is in above format. How can we convert it to values in range of (44-263). Didn't find much documentation online on the same.
Looking for documentation reference and interpretation along with interpretation of 95% prediction.
Actually to clarify (I'm the PM of AutoML Tables)--
AutoML Tables does not do any normalization of the predicted values for your label data, so if you expect your label data to have a distribution of min/max 44-263, then the output predictions should also be in that range. Two possibilities would make it significantly different:
1) You selected the wrong label column
2) Your input features for this prediction are dramatically different than what was seen in the training data used.
Please feel free to reach out to cloud-automl-tables-discuss#googlegroups.com if you'd like us to help debug further
Please help me understand the best way storing information in HBase.
Basically, I have a rowkey like hashed_uid+date+session_id with metrics like duration, date, time, location, depth and so on.
I have read a lot of materials where I am bit confused. People have suggested less column family for better performance, so I am facing three options to choose:
Have each metrics sits in one row like rowkey_key cf1->alias1:value
Have many columns like rowkey cf1->key1:val1, cf1->key2:val2 ...
Have all the key-value pairs coded into one big string like rowkey cf1->"k1:v1,k2:v2,k3:v3..."
Thank you in advance. I don't know which to choose. The goal of my HBase design is to prepare for incremental windowing functions of a user profiling output, like percentiles, engagement and stat summary for last 60 days. Most likely, I will use hive for that.
Possibly you are confused by the similarity of naming of column family and column. These concepts are different things in HBase. Column family consist of several columns. This design is to improve the speed of access to data when you need to read only some type of columns. E.g., you have raw data and processed data. Reading processed data will not involve raw data if they are stored in separated column families. You can partially to have any numbers of columns per row key; it should be stored in one region, no more than 10GB. The design depends on what you what:
The first variant has no alternatives when you need to store a lot of
data per one-row key, that can't be stored in on a region. More than
10GB.
Second is good when you need to get only a few metrics per
single read per row key.
The last variant is suitable when you
always get all metrics per single read per row key.
I'm building a table that contains about 400k rows of a messaging app's data.
The current table's columns looks something like this:
message_id (int)| sender_userid (int)| other_col (string)| other_col2 (int)| create_dt (timestamp)
A lot of queries I would be running in the future will rely on a where clause involving the create_dt column. Since I expect this table to grow, I would like to try and optimize it right now. I'm aware that partitioning is one way, but when I partition it based on create_dt the result is too many partitions since I have every single date spanning back to Nov 2013.
Is there a way to instead partition by a range of dates? How about partition for every 3 months? or even every month? If this is possible - Could I possibly have too many partitions in the future making it inefficient? What are some other possible partition methods?
I've also read about bucketing, but as far as I'm aware that's only useful if you would be doing joins on a column that the bucket is based on. I would most likely be doing joins only on column sender_userid (int).
Thanks!
I think this might be a case of premature optimization. I'm not sure what your definition of "too many partitions" is, but we have a similar use case. Our tables are partitioned by date and customer column. We have data that spans back to Mar 2013. This created approximately 160k+ partitions. We also use a filter on date and we haven't seen any performance problems with this schema.
On a side note, Hive is getting better at scaling up to 100s of thousands of partitions and tables.
On another side note, I'm curious as to why you're using Hive in the first place for this. 400k rows is a tiny amount of data and is not really suited for Hive.
Check out hive built in UDFs. With the right combination of them you can achieve what you want. Here's an example to partition on every month (produces "YEAR-MONTH" string that you can use as partition column value):
select concat(cast(year(to_date(create_dt)) as string),'-',cast(month(to_date(create_dt)) as string))
But when partitioning on dates it is usually useful to have multiple levels of the date dimension so in this case you should have two partition columns, first for year and second for month:
select year(to_date(create_dt)),month(to_date(create_dt))
Keep in mind that timestamps and dates are strings, and that functions like month() or year() return integers as values of date fields. You can use simple mathematical operations to figure out the right partition.