Reducing data with data stage - etl

I've been asked to reduce an existing data model using Data Stage ETL.
It's more of an exercice and a way to get to know this program which I'm very new to.
Of course, the data shall be reduced following some functionnal rules.
Table : MEMBERSHIP (..,A,B,C) # where A,B,C are different attributes (our filters)
Reducing data from ~700k rows to 7k rows or so.
I was thinking about keeping the same percentage as in the data source.
Therefore if we have the 70% of A, 20% of B and 10% of C, we would pretty much have the same percentage on the reduced version.
I'm looking for the best way to do so and the inner tools to use(maybe with the aggregator stage?).
Is there any way to do some scripting similar to PL with DataStage ?
I hope I've been clear enough. If you have any advice I'd be very grateful.
Thanks to all of you.
~Whitoo

Datastage does not do percentage wise reductions
What you can do is to use a tranformer stage or a filter stage to filter out the data from the source based on certain conditions. But like I said conditions have to be very specific. (for example - select only those records which have A = [somevalue] or A not= [somevalue])

DataStage PX has the sample stage that allows you to specify what percent of data you want it to sample: http://datastage4you.blogspot.com/2014/01/sample-stage-in-datastage.html.

Related

Interpret Google AutoML Online Prediction Results

We are using Google AutoML with Tables using input as CSV files. We have imported data , linked all schema with nullable columns and train model and then deployed and used the online prediction to predict value of one column.
Column we targeted has values min-max ( 44 - 263).
When we deployed and ran online-prediction it return values like this
Prediction result
0.49457597732543945
95% prediction interval
[-8.209495544433594, 0.9892584085464478]
Most of the resultset is in above format. How can we convert it to values in range of (44-263). Didn't find much documentation online on the same.
Looking for documentation reference and interpretation along with interpretation of 95% prediction.
Actually to clarify (I'm the PM of AutoML Tables)--
AutoML Tables does not do any normalization of the predicted values for your label data, so if you expect your label data to have a distribution of min/max 44-263, then the output predictions should also be in that range. Two possibilities would make it significantly different:
1) You selected the wrong label column
2) Your input features for this prediction are dramatically different than what was seen in the training data used.
Please feel free to reach out to cloud-automl-tables-discuss#googlegroups.com if you'd like us to help debug further

hbase design concat long key-value pairs vs many columns

Please help me understand the best way storing information in HBase.
Basically, I have a rowkey like hashed_uid+date+session_id with metrics like duration, date, time, location, depth and so on.
I have read a lot of materials where I am bit confused. People have suggested less column family for better performance, so I am facing three options to choose:
Have each metrics sits in one row like rowkey_key cf1->alias1:value
Have many columns like rowkey cf1->key1:val1, cf1->key2:val2 ...
Have all the key-value pairs coded into one big string like rowkey cf1->"k1:v1,k2:v2,k3:v3..."
Thank you in advance. I don't know which to choose. The goal of my HBase design is to prepare for incremental windowing functions of a user profiling output, like percentiles, engagement and stat summary for last 60 days. Most likely, I will use hive for that.
Possibly you are confused by the similarity of naming of column family and column. These concepts are different things in HBase. Column family consist of several columns. This design is to improve the speed of access to data when you need to read only some type of columns. E.g., you have raw data and processed data. Reading processed data will not involve raw data if they are stored in separated column families. You can partially to have any numbers of columns per row key; it should be stored in one region, no more than 10GB. The design depends on what you what:
The first variant has no alternatives when you need to store a lot of
data per one-row key, that can't be stored in on a region. More than
10GB.
Second is good when you need to get only a few metrics per
single read per row key.
The last variant is suitable when you
always get all metrics per single read per row key.

Cassandra Modeling for filter and range queries

I'm trying to model a database of users. These users have various vital statistics: age, sex, height, weight, hair color, etc.
I want to be able to write queries like these:
get all users 5'1" to 6'0" tall with red hair who weigh more than 100 pounds
or
get all users who are men who are 6'0" are ages 31-37 and have black hair
How can I model my data in order to make these queries? Let's assume this database will hold billions of users. I can't think of an approach that wouldn't require me to make MANY requests or cluster the data on VERY few nodes.
EDIT:
Just a little more background, let's assume this thought problem is to build a dating website. The site should allow users to filter people based on the aforementioned criteria (age, sex, height, weight, hair, etc.). These filters are optional, and you can have as many as you want. This site has 2 billion users. Is that something that can be achieved through data modeling alone?
IF I UNDERSTAND THINGS CORRECTLY
If I have 2 billion users and I create both of the tables mentioned in the first answer (assuming options of male and female for sex, and blonde, brown, red for hair color), I will, for the first table, be putting at most 2 billion records on one node if everyone has blonde hair. Best case scenario, 2/3 billion records on three nodes. In the second case, I will be putting 2/5 billion records on each node in the best case with the same worst case. Am I wrong? Shouldn't the partition keys be more unique than that?
So if you are trying to model you data inside Cassandra then the general rule is that you need to make a table per query. There are also significant restrictions on what you can filter your query by. If you want to understand some of the restrictions I suggest you take a look at this post:
http://www.datastax.com/dev/blog/a-deep-look-to-the-cql-where-clause
or my long answer here:
cassandra - how to perform table query?
All of the above only applies if you are running fixed queries that are known ahead of time. If instead you are looking to perform some sort of analytical analysis on your data (it sounds like you might be) than I would look at using Spark in conjunction with Cassandra. This will provide you a fast tool to do in-memory processing of your data. If you look at using Datastax (Community or Enterprise) then Spark also has a connector that makes reading and writing data to and from Cassandra easy.
Edited with Additional Information
Based on the query "get all users 5'1" to 6'0" tall with red hair who weigh more than 100 pounds" you would need to build a table with following:
CREATE TABLE user_by_haircolor_weight_height (
haircolor text,
weight float,
height_in int,
user varchar,
PRIMARY KEY ((haircolor), weight, height_in)
);
You could then query this by:
SELECT * from user_by_haircolor_weight_height where haircolor='red' and weight>100 and height_in>61 and height_in<73;
For the query "get all users who are men who are 6'0" are ages 31-37 and have black hair" you would need to build a similar table with a
PRIMARY KEY ((haircolor, sex), height_in, age)
In the end if what you are trying to do is perform either ad-hoc or a set number analytics (i.e. can have a bit more latency than a straight CQL query) on the data stored in you cassandra table than I suggest you look at using Spark. If you need something a bit more real-time to handle ad-hoc queries you can look at using Solr to perform Lucene powered searches on your table.
my recommendation is :
1) keep main table with proper partition key, so that million records being spread across cluster, don't here use any cluster column which will cross row key limitation of 2gb etc.,
2) depending on query pattern you may better create additional tables(like index) as much as possible to keep inverted index data in it. coz write is cheap.
3) use multiple query to get what you need.
4) last option is, use DSE solr search capability.
Just to reiterate the end of the conversation:
"Your understanding is correct and you are correct in stating that partition keys should be more unique than that. Each partition had a maximum size of 2GB but a practical limit is lower. In practice you would want your data partitioned into far smaller chunks that the table above. Given the ad-hoc nature of your queries in your example I do not think you would be able to practically do this by data modelling alone. I would suggest looking at using a Solr index on a table. This would allow you a robust search capability. If you use Datastax you are even able to query this via CQL"
Cassandra alone is not a good candidate for this sort of complex filtering across a very large data set.

Vertica query optimization

I want to optimize a query in vertica database. I have table like this
CREATE TABLE data (a INT, b INT, c INT);
and a lot of rows in it (billions)
I fetch some data using whis query
SELECT b, c FROM data WHERE a = 1 AND b IN ( 1,2,3, ...)
but it runs slow. The query plan shows something like this
[Cost: 3M, Rows: 3B (NO STATISTICS)]
The same is shown when I perform explain on
SELECT b, c FROM data WHERE a = 1 AND b = 1
It looks like scan on some part of table. In other databases I can create an index to make such query realy fast, but what can I do in vertica?
Vertica does not have a concept of indexes. You would want to create a query specific projection using the Database Designer if this is a query that you feel is run frequently enough. Each time you create a projection, the data is physically copied and stored on disk.
I would recommend reviewing projection concepts in the documentation.
If you see a NO STATISTICS message in the plan, you can run ANALYZE_STATISTICS on the object.
For further optimization, you might want to use a JOIN rather than IN. Consider using partitions if appropriate.
Creating good projections is the "secret-sauce" of how to make Vertica perform well. Projection design is a bit of an art-form, but there are 3 fundamental concepts that you need to keep in mind:
1) SEGMENTATION: For every row, this determines which node to store the data on, based on the segmentation key. This is important for two reasons: a) DATA SKEW -- if data is heavily skewed then one node will do too much work, slowing down the entire query. b) LOCAL JOINS - if you frequently join two large fact tables, then you want the data to be segmented the same way so that the joined records exist on the same nodes. This is extremely important.
2) ORDER BY: If you are performing frequent FILTER operations in the where clause, such as in your query WHERE a=1, then consider ordering the data by this key first. Ordering will also improve GROUP BY operations. In your case, you would order the projection by columns a then b. Ordering correctly allows Vertica to perform MERGE joins instead of HASH joins which will use less memory. If you are unsure how to order the columns, then generally aim for low to high cardinality which will also improve your compression ratio significantly.
3) PARTITIONING: By partitioning your data with a column which is frequently used in the queries, such as transaction_date, etc, you allow Vertica to perform partition pruning, which reads much less data. It also helps during insert operations, allowing to only affect one small ROS container, instead of the entire file.
Here is an image which can help illustrate how these concepts work together.

Generate Price Data from 3 variables and data

I'm trying to come up with an algorithm that would generate price based on 3 variables. I have to come up with a way from extracting this from some data.
For instance, I'm trying to come up with the price for a used car. The 3 variables would be:
The make of the car (i.e. Honda Civic)
The year of the car (i.e. 2006)
Kilometer's Driven (i.e. 200,000 KM)
I would feed it data extracted from a listing site. The data I would have is the same as above as well as the listing price.
The user can then pick the make, year, and kilometers driven and it will generate an average price based on that data.
Any ideas at all would be helpful! I'm creating this on PHP with an MySQL database.
Thanks so much!
if you are looking for something simple and that is just based on the available data just using SQL will suffice. You need to GROUP BY, use AVG and filter with WHERE.
If you are looking for something fancier and are looking to make predictions based on limited data or incomplete queries, you should have a look at things like regressions trees.

Resources