How can I perform analytical operations along with join in Mosaic Decisions? - expression

While joining the two tables I need to perform some analytical operations on the columns.
Example,
left outer join $(CORE_DB_NAME).GLOBAL_DIMENSION_CORE.plant_d loc on
upper(trim(COALESCE(loc.PLANT_NUMBER,'-1'))) =
upper(trim(COALESCE(afru.werks,'-1')))
How can I achieve this in Mosaic Decisions?

To achieve this in Mosaic Decisions, Transformer Node can be used to transform the data by using the various functions available in it.
Considering the above scenario, below functions available under String Functions in the transformer node can be used to process the data and then perform join operation on it
UPPER()
TRIM()
COALESCE()

Related

(Spark) Is there any possible way to optimize two large rdd join when both of them is too large for memory(means cannot use broadcast)?

As title.
Is there any possible way to optimize two large rdd join when both of them is too large for memory? In this case I suppose we cannot use broadcast for map side join.
If I have to join this two rdd, and both of them is too large to fit in memory:
country_rdd:
(id, country)
income_rdd:
(id, (income, month, year))
joined_rdd = income_rdd.join(country_rdd)
Is there any possible way to reduce the shuffling here? Or anything I can do to tuning the join performance?
Besides, the joined_rdd will be further calculated and reduced only by country and time, not relevant to id anymore. Eg: my final result = income for different country in different years. What's the best practice to do that?
I used to consider do some pre-partition, but seems if I only need to do join once that won't help much?
In general case (no a priori knowledge of the key properties) it is not possible. Shuffle is a essential part of the join and cannot be avoided.
In specific cases you can reduce shuffling in two ways:
Design your own Partitioner which takes advantage of pre-existing data distribution. For example if you know that data is sorted by key you can use that knowledge to limit the shuffle.
If you apply inner join, and only a fraction of keys occurs in both RDDs you can:
Create Bloom filters on each datasets. Lets call these leftFilter and rightFilter.
Filter RDD with opposite filters (leftRDD with rightFilter, rightRDD with leftFilter)
Join filtered RDD

Not allowed to use Joiner transformation

I have an expression transformation from which I am passing the data to two different transformations.
Later in the downstream of these parallel flows, I am trying to apply a joiner transformation but I am not allowed to do so.,
Is joiner transformation not allowed in a such a case similar to self-join?
What could be an alternative approach if I wanted to achieve such a transformation?
It would be great if somebody can help me sort out this issue.
You need to sort the data before the joiner, and turn on 'sorted merge join' before connecting the second set of ports to the joiner.
One voice of caution though: carefully consider the 'key' you join these data on. I should be a unique value across all records on at least one of the two data streams, otherwise you'll end up with a data explosion. I know this may sound very basic, but it is often forgotten in self joins :)
Joiner Transformation will work. I assume the if the data is from same source table and is passed through different pipe line, use the option of SORTED INPUT in the joiner transformation.

Vertica query optimization

I want to optimize a query in vertica database. I have table like this
CREATE TABLE data (a INT, b INT, c INT);
and a lot of rows in it (billions)
I fetch some data using whis query
SELECT b, c FROM data WHERE a = 1 AND b IN ( 1,2,3, ...)
but it runs slow. The query plan shows something like this
[Cost: 3M, Rows: 3B (NO STATISTICS)]
The same is shown when I perform explain on
SELECT b, c FROM data WHERE a = 1 AND b = 1
It looks like scan on some part of table. In other databases I can create an index to make such query realy fast, but what can I do in vertica?
Vertica does not have a concept of indexes. You would want to create a query specific projection using the Database Designer if this is a query that you feel is run frequently enough. Each time you create a projection, the data is physically copied and stored on disk.
I would recommend reviewing projection concepts in the documentation.
If you see a NO STATISTICS message in the plan, you can run ANALYZE_STATISTICS on the object.
For further optimization, you might want to use a JOIN rather than IN. Consider using partitions if appropriate.
Creating good projections is the "secret-sauce" of how to make Vertica perform well. Projection design is a bit of an art-form, but there are 3 fundamental concepts that you need to keep in mind:
1) SEGMENTATION: For every row, this determines which node to store the data on, based on the segmentation key. This is important for two reasons: a) DATA SKEW -- if data is heavily skewed then one node will do too much work, slowing down the entire query. b) LOCAL JOINS - if you frequently join two large fact tables, then you want the data to be segmented the same way so that the joined records exist on the same nodes. This is extremely important.
2) ORDER BY: If you are performing frequent FILTER operations in the where clause, such as in your query WHERE a=1, then consider ordering the data by this key first. Ordering will also improve GROUP BY operations. In your case, you would order the projection by columns a then b. Ordering correctly allows Vertica to perform MERGE joins instead of HASH joins which will use less memory. If you are unsure how to order the columns, then generally aim for low to high cardinality which will also improve your compression ratio significantly.
3) PARTITIONING: By partitioning your data with a column which is frequently used in the queries, such as transaction_date, etc, you allow Vertica to perform partition pruning, which reads much less data. It also helps during insert operations, allowing to only affect one small ROS container, instead of the entire file.
Here is an image which can help illustrate how these concepts work together.

How to decide when to use a Map-Side Join or Reduce-Side while writing an MR code in java?

How to decide when to use a Map-Side Join or Reduce-Side while writing an MR code in java?
Map side join performs join before data reached to Map. Map function expects a strong prerequisites before joining data at map side. Both method have some pros and cons. Map side join is efficient compare to reduce side but it require strict format.
Prerequisites:
Data should be partitioned and sorted in particular way.
Each input data should be divided in same number of partition.
Must be sorted with same key.
All the records for a particular key must reside in the same partition.
Reduce side join also called as Repartitioned join or Repartitioned sort merge join and also it is mostly used join type.
It will have to go through sort and shuffle phase which would incur network overhead.Reduce side join uses few terms like data source, tag and group key lets be familiar with it.
Data Source is referring to data source files, probably taken from RDBMS
Tag would be used to tag every record with it’s source name, so that it’s source can be identified at any given point of time be it is in map/reduce phase. why it is required will cover it later.
Group key is referring column to be used as join key between two data sources.
As we know we are going to join this data on reduce side we must prepare in a way that it can be used for joining in reduce phase. let’s have a look what are the steps needs to be perform.
For more information check this link:
http://hadoopinterviews.com/map-side-join-reduce-side-join/
You will use mapside join if one of your table can be fit in memory which will reduce overhead on your sort and shuffle data.
Reduce-Side joins are more simple than Map-Side joins since the input datasets need not to be structured. But it is less efficient as both datasets have
to go through the MapReduce shuffle phase. the records with the same key are brought together in the reducer.

In Oracle, what is the difference between a hash join and a sort-merge join?

In Oracle, I can use the hints USE_HASH or USE_MERGE to instruct the optimizer to do a hash join or a sort-merge join. How are those types of joins different and when/why should I use one or the other?
Jonathan Lewis posted a really good explanation of how hash joins and merge joins work:
Hash Joins - http://jonathanlewis.wordpress.com/2010/08/10/joins-hj/
Merge Joins - http://jonathanlewis.wordpress.com/2010/08/15/joins-mj/
and for good measure...
Nested Loop Joins - http://jonathanlewis.wordpress.com/2010/08/09/joins-nlj/
"when/why should I use one or the other"
Generally you shouldn't worry about it. That's what the Oracle optimizer is for.
The use_hash hint requests a hash join against the specified tables. A hash join loads the rows from the left hand table into an in-memory hash table.
The use_merge hint forces a sort/merge operation that essentially does a full table scan and creates a traditional index on the fly. I.e., A to Z.
Because of the memory restrictions on hash joins, you want to use them, generally, only on smaller left hand tables
Sort merge joins are generally best for queries that produce very large result sets or tables that do not possess indexes on the join keys.

Resources