Tableau Data Blend Performance/ Level of Detail - performance

I have two blended sources where the source A has 500,000 rows and source B has 20,000. The problem is that when I use source A as the primary source, any computation in the dashboard takes far too long to be useful. When I use B as my primary source, performance is much improved...
...but, the level of detail I need is in source A. When source B is primary I am left with the dreadful asterisk where there is a one-to-many relationship.
Source A primary:
Event(from source B) Occurred_On(from source A)
ABC 1/1/2000
ABC 5/10/2000
XYZ 9/9/2002
XYZ 4/5/2002
Source B primary:
Event(from source B) Occurred_On(from source A)
ABC *
XYZ *
Data must be blended-- source A is a database and source B is a text file so a join is out of the question.
Patience is waning and all hope seems to be lost. Does there exist any possible way to use B as the primary while maintaining the level of detail from a field in A?
Or any other workaround that could solve this?

I will go with the below options
Cross database joins [a link] (http://www.tableau.com/about/blog/2016/7/integrate-your-data-cross-database-joins-56724)
-Create a lookup table of the text file and apply a join a link
-To improve the performance, create a data extract file of the combined dataset which will be the primary data source for the report [a link] (http://onlinehelp.tableau.com/current/pro/desktop/en-us/help.htm#extracting_data.html)
Just, 500K record should not cause the performance drag but will be a good idea to recheck the server configuration and see if there are any bottlenecks.

Related

SSIS Google Analytics - Reporting dimensions & metrics are incompatible

I'm trying to extract data from Google Analytics, using KingswaySoft - SSIS Integration Toolkit, in Visual Studio.
I've set the metrics and dimensions, but I get this error message:
Please remove transactions to make the request compatible. The request's dimensions & metrics are incompatible. To learn more, see https://ga-dev-tools.web.app/ga4/dimensions-metrics-explorer/
I've tried to remove transactions metric and it works, but this metric is really necessary.
Metrics: sessionConversionRate, sessions, totalUsers, transactions
Dimensions: campaignName, country, dateHour, deviceCategory, sourceMedium
Any idea on how to solve it?
I'm not sure how helpful this suggestion is but could a possible work around include having two queries.
Query 1: Existing query without transactions
Query 2: The same dimensions with transactionId included
The idea would be to use the SSIS Aggregate component to group by the original dimensions and count the transactions. You could then merge the queries together via a merge join.
Would that work?
The API supports what it supports. So if you've attempted to pair things that are incompatible, you won't get any data back. Things that seem like they should totally work go together like orange juice and milk.
While I worked on the GA stuff through Python, an approach we found helped us work through incompatible metrics and total metrics was to make multiple pulls using the same dimensions. As the data sets are at the same level of grain, as long as you match up each dimension in the set, you can have all the metrics you want.
In your case, I'd have 2 data flows, followed by an Execute SQL Task that brings the data together for the final table
DFT1: Query1 -> Derived Column -> Stage.Table1
DFT2: Query2 -> Derived Column -> Stage.Table2
Execute SQL Task
SELECT
T1.*, T2.Metric_A, T2.Metric_B, ... T2.Metric_Z
INTO
#T
FROM
Stage.T1 AS T1
INNER JOIN
Stage.T2 AS T2
ON T2.Dim1 = T1.Dim1 /* etc */ AND T2.Dim7 = T1.Dim7
-- Update you have solid data aka
-- isDataGolden exists in the "data" section of the response
-- Usually within 7? days but possibly sooner
UPDATE
X
SET
metric1 = S.metric1 /* etc */
FROM
dbo.X AS X
INNER JOIN #T AS T
ON T.Dim1 = X.Dim1
WHERE
X.isDataGolden IS NULL
AND T.isDataGolden IS NOT NULL;
-- Add new data but be aware that not all nodes might have
-- reported in.
INSERT INTO
dbo.X
SELECT
*
FROM
#T AS T
WHERE
NOT EXISTS (SELECT * FROM dbo.X AS X WHERE X.Dim1 = T.Dim1 /* etc */);

HIVE/PIG JOIN Based on SUBSTRING match

I have a requirement where I need to JOIN a tweets table with person names, like filtering the tweets if it contains any person name. I have following data:
Tweets Table: (70 million records stored as a HIVE Table)
id
tweet
1
Cristiano Ronaldo greatest of all time
2
Brad Pitt movies
3
Random tweet without any person name
Person Names: (1.6 million names stored on HDFS as .tsv file)
id
person_name
1
Cristiano Ronaldo
2
Brad Pitt
3
Angelina Jolie
Expected Result:
id
tweet
person_name
1
Cristiano Ronaldo greatest of all time
Cristiano Ronaldo
2
Brad Pitt movies
Brad Pitt
What I've tried so far:
I have converted the person names .tsv file to HIVE table as well and then tried to join 2 tables with the following HIVE query:
SELECT * FROM tweets t INNER JOIN people p WHERE instr(t.tweet, p.person_name) > 0;
Tried with some sample data and it works fine. But when I try to run on entire data (70m tweets JOIN with 1.6m Person Names), it takes forever. Definitely doesn't look very efficient.
I wanted to try JOIN with PIG as well (as it is considered little more efficient than HIVE JOIN), where I can directly JOIN person names .tsv file tweets HIVE Table, but not sure how to JOIN based on substring in PIG.
Can someone please share the PIG JOIN syntax for this problem, if you have any idea? Also, please do suggest me any alternatives that I can use?
The idea is to create buckets so that we don't have to compare a lot of records. We are going to increase the number of records / joins to use multiple nodes to do work instead of a large crossjoin.--> WHERE instr(t.tweet, p.person_name) > 0;
I'd suggest splitting the tweets into individual words. Yes multiplying your record count way up.
Filtering out 'stopwords' or some other list of words that fit in memory.
Split names into (firstnames) and "last name"
Join tweets and names on "lastname" and instr(t.tweet, p.person_name) This should significantly reduce the size of data that you compare via a function. It will run faster.
If you are going to do this regularly consider creating tables with
sort/bucket to really make things sizzle. (Make it faster as it can hopefully be Sort Merge Join ready.)
It is worth trying Map-Join.
Person table is small one and join with it can be converted to Map-Join operator if it fits into memory. Table will be loaded into each mapper memory.
Check EXPLAIN output. If it says that Common Join operator is on Reducer vertex, then try to increase mapper container memory and adjust map-join settings to convert to Map Join.
Settings responsible for Map Join (suppose the People table <2.5Gb)
Try to bump mapjoin table size to 2.5Gb (check the actual size) and run explain again.
set hive.auto.convert.join=true; --this enables map-join
set hive.auto.convert.join.noconditionaltask = true;
set hive.mapjoin.smalltable.filesize=2500000000; --size of table to fit in memory
set hive.auto.convert.join.noconditionaltask.size=2500000000;
Also container size should be increased to avoid OOM (if you are on Tez):
set hive.tez.container.size=8192; --container size in megabytes
set hive.tez.java.opts=-Xmx6144m; --set this 80% of hive.tez.container.size
Figures are just an example. Try to adjust and check the EXPLAIN again, if it shows Map-Join operator, then check execution again, it should run much faster.

Spring: weird onetomany mapping with relational table for each relation

I'm trying to build a number of relations the way i've been told to. I have 6 tables...let's call them A, B, C, D, E and F.
The relation between them is always 1:N.
I've been asked to map those tables in Spring/JPA by creating a new relational table each time like so:
A + B -> AB
AB + C -> ABC
ABC + D -> ABCD
ABCD + E -> ACBDE
ACBDE + F -> ABCDEF
...where AB, ABC, ACBD, ABCDE and ACBDEF are the new relational tables I have to create.
It feels so weird to me to map tables like so, even more when the relation betweed them is not N:N, but 1:N. Also, I don't get to see the purpose ot donig that, and I came here to see if you guys could help me with both issues: Understanding why this would make sense, and how to achieve this?
I've tried myself for 2 days, but mapping the tables like they were N:N, and I always get an error like "Caused by: org.hibernate.MappingException: Foreign key (FKsxjpculqrp0noj2x8cetijcof:CEV_ambito [id_amb])) must have same number of columns as the referenced primary key (CEV_ambito [FK_tea_amb,id_amb])"
Please, any help or indications on how to do this properly would be really appreciated. Thank you all.
this is indeed a weird requirement (might make more sense if we knew about the data inside of those tables) but anyways to get 1:N you should use a foreign key inside of the next table so B has foreign key to A, C has foreignkey pointing at B and so forth.
Hibernate (jpa) by default will use a separate intermediary table for mapping which makes it look like many to many but you can customize this behavior with #JoinColumn like shown here
https://www.baeldung.com/jpa-join-column#oneToMany_mapping

Global filters for different data sources (with common tables)

I am currently working on Tableau using 2 data sources using each a join of 2 tables (named A, B, C):
Data source 1: A-B
Data source 2: A-C
Basically, A contains the major information that I need and then I join data from B and C to get the extra information I need for each report I am doing.
I then do a dashboard that contains reports using the data source 1 and 2.
My problem now is that I am filtering this dashboard using a dimension in A and I would like it to apply to all worksheets (e.g. for those using data sources 1 and those using data source 2).
I thought that because A is the common table in all data sources, that using a dimension in A would be ok to filter everything but it seems that it is not the case.
Is there a way to fix this?
I read some forums about creating a parameter. However, the filtering I am doing is basically as follows: I want my users to choose 1 shop name. They can find it either by:
Typing the name in the 'Shop name' quick filter,
Using a combination of the quick filters 'Region' and 'country' to then get a drop down of 'Shop Name' that has a reduced amounts of shop names (easier when the user knows where the shop is but does not remember its exact name).
Using a parameter would not allow me to do this anymore since all of this is based on 'filtering the relevant values'.
Does anyone have any recommendations?

Having more than 50 column in a SQL table

I have designed my database in such a way that One of my table contains 52 columns. All the attributes are tightly associated with the primary key attribute, So there is no scope of further Normalization.
Please let me know if same kind of situation arises and you don't want to keep so many columns in a single table, what is the other option to do that.
It is not odd in any way to have 50 columns. ERP systems often have 100+ columns in some tables.
One thing you could look into is to ensure most columns got valid default values (null, today etc). That will simplify inserts.
Also ensure your code always specifies the columns (i.e no "select *"). Any kind of future optimization will include indexes with a subset of the columns.
One approach we used once, is that you split your table into two tables. Both of these tables get the primary key of the original table. In the first table, you put your most frequently used columns and in the second table you put the lesser used columns. Generally the first one should be smaller. You now can speed up things in the first table with various indices. In our design, we even had the first table running on memory engine (RAM), since we only had reading queries. If you need to get the combination of columns from table1 and table2 you need to join both tables with the primary key.
A table with fifty-two columns is not necessarily wrong. As others have pointed out many databases have such beasts. However I would not consider ERP systems as exemplars of good data design: in my experience they tend to be rather the opposite.
Anyway, moving on!
You say this:
"All the attributes are tightly associated with the primary key
attribute"
Which means that your table is in third normal form (or perhaps BCNF). That being the case it's not true that no further normalisation is possible. Perhaps you can go to fifth normal form?
Fifth normal form is about removing join dependencies. All your columns are dependent on the primary key but there may also be dependencies between columns: e.g, there are multiple values of COL42 associated with each value of COL23. Join dependencies means that when we add a new value of COL23 we end up inserting several records, one for each value of COL42. The Wikipedia article on 5NF has a good worked example.
I admit not many people go as far as 5NF. And it might well be that even with fifty-two columns you table is already in 5NF. But it's worth checking. Because if you can break out one or two subsidiary tables you'll have improved your data model and made your main table easier to work with.
Another option is the "item-result pair" (IRP) design over the "multi-column table" MCT design, especially if you'll be adding more columns from time to time.
MCT_TABLE
---------
KEY_col(s)
Col1
Col2
Col3
...
IRP_TABLE
---------
KEY_col(s)
ITEM
VALUE
select * from IRP_TABLE;
KEY_COL ITEM VALUE
------- ---- -----
1 NAME Joe
1 AGE 44
1 WGT 202
...
IRP is a bit harder to use, but much more flexible.
I've built very large systems using the IRP design and it can perform well even for massive data. In fact it kind of behaves like a column organized DB as you only pull in the rows you need (i.e. less I/O) rather that an entire wide row when you only need a few columns (i.e. more I/O).

Resources