HIVE/PIG JOIN Based on SUBSTRING match - hadoop

I have a requirement where I need to JOIN a tweets table with person names, like filtering the tweets if it contains any person name. I have following data:
Tweets Table: (70 million records stored as a HIVE Table)
id
tweet
1
Cristiano Ronaldo greatest of all time
2
Brad Pitt movies
3
Random tweet without any person name
Person Names: (1.6 million names stored on HDFS as .tsv file)
id
person_name
1
Cristiano Ronaldo
2
Brad Pitt
3
Angelina Jolie
Expected Result:
id
tweet
person_name
1
Cristiano Ronaldo greatest of all time
Cristiano Ronaldo
2
Brad Pitt movies
Brad Pitt
What I've tried so far:
I have converted the person names .tsv file to HIVE table as well and then tried to join 2 tables with the following HIVE query:
SELECT * FROM tweets t INNER JOIN people p WHERE instr(t.tweet, p.person_name) > 0;
Tried with some sample data and it works fine. But when I try to run on entire data (70m tweets JOIN with 1.6m Person Names), it takes forever. Definitely doesn't look very efficient.
I wanted to try JOIN with PIG as well (as it is considered little more efficient than HIVE JOIN), where I can directly JOIN person names .tsv file tweets HIVE Table, but not sure how to JOIN based on substring in PIG.
Can someone please share the PIG JOIN syntax for this problem, if you have any idea? Also, please do suggest me any alternatives that I can use?

The idea is to create buckets so that we don't have to compare a lot of records. We are going to increase the number of records / joins to use multiple nodes to do work instead of a large crossjoin.--> WHERE instr(t.tweet, p.person_name) > 0;
I'd suggest splitting the tweets into individual words. Yes multiplying your record count way up.
Filtering out 'stopwords' or some other list of words that fit in memory.
Split names into (firstnames) and "last name"
Join tweets and names on "lastname" and instr(t.tweet, p.person_name) This should significantly reduce the size of data that you compare via a function. It will run faster.
If you are going to do this regularly consider creating tables with
sort/bucket to really make things sizzle. (Make it faster as it can hopefully be Sort Merge Join ready.)

It is worth trying Map-Join.
Person table is small one and join with it can be converted to Map-Join operator if it fits into memory. Table will be loaded into each mapper memory.
Check EXPLAIN output. If it says that Common Join operator is on Reducer vertex, then try to increase mapper container memory and adjust map-join settings to convert to Map Join.
Settings responsible for Map Join (suppose the People table <2.5Gb)
Try to bump mapjoin table size to 2.5Gb (check the actual size) and run explain again.
set hive.auto.convert.join=true; --this enables map-join
set hive.auto.convert.join.noconditionaltask = true;
set hive.mapjoin.smalltable.filesize=2500000000; --size of table to fit in memory
set hive.auto.convert.join.noconditionaltask.size=2500000000;
Also container size should be increased to avoid OOM (if you are on Tez):
set hive.tez.container.size=8192; --container size in megabytes
set hive.tez.java.opts=-Xmx6144m; --set this 80% of hive.tez.container.size
Figures are just an example. Try to adjust and check the EXPLAIN again, if it shows Map-Join operator, then check execution again, it should run much faster.

Related

Scan on DynamDB table or Query on secondary global index or a local index (What's the best solution)

I have AWS DynamoDB table called "Users", whose hash key/primary key is "UserID" which consist of emails. It has two attributes, first called "Daily Points" and second "TimeSpendInTheApp". Now I need to run a query or scan on the table, that will give me top 50 users which have the highest points and top 50 users which have spend the most time in the app. Now this query will be executed only once a day by cron aws lambda. I am trying to find the best solutions for this query or scan. For me, the cost is most important than speed/or efficiency. As maintaining secondary global index or a local index on points can be costly operations, as I have to assign Read and Write units for those indexes, which I want to avoid. "Users" table will have a maximum of 100,000 to 150,000 records and on average it will have 50,000 records. What are my best options? Please suggest.
I am thinking, my first option is, I can scan the whole table on Filter Expression for records above certain points (5000 for example), after this scan, if 50 or more than 50 records are found, then simply sort the values and take the top 50 records. If this scan returns no or very less results then reduce the Filter Expression value (3000 for example), then again do the same scan operation. If Filter Expression value (2500 for example) returns too many records, like 5000 or more, then reduce the Filter Expression value. Is this even possible, I guess it would also need to handle pagination. Is it advisable to scan on a table which has 50,000 record?
Any advice or suggestion will be helpful. Thanks in advance.
Firstly, creating indexes for the above use case doesn't simplify the process as it doesn't have solution for aggregation or sorting.
I would export the data to HIVE and run the queries rather than writing code to determine the result especially as it is going to be a batch executed only once per day.
Something like below:-
Create Hive table:-
CREATE EXTERNAL TABLE hive_users(userId string, dailyPoints bigint, timeSpendInTheApp bigint)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "Users",
"dynamodb.column.mapping" = "userId:UserID,dailyPoints:Daily_Points,timeSpendInTheApp:TimeSpendInTheApp");
Queries:-
SELECT dailyPoints, userId from hive_users sort by dailyPoints desc;
SELECT timeSpendInTheApp, userId from hive_users sort by timeSpendInTheApp desc;
Hive Reference

Unreliable results of a query

I need to get sales figures from open orders, sorted by code. The items are separated in the stock table by lot number (for traceability reasons) but the lot numbers do not appear in the orders table. The only link between the 2 tables is the part number.
When my query
SELECT code, SUM(qty*price) AS Sales
FROM orders INNER JOIN stock ON orders.partno = stock.partno
GROUP BY code
started returning strange results (very high sales figures for a given code), I changed it to
SELECT DISITNCT orders.partno, stock.lot, stock.code
FROM orders INNER JOIN stock ON orders.partno = stock.partno
and noticed that if several lots of a given part are in stock they are all returned
Part1 LotA code
Part1 LotB code
Part1 LotC code
which means that if a customer orders 300 units of Part1, my query returns 900 and my sales figure is multiplied by 3.
How can I work around that?
It must be noted that I do not work from a database but from a group of tables, the structures of which can sometimes be whimsical.
You should really use table.column or alias.column reference when writing queries. As your question stands, we do not know which table the PRICE comes from... the parts table or the lots table. If you are dealing with inventory tracking such as FIFO or LIFO method accounting, you must have an association to the lot table for inventory being tracked/sold.
Now, why are you getting large numbers? That is because of a Cartesian result. If you are not familiar with that, for each record in one table joined to another, it is returning however many matches.
So, if you have an order of one line item, there is only one line item in a products available table. So this is simple 1:1 ratio. Now, you have your STOCK table that can have multiple records for the exact same part number. You are now returning the same original order line item for EACH LOT ENTRY in the Stock table. So now, for your 1 item, you are getting 3 lots (1:3 result).
I know this is important from a cost-of-goods sold basis, hence your need to know which "lot" it is joined to so you only get that one specific record for proper pricing.
If however, you do have a generic product table of everything you sell, and that table has a generic common price no matter which "lot" was used for the sale, I would join to that table instead for your report. But you will still have the accounting issue of inventory, cost-of-goods, etc.

More efficient query to avoid OutOfMemoryError in Hive

I'm getting an exception in Hive:
java.lang.OutOfMemoryError: GC overhead limit exceeded.
In searching I've found that is because 98% of all CPU time of the process is going to garbage collect (whatever that means?). Is the core of my issue in my query? Should I be writing the below in a different way to avoid this kind of problem?
I'm trying to count how many of a certain phone type have an active 'Use' in a given time period. Is there a way to do this logic differently, that would run better?
select count(a.imei)
from
(Select distinct imei
from pingdata
where timestamp between TO_DATE("2016-06-01") AND TO_DATE("2016-07-17")
and ((SUBSTR(imei,12,2) = "04") or (SUBSTR(imei,12,2) = "05")) ) a
join
(SELECT distinct imei
FROM eventdata
where timestamp between TO_DATE("2016-06-01") AND TO_DATE("2016-07-17")
AND event = "Use" AND clientversion like '3.2%') b
on a.imei=b.imei
Thank you
Applying distinct to each dataset before joining them is safer because joining not unique keys will duplicate data.
I would recommend to partition your datasets by to_date(timestamp) field (yyyy-MM-dd) to make partition pruning work according to your where clause (check it works). Partition also by event field if datasets are too big and contain a lot of data where event <> 'Use'.
It's important to know on which stage it fails. Study the exception as well. If it fails on mappers then you should optimize your subqueries (add partitions as I mentioned). if it fails on reducer (join) then you should somehow improve join (try to reduce bytes per reducer:
set hive.exec.reducers.bytes.per.reducer=67108864; or even less) if it fails on writer (OrcWriter then try to add partition to Output table by substr from imei and 'distribute by substr(imei...)` at the end of query to reduce pressure on reducers).
Or add une more column with low cardinality and even distribution to distribute the data between more reducers evenly:
distribute by substr(imei...), col2
Make sure that partition column is in the distribute by. This will reduce the number of files written by each reducer and help to get rid of OOM
In order to improve performance, by looking at your query: I would partition the hive tables by yyyy, mm, dd, or by first two digits of imei, you will have to decide the variable according to your need of querying these tables and amount of data. but I would vote for yyyy, mm, dd, that will give you tremendous amount of improvement on performance. see improving-query-performance-using-partitioning
But for now, this should give you some improvements:
Select count(distinct(pd.imei))
from pingdata pd join eventdata ed on pd.imei=ed.imei
where
TO_DATE(pd.timestamp) between '2016-06-01' AND '2016-07-17'
and pd.timestamp=ed.pd.timestamp
and SUBSTR(pd.imei,12,2) in ('04','05')
and ed.event = 'Use' AND ed.clientversion like '3.2%';
if TO_DATE(timestamp) values are inserted on same day, in other words if both values are same for date than and pd.timestamp=ed.pd.timestamp condition should be excluded.
Select count(distinct(pd.imei))
from pingdata pd join eventdata ed on pd.imei=ed.imei
where
TO_DATE(pd.timestamp) between '2016-06-01' AND '2016-07-17'
and SUBSTR(pd.imei,12,2) in ('04','05')
and ed.event = 'Use' AND ed.clientversion like '3.2%';
Try running both queries and compare results. Do let us know the differences and if you find this helpful.

Speeding up a postgres query (which works on 2 tables)

I am doing, in postgresql, something like this:
select A.first,
count(B.second) as count,
array_agg(A.second) as second,
array_agg(A.third) as third,
array_agg(B.kids) as kids
from A join B on A.first=B.second
group by A.first;
And it's taking forever (also because the tables are pretty big). Limiting the output to 10 row and looking with explain analyze told me there's a nested loop which is huge and takes most of the time.
Is there any way in which I can write this query (which I'll then use in CREATE TABLE AS to create a new table) to speed it up, while conserving the same output, which is what I want?
Thanks!
Ensure the column bring used as a foreign key is indexed:
create index b_second on b(second);
Without such an index, every row of a would cause a table scan of b, which would make your query crawl.

How to fill a Cassandra Column Family from another one's columns?

I have always read that Cassandra is good if your application changes frequently and features are added frequently.
That makes sense, since you don't have any fixed schema, you can add columns to rows to suffice your needs, instead of running an ALTER TABLE query which may freeze your database for hours for very large tables.
However I have an hypotetical problem which I'm not able to solve.
Let's say I have:
CREATE COLUMN FAMILY Students
with comparator='CompositeType(UTF8Type,UTF8Type),
and key_validation_class=UUIDType;
Each student has some generic column (you know, meta:username, meta:password, meta:surname, etc), plus each student may follow N courses. This N-N relationship is resolved using denormalization, adding N columns to each Student (course:ID1, course:ID2).
On the other side, I may have a Courses CF, where each row is contains all of the following Students UUIDs.
So I can ask "which courses are followed by XXX" and "which students follow course YYY".
The problem is: what if I didn't create the second column family? Maybe at the time when the application was built, getting the students following a specific course wasn't a requirement.
This is a simple example, but I believe it's quite common. "With Cassandra you plan CFs in terms of queries instead of relationships". I need that query now, while at first it wasn't needed.
Given a table of students with thousands of entries, how would you fill the Courses CF? Is this a job for Hadoop, Pig or Hive (I never touched any of those, just guessing).
Pig (which uses the Hadoop integration) is actually perfect for this type of work, because you can not only read but also write data back into Cassandra using CassandraStorage. It gives you the parallel processing capability to do the job with minimal time and overhead. Otherwise the alternative is to write something to do the extraction yourself, then write the new CF.
Here is a Pig example that computes averages from a set of data in one CF and outputs them to another:
rows = LOAD 'cassandra://HadoopTest/TestInput' USING CassandraStorage() AS (key:bytearray,cols:bag{col:tuple(name:chararray,value)});
columns = FOREACH rows GENERATE flatten(cols) AS (name,value);
grouped = GROUP columns BY name;
vals = FOREACH grouped GENERATE group, columns.value AS values;
avgs = FOREACH vals GENERATE group, 'Pig_Average' AS name, (long)SUM(values.value)/COUNT(values.value) AS average;
cass_group = GROUP avgs BY group;
cass_out = FOREACH cass_group GENERATE group, avgs.(name, average);
STORE cass_out INTO 'cassandra://HadoopTest/TestOutput' USING CassandraStorage();
If you use the existing cassandra file, you would have to unwind the data. Since NOSQL files are unidirectional this could be a very time consuming operation in Cassandra itself. The data would have to be sorted in the opposite order from the first file. Frankly I believe that you would have to go back to the original data that was used to populate the first file and populate this new file from that.

Resources