Sequence Number UDF in Hive - hadoop

i have tried this UDF in hive : UDFRowSequence.
But its not generating unique value i.e it is repeating the sequence depending on mappers.
Suppose i have one file (Having 4 records) availble at HDFS .it will create one mapper for this job and result will be like
1
2
3
4
but when there are multiple file (large size) at HDFS Location , Multiple mapper will get created for that job and for each mapper repetitive sequence number will get generated like below
1
2
3
4
1
2
3
4
1
2
.
Is there any solution for this so that unique number should be generated for each record

I think you are looking for ROW_NUMBER(). You can read about it and other "windowing" functions here.
Example:
SELECT *, ROW_NUMBER() OVER ()
FROM some_database.some_table

#GoBrewers14 :- Yes i did try that. We tried to use the ROW_NUMBER function ,but when we try to query this on small size data eg. file containing 500 rows, it’s working perfectly. But when it comes to large size data, query runs for couple of hours and finally fails to generate output .
I have came to know below information regarding this :-
Generating a sequential order in a distributed processing query is not possible with simple UDFs. This is because the approach will require some centralised entity to keep track of the counter, which will also result in severe inefficiency for distributed queries and is not recommended to apply.

If you want work with multiple mappers and with large dataset, try using this UDF: https://github.com/manojkumarvohra/hive-hilo
It makes use of zookeeper as central repository to maintain state of sequence

Query to generate Sequences. We can use this as Surrogate Key in Dimension table as well.
WITH TEMP AS
(SELECT if(max(seq) IS NULL, 0, max(seq)) max_seq
FROM seq_test)
SELECT col_id,
col_val,
row_number() over() + max_seq AS seq
FROM souce_table
INNER JOIN TEMP ON 1 = 1;
seq_test: Its your target table.
source_table: Its your source.
Seq: Surrogate key / Sequence number / Key column

Related

Hive + Tez :: A join query stuck at last 2 mappers for a long time

I have a views table joining with a temp table with the below parameters intentionally enabled.
hive.auto.convert.join=true;
hive.execution.engine=tez;
The Code Snippet is,
CREATE TABLE STG_CONVERSION AS
SELECT CONV.CONVERSION_ID,
CONV.USER_ID,
TP.TIME,
CONV.TIME AS ACTIVITY_TIME,
TP.MULTI_DIM_ID,
CONV.CONV_TYPE_ID,
TP.SV1
FROM VIEWS TP
JOIN SCU_TMP CONV ON TP.USER_ID = CONV.USER_ID
WHERE TP.TIME <= CONV.TIME;
In the normal scenario, both the tables can have any number of records.
However,in the SCU_TMP table, only 10-50 records are expected with the same User Id.
But in some cases, couple of User IDs come with around 10k-20k records in SCU Temp table, which creates a cross product effect.
In such cases, it'll run for ever with just 1 mapper to complete.
Is there any way to optimise this and run this gracefully?
I was able to find a solution to it by the below query.
set hive.exec.reducers.bytes.per.reducer=10000
CREATE TABLE STG_CONVERSION AS
SELECT CONV.CONVERSION_ID,
CONV.USER_ID,
TP.TIME,
CONV.TIME AS ACTIVITY_TIME,
TP.MULTI_DIM_ID,
CONV.CONV_TYPE_ID,
TP.SV1
FROM (SELECT TIME,MULTI_DIM_ID,SV1 FROM VIEWS SORT BY TIME) TP
JOIN SCU_TMP CONV ON TP.USER_ID = CONV.USER_ID
WHERE TP.TIME <= CONV.TIME;
The problem arises due to the fact that when a single user id dominates the table, join of that user gets processed through a single mapper which gets stuck.
Two modifications to it,
1) Replaced Table name with a subquery - which added a sorting process before the join.
2)Reduced the hive.exec.reducers.bytes.per.reducer parameter to 10KB.
Sort by time in step (1) added a shuffle phase which evenly distributed the data which was earlier skewed by the User ID.
Reducing the bytes per reducer parameter resulted in distribution of data to all available reducers.
By these two enhancements, 10-12hrs run was reduced to 45 mins.

How much data is considered "too large" for a Hive MAPJOIN job?

EDIT: added more file size details, and some other session information.
I have a seemingly straightforward Hive JOIN query that surprisingly requires several hours to run.
SELECT a.value1, a.value2, b.value
FROM a
JOIN b ON a.key = b.key
WHERE a.keyPart BETWEEN b.startKeyPart AND B.endKeyPart;
I'm trying to determine if the execution time is normal for my dataset and AWS hardware selection, or if I am simply trying to JOIN too much data.
Table A: ~2.2 million rows, 12MB compressed, 81MB raw, 4 files.
Table B: ~245 thousand rows, 6.7MB compressed, 14MB raw, one file.
AWS: emr-4.3.0, running on about 5 m3.2xlarge EC2 instances.
Records from A always matches one or more records in B, so logically I see that at most 500 billion rows are generated before they are pruned with the WHERE clause.
4 mappers are allocated for the job, which completes in 6 hours. Is this normal for this type of query and configuration? If not, what should I do to improve it?
I've partitioned B on the JOIN key, which yields 5 partitions, but haven't noticed a significant improvement.
Also, the logs show that the Hive optimizer starts a local map join task, presumably to cache or stream the smaller table:
2016-02-07 02:14:13 Starting to launch local task to process map join; maximum memory = 932184064
2016-02-07 02:14:16 Dump the side-table for tag: 1 with group count: 5 into file: file:/mnt/var/lib/hive/tmp/local-hadoop/hive_2016-02-07_02-14-08_435_7052168836302267808-1/-local-10003/HashTable-Stage-4/MapJoin-mapfile01--.hashtable
2016-02-07 02:14:17 Uploaded 1 File to: file:/mnt/var/lib/hive/tmp/local-hadoop/hive_2016-02-07_02-14-08_435_7052168836302267808-1/-local-10003/HashTable-Stage-4/MapJoin-mapfile01--.hashtable (12059634 bytes)
2016-02-07 02:14:17 End of local task; Time Taken: 3.71 sec.
What is causing this job to run slowly? The data set doesn't appear too large, and the "small-table" size is well under the "small-table" limit of 25MB that triggers the disabling of the MAPJOIN optimization.
A dump of the EXPLAIN output is copied on PasteBin for reference.
My session enables compression for output and intermediate storage. Could this be the culprit?
SET hive.exec.compress.output=true;
SET hive.exec.compress.intermediate=true;
SET mapred.output.compress=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
SET io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;
SET io.seqfile.compression.type=BLOCK;
My solution to this problem is to express the JOIN predicate entirely within the JOIN ON clause, as this is the most efficient way to execute a JOIN in Hive. As for why the original query was slow, I believe that the mappers just need time when scanning the intermediate data set row by row, 100+ billion times.
Due to Hive only supporting equality expressions in the JOIN ON clause and rejecting function calls that use both table aliases as parameters, there is no way to rewrite the original query's BETWEEN clause as an algebraic expression. For example, the following expression is illegal.
-- Only handles exclusive BETWEEN
JOIN b ON a.key = b.key
AND sign(a.keyPart - b.startKeyPart) = 1.0 -- keyPart > startKeyPart
AND sign(a.keyPart - b.endKeyPart) = -1.0 -- keyPart < endKeyPart
I ultimately modified my source data to include every value between startKeyPart and endKeyPart in a Hive ARRAY<BIGINT> data type.
CREATE TABLE LookupTable
key BIGINT,
startKeyPart BIGINT,
endKeyPart BIGINT,
keyParts ARRAY<BIGINT>;
Alternatively, I could have generated this value inline within my queries using a custom Java method; the LongStream.rangeClosed() method is only available in Java 8, which is not part of Hive 1.0.0 in AWS emr-4.3.0.
Now that I have the entire key space in an array, I can transform the array to a table using LATERAL VIEW and explode(), rewriting the JOIN as follows.
WITH b AS
(
SELECT key, keyPart, value
FROM LookupTable
LATERAL VIEW explode(keyParts) keyPartsTable AS keyPart
)
SELECT a.value1, a.value2, b.value
FROM a
JOIN b ON a.key = b.key AND a.keyPart = b.keyPart;
The end result is that the above query takes approximately 3 minutes to complete, when compared with the original 6 hours on the same hardware configuration.

How to speed up the creation of a table from table partitioned by date?

I have a table with a huge amount of data. It is partitioned by week. This table contains a column named group. Each group could have multiple records of weeks. For example:
List item
gr week data
1 1 10
1 2 13
1 3 5
. . 6
2 2 14
2 3 55
. . .
I want to create a table based on one group. The creation currently is taking ~23 minutes on Oracle 11g. This is a long time since I have to repeat the process for each group and I have many groups. what is the best fastest way to create the table ?
Create all tables then use INSERT ALL WHEN
http://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_9014.htm#i2145081
The data will be read only once.
insert all
when gr=1 then
into tab1 values (gr, week, data)
when gr=2 then
into tab2 values (gr, week, data)
when gr=3 then
into tab3 values (gr, week, data)
select *
from big_table
The best speed up you reach if you don't copy the data on group basis and process then week by week, but you don't say what you will reach so it is not possible to comment (this approach may be of course difficult or impracticable; but you should at least consider it).
Therefore below some hints how to extract the group data:
remove all indexes as this will only block space - all you need to do is one large FULL TABLE SCAN
check the available space and size of each group; maybe you can process several groups in one pass
deploy parallel query
.
create table tmp as
select /*+ parallel(4) */ * from BIG_TABLE
where group_id in (..list of groupIds..);
Please note that parallel mode must be enabled in the database, ask your DBA if you are unsure. The point is that the large FULL TABLE SCAN is performed by several sub-processes (here 4) which may (dependent on your mileage) cut the elapsed time.

Forcing a reduce phase or a second map reduce job in hive

I am running a hive query of the following form:
INSERT OVERWRITE LOCAL DIRECTORY ...
SELECT /*+ MAPJOIN(...) */ * FROM ...
Because of the MAPJOIN, the result does not require a reduce phase. The map phase uses about 5000 mappers, and it ends up taking about 50 minutes to complete the job. It turns out that most of this time is spent copying those 5000 files to the local directory.
To try to optimize this, I replaced SELECT * ... with SELECT DISTINCT * ... (I know in advance that my results are already distinct, so this doesn't actually change my result), in order to force a second map reduce job. The first map reduce job is the same as before, with 5000 mappers and 0 reducers. The second map reduce job now has 5000 mappers and 3 reducers. With this change, there are now only 3 files to be copied, rather than 5000, and the query now only takes a total of about 20 minutes.
Since I don't actually need the DISTINCT, I'd like to know whether my query can be optimized in a less kludge-y way, without using DISTINCT?
What about wrapping you query with another SELECT, and maybe a useless WHERE clause to make sure it kicks off a job.
INSERT OVERWRITE LOCAL DIRECTORY ...
SELECT *
FROM (
SELECT /*+ MAPJOIN(...) */ *
FROM ..
) x
WHERE 1 = 1
I'll run this when I get a chance tomorrow and delete this part of the answer if it doesn't work. If you get to it before me then great.
Another option would be to take advantage of the virtual columns for file name and line number to force distinct results. This complicates the query and introduces two meaningless columns, but has the advantage that you no longer have to know in advance that your results will be distinct. If you can't abide the useless columns, wrap it in another SELECT to remove them.
INSERT OVERWRITE LOCAL DIRECTORY ...
SELECT {{enumerate every column except the virutal columns}}
FROM (
SELECT DISTINCT /*+ MAPJOIN(...) */ *, INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE
FROM ..
) x
Both solutions are more kludge-y than what you came up with, but have the advantage that you are not limited to queries with distinct results.
We get another option if you aren't limited to Hive. You could get rid of the LOCAL and write the results to HDFS, which should be fast even with 5000 mappers. Then use hadoop fs -getmerge /result/dir/on/hdfs/ to pull the results into the local filesystem. This unfortunately reaches out of Hive, but maybe setting up a two step Oozie workflow is acceptable for your use case.

generating unique ids in hive

I have been trying to generate unique ids for each row of a table (30 million+ rows).
using sequential numbers obviously not does not work due to the parallel nature of Hadoop.
the built in UDFs rand() and hash(rand(),unixtime()) seem to generate collisions.
There has to be a simple way to generate row ids, and I was wondering of anyone has a solution.
my next step is just creating a Java map reduce job to generate a real hash string with a secure random + host IP + current time as a seed. but I figure I'd ask here before doing it ;)
Use the reflect UDF to generate UUIDs.
reflect("java.util.UUID", "randomUUID")
Update (2019)
For a long time, UUIDs were your best bet for getting unique values in Hive. As of Hive 4.0, Hive offers a surrogate key UDF which you can use to generate unique values which will be far more performant than UUID strings. Documentation is a bit sparse still but here is one example:
create table customer (
id bigint default surrogate_key(),
name string,
city string,
primary key (id) disable novalidate
);
To have Hive generate IDs for you, use a column list in the insert statement and don't mention the surrogate key column:
-- staging_table would have two string columns.
insert into customer (name, city) select * from staging_table;
Not sure if this is all that helpful, but here goes...
Consider the native MapReduce analog: assuming your input data set is text based, the input Mapper's key (and hence unique ID) would be, for each line, the name of the file plus its byte offset.
When you are loading the data into Hive, if you can create an extra 'column' that has this info, you get your rowID for free. It's semantically meaningless, but so too is the approach you mention above.
Elaborating on the answer by jtravaglini,
there are 2 built in Hive virtual columns since 0.8.0 that can be used to generate a unique identifier:
INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE
Use like this:
select
concat(INPUT__FILE__NAME, ':', BLOCK__OFFSET__INSIDE__FILE) as rowkey,
...
;
...
OK
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:0
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:57
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:114
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:171
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:228
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:285
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:342
...
Or you can anonymize that with md5 or similar, here's a link to md5 UDF:
https://gist.github.com/dataminelab/1050002
(note the function class name is initcap 'Md5')
select
Md5(concat(INPUT__FILE__NAME, ':', BLOCK__OFFSET__INSIDE__FILE)) as rowkey,
...
Use ROW_NUMBER function to generate monotonically increasing integer ids.
select ROW_NUMBER() OVER () AS id from t1;
See https://community.hortonworks.com/questions/58405/how-to-get-the-row-number-for-particular-values-fr.html
reflect("java.util.UUID", "randomUUID")
I could not vote up the other one. I needed a pure binary version, so I used this:
unhex(regexp_replace(reflect('java.util.UUID','randomUUID'), '-', ''))
Depending on the nature of your jobs and how frequently you plan on running them, using sequential numbers may actually be a reasonable alternative. You can implement a rank() UDF as described in this other SO question.
Write a custom Mapper that keeps a counter for every Map task and creates as row ID for a row the concatenation of JobID() (as obtained from the MR API) + current value of counter. Before the next row is examined, increment the counter.
If you want work with multiple mappers and with large dataset, try using this UDF: https://github.com/manojkumarvohra/hive-hilo
It makes use of zookeeper as central repository to maintain state of sequence and generated unique incrementing numeric values

Resources