generating unique ids in hive - hadoop

I have been trying to generate unique ids for each row of a table (30 million+ rows).
using sequential numbers obviously not does not work due to the parallel nature of Hadoop.
the built in UDFs rand() and hash(rand(),unixtime()) seem to generate collisions.
There has to be a simple way to generate row ids, and I was wondering of anyone has a solution.
my next step is just creating a Java map reduce job to generate a real hash string with a secure random + host IP + current time as a seed. but I figure I'd ask here before doing it ;)

Use the reflect UDF to generate UUIDs.
reflect("java.util.UUID", "randomUUID")
Update (2019)
For a long time, UUIDs were your best bet for getting unique values in Hive. As of Hive 4.0, Hive offers a surrogate key UDF which you can use to generate unique values which will be far more performant than UUID strings. Documentation is a bit sparse still but here is one example:
create table customer (
id bigint default surrogate_key(),
name string,
city string,
primary key (id) disable novalidate
);
To have Hive generate IDs for you, use a column list in the insert statement and don't mention the surrogate key column:
-- staging_table would have two string columns.
insert into customer (name, city) select * from staging_table;

Not sure if this is all that helpful, but here goes...
Consider the native MapReduce analog: assuming your input data set is text based, the input Mapper's key (and hence unique ID) would be, for each line, the name of the file plus its byte offset.
When you are loading the data into Hive, if you can create an extra 'column' that has this info, you get your rowID for free. It's semantically meaningless, but so too is the approach you mention above.

Elaborating on the answer by jtravaglini,
there are 2 built in Hive virtual columns since 0.8.0 that can be used to generate a unique identifier:
INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE
Use like this:
select
concat(INPUT__FILE__NAME, ':', BLOCK__OFFSET__INSIDE__FILE) as rowkey,
...
;
...
OK
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:0
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:57
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:114
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:171
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:228
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:285
hdfs://<nodename>:8020/user/dhdpadmn/training/training_data/nyse/daily/NYSE_daily2.txt:342
...
Or you can anonymize that with md5 or similar, here's a link to md5 UDF:
https://gist.github.com/dataminelab/1050002
(note the function class name is initcap 'Md5')
select
Md5(concat(INPUT__FILE__NAME, ':', BLOCK__OFFSET__INSIDE__FILE)) as rowkey,
...

Use ROW_NUMBER function to generate monotonically increasing integer ids.
select ROW_NUMBER() OVER () AS id from t1;
See https://community.hortonworks.com/questions/58405/how-to-get-the-row-number-for-particular-values-fr.html

reflect("java.util.UUID", "randomUUID")
I could not vote up the other one. I needed a pure binary version, so I used this:
unhex(regexp_replace(reflect('java.util.UUID','randomUUID'), '-', ''))

Depending on the nature of your jobs and how frequently you plan on running them, using sequential numbers may actually be a reasonable alternative. You can implement a rank() UDF as described in this other SO question.

Write a custom Mapper that keeps a counter for every Map task and creates as row ID for a row the concatenation of JobID() (as obtained from the MR API) + current value of counter. Before the next row is examined, increment the counter.

If you want work with multiple mappers and with large dataset, try using this UDF: https://github.com/manojkumarvohra/hive-hilo
It makes use of zookeeper as central repository to maintain state of sequence and generated unique incrementing numeric values

Related

MD5 of entire row in oracle

Below is the query to get MD5 of entire row in snowflake
SELECT MD5(TO_VARCHAR(ARRAY_CONSTRUCT(*)))FROM T
taken from here
what is the alternative query in oracle to achieve such requirement without having to put all column names manually.
You may use the packaged function dbms_sqlhash.gethash as described below, but remember:
the package was removed from the documentation (I guess in 11g), so in the recent releases this is an undocumented feature
if you calculate the hash code from more than one row you must define order (order by on a unique key(s)) . Otherwise the calculated hash is not deterministic. (This was probaly the reason of the removal)
the columns with other data types than varchar2 are converted to strings before the hash calculation, so the result is dependent on the NLS setting. You must stabilize the NLS setting to get reproducible results, e.g. with alter session set nls_date_format='dd-mm-yyyy hh24:mi:ss';
The column must be concatenated with some special delimiter (that does not occure in the data) to aviod collision: 'A' || null is the same as null || 'A'. This are unknown internals, so it is rather hard to compare the result MD5 hash with hash calculated on other (non Oracle) data.
You need extra grant to execute the package
Some additional info
Example
select * from tab where x=1;
X Y Z
---------- - -------------------
1 a 13.10.2021 00:00:00
select
dbms_sqlhash.gethash(
sqltext => 'select * from tab where x=1',
digest_type => 2 /* dbms_crypto.hash_md5 */
) MD5
from dual;
MD5
--------------------------------
215A9C4642A3691F951DD8060877D191
Order Independent Hash Code of a Table
Contrary to a file (where the order matter) in a database table the order is not relevant. It would be therefore meaningfull to have a possibility to calculate an order independent hash code of a table.
Unfortunately this feature is currently not available in Oracle, but was implemented as a prototype as described here

Sequence Number UDF in Hive

i have tried this UDF in hive : UDFRowSequence.
But its not generating unique value i.e it is repeating the sequence depending on mappers.
Suppose i have one file (Having 4 records) availble at HDFS .it will create one mapper for this job and result will be like
1
2
3
4
but when there are multiple file (large size) at HDFS Location , Multiple mapper will get created for that job and for each mapper repetitive sequence number will get generated like below
1
2
3
4
1
2
3
4
1
2
.
Is there any solution for this so that unique number should be generated for each record
I think you are looking for ROW_NUMBER(). You can read about it and other "windowing" functions here.
Example:
SELECT *, ROW_NUMBER() OVER ()
FROM some_database.some_table
#GoBrewers14 :- Yes i did try that. We tried to use the ROW_NUMBER function ,but when we try to query this on small size data eg. file containing 500 rows, it’s working perfectly. But when it comes to large size data, query runs for couple of hours and finally fails to generate output .
I have came to know below information regarding this :-
Generating a sequential order in a distributed processing query is not possible with simple UDFs. This is because the approach will require some centralised entity to keep track of the counter, which will also result in severe inefficiency for distributed queries and is not recommended to apply.
If you want work with multiple mappers and with large dataset, try using this UDF: https://github.com/manojkumarvohra/hive-hilo
It makes use of zookeeper as central repository to maintain state of sequence
Query to generate Sequences. We can use this as Surrogate Key in Dimension table as well.
WITH TEMP AS
(SELECT if(max(seq) IS NULL, 0, max(seq)) max_seq
FROM seq_test)
SELECT col_id,
col_val,
row_number() over() + max_seq AS seq
FROM souce_table
INNER JOIN TEMP ON 1 = 1;
seq_test: Its your target table.
source_table: Its your source.
Seq: Surrogate key / Sequence number / Key column

Should I use Oracle's sys_guid() to generate guids?

I have some inherited code that calls SELECT SYS_GUID() FROM DUAL each time an entity is created. This means that for each insertion there are two calls to Oracle, one to get the Guid, and another to insert the data.
I suppose that there may be a good reason for this, for example - Oracle's Guids may be optimized for high-volume insertions by being sequential and thus they maybe are trying to avoid excessive index tree re-balancing.
Is there a reason to use SYS_GUID as opposed to building your own Guid on the client?
Why roll your own if you already have it provided to you. Also, you don't need to grab it first and then insert, you can just insert:
create table my_tab
(
val1 raw(16),
val2 varchar2(100)
);
insert into my_tab(val1, val2) values (sys_guid(), 'Some data');
commit;
You can also use it as a default value for a primary key:
drop table my_tab;
create table my_tab
(
val1 raw(16) default sys_guid(),
val2 varchar2(100),
primary key(val1)
);
Here there's no need for setting up a before insert trigger to use a sequence (or in most cases even caring about val1 or how its populated in the code).
More maintenance for sequences also. Not to mention the portability issues when moving data between systems.
But, sequences are more human friendly imo (looking at and using a number is better than a 32 hex version of a raw value, by far). There may be other benefits to sequences, I haven't done any extensive comparisons, you may wish to run some performance tests first.
If your concern is two database calls, you should be able to call SYS_GUID() within your INSERT statement. You could even use a RETURNING clause to get the value that Oracle generated, so that you have it in your application for further use.
SYS_GUID can be used as a default value for a primary key column, which is often more convenient than using a sequence, but note that the values will be more or less random and not sequential. On the plus side, that may reduce contention for hot blocks, but on the minus side your index inserts will be all over the place as well. We generally recommend against this practice.
for reference click here
I have found no reason to generate a Guid from Oracle. The round trip between Oracle and the client for every Guid is likely slower than the occasional index rebalancing that occurs is random value inserts.

How do I optimize the following SQL query for performance?

How do I optimize the following SQL query for performance?
select * from Employee where CNIC = 'some-CNIC-number'
Will using alias help making it a little faster?
I am using Microsoft SQL Server.
This is better if you tell us what RDBMS you are using, but...
1 - Don't do SELECT *. Specify which columns you need. Less data = faster query
2 - For indexing, make sure you have an index on CNIC. You also want a good clustered index on a primary key (preferably something like an ID number)
3 - You put the number in single quotes ' ' which indicates you may have it as a varchar column. If it will always be NUMERIC, it should be an int/bigint data type. This takes up less space and will be faster to retrieve and index by.
Create an index on CNIC:
CREATE INDEX ix_employee_cnic ON employee (cnic)
First thing, as I see this column will be used for storing Id card nos, then you can make your coulmn of type int rather than varchar or nvarchar as searching will faster on an integer type as compared to varchar or nvarchar.
Second, use with (no lock), like
select * from Employee with (nolock) where CNIC = 'some-CNIC-number'
This is to minimize the chances of a deadlock.

How to protect a running column within Oracle/PostgreSQL (kind of MAX-result locking or something)

I'd need advice on following situation with Oracle/PostgreSQL:
I have a db table with a "running counter" and would like to protect it in the following situation with two concurrent transactions:
T1 T2
SELECT MAX(C) FROM TABLE WHERE CODE='xx'
-- C for new : result + 1
SELECT MAX(C) FROM TABLE WHERE CODE='xx';
-- C for new : result + 1
INSERT INTO TABLE...
INSERT INTO TABLE...
So, in both cases, the column value for INSERT is calculated from the old result added by one.
From this, some running counter handled by the db would be fine. But that wouldn't work because
the counter values or existing rows are sometimes changed
sometimes I'd like there to be multiple counter "value groups" (as with the CODE mentioned) : with different values for CODE the counters would be independent.
With some other databases this can be handled with SERIALIZABLE isolation state but at least with Oracle&Postgre the phantom reads are prevented but as the result the table ends up with two distinct rows with same counter value. This seems to have to do with the predicate locking, locking "all the possible rows covered by the query" - some other db:s end up to lock the whole table or something..
SELECT ... FOR UPDATE -statements seem to be for other purposes and don't even seem to work with MAX() -function.
Setting an UNIQUE contraint on the column would probably be the solution but are there some other ways to prevent the situation?
b.r. Touko
EDIT: One more option could probably be manual locking even though it doesn't appear nice to me..
Both Oracle and PostgreSQL support what's called sequences and the perfect fit for your problem. You can have a regular int column, but define one sequence per group, and do a single query like
--PostgreSQL
insert into table (id, ... ) values (nextval(sequence_name_for_group_xx), ... )
--Oracle
insert into table (id, ... ) values (sequence_name_for_group_xx.nextval, ... )
Increments in sequences are atomic, so your problem just wouldn't exist. It's only a matter of creating the required sequences, one per group.
the counter values or existing rows are sometimes changed
You should to put a unique constraint on that column if this would be a problem for your app. Doing so would guarantee a transaction at SERIALIZABLE isolation level would abort if it tried to use the same id as another transaction.
One more option could probably be manual locking even though it doesn't appear nice to me..
Manual locking in this case is pretty easy: just take a SHARE UPDATE EXCLUSIVE or stronger lock on the table before selecting the maximum. This will kill concurrent performance, though.
sometimes I'd like there to be multiple counter "value groups" (as with the CODE mentioned) : with different values for CODE the counters would be independent.
This leads me to the Right Solution for this problem: sequences. Set up several sequences, one for each "value group" you want to get IDs in their own range. See Section 9.15 of The Manual for the details of sequences and how to use them; it looks like they're a perfect fit for you. Sequences will never give the same value twice, but might skip values: if a transaction gets the value '2' from a sequence and aborts, the next transaction will get the value '3' rather than '2'.
The sequence answer is common, but might not be right. The viability of this solution depends on what you actually need. If what you semantically want is "some guaranteed to be unique number" then that is what a sequence is for. However, if what you want is to make sure that your value increases by exactly one on each insert (as you have asked), then DO NOT USE A SEQUENCE! I have run into this trap before myself. Sequences are not guaranteed to be sequential! They can skip numbers. Depending on what sort of optimizations you have configured, they can skip LOTS of numbers. Even if you have things configured just right so that you shouldn't skip any numbers, that is not guaranteed, and is not what sequences are for. So, you are only asking for trouble if you (mis)use them like that.
One step better solution is to bundle the select into the insert, like so:
INSERT INTO table(code, c, ...)
VALUES ('XX', (SELECT MAX(c) + 1 AS c FROM table WHERE code = 'XX'), ...);
(I haven't test run that query, but I'm pretty sure it should work. My apologies if it doesn't.) But, doing something like that reflects the semantic intent of what you are trying to do. However, this is inefficient, because you have to do a scan for MAX, and the inference I am taking from your sample is that you have a small number of code values relative to the size of the table, so you are going to do an expensive, full table scan on every insert. That isn't good. Also, this doesn't even get you the ACID guarantee you are looking for. The select is not transactionally tied to the insert. You can't "lock" the result of the MAX() function. So, you could still have two transactions running this query and they both do the sub-select and get the same max, both add one, and then both try to insert. It's a much smaller window, but you may still technically have a race condition here.
Ultimately, I would challenge that you probably have the wrong data model if you are trying to increment on insert. You should insert with a unique key, most commonly a sequence value (at least as an easy, surrogate key for any natural key). That gets the data safely inserted. Then, if you need a count of things, then have one table that stores your counts.
CREATE TABLE code_counts (
code VARCHAR(2), --or whatever
count NUMBER
);
If you really want to store the code count of each item as it is inserted, the separate count table also allows you to do so correctly, transactionally, like so:
UPDATE code_counts SET count = count + 1 WHERE code = 'XX' RETURNING count INTO :count;
INSERT INTO table(code, c, ...) VALUES ('XX', :count, ...);
COMMIT;
The key is that the update locks the counter table and reserves that value for you. Then your insert uses that value. And all of that is committed as one transactional change. You have to do this in a transaction. Having a separate count table avoids the full table scan of doing SELECT MAX().... In essense, what this does is re-implements a sequence, but it also guarantees you sequencial, ordered use.
Without knowing your whole problem domain and data model, it is hard to say, but abstracting your counts out to a separate table like this where you don't have to do a select max to get the right value is probably a good idea. Assuming, of course, that a count is what you really care about. If you are just doing logging or something where you want to make sure things are unique, then use a sequence, and a timestamp to sort by.
Note that I'm saying not to sort by a sequence either. Basically, never trust a sequence to be anything other than unique. Because when you get to caching sequence values on a multi-node system, your application might even consume them out of order.
This is why you should use the Serial datatype, which defers the lookup of C to the time of insert (which uses table locks i presume). You would then not specify C, but it would be generated automatically. If you need C for some intermediate calculation, you would need to save first, then read C and finally update with the derived values.
Edit: Sorry, I didn't read your whole question. What about solving your other problems with normalization? Just create a second table for each specific type (for each x where A='x'), where you have another auto increment. Manually edited sequences could be another column in the same table, which uses the generated sequence as a base (i.e if pk = 34 you can have another column mypk='34Changed').
You can create sequential collumn by using sequence as default value:
First, you have to create the sequence counter:
CREATE SEQUENCE SEQ_TABLE_1 START WITH 1 INCREMENT BY 1;
So, you can use it as default value:
CREATE TABLE T (
COD NUMERIC(10) DEFAULT NEXTVAL('SEQ_TABLE_1') NOT NULL,
collumn1...
collumn2...
);
Now you don't need to worry about sequence on inserting rows:
INSERT INTO T (collumn1, collumn2) VALUES (value1, value2);
Regards.

Resources