Recalculate some columns after insert on Hbase table - hadoop

Is there a simple way to recalculate some values after an insert occurs? I have a table with multiple column families, one of them statistical. I'd like to insert the original record and than have some HBase-specific facility to calculate the values offline - without blocking the insert.
Let's assume I put some files into an hbase table and want to have information about the number of lines in them and also the dates stored there.
I've been looking into RegionObserver and its preGet method. This solution works but I'm afraid it blocks the actual insert from occurring until the calculation is completed.

use the postPut method. You can see a short introduction to HBase's coprocessors here

Try apache Pig, this is best suited for stasticl calculations and can run in local as well as mapred mode
For more detail you can visit
http://pig.apache.com

Related

Sqoop incremental load using Informatica BDM

I am new to Informatica BDM.I have a use case in which I have to import the data incrementally (100 tables) from RDBMS into Hive on daily basis. Can someone please guide me with the best possible approach to achieve this?
Thanks,
Sumit
Hadoop is write onces read many (WORM) approach and the incremental load is not easy stuff. There are following guideline you can follow and validate your current requirement
If the table is a small/mid-size and not having too many records,
better to refresh the entire table
If the table is too big and incremental load has add/update/delete operation, you can think of staging the delta and perform a join operation to re-create data set.
For large table and large delta, you can create a version number for all the latest record and each delta may come to a new directory and a view should be created to get the latest version for further processing. This avoid heavy merge operation.
If the delete operation is not coming as change, then you also need to think how to act on it and in such case, you need to get the full refresh.

Stop Hbase update operation if it have same value

I have a table in Hbase named 'xyz' . When I do an update operation on this table , it updates a table even though it is same record .
How can I control second record to not be added.
Eg:
create 'ns:xyz',{NAME=>'cf1',VERSIONS => 5}
put 'ns:xyz','1','cf1:name','NewYork'
put 'ns:xyz','1','cf1:name','NewYork'
Above put statements are giving 2 records with different timestamp if I check all versions. I am expecting that it should not add 2nd record because it have same value
HBase isn't going to look through the entire row and work out if it's the same as the data you're adding. That would be an expensive operation, and HBase prides itself on its fast insert speeds.
If you're really eager to do this (and I'd ask if you really want to do this), you should perform a GET first to see if the data is already present in the table.
You could also write a Coprocessor to do this every time you PUT data, but again the performance would be undesirable.
As mentioned by #Ben Watson, HBase is best known for it's performance in write since it doesn't need to check for the existence of a value as multiple versions will be maintained by default.
One hack what you can do is, you can use custom versioning. As show in the below screenshot, you have two versions already for a row key. Now if you are going to insert the same record with the same timestamp. HBase would be overwriting the same record with just the value.
NOTE: It is left to your application to get the same timestamp for a particular value.

Best approaches to UPDATE the data in tables - Teradata

I am new to Teradata & fortunately got a chance to work on both DDL-DML statements.
One thing I observed is Teradata is very slow when time comes to UPDATE the data in a table having large number of records.
The simplest way I found on the Google to perform this update is to write an INSERT-SELECT statement with a CASE on column holding values to be update with new values.
But what when this situation arrives in Data Warehouse environment, when we need to update multiple columns from a table holding millions of rows ?
Which would be the best approach to follow ?
INSERT-SELECT only OR MERGE-UPDATE OR MLOAD ?
Not sure if any of the above approach is not used for this UPDATE operation.
Thank you in advance!
At enterprise level, we expect volumes to be huge and updates are often part of some scheduled jobs/scripts.
With huge volume of data, Updates comes as a costly operation that involve risk of blocking table for some time in case the update fails (due to fallback journal). Although scripts are tested well, and failures seldom happen in production environments, it's always better to have data that needs to be updated loaded to a temporary table in required form and inserted back to same table after deleting matching records to maintain SCD-1 (Where we don't maintain history).

Simulating a columnar store using cluster tables

I have a client that mostly uses calculations on a single column of many rows from a table (each time another column), which is classic for a columnar DB.
The problem is that he is using Oracle, so what I thought of doing was to build a bunch of cluster table where each table has just one column besides the PK and this way allow him to work in a pseudo-columnar model.
What are you thoughts on the subject?
Will it even work as expected or am I just forcing the solution here ?
Thanks,
Daniel
I didn't test it in the end but I did achieve close to vertical performance time using sorted table hash cluster.

MERGE in Vertica

I would like to write a MERGE statement in Vertica database.
I know it can't be used directly, and insert/update has to be
combined to get the desired effect.
The merge sentence looks like this:
MERGE INTO table c USING (select b.field1,field2 aeg from table a, table b
where a.field3='Y'
and a.field4=b.field4
group by b.field1) t
on (c.field1=t.field1)
WHEN MATCHED THEN
UPDATE
set c.UUS_NAIT=t.field2;
Would just like to see an example of MERGE being used as insert/update.
You really don't want to do an update in Vertica. Inserting is fine. Selects are fine. But I would highly recommend staying away from anything that updates or deletes.
The system is optimized for reading large amounts of data and for inserting large amounts of data. So since you want to do an operation that does 1 of the 2 I would advise against it.
As you stated, you can break apart the statement into an insert and an update.
What I would recommend, not knowing the details of what you want to do so this is subject to change:
1) Insert data from an outside source into a staging table.
2) Perform and INSERT-SELECT from that table into the table you desire using the criteria you are thinking about. Either using a join or in two statements with subqueries to the table you want to test against.
3) Truncate the staging table.
It seems convoluted I guess, but you really don't want to do UPDATE's. And if you think that is a hassle, please remember that what causes the hassle is what gives you your gains on SELECT statements.
If you want an example of a MERGE statement follow the link. That is the link to the Vertica documentation. Remember to follow the instructions clearly. You cannot write a Merge with WHEN NOT MATCHED followed and WHEN MATCHED. It has to follow the sequence as given in the usage description in the documentation (which is the other way round). But you can choose to omit one completely.
I'm not sure, if you are aware of the fact that in Vertica, data which is updated or deleted is not really removed from the table, but just marked as 'deleted'. This sort of data can be manually removed by running: SELECT PURGE_TABLE('schemaName.tableName');
You might need super user permissions to do that on that schema.
More about this can be read here: Vertica Documentation; Purge Data.
An example of this from Vertica's Website: Update and Insert Simultaneously using MERGE
I agree that Merge is supported in Vertica version 6.0. But if Vertica's AHM or epoch management settings are set to save a lot of history (deleted) data, it will slow down your updates. The update speeds might go from what is bad, to worse, to horrible.
What I generally do to get rid of deleted (old) data is run the purge on the table after updating the table. This has helped maintain the speed of the updates.
Merge is useful where you definitely need to run updates. Especially incremental daily updates which might update millions of rows.
Getting to your answer: I don't think Vertica supportes Subquery in Merge. You would get the following.
ERROR 0: Subquery in MERGE is not supported
When I had a similar use-case, I created a view using the sub-query and merged into the destination table using the newly created view as my source table. That should let you keep using MERGE operations in Vertica and regular PURGEs should let you keep your updates fast.
In fact merge also helps avoid duplicate entries during inserts or updates if you use the correct combination of fields in ON clause, which should ideally be a join on the primary keys.
I like geoff's answer in general. It seems counterintuitive, but you'll have better results creating a new table with the rows you want in it versus modifying an existing one.
That said, doing so would only be worth it once the table gets past a certain size, or past a certain number of UPDATEs. If you're talking about a table <1mil rows, I might chance it and do the updates in place, and then purge to get rid of tombstoned rows.
To be clear, Vertica is not well suited for single row updates but large bulk updates are much less of an issue. I would not recommend re-creating the entire table, I would look into strategies around recreating partitions or bulk updates from staging tables.

Resources