Stop Hbase update operation if it have same value - hadoop

I have a table in Hbase named 'xyz' . When I do an update operation on this table , it updates a table even though it is same record .
How can I control second record to not be added.
Eg:
create 'ns:xyz',{NAME=>'cf1',VERSIONS => 5}
put 'ns:xyz','1','cf1:name','NewYork'
put 'ns:xyz','1','cf1:name','NewYork'
Above put statements are giving 2 records with different timestamp if I check all versions. I am expecting that it should not add 2nd record because it have same value

HBase isn't going to look through the entire row and work out if it's the same as the data you're adding. That would be an expensive operation, and HBase prides itself on its fast insert speeds.
If you're really eager to do this (and I'd ask if you really want to do this), you should perform a GET first to see if the data is already present in the table.
You could also write a Coprocessor to do this every time you PUT data, but again the performance would be undesirable.

As mentioned by #Ben Watson, HBase is best known for it's performance in write since it doesn't need to check for the existence of a value as multiple versions will be maintained by default.
One hack what you can do is, you can use custom versioning. As show in the below screenshot, you have two versions already for a row key. Now if you are going to insert the same record with the same timestamp. HBase would be overwriting the same record with just the value.
NOTE: It is left to your application to get the same timestamp for a particular value.

Related

How to obtain count of update records from Hive

I am working with hive driver in which the executeUpdate() record does not return the affected record count. Is there any alternate way in which this can be obtained? We need to get the effected record value for further processing.
If I am not mistaken Hive does not show (or even know?!) the number of updated records. Therefore extracting this directly is likely not going to work.
Workaround
First run a count query using the exact where statement and log the result
Then do the actual update
Naturally this incurs significant overhead.

How to know when was the last row of a table inserted

First of all I want to apologize because I do not have the vocabulary to talk about hive properly, I'm not sure if what goes into a row is called data and so on, I'm trying to be as correct as possible.
I want to know if it's possible, without adding an extra column to a hive table (where you would put the date/some metadata), what where the new rows added.
The case is as follows: A very large number of data is going to be processed, and the data selected ends in another hive table. If some new data is added to the original tables, I want to only process that new data, not to re-process the whole process, because it seems waste(we're talking several million entries).
I would normally add a new column with dates, or just metadata that tells me whether or not a row was already "computed" with.
edit: I have been updated with more info. Turns out, there are actually two problems, imo.
One, new data may come, and it would be infinitely better to just insert thus new ones in the destination table.
Second, data might be updated. I've been told that hive does not allow updates in the normal sense, since for example insert overwrite would just rewrite the whole set (turns out it's Hive 0.12.0, and in 0.14 SOME functionality has been added but updating is not a possibility).

Informatica 9.5.1, huge table (scd1)

I have a table(in oracle) size about 860 million records (850gb) on top we are getting about 2 -3 million records as source (flatfile).
we are doing a lookup on target if record already exist it will update if it is a new record it will insert(scd1).
The transformations we using are unconnectedlookup, sorter, filter and router, update strategy transformations, it was fine all this time, but as the table is huge and growing huge, it is taking for ever to insert and update, last night it took 19 hrs to 2.4 million records (2.1 millions were new so inserted and the rest are updates).
Today I got about 1.9 millions to go through i am not sure how long it will take any suggestions or help how can we handle this ?
1) Use just a connected lookup to oracle table, after SQ matching on primary key and filter out nulls (records missing in Oracle table) or not null (updates). Dont check for other columns for update. Skip sorter and filter. Just use update strategy.
2) Or use joiner and make flat file pipeline as master. Then check for nulls to find insert or updates.
3) Check if your target table dont have any trigger etc on it. If yes then check its logic and implement it in ETL.
Since you are dealing with 850mil data, you have two major bottlenecks - target lookup and writing into target.
You can think of this strategy -
Mapping 1 - Create a new mapping to load flat file data into a temp table TMP1.
Mapping 2 - Modify existing mapping. Just modify lookup query and join TMP1 and target (860mil)table in SQL Override. This will reduce time, I/O, lookup cache.
Also, please make sure you have an index on key columns in target. And you drop-create all other index while loading. Skipping sorter will help but adding joiner will not help much.
Regards,
Koushik
How many inserts vs updates do you have?
With just a few updates, try using Update else Insert target
property.
If there are many updates and few inserts, perform update
just if a key is found, without checking if anything has changed
If there are many source rows matching what you already have (i.e. an update that doesn't change anything) try to eliminate them. But don't compare all columns - use a hash instead. Just create an additional computed column that will contain a MD5 calculated on all columns. Then all you need to do is compare one column instead of all to detect a change.
1) Try using a merge statement if source and targets are in same database.
2) We can also use sql loader connection to improve the performance.
Clearly the bottleneck is in the target lookup and target load (update to be specific).
Try the following to tune the existing code:
1) Try to remove any unwanted lookup ports if you have in the lookup transformation. Keep only the fields that are used in the lookup condition as you are using it just to check if the record exists.
2) Try adding an index to the target table for the fields you are using for the update
3) Increase the commit interval of the session to a higher value.
4) Partial Pushdown optimization:
You can pushdown some of the processing to database which might be faster instead of doing it in Informatica
Create a staging table to hold the incoming data for that load.
Create a mapping to load the incoming file to the staging table. Truncate it before the start of the load to clear the records of the previous run.
In the SQL override of the existing mapping do a left join between the staging table and target table to find insert/updates. This will be faster than the Informatica lookup and eliminates the time taken to build the Informatica lookup cache.
5) Using MD5 to eliminate unwanted updates
For using MD5 you need to add a new field in the target table and do a mapping to update the existing records one time.
Then in your existing mapping add a step to compute MD5 for the incoming column.
If the record is identified for update then check if the MD5 computed for the incoming column is same as that of the target column. If the checksum also matches then don't update the record. Only if the check sum is different update the record. By this way you will filter out the unwanted updates. If there is no lookup match then insert the record.
Advantages: You are reducing the unwanted updates.
Disadvantages: You have to do an one time process to populate MD5 values for the existing records in the table.
If none of this works check with your database administrator to see if there is any issue in the database side that might slow down the load.

Does postgresql index update on inserting new row?

Sorry if this is a dumb question but do i need to reindex my table every time i insert rows, or does the new row get indexed when added?
From the manual
Once an index is created, no further intervention is required: the system will update the index when the table is modified
http://postgresguide.com/performance/indexes.html
I think when you insert rows, the index does get updated. It maintains the sort on the index table as you insert data. Hence there are performance issues or downtimes on a table, if you try adding large number of rows at once.
On top of the other answers: PostgreSQL is a top notch Relational Database. I'm not aware of any Relational Database system where indices are not updated automatically.
It seems to depend on the type of index. For example, according to https://www.postgresql.org/docs/9.5/brin-intro.html, for BRIN indexes:
When a new page is created that does not fall within the last summarized range, that range does not automatically acquire a summary tuple; those tuples remain unsummarized until a summarization run is invoked later, creating initial summaries. This process can be invoked manually using the brin_summarize_new_values(regclass) function, or automatically when VACUUM processes the table.
Although this seems to have changed in version 10.

MERGE in Vertica

I would like to write a MERGE statement in Vertica database.
I know it can't be used directly, and insert/update has to be
combined to get the desired effect.
The merge sentence looks like this:
MERGE INTO table c USING (select b.field1,field2 aeg from table a, table b
where a.field3='Y'
and a.field4=b.field4
group by b.field1) t
on (c.field1=t.field1)
WHEN MATCHED THEN
UPDATE
set c.UUS_NAIT=t.field2;
Would just like to see an example of MERGE being used as insert/update.
You really don't want to do an update in Vertica. Inserting is fine. Selects are fine. But I would highly recommend staying away from anything that updates or deletes.
The system is optimized for reading large amounts of data and for inserting large amounts of data. So since you want to do an operation that does 1 of the 2 I would advise against it.
As you stated, you can break apart the statement into an insert and an update.
What I would recommend, not knowing the details of what you want to do so this is subject to change:
1) Insert data from an outside source into a staging table.
2) Perform and INSERT-SELECT from that table into the table you desire using the criteria you are thinking about. Either using a join or in two statements with subqueries to the table you want to test against.
3) Truncate the staging table.
It seems convoluted I guess, but you really don't want to do UPDATE's. And if you think that is a hassle, please remember that what causes the hassle is what gives you your gains on SELECT statements.
If you want an example of a MERGE statement follow the link. That is the link to the Vertica documentation. Remember to follow the instructions clearly. You cannot write a Merge with WHEN NOT MATCHED followed and WHEN MATCHED. It has to follow the sequence as given in the usage description in the documentation (which is the other way round). But you can choose to omit one completely.
I'm not sure, if you are aware of the fact that in Vertica, data which is updated or deleted is not really removed from the table, but just marked as 'deleted'. This sort of data can be manually removed by running: SELECT PURGE_TABLE('schemaName.tableName');
You might need super user permissions to do that on that schema.
More about this can be read here: Vertica Documentation; Purge Data.
An example of this from Vertica's Website: Update and Insert Simultaneously using MERGE
I agree that Merge is supported in Vertica version 6.0. But if Vertica's AHM or epoch management settings are set to save a lot of history (deleted) data, it will slow down your updates. The update speeds might go from what is bad, to worse, to horrible.
What I generally do to get rid of deleted (old) data is run the purge on the table after updating the table. This has helped maintain the speed of the updates.
Merge is useful where you definitely need to run updates. Especially incremental daily updates which might update millions of rows.
Getting to your answer: I don't think Vertica supportes Subquery in Merge. You would get the following.
ERROR 0: Subquery in MERGE is not supported
When I had a similar use-case, I created a view using the sub-query and merged into the destination table using the newly created view as my source table. That should let you keep using MERGE operations in Vertica and regular PURGEs should let you keep your updates fast.
In fact merge also helps avoid duplicate entries during inserts or updates if you use the correct combination of fields in ON clause, which should ideally be a join on the primary keys.
I like geoff's answer in general. It seems counterintuitive, but you'll have better results creating a new table with the rows you want in it versus modifying an existing one.
That said, doing so would only be worth it once the table gets past a certain size, or past a certain number of UPDATEs. If you're talking about a table <1mil rows, I might chance it and do the updates in place, and then purge to get rid of tombstoned rows.
To be clear, Vertica is not well suited for single row updates but large bulk updates are much less of an issue. I would not recommend re-creating the entire table, I would look into strategies around recreating partitions or bulk updates from staging tables.

Resources