Check and put in Hbase - performance

I have a case in which we need to insert records into Hbase table, in which 90% of the records coming from the source are repeated. In this case,
is it advisable to first query for the record from Hbase, if not present then call put
or
just simply call put.
Which of the above will be good in terms of performance.

Both HTable methods checkAndPut() and exists() requires accessing to table data which could hurt you badly if you receive lots of write requests and the data is not in the memstore.
Plain writes in HBase are usually not so expensive, so, if you have a good rowKey design and you're already avoiding hot regions, I'll just stick to overwriting data.

If you don't want to re-insert existing records you can use the checkAndPut method of HTable. Which this the put will be applied only if the condition you specify is met. So you could check for an existence of a column to put only if not existing.

I kind of agree with both answers. It is true that before using the CAS (Check And Set) mechanism, one has to revise his design first, and see if it is possible to refactor it and use plain writes instead. However, in some cases, this is not trivial.
Another thing I would make sure of before using the checkAndPut(), is that this operation requires Isolation, when updating values. HBase only guarantees it, when rewriting, but not updating.
And at last, check if it is possible to use the Append instead of checkAndPut.

Related

When should I use CREATE and when MERGE in Cypher queries?

I've seen that sometimes CREATE is used to create nodes, and in other situations, MERGE is used. What's the difference, and when should one be used in place of another?
CREATE does just what it says. It creates, and if that means creating duplicates, well then it creates.
MERGE does the same thing as CREATE, but also checks to see if a node already exists with the properties you specify. If it does, then it doesn't create. This helps avoid duplicates.
Here's an example: I use CREATE twice to create a person with the same name.
CREATE should be used when you are absolutely certain that the information doesn't exist in the database (for example, when you are loading data). MERGE is used whenever there is a possibility that the node or relationship already exists and you don’t need to duplicate it. MERGE shouldn't always be used as it’s considerably slower than the create clause.

Stop Hbase update operation if it have same value

I have a table in Hbase named 'xyz' . When I do an update operation on this table , it updates a table even though it is same record .
How can I control second record to not be added.
Eg:
create 'ns:xyz',{NAME=>'cf1',VERSIONS => 5}
put 'ns:xyz','1','cf1:name','NewYork'
put 'ns:xyz','1','cf1:name','NewYork'
Above put statements are giving 2 records with different timestamp if I check all versions. I am expecting that it should not add 2nd record because it have same value
HBase isn't going to look through the entire row and work out if it's the same as the data you're adding. That would be an expensive operation, and HBase prides itself on its fast insert speeds.
If you're really eager to do this (and I'd ask if you really want to do this), you should perform a GET first to see if the data is already present in the table.
You could also write a Coprocessor to do this every time you PUT data, but again the performance would be undesirable.
As mentioned by #Ben Watson, HBase is best known for it's performance in write since it doesn't need to check for the existence of a value as multiple versions will be maintained by default.
One hack what you can do is, you can use custom versioning. As show in the below screenshot, you have two versions already for a row key. Now if you are going to insert the same record with the same timestamp. HBase would be overwriting the same record with just the value.
NOTE: It is left to your application to get the same timestamp for a particular value.

Put performance - Hbase Java Client

I did some bench on PUT performance from a Java client, but the result is not clear to me.
Here's the problem:
What it is the best way to do puts in HBase? A single put with 1000 columns (4 families), or 1000 puts witha single columns? Maybe 4 puts with 250 columns each one?
In theory, what would be the best strategy?
PS: I can't use batch because I need the Wals for Solr.
Thanks.
To get good performance for the write operation, you should use a one Put for single Row. In other cases, perfomance will be significantly degraded, because HBase create a lock for row key and in this case, a lot of time will be wasted on synchronization. In a case of single put per row write performance will be comparable with the bulk load.
First of all use as few column families as you can (I have provided details in this answer). Second, you must specify not only your write patterns but also read patterns. HBase works best for "write once and read many" scenarios. Therefore you want to design you table thus it will provide the fastest access to data. And this criterion will determine whether you need "tall" or "wide" table. Check out HBase table design chapter of "HBase in Action".

Should I store reference data in my application memory, or in the database?

I am faced with the choice where to store some reference data (essentially drop down values) for my application. This data will not change (or if it does, I am fine with needing to restart the application), and will be frequently accessed as part of an AJAX autocomplete widget (so there may be several queries against this data by one user filling out one field).
Suppose each record looks something like this:
category
effective_date
expiration_date
field_A
field_B
field_C
field_D
The autocomplete query will need to check the input string against 4 fields in each record and discrete parameters against the category and effective/expiration dates, so if this were a SQL query, it would have a where clause that looks something like:
... WHERE category = ?
AND effective_date < ?
AND expiration_date > ?
AND (colA LIKE ? OR colB LIKE ? OR colC LIKE ?)
I feel like this might be a rather inefficient query, but I suppose I don't know enough about how databases optimize their indexes, etc. I do know that a lot of really smart people work really hard to make database engines really fast at this exact type of thing.
The alternative I see is to store it in my application memory. I could have a list of these records for each category, and then iterate over each record in the category to see if the filter criteria is met. This is definitely O(n), since I need to examine every record in the category.
Has anyone faced a similar choice? Do you have any insight to offer?
EDIT: Thanks for the insight, folks. Sending the entire data set down to the client is not really an option, since the data set is so large (several MB).
Definitely cache it in memory if it's not changing during the lifetime of the application. You're right, you don't want to be going back to the database for each call, because it's completely unnecessary.
There's can be debate about exactly how much to cache on the server (I tend to cache as little as possible until I really need to), but for information that will not change and will be accessed repeatedly, you should almost always cache that in the Application object.
Given the number of directions you're coming at this data (filtering on 6 or more columns), I'm not sure how much more you'll be able to optimize the information in memory. The first thing I would try is to store it in a list in the Application object, and query it using LINQ-to-objects. Or, if there is one field that is used significantly more than the others, or try using a Dictionary instead of a list. If the performance continues to be a problem, try using storing it in a DataSet and setting indexes on it (but of course you loose some code-simplicity and maintainability this way).
I do not think there is a one size fits all answer to your question. Depending on the data size and usage patterns the answer will vary. More than that the answer may change over time.
This is why in my development I built some intermediate layer which allows me to change how the caching is done by changing configuration (with no code changes). Every while we analyze various stats (cache hit ratio, etc.) and decide if we want to change cache behavior.
BTW there is also a third layer - you can push your static data to the browser and cache it there too
Can you just hard-wire it into the program (as long as you stick to DRY)? Changing it only requires a rebuild.

Thread-safe unique entity instance in Core Data

I have a Message entity that has a messageID property. I'd like to ensure that there's only ever one instance of a Message entity with a given messageID. In SQL, I'd just add a unique constraint to the messageID column, but I don't know how to do this with Core Data. I don't believe it can be done in the data model itself, so how do you go about it?
My initial thought is to use a validation method to do a fetch on the NSManagedObject's context for the ID, see if it finds anything but itself, and if so, fails the validation. I suspect this will work - but I'm worried about the performance of something like that. I went through a lot of effort to minimize the fetch requests needed for the entire import routine, and having it validate by performing a fetch for every single new message entity seems a bit excessive. I can get all pre-existing objects I need and identify all the new objects I need to insert into the store using just two fetch queries before I do the actual work of importing and connecting everything together. This would add a fetch to every single update or insert in addition to those two - which would seem to eliminate any performance advantage I had by pre-processing the import data in the first place!
The main reason this is an issue is that the importer can (potentially) run several batches concurrently on several threads and may include some overlapping/duplicate data that needs to ultimately result in just one object in the store and not duplicate entries. Is there a reasonable way to do this and does what I'm asking for make sense for Core Data?
The only way to guarantee uniqueness is to do a fetch. Fortunately you can just do a -countForFetchRequest:error: and check to see if it is zero or not. That is the least expensive way to guarantee uniqueness at this time.
You can probably accomplish this in the validation or run it in the loop that is processing the data. Personally I would do it above the creation of the NSManagedObject so that you do not have the unnecessary allocs when a record already exists.
I don't think there is a way to easily guarantee an attribute is unique without doing a lot of work on your own. You can, of course use CFUUIDCreate to create a globally unique UUID, which should be unique, even in a multithreaded environment. But...
The objectID (type NSManagedObjectID) of all managed objects is guaranteed to be unique within the persistent store coordinator. Since you can add arbitrarily many persistent stores to the coordinator, this guarantee basically guarantees that the objectIDs are globally unique. Why don't you use the objectID as your messageID? You can't, of course, change the objectID once it's assigned (and it won't get assigned until the context containing the inserted object is saved; until then it will be a temporary but still unique ID).
So you have a NSManagedContext for each thread, backed by the same persistent store, is that correct? And before you save the NSManagedContext, you'd like to make sure the messageID is unique, that is, that you are not updating an existing row, and that it is not in one of the other contexts, correct?
Given that model (correct me if I misunderstand), I think you'd be better served having one object that manages access to the persistent store. That way, all threads would update one context and you can do your validation in there, using Marcus's -countForFetchRequest:error: suggestion. Granted, that places a bottleneck on this operation.
Just to add my 2 cents: I think inconsistencies will occur sooner or later anyway, and the only way to mitigate them seems to be to do it on an application-level with rather complex code.
So in my case I decided to allow duplicate values for what are supposed to be "unique" fields.
I added code, however, that detects these problems later (e.g. when a fetch that should return 1 object returns more than 1) and fixes them when they occur (usually by deleting).
It's a "go ahead, make a mistake, ill fix it later for you"-strategy.
This is not ideal, of course, but a valid way to attack this problen, imho.

Resources