I am trying to copy specific data from one hbase table to another which requires scanning the table for only rowkeys and parsing a specific value from there. It works fine but I noticed the results seem to be returned in ascending sort order & in this case alphabetically. Is there a way to specify a reverse order or perhaps by insert timestamp?
Scan scan = new Scan();
scan.setMaxResultSize(1000);
scan.setFilter(new FirstKeyOnlyFilter());
ResultScanner scanner = TestHbaseTable.getScanner(scan);
for(Result r : scanner){
System.out.println(Bytes.toString(r.getRow()));
String rowKey = Bytes.toString(r.getRow());
if(rowKey.startsWith("dm.") || rowKey.startsWith("bk.") || rowKey.startsWith("rt.")) {
continue;
} else if(rowKey.startsWith("yt")) {
List<String> ytresult = Arrays.asList(rowKey.split("\\s*.\\s*"));
.....
This table is huge so I would prefer to skip to the rows I actually need. Appreciate any help here.
Have you tried the .setReversed() property of the Scan? Keep in mind that in this case your start row would have to be the logical END of your rowKey range, and from there it would scan 'upwards'.
Related
Here is requirement I am working on
There are multiple indexes with name content_ssc, content_teal, content_mmy.
These indexes can have common data (co_code is one of the field in the documents of these indexes)
a. content_ssc can have documents with co_code = teal/ssc/mmy
b. content_mmy can have documents with co_code = ssc/mmy
I need to get the data using below condition (this is one of the approach to get the unique data from these indexes)
a. (Index = content_ssc and site_code = ssc) OR (Index = content_mmy and site_code = mmy)
Basically I am getting a duplicate data from these indexes currently so I need any solution which should fetch unique data from these indexes using the above condition.
I have tried using boolean query with multiple indices from this link but it didn't produce unique result.
Please suggest.
You can use distinct query , and you will get unique result
In Hbase 1.2.4
What is the difference between checkAndPut and checkAndMutate?
checkAndPut - compares the value with the current value from the hbase according to the passed CompareOp. CompareOp=EQUALS Adds the value to the put object if expected value is equal.
checkAndMutate - compares the value with the current value from the hbase according to the passed CompareOp.CompareOp=EQUALS Adds the value to the rowmutation object if expected value is equal.
you can add multiple put and delete objects in the order in which you wants the mutation to executes in hbase to the rowmutation object
In rowmutation the order of puts and deletes matter
RowMutations mutations = new RowMutations(row);
//add new columns
Put put = new Put(row);
put.add(cf, col1, v1);
put.add(cf, col2, v2);
Delete delete = new Delete(row);
delete.deleteFamily(cf1, now);
//delete column family and add new columns to same family
mutations.add(delete);
mutations.add(put);
table.mutateRow(mutations);
checkAndMutate
https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#checkAndMutate-byte:A-byte:A-byte:A-org.apache.hadoop.hbase.filter.CompareFilter.CompareOp-byte:A-org.apache.hadoop.hbase.client.RowMutations-
checkAndPut
https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#checkAndPut-byte:A-byte:A-byte:A-org.apache.hadoop.hbase.filter.CompareFilter.CompareOp-byte:A-org.apache.hadoop.hbase.client.Put-
Could someone please explain to me what is the difference between Hadoop Streaming vs Buffering?
Here is the context I have read in Hive :
In every map/reduce stage of the join, the last table in the sequence is streamed through the reducers whereas the others are buffered. Therefore, it helps to reduce the memory needed in the reducer for buffering the rows for a particular value of the join key by organizing the tables such that the largest tables appear last in the sequence. e.g. in:
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
In a reduce side join, the values from multiple tables are often tagged to identify them on reducer stage, for the table they are coming from.
Consider a case of two tables:
On reduce call, the mixed values associated with both tables are iterated.
During iteration, the value for one of the tag/table are locally stored into an arraylist. (This is buffering).
While the rest of the values are being streamed through and values for another tag/table are detected, the values of first tag are fetched from the saved arraylist. The two tag values are joined and written to output collector.
Contrast this with the case what if the larger table values are kept in arraylist then it could result into OOM if the arraylist outgrows to overwhelm the memory of the container's JVM.
void reduce(TextPair key , Iterator <TextPair> values ,OutputCollector <Text,Text> output ,Reporter reporter ) throws IOException {
//buffer for table1
ArrayList <Text> table1Values = new ArrayList <Text>() ;
//table1 tag
Text table1Tag = key . getSecond();
TextPair value = null;
while( values . hasNext() ){
value = values . next() ;
if(value.getSecond().equals(table1Tag)){
table1Values.add (value.getFirst() );
}
else{
for( Text val : table1Values ){
output.collect ( key.getFirst() ,new Text(val.toString() + "\t"+ value.getFirst().toString () ));
}
}
}
}
You can use the below hint to specify which of the joined tables would be streamed on reduce side:
SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
Hadoop Streaming in general refers to using custom made python or shell scripts to perform your map-reduce logic. ( For example, using the Hive TRANSFORM keyword.)
Hadoop buffering, in this context, refers to the phase in a map-reduce job of a Hive query with a join, when records are read into the reducers, after having been sorted and grouped coming out of the mappers. The author is explaining why you should order the join clauses i n a Hive query, so that the largest tables are last; because it helps optimize the implementation of joins in Hive.
They are completely different concepts.
In response to your comments:
In Hive's join implementation, it must take records from multiple tables, sort them by the join key, and then collate them together in the proper order. It has to read them grouped by the different tables, so they have to see groups from different tables, and once all tables have been seen, start processing them. The first groups from the first tables need to be buffered (kept in memory) because they can not be processed until the last table is seen. The last table can be streamed, (each row processed as they are read) since the other tables group are in memory, and the join can start.
I have the big text contents saved in the co column, I want to search if the column co contains particular word, something like what we do in RDBMS eg: where co like %test%, To achieve this should i write any filter or Map reduce? could somebody give an example how to achieve this?
You can do something like
RegexStringComparator comp = new RegexStringComparator(".test."); // or (\W|^)test(\W|$) if you want complete words only
or
SubstringComparator comp = new SubstringComparator("test");
and then
SingleColumnValueFilter filter = new SingleColumnValueFilter(
Bytes.toBytes("COLUMN_FAMILY_NAME"),
Bytes.toBytes("co"),
CompareOp.EQUAL,
comp
);
scan.setFilter(filter);
note that the performance for this will not be spectacular as HBase will look at each instance of the column in your table
Is there an efficient way to delete multiple rows in HBase or does my use case smell like not suitable for HBase?
There is a table say 'chart', which contains items that are in charts. Row keys are in the following format:
chart|date_reversed|ranked_attribute_value_reversed|content_id
Sometimes I want to regenerate chart for a given date, so I want to delete all rows starting from 'chart|date_reversed_1' till 'chart|date_reversed_2'. Is there a better way than to issue a Delete for each row found by a Scan? All the rows to be deleted are going to be close to each other.
I need to delete the rows, because I don't want one item (one content_id) to have multiple entries which it will have if its ranked_attribute_value had been changed (its change is the reason why chart needs to be regenerated).
Being a HBase beginner, so perhaps I might be misusing rows for something that columns would be better -- if you have a design suggestions, cool! Or, maybe the charts are better generated in a file (e.g. no HBase for output)? I'm using MapReduce.
Firstly, coming to the point of range delete there is no range delete yet in HBase, AFAIK. But there is a way to delete more than one rows at a time in the HTableInterface API. For this simply form a Delete object with row keys from scan and put them in a List and use the API, done! To make scan faster do not include any column family in the scan result as all you need is the row key for deleting whole rows.
Secondly, about the design. First my understanding of the requirement is, there are contents with content id and each content has charts generated against them and those data are stored; there can be multiple charts per content via dates and depends on the rank. In addition we want the last generated content's chart to show at the top of the table.
For my assumption of the requirement I would suggest using three tables - auto_id, content_charts and generated_order. The row key for content_charts would be its content id and the row key for generated_order would be a long, which would auto-decremented using HTableInterface API. For decrementing use '-1' as the amount to offset and initialize the value Long.MAX_VALUE in the auto_id table at the first start up of the app or manually. So now if you want to delete the chart data simply clean the column family using delete and then put back the new data and then make put in the generated_order table. This way the latest insertion will also be at the top in the latest insertion table which will hold the content id as a cell value. If you want to ensure generated_order has only one entry per content save the generated_order id first and take the value and save it into content_charts when putting and before deleting the column family first delete the row from generated_order. This way you could lookup and charts for a content using 2 gets at max and no scan required for the charts.
I hope this is helpful.
You can use the BulkDeleteProtocol which uses a Scan that defines the relevant range (start row, end row, filters).
See here
I ran into your situation and this is my code to implement what you want
Scan scan = new Scan();
scan.addFamily("Family");
scan.setStartRow(structuredKeyMaker.key(starDate));
scan.setStopRow(structuredKeyMaker.key(endDate + 1));
try {
ResultScanner scanner = table.getScanner(scan);
Iterator<Entity> cdrIterator = new EntityIteratorWrapper(scanner.iterator(), EntityMapper.create(); // this is a simple iterator that maps rows to exact entity of mine, not so important !
List<Delete> deletes = new ArrayList<Delete>();
int bufferSize = 10000000; // this is needed so I don't run out of memory as I have a huge amount of data ! so this is a simple in memory buffer
int counter = 0;
while (entityIterator.hasNext()) {
if (counter < bufferSize) {
// key maker is used to extract key as byte[] from my entity
deletes.add(new Delete(KeyMaker.key(entityIterator.next())));
counter++;
} else {
table.delete(deletes);
deletes.clear();
counter = 0;
}
}
if (deletes.size() > 0) {
table.delete(deletes);
deletes.clear();
}
} catch (IOException e) {
e.printStackTrace();
}