Column Value Range Filter in Hbase 0.94 - filter

I want to use a range filter in hbase on more than one column . I know we can use SingleColumnValueFilter implementing And/Or Conditions but I want to run the same filter condition against two different columns.
Example:myhbase table
rowkey,cf:bidprice,cf:askprice,cf:product
I want to filter all the rows with (cf:bidprice>10 and cf:bidprice<20) or (cf:askprice>10 and cf:askprice<20).

I think I figured it out. Below code snippet is an example implementation.
byte[] startRow=Bytes.toBytes("startrow");
byte[] endRow=Bytes.toBytes("stoprow");
SingleColumnValueFilter bidPriceGreaterFilter=new SingleColumnValueFilter("q".getBytes(), "bidprice".getBytes(), CompareFilter.CompareOp.GREATER_OR_EQUAL, "12345".getBytes());
SingleColumnValueFilter bidPricelesserFilter=new SingleColumnValueFilter("q".getBytes(), "bidprice".getBytes(), CompareFilter.CompareOp.LESS_OR_EQUAL, "12346".getBytes());
SingleColumnValueFilter askPriceGreaterFilter=new SingleColumnValueFilter("q".getBytes(), "askprice".getBytes(), CompareFilter.CompareOp.GREATER_OR_EQUAL, "12345".getBytes());
SingleColumnValueFilter askPricelesserFilter=new SingleColumnValueFilter("q".getBytes(), "askprice".getBytes(), CompareFilter.CompareOp.LESS_OR_EQUAL, "12346".getBytes());
FilterList andFilter1= new FilterList(FilterList.Operator.MUST_PASS_ALL);
andFilter1.addFilter(bidPriceGreaterFilter);
andFilter1.addFilter(bidPricelesserFilter);
FilterList andFilter2= new FilterList(FilterList.Operator.MUST_PASS_ALL);
andFilter2.addFilter(askPriceGreaterFilter);
andFilter2.addFilter(askPricelesserFilter);
FilterList finalFilterList=new FilterList(FilterList.Operator.MUST_PASS_ONE);
finalFilterList.addFilter(andFilter1);
finalFilterList.addFilter(andFilter2);
Scan scan = new Scan(startRow,endRow);
scan.setFilter(finalFilterList);

Related

Create criteria object using map in mongo

I have a map Map filterParams, whose key,value pair will be a part of a Criteria. How do I create the Criteria object.
Earlier I was using Query.addCriteria. However now I want a criteria object as I need to pass it to Aggregation.match() in mongo.
filterParams.entrySet().forEach(e -> query.addCriteria(criteria.where(e.getKey()).is(e.getValue())));
Aggregation aggregation = Aggregation.newAggregation(Aggregation.match(criteria ),Aggregation.group("property_type.name").count().as("count"),
Aggregation.project("property_type").andExclude("_id"));
Try this one:
Criteria criteria = new Criteria();
filterParams.entrySet().forEach(e -> criteria.and(e.getKey()).is(e.getValue()));
Aggregation aggregation = Aggregation.newAggregation(
Aggregation.match(criteria),
Aggregation.group("property_type.name").count().as("count"),
Aggregation.project("property_type").andExclude("_id")
);

Hadoop Pig: Show entries using STARTSWITH

I am having issues using the STARTSWITH string function. I want to display all records in System_Period that begins with 20040
transactions = LOAD '/home/cloudera/datasets/assignment2/Transactions.csv'
USING PigStorage(',') AS (Branch_Number:int, Contract_Number:int,
Customer_Number:int,Invoice_Date:chararray, Invoice_Number:int,
Product_Number:int, Sales_Amount:double, Employee_Number:int,
Service_Date:chararray, System_Period:int);
sysGroup = GROUP transactions BY System_Period;
sysFilter = FILTER sysGroup BY STARTSWITH(transactions.System_Period, 20040);
DUMP sysFilter;
The error I am receiving is
Could not infer the matching function for org.apache.pig.builtin.STARTSWITH as multiple or none of them fit. Please use an explicit cast.
STARTSWITH is only used to compare a tuple1 with tuple2 to check whether tuple1 contains tuple2. You cannot pass a relation or a bag to that. And one more thing to be noted is it accepts only String(chararray) not an integer. Either FILTER the system_period that begins with 20040 before the GROUP BY and load system_period as chararray and then cast it after the filter as per your need.
transactions = LOAD '/home/cloudera/datasets/assignment2/Transactions.csv'
USING PigStorage(',') AS (Branch_Number:int, Contract_Number:int,
Customer_Number:int,Invoice_Date:chararray, Invoice_Number:int,
Product_Number:int, Sales_Amount:double, Employee_Number:int,
Service_Date:chararray, System_Period:chararray);
sysFilter = FILTER transactions BY STARTSWITH(System_Period, '20040');
Else after GROUP BY FLATTEN the result and then filter
transactions = LOAD '/home/cloudera/datasets/assignment2/Transactions.csv'
USING PigStorage(',') AS (Branch_Number:int, Contract_Number:int,
Customer_Number:int,Invoice_Date:chararray, Invoice_Number:int,
Product_Number:int, Sales_Amount:double, Employee_Number:int,
Service_Date:chararray, System_Period:chararray);
sysGroup = GROUP transactions BY System_Period;
flatres = FOREACH sysGroup GENERATE group,FLATTEN(transactions);
sysFilter = FILTER flatres BY STARTSWITH(System_Period, '20040');

Triple composite key in Hbase

I have a use case where I want 3 level composite key.
For eg.
Rollnumber:class:friendsRollNumber
I would want to query "Get all friends for a particular roll number and a class"
I could not find sufficient examples over net to use composite keys and range scans over it together.
Currently, I am doing the following.
byte[]rowkey = Bytes.add(Bytes.tobytes("myrollnumber"),Bytes.tobytes("myClass"),Bytes.tobytes("myFriendsRollNumber"))
This is the way I form row key .
Will it select region server based on myRollNumber and myClass ? If not How can I do that ?
Also , For range scan , what is the correct way to use it. I am doing it in the following way. I am still in process of writing the code , so have not tested it.
Scan s = new Scan();
Filter f = New PrefixFilter(Bytes.tobytes("myrollnumber"),Bytes.tobytes("class"))
s.setFilter(f)
Is the above way correct to scan as per my requirement ?
Also , how to get the individual parts of rowKey from the scanner ?
Try this:
byte[] prefix=Bytes.toBytes("rollnumber" + "class");
Scan scan = new Scan(prefix));
PrefixFilter prefixFilter = new PrefixFilter(prefix);
scan.addFilter(prefixFilter);
ResultScanner resultScanner = table.getScanner(scan);
For your requirement, you can use start stop row feature of Scan. You do not need a filter.
byte[]startRow= Bytes.add(Bytes.tobytes("search_rollnumber"),Bytes.tobytes("myClass"));
byte[]stopRow= Bytes.add(Bytes.tobytes("search_rollnumber"),Bytes.tobytes("myClass"));
stopRow[stopRow.length - 1]++;
Scan s = new Scan(startRow, stopRow);
Using this scan you will get all rows starting with search_rollnumberMyClass.
I am not sure you use : in your rowkey. But i think you should if both rollnumber and class are represented as integer.

Hbase filter to find rows without a specific column

I want to filter out all rows that do not have a specific column. any idea which comparator to use?
You can use skip filter combined with qualifier filter.
If you use the java client API:
Filter filter = new QualifierFilter(CompareFilter.CompareOp.EQUAL,new BinaryComparator(Bytes.toBytes("column-name")));
Filter filter2 = new SkipFilter(filter);
scan.setFilter(filter2);
this will return all the row without that specific column
SingleColumnValueFilter has method setFilterIfMissing that excludes all row that do not contain given column if it is given true. All that is needed is to design filter so it will always pass and call setFilterIfMissing(true)
SingleColumnValueFilter filter = new SingleColumnValueFilter(Bytes.toBytes(columnFamily), Bytes.toBytes("column_name"), CompareFilter.CompareOp.NOT_EQUAL, Bytes.toBytes("non-sense"));
filter.setFilterIfMissing(true);
scan.setFilter(filter);

Hadoop: Map Reduce: read from HBase, but filter rows by content of one column

I am really new to Hadoop and I am not able to find an answer to my question. I want to write a map reduce job, where I read from HBase and write then in a simple text file.
In HBase, Ive got a column representing an id. Now I dont want to work on all containing rows in my HBase Table, but only on those between a maxId and a minId.
I found out that I could possibly user filters (scan.setFilter), so that I can filter rows which dont match my request.
This is my first Map Reduce Job, so please be patient :-)
Ive got a Starter Class, where I configure the job and the Scan Object and then start the job.
Now, my first try looks like this:
private Scan getScan()
{
final Scan scan = new Scan();
// ** FILTER **
List<Filter> filters = new ArrayList<Filter>();
Filter filter1 = new ValueFilter(CompareFilter.CompareOp.GREATER_OR_EQUAL, new BinaryComparator(Bytes.toBytes(Integer.parseInt(minId))));
filters.add(filter1);
Filter filter2 = new ValueFilter(CompareFilter.CompareOp.LESS_OR_EQUAL, new BinaryComparator(Bytes.toBytes(Integer.parseInt(maxId))));
filters.add(filter2);
FilterList filterList = new FilterList(filters);
scan.setFilter(filterList);
scan.setCaching(500);
scan.setCacheBlocks(false);
// id
scan.addColumn("columnfamily".getBytes(), "id".getBytes());
return scan;
}
Well, Im not sure if this is the right way to do it. I also read that I could pass my minId and maxId maybe with the Configuration Object to the Map Job, but Im not sure how.
Besides, what have I to do afterwards? I would normally just initiate the job with initTableMapperJob and pass the Scan Object to it. Ive read something of ResultScanner and so, do I need them? I thought the MapReduce Framework would now automatically pass the correct rows to my map job, is that correct?

Resources