Unable to fetch data from Hbase based on query parameters - hadoop

How to get data from HBase? I have a table with empId, name, startDate, endDate and other columns. Now I want to get data from an HBase table based upon empId, startDate and endDate.In normal SQL I can use:
select * from tableName where empId=val and date>=startDate and date<=endDate
How can I do this in HBase as it stores data as key value pairs? The key is empId.

Getting filtered rows in HBase shell is tricky. Since the shell is JRuby-based you can have here Ruby commands as well:
import org.apache.hadoop.hbase.filter.CompareFilter
import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
import org.apache.hadoop.hbase.filter.BinaryComparator
import org.apache.hadoop.hbase.filter.FilterList
import java.text.SimpleDateFormat
import java.lang.Long
def dateToBytes(val)
Long.toString(
SimpleDateFormat.new("yyyy/MM/dd").parse(val).getTime()).to_java_bytes
end
# table properties
colfam='c'.to_java_bytes;
col_name='name';
col_start='startDate';
col_end='endDate';
# query params
q_name='name2';
q_start='2012/08/14';
q_end='2012/08/24';
# filters
f_name=SingleColumnValueFilter.new(
colfam, col_name.to_java_bytes,
CompareFilter::CompareOp::EQUAL,
BinaryComparator.new(q_name.to_java_bytes));
f_start=SingleColumnValueFilter.new(
colfam, col_start.to_java_bytes,
CompareFilter::CompareOp::GREATER_OR_EQUAL,
BinaryComparator.new(dateToBytes(q_start)));
f_end=SingleColumnValueFilter.new(
colfam, col_end.to_java_bytes,
CompareFilter::CompareOp::LESS_OR_EQUAL,
BinaryComparator.new(dateToBytes(q_end)));
filterlist= FilterList.new([f_name, f_start, f_end]);
# get the result
scan 'mytable', {"FILTER"=>filterlist}
Similarly in Java construct a FilterList :
// Query params
String nameParam = "name2";
String startDateParam = "2012/08/14";
String endDateParam = "2012/08/24";
Filter nameFilter =
new SingleColumnValueFilter(colFam, nameQual, CompareOp.EQUAL,
Bytes.toBytes(nameParam));
//getBytesFromDate(): parses startDateParam and create a byte array out of it
Filter startDateFilter =
new SingleColumnValueFilter(colFam, startDateQual,
CompareOp.GREATER_OR_EQUAL, getBytesFromDate(startDateParam));
Filter endDateFilter =
new SingleColumnValueFilter(colFam, endDateQual,
CompareOp.LESS_OR_EQUAL, getBytesFromDate(endDateParam));
FilterList filters = new FilterList();
filters.addFilter(nameFilter);
filters.addFilter(startDateFilter);
filters.addFilter(endDateFilter);
HTable htable = new HTable(conf, tableName);
Scan scan = new Scan();
scan.setFilter(filters);
ResultScanner rs = htable.getScanner(scan);
//process your result...

Related

Match individual records during Batch predictions with VertexAI pipeline

I have a custom model in Vertex AI and a table storing the features for the model along with the record_id.
I am building pipeline component for the batch prediction and facing a critical issue.
When I submit the batch_prediction, I should exclude the record_id for the job but How can I map the record if I don't have the record_id in the result?
from google.cloud import bigquery
from google.cloud import aiplatform
aiplatform.init(project=project_id)
client = bigquery.Client(project=project_id)
query = '''
SELECT * except(record_id) FROM `table`
'''
df = client.query(query).to_dataframe() # drop the record_id and load it to another table
job = client.load_table_from_dataframe(
X, "table_wo_id",
)
clf = aiplatform.Model(model_id = 'custom_model')
clf.batch_predict(job_display_name = 'custom model batch prediction',
bigquery_source = 'bq://table_wo_id',
instances_format = 'bigquery',
bigquery_destination_prefix = 'bq://prediction_result_table',
predictions_format = 'bigquery',
machine_type = 'n1-standard-4',
max_replica_count = 1
)
like the above example, there is no record_id column in prediction_result_table. There is no way to map the result back to each record

SparkRDD Operations

Let's assume i had a table of two columns A and B in a CSV File. I pick maximum value from column A [Max value = 100] and i need to return the respective value of column B [Return Value = AliExpress] using JavaRDD Operations without using DataFrames.
Input Table :
COLUMN A Column B
56 Walmart
72 Flipkart
96 Amazon
100 AliExpress
Output Table:
COLUMN A Column B
100 AliExpress
This is what i tried till now
SourceCode:
SparkConf conf = new SparkConf().setAppName("SparkCSVReader").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> diskfile = sc.textFile("/Users/apple/Downloads/Crash_Data_1.csv");
JavaRDD<String> date = diskfile.flatMap(f -> Arrays.asList(f.split(",")[1]));
From the above code i can fetch only one column data. Is there anyway to get two columns. Any suggestions. Thanks in advance.
You can use either top or takeOrdered functions to achieve it.
rdd.top(1) //gives you top element in your RDD
Data:
COLUMN_A,Column_B
56,Walmart
72,Flipkart
96,Amazon
100,AliExpress
Creating df using Spark 2
val df = sqlContext.read.option("header", "true")
.option("inferSchema", "true")
.csv("filelocation")
df.show
import sqlContext.implicits._
import org.apache.spark.sql.functions._
Using Dataframe functions
df.orderBy(desc("COLUMN_A")).take(1).foreach(println)
OUTPUT:
[100,AliExpress]
Using RDD functions
df.rdd
.map(row => (row(0).toString.toInt, row(1)))
.sortByKey(false)
.take(1).foreach(println)
OUTPUT:
(100,AliExpress)

HBase Get values where rowkey in

How do I get all the values in HBase given Rowkey values?
val tableName = "myTable"
val hConf = HBaseConfiguration.create()
val hTable = new HTable(hConf, tableName)
val theget= new Get(Bytes.toBytes("1001-A")) // rowkey values (1001-A, 1002-A, 2010-A, ...)
val result = hTable.get(theget)
val values = result.listCells()
The code above only works for one rowkey.
You can use Batch operations. Please refer the link below for Javadoc : Batch Operations on HTable
Another approach is to Scan with a start row key & end row key (First & Last row keys from an sorted ascending set of keys). This makes more sense if there are too many values.
There is htable.get method that take list of Gets:
List<Get> gets = ....
List<Result> results = htable.get(gets)

Hbase - get column names for row by column name prefix

I have a Hbase Table with the following description.
For a row key, my column would be of the form a_1, a_2,a_3,b_1,c_1,C_2 and so on, a compound key format.
Suppose one of my row is of the form
row key - row1
column family - c1
columns - a_1, a_2,a_3,b_1,b_2,c_1,C_2,d_9,d_99
Can I, by any operation retrieve a,b,c,d as the columns corresponding to row1, I am not bothered about whatever be the suffixes for a,b,c...
I can get all column names for a given row, add them to set by splitting the row keys by their first part and emit the set. I am worried, if there would be a better way of doing it by filters or some other hbase way of getting it done, please comment...
You can use COlumnPrefixFilter for that. You can see the following code
Configuration hadoopConf = new Configuration();
hadoopConf.set("hbase.zookeeper.quorum", "localhost");
hadoopConf.set("hbase.zookeeper.property.clientPort", "2181");
HTable hTable = new HTable(hadoopConf, "KunderaExamples");
Scan scan = new Scan();
scan.setFilter(new ColumnPrefixFilter("A".getBytes()));
ResultScanner scanner = hTable.getScanner(scan);
Iterator<Result> resultsIter = scanner.iterator();
while (resultsIter.hasNext())
{
Result result = resultsIter.next();
List<KeyValue> values = result.list();
for (KeyValue value : values)
{
System.out.println(value.getKey());
System.out.println(new String(value.getQualifier()));
System.out.println(value.getValue());
}
}

How to apply several QualifierFilter to a row in HBase

we would like to filter a scan on a HBase table with two QualifierFilters.
Means we would like to only get the rows of the table which do have a certain column 'col_A' AND (!) a certain other column 'col_B'.
Our current approach looks like this:
FilterList filterList = new FilterList(FilterList.Operator.MUST_PASS_ALL);
Filter filter1 = new QualifierFilter(CompareOp.EQUAL, new BinaryComparator("col_A".getBytes()));
filterList.addFilter(filter1);
Filter filter2 = new QualifierFilter(CompareOp.EQUAL, new BinaryComparator("col_B".getBytes()));
filterList.addFilter(filter2);
Scan scan = new Scan();
scan.setFilter(filterList);
...
The ResultScanner does not return any results from this scan although there are several rows in the HBase table which do have both columns 'col_A' and 'col_B'.
If we only apply filter1 to the scan everything works fine an we do get all the rows which have 'col_A'.
If we only apply filter2 to the scan it is the same. We do get all rows which have 'col_B'.
Only if we combine these two filters we do not get any results.
What would be the right way to get only the rows from the table which do have col_A AND col_B?
You can achieve this by defining the following filters:
List<Filter> filters = new ArrayList<Filter>(2);
byte[] colfam = Bytes.toBytes("c");
byte[] fakeValue = Bytes.toBytes("DOESNOTEXIST");
byte[] colA = Bytes.toBytes("col_A");
byte[] colB = Bytes.toBytes("col_B");
SingleColumnValueFilter filter1 =
new SingleColumnValueFilter(colfam, colA , CompareOp.NOT_EQUAL, fakeValue);
filter1.setFilterIfMissing(true);
filters.add(filter1);
SingleColumnValueFilter filter2 =
new SingleColumnValueFilter(colfam, colB, CompareOp.NOT_EQUAL, fakeValue);
filter2.setFilterIfMissing(true);
filters.add(filter2);
FilterList filterList = new FilterList(FilterList.Operator.MUST_PASS_ALL, filters);
Scan scan = new Scan();
scan.setFilter(filterList);
The idea here is to define one SingleColumnValueFilter per column you are looking for, each with a fake value and a CompareOp.NOT_EQUAL operator. I.e:
such a SingleColumnValueFilter will return all columns for a given name.
Source: http://mapredit.blogspot.com/2012/05/using-filters-in-hbase-to-match-two.html
I think this line is the issue -
FilterList filterList = new FilterList(FilterList.Operator.MUST_PASS_ALL);
You want it to be -
FilterList filterList = new FilterList(FilterList.Operator.MUST_PASS_ONE);
The filter will try to find a column that has both the column qualifier and there is no such column

Resources