how to search column value like '%test%' in hbase - hadoop

I have the big text contents saved in the co column, I want to search if the column co contains particular word, something like what we do in RDBMS eg: where co like %test%, To achieve this should i write any filter or Map reduce? could somebody give an example how to achieve this?

You can do something like
RegexStringComparator comp = new RegexStringComparator(".test."); // or (\W|^)test(\W|$) if you want complete words only
or
SubstringComparator comp = new SubstringComparator("test");
and then
SingleColumnValueFilter filter = new SingleColumnValueFilter(
Bytes.toBytes("COLUMN_FAMILY_NAME"),
Bytes.toBytes("co"),
CompareOp.EQUAL,
comp
);
scan.setFilter(filter);
note that the performance for this will not be spectacular as HBase will look at each instance of the column in your table

Related

how to change hbase table scan results order

I am trying to copy specific data from one hbase table to another which requires scanning the table for only rowkeys and parsing a specific value from there. It works fine but I noticed the results seem to be returned in ascending sort order & in this case alphabetically. Is there a way to specify a reverse order or perhaps by insert timestamp?
Scan scan = new Scan();
scan.setMaxResultSize(1000);
scan.setFilter(new FirstKeyOnlyFilter());
ResultScanner scanner = TestHbaseTable.getScanner(scan);
for(Result r : scanner){
System.out.println(Bytes.toString(r.getRow()));
String rowKey = Bytes.toString(r.getRow());
if(rowKey.startsWith("dm.") || rowKey.startsWith("bk.") || rowKey.startsWith("rt.")) {
continue;
} else if(rowKey.startsWith("yt")) {
List<String> ytresult = Arrays.asList(rowKey.split("\\s*.\\s*"));
.....
This table is huge so I would prefer to skip to the rows I actually need. Appreciate any help here.
Have you tried the .setReversed() property of the Scan? Keep in mind that in this case your start row would have to be the logical END of your rowKey range, and from there it would scan 'upwards'.

Pig latin join by field

I have a Pig latin related problem:
I have this data below (in one row):
A = LOAD 'records' AS (f1:chararray, f2:chararray,f3:chararray, f4:chararray,f5:chararray, f6:chararray);
DUMP A;
(FITKA,FINVA,FINVU,FEEVA,FETKA,FINVA)
Now I have another dataset:
B = LOAD 'values' AS (f1:chararray, f2:chararray);
Dump B;
(FINVA,0.454535)
(FITKA,0.124411)
(FEEVA,0.123133)
And I would like to get those two dataset joined. I would get corresponding value from dataset B and place that value beside the value from dataset A. So expected output is below:
FITKA 0.123133, FINVA 0.454535 and so on ..
(They can also be like: FITKA, 0.123133, FINVA, 0.454535 and so on .. )
And then I would be able to multiply values (0.123133 x 0.454535 .. and so on) because they are on the same row now and this is what I want.
Of course I can join column by column but then values appear "end of row" and then I can clean it by using another foreach generate. But, I want some simpler solution without too many joins which may cause performance issues.
Dataset A is text (Sentence in one way..).
So what are my options to achieve this?
Any help would be nice.
A sentence can be represented as a tuple and contains a bag of tuples (word, count).
Therefore, I suggest you change the way you store your data to the following format:
sentence:tuple(words:bag{wordcount:tuple(word, count)})

How do I split birt dataset column into multiple rows

My datasource has a column that contains a comma-separated list of numbers.
I want to create a dataset that takes those numbers and turns them into groupings to use in a bar chart.
requirements
numbers will be between 0-17 inclusive
groupings: 0-2,3-5,6-10,11-17
x-axis labels have to be the groupings
y-axis is the percent of rows that contain that grouping
note that because each row can contribute to multiple columns the percentages can add up to > 100%
any help you can offer would be awesome... i'm very new to BIRT and have been stuck on this for a couple days now
Not sure that I understand the requirements exactly, but your basic question "split dataset column into multiple rows" can be solved either using a scripted dataset or with pure SQL (depending on your DB).
Either way, you will need a second dataset (e.g. your data model is master-detail, and in your layout you will need something like
Table/List "Master bound to master DS
Table/List "Detail" bound to detail DS
The detail DS need the comma-separated result column from the master DS as an input parameter of type "String".
Doing this with a scripted dataset is quite easy IFF you understand Javascript AND you understand how scripted datasets work: Create a report variable "myValues" of type object with a default value of null and a second report variable "myValuesIndex" of type integer with a default value of 0.
(Note: this is all untested!)
Create the dataset "detail" as a scripted DS, with one input parameter "csv" of type String and one output parameter "value" of type String.
In the open event of the scripted DS, code:
vars["myValues"] = this.getInputParameterValue("csv").split(",");
vars["myValuesIndex"] = 0;
In the fetch event, code:
var i = vars["myValuesIndex"];
var len = vars["myValues"].length;
if (i < len) {
row["value"] = vars["myValues"][i];
vars["myValuesIndex"] = i+1;
return true;
} else {
return false;
}
For example, for the master DS result row with csv = "1,2,3-4,foo", the detail DS will result in 4 rows with
value = "1"
value = "2"
value = "3-4"
value = "foo"
Using an Oracle DB, this can be done without Javascript. The detail DS (with the same input parameter as above) would then look like:
select t.value as value from table(split(?)) t
For the definition of the split function, see RedFilter's answer on
Is there a function to split a string in PL/SQL?
If you get ORA-22813, you should change the original definition
create or replace type split_tbl as table of varchar2(32767);
to
create or replace type split_tbl as table of varchar2(4000);
as mentioned on https://community.oracle.com/thread/2288603?tstart=0
It's also possible with pure SQL in 11g using regexp_substr (see the same page).
create parameters in the scripted data set. we have to pass or link actual dataset values to scripted dataset parameters through DataSet parameter Binding after assigning the scripted data set to Table.

Hadoop, how to normalize multiple columns data?

I have a file .txt like this
1036177 19459.7356 17380.3761 18084.1440
1045709 19674.2457 17694.8674 18700.0120
1140443 19772.0645 17760.0904 19456.7521
where the first column represent the Key and the others are the values.
I would like to normalize (min-max) each column and after that sum up the columns.
Someone can give me some advice on how do that in MapReduce?
From an algorithmic perspective you'll need to:
Mapper
Parse / tokenize each input line by it's delimiter (space?)
Use a Text object to encapsulate the key field
Either create a custom value class to encapsulate the other fields or use an ArrayWritable wrapper
Output this Key / Value from your Mapper
Reducer
All values will be grouped by the same key, so here you'll just need to process each input value and calculate the min, max and sum for each column
Finally output your result
You might want to look at using Apache Pig which should make this task much easier (untested):
grunt> A = LOAD '/path/to/data.txt' USING PigStorage(' ')
AS (key, fld1:float, fld2:float, fld3:float);
grunt> GRP = GROUP A BY key;
grunt> B = FOREACH GRP GENERATE $0, MIN(fld1), MAX(fld1), SUM(fld1),
MIN(fld2), MAX(fld2), SUM(fld2),
MIN(fld3), MAX(fld3), SUM(fld3);
grunt> STORE B INTO '/path/to/output' USING PigStorage('\t', '-schema');

UnGroup in Apache Pig

Does Apache Pig support an UNGROUP operation ? I guess No. So could any one help me out with this probblem?
I have a rows of the form
1,a-b-c
2,d-e-f
3,g-h
I would like to expand it to the form
1,a
1,b
1,c
2,d
2,e
2,f
3,g
3,h
Any help appreciated.
You should probably use the builtin STRSPLIT to split your second field into several tokens, and then apply FLATTEN to create 1 row per element. Something like this:
A = LOAD 'input.txt' as (id, data);
B = FOREACH A GENERATE id, FLATTEN(STRSPLIT(data,'-'));

Resources