HBase shell - Retrieve (only) column values (and not column name) - hadoop

I am pretty new to Hadoop and HBase, trying to learn and evaluate if it can be used for my use case. And being new to Java (I am basically Perl/Unix and DB developer), I am trying to get solution in Hbase shell if possible.
I have a HBase table (schema below) where I am trying to implement the historical data (which can be used for audit and analytics).
Assume the basic structure as below,
rowkey 'cf1:id', 'cf1:price', 'cf1:user', 'cf1:timestamp'
Now,
rowkey - instrument or any object
id - using this to identify which col has latest data. First entry will have 1 as its value, and go on
user - user which updated data
e.g.
initially the data looks like,
hbase(main):009:0> scan 'price_history'
ROW COLUMN+CELL
row1 column=cf1:id, timestamp=1389020633920,value=1
row1 column=cf1:pr, timestamp=1389020654614, value=109.45
row1 column=cf1:us, timestamp=1389020668338, value=feed
row2 column=cf1:id, timestamp=1389020687334, value=1
row2 column=cf1:pr, timestamp=1389020697880, value=1345.65
row2 column=cf1:us, timestamp=1389020708403, value=feed
Now assume row2 or instrument 2 is updated on same day with new price,
hbase(main):003:0> scan 'price_history'
ROW COLUMN+CELL
row1 column=cf1:id, timestamp=1389020633920, value=1
row1 column=cf1:pr, timestamp=1389020654614, value=109.45
row1 column=cf1:us, timestamp=1389020668338, value=feed
row2 column=cf1:id, timestamp=1389020859674, value=2
row2 column=cf1:pr, timestamp=1389020697880, value=1345.65
row2 column=cf1:pr1, timestamp=1389020869856, value=200
row2 column=cf1:us, timestamp=1389020708403, value=feed
row2 column=cf1:us1, timestamp=1389020881601, value=user1`
If you see id is changed to 2 to indicate second set of data is latest. and new values or columns added.
What I want is,
1) Can I fetch the value of columns id? i.e. the output should be 1 or 2 and not all other attribs
2) Based on the above o/p i will fetch the further data, but can I also have a search and o/p as value of rowkey? i.e. something like give me o/p of row having VALUE as row1 (I can have list of row1, row2, rown..)
Please assist if possible in HBase shell as much as possible (Other solutions are also welcomed)
Also, if any of the architect can suggest better solution to model the table to keep track of changes/versions of prices are also welcomed.
Thanks.

That is going to be tough to do in the shell without doing a lot of piping output and grepping the results. The shell output formatting also makes this difficult because of how it breaks up lines. A lighter weight solution than writing Java would be to write your scanner in ruby. HBase comes with the jruby jar and lets you execute ruby scripts.
include Java
import "org.apache.hadoop.hbase.client.Scan"
import "org.apache.hadoop.hbase.util.Bytes"
import "org.apache.hadoop.hbase.client.HTable"
config = HBaseConfiguration.create()
family = Bytes.toBytes("family-name")
qual = Bytes.toBytes("qualifier"
scan = Scan.new()
scan.addColumn(family, qualifier)
table = HTable.new(config, "table-name")
scanner = table.getScanner(scan)
scanner.each do |result|
keyval = result.getColumnLatest(family, qualifier)
puts "#{Bytes.toDouble(keyval.getValue())}"
end
That should get you pretty close, you can add additional data to the output for example the row key. To run it just use hbase org.jruby.Main your_ruby_file.rb

Related

How to extract key-value pairs from CSV using Talend

I have data for one column in my CSV file as :
`column1`
row1 : {'name':'Steve Jobs','location':'America','status':'none'}
row2 : {'name':'Mark','location':'America','status':'present'}
row3 : {'name':'Elan','location':'Canada','status':'present'}
I want as the output for that column as :
`name` `location` `status`
Steve jobs America none
Mark America present
Elan Canada present
But sometimes I have row value like {'name':'Steve Jobs','location':'America','status':'none'},{'name':'Mark','location':'America','status':'present'}
Please help !
You have to use tMap and tExtractDelimitedFields components.
Flow,
Below is the step by step explination,
Original data - row1 : {'name':'Steve Jobs','location':'America','status':'none'}
Substring the value inside the braces using below function
row1.Column0.substring(row1.Column0.indexOf("{")+1, row1.Column0.indexOf("}")-1)
Now the result is - 'name':'Steve Jobs','location':'America','status':'none'
3.Extract single columns to multiple using tExtractDelimitedFields. Since the columns are seperated be ,, delimiter should be provided as comma. And we have 3 fields in the data, so create 3 fields in the component schema. Below is the snipping of the tExtractDelimitedFields component configuration
Now the result is,
name location status
'name':'Steve Jobs' 'location':'America' 'status':'none'
'name':'Mark' 'location':'America' 'status':'present'
'name':'Elan' 'location':'Canada' 'status':'present'
Again using one more tMap, replace the column names and single quotes from the data,
row2.name.replaceAll("'name':", "").replaceAll("'", "")
row2.location.replaceAll("'location':", "").replaceAll("'", "")
row2.status.replaceAll("'status':", "").replaceAll("'", "")
Your final result is below,

Pig latin join by field

I have a Pig latin related problem:
I have this data below (in one row):
A = LOAD 'records' AS (f1:chararray, f2:chararray,f3:chararray, f4:chararray,f5:chararray, f6:chararray);
DUMP A;
(FITKA,FINVA,FINVU,FEEVA,FETKA,FINVA)
Now I have another dataset:
B = LOAD 'values' AS (f1:chararray, f2:chararray);
Dump B;
(FINVA,0.454535)
(FITKA,0.124411)
(FEEVA,0.123133)
And I would like to get those two dataset joined. I would get corresponding value from dataset B and place that value beside the value from dataset A. So expected output is below:
FITKA 0.123133, FINVA 0.454535 and so on ..
(They can also be like: FITKA, 0.123133, FINVA, 0.454535 and so on .. )
And then I would be able to multiply values (0.123133 x 0.454535 .. and so on) because they are on the same row now and this is what I want.
Of course I can join column by column but then values appear "end of row" and then I can clean it by using another foreach generate. But, I want some simpler solution without too many joins which may cause performance issues.
Dataset A is text (Sentence in one way..).
So what are my options to achieve this?
Any help would be nice.
A sentence can be represented as a tuple and contains a bag of tuples (word, count).
Therefore, I suggest you change the way you store your data to the following format:
sentence:tuple(words:bag{wordcount:tuple(word, count)})

Pig Latin using two data sources in one FILTER statement

In my pig script, am reading data from more than 5 data sources (Hive tables), where one is the main source data and rest were kind of dimension data tables. I am trying to filter the main data source relation (or alias) w.r.t some value in one of the dimension relation.
E.g.
-- main_data is main data source and dept_data is department data
filtered_data1 = FILTER main_data BY deptID == dept_data.departmentID;
filtered_data2 = FOREACH filtered_data1 GENERATE $0, $1, $3, $7;
In my pig script there are minimum 20 instances where I need to match for some value between multiple data sources and produce a new relation. But am getting some error as
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias filtered_data1.
Backend error : Scalar has more than one row in the output. 1st : ( ..... ) 2nd : ( .... )
Details at logfile: /root/pig_1403263965493.log
I tried to use "relation::field" approach also, no use. Alternatively, am joining these two relations (data sources) to get filtered data, but I feel, this will slow down the execution process and unnecessirity huge data will be dumped.
Please guide me how two use two or more data sources in one FILTER statement, something like in SQL, so that I can avoid using JOIN statements and get it done from FILTER statement itself.
Where A.deptID = B.departmentID And A.sectionID = C.sectionID And A.cityID = D.cityID
If you want to match records from different tables by a single ID, you would pretty much have to use a join, as such:
Where A::deptID = B::departmentID And A::sectionID = C::sectionID And A::cityID = D::cityID
If you just want to keep the records that occur in all other tables, you could probably go for an INTERSECT and then a
FILTER BY someID IN someIDList

Using XPATH to select the row *after* a row containing certain text

I can't work out if this is possible or not, I've got a basic table but that table has a varying number of rows and data within it.
Assuming the table is just one column wide and a random number of rows long to select a row containing the text "COW" I can do something very simple like do: -
table/tbody/tr[contains(td[1],"COW")]/td[1]
But lets say that this table contains two types of data in it, a list of animals and, underneath each animal, a list of attributes, all in the same column, looking something like this: -
COW
Horns = 2
Hooves = 4
Tail = 1
CHICKEN
Horns = 0
Hooves = 0
Tail = 1
Is there a way using XPATH to first identify the row that contains the text COW and then select the row directly after this to return the text "Horns = 2"?
Cheers
It seems that you want something like this:
table/tbody/tr[contains(td[1],"COW")]/following-sibling::tr[1]/td[1]
This will select the first td in the row immediately following the row which contains the td which contains COW.

Max/Min for whole sets of records in PIG

I have a set set of records that I am loading from a file and the first thing I need to do is get the max and min of a column.
In SQL I would do this with a subquery like this:
select c.state, c.population,
(select max(c.population) from state_info c) as max_pop,
(select min(c.population) from state_info c) as min_pop
from state_info c
I assume there must be an easy way to do this in PIG as well but I'm having trouble finding it. It has a MAX and MIN function but when I tried doing the following it didn't work:
records=LOAD '/Users/Winter/School/st_incm.txt' AS (state:chararray, population:int);
with_max = FOREACH records GENERATE state, population, MAX(population);
This didn't work. I had better luck adding an extra column with the same value to each row and then grouping them on that column. Then getting the max on that new group. This seems like a convoluted way of getting what I want so I thought I'd ask if anyone knows a simpler way.
Thanks in advance for the help.
As you said you need to group all the data together but no extra column is required if you use GROUP ALL.
Pig
records = LOAD 'states.txt' AS (state:chararray, population:int);
records_group = GROUP records ALL;
with_max = FOREACH records_group
GENERATE
FLATTEN(records.(state, population)), MAX(records.population);
Input
CA 10
VA 5
WI 2
Output
(CA,10,10)
(VA,5,10)
(WI,2,10)

Resources