My rowKeys in HBase like this;
a1s1
a1s2
a1s3
a2s1
a3s1
a3s2
...
I want to get only these data;
a1s1
a2s1
a3s1
But when I run thise query; scan 't1', {STARTROW=>'a1s1', ENDROW=>'a4s1'}
It gives me;
a1s1
a1s2
a1s3
a2s1
a3s1
But I don't want to get a1s2 and a1s3. How can I do this?
You should use STARTROW-ENDROW and another filter with RegexStringComparator. If you use only start-end row filter, hbase performs this filtration for each character in your rowkey. Because rowkey is not numeric. In Hbase shell you can try this:
import org.apache.hadoop.hbase.filter.CompareFilter
import org.apache.hadoop.hbase.filter.RegexStringComparator
scan 't1', {STARTROW => 'a1s1', ENDROW => 'a4s1', FILTER => org.apache.hadoop.hbase.filter.RowFilter.new(CompareFilter::CompareOp.valueOf('EQUAL'),RegexStringComparator.new("s1$"))}
I assume, you want to get the row key starting with "a*" and ending with "s1".
So either you can use below:
scan 't1', { ENDROW=>'s1'}
Or
scan 't1', {STARTROW=>'a', ENDROW=>'s1'}
Another option is using regexString:
scan 't1', {FILTER => "RowFilter(=, 'regexstring:*s1')"}
Related
Hi I have SQL Table where I am storing values like this:
Column Name: Registration_ID
180,1801,1803,18011,220
180,1801,
180,1801,1803
No I want to match exact Registration_ID and get records based on the Registration_ID. I have tried Contains but is not matching exact values.
Here is my query:
var Result=db.Entity_StudentRepository.Get(x =>
x.Registration_ID.Contains(Used_For_Id.ToString())).Select(x => x.Registration_ID).ToArray();
Could you please try the following query and let know if it works-
db.Entity_StudentRepository.AsEnumerable().Where(t=> Registration_ID.Split(',').Select(int.Parse).Contains(Used_For_Id));
I was doing a scan using startRowKey and StopRowKey in HBase scan using HBase shell, but the output what I am receiving is outside the range passed. Please refer the Hbase Query -
import org.apache.hadoop.hbase.filter.CompareFilter
import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
import org.apache.hadoop.hbase.filter.SubstringComparator
import org.apache.hadoop.hbase.util.Bytes
scan 'TableName',{ LIMIT => 2 , STARTROW => '000|9223370554721275807', STOPROW => '101|9223370554727575807', FILTER => SingleColumnValueFilter.new(Bytes.toBytes('col_family'), Bytes.toBytes('col_qualifier'), CompareFilter::CompareOp.valueOf('EQUAL'), Bytes.toBytes('Some Value')), COLUMNS => 'col_family:col_qualifier', REVERSED => false}
But the out what is received is outside this range -
016|9223370554960173487
021|9223370555154148992
Please let me know is my search query is correct or what could be the root cause for this?? Any help will be really appreciated.
Thanks
If you put the four rowkeys mentioned in your question in a file and sort them the result will be:
000|9223370554721275807
016|9223370554960173487
021|9223370555154148992
101|9223370554727575807
Thus the values you received are not outside the range of your scan.
Actually I want to get the full table but it should be based on Doc_Type==distinct
Means it should only pick the records from table that has unique Doc_Type. I have tried with following but it returns me a single column into tolist() but I want to get full table. How can I do it?
var data = DB.tblDocumentTypes.Select(m => m.Doc_Type).Distinct().ToList();
You can use GroupBy
var data = DB.tblDocumentTypes.GroupBy(m => m.Doc_Type).Select(x => x.First());
Lets say a comments table has the following structure:
id | author | timestamp | body
I want to use index for efficiently execute the following query:
r.table('comments').getAll("me", {index: "author"}).orderBy('timestamp').run(conn, callback)
Is there other efficient method I can use?
It looks that currently index is not supported for a filtered result of a table. When creating an index for timestamp and adding it as a hint in orderBy('timestamp', {index: timestamp}) I'm getting the following error:
RqlRuntimeError: Indexed order_by can only be performed on a TABLE. in:
This can be accomplished with a compound index on the "author" and "timestamp" fields. You can create such an index like so:
r.table("comments").index_create("author_timestamp", lambda x: [x["author"], x["timestamp"]])
Then you can use it to perform the query like so:
r.table("comments")
.between(["me", r.minval], ["me", r.maxval]
.order_by(index="author_timestamp)
The between works like the get_all did in your original query because it gets only documents that have the author "me" and any timestamp. Then we do an order_by on the same index which orders by the timestamp(since all of the keys have the same author.) the key here is that you can only use one index per table access so we need to cram all this information in to the same index.
It's currently not possible chain a getAll with a orderBy using indexes twice.
Ordering with an index can be done only on a table right now.
NB: The command to orderBy with an index is orderBy({index: 'timestamp'}) (no need to repeat the key)
The answer by Joe Doliner was selected but it seems wrong to me.
First, in the between command, no indexer was specified. Therefore between will use primary index.
Second, the between return a selection
table.between(lowerKey, upperKey[, {index: 'id', leftBound: 'closed', rightBound: 'open'}]) → selection
and orderBy cannot run on selection with an index, only table can use index.
table.orderBy([key1...], {index: index_name}) → selection<stream>
selection.orderBy(key1, [key2...]) → selection<array>
sequence.orderBy(key1, [key2...]) → array
You want to create what's called a "compound index." After that, you can query it efficiently.
//create compound index
r.table('comments')
.indexCreate(
'author__timestamp', [r.row("author"), r.row("timestamp")]
)
//the query
r.table('comments')
.between(
['me', r.minval],
['me', r.maxval],
{index: 'author__timestamp'}
)
.orderBy({index: r.desc('author__timestamp')}) //or "r.asc"
.skip(0) //pagi
.limit(10) //nation!
I like using two underscores for compound indexes. It's just stylistic. Doesn't matter how you choose to name your compound index.
Reference: How to use getall with orderby in RethinkDB
I'm trying to figure out how to do the equivalent of Oracle's LEAD and LAG in Hbase or some other sort of pattern that will solve my problem. I could write a MapReduce program that does this quite easily, but I'd love to be able to exploit the fact that the data is already sorted in the way I need it to be.
My problem is as follows: I have a rowkey and a value that looks like:
(employee name + timestamp) => data:salary
So, some example data might be:
miller, bob;2010-01-14 => data:salary=90000
miller, bob;2010-11-04 => data:salary=102000
miller, bob;2011-12-03 => data:salary=107000
monty, fred;2010-04-10 => data:salary=19000
monty, fred;2011-09-09 => data:salary=24000
What I want to do is calculate the changes of salary, record by record. I want to transform the above data into differences between records:
miller, bob;2010-01-14 => data:salarydiff=90000
miller, bob;2010-11-04 => data:salarydiff=12000
miller, bob;2011-12-03 => data:salarydiff=5000
monty, fred;2010-04-10 => data:salarydiff=19000
monty, fred;2011-09-09 => data:salarydiff=5000
I'm up for changing the rowkey strategy if necessary.
What I'd do is change the key so that the timestamp will be descending (newer salary first)
miller, bob;2011-12-03 => data:salary=107000
miller, bob;2010-11-04 => data:salary=102000
miller, bob;2010-01-14 => data:salary=90000
Now you can do a simple map job that will scan the table. Then in the map you create a new Scan to the current key. Scan.next to get the previous salary, calculate the diff and store it in a new column on the current row key
Basically in your mapper class (the one that inherits TableMapper) you override the setup method and get the configuration
#Override
protected void setup(Mapper.Context context) throws IOException,InterruptedException {
Configuration config = context.getConfiguration();
table = new HTable(config,<Table Name>);
}
Then inside the map you extract the row key from the row parmeter, create the new Scan and continue as explained above
In most cases the next record would be in the same region - occasionally it might go to another regionserver