Range in Hbase FuzzyRowFilter? - filter

I have some data in Hbase. The structure of the key is like this userID(Integer)+dateTimeInMillis(Long). I used the following code in the past to get the rows between the range:
Scan scan = new Scan(startKey.array(), endKey.array());
scan.addFamily(Bytes.toBytes(""));
ResultScanner result = table.getScanner(scan);
I need to know the userIds and TimeStamps in order to query the rows.
One of my colleagues suggested I use fuzzyRowFilter to scan data for testing, I found it very helpful. I have played around a little with the fuzzyRowFilter. This is how I was able to achieve the result against all the userID for one day.
List<Pair<byte[], byte[]>> keys = new ArrayList<Pair<byte[], byte[]>>();
keys.add(new Pair<byte[], byte[]>(
startKey.array(), new byte[] { 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 }));
Filter filter = new FuzzyRowFilter(keys);
Scan scan = new Scan();
scan.addFamily(Bytes.toBytes("d"));
scan.setFilter(filter);
I can't adjust this filter to my use case. I can't find any method to specify a range. I tried searching the internet. Most developers said that the range isn't supported in this filter yet. Some had suggestions like using multiple filters.
Isn't there any better way to specify range beside using multiple filters.
I tried setting the last byte of my key to "1" in order to get a better result, but it didn't work out as I was expecting it to. If anybody knows a better way to apply the fuzzyRowFilter with range or has implemented a custom range filter, I would appreciate If I could get any ideas from him to get maximum performance.
Regards,

Related

Creating a dynamically sorted and filtered list from a range using a single formula in Google Sheets

I'm trying to do something similar to this question: Google sheets using Filter and Sort together
I have an input range with two columns, and as output I want to create a dynamically sorted and filtered list from the input range, using a single formula.
See this document for the desired result: https://docs.google.com/spreadsheets/d/109xcbORFZxTjH0Vjd6PVqYlOxMIdK7aXqf5-jnMMPik/edit?usp=sharing
I tried the formula: =SORT(FILTER(B11:C100, B11:B100 = or(I11,I12,I13,I14)), 2, 0) but it doesn't work. What I am doing wrong here? Any help much appreciated.
try:
=ARRAYFORMULA(QUERY(B11:C,
"where lower(B) matches '"&TEXTJOIN("|", 1, LOWER(I11:I))&"'
order by C desc", 0))
You can modified or(,,,,) like follow:
=SORT(
FILTER(B11:C100,
((B11:B100 = I11)*1+(B11:B100 = I12)*1+(B11:B100 = I13)*1+(B11:B100 = I14)*1)>0
)
, 2, 0
)

how to change hbase table scan results order

I am trying to copy specific data from one hbase table to another which requires scanning the table for only rowkeys and parsing a specific value from there. It works fine but I noticed the results seem to be returned in ascending sort order & in this case alphabetically. Is there a way to specify a reverse order or perhaps by insert timestamp?
Scan scan = new Scan();
scan.setMaxResultSize(1000);
scan.setFilter(new FirstKeyOnlyFilter());
ResultScanner scanner = TestHbaseTable.getScanner(scan);
for(Result r : scanner){
System.out.println(Bytes.toString(r.getRow()));
String rowKey = Bytes.toString(r.getRow());
if(rowKey.startsWith("dm.") || rowKey.startsWith("bk.") || rowKey.startsWith("rt.")) {
continue;
} else if(rowKey.startsWith("yt")) {
List<String> ytresult = Arrays.asList(rowKey.split("\\s*.\\s*"));
.....
This table is huge so I would prefer to skip to the rows I actually need. Appreciate any help here.
Have you tried the .setReversed() property of the Scan? Keep in mind that in this case your start row would have to be the logical END of your rowKey range, and from there it would scan 'upwards'.

Azure Search Scoring Profile Magnitude by Downloads

I am new to Azure Search so I just want to run this by before I try to implement it. We have a search setup on items and we want to score/rank the results based on its initial score and how many times the item has been used/downloaded. We want the items downloaded the most to appear at the top of the result list.
We have a separate field in the search index that contains the used/download count (itemCount).
I know I have to set up a Magnitude profile but I am not sure what to use for the range as the itemCount can contain 0 - N So do I just set the range to be some large number i.e. 100,000,000 or what is the best practice?
var functionRankByDownload = new MagnitudeFunction()
{
Boost = 1000,
BoostingRangeStart = 0,
BoostingRangeEnd = 100000000,
ConstantBoostBeyondRange = true,
FieldName = "itemCount",
Interpolation = InterpolationTypes.Linear
};
scoringProfile1.Functions = new List() { functionRankByDownload };
I found the score calculation is as follows:
((initialScore * boost * itemCount) - min) / (max-min)
So it seems like it should work ok having a large value for the max but again just wanting to know the best practice.
Thanks!
That seems reasonable. The BoostingRangeEnd can be any reasonable bound to your range depending on the scenario. Since, you are using ConstantBoostBeyondRange, it would also take care of boosting values outside ranges appropriately.
You might also want to experiment with the boost value for a large range like this and see if a bigger boost value is more helpful for your scenario.

RethinkDB: Can I group by fields between dates efficiently?

I'd like to group by multiple fields, between two timestamps.
I tried something like:
r.table('my_table').between(r.time(2015, 1, 1, 'Z'), r.now(), {index: "timestamp"}).group("field_a", "field_b").count()
Which takes a lot of time since my table is pretty big. I started thinking about using index in the 'group' part of the query, then I remembered it's impossible to use more than one index in the same rql.
Can I achieve what I need efficiently?
You could create a compound index, and then efficiently compute the count for any of the groups without computing all of them:
r.table('my_table').indexCreate('compound', function(row) {
return [row('field_a'), row('field_b'), row('timestamp')];
})
r.table('my_table').between(
[field_a_val, field_b_val, r.time(2015, 1, 1, 'Z')],
[field_a_val, field_b_val, r.now]
).count()

how to sort the documents according to an field in lucene?

guys.
I've got billions of records which have two attributes:
RecordCreatedTime, RecordContent
I've used lucene to index the records, and it is done.
Now I want to query some records according to the RecordCreatedTime, for example, check out the document just in November, 2013.
I am considering to sort the documents with RecordCreatedTime, and have tried some methods like NumericDocValuesSorter but it didn't work.
Can you guys provide some more materials so I can take a careful look??
Much thanks.
You should check out Lucene's DateTools which provides you with the tools to represent dates in a way that is appropriate for searching and sorting in the index. A TermRangeQuery can be used to search a particular range (such as the month of November, 2013), when indexed in that format.
You can also sort easily, by passing a Sort into your search call.
For example, something like:
String startDateString = DateTools.dateToString(startDate, DateTools.Resolution.DAY);
String endDateString = DateTools.dateToString(endDate, DateTools.Resolution.DAY);
TermRangeQuery query = TermRangeQuery.newStringRange("recordCreatedTime", startDateString, endDateString, true, false);
SortField field = new SortField('recordCreatedTime', SortField.Type.STRING);
Sort sort = new Sort(field);
TopDocs results = searcher.search(query, numDocs, sort);

Resources