SCAN command with spring redis template - spring

I am trying to execute "scan" command with RedisConnection. I don't understand why the following code is throwing NoSuchElementException
RedisConnection redisConnection = redisTemplate.getConnectionFactory().getConnection();
Cursor c = redisConnection.scan(scanOptions);
while (c.hasNext()) {
c.next();
}
Exception:
java.util.NoSuchElementException at
java.util.Collections$EmptyIterator.next(Collections.java:4189) at
org.springframework.data.redis.core.ScanCursor.moveNext(ScanCursor.java:215)
at
org.springframework.data.redis.core.ScanCursor.next(ScanCursor.java:202)

Yes, I have tried this, in 1.6.6.RELEASE spring-data-redis.version. No issues, the below simple while loop code is enough. And i have set count value to 100 (more the value) to save round trip time.
RedisConnection redisConnection = null;
try {
redisConnection = redisTemplate.getConnectionFactory().getConnection();
ScanOptions options = ScanOptions.scanOptions().match(workQKey).count(100).build();
Cursor c = redisConnection.scan(options);
while (c.hasNext()) {
logger.info(new String((byte[]) c.next()));
}
} finally {
redisConnection.close(); //Ensure closing this connection.
}

I'm using spring-data-redis 1.6.0-RELEASE and Jedis 2.7.2; I do think that the ScanCursor implementation is slightly flawed w/rgds to handling this case on this version - I've not checked previous versions though.
So: rather complicated to explain, but in the ScanOptions object there is a "count" field that needs to be set (default is 10). This field, contains an "intent" or "expected" results for this search. As explained (not really clearly, IMHO) here, you may change the value of count at each invocation, especially if no result has been returned. I understand this as "a work intent" so if you do not get anything back, maybe your "key space" is vast and the SCAN command has not worked "hard enough". Obviously, as long as you're getting results back, you do not need to increase this.
A "simple-but-dangerous" approach would be to have a very large count (e.g 1 million or more). This will make REDIS go away trying to search your vast key space to find "at least or near as much" as your large count. Don't forget - REDIS is single-threaded so you just killed your performance. Try this with a REDIS of 12M keys and you'll see that although SCAN may happily return results with a very high count value, it will absolutely do nothing more during the time of that search.
To the solution to your problem:
ScanOptions options = ScanOptions.scanOptions().match(pattern).count(countValue).build();
boolean done = false;
// the while-loop below makes sure that we'll get a valid cursor -
// by looking harder if we don't get a result initially
while (!done) {
try(Cursor c = redisConnection.scan(scanOptions)) {
while (c.hasNext()) {
c.next();
}
done = true; //we've made it here, lets go away
} catch (NoSuchElementException nse) {
System.out.println("Going for "+countValue+" was not hard enough. Trying harder");
options = ScanOptions.scanOptions().match(pattern).count(countValue*2).build();
}
}
Do note that the ScanCursor implementation of Spring Data REDIS will properly follow the SCAN instructions and loop correctly, as much as needed, to get to the end of the loop as per documentation. I've not found a way to change the scan options within the same cursor - so there may be a risk that if you get half-way through your results and get a NoSuchElementException, you'll start again (and essentially do some of the work twice).
Of course, better solutions are always welcome :)

My old code
ScanOptions.scanOptions().match("*" + query + "*").count(10).build();
Working code
ScanOptions.scanOptions().match("*" + query + "*").count(Integer.MAX_VALUE).build();

Related

Cancel ajax calls?

I'm using the Select2 select boxes in my Django project. The ajax calls it makes can be fairly time-consuming if you've only entered a character or two in the query box, but go quicker if you've entered several characters. So what I'm seeing is you'll start typing a query, and it will make 4 or 5 ajax calls, but the final one returns and the results display. It looks fine on the screen, but meanwhile, the server is still churning away on the earlier queries. I've increased the "delay" parameter to 500 ms, but it's still a bit of a problem.
Is there a way to have the AJAX handler on the server detect that this is a new request from the same client as one that is currently processing, and tell the older one to exit immediately? It appears from reading other answers here that merely calling .abort() on the client side doesn't stop the query running on the server side.
If they are DB queries that are taking up time, then basically nothing will stop them besides stopping the database server, which is of course not tangible. If it is computation in nested loops for example, then you could use cache to detect whether another request has been submitted from the same user. Basically:
from django.core.cache import cache
def view(request):
start_time = timestamp # timezone.now() etc.
cache.set(request.session.session_key + 'some_identifier', start_time)
for q in werty:
# Very expensive computation with millions of loops
if start_time != cache.get(request.session.session_key + 'some_identifier'):
break
else:
# Continue the nasty computations
else:
cache.delete(request.session.session_key + 'some_identifier')
But the Django part aside - what I would do: in JS add a condition that when the search word is less than 3 chars, then it waits 0.5s (or less, whatever you like) before searching. And if another char is added then search right away.
I.e.
var timeout;
function srch(param) {
timeout = false;
if (param.length < 3) {
timeout = true;
setTimeout(function () {
if (timeout) {
$.ajax({blah: blah});
}
}, 500);
} else {
$.ajax({blah: blah});
}
}

Fetch 1M records in orientdb: why is it 6x slower than bare SQL+MySQL

For some graph algorithm I need to fetch a lot of records from a database to memory (~ 1M records). I want this to be done fast and I want the records to be objects (that is: I want ORM). To crudely benchmark different solutions I created a simple problem of one table with 1M Foo objects like I did here: Why is loading SQLAlchemy objects via the ORM 5-8x slower than rows via a raw MySQLdb cursor? .
One can see that fetching them using bare SQL is extremely fast; also converting the records to objects using a simple for-loop is fast. Both execute in around 2-3 seconds. However using ORM's like SQLAlchemy and Hibernate, this takes 20-30 seconds: a lot slower if you ask me, and this is just a simple example without relations and joins.
SQLAlchemy gives itself the feature "Mature, High Performing Architecture," (http://www.sqlalchemy.org/features.html). Similarly for Hibernate "High Performance" (http://hibernate.org/orm/). In a way both are right, because they allow for very generic object oriented data models to be mapped back and forth to a MySQL database. On the other hand they are awfully wrong, since they are 10x slower than just SQL and native code. Personally I think they could do better benchmarks to show this, that is, a benchmark comparing with native SQL + java or python. But that is not the problem at hand.
Of course, I don't want SQL + native code, as it is hard to maintain. So I was wondering why there does not exist something like an object oriented database, which handles the database->object mapping native. Someone suggested OrientDB, hence I tried it. The API is quite nice: when you have your getters and setters right, the object is insertable and selectable.
But I want more than just API-sweetness, so I tried the 1M example:
import java.io.Serializable;
public class Foo implements Serializable {
public Foo() {}
public Foo(int a, int b, int c) { this.a=a; this.b=b; this.c=c; }
public int a,b,c;
public int getA() { return a; }
public void setA(int a) { this.a=a; }
public int getB() { return b; }
public void setB(int b) { this.b=b; }
public int getC() { return c; }
public void setC(int c) { this.c=c; }
}
import com.orientechnologies.orient.object.db.OObjectDatabaseTx;
public class Main {
public static void insert() throws Exception {
OObjectDatabaseTx db = new OObjectDatabaseTx ("plocal:/opt/orientdb-community-1.7.6/databases/test").open("admin", "admin");
db.getEntityManager().registerEntityClass(Foo.class);
int N=1000000;
long time = System.currentTimeMillis();
for(int i=0; i<N; i++) {
Foo foo = new Foo(i, i*i, i+i*i);
db.save(foo);
}
db.close();
System.out.println(System.currentTimeMillis() - time);
}
public static void fetch() {
OObjectDatabaseTx db = new OObjectDatabaseTx ("plocal:/opt/orientdb-community-1.7.6/databases/test").open("admin", "admin");
db.getEntityManager().registerEntityClass(Foo.class);
long time = System.currentTimeMillis();
for (Foo f : db.browseClass(Foo.class).setFetchPlan("*:-1")) {
if(f.getA() == 345234) System.out.println(f.getB());
}
System.out.println("Fetching all Foo records took: " + (System.currentTimeMillis() - time) + " ms");
db.close();
}
public static void main(String[] args) throws Exception {
//insert();
fetch();
}
}
Fetching 1M Foo's using OrientDB takes approximately 18 seconds. The for-loop with the getA() is to force the object fields to be actually loaded into memory, as I noticed that by default they are fetched lazily. I guess this may also be the reason fetching the Foo's is slow, because it has db-access each iteration instead of db-access once when it fetches everything (including the fields).
I tried to fix that using setFetchPlan("*:-1"), I figured it may also apply on fields, but that did not seem to work.
Question: Is there a way to do this fast, preferably in the 2-3 seconds range? Why does this take 18 seconds, whilst the bare SQL version uses 3 seconds?
Addition: Using a ODatabaseDocumentTX like #frens-jan-rumph suggested only gave ma a speedup of approximately 5, but of approximatelt 2. Adjusting the following code gave me a running time of approximately 9 seconds. This is still 3 times slower than raw sql whilst no conversion to Foo's was executed. Almost all time goes to the for-loop.
public static void fetch() {
ODatabaseDocumentTx db = new ODatabaseDocumentTx ("plocal:/opt/orientdb-community-1.7.6/databases/pits2").open("admin", "admin");
long time = System.currentTimeMillis();
ORecordIteratorClass<ODocument> it = db.browseClass("Foo");
it.setFetchPlan("*:0");
System.out.println("Fetching all Foo records took: " + (System.currentTimeMillis() - time) + " ms");
time = System.currentTimeMillis();
for (ODocument f : it) {
//if((int)f.field("a") == 345234) System.out.println(f.field("b"));
}
System.out.println("Iterating all Foo records took: " + (System.currentTimeMillis() - time) + " ms");
db.close();
}
The answer lies in convenience.
During an interview, when I asked a candidate what they thought of LINQ (C# I know, but pertinent to your question), they quite rightly answered that it was a sacrifice of performance, over convenience.
A hand-written SQL statement (whether or not it calls a stored procedure) is always going to be faster than using an ORM that auto-magically converts the results of the query in to nice, easy-to-use POCOs.
That said, the difference should not be that great as you have experienced. Yes, there is overhead in doing it the auto-magical way, but it shouldn't be that great. I do have experience here, and within C# I have had to use special reflection classes to reduce the time it takes to do this auto-magical mapping.
With large swabs of data, I would expect an initial slow-down from an ORM, but then it would be negligible. 3 seconds to 18 seconds is huge.
If you profile your test, you would discover that around 60 - 80% of the CPU time is taken by execution of the following four methods:
com.orienttechnologies...OObjectEntitySerializer.getField(...)
com.orienttechnologies...OObjectEntityEnhancer.getProxiedInstance(...)
com.orienttechnologies...OObjectMethodFilter.isScalaClass(...)
javaassist...SecurityActions.getDeclaredMethods(...)
So yes, in this setup the bottleneck is in the ORM layer. Using ODatabaseDocumentTx provides a speedup of around 5x. Might just get you where you want to be.
Still a lot of time (close to 50%) is spent in com.orientechnologies...OJNADirectMemory.getInt(...). That's expensive for just reading an integer from a memory location. Don't understand why not just the java nio bytebuffers are used here. Saves a lot of crossing the Java / native border, etc.
Apart from these micro benchmarks and remarkable behaviour in OrientDB I think that there are at least two other things to consider:
Does this test reflect your expected workload?
I.e. you read a straightforward list of records. If so, why use a database? If not, then test on the actual workload, e.g. your searches, graph traversals, etc.
Does this test reflect your expected setup?
E.g. you are reading from a plocal database while reading from any database over tcp/ip might just as well have its bottleneck somewhere else. Also, you are reading from one thread / process; if you expect concurrent use of the database, this probably throws things off considerably (disk seeks, more book keeping overhead, etc.)
P.S. I would recommend warming up code before benchmarking
What you do here is a worst case scenario. As you wrote (or should have wrote) for your database your test is just reading a table and writes it directly to a stream of whatever.
So what you see is the complete overhead of alot of magic. Usually if you do something more complex like joining, selecting, filtering and ordering the overhead of your ORM comes down to a more reasonable share of 5 to 10%.
Another thing you should think about - I guess orient is doing the same - the ORM solution is creating new objects multiplying memory consumption and Java is really bad on memory consumption and the reason why I use custom in memory tables all the time I handle a lot of data / objects.
You know where an object is a row in a table.
Another thing your objects get also inserted into a list / map (at least Hibernate is doing it). It tracks the dirtiness of the objects once you change them. This insertion also takes a lot of time when you rescale it and is a reason why we use paginated lists or maps. copying 1M references is dead slow if the area grows.

Sorting CouchDB Views By Value

I'm testing out CouchDB to see how it could handle logging some search results. What I'd like to do is produce a view where I can produce the top queries from the results. At the moment I have something like this:
Example document portion
{
"query": "+dangerous +dogs",
"hits": "123"
}
Map function
(Not exactly what I need/want but it's good enough for testing)
function(doc) {
if (doc.query) {
var split = doc.query.split(" ");
for (var i in split) {
emit(split[i], 1);
}
}
}
Reduce Function
function (key, values, rereduce) {
return sum(values);
}
Now this will get me results in a format where a query term is the key and the count for that term on the right, which is great. But I'd like it ordered by the value, not the key. From the sounds of it, this is not yet possible with CouchDB.
So does anyone have any ideas of how I can get a view where I have an ordered version of the query terms & their related counts? I'm very new to CouchDB and I just can't think of how I'd write the functions needed.
It is true that there is no dead-simple answer. There are several patterns however.
http://wiki.apache.org/couchdb/View_Snippets#Retrieve_the_top_N_tags. I do not personally like this because they acknowledge that it is a brittle solution, and the code is not relaxing-looking.
Avi's answer, which is to sort in-memory in your application.
couchdb-lucene which it seems everybody finds themselves needing eventually!
What I like is what Chris said in Avi's quote. Relax. In CouchDB, databases are lightweight and excel at giving you a unique perspective of your data. These days, the buzz is all about filtered replication which is all about slicing out subsets of your data to put in a separate DB.
Anyway, the basics are simple. You take your .rows from the view output and you insert it into a separate DB which simply emits keyed on the count. An additional trick is to write a very simple _list function. Lists "render" the raw couch output into different formats. Your _list function should output
{ "docs":
[ {..view row1...},
{..view row2...},
{..etc...}
]
}
What that will do is format the view output exactly the way the _bulk_docs API requires it. Now you can pipe curl directly into another curl:
curl host:5984/db/_design/myapp/_list/bulkdocs_formatter/query_popularity \
| curl -X POST host:5984/popularity_sorter/_design/myapp/_view/by_count
In fact, if your list function can handle all the docs, you may just have it sort them itself and return them to the client sorted.
This came up on the CouchDB-user mailing list, and Chris Anderson, one of the primary developers, wrote:
This is a common request, but not supported directly by CouchDB's
views -- to do this you'll need to copy the group-reduce query to
another database, and build a view to sort by value.
This is a tradeoff we make in favor of dynamic range queries and
incremental indexes.
I needed to do this recently as well, and I ended up doing it in my app tier. This is easy to do in JavaScript:
db.view('mydesigndoc', 'myview', {'group':true}, function(err, data) {
if (err) throw new Error(JSON.stringify(err));
data.rows.sort(function(a, b) {
return a.value - b.value;
});
data.rows.reverse(); // optional, depending on your needs
// do something with the data…
});
This example runs in Node.js and uses node-couchdb, but it could easily be adapted to run in a browser or another JavaScript environment. And of course the concept is portable to any programming language/environment.
HTH!
This is an old question but I feel it still deserves a decent answer (I spent at least 20 minutes on searching for the correct answer...)
I disapprove of the other suggestions in the answers here and feel that they are unsatisfactory. Especially I don't like the suggestion to sort the rows in the applicative layer, as it doesn't scale well and doesn't deal with a case where you need to limit the result set in the DB.
The better approach that I came across is suggested in this thread and it posits that if you need to sort the values in the query you should add them into the key set and then query the key using a range - specifying a desired key and loosening the value range. For example if your key is composed of country, state and city:
emit([doc.address.country,doc.address.state, doc.address.city], doc);
Then you query just the country and get free sorting on the rest of the key components:
startkey=["US"]&endkey=["US",{}]
In case you also need to reverse the order - note that simple defining descending: true will not suffice. You actually need to reverse the start and end key order, i.e.:
startkey=["US",{}]&endkey=["US"]
See more reference at this great source.
I'm unsure about the 1 you have as your returned result, but I'm positive this should do the trick:
emit([doc.hits, split[i]], 1);
The rules of sorting are defined in the docs.
Based on Avi's answer, I came up with this Couchdb list function that worked for my needs, which is simply a report of most-popular events (key=event name, value=attendees).
ddoc.lists.eventPopularity = function(req, res) {
start({ headers : { "Content-type" : "text/plain" } });
var data = []
while(row = getRow()) {
data.push(row);
}
data.sort(function(a, b){
return a.value - b.value;
}).reverse();
for(i in data) {
send(data[i].value + ': ' + data[i].key + "\n");
}
}
For reference, here's the corresponding view function:
ddoc.views.eventPopularity = {
map : function(doc) {
if(doc.type == 'user') {
for(i in doc.events) {
emit(doc.events[i].event_name, 1);
}
}
},
reduce : '_count'
}
And the output of the list function (snipped):
165: Design-Driven Innovation: How Designers Facilitate the Dialog
165: Are Your Customers a Crowd or a Community?
164: Social Media Mythbusters
163: Don't Be Afraid Of Creativity! Anything Can Happen
159: Do Agencies Need to Think Like Software Companies?
158: Customer Experience: Future Trends & Insights
156: The Accidental Writer: Great Web Copy for Everyone
155: Why Everything is Amazing But Nobody is Happy
Every solution above will break couchdb performance I think. I am very new to this database. As I know couchdb views prepare results before it's being queried. It seems we need to prepare results manually. For example each search term will reside in database with hit counts. And when somebody searches, its search terms will be looked up and increments hit count. When we want to see search term popularity, it will emit (hitcount, searchterm) pair.
The Link Retrieve_the_top_N_tags seems to be broken, but I found another solution here.
Quoting the dev who wrote that solution:
rather than returning the results keyed by the tag in the map step, I would emit every occurrence of every tag instead. Then in the reduce step, I would calculate the aggregation values grouped by tag using a hash, transform it into an array, sort it, and choose the top 3.
As stated in the comments, the only problem would be in case of a long tail:
Problem is that you have to be careful with the number of tags you obtain; if the result is bigger than 500 bytes, you'll have couchdb complaining about it, since "reduce has to effectively reduce". 3 or 6 or even 20 tags shouldn't be a problem, though.
It worked perfectly for me, check the link to see the code !

How much information hiding is necessary when doing code refactoring?

How much information hiding is necessary? I have boilerplate code before I delete a record, it looks like this:
public override void OrderProcessing_Delete(Dictionary<string, object> pkColumns)
{
var c = Connect();
using (var cmd = new NpgsqlCommand("SELECT COUNT(*) FROM orders WHERE order_id = :_order_id", c)
{ Parameters = { {"_order_id", pkColumns["order_id"]} } } )
{
var count = (long)cmd.ExecuteScalar();
// deletion's boilerplate code...
if (count == 0) throw new RecordNotFoundException();
else if (count > 1) throw new DatabaseStructureChangedException();
// ...boiler plate code
}
// deleting of table(s) goes here...
}
NOTE: boilerplate code is code-generated, including the "using (var cmd = new NpgsqlCommand( ... )"
But I'm seriously thinking to refactor the boiler plate code, I wanted a more succint code. This is how I envision to refactor the code (made nicer with extension method (not the sole reason ;))
using (var cmd = new NpgsqlCommand("SELECT COUNT(*) FROM orders WHERE order_id = :_order_id", c)
{ Parameters = { {"_order_id", pkColumns["order_id"]} } } )
{
cmd.VerifyDeletion(); // [EDIT: was ExecuteWithVerification before]
}
I wanted the executescalar and the boilerplate code to goes inside the extension method.
For my code above, does it warrants code refactoring / information hiding? Is my refactored operation looks too opaque?
I would say that your refactor is extremely good, if your new single line of code replaces a handful of lines of code in many places in your program. Especially since the functionality is going to be the same in all of those places.
The programmer coming after you and looking at your code will simply look at the definition of the extension method to find out what it does, and now he knows that this code is defined in one place, so there is no possibility of it differing from place to place.
Try it if you must, but my feeling is it's not about succinctness but whether or not you want to enforce the behavior every time or most of the time. And by extension, if the verify-condition changes that it would likely change across the board.
Basically, reducing a small chunk of boiler-plate code doesn't necessarily make things more succinct; it's just one more bit of abstractness the developer has to wade through and understand.
As a developer, I'd have no idea what "ExecuteWithVerify" means. What exactly are we verifying? I'd have to look it up and remember it. But with the boiler-plate code, I can look at the code and understand exactly what's going on.
And by NOT reducing it to a separate method I can also tune the boiler-plate code for cases where exceptions need to be thrown for differing conditions.
It's not information-hiding when you extract or refactor your code. It's only information-hiding when you start restricting access to your extension definition after refactoring.
"new" operator within a Class (except for the Constructor) should be Avoided at all costs. This is what you need to refactor here.

Best practice for incorrect parameters on a remove method

So I have an abstract data type called RegionModel with a series of values (Region), each mapped to an index. It's possible to remove a number of regions by calling:
regionModel.removeRegions(index, numberOfRegionsToRemove);
My question is what's the best way to handle a call when the index is valid :
(between 0 (inclusive) and the number of Regions in the model (exclusive))
but the numberOfRegionsToRemove is invalid:
(index + regionsToRemove > the number of regions in the model)
Is it best to throw an exception like IllegalArgumentException or just to remove as many Regions as I can (all the regions from index to the end of the model)?
Sub-question: if I throw an exception what's the recommended way to unit test that the call threw the exception and left the model untouched (I'm using Java and JUnit here but I guess this isn't a Java specific question).
Typically, for structures like this, you have a remove method which takes an index and if that index is outside the bounds of the items in the structure, an exception is thrown.
That being said, you should be consistent with whatever that remove method that takes a single index does. If it simply ignores incorrect indexes, then ignore it if your range exceeds (or even starts before) the indexes of the items in your structure.
I agree with Mitchel and casperOne -- an Exception makes the most sense.
As far as unit testing is concerned, JUnit4 allows you to exceptions directly:
http://www.ibm.com/developerworks/java/library/j-junit4.html
You would need only to pass parameters which are guaranteed to cause the exception, and add the correct annotation (#Test(expected=IllegalArgumentException.class)) to the JUnit test method.
Edit: As Tom Martin mentioned, JUnit 4 is a decent-sized step away from JUnit 3. It is, however, possible to also test exceptions using JUnit 3. It's just not as easy.
One of the ways I've tested exceptions is by using a try/catch block within the class itself, and embedding Assert statements within it.
Here's a simple example -- it's not complete (e.g. regionModel is assumed to be instantiated), but it should get the idea across:
public void testRemoveRegionsInvalidInputs() {
int originalSize = regionModel.length();
int index = 0;
int numberOfRegionsToRemove = 1,000; // > than regionModel's current size
try {
regionModel.removeRegions(index, numberOfRegionsToRemove);
// Since the exception will immediately go into the 'catch' block this code will only run if the IllegalArgumentException wasn't thrown
Assert.assertTrue("Exception not Thrown!", false);
}
catch (IllegalArgumentException e) {
Assert.assertTrue("Exception thrown, but regionModel was modified", regionModel.length() == originalSize);
}
catch (Exception e) {
Assert.assertTrue("Incorrect exception thrown", false);
}
}
I would say that an argument such as illegalArgumentException would be the best way to go here. If the calling code was not passing a workable value, you wouldn't necessarily want to trust that they really wanted to remove what it had them do.

Resources