Sorting CouchDB Views By Value

Sorting CouchDB Views By Value - sorting

I'm testing out CouchDB to see how it could handle logging some search results. What I'd like to do is produce a view where I can produce the top queries from the results. At the moment I have something like this:
Example document portion
{
"query": "+dangerous +dogs",
"hits": "123"
}
Map function
(Not exactly what I need/want but it's good enough for testing)
function(doc) {
if (doc.query) {
var split = doc.query.split(" ");
for (var i in split) {
emit(split[i], 1);
}
}
}
Reduce Function
function (key, values, rereduce) {
return sum(values);
}
Now this will get me results in a format where a query term is the key and the count for that term on the right, which is great. But I'd like it ordered by the value, not the key. From the sounds of it, this is not yet possible with CouchDB.
So does anyone have any ideas of how I can get a view where I have an ordered version of the query terms & their related counts? I'm very new to CouchDB and I just can't think of how I'd write the functions needed.

It is true that there is no dead-simple answer. There are several patterns however.
http://wiki.apache.org/couchdb/View_Snippets#Retrieve_the_top_N_tags. I do not personally like this because they acknowledge that it is a brittle solution, and the code is not relaxing-looking.
Avi's answer, which is to sort in-memory in your application.
couchdb-lucene which it seems everybody finds themselves needing eventually!
What I like is what Chris said in Avi's quote. Relax. In CouchDB, databases are lightweight and excel at giving you a unique perspective of your data. These days, the buzz is all about filtered replication which is all about slicing out subsets of your data to put in a separate DB.
Anyway, the basics are simple. You take your .rows from the view output and you insert it into a separate DB which simply emits keyed on the count. An additional trick is to write a very simple _list function. Lists "render" the raw couch output into different formats. Your _list function should output
{ "docs":
[ {..view row1...},
{..view row2...},
{..etc...}
]
}
What that will do is format the view output exactly the way the _bulk_docs API requires it. Now you can pipe curl directly into another curl:
curl host:5984/db/_design/myapp/_list/bulkdocs_formatter/query_popularity \
| curl -X POST host:5984/popularity_sorter/_design/myapp/_view/by_count
In fact, if your list function can handle all the docs, you may just have it sort them itself and return them to the client sorted.

This came up on the CouchDB-user mailing list, and Chris Anderson, one of the primary developers, wrote:
This is a common request, but not supported directly by CouchDB's
views -- to do this you'll need to copy the group-reduce query to
another database, and build a view to sort by value.
This is a tradeoff we make in favor of dynamic range queries and
incremental indexes.
I needed to do this recently as well, and I ended up doing it in my app tier. This is easy to do in JavaScript:
db.view('mydesigndoc', 'myview', {'group':true}, function(err, data) {
if (err) throw new Error(JSON.stringify(err));
data.rows.sort(function(a, b) {
return a.value - b.value;
});
data.rows.reverse(); // optional, depending on your needs
// do something with the data…
});
This example runs in Node.js and uses node-couchdb, but it could easily be adapted to run in a browser or another JavaScript environment. And of course the concept is portable to any programming language/environment.
HTH!

This is an old question but I feel it still deserves a decent answer (I spent at least 20 minutes on searching for the correct answer...)
I disapprove of the other suggestions in the answers here and feel that they are unsatisfactory. Especially I don't like the suggestion to sort the rows in the applicative layer, as it doesn't scale well and doesn't deal with a case where you need to limit the result set in the DB.
The better approach that I came across is suggested in this thread and it posits that if you need to sort the values in the query you should add them into the key set and then query the key using a range - specifying a desired key and loosening the value range. For example if your key is composed of country, state and city:
emit([doc.address.country,doc.address.state, doc.address.city], doc);
Then you query just the country and get free sorting on the rest of the key components:
startkey=["US"]&endkey=["US",{}]
In case you also need to reverse the order - note that simple defining descending: true will not suffice. You actually need to reverse the start and end key order, i.e.:
startkey=["US",{}]&endkey=["US"]
See more reference at this great source.

I'm unsure about the 1 you have as your returned result, but I'm positive this should do the trick:
emit([doc.hits, split[i]], 1);
The rules of sorting are defined in the docs.

Based on Avi's answer, I came up with this Couchdb list function that worked for my needs, which is simply a report of most-popular events (key=event name, value=attendees).
ddoc.lists.eventPopularity = function(req, res) {
start({ headers : { "Content-type" : "text/plain" } });
var data = []
while(row = getRow()) {
data.push(row);
}
data.sort(function(a, b){
return a.value - b.value;
}).reverse();
for(i in data) {
send(data[i].value + ': ' + data[i].key + "\n");
}
}
For reference, here's the corresponding view function:
ddoc.views.eventPopularity = {
map : function(doc) {
if(doc.type == 'user') {
for(i in doc.events) {
emit(doc.events[i].event_name, 1);
}
}
},
reduce : '_count'
}
And the output of the list function (snipped):
165: Design-Driven Innovation: How Designers Facilitate the Dialog
165: Are Your Customers a Crowd or a Community?
164: Social Media Mythbusters
163: Don't Be Afraid Of Creativity! Anything Can Happen
159: Do Agencies Need to Think Like Software Companies?
158: Customer Experience: Future Trends & Insights
156: The Accidental Writer: Great Web Copy for Everyone
155: Why Everything is Amazing But Nobody is Happy

Every solution above will break couchdb performance I think. I am very new to this database. As I know couchdb views prepare results before it's being queried. It seems we need to prepare results manually. For example each search term will reside in database with hit counts. And when somebody searches, its search terms will be looked up and increments hit count. When we want to see search term popularity, it will emit (hitcount, searchterm) pair.

The Link Retrieve_the_top_N_tags seems to be broken, but I found another solution here.
Quoting the dev who wrote that solution:
rather than returning the results keyed by the tag in the map step, I would emit every occurrence of every tag instead. Then in the reduce step, I would calculate the aggregation values grouped by tag using a hash, transform it into an array, sort it, and choose the top 3.
As stated in the comments, the only problem would be in case of a long tail:
Problem is that you have to be careful with the number of tags you obtain; if the result is bigger than 500 bytes, you'll have couchdb complaining about it, since "reduce has to effectively reduce". 3 or 6 or even 20 tags shouldn't be a problem, though.
It worked perfectly for me, check the link to see the code !

Related

Extracting all children belongs to specific parent in graphql

I am using GrapgQL and Java. I need to extract all the children belongs to specific parent. I have used the below way but it will fetch only the parent and it does not fetch any children.
schema {
query: Query
}
type LearningResource{
id: ID
name: String
type: String
children: [LearningResource]
}
type Query {
fetchLearningResource: LearningResource
}
#Component
public class LearningResourceDataFetcher implements DataFetcher{
#Override
public LearningResource get(DataFetchingEnvironment dataFetchingEnvironment) {
LearningResource lr3 = new LearningResource();
lr3.setId("id-03");
lr3.setName("Resource-3");
lr3.setType("Book");
LearningResource lr2 = new LearningResource();
lr2.setId("id-02");
lr2.setName("Resource-2");
lr2.setType("Paper");
LearningResource lr1 = new LearningResource();
lr1.setId("id-01");
lr1.setName("Resource-1");
lr1.setType("Paper");
List<LearningResource> learningResources = new ArrayList<>();
learningResources.add(lr2);
learningResources.add(lr3);
learningResource1.setChildren(learningResources);
return lr1;
}
}
return RuntimeWiring.newRuntimeWiring().type("Query", typeWiring -> typeWiring.dataFetcher("fetchLearningResource", learningResourceDataFetcher)).build();
My Controller endpoint
#RequestMapping(value = "/queryType", method = RequestMethod.POST)
public ResponseEntity query(#RequestBody String query) {
System.out.println(query);
ExecutionResult result = graphQL.execute(query);
System.out.println(result.getErrors());
System.out.println(result.getData().toString());
return ResponseEntity.ok(result.getData());
}
My request would be like below
{
fetchLearningResource
{
name
}
}
Can anybody please help me to sort this ?

Because I get asked this question a lot in real life, I'll answer it in detail here so people have easier time googling (and I have something to point at).
As noted in the comments, the selection for each level has to be explicit and there is no notion of an infinitely recursive query like get everything under a node to the bottom (or get all children of this parent recursively to the bottom).
The reason is mostly that allowing such queries could easily put you in a dangerous situation: a user would be able to request the entire object graph from the server in one easy go! For any non-trivial data size, this would kill the server and saturate the network in no time. Additionally, what would happen once a recursive relationship is encountered?
Still, there is a semi-controlled escape-hatch you could use here. If the scope in which you need everything is limited (and it really should be), you could map the output type of a specific query as a (complex) scalar.
In your case, this would mean mapping LearningResource as a scalar. Then, fetchLearningResource would effectively be returning a JSON blob, where the blob would happen to be all the children and their children recursively. Query resolution doesn't descent deeper once a scalar field is reached, as scalars are leaf nodes, so it can't keep resolving the children level-by-level. This means you'd have to recursively fetch everything in one go, by yourself, as GraphQL engine can't help you here. It also means sub-selections become impossible (as scalars can't have sub-selections - again, they're leaf nodes), so the client would always get all the children and all the fields from each child back. If you still need the ability to limit the selection in certain cases, you can expose 2 different queries e.g. fetchLearningResource and fetchAllLearningResources, where the former would be mapped as it is now, and the latter would return the scalar as explained.
An object scalar implementation is provided by the graphql-java ExtendedScalars project.
The schema could then look like:
schema {
query: Query
}
scalar Object
type Query {
fetchLearningResource: Object
}
And you'd use the method above to produce the scalar implementation:
RuntimeWiring.newRuntimeWiring()
.scalar(ExtendedScalars.Object) //register the scalar impl
.type("Query", typeWiring -> typeWiring.dataFetcher("fetchLearningResource", learningResourceDataFetcher)).build();
Depending on how you process the results of this query, the DataFetcher for fetchLearningResource may need to turn the resulting object into a map-of-maps (JSON-like object) before returning to the client. If you simply JSON-serialize the result anyway, you can likely skip this. Note that you're side-stepping all safety mechanisms here and must take care not to produce enormous results. By extension, if you need this in many places, you're very likely using a completely wrong technology for your problem.
I have not tested this with your code myself, so I might have skipped something important, but this should be enough to get you (or anyone googling) onto the right track (if you're sure this is the right track).
UPDATE: I've seen someone implement a custom Instrumentation that rewrites the query immediately after it's parsed, and adds all fields to the selection set if no field had already been selected, recursively. This effectively allows them to select everything implicitly.
In graphql-java v11 and prior, you could mutate the parsed query (represented by the Document class), but as of v12, it will no longer be possible, but instrumentations in turn gain the ability to replace the Document explicitly via the new instrumentDocument method.
Of course, this only makes sense if your schema is such that it can not be exploited or you fully control the client so there's no danger. You could also only do it selectively for some types, but it would be extremely confusing to use.

Compute the most common array elements

I have a bunch of documents containing an array of tags:
{ tags: ["tag1", "tag2", "tag3"] }
What I'd like to do is to compute the top 10 most common tags used among all documents. After some trial-and-error I've come up with the following solution:
r.db("database").table("table").concatMap(function(doc) {
return doc("tags")
}).coerceTo("array").group(function(entry) {
return entry
}).count().ungroup().orderBy(r.desc("reduction").limit(10).map(function(doc) {
return doc("group")
})
However, I "feel" (with my limited knowledge of query optimization) that this a rather cumbersome way to do it. Can anyone suggest a more efficient approach with proper use of indexes?

That query looks fine to me except for the coerceTo('array'), which I don't think is necessary and which will probably affect performance. You can also shorten it quite a bit:
r.table('table').group('tags', {multi: true}).count().ungroup().orderBy('reduction').slice(-10)('group')

SCAN command with spring redis template

I am trying to execute "scan" command with RedisConnection. I don't understand why the following code is throwing NoSuchElementException
RedisConnection redisConnection = redisTemplate.getConnectionFactory().getConnection();
Cursor c = redisConnection.scan(scanOptions);
while (c.hasNext()) {
c.next();
}
Exception:
java.util.NoSuchElementException at
java.util.Collections$EmptyIterator.next(Collections.java:4189) at
org.springframework.data.redis.core.ScanCursor.moveNext(ScanCursor.java:215)
at
org.springframework.data.redis.core.ScanCursor.next(ScanCursor.java:202)

Yes, I have tried this, in 1.6.6.RELEASE spring-data-redis.version. No issues, the below simple while loop code is enough. And i have set count value to 100 (more the value) to save round trip time.
RedisConnection redisConnection = null;
try {
redisConnection = redisTemplate.getConnectionFactory().getConnection();
ScanOptions options = ScanOptions.scanOptions().match(workQKey).count(100).build();
Cursor c = redisConnection.scan(options);
while (c.hasNext()) {
logger.info(new String((byte[]) c.next()));
}
} finally {
redisConnection.close(); //Ensure closing this connection.
}

I'm using spring-data-redis 1.6.0-RELEASE and Jedis 2.7.2; I do think that the ScanCursor implementation is slightly flawed w/rgds to handling this case on this version - I've not checked previous versions though.
So: rather complicated to explain, but in the ScanOptions object there is a "count" field that needs to be set (default is 10). This field, contains an "intent" or "expected" results for this search. As explained (not really clearly, IMHO) here, you may change the value of count at each invocation, especially if no result has been returned. I understand this as "a work intent" so if you do not get anything back, maybe your "key space" is vast and the SCAN command has not worked "hard enough". Obviously, as long as you're getting results back, you do not need to increase this.
A "simple-but-dangerous" approach would be to have a very large count (e.g 1 million or more). This will make REDIS go away trying to search your vast key space to find "at least or near as much" as your large count. Don't forget - REDIS is single-threaded so you just killed your performance. Try this with a REDIS of 12M keys and you'll see that although SCAN may happily return results with a very high count value, it will absolutely do nothing more during the time of that search.
To the solution to your problem:
ScanOptions options = ScanOptions.scanOptions().match(pattern).count(countValue).build();
boolean done = false;
// the while-loop below makes sure that we'll get a valid cursor -
// by looking harder if we don't get a result initially
while (!done) {
try(Cursor c = redisConnection.scan(scanOptions)) {
while (c.hasNext()) {
c.next();
}
done = true; //we've made it here, lets go away
} catch (NoSuchElementException nse) {
System.out.println("Going for "+countValue+" was not hard enough. Trying harder");
options = ScanOptions.scanOptions().match(pattern).count(countValue*2).build();
}
}
Do note that the ScanCursor implementation of Spring Data REDIS will properly follow the SCAN instructions and loop correctly, as much as needed, to get to the end of the loop as per documentation. I've not found a way to change the scan options within the same cursor - so there may be a risk that if you get half-way through your results and get a NoSuchElementException, you'll start again (and essentially do some of the work twice).
Of course, better solutions are always welcome :)

My old code
ScanOptions.scanOptions().match("*" + query + "*").count(10).build();
Working code
ScanOptions.scanOptions().match("*" + query + "*").count(Integer.MAX_VALUE).build();

d3js: Calculating sub-totals, conditional on i=="some stuff"

Warning/Disclaimer: This is a basic JavaScript question, but I've gone through a bunch of iterations in my code, much Googling, and I'm having trouble wrapping my head around how to proceed.
I have data in three columns in a CSV file: a political candidate's name, their party, and their approval rating.
I've created a bubble chart/force layout, similar to this. Candidates are represented as bubbles. Users can select to see candidates organized in one big blob, or they can click to see the bubbles organize themselves by party. I have some <div> elements that pop up under each grouping of same-party candidates. What I'd like to do now is to have each party-specific <div> element display the total approval rating all its same-party candidates get, combined. (So, in Excel, a =SUMIF().)
To do this: I'm creating a function(party) which should, in principle, return that conditional sum. Here's what it looks like when I'm calling it for candidates with no party affiliation:
d3.select("#text-NONE")
.text(label_party("N/A"));
"N/A" is the string found in the CSV file.
And the function itself:
function label_party(party) {
var party_total = 0;
function party(d) {
if (d.party[i] == "party") {
party_total+= d.y2012[i];
};
};
return party_total;
};
Both of the above are happening outside of the d3.csv() call. My main Q: how can I set up a conditional sum over two columns in a CSV? At the moment, it's simply returning 0 - so it's skipping my loop, though I don't know why.

Order $each by name

I am trying to figure why my ajax $each alters the way my list of names gets printed?
I have an json string like this:
[{"name":"Adam","len":1,"cid":2},{"name":"Bo","len":1,"cid":1},{"name":"Bob","len":1,"cid":3},{"name":"Chris","len":1,"cid":7},{"name":"James","len":1,"cid":5},{"name":"Michael","len":1,"cid":6},{"name":"Nick","len":1,"cid":4},{"name":"OJ","len":1,"cid":8}]
Here all the names are sorted in alphabetic order, but when getting them out they are sorted by "cid"? Why, and how can I change this?
Here is my jQuery:
var names = {};
$.getJSON('http://mypage.com/json/names.php', function(data){
$.each(data.courses, function (k, vali) {
names[vali.cid] = vali.name;
});
});
I guess its because "names[vali.cid]", but I need that part to stay that way. Can it still be done?
Hoping for help and thanks in advance :-.)

Ordering inside an object is not really defined or predictable when you iterate over it. I would suggest sorting the array based on an internal property:
var names = [];
$.getJSON('http://mypage.com/json/names.php', function(data){
$.each(data.courses, function (k, vali) {
names.push({name: vali.name, cid: vali.cid});
});
names.sort(function(a, b) {
return a.name.localeCompare(b.name);
});
});
Now you have an array that is ordered and you can iterate over it in a predictable order as well.

There is no "ajax $each" - you probably mean the jQuery function.
With "when getting them out" I presume you mean something like console.debug(names) after your $each call
Objects aren't ordered in javascript per definition, so there is no more order in your variable "names". Still, most javascript implementations today (and all the ones probably important to you - the ones used in the most used browsers) employ a stable order in objects which normally depends on the order you insert stuff.
All this said, there can probably be 3 reasons you're not getting what you're expecting:
Try console.debug(data) and see what you get - the order as you want it?
As you don't explicitly state how you debug your stuff, the problem could be in the way you output and not the data is stored. Here too try console.debug(names).
You're using a function which dereferences on expection, like console.*. This means if you console.debug() an object, the displayed values will depend on the moment you unfold the displayed tree in your browser, not when the line was called!

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio