I have a bunch of documents containing an array of tags:
{ tags: ["tag1", "tag2", "tag3"] }
What I'd like to do is to compute the top 10 most common tags used among all documents. After some trial-and-error I've come up with the following solution:
r.db("database").table("table").concatMap(function(doc) {
return doc("tags")
}).coerceTo("array").group(function(entry) {
return entry
}).count().ungroup().orderBy(r.desc("reduction").limit(10).map(function(doc) {
return doc("group")
})
However, I "feel" (with my limited knowledge of query optimization) that this a rather cumbersome way to do it. Can anyone suggest a more efficient approach with proper use of indexes?
That query looks fine to me except for the coerceTo('array'), which I don't think is necessary and which will probably affect performance. You can also shorten it quite a bit:
r.table('table').group('tags', {multi: true}).count().ungroup().orderBy('reduction').slice(-10)('group')
Related
Normally one would just make a simple join to merge both arrays in one array, the problem is that i have arrays with different object structures, and depending on the type of object, i need to pass a different value.
Example:
array 1: fruits.type.name
array 2: animals.family.name
Is there any possibility other than having to craft a custom component from scratch using something like v-text-input, for example?
You mean something like this? Check this codesanbox I made:
https://codesandbox.io/s/stack-71429578-switch-autocomplete-array-757lve?file=/src/components/Example.vue
computed: {
autoArray() {
return this.typeAnimal ? this.animals : this.fruits
},
autoTypeId() {
return this.typeAnimal ? 'family.id' : 'type.id'
},
autoText() {
return this.typeAnimal ? 'family.name' : 'type.name'
}
}
With help of a couple computed props you could be able to switch array, item-text and item-value depending of the array you're working with.
As far as I know, there's no easy way to supply two different arrays to v-autocomplete and retain the search functionality.
You could probably join the arrays and write a custom filter property. Then use selection and item slots to change the output of the select based on the structure.
But if your data arrays aren't too complicated, I would avoid the above. Instead, I would loop through both arrays, and build a new combined one with a coherent structure.
We have a large corpus of JSON-formatted documents to search through to find patterns and historical trends. Elasticsearch seems like the perfect fit for this problem. The first trick is that the documents are collections of tens of thousands of "nested" documents (with a header). The second trick is that these nested documents represent data with varying types.
In order to accommodate this, all the value fields have been "encoded" as an array of strings, so a single integer value has been stored in the JSON as "[\"1\"]", and a table of floats is flattened to "[\"123.45\",\"678.9\",...]" and so on. (We also have arrays of strings, which don't need converting.) While this is awkward, I would have thought this would be a good compromise, given the way everything else involved in Elasticsearch seems to work.
The particular problem here is that these stored data values might represent a bitfield, from which we may need to inspect the state of one bit. Since this field will have been stored as a single-element string array, like "[\"14657\"], we need to convert that to a single integer, and then bit-shift it multiple times to the desired bit (or apply a mask, if such a function is available).
With Elasticsearch, I see that I can embed "Painless" scripts, but examples vary, and I haven't been able to find one that shows how I can covert the arbitrary-length string-array data field to appropriate types, for further comparison. Here's my query script as it stands.
{
"_source" : false,
"from" : 0, "size" : 10,
"query": {
"nested": {
"path": "Variables",
"query": {
"bool": {
"must": {
"match": {"Variables.Designation": "Big_Long_Variable_Name"}
},
"must_not": {
"match": {"Variables.Data": "[0]"}
},
"filter": {
"script": {
"script": {
"source":
"
def vals = doc['Variables.Data'];
return vals[0] != params.setting;
",
"params": {
"setting": 3
}
}
}
}
}
},
"inner_hits": {
"_source": "Variables.Data"
}
}
}
}
I need to somehow transform the vals variable to an array of ints, pick off the first value, do some bit operations, and make a comparison to return true or false. In this example, I'm hoping to be able to set "setting" equal to the bit position I want to check for on/off.
I've already been through the exercise with Elasticsearch in finding out that I needed to make my Variables.Data field a keyword so I could search on specific values in it. I realize that this is getting away from the intent of Elasticsearch, but I still think this might be the best solution, for other reasons. I created a new index, and reimported my test documents, and the index size went up about 30%. That's a compromise I'm willing to make, if I can get this to work.
What tools do I have in Painless to make this work? (Or, am I crazy to try to do this with this tool?)
I would suggest that you encode your data in elasticsearch provided types wherever possible (and even when not) to make the most out of painless. For instance, for the bit strings, you can encode them as an array of 1 and 0's for easier operations with Painless.
Painless, in my opinion, is still primitive. It's hard to debug. It's hard to read. It's hard to maintain. And, it's a horrible idea to have large functions in Painless.
To answer your question, you'd basically need to parse the array string with painless and have it in one of the available datatypes in order to do the comparison that you desire. For example, for the list, you'd use something like the split function, and then manually case each item in the results as int, float, string, etc...
Use the execute API to test small bits before adding this to your scripted field:
POST /_scripts/painless/_execute
{
"script": {
"source": """
ArrayList arr = []; //to start with
// use arr.add(INDEX, VALUE) to add after parsing
""",
"params": {
"foo": 100.0,
"bar": 1000.0
}
}
}
On the other hand, if you save your data in ElasticSearch provided datatypes (note that ElasticSearch supports saving lists inside documents), then this task would be far easier to do in Painless.
For example, instead of having my_doc.foo = "[\"123.45\",\"678.9\",...]" as a string to be parsed later, why not saving it as a native list of floats instead like my_doc.foo = [123.45, 678.9, ...]?
This way, you avoid the unnecessary Painless code required to parse the text document.
I'd like to add extension Coding to DSTU2 ClaimResponse.item.adjudication.code which has binding strength as Extensible. I have three formats and which one is proper, or if none of them, what is suggested format? Thanks.
a. Use FHIR code "system" with a new code value
"adjudication":[
{
"code":{
"system":"http://hl7.org/fhir/ValueSet/adjudication",
"code":"allowed"
},
"amount":{
"value":21,
"system":"urn:std:iso:4217",
"code":"USD"
}
}
]
b. Use custom code "system" with a new code value
"adjudication":[
{
"code":{
"system":"http://myhealth.com/ClaimResponse/adjudication#allowed",
"code":"allowed"
},
"amount":{
"value":21,
"system":"urn:std:iso:4217",
"code":"USD"
}
}
]
c. Use extension
"adjudication":[
{
"code":{
"extension":[
{
"url":"http://myhealth.com/ClaimResponse/adjudication#allowed",
"valueCode":"allowed"
}
]
},
"amount":{
"value":234,
"system":"urn:std:iso:4217",
"code":"USD"
}
}
]
Option b is the closest, but the system URL looks a little funky. Something like this would be better: "system":"http://myhealth.com/CodeSystem/adjudication-code"
The system should ideally be a URL that resolves to the code system definition (though it doesn't have to) and should apply to a set of codes, not the single code you're conveying. (While it's possible to have one-code code systems, it's more than a little unusual.)
Option a is wrong because we never send the value set URL as the Coding.system. Option c is unnecessary - with an extensible binding, you're free to use any codes that aren't already covered by the defined value set.
All that said, it's not clear that "allowed" makes sense as a value for "code" given the other options in the extensible value set. You might also look at the draft STU 3 version which eliminates "code" altogether. See if that design will meet your needs better, and if not, provide feedback when it goes to ballot this August.
In my ElasticSearch index, location is a MultiValueField. When I write a custom scoring formula for my documents involving location, I want the script to pick up on whichever location is the closest to the point in my query.
So, I have this part of my scoring formula:
...
if (!doc['location'].empty && doc['location'].values.length > 1) {
least_distance = 10000;
foreach (loc_index: doc['location'].values) {
temp_distance = loc_index.distance(lat, lng);
if (temp_distance < least_distance) {
least_distance = temp_distance;
}
...
It's not the most elegant (I'm new to mvel and ES), but conceptually I'm first checking to see if doc['location'] indeed has more than one location in it, and if so, go through each of the locations to calculate distance, and keep track of the minimum distance found so far.
When I do this, ElasticSearch is returning an error:
Query Failed [Failed to execute main query]]; nested: PropertyAccessException[[Error: unable to resolve method: org.elasticsearch.common.geo.GeoPoint.distance(java.lang.Double, java.lang.Double)
which I think means that it doesn't want to do .distance() on a GeoPoint, which for some reason is different than a field that I might get by doing doc['location'].
Am I interpreting this situation correctly, and does anybody know of a workaround? Is there a way to just calculate distance (ideally without actually putting all the arithmetic for the distance between two coordinates) using ElasticSearch?
The issue here is that calling .values gives a list of GeoPoint() objects. There is a work around, although we need to do a bit of extra work to pull in the appropriate Java classes. We need to have the latitude and longitude of both points.
import org.elasticsearch.common.geo.GeoDistance;
import org.elasticsearch.common.unit.DistanceUnit;
base_point = doc['base_location'].value;
if (!doc['location'].empty && doc['location'].values.length > 1) {
foreach (loc_index: doc['location'].values) {
distance = GeoDistance.PLANE.calculate(loc_index.lat, loc_index.lon, base_point.lat, base_point.lon, DistanceUnit.MILES);
}
}
We can get the result in different units described by the enumerable here.
We can also use different calculation methodologies (like ARC), described here.
I'm testing out CouchDB to see how it could handle logging some search results. What I'd like to do is produce a view where I can produce the top queries from the results. At the moment I have something like this:
Example document portion
{
"query": "+dangerous +dogs",
"hits": "123"
}
Map function
(Not exactly what I need/want but it's good enough for testing)
function(doc) {
if (doc.query) {
var split = doc.query.split(" ");
for (var i in split) {
emit(split[i], 1);
}
}
}
Reduce Function
function (key, values, rereduce) {
return sum(values);
}
Now this will get me results in a format where a query term is the key and the count for that term on the right, which is great. But I'd like it ordered by the value, not the key. From the sounds of it, this is not yet possible with CouchDB.
So does anyone have any ideas of how I can get a view where I have an ordered version of the query terms & their related counts? I'm very new to CouchDB and I just can't think of how I'd write the functions needed.
It is true that there is no dead-simple answer. There are several patterns however.
http://wiki.apache.org/couchdb/View_Snippets#Retrieve_the_top_N_tags. I do not personally like this because they acknowledge that it is a brittle solution, and the code is not relaxing-looking.
Avi's answer, which is to sort in-memory in your application.
couchdb-lucene which it seems everybody finds themselves needing eventually!
What I like is what Chris said in Avi's quote. Relax. In CouchDB, databases are lightweight and excel at giving you a unique perspective of your data. These days, the buzz is all about filtered replication which is all about slicing out subsets of your data to put in a separate DB.
Anyway, the basics are simple. You take your .rows from the view output and you insert it into a separate DB which simply emits keyed on the count. An additional trick is to write a very simple _list function. Lists "render" the raw couch output into different formats. Your _list function should output
{ "docs":
[ {..view row1...},
{..view row2...},
{..etc...}
]
}
What that will do is format the view output exactly the way the _bulk_docs API requires it. Now you can pipe curl directly into another curl:
curl host:5984/db/_design/myapp/_list/bulkdocs_formatter/query_popularity \
| curl -X POST host:5984/popularity_sorter/_design/myapp/_view/by_count
In fact, if your list function can handle all the docs, you may just have it sort them itself and return them to the client sorted.
This came up on the CouchDB-user mailing list, and Chris Anderson, one of the primary developers, wrote:
This is a common request, but not supported directly by CouchDB's
views -- to do this you'll need to copy the group-reduce query to
another database, and build a view to sort by value.
This is a tradeoff we make in favor of dynamic range queries and
incremental indexes.
I needed to do this recently as well, and I ended up doing it in my app tier. This is easy to do in JavaScript:
db.view('mydesigndoc', 'myview', {'group':true}, function(err, data) {
if (err) throw new Error(JSON.stringify(err));
data.rows.sort(function(a, b) {
return a.value - b.value;
});
data.rows.reverse(); // optional, depending on your needs
// do something with the data…
});
This example runs in Node.js and uses node-couchdb, but it could easily be adapted to run in a browser or another JavaScript environment. And of course the concept is portable to any programming language/environment.
HTH!
This is an old question but I feel it still deserves a decent answer (I spent at least 20 minutes on searching for the correct answer...)
I disapprove of the other suggestions in the answers here and feel that they are unsatisfactory. Especially I don't like the suggestion to sort the rows in the applicative layer, as it doesn't scale well and doesn't deal with a case where you need to limit the result set in the DB.
The better approach that I came across is suggested in this thread and it posits that if you need to sort the values in the query you should add them into the key set and then query the key using a range - specifying a desired key and loosening the value range. For example if your key is composed of country, state and city:
emit([doc.address.country,doc.address.state, doc.address.city], doc);
Then you query just the country and get free sorting on the rest of the key components:
startkey=["US"]&endkey=["US",{}]
In case you also need to reverse the order - note that simple defining descending: true will not suffice. You actually need to reverse the start and end key order, i.e.:
startkey=["US",{}]&endkey=["US"]
See more reference at this great source.
I'm unsure about the 1 you have as your returned result, but I'm positive this should do the trick:
emit([doc.hits, split[i]], 1);
The rules of sorting are defined in the docs.
Based on Avi's answer, I came up with this Couchdb list function that worked for my needs, which is simply a report of most-popular events (key=event name, value=attendees).
ddoc.lists.eventPopularity = function(req, res) {
start({ headers : { "Content-type" : "text/plain" } });
var data = []
while(row = getRow()) {
data.push(row);
}
data.sort(function(a, b){
return a.value - b.value;
}).reverse();
for(i in data) {
send(data[i].value + ': ' + data[i].key + "\n");
}
}
For reference, here's the corresponding view function:
ddoc.views.eventPopularity = {
map : function(doc) {
if(doc.type == 'user') {
for(i in doc.events) {
emit(doc.events[i].event_name, 1);
}
}
},
reduce : '_count'
}
And the output of the list function (snipped):
165: Design-Driven Innovation: How Designers Facilitate the Dialog
165: Are Your Customers a Crowd or a Community?
164: Social Media Mythbusters
163: Don't Be Afraid Of Creativity! Anything Can Happen
159: Do Agencies Need to Think Like Software Companies?
158: Customer Experience: Future Trends & Insights
156: The Accidental Writer: Great Web Copy for Everyone
155: Why Everything is Amazing But Nobody is Happy
Every solution above will break couchdb performance I think. I am very new to this database. As I know couchdb views prepare results before it's being queried. It seems we need to prepare results manually. For example each search term will reside in database with hit counts. And when somebody searches, its search terms will be looked up and increments hit count. When we want to see search term popularity, it will emit (hitcount, searchterm) pair.
The Link Retrieve_the_top_N_tags seems to be broken, but I found another solution here.
Quoting the dev who wrote that solution:
rather than returning the results keyed by the tag in the map step, I would emit every occurrence of every tag instead. Then in the reduce step, I would calculate the aggregation values grouped by tag using a hash, transform it into an array, sort it, and choose the top 3.
As stated in the comments, the only problem would be in case of a long tail:
Problem is that you have to be careful with the number of tags you obtain; if the result is bigger than 500 bytes, you'll have couchdb complaining about it, since "reduce has to effectively reduce". 3 or 6 or even 20 tags shouldn't be a problem, though.
It worked perfectly for me, check the link to see the code !