Is there a way to efficiently find all unique values for a field in MongoDB? - ruby

Consider a collection of Users:
{ name: 'Jeff' }
{ name: 'Joel' }
Is there a way to efficiently get all the unique values for name?
User.pluck(:name).uniq
To return
[ 'Jeff', 'Joel' ]
I think this would get the whole collection, so it would be inefficient.
However, if there is an index on name, is there a way to get all the unique values without getting all the documents?
Or is there another way to efficiently get the unique names?

As indicated in the comments, you can efficiently get the unique values of a field over all docs in a collection using distinct.
The documentation specifically mentions that indexes are used when possible, and that they can cover the distinct query. This means that only the supporting index needs to be loaded into memory to get the results.
When possible, db.collection.distinct() operations can use indexes.
Indexes can also cover db.collection.distinct() operations. See
Covered Query for more information on queries covered by indexes.
In Ruby, you would perform your distinct query as:
User.distinct(:name)

Related

Get raw Low Cardinality values in Clickhouse

Is there a way to retrieve the underlying values of LowCardinality types in Clickhouse? I would also need to retrieve a mapping (in a separate query) of the underlying values to the logical values. I've tried using lowCardinalityIndices and lowCardinalityKeys but it appears that indices -> keys returned by those functions are a many to many mapping.
Thank you!
Your question does not make sense.
Column with LowCardinality does not have a single dictionary. Each part has multiple dictionaries for a single LowCardinality column. That's why your observe this lowCardinalityIndices/lowCardinalityKeys behaviour.

Categorising documents in elasticsearch

I've got a bunch of ES documents that I'd like to put into "collections". Each document has a unique integer as an ID. Each collection also needs to have a unique integer as an ID.
I need to be able to run queries to get a list of docs in a collection, and easily add an existing doc to a collection.
What would be the most efficient and logical way of approaching this:
An index of collections, which each has an array of document IDs, or
For each document have an array of integers (or a single integer) indicating to which collections it belongs?
Thank you.

elasticsearch: decide which query should run first

We have a simple web page, where the user can provide some input and query the database. We currently use mongodb but want to migrate to elasticsearch, since the queries are faster.
There are some required search fields, like start and end date, and some optional ones, like a search string to match an entry, or a parent search string, to match parent entries. Parent-child relations are just described through fields containing each entry's ancestors ids.
The question is the following: If both search and parent search string are provided, is there a way to know before executing the queries, which query should be executed first, in order to provide results faster and to be more performant?
For example, it could be that a specific parent search results in only 2 docs/parent entries, and then we can fetch all children matching the search string. In that case we should execute firstly the parent query and then the entry query.
One option would be to get the count of both queries and then execute first the one with the smallest count, but isn't this solution worse, since the queries are going to be executed twice? Once for the count and once for the actual query.
Are there any other options to solve this?
PS. We use elasticsearch v1.7
Example
Let's say the user wants to search for all entries matching the following fields.
searchString: type:BLOCK AND name:test
parentSearchString: name:parentTest AND NOT type:BLOCK
This means that we either have to
fetch all entries (parents) matching the parentSearchString and store their ids. Then, we have to fetch all entries that match the searchString and also have to contain any of the parent ids in the ancestors field.
OR
fetch all entries that match the searchString and store all ancestors ids. Then fetch all entries that match the parentSearchString and their id is one of the ancestors ids.
Just to clarify, both parent and children entries have the exact same structure and reside in the same index. We cannot have different indices since the pare-child relation can be 10 times nested, so an entry can be both a parent and a child. An entry looks more or less like:
{
id: "e32452365321",
name: "name",
type: "type",
ancestors: "id1 id2 id3" // stored in node as an array of ids
}
First of all, I would advise you, to upgrade your Elasticsearch version, if possible. There happened a lot since 1.7 and to be honest, I can't tell if all of what's written in the following article is valid for such an old version (probably it isn't).
But to your actual question: Hopefully I am understanding you correctly, but you try to estimate how costly a query for Elasticsearch is? Well, you don't have to. If you provide all 'queries' in one nested query, Elasticsearch will do that for you: https://www.elastic.co/blog/elasticsearch-query-execution-order
Regarding speed, there is one other thing I can mention: calculating score does take time. So if sorting is not based on the elasticsearch _score, you want to use boolean filter queries. This would also apply, if you want to sort only by _score of parent matches, then you could put the query for children into a filter.
update
Thanks to your example, I now see the problem. Self referencial Parent-Child relations are unfortunately not supported by ElasticSearch, so your approach is probably right. You might want to check out the short chapter of the documentation about application-joins.
So yes, in general, you want to send the second query with the least possible amount of ids/terms. While getting counts for both queries is not as bad as you might think, because the results are most likely still cached, does it actually help? Because if you're going from child to parent, you would have to count the ancestors (field values), and not the actual document count.
I would argue, that the most expensive operation is very often fetching result source from disk. So whichever way you go, you probably should only fetch what you need in the first query. So your options are:
Fetch only the id of parent matches, and then use a terms filter on ancestors in the second query.
Or, fetch only the ancestors field of child matches, and use an id filter in your second query.
Unfortunately, I can't help you more than that, since I don't have enough experience in comparing speed of those approaches. My guess would be, that an id filter might be faster in general. But that's just a guess...

RethinkDb OrderBy Before Filter, Performance

The data table is the biggest table in my db. I would like to query the db and then order it by the entries timestamps. Common sense would be to filter first and then manipulate the data.
queryA = r.table('data').filter(filter).filter(r.row('timestamp').minutes().lt(5)).orderBy('timestamp')
But this is not possible, because the filter creates a side table. And the command would throw an error (https://github.com/rethinkdb/rethinkdb/issues/4656).
So I was wondering if I put the orderBy first if this would crash the perfomance when the datatbse gets huge over time.
queryB = r.table('data').orderBy('timestamp').filter(filter).filter(r.row('timestamp').minutes().lt(5))
Currently I order it after querying, but usually datatbases are quicker in these processes.
queryA.run (err, entries)->
...
entries = _.sortBy(entries, 'timestamp').reverse() #this process takes on my local machine ~2000ms
Question:
What is the best approach (performance wise) to query this entries ordered by timestamp.
Edit:
The db is run with one shard.
Using an index is often the best way to improve performance.
For example, an index on the timestamp field can be created:
r.table('data').indexCreate('timestamp')
It can be used to sort documents:
r.table('data').orderBy({index: 'timestamp'})
Or to select a given range, for example the past hour:
r.table('data').between(r.now().sub(60*60), r.now(), {index: 'timestamp'})
The last two operations can be combined int one:
r.table('data').between(r.now().sub(60*60), r.maxval, {index: 'timestamp'}).orderBy({index: 'timestamp'})
Additional filters can also be added. A filter should always be placed after an indexed operation:
r.table('data').orderBy({index: 'timestamp'}).filter({colour: 'red'})
This restriction on filters is only for indexed operations. A regular orderBy can be placed after a filter:
r.table('data').filter({colour: 'red'}).orderBy('timestamp')
For more information, see the RethinkDB documentation: https://www.rethinkdb.com/docs/secondary-indexes/python/

MongoDB: Two-field index vs document field index

I need to index a collection by two fields (unique index), say field1 and field2. What's better approach in terms of performance:
Create a regular two-column index
-or -
Combine those two fields in a single document field {field1 : value, field2 : value2} and index that field?
Note: I will always be querying by those two fields together.
You can keep the columns separate and create a single index that will increase performance when querying both fields together.
db.things.ensureIndex({field1:1, field2:1});
http://www.mongodb.org/display/DOCS/Indexes#Indexes-CompoundKeysIndexes
Having the columns in the same column provides no performance increases, because you must index them the same way:
db.things.ensureIndex({fields.field1:1, fields.field2:1});
http://www.mongodb.org/display/DOCS/Indexes#Indexes-EmbeddedKeys
Or you can index the entire document
db.things.ensureIndex({fields: 1});
http://www.mongodb.org/display/DOCS/Indexes#Indexes-DocumentsasKeys
There could be a possible performance increase, but doubtfully very much. Use the test database, create test data and benchmark some tests to figure it out. We would love to hear your results.
I'd create a compound index over both fields. This'll take up less disk space because you won't need to store the extra combined field, and give you the bonus of an additional index over the first field, i.e. an index over { a:1, b:1 } is also an index over { a:1 }.

Resources