How to optimize filtering : Rethink db - rethinkdb

introduction
I have been working with the ReThink Db using data explorer tab. I am new to Rethink Db.
I have created this query to filter record on date base. I needed to optimize the query so that it can take less time for large records.
r.db('test').table('usrz').filter(function(test) {
return test("createdDate").date().during(
r.time(2016,12,20, 'Z'),
r.time(2016,12,30, 'Z'))
}).orderBy(r.desc('createdDate'))
Any help or reference will be apreciated. Thanks for your time.

RethinkDB queries can be optimized by using indexes. (see https://www.rethinkdb.com/docs/secondary-indexes/javascript)
To create an index:
r.table('usrz').indexCreate('createdDate')
Your query can be converted to use that index by transforming the filter/during combination into a between and by adding an index argument to orderBy

Related

Hibernate criteria query with restrictions takes long execution time

In our application we are using Criteria query with about 15 restrictions to filter some results.
The query is like this :
Select column1,column2,colum3... From Table where column=?1 and column=?2....
Based upon the input the AND clause will grow. I really can't find where lies the issue.
criteria.addRestrictions() is the one we are using to create this kind of dynamic query. We are validating the input simple If clause with a predicate input!=null. All the inputs are string only. And also we don't have indexes for every column. I know without indexes, a full table scan will be performed. But I want to know which is causing long execution time for that query.
Any help is appreciated. Thanks .

RethinkDb OrderBy Before Filter, Performance

The data table is the biggest table in my db. I would like to query the db and then order it by the entries timestamps. Common sense would be to filter first and then manipulate the data.
queryA = r.table('data').filter(filter).filter(r.row('timestamp').minutes().lt(5)).orderBy('timestamp')
But this is not possible, because the filter creates a side table. And the command would throw an error (https://github.com/rethinkdb/rethinkdb/issues/4656).
So I was wondering if I put the orderBy first if this would crash the perfomance when the datatbse gets huge over time.
queryB = r.table('data').orderBy('timestamp').filter(filter).filter(r.row('timestamp').minutes().lt(5))
Currently I order it after querying, but usually datatbases are quicker in these processes.
queryA.run (err, entries)->
...
entries = _.sortBy(entries, 'timestamp').reverse() #this process takes on my local machine ~2000ms
Question:
What is the best approach (performance wise) to query this entries ordered by timestamp.
Edit:
The db is run with one shard.
Using an index is often the best way to improve performance.
For example, an index on the timestamp field can be created:
r.table('data').indexCreate('timestamp')
It can be used to sort documents:
r.table('data').orderBy({index: 'timestamp'})
Or to select a given range, for example the past hour:
r.table('data').between(r.now().sub(60*60), r.now(), {index: 'timestamp'})
The last two operations can be combined int one:
r.table('data').between(r.now().sub(60*60), r.maxval, {index: 'timestamp'}).orderBy({index: 'timestamp'})
Additional filters can also be added. A filter should always be placed after an indexed operation:
r.table('data').orderBy({index: 'timestamp'}).filter({colour: 'red'})
This restriction on filters is only for indexed operations. A regular orderBy can be placed after a filter:
r.table('data').filter({colour: 'red'}).orderBy('timestamp')
For more information, see the RethinkDB documentation: https://www.rethinkdb.com/docs/secondary-indexes/python/

Rethinkdb getAll , orderBy with index - Tags

I'm new to rethinkdb and i love it, but i found some problems when i tried to optimize my query and make it work on bigger datasets.
The problem is simple.
I need to filter my "event" table by timestamp (row.to) , by tag (row.tags), order by timestamp (row.from) and then slice for pagination.
row.tags has a multi index and works well!
row.from and row.to are start/end time of Event.
The slow query (testeded on 100k entries) is this:
r.db("test").table("event")
.getAll(r.args(["148a6e03-b6c3-4092-afa0-3b6d1a4555cd","7008d4b0-d859-49f3-b9e0-2e121f000ddf"]), {"index": "tags"})
.filter(function(row) {return row("to").ge(r.epochTime(1480460400));})
.orderBy(r.asc("from"))
.slice(0,20)
I created an index on 'from' and tried to do
.orderBy(r.asc("from"),{index:'from'})
but i get
e: Indexed order_by can only be performed on a TABLE or TABLE_SLICE in:
I already read about problems about index intersection in Rethinkdb, but maybe i miss something, maybe there is a way of doing this simple task.
Thank you.
The reason RethinkDB complains is this:
getAll returns a selection. When filter is applied to a selection it returns a selection. When orderBy is applied to a selection the index parameter can't be used (it can only be used when orderBy is applied to a table).
orderBy can be applied to a table, sequence or selection. Only when it's applied to table can the index parameter be used. This makes sense as the index is updated when rows are added and removed from the table.
In your case, you are applying orderBy on a result of filter which is a selection. In order to sort a selection the database needs to:
read all elements into memory (by default max is 100,000 elements)
sort them using the provided function or field
and it can't use index in this case.
The way to improve your query might be to sort the table first and then apply the filter. You will be able to use the index in this case.

How to retrieve all document ids matching a search, in elastic search?

I'm working on a simple side project, and have a tech stack that involves both a SQL database and ElasticSearch. I only have ElasticSearch because I assumed that as my project grows, my full text searching would be most efficiently performed by ES. My ES schema is very simple - documents that I insert into ES have 2 fields, one being the id and the other being the field with the body of text to search. The id being inserted into ES corresponds to that document's primary key id from the SQL database.
insert record into SQL -> insert record into ES using PK from SQL
Searching would be the reverse of that. Query ES and grab all the matching ids, and then turn around and use those ids to get records from SQL.
search ES can get all PK ids -> use those ids to get documents from SQL
The problem that I am facing though, is that ES can only return documents in a paginated manner. This is a problem because I also have a WHERE clause on my SQL query, beyond just the ids. My SQL query might look like this ...
SELECT * FROM foo WHERE id IN (1,2,3,4,5) AND bar != 'baz'
Well, with ES paginating the results, my WHERE clause will always only be querying a subset of the full results from ES. Even if I utilize ES' skip and take, I'm still only querying SQL using a subset of document ids.
Is there a way to get Elastic Search to only return the entire list of matching document ids? I realize this is here to not allow me to shoot myself in the foot, because doing this across all shards and many many documents is not efficient. Is there no way, though?
After putting in some hours on this project, I've only now realized that I've poorly engineered this, unless I can get all of these ids from ES. Some alternative implementations that I've thought of would be to store the things that I'm filtering on, in SQL, in ES as well. A problem there is that I'd have to update the ES document every time I update the document in SQL. This would require a pretty big rewrite to some of my data access code. I could scrap ElasticSearch all together and just perform searching in Postgres, for now, until I can think of a better way to structure this.
The elasticsearch not support return each and every doc match to you queries. Because it Ll overload the system. Instead of this.. Use scroll concept in elasticsearch.. It's lik cursor concept in db's..
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/scan-scroll.html
For more examples refer the Github repo. https://github.com/sidharthancr/elasticsearch-java-client
Hope it helps..
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-fields.html
please have a look into the elastic search document where you can specify only particular fields that return from the match documents
hope this resolves your problem
{
"fields" : ["user", "postDate"],
"query" : {
"term" : { "user" : "kimchy" }
}
}

This filters in memory right?

I Just want to make sure I understand this correctly...
search is an object that contains a querystring.
Repo.Query returns an ObjectQuery<T>.
From my understanding the chained linq statements will filter the results after entity framework has returned all the rows satisfying the query. So really ALL the rows are being returned and THEN filtered in memory. So we are returning a bunch of data that we don't really want. There's about 10k rows being returned so this is kind of important. Just like to get my confusion cleared up.
var searchQuery = Repo.Query(search)
.Where(entity =>
entity.Prop1.ToUpper().Equals(prop1.ToUpper()) &&
entity.Prop2.ToUpper().Equals(prop2.ToUpper()))
.OrderBy(entity => Repo.SortExpression ?? entity.prop1);
Your Repo.Query(string query) function should return IQueryable<T>.
Then you can filter and order without getting all rows first.
IQueryable(Of T) Interface
hope this helps
If this is to SQL, this will most likely create a SQL query and filter on the server and not in memory.
As a matter of fact, the statement above wouldn't actually do anything.
It's only when you iterate over it that the query will be executed. This is why certain providers (like the EF to SQL one) can collapse expression trees into a SQL query.
Easiest way to check is to use LINQPAD or the SQL Profiler to see what query is actually is executed.

Resources