Is there a way to change the default limit of results in arango db? I have a collection with > 1000 records, and it seems impossible to retrieve them all together.
I'm using the arangojs driver with node, and I've also tried to run the simple query into the arango web interface (also setting the limit > 1000, it won't retrieve more than 1000 records).
Other things that I've already tried:
Get the full collection using arango functions
db.collection("collection_name")
.all().then(() => ...)
Run the query using arango functions without setting limits
let query = `FOR document in vehicles
RETURN document`;
db.query(query)
.then(() => ...)
Run the query trying to paginate results
let query = `FOR document in vehicles
LIMIT 1000,2000
RETURN document`;
db.query(query)
.then(() => ...)
In all the cases (including the last one) the results are limited to the first 1000 records, like they aren't stored in the collection.
Anyone that can helps? Thank you
1000 is the default cursor batch size. If you have more results, you have to fetch the additional batches. The simplest way to do this is using all(). For more details see the cursor documentation of the arangojs driver: https://www.arangodb.com/docs/stable/drivers/js-reference-cursor.html
Related
I have a requirement in which I just need a single row to be returned while querying a table in Dynamodb.
I can see a parameter in aws-cli named 'max-items' which apparently limits the result size of the query. Here is the sample query:
aws dynamodb query --table-name testTable --key-condition-expression "CompositePartitionKey = :pk"
--expression-attribute-values '{ ":pk": { "S": "1234_125" }, ":ps": { "S": "SOME_STATE" }
}'
--filter-expression 'StateAttribute IN (:ps) AND attribute_not_exists(AnotherAttribute)'
--index-name GSI_PK_SK --endpoint-url http://localhost:8000 --max-items 1
But I am not able to figure out any similar keyword/attribute in Go.
Here is what I could find relevant: How to set limit of matching items returned by DynamoDB using Java?
As you can see in DynamoDB's official description of the "Query" operation - https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Query.html - the parameter you are looking for is "Limit". Please consult your Go library's documentation on how exactly to pass this parameter to the query.
By the way, note that Limit doesn't quite limit the number of returned results, but rather the number of rows read at the server side. If your query has a filter, it can return fewer than Limit results. I don't know whether this matters to you or not.
You might want to look into pagination. You can use page-size to control how much you can get in each query.
More details on pagination can be found on Paginating Table Query Results.
You need to paginate through DynamoDB's responses until your "user limit" is fulfilled.
The data table is the biggest table in my db. I would like to query the db and then order it by the entries timestamps. Common sense would be to filter first and then manipulate the data.
queryA = r.table('data').filter(filter).filter(r.row('timestamp').minutes().lt(5)).orderBy('timestamp')
But this is not possible, because the filter creates a side table. And the command would throw an error (https://github.com/rethinkdb/rethinkdb/issues/4656).
So I was wondering if I put the orderBy first if this would crash the perfomance when the datatbse gets huge over time.
queryB = r.table('data').orderBy('timestamp').filter(filter).filter(r.row('timestamp').minutes().lt(5))
Currently I order it after querying, but usually datatbases are quicker in these processes.
queryA.run (err, entries)->
...
entries = _.sortBy(entries, 'timestamp').reverse() #this process takes on my local machine ~2000ms
Question:
What is the best approach (performance wise) to query this entries ordered by timestamp.
Edit:
The db is run with one shard.
Using an index is often the best way to improve performance.
For example, an index on the timestamp field can be created:
r.table('data').indexCreate('timestamp')
It can be used to sort documents:
r.table('data').orderBy({index: 'timestamp'})
Or to select a given range, for example the past hour:
r.table('data').between(r.now().sub(60*60), r.now(), {index: 'timestamp'})
The last two operations can be combined int one:
r.table('data').between(r.now().sub(60*60), r.maxval, {index: 'timestamp'}).orderBy({index: 'timestamp'})
Additional filters can also be added. A filter should always be placed after an indexed operation:
r.table('data').orderBy({index: 'timestamp'}).filter({colour: 'red'})
This restriction on filters is only for indexed operations. A regular orderBy can be placed after a filter:
r.table('data').filter({colour: 'red'}).orderBy('timestamp')
For more information, see the RethinkDB documentation: https://www.rethinkdb.com/docs/secondary-indexes/python/
I have a fairly large CouchDB database (approximately 3 million documents). I have various view functions returning slices of the data that can't be modified (or at least, should only be modified as a last resort).
I need the ability to sort on an arbitrary field for reporting purposes. For smaller DBs, I return the entire object, json_parse it in our PHP backend, then sort there. However, we're often getting Out Of Memory errors when doing this on our largest DBs.
After some research, I'm leaning towards accessing a sort key (via URL parameter) in a list function and doing the sort there. This is an idea I've stolen from here. Excerpt:
function(head, req) {
var row
var rows=[]
while(row = getRow()) {
rows.push(row)
}
rows.sort(function(a,b) {
return b.value-a.value
})
send(JSON.stringify({"rows" : rows}))
}
It seems to be working for smaller DBs, but it still needs a lot of work to be production ready.
Is this:
a) a good solution?
b) going to work with 3, 5, or 10 million rows?
You can't avoid loading everything into memory by using a list function. So with enough data, eventually, you'll get an out of memory error, just as you're getting with PHP.
If you can live within the memory constrains, it's a reasonable solution, with some advantages.
Otherwise, investigate using something like lucene, elasticsearch, or Cloudant Search (clouseau & dreyfus).
In our environment, we have more than 5 million records. The couch is design such that each and every Document has some specific fields which distinguish it from the other category of documents.
For example, there are number documents with field DocumentType "USer" or DocumentType "XXX"
These DocumentType field allow us to sort various document based on different categories.
So if you have 3 Million doc, and you have around 10 categories so each category will have about 300k Docs.
Now you can design system such that you always pass the DocId you need to be passed to Couch. In that way it will be faster.
so query can be like
function(doc)
{
if(doc.DocumentType=== 'XXX' && doc._id) {emit(doc.FieldYouWant, doc._id)}
}
This is how our backhand is designed in production.
In MySQL I can do something like:
SELECT id FROM table WHERE field = 'foo' LIMIT 5
If the table has 10,000 rows, then this query is way way faster than if I left out the LIMIT part.
In ElasticSearch, I've got the following:
{
"query":{
"fuzzy_like_this_field":{
"body":{
"like_text":"REALLY LONG (snip) TEXT HERE",
"max_query_terms":1,
"min_similarity":0.95,
"ignore_tf":true
}
}
}
}
When I run this search, it takes a few seconds, whereas mysql can return results for the same query in far, far less time.
If I pass in the size parameter (set to 1), it successfully only returns 1 result, but the query itself isn't any faster than if I had set the size to unlimited and returned all the results. I suspect the query is being run in its entirety and only 1 result is being returned after the query is done processing. This means the "size" attribute is useless for my purposes.
Is there any way to have my search stop searching as soon as it finds a single record that matches the fuzzy search, rather than processing every record in the index before returning a response? Am I misunderstanding something more fundamental about this?
Thanks in advance.
You are correct the query is being ran entirely. Queries by default return data sorted by score, so your query is going to score each document. The docs state that the fuzzy query isn't going to scale well, so might want to consider other queries.
A limit filter might give you similar behavior to what your looking for.
A limit filter limits the number of documents (per shard) to execute
on
To replicate mysql field='foo' try using a term filter. You should use filters when you don't care about scoring, they are faster and cache-able.
I Just want to make sure I understand this correctly...
search is an object that contains a querystring.
Repo.Query returns an ObjectQuery<T>.
From my understanding the chained linq statements will filter the results after entity framework has returned all the rows satisfying the query. So really ALL the rows are being returned and THEN filtered in memory. So we are returning a bunch of data that we don't really want. There's about 10k rows being returned so this is kind of important. Just like to get my confusion cleared up.
var searchQuery = Repo.Query(search)
.Where(entity =>
entity.Prop1.ToUpper().Equals(prop1.ToUpper()) &&
entity.Prop2.ToUpper().Equals(prop2.ToUpper()))
.OrderBy(entity => Repo.SortExpression ?? entity.prop1);
Your Repo.Query(string query) function should return IQueryable<T>.
Then you can filter and order without getting all rows first.
IQueryable(Of T) Interface
hope this helps
If this is to SQL, this will most likely create a SQL query and filter on the server and not in memory.
As a matter of fact, the statement above wouldn't actually do anything.
It's only when you iterate over it that the query will be executed. This is why certain providers (like the EF to SQL one) can collapse expression trees into a SQL query.
Easiest way to check is to use LINQPAD or the SQL Profiler to see what query is actually is executed.