Is there a way to get all the objects in key/value format which are under one similar secondary index value. I know we can get the list of keys for one secondary index (bucket/{{bucketName}}/index/{{index_name}}/{{index_val}}). But somehow my requirements are such that if I can get all the objects too. I don't want to perform a separate query for each key to get the object details separately if there is way around it.
I am completely new to Riak and I am totally a front-end guy, so please bear with me if something I ask is of novice level.
In Riak, it's sometimes the case that the better way is to do separate lookups for each key. Coming from other databases this seems strange, and likely inefficient, however you may find your query will be faster over an index and a bunch of single object gets, than a map/reduce for all the objects in a single go.
Try both these approaches, and see which turns out fastest for your dataset - variables that affect this are: size of data being queried; size of each document; power of your cluster; load the cluster is under etc.
Python code demonstrating the index and separate gets (if the data you're getting is large, this method can be made memory-efficient on the client, as you don't need to store all the objects in memory):
query = riak_client.index("bucket_name", 'myindex', 1)
query.map("""
function(v, kd, args) {
return [v.key];
}"""
)
results = query.run()
bucket = riak_client.bucket("bucket_name")
for key in results:
obj = bucket.get(key)
# .. do something with the object
Python code demonstrating a map/reduce for all objects (returns a list of {key:document} objects):
query = riak_client.index("bucket_name", 'myindex', 1)
query.map("""
function(v, kd, args) {
var obj = Riak.mapValuesJson(v)[0];
return [ {
'key': v.key,
'data': obj,
} ];
}"""
)
results = query.run()
Related
I'm using EF Core but I'm not really an expert with it, especially when it comes to details like querying tables in a performant manner...
So what I try to do is simply get the max-value of one column from a table with filtered data.
What I have so far is this:
protected override void ReadExistingDBEntry()
{
using Model.ResultContext db = new();
// Filter Tabledata to the Rows relevant to us. the whole Table may contain 0 rows or millions of them
IQueryable<Measurement> dbMeasuringsExisting = db.Measurements
.Where(meas => meas.MeasuringInstanceGuid == Globals.MeasProgInstance.Guid
&& meas.MachineId == DBMatchingItem.Id);
if (dbMeasuringsExisting.Any())
{
// the max value we're interested in. Still dbMeasuringsExisting could contain millions of rows
iMaxMessID = dbMeasuringsExisting.Max(meas => meas.MessID);
}
}
The equivalent SQL to what I want would be something like this.
select max(MessID)
from Measurement
where MeasuringInstanceGuid = Globals.MeasProgInstance.Guid
and MachineId = DBMatchingItem.Id;
While the above code works (it returns the correct value), I think it has a performance issue when the database table is getting larger, because the max filtering is done at the client-side after all rows are transferred, or am I wrong here?
How to do it better? I want the database server to filter my data. Of course I don't want any SQL script ;-)
This can be addressed by typing the return as nullable so that you do not get a returned error and then applying a default value for the int. Alternatively, you can just assign it to a nullable int. Note, the assumption here of an integer return type of the ID. The same principal would apply to a Guid as well.
int MaxMessID = dbMeasuringsExisting.Max(p => (int?)p.MessID) ?? 0;
There is no need for the Any() statement as that causes an additional trip to the database which is not desirable in this case.
Does anyone know if there's an easy way to negate a parse query? Something like this:
Parse.Query.not(query)
More specifically I want to do a relational query that gets everything except for the objects within the query. For example:
const relation = myParseObject.relation("myRelation");
const query = relation.query();
const negatedQuery = Parse.Query.not(query);
return await negatedQuery.find();
I know one solution would be to fetch the objects in the relation and then create a new query by looping through the objectIds using query.notEqualTo("objectId", fetchedObjectIds[i]), but this seems really circuitous...
Any help would be much appreciated!
doesNotMatchKeyInQuery is the solution as Davi Macedo pointed out in the comments.
For example, if I wanted to get all of the Comments that are not in an Article's relation, I would do the following:
const relationQuery = article.relation("comments").query();
const notInRelationQuery = new Parse.Query("Comment");
notInRelationQuery.doesNotMatchKeyInQuery("objectId", "objectId", relationQuery);
const notRelatedComments = await notInRelationQuery.find();
How I understand it is that the first argument is specifying the key in the objects that we are fetching. The second argument is specifying the key in the objects that are in the query that we're about to argue. And lastly we argue a query for the objects we don't want. So, it essentially finds the objects you don't want and then compares the values of the objects you do want to the values of the objects you don't want for the argued keys. It then returns all the objects you do want. I could probably write that more succinctly, but w/e.
I'm working on an audit log which saves sessions in RavenDB. Initially, the website for querying the audit logs was responsive enough but as the amount of logged data has increased, the search page became unusable (it times out before returning using default settings - regardless of the query used). Right now we have about 45mil sessions in the table that gets queried but steady state is expected to be around 150mil documents.
The problem is that with this much live data, playing around to test things has become impractical. I hope some one can give me some ideas what would be the most productive areas to investigate.
The index looks like this:
public AuditSessions_WithSearchParameters()
{
Map = sessions => from session in sessions
select new Result
{
ApplicationName = session.ApplicationName,
SessionId = session.SessionId,
StartedUtc = session.StartedUtc,
User_Cpr = session.User.Cpr,
User_CprPersonId = session.User.CprPersonId,
User_ApplicationUserId = session.User.ApplicationUserId
};
Store(r => r.ApplicationName, FieldStorage.Yes);
Store(r => r.StartedUtc, FieldStorage.Yes);
Store(r => r.User_Cpr, FieldStorage.Yes);
Store(r => r.User_CprPersonId, FieldStorage.Yes);
Store(r => r.User_ApplicationUserId, FieldStorage.Yes);
}
The essense of the query is this bit:
// Query input paramters
var fromDateUtc = fromDate.ToUniversalTime();
var toDateUtc = toDate.ToUniversalTime();
sessionQuery = sessionQuery
.Where(s =>
s.ApplicationName == applicationName &&
s.StartedUtc >= fromDateUtc &&
s.StartedUtc <= toDateUtc
);
var totalItems = Count(sessionQuery);
var sessionData =
sessionQuery
.OrderByDescending(s => s.StartedUtc)
.Skip((page - 1) * PageSize)
.Take(PageSize)
.ProjectFromIndexFieldsInto<AuditSessions_WithSearchParameters.ResultWithAuditSession>()
.Select(s => new
{
s.SessionId,
s.SessionGroupId,
s.ApplicationName,
s.StartedUtc,
s.Type,
s.ResourceUri,
s.User,
s.ImpersonatingUser
})
.ToList();
First, to determine the number of pages of results, I count the number of results in my query using this method:
private static int Count<T>(IRavenQueryable<T> results)
{
RavenQueryStatistics stats;
results.Statistics(out stats).Take(0).ToArray();
return stats.TotalResults;
}
This turns out to be very expensive in itself, so optimizations are relevant both here and in the rest of the query.
The query time is not related to the amount of result items in any relevant way. If I use a different value for the applicationName parameter than any of the results, it is just as slow.
One area of improvement could be to use sequential IDs for the sessions. For reasons not relevant to this post, I found it most practical to use guid based ids. I'm not sure if I can easily change IDs of the existing values (with this much data) and I would prefer not to drop the data (but might if the expected impact is large enough). I understand that sequential ids result in better behaving b-trees for the indexes, but I have no idea how significant the impact is.
Another approach could be to include a timestamp in the id and query for documents with ids starting with the string matching enough of the time to filter the result. An example id could be AuditSessions/2017-12-31-24-31-42/bc835d6c-2fba-4591-af92-7aab96339d84. This also requires me to update or drop all the existing data. This of course also has the benefits of mostly sequential ids.
A third approach could be to move old data into a different collection over time, in recognition of the fact that you would most often look at the most recent data. This requires a background job and support for querying across collection time boundaries. It also has the issue that the collection with the old sessions is still slow if you need to access it.
I'm hoping there is something simpler than these solutions, such as modifying the query or the indexed fields in a way that avoids a lot of work.
At a glance, it is probably related to the range query on the StartedUtc.
I'm assuming that you are using exact numbers, so you have a LOT of distinct values there.
If you can, you can dramatically reduce the cost by changing the index to index on a second / minute granularity (which is usually what you are querying on), and then use Ticks, which allow us to use numeric range query.
StartedUtcTicks = new Datetime(session.StartedUtc.Year, session.StartedUtc.Month, session.StartedUtc.Day, session.StartedUtc.Hour, session.StartedUtc.Minute, session.StartedUtc.Second).Ticks,
And then query by the date ticks.
I am looking for the most efficient way to get the last elements of a fairly large (> 1 million docs) MongoDB collection.
Specifically, it is the oplog collection and I am looking for all entries after a given timestamp. It makes no sense to search the first million or so entries for a timestamp larger than the current one, since they are all definitely older because the collection is stored in its natural order.
Is there a way to tell MongoDB to search from the end of a collection?
I tried a linq query with Skip(N) but it's very slow. It seems it parses through all documents from the beginning and just doesn't return the first N.
The most efficient way is probably using aggregation. If your collection is sorted, you can get the last Timestamp using this aggregation:
var group = new BsonDocument
{
{
"$group", new BsonDocument
{
{"_id", 0},
{"newestTimeStamp", new BsonDocument { {"$last","$timeStamp"} } }
}
}
};
var pipeline = new[] {group};
var result = _dtCollection.Aggregate(pipeline);
}
Then you can deserialize the result into a Timestamp class. If you want to get several elements, you could create a similar expression using $match.
Also make sure to add an index to the collection on the TimeStamp field. This will probably make your LINQ-query faster if you decide to use that instead.
This one is related to my previous question on the performance of Arrays and Hashes in Ruby.
Prerequisites
I know that using Hashes to store lots of Objects leads to significant performance increase because of the O(1) lookup.
Now let's assume I had two classes, namely A and B and they can be related to each other, but only if a third class C exists (it's sort of a relationship class). To give a practical example, let's say I have Document, Query and the relationship class Judgement (this is from information retrieval, so basically the judgement tells you whether the document is relevant for a query or not).
(I hope I got this right)
The Problem
In most cases you'd like to find out how many Judgements there are for a combination of Document and Query or if there are any.
In order to find out the latter, I'll iterate over each Jugdement ...
#judgements.each { |j| return true if j.document == document and j.query == query }
Now, this brings us back to a linear search again, which isn't that helpful.
How to solve it?
I was thinking about a way to have double Hashes - if there's such a thing - so that I could just look up Judgements using the Document and Query I already have.
Or is there any other way of quickly finding out whether a Judgement exists for a given pair of Document and Query?
Well, if you need performance, you can always create another data structure to facilitate indexing - in your case you can write a hash where keys would be [document, query] pairs, and values arrays of judgments. Depending on the architecture of your app, you can either update this index upon every change in your objects, or build indexes from scratch whenever you need to perform a batch of lookups.
Or, perhaps, you should leave it to the database to do your lookups instead, of course, if you have a database at all.
This
#judgements.each { |j| return true if j.document == document and j.query == query }
could be written as
#judgements.any? { |j| j.document == document and j.query == query }
I agree with Mladen Jablanović in that that's a good chance that you should let your database handle this. In MongoDB it would be something like this
db = Mongo::Connection.new.db("mydb")
judgements = db.collection("judgements")
judgement = {:judgement_no=> "2011:73", :document => 4711, :query => 42}
judgements.add(judgement)
judgements.create_index([['document', Mongo::ASCENDING], ['query', Mongo::ASCENDING]])
judgements.find({:document => 4711, :query => 42}).each { |jm| puts jm }