Within the MapReduce implementation, are reduce functions indexed similarly to map functions? - view

If I have a couple docs in Couch that look like this:
{
"_id": "be890e3ee1457e920f12722c44001b0e", // Or whatever auto ID
"_rev": "7-74d1787aa3ca6d2526c4436577da660f", // Or whatever auto rev
"type_": "count",
"value": -1,
"time": 1485759832925 // This is an Epoch time, the result of this JavaScript: var x = (new Date()).getTime(), that I calculate in the console just before saving the doc
}
And then I create a map function to retrieve these docs like so (that I run directly after creating a few docs):
function(doc) {
if (doc.type_) {
if (doc.time) {
var datetime = (new Date()).getTime();
var docTime = doc.time;
var docAge = datetime - docTime;
// Only emit docs younger than 1 minute
if (docAge / 1000 <= 60) {
emit(doc.time, docAge);
};
};
};
};
I found that once the view is calculated, that the docAge will never change and that the docs will always be emitted despite being 'too old'.
If you open a doc and re-save it, then the view will NOT emit that doc (because it reflects as a CouchDB update and now the time value is too old), but other docs will not have been recalculated (i.e. the docAge for those docs is still the same).
So by this I can see that views are incrementally updated to reflect changed docs. And as I understand, they are cached.
Question:
Where are these cached views stored?
Are Group and reduce output recalculated from scratch everytime the map
function incrementally updates?

Your views are not being "cached" per-se. The idea behind CouchDB views is that they are deterministic, and thus should not be influenced by anything beyond the document in question.
Using new Date() in your view means that you are bringing in an external resource (the clock) which means your view index will be computed in a way you aren't intending based on your question.
Your map function must deal in absolutes, so it should output the timestamp irregardless of the time that your view index is rebuilt. From your application, you'll pass the time you want to query as a parameter to the view query.
For example, consider this view function:
function (doc) {
if (doc.type_ && doc.time) {
emit(doc.time);
}
}
It will output the time for all your documents. Then, you will query the view passing in the expected timeframe.
?start_key=<timestamp from 1 minute ago>
Then you will get the documents whose timestamp falls in the last minute. You can include end_key to specify an upper-limit.
There's a bit of a mental hurdle to overcome with how MapReduce views in CouchDB are designed to work, so I would highly recommend their Guide to Views to get started. (in fact, their newest documentation is quite good and I would highly recommend reading through all of it)

Related

Slow query over large collection

I'm working on an audit log which saves sessions in RavenDB. Initially, the website for querying the audit logs was responsive enough but as the amount of logged data has increased, the search page became unusable (it times out before returning using default settings - regardless of the query used). Right now we have about 45mil sessions in the table that gets queried but steady state is expected to be around 150mil documents.
The problem is that with this much live data, playing around to test things has become impractical. I hope some one can give me some ideas what would be the most productive areas to investigate.
The index looks like this:
public AuditSessions_WithSearchParameters()
{
Map = sessions => from session in sessions
select new Result
{
ApplicationName = session.ApplicationName,
SessionId = session.SessionId,
StartedUtc = session.StartedUtc,
User_Cpr = session.User.Cpr,
User_CprPersonId = session.User.CprPersonId,
User_ApplicationUserId = session.User.ApplicationUserId
};
Store(r => r.ApplicationName, FieldStorage.Yes);
Store(r => r.StartedUtc, FieldStorage.Yes);
Store(r => r.User_Cpr, FieldStorage.Yes);
Store(r => r.User_CprPersonId, FieldStorage.Yes);
Store(r => r.User_ApplicationUserId, FieldStorage.Yes);
}
The essense of the query is this bit:
// Query input paramters
var fromDateUtc = fromDate.ToUniversalTime();
var toDateUtc = toDate.ToUniversalTime();
sessionQuery = sessionQuery
.Where(s =>
s.ApplicationName == applicationName &&
s.StartedUtc >= fromDateUtc &&
s.StartedUtc <= toDateUtc
);
var totalItems = Count(sessionQuery);
var sessionData =
sessionQuery
.OrderByDescending(s => s.StartedUtc)
.Skip((page - 1) * PageSize)
.Take(PageSize)
.ProjectFromIndexFieldsInto<AuditSessions_WithSearchParameters.ResultWithAuditSession>()
.Select(s => new
{
s.SessionId,
s.SessionGroupId,
s.ApplicationName,
s.StartedUtc,
s.Type,
s.ResourceUri,
s.User,
s.ImpersonatingUser
})
.ToList();
First, to determine the number of pages of results, I count the number of results in my query using this method:
private static int Count<T>(IRavenQueryable<T> results)
{
RavenQueryStatistics stats;
results.Statistics(out stats).Take(0).ToArray();
return stats.TotalResults;
}
This turns out to be very expensive in itself, so optimizations are relevant both here and in the rest of the query.
The query time is not related to the amount of result items in any relevant way. If I use a different value for the applicationName parameter than any of the results, it is just as slow.
One area of improvement could be to use sequential IDs for the sessions. For reasons not relevant to this post, I found it most practical to use guid based ids. I'm not sure if I can easily change IDs of the existing values (with this much data) and I would prefer not to drop the data (but might if the expected impact is large enough). I understand that sequential ids result in better behaving b-trees for the indexes, but I have no idea how significant the impact is.
Another approach could be to include a timestamp in the id and query for documents with ids starting with the string matching enough of the time to filter the result. An example id could be AuditSessions/2017-12-31-24-31-42/bc835d6c-2fba-4591-af92-7aab96339d84. This also requires me to update or drop all the existing data. This of course also has the benefits of mostly sequential ids.
A third approach could be to move old data into a different collection over time, in recognition of the fact that you would most often look at the most recent data. This requires a background job and support for querying across collection time boundaries. It also has the issue that the collection with the old sessions is still slow if you need to access it.
I'm hoping there is something simpler than these solutions, such as modifying the query or the indexed fields in a way that avoids a lot of work.
At a glance, it is probably related to the range query on the StartedUtc.
I'm assuming that you are using exact numbers, so you have a LOT of distinct values there.
If you can, you can dramatically reduce the cost by changing the index to index on a second / minute granularity (which is usually what you are querying on), and then use Ticks, which allow us to use numeric range query.
StartedUtcTicks = new Datetime(session.StartedUtc.Year, session.StartedUtc.Month, session.StartedUtc.Day, session.StartedUtc.Hour, session.StartedUtc.Minute, session.StartedUtc.Second).Ticks,
And then query by the date ticks.

storing business hours in Parse DB

Need some help with the infrastructure with storing business hours for a location on Parse.com, i already tried it as a separate Class called BusinessHours, where each row has a pointer to the Location class. Having a minimum of 7 rows for each day of the week for 1 location, the objects count comes to +10.000
than in swift i do this to determine if the location is open now
for hour in hours {
if hour.isClosedAllDay {
isOpen = "closed".localized
}else{
let now = NSDate()
if now.hasDayOffset(hour.weekday, closeWeekDay: hour.nextWeekday) {
if hour.open != nil && hour.close != nil {
let open = now.hourDateFromString(hour.open!, offset: now.dayOpenOffset(hour.weekday, closeWeekDay: hour.nextWeekday))
let close = now.hourDateFromString(hour.close!, offset: now.dayCloseOffset(hour.weekday, closeWeekDay: hour.nextWeekday))
if now.isBetween(open, close: close) {
isOpen = "open".localized
timeOfBusiness = hour.time!
break
}
}
}
}
}
Is there a better way to do this than to have thousands of rows for Business Hours only? I was thinking of adding a object field to the Location Class for the hours but don't know if that is the right way to go either.
Depending on how you want to edit and change the details, and the complexities of multiple opening times per day, I'd consider not using multiple columns and rows. Instead, you could simply store a JSON string in a single column which contains all of the required details.
Obviously you wouldn't be able to use this for querying, so if you need to do that then you need to keep something more like your current solution.
If you don't need querying, or you need simple querying like 'is it open at all on a Monday' then a combined solution, supported by cloud code so the app doesn't need lots of knowledge of the JSON, could work well. For instance you could have columns for general open hours each day and then details in JSON, so you can get a rough answer by querying and then check the exact detail before presentation / usage of the result.
I ended up doing it like this in an array field called businessHours in my Location class:
[
{"close":"20:00Z","open":"12:00Z","time":"09:00 - 17:00","isClosedAllDay":false,"nextWeekday":1,"weekday":1},
{"close":"20:00Z","open":"12:00Z","time":"09:00 - 17:00","isClosedAllDay":false,"nextWeekday":2,"weekday":2},
{"close":"20:00Z","open":"12:00Z","time":"09:00 - 17:00","isClosedAllDay":false,"nextWeekday":3,"weekday":3},
{"close":"20:00Z","open":"12:00Z","time":"09:00 - 17:00","isClosedAllDay":false,"nextWeekday":4,"weekday":4},
{"close":"20:00Z","open":"12:00Z","time":"09:00 - 17:00","isClosedAllDay":false,"nextWeekday":5,"weekday":5},
{"close":"20:00Z","open":"12:00Z","time":"09:00 - 17:00","isClosedAllDay":false,"nextWeekday":6,"weekday":6},
{"close":"20:00Z","open":"12:00Z","time":"09:00 - 17:00","isClosedAllDay":false,"nextWeekday":7,"weekday":7}
]
and then looping through the objects as a NSDictionary.
thanks Wain!

Meteor: filter data in publish or on client

In Meteor I want to work on the document level when having a Mongo database and according to sources, what I have to watch out for is expensive publications so today my question is:
How would I go about publishing documents with relations, would I follow the relational-type of query where we would find assignment details with an assignment id like this:
Meteor.publish('someName', function () {
var empId = "dj4nfhd56k7bhb3b732fd73fb";
var assignmentData = Assignment.find({ employee_id: empId });
return AssignmentDetails.find({ assignment_id: $in [ assignment ] });
});
or should we rather take an approach like this where we skip the filtering step in the publish and instead publish every assignment_detail and handle that filter on the client:
Meteor.publish('someName', function () {
var empId = "dj4nfhd56k7bhb3b732fd73fb";
var assignmentData = Assignment.find({ employee_id: empId });
var detailData = AssignmentDetails.find({ employee_id: empId });
return [ assignmentData, detailData];
});
I guess this is a question of whether the amount of data being searched trough on the server should be more then or if the amount of data being transferred to the client should be bigger.
Which of these would be most cost effective for the server?
It's a matter of opinion, but if possible I would strongly recommend attaching employee_id to docs in AssignmentDetails, as you have in the second example. You're correct in suggesting that publications are expensive, but much more so if the publication function is more complex than necessary, and you can reduce your pub function to one line if you have employee_id in AssignmentDetails (even where there are many employee_ids for each assignment) by just searching on that. You don't even need to return that field to the client (you can specify the fields to return in your find), so the only incurred overhead would be in database storage (which is v. cheap) and adding it to inserted/updated AssignmentDetails docs (which would be imperceptible). The actual amount of data transferred would be the same as in the first case.
The alternative of just publishing everything might be fine for a small collection, but it really depends on the number of assignments, and it's not going to be at all scalable this way. You need to send the entire collection to the client every time a client connects, which is expensive and time-consuming at both ends if it's more than a MB or so, and there isn't really any way round that overhead when you're talking about a dynamic (i.e. frequently-changing) collection, which I think you are (whereas for largely static collections you can do things with localStorage and poll-and-diff).

backbone.js: Retrieve a smaller version of model building a collection

I'm trying to build an api to create a collection in backbone. My Model is called log and has this (shortened) properties (format for getLog/<id>):
{
'id': string,
'duration': float,
'distance': float,
'startDate': string,
'endDate': string
}
I need to create a collection, because I have many logs and I want to display them in a list. The api for creating the collection (getAllLogs) takes 30 sec to run, which is to slow. It returns the same as the format as the api getLog/<id>, but in an array, one element for each log on the database.
To speed things up, I rebuild the api several times and optimize it to it's limits, but now I came to 30 sec, which is still to slow.
My question is if it is possible to have a collection filled with instances of a model without ALL the information in the model, just a part of it needed to display the list. This will increase the speed of loading the collection and displaying the list, while in the background I could continue loading all other properties, or load them only for the elements I really need.
In my case, the model would load only with this information:
{
'id': string,
'distance': float
}
and all other properties could be loaded later.
How can I do it? is it a good idea anyway?
thanks.
One way to do this is to use map to get the shortened model. Something like this will convert a Backbone.Collection "collection" with all properties to one with only "id" and "distance":
var shortCollection = new Backbone.Collection(collection.toJSON().map(function(x) {
return { id: x.id, distance: x.distance };
}));
Here's a Fiddle illustration.

Model records ordering in Spine.js

As I can see in the Spine.js sources the Model.each() function returns Model's records in the order of their IDs. This is completely unreliable in scenarios where ordering is important: long person list etc.
Can you suggest a way to keep original records ordering (in the same order as they've arrived via refresh() or similar functions) ?
P.S.
Things are even worse because by default Spine.js internally uses new GUIDs as IDs. So records order is completely random which unacceptable.
EDIT:
Seems that in last commit https://github.com/maccman/spine/commit/116b722dd8ea9912b9906db6b70da7948c16948a
they made it possible, but I have not tested it myself because I switched from Spine to Knockout.
Bumped into the same problem learning spine.js. I'm using pure JS, so i was neglecting the the contact example http://spinejs.com/docs/example_contacts which helped out on this one. As a matter of fact, you can't really keep the ordering from the server this way, but you can do your own ordering with javascript.
Notice that i'm using the Element Pattern here. (http://spinejs.com/docs/controller_patterns)
First you set the function which is gonna do the sorting inside the model:
/*Extending the Student Model*/
Student.extend({
nameSort: function(a,b) {
if ((a.name || a.email) > (b.name || b.email))
return 1;
else
return -1
}
});
Then, in the students controller you set the elements using the sort:
/*Controller that manages the students*/
var Students = Spine.Controller.sub({
/*code ommited for simplicity*/
addOne: function(student){
var item = new StudentItem({item: student});
this.append(item.render());
},
addAll: function(){
var sortedByName = Student.all().sort(Student.nameSort);
var _self = this;
$.each(sortedByName, function(){_self.addOne(this)});
},
});
And that's it.

Resources