Comparing permanent and ad hoc views - view

I need to reassure that I have a correct understanding of the difference for permanent and ad hoc views.
Permanent views are stored in the
_design document and computed upon being queried the first time, later
changes to documents will result in
changing the related documents in the
stored B tree file for that view.
Ad hoc views are calculated completely every time for the complete database?

Ok,
just found the Solution in the CouchDB Wiki.
I got it right.

Related

My couchdb view is rebuilding for no reason

i have a couchdb with a database containing ~20M documents. it takes ~12h to build a single view.
i have saved 6 views successfully. they returned results quickly. at first.
after 2 days idle, i added another view. it took much longer to build, and it was a "nice-to-have", not a requirement, so i killed it after ~60% completion (restarted the windows service).
my other views now start re-building their indexes when accessed.
really frustrated.
additional info: disk had gotten within 65GB of full (1TB disk; local)
Sorry you have no choice but to wait for the views to rebuild here. However I will try to explain why this is happening. It won't solve your problem but perhaps it will help you understand what is happening and how to prevent it in future.
From the wiki
CouchDB view index filenames are based on the contents of the design document (not its name, ID or revision). This means that two design documents with identical view code will share view index files.
What follows is that if you change the contents by adding a new view or updating the existing one couchdb will rebuild the indexes.
So I think the most obvious solution is to add new views in new design docs. It will prevent re indexing of existing views and the new one will take whatever time it needs to index any way.
Here is another helpful answer that throws light on how to effectively use couchdb design documents and views.

Handling passive deletion updates (ie. archiving instead of deleting)

We are developing an application based on DDD principles. We have encountered a couple of problems so far that we can't answer nor can we find the answers on the Internet.
Our application is intended to be a cloud application for multiple companies.
One of the demands is that there are no physical deletions from the database. We make only passive deletion by setting Active property of entities to false. That takes care of Select, Insert and Delete operations, but we don't know how to handle update operations.
Update means changing values of properties, but also means that past values are deleted and there are many reasons that we don't want that. One of the primary reason is for Accounting purposes.
If we make all update statements as "Archive old values" and then "Create new values" we would have a great number of duplicate values. For eg., Company has Branches, and Company is the Aggregate Root for Branches. If I change Companies phone number, that would mean I have to archive old company and all of its branches and create completely new company with branches just for one property. This may be a good idea at first, but over time there will be many values which can clog up the database. Phone is maybe an irrelevant property, but changing the Address (if street name has changed, but company is still in the same physical location) is a far more serious problem.
Currently we are using ASP.NET MVC with EF CF for repository, but one of the demands is that we are able to easily switch, or add, another technology like WPF or WCF. Currently we are using Automapper to map DTO's to Domain entities and vice versa and DTO's are primary source for views, ie. we have no view models. Application is layered according to DDD principle, and mapping occurs in Service Layer.
Another demand is that we musn't create a initial entity in database and then fill the values, but an entire aggregate should be stored as a whole.
Any comments or suggestions are appreciated.
We also welcome any changes in demands (as this is an internal project, and not for a customer) and architecture, but only if it's absolutely neccessary.
Thank you.
Have you ever come across event sourcing? Sounds like it could be of use if you're interested in tracking the complete history of aggregates.
To be honest I would create another table that would be a change log inserting the old record and deleted records etc etc into it before updating the live data. Yes you are creating a lot of records but you are abstracting this data from live records and keeping this data as lean as possible.
Also when it comes to clean up and backup you have your live date and your changed / delete data and you can routinely back up and trim your old changed / delete and reduced its size depending on how long you have agreed to keep changed / delete data live with the supplier or business you are working with.
I think this would be the best way to go as your core functionality will be working on a leaner dataset and I'm assuming your users wont be wanting to check revision and deletions of records all the time? So by separating the data you are accessing it when it is needed instead of all the time because everything is intermingled.

update app database regularly without needing an app update

I am working on a WP7 app that contains
CategoryGroups
Categories
Products
The rows for each of these entities are populated on first run of the application.
The issues is that when the app gets published, the rows in each of the entities will change (added, deleted, modified). I would like some suggestions on how I should handle this? Any pointers to existing code samples will be great?
I am using an object oriented database to store my entities. The app also allows the user to add their own entities (which get added to the database as personalized (flagged) entities). One solution I was thinking was to read an xml file from the server and then loop through the database entries and make the necessary modifications in the database. So, on the first run, all the entities will just get inserted. On subsequent runs, if the version number attribute in xml is different, then the system populated data is reloaded from xml but the user data is preserved.
Also, maybe only check for the new xml file on the server when internet connection is available and only periodically (like every 2 weeks).
Any other suggestions are welcome. If there is a simpler, cleaner way - please share.
Pratik
I think it's fair to say that this question has nothing to do with WP7 and everything to do with finding an efficient way to to compute and deliver update deltas.
Timestamp your items. When requesting an update, specify the time of last update. You server can trivially query for items newer than this and return a delta. At the client (ie in the phone) it is not necessary to store a last update time because you can simply add one second to the most recent timestamp in the items present on the phone.

How to access data in Dynamics CRM?

What is the best way in terms of speed of the platform and maintainability to access data (read only) on Dynamics CRM 4? I've done all three, but interested in the opinions of the crowd.
Via the API
Via the webservices directly
Via DB calls to the views
...and why?
My thoughts normally center around DB calls to the views but I know there are purists out there.
Given both requirements I'd say you want to call the views. Properly crafted SQL queries will fly.
Going through the API is required if you plan to modify data, but it isnt the fastest approach around because it doesnt allow deep loading of entities. For instance if you want to look at customers and their orders you'll have to load both up individually and then join them manually. Where as a SQL query will already have the data joined.
Nevermind that the TDS stream is a lot more effecient that the SOAP messages being used by the API & webservices.
UPDATE
I should point out in regard to the views and CRM database in general: CRM does not optimize the indexes on the tables or views for custom entities (how could it?). So if you have a truckload entity that you lookup by destination all the time you'll need to add an index for that property. Depending upon your application it could make a huge difference in performance.
I'll add to jake's comment by saying that querying against the tables directly instead of the views (*base & *extensionbase) will be even faster.
In order of speed it'd be:
direct table query
view query
filterd view query
api call
Direct table updates:
I disagree with Jake that all updates must go through the API. The correct statement is that going through the API is the only supported way to do updates. There are in fact several instances where directly modifying the tables is the most reasonable option:
One time imports of large volumes of data while the system is not in operation.
Modification of specific fields across large volumes of data.
I agree that this sort of direct modification should only be a last resort when the performance of the API is unacceptable. However, if you want to modify a boolean field on thousands of records, doing a direct SQL update to the table is a great option.
Relative Speed
I agree with XVargas as far as relative speed.
Unfiltered Views vs Tables: I have not found the performance advantage to be worth the hassle of manually joining the base and extension tables.
Unfiltered views vs Filtered views: I recently was working with a complicated query which took about 15 minutes to run using the filtered views. After switching to the unfiltered views this query ran in about 10 seconds. Looking at the respective query plans, the raw query had 8 operations while the query against the filtered views had over 80 operations.
Unfiltered Views vs API: I have never compared querying through the API against querying views, but I have compared the cost of writing data through the API vs inserting directly through SQL. Importing millions of records through the API can take several days, while the same operation using insert statements might take several minutes. I assume the difference isn't as great during reads but it is probably still large.

How does Facebook do it?

Have you ever noticed how facebook says “3 friends and 33 others liked this”? I was wondering what the best approach to do this is. I don’t think going through the friends list, and the list of users who “liked this” and comparing them is efficient at all! Do they keep a track of this in the database? That will make the database size very huge.
What do you guys think?
Thanks!
I would guess they outer join their friends table with their likes table to count both regular likes and friend likes at the same time.
With the proper indexes, it wouldn't be a slow query at all. Huge databases aren't necessarily slow, so there's really no reason to not store all of this information in a database. The trick is to make sure the indexes and partitions (if any) are set up well.
Facebook uses Cassandra, a NoSQL database for at least some things. Here's a more detailed discussion of what some of the bigger social media sites do to solve these problems:
http://www.25hoursaday.com/weblog/2009/09/10/BuildingScalableDatabasesDenormalizationTheNoSQLMovementAndDigg.aspx
Lots of interesting reading in there if you follow the links from it to the Digg blog post, etc.
Yes they definitely keep it in their database as they definitely have more than 1 server that needs to access the data.
As for scalability, I'm sure they use a lot of caching.
Here is an example:
If you have 1 million rows to go through, an index can perform O(logn) = 20 operations (in the worst case) only to find what you need.
For 2 million, you only need 21 operations (in the worst case) to find what you need.
Every time you double the amount of users to go through you simply need only 1 more operation (in the worst case) with a O(logn) index.
They also have a distributed architecture or a clustered database.
Facebook must be using a trigger(which automatically gets executed as soon as an event occurs).
For example, suppose a trigger is created to store the count and names of people who liked the status, then it will get executed every time when someone likes your status and that too implicitly (automatically).
This makes the operation way too easy and Facebook doesn't have to manually update the database or store a huge database for this. Also,this approach is a faster one.
In designing social networking software (mothsorchid.com) I found the only way to address this is to pre-cache streams of notifications. One doesn't query the database at the time of page load to count how many friends and others 'liked this', when someone 'likes' something that is recorded on the object, and when retrieving the object one can compare with the current user's friend list. If someone updates their profile/makes a comment/etc it sends notification objects to friends which are pre-cached in their feeds. Cuts down tremendously on database work at expense of disk space, but disk space is cheap.
As to how Facebook does this, they use Cassandra DBMS, which is probably a little different to what you have in mind.
Keep in mind that Facebook strongly utilizes memcached, so they're retaining a lot of data in memory and only refreshing it when absolutely necessary. See this blog post for some scalability discussion around this:
http://www.facebook.com/note.php?note_id=39391378919
Each entry that somebody can like probably contains a list of everybody who does like it (all of this is of course in a database). When you view that entry, they match it against your friends list to see which of them is your friend. Voila.
A lot of this are explained by the Director of Engineering of Facebook in this QCon presentation :
http://www.infoq.com/presentations/Facebook-Software-Stack
A great presentation to watch.....

Resources