How to speed up performance by avoiding to query Mongoid multiple times? - performance

I have approx. 10 million Article objects in a Mongoid database. The huge number of Article objects makes the queries quite time consuming to perform.
As exemplified below, I am registering for each week (e.g. 700 days from now .. 7 days from now, 0 days from now) how many articles are in the database.
But for every query I make, the time consumption is increased, and Mongoid's CPU usage quickly reaches +100%.
articles = Article.where(published: true).asc(:datetime)
days = Date.today.mjd - articles.first.datetime.to_date.mjd
days.step(0, -7) do |n|
current_date = Date.today - n.days
previous_articles = articles.lt(datetime: current_date)
previous_good_articles = previous_articles.where(good: true).size
previous_bad_articles = previous_articles.where(good: false).size
end
Is there a way to save the Article objects to memory, so only need to call the database on the first line?

A MongoDB database is not build for that.
I think the best way is to run daily a script that creates your data for that day and save it in a Redis Database http://www.redis.io
Redis stores your data in the server memory, so you can access it every time of the day.
And is very quick.

Don't Repeat Yourself (DRY) is a best-practice that applies not only to code but also to processing. Many applications have natural epochs for summarizing data, a day is a good choice in your question, and if the data is historical, it only has to be summarized once. So you reduce processing of 10 million Article document down to 700 day-summary documents. You need special code for merging in today if you want up to the moment accurate data, but the previous savings is well worth the effort.
I politely disagree with the statement, "A MongoDB database is not build for that." You can see from the above that it is all about not repeating processing. The 700 day-summary documents can be stored in any reasonable data store. Since you already are using MongoDB, simply use another MongoDB collection for the day summaries. There's no need to spin up another data store if you don't want to. The summary data will easily fit in memory, and the reduction in processing means that your working set size will no longer be blown out by the historical processing.

Related

Hadoop vs Cassandra: Which is better for the following scenario?

There is a situation in our systems in which the user can view and "close" a report. After they close it, the report is moved to a temporary table inside the database where it is kept for 24 hrs, and then moved to an archives table(where the report is stored for next 7 years). At any point during the 7 years, a user can "reopen" the report and work on it. The problem is that archives storage is getting large and finding/reopening reports tend to be time consuming. And I need to get statistics on the archives from time to time(i.e. report dates, clients, average length "opened", etc). I want to use a big data approach but I am not sure whether to use Hadoop, Cassandra, or something else ? Can someone provide me with some guidelines how to get started and decide on what to use ?
If you archive is large and you'd like to get reports from it, you won't be able to use just Cassandra, as it has no easy means of aggregating the data. You'll end up collocating Hadoop and Cassandra on the same nodes.
From my experience archives (write once - read many) is not the best use case for Cassandra if you're having a lot of writes (we've tried it for a backend for a backup sysyem). Depending on your compaction strategy you'll pay either in space or in iops for having that. Added changes are propagated through the SSTable hierarchies resulting in a lot more writes than the original change.
It is not possible to answer your question in full without knowing other variables: how much hardware (servers, their ram/cpu/hdd/ssd) are you going to allocate? what is the size of each 'report' entry? how many reads / writes you usually serve daily? How large is your archive storage now?
Cassandra might work fine. Keep two tables, reports and reports_archive. Define the schema using a TTL of 24 hours and 7 years:
CREATE TABLE reports (
...
) WITH default_time_to_live = 86400;
CREATE TABLE reports_archive (
...
) WITH default_time_to_live = 86400 * 365 * 7;
Use the new Time Window Compaction Strategy (TWCS) to minimize write amplification. It could be advantageous to store the report metadata and report binary data in separate tables.
For roll-up analytics, use Spark with Cassandra. You don't mention the size of your data, but roughly speaking 1-3 TB per Cassandra node should work fine. Using RF=3 you'll need at least three nodes.

high volume data storage and processing

I am building a new application where I am expecting a high volume of geo location data something like a moving object sending geo coordinates every 5 seconds. This data needs to be stored in some database so that it can be used for tracking the moving object on a map anytime. So, I am expecting about 250 coordinates per moving object per route. And each object can run about 50 routes a day. and I have 900 such objects to track. SO, that brings to about 11.5 million geo coordinates to store per day. I have to store about one week of data at least in my database.
This data will be basically used for simple queries like find all the geocoordates for a particular object and a particular route. so, the query is not very complicated and this data will not be used for any analysis purpose.
SO, my question is should I just go with normal Oracle database like 12C distributed over two VMs or should I think about some big data technologies like NO SQL or hadoop?
One of the key requirement is to have high performance. Each query has to respond withing 1 second.
Since you know the volume of data (11.5 million) you can easily simulate the all your scenario in Oracle DB and test it well before.
My suggestions are you need to go for day level partitions & 2 sub partitions like objects & routs. All your business SQL has to hit right partitions always.
and also you might required to clear older days data. or Some sort of aggregation you can created with past days and delete your raw data would help.
its well doable 12C.

MongoID where queries map_reduce association

I have an application who is doing a job aggregating data from different Social Network sites Back end processes done Java working great.
Its front end is developed Rails application deadline was 3 weeks for some analytics filter abd report task still few days left almost completed.
When i started implemented map reduce for different states work great over 100,000 record over my local machine work great.
Suddenly my colleague gave me current updated database which 2.7 millions record now my expectation was it would run great as i specify date range and filter before map_reduce execution. My believe was it would result set of that filter but its not a case.
Example
I have a query just show last 24 hour loaded record stats
result comes 0 record found but after 200 seconds with 2.7 million record before it comes in milliseconds..
CODE EXAMPLE BELOW
filter is hash of condition expected to check before map_reduce
map function
reduce function
SocialContent.where(filter).map_reduce(map, reduce).out(inline: true).entries
Suggestion please.. what would be ideal solution in remaining time frame as database is growing exponentially in days.
I would suggest you look at a few different things:
Does all your data still fit in memory? You have a lot more records now, which could mean that MongoDB needs to go to disk a lot more often.
M/R can not make use of indexes. You have not shown your Map and Reduce functions so it's not possible to point out mistakes. Update the question with those functions, and what they are supposed to do and I'll update the answer.
Look at using the Aggregation Framework instead, it can make use of indexes, and also run concurrently. It's also a lot easier to understand and debug. There is information about it at http://docs.mongodb.org/manual/reference/aggregation/

SQLCE performance in windows phone very poor

I've writing this thread as I've fought this problem for three whole days now!
Basically, I have a program that collects a big CSV-file and uses that as input to a local SQLCE-database.
For every row in this CSV-file (which represents some sort of object, lets call it "dog"), I need to know whether this dog already exists in the database.
If it already exists, don't add it to the database.
If it doesn't exists, add a new row in the database.
The problem is, every query takes around 60 milliseconds (in the beginning, when the database is empty) and it goes up to about 80ms when the database is around 1000 rows big.
When I have to go thru 1000 rows (which in my opinion is not much), this takes around 70000 ms = 1 minute and 10 seconds (just to check if the database is up to date), way too slow! Considering this amount will probably some day be more than 10000 rows, I cannot expect my user to wait for over 10 minutes before his DB is synchronized.
I've tried to use the compiled query instead, but that does not improve performance.
The field which im searching for is a string (which is the primary key), and it's indexed.
If it's necessary, I can update this thread with code so you can see what I do.
SQL CE on Windows Phone isn't the fastest of creatures but you can optimise it:
This article covers a number of things you can do:WP7 Local DB Best Practices
They als provide a WP7 project that can be downloaded so you can play with the code.
On top of this article I'd suggest changing your PK from a string to an int; strings take up more space than ints so your index will be larger and take more time to load from isolated storage. Certainly in SQL Server searchs of strings are slower than searches of ints/longs.

Rails 3 - ActiveRecord, what is more efficient (update vs. count)?

Okay, lets say I got 2 different Models:
Poll (has_many :votes)
Vote (belongs_to :poll)
So, one poll can have many votes.
At the moment I'm displaying a list of all polls including its overall vote_count.
Everytime someone votes for a poll, I'm going to update the overall vote_count of the specific poll by using:
#poll = Poll.update(params[:poll_id], :vote_count => vote_count+1)
To retrieve the vote_count I use : #poll.vote_count which works fine.
Lets assume I got a huge amount of polls (in my db) and a lot of people would vote for the same poll at the same time.
Question : Wouldn't it be more efficient to remove the vote_count from the poll table and use: #vote_count = Poll.find(params[:poll_id]).votes.count for retrieving the overall vote_count instead? Which operation (update vs. count) would make more sense in this case?(I'm using postgresql in production)
Thanks for the help!
Have you considered using a counter cache (see counter_cache option)? Rails has this built in functionality to handle all the possible updates to an association and how that would affect the counter.
It's as simple as adding a 0 initialized integer column named #{attribute.pluralize}_count(in your case votes_count) on the table of one to many side of an association (in your case Poll).
And then on the other side of the association add the :counter_cache => true argument to the belongs to statement.
belongs_to :poll, :counter_cache => true
Now this doesn't answer your question exactly, and the correct answer will depend on the shape of your data and the indexes you've configured. If you're expecting your votes table to number in the millions spread out over thousands of polls, then go with the counter cache, otherwise just counting the associations should be fine.
This is a great question as it touches the fundamental issue of storing summary and aggregate information.
Generally this is not a good idea as things can easily get out of sync as systems grow.
Sometimes there are occasions when you do want summary information, but these are more specialized cases such as read-open databases that are only used for reporting and are updated once a day at midnight.
In those cases summary aggregate reporting is not only ok but is preferred over repeated summary/aggregate information calculation that would otherwise be done with each query. That will also depend on both usage and size, e.g. if there are 300 queries a day (against the once a day updated, read only database) and they all have to calculate the same totals, and each query reads 20,000 rows, it is more efficient to do that once and store that calculation. As the data and queries grow this may be the only practical way to allow complex reporting.
To me, it doesn't make more sense in such a simple case to use a vote_count in the Poll. Counting rows is really fast and if you add a vote and forget to increment vote_count then the data is kind of broken...

Resources