I'm feeling good to work with Neo4j as I'm building a social network and neo4j is working well for me. Please answer to these points:
1) I'm stuck at making a decision as to store the number of likes on a post (de-normalized)somewhere in the database or should I count the number of edges to that post dynamically every time.
For example, When retrieving the "post" json data, for each user who needs that data, I need to count the no. of edges everytime I generate json.
2) Stuck at deciding best way to notify users about likes or comments.
For example, I want to push a notification to the user saying "John and 3 others also commented on Cena's post".
This notification might be updated as the number of comments increases. So, it's helpful for me to update notification if I'm using count(*) rather than than storing the counter somewhere, because I can fetch the count of "${count} new replies on your post" easily. But I'm worried about the performance.
3) Can I use Redis or other memcache with neo4j? Does that make a "significant" difference?
Please help me out in deciding which is better.
P.S: Please keep in mind the efficiency and scalability of application.
Related
So I will be embarking on designing a dashboard that will display KPI's and other relevant information for my team. Since I am in the early stages of this project and am not very familiar on the technical process behind designing a dashboard, I need some questions vetted out first before I go and shop for some solutions to avoid reinventing the wheel.
Here are some of my questions:
We want a dashboard that can provide live-time information via our data sources (or as close to live-time as possible). What function allows a dashboard to update itself with concurrent datasources? From a conceptual standpoint, I can understand creating a dashboard out of Microsoft Excel, and having the dashboard dependent on the values you may have set within your pivot table.
How do you make a dashboard request information from multiple datasources on its own? Just like the excel example, a user may have to go into the pivot tables to update values, but I want to know how would a dashboard request this by itself and what is the exact method from a programming standpoint? Does the code execute itself every time you refresh the webpage?
How do you create datasources organically? I know for some solutions such as SharePoint BI Center, there are pre-supported datasources like an excel sheet or SharePoint and it's as easy as uploading your document and letting the design handle the rest. However, there are going to be some datasources that I know that will need to be fetched. Do I need to understand something else like an event recorder in order to navigate this issue?
Introduction
The dashboard (or a report, respectively) is usually the result of a long chain of steps. Very much simplified it could look like this:
src1
|------\
src2 | /---- Dashboards
|------+---[DWH]-[BR]-+
src n | | \---- Reports etc.
|------/ [Big Data]
Keep in mind, this is only a very, very simple structure of a data backend / frontend.
DWH means Data Warehouse, where data might be stored temporarily (you referred to this as fetching). This could be a database, could be a Big Data engine, could be a combination of both...
Afterwards, there are Business Rules (BR). Those might be specific rules in how different departments calculate and relate to data, but also simple things like algebra.
Questions
So, the main question should not be about the technology:
What software should we choose?
How can we create a dashboard?
but on the contrary focused on your business processes (see it like a top-down view):
How does our core process look like? Where would I like to measure data?
How would department a calculate sales in difference to department b? Should all use the same rule?
Where does everyone store the data? Can we access it? Do we need structural data?
And, very easy to forget but also easily sometimes one of the biggest parts: Is the identifier of a business object (say, sales id) everywhere build and formatted in the same way?
Conclusion
When those questions are at least in the back of your head and you keep working in this direction, more or less automatically data will spill out at certain points of that process.
Then it won't matter if you use Excel, a small-to medium app like Tableau, Tibco Spotfire, QlikView, Power BI or you want to go full scale with a big Hadoop backend, databases and JasperReports, Apache Drill, Pentaho, SSIS on top of it... it will come out eventually.
TL;DR
Focus on the processes first. Make sure to understand them. Draft in Excel. Then proceed in getting the data and the tools you need to help your use cases. It will work out much better from a "top-down" approach than trying to solve your requirements with tools only.
I am developing a web app in Meteor, with Mongo, that will be running on cloud. Each user must belong to a Company.
Each Company can only access it's own data.
Each user can access it's own data and some data shared with other users of the same company.
Imagine 1.000 companies and 100 users per company, it could get very bad in performance and secutiry, if I use 1 Mongodb database for whole app.
So, because Mongo is "Schema-less and Database-less" I think I can define 1.000 dbs, lets say db_0001, db_0002, ... with same name collections, lets say tasks, messages, ..., so the app can be efficient and more secure (same code for every Company and isolation of data).
Also, on hosting side (let's say for example with Digital Ocean), I think its easier to distribute the dbs if the are already atomized.
Is this a good approach? Or should I not worry about it and let the hosting do this job?
Any thoughts are wellcome.
You are currently only looking at one side of the coin. That's fine to start with.
Think about how you are going to be displaying that data and what query does it translate to. Do a thorough due diligence on all the potential query. For example, how often would user/getbyid be called and how often would you have to show a user their info and their relationship with other users. What other meta data would be required beside user info, would you have to perform a join to get that data? or is it stored as an embedded document? What fields are you going to be searching and sorting by most? Which types of data are write heavy and what are read heavy?
Now lets get back to your database shading approach. It's great that you are thinking ahead of time on this front rather than having to rewrite your component later. Data volume/storage does not worry me here. How many concurrent users would be using at application and what are primary use cases should be the first place to look at to think about scale.
Additionally, you need to understand the nature of the business and project growth. Is it like Instragram type of hyper growth? or is it more predictable. A big Mongo cluster can handle thousands of concurrent read/write requests (assuming your design and query are optimized) so that does not bother me. If you want to keep it flexible MongoDB has a sharding mechanism and you can shard on a key and it takes care all the fancy stuff for ya.
MongoDB has eventual consistency (look up MongoDB CAP theorem) if you enable read from secondaries and you have a high volume business critical app you need to be careful because you can be reading out of date result.
As far as hosting is concerned, DO is fine but always have a backup in another region to maintain geographic redundancy so in case if a region goes down (Hello AWS!) you have something to fall back on.
Good luck on your project!
Which is the best way to implement visitor's logic?
Create visitors table |ip|resource_type|resource_id|
Create serialize field in records (Post, Pet, Event, Ad, etc...)
Use nosql solutions
Any other idea
In the 1st case, we have extended the table size for every visit.
In the 2nd, we have a long field.
In the 3nd, I have trouble with mongoid at production (centOS).
Not sure I'm answering, but I would not implement that myself, but rather take a look at existing solutions. For basic counting :
Vanity
Google Analytics
For more detailed metrics about what each user does, I would go toward cohort.
A totally other option could be using just the log and something like lograge to log each request. It is very easy to add fields (such as the IP). You can then extract all the informations from your logs.
Okay, lets say I got 2 different Models:
Poll (has_many :votes)
Vote (belongs_to :poll)
So, one poll can have many votes.
At the moment I'm displaying a list of all polls including its overall vote_count.
Everytime someone votes for a poll, I'm going to update the overall vote_count of the specific poll by using:
#poll = Poll.update(params[:poll_id], :vote_count => vote_count+1)
To retrieve the vote_count I use : #poll.vote_count which works fine.
Lets assume I got a huge amount of polls (in my db) and a lot of people would vote for the same poll at the same time.
Question : Wouldn't it be more efficient to remove the vote_count from the poll table and use: #vote_count = Poll.find(params[:poll_id]).votes.count for retrieving the overall vote_count instead? Which operation (update vs. count) would make more sense in this case?(I'm using postgresql in production)
Thanks for the help!
Have you considered using a counter cache (see counter_cache option)? Rails has this built in functionality to handle all the possible updates to an association and how that would affect the counter.
It's as simple as adding a 0 initialized integer column named #{attribute.pluralize}_count(in your case votes_count) on the table of one to many side of an association (in your case Poll).
And then on the other side of the association add the :counter_cache => true argument to the belongs to statement.
belongs_to :poll, :counter_cache => true
Now this doesn't answer your question exactly, and the correct answer will depend on the shape of your data and the indexes you've configured. If you're expecting your votes table to number in the millions spread out over thousands of polls, then go with the counter cache, otherwise just counting the associations should be fine.
This is a great question as it touches the fundamental issue of storing summary and aggregate information.
Generally this is not a good idea as things can easily get out of sync as systems grow.
Sometimes there are occasions when you do want summary information, but these are more specialized cases such as read-open databases that are only used for reporting and are updated once a day at midnight.
In those cases summary aggregate reporting is not only ok but is preferred over repeated summary/aggregate information calculation that would otherwise be done with each query. That will also depend on both usage and size, e.g. if there are 300 queries a day (against the once a day updated, read only database) and they all have to calculate the same totals, and each query reads 20,000 rows, it is more efficient to do that once and store that calculation. As the data and queries grow this may be the only practical way to allow complex reporting.
To me, it doesn't make more sense in such a simple case to use a vote_count in the Poll. Counting rows is really fast and if you add a vote and forget to increment vote_count then the data is kind of broken...
Have you ever noticed how facebook says “3 friends and 33 others liked this”? I was wondering what the best approach to do this is. I don’t think going through the friends list, and the list of users who “liked this” and comparing them is efficient at all! Do they keep a track of this in the database? That will make the database size very huge.
What do you guys think?
Thanks!
I would guess they outer join their friends table with their likes table to count both regular likes and friend likes at the same time.
With the proper indexes, it wouldn't be a slow query at all. Huge databases aren't necessarily slow, so there's really no reason to not store all of this information in a database. The trick is to make sure the indexes and partitions (if any) are set up well.
Facebook uses Cassandra, a NoSQL database for at least some things. Here's a more detailed discussion of what some of the bigger social media sites do to solve these problems:
http://www.25hoursaday.com/weblog/2009/09/10/BuildingScalableDatabasesDenormalizationTheNoSQLMovementAndDigg.aspx
Lots of interesting reading in there if you follow the links from it to the Digg blog post, etc.
Yes they definitely keep it in their database as they definitely have more than 1 server that needs to access the data.
As for scalability, I'm sure they use a lot of caching.
Here is an example:
If you have 1 million rows to go through, an index can perform O(logn) = 20 operations (in the worst case) only to find what you need.
For 2 million, you only need 21 operations (in the worst case) to find what you need.
Every time you double the amount of users to go through you simply need only 1 more operation (in the worst case) with a O(logn) index.
They also have a distributed architecture or a clustered database.
Facebook must be using a trigger(which automatically gets executed as soon as an event occurs).
For example, suppose a trigger is created to store the count and names of people who liked the status, then it will get executed every time when someone likes your status and that too implicitly (automatically).
This makes the operation way too easy and Facebook doesn't have to manually update the database or store a huge database for this. Also,this approach is a faster one.
In designing social networking software (mothsorchid.com) I found the only way to address this is to pre-cache streams of notifications. One doesn't query the database at the time of page load to count how many friends and others 'liked this', when someone 'likes' something that is recorded on the object, and when retrieving the object one can compare with the current user's friend list. If someone updates their profile/makes a comment/etc it sends notification objects to friends which are pre-cached in their feeds. Cuts down tremendously on database work at expense of disk space, but disk space is cheap.
As to how Facebook does this, they use Cassandra DBMS, which is probably a little different to what you have in mind.
Keep in mind that Facebook strongly utilizes memcached, so they're retaining a lot of data in memory and only refreshing it when absolutely necessary. See this blog post for some scalability discussion around this:
http://www.facebook.com/note.php?note_id=39391378919
Each entry that somebody can like probably contains a list of everybody who does like it (all of this is of course in a database). When you view that entry, they match it against your friends list to see which of them is your friend. Voila.
A lot of this are explained by the Director of Engineering of Facebook in this QCon presentation :
http://www.infoq.com/presentations/Facebook-Software-Stack
A great presentation to watch.....