Handle huge data imported from facebook - hadoop

I'm currently create a program that imports all groups and feeds from Facebook which the user wants.
I used to use the Graph API with OAuth and this works very well.
But I came to the point that I realized that one request can't handle the import of 1000 groups plus the feeds.
So I'm looking for a solution that imports this data in the background (like a cron job) into a database.
Requirements
Runs in background
Runs under Linux
Restful
Questions
What's you experience about that?
Would hadoop the right solution?

You can use neo4j.
Neo4j is a graph database, reliable and fast for managing and querying highly connected data
http://www.neo4j.org/
1) Decide structure of nodes, relationships, and there properties and accordingly
You need to create API that will get data from facebook and store it in Neo4j.
I have used neo4j in 3 big projects, and it is best for graph data.
2) Create a cron jon that will get data from facebook and store into the neo4j.
I think implementing mysql for graph database is not a good idea. for large data neo4j is the good option.

Interestingly you designed the appropriate solution yourself already. So in fact you need following components:
a relational database, since you want to request data in a structured, quick way
-> from experiences I would pressure the fact to have a fully normalized data model (in your case with tables users, groups, users2groups), also have 4-Byte surrogate keys over larger keys from facebook (for back referencing you can store their keys as attributes, but internal relations are more efficient on surrogate keys)
-> establish indexes based on hashes rather than strings (eg. crc32(lower(STRING))) - an example select would than be this: select somethinguseful from users where name=SEARCHSTRING and hash=crc32(lower(SEARCHSTRING))
-> never,ever establish unique columns based on strings with length > 8 Byte; unique bulk inserts can be done based on hashes+string checking via insert...select
-> once you got that settled you could also look into sparse matrices (see wikipedia) and bitmaps to get your users2groups optimized (however I have learned that this is an extra that should not hinder you to come up with a first version soon)
a cron job that is run periodically
-> ideally along the caps, facebook is giving you (so if they rule you to not request more often than once per second, stick to that - not more, but also try to come as close as possible to the cap) -> invest some time in getting the management of this settled, if different types of requests need to be fired (request for user records <> requests for group records, but maybe hit by the same cap)
-> most of the optimization can only be done with development - so if I were you I would stick to any high level programming language that does not bother to much with var type juggling and that also comes along with a broad support for associative arrays such as PHP and I would programm that thing myself
-> I made good experiences with setting up the cron job as web page with deactivated output buffering (for php look at ob_end_flush(void)) - easy to test and the cron job can be triggered via curl; if you channel status outputs via an own function (eg with time stamps) this could then also become flexible to either run viw browser or via command line -> which means efficient testing + efficient production running
your user ui, which only requests your database and never, ever, never the external system api
lots of memory, to keep your performance high (optimal: all your data+index data fits into database memory/cache dedicated to the database)
-> if you use mysql as database you should look into innodb_flush_log_at_trx_commit=0, and innodb_buffer_pool_size (just google, if interested)
Hadoop is a file system layer - it could help you with availability. However I would put this into the category of "sparse matrix", which is nothing that stops you from coming up with a solution. From my experience availability is not a primary constraint in data exposure projects.
-------------------------- UPDATE -------------------
I like neo4j from the other answer. So I wondered what I can learn for my future projects. My experiences with mysql is that RAM is usually the biggest constraint. So increasing your RAM to be able to load the full database can gain you performance improvements by a factor of 2-1000 - depending on from where you are coming from. Everything else such as index improvements and structure somehow follows. So if I would need to make up a performance prioritization list, it would be something like this:
MYSQL + enough RAM dedicated to the database to load all data
NEO4J + enough RAM dedicated to the database to load all data
I would still prefer MYSQL. It stores records efficiently, but needs to run joins for deriving relations (which neo4j does not require to that extend). Join-costs are usually low with the right indexes and according to http://docs.neo4j.org/chunked/milestone/configuration-caches.html neo4j does need to add extra management data to the property separation. For big data projects those management data sums up and in full load to memory set ups requires you buy more memory. Performance wise these both options are ultimate. Further, much further down the line you would find this:
NEO4J + not enough RAM dedicated to the database to load all data
MYSQL + not enough RAM dedicated to the database to load all data
In worst case MYSQL will even put indexes to disk (at least partly), which can result in massive read delay. In comparison with NEO4J you could perform a ' direct jump from node to node' exercise, which should - at least in theory - be faster.

Related

Is Elasticsearch optimized for inserts?

I develop for a relatively large online store with a PHP backend, and it uses elasticsearch for some things (like text search, logging... etc).
Now, I'd like to start storing all kinds of information about user activity in ES. For instance, every page view (for instance: user enter product page/category page ,etc).
Is ES optimized for such a heavy load of continuous inserts, or should I consider some alternatives, like for instance having some sort of a buffer layer where I store all of my immediate inserts in memory, and then every minute or so, insert them into ES in bulk?
What is the industry standard? Or am I worrying in vain and ES is optimized for that?
Thanks.
Elasticsearch, when properly sized to handle your load, is definitely a valid alternative for such a use case.
You might decide, however, to store that streaming data into another cluster which is different from your production cluster, so as to not impact the health of the production cluster too much.
There are a lot variables to arrive at the correct decision, and we don't have enough information here, but it's definitely a valid way.

Oracle Materialized View for sensory data transfer

In an application we have to send sensory data stream from multiple clients to a central server over internet. One obvious solution is to use MOMs (Message Oriented Middlewares) such as Kafka, but I recently learned that we can do this with data base synchronization tools such as oracle Materialized View.
The later approach works in some application (sending data from a central server to multiple clients, inverse directin of our application), but what is the pros and cons of it in our application? Which one is better for sending sensory data stream from multiple (~100) clients to server in terms of speed, security, etc.?
Thanks.
P.S.
For more detail consider an application in which many (about 100) clients have to send streaming data (1MB data per minute) to a central server over internet. The data are needed in server for the sake of online monitoring, analysis and some computation such as machine learning and data mining tasks.
My question is about the difference between db-to-db connection and streaming solutions such as kafka for trasfering data from clients to server.
Prologue
I'm going to try and break your question down into in order to get a clearer understanding of your current requirements and then build it back up again. This has taken a long time to write so I'd really appreciate it if you do two things off the back of it:
Be sceptical - there's absolutely no substitute for testing things yourself. The internet is very useful as a guide but there's no guarantee that the help you receive (if this answer is even helpful!) is the best thing for your specific situation. It's impossible to completely describe your current situation in the space allotted and so any answer is, of necessity, going to be lacking somewhere.
Look again at how you explained yourself - this is a valid question that's been partially stopped by a lack of clarity in your description of the system and what you're trying to achieve. Getting someone unfamiliar with your system to look over your question before posting a complex question may help.
Problem definition
sensory data stream from multiple clients to a central server
You're sending data from multiple locations to a single persistence store
online monitoring
You're going to be triggering further actions based off the raw data and potentially some aggregated data
analysis and some computation such as machine learning and data mining tasks
You're going to be performing some aggregations on the clients' data, i.e. you require aggregations of all of the clients' data to be persisted (however temporarily) somewhere
Further assumptions
Because you're talking about materialized views we can assume that all the clients persist data in a database, probably Oracle.
The data coming in from your clients is about the same topic.
You've got ~100 clients, at that amount we can assume that:
the number of clients might change
you want to be able to add clients without increasing the number of methods of accessing data
You don't work for one of Google, Amazon, Facebook, Quantcast, Apple etc.
Architecture diagram
Here, I'm not making any comment on how it's actually going to work - it's the start of a discussion based on my lack of knowledge of your systems. The "raw data persistence" can be files, Kafka, a database etc. This is description of the components that are going to be required and a rough guess as to how they will have to connect.
Applying assumed architecture to materialized views
Materialized views are a persisted query. Therefore you have two choices:
Create a query that unions all 100 clients data together. If you add or remove a client you must change the query. If a network issue occurs at any one of your clients then everything fails
Write and maintain 100 materialized views. The Oracle database at your central location has 100 incoming connections.
As you can probably guess from the tradeoffs you'll have to make I do not like materialized views as the sole solution. We should be trying to reduce the amount of repeated code and single points of failure.
You can still use materialized views though. If we take our diagram and remove all the duplicated arrows in your central location it implies two things.
There is a single service that accepts incoming data
There is a single service that puts all the incoming data into a single place
You could then use a single materialized view for your aggregation layer (if your raw data persistence isn't in Oracle you'll first have to put the data into Oracle).
Consequences of changes
Now we've decided that you have a single data pipeline your decisions actually become harder. We've decoupled your clients from the central location and the aggregation layer from our raw data persistence. This means that the choices are now yours but they're also considerably easier to change.
Reimagining architecture
Here we need to work out what technologies aren't going to change.
Oracle databases are expensive and you're pushing 140GB/day into yours (that's 50TB/year by the way, quite a bit). I don't know if you're actually storing all the raw data but at those volumes it's less likely that you are - you're only storing the aggregations
I'm assuming you've got some preferred technologies where your machine learning and data mining happen. If you don't then consider getting some to prevent madness supporting everything
Putting all of this together we end up with the following. There's actually only one question that matters:
How many times do you want to read your raw data off your database.
If the answer to that is once then we've just described middleware of some description. If the answer is more than once then I would reconsider unless you've got some very good disks. Whether you use Kafka for this middle layer is completely up to you. Use whatever you're most familiar with and whatever you're most willing to invest the time into learning and supporting. The amount of data you're dealing with is non-trivial and there's going to be some trial and error getting this right.
One final point about this; we've defined a data pipeline. A single method of data flowing through your system. In doing so, we've increased the flexibility of the system. Want to add more clients, no need to do anything. Want to change the technology behind part of the system, as long as the interface remains the same there's no issue. Want to send data elsewhere, no problem, it's all in the raw data persistence layer.

Mongodb strategy for Multi-Company web app

I am developing a web app in Meteor, with Mongo, that will be running on cloud. Each user must belong to a Company.
Each Company can only access it's own data.
Each user can access it's own data and some data shared with other users of the same company.
Imagine 1.000 companies and 100 users per company, it could get very bad in performance and secutiry, if I use 1 Mongodb database for whole app.
So, because Mongo is "Schema-less and Database-less" I think I can define 1.000 dbs, lets say db_0001, db_0002, ... with same name collections, lets say tasks, messages, ..., so the app can be efficient and more secure (same code for every Company and isolation of data).
Also, on hosting side (let's say for example with Digital Ocean), I think its easier to distribute the dbs if the are already atomized.
Is this a good approach? Or should I not worry about it and let the hosting do this job?
Any thoughts are wellcome.
You are currently only looking at one side of the coin. That's fine to start with.
Think about how you are going to be displaying that data and what query does it translate to. Do a thorough due diligence on all the potential query. For example, how often would user/getbyid be called and how often would you have to show a user their info and their relationship with other users. What other meta data would be required beside user info, would you have to perform a join to get that data? or is it stored as an embedded document? What fields are you going to be searching and sorting by most? Which types of data are write heavy and what are read heavy?
Now lets get back to your database shading approach. It's great that you are thinking ahead of time on this front rather than having to rewrite your component later. Data volume/storage does not worry me here. How many concurrent users would be using at application and what are primary use cases should be the first place to look at to think about scale.
Additionally, you need to understand the nature of the business and project growth. Is it like Instragram type of hyper growth? or is it more predictable. A big Mongo cluster can handle thousands of concurrent read/write requests (assuming your design and query are optimized) so that does not bother me. If you want to keep it flexible MongoDB has a sharding mechanism and you can shard on a key and it takes care all the fancy stuff for ya.
MongoDB has eventual consistency (look up MongoDB CAP theorem) if you enable read from secondaries and you have a high volume business critical app you need to be careful because you can be reading out of date result.
As far as hosting is concerned, DO is fine but always have a backup in another region to maintain geographic redundancy so in case if a region goes down (Hello AWS!) you have something to fall back on.
Good luck on your project!

Neo4j - Using Java plugins to REST api to improve performance?

I am building an application that requires a lot of data constantly being extracted from a local MongoDB to be put into Neo4j. Seeing as I am also having many users access the Neo4j database, from both a Django webserver and other places, I decided on using the REST interface for Neo4j.
The problem I am having is that, even with batch insertion, the Neo4j server is active over 50% of the time with just trying the insert all the data from the mongoDB. As far as I can see there might be some waiting time because of the HTTP requests but I have been trying to tweak but have only gotten so far.
The question is, if I write a Java plugin (http://docs.neo4j.org/chunked/stable/server-plugins.html) that can handle inserting the mongoDB extractions directly, will I then go around the REST API? Or will the java plugin commands just convert to regular REST API requests? Furthermore, will there be a performance boost by using the plugin?
The last question is how do I optimize the speed of the REST API (So far I am performing around 1500 read/write operations which includes many "get_or_create_in_index" operations)? Is there a sweet spot where the number of queries appended to one HTTP requests will keep Neo4j busy until the next HTTP request arrives?
Update:
I am using Neo4j version 2.0
The data that I am extracting consists of bluetooth observations, where the phone that is running the app i created scans all nearby phones. This single observation is then saved as a document in MongoDB and consists of the users id, the time of the scan and a list of the phones/users that he has seen in that scan.
In Neo4j I model all the users as nodes and I also model an observation between two users as a node so that it will look like this:
(user1)-[observed]->(observation_node)-[observed]->(user2)
Furthermore I index all user nodes.
When moving the observation from mongoDB to Neo4j, I do the following for each document:
Check in the index if the user doing the scan already has a node assigned, else create one
Then for each observed user in the scan: A) Check in index if the observed user has a node else create one B) Create an observation node and relationships between the users and the observation node, if this doesn't already exist C) Make a relationship between the observation node and a timeline node (the timeline just consists of a tree of nodes so that I can quickly find observations at a certain time)
As it can be seen I am doing quite a few lookups in the user index (3), some normal read (2-3) and potentially many writes for each observation.
Each bluetooth scan average around 5-30 observations and I batch 100 scans in a single HTTP request. This means that each request usually contains 5000-10000 updates.
What version are you using?
The unmanaged extension would use the underlying Java-API so it much faster, also you can decide on the format & protocol of the data that you push to it.
It is sensible to batch writes, so that you don't incurr tx overhead per each tiny write. E.g. aggregating 10-50k updates in one operation helps a lot.
What is the concrete shape of the updates you do? Can you edit your question to reflect that?
Some resources for this:
http://maxdemarzi.com/2013/09/05/scaling-writes/
http://maxdemarzi.com/2013/12/31/the-power-of-open-source-software/

Free data warehouse - Infobright, Hadoop/Hive or what?

I need to store large amount of small data objects (millions of rows per month). Once they're saved they wont change. I need to :
store them securely
use them to analysis (mostly time-oriented)
retrieve some raw data occasionally
It would be nice if it could be used with JasperReports or BIRT
My first shot was Infobright Community - just a column-oriented, read-only storing mechanism for MySQL
On the other hand, people says that NoSQL approach could be better. Hadoop+Hive looks promissing, but the documentation looks poor and the version number is less than 1.0 .
I heard about Hypertable, Pentaho, MongoDB ....
Do you have any recommendations ?
(Yes, I found some topics here, but it was year or two ago)
Edit:
Other solutions : MonetDB, InfiniDB, LucidDB - what do you think?
Am having the same problem here and made researches; two types of storages for BI :
column oriented. Free and known : monetDB, LucidDb, Infobright. InfiniDB
Distributed : hTable, Cassandra (also column oriented theoretically)
Document oriented / MongoDb, CouchDB
The answer depends on what you really need :
If your millions of row are loaded at once (nighly batch or so), InfiniDB or other column oriented DB are the best; They have great performance and are "BI oriented". http://www.d1solutions.ch/papers/d1_2010_hauenstein_real_life_performance_database.pdf
And they won't require a setup of "nodes", "sharding" and other stuff that comes with distributed/"NoSQL" DBs.
http://www.mysqlperformanceblog.com/2010/01/07/star-schema-bechmark-infobright-infinidb-and-luciddb/
If the rows are added in real time.. then column oriented DB are bad. You can either choose two have two separate DB (that's my choice : one noSQL for real feeding of the stats by the front, and real time stats. The other DB column-oriented for BI). Or turn towards something that mixes column oriented (for out requests) and distribution (for writes) / like Cassandra.
Document oriented DBs are not suited for BI, they are more useful for CRM/CMS issues where you need frequent access to a particular row
As for the exact choice inside a category, I'm still undecided. Cassandra in distributed, and Monet or InfiniDB for CODB, are leaders. Monet is reported to have problem loading very big tables because it runs indexes in memory.
You could also consider GridSQL. Even for a single server, you can create multiple logical "nodes" to utilize multiple cores when processing queries.
GridSQL uses PostgreSQL, so you can also take advantage of partitioning tables into subtables to evaluate queries faster. You mentioned the data is time-oriented, so that would be a good candidate for creating subtables.
If you're looking for compatibility with reporting tools, something based on MySQL may be your best choice. As for what will work for you, Infobright may work. There are several other solutions as well, however you may want also to look at plain-old MySQL and the Archive table. Each record is compressed and stored and, IIRC, it's designed for your type of workload, however I think Infobright is supposed to get better compression. I haven't really used either, so I'm not sure which will work best for you.
As for the key-value stores (E.g. NoSQL), yes, they can work as well and there are plenty of alternatives out there. I know CouchDB has "views", but I haven't had the opportunity to use any, so I don't know how well any of them work.
My only concern with your data set is that since you mentioned time, you may want to ensure that whatever solution you use will allow you to archive data past a certain time. It's a common data warehouse practice to only keep N months of data online and archive the rest. This is where partitioning, as implemented in an RDBMS, comes in very useful.

Resources