I'm developing a web application that will heavily depend on its ability to make suggestions on items basing on users with similar preferences. A friend of mine told me that what I'm looking for - mathematically - is some Cluster Analysis algorithm. On the other hand, here on SO, I was told that Neo4j (or some other Graph DB) was the kind DB that I should have approached for this task (the preferences one).
I started studying both this tools, and I'm having some doubts.
For Cluster Analysis purposes it looks to me that a standard SQL DB would still be the perfect choice, while Neo4j would be better suited for a Neural Network kind of approach (although still perfectly fit for the task).
Am I missing something? Am I trying to use the wrong tools combination?
I would love to hear some ideas on the subject.
this depends on your data. neo4j is capable to provide even complex recommendations in real-time for one particular node - let's say you want to recommend to a user some product and this can be handle within a graph db in real-time
whereas using some clustering system is the best way to do recommendations for all users at once (and than maybe save it somewhere so you wouldn't need to calculate it again).
the computational difference:
neo4j has has no initialization cost and can give you one recommendations in an acceptable time
clustering needs more time for initialization (e.g. not in seconds but most likely in minutes/hours) and is better to calculate the recommendations for the whole dataset. in fact, taking strictly the time for one calculations for a specific user this clustering can do it faster than neo4j but the big restriction is the initial initialization - thus not good for real-time application
the practical difference:
if you have mostly static data and is ok for you to do recommendations once in a time than do clustering with SQL
if you got dynamical data where the data are being updated with each interaction and is necessary for you to always provide the newest recommendation, than use neo4j

I am currently working on various topics related to recommendation and clustering with neo4j.
I'm not exactly sure what you're looking for, but depending on how you implement you data on the graph, you can easily work out clustering algorithms based on counting links to various type of nodes.
If you plan correctly you nodes and relationships, you can then identify group of nodes that share most common links to a set of category.

let me introduce Reco4J (, is is an open source framework that provide recommendation based on graph database source. It uses neo4j as graph database management system.
Have a look at it and contact us if you are interested in support.
It is in a really early release but we are working hard to provide extended documentation and new interesting features.


Exasol vs HBase

I'm quite new to BigData architecture so please don't be to harsh on me.
I am trying to figure out the best alternative to build a BI Architecture able to deal with huge amounts of data. As I see it, the solution has to be clustered/horizontally scalable to cope with system growing. I would like to be able to interact with the system using SQL, so HBase + Hive (or even Pig, not for sql but not to need to manually write MR tasks) could be a solution. What would be the benefits/disadvantages of such an architecture opposed to, for instance, Exasolution and their In-Memory - MPP - Columnar solution.
Are there other alternatives which might have some extra-benefits? What about maintenance and configuration? Any Microsoft solution (I may find customer specific needs regarding this)
Sorry for posting such an open question, but I would like to see some discussion so that I can learn from you as much as possible.
Though being an EXASOL guy, I will not start to try to convince you that EXASOL is the one and only good solution out there. It heavily depends on the use case you are trying to implement, and the requirements you have to fulfill.
Hadoop is a very flexible, scalable system and used very often for storing and processing huge volumes of data.
EXASOL in contrast is a specialized RDBMS for complex analytic query processing.
I think that these two options don't really directly compete but complement each other. In many cases companies need a scalable data lake to store and preprocess there data, or to query it in rather simply ways. Once you want to enter the real-time business with complex analytics, where dozens, hundreds or even thousands of analysts are running lots of queries, then an in-memory RDBMS is a great choice.
King, the producer of Candy Crush, combines these two worlds to a powerful data management eco system. They store petabytes of data within Hadoop and use EXASOL on top as an in-memory layer for hundreds of terabytes of data. You can read more about that exciting use case here:
Another important difference of these two worlds is the complexity. While EXASOL is tuning-free because it is a specialized system (similar to an appliance) for a certain use case running SQL queries or R/Python/Java in-database-analytics, the Hadoop stack is much more complex. You'll need a certain level of know how to setup, maintain and tune this system. This doesn't need to be a reason for any of the two option. As mentioned, it heavily depends on what you want.
From a price perspective, Hadoop is free and so it should be much cheaper than an in-memory db such as EXASOL, right? Wait a minute, it's not that easy. Again, you have to consider the whole picture. How much data you really want to store, how much of that needs to be queried for analysis, how much hardware would you need to buy, how many people do you have to be hired and trained for the operation or the analytics deployed on the system.
To summarize my thoughts, the world is too complicated to directly compare these two technologies. Depending on the use case and your personal requirements, either one or the other could be the better option. And in my opinion, the trend in the market is combining such systems to a data mgmt eco systems where you get the best out of the two worlds... Actually three worlds, because the world of operational data processing of NoSQL solutions should also be mentioned here.
I hope that helped a bit. If you need any further details especially about EXASOL, don't hesitate to contact me or connect with me on LinkedIn:

Is a data warehouse a good solution for sharing customer data across technologies?

I am wanting to be able to share data across all areas of our business in a way that reduces the overall complexity of our infrastructure.
The Problem
Our problem is that we currently have 4 main applications that all connect to our CRM application (Microsoft Dynamics 2011):
The decision-makers at our firm are currently wanting to upgrade our CRM to the most current version and, then, stay up to date as new upgrades are released (every 2-3 years). Almost all of our applications are rigidly integrated with Microsoft Dynamics so each upgrade is very expensive and risky. I want to design another approach that will reduce this expense and risk.
In 2006, Roger Sessions wrote an article called A Better Path to Enterprise Architectures (here) which outlines ways to better Business IT systems. One of the central themes in his discussion is reducing complexity, and by arranging die in different ways, he shows that you can exponentially reduce the complexity of systems by partitioning technologies into segments rather than letting any technology connect to any other technology. Jeanne Ross has a great presentation on this topic as well (here), and she talks about having a digitized platform for sharing core data and services between areas of the business in order to reduce complexity of the overall system and increase agility in responding to current and future business needs.
As I reflect on the lessons from Sessions and Ross, I am confident that we need to take Microsoft Dynamics out of the center of our architecture if we are wanting to overhaul the technology every 2-3 years. We'll just need replace it with something that will allow our core data (mostly customer data) to be shared across applications. I know that data warehouses are often used for aggregating data across the organization. Could this work?
I understand that data warehouses are mostly used for reporting, so I don't know if having direct connections to the data warehouse would be ideal. However, each application would not need the ability to update any data in the data warehouse. They just need the ability to grab their IDs to set up relationships between global, data warehouse entities (customers) and various unit-specific entities within each application's database.
Which of these three options would meet my needs: (1) a data warehouse to which all applications connect directly, (2) a data warehouse that feeds data to each application-specific database through overnight updates or (3) something else?
What you're after is a data integration architecture - that doesn't necessarily mean a data warehouse. The pattern you're describing is called "hub and spoke," and it's very common - I'd say you're definitely on the right track for resolving the integration problem you're describing.
This page goes into this problem and pattern in much more depth, and it also has a section on the differences between data warehousing and data integration. You've noted that you're aware data warehouses are commonly used for reporting - that's true, and they're also used heavily for analytics, as the link discusses. They're traditionally a data source for business intelligence efforts. This can mean they're not always focused on the kind of data you're interested in - i.e. operational data which your systems need to function, but which might not be of interest for reporting or analytical purposes. Or, they might not function in a way that's helpful for your needs - for instance, traditional overnight ETL loads might not be the best solution if you need your applications to be up-to-date more quickly.
All this is to say that data warehouses can definitely be used as a data hub - the EDW becomes your "master data" source, any data quality processes needed run on the EDW data, and ETL processes fire corrected data back out to the various sources - but you'll probably be better served by researching the topic of data integration than the topic of data warehousing, even if the two share a lot of similarities and can overlap.
If you create a data warehouse without any business intelligence requirements, it might not function very well as a data warehouse. A very suitable data integration/master data solution might not resolve all of the future requirements you have for a data warehouse. Equally, if you were to create a traditional data warehouse after researching data warehousing best practices, it might not fulfill your data integration requirements, or fulfill them in the best way. As the link suggests, separate the two ideas: resolve your data integration problem, and if you want a data warehouse in the future, you can use your data integration solution to help populate it.

Which open-source recommendation system should I choose to deal with big dataset

I want to build a recommendation system, and the target is to deal with really big data set, like 1 TB data.
And each user has really huge amount of items, however the number of user is small, like thousands or 10 thousands.
I have search from google, I found there is some open-source recommendation engine based on hadoop like Mahout, I guess it may have ability to deal with such big data, however I'm not sure.
I also find some engine write in C++ python, even php, I don't think script languages can deal with such big data, cause memory can't contain the whole dataset.
Or I'm wrong? Could some give me some recommendation?
Your question title is:
Which opensource recommendation system should I choose to deal with
big dataset?
and in the first line you say
I want to build a recommendation system, and the target is to deal with really big data set, > like 1 TB data.
And you are asking for an recommendation as an answer.
To answer your second question first. In my experience of building recommender systems I would advise you do not "build" a recommender system from the ground up if you can avoid it. Recommender Systems are complex and can use a wide range of techniques to provide a user with a recommendation. So my recommendation is unless you are really committed, and have a team of people with a range of experience and knowledge in recommender systems, statistics, and software engineering then look to implement an existing recommender system rather than building your own.
In terms of which open source recommender system you should choose, this is actually pretty difficult to answer with great accuracy. Let me try to answer this by breaking it down.
Consider the open source license, its restrictions and your requirements.
Consider which algorithm you want to use to make recommendations
Consider the environment you will be running your recommender system on.
I recommend you look more into the algorithm side as it will be the determining factor as to which tool you can use, or whether you will need to roll your own. Start reading here for a very brief insight in to the different approaches that recommender systems use. In summary the different approaches are:
Content based
Neighbourhood / Collaborative filtering based
Constraint based
In your case to keep things relatively straightforward it sounds like you should consider a user-user collaborative filtering algorithm for this. The reasons being:
Neighbourhood Collaborative Filtering is quite intuitive to understand and it can be relatively easy to implement.
With this method you can also justify your recommendations to your users in a basic way
There is no requirement to build a model for training, and the processing of neighbours can be done "offline", to provide quick recommendations to the end user.
Storing neighbours is actually quite memory efficient, which means better scalability. Something it sounds like you will need lots of.
The user-based part of my suggestion is because it sounds like you have less users than you do items. In a user-based nearest neighbourhood a predicted rating of a new item I for user U is calculated by looking at the other users who have also rated item I and are most similar to user U. Because you have fewer users than items in your system it will be faster to compute user-based collaborative filtering compared with item-based collaborative filtering.
Within the user-based collaborative filtering you need to consider what rating normalisation (mean-centering vs z-score) you want to use, the similarity weight computation method (e.g. Cosine vs Pearsons correlation vs other similarity measures) you want to use, neighbourhood selection criteria (pre-filtering of neighbours, number of neighbours involved in the prediction), and any Dimensionality Reduction methods (SVD, SVD++) you want to implement (with a large dataset like yours you will want to seriously consider DM).
So really instead of looking for an open source that will be able to process your data set you should consider your algorithm choice first, then look to find a tool that has an implementation of this algorithm, and then assess whether it can process your the volume involved in your dataset.
In saying all of that, if you do choose to go down the user-based collaborative filtering route then I am confident that Apache Mahout will be able to solve your problem, and if not it will certainly help you understand the complexity involved in building your own (just look at their source code).
Please note the advice is really consider the algorithm choice. "Good" recommender systems are so much more than just being able to process a large dataset. You need to think about accuracy, coverage, confidence, novelty, serendipity, diversity, robustness, privacy, risk user trust, and finally scalability. You should also consider how you are going to perform experiments and evaluate your recommendations, remember if the recommendations you are churning out are rubbish and it is turning your users off then there is no point to have a recommender system!
It is such a big area with lots to think about, there is probably no one single tool that is going to help you with everything, so be prepared to do a lot of reading and research as well as implementing lots of different open source tools to help you.
In saying that, start looking at Apache Mahout. Going back to the break-down of the 3 areas I said you should think about.
It has a commercial-friendly open-source license,
it has really great implementation of the algorithms you are likely going to need to use, and
it can work on distributed environments (read scalable).
Hope that helps, and good luck.

cluster analysis Hadoop, Map reduce environment

we are currently trying to create some very basic personas based on our user data base (few million profiles). The goal is to find out at this stage what the characteristics of our users are, for example what they look like and what they are looking for and to create several "typical" user profiles.
I believe the best way to achieve this would be to run a cluster analysis in order to find similarities among users.
The big roadblock however is how to get there. We are tracking our data in a Hadoop environment and I am being told that this could be potentially achieved with our tools.
I have familiarised myself with the theory of the topic and know that it can be done for example in SPSS (quite hard to use and limited to samples of large data sets).
The big question: Is it possible to perform a or different types of cluster analysis in a Hadoop environment and then visualise the results like in SPSS? It is my understanding that we would need to run several types of analysis in order to find the best way to cluster the data, also when it comes to distance measurements of the clusters.
I have not found any information on the internet with regards to this, so I wonder if this is possible at all, without a major programming effort (meaning literally implementing for example all the standard tools available in SPSS: Dendrograms, the different result tables and cluster graphs etc.).
Any input would be much appreaciated. Thanks.

Resources related to data-mining and gaming on social networks

I'm interested in the problem of patterning mining among players of social networking games. For example detecting cheaters of a game, given a company's user database. So far I have been following the usual recipe for a data mining project:
construct a data warehouse that aggregates significant information
select a classifier, and train it with a subsectio of records from the warehouse
validate classifier with another test set
lather, rinse, repeat
Surprisingly, I've found very little in this area regarding literature, best practices, etc. I am hoping to crowdsource the information gathering problem here. Specifically what I'm looking for:
What classifiers have worked will for this type of pattern mining (it seems highly temporal, users playing games, users receiving rewards, users transferring prizes etc).
Are there any highly agreed upon attributes specific to social networking / gaming data?
What is a practical amount of information that should be considered? One problem I've run into is data overload, where queries and data cleansing may take days to complete.
Related to point above, what hardware resources are required to produce results? I've found it difficult to estimate the amount of computing power I will require for production use. It has become apparent that a white box in the corner does not have enough horse-power for such a project. Are companies generally resorting to cloud solutions? Are they buying clusters?
Basically, any resources (theoretical, academic, or practical) about implementing a social networking / gaming pattern-mining program would be very much appreciated.
I am looking for the same kind of resources, here are some things I found that I consider pretty interesting, hope you can take advantage of it, please if you discover more resources let me know.
Here they are:
This is in portuguesse but is excelent:
