What could be the best way to use Neo4j and Hadoop?
I have to show the output in an admin panel.
My constraints are - large amount of data and query operations.
What I am currently thinking is,
Bring data into hadoop, perform ETL operations and write it back to the system. Convert this into a job and set it for repititive execution. Use Neo4j on this exported data. Is this the right way?
When searched about it, I found an article
In the past there were some approaches that used Hadoop to quickly generate Neo4j datastores directly. While this approach is performant, it is also tightly coupled to the store-format of a certain Neo4j version as it has to duplicate the functionality of writing to split-up store-files. With the parallel neo4j-import tool and APIs introduced in Neo4j 2.2, such a solution is no longer needed. The import facilities scale across a large number of CPUs to maximize import performance.
Does this mean in terms of large datasets Neo4j no longer needs Hadoop for data processing?
Related
Which technology is efficient to analyze the data hadoop or python? and which technology is speed between the above two?
So Hadoop mostly uses spark. If the underlying framework you are using to analyse or crunch your data contains spark, you are good to go with either Scala, PySpark or maybe R. Using alone python won't give you benefits of Spark which makes data analysis faster and also various transformations on Big Data. So whichever you use, its about using spark.
Scala or PySpark : both contains almost all of these features.
Whenever analyzing data and considering speed as a criteria, two key components determine the speed: The amount of data you have and where the data is located.
If you have Big Data, consider using Hadoop or Spark to analyze it. This will make it much faster and you will not be dependent of load time. If you have a few gigabytes of data it maybe best to use python but it still may slow down your machine.
Now to address where the data is, if you have your data on premise then python is the best approach. If your data is located in cloud server, then Azure, GCP, or even AWS have big data tools available to make this data exploration easier. All three cloud systems have big data tools available for use.
So in terms of speed, it really depends on the two constraints. If you have Big Data and your data is located in a cloud system. Consider using Hadoop to analyze your data. If you have only a few gigabytes of data and on-premise, use python to analyze your data.
Usually, data warehouses in the context of big data are managed and implemented on the basis of Hadoop-based system, like Apache Hive (right?).
On the other hand, my question regards the methodological process.
How do big data affect the design process of a data warehouse?
Is the process similar or new tasks must be considered?
Hadoop is similar in architecture to MPP data warehouses, but with some significant differences. Instead of rigidly defined by a parallel architecture, processors are loosely coupled across a Hadoop cluster and each can work on different data sources.
The data manipulation engine, data catalog, and storage engine can work independently of each other with Hadoop serving as a collection point. Also critical is that Hadoop can easily accommodate both structured and unstructured data. This makes it an ideal environment for iterative inquiry. Instead of having to define analytics outputs according to narrow constructs defined by the schema, business users can experiment to find what queries matter to them most. Relevant data can then be extracted and loaded into a data warehouse for fast queries.
The Hadoop ecosystem starts from the same aim of wanting to collect together as much interesting data as possible from different systems, but approaches it in a radically better way. With this approach, you dump all data of interest into a big data store (usually HDFS – Hadoop Distributed File System). This is often in cloud storage – cloud storage is good for the task, because it’s cheap and flexible, and because it puts the data close to cheap cloud computing power. You can still then do ETL and create a data warehouse using tools like Hive if you want, but more importantly you also still have all of the raw data available so you can also define new questions and do complex analyses over all of the raw historical data if you wish. The Hadoop toolset allows great flexibility and power of analysis, since it does big computation by splitting a task over large numbers of cheap commodity machines, letting you perform much more powerful, speculative, and rapid analyses than is possible in a traditional warehouse.
We have a lot of user interaction data from various websites stored in Cassandra such as cookies, page-visits, ads-viewed, ads-clicked, etc.. that we would like to do reporting on. Our current Cassandra schema supports basic reporting and querying. However we also would like to build large queries that would typically involve Joins on large Column Families (containing millions of rows).
What approach is best suited for this? One possibility is to extract data out to a relational database such as mySQL and do data mining there. Alternate could be to attempt at use hadoop with hive or pig to run map reduce queries for this purpose? I must admit I have zero experience with the latter.
Anyone have experience of performance differences in one one vs the other? Would you run map reduce queries on a live Cassandra production instance or on a backup copy to prevent query load from affecting write performance?
In my experience Cassandra is better suited to processes where you need real-time access to your data, fast random reads and just generally handle large traffic loads. However, if you start doing complex analytics, the availability of your Cassandra cluster will probably suffer noticeably. In general from what I've seen it's in your best interest to leave the Cassandra cluster alone, otherwise the availability starts suffering.
Sounds like you need an analytics platform, and I would definitely advise exporting your reporting data out of Cassandra to use in an offline data-warehouse system.
If you can afford it, having a real data-warehouse would allow you to do complex queries with complex joins on multiples tables. These data-warehouse systems are widely used for reporting, here is a list of what are in my opinion the key players:
Netezza
Aster/TeraData
Vertica
A recent one which is gaining a lot of momentum is Amazon Redshift, but it is currently in beta, but if you can get your hands on it you could give this a try since it looks like a solid analytics platform with a pricing much more attractive than the above solutions.
Alternatives like using Hadoop MapReduce/Hive/Pig are also interesting to look at, but probably not a replacement for Hadoop technologies. I would recommend Hive if you have a SQL background because it will be very easy to understand what you're doing and you can scale easily. There are actually already libraries integrated with Hadoop, like Apache Mahout, which allow you to do data-mining on a Hadoop cluster, you should definitely give this a try and see if it fits your needs.
To give you an idea, an approach that I've used that has been working well so far is pre-aggregating the results in Hive and then have the reports themselves generated in a data-warehouse like Netezza to compute complex joins .
Disclosure: I'm an engineer at DataStax.
In addition to Charles' suggestions, you might want to look into DataStax Enterprise (DSE), which offers a nice integration of Cassandra with Hadoop, Hive, Pig, and Mahout.
As Charles mentioned, you don't want to run your analytics directly against Cassandra nodes that are handling your real-time application needs because they can have a substantial impact on performance. To avoid this, DSE allows you to devote a portion of your cluster strictly to analytics by using multiple virtual "datacenters" (in the NetworkToplogyStrategy sense of the term). Queries performed as part of a Hadoop job will only impact those nodes, essentially leaving your normal Cassandra nodes unaffected. Additionally, you can scale each portion of the cluster up or down separately based on your performance needs.
There are a couple of upsides to the DSE approach. The first is that you don't need to perform any ETL prior to processing your data; Cassandra's normal replication mechanisms keep the nodes devoted to analytics up to date. Second, you don't need an external Hadoop cluster. DSE includes a drop-in replacement for HDFS called CFS (CassandraFS), so all source data, intermediate results, and final results from a Hadoop job can be stored in the Cassandra cluster.
What is the point in feeding an Hadoop cluster and using that cluster to feed data into a Vertica/InfoBright datawarehouse ?
All thse vendor keep saying "we can connect with Hadoop", but I don't understand what's the point. What is the interest of storing in Hadoop and transfering into InfoBright ? Why not have the applications store directly in the Infobright/Vertica DW ?
Thank you !
Why combine the solutions? Hadoop has some great capabilities (see url below). These capabilities though do not include allowing business users to run quick analytics. Queries that take 30 minutes to hours in Hadoop are being delivered in 10’s of seconds with Infobright.
BTW, your initial question did not presuppose an MPP architecture and for good reason. Infobright customers Liverail, AdSafe Media & InMobi, among others, utilize IEE with Hadoop.
If you register for an Industry White Paper http://support.infobright.com/Support/Resource-Library/Whitepapers/ you will see a view of the current marketplace where four suggested Use Cases for Hadoop are outlined. It was authored by Wayne Eckerson , Director of Research, Business Applications and Architecture Group, TechTarget, in September 2011.
1) Create an online archive.
With Hadoop, organizations don’t have to delete or ship the data to offline storage; they can keep it online indefinitely by adding commodity servers to meet storage and processing requirements. Hadoop becomes a low-cost alternative for meeting online archival requirements.
2) Feed the data warehouse.
Organizations can also use Hadoop to parse, integrate and aggregate large volumes of Web or other types of data and then ship it to the data warehouse, where both casual and power users can query and analyze the data using familiar BI tools. Here, Hadoop becomes an ETL tool for processing large volumes of Web data before it lands in the corporate data warehouse.
3) Support analytics.
The big data crowd (i.e., Internet developers) views Hadoop primarily as an analytical engine for running analytical computations against large volumes of data. To query Hadoop, analysts currently need to write programs in Java or other languages and understand MapReduce, a framework for writing distributed (or parallel) applications. The advantage here is that analysts aren’t restricted by SQL when formulating queries. SQL does not support many types of analytics, especially those that involve inter-row calculations, which are common in Web traffic analysis. The disadvantage is that Hadoop is batch-oriented and not conducive to iterative querying.
4) Run reports.
Hadoop’s batch-orientation, however, makes it suitable for executing regularly scheduled reports. Rather than running reports against summary data, organizations can now run them against raw data, guaranteeing the most accurate results.
There are several reasons you may want to do that
1. Cost per TB. The storage costs in Hadoop are much cheaper than Vertica/Netezza/greenplum and the like). You can get long-term retention in Hadoop and shorter term data in the analytics DB
2. Data ingestion capabilities in hadoop (performing transformations) is better in Hadoop
3. programatic analytics (libraries like Mahout ) so you can build advanced text analytics
4. dealing with unstructured data
The MPP dbs provide better performance in ad-hoc queries, better dealing with structured data and connectivity to traditional BI tools (OLAP and reporting) - so basically Hadoop complements the offering of these DBs
Hadoop is more of a platform than a DB.
Think of Hadoop as a neat filesystem that supports lots of queries over different of file types. With this in mind, most people dump raw data onto Hadoop and use it as a staging layer in the data pipeline, where it can chew the data and push it to other systems like vertica or any other. You have several advantages that can be resumed to decoupling.
So Hadoop is turning into the facto storage platform for big data. It is simple, fault-tolerant, scales well, and it is easy to feed and to get data out of it. So most vendors are trying to push a product to companies that probably have a Hadoop installation.
What makes the joint deployment so effective for this software ?
First, both platforms have a lot in common:
Purpose-built from scratch for Big Data transformation and analytics
Leverage MPP architecture to scale out with commodity hardware,
capable of managing TBs through PBs of data
Native HA support with low administration overhead
Hadoop is ideal for the initial exploratory data analysis, where the data is often available in HDFS and is schema-less, and batch jobs usually suffice, whereas Vertica is ideal for stylized, interactive analysis, where a known analytic method needs to be applied repeatedly to incoming batches of data.
By using Vertica’s Hadoop connector, users can easily move data between the two platforms. Also, a single analytic job can be decomposed into bits and pieces that leverage the execution power of both platforms; for instance, in a web analytics use case, the JSON data generated by web servers is initially dumped into HDFS. A map-reduce job is then invoked to convert such semi-structured data into relational tuples, with the results being loaded into Vertica for optimized storage and retrieval by subsequent analytic queries.
What are the Key differences that make Hadoop and Vertica complement each other when addressing Big Data.
Interface and extensibility
Hadoop
Hadoop’s map-reduce programming interface is designed for developers.The platform is acclaimed for its multi-language support as well as ready-made analytic library packages supplied by a strong community.
Vertica
Vertica’s interface complies with BI industry standards (SQL, ODBC, JDBC etc). This enables both technologists and business analysts to leverage Vertica in their analytic use cases. The SDK is an alternative to the map-reduce paradigm, and often delivers higher performance.
Tool chain/Eco system
Hadoop
Hadoop and HDFS integrate well with many other open source tools. Its integration with existing BI tools is emerging.
Vertica
Vertica integrates with the BI tools because of its standards compliant interface. Through Vertica’s Hadoop connector, data can be exchanged in parallel between Hadoop and Vertica.
Storage management
Hadoop
Hadoop replicates data 3 times by default for HA. It segments data across the machine cluster for loading balancing, but the data segmentation scheme is opaque to the end users and cannot be tweaked to optimize for the analytic jobs.
Vertica
Vertica’s columnar compression often achieves 10:1 in its compression ratio. A typical Vertica deployment replicates data once for HA, and both data replicas can attain different physical layout in order to optimize for a wider range of queries. Finally, Vertica segments data not only for load balancing, but for compression and query workload optimization as well.
Runtime optimization
Hadoop
Because the HDFS storage management does not sort or segment data in ways that optimize for an analytic job, at job runtime the input data often needs to be resegmented across the cluster and/or sorted, incurring a large amount of network and disk I/O.
Vertica
The data layout is often optimized for the target query workload during data loading, so that a minimal amount of I/O is incurred at query runtime. As a result, Vertica is designed for real-time analytics as opposed to batch oriented data processing.
Auto tuning
Hadoop
The map-reduce programs use procedural languages (Java, python, etc), which provide the developers fine-grained control of the analytic logic, but also requires that the developers optimize the jobs carefully in their programs.
Vertica
The Vertica Database Designer provides automatic performance tuning given an input workload. Queries are specified in the declarative SQL language, and are automatically optimized by the Vertica columnar optimizer.
I'm not a Hadoop user (just a Vertica user/DBA), but I would assume the answer would be something along these lines:
-You already have a setup using Hadoop and you want to add a "Big Data" database for intensive analytical analysis.
-You want to use Hadoop for non-analytical functions and processing and a database for analysis. But it is the same data, so no need for two feeds.
To expand slightly on Arnon's answer, Hadoop has been recognized as a force that is not going away and is gaining increasing traction in organizations, many times via grassroots efforts from developers. MPP databases are good at answering questions that we know about at design time such as "How many transactions do we get per hour by country?".
Hadoop started as a platform for a new type of developer that lives somewhere between analysts and developers, one who can write code but also understands data analysis and machine learning. MPP databases (column or not) are very poor at serving this type of developer who often is analyzing unstructured data, using algorithms that require too much CPU power to run in a database or datasets which are too large. The sheer amount of CPU power required to build some models makes running these algorithms in any sort of traditional sharded DB impossible.
My personal pipeline using hadoop typically looks like:
Run a number of very large global queries in Hadoop to get a basic feel for the data and the distribution of variables.
Use Hadoop to build a smaller dataset with just the data I am interested in.
Export the smaller dataset into a relational DB.
Run lots of small queries on the relational db, build excel sheets, sometimes do a little R.
Bear in mind that this workflow only works for the "analyst developer" or "data scientist". Others mileage will vary.
Coming back to your question due to people like me abandoning their tools these companies are looking for ways to remain relevant in an age where Hadoop is synonymous with big data, the coolest startups and cutting edge technology (whether this is earned or not you may discuss amongst yourselves.) Also many Hadoop installations are an order of magnitude or more larger than an organizations MPP deployments, meaning more data is being retained for longer in Hadoop.
Massive parallel database like Greenplum DB are excellent for handling massive amounts of structured data. Hadoop is excellent at handling even more massive amounts of unstructured data, e.g. websites.
Nowadays, a ton of interesting analytics combines these both types of data to gain insight. Therefore it is important for these database systems to be able to integrate with Hadoop.
For example you could do text processing on the Hadoop Cluster using MapReduce until you have some scoring value per product or something. This scoring value then could be used by the database to combine it with other data that is already stored in the database or data that has been loaded into the database from other sources.
Unstructured data, by their nature, is not suitable for loading into your traditional data warehouse. Hadoop mapreduce jobs can extract structures out of your log files (ex) and then the same can then be ported into your DW for analytics. Hadoop is batch processing, therefore is not suitable for analytic query processing. So you can process your data using hadoop to bring some structure, and then make it query ready via your visualization/sql layer.
What is the point in feeding an Hadoop cluster and using that cluster to feed data into a Vertica/InfoBright datawarehouse ?
The point is you would not want your users to fire up a query and wait for minutes, sometimes hours before you come back with an answer. Hadoop cannot provide you with a real time query response. Although this is changing with the advent of Cloudera's Impala and Hortonworks's Stinger. These are real-time data processing engines over Hadoop.
Hadoop's underlying data system, HDFS, allows chunking up your data and distributing it over the nodes in your cluster. In fact, HDFS can also be replaced with a 3rd party data storage like S3. Point is: Hadoop provides both -> storage + processing. So you are welcome to use hadoop as storage engine and extract the data into your data warehouse when needed. You can also use Hadoop to create cubes and marts and store these marts in the warehouse.
However, with the advent of Stinger and Impala, the strength of these claims will eventually be erased. So keep an eye out.
I am considering various technologies for data warehousing and business intelligence, and have come upon this radical tool called Hadoop. Hadoop doesn't seem to be exactly built for BI purposes, but there are references of it having potential in this field. ( http://www.infoworld.com/d/data-explosion/hadoop-pitched-business-intelligence-488).
However little information I have got from the internet, my gut tells me that hadoop can become a disruptive technology in the space of traditional BI solutions. There really is sparse information regarding this topic, and hence I wanted to gather all the Guru's thoughts here on the potential of Hadoop as a BI tool as compared to traditional backend BI infrastructure like Oracle Exadata, vertica etc. For starters, I would like to ask the following question -
Design Considerations - How would designing a BI solution with Hadoop be different from traditional tools? I know it should be different, as I read one cannot create schemas in Hadoop. I also read that a major advantage will be the complete elimination of ETL tools for Hadoop (is this true?) Do we need Hadoop + pig + mahout to get a BI solution??
Thanks & Regards!
Edit - Breaking down into multiple questions. Will start with the one i think most imp.
Hadoop is a great tool to be part of a BI solution. It is not, itself, a BI solution. What Hadoop does is takes in Data_A and outputs Data_B. Whatever is needed for Bi but is not in a useful form can be processed using MapReduce and output a useful form of the data. Be it CSV, HIVE, HBase, MSSQL or anything else used to view data.
I believe Hadoop is supposed to be the ETL tool. That's what we are using it for. We process gigs of log files every hour and store it in Hive and do daily aggregations that are loading into a MSSQL server and viewed through a visualization layer.
The major design considerations I've run against are:
- Data Flexibility: Do you want your users to view pre-aggregated data or have the flexibility to adjust the query and look at the data how they want
- Speed: How long do you want your users to wait for the data? Hive (for example) is slow. It takes minutes to generate results, even on fairly small data sets. The larger the data traversed the longer it will take to generate a result.
- Visualization: What type of visualization do you want to use? Do you want to custom build a lot of pieces or be able to use something off the shelf? What restraints and flexibility are needed for your visualization? How flexible and changeable does the visualization need to be?
hth
Update: As a response to #Bhat's comment asking about lack of visualization...
The lack of a visualization tool that would allow us to effectively utilize the data stored in HBase was a major factor in re-evaluating our solution. We stored the raw data in Hive, and pre-aggregated the data and stored it HBase. To utilize this we were going to have to write a custom connector (did this part) and visualization layer. We looked at what we would be able to produce and what is commercially available, and went the commercial route.
We still use Hadoop as our ETL tool for processing our weblogs, it's fantastic for that. We just send the ETL'd raw data to a commercial big data database that will take the place of both Hive and HBase in our design.
Hadoop doesn't really compare to MSSQL or other data warehouse storage. Hadoop doesn't do any storage (ignoring the HDFS), it does processing of data. Running MapReduces (which Hive does) is going to be slower than MSSQL (or such).
Hadoop is very well suited for storing colossal files that can represent fact tables. These tables can be partitioned by placing individual files representing the table into separate directories. Hive understands such file structures and allows to query them like partitioned tables. You can phrase your BI questions to the Hadoop data in the form of SQL queries via Hive, but you will still need to write and run an occasional MapReduce job.
From business perspective, you should consider Hadoop if you have a lot of low-value data. There are many cases when RDBMS / MPP solutions are not cost effective.
You also should consider Hadoop as a serious option if your data is not structured (HTMLs for example).
We are creating a comparison matrix for BI tools for Big Data / Hadoop
http://hadoopilluminated.com/hadoop_book/BI_Tools_For_Hadoop.html
It is work in progress and would love any input.
(disclaimer : I am the author of this online book)