What is it exactly? - hadoop

Why in this link:{http://www.ibm.com/developerworks/aix/library/au-cloud_apache/#figure2} in figure1,apache hadoop is defined as a Platform as a service but in http://nosql-databases.org it is defined as a no sql wide column store database?
I mean when working with hadoop do I need a database too?
Thanks in advance.

Hadoop is a basically a collection of java software that fundamentally provides two things:
A distributed file system implementation.
A framework for writing, and running Map Reduce jobs written in Java.
Many things are built on top of these two pieces (like HBase, which is probably the columnar datastore you have read about).
A good resource for learning more about Hadoop is the apache project page documetation. If that looks confusing, there is also a book called 'Hadoop: The Definitive Guide' which is pretty good reading.
If you want to read about how it all began, I'd recommend reading this google paper upon which Hadoop is based
Hope that helps.

Related

Difference between Apache NiFi and StreamSets

I am planning to do a class project and was going through few technologies where I can automate or set the flow of data between systems and found that there are couple of them i.e. Apache NiFi and StreamSets ( to my knowledge ). What I couldn't understand is the difference between them and use-cases where they can be used? I am new to this and if anyone can explain me a bit would be highly appreciated. Thanks
Suraj,
Great question.
My response is as a member of the open source Apache NiFi project management committee and as someone who is passionate about the dataflow management domain.
I've been involved in the NiFi project since it was started in 2006. My knowledge of Streamsets is relatively limited so I'll let them speak for it as they have.
The key thing to understand is that NiFi was built to do one really important thing really well and that is 'Dataflow Management'. It's design is based on a concept called Flow Based Programming which you may want to read about and reference for your project 'https://en.wikipedia.org/wiki/Flow-based_programming'
There are already many systems which produce data such as sensors and others. There are many systems which focus on data processing like Apache Storm, Spark, Flink, and others. And finally there are many systems which store data like HDFS, relational databases, and so on. NiFi purely focuses on the task of connecting those systems and providing the user experience and core functions necessary to do that well.
What are some of those key functions and design choices made to make that effective:
1) Interactive command and control
The job of someone trying to connect systems is to be able to rapidly and efficiently interact with the constant streams of data they see. NiFi's UI allows you do just that as the data is flowing you can add features to operate on it, fork off copies of data to try new approaches, adjust current settings, see recent and historical stats, helpful in-line documentation and more. Almost all other systems by comparison have a model that is design and deploy oriented meaning you make a series of changes and then deploy them. That model is fine and can be intuitive but for the dataflow management job it means you don't get the interactive change by change feedback that is so vital to quickly build new flows or to safely and efficiently correct or improve handling of existing data streams.
2) Data Provenance
A very unique capability of NiFi is its ability to generate fine grained and powerful traceability details for where your data comes from, what is done to it, where its sent and when it is done in the flow. This is essential to effective dataflow management for a number of reasons but for someone in the early exploration phases and working a project the most important thing this gives you is awesome debugging flexibility. You can setup your flows and let things run and then use provenance to actually prove that it did exactly what you wanted. If something didn't happen as you expected you can fix the flow and replay the object then repeat. Really helpful.
3) Purpose built data repositories
NiFi's out of the box experience offers very powerful performance even on really modest hardware or virtual environments. This is because of the flowfile and content repository design which gives us the high performance but transactional semantics we want as data works its way through the flow. The flowfile repository is a simple write ahead log implementation and the content repository provides an immutable versioned content store. That in turn means we can 'copy' data by only ever adding a new pointer (not actually copying bytes) or we can transform data by simply reading from the original and writing out a new version. Again very efficient. Couple that with the provenance stuff I mentioned a moment ago and it just provides a really powerful platform. Another really key thing to understand here is that in the business of connecting systems you don't always get to dictate things like size of data involved. The NiFi API was built to honor that fact and so our API lets processors do things like receive, transform, and send data without ever having to load the full objects in memory. These repositories also mean that in most flows the majority of processors do not even touch the content at all. However, you can easily see from the NiFi UI precisely how many bytes are actually being read or written so again you get really helpful information in establishing and observing your flows. This design also means NiFi can support back-pressure and pressure-release naturally and these are really critical features for a dataflow management system.
It was mentioned previously by the folks from the Streamsets company that NiFi is file oriented. I'm not really sure what the difference is between a file or a record or a tuple or an object or a message in generic terms but the reality is when data is in the flow then it is 'a thing that needs to be managed and delivered'. That is what NiFi does. Whether you have lots of really high speed tiny things or you have large things and whether they came from a live audio stream off the Internet or they come from a file sitting on your harddrive it doesn't matter. Once it is in the flow it is time to manage and deliver it. That is what NiFi does.
It was also mentioned by the Streamsets company that NiFi is schemaless. It is accurate that NiFi does not force conversion of data from whatever it is originally to some special NiFi format nor do we have to reconvert it back to some format for follow-on delivery. It would be pretty unfortunate if we did that because what this means is that even the most trivial of cases would have problematic performance implications and luckily NiFi does not have that problem. Further had we gone that route then it would mean handling diverse datasets like media (images, video, audio, and more) would be difficult but we're on the right track and NiFi is used for things like that all the time.
Finally, as you continue with your project and if you find there are things you'd like to see improved or that you'd like to contribute code we'd love to have your help. From https://nifi.apache.org you can quickly find information on how to file tickets, submit patches, email the mailing list, and more.
Here are a couple of fun recent NiFi projects to checkout:
https://www.linkedin.com/pulse/nifi-ocr-using-apache-read-childrens-books-jeremy-dyer
https://twitter.com/KayLerch/status/721455415456882689
Good luck on the class project! If you have any questions the users#nifi.apache.org mailing list would love to help.
Thanks
Joe
Both Apache NiFi and StreamSets Data Collector are Apache-licensed open source tools.
Hortonworks does have a commercially supported variant called Hortonworks DataFlow (HDF).
While both have a lot of similarities such as a web-based ui, both are used for ingesting data there are a few key differences. They also both consist of a processors linked together to perform transformations, serialization, etc.
NiFi processors are file-oriented and schemaless. This means that a piece of data is represented by a FlowFile (this could be an actual file on disk, or some blob of data acquired elsewhere). Each processor is responsible for understanding the content of the data in order to operate on it. Thus if one processor understands format A and another only understands format B, you may need to perform a data format conversion in between those two processors.
NiFi can be run standalone, or as a cluster using its own built-in clustering system.
StreamSets Data Collector (SDC) however, takes a record based approach. What this means is that as data enters your pipeline it (whether its JSON, CSV, etc) it is parsed into a common format so that the responsibility of understanding the data format is no longer placed on each individual processor and any processor can be connected to any other processor.
SDC also runs standalone, and also a clustered mode, but it runs atop Spark on YARN/Mesos instead, leveraging existing cluster resources you may have.
NiFi has been around for about the last 10 years (but less than 2 years in the open source community).
StreamSets was released to the open source community a little bit later in 2015. It is vendor agnostic, and as far as Hadoop goes Hortonworks, Cloudera, and MapR are all supported.
Full Disclosure: I am an engineer who works on StreamSets.
They are very similar for data ingest scenarios.
Apache NIFI(HDP) is more mature and StreamSets is more lightweight.
Both are easy to use, both have strong capability. And StreamSets could easily
They have companies behind, Hortonworks and Cloudera.
Obviously there are more contributors working on NIFI than StreamSets, of course, NIFI have more enterprise deployments in production.
Two of the key differentiators between the two IMHO are.
Apache NiFi is a Top Level Apache project, meaning it has gone through the incubation process described here, http://incubator.apache.org/policy/process.html, and can accept contributions from developers around the world who follow the standard Apache process which ensures software quality. StreamSets, is Apache LICENSED, meaning anyone can reuse the code, etc. But the project is not managed as an Apache project. In fact, in order to even contribute to Streamsets, you are REQUIRED to sign a contract. https://streamsets.com/contributing/ . Contrast this with the Apache NiFi contributor guide, which wasn't written by a lawyer. https://cwiki.apache.org/confluence/display/NIFI/Contributor+Guide#ContributorGuide-HowtocontributetoApacheNiFi
StreamSets "runs atop Spark on YARN/Mesos instead, leveraging existing cluster resources you may have." which imposes a bit of restriction if you want to deploy your dataflows further toward the Edge where the Devices that are generating the data live. Apache MiniFi, a sub-project of NiFi can run on a single Raspberry Pi, while I am fairly confident that StreamSets cannot, as YARN or Mesos require more resources than a Raspberry Pi provides.
Disclosure: I am a Hortonworks employee

Hbase vs Cassandra: Which is better for a timeseries data storage?

I use my API logs to extract information like:
In this period of time how many are the users of my API ?
Or in this period of time, what type of services are called the most ?
Almost all the information I extract depend on the timestamp. Actually I use MongoDB and I added the time-stamp as an index(for 80GB, indexes size is 12GB).
A migration to cassandra or Hbase was recommended for me. And I want to know which is better for my use case:
Analysis for timeseries data.
Both good write and read performance are required.
Possibility of using hadoop to do my data analysis.
Thanks for sharing your point of view or your experience.
Advantages of Cassandra:
Cassandra generally shows better performance (though both are excellent).
Cassandra is substantially easier to setup and manage from an operational stand point (though there are tools that will help either way).
Advantages of HBase:
Native to the hadoop ecosystem
HBase will require you installing hadoop anyway, and you get a nice two-for-one. To use Cassandra you will probably need to go to use DataStax Enterprise, a commercial, non-open source product, OR investigate using Spark for your analytics work which has an open-source connector with Cassandra.
Chocolate or Vanilla ice cream - which is better?
I would suggest that you would be the best decision maker. Set up development environments for each option, and this will tell you much more about operational and tuning issues than, I think, anyone else might be able to give you. :)

Enterprise Data warehouse with NOSQL /Hadoop - "NO RDBMS"

Are there any EDW (enterprise data warehouse) systems designed using NOSQL/Hadoop solution ?
I know there are PDW systems(MS PDW polybase, Greenplum hawq etc) which connect to HDFS sub-systems. These are proprietary hardware and software solutions and are expensive at scale.I am looking for a solution with NOSQL or Hadoop and preferably open source for enterprise data warehouse solution. I would like to hear any of your experiences if you have implemented any. Just to mention again, I am not looking for any type of proprietary RDBMS as a player in this EDW solution.
I did some research on the internet, though it's possible(Impala is a possible option) but did not see anyone really implemented completely with NOSQL or Hadoop.
If you have done something of this type, I would like to hear how you designed and what different tools that are used by your business analysts etc... If you can share your experience along the journey that would be really appreciated.
Updating....
How about VoltDb and NEOdb (which are not true RDBMS) but they claim that they can support ANSI SQL to a greater extent.
First problem you will face with building the EDW on top of Hadoop is the fact that its storage is not updatable, so you should forget about SQL UPDATE and DELETE commands.
Second, solution built on top of Hadoop is usually times more expensive to maintain. More expensive specialists, more complex debugging (compare debugging the problem in Hive query vs SQL query problems in Oracle - which would be easier).
Third, Hadoop usually gives you much less concurrency and much higher latency for any type of workload you put on top of it.
Given all of this, why do you think DWH is built on top of Hadoop only for really big enterprises like Facebook, Yahoo, Ebay, LinkedIn and so on? Because it is not that simple to do, while when implemented it can be more scalable and more customizable than any proprietary solution.
So if you are clearly decided to go on with Hadoop or any other NoSQL solution to build your DWH, I would recommend you this:
Use Hadoop HDFS as a base for data storage
Use Flume for data loading into the HDFS
Use Hive with Tez for heavy ETL jobs
Provide Impala as a SQL query interface for analysts
Provide Spark as an advanced instrument for analysts
Use Ambari for management and provisioning of all of tools together
These tools together will cover most of your needs

Example application using HDFS+Map Reduce

I have an academic course "Middleware" which covers different aspects of Distributed Software Systems including introduction to topics like [tag:Distributed File system]. This also involves introduction to hbase,hadoop,mapreduce,hiveql,piglatin.
I want to know, can I have a small project which tries to integrate above technologies. For starters, I am aware of vm provided by cloudera for having a feel of hadoop and playing around using Eclipse.
I was thinking on lines of implementing an application which accepts stream of events as an input, Analyses this and gives an output.
I have both windows/linux on my machine with i7 procoessor and 4Gb Ram.
Please let me know how to get started with everything and any suggestions for simple example application are welcome.
Here is a blog post on analyzing Tweets using Hive/HDFS. And here is a blog post on performing Clickstream analytics using Pig and Hive.
Check some of the Big Data use cases here and try to solve an interesting problem.

how to use hadoop for a web application?

I am working on a social networking web based application, which is uses Apache web server and MYSQL server for database with codeigniter MVC frameworks. I don't know how to integrate Hadoop in this application and how to write map- reduce program.
Hadoop and map-reduce have no direct relationship to web applications. You should not integrate Hadoop into a web application as long as you understand web application as something that responds (quickly) to user input (web requests).
Hadoop and map-reduce are very useful for algorithms that run on large datasets in order to transform/extract data/knowledge from those datasets.
While it is true that Hadoop is nowadays mostly used for "offline analytics", it can be useful to web projects as well. For example, to pre-compute recommendations or suggestions that are then provided to the users of a website.
Another case of use is to be able to ETL from multiple sources of data to produce an inverted index for a website (for example, jobs/cars/rentals-like websites with huge amounts of input data).
Always think of Hadoop when you have a "Big Data" problem, not if your website is managing small amounts of data.
Using Hadoop to tackle this sort of problems has some advantages and disadvantages. The obvious advantage is that it makes any sort of batch process (like the examples I mentioned) scale transparently. The disadvantage is that it isn't real-time: you can't use Hadoop to update your website every 5 seconds.
I think Hadoop can have two "classic" usages for the social network style of applications.
First is usage of HBASE to store messaging and other dynamic information. Storage of user profiles in the HBASE also can be considered in order to completely replace MySQL with this kind of NoSQL solution.
Second is usage of Hadoop MapReduce for analysis of Your network. Good example of such analysis is looking for friends suggestions.
Yes it is possible to make web application using apache hadoop as a back-end
You can create web application using apache hive and pig you can write custom mapper and reducers and use as udf , but personal experience it is slow , In case you have very less data , It is better to use other database and do analytics. , I prefer spark is the solution for better reponse time..
By using hadoop analyse your data and take the results into your mysql database. Then use that with your web application.
In your web application you can get required data from Hadoop (like job results) using REST services: https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/WebServicesIntro.html

Resources