how to use hadoop for a web application? - hadoop

I am working on a social networking web based application, which is uses Apache web server and MYSQL server for database with codeigniter MVC frameworks. I don't know how to integrate Hadoop in this application and how to write map- reduce program.

Hadoop and map-reduce have no direct relationship to web applications. You should not integrate Hadoop into a web application as long as you understand web application as something that responds (quickly) to user input (web requests).
Hadoop and map-reduce are very useful for algorithms that run on large datasets in order to transform/extract data/knowledge from those datasets.

While it is true that Hadoop is nowadays mostly used for "offline analytics", it can be useful to web projects as well. For example, to pre-compute recommendations or suggestions that are then provided to the users of a website.
Another case of use is to be able to ETL from multiple sources of data to produce an inverted index for a website (for example, jobs/cars/rentals-like websites with huge amounts of input data).
Always think of Hadoop when you have a "Big Data" problem, not if your website is managing small amounts of data.
Using Hadoop to tackle this sort of problems has some advantages and disadvantages. The obvious advantage is that it makes any sort of batch process (like the examples I mentioned) scale transparently. The disadvantage is that it isn't real-time: you can't use Hadoop to update your website every 5 seconds.

I think Hadoop can have two "classic" usages for the social network style of applications.
First is usage of HBASE to store messaging and other dynamic information. Storage of user profiles in the HBASE also can be considered in order to completely replace MySQL with this kind of NoSQL solution.
Second is usage of Hadoop MapReduce for analysis of Your network. Good example of such analysis is looking for friends suggestions.

Yes it is possible to make web application using apache hadoop as a back-end
You can create web application using apache hive and pig you can write custom mapper and reducers and use as udf , but personal experience it is slow , In case you have very less data , It is better to use other database and do analytics. , I prefer spark is the solution for better reponse time..

By using hadoop analyse your data and take the results into your mysql database. Then use that with your web application.

In your web application you can get required data from Hadoop (like job results) using REST services: https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/WebServicesIntro.html

Related

Relevance of Hadoop & Streaming solutions when Spark exists

I am starting a big data initiative for my startup. In 2018 is there any reason to use Hadoop at all since Spark is touted to be way faster due to it primarily not writing the intermediate data to disk as Hadoop’s MR.
I realize Spark has a higher need for RAM But that would be just one time CAPEX costs that would pay for itself?
In general unless there are legacy projects why should one pick up Hadoop at all since Spark is available?
Would appreciate real world comparisons of the two, gotchas etc.?
Alternately are there use cases that Hadoop can solve but Spark cannot?
—————-comment below for actual problem————
I would use YARN as the resource manager with HDFS as the file system for Spark.
Also realize that as Spark intersects quiet a bit with Hadoop ecosystem.
Comparos are :
Mapreduce vs Spark code
SparkSQL vs Hive
People mention Pig too but not a whole lot of people want to learn custom querying. And if I had to use Pig as a data scientist why wouldn’t I use say an Apache NiFi with Hadoop?
Also not sure how Spark handles the following:
If data does not fit in RAM then what ? Back to a disk based paradigm (not talking of streaming use cases here..) so no better than Mapreduce? How does Tez make MR2 better?
Hadoop 3 has support for Erasure coding to reduce data replication. What does Spark do?
Where I am unclear is the plethora of overlapping choices. For e.g. streaming alone has:
Spark streaming
Apache storm
Apache Samza
Kafka streams
CEP commercial tools.(ORacle CEP, TIBCO etc.)
A lot of them use DAG similar to Spark’s core engine so hard to pick one from the other.
Use case:
App sends data to middleware until end of event. Event can end specified on periodicity or due to a business condition being met.
Middleware must show real time addition of a value (simplifying) sent by users from their app instances. Accepted that middleware is the floor of the actual sum of values and real value can be higher. Plan to use Kafka streams here to have a consumer that adds all the inputs with minimal latency the consumer posts to a cache which is polled by apps to show current additive value.
Middleware logs all input
After event ends a big data paradigm scans through log data and database records to get accurate count by comparing all dB values and log entries (audit) and compare them to the Kafka shown value. Value calculated by this scheme is the final value.
Design choices:
I like Kafka because it decouples application middleware and is low latency high throughput messaging. Streams code is easy to write . Happy for someone to counter argue using Spark Streams Or Apache Storm or Apache Samza instead?
Application itself is Java code on Tomcat server with REST end points for iOS/ Android clients. Not doing client caching due to explicit liveliness of additive value.
You're confusing Hadoop with just MapReduce. Hadoop is an ecosystem of MapReduce, HDFS, and YARN.
First of all, Spark doesn't have a filesystem. That's primarily why Hadoop is nice, in my book. Sure, you can use S3, or many other cloud storages, or bare metal data stores like Ceph, or GlusterFS, but from what I've researched, HDFS is by far the fastest when processing data.
Maybe you're not familiar with the concept of rack locality that YARN offers. If you use Spark Standalone mode with any file system not mounted under the Spark executors, then all your data requests will need to be pulled over a network connection, therefore saturating the network, and causing a bottleneck, regardless of memory. Compare that to the Spark executors running on the YARN NodeManagers, HDFS datanodes are ideally also NodeManagers.
A similar problem - people say Hive is slow, SparkSQL is faster. Well, that's true if you run Hive with MapReduce instead of Tez or Spark execution modes.
Now, if you're wanting streaming and real-time events rather than the batch world commonly associated with Hadoop. You might want to research the SMACK stack.
Update
Pig as a data scientist why wouldn’t I use say an Apache NiFi with Hadoop
Pig is not comparable to NiFi.
You can use NiFi; nothing is stopping you. It would run closer to real-time than Spark micro batches. And it is a good tool to pair with Kafka.
plethora of overlapping choices
Yes, and you didn't even list them all... It's up to some BigData architect in your company to come up with a solution. You'll find that vendor support from Confluent is mostly for Kafka. I haven't seen them talking about Samza much. Hortonworks will support Storm, Nifi, and Spark, but they aren't running the latest version of Kafka if you want fancy features like KSQL. Streamsets is a similar company offering a tool competing with NiFi which consists of employees with backgrounds in other batch/streaming Apache projects.
Storm and Samza are two ways to do the same thing, as far as I know. I think Flink is more programmer friendly than Storm. I don't have experience with Samza, though I work closely with people who primarily are using Kafka Streams rather than it. And Kafka Streams isn't DAG based - it's just a high level Kafka library, embeddable in any JVM application.
If data does not fit in RAM then what ?
By default, it spills to disk... Spark has parameters to configure if you don't want disk to be touched. In which case, your jobs die of OOM more quickly, obviously.
How does Tez make MR2 better?
Tez isn't MR. It creates more optimized DAGs like Spark does. Go read about it.
Hadoop 3 has support for Erasure coding to reduce data replication. What does Spark do?
Spark has no filesystem. We already covered this. Erasure encoding is primarily for data at-rest, not during processing. I actually don't know if Spark supports Hadoop 3, yet.
Application itself is Java code on Tomcat server with REST end points for iOS/ Android clients
Personally, I would use Kafka Streams here because 1) You are using Java already 2) it's a standalone thread in your code that offers you to read/publish data from Kafka without Hadoop/YARN or Spark Clusters. It's not clear what your question has to do with Hadoop from your listed client-server archictecture, but feel free to string an additional line from a Kafka topic to a database/analytics engine of your choice. The Kafka Connect framework has many connectors for you to choose from.
You could also use NiFi as your mobile REST API to just ExposeHTTP and send requests to it, then route flows based on attributes in the data. Then, manipulate and publish to Kafka as well as other systems.
Spark and Hadoop works pretty similar in the way of solving MapReduce problems.
Hadoop is pretty relevant if you talk about HDFS point of view. The HDFS is a well known used solution for big data storage. But your question is about MapReduce.
Spark is the best option if you are talking about good machines with real good configuration of memory and network throughput. But we know that kind of machines are expensive and sometimes you best option is to use Hadoop to process your data. Spark is great and fast but sometimes you get crazy with Memory issues if you don't have a good cluster in case of fit too much data in the memory. Hadoop in this case can be better. But this problem year after year are less relevant.
So hadoop is here com complement Spark, Hadoop is not only MapReduce Hadoop is an ecosystem. Spark doesn't have a distributed file system, to Spark works well you need one, Spark doesn't have a resource manager, Hadoop has called Yarn. And Spark in a cluster mode need a resource manager.
Conclusion
Hadoop still relevant as an ecosystem but as only mapReduce I can say that is not been used anymore.

Apache NIFI for ETL

How effective is to use Apache NIFI for the ETL process having source as HDFS & destination as Oracle DB. What are the limitations of Apache NIFI compared other ETL tools such as Pentaho,Datastage,etc..
Main advantage of NiFi
The main advantages of NiFi:
Intuitive gui, which allows for easy inspection of the data
Strong delivery guarantees
Low latency, you can support both batch and streaming usecases
It can handle any format, not only limited to SQL tables, but can also move log files etc.
Schema aware, and can share schema with solutions like Kafka, Flink, Spark
Main limitation of NiFi
NiFi is really a tool for moving data around, you can do enrichments of individual records but it is typically mentioned to do 'EtL' with a small t. A typical thing that you would not want to do in NiFi is joining two dynamic data sources.
For joining tables, tools like Spark, Hive, or classical ETL alternatives are often used.
For joining streams, tools like Flink and Spark Streaming are often used.
Conclusion
NiFi is a great tool, you just need to make sure you use it for the right usecase. Where needed you can use other tools to complement it.
Extra strong full disclosure: I am an employee of Cloudera, the company that supports NiFi and other projects such as Spark and Flink. I have used other ETL tools before, but not to the same extent as NiFi.
Not sure about sqoop, I can explain the benifits of using Apache Nifi. In your case the data in HDFS could be of any format(Unstructured), Nifi has a capability to process and bring it to format of your choice so that you can directly save it to any RDBMS.
Nifi handles back-pressure in vary effective way to have lossless transmission.
One of the critical features that NiFi provides that our competitors generally don't is the ability to stop jobs and examine the flow and downstream systems while it's running. For you, this means you can test the flow against a test HDFS folder and a test Oracle DB, let some data go through, pause the flow and poke around Oracle to make sure it's to your liking after a matter of seconds or minutes instead of waiting for a "job to complete." It makes the process extremely agile.
Actually Nifi is very good tool. You can easily manipulate processors. In short time you can migrate huge data.
But for destinations such as RDBMS, there are always problems. I used to have a lot of problems about "non-killing" threads, you have to be very careful about stopping processes and the configuration of processors. Some processors like QueryDatabasetable consumes huge memory and the server goes down.

Hbase vs Cassandra: Which is better for a timeseries data storage?

I use my API logs to extract information like:
In this period of time how many are the users of my API ?
Or in this period of time, what type of services are called the most ?
Almost all the information I extract depend on the timestamp. Actually I use MongoDB and I added the time-stamp as an index(for 80GB, indexes size is 12GB).
A migration to cassandra or Hbase was recommended for me. And I want to know which is better for my use case:
Analysis for timeseries data.
Both good write and read performance are required.
Possibility of using hadoop to do my data analysis.
Thanks for sharing your point of view or your experience.
Advantages of Cassandra:
Cassandra generally shows better performance (though both are excellent).
Cassandra is substantially easier to setup and manage from an operational stand point (though there are tools that will help either way).
Advantages of HBase:
Native to the hadoop ecosystem
HBase will require you installing hadoop anyway, and you get a nice two-for-one. To use Cassandra you will probably need to go to use DataStax Enterprise, a commercial, non-open source product, OR investigate using Spark for your analytics work which has an open-source connector with Cassandra.
Chocolate or Vanilla ice cream - which is better?
I would suggest that you would be the best decision maker. Set up development environments for each option, and this will tell you much more about operational and tuning issues than, I think, anyone else might be able to give you. :)

What cassandra client to use for haoop integration?

I am trying to build a data services layer using cassandra as the backend store. I am new to Cassandra and not sure what client to use for cassandra - thrift or cql 3? We have a lot of mapreduce jobs using Amazon elastic mapreduce (EMR) that will be reading/ writing the data from cassandra at high volume. The total data volume will be > 100 TB with billions of rows in Cassandra. The mapreduce jobs may be read or write heavy with high qps (>1000 qps). The requirements are as follows:
Simplicity of client code. It seems thrift has in-built integration with Hadoop for bulk data loading using sstableloader (http://www.datastax.com/dev/blog/bulk-loading).
Ability to define new columns at run time. We may need to add more columns depending on application requirements. It seems cql3 does not allow definition of columns dynamically at runtime.
Performance of bulk read/ write. Not sure which client is better. However, I found this post that claims thrift client has better performance for high data volume: http://jira.pentaho.com/browse/PDI-7610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
I could not find any authoritative source of information that answers this question. Appreciate if you could help with this since I am sure this is a common problem for most folks and would benefit the overall community.
Many thanks in advance.
-Prateek
Hadoop and Cassandra are both written in Java so definitely pick a java based driver. As far as simplicity of code goes I'd go for Astyanax, their wiki page is really good and documentation is solid all round. And yes atyanax does allow you to define columns at runtime as you please but be aware that thrift based APIs are being superseded by cql apis.
If however you want to go down the pure cql3 route, datastax's driver is what I'd advise you to use. It allows for asynchronous connections and is continuously updated (view the logs). The code is also very clean although documentation isn't quite there yet, but there are tests in the source that you can look at.
But to be honest, there are so many questions about the APIs that you should read though them and form an opinion for yourself:
Cassandra Client Java API's
About Java Cassandra Client, which one is better? How about CQL?
Advantages of using cql over thrift
Also for performance here some benchmarks (they are however outdated!) showing that cql is catching up (and somewhat overtaking when it comes to prepared statements) thrift:
compare string vs. binary prepared statement parameters
CQL benchmarking

What is it exactly?

Why in this link:{http://www.ibm.com/developerworks/aix/library/au-cloud_apache/#figure2} in figure1,apache hadoop is defined as a Platform as a service but in http://nosql-databases.org it is defined as a no sql wide column store database?
I mean when working with hadoop do I need a database too?
Thanks in advance.
Hadoop is a basically a collection of java software that fundamentally provides two things:
A distributed file system implementation.
A framework for writing, and running Map Reduce jobs written in Java.
Many things are built on top of these two pieces (like HBase, which is probably the columnar datastore you have read about).
A good resource for learning more about Hadoop is the apache project page documetation. If that looks confusing, there is also a book called 'Hadoop: The Definitive Guide' which is pretty good reading.
If you want to read about how it all began, I'd recommend reading this google paper upon which Hadoop is based
Hope that helps.

Resources