I want to consume a user selected file in Hadoop, through given user interface. What should be the approach, is it is a wise decision to use Hadoop in web
There are two issues - If you should use hadoop, and how to use hadoop.
the typical file you can expect a user uploading through a web interface is much smaller then the data size where hadoop begins to be relevant.
So it's very probable that hadoop isn't the right choice for this scenario
It's hard to know what the problem is without more detailed logs.
But the most common case, if the class that isn't found is your class (and not an infrastructure class) is that you need to distribute your jars to all the hadoop tasks.
a simple solution is to use the -libjars parameter when running your application
see a good explanation here
Related
I don't know how to build architecture for following use case:
I have an Web application where users can upload files(pdf&pptx) and directories to be processed. After upload is complete web application put this files and directories in HDFS, then send a messages on kafka with path to this files.
Spark Application read messages from kafka streaming, collect them on master(driver), and after that process them. I collect messages first because i need to move the code to data, and not move data where the message is received. I understood that spark assign job to executor which already have file locally.
I have issues with kafka because i was forced to collect them first for the above reason, and when want to create checkpoint app crash "because you are attempting to reference SparkContext from a broadcast variable" even if the code run before adding checkpointing( I use sparkContext there because i need to save data to ElasticSearch and PostgreSQL. I don't know how exactly i can do code upgrading in this conditions.
I read about hadoop small files problems, and I understand what problems are in this case. I read that HBase is a better solution to save small files than just save in hdfs. Other problem in hadoop small files problems is big number of mappers and reducers created for computation, but i don't understand if this problem there in spark.
What is the best architecture for this use case?
How to do Job Scheduling? It's kafka good for that? or I need to use other service like rabbitMQ or something else?
Exist some method to add jobs to an running Spark application through some REST API?
How is the best way to save files? Is better to use Hbase because i have small files(<100MB)? Or I need to use SequenceFile? I think SequenceFile isn't for my use case because i need to reprocess some files randomly.
What is the best architecture do you think for this use case?
Thanks!
There is no one single "the best" way to build architecture. You need to make decisions and stick to them. Make the architecture flexible and decoupled so that you can easily replace components if needed.
Consider following stages/layers in your architecture:
Retrieval/Acquisition/Transport of source data (files)
Data processing/transformation
Data archival
As a retrieval component, I would use Flume. It is flexible, supports a lot of sources, channels (including Kafka) and sinks. In your case you can configure source that monitors the directory and extracts the newly received files.
For data processing/transformation - it depends what task you are solving. You probably decided on Spark Streaming. Spark streaming can be integrated with Flume sink (http://spark.apache.org/docs/latest/streaming-flume-integration.html) There are other options available, e.g. Apache Storm. Flume combines very well with Storm. Some transformations can also be applied in Flume.
For data archival - do not store/archive the files directly in Hadoop, unless they are bigger than few hundredths of megabytes. One solution would be to put them in HBase.
Make your architecture more flexible. I would place processed files in a temporary HDFS location and have some job regualarly archive them into zip, HBase, Hadoop Archive (there is such an animal) or any other solution.
Consider using Apache NiFi (aka HDF - Hortonworks Data Flow). It uses internally queues, provides a lot of processors. It can make your life easier and get the workflow developed in minutes. Give it a try. There is nice Hortonworks tutorial which , combined with HDP Sandbox running on a virtual machine/Docker, can bring you up to speed in very short time (1-2 hours?).
I would like to understand does hadoop support for siebel applications , can any body share experience in doing that. I looked for online documentation and not able to find any proper link to explain this so posting question here
I have and siebel application run with Oracle database, I would like to replace with HAdoop ..is it possible ?
No is the answer.
Basically Hadoop isn't a database at all.
Hadoop is basically a distributed file system (HDFS) - it lets you store large amount of file data on a cloud of machines, handling data redundancy etc.
On top of that distributed file system it provides an API for processing all stored data using something called as Map-Reduce.
I have a question.. I have a program write in Netbeans. the program read data from cassandra and write the result into it. My program is not MapReduce at all.I execute the program and make a .jar file from it. now, I want to know if I can execute it in Hadoop?
actually, I want to know can I run a non-MapReduce Program in Hadoop?
You could architect this program to run on Hadoop v2 as a Yarn application. This would require re-architecting your application to fit the Yarn paradigm. An example of how to do this is given here: Writing App Framework on Yarn
This is not a simple exercise. Also, if you are interested in using Hadoop, I would consider simply re-writing your application to use HBase (another No-SQL Columnar database competitor to Cassandra) which is written specifically for Hadoop. It translates your query requests to MapReduce calls automatically.
This question is ages long but has never been answered. Anyhow, two projects are looking into this issue:
Apache Slider (incubating): http://slider.incubator.apache.org/
and
Apache Myriad (incubating): http://myriad.incubator.apache.org/
Slider is mainly sponsored by Hortonworks while Myriad is a MapR / Mesosphere project with large assistance from PayPal.
I'm new to Hadoop, and now have to process a input file. I want to process each line and the output should be one file for each line.
I surf the internet and found MultipleOutputFormat, and generateFileNameForKeyValue.
But most people write it with JobConf class. As I'm using Hadoop 0.20.1, I think Job class takes place. And I don't know how to use Job class to generate multiple output files by key.
Could anyone help me?
The Eclipse plugin is mainly used to submit and monitor jobs as well as interact with HDFS, against a real or 'psuedo' cluster.
If you're running in local mode, then i don't think the plugin gains you anything - seeing as your job will be run in a single JVM. With this in mind i would say include include the most recent 1.x hadoop-core in your Eclipse project's classpath.
Eitherway MultipleOutputFormat has not been ported to the new mapreduce package (neither in 1.1.2 or 2.0.4-alpha), so you'll either need to port it yourself or find another way (maybe MultipleOutputs - The Javadoc page has some usage on using MultipleOutputs)
So my MR Job generates a report file, and that file needs to be able to be downloaded by an end-user who needs to click a button on a normal web reporting interface, and have it download the output. According to this O'Reilly book excerpt, there is an HTTP read-only interface. It says it's XML based, but it seems that it's simply the normal web interface intended to be viewed through a web browser, not something that can be programatically queried, listed, and downloaded. Is my only recourse to write my own servlet based interface? Or execute the hadoop cli tool?
The way to access HDFS programmatically from something other than Java is by using Trift.
There are pre-generated client classes for several languages (Java, Python, PHP, ...) included in the HDFS source tree.
See http://wiki.apache.org/hadoop/HDFS-APIs
I'm afraid you will probably have to settle with the CLI AFAIK.
Not sure if it would fit your situation, but I think it would be reasonable to have whatever script that kicks off the MR job do a hadoop dfs -get ... after job completion to a known directory that's served.
Sorry that I don't know of an easier solution.