I am designing an application, which requires response very fast and need to retrieve and process a large volume of data (>40G) from hadoop file system, given one input (command).
I am thinking, if it is possible to catch such high amount of data in the distributed memory using spark, and let the application running all the time. If I give the application an command, it could start to process data based on the input.
I think catching such big data is not a problem. However, how can I let the application running, and take input?
As far as I know, there is nothing can be done after "spark-submit" command...
You can try spark job server and Named Objects to cache dataset in distributed memory and use it in various input commands.
The requirement is not clear!!!, but based on my understanding,
1) In spark-submit after the application.jar, you can provide application specific command line arguments. But if you want to send commands after the job was started, then you can write a spark streaming job which processes kafka messages.
2) HDFS is already optimised for processing large volume of data. You can cache intermediate reusable data so that they do not get re-computed. But for better performance you might consider using something like elasticsearch/cassandra, so that they can be fetched/stored even faster.
Related
consider that you have 10GB data and you want to process them by a MapReduce program using Hadoop. Instead of copying all the 10GB at the beginning to HDFS and then running the program, I want to for example copy 1GB and start the work and gradually add the remaining 9GB during the time. I wonder if it is possible in Hadoop.
Thanks,
Morteza
Unfortunately this is not possible with MapReduce. When you initiate a MapReduce Job, part of the setup process is determining block locations of your input. If the input is only partially there, the setup process will only work on those blocks and wont dynamically add inputs.
If you are looking for a stream processor, have a look at Apache Storm https://storm.apache.org/ or Apache Spark https://spark.apache.org/
I'm running Hadoop 1.2.1 on top of Mesos 0.14. My goal is to log the input data size, running time, cpu usage, memory usage, and so on for optimization purposes later. All of these but data size are obtained using Sigar.
Is there any way I can get the input data size of any job which is running?
For example, when I'm running hadoop example's terasort, I need to get the teragen's generated data size before the job actually runs. If I'm running Wordcount example, I need to get the wordcount input file size. I need to get the data size automatically since I won't be able to know what job will be run inside this framework later.
I'm using Java to write some of the mesos library code. Preferably, I want to get the data size inside MesosExecutor class. For some reason, upgrading Hadoop/Mesos isn't an option.
Any suggestions or related API will be appreciated. Thank you.
Does hadoop fs -dus satisfy your requirement? Before submit the job to hadoop, calculate the input file size and pass it as params to your executor.
Is it possible to process GBs of data in Kafka/Storm as a single message? File frequency is 30 minutes.
If not possible If I break the message into 1 MB each and then can I process it in Kafka/Storm?
My files is in SEGY format (Oil/gas domain) and I will call bin executables (written in c++) through storm to process this file. Whether tuples can be formed successfully for this file format?
Please help.
Are you sure you want to use Storm to do this processing? Seems like a batch application may be more appropriate.
Regardless, you might be able to get it to work but I would recommend having your spout split the data up into more manageable chunks that can be processed by your bolts.
I'm writing a mapreduce job, and I have the input that I want to pass to the mappers in the memory.
The usual method to pass input to the mappers is via the Hdfs - sequencefileinputformat or Textfileinputformat. These inputformats need to have files in the fdfs which will be loaded and splitted to the mappers
I cant find a simple method to pass, lets say List of elemnts to the mappers.
I find myself having to wrtite these elements to disk and then use fileinputformat.
any solution?
I'm writing the code in java offcourse.
thanks.
Input format is not have to load data from the disk or file system.
There are also input formats reading data from other systems like HBase or (http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapred/TableInputFormat.html) where data is not implied to sit on the disk. It only is implied to be available via some API on all nodes of the cluster.
So you need to implement input format which splits data in your own logic (as soon as there is no files it is your own task) and to chop the data into records .
Please note that your in memory data source should be distributed and run on all nodes of the cluster. You will also need some efficient IPC mechanism to pass data from your process to the Mapper process.
I would be glad also to know what is your case which leads to this unusual requirement.
I want to use Apache pig to transform/join data in two files, but I want to implement it step by step, which means, test it from real data, but with a small size(10 lines for example), is it possible to use pig that read from STDIN and output to STDOUT?
Basically Hadoop supports Streaming in various ways, but Pig originally lacked support for loading data through streaming. However there are some solutions.
You can check out HStreaming:
A = LOAD 'http://myurl.com:1234/index.html' USING HStream('\n') AS (f1, f2);
The answer is no. The data needs to be out in the cluster on data nodes before any MR job can even run over the data.
However if you are using a small sample of data and are just wanting to do something simple you could use Pig in local mode and just write stdin to a local file and run it through your script.
But the bigger question becomes why are you wanting to use MR/Pig on a stream of data? It was and is not intended for that type of use.