processing GBs of data in kafka/storm - apache-storm

Is it possible to process GBs of data in Kafka/Storm as a single message? File frequency is 30 minutes.
If not possible If I break the message into 1 MB each and then can I process it in Kafka/Storm?
My files is in SEGY format (Oil/gas domain) and I will call bin executables (written in c++) through storm to process this file. Whether tuples can be formed successfully for this file format?
Please help.

Are you sure you want to use Storm to do this processing? Seems like a batch application may be more appropriate.
Regardless, you might be able to get it to work but I would recommend having your spout split the data up into more manageable chunks that can be processed by your bolts.

Related

Save and Process huge amount of small files with spark

I'm new in big data! I have some questions about how to process and how to save large amount of small files(pdf and ppt/pptx) in spark, on EMR Clusters.
My goal is to save data(pdf and pptx) into HDFS(or in some type of datastore from cluster) then extract content from this file from spark and save it in elasticsearch or some relational database.
I had read the problem of small files when save data in HDFS. What is the best way to save large amount of pdf & pptx files (maxim size 100-120 MB)? I had read about Sequence Files and HAR(hadoop archive) but none of them I don't understand how exactly it's works and i don't figure out what is the best.
What is the best way to process this files? I understood that some solutions could be FileInputFormat or CombineFileInputFormat but again I don't know how exactly it's works. I know that can't run every small file on separated task because the cluster will be put in the bottleneck case.
Thanks!
If you use Object Stores (like S3) instead of HDFS then there is no need to apply any changes or conversions to your files and you can have them each as a single object or blob (this also means they are easily readable using standard tools and needn't be unpacked or reformatted with custom classes or code).
You can then read the files using python tools like boto (for s3) or if you are working with spark using the wholeTextFile or binaryFiles command and then making a BytesIO (python) / ByteArrayInputStream (java) to read them using standard libraries.
2) When processing the files, you have the distinction between items and partitions. If you have a 10000 files you can create 100 partitions containing 100 files each. Each file will need to anyways be processed one at a time since the header information is relevant and likely different for each file.
Meanwhile, I found some solutions for that small files problem in HDFS. I can use the following approaches:
HDFS Federation help us to distribute the load of namenodes: https://hortonworks.com/blog/an-introduction-to-hdfs-federation/
HBase could be also a good alternative if your files size is not too large.
There are practical limits to the size of values (e.g., storing 10-50MB objects in HBase would probably be too much to ask); search the mailing list for conversations on this topic. All rows in HBase conform to the Data Model, and that includes versioning. Take that into consideration when making your design, as well as block size for the ColumnFamily.
https://hbase.apache.org/book.html
Apache Ozone which is object storage like S3 but is on-premises. At the time of writing, from what I know, Ozone is not production ready. https://hadoop.apache.org/ozone/

How can Spark take input after it is submitted

I am designing an application, which requires response very fast and need to retrieve and process a large volume of data (>40G) from hadoop file system, given one input (command).
I am thinking, if it is possible to catch such high amount of data in the distributed memory using spark, and let the application running all the time. If I give the application an command, it could start to process data based on the input.
I think catching such big data is not a problem. However, how can I let the application running, and take input?
As far as I know, there is nothing can be done after "spark-submit" command...
You can try spark job server and Named Objects to cache dataset in distributed memory and use it in various input commands.
The requirement is not clear!!!, but based on my understanding,
1) In spark-submit after the application.jar, you can provide application specific command line arguments. But if you want to send commands after the job was started, then you can write a spark streaming job which processes kafka messages.
2) HDFS is already optimised for processing large volume of data. You can cache intermediate reusable data so that they do not get re-computed. But for better performance you might consider using something like elasticsearch/cassandra, so that they can be fetched/stored even faster.

Hadoop streaming with multiple input files

I want to build an inverted index from a set of files with Hadoop using the Streaming API. The documentation always refers to using a file whose lines have the entries to the mapper to be fed. But in this case, I have multiple input files, and I need the mappers to process only one file at a time. Is there a way to accomplish that. For preprocessing reasons, I need the input to be like this, and I cannot have the input in the classic line = key, value format that the documentation refers.
By default a mapper only processes one file, unless you use an input class that allow combine inputs like CombineFileInputFormat.
Then, if you have 10 files you will end with 10 mappers and each of them will process only one file. If you are only using mappers (not reducers) that will end in 10 outputs files (one for each mapper).
In the other side, if you have enough big splittable files, it is possible that one file be processed by several mappers at the same time.

hadoop/HDFS: Is it possible to write from several processes to the same file?

f.e. create file 20bytes.
1st process will write from 0 to 4
2nd from 5 to 9
etc
I need this to parallel creating a big files using my MapReduce.
Thanks.
P.S. Maybe it is not implemented yet, but it is possible in general - point me where I should dig please.
Are you able to explain what you plan to do with this file after you have created it.
If you need to get it out of HDFS to then use it then you can let Hadoop M/R create separate files and then use a command like hadoop fs -cat /path/to/output/part* > localfile to combine the parts to a single file and save off to the local file system.
Otherwise, there is no way you can have multiple writers open to the same file - reading and writing to HDFS is stream based, and while you can have multiple readers open (possibly reading different blocks), multiple writing is not possible.
Web downloaders request parts of the file using the Range HTTP header in multiple threads, and then either using tmp files before merging the parts together later (as Thomas Jungblut suggests), or they might be able to make use of Random IO, buffering the downloaded parts in memory before writing them off to the output file in the correct location. You unfortunately don't have the ability to perform random output with Hadoop HDFS.
I think the short answer is no. The way you accomplish this is write your multiple 'preliminary' files to hadoop and then M/R them into a single consolidated file. Basically, use hadoop, don't reinvent the wheel.

hadoop - How can i use data in memory as input format?

I'm writing a mapreduce job, and I have the input that I want to pass to the mappers in the memory.
The usual method to pass input to the mappers is via the Hdfs - sequencefileinputformat or Textfileinputformat. These inputformats need to have files in the fdfs which will be loaded and splitted to the mappers
I cant find a simple method to pass, lets say List of elemnts to the mappers.
I find myself having to wrtite these elements to disk and then use fileinputformat.
any solution?
I'm writing the code in java offcourse.
thanks.
Input format is not have to load data from the disk or file system.
There are also input formats reading data from other systems like HBase or (http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapred/TableInputFormat.html) where data is not implied to sit on the disk. It only is implied to be available via some API on all nodes of the cluster.
So you need to implement input format which splits data in your own logic (as soon as there is no files it is your own task) and to chop the data into records .
Please note that your in memory data source should be distributed and run on all nodes of the cluster. You will also need some efficient IPC mechanism to pass data from your process to the Mapper process.
I would be glad also to know what is your case which leads to this unusual requirement.

Resources