using pyspark, read/write 2D images on hadoop file system - hadoop

I want to be able to read / write images on an hdfs file system and take advantage of the hdfs locality.
I have a collection of images where each image is composed of
2D arrays of uint16
basic additional information stored as an xml file.
I want to create an archive over hdfs file system, and use spark for analyzing the archive. Right now I am struggling over the best way to store the data over hdfs file system in order to be able to take full advantage of spark+hdfs structure.
From what I understand, the best way would be to create a sequenceFile wrapper. I have two questions :
Is creating a sequenceFile wrapper the best way ?
Does anybody have any pointer to examples I could use to start with ? I must not be first one that needs to read something different than text file on hdfs through spark !

I have found a solution that works : using the pyspark 1.2.0 binaryfile does the job. It is flagged as experimental, but I was able to read tiff images with a proper combination of openCV.
import cv2
import numpy as np
# build rdd and take one element for testing purpose
L = sc.binaryFiles('hdfs://localhost:9000/*.tif').take(1)
# convert to bytearray and then to np array
file_bytes = np.asarray(bytearray(L[0][1]), dtype=np.uint8)
# use opencv to decode the np bytes array
R = cv2.imdecode(file_bytes,1)
Note the help of pyspark :
binaryFiles(path, minPartitions=None)
:: Experimental
Read a directory of binary files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI as a byte array. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
Note: Small files are preferred, large file is also allowable, but may cause bad performance.

Related

Save and Process huge amount of small files with spark

I'm new in big data! I have some questions about how to process and how to save large amount of small files(pdf and ppt/pptx) in spark, on EMR Clusters.
My goal is to save data(pdf and pptx) into HDFS(or in some type of datastore from cluster) then extract content from this file from spark and save it in elasticsearch or some relational database.
I had read the problem of small files when save data in HDFS. What is the best way to save large amount of pdf & pptx files (maxim size 100-120 MB)? I had read about Sequence Files and HAR(hadoop archive) but none of them I don't understand how exactly it's works and i don't figure out what is the best.
What is the best way to process this files? I understood that some solutions could be FileInputFormat or CombineFileInputFormat but again I don't know how exactly it's works. I know that can't run every small file on separated task because the cluster will be put in the bottleneck case.
Thanks!
If you use Object Stores (like S3) instead of HDFS then there is no need to apply any changes or conversions to your files and you can have them each as a single object or blob (this also means they are easily readable using standard tools and needn't be unpacked or reformatted with custom classes or code).
You can then read the files using python tools like boto (for s3) or if you are working with spark using the wholeTextFile or binaryFiles command and then making a BytesIO (python) / ByteArrayInputStream (java) to read them using standard libraries.
2) When processing the files, you have the distinction between items and partitions. If you have a 10000 files you can create 100 partitions containing 100 files each. Each file will need to anyways be processed one at a time since the header information is relevant and likely different for each file.
Meanwhile, I found some solutions for that small files problem in HDFS. I can use the following approaches:
HDFS Federation help us to distribute the load of namenodes: https://hortonworks.com/blog/an-introduction-to-hdfs-federation/
HBase could be also a good alternative if your files size is not too large.
There are practical limits to the size of values (e.g., storing 10-50MB objects in HBase would probably be too much to ask); search the mailing list for conversations on this topic. All rows in HBase conform to the Data Model, and that includes versioning. Take that into consideration when making your design, as well as block size for the ColumnFamily.
https://hbase.apache.org/book.html
Apache Ozone which is object storage like S3 but is on-premises. At the time of writing, from what I know, Ozone is not production ready. https://hadoop.apache.org/ozone/

How to load multiple files in tar.gz into Pig

Scenario: Vendor will provide raw feed in tar.gz format which contains multiple files in tab delimited format
File Detail:
a) One Hit level data
b) Multiple Lookup files
c) One Header file for (a)
The feed(tar.gz) will be ingested and landed into BDP operational raw.
Query: Would like to load these data from operational raw area into Pig for data quality checking process. How this can be achieved? Should the files be extracted in hadoop for us to use or alternatives available? Please advise. Thanks!
Note: Any sample script will be more helpful
Ref : http://pig.apache.org/docs/r0.9.1/func.html#load-store-functions
Extract from Docs :
Handling Compression
Support for compression is determined by the load/store function. PigStorage and TextLoader support gzip and bzip compression for both read (load) and write (store). BinStorage does not support compression.
To work with gzip compressed files, input/output files need to have a .gz extension. Gzipped files cannot be split across multiple maps; this means that the number of maps created is equal to the number of part files in the input location.
A = load 'myinput.gz';
store A into 'myoutput.gz';

Downloading list of files in parallel in Apache Pig

I have a simple text file which contains list of folders on some FTP servers. Each line is a separate folder. Each folder contains couple of thousand images. I want to connect to each folder, store all files inside that foder in a SequenceFile and then remove that folder from FTP server. I have written a simple pig UDF for this. Here it is:
dirs = LOAD '/var/location.txt' USING PigStorage();
results = FOREACH dirs GENERATE download_whole_folder_into_single_sequence_file($0);
/* I don't need results bag. It is just a dummy bag */
The problem is I'm not sure if each line of input is processed in separate mapper. The input file is not a huge file just couple of hundred lines. If it were pure Map/Reduce then I would use NLineInputFormat and process each line in a separate Mapper. How can I achieve the same thing in pig?
Pig lets you write your own load functions, which let you specify which InputFormat you'll be using. So you could write your own.
That said, the job you described sounds like it would only involve a single map-reduce step. Since using Pig wouldn't reduce complexity in this case, and you'd have to write custom code just to use Pig, I'd suggest just doing it in vanilla map-reduce. If the total file size is Gigabytes or less, I'd just do it all directly on a single host. It's simpler not to use map reduce if you don't have to.
I typically use map-reduce to first load data into HDFS, and then Pig for all data processing. Pig doesn't really add any benefits over vanilla hadoop for loading data IMO, it's just a wrapper around InputFormat/RecordReader with additional methods you need to implement. Plus it's technically possible with Pig that your loader will be called multiple times. That's a gotcha you don't need to worry about using Hadoop map-reduce directly.

Write sequence file using mapreduce and org.apache.hadoop.fs. differences?

I see example of writing sequence file into hdfs using either org.apache.hadoop.fs package or mapreduce. My questions are :
What are the differences?
Is the end result, I mean the sequence file written in HDFS with both methods come up to be the same?
I only tried the org.apache.hadoop.fs to write sequence file, when I tried to use hadoop fs -text to view result, I see the "key" still attached in each record/block? Would it be the same if I used mapreduce to produce the sequence file? I rather not to see the "key"
How does one decide which method to use to write sequence file into HDFS?
For the sequence file you will write your content including the object i.e your own custom Object. While text file is just a string as each line.
The Apache Hadoop Wiki states that "SequenceFile is a flat file consisting of binary key/value pairs". The Wiki shows the actual file format, that includes the key. Note that SequenceFiles support multiple formats, such as "Uncompressed", "Record Compressed", and "Block Compressed". Additionally there are various compression codecs that can be used. Since the file format and compression information is stored in the file header, applications (such as Mapper and Reducer tasks) can easily determine how to correctly process the files.
In the image below you can see that the append() method on the org.apache.hadoop.io.SequenceFile.Writer class requires both a key and a value:
Also keep in mind that both the MapReduce Mapper and Reducer ingest and emit key-value pairs. So having the key stored in the SequenceFile allows Hadoop top operate very efficiently with these types of files.
So in a nutshell:
SequenceFiles will always contain a "key" in addition to the "value".
Two SequenceFiles containing the same data are not necessarily exactly the same in terms of size or actual bytes. It all depends on whether compression is used, the type of compression, and the compression codec.
The method you use to create SequenceFiles and add them to HDFS, largely depends on what you are trying to achieve and accomplish. SequenceFiles are typically a means to efficiently accomplish a particular goal, they are rarely the end result.

Hadoop streaming: single file or multi file per map. Don't Split

I have a lot of zip files that need to be processed by a C++ library. So I use C++ to write my hadoop streaming program. The program will read a zip file, unzip it, and process the extracted data.
My problem is that:
my mapper can't get the content of exactly one file. It usually gets something like 2.4 files or 3.2 files. Hadoop will send several files to my mapper but at least one of the file is partial. You know zip files can't be processed like this.
Can I get exactly one file per map? I don't want to use file list as input and read it from my program because I want to have the advantage of data locality.
I can accept the contents of multiple zip file per map if Hadoop don't split the zip files. I mean exactly 1, 2, 3 files, not something like 2.3 files. Actually it will be even better because my program need to load about 800MB data file for processing the unziped data. Can we do this?
You can find the solution here:
http://wiki.apache.org/hadoop/FAQ#How_do_I_get_each_of_a_job.27s_maps_to_work_on_one_complete_input-file_and_not_allow_the_framework_to_split-up_the_files.3F
The easiest way I would suggest is to set mapred.min.split.size to a large value so that your files do not get split.
If this does not work then you would need to implement an InputFormat which is not very difficult to do and you can find the steps at: http://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat
Rather then depending on the min split size I would suggest an easier way is to Gzip your files.
There is a way to compress files using gzip
http://www.gzip.org/
If you are on Linux you compress the extracted data with
gzip -r /path/to/data
Now that you have this pass this data as your input in your hadoop streaming job.

Resources