working with big scientific data on Hadoop - hadoop

I am currently starting a project titled "Cloud computing for time series mining algorithms using Hadoop".
The data which I have is hdf files of size over a terabyte.In hadoop as I know that we should have text files as input for further processing (map-reduce task). So I have one option that I convert all my .hdf files to text files which is going to take a lot of time.
Or I find a way of how to use raw hdf files in map reduce programmes.
So far I have not been successful in finding any java code which reads hdf files and extract data from them.
If somebody has a better idea of how to work with hdf files I will really appreciate such help.
Thanks
Ayush

Here are some resources:
SciHadoop (uses netCDF but might be already extended to HDF5).
You can either use JHDF5 or the lower level official Java HDF5 interface to read out data from any HDF5 file in the map-reduce task.

For your first option, you could use a conversion tool like HDF dump to dump HDF file to text format. Otherwise, you can write a program using Java library for reading HDF file and write it to text file.
For your second option, SciHadoop is a good example of how to read Scientific datasets from Hadoop. It uses NetCDF-Java library to read NetCDF file. Hadoop does not support POSIX API for file IO. So, it uses an extra software layer to translate POSIX call of NetCDF-java library to HDFS(Hadoop) API calls. If SciHadoop does not already support HDF files, you might go along a little harder path and develop a similar solution yourself.

If you do not find any java code and can do in other languages then you can use hadoop streaming.

SciMATE http://www.cse.ohio-state.edu/~wayi/papers/SciMATE.pdf is a good option. It is developed based on a variant of MapReduce, which has been shown to perform a lot of scientific applications much more efficiently than Hadoop.

Related

Save and Process huge amount of small files with spark

I'm new in big data! I have some questions about how to process and how to save large amount of small files(pdf and ppt/pptx) in spark, on EMR Clusters.
My goal is to save data(pdf and pptx) into HDFS(or in some type of datastore from cluster) then extract content from this file from spark and save it in elasticsearch or some relational database.
I had read the problem of small files when save data in HDFS. What is the best way to save large amount of pdf & pptx files (maxim size 100-120 MB)? I had read about Sequence Files and HAR(hadoop archive) but none of them I don't understand how exactly it's works and i don't figure out what is the best.
What is the best way to process this files? I understood that some solutions could be FileInputFormat or CombineFileInputFormat but again I don't know how exactly it's works. I know that can't run every small file on separated task because the cluster will be put in the bottleneck case.
Thanks!
If you use Object Stores (like S3) instead of HDFS then there is no need to apply any changes or conversions to your files and you can have them each as a single object or blob (this also means they are easily readable using standard tools and needn't be unpacked or reformatted with custom classes or code).
You can then read the files using python tools like boto (for s3) or if you are working with spark using the wholeTextFile or binaryFiles command and then making a BytesIO (python) / ByteArrayInputStream (java) to read them using standard libraries.
2) When processing the files, you have the distinction between items and partitions. If you have a 10000 files you can create 100 partitions containing 100 files each. Each file will need to anyways be processed one at a time since the header information is relevant and likely different for each file.
Meanwhile, I found some solutions for that small files problem in HDFS. I can use the following approaches:
HDFS Federation help us to distribute the load of namenodes: https://hortonworks.com/blog/an-introduction-to-hdfs-federation/
HBase could be also a good alternative if your files size is not too large.
There are practical limits to the size of values (e.g., storing 10-50MB objects in HBase would probably be too much to ask); search the mailing list for conversations on this topic. All rows in HBase conform to the Data Model, and that includes versioning. Take that into consideration when making your design, as well as block size for the ColumnFamily.
https://hbase.apache.org/book.html
Apache Ozone which is object storage like S3 but is on-premises. At the time of writing, from what I know, Ozone is not production ready. https://hadoop.apache.org/ozone/

Copy VSAM dataset to flat file for Hadoop [duplicate]

I have files in Mainframe. I want these data to be pushed to Hadoop(HDFS)/HIVE.
I can use Sqoop for the Mainframe DB2 database and import it to HIVE, but what about files (like COBOL,VASM etc.)
Is there any custom flume source that I can write or some alternative tool to use here?
COBOL is a programming language, not a file format. If what you need is to export files produced by COBOL programs, you can use the same technique as if those files were produced by C, C++, Java, Perl, PL/I, Rexx, etc.
In general, you will have three different data sources: flat files, VSAM files, and a DBMS such as DB2 or IMS.
DMBSs have export utilities to copy the data into flat files. Keep in mind that data in DB2 will likely be normalized and thus you likely need the contents of related tables in order to make sense of the data.
VSAM files can be exported to flat files via the IDCAMS utility.
I would strongly suggest you get the files into a text format before transferring them to another box with a different code page. Trying to deal with mixed text (which must have its code page translated) and binary (which must not have its code page translated but which likely must be converted from big endian to little endian) is harder than doing the conversion up front.
The conversion can likely be done via the SORT utility on the mainframe. Mainframe SORT utilities tend to have extensive data manipulation functions. There are other mechanisms you could use (other utilities, custom code written in the language of your choice, purchased packages) but this is what we tend to do in these circumstances.
Once you have your flat files converted such that all data is text, you can transfer them to your Hadoop boxes via FTP or SFTP or FTPS.
This isn't an exhaustive coverage of the topic, but it will get you started.
Syncsort has been processing mainframe data for 40 years (approx 50% of mainframes already run the software) they have a specific product called DMX-H which can source mainframe data, handle the data type conversions, import the cobol copy books and load it directly into HDFS.
Syncsort also recently contributed a new feature enhancement to the Apache Hadoop core
I suggest you contact them at www.syncsort.com
They were showing this in a demo at a recent Cloudera roadshow.
Update for 2018:
There are a number of commercial products that help to move data from the mainframe to distributed platforms. Here is a list of ones that I have run into for those that are interested. All of them take data on Z as described in the question and will do some transformation and enable movement of the data to other platforms. Not an exact match, but, the industry has changed and the goal of moving data for analysis to other platforms is growing. Data Virtualization Manager provides the most robust tooling for transforming the data from what I've seen.
SyncSort IronStream
IBM Common Data Provider
Correlog
IBM Data Virtualization Manager
Why not : hadoop fs -put <what> <where>?
Transmission of cobol layout files can be done through above discussed options. However actual mapping them to Hive table is a complex task as cobol layout has complex formats as depending clause, variable length, etc.,
I have tried to create custom serde to achieve, although it is still in initial stages. But here is the link, which might give you some idea how to deserialize according to your requirements.
https://github.com/rbheemana/Cobol-to-Hive
Not pull, but push: use the Co:Z Launcher from Dovetailed Technologies.
For example (JCL excerpt):
//FORWARD EXEC PGM=COZLNCH
//STDIN DD *
hadoop fs -put <(fromfile /u/me/data.csv) /data/data.csv
# Create a catalog table
hive -f <(fromfile /u/me/data.hcatalog)
/*
where /u/me/data.csv (the mainframe-based data that you want in Hadoop) and /u/me/data.hcatalog (corresponding HCatalog file) are z/OS UNIX file paths.
For a more detailed example, where the data happens to be log records, see Extracting logs to Hadoop.
Cobrix might be able to solve it for you. It is an open-source COBOL data source for Spark and can parse the files you mentioned.

Load image to pig

I am a newbie to analyze Images using Apache Pig.
Can anyone suggest me how to load and process the images??
I know for textfiles,
alias = load '/user/Pavan/sample.txt' using PigStorage(" ");
How to do with images??
You have a few options, which really depend on the kind of manipulation you're looking to do:
1) Write a custom load function
Pig can be used for images, but you'd need to write a custom load function, which could be more than you're looking to do.
2) Use a Sequence File (my recommendation)
You could also convert the image to a Sequence File, which Pig has a loader file for, available in the Piggybank JAR. There are also load functions and store functions for reading and writing Sequence Files available via Twitter's Elephant Bird package.
Here's an article about using Sequence Files on Hadoop for astronomical categorization tasks.
3) Go with MapReduce.
Depending on the nature of your task, you may be better off in native MapReduce.

How to pull data from Mainframe to Hadoop

I have files in Mainframe. I want these data to be pushed to Hadoop(HDFS)/HIVE.
I can use Sqoop for the Mainframe DB2 database and import it to HIVE, but what about files (like COBOL,VASM etc.)
Is there any custom flume source that I can write or some alternative tool to use here?
COBOL is a programming language, not a file format. If what you need is to export files produced by COBOL programs, you can use the same technique as if those files were produced by C, C++, Java, Perl, PL/I, Rexx, etc.
In general, you will have three different data sources: flat files, VSAM files, and a DBMS such as DB2 or IMS.
DMBSs have export utilities to copy the data into flat files. Keep in mind that data in DB2 will likely be normalized and thus you likely need the contents of related tables in order to make sense of the data.
VSAM files can be exported to flat files via the IDCAMS utility.
I would strongly suggest you get the files into a text format before transferring them to another box with a different code page. Trying to deal with mixed text (which must have its code page translated) and binary (which must not have its code page translated but which likely must be converted from big endian to little endian) is harder than doing the conversion up front.
The conversion can likely be done via the SORT utility on the mainframe. Mainframe SORT utilities tend to have extensive data manipulation functions. There are other mechanisms you could use (other utilities, custom code written in the language of your choice, purchased packages) but this is what we tend to do in these circumstances.
Once you have your flat files converted such that all data is text, you can transfer them to your Hadoop boxes via FTP or SFTP or FTPS.
This isn't an exhaustive coverage of the topic, but it will get you started.
Syncsort has been processing mainframe data for 40 years (approx 50% of mainframes already run the software) they have a specific product called DMX-H which can source mainframe data, handle the data type conversions, import the cobol copy books and load it directly into HDFS.
Syncsort also recently contributed a new feature enhancement to the Apache Hadoop core
I suggest you contact them at www.syncsort.com
They were showing this in a demo at a recent Cloudera roadshow.
Update for 2018:
There are a number of commercial products that help to move data from the mainframe to distributed platforms. Here is a list of ones that I have run into for those that are interested. All of them take data on Z as described in the question and will do some transformation and enable movement of the data to other platforms. Not an exact match, but, the industry has changed and the goal of moving data for analysis to other platforms is growing. Data Virtualization Manager provides the most robust tooling for transforming the data from what I've seen.
SyncSort IronStream
IBM Common Data Provider
Correlog
IBM Data Virtualization Manager
Why not : hadoop fs -put <what> <where>?
Transmission of cobol layout files can be done through above discussed options. However actual mapping them to Hive table is a complex task as cobol layout has complex formats as depending clause, variable length, etc.,
I have tried to create custom serde to achieve, although it is still in initial stages. But here is the link, which might give you some idea how to deserialize according to your requirements.
https://github.com/rbheemana/Cobol-to-Hive
Not pull, but push: use the Co:Z Launcher from Dovetailed Technologies.
For example (JCL excerpt):
//FORWARD EXEC PGM=COZLNCH
//STDIN DD *
hadoop fs -put <(fromfile /u/me/data.csv) /data/data.csv
# Create a catalog table
hive -f <(fromfile /u/me/data.hcatalog)
/*
where /u/me/data.csv (the mainframe-based data that you want in Hadoop) and /u/me/data.hcatalog (corresponding HCatalog file) are z/OS UNIX file paths.
For a more detailed example, where the data happens to be log records, see Extracting logs to Hadoop.
Cobrix might be able to solve it for you. It is an open-source COBOL data source for Spark and can parse the files you mentioned.

Storing data to SequenceFile from Apache Pig

Apache Pig can load data from Hadoop sequence files using the PiggyBank SequenceFileLoader:
REGISTER /home/hadoop/pig/contrib/piggybank/java/piggybank.jar;
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();
log = LOAD '/data/logs' USING SequenceFileLoader AS (...)
Is there also a library out there that would allow writing to Hadoop sequence files from Pig?
It's just a matter of implementing a StoreFunc to do so.
This is possible now, although it will become a fair bit easier once Pig 0.7 comes out, as it includes a complete redesign of the Load/Store interfaces.
The "Hadoop expansion pack" Twitter is about to open source open-sourced at github, includes code for generating Load and Store funcs based on Google Protocol Buffers (building on Input/Output formats for same -- you already have those for sequence files, obviously). Check it out if you need examples of how to do some of the less trivial stuff. It should be fairly straightforward though.
This seemed to work for me. https://github.com/kevinweil/elephant-bird/pull/73

Resources