How to pull data from Mainframe to Hadoop - hadoop

I have files in Mainframe. I want these data to be pushed to Hadoop(HDFS)/HIVE.
I can use Sqoop for the Mainframe DB2 database and import it to HIVE, but what about files (like COBOL,VASM etc.)
Is there any custom flume source that I can write or some alternative tool to use here?

COBOL is a programming language, not a file format. If what you need is to export files produced by COBOL programs, you can use the same technique as if those files were produced by C, C++, Java, Perl, PL/I, Rexx, etc.
In general, you will have three different data sources: flat files, VSAM files, and a DBMS such as DB2 or IMS.
DMBSs have export utilities to copy the data into flat files. Keep in mind that data in DB2 will likely be normalized and thus you likely need the contents of related tables in order to make sense of the data.
VSAM files can be exported to flat files via the IDCAMS utility.
I would strongly suggest you get the files into a text format before transferring them to another box with a different code page. Trying to deal with mixed text (which must have its code page translated) and binary (which must not have its code page translated but which likely must be converted from big endian to little endian) is harder than doing the conversion up front.
The conversion can likely be done via the SORT utility on the mainframe. Mainframe SORT utilities tend to have extensive data manipulation functions. There are other mechanisms you could use (other utilities, custom code written in the language of your choice, purchased packages) but this is what we tend to do in these circumstances.
Once you have your flat files converted such that all data is text, you can transfer them to your Hadoop boxes via FTP or SFTP or FTPS.
This isn't an exhaustive coverage of the topic, but it will get you started.

Syncsort has been processing mainframe data for 40 years (approx 50% of mainframes already run the software) they have a specific product called DMX-H which can source mainframe data, handle the data type conversions, import the cobol copy books and load it directly into HDFS.
Syncsort also recently contributed a new feature enhancement to the Apache Hadoop core
I suggest you contact them at www.syncsort.com
They were showing this in a demo at a recent Cloudera roadshow.

Update for 2018:
There are a number of commercial products that help to move data from the mainframe to distributed platforms. Here is a list of ones that I have run into for those that are interested. All of them take data on Z as described in the question and will do some transformation and enable movement of the data to other platforms. Not an exact match, but, the industry has changed and the goal of moving data for analysis to other platforms is growing. Data Virtualization Manager provides the most robust tooling for transforming the data from what I've seen.
SyncSort IronStream
IBM Common Data Provider
Correlog
IBM Data Virtualization Manager

Why not : hadoop fs -put <what> <where>?

Transmission of cobol layout files can be done through above discussed options. However actual mapping them to Hive table is a complex task as cobol layout has complex formats as depending clause, variable length, etc.,
I have tried to create custom serde to achieve, although it is still in initial stages. But here is the link, which might give you some idea how to deserialize according to your requirements.
https://github.com/rbheemana/Cobol-to-Hive

Not pull, but push: use the Co:Z Launcher from Dovetailed Technologies.
For example (JCL excerpt):
//FORWARD EXEC PGM=COZLNCH
//STDIN DD *
hadoop fs -put <(fromfile /u/me/data.csv) /data/data.csv
# Create a catalog table
hive -f <(fromfile /u/me/data.hcatalog)
/*
where /u/me/data.csv (the mainframe-based data that you want in Hadoop) and /u/me/data.hcatalog (corresponding HCatalog file) are z/OS UNIX file paths.
For a more detailed example, where the data happens to be log records, see Extracting logs to Hadoop.

Cobrix might be able to solve it for you. It is an open-source COBOL data source for Spark and can parse the files you mentioned.

Related

Save and Process huge amount of small files with spark

I'm new in big data! I have some questions about how to process and how to save large amount of small files(pdf and ppt/pptx) in spark, on EMR Clusters.
My goal is to save data(pdf and pptx) into HDFS(or in some type of datastore from cluster) then extract content from this file from spark and save it in elasticsearch or some relational database.
I had read the problem of small files when save data in HDFS. What is the best way to save large amount of pdf & pptx files (maxim size 100-120 MB)? I had read about Sequence Files and HAR(hadoop archive) but none of them I don't understand how exactly it's works and i don't figure out what is the best.
What is the best way to process this files? I understood that some solutions could be FileInputFormat or CombineFileInputFormat but again I don't know how exactly it's works. I know that can't run every small file on separated task because the cluster will be put in the bottleneck case.
Thanks!
If you use Object Stores (like S3) instead of HDFS then there is no need to apply any changes or conversions to your files and you can have them each as a single object or blob (this also means they are easily readable using standard tools and needn't be unpacked or reformatted with custom classes or code).
You can then read the files using python tools like boto (for s3) or if you are working with spark using the wholeTextFile or binaryFiles command and then making a BytesIO (python) / ByteArrayInputStream (java) to read them using standard libraries.
2) When processing the files, you have the distinction between items and partitions. If you have a 10000 files you can create 100 partitions containing 100 files each. Each file will need to anyways be processed one at a time since the header information is relevant and likely different for each file.
Meanwhile, I found some solutions for that small files problem in HDFS. I can use the following approaches:
HDFS Federation help us to distribute the load of namenodes: https://hortonworks.com/blog/an-introduction-to-hdfs-federation/
HBase could be also a good alternative if your files size is not too large.
There are practical limits to the size of values (e.g., storing 10-50MB objects in HBase would probably be too much to ask); search the mailing list for conversations on this topic. All rows in HBase conform to the Data Model, and that includes versioning. Take that into consideration when making your design, as well as block size for the ColumnFamily.
https://hbase.apache.org/book.html
Apache Ozone which is object storage like S3 but is on-premises. At the time of writing, from what I know, Ozone is not production ready. https://hadoop.apache.org/ozone/

Gathering heterogeneous data with hadoop

We have a system, including some Oracle and Microsoft SQL DBMS, that get data from some different sources and in different formats, stores and process it. "Different formats" means files: dbf, xls and others, including binary formats (images), which are imported to DBMS with different tools, and direct access to the databases. I want to isolate all the incoming data and store it "forever" and want to get them later by source and creation time. After some studies I want to try hadoop ecosystem, but not quite sure, if it's an adequate solution for this goal. And what parts of ecosystem should I use? HDFS alone, Hive, may be something else? Could you give me a piece of advise?
I assume you want to store the files that contain the data -- effectively a searchable file archive.
The files themselves can just be stored in HDFS ... or you may find a system like Amazon's S3 cheaper and more flexible. As you store the files, you could manage the other data about the data, namely: location, source, and creation time by appending to another file -- a simple tab-separated file or several other formats supported by Hadoop make this easy.
You can manage and query the file with Hive or other SQL-on-Hadoop tools. In effect, you're creating a simple file system with special attributes, so the trick would be to make sure that each time you write a file, you also write the metadata. You may have to handle cases like write failures, what happens when you delete, rename, or move files (I know, you say "never").
Your solution might be simpler depending on your needs, you may find that storing the data in subdirectories within HDFS (or AWS S3) is even simpler. Perhaps if you wanted to store DBF files from source "foo", and XLS files from "bar" created on December 1, 2015, you could simply create a directory structure like
/2015/12/01/foo/dbf/myfile.dbf
/2015/12/01/bar/xls/myexcel.xls
This solution has the advantage of being self-maintaining -- the file path stores the metadata which makes it very portable and simple, requiring nothing more than a shell script to implement.
I don't think there's any reason to make the solution more complicated than necessary. Hadoop or S3 are both fine for long-term, high-durability storage and for querying. My company has found that storing the information about the file in Hadoop (which we use for many other purposes) and storing the files themselves on AWS S3 is far simpler, more easily secured and much cheaper.
There are various things that you may want to do, each with their own solution. If more than 1 use case is relevant for you, you probably want to implement multiple solutions in parallel.
1. Store files for use
If you want to store files in a way that they can be picked up efficiently (distributed), the solution is simple: Put the files on hdfs
2. Store the information for use
If you want to use the information, rather than storing the files you should be interested in storing the information in a way that they can be picked up efficiently. The general solution here would be: Parse the files in a lossles way and store their information in a database
You may find that storing information in (partitioned) ORC files can be nice for this. You can do this with Pive, Pig or even UDFs (e.g. python) in Pig.
3. Keep the files for the future
In this case you would mostly care about preserving the files, and not so much about ease of access. Here the recommended solution is: Store compressed files with proper backups
Note that the replication that hdfs does is to deal more efficiently with data (and hardware issues). Just having your data on hdfs does NOT mean that it is backed up.

Copy VSAM dataset to flat file for Hadoop [duplicate]

I have files in Mainframe. I want these data to be pushed to Hadoop(HDFS)/HIVE.
I can use Sqoop for the Mainframe DB2 database and import it to HIVE, but what about files (like COBOL,VASM etc.)
Is there any custom flume source that I can write or some alternative tool to use here?
COBOL is a programming language, not a file format. If what you need is to export files produced by COBOL programs, you can use the same technique as if those files were produced by C, C++, Java, Perl, PL/I, Rexx, etc.
In general, you will have three different data sources: flat files, VSAM files, and a DBMS such as DB2 or IMS.
DMBSs have export utilities to copy the data into flat files. Keep in mind that data in DB2 will likely be normalized and thus you likely need the contents of related tables in order to make sense of the data.
VSAM files can be exported to flat files via the IDCAMS utility.
I would strongly suggest you get the files into a text format before transferring them to another box with a different code page. Trying to deal with mixed text (which must have its code page translated) and binary (which must not have its code page translated but which likely must be converted from big endian to little endian) is harder than doing the conversion up front.
The conversion can likely be done via the SORT utility on the mainframe. Mainframe SORT utilities tend to have extensive data manipulation functions. There are other mechanisms you could use (other utilities, custom code written in the language of your choice, purchased packages) but this is what we tend to do in these circumstances.
Once you have your flat files converted such that all data is text, you can transfer them to your Hadoop boxes via FTP or SFTP or FTPS.
This isn't an exhaustive coverage of the topic, but it will get you started.
Syncsort has been processing mainframe data for 40 years (approx 50% of mainframes already run the software) they have a specific product called DMX-H which can source mainframe data, handle the data type conversions, import the cobol copy books and load it directly into HDFS.
Syncsort also recently contributed a new feature enhancement to the Apache Hadoop core
I suggest you contact them at www.syncsort.com
They were showing this in a demo at a recent Cloudera roadshow.
Update for 2018:
There are a number of commercial products that help to move data from the mainframe to distributed platforms. Here is a list of ones that I have run into for those that are interested. All of them take data on Z as described in the question and will do some transformation and enable movement of the data to other platforms. Not an exact match, but, the industry has changed and the goal of moving data for analysis to other platforms is growing. Data Virtualization Manager provides the most robust tooling for transforming the data from what I've seen.
SyncSort IronStream
IBM Common Data Provider
Correlog
IBM Data Virtualization Manager
Why not : hadoop fs -put <what> <where>?
Transmission of cobol layout files can be done through above discussed options. However actual mapping them to Hive table is a complex task as cobol layout has complex formats as depending clause, variable length, etc.,
I have tried to create custom serde to achieve, although it is still in initial stages. But here is the link, which might give you some idea how to deserialize according to your requirements.
https://github.com/rbheemana/Cobol-to-Hive
Not pull, but push: use the Co:Z Launcher from Dovetailed Technologies.
For example (JCL excerpt):
//FORWARD EXEC PGM=COZLNCH
//STDIN DD *
hadoop fs -put <(fromfile /u/me/data.csv) /data/data.csv
# Create a catalog table
hive -f <(fromfile /u/me/data.hcatalog)
/*
where /u/me/data.csv (the mainframe-based data that you want in Hadoop) and /u/me/data.hcatalog (corresponding HCatalog file) are z/OS UNIX file paths.
For a more detailed example, where the data happens to be log records, see Extracting logs to Hadoop.
Cobrix might be able to solve it for you. It is an open-source COBOL data source for Spark and can parse the files you mentioned.

working with big scientific data on Hadoop

I am currently starting a project titled "Cloud computing for time series mining algorithms using Hadoop".
The data which I have is hdf files of size over a terabyte.In hadoop as I know that we should have text files as input for further processing (map-reduce task). So I have one option that I convert all my .hdf files to text files which is going to take a lot of time.
Or I find a way of how to use raw hdf files in map reduce programmes.
So far I have not been successful in finding any java code which reads hdf files and extract data from them.
If somebody has a better idea of how to work with hdf files I will really appreciate such help.
Thanks
Ayush
Here are some resources:
SciHadoop (uses netCDF but might be already extended to HDF5).
You can either use JHDF5 or the lower level official Java HDF5 interface to read out data from any HDF5 file in the map-reduce task.
For your first option, you could use a conversion tool like HDF dump to dump HDF file to text format. Otherwise, you can write a program using Java library for reading HDF file and write it to text file.
For your second option, SciHadoop is a good example of how to read Scientific datasets from Hadoop. It uses NetCDF-Java library to read NetCDF file. Hadoop does not support POSIX API for file IO. So, it uses an extra software layer to translate POSIX call of NetCDF-java library to HDFS(Hadoop) API calls. If SciHadoop does not already support HDF files, you might go along a little harder path and develop a similar solution yourself.
If you do not find any java code and can do in other languages then you can use hadoop streaming.
SciMATE http://www.cse.ohio-state.edu/~wayi/papers/SciMATE.pdf is a good option. It is developed based on a variant of MapReduce, which has been shown to perform a lot of scientific applications much more efficiently than Hadoop.

Storing data to SequenceFile from Apache Pig

Apache Pig can load data from Hadoop sequence files using the PiggyBank SequenceFileLoader:
REGISTER /home/hadoop/pig/contrib/piggybank/java/piggybank.jar;
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();
log = LOAD '/data/logs' USING SequenceFileLoader AS (...)
Is there also a library out there that would allow writing to Hadoop sequence files from Pig?
It's just a matter of implementing a StoreFunc to do so.
This is possible now, although it will become a fair bit easier once Pig 0.7 comes out, as it includes a complete redesign of the Load/Store interfaces.
The "Hadoop expansion pack" Twitter is about to open source open-sourced at github, includes code for generating Load and Store funcs based on Google Protocol Buffers (building on Input/Output formats for same -- you already have those for sequence files, obviously). Check it out if you need examples of how to do some of the less trivial stuff. It should be fairly straightforward though.
This seemed to work for me. https://github.com/kevinweil/elephant-bird/pull/73

Resources