Hadoop streaming with zip input files - hadoop

I'm trying to run a streaming job where the input files are csv inside zip files.
I tried using this, however it doesn't seem for work with CDH4 (I get the error class com.cotdp.hadoop.ZipFileInputFormat not org.apache.hadoop.mapred.InputFormat)
Anyone know of an input file reader I can use for streaming with zip files? If possible, I'm looking for a multi file reader (that can be given the top level directory).

I ended up writing zipstream.
Note that is process only the first file in the zip, I'll probably add support for multiple files later.

There are two hadoop api's for input formats. mapred.InputFormat, and mapreduce.InputFormat.
mapreduce is the newer API and the one you should be using if you can.
I would check to see which InputFormat the ZipInputFormat actually implements. If it implements the mapreduce version you'll need to move your job over to this second API.
For a bit of background: In an earlier Hadoop version 'mapred' was depreciated in favor of 'mapreduce', a newer, faster, and cleaner implementation. Unfortunately this new API didn't include all the features of the old one, so in more recent versions of Hadoop 'mapred' was reinstated, and now there are two APIs that basically do the same thing.

Related

How to use AvroParquetReader inside a Flink application?

I am having trouble using AvroParquetReader inside a Flink Application. (flink>=1.15)
Motivaton (AKA why I want to use it)
According to official doc one can read Parquet files in Flink into FileSource. However, I only want to write a function to load parquet file into Avro records without creating a DataStreamSource. In particular, I want to load parquet files into FileInputFormat which is a complete separate API (for some weird reasons). (And I could not see easily how one could cast BulkFormat or StreamFormat into it, if one dig one level deeper.)
Therefore, it would much simpler if one use org.apache.parquet.avro.AvroParquetReader to read it directly.
Error description
However, I found this error after run the Flink application locally: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found.
This is quite unexpected, since the flink-s3-hadoop-fs jar has already been loaded inside the plugin system (and the file path has already been added to HADOOP_CLASSPATH as well). So not only flink knows where it is, so should the local hadoop as well.
Comments:
Without this AvroParquetReader, the Flink app can write to S3 without problem.
The Hadoop is not a flink shaded one, but installed separately with version 2.10.
Would love to hear if you have some insights about this.
ParquetAvroReader should be able to read the parquet files without problem.
there is an official hadoop guide that has some potential fixes for the issue and can be found here. If I recall correnctly this issue was cause by some Hadoop AWS dependencies missing.

Spark Architecture for processing small binary files saved in HDFS

I don't know how to build architecture for following use case:
I have an Web application where users can upload files(pdf&pptx) and directories to be processed. After upload is complete web application put this files and directories in HDFS, then send a messages on kafka with path to this files.
Spark Application read messages from kafka streaming, collect them on master(driver), and after that process them. I collect messages first because i need to move the code to data, and not move data where the message is received. I understood that spark assign job to executor which already have file locally.
I have issues with kafka because i was forced to collect them first for the above reason, and when want to create checkpoint app crash "because you are attempting to reference SparkContext from a broadcast variable" even if the code run before adding checkpointing( I use sparkContext there because i need to save data to ElasticSearch and PostgreSQL. I don't know how exactly i can do code upgrading in this conditions.
I read about hadoop small files problems, and I understand what problems are in this case. I read that HBase is a better solution to save small files than just save in hdfs. Other problem in hadoop small files problems is big number of mappers and reducers created for computation, but i don't understand if this problem there in spark.
What is the best architecture for this use case?
How to do Job Scheduling? It's kafka good for that? or I need to use other service like rabbitMQ or something else?
Exist some method to add jobs to an running Spark application through some REST API?
How is the best way to save files? Is better to use Hbase because i have small files(<100MB)? Or I need to use SequenceFile? I think SequenceFile isn't for my use case because i need to reprocess some files randomly.
What is the best architecture do you think for this use case?
Thanks!
There is no one single "the best" way to build architecture. You need to make decisions and stick to them. Make the architecture flexible and decoupled so that you can easily replace components if needed.
Consider following stages/layers in your architecture:
Retrieval/Acquisition/Transport of source data (files)
Data processing/transformation
Data archival
As a retrieval component, I would use Flume. It is flexible, supports a lot of sources, channels (including Kafka) and sinks. In your case you can configure source that monitors the directory and extracts the newly received files.
For data processing/transformation - it depends what task you are solving. You probably decided on Spark Streaming. Spark streaming can be integrated with Flume sink (http://spark.apache.org/docs/latest/streaming-flume-integration.html) There are other options available, e.g. Apache Storm. Flume combines very well with Storm. Some transformations can also be applied in Flume.
For data archival - do not store/archive the files directly in Hadoop, unless they are bigger than few hundredths of megabytes. One solution would be to put them in HBase.
Make your architecture more flexible. I would place processed files in a temporary HDFS location and have some job regualarly archive them into zip, HBase, Hadoop Archive (there is such an animal) or any other solution.
Consider using Apache NiFi (aka HDF - Hortonworks Data Flow). It uses internally queues, provides a lot of processors. It can make your life easier and get the workflow developed in minutes. Give it a try. There is nice Hortonworks tutorial which , combined with HDP Sandbox running on a virtual machine/Docker, can bring you up to speed in very short time (1-2 hours?).

Hadoop 3rd Party Library Dependencies on Local Files

So I'm working on a hadoop project that makes extensive use of some 3rd party libraries that rely on the availability of small local files. A lot of them are config files, although one of them is a 34MB dictionary file. Essentially, I am trying to wrap the library to operate on much larger inputs and outputs. The particular libraries in question are s-match and WordNet JWNL.
What is the correct way to make sure these smaller files are available to the mapper and reducer nodes locally at runtime?
The alternative is to extensively alter the 3rd party libraries, which I'd obviously rather avoid. Surely there's got to be a way to package and propagate these files to the local filesystems, avoiding the need for MR jobs to read exclusively from the HDFS and/or special objects.
The most standard way to do it is to add these files to Hadoop's distributed cache. Here's an article on how the distributed cache works. Basically, if you are using the vanilla hadoop API, you can add files to the distributed cache via your JobConf.
JobConf job = new JobConf();
DistributedCache.addCacheFile(new URI("myfile.txt"),job);
If you are using an uberjar to run your job, you can also just ship them in the classpath of the uberjar, but this is a bit dirtier and will blow up the size of your jar file.

how to get multipleOutput in hadoop

I'm new to Hadoop, and now have to process a input file. I want to process each line and the output should be one file for each line.
I surf the internet and found MultipleOutputFormat, and generateFileNameForKeyValue.
But most people write it with JobConf class. As I'm using Hadoop 0.20.1, I think Job class takes place. And I don't know how to use Job class to generate multiple output files by key.
Could anyone help me?
The Eclipse plugin is mainly used to submit and monitor jobs as well as interact with HDFS, against a real or 'psuedo' cluster.
If you're running in local mode, then i don't think the plugin gains you anything - seeing as your job will be run in a single JVM. With this in mind i would say include include the most recent 1.x hadoop-core in your Eclipse project's classpath.
Eitherway MultipleOutputFormat has not been ported to the new mapreduce package (neither in 1.1.2 or 2.0.4-alpha), so you'll either need to port it yourself or find another way (maybe MultipleOutputs - The Javadoc page has some usage on using MultipleOutputs)

Writing data to Hadoop

I need to write data in to Hadoop (HDFS) from external sources like a windows box. Right now I have been copying the data onto the namenode and using HDFS's put command to ingest it into the cluster. In my browsing of the code I didn't see an API for doing this. I am hoping someone can show me that I am wrong and there is an easy way to code external clients against HDFS.
There is an API in Java. You can use it by including the Hadoop code in your project.
The JavaDoc is quite helpful in general, but of course you have to know, what you are looking for *g *
http://hadoop.apache.org/common/docs/
For your particular problem, have a look at:
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html
(this applies to the latest release, consult other JavaDocs for different versions!)
A typical call would be:
Filesystem.get(new JobConf()).create(new Path("however.file"));
Which returns you a stream you can handle with regular JavaIO.
For the problem of loading the data I needed to put into HDFS, I choose to turn the problem around.
Instead of uploading the files to HDFS from the server where they resided, I wrote a Java Map/Reduce job where the mapper read the file from the file server (in this case via https), then write it directly to HDFS (via the Java API).
The list of files is read from the input. I then have an external script that populates a file with the list of files to fetch, uploads the file into HDFS (using hadoop dfs -put), then start the map/reduce job with a decent number of mappers.
This gives me excellent transfer performance, since multiple files are read/written at the same time.
Maybe not the answer you were looking for, but hopefully helpful anyway :-).
About 2 years after my last answer, there are now two new alternatives - Hoop/HttpFS, and WebHDFS.
Regarding Hoop, it was first announced in Cloudera's blog and can be downloaded from a github repository. I have managed to get this version to talk successfully to at least Hadoop 0.20.1, it can probably talk to slightly older versions as well.
If you're running Hadoop 0.23.1 which at time of writing still is not released, Hoop is instead part of Hadoop as its own component, the HttpFS. This work was done as part of HDFS-2178. Hoop/HttpFS can be a proxy not only to HDFS, but also to other Hadoop-compatible filesystems such as Amazon S3.
Hoop/HttpFS runs as its own standalone service.
There's also WebHDFS which runs as part of the NameNode and DataNode services. It also provides a REST API which, if I understand correctly, is compatible with the HttpFS API. WebHDFS is part of Hadoop 1.0 and one of its major features is that it provides data locality - when you're making a read request, you will be redirected to the WebHDFS component on the datanode where the data resides.
Which component to choose depends a bit on your current setup and what needs you have. If you need a HTTP REST interface to HDFS now and you're running a version that does not include WebHDFS, starting with Hoop from the github repository seems like the easiest option. If you are running a version that includes WebHDFS, I would go for that unless you need some of the features Hoop has that WebHDFS lacks (access to other filesystems, bandwidth limitation, etc.)
Install Cygwin, install Hadoop locally (you just need the binary and configs that point at your NN -- no need to actually run the services), run hadoop fs -copyFromLocal /path/to/localfile /hdfs/path/
You can also use the new Cloudera desktop to upload a file via the web UI, though that might not be a good option for giant files.
There's also a WebDAV overlay for HDFS but I don't know how stable/reliable that is.
It seems there is a dedicated page now for this at http://wiki.apache.org/hadoop/MountableHDFS:
These projects (enumerated below) allow HDFS to be mounted (on most
flavors of Unix) as a standard file system using the mount command.
Once mounted, the user can operate on an instance of hdfs using
standard Unix utilities such as 'ls', 'cd', 'cp', 'mkdir', 'find',
'grep', or use standard Posix libraries like open, write, read, close
from C, C++, Python, Ruby, Perl, Java, bash, etc.
Later it describes these projects
contrib/fuse-dfs is built on fuse, some C glue, libhdfs and the hadoop-dev.jar
fuse-j-hdfs is built on fuse, fuse for java, and the hadoop-dev.jar
hdfs-fuse - a google code project is very similar to contrib/fuse-dfs
webdav - hdfs exposed as a webdav resource mapR - contains a closed source hdfs compatible file system that supports read/write
NFS access
HDFS NFS Proxy - exports HDFS as NFS without use of fuse. Supports Kerberos and re-orders writes so they are written to hdfs
sequentially.
I haven't tried any of these, but I will update the answer soon as I have the same need as the OP
You can now also try to use Talend, which includes components for Hadoop integration.
you can try mounting HDFS on your machine(call it machine_X) where you are executing your code and machine_X should have infiniband connectivity with the HDFS Check this out, https://wiki.apache.org/hadoop/MountableHDFS
You can also use HadoopDrive (http://hadoopdrive.effisoft.eu). It's a Windows shell extension.

Resources