Accessing Hadoop Distributed Cache in UDF - caching

Is it possible to pick up files from the distributed cache in a UDF?
Before diving in further, I spent quite a bit of time trying to find an answer to this particular question (on StackOverflow and otherwise), and was not able to find one.
The main crux of the problem is as follows: I would like to take a file that is already on HDFS, copy it over to the distributed cache in Pig, and then be able to read this file from the cache in a Java UDF. Another hiccup is that due to the design of the program, I am unable to extend from 'EvalFunc', which may solve the issue.
I specified SET mapred.cache.files '$PATH_TO_FILE_ON_HDFS' as well as SET mapped.create.symlink 'yes' in my Pig script, passed the file path as a parameter to the UDF, and attempted to use FileSystem and FileReader classes to access the file, to no avail.
Please let me know if I can further clarify this/provide any more relevant details.

Related

Gathering heterogeneous data with hadoop

We have a system, including some Oracle and Microsoft SQL DBMS, that get data from some different sources and in different formats, stores and process it. "Different formats" means files: dbf, xls and others, including binary formats (images), which are imported to DBMS with different tools, and direct access to the databases. I want to isolate all the incoming data and store it "forever" and want to get them later by source and creation time. After some studies I want to try hadoop ecosystem, but not quite sure, if it's an adequate solution for this goal. And what parts of ecosystem should I use? HDFS alone, Hive, may be something else? Could you give me a piece of advise?
I assume you want to store the files that contain the data -- effectively a searchable file archive.
The files themselves can just be stored in HDFS ... or you may find a system like Amazon's S3 cheaper and more flexible. As you store the files, you could manage the other data about the data, namely: location, source, and creation time by appending to another file -- a simple tab-separated file or several other formats supported by Hadoop make this easy.
You can manage and query the file with Hive or other SQL-on-Hadoop tools. In effect, you're creating a simple file system with special attributes, so the trick would be to make sure that each time you write a file, you also write the metadata. You may have to handle cases like write failures, what happens when you delete, rename, or move files (I know, you say "never").
Your solution might be simpler depending on your needs, you may find that storing the data in subdirectories within HDFS (or AWS S3) is even simpler. Perhaps if you wanted to store DBF files from source "foo", and XLS files from "bar" created on December 1, 2015, you could simply create a directory structure like
/2015/12/01/foo/dbf/myfile.dbf
/2015/12/01/bar/xls/myexcel.xls
This solution has the advantage of being self-maintaining -- the file path stores the metadata which makes it very portable and simple, requiring nothing more than a shell script to implement.
I don't think there's any reason to make the solution more complicated than necessary. Hadoop or S3 are both fine for long-term, high-durability storage and for querying. My company has found that storing the information about the file in Hadoop (which we use for many other purposes) and storing the files themselves on AWS S3 is far simpler, more easily secured and much cheaper.
There are various things that you may want to do, each with their own solution. If more than 1 use case is relevant for you, you probably want to implement multiple solutions in parallel.
1. Store files for use
If you want to store files in a way that they can be picked up efficiently (distributed), the solution is simple: Put the files on hdfs
2. Store the information for use
If you want to use the information, rather than storing the files you should be interested in storing the information in a way that they can be picked up efficiently. The general solution here would be: Parse the files in a lossles way and store their information in a database
You may find that storing information in (partitioned) ORC files can be nice for this. You can do this with Pive, Pig or even UDFs (e.g. python) in Pig.
3. Keep the files for the future
In this case you would mostly care about preserving the files, and not so much about ease of access. Here the recommended solution is: Store compressed files with proper backups
Note that the replication that hdfs does is to deal more efficiently with data (and hardware issues). Just having your data on hdfs does NOT mean that it is backed up.

Can I get around the no-update restriction in HDFS?

Thanks for the answers. I'm still not quite getting the answer I want. It's a particular question involving HDFS and the concat api.
Here it is. When concat talks about files, does it mean only "files created and managed by HDFS?" Or will it work on files that are not known to HDFS but just happen to live on the datanodes?
The idea is to
Create a file and save it through HDFS. It's broken up into blocks and saved to the datanodes.
Go directly to the datanodes and make local copies of the blocks using normal shell commands.
Alter those copies. I now have a set of blocks that Hadoop doesn't know about. The checksums are definitely bad.
Use concat to stitch the copies together and "register" them with HDFS.
At the end of all that, I have two files as far as HDFS is concerned. The original and an updated copy. Essentially, I put the data blocks on the datanodes without going through Hadoop. The concat code put all those new blocks into a new HDFS file without having to pass the data through Hadoop.
I don't think this will work, but I need to be sure it won't. It was suggested to me as a possible solution to the update problem. I need to convince them this will not work.
The base philosophy of HDFS is:
write-once, read-many
then, it is not possible to update files with the base implementation of HDFS. You only can append at the end of a current file if you are using a Hadoop branch that allow it. (The original version doesn't allow it)
An alternative could be use a non-standard HDFS like Map-R file system: https://www.mapr.com/blog/get-real-hadoop-read-write-file-system#.VfHYK2wViko
Go for HBase which is built on top of Hadoop to support CRUD operations in big data hadoop world.
If you are not supposed to use No SQL database then there is no chance for updating HDFS files. Only option is to rewrite.

Hadoop Spark (Mapr) - AddFile how does it work

I am trying to understand how does hadoop work. Say I have 10 directory on hdfs, it contains 100s of file which i want to process with spark.
In the book - Fast Data Processing with Spark
This requires the file to be available on all the nodes in the cluster, which isn't much of a
problem for a local mode. When in a distributed mode, you will want to use Spark's
addFile functionality to copy the file to all the machines in your cluster.
I am not able to understand this, will spark create copy of file on each node.
What I want is that it should read the file which is present in that directory (if that directory is present on that node)
Sorry, I am bit confused , how to handle the above scenario in spark.
regards
The section you're referring to introduces SparkContext::addFile in a confusing context. This is a section titled "Loading data into an RDD", but it immediately diverges from that goal and introduces SparkContext::addFile more generally as a way to get data into Spark. Over the next few pages it introduces some actual ways to get data "into an RDD", such as SparkContext::parallelize and SparkContext::textFile. These resolve your concerns about splitting up the data among nodes rather than copying the whole of the data to all nodes.
A real production use-case for SparkContext::addFile is to make a configuration file available to some library that can only be configured from a file on the disk. For example, when using MaxMind's GeoIP Legacy API, you might configure the lookup object for use in a distributed map like this (as a field on some class):
#transient lazy val geoIp = new LookupService("GeoIP.dat", LookupService.GEOIP_MEMORY_CACHE | LookupService.GEOIP_CHECK_CACHE)
Outside your map function, you'd need to make GeoIP.dat available like this:
sc.addFile("/path/to/GeoIP.dat")
Spark will then make it available in the current working directory on all of the nodes.
So, in contrast with Daniel Darabos' answer, there are some reasons outside of experimentation to use SparkContext::addFile. Also, I can't find any info in the documentation that would lead one to believe that the function is not production-ready. However, I would agree that it's not what you want to use for loading the data you are trying to process unless it's for experimentation in the interactive Spark REPL, since it doesn't create an RDD.
addFile is only for experimentation. It is not meant for production use. In production you just open a file specified by a URI understood by Hadoop. For example:
sc.textFile("s3n://bucket/file")

Hadoop use folder structure as input

I'm a beginner trying to use Hadoop and I guess although I understand the general map-reduce stuff I seem to miss something in the beginning.
Basically I'm trying to parse a website (local) using hadoop and have as result the link structure (so that later I can calculate some page rank).
Thus the input is a folder structure (with subfolder and files) and the output should be, for now, each file with a list of files that link to it.
What InputFormat should I use? The FileInputFormat doesn't seem to work (I get an exception upon encountering a folder - saying it is a directory). Actually is there such an InputFormat that allows for inputing such folder structures?
If not... should I somehow preprocess the input data? Meaning should I take out every HTML file into a single directory and look from it there?
Or, is there a way to write such an InputFormat that does what I need?
Actually is there such an InputFormat that allows for inputing such folder structures?
All the FileInputFormats take a Path as an input, which can be a directory or file.
The FileInputFormat doesn't seem to work (I get an exception upon encountering a folder - saying it is a directory).
The JIRA has been fixed in some of the releases (0.21, 0.22, 0.23 and trunk). o.a.h.mapred.FileInputFormat should have the addInputPathRecursively method implemented. Also, noticed that it's not implemented in the new API (o.a.h.mapreduce.FileInputFormat). Here is the code for o.a.h.mapred.FileInputFormat class from trunk.
BTW, what release are you using?
Basically I'm trying to parse a website (local) using hadoop and have as result the link structure (so that later I can calculate some page rank).
Because of the media attention/hype Hadoop is being used for every thing. Hadoop as-is works well for some types of problems. Consider using Apache Hama and Giraph for graph processing. Note that both are in incubator and documentation is also sparse.

Storing data to SequenceFile from Apache Pig

Apache Pig can load data from Hadoop sequence files using the PiggyBank SequenceFileLoader:
REGISTER /home/hadoop/pig/contrib/piggybank/java/piggybank.jar;
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();
log = LOAD '/data/logs' USING SequenceFileLoader AS (...)
Is there also a library out there that would allow writing to Hadoop sequence files from Pig?
It's just a matter of implementing a StoreFunc to do so.
This is possible now, although it will become a fair bit easier once Pig 0.7 comes out, as it includes a complete redesign of the Load/Store interfaces.
The "Hadoop expansion pack" Twitter is about to open source open-sourced at github, includes code for generating Load and Store funcs based on Google Protocol Buffers (building on Input/Output formats for same -- you already have those for sequence files, obviously). Check it out if you need examples of how to do some of the less trivial stuff. It should be fairly straightforward though.
This seemed to work for me. https://github.com/kevinweil/elephant-bird/pull/73

Resources