Hadoop use folder structure as input - hadoop

I'm a beginner trying to use Hadoop and I guess although I understand the general map-reduce stuff I seem to miss something in the beginning.
Basically I'm trying to parse a website (local) using hadoop and have as result the link structure (so that later I can calculate some page rank).
Thus the input is a folder structure (with subfolder and files) and the output should be, for now, each file with a list of files that link to it.
What InputFormat should I use? The FileInputFormat doesn't seem to work (I get an exception upon encountering a folder - saying it is a directory). Actually is there such an InputFormat that allows for inputing such folder structures?
If not... should I somehow preprocess the input data? Meaning should I take out every HTML file into a single directory and look from it there?
Or, is there a way to write such an InputFormat that does what I need?

Actually is there such an InputFormat that allows for inputing such folder structures?
All the FileInputFormats take a Path as an input, which can be a directory or file.
The FileInputFormat doesn't seem to work (I get an exception upon encountering a folder - saying it is a directory).
The JIRA has been fixed in some of the releases (0.21, 0.22, 0.23 and trunk). o.a.h.mapred.FileInputFormat should have the addInputPathRecursively method implemented. Also, noticed that it's not implemented in the new API (o.a.h.mapreduce.FileInputFormat). Here is the code for o.a.h.mapred.FileInputFormat class from trunk.
BTW, what release are you using?
Basically I'm trying to parse a website (local) using hadoop and have as result the link structure (so that later I can calculate some page rank).
Because of the media attention/hype Hadoop is being used for every thing. Hadoop as-is works well for some types of problems. Consider using Apache Hama and Giraph for graph processing. Note that both are in incubator and documentation is also sparse.

Related

Sync files on hdfs having same size but varies in contents

i am trying to sync files from one hadoop clutster to another using distcp and airbnb reair utility, but both of them are not working as expected.
if file size is same on source and destination both of them fails to update it even if file content are been changed(checksum also varies) unless overwrite option is not used.
I need to keep sync data of around 30TB so every time loading complete dataset is not feasible.
Could anyone please suggest how can i bring two dataset in sync if file size is same(count in source is changed) and have varied checksum.
The way DistCp handles syncing between files that are the same size but having different contents is by comparing its so-called FileChecksum. The FileChecksum was first introduced in HADOOP-3981, mostly for the purpose of being used in DistCp. Unfortunately, this has the known shortcoming of being incompatible between different storage implementations, and even incompatible between HDFS instances that have different internal block/chunk settings. Specifically, that FileChecksum bakes in the structure of having, for example, 512-bytes-per-chunk and 128MB-per-block.
Since GCS doesn't have the same notions of "chunks" or "blocks", there's no way for it to have any similar definition of a FileChecksum. The same is also true of all other object stores commonly used with Hadoop; the DistCp documentation appendix discusses this fact under "DistCp and Object Stores".
That said, there's a neat trick that can be done to define a nice standardized representation of a composite CRC for HDFS files that is mostly in-place compatible with existing HDFS deployments; I've filed HDFS-13056 with a proof of concept to try to get this added upstream, after which it should be possible to make it work out-of-the-box against GCS, since GCS also supports file-level CRC32C.

Can I get around the no-update restriction in HDFS?

Thanks for the answers. I'm still not quite getting the answer I want. It's a particular question involving HDFS and the concat api.
Here it is. When concat talks about files, does it mean only "files created and managed by HDFS?" Or will it work on files that are not known to HDFS but just happen to live on the datanodes?
The idea is to
Create a file and save it through HDFS. It's broken up into blocks and saved to the datanodes.
Go directly to the datanodes and make local copies of the blocks using normal shell commands.
Alter those copies. I now have a set of blocks that Hadoop doesn't know about. The checksums are definitely bad.
Use concat to stitch the copies together and "register" them with HDFS.
At the end of all that, I have two files as far as HDFS is concerned. The original and an updated copy. Essentially, I put the data blocks on the datanodes without going through Hadoop. The concat code put all those new blocks into a new HDFS file without having to pass the data through Hadoop.
I don't think this will work, but I need to be sure it won't. It was suggested to me as a possible solution to the update problem. I need to convince them this will not work.
The base philosophy of HDFS is:
write-once, read-many
then, it is not possible to update files with the base implementation of HDFS. You only can append at the end of a current file if you are using a Hadoop branch that allow it. (The original version doesn't allow it)
An alternative could be use a non-standard HDFS like Map-R file system: https://www.mapr.com/blog/get-real-hadoop-read-write-file-system#.VfHYK2wViko
Go for HBase which is built on top of Hadoop to support CRUD operations in big data hadoop world.
If you are not supposed to use No SQL database then there is no chance for updating HDFS files. Only option is to rewrite.

Accessing Hadoop Distributed Cache in UDF

Is it possible to pick up files from the distributed cache in a UDF?
Before diving in further, I spent quite a bit of time trying to find an answer to this particular question (on StackOverflow and otherwise), and was not able to find one.
The main crux of the problem is as follows: I would like to take a file that is already on HDFS, copy it over to the distributed cache in Pig, and then be able to read this file from the cache in a Java UDF. Another hiccup is that due to the design of the program, I am unable to extend from 'EvalFunc', which may solve the issue.
I specified SET mapred.cache.files '$PATH_TO_FILE_ON_HDFS' as well as SET mapped.create.symlink 'yes' in my Pig script, passed the file path as a parameter to the UDF, and attempted to use FileSystem and FileReader classes to access the file, to no avail.
Please let me know if I can further clarify this/provide any more relevant details.

Hadoop Spark (Mapr) - AddFile how does it work

I am trying to understand how does hadoop work. Say I have 10 directory on hdfs, it contains 100s of file which i want to process with spark.
In the book - Fast Data Processing with Spark
This requires the file to be available on all the nodes in the cluster, which isn't much of a
problem for a local mode. When in a distributed mode, you will want to use Spark's
addFile functionality to copy the file to all the machines in your cluster.
I am not able to understand this, will spark create copy of file on each node.
What I want is that it should read the file which is present in that directory (if that directory is present on that node)
Sorry, I am bit confused , how to handle the above scenario in spark.
regards
The section you're referring to introduces SparkContext::addFile in a confusing context. This is a section titled "Loading data into an RDD", but it immediately diverges from that goal and introduces SparkContext::addFile more generally as a way to get data into Spark. Over the next few pages it introduces some actual ways to get data "into an RDD", such as SparkContext::parallelize and SparkContext::textFile. These resolve your concerns about splitting up the data among nodes rather than copying the whole of the data to all nodes.
A real production use-case for SparkContext::addFile is to make a configuration file available to some library that can only be configured from a file on the disk. For example, when using MaxMind's GeoIP Legacy API, you might configure the lookup object for use in a distributed map like this (as a field on some class):
#transient lazy val geoIp = new LookupService("GeoIP.dat", LookupService.GEOIP_MEMORY_CACHE | LookupService.GEOIP_CHECK_CACHE)
Outside your map function, you'd need to make GeoIP.dat available like this:
sc.addFile("/path/to/GeoIP.dat")
Spark will then make it available in the current working directory on all of the nodes.
So, in contrast with Daniel Darabos' answer, there are some reasons outside of experimentation to use SparkContext::addFile. Also, I can't find any info in the documentation that would lead one to believe that the function is not production-ready. However, I would agree that it's not what you want to use for loading the data you are trying to process unless it's for experimentation in the interactive Spark REPL, since it doesn't create an RDD.
addFile is only for experimentation. It is not meant for production use. In production you just open a file specified by a URI understood by Hadoop. For example:
sc.textFile("s3n://bucket/file")

Storing data to SequenceFile from Apache Pig

Apache Pig can load data from Hadoop sequence files using the PiggyBank SequenceFileLoader:
REGISTER /home/hadoop/pig/contrib/piggybank/java/piggybank.jar;
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();
log = LOAD '/data/logs' USING SequenceFileLoader AS (...)
Is there also a library out there that would allow writing to Hadoop sequence files from Pig?
It's just a matter of implementing a StoreFunc to do so.
This is possible now, although it will become a fair bit easier once Pig 0.7 comes out, as it includes a complete redesign of the Load/Store interfaces.
The "Hadoop expansion pack" Twitter is about to open source open-sourced at github, includes code for generating Load and Store funcs based on Google Protocol Buffers (building on Input/Output formats for same -- you already have those for sequence files, obviously). Check it out if you need examples of how to do some of the less trivial stuff. It should be fairly straightforward though.
This seemed to work for me. https://github.com/kevinweil/elephant-bird/pull/73

Resources