using wikipedia dataset for pagerank in hadoop - hadoop

I will be doing a project on pagerank and inverted indexing of wikipedia dataset using apache hadoop.I downloaded the whole wiki dump - .It decompresses to a single 42 Gb .xml file. I want to somehow process this file to get data suitable for input in pagerank and inverted indexing map-reduce algos. Please help! Any leads will be helpful.

You need to write your own Inputformat to process XML. You would also need to implement a RecordReader to make sure your inputsplits have the fully formed XML chunk and not just a single line. See .

Your question is not very clear to me. What kind of idea do you need?
The very first thing which is going to hit you is how are you going to process this xml file in your MR job. MR framework doesn't provide any built-in InputFormat for xml files. For this you might wanna have a look at this.


Setting Mappers of desired numbers

I have gone through lot of blogs on stackoverflow and also apache wiki for getting to know the way the mappers are set in Hadoop. I also went through [hadoop - how total mappers are determined [this] post.
Some say its based on InputFormat and some posts say its based on the number of blocks the input file id split into.
Some how I am confused by the default setting.
When I run a wordcount example I see the mappers are low as 2. What is really happening in the setting ? Also this post [] [example program]. Here they set the mappers based on user input. How can one manually do this setting ?
I would really appreciate for some help and understanding of how mappers work.
Thanks in advance
Use the java system properties mapred.min.split.size and mapred.max.split.size to guide Hadoop to use the split size you want. This won't always work - particularly when your data is in a compression format that is not splittable (e.g. gz, but bzip2 is splittable).
So if you want more mappers, use a smaller split size. Simple!
(Updated as requested) Now this won't work for many small files, in particular you'll end up with more mappers than you want. For this situation use CombineFileInputFormat ... in Scalding this SO explains: Create Scalding Source like TextLine that combines multiple files into single mappers

working with big scientific data on Hadoop

I am currently starting a project titled "Cloud computing for time series mining algorithms using Hadoop".
The data which I have is hdf files of size over a terabyte.In hadoop as I know that we should have text files as input for further processing (map-reduce task). So I have one option that I convert all my .hdf files to text files which is going to take a lot of time.
Or I find a way of how to use raw hdf files in map reduce programmes.
So far I have not been successful in finding any java code which reads hdf files and extract data from them.
If somebody has a better idea of how to work with hdf files I will really appreciate such help.
Here are some resources:
SciHadoop (uses netCDF but might be already extended to HDF5).
You can either use JHDF5 or the lower level official Java HDF5 interface to read out data from any HDF5 file in the map-reduce task.
For your first option, you could use a conversion tool like HDF dump to dump HDF file to text format. Otherwise, you can write a program using Java library for reading HDF file and write it to text file.
For your second option, SciHadoop is a good example of how to read Scientific datasets from Hadoop. It uses NetCDF-Java library to read NetCDF file. Hadoop does not support POSIX API for file IO. So, it uses an extra software layer to translate POSIX call of NetCDF-java library to HDFS(Hadoop) API calls. If SciHadoop does not already support HDF files, you might go along a little harder path and develop a similar solution yourself.
If you do not find any java code and can do in other languages then you can use hadoop streaming.
SciMATE is a good option. It is developed based on a variant of MapReduce, which has been shown to perform a lot of scientific applications much more efficiently than Hadoop.

how to output to HDFS from mapper directly?

In certain criteria we want the mapper do all the work and output to HDFS, we don't want the data transmitted to reducer(will use extra bandwidth, please correct me if there is case its wrong).
a pseudo code would be:
def mapper(k,v_list):
for v in v_list:
if criteria:
write to HDFS
I found it hard because the only thing we can play with is OutputCollector.
One thing I think of is to exend OutputCollector, override OutputCollector.collect and do the stuff.
Is there any better ways?
You can just set the number of reduce tasks to 0 by using JobConf.setNumReduceTasks(0). This will make the results of the mapper go straight into HDFS.
From the Map-Reduce manual:
Reducer NONE
It is legal to set the number of reduce-tasks to zero if no reduction is desired.
In this case the outputs of the map-tasks go directly to the FileSystem,
into the output path set by setOutputPath(Path). The framework does not sort
the map-outputs before writing them out to the FileSystem.
I'm assuming that you're using streaming, in which case there is no standard way of doing this.
It's certainly possible in a java Mapper. For streaming you'd need amend the PipeMapper java file, or like you say write your own output collector - but if you're going to that much trouble, you might has well just write a java mapper.
Not sending something to the Reducer may not actually save bandwidth if you are still going to write it to the HDFS. The HDFS is still replicated to other nodes and the replication is going to happen.
There are other good reasons to write output from the mapper though. There is a FAQ about this, but it is a little short on details except to say that you can do it.
I found another question which is potentially a duplicate of yours here. That question has answers that are more help if you are writing a Mapper in Java. If you are trying to do this in a streaming way, you can just use the hadoop fs commands in scripts to do it.
We can in fact write output to HDFS and pass it on to Reducer also at the same time. I understand that you are using Hadoop Streaming, I've implemented something similar using Java MapReduce.
We can generate named output files from a Mapper or Reducer using MultipleOutputs. So, in your Mapper implementation after all the business logic for processing input data, you can write the output to MultipleOutputs using multipleOutputs.write("NamedOutputFileName", Outputkey, OutputValue) and for the data you want to pass on to reducer you can write to context using context.write(OutputKey, OutputValue)
I think if you can find something to write the data from mapper to a named output file in the language you are using (Eg: Python) - this will definitely work.
I hope this helps.

Hadoop use folder structure as input

I'm a beginner trying to use Hadoop and I guess although I understand the general map-reduce stuff I seem to miss something in the beginning.
Basically I'm trying to parse a website (local) using hadoop and have as result the link structure (so that later I can calculate some page rank).
Thus the input is a folder structure (with subfolder and files) and the output should be, for now, each file with a list of files that link to it.
What InputFormat should I use? The FileInputFormat doesn't seem to work (I get an exception upon encountering a folder - saying it is a directory). Actually is there such an InputFormat that allows for inputing such folder structures?
If not... should I somehow preprocess the input data? Meaning should I take out every HTML file into a single directory and look from it there?
Or, is there a way to write such an InputFormat that does what I need?
Actually is there such an InputFormat that allows for inputing such folder structures?
All the FileInputFormats take a Path as an input, which can be a directory or file.
The FileInputFormat doesn't seem to work (I get an exception upon encountering a folder - saying it is a directory).
The JIRA has been fixed in some of the releases (0.21, 0.22, 0.23 and trunk). o.a.h.mapred.FileInputFormat should have the addInputPathRecursively method implemented. Also, noticed that it's not implemented in the new API (o.a.h.mapreduce.FileInputFormat). Here is the code for o.a.h.mapred.FileInputFormat class from trunk.
BTW, what release are you using?
Basically I'm trying to parse a website (local) using hadoop and have as result the link structure (so that later I can calculate some page rank).
Because of the media attention/hype Hadoop is being used for every thing. Hadoop as-is works well for some types of problems. Consider using Apache Hama and Giraph for graph processing. Note that both are in incubator and documentation is also sparse.

Storing data to SequenceFile from Apache Pig

Apache Pig can load data from Hadoop sequence files using the PiggyBank SequenceFileLoader:
REGISTER /home/hadoop/pig/contrib/piggybank/java/piggybank.jar;
DEFINE SequenceFileLoader;
log = LOAD '/data/logs' USING SequenceFileLoader AS (...)
Is there also a library out there that would allow writing to Hadoop sequence files from Pig?
It's just a matter of implementing a StoreFunc to do so.
This is possible now, although it will become a fair bit easier once Pig 0.7 comes out, as it includes a complete redesign of the Load/Store interfaces.
The "Hadoop expansion pack" Twitter is about to open source open-sourced at github, includes code for generating Load and Store funcs based on Google Protocol Buffers (building on Input/Output formats for same -- you already have those for sequence files, obviously). Check it out if you need examples of how to do some of the less trivial stuff. It should be fairly straightforward though.
This seemed to work for me.
