how to output first row as column qualifier names - hadoop

I am able to process two nodes from an xml. And I am getting the output below:
bin/hadoop fs -text /user/root/t-output1/part-r-00000
name:ST17925 currentgrade 1.02
name:ST17926 currentgrade 3.0
name:ST17927 currentgrade 3.0
but I need to have an output like:
studentid curentgrade
ST17925 1.02
ST17926 3.00
ST17927 3.00
How can I achieve this?
My complete source code: https://github.com/studhadoop/xml/blob/master/XmlParser11.java
EDIT: Solution
protected void setup(Context context) throws IOException, InterruptedException {
context.write(new Text("studentid"), new Text("currentgrade"));
}

I think it is difficult to do this along with your MapReduce code. The reasons is
The headers may not be of the same data types
If the types are same, you can write headers from the setup() method of Reducer code but there is no guarantee that the headers will appear as the first row in the output.
At best what you can do is, create a separate HDFS/ local file with the headers in your map code on the first encounter of the column qualifiers. You need to use appropriate file operations API for creating this file. Later when the job is complete you can use these headers in other programs or merge them together as a single file.

Related

How does RecordReader send data to mapper in Hadoop

I'm new to Hadoop and currently I'm learning mapreduce design pattern from Donald Miner & Adam Shook MapReduce Design Pattern book. So in this book there is Cartesian Product Pattern. My question is:
When does record reader send data to mapper?
Where is the code that send the data to mapper?
What I see is next function in CartesianRecordReader class read both split without sending the data.
Here is the source code https://github.com/adamjshook/mapreducepatterns/blob/master/MRDP/src/main/java/mrdp/ch5/CartesianProduct.java
That's all, thanks in advance :)
When does record reader send data to mapper?
Let me answer by giving you an idea how how the mapper and the RecordReader are related. This is the Hadoop code that sends data
to the mapper. 1
RecordReader<K1, V1> input;
K1 key = input.createKey();
V1 value = input.createValue();
while (input.next(key, value)) {
// map pair to output
mapper.map(key, value, output, reporter);
if(incrProcCount) {
reporter.incrCounter(SkipBadRecords.COUNTER_GROUP,
SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDS, 1);
}
}
Basically, the Hadoop will call next until it returns false, and at every call key and value will obtain new values. Key being normally the bytes read so far and value the next line in the file.
Where is the code that send the data to mapper?
That code is at the source code of hadoop (Probably at the MapContextImpl class) but it resembles what I have wrote in the code snippet.
EDIT : The source code is at MapRunner.

How do I make the mapper process the entire file from HDFS

This is the code where I read the file that contain Hl7 messages and iterate through them using Hapi Iterator (from http://hl7api.sourceforge.net)
File file = new File("/home/training/Documents/msgs.txt");
InputStream is = new FileInputStream(file);
is = new BufferedInputStream(is);
Hl7InputStreamMessageStringIterator iter = new
Hl7InputStreamMessageStringIterator(is);
I want to make this done inside the map function? obviously I need to prevent the splitting in InputFormat to read the entire file as once as a single value and change it toString (the file size is 7KB), because as you know Hapi can parse only entire message.
I am newbie to all of this so please bear with me.
You will need to implement you own FileInputFormat subclass:
It must override isSplittable() method to false which means that number of mappers will be equal to number of input files: one input file per each mapper.
You also need to implement getRecordReader() method. This is exactly the class where you need to put you parsing logic from above to.
If you do not want your data file to split or you want a single mapper which will process your entire file. So that one file will be processed by only one mapper. In that case extending map/reduce inputformat and overriding isSplitable() method and return "false" as boolean will help you.
For ref : ( Not based on your code )
https://gist.github.com/sritchie/808035
As the input is getting from the text file, you can override isSplitable() method of fileInputFormat. Using this, one mapper will process the whole file.
public boolean isSplitable(Context context,Path args[0])
{
return false;
}

Model evaluation in Stanford NER

I'm doing a project with the NER module from Stanford CoreNLP and I'm currently having some issues with the evaluation of the model.
I'm using the API to call the functionality from inside a java program instead of using the command line arguments and so far I've managed to train the model from several training files (in a tab-separated format; 2 columns with token and annotation/answer) and to serialize it to a file which was pretty easy.
Now I'm trying to evaluate the model I've trained on some test files (precision, recall, f1) and I'm kinda stuck there. First of all, what format should the test files be in? I'm assuming they should be the same as the training files (tab-separated) which would be the logical thing. I've looked through the JavaDoc documentation for information on how to use the classify method and also had a look at the NERDemo.java. I've managed to get the classifyToString method to work but that doesn't really help me with the evaluation. I've found the classifyAndWriteAnswers(String testFile, DocumentReaderAndWriter<IN> readerWriter, boolean outputScores) method that I assume would give me the precision and recall scores if I set outputScores to true.
However, I can't manage to get this to work. Which DocumentReaderAndWriter should I use as the second argument?
This is what I've got right now:
public static void evaluate(CRFClassifier classifier, File testFile) {
try {
classifier.classifyAndWriteAnswers(testFile.getPath(), new PlainTextDocumentReaderAndWriter(), true);
} catch (IOException e) {
e.printStackTrace();
}
}
This is what I get:
Unchecked call to 'classifyAndWriteAnswers(String, DocumentReaderAndWriter<IN>, boolean)' as a member of raw type 'edu.stanford.nlp.ie.AbstractSequenceClassifier'
Also, do I pass the path to the test file as the first argument or rather the file itself loaded into a String? Some help would be greatly appreciated.

read csv file format in hadoop using java code

How to read CSV file formats in Hadoop using Java code in eclipse IDE?
I have very large file in CSV format and i want to access the CSV file in HDFS in order to perform map reduce program. kindly anyone help me in solving the problem.
I want Java code to access the file .
thanks in advance.
You can pass the file as input to the mapper. The lines of the file will become value to the mapper.
class FileMapper extends Mapper<LongWritable, Text> {
#Override
public void map(LongWritable key, Text value, Context context) {
// process your CSV records here.
}
}
Create you own CSVInputFormat
CSVInputFormat

processing multiple files in minimum time

I am new to hadoop. Basically I am writing a program which takes two multifasta files (ref.fasta,query.fasta) which are 3+ GB.....
ref.fasta:
gi|12345
ATATTATAGGACACCAATAAAATT..
gi|5253623
AATTATCGCAGCATTA...
..and so on..
query.fasta:
query
ATTATTTAAATCTCACACCACATAATCAATACA
AATCCCCACCACAGCACACGTGATATATATACA
CAGACACA...
NOw to each mapper I need to give a single part of ref file and the whole query file.
i.e
gi|12345
ATATTATAGGACACCAATA....
(a single fasta sequence from ref file)
AND the entire query file.because I want to run an exe inside mapper which takes these both as input.
so do i process ref.fasta outside and then give it to mapper?or some thing else..??
I just need approach which will take minimum time.
Thanks.
The best approach for your use-case may be to have the query file in distributed cache and get the file object ready in the configure()/setup() to be used in the map(). And have the ref file as normal input.
You may do the following:
In your run() add the query file to the distributed cache:
DistributedCache.addCacheFile(new URI(queryFile-HDFS-Or-S3-Path), conf);
Now have the mapper class something like following:
public static class MapJob extends MapReduceBase implements Mapper {
File queryFile;
#Override
public void configure(JobConf job) {
Path queryFilePath = DistributedCache.getLocalCacheFiles(job)[0];
queryFile = new File(queryFilePath.toString());
}
#Override
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
// Use the queryFile object and [key,value] from your ref file here to run the exe file as desired.
}
}
I faced a similar problem.
I'd suggest you pre-process your ref file and split it into multiple files (one per sequence).
Then copy those files to a folder on the hdfs that you will set as your input path in your main method.
Then implement a custom input format class and custom record reader class. Your record reader will just pass the name of the local file split path (as a Text value) to either the key or value parameter of your map method.
For the query file that is require by all map functions, again add your query file to the hdfs and then add it to the DistributedCache in your main method.
In your map method you'll then have access to both local file paths and can pass them to your exe.
Hope that helps.
I had a similar problem and eventually re-implemented the functionality of blast exe file so that I didn't need to deal with reading files in my map method and could instead deal entire with Java objects (Genes and Genomes) that are parsed from the input files by my custom record reader and then passed as objects to my map function.
Cheers, Wayne.

Resources