read csv file format in hadoop using java code - hadoop

How to read CSV file formats in Hadoop using Java code in eclipse IDE?
I have very large file in CSV format and i want to access the CSV file in HDFS in order to perform map reduce program. kindly anyone help me in solving the problem.
I want Java code to access the file .
thanks in advance.

You can pass the file as input to the mapper. The lines of the file will become value to the mapper.
class FileMapper extends Mapper<LongWritable, Text> {
#Override
public void map(LongWritable key, Text value, Context context) {
// process your CSV records here.
}
}

Create you own CSVInputFormat
CSVInputFormat

Related

Sequence file reading issue using spark Java

i am trying to read the sequence file generated by hive using spark. When i try to access the file , i am facing org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException:
I have tried the workarounds for this issue like making the class serializable, still i face the issue. I am writing the code snippet here , please let me know what i am missing here.
Is it because of the BytesWritable data type or something else which is causing the issue.
JavaPairRDD<BytesWritable, Text> fileRDD = javaCtx.sequenceFile("hdfs://path_to_the_file", BytesWritable.class, Text.class);
List<String> result = fileRDD.map(new Function<Tuple2<BytesWritables,Text>,String>(){
public String call (Tuple2<BytesWritable,Text> row){
return row._2.toString()+"\n";
}).collect();
}
Here is what was needed to make it work
Because we use HBase to store our data and this reducer outputs its result to HBase table, Hadoop is telling us that he doesn’t know how to serialize our data. That is why we need to help it. Inside setUp set the io.serializations variable
You can do it in spark accordingly
conf.setStrings("io.serializations", new String[]{hbaseConf.get("io.serializations"), MutationSerialization.class.getName(), ResultSerialization.class.getName()});

How do I make the mapper process the entire file from HDFS

This is the code where I read the file that contain Hl7 messages and iterate through them using Hapi Iterator (from http://hl7api.sourceforge.net)
File file = new File("/home/training/Documents/msgs.txt");
InputStream is = new FileInputStream(file);
is = new BufferedInputStream(is);
Hl7InputStreamMessageStringIterator iter = new
Hl7InputStreamMessageStringIterator(is);
I want to make this done inside the map function? obviously I need to prevent the splitting in InputFormat to read the entire file as once as a single value and change it toString (the file size is 7KB), because as you know Hapi can parse only entire message.
I am newbie to all of this so please bear with me.
You will need to implement you own FileInputFormat subclass:
It must override isSplittable() method to false which means that number of mappers will be equal to number of input files: one input file per each mapper.
You also need to implement getRecordReader() method. This is exactly the class where you need to put you parsing logic from above to.
If you do not want your data file to split or you want a single mapper which will process your entire file. So that one file will be processed by only one mapper. In that case extending map/reduce inputformat and overriding isSplitable() method and return "false" as boolean will help you.
For ref : ( Not based on your code )
https://gist.github.com/sritchie/808035
As the input is getting from the text file, you can override isSplitable() method of fileInputFormat. Using this, one mapper will process the whole file.
public boolean isSplitable(Context context,Path args[0])
{
return false;
}

how to output first row as column qualifier names

I am able to process two nodes from an xml. And I am getting the output below:
bin/hadoop fs -text /user/root/t-output1/part-r-00000
name:ST17925 currentgrade 1.02
name:ST17926 currentgrade 3.0
name:ST17927 currentgrade 3.0
but I need to have an output like:
studentid curentgrade
ST17925 1.02
ST17926 3.00
ST17927 3.00
How can I achieve this?
My complete source code: https://github.com/studhadoop/xml/blob/master/XmlParser11.java
EDIT: Solution
protected void setup(Context context) throws IOException, InterruptedException {
context.write(new Text("studentid"), new Text("currentgrade"));
}
I think it is difficult to do this along with your MapReduce code. The reasons is
The headers may not be of the same data types
If the types are same, you can write headers from the setup() method of Reducer code but there is no guarantee that the headers will appear as the first row in the output.
At best what you can do is, create a separate HDFS/ local file with the headers in your map code on the first encounter of the column qualifiers. You need to use appropriate file operations API for creating this file. Later when the job is complete you can use these headers in other programs or merge them together as a single file.

processing multiple files in minimum time

I am new to hadoop. Basically I am writing a program which takes two multifasta files (ref.fasta,query.fasta) which are 3+ GB.....
ref.fasta:
gi|12345
ATATTATAGGACACCAATAAAATT..
gi|5253623
AATTATCGCAGCATTA...
..and so on..
query.fasta:
query
ATTATTTAAATCTCACACCACATAATCAATACA
AATCCCCACCACAGCACACGTGATATATATACA
CAGACACA...
NOw to each mapper I need to give a single part of ref file and the whole query file.
i.e
gi|12345
ATATTATAGGACACCAATA....
(a single fasta sequence from ref file)
AND the entire query file.because I want to run an exe inside mapper which takes these both as input.
so do i process ref.fasta outside and then give it to mapper?or some thing else..??
I just need approach which will take minimum time.
Thanks.
The best approach for your use-case may be to have the query file in distributed cache and get the file object ready in the configure()/setup() to be used in the map(). And have the ref file as normal input.
You may do the following:
In your run() add the query file to the distributed cache:
DistributedCache.addCacheFile(new URI(queryFile-HDFS-Or-S3-Path), conf);
Now have the mapper class something like following:
public static class MapJob extends MapReduceBase implements Mapper {
File queryFile;
#Override
public void configure(JobConf job) {
Path queryFilePath = DistributedCache.getLocalCacheFiles(job)[0];
queryFile = new File(queryFilePath.toString());
}
#Override
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
// Use the queryFile object and [key,value] from your ref file here to run the exe file as desired.
}
}
I faced a similar problem.
I'd suggest you pre-process your ref file and split it into multiple files (one per sequence).
Then copy those files to a folder on the hdfs that you will set as your input path in your main method.
Then implement a custom input format class and custom record reader class. Your record reader will just pass the name of the local file split path (as a Text value) to either the key or value parameter of your map method.
For the query file that is require by all map functions, again add your query file to the hdfs and then add it to the DistributedCache in your main method.
In your map method you'll then have access to both local file paths and can pass them to your exe.
Hope that helps.
I had a similar problem and eventually re-implemented the functionality of blast exe file so that I didn't need to deal with reading files in my map method and could instead deal entire with Java objects (Genes and Genomes) that are parsed from the input files by my custom record reader and then passed as objects to my map function.
Cheers, Wayne.

Using Distributed Cache with Pig on Elastic Map Reduce

I am trying to run my Pig script (which uses UDFs) on Amazon's Elastic Map Reduce.
I need to use some static files from within my UDFs.
I do something like this in my UDF:
public class MyUDF extends EvalFunc<DataBag> {
public DataBag exec(Tuple input) {
...
FileReader fr = new FileReader("./myfile.txt");
...
}
public List<String> getCacheFiles() {
List<String> list = new ArrayList<String>(1);
list.add("s3://path/to/myfile.txt#myfile.txt");
return list;
}
}
I have stored the file in my s3 bucket /path/to/myfile.txt
However, on running my Pig job, I see an exception:
Got an exception java.io.FileNotFoundException: ./myfile.txt (No such file or directory)
So, my question is: how do I use distributed cache files when running pig script on amazon's EMR?
EDIT: I figured out that pig-0.6, unlike pig-0.9 does not have a function called getCacheFiles(). Amazon does not support pig-0.6 and so I need to figure out a different way to get distributed cache work in 0.6
I think adding this extra arg to the Pig command line call should work (with s3 or s3n, depending on where your file is stored):
–cacheFile s3n://bucket_name/file_name#cache_file_name
You should be able to add that in the "Extra Args" box when creating the Job flow.

Resources