Using Distributed Cache with Pig on Elastic Map Reduce - hadoop

I am trying to run my Pig script (which uses UDFs) on Amazon's Elastic Map Reduce.
I need to use some static files from within my UDFs.
I do something like this in my UDF:
public class MyUDF extends EvalFunc<DataBag> {
public DataBag exec(Tuple input) {
...
FileReader fr = new FileReader("./myfile.txt");
...
}
public List<String> getCacheFiles() {
List<String> list = new ArrayList<String>(1);
list.add("s3://path/to/myfile.txt#myfile.txt");
return list;
}
}
I have stored the file in my s3 bucket /path/to/myfile.txt
However, on running my Pig job, I see an exception:
Got an exception java.io.FileNotFoundException: ./myfile.txt (No such file or directory)
So, my question is: how do I use distributed cache files when running pig script on amazon's EMR?
EDIT: I figured out that pig-0.6, unlike pig-0.9 does not have a function called getCacheFiles(). Amazon does not support pig-0.6 and so I need to figure out a different way to get distributed cache work in 0.6

I think adding this extra arg to the Pig command line call should work (with s3 or s3n, depending on where your file is stored):
–cacheFile s3n://bucket_name/file_name#cache_file_name
You should be able to add that in the "Extra Args" box when creating the Job flow.

Related

Spring batch to upload a CSV file and insert into database

My project has this requirement where user uploads a CSV file which has to be pushed to sql server database.
I know we can use Spring batch to process large number of records. But I'm not able to find any tutorial/sample code for this requirement of mine.
All the tutorials which I came across just hardcoded the CSV file name and in-memory databases in it like below:
https://spring.io/guides/gs/batch-processing/
User Input file is available in shared drive location on schduled time with file name prefix as eg: stack_overlfow_dd-MM-yyyy HH:mm, on daily basis how can I poll the Network shared drive for every 5-10 minutes atleast for one hour daily if its matches with regex then upload to database.
How can I take the csv file first from shared location and store it in memory or somewhere and then configure spring batch to read that as input.
any help here would be appreciated. Thanks In advance
All the tutorials which I came across just hardcoded the CSV file name and in-memory databases
You can find samples in the official repo here. Here is an example where the input file name is not hardcoded but passed as a job parameter.
How can I take the csv file first from shared location and store it in memory or somewhere and then configure spring batch to read that as input.
You can proceed in two steps: download the file locally then read/process/write it to the database (See https://stackoverflow.com/a/52110781/5019386).
how can I poll the Network shared drive for every 5-10 minutes atleast for one hour daily if its matches with regex then upload to database.
Once you have defined your job, you can schedule it to run when you want using:
a scheduler like Quartz
or using Spring's task scheduling features.
or using a combination of Spring Integration and Spring Batch. Spring integration would poll the directory and then launches a Spring Batch job when appropriate. This approach is described here.
More details on job scheduling here.
You can make a service layer that can process excel file and read data from file and construct java object to save into DB. Here I have used apache POI to parse Excel data and read from excel sheet.
public class FileUploadService {
#Autowired
FileUploadDao fileUploadDao;
public String uploadFileData(String inputFilePath) {
Workbook workbook = null;
Sheet sheet = null;
try {
workbook = getWorkBook(new File(inputFilePath));
sheet = workbook.getSheetAt(0);
/*Build the header portion of the Output File*/
String headerDetails = "EmployeeId,EmployeeName,Address,Country";
String headerNames[] = headerDetails.split(",");
/*Read and process each Row*/
ArrayList < ExcelTemplateVO > employeeList = new ArrayList < > ();
Iterator < Row > rowIterator = sheet.iterator();
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
//Read and process each column in row
ExcelTemplateVO excelTemplateVO = new ExcelTemplateVO();
int count = 0;
while (count < headerNames.length) {
String methodName = "set" + headerNames[count];
String inputCellValue = getCellValueBasedOnCellType(row, count++);
setValueIntoObject(excelTemplateVO, ExcelTemplateVO.class, methodName, "java.lang.String", inputCellValue);
}
employeeList.add(excelTemplateVO);
}
fileUploadDao.saveFileDataInDB(employeeList);
} catch (Exception ex) {
ex.printStackTrace();
}
return "Success";
}
I believe your question have already been answered here.
The author of the question has even uploaded a repository of his working result :
https://github.com/PriyankaBolisetty/SpringBatchUploadCSVFileToDatabase/tree/master/src/main/java/springbatch_example
You can retrieve and filter files' lists in a shared drive using JCIFS API method SmbFile.listFiles(String wildcard).

Sequence file reading issue using spark Java

i am trying to read the sequence file generated by hive using spark. When i try to access the file , i am facing org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException:
I have tried the workarounds for this issue like making the class serializable, still i face the issue. I am writing the code snippet here , please let me know what i am missing here.
Is it because of the BytesWritable data type or something else which is causing the issue.
JavaPairRDD<BytesWritable, Text> fileRDD = javaCtx.sequenceFile("hdfs://path_to_the_file", BytesWritable.class, Text.class);
List<String> result = fileRDD.map(new Function<Tuple2<BytesWritables,Text>,String>(){
public String call (Tuple2<BytesWritable,Text> row){
return row._2.toString()+"\n";
}).collect();
}
Here is what was needed to make it work
Because we use HBase to store our data and this reducer outputs its result to HBase table, Hadoop is telling us that he doesn’t know how to serialize our data. That is why we need to help it. Inside setUp set the io.serializations variable
You can do it in spark accordingly
conf.setStrings("io.serializations", new String[]{hbaseConf.get("io.serializations"), MutationSerialization.class.getName(), ResultSerialization.class.getName()});

How to test hadoop mapreduce with hdfs?

I am using MRUnit to write unit tests for my mapreduce jobs.
However, I am having trouble including hdfs into that mix. My MR job needs a file from hdfs. How do I mock out the hdfs part in MRUnit test case?
Edit:
I know that I can specify inputs/exepctedOutput for my MR code in the test infrastructure. However, that is not what I want. My MR job needs to read another file that has domain data to do the job. This file is in HDFS. How do I mock out this file?
I tried using mockito but it didnt work. The reason was that FileSystem.open() returns a FSDataInputStream which inherits from other interfaces besides java.io.Stream. It was too painful to mock out all the interfaces. So, I hacked it in my code by doing the following
if (System.getProperty("junit_running") != null)
{
inputStream = this.getClass().getClassLoader().getResourceAsStream("domain_data.txt");
br = new BufferedReader(new InputStreamReader(inputStream));
} else {
Path pathToRegionData = new Path("/domain_data.txt");
LOG.info("checking for existence of region assignment file at path: " + pathToRegionData.toString());
if (!fileSystem.exists(pathToRegionData))
{
LOG.error("domain file does not exist at path: " + pathToRegionData.toString());
throw new IllegalArgumentException("region assignments file does not exist at path: " + pathToRegionData.toString());
}
inputStream = fileSystem.open(pathToRegionData);
br = new BufferedReader(new InputStreamReader(inputStream));
}
This solution is not ideal because I had to put test specific code in my production code. I am still waiting to see if there is an elegant solution out there.
Please follow the this small tutorial for MRUnit.
https://github.com/malli3131/HadoopTutorial/blob/master/MRUnit/Tutorial
In MRUnit test case, we supply the data inside the testMapper() and testReducer() methods. So there is no need of input from HDFS for MRUnit Job. Only MapReduce jobs require data inputs from HDFS.

Hive setup()-like functionality similar to Mapper setup()?

I want to replace a Hadoop job with Hive. My challenge is in Hadoop, I'm using setup() to build a kdtree by reading in reference data (points of interest) from the distributed cache. I then use the kdtree in map() to evaluate distance of the target data against the kdtree.
In Hive, I wanted to use a udf with evaluate() method to determine the distance, but I don't know how to setup the kdtree with the reference data. Is this possible?
I probably don't have the entire answer, so I'm just going to throw out some ideas that might be of help.
You can add files to the distributed cache in hive using ADD FILE ...
Hive 11+ (I think) should let you access to the distributed cache in GenericUDF.initialize
https://issues.apache.org/jira/browse/HIVE-1016 which references...
https://issues.apache.org/jira/browse/HIVE-3628
So when you initialize the UDF, you might be able to build your kdtree by accessing the file you added in the distributed cache.
Like climbage says ADD FILE command adds the file into distributed cache.
You can access the distributed cache in your UDF simply by opening a file which is in the current directory.
ie... open( new File( System.getProperty("user.dir") + "/myfile") );
You can use a ConstantObjectInspector to access the filename in the initialize method of GenericUDF, where you can open the file and read into memory into your data structure.
The distributed_map UDF of Brickhouse does something similar ( https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/dcache/DistributedMapUDF.java )
Something like
public ObjectInspector initialize(ObjectInspector[] inspArr) {
ConstantObjectInspector fileNameInsp = (ConstantObjectInspector)inspArr[0];
String fileName = fileNameInsp.getWritableConstantValue().toString();
FileInputStream inFile = new FileInputStream("./" + fileName);
doStuff( inFile );
.....
}

processing multiple files in minimum time

I am new to hadoop. Basically I am writing a program which takes two multifasta files (ref.fasta,query.fasta) which are 3+ GB.....
ref.fasta:
gi|12345
ATATTATAGGACACCAATAAAATT..
gi|5253623
AATTATCGCAGCATTA...
..and so on..
query.fasta:
query
ATTATTTAAATCTCACACCACATAATCAATACA
AATCCCCACCACAGCACACGTGATATATATACA
CAGACACA...
NOw to each mapper I need to give a single part of ref file and the whole query file.
i.e
gi|12345
ATATTATAGGACACCAATA....
(a single fasta sequence from ref file)
AND the entire query file.because I want to run an exe inside mapper which takes these both as input.
so do i process ref.fasta outside and then give it to mapper?or some thing else..??
I just need approach which will take minimum time.
Thanks.
The best approach for your use-case may be to have the query file in distributed cache and get the file object ready in the configure()/setup() to be used in the map(). And have the ref file as normal input.
You may do the following:
In your run() add the query file to the distributed cache:
DistributedCache.addCacheFile(new URI(queryFile-HDFS-Or-S3-Path), conf);
Now have the mapper class something like following:
public static class MapJob extends MapReduceBase implements Mapper {
File queryFile;
#Override
public void configure(JobConf job) {
Path queryFilePath = DistributedCache.getLocalCacheFiles(job)[0];
queryFile = new File(queryFilePath.toString());
}
#Override
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
// Use the queryFile object and [key,value] from your ref file here to run the exe file as desired.
}
}
I faced a similar problem.
I'd suggest you pre-process your ref file and split it into multiple files (one per sequence).
Then copy those files to a folder on the hdfs that you will set as your input path in your main method.
Then implement a custom input format class and custom record reader class. Your record reader will just pass the name of the local file split path (as a Text value) to either the key or value parameter of your map method.
For the query file that is require by all map functions, again add your query file to the hdfs and then add it to the DistributedCache in your main method.
In your map method you'll then have access to both local file paths and can pass them to your exe.
Hope that helps.
I had a similar problem and eventually re-implemented the functionality of blast exe file so that I didn't need to deal with reading files in my map method and could instead deal entire with Java objects (Genes and Genomes) that are parsed from the input files by my custom record reader and then passed as objects to my map function.
Cheers, Wayne.

Resources