If the input files in XML format, I shouldn't be using TextInputFormat because TextInputFormat assumes each record is in each line of the input file and the Mapper class is called for each line to get a Key Value pair for that record/line.
So I think we need a custom input format to scan the XML datasets.
Being new to Hadoop mapreduce, is there any article/link/video that shows the steps to build a custom input format?
thanks
nath
Problem
Working on a single XML file in parallel in MapReduce is tricky because XML does not contain a synchronization marker in its data format. Therefore, how do we work with a file format that’s not inherently splittable like XML?
Solution
MapReduce doesn’t contain built-in support for XML, so we have to turn to another Apache project, Mahout, a machine learning system, which provides an XML InputFormat.
So I mean no need to have custom input format since Mahout library present.
I am not sure, whether you are going to read or write but both were described in above link.
Pls have a look at XmlInputFormat implementation details here.
Furthermore, XmlInputFormat extends TextInputFormat
As the name suggests, I'm looking for some tool which will convert the existing data from hadoop sequence file to json format.
My initial googling have only shown up results related to jaql, which I'm desperately trying to get to work.
Is there any tool from Apache available for this very purpose?
NOTE:
I've hadoop sequence file sitting on my local machine and would like to get data in corresponding json format.
So in-effect, I'm looking for some tool/utility which will take hadoop sequence file as input and produce output in json format.
Thanks
Apache Hadoop might be a good tool for reading sequence files.
All kidding aside, though, why not write the simplest possible Mapper java program that uses, say, Jackson to serialize each key and value pair it sees? That would be a pretty easy program to write.
I thought there must be some tool which will do this given that its such common requirement. Yes, it should be pretty easy to code but again why to do so if you already have something which does just the same.
Anyway, I figured out to do it using jaql. Sample working query which worked for me,
read({type: 'hdfs', location: 'some_hdfs_file', inoptions: {converter: 'com.ibm.jaql.io.hadoop.converter.FromJsonTextConverter'}});
I am a complete beginner on Pig. I have installed cdh4 pig and I am connected to a cdh4 cluster. We need to process these web log files that are going to be massive (the files are already being loaded to HDFS). Unfortunately the log syntax is quite involved (not a typical comma delimited file). A restriction is I cannot currently pre-process the log files with some other tool because they are just too huge and can't afford storing a copy. Here is a raw line in the logs:
"2013-07-02 16:17:12
-0700","?c=Thing.Render&d={%22renderType%22:%22Primary%22,%22renderSource%22:%22Folio%22,%22things%22:[{%22itemId%22:%225442f624492068b7ce7e2dd59339ef35%22,%22userItemId%22:%22873ef2080b337b57896390c9f747db4d%22,%22listId%22:%22bf5bbeaa8eae459a83fb9e2ceb99930d%22,%22ownerId%22:%222a4034e6b2e800c3ff2f128fa4f1b387%22}],%22redirectId%22:%22tgvm%22,%22sourceId%22:%226da6f959-8309-4387-84c6-a5ddc10c22dd%22,%22valid%22:false,%22pageLoadId%22:%224ada55ef-4ea9-4642-ada5-d053c45c00a4%22,%22clientTime%22:%222013-07-02T23:18:07.243Z%22,%22clientTimeZone%22:5,%22process%22:%22ml.mobileweb.fb%22,%22c%22:%22Thing.Render%22}","http://m.someurl.com/listthing/5442f624492068b7ce7e2dd59339ef35?rdrId=tgvm&userItemId=873ef2080b337b57896390c9f747db4d&fmlrdr=t&itemId=5442f624492068b7ce7e2dd59339ef35&subListId=bf5bbeaa8eae459a83fb9e2ceb99930d&puid=2a4034e6b2e800c3ff2f128fa4f1b387&mlrdr=t","Mozilla/5.0
(iPhone; CPU iPhone OS 6_1_3 like Mac OS X) AppleWebKit/536.26 (KHTML,
like Gecko) Mobile/10B329
[FBAN/FBIOS;FBAV/6.2;FBBV/228172;FBDV/iPhone4,1;FBMD/iPhone;FBSN/iPhone
OS;FBSV/6.1.3;FBSS/2;
FBCR/Sprint;FBID/phone;FBLC/en_US;FBOP/1]","10.nn.nn.nnn","nn.nn.nn.nn,
nn.nn.0.20"
As you probably noticed there is some json embedded there but it is url encoded. After url decoding (can Pig do url decoding?) here is how the json looks:
{"renderType":"Primary","renderSource":"Folio","things":[{"itemId":"5442f624492068b7ce7e2dd59339ef35","userItemId":"873ef2080b337b57896390c9f747db4d","listId":"bf5bbeaa8eae459a83fb9e2ceb99930d","ownerId":"2a4034e6b2e800c3ff2f128fa4f1b387"}],"redirectId":"tgvm","sourceId":"6da6f959-8309-4387-84c6-a5ddc10c22dd","valid":false,"pageLoadId":"4ada55ef-4ea9-4642-ada5-d053c45c00a4","clientTime":"2013-07-02T23:18:07.243Z","clientTimeZone":5,"process":"ml.mobileweb.fb","c":"Thing.Render"}
I need to extract the different fields in the json and the "things" field which is in fact a collection. I also need to extract the other query string values in the log. Can Pig directly deal with this kind of source data and if so could you be so kind to guide me through how to have Pig be able to parse and load it?
Thank you!
For such complicated task, you ususally need to write your Load function.
I recommend Chapter 11. Writing Load and Store Functions in Programming Pig. Load/Store Functions in official docuemnt is too simple.
I experimented plenty and learned tons. Tried a couple json libraries, piggybank and the java.net.URLDecoder. Even tried the CSVExcelStorage. I registered the libraries and was able to solve the problem partially. When I ran the tests against a larger data set, it started hitting encoding issues in some lines of the source data resulting in exceptions and job failure. So I ended up using Pig's built-in regex functionality to extract the desired values:
A = load '/var/log/live/collector_2013-07-02-0145.log' using TextLoader();
-- fix some of the encoding issues
A = foreach A GENERATE REPLACE($0,'\\\\"','"');
-- super basic url-decode
A = foreach A GENERATE REPLACE($0,'%22','"');
-- extract each of the fields from the embedded json
A = foreach A GENERATE
REGEX_EXTRACT($0,'^.*"redirectId":"([^"\\}]+).*$',1) as redirectId,
REGEX_EXTRACT($0,'^.*"fromUserId":"([^"\\}]+).*$',1) as fromUserId,
REGEX_EXTRACT($0,'^.*"userId":"([^"\\}]+).*$',1) as userId,
REGEX_EXTRACT($0,'^.*"listId":"([^"\\}]+).*$',1) as listId,
REGEX_EXTRACT($0,'^.*"c":"([^"\\}]+).*$',1) as eventType,
REGEX_EXTRACT($0,'^.*"renderSource":"([^"\\}]+).*$',1) as renderSource,
REGEX_EXTRACT($0,'^.*"renderType":"([^"\\}]+).*$',1) as renderType,
REGEX_EXTRACT($0,'^.*"engageType":"([^"\\}]+).*$',1) as engageType,
REGEX_EXTRACT($0,'^.*"clientTime":"([^"\\}]+).*$',1) as clientTime,
REGEX_EXTRACT($0,'^.*"clientTimeZone":([^,\\}]+).*$',1) as clientTimeZone;
I decided not to use REGEX_EXTRACT_ALL in case the order of the fields varies.
I would like to know the retrieve data from PDF using PERL. I have used the API::PDF but I'm expecting other than that. I am expecting data output like as PDF 2 DOC. I would be appreciate any one help me.
Thanks!
Writing a general PDF converter is not a simple task. There are at least two modules on CPAN which can help:
CAM::PDF
PDF::API2
I am new to PIG and Hadoop. I have written a PIG UDF which operates on String and returns a string. I actually use a class from an already existing jar which contains the business logic in the udf. The class constructor takes 2 filenames as input which it uses for building some dictionary used for processing the input. How to get it working in mapreduce mode I tried passing the filenames in pig local mode it works fine. But I dont know how to make it work in mapreduce mode? Can distributed cache solve the problem?
Here is my code
REGISTER tokenParser.jar
REGISTER sampleudf.jar;
DEFINE TOKENPARSER com.yahoo.sample.ParseToken('conf/input1.txt','conf/input2.xml');
A = LOAD './inputHOP.txt' USING PigStorage() AS (tok:chararray);
B = FOREACH A GENERATE TOKENPARSER(tok);
STORE B into 'newTokout' USING PigStorage();
From what I understand is tokenParser.jar must be using some sort of BufferedInputReader. Is it possible to make it work without changing tokenParser.jar
Yes, like in this similar question using the distributed cache is a good way to solve this problem.