How to load complex web logs syntax with Pig? - hadoop

I am a complete beginner on Pig. I have installed cdh4 pig and I am connected to a cdh4 cluster. We need to process these web log files that are going to be massive (the files are already being loaded to HDFS). Unfortunately the log syntax is quite involved (not a typical comma delimited file). A restriction is I cannot currently pre-process the log files with some other tool because they are just too huge and can't afford storing a copy. Here is a raw line in the logs:
"2013-07-02 16:17:12
-0700","?c=Thing.Render&d={%22renderType%22:%22Primary%22,%22renderSource%22:%22Folio%22,%22things%22:[{%22itemId%22:%225442f624492068b7ce7e2dd59339ef35%22,%22userItemId%22:%22873ef2080b337b57896390c9f747db4d%22,%22listId%22:%22bf5bbeaa8eae459a83fb9e2ceb99930d%22,%22ownerId%22:%222a4034e6b2e800c3ff2f128fa4f1b387%22}],%22redirectId%22:%22tgvm%22,%22sourceId%22:%226da6f959-8309-4387-84c6-a5ddc10c22dd%22,%22valid%22:false,%22pageLoadId%22:%224ada55ef-4ea9-4642-ada5-d053c45c00a4%22,%22clientTime%22:%222013-07-02T23:18:07.243Z%22,%22clientTimeZone%22:5,%22process%22:%22ml.mobileweb.fb%22,%22c%22:%22Thing.Render%22}","http://m.someurl.com/listthing/5442f624492068b7ce7e2dd59339ef35?rdrId=tgvm&userItemId=873ef2080b337b57896390c9f747db4d&fmlrdr=t&itemId=5442f624492068b7ce7e2dd59339ef35&subListId=bf5bbeaa8eae459a83fb9e2ceb99930d&puid=2a4034e6b2e800c3ff2f128fa4f1b387&mlrdr=t","Mozilla/5.0
(iPhone; CPU iPhone OS 6_1_3 like Mac OS X) AppleWebKit/536.26 (KHTML,
like Gecko) Mobile/10B329
[FBAN/FBIOS;FBAV/6.2;FBBV/228172;FBDV/iPhone4,1;FBMD/iPhone;FBSN/iPhone
OS;FBSV/6.1.3;FBSS/2;
FBCR/Sprint;FBID/phone;FBLC/en_US;FBOP/1]","10.nn.nn.nnn","nn.nn.nn.nn,
nn.nn.0.20"
As you probably noticed there is some json embedded there but it is url encoded. After url decoding (can Pig do url decoding?) here is how the json looks:
{"renderType":"Primary","renderSource":"Folio","things":[{"itemId":"5442f624492068b7ce7e2dd59339ef35","userItemId":"873ef2080b337b57896390c9f747db4d","listId":"bf5bbeaa8eae459a83fb9e2ceb99930d","ownerId":"2a4034e6b2e800c3ff2f128fa4f1b387"}],"redirectId":"tgvm","sourceId":"6da6f959-8309-4387-84c6-a5ddc10c22dd","valid":false,"pageLoadId":"4ada55ef-4ea9-4642-ada5-d053c45c00a4","clientTime":"2013-07-02T23:18:07.243Z","clientTimeZone":5,"process":"ml.mobileweb.fb","c":"Thing.Render"}
I need to extract the different fields in the json and the "things" field which is in fact a collection. I also need to extract the other query string values in the log. Can Pig directly deal with this kind of source data and if so could you be so kind to guide me through how to have Pig be able to parse and load it?
Thank you!

For such complicated task, you ususally need to write your Load function.
I recommend Chapter 11. Writing Load and Store Functions in Programming Pig. Load/Store Functions in official docuemnt is too simple.

I experimented plenty and learned tons. Tried a couple json libraries, piggybank and the java.net.URLDecoder. Even tried the CSVExcelStorage. I registered the libraries and was able to solve the problem partially. When I ran the tests against a larger data set, it started hitting encoding issues in some lines of the source data resulting in exceptions and job failure. So I ended up using Pig's built-in regex functionality to extract the desired values:
A = load '/var/log/live/collector_2013-07-02-0145.log' using TextLoader();
-- fix some of the encoding issues
A = foreach A GENERATE REPLACE($0,'\\\\"','"');
-- super basic url-decode
A = foreach A GENERATE REPLACE($0,'%22','"');
-- extract each of the fields from the embedded json
A = foreach A GENERATE
REGEX_EXTRACT($0,'^.*"redirectId":"([^"\\}]+).*$',1) as redirectId,
REGEX_EXTRACT($0,'^.*"fromUserId":"([^"\\}]+).*$',1) as fromUserId,
REGEX_EXTRACT($0,'^.*"userId":"([^"\\}]+).*$',1) as userId,
REGEX_EXTRACT($0,'^.*"listId":"([^"\\}]+).*$',1) as listId,
REGEX_EXTRACT($0,'^.*"c":"([^"\\}]+).*$',1) as eventType,
REGEX_EXTRACT($0,'^.*"renderSource":"([^"\\}]+).*$',1) as renderSource,
REGEX_EXTRACT($0,'^.*"renderType":"([^"\\}]+).*$',1) as renderType,
REGEX_EXTRACT($0,'^.*"engageType":"([^"\\}]+).*$',1) as engageType,
REGEX_EXTRACT($0,'^.*"clientTime":"([^"\\}]+).*$',1) as clientTime,
REGEX_EXTRACT($0,'^.*"clientTimeZone":([^,\\}]+).*$',1) as clientTimeZone;
I decided not to use REGEX_EXTRACT_ALL in case the order of the fields varies.

Related

Best way to get script output into elasticsearch?

I wrote a Python program to parse an internal website to retrieve a number of metrics. The script spits out something like this:
12 456 785.3 .12 23145
Each value contains a unique performance metric. I'm now looking for ways to get this data into elasticsearch so I can start making use of it. There are several input plug-ins:
https://www.elastic.co/guide/en/logstash/current/input-plugins.html
And I'm curious which one folks are using to get script data into elasticsearch? Should I funnel it to syslog and grok it from there? Anyone using meetup? Any other solutions that I may be missing? I read numerous posts / websites and the solutions they proposed seem extremely complicated for this super simple job.
If this is a custom script for just this task, I'd change the script to generate JSON and use the Python Elastic SDK to insert the document in Elastic. Alternatively, your script could output JSON to stdout and you could have a simple bash script that does a curl to insert the document in Elasticsearch.

How to convert hadoop sequence file to json format?

As the name suggests, I'm looking for some tool which will convert the existing data from hadoop sequence file to json format.
My initial googling have only shown up results related to jaql, which I'm desperately trying to get to work.
Is there any tool from Apache available for this very purpose?
NOTE:
I've hadoop sequence file sitting on my local machine and would like to get data in corresponding json format.
So in-effect, I'm looking for some tool/utility which will take hadoop sequence file as input and produce output in json format.
Thanks
Apache Hadoop might be a good tool for reading sequence files.
All kidding aside, though, why not write the simplest possible Mapper java program that uses, say, Jackson to serialize each key and value pair it sees? That would be a pretty easy program to write.
I thought there must be some tool which will do this given that its such common requirement. Yes, it should be pretty easy to code but again why to do so if you already have something which does just the same.
Anyway, I figured out to do it using jaql. Sample working query which worked for me,
read({type: 'hdfs', location: 'some_hdfs_file', inoptions: {converter: 'com.ibm.jaql.io.hadoop.converter.FromJsonTextConverter'}});

Pig - load Word documents (.doc & .docx) with pig

I can't load Microsoft Word documents (.doc or .docx) with pig. Indeed, when i try to do so, by using TextLoader(), PigStorage() or no loader at all, it doesn't work. The output is some weird symbols.
I heard that I could write a custom loader in JAVA but it seems really difficult and I don't underdstand how we can program one of these at the moment.
I would like to put all the .doc file content in a single chararray bag so I could later use a filter function to process it.
How could I do ?
Thanks
They are right. Since .doc and .docx are binary formats, simple text loaders won't work. You can either write the UDF to be able to load the files directly into Pig, or you can do some preprocessing to convert all .doc and .docx files into .txt files so that Pig will be loading those .txt files instead. This link may help you get started in finding a way to convert the files.
However, I'd still recommend learning to write the UDF. Preprocessing the files is going to add significant overhead that can be avoided.
Update: Here are a couple of resources I've used for writing my java (Load) UDFs in the past. One, Two.

How do I read static files in a PIG UDF

I am new to PIG and Hadoop. I have written a PIG UDF which operates on String and returns a string. I actually use a class from an already existing jar which contains the business logic in the udf. The class constructor takes 2 filenames as input which it uses for building some dictionary used for processing the input. How to get it working in mapreduce mode I tried passing the filenames in pig local mode it works fine. But I dont know how to make it work in mapreduce mode? Can distributed cache solve the problem?
Here is my code
REGISTER tokenParser.jar
REGISTER sampleudf.jar;
DEFINE TOKENPARSER com.yahoo.sample.ParseToken('conf/input1.txt','conf/input2.xml');
A = LOAD './inputHOP.txt' USING PigStorage() AS (tok:chararray);
B = FOREACH A GENERATE TOKENPARSER(tok);
STORE B into 'newTokout' USING PigStorage();
From what I understand is tokenParser.jar must be using some sort of BufferedInputReader. Is it possible to make it work without changing tokenParser.jar
Yes, like in this similar question using the distributed cache is a good way to solve this problem.

Dynamically updating RDF File

Is it possible to update an rdf file dynamically from user generated input through a webform? The exact scenario would beskos concept definitions being created and updated through user input to html forms.
I was considering xpath but is there a better / generally accepted / best practice way of doing this kind of thing?
For this type of thing there are IMO two approaches:
1 - Using Named Graphs in a Triple Store
Rather than editing an actual fixed file you use a Graph which is stored as a named graph in a Triple Store that supports triple level updates (i.e. you can change individual Triples in a Graph). For example you could use a store like Virtuoso or a Jena based store (Jena SDB/TDB) to do this, basically any store that supports the SPARUL language or has it's own equivalent.
2 - Using a fixed RDF file and altering it
From your mention of XPath I assume that you are intending to store your file as RDF/XML. While XPath would potentially work for this it's going to be dependent on the exact serialization of your file and may get very complex. If your app is going to allow users to submit and edit their own files then they'll be no guarantees over how the RDF has been serialized into RDF/XML so your XPath expressions might not work. If you control all the serialization and processing of the RDF/XML then you can keep it in a format that your XPath will work on.
From my point of view the simplest way to do this approach is to load the file into memory using an appropriate RDF library, manipulate it in memory and then persist the whole thing back to disk when the user is done (or at regular intervals or whatever is appropriate to your application)

Resources