How to convert hadoop sequence file to json format?

How to convert hadoop sequence file to json format? - hadoop

As the name suggests, I'm looking for some tool which will convert the existing data from hadoop sequence file to json format.
My initial googling have only shown up results related to jaql, which I'm desperately trying to get to work.
Is there any tool from Apache available for this very purpose?
NOTE:
I've hadoop sequence file sitting on my local machine and would like to get data in corresponding json format.
So in-effect, I'm looking for some tool/utility which will take hadoop sequence file as input and produce output in json format.
Thanks

Apache Hadoop might be a good tool for reading sequence files.
All kidding aside, though, why not write the simplest possible Mapper java program that uses, say, Jackson to serialize each key and value pair it sees? That would be a pretty easy program to write.

I thought there must be some tool which will do this given that its such common requirement. Yes, it should be pretty easy to code but again why to do so if you already have something which does just the same.
Anyway, I figured out to do it using jaql. Sample working query which worked for me,
read({type: 'hdfs', location: 'some_hdfs_file', inoptions: {converter: 'com.ibm.jaql.io.hadoop.converter.FromJsonTextConverter'}});

Related

read a .fit file on Linux

How could I read Garmin's .fit file on Linux. I'd like to use it for some data analysis but the file is a binary file.
I have visited http://garmin.kiesewetter.nl/ but the website does not seem to work.
Thanks

You can use GPSbabel to do this. It's a command-line tool, so you end up with something like:
gpsbabel -i garmin_fit -f {filename}.fit -o csv -F {output filename}.csv
and you'll get a text file with all the lat/long coordinates.
What's trickier is getting out other data, ie: if you want speed, time, or other information from the .fit file. You can easily get those into a .gpx, where they're in xml and human-readable, but I haven't yet found a single line solution for getting that data into a csv.

The company that created ANT made an SDK package available here:
https://www.thisisant.com/resources/fit
When unzipping this, there is a java/FitCSVTool.jar file. Then:
java -jar java/FitCSVTool.jar -b input.fit output.csv
I tested with a couple of files and it seems to work really well. Then of course the format of the csv can be a little bit complex.
For example, latitude and longitude are stored in semicircles, so it should be multiplied by 180/(2^31) to give GPS coordinates.

You need to convert the file to a .csv, the Garmin repair tool at http://garmin.kiesewetter.nl/ will do this for you. I've just loaded the site fine, try again it may have been temporarily down.
To add a little more detail:
"FIT or Flexible and Interoperable Data Transfer is a file format used for GPS tracks and routes. It is used by newer Garmin fitness GPS devices, including the Edge and Forerunner." From the OpenStreetMap Wiki http://wiki.openstreetmap.org/wiki/FIT
There are many tools to convert these files to other formats for different uses, which one you choose depends on the use. GPSBabel is another converer tool that may help. gpsbabel.org (I can't post two links yet :)

This page parses the file and lets you download it as tables. https://www.fitfileviewer.com/ The fun bit is converting the timestamps from numbers to readable timestamps Garmin .fit file timestamp

Mapreduce XML input format - to build custom format

If the input files in XML format, I shouldn't be using TextInputFormat because TextInputFormat assumes each record is in each line of the input file and the Mapper class is called for each line to get a Key Value pair for that record/line.
So I think we need a custom input format to scan the XML datasets.
Being new to Hadoop mapreduce, is there any article/link/video that shows the steps to build a custom input format?
thanks
nath

Problem
Working on a single XML file in parallel in MapReduce is tricky because XML does not contain a synchronization marker in its data format. Therefore, how do we work with a file format that’s not inherently splittable like XML?
Solution
MapReduce doesn’t contain built-in support for XML, so we have to turn to another Apache project, Mahout, a machine learning system, which provides an XML InputFormat.
So I mean no need to have custom input format since Mahout library present.
I am not sure, whether you are going to read or write but both were described in above link.
Pls have a look at XmlInputFormat implementation details here.
Furthermore, XmlInputFormat extends TextInputFormat

Hadoop reducer raw output

I have a hadoop task written in Python.
I'm trying to output result as raw binary data, but the data I get are "corrupted". Looks like standard outputter doesn't work with binary data well.
Is there any way to fix it without writing own outputter in Java?

Lost hope to find an answer, I've implemented my own writer in Java.

Reading and writing to hadoop sequence file using scala

I just started using scalding and trying to find examples of reading a text file and writing to a hadoop sequence file.
Any help is appreciated.

You can use com.twitter.scalding.WritableSequenceFile (please note that you have to use the fully quantified name, otherwise it picks up the cascading one). Hope this helps.

How to load complex web logs syntax with Pig?

I am a complete beginner on Pig. I have installed cdh4 pig and I am connected to a cdh4 cluster. We need to process these web log files that are going to be massive (the files are already being loaded to HDFS). Unfortunately the log syntax is quite involved (not a typical comma delimited file). A restriction is I cannot currently pre-process the log files with some other tool because they are just too huge and can't afford storing a copy. Here is a raw line in the logs:
"2013-07-02 16:17:12
-0700","?c=Thing.Render&d={%22renderType%22:%22Primary%22,%22renderSource%22:%22Folio%22,%22things%22:[{%22itemId%22:%225442f624492068b7ce7e2dd59339ef35%22,%22userItemId%22:%22873ef2080b337b57896390c9f747db4d%22,%22listId%22:%22bf5bbeaa8eae459a83fb9e2ceb99930d%22,%22ownerId%22:%222a4034e6b2e800c3ff2f128fa4f1b387%22}],%22redirectId%22:%22tgvm%22,%22sourceId%22:%226da6f959-8309-4387-84c6-a5ddc10c22dd%22,%22valid%22:false,%22pageLoadId%22:%224ada55ef-4ea9-4642-ada5-d053c45c00a4%22,%22clientTime%22:%222013-07-02T23:18:07.243Z%22,%22clientTimeZone%22:5,%22process%22:%22ml.mobileweb.fb%22,%22c%22:%22Thing.Render%22}","http://m.someurl.com/listthing/5442f624492068b7ce7e2dd59339ef35?rdrId=tgvm&userItemId=873ef2080b337b57896390c9f747db4d&fmlrdr=t&itemId=5442f624492068b7ce7e2dd59339ef35&subListId=bf5bbeaa8eae459a83fb9e2ceb99930d&puid=2a4034e6b2e800c3ff2f128fa4f1b387&mlrdr=t","Mozilla/5.0
(iPhone; CPU iPhone OS 6_1_3 like Mac OS X) AppleWebKit/536.26 (KHTML,
like Gecko) Mobile/10B329
[FBAN/FBIOS;FBAV/6.2;FBBV/228172;FBDV/iPhone4,1;FBMD/iPhone;FBSN/iPhone
OS;FBSV/6.1.3;FBSS/2;
FBCR/Sprint;FBID/phone;FBLC/en_US;FBOP/1]","10.nn.nn.nnn","nn.nn.nn.nn,
nn.nn.0.20"
As you probably noticed there is some json embedded there but it is url encoded. After url decoding (can Pig do url decoding?) here is how the json looks:
{"renderType":"Primary","renderSource":"Folio","things":[{"itemId":"5442f624492068b7ce7e2dd59339ef35","userItemId":"873ef2080b337b57896390c9f747db4d","listId":"bf5bbeaa8eae459a83fb9e2ceb99930d","ownerId":"2a4034e6b2e800c3ff2f128fa4f1b387"}],"redirectId":"tgvm","sourceId":"6da6f959-8309-4387-84c6-a5ddc10c22dd","valid":false,"pageLoadId":"4ada55ef-4ea9-4642-ada5-d053c45c00a4","clientTime":"2013-07-02T23:18:07.243Z","clientTimeZone":5,"process":"ml.mobileweb.fb","c":"Thing.Render"}
I need to extract the different fields in the json and the "things" field which is in fact a collection. I also need to extract the other query string values in the log. Can Pig directly deal with this kind of source data and if so could you be so kind to guide me through how to have Pig be able to parse and load it?
Thank you!

For such complicated task, you ususally need to write your Load function.
I recommend Chapter 11. Writing Load and Store Functions in Programming Pig. Load/Store Functions in official docuemnt is too simple.

I experimented plenty and learned tons. Tried a couple json libraries, piggybank and the java.net.URLDecoder. Even tried the CSVExcelStorage. I registered the libraries and was able to solve the problem partially. When I ran the tests against a larger data set, it started hitting encoding issues in some lines of the source data resulting in exceptions and job failure. So I ended up using Pig's built-in regex functionality to extract the desired values:
A = load '/var/log/live/collector_2013-07-02-0145.log' using TextLoader();
-- fix some of the encoding issues
A = foreach A GENERATE REPLACE($0,'\\\\"','"');
-- super basic url-decode
A = foreach A GENERATE REPLACE($0,'%22','"');
-- extract each of the fields from the embedded json
A = foreach A GENERATE
REGEX_EXTRACT($0,'^.*"redirectId":"([^"\\}]+).*$',1) as redirectId,
REGEX_EXTRACT($0,'^.*"fromUserId":"([^"\\}]+).*$',1) as fromUserId,
REGEX_EXTRACT($0,'^.*"userId":"([^"\\}]+).*$',1) as userId,
REGEX_EXTRACT($0,'^.*"listId":"([^"\\}]+).*$',1) as listId,
REGEX_EXTRACT($0,'^.*"c":"([^"\\}]+).*$',1) as eventType,
REGEX_EXTRACT($0,'^.*"renderSource":"([^"\\}]+).*$',1) as renderSource,
REGEX_EXTRACT($0,'^.*"renderType":"([^"\\}]+).*$',1) as renderType,
REGEX_EXTRACT($0,'^.*"engageType":"([^"\\}]+).*$',1) as engageType,
REGEX_EXTRACT($0,'^.*"clientTime":"([^"\\}]+).*$',1) as clientTime,
REGEX_EXTRACT($0,'^.*"clientTimeZone":([^,\\}]+).*$',1) as clientTimeZone;
I decided not to use REGEX_EXTRACT_ALL in case the order of the fields varies.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to convert hadoop sequence file to json format? - hadoop

Apache Hadoop might be a good tool for reading sequence files. All kidding aside, though, why not write the simplest possible Mapper java program that uses, say, Jackson to serialize each key and value pair it sees? That would be a pretty easy program to write.

Related

read a .fit file on Linux

Mapreduce XML input format - to build custom format

Hadoop reducer raw output

Reading and writing to hadoop sequence file using scala

How to load complex web logs syntax with Pig?

Categories

Resources