Reading and writing to hadoop sequence file using scala - hadoop

I just started using scalding and trying to find examples of reading a text file and writing to a hadoop sequence file.
Any help is appreciated.

You can use com.twitter.scalding.WritableSequenceFile (please note that you have to use the fully quantified name, otherwise it picks up the cascading one). Hope this helps.

Related

How to convert hadoop sequence file to json format?

As the name suggests, I'm looking for some tool which will convert the existing data from hadoop sequence file to json format.
My initial googling have only shown up results related to jaql, which I'm desperately trying to get to work.
Is there any tool from Apache available for this very purpose?
NOTE:
I've hadoop sequence file sitting on my local machine and would like to get data in corresponding json format.
So in-effect, I'm looking for some tool/utility which will take hadoop sequence file as input and produce output in json format.
Thanks
Apache Hadoop might be a good tool for reading sequence files.
All kidding aside, though, why not write the simplest possible Mapper java program that uses, say, Jackson to serialize each key and value pair it sees? That would be a pretty easy program to write.
I thought there must be some tool which will do this given that its such common requirement. Yes, it should be pretty easy to code but again why to do so if you already have something which does just the same.
Anyway, I figured out to do it using jaql. Sample working query which worked for me,
read({type: 'hdfs', location: 'some_hdfs_file', inoptions: {converter: 'com.ibm.jaql.io.hadoop.converter.FromJsonTextConverter'}});

How to Delete an entry from MapFile in Hadoop

Is there any solution to delete an entry from MapFile in Hadoop. I could able to read and write entries to a MapFile, but i am totally unaware of deleting or updating an entry from it. Is there any good solution for the same ? Any help is appreciated. Thanks in Advance.
hdfs is basically supports data warehousing facilities. You can not modify existing content of any hdfs file, at most you can append new content at bottom of fine.
You can refer similar question
Suppose file contain below 2 lines
hi hello world
this is fine
Now in mapper write logic string which contains "hello" , and pass it to reducer phase.
now the reducer output will contain only "hi hello world"
If you want any other than please specify with short use case.

How to generate arff file from ACE

I am using the ACE data mining package https://dtai.cs.kuleuven.be/ACE/doc/ACEuser-1.2.16.pdf which uses inductive logic programming.
I am using WARMR to find frequent queries.
In the manual it has a command 'generate_arff' but this command does not seem to be in my version.(Windows). By typing help into ACE it lists the command 'generate_arff/2' but the help does not give any indication what the arguments should be (Presumably an input file and an output file) I have not been able to guess how this works. Anyone know how to do this?
generate_arff('my_program.freq_queries.out','my_program.arff').
This does not work in windows for some reason, but works in Linux

pig shell setup: automatically executing pig scripts

Is there a way to automatically run a pig script when invoking pig from command line?
The reason I'm wondering about this is that I have several import and define statements that I use constantly over and over to set everything up. Is it possible to define this collection of statements somewhere so that when I start pig, it will automatically execute those lines? I apologize in advance if this is something trivial that I missed from the documentation.
yes you can certainly do so from version 0.11 onwards.
You need to use .pigbootup file.
Here is a nice blogpost on setting up the pigbootup file
http://hadoopified.wordpress.com/2013/02/06/pig-specify-a-default-script/
If you want to include Pig-Macros from a file you can use the import command
Take a look at http://pig.apache.org/docs/r0.9.1/cont.html#import-macros for reference

How to use the MultipleTextOutputFormat class to rename the default output file to some meaningful names?

After the reduce phase in Hadoop, I wanted the output file names to be something meaningful depending on the input key value. However I'm not successful on following the example on "Hadoop: The Definative Guide" which used MultipleTextOutputFormat to do this. The reason is that it's based on old API and it doesn't work on the new API ?
Can anybody hint on the solution or point me to the relevant documentation ?
You are probably right. Most things that worked in the old API don't always work in the new one.
There is a "new way" of doing this now, called MultipleOutputs.

Resources