How do you use ORCFile Input/Output format in MapReduce? - hadoop

I need to implement a custom I/O format based on ORCFile I/O format. How do I go about it?
Specifically I would need a way to include the ORCFile library in my source code (which is a custom Pig implementation) and use the ORCFile Output format to write data, and later use the ORCFile Input format to read back the data.

You need to create your subclass of InputFormat class (or FileInputFormat, depending on the nature of your files).
Just google for Hadoop InputFormat and you will find a plenty of articles and tutorials on how to create your own InputFormat class.

You can use HCatalog library to read write orc files in mapreduce.

Just wrote a sample code here. Hope it helps.
sample mapper code
public static class MyMapper<K extends WritableComparable, V extends Writable>
extends MapReduceBase implements Mapper<K, OrcStruct, Text, IntWritable> {
private StructObjectInspector oip;
private final OrcSerde serde = new OrcSerde();
public void configure(JobConf job) {
Properties table = new Properties();
table.setProperty("columns", "a,b,c");
table.setProperty("columns.types", "int,string,struct<d:int,e:string>");
serde.initialize(job, table);
try {
oip = (StructObjectInspector) serde.getObjectInspector();
} catch (SerDeException e) {
e.printStackTrace();
}
}
public void map(K key, OrcStruct val,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
System.out.println(val);
List<? extends StructField> fields =oip.getAllStructFieldRefs();
StringObjectInspector bInspector =
(StringObjectInspector) fields.get(B_ID).getFieldObjectInspector();
String b = "garbage";
try {
b = bInspector.getPrimitiveJavaObject(oip.getStructFieldData(serde.deserialize(val), fields.get(B_ID)));
} catch (SerDeException e1) {
e1.printStackTrace();
}
OrcStruct struct = null;
try {
struct = (OrcStruct) oip.getStructFieldData(serde.deserialize(val),fields.get(C_ID));
} catch (SerDeException e1) {
e1.printStackTrace();
}
StructObjectInspector cInspector = (StructObjectInspector) fields.get(C_ID).getFieldObjectInspector();
int d = ((IntWritable) cInspector.getStructFieldData(struct, fields.get(D_ID))).get();
String e = cInspector.getStructFieldData(struct, fields.get(E_ID)).toString();
output.collect(new Text(b), new IntWritable(1));
output.collect(new Text(e), new IntWritable(1));
}
}
Launcher code
JobConf job = new JobConf(new Configuration(), OrcReader.class);
// Specify various job-specific parameters
job.setJobName("myjob");
job.set("mapreduce.framework.name","local");
job.set("fs.default.name","file:///");
job.set("log4j.logger.org.apache.hadoop","INFO");
job.set("log4j.logger.org.apache.hadoop","INFO");
//push down projection columns
job.set("hive.io.file.readcolumn.ids","1,2");
job.set("hive.io.file.read.all.columns","false");
job.set("hive.io.file.readcolumn.names","b,c");
FileInputFormat.setInputPaths(job, new Path("./src/main/resources/000000_0.orc"));
FileOutputFormat.setOutputPath(job, new Path("./target/out1"));
job.setMapperClass(OrcReader.MyMapper.class);
job.setCombinerClass(OrcReader.MyReducer.class);
job.setReducerClass(OrcReader.MyReducer.class);
job.setInputFormat(OrcInputFormat.class);
job.setOutputFormat(TextOutputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
JobClient.runJob(job);

Related

How to set multiple Avro schemas with AvroParquetOutputFormat?

In my MapReduce job, Im using AvroParquetOutputFormat to write to Parquet files using Avro schema.
The application logic requires multiple types of files getting created by Reducer and each file has its own Avro schema.
The class AvroParquetOutputFormat has a static method setSchema() to set Avro schema of output. Looking at the code, AvroParquetOutputFormat uses AvroWriteSupport.setSchema() which again is a static implementation.
Without extending AvroWriteSupport and hacking the logic, is there a simpler way to achieve multiple Avro schema output from AvroParquetOutputFormat in a single MR job?
Any pointers/inputs highly appreciated.
Thanks & Regards
MK
It may be quite late to answer, but I have also faced this issue and came up with a solution.
First, There is no support like 'MultipleAvroParquetOutputFormat' inbuilt in parquet-mr. But to achieve a similar behavior I used MultipleOutputs.
For a map-only kind of job, put your mapper like this:
public class EventMapper extends Mapper<LongWritable, BytesWritable, Void, GenericRecord>{
protected KafkaAvroDecoder deserializer;
protected String outputPath = "";
// Using MultipleOutputs to write custom named files
protected MultipleOutputs<Void, GenericRecord> mos;
public void setup(Context context) throws IOException, InterruptedException {
super.setup(context);
Configuration conf = context.getConfiguration();
outputPath = conf.get(FileOutputFormat.OUTDIR);
mos = new MultipleOutputs<Void, GenericRecord>(context);
}
public void map(LongWritable ln, BytesWritable value, Context context){
try {
GenericRecord record = (GenericRecord) deserializer.fromBytes(value.getBytes());
AvroWriteSupport.setSchema(context.getConfiguration(), record.getSchema());
Schema schema = record.getSchema();
String mergeEventsPath = outputPath + "/" + schema.getName(); // Adding '/' will do no harm
mos.write( (Void) null, record, mergeEventsPath);
} catch (IOException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
#Override
public void cleanup(Context context) throws IOException, InterruptedException {
mos.close();
}
}
This will create a new RecordWriter for each schema and creates a new parquet file, appended with the schema name, for example, schema1-r-0000.parquet.
This will also create the default part-r-0000x.parquet files based on schema set in the driver. To avoid this, use LazyOutputFormat like:
LazyOutputFormat.setOutputFormatClass(job, AvroParquetOutputFormat.class);
Hope this helps.

Weird output while reading a .tar.gz file in MapReduce

Please go a little Easy on me, As I am a newbie to hadoop and MapReduce.
I have a .tar.gz file that I am trying to read using mapReduce by writing a custom InputFormatter employing CompressionCodecfactory.
I read some document over Internet that CompressionCodecFactory can be used to read a .tar.gz file. hence I implemented that in my code.
The output that I get After running the code is absolute garbage.
A piece of my input file is provided below:
"MAY 2013 KOTZEBUE, AK"
"RALPH WIEN MEMORIAL AIRPORT (PAOT)"
"Lat:66° 52'N Long: 162° 37'W Elev (Ground) 30 Feet"
"Time Zone : ALASKA WBAN: 26616 ISSN#: 0197-9833"
01,21,0,11,-11,3,11,54,0," ",4, ,0.0,0.00,30.06,30.09,10.2,36,10.0,25,360,22,360,01
02,25,3,14,-9,5,12,51,0," ",4, ,0.0,0.00,30.09,30.11,6.1,34,7.7,16,010,14,360,02
03,21,1,11,-12,7,11,54,0," ",4, ,0.0,0.00,30.14,30.15,5.0,28,6.0,17,270,16,270,03
04,20,8,14,-10,11,13,51,0,"SN BR",4, ,.001,.0001,30.09,30.11,8.6,26,9.2,20,280,15,280,04
05,29,19,24,-1,21,23,41,0,"SN BR",5, ,0.6,0.06,30.11,30.14,8.1,20,8.5,22,240,20,240,05
06,27,19,23,-3,21,23,42,0,"SN BR",4, ,0.1,0.01,30.14,30.15,8.7,19,9.4,18,200,15,200,06
The the output I get is quite weird:
��#(���]�OX}�s���{Fw8OP��#ig#���e�1L'�����sAm�
��#���Q�eW�t�Ruk�#��AAB.2P�V�� \L}��+����.֏9U]N �)(���d��i(��%F�S<�ҫ ���EN��v�7�Y�%U�>��<�p���`]ݹ�#�#����9Dˬ��M�X2�'��\R��\1- ���V\K1�c_P▒W¨P[Ö␤ÍãÏ2¨▒;O
Below is the Custom InputFormat and RecordReader code:
InputFormat
public class SZ_inptfrmtr extends FileInputFormat<Text, Text>
{
#Override
public RecordReader<Text, Text> getRecordReader(InputSplit split,
JobConf job_run, Reporter reporter) throws IOException {
// TODO Auto-generated method stub
return new SZ_recordreader(job_run, (FileSplit)split);
}
}
RecordReader:
public class SZ_recordreader implements RecordReader<Text, Text>
{
FileSplit split;
JobConf job_run;
boolean processed=false;
CompressionCodecFactory compressioncodec=null; // A factory that will find the correct codec(.file) for a given filename.
public SZ_recordreader(JobConf job_run, FileSplit split)
{
this.split=split;
this.job_run=job_run;
}
#Override
public void close() throws IOException {
// TODO Auto-generated method stub
}
#Override
public Text createKey() {
// TODO Auto-generated method stub
return new Text();
}
#Override
public Text createValue() {
// TODO Auto-generated method stub
return new Text();
}
#Override
public long getPos() throws IOException {
// TODO Auto-generated method stub
return processed ? split.getLength() : 0;
}
#Override
public float getProgress() throws IOException {
// TODO Auto-generated method stub
return processed ? 1.0f : 0.0f;
}
#Override
public boolean next(Text key, Text value) throws IOException {
// TODO Auto-generated method stub
FSDataInputStream in=null;
if (!processed)
{
byte [] bytestream= new byte [(int) split.getLength()];
Path path=split.getPath();
compressioncodec=new CompressionCodecFactory(job_run);
CompressionCodec code = compressioncodec.getCodec(path);
// compressioncodec will find the correct codec by visiting the path of the file and store the result in code
System.out.println(code);
FileSystem fs= path.getFileSystem(job_run);
try
{
in =fs.open(path);
IOUtils.readFully(in, bytestream, 0, bytestream.length);
System.out.println("the input is " +in+ in.toString());
key.set(path.getName());
value.set(bytestream, 0, bytestream.length);
}
finally
{
IOUtils.closeStream(in);
}
processed=true;
return true;
}
return false;
}
}
Could anybody please point out the flaw..
There is a codec for .gz, but no codec for .tar.
Your .tar.gz is being decompressed to .tar, but it's still a tarball, and not something understood by the Hadoop system.
Your code may be stuck there in mapper and reducer class communication, To work with compressed files in MapReduce you need to set some configuration options for your job. These classes
are mandatory to set in driver class:
conf.setBoolean("mapred.output.compress", true);//Compress The Reducer Out put
conf.setBoolean("mapred.compress.map.output", true);//Compress The Mapper Output
conf.setClass("mapred.output.compression.codec",
codecClass,
CompressionCodec.class);//Compression codec for Compresing mapper output
The only difference between a MapReduce job that works with uncompressed versus
compressed IO are these three annotated lines.
I read some document over Internet that CompressionCodecFactory can be used to read a .tar.gz file. hence I implemented that in my code.
Even Compression codec do better over, but there are many codec for this purpose most one is LzopCodec and SnappyCodec for possible big data.. you can find a Git for LzopCodec here :https://github.com/twitter/hadoop-lzo/blob/master/src/main/java/com/hadoop/compression/lzo/LzopCodec.java

Hadoop - Mysql new API connection

I am trying to set MySQL as input, in a Hadoop Process. How to use DBInputFormat class for Hadoop - MySQL connection in version 1.0.3? The configuration of the job via JobConf from hadoop-1.0.3/docs/api/ doesnt work.
// Create a new JobConf
JobConf job = new JobConf(new Configuration(), MyJob.class);
// Specify various job-specific parameters
job.setJobName("myjob");
FileInputFormat.setInputPaths(job, new Path("in"));
FileOutputFormat.setOutputPath(job, new Path("out"));
job.setMapperClass(MyJob.MyMapper.class);
job.setCombinerClass(MyJob.MyReducer.class);
job.setReducerClass(MyJob.MyReducer.class);
job.setInputFormat(SequenceFileInputFormat.class);
job.setOutputFormat(SequenceFileOutputFormat.class);
You need to do something like the following (assuming the typical employee table for example):
JobConf conf = new JobConf(getConf(), MyDriver.class);
conf.setInputFormat(DBInputFormat.class);
DBConfiguration.configureDB(conf, “com.mysql.jdbc.Driver”, “jdbc:mysql://localhost/mydatabase”); String [] fields = { “employee_id”, "name" };
DBInputFormat.setInput(conf, MyRecord.class, “employees”, null /* conditions */, “employee_id”, fields);
...
// other necessary configuration
JobClient.runJob(conf);
The configureDB() and setInput() calls configure the DBInputFormat. The first call specifies the JDBC driver implementation to use and what database to connect to. The second call specifies what data to load from the database. The MyRecord class is the class where data will be read into in Java, and "employees" is the name of the table to read. The "employee_id" parameter specifies the table’s primary key, used for ordering results. The section “Limitations of the InputFormat” below explains why this is necessary. Finally, the fields array lists what columns of the table to read. An overloaded definition of setInput() allows you to specify an arbitrary SQL query to read from, instead.
After calling configureDB() and setInput(), you should configure the rest of your job as usual, setting the Mapper and Reducer classes, specifying any other data sources to read from (e.g., datasets in HDFS) and other job-specific parameters.
You need to create your own implementation of Writable- something like the following (considering id and name as table fields):
class MyRecord implements Writable, DBWritable {
long id;
String name;
public void readFields(DataInput in) throws IOException {
this.id = in.readLong();
this.name = Text.readString(in);
}
public void readFields(ResultSet resultSet) throws SQLException {
this.id = resultSet.getLong(1);
this.name = resultSet.getString(2); }
public void write(DataOutput out) throws IOException {
out.writeLong(this.id);
Text.writeString(out, this.name); }
public void write(PreparedStatement stmt) throws SQLException {
stmt.setLong(1, this.id);
stmt.setString(2, this.name); }
}
The mapper then receives an instance of your DBWritable implementation as its input value. The input key is a row id provided by the database; you’ll most likely discard this value.
public class MyMapper extends MapReduceBase implements Mapper<LongWritable, MyRecord, LongWritable, Text> {
public void map(LongWritable key, MyRecord val, OutputCollector<LongWritable, Text> output, Reporter reporter) throws IOException {
// Use val.id, val.name here
output.collect(new LongWritable(val.id), new Text(val.name));
}
}
For more : read the following link (actual source of my answer) : http://blog.cloudera.com/blog/2009/03/database-access-with-hadoop/
Have a look at this post. It shows how to sink data from Map Reduce to MySQL Database.

Read/Write Hbase without reducer, exceptions

I want to read and write hbase without using any reducer.
I followed the example in "The Apache HBase™ Reference Guide", but there are exceptions.
Here is my code:
public class CreateHbaseIndex {
static final String SRCTABLENAME="sourceTable";
static final String SRCCOLFAMILY="info";
static final String SRCCOL1="name";
static final String SRCCOL2="email";
static final String SRCCOL3="power";
static final String DSTTABLENAME="dstTable";
static final String DSTCOLNAME="index";
static final String DSTCOL1="key";
public static void main(String[] args) {
System.out.println("CreateHbaseIndex Program starts!...");
try {
Configuration config = HBaseConfiguration.create();
Scan scan = new Scan();
scan.setCaching(500);
scan.setCacheBlocks(false);
scan.addColumn(Bytes.toBytes(SRCCOLFAMILY), Bytes.toBytes(SRCCOL1));//info:name
HBaseAdmin admin = new HBaseAdmin(config);
if (admin.tableExists(DSTTABLENAME)) {
System.out.println("table Exists.");
}
else{
HTableDescriptor tableDesc = new HTableDescriptor(DSTTABLENAME);
tableDesc.addFamily(new HColumnDescriptor(DSTCOLNAME));
admin.createTable(tableDesc);
System.out.println("create table ok.");
}
Job job = new Job(config, "CreateHbaseIndex");
job.setJarByClass(CreateHbaseIndex.class);
TableMapReduceUtil.initTableMapperJob(
SRCTABLENAME, // input HBase table name
scan, // Scan instance to control CF and attribute selection
HbaseMapper.class, // mapper
ImmutableBytesWritable.class, // mapper output key
Put.class, // mapper output value
job);
job.waitForCompletion(true);
} catch (IOException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
} catch (ClassNotFoundException e) {
e.printStackTrace();
}
System.out.println("Program ends!...");
}
public static class HbaseMapper extends TableMapper<ImmutableBytesWritable, Put> {
private HTable dstHt;
private Configuration dstConfig;
#Override
public void setup(Context context) throws IOException{
dstConfig=HBaseConfiguration.create();
dstHt = new HTable(dstConfig,SRCTABLENAME);
}
#Override
public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
// this is just copying the data from the source table...
context.write(row, resultToPut(row,value));
}
private static Put resultToPut(ImmutableBytesWritable key, Result result) throws IOException {
Put put = new Put(key.get());
for (KeyValue kv : result.raw()) {
put.add(kv);
}
return put;
}
#Override
protected void cleanup(Context context) throws IOException, InterruptedException {
dstHt.close();
super.cleanup(context);
}
}
}
By the way, "souceTable" is like this:
key name email
1 peter a#a.com
2 sam b#b.com
"dstTable" will be like this:
key value
peter 1
sam 2
I am a newbie in this field and need you help. Thx~
You are correct that you don't need a reducer to write to HBase, but there are some instances where a reducer might help. If you are creating an index, you might run into situations where two mappers are trying to write the same row. Unless you are careful to ensure that they are writing into different column qualifiers, you could overwrite one update with another due to race conditions. While HBase does do row level locking, it won't help if your application logic is faulty.
Without seeing your exceptions, I would guess that you are failing because you are trying to write key-value pairs from your source table into your index table, where the column family doesn't exist.
In this code you are not specifying the output format. You need to add the following code
job.setOutputFormatClass(TableOutputFormat.class);
job.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE,
DSTTABLENAME);
Also, we are not supposed to create new configuration in the set up, we need to use the same configuration from context.

Any suggestions for reading two different dataset into Hadoop at the same time?

Dear hadooper:
I'm new for hadoop, and recently try to implement an algorithm.
This algorithm needs to calculate a matrix, which represent the different rating of every two pair of songs. I already did this, and the output is a 600000*600000 sparse matrix which I stored in my HDFS. Let's call this dataset A (size=160G)
Now, I need to read the users' profiles to predict their rating for a specific song. So I need to read the users' profile first(which is 5G size), let call this dataset B, and then calculate use the dataset A.
But now I don't know how to read the two dataset from a single hadoop program. Or can I read the dataset B into RAM then do the calculation?( I guess I can't, because the HDFS is a distribute system, and I can't read the dataset B into a single machine's memory).
Any suggestions?
You can use two Map function, Each Map Function Can process one data set if you want to implement different processing. You need to register your map with your job conf. For eg:
public static class FullOuterJoinStdDetMapper extends MapReduceBase implements Mapper <LongWritable ,Text ,Text, Text>
{
private String person_name, book_title,file_tag="person_book#";
private String emit_value = new String();
//emit_value = "";
public void map(LongWritable key, Text values, OutputCollector<Text,Text>output, Reporter reporter)
throws IOException
{
String line = values.toString();
try
{
String[] person_detail = line.split(",");
person_name = person_detail[0].trim();
book_title = person_detail[1].trim();
}
catch (ArrayIndexOutOfBoundsException e)
{
person_name = "student name missing";
}
emit_value = file_tag + person_name;
output.collect(new Text(book_title), new Text(emit_value));
}
}
public static class FullOuterJoinResultDetMapper extends MapReduceBase implements Mapper <LongWritable ,Text ,Text, Text>
{
private String author_name, book_title,file_tag="auth_book#";
private String emit_value = new String();
// emit_value = "";
public void map(LongWritable key, Text values, OutputCollectoroutput, Reporter reporter)
throws IOException
{
String line = values.toString();
try
{
String[] author_detail = line.split(",");
author_name = author_detail[1].trim();
book_title = author_detail[0].trim();
}
catch (ArrayIndexOutOfBoundsException e)
{
author_name = "Not Appeared in Exam";
}
emit_value = file_tag + author_name;
output.collect(new Text(book_title), new Text(emit_value));
}
}
public static void main(String args[])
throws Exception
{
if(args.length !=3)
{
System.out.println("Input outpur file missing");
System.exit(-1);
}
Configuration conf = new Configuration();
String [] argum = new GenericOptionsParser(conf,args).getRemainingArgs();
conf.set("mapred.textoutputformat.separator", ",");
JobConf mrjob = new JobConf();
mrjob.setJobName("Inner_Join");
mrjob.setJarByClass(FullOuterJoin.class);
MultipleInputs.addInputPath(mrjob,new Path(argum[0]),TextInputFormat.class,FullOuterJoinStdDetMapper.class);
MultipleInputs.addInputPath(mrjob,new Path(argum[1]),TextInputFormat.class,FullOuterJoinResultDetMapper.class);
FileOutputFormat.setOutputPath(mrjob,new Path(args[2]));
mrjob.setReducerClass(FullOuterJoinReducer.class);
mrjob.setOutputKeyClass(Text.class);
mrjob.setOutputValueClass(Text.class);
JobClient.runJob(mrjob);
}
Hadoop allows you to use different map input formats for different folders. So you can read from several datasources and then cast to specific type in Map function i.e. in one case you got (String,User) in other (String,SongSongRating) and you Map signature is (String,Object).
The second step is selection recommendation algorithm, join those data in some way so aggregator will have least information enough to calculate recommendation.

Resources