specifying own inputformat for streaming job - hadoop

I defined my own input format as follows which prevents file spliting:
import org.apache.hadoop.fs.*;
import org.apache.hadoop.mapred.TextInputFormat;
public class NSTextInputFormat extends TextInputFormat {
#Override
protected boolean isSplitable(FileSystem fs, Path file) {
return false;
}
}
I compiled this using Eclipse into a class NSTextInputFormat.class. I copied this class to a client from where the job is launched. I used following command for launching the job and passing above class as inputformat.
hadoop jar $HADOOP_HOME/hadoop-streaming.jar -Dmapred.job.queue.name=unfunded -input 24222910/framefile -input 24225109/framefile -output Output -inputformat NSTextInputFormat -mapper ExtractHSV -file ExtractHSV -file NSTextInputFormat.class -numReduceTasks 0
This fails saying:
-inputformat : class not found : NSTextInputFormat
Streaming Job Failed!
I set the PATH and CLASSPATH variable to the directory containing NSTextInputFormat.class, but still that doesnot work. Any pointers to this will be helpful.

There are a few gotchas here that can get you if you are not familiar with Java.
-inputformat (and the other commandline options that expect classnames) expects a fully qualified classname, otherwise it expects to find the class in some org.apache.hadoop... namespace. So you must include a package name in you .java file
package org.example.hadoop;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.mapred.TextInputFormat;
public class NSTextInputFormat extends TextInputFormat {
#Override
protected boolean isSplitable(FileSystem fs, Path file) {
return false;
}
}
And the specify the full name on the commandline:
-inputformat org.example.hadoop.NSTextInputFormat
When you build the jar file the .class file must also be in a directory structure that mirrors the package name. I'm sure this is Java Packaging 101, but if you are using Hadoop Streaming then you probably aren't too familiar with Java in the first place. Passing the -d option to javac will tell it to compile the input files into .class files in directories that match the package name.
javac -classpath `hadoop classpath` -d ./output NSTextInputFormat.java
The compiled .class file will be written to ./output/org/example/hadoop/NSTextInputFormat.class. You will need to create the output directory but the other sub-directories will be created for you. The jar file can then be created like so:
jar cvf myjar.jar -C ./output/ .
And you should see some output similar to this:
added manifest
adding: org/(in = 0) (out= 0)(stored 0%)
adding: org/example/(in = 0) (out= 0)(stored 0%)
adding: org/example/hadoop/(in = 0) (out= 0)(stored 0%)
adding: org/example/hadoop/NSTextInputFormat.class(in = 372) (out= 252)(deflated 32%)

Bundle the input format and mapper class into a jar (myjar.jar) and add the -libjars myjar.jar option to the command line:
hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-libjars myjar.jar \
-Dmapred.job.queue.name=unfunded \\
-input 24222910/framefile \
-input 24225109/framefile \
-output Output \
-inputformat NSTextInputFormat \
-mapper ExtractHSV \
-numReduceTasks 0

Related

Exporting data from Teradata to HDFS using TDCH

I'm trying to export a table from Teradata into a file in my hdfs using TDCH.
I'm using the below parameters :
hadoop jar $TDCH_JAR com.teradata.connector.common.tool.ConnectorImportTool \
-libjars $LIB_JARS \
-Dmapred.job.queue.name=default \
-Dtez.queue.name=default \
-Dmapred.job.name=TDCH \
-classname com.teradata.jdbc.TeraDriver \
-url jdbc:teradata://$ipServer/logmech=ldap,database=$database,charset=UTF16 \
-jobtype hdfs \
-fileformat textfile \
-separator ',' \
-enclosedby '"' \
-targettable ${targetTable} \
-username ${userName} \
-password ${password} \
-sourcequery "select * from ${database}.${targetTable}" \
-nummappers 1 \
-sourcefieldnames "" \
-targetpaths ${targetPaths}
It's working, but I need the headers in the file, and when I add the parameter:
-targetfieldnames "ID","JOB","DESC","DT","REG" \
It doesnt work, I don't even generate the file anymore.
Can anyonne help me?
The -targetfieldnames option is only valid for -jobtype hive.
It does not put headers in the HDFS file, it specifies Hive column names.
(There is no option to prefix CSV with a header record.)
Also the value supplied for -targetfieldnames should be a single string like "ID,JOB,DESC,DT,REG" rather than a list of strings.

protobuf validator command generates file in wrong path

I am trying to include request validation for grpc. I modified the protobuf command like this.
pkg/test/test.proto contains my schema.
If i run the below command :
protoc --go_out=. \
--proto_path=${GOPATH}/src \
--proto_path=${GOPATH}/src/github.com/gogo/protobuf/gogoproto/ \
--proto_path=${GOPATH}/src/github.com/mwitkow/go-proto-validators/ \
--proto_path=. \ --go_opt=paths=source_relative --go-grpc_out=. --go-grpc_opt=paths=source_relative pkg/test/test.proto --govalidators_out=.
The validator.go file generated file is not generated inside pkg/test instead it is getting generated inside a new folder created {source relative pkg}/pkg/test/test.proto/validator.go.
How to generate validator.go file without the folder structure in pkg/test?
Analysis
It looks like the *.validator.pb.go files are generated in the wrong directory.
Using the pkg/test/test.proto file with the following content:
syntax = "proto3";
option go_package = "github.com/example-user/example-repository";
service Greeter {
rpc SayHello (HelloRequest) returns (HelloReply) {}
}
message HelloRequest {
string name = 1;
}
message HelloReply {
string message = 1;
}
Produced the file system contents:
$ find .
.
./github.com
./github.com/example-user
./github.com/example-user/example-repository
./github.com/example-user/example-repository/test.validator.pb.go
./pkg
./pkg/test
./pkg/test/test_grpc.pb.go
./pkg/test/test.proto
./pkg/test/test.pb.go
Solution
Add the --govalidators_opt=paths=source_relative command line argument.
Please, note the parameter name:
--govalidators_opt
The complete command line:
protoc --go_out=. \
--proto_path=. \
--go_opt=paths=source_relative \
--go-grpc_out=. \
--go-grpc_opt=paths=source_relative \
--govalidators_out=. \
--govalidators_opt=paths=source_relative \
pkg/test/test.proto
Produced the file system contents:
$ find .
.
./pkg
./pkg/test
./pkg/test/test_grpc.pb.go
./pkg/test/test.proto
./pkg/test/test.pb.go
./pkg/test/test.validator.pb.go
Additional references
Seems to be an example. Comment.
A GitHub issue created by the asker? How to specify path for generated validator.go ? · Issue #121 · mwitkow/go-proto-validators.

How to unzip .gz files in a new directory in hadoop?

I have a bunch of .gz files in a folder in hdfs. I want to unzip all of these .gz files to a new folder in hdfs. How should i do this?
I can think of achieving it through 3 different ways.
Using Linux command line
Following command worked for me.
hadoop fs -cat /tmp/Links.txt.gz | gzip -d | hadoop fs -put - /tmp/unzipped/Links.txt
My gzipped file is Links.txt.gz
The output gets stored in /tmp/unzipped/Links.txt
Using Java program
In Hadoop The Definitve Guide book, there is a section on Codecs. In that section, there is a program to Decompress the output using CompressionCodecFactory. I am re-producing that code as is:
package com.myorg.hadooptests;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URI;
public class FileDecompressor {
public static void main(String[] args) throws Exception {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
Path inputPath = new Path(uri);
CompressionCodecFactory factory = new CompressionCodecFactory(conf);
CompressionCodec codec = factory.getCodec(inputPath);
if (codec == null) {
System.err.println("No codec found for " + uri);
System.exit(1);
}
String outputUri =
CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension());
InputStream in = null;
OutputStream out = null;
try {
in = codec.createInputStream(fs.open(inputPath));
out = fs.create(new Path(outputUri));
IOUtils.copyBytes(in, out, conf);
} finally {
IOUtils.closeStream(in);
IOUtils.closeStream(out);
}
}
}
This code takes the gz file path as input.
You can execute this as:
FileDecompressor <gzipped file name>
For e.g. when I executed for my gzipped file:
FileDecompressor /tmp/Links.txt.gz
I got the unzipped file at location: /tmp/Links.txt
It stores the unzipped file in the same folder. So you need to modify this code to take 2 input parameters: <input file path> and <output folder>.
Once you get this program working, you can write a Shell/Perl/Python script to call this program for each of the inputs you have.
Using Pig script
You can write a simple Pig script to achieve this.
I wrote the following script, which works:
A = LOAD '/tmp/Links.txt.gz' USING PigStorage();
Store A into '/tmp/tmp_unzipped/' USING PigStorage();
mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt
rm /tmp/tmp_unzipped/
When you run this script, the unzipped contents are stored in a temporary folder: /tmp/tmp_unzipped. This folder will contain
/tmp/tmp_unzipped/_SUCCESS
/tmp/tmp_unzipped/part-m-00000
The part-m-00000 contains the unzipped file.
Hence, we need to explicitly rename it using following command and finally delete the /tmp/tmp_unzipped folder:
mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt
rm /tmp/tmp_unzipped/
So, if you use this Pig script, you just need to take care of parameterizing the file name (Links.txt.gz and Links.txt).
Again, once you get this script working, you can write a Shell/Perl/Python script to call this Pig script for each of the inputs you have.
Bash solution
In my case, I did not want to pipe-unzip the files since I was not sure of their content. Instead, I wanted to make sure all files in the zip files will be put extracted on HDFS.
I have created a simple bash script. Comments should give you a clue what is going on. There is a short description below.
#!/bin/bash
workdir=/tmp/unziphdfs/
cd $workdir
# get all zip files in a folder
zips=$(hadoop fs -ls /yourpath/*.zip | awk '{print $8}')
for hdfsfile in $zips
do
echo $hdfsfile
# copy to temp folder to unpack
hdfs dfs -copyToLocal $hdfsfile $workdir
hdfsdir=$(dirname "$hdfsfile")
zipname=$(basename "$hdfsfile")
# unpack locally and remove
unzip $zipname
rm -rf $zipname
# copy files back to hdfs
files=$(ls $workdir)
for file in $files; do
hdfs dfs -copyFromLocal $file $hdfsdir
rm -rf $file
done
# optionally remove the zip file from hdfs?
# hadoop fs -rm -skipTrash $hdfsfile
done
Description
Get all the *.zip files in an hdfs dir
One-by-one: copy zip to a temp dir (on filesystem)
Unzip
Copy all the extracted files to the dir of the zip file
Cleanup
I managed to have it working with a sub-dir structure for many zip files in each, using /mypath/*/*.zip.
Good luck :)
If you have compressed text files, hadoop fs -text supports gzip along with other common compression formats (snappy, lzo).
hadoop fs -text /tmp/a.gz | hadoop fs -put - /tmp/uncompressed_a
You can do this using hive (assuming it is text data).
create external table source (t str) location '<directory_with_gz_files>';
create external table target (t str) location '<target_dir>';
insert into table target select * from source;
Data will be uncompressed into new set of files.
if you do not want to change the names and if you have enough storage on the node where you are running, you can do this.
hadoop fs -get <your_source_directory> <directory_name>
It will create a directory where you run hadoop command. cd to it and gunzip all the files
cd ..
hadoop fs -moveFromLocal <directory_name> <target_hdfs_path>
Providing the scala code
import org.apache.hadoop.fs.{FSDataOutputStream, FileSystem, FileUtil, Path}
import org.apache.hadoop.io.compress.{CompressionCodecFactory, CompressionInputStream}
import org.apache.spark.sql.SparkSession
import org.apache.hadoop.io.IOUtils
val conf = new org.apache.hadoop.conf.Configuration()
def extractFile (sparkSession: SparkSession, compath : String, uncompPath :String): String = {
val fs = FileSystem.get(sparkSession.sparkContext.hadoopConfiguration)
val inputPath = new Path(compath)
val factory = new CompressionCodecFactory(sparkSession.sparkContext.hadoopConfiguration);
val codec = factory.getCodec(inputPath)
if (codec == null){
throw new RuntimeException(s"Not a valid codex $codec")
}
var in : CompressionInputStream = null;
var out : FSDataOutputStream = null;
try {
in = codec.createInputStream(fs.open(inputPath));
out = fs.create(new Path(uncompPath));
IOUtils.copyBytes(in, out, conf);
} finally {
IOUtils.closeStream(in);
IOUtils.closeStream(out);
}
uncompPath
}
Hadoop's FileUtil class has unTar() and unZip() methods to achieve this. The unTar() method will work on .tar.gz and .tgz files as well. Unfortunately they only work on files on the local filesystem. You'll have to use one of the same class's copy() methods to copy to and from any distributed file systems you need to use.

AC_CONFIG_FILES not generating Makefiles

I'm writing an app in Vala with support to plugins. The app has the following directory structure:
data/
[data files]
m4/
my_project.m4
plugins/
example/
example.plugin.in
example-plugin.vala
Makefile.am
po/
src/
[source files]
The file "my_project.m4" dinamically adds plugin dirs with a simple defined function called MYPROJ_ADD_PLUGIN, and it works fine as I tested it with some other projects. Basically, it calls:
AC_CONFIG_FILES([plugins/example/Makefile])
[...]
AC_CONFIG_FILES([plugins/example/example.plugin])
The problem is, when I try to configure it, it gives back:
"error: cannot find input file: `plugins/example/Makefile.in'"
The example makefile (plugins/example/Makefile.am) is the following:
include $(top_srcdir)/common.am
plugin_LTLIBRARIES = example-plugin.la
plugin_DATA = example.plugin
example_plugin_la_SOURCES = \
example-plugin.vala
example_plugin_la_VALAFLAGS = \
$(MYPROJ_COMMON_VALAFLAGS) \
--target-glib=2.38
example_plugin_la_CFLAGS = \
$(MYPROJ_COMMON_CFLAGS) \
-I$(srcdir) \
-DG_LOG_DOMAIN='"Example"'
example_plugin_la_LIBADD = \
$(MYPROJ_COMMON_LIBS)
example_plugin_la_LDFLAGS = \
$(MYPROJ_PLUGIN_LINKER_FLAGS) \
-lm
EXTRA_DIST = example.plugin.in
Every var is correctly generated (in common.am and configure.ac).
I appreciate any advice on this issue.
Thanks in advance
Looks like I found the answer to my own question. Apparently, everything I had to do was add a "lib" prefix to my plugin output file. The plugins/example/Makefile.am now looks like:
include $(top_srcdir)/common.am
plugin_LTLIBRARIES = **lib**example.la
plugin_DATA = example.plugin
**lib**example_la_SOURCES = \
example-plugin.vala
**lib**example_la_VALAFLAGS = \
$(MYPROJ_COMMON_VALAFLAGS) \
--target-glib=2.38
**lib**example_la_CFLAGS = \
$(MYPROJ_COMMON_CFLAGS) \
-I$(srcdir) \
-DG_LOG_DOMAIN='"Example"'
**lib**example_la_LIBADD = \
$(MYPROJ_COMMON_LIBS)
**lib**example_la_LDFLAGS = \
$(MYPROJ_PLUGIN_LINKER_FLAGS) \
-lm
EXTRA_DIST = example.plugin.in
This was the only modification I did, and it works as expected now. Seems like autoconf/autotools is very rigid about the syntax of plugins and shared libs, as they MUST start with lib prefix.

Hadoop Streaming though a Runnable Jar produces no output

I'm trying to stream data through an existing java app, and as a test just created a runnable jar to print to stdout:
public class Myapp {
public static void main(String[] args) {
for (int i=0;i<100;i++){
System.out.println(i);
}
}
}
After creating a the jar I can do this:
> java -jar myapp.jar a b < input.txt > myout1.txt
and myout1.txt gets filled with data. When I run this in hadoop using
hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.0.1.jar -file 'myapp.jar' -numReduceTasks 0 -input "input.txt" -output "myout.txt" -mapper "java -jar myapp.jar"
The job succeeds, but the myout.txt/part-* files are all empty. Reading data from stdin doesn't help, and this works in Python and Perl, or using the java API with a map function. Is there something special about streaming through a jar or printing with System.out.println?

Resources