I'm trying to export a table from Teradata into a file in my hdfs using TDCH.
I'm using the below parameters :
hadoop jar $TDCH_JAR com.teradata.connector.common.tool.ConnectorImportTool \
-libjars $LIB_JARS \
-Dmapred.job.queue.name=default \
-Dtez.queue.name=default \
-Dmapred.job.name=TDCH \
-classname com.teradata.jdbc.TeraDriver \
-url jdbc:teradata://$ipServer/logmech=ldap,database=$database,charset=UTF16 \
-jobtype hdfs \
-fileformat textfile \
-separator ',' \
-enclosedby '"' \
-targettable ${targetTable} \
-username ${userName} \
-password ${password} \
-sourcequery "select * from ${database}.${targetTable}" \
-nummappers 1 \
-sourcefieldnames "" \
-targetpaths ${targetPaths}
It's working, but I need the headers in the file, and when I add the parameter:
-targetfieldnames "ID","JOB","DESC","DT","REG" \
It doesnt work, I don't even generate the file anymore.
Can anyonne help me?
The -targetfieldnames option is only valid for -jobtype hive.
It does not put headers in the HDFS file, it specifies Hive column names.
(There is no option to prefix CSV with a header record.)
Also the value supplied for -targetfieldnames should be a single string like "ID,JOB,DESC,DT,REG" rather than a list of strings.
I am trying to include request validation for grpc. I modified the protobuf command like this.
pkg/test/test.proto contains my schema.
If i run the below command :
protoc --go_out=. \
--proto_path=${GOPATH}/src \
--proto_path=${GOPATH}/src/github.com/gogo/protobuf/gogoproto/ \
--proto_path=${GOPATH}/src/github.com/mwitkow/go-proto-validators/ \
--proto_path=. \ --go_opt=paths=source_relative --go-grpc_out=. --go-grpc_opt=paths=source_relative pkg/test/test.proto --govalidators_out=.
The validator.go file generated file is not generated inside pkg/test instead it is getting generated inside a new folder created {source relative pkg}/pkg/test/test.proto/validator.go.
How to generate validator.go file without the folder structure in pkg/test?
Analysis
It looks like the *.validator.pb.go files are generated in the wrong directory.
Using the pkg/test/test.proto file with the following content:
syntax = "proto3";
option go_package = "github.com/example-user/example-repository";
service Greeter {
rpc SayHello (HelloRequest) returns (HelloReply) {}
}
message HelloRequest {
string name = 1;
}
message HelloReply {
string message = 1;
}
Produced the file system contents:
$ find .
.
./github.com
./github.com/example-user
./github.com/example-user/example-repository
./github.com/example-user/example-repository/test.validator.pb.go
./pkg
./pkg/test
./pkg/test/test_grpc.pb.go
./pkg/test/test.proto
./pkg/test/test.pb.go
Solution
Add the --govalidators_opt=paths=source_relative command line argument.
Please, note the parameter name:
--govalidators_opt
The complete command line:
protoc --go_out=. \
--proto_path=. \
--go_opt=paths=source_relative \
--go-grpc_out=. \
--go-grpc_opt=paths=source_relative \
--govalidators_out=. \
--govalidators_opt=paths=source_relative \
pkg/test/test.proto
Produced the file system contents:
$ find .
.
./pkg
./pkg/test
./pkg/test/test_grpc.pb.go
./pkg/test/test.proto
./pkg/test/test.pb.go
./pkg/test/test.validator.pb.go
Additional references
Seems to be an example. Comment.
A GitHub issue created by the asker? How to specify path for generated validator.go ? · Issue #121 · mwitkow/go-proto-validators.
I have a bunch of .gz files in a folder in hdfs. I want to unzip all of these .gz files to a new folder in hdfs. How should i do this?
I can think of achieving it through 3 different ways.
Using Linux command line
Following command worked for me.
hadoop fs -cat /tmp/Links.txt.gz | gzip -d | hadoop fs -put - /tmp/unzipped/Links.txt
My gzipped file is Links.txt.gz
The output gets stored in /tmp/unzipped/Links.txt
Using Java program
In Hadoop The Definitve Guide book, there is a section on Codecs. In that section, there is a program to Decompress the output using CompressionCodecFactory. I am re-producing that code as is:
package com.myorg.hadooptests;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URI;
public class FileDecompressor {
public static void main(String[] args) throws Exception {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
Path inputPath = new Path(uri);
CompressionCodecFactory factory = new CompressionCodecFactory(conf);
CompressionCodec codec = factory.getCodec(inputPath);
if (codec == null) {
System.err.println("No codec found for " + uri);
System.exit(1);
}
String outputUri =
CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension());
InputStream in = null;
OutputStream out = null;
try {
in = codec.createInputStream(fs.open(inputPath));
out = fs.create(new Path(outputUri));
IOUtils.copyBytes(in, out, conf);
} finally {
IOUtils.closeStream(in);
IOUtils.closeStream(out);
}
}
}
This code takes the gz file path as input.
You can execute this as:
FileDecompressor <gzipped file name>
For e.g. when I executed for my gzipped file:
FileDecompressor /tmp/Links.txt.gz
I got the unzipped file at location: /tmp/Links.txt
It stores the unzipped file in the same folder. So you need to modify this code to take 2 input parameters: <input file path> and <output folder>.
Once you get this program working, you can write a Shell/Perl/Python script to call this program for each of the inputs you have.
Using Pig script
You can write a simple Pig script to achieve this.
I wrote the following script, which works:
A = LOAD '/tmp/Links.txt.gz' USING PigStorage();
Store A into '/tmp/tmp_unzipped/' USING PigStorage();
mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt
rm /tmp/tmp_unzipped/
When you run this script, the unzipped contents are stored in a temporary folder: /tmp/tmp_unzipped. This folder will contain
/tmp/tmp_unzipped/_SUCCESS
/tmp/tmp_unzipped/part-m-00000
The part-m-00000 contains the unzipped file.
Hence, we need to explicitly rename it using following command and finally delete the /tmp/tmp_unzipped folder:
mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt
rm /tmp/tmp_unzipped/
So, if you use this Pig script, you just need to take care of parameterizing the file name (Links.txt.gz and Links.txt).
Again, once you get this script working, you can write a Shell/Perl/Python script to call this Pig script for each of the inputs you have.
Bash solution
In my case, I did not want to pipe-unzip the files since I was not sure of their content. Instead, I wanted to make sure all files in the zip files will be put extracted on HDFS.
I have created a simple bash script. Comments should give you a clue what is going on. There is a short description below.
#!/bin/bash
workdir=/tmp/unziphdfs/
cd $workdir
# get all zip files in a folder
zips=$(hadoop fs -ls /yourpath/*.zip | awk '{print $8}')
for hdfsfile in $zips
do
echo $hdfsfile
# copy to temp folder to unpack
hdfs dfs -copyToLocal $hdfsfile $workdir
hdfsdir=$(dirname "$hdfsfile")
zipname=$(basename "$hdfsfile")
# unpack locally and remove
unzip $zipname
rm -rf $zipname
# copy files back to hdfs
files=$(ls $workdir)
for file in $files; do
hdfs dfs -copyFromLocal $file $hdfsdir
rm -rf $file
done
# optionally remove the zip file from hdfs?
# hadoop fs -rm -skipTrash $hdfsfile
done
Description
Get all the *.zip files in an hdfs dir
One-by-one: copy zip to a temp dir (on filesystem)
Unzip
Copy all the extracted files to the dir of the zip file
Cleanup
I managed to have it working with a sub-dir structure for many zip files in each, using /mypath/*/*.zip.
Good luck :)
If you have compressed text files, hadoop fs -text supports gzip along with other common compression formats (snappy, lzo).
hadoop fs -text /tmp/a.gz | hadoop fs -put - /tmp/uncompressed_a
You can do this using hive (assuming it is text data).
create external table source (t str) location '<directory_with_gz_files>';
create external table target (t str) location '<target_dir>';
insert into table target select * from source;
Data will be uncompressed into new set of files.
if you do not want to change the names and if you have enough storage on the node where you are running, you can do this.
hadoop fs -get <your_source_directory> <directory_name>
It will create a directory where you run hadoop command. cd to it and gunzip all the files
cd ..
hadoop fs -moveFromLocal <directory_name> <target_hdfs_path>
Providing the scala code
import org.apache.hadoop.fs.{FSDataOutputStream, FileSystem, FileUtil, Path}
import org.apache.hadoop.io.compress.{CompressionCodecFactory, CompressionInputStream}
import org.apache.spark.sql.SparkSession
import org.apache.hadoop.io.IOUtils
val conf = new org.apache.hadoop.conf.Configuration()
def extractFile (sparkSession: SparkSession, compath : String, uncompPath :String): String = {
val fs = FileSystem.get(sparkSession.sparkContext.hadoopConfiguration)
val inputPath = new Path(compath)
val factory = new CompressionCodecFactory(sparkSession.sparkContext.hadoopConfiguration);
val codec = factory.getCodec(inputPath)
if (codec == null){
throw new RuntimeException(s"Not a valid codex $codec")
}
var in : CompressionInputStream = null;
var out : FSDataOutputStream = null;
try {
in = codec.createInputStream(fs.open(inputPath));
out = fs.create(new Path(uncompPath));
IOUtils.copyBytes(in, out, conf);
} finally {
IOUtils.closeStream(in);
IOUtils.closeStream(out);
}
uncompPath
}
Hadoop's FileUtil class has unTar() and unZip() methods to achieve this. The unTar() method will work on .tar.gz and .tgz files as well. Unfortunately they only work on files on the local filesystem. You'll have to use one of the same class's copy() methods to copy to and from any distributed file systems you need to use.
I'm writing an app in Vala with support to plugins. The app has the following directory structure:
data/
[data files]
m4/
my_project.m4
plugins/
example/
example.plugin.in
example-plugin.vala
Makefile.am
po/
src/
[source files]
The file "my_project.m4" dinamically adds plugin dirs with a simple defined function called MYPROJ_ADD_PLUGIN, and it works fine as I tested it with some other projects. Basically, it calls:
AC_CONFIG_FILES([plugins/example/Makefile])
[...]
AC_CONFIG_FILES([plugins/example/example.plugin])
The problem is, when I try to configure it, it gives back:
"error: cannot find input file: `plugins/example/Makefile.in'"
The example makefile (plugins/example/Makefile.am) is the following:
include $(top_srcdir)/common.am
plugin_LTLIBRARIES = example-plugin.la
plugin_DATA = example.plugin
example_plugin_la_SOURCES = \
example-plugin.vala
example_plugin_la_VALAFLAGS = \
$(MYPROJ_COMMON_VALAFLAGS) \
--target-glib=2.38
example_plugin_la_CFLAGS = \
$(MYPROJ_COMMON_CFLAGS) \
-I$(srcdir) \
-DG_LOG_DOMAIN='"Example"'
example_plugin_la_LIBADD = \
$(MYPROJ_COMMON_LIBS)
example_plugin_la_LDFLAGS = \
$(MYPROJ_PLUGIN_LINKER_FLAGS) \
-lm
EXTRA_DIST = example.plugin.in
Every var is correctly generated (in common.am and configure.ac).
I appreciate any advice on this issue.
Thanks in advance
Looks like I found the answer to my own question. Apparently, everything I had to do was add a "lib" prefix to my plugin output file. The plugins/example/Makefile.am now looks like:
include $(top_srcdir)/common.am
plugin_LTLIBRARIES = **lib**example.la
plugin_DATA = example.plugin
**lib**example_la_SOURCES = \
example-plugin.vala
**lib**example_la_VALAFLAGS = \
$(MYPROJ_COMMON_VALAFLAGS) \
--target-glib=2.38
**lib**example_la_CFLAGS = \
$(MYPROJ_COMMON_CFLAGS) \
-I$(srcdir) \
-DG_LOG_DOMAIN='"Example"'
**lib**example_la_LIBADD = \
$(MYPROJ_COMMON_LIBS)
**lib**example_la_LDFLAGS = \
$(MYPROJ_PLUGIN_LINKER_FLAGS) \
-lm
EXTRA_DIST = example.plugin.in
This was the only modification I did, and it works as expected now. Seems like autoconf/autotools is very rigid about the syntax of plugins and shared libs, as they MUST start with lib prefix.
I'm trying to stream data through an existing java app, and as a test just created a runnable jar to print to stdout:
public class Myapp {
public static void main(String[] args) {
for (int i=0;i<100;i++){
System.out.println(i);
}
}
}
After creating a the jar I can do this:
> java -jar myapp.jar a b < input.txt > myout1.txt
and myout1.txt gets filled with data. When I run this in hadoop using
hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.0.1.jar -file 'myapp.jar' -numReduceTasks 0 -input "input.txt" -output "myout.txt" -mapper "java -jar myapp.jar"
The job succeeds, but the myout.txt/part-* files are all empty. Reading data from stdin doesn't help, and this works in Python and Perl, or using the java API with a map function. Is there something special about streaming through a jar or printing with System.out.println?