Hadoop Streaming though a Runnable Jar produces no output - hadoop

I'm trying to stream data through an existing java app, and as a test just created a runnable jar to print to stdout:
public class Myapp {
public static void main(String[] args) {
for (int i=0;i<100;i++){
System.out.println(i);
}
}
}
After creating a the jar I can do this:
> java -jar myapp.jar a b < input.txt > myout1.txt
and myout1.txt gets filled with data. When I run this in hadoop using
hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.0.1.jar -file 'myapp.jar' -numReduceTasks 0 -input "input.txt" -output "myout.txt" -mapper "java -jar myapp.jar"
The job succeeds, but the myout.txt/part-* files are all empty. Reading data from stdin doesn't help, and this works in Python and Perl, or using the java API with a map function. Is there something special about streaming through a jar or printing with System.out.println?

Related

How to automate veracode scans

Hey I am looking to use a jenkins pipeline to automatically run a vercode application scan. I know how to launch the scan manually using a few sets of commands. I was just going to add these commands to a script and run them, but maybe there is a better way to do this? Something like this is over engineered for my purposes:https://github.com/OLSPayments/veracode-scripts/blob/master/submitToVeracode.py.
I figured out that it can be done through a Jenkins pipeline. Here is an example:
yml
pipeline {
agent any-with-jdk8-maven-curl-unzip
stages {
stage('Maven Build') {
steps {
- sh 'maven clean verify'
}
}
stage('Veracode Pipeline Scan') {
steps {
- sh `curl -O https://downloads.veracode.com/securityscan/pipeline-scan-LATEST.zip`
- sh `unzip pipeline-scan-LATEST.zip pipeline-scan.jar`
- sh `java -jar pipeline-scan.jar \
--veracode_api_id "${VERACODE_API_ID}" \
--veracode_api_key "${VERACODE_API_SECRET}" \
--file "build/libs/sample.jar" \
--fail_on_severity="Very High, High" \
--fail_on_cwe="80" \
--baseline_file "${CI_BASELINE_PATH}" \
--timeout "${CI_TIMEOUT}" \
--project_name "${env.JOB_NAME}" \
--project_url "${env.GIT_URL}" \
--project_ref "${env.GIT_COMMIT}" \
}
}
}
post {
always {
archiveArtifacts artifacts: 'results.json', fingerprint: true
}
}
}
documentation: https://help.veracode.com/reader/tS9CaFwL4_lbIEWWomsJoA/G02kb80l3gTu_ygcuFODaw

Some files are not included during a Tar Gradle task

I just observed a very weird behaviour from a Gradle Tar task.
Let's take a simple example, 2 files :
/tmp/test$ ls
test1.txt ##test2##
Here is a simple Tar task :
task('testHash', type: Tar) {
from "/tmp/test"
extension = 'tar.gz'
compression = Compression.GZIP
}
The file ##test2## is skipped for some reason, after running gradle testHash :
/path/to/gradle/project/foo$ tar tvf build/distributions/foo-1.0.tar.gz
test1.txt
It seems to happen when the filename is containing # character both at the beginning and the end.
A regular tar is working well :
/tmp/test$ tar czvf test.tar.gz *
test1.txt
##test2##
/tmp/test$ tar tf test.tar.gz
test1.txt
##test2##
I am using Gradle 4.1. Any explanation ?
Thanks to Opal's comments, I adjusted my searches and found a workaround. There is maybe a cleaner way but this one works for me
task('testHash', type: Tar) {
doFirst {
org.apache.tools.ant.DirectoryScanner.defaultExcludes.each {
DirectoryScanner.removeDefaultExclude it
}
}
from "/tmp/test"
extension = 'tar.gz'
compression = Compression.GZIP
}
FYI, here are default excludes
There are a set of definitions that are excluded by default from all
directory-based tasks. As of Ant 1.8.1 they are:
**/*~
**/#*#
**/.#*
**/%*%
**/._*
**/CVS
**/CVS/**
**/.cvsignore
**/SCCS
**/SCCS/**
**/vssver.scc
**/.svn
**/.svn/**
**/.DS_Store
Ant 1.8.2 adds the following default excludes:
**/.git
**/.git/**
**/.gitattributes
**/.gitignore
**/.gitmodules
**/.hg
**/.hg/**
**/.hgignore
**/.hgsub
**/.hgsubstate
**/.hgtags
**/.bzr
**/.bzr/**
**/.bzrignore

How to extract a .gz file in hadoop cluster environment?

How to extract a .gz file in hadoop cluster environment , via java code without copying file to local(extracting directly in hadoop through code)?
"gunzip -k file.gz" is usually used to unpack the .gz file keeping the original .gz as well, is it what you were looking for?
Consider your .gz file has one file in it, you can do like this:
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path inFile = new Path(argv[0]);
Path outFile = new Path(argv[1]);
FSDataInputStream in = fs.open(inFile);
org.apache.hadoop.io.compress.GzipCodec.GzipInputStream gis = new org.apache.hadoop.io.compress.GzipCodec.GzipInputStream(in);
FSDataOutputStream out = fs.create(outFile);
doCopy(gis, out);
public static void doCopy(FSDataInputStream is, FSDataOutputStream os) throws Exception {
int oneByte;
while ((oneByte = is.read()) != -1) {
os.write(oneByte);
}
os.close();
is.close();
}
the other way of doing this you can create a shell script and run the same script whenever you need. or els you try implement terminal commands in your code.
If you want to do it from terminal you can run this command
gzip [ -acdfhlLnNrtvV19 ] [-S suffix] [ name ... ]
gunzip [ -acfhlLnNrtvV ] [-S suffix] [ name ... ]
source : https://www.techonthenet.com/linux/commands/gzip.php

How to unzip .gz files in a new directory in hadoop?

I have a bunch of .gz files in a folder in hdfs. I want to unzip all of these .gz files to a new folder in hdfs. How should i do this?
I can think of achieving it through 3 different ways.
Using Linux command line
Following command worked for me.
hadoop fs -cat /tmp/Links.txt.gz | gzip -d | hadoop fs -put - /tmp/unzipped/Links.txt
My gzipped file is Links.txt.gz
The output gets stored in /tmp/unzipped/Links.txt
Using Java program
In Hadoop The Definitve Guide book, there is a section on Codecs. In that section, there is a program to Decompress the output using CompressionCodecFactory. I am re-producing that code as is:
package com.myorg.hadooptests;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URI;
public class FileDecompressor {
public static void main(String[] args) throws Exception {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
Path inputPath = new Path(uri);
CompressionCodecFactory factory = new CompressionCodecFactory(conf);
CompressionCodec codec = factory.getCodec(inputPath);
if (codec == null) {
System.err.println("No codec found for " + uri);
System.exit(1);
}
String outputUri =
CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension());
InputStream in = null;
OutputStream out = null;
try {
in = codec.createInputStream(fs.open(inputPath));
out = fs.create(new Path(outputUri));
IOUtils.copyBytes(in, out, conf);
} finally {
IOUtils.closeStream(in);
IOUtils.closeStream(out);
}
}
}
This code takes the gz file path as input.
You can execute this as:
FileDecompressor <gzipped file name>
For e.g. when I executed for my gzipped file:
FileDecompressor /tmp/Links.txt.gz
I got the unzipped file at location: /tmp/Links.txt
It stores the unzipped file in the same folder. So you need to modify this code to take 2 input parameters: <input file path> and <output folder>.
Once you get this program working, you can write a Shell/Perl/Python script to call this program for each of the inputs you have.
Using Pig script
You can write a simple Pig script to achieve this.
I wrote the following script, which works:
A = LOAD '/tmp/Links.txt.gz' USING PigStorage();
Store A into '/tmp/tmp_unzipped/' USING PigStorage();
mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt
rm /tmp/tmp_unzipped/
When you run this script, the unzipped contents are stored in a temporary folder: /tmp/tmp_unzipped. This folder will contain
/tmp/tmp_unzipped/_SUCCESS
/tmp/tmp_unzipped/part-m-00000
The part-m-00000 contains the unzipped file.
Hence, we need to explicitly rename it using following command and finally delete the /tmp/tmp_unzipped folder:
mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt
rm /tmp/tmp_unzipped/
So, if you use this Pig script, you just need to take care of parameterizing the file name (Links.txt.gz and Links.txt).
Again, once you get this script working, you can write a Shell/Perl/Python script to call this Pig script for each of the inputs you have.
Bash solution
In my case, I did not want to pipe-unzip the files since I was not sure of their content. Instead, I wanted to make sure all files in the zip files will be put extracted on HDFS.
I have created a simple bash script. Comments should give you a clue what is going on. There is a short description below.
#!/bin/bash
workdir=/tmp/unziphdfs/
cd $workdir
# get all zip files in a folder
zips=$(hadoop fs -ls /yourpath/*.zip | awk '{print $8}')
for hdfsfile in $zips
do
echo $hdfsfile
# copy to temp folder to unpack
hdfs dfs -copyToLocal $hdfsfile $workdir
hdfsdir=$(dirname "$hdfsfile")
zipname=$(basename "$hdfsfile")
# unpack locally and remove
unzip $zipname
rm -rf $zipname
# copy files back to hdfs
files=$(ls $workdir)
for file in $files; do
hdfs dfs -copyFromLocal $file $hdfsdir
rm -rf $file
done
# optionally remove the zip file from hdfs?
# hadoop fs -rm -skipTrash $hdfsfile
done
Description
Get all the *.zip files in an hdfs dir
One-by-one: copy zip to a temp dir (on filesystem)
Unzip
Copy all the extracted files to the dir of the zip file
Cleanup
I managed to have it working with a sub-dir structure for many zip files in each, using /mypath/*/*.zip.
Good luck :)
If you have compressed text files, hadoop fs -text supports gzip along with other common compression formats (snappy, lzo).
hadoop fs -text /tmp/a.gz | hadoop fs -put - /tmp/uncompressed_a
You can do this using hive (assuming it is text data).
create external table source (t str) location '<directory_with_gz_files>';
create external table target (t str) location '<target_dir>';
insert into table target select * from source;
Data will be uncompressed into new set of files.
if you do not want to change the names and if you have enough storage on the node where you are running, you can do this.
hadoop fs -get <your_source_directory> <directory_name>
It will create a directory where you run hadoop command. cd to it and gunzip all the files
cd ..
hadoop fs -moveFromLocal <directory_name> <target_hdfs_path>
Providing the scala code
import org.apache.hadoop.fs.{FSDataOutputStream, FileSystem, FileUtil, Path}
import org.apache.hadoop.io.compress.{CompressionCodecFactory, CompressionInputStream}
import org.apache.spark.sql.SparkSession
import org.apache.hadoop.io.IOUtils
val conf = new org.apache.hadoop.conf.Configuration()
def extractFile (sparkSession: SparkSession, compath : String, uncompPath :String): String = {
val fs = FileSystem.get(sparkSession.sparkContext.hadoopConfiguration)
val inputPath = new Path(compath)
val factory = new CompressionCodecFactory(sparkSession.sparkContext.hadoopConfiguration);
val codec = factory.getCodec(inputPath)
if (codec == null){
throw new RuntimeException(s"Not a valid codex $codec")
}
var in : CompressionInputStream = null;
var out : FSDataOutputStream = null;
try {
in = codec.createInputStream(fs.open(inputPath));
out = fs.create(new Path(uncompPath));
IOUtils.copyBytes(in, out, conf);
} finally {
IOUtils.closeStream(in);
IOUtils.closeStream(out);
}
uncompPath
}
Hadoop's FileUtil class has unTar() and unZip() methods to achieve this. The unTar() method will work on .tar.gz and .tgz files as well. Unfortunately they only work on files on the local filesystem. You'll have to use one of the same class's copy() methods to copy to and from any distributed file systems you need to use.

specifying own inputformat for streaming job

I defined my own input format as follows which prevents file spliting:
import org.apache.hadoop.fs.*;
import org.apache.hadoop.mapred.TextInputFormat;
public class NSTextInputFormat extends TextInputFormat {
#Override
protected boolean isSplitable(FileSystem fs, Path file) {
return false;
}
}
I compiled this using Eclipse into a class NSTextInputFormat.class. I copied this class to a client from where the job is launched. I used following command for launching the job and passing above class as inputformat.
hadoop jar $HADOOP_HOME/hadoop-streaming.jar -Dmapred.job.queue.name=unfunded -input 24222910/framefile -input 24225109/framefile -output Output -inputformat NSTextInputFormat -mapper ExtractHSV -file ExtractHSV -file NSTextInputFormat.class -numReduceTasks 0
This fails saying:
-inputformat : class not found : NSTextInputFormat
Streaming Job Failed!
I set the PATH and CLASSPATH variable to the directory containing NSTextInputFormat.class, but still that doesnot work. Any pointers to this will be helpful.
There are a few gotchas here that can get you if you are not familiar with Java.
-inputformat (and the other commandline options that expect classnames) expects a fully qualified classname, otherwise it expects to find the class in some org.apache.hadoop... namespace. So you must include a package name in you .java file
package org.example.hadoop;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.mapred.TextInputFormat;
public class NSTextInputFormat extends TextInputFormat {
#Override
protected boolean isSplitable(FileSystem fs, Path file) {
return false;
}
}
And the specify the full name on the commandline:
-inputformat org.example.hadoop.NSTextInputFormat
When you build the jar file the .class file must also be in a directory structure that mirrors the package name. I'm sure this is Java Packaging 101, but if you are using Hadoop Streaming then you probably aren't too familiar with Java in the first place. Passing the -d option to javac will tell it to compile the input files into .class files in directories that match the package name.
javac -classpath `hadoop classpath` -d ./output NSTextInputFormat.java
The compiled .class file will be written to ./output/org/example/hadoop/NSTextInputFormat.class. You will need to create the output directory but the other sub-directories will be created for you. The jar file can then be created like so:
jar cvf myjar.jar -C ./output/ .
And you should see some output similar to this:
added manifest
adding: org/(in = 0) (out= 0)(stored 0%)
adding: org/example/(in = 0) (out= 0)(stored 0%)
adding: org/example/hadoop/(in = 0) (out= 0)(stored 0%)
adding: org/example/hadoop/NSTextInputFormat.class(in = 372) (out= 252)(deflated 32%)
Bundle the input format and mapper class into a jar (myjar.jar) and add the -libjars myjar.jar option to the command line:
hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-libjars myjar.jar \
-Dmapred.job.queue.name=unfunded \\
-input 24222910/framefile \
-input 24225109/framefile \
-output Output \
-inputformat NSTextInputFormat \
-mapper ExtractHSV \
-numReduceTasks 0

Resources