mahout toString empty weight - hadoop

I use the Naive Bayes in Mahout to do classification.
after I trained the model, I used one document to test the model.
I transfer the data into vector
./bin/mahout seqdirectory -i /home/d/mahoutTest/ -o /home/d/seqMahout1
./bin/mahout seq2sparse -i /home/d/seqMahout1/ -o /home/d/vecMahout1 -lnorm -nv -ow -wt tfidf
but when I was trying to test the data, there is error
./bin/mahout testnb -i /home/d/vecMahout1/tfidf-vectors/ -m /tmp/mahout-work-d/model/ -l /tmp/mahout-work-d/labelindex -o /home/d/out10/
SLF4J: Failed toString() invocation on an object of type [org.apache.mahout.classifier.ResultAnalyzer]
org.apache.commons.math3.exception.MathIllegalArgumentException: weigth array must contain at least one non-zero value
at org.apache.commons.math3.stat.descriptive.AbstractUnivariateStatistic.test(AbstractUnivariateStatistic.java:309)
at org.apache.commons.math3.stat.descriptive.AbstractUnivariateStatistic.test(AbstractUnivariateStatistic.java:245)
at org.apache.commons.math3.stat.descriptive.moment.Mean.evaluate(Mean.java:211)
at org.apache.commons.math3.stat.descriptive.moment.Mean.evaluate(Mean.java:254)
at org.apache.mahout.classifier.ConfusionMatrix.getWeightedPrecision(ConfusionMatrix.java:143)
at org.apache.mahout.classifier.ResultAnalyzer.toString(ResultAnalyzer.java:114)
at org.slf4j.helpers.MessageFormatter.safeObjectAppend(MessageFormatter.java:304)
at org.slf4j.helpers.MessageFormatter.deeplyAppendParameter(MessageFormatter.java:276)
at org.slf4j.helpers.MessageFormatter.arrayFormat(MessageFormatter.java:230)
at org.slf4j.helpers.MessageFormatter.format(MessageFormatter.java:152)
at org.slf4j.impl.Log4jLoggerAdapter.info(Log4jLoggerAdapter.java:345)
at org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.run(TestNaiveBayesDriver.java:107)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.main(TestNaiveBayesDriver.java:67)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)

This can happen if the "true" label for your test item is not in the training set. Do you see a warning like:
15/07/07 15:20:12 WARN ConfusionMatrix: Label YOUR TEST LABEL HERE did not appear in the training examples

Related

Shell variable inside sshagent block of a Jenkinsfile

I have this sshagent code block:
sshagent(['ssh_key.hashed']) {
sh """
ssh -o StrictHostKeyChecking=no -l user example.com <<EOF
today=`date +%Y-%m-%d`
drush -r /var/www/html/example.com sql-dump --gzip > /var/www/html/example.com/backups/example_prodDB-jenkins-${today}.sql.gz
EOF
""".stripIndent()
}
Where as you can see, the real intention is to get the db dump of a drupal database.
Now, it will work if used on a regular shell script.
I need to write it that way as I will reuse the $today shell variable in another line of code.
Within the jenkinsfile, it seems like it is interpreted as a groovy variable based on the error:
groovy.lang.MissingPropertyException: No such property: today for class: groovy.lang.Binding
at groovy.lang.Binding.getVariable(Binding.java:63)
at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onGetProperty(SandboxInterceptor.java:270)
at org.kohsuke.groovy.sandbox.impl.Checker$7.call(Checker.java:353)
at org.kohsuke.groovy.sandbox.impl.Checker.checkedGetProperty(Checker.java:357)
at org.kohsuke.groovy.sandbox.impl.Checker.checkedGetProperty(Checker.java:333)
at org.kohsuke.groovy.sandbox.impl.Checker.checkedGetProperty(Checker.java:333)
at org.kohsuke.groovy.sandbox.impl.Checker.checkedGetProperty(Checker.java:333)
at com.cloudbees.groovy.cps.sandbox.SandboxInvoker.getProperty(SandboxInvoker.java:29)
at com.cloudbees.groovy.cps.impl.PropertyAccessBlock.rawGet(PropertyAccessBlock.java:20)
at WorkflowScript.run(WorkflowScript:11)
at ___cps.transform___(Native Method)
at com.cloudbees.groovy.cps.impl.PropertyishBlock$ContinuationImpl.get(PropertyishBlock.java:74)
at com.cloudbees.groovy.cps.LValueBlock$GetAdapter.receive(LValueBlock.java:30)
at com.cloudbees.groovy.cps.impl.PropertyishBlock$ContinuationImpl.fixName(PropertyishBlock.java:66)
at jdk.internal.reflect.GeneratedMethodAccessor305.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21)
at com.cloudbees.groovy.cps.Next.step(Next.java:83)
at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:174)
at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:163)
at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:129)
at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:268)
at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:163)
at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:18)
at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:51)
at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:185)
at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:400)
at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$400(CpsThreadGroup.java:96)
at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:312)
at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:276)
at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:67)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Finished: FAILURE
Any hints is appreciated.
Thanks.
After a lot of searching online and trial and error found this helpful reference https://code-maven.com/jenkins-pipeline-environment-variables
I don't know if it's the right way to do it but it works.
So in summary, I had to add an environment variable and add use it within my "sshagent"
Posting sample Jenkinsfile that works.
pipeline {
agent any
stages {
stage('Dump Prod DB') {
environment {
today = sh(script: 'date +%Y-%m-%d', , returnStdout: true).trim()
}
steps {
sshagent(['promet_key.hashed']) {
sh """
ssh -o StrictHostKeyChecking=no -l user example.com <<EOF
drush -r /var/www/html/example.com sql-dump --gzip > /var/www/html/example.com/backups/example_prodDB-jenkins-${today}.sql.gz
EOF
""".stripIndent()
}
}
}
}
}

Nifi 1.10 STATELESS -Caused by: java.nio.file.NoSuchFileException

RUN THIS COMMAND
/bin/nifi.sh stateless RunFromRegistry Once --file ./test/stateless_test1.json
LOG
Note: Use of this command is considered experimental. The commands and approach used may change from time to time.
Java home (JAVA_HOME): /home/deltaman/software/jdk1.8.0_211
Java options (STATELESS_JAVA_OPTS): -Xms1024m -Xmx1024m
13:48:39.835 [main] INFO org.apache.nifi.StatelessNiFi - Unpacking 100 NARs
13:50:51.513 [main] INFO org.apache.nifi.StatelessNiFi - Finished unpacking 100 NARs in 131671 millis
Exception in thread "main" java.lang.reflect.InvocationTargetException
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.nifi.StatelessNiFi.main(StatelessNiFi.java:103)
... 5 more
Caused by: java.nio.file.NoSuchFileException: ./test/stateless_test1.json
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
at java.nio.file.Files.newByteChannel(Files.java:361)
at java.nio.file.Files.newByteChannel(Files.java:407)
at java.nio.file.Files.readAllBytes(Files.java:3152)
at org.apache.nifi.stateless.runtimes.Program.runLocal(Program.java:119)
at org.apache.nifi.stateless.runtimes.Program.launch(Program.java:67)
... 10 more
it seems no exist file,but i can find the file as follows:
$ cat ./test/stateless_test1.json
{
"registryUrl": "http://10.148.123.12:9991",
"bucketId": "ec1b291e-c3f1-437c-a4e4-c069bd2f6ed1",
"flowId": "b1f73fe8-2874-47a5-970c-6b25eea19497",
"parameters": {
"text" : "xixixixi"
}
}
CONFIGURATION
IDK WHAT IS THE PROBLEM?
ANY SUGGESTION IS APPRECIATION!
/bin/nifi.sh stateless RunFromRegistry Once --file ./test/stateless_test1.json
it is relative path,must use full path,such as
/home/NiFi/nifi-1.10.0/bin/nifi.sh stateless RunFromRegistry Once --file /home/NiFi/nifi-1.10.0/test/stateless_test1.json

Amazon EMR: How to add Amazon EMR MapReduce/Hive/Spark steps with inline shell script in the arguments?

For example, I have two Hive jobs, where the output of one job is used as a argument/variable in the second job. I can successfully run the following comand on terminal to get my result on the master node of the EMR cluster.
[hadoop#ip-10-6-131-223 ~]$ hive -f s3://MyProjectXYZ/bin/GetNewJobDetails_SelectAndOverwrite.hql --hivevar LatestLastUpdated=$(hive -f s3://MyProjectXYZ/bin/GetNewJobDetails_LatestLastUpdated.hql)
However, it seems I can not add a Hive step to run GetNewJobDetails_SelectAndOverwrite.hql with the Arguments textbox set as --hivevar LatestLastUpdated=$(hive -f s3://MyProjectXYZ/bin/GetNewJobDetails_LatestLastUpdated.hql).
The error is:
Details : FAILED: ParseException line 7:61 cannot recognize input near
'$' '(' 'hive' in expression specification
JAR location : command-runner.jar
Main class : None
Arguments : hive-script --run-hive-script --args -f
s3://MyProjectXYZ/bin/GetNewJobDetails_SelectAndOverwrite.hql
--hivevar LatestLastUpdated=$(hive -f s3://MyProjectXYZ/bin/GetNewJobDetails_LatestLastUpdated.hql)
Action on failure: Cancel and wait
I also tried it with command-runner.jar to run the first hive command. It still does not work:
NoViableAltException(15#[412:1: atomExpression : ( constant | (
intervalExpression )=> intervalExpression | castExpression |
extractExpression | floorExpression | caseExpression | whenExpression
| ( subQueryExpression )=> ( subQueryExpression ) -> ^(
TOK_SUBQUERY_EXPR TOK_SUBQUERY_OP subQueryExpression ) | ( function
)=> function | tableOrColumn | expressionsInParenthesis[true] );]) at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser$DFA36.specialStateTransition(HiveParser_IdentifiersParser.java:31808)
at org.antlr.runtime.DFA.predict(DFA.java:80) at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.atomExpression(HiveParser_IdentifiersParser.java:6746)
at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceFieldExpression(HiveParser_IdentifiersParser.java:6988)
at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceUnaryPrefixExpression(HiveParser_IdentifiersParser.java:7324)
at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceUnarySuffixExpression(HiveParser_IdentifiersParser.java:7380)
at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceBitwiseXorExpression(HiveParser_IdentifiersParser.java:7542)
at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceStarExpression(HiveParser_IdentifiersParser.java:7685)
at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedencePlusExpression(HiveParser_IdentifiersParser.java:7828)
at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceConcatenateExpression(HiveParser_IdentifiersParser.java:7967)
at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceAmpersandExpression(HiveParser_IdentifiersParser.java:8177)
at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceBitwiseOrExpression(HiveParser_IdentifiersParser.java:8314)
at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceSimilarExpressionPart(HiveParser_IdentifiersParser.java:8943)
at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceSimilarExpressionMain(HiveParser_IdentifiersParser.java:8816)
at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceSimilarExpression(HiveParser_IdentifiersParser.java:8697)
at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceEqualExpression(HiveParser_IdentifiersParser.java:9537)
at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceNotExpression(HiveParser_IdentifiersParser.java:9703)
at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceAndExpression(HiveParser_IdentifiersParser.java:9812)
at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceOrExpression(HiveParser_IdentifiersParser.java:9953)
at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.expression(HiveParser_IdentifiersParser.java:6686)
at
org.apache.hadoop.hive.ql.parse.HiveParser.expression(HiveParser.java:42062)
at
org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.searchCondition(HiveParser_FromClauseParser.java:6446)
at
org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.whereClause(HiveParser_FromClauseParser.java:6364)
at
org.apache.hadoop.hive.ql.parse.HiveParser.whereClause(HiveParser.java:41844)
at
org.apache.hadoop.hive.ql.parse.HiveParser.atomSelectStatement(HiveParser.java:36755)
at
org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:36987)
at
org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:36504)
at
org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:35822)
at
org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:35710)
at
org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:2284)
at
org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1333)
at
org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:208)
at
org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:77)
at
org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:70)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:468) at
org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317) at
org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457) at
org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237) at
org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227) at
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)
at
org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
at
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
at
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336)
at
org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:474)
at
org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:490)
at
org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:793)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759) at
org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
org.apache.hadoop.util.RunJar.run(RunJar.java:234) at
org.apache.hadoop.util.RunJar.main(RunJar.java:148) FAILED:
ParseException line 7:61 cannot recognize input near '$' '(' 'hive' in
expression specification
You should execute the two hive commands as 2 different steps on the EMR. Also the arguments should be passed as a list instead of string. You can split your hive command by space (' '), which will return a list and pass this list as argument to the EMR step.
Reference : https://docs.aws.amazon.com/cli/latest/reference/emr/add-steps.html

giraph/ hadoop reading manifest file

I am trying to run RandomWalkWith Restart example https://github.com/apache/giraph/blob/release-1.0/giraph-examples/src/main/java/org/apache/giraph/examples/RandomWalkWithRestartVertex.java
My Input is data is
12 34 56
34 78
56 34 78
78 34
and I am running
hadoop jar giraph-examples-1.1.0-for-hadoop-2.2.0-jar-with-dependencies.jar GiraphRunner -Dgiraph.zkList=<host>:port -libjars giraph-examples-1.1.0-for-hadoop-2.2.0-jar-with-dependencies.jar
org.apache.giraph.examples.RandomWalkWithRestartComputation
-mc org.apache.giraph.examples.RandomWalkVertexMasterCompute
-wc org.apache.giraph.examples.RandomWalkWorkerContext
-vof org.apache.giraph.examples.VertexWithDoubleValueDoubleEdgeTextOutputFormat
-vif org.apache.giraph.examples.LongDoubleDoubleTextInputFormat
-vip giraph_algorithms/personalized_pr/input/graph.txt
-op giraph_algorithms/personalized_pr/out1 -w 1
But I am getting this error.. :-/
Error: java.lang.IllegalStateException: run: Caught an unrecoverable exception
For input string: "PK�uE META-INF/��PKPK�uEMETA-INF/MANIFEST.MF�M��LK-.�" at
org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:101) at
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) at
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:415) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554) at
org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) Caused by:
java.lang.NumberFormatException: For input string: "PK�uE META-INF/��PKPK�uEMETA-
INF/MANIFEST.MF�M��LK-.�" at
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at
java.lang.Long.parseLong(Long.java:441) at java.lang.Long.parseLong(Long.java:483) at
org.apache.giraph.examples.RandomWalkWorkerContext.initializeSources(
RandomWalkWorkerContext.java:131) at org.apache.giraph.examples.RandomWalkWorkerContext.
setStaticVars(RandomWalkWorkerContext.java:160) at
org.apache.giraph.examples.RandomWalkWorkerContext
.preApplication(RandomWalkWorkerContext.java:146) at
org.apache.giraph.graph.GraphTaskManager.workerContextPreApp(
GraphTaskManager.java:815) at
org.apache.giraph.graph.GraphTaskManager.
prepareGraphStateAndWorkerContext(GraphTaskManager.java:451) at
org.apache.giraph.graph.GraphTaskManager.execute(GraphTaskManager.java:266) at
org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:91) ... 7 more
Why is it reading manifest file.. When I specifically saying it to read a file and not even a directory?
Because you passed the libjar argument as the vertex class file.
Like the other arguments, you need to say: -D libjars=your_jar.jar.

How do I get the Hadoop input filename from mapper?

Hadoop streaming makes the filename available to every map task through the environment variable.
Python:
os.environ["map.input.file"]
Java:
System.getenv(“map.input.file”).
How about Ruby?
mapper.rb
#!/usr/bin/env ruby
STDIN.each_line do |line|
line.split.each do |word|
word = word[/([a-zA-Z0-9]+)/]
word = word.gsub(/ /,"")
puts [word, 1].join("\t")
end
end
puts ENV['map.input.file']
How about:
ENV['map.input.file']
Ruby lets you assign to the ENV hash just as easily:
ENV['map.input.file'] = '/path/to/file'
All JobConf variables are put into environment variables by hadoop-streaming. The variable names are made "safe" by converting any character not in 0-9 A-Z a-z to _.
So map.input.file => map_input_file
Try: puts ENV['map_input_file']
Using the input from op, I tried mapper:
#!/usr/bin/python
import os
file_name = os.getenv('map_input_file')
print file_name
and a standard wordcount reducer using command:
hadoop fs -rmr /user/itsjeevs/wc &&
hadoop jar $STRMJAR -files /home/jejoseph/wc_mapper.py,/home/jejoseph/wc_reducer.py \
-mapper wc_mapper.py \
-reducer wc_reducer.py \
-numReduceTasks 10 \
-input "/data/*" \
-output wc
to fail with error:
16/03/10 15:21:32 INFO mapreduce.Job: Task Id : attempt_1455931799889_822384_m_000043_0, Status : FAILED
Error: java.io.IOException: Stream closed
at java.lang.ProcessBuilder$NullOutputStream.write(ProcessBuilder.java:434)
at java.io.OutputStream.write(OutputStream.java:116)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
16/03/10 15:21:32 INFO mapreduce.Job: Task Id : attempt_1455931799889_822384_m_000077_0, Status : FAILED
Error: java.io.IOException: Broken pipe
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:345)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Not sure what is happening.

Resources