debugging a mahout logistic regression - hadoop

I am new to mahout.. And I am trying out the standard "donut" example listed here:
http://imiloainf.wordpress.com/2011/11/02/mahout-logistic-regression/
So this example works like a charm.
But when I try to implement it on my dataset (whcih is a huge dataset) it doesnt works.
The dataset is in one csv file.. everything is same except it has a lot of features (~100) and is 1TB file.
I am getting this error.
bin/mahout trainlogistic --input /path/mahout_input/complete/input.csv \
--output mahoutmodel --target default --categories 2 --predictors O1 E1 I1 \
--types numeric --features 30 --passes 100 --rate 50
Running on hadoop, using HADOOP_HOME=/opt/mapr/hadoop/hadoop-0.20.2
No HADOOP_CONF_DIR set, using /opt/mapr/hadoop/hadoop-0.20.2/conf
Exception in thread "main" java.lang.NullPointerException
at org.apache.mahout.classifier.sgd.CsvRecordFactory.firstLine(CsvRecordFactory.java:167)
at org.apache.mahout.classifier.sgd.TrainLogistic.main(TrainLogistic.java:75)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
What am i doing wrong?
How do you debug this.. what is the error??
Thanks

My guess is your input doesn't exist or is empty. I'd check that /path/mahout_input/complete/input.csv is really what you mean.

Either check your input path or make sure your first line of input path having values in "" only like "x1","x2","x3","lablel"..so on

Happened to me as well.
My fault was to bypass an incorrect --target parameter, which does not exists in columns. Specifically my header line was
myColumn1,myColumn2,myColumn3
and my command line was
mahout trainlogistic --input ./input.csv --output ./logistic_model
--target myMisTypedColumn1 --predictors myColumn2 myColumn3 --types w w w --features 2 --passes 100 --rate 50 --categories 2
One another tip is: Dont use " ( quotes) or long column names, so you should avoid the headache of "Does mahout did not like my column name?" etc.
And as a feedback to MAHOUT: the error message is terrible. We should never see a "NullPointerException" in such a promising framework.

Related

Using weka from a batch file

I want to make predictions from a Weka saved model without opening Weka Explorer or Simple CLI interfaces. So I created a batch file:
#ECHO ON
title Weka caller
set root=C:\Program Files\Weka-3-8\
cd /D %root%
java -classpath weka.jar weka.classifiers.functions.LinearRegression -T Z:\ARFF_FILES\TestSet_regression.arff -l Z:\WEKA_MODELS\Regression_model_03_05_2018.model -p 0
I have this error message:
C:\Program Files\Weka-3-8>java -classpath weka.jar weka.classifiers.functions.LinearRegression -T Z:\ARFF_FILES\TestSet_regression.arff -l Z:\WEKA_MODELS\Regression_model_03_05_2018.model -p 0
Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.NoClassDefFoundError: no/uib/cipr/matrix/Matrix
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Unknown Source)
at java.lang.Class.privateGetMethodRecursive(Unknown Source)
at java.lang.Class.getMethod0(Unknown Source)
at java.lang.Class.getMethod(Unknown Source)
at sun.launcher.LauncherHelper.validateMainClass(Unknown Source)
at sun.launcher.LauncherHelper.checkAndLoadMain(Unknown Source)
Caused by: java.lang.ClassNotFoundException: no.uib.cipr.matrix.Matrix
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
... 7 more
Is someone already called weka from windows cmd shell ??
I have not used Weka in a windows shell but the way you could do it in Linux is as follows:
#!/bin/bash
export CLASSPATH=/home/stalai/Weka/weka-3-9-1/weka.jar:.
echo $CLASSPATH
# Code that loops through various classification routines and saves the results in a corresponding text file
# Defult values
CV=103 # Cross Validation: change to 10 or keep leave one out cross validation [change by (-x)]
files=dataset.csv # Look at the required .csv files and process them
for i in {100..10};
do
java weka.classifiers.meta.AttributeSelectedClassifier -t $files -x $CV >> $CorAttEvalResults -E "weka.attributeSelection.CorrelationAttributeEval " -S "weka.attributeSelection.Ranker -T -1.7976931348623157E308 -N $i" -W weka.classifiers.lazy.IBk -- -K 1 -W 0 -A "weka.core.neighboursearch.LinearNNSearch -A \"weka.core.EuclideanDistance -R first-last\""
done
In this example, we are eliminating the top 100 features down to 10 by using a correlation based feature ranker and saving the results in CorAttEvalResults following a leave-one-out-cross validation. The CV=103 is infact the total number of classes in the dataset.csv file.
Once you figured out the desired model, change the corresponding flag values and reload the model. Let me know if you need more help!
Also I would recommend using CSV instead of Arff as it is easier to handle cross platform if you wanna expand your code or something like that.

Unable to create(Mismatch between expected number of columns) Dashboard report in Jmeter...!

I have issues while generating dashboard report in Jmeter (through command line)
1)Coped reportgenerator Properties to User Properties file
2)Restarted Jmeter to pick up the data
3)Added below to user properties file:
jmeter.save.saveservice.bytes=true
jmeter.save.saveservice.label=true
jmeter.save.saveservice.latency=true
jmeter.save.saveservice.response_code=true
jmeter.save.saveservice.response_message=true
jmeter.save.saveservice.successful=true
jmeter.save.saveservice.thread_counts=true
jmeter.save.saveservice.thread_name=true
jmeter.save.saveservice.time=true
jmeter.save.saveservice.timestamp_format=ms
jmeter.save.saveservice.timestamp_format=yyyy/MM/dd HH:mm:ss
I feel main problem is with mismatch with the CSV file/JTL file I have and trying to create report. – Give me your suggestions
ERROR | An error occurred:
org.apache.jmeter.report.dashboard.GenerationException: Error while processing samples:Mismatch between expected number of columns:16 and columns in CSV file:6, check your jmeter.save.saveservice.* configuration
at org.apache.jmeter.report.dashboard.ReportGeenter code herenerator.generate(ReportGenerator.java:246)
at org.apache.jmeter.JMeter.start(JMeter.java:517)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.jmeter.NewDriver.main(NewDriver.java:248)
Caused by: org.apache.jmeter.report.core.SampleException: Mismatch between expected number of columns:16 and columns in CSV file:6, check your
jmeter.save.saveservice.* configuration
at org.apache.jmeter.report.core.CsvSampleReader.nextSample(CsvSampleReader.java:183)
at org.apache.jmeter.report.core.CsvSampleReader.readSample(CsvSampleReader.java:201)
at org.apache.jmeter.report.processor.CsvFileSampleSource.produce(CsvFileSampleSource.java:180)
at org.apache.jmeter.report.processor.CsvFileSampleSource.run(CsvFileSampleSource.java:238)
at org.apache.jmeter.report.dashboard.ReportGenerator.generate(ReportGenerator.java:244)
... 6 more
An error occurred: Error while processing samples:Mismatch between expected number of columns:16 and columns in CSV file:6, check your jmeter.save.saveservice.* configuration
errorlevel=1
I made the same mistake. Just forget about those properties and copy in user.properties file only this:
jmeter.reportgenerator.overall_granularity=60000
jmeter.reportgenerator.apdex_statisfied_threshold=1500
jmeter.reportgenerator.apdex_tolerated_threshold=3000
jmeter.reportgenerator.exporter.html.series_filter=((^s0)|(^s1))(-success|-failure)?
jmeter.reportgenerator.exporter.html.filters_only_sample_series=true
Then from the command line run this:
.\jmeter -n -t sample_jmeter_test.jmx -l test.csv -e -o tmp
Where:
.\jmeter - you run the jmeter in \bin directory
sample_jmeter_test.jmx - name of the test that will be run, located in \bin directory
test.csv - located again in the \bin directory, this is the file that all gathered statistics will be written into
tmp is the directory where I create under \bin where the dashboard files will be saved
The file csv or jtl may be in writing still, so jmeter process report try to read file while another field and row are added to the same file. Infact I resolve the error by manual running of report generation command on the same jtl file:
jmeter -g <file csv or jtl> -o <path report>
may be possible configure a delay after running load process and the report process but I don't know if exist this option.

Set variables in hive scripts using command line

I have checked the related thread - How to set variables in HIVE scripts
Inside hive the variable is working fine:
hive> set hivevar:cal_month_end='2012-01-01';
hive> select ${cal_month_end};
But when I run this through command line:
$ hive -e "set hivevar:cal_month_end='2012-01-01';select '${cal_month_end}';"
It keeps giving me below error:
Error: java.lang.IllegalArgumentException: Can not create a Path from
an empty string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:131)
at org.apache.hadoop.fs.Path.(Path.java:139)
at org.apache.hadoop.hive.ql.io.HiveInputFormat$HiveInputSplit.getPath(HiveInputFormat.java:110)
at org.apache.hadoop.mapred.MapTask.updateJobWithSplit(MapTask.java:463)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:411)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1469)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
You have to escape few characters. This is working for me.
hive -e "set hivevar:cal_month_end=\'2012-01-01\';select '\${cal_month_end}';"
you have to get the " and ' right.
use this :
hive -e 'set hivevar:cal_month_end="2012-01-01";select ${cal_month_end};'
I finally know what's went wrong. The problem is in command line I can't just select something, it needs to from some table. Below is working fine.
$ hive -e "set hivevar:cal_month_end='2012-01-01';select * from foo where start_time > '${cal_month_end}' limit 10"
You can also set variables as an argument of the hive command:
hive --hivevar cal_month_end='2012-01-01' -e "select '${cal_month_end}';"

Pig Simple Dump function

My Input file is below . I am trying to dump the loaded data in relation. I am using pig 0.12.
a,t1,1000,100
a,t1,2000,200
b,t2,1000,200
b,t2,5000,100
I entered into HDFS mode by entering pig
myinput = LOAD 'file' AS(a1:chararray,a2:chararray,amt:int,rate:int);
if i do dump myinput then it shows the below error.
describe, illustrate works fine..
so
dump myinput ;
As soon i enter the dump command i get the below error message.
ERROR org.apache.hadoop.ipc.RPC - FailoverProxy: Failing this Call: submitJob for error (RemoteException): org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.security.AccessControlException: User 'myid' cannot perform operation SUBMIT_JOB on queue default.
Please run "hadoop queue -showacls" command to find the queues you have access to .
at org.apache.hadoop.mapred.ACLsManager.checkAccess(ACLsManager.java:179)
at org.apache.hadoop.mapred.ACLsManager.checkAccess(ACLsManager.java:136)
at org.apache.hadoop.mapred.ACLsManager.checkAccess(ACLsManager.java:113)
at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:4541)
at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:993)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1326)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1322)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1320)
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias myinput
Is this access issues? kind of privilege issue?
Can someone help me
If you didn't mention any load functions like PigStorage('\t') then it reads data with the column separator as tab(\t) by default.
In your data, the column separator is comma(,)
So Try this one,
myinput = LOAD 'file' using PigStorage(',') AS(a1:chararray,a2:chararray,amt:int,rate:int);
Hope it should work..
you could describe your input data(separator), in your case comma :
try this code please :
myinput = LOAD 'file' USING PigStorage(',') AS (a1:chararray,a2:chararray,amt:int,rate:int);

Using s3distcp with Amazon EMR to copy a single file

I want to copy just a single file to HDFS using s3distcp. I have tried using the srcPattern argument but it didn't help and it keeps on throwing java.lang.Runtime exception.
It is possible that the regex I am using is the culprit, please help.
My code is as follows:
elastic-mapreduce -j $jobflow --jar s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar --args '--src,s3://<mybucket>/<path>' --args '--dest,hdfs:///output' --arg --srcPattern --arg '(filename)'
Exception thrown:
Exception in thread "main" java.lang.RuntimeException: Error running job at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:586) at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:216) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at com.amazon.external.elasticmapreduce.s3distcp.Main.main(Main.java:12) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs:/tmp/a088f00d-a67e-4239-bb0d-32b3a6ef0105/files at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1036) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1028) at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:172) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:944) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:897) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:871) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1308) at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:568) ... 9 more
DistCp is intended to copy many files using many machines. DistCp is not the right tool if you want to only copy one file.
On the hadoop master node, you can copy a single file using
hadoop fs -cp s3://<mybucket>/<path> hdfs:///output
The regex I was using was indeed the culprit.
Say the file names have dates, for example files are like abcd-2013-06-12.gz , then in order to copy ONLY this file, following emr command should do:
elastic-mapreduce -j $jobflow --jar
s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar
--args '--src,s3:///' --args '--dest,hdfs:///output' --arg --srcPattern --arg '.*2013-06-12.gz'
If I remember correctly, my regex initially was *2013-06-12.gz and not .*2013-06-12.gz. So the dot at the beginning was needed.

Resources