Error while integrating mahout into solr - hadoop

I am trying to integrate mahout into solr (Using updateRequestProcessor chain) But whenever i try to initialise classifierContext i get following error.
INFO: model path :/home/bayes-model
java.lang.IllegalStateException: /home/bayes-model/trainer-weights/Sigma_j/part-*
at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterable.iterator(SequenceFileDirIterable.java:79)
at org.apache.mahout.classifier.bayes.SequenceFileModelReader.loadFeatureWeights(SequenceFileModelReader.java:72)
at org.apache.mahout.classifier.bayes.SequenceFileModelReader.loadModel(SequenceFileModelReader.java:46)
at org.apache.mahout.classifier.bayes.InMemoryBayesDatastore.initialize(InMemoryBayesDatastore.java:72)
at org.apache.mahout.classifier.bayes.ClassifierContext.initialize(ClassifierContext.java:44)
at solr.mypkg.CategorizeDocumentFactory.init(CategorizeDocumentFactory.java:67)
at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:449)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1569)
at org.apache.solr.update.processor.UpdateRequestProcessorChain.init(UpdateRequestProcessorChain.java:57)
Caused by: javax.security.auth.login.LoginException: unable to find LoginModule
class: org.apache.hadoop.security.UserGroupInformation$HadoopLoginModule
at javax.security.auth.login.LoginContext.invoke(LoginContext.java:808)
at javax.security.auth.login.LoginContext.access$000(LoginContext.java:186)
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:683)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
at javax.security.auth.login.LoginContext.login(LoginContext.java:579)
at org.apache.hadoop.security.UserGroupInformation.getLoginUse
My code is as below :
params = SolrParams.toSolrParams((NamedList) args);
BayesParameters p = new BayesParameters();
String modelPath = params.get("model");
File file=new File(modelPath);
p.set("basePath",file.getAbsolutePath());
LOG.info("model path :"+file.getAbsolutePath());
p.set("classifierType","bayes");
p.set("dataSource","hdfs");
Datastore ds = new InMemoryBayesDatastore(p);
Algorithm alg = new BayesAlgorithm();
ctx = new ClassifierContext(alg,ds);
ctx.initialize();
What could be the reason?

Related

SplitFile gives casting error

I have placed a mp4 file on hdfs and trying to analyze it directly i have a class name as VideoRecordReader in which it gives the casting error. Below is the description of Error.
You have loaded library /usr/local/lib/libopencv_core.so.3.0.0 which
might have disabled stack guard. The VM will try to fix the stack
guard now. attempt_201607261400_0011_m_000000_1: It's highly
recommended that you fix the library with 'execstack -c ', or
link it with '-z noexecstack'. 16/07/26 17:32:27 INFO
mapred.JobClient: Task Id : attempt_201607261400_0011_m_000000_2,
Status : FAILED java.lang.ClassCastException:
org.apache.hadoop.mapreduce.lib.input.FileSplit cannot be cast to
org.apache.hadoop.mapred.FileSplit at
com.finalyearproject.VideoRecordReader.initialize(VideoRecordReader.java:65)
at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:521)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364) at
org.apache.hadoop.mapred.Child$4.run(Child.java:255) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Here is the code of SplitFile.
public void initialize(InputSplit genericSplit, TaskAttemptContext context)
throws IOException, InterruptedException {
FileSplit split = (FileSplit) genericSplit;
Configuration job = context.getConfiguration();
start = 0;
end = 1;
final Path file = split.getPath();
FileSystem fs = file.getFileSystem(job);
fileIn = fs.open(split.getPath());
filename = split.getPath().getName();
byte [] b = new byte[fileIn.available()];
fileIn.readFully(b);
video = new VideoObject(b);
}
kindly help me thank u best regards.
Its likely you're mixing the mapred and mapreduce APIs together.
Its complaining that you're trying to cast org.apache.hadoop.mapreduce.lib.input.FileSplit to org.apache.hadoop.mapred.FileSplit.
You need to make sure that you generally dont mix imports between the two APIs.
So check if the org.apache.hadoop.mapred.FileSplit has been imported and change it to org.apache.hadoop.mapreduce.lib.input.FileSplit.

FileNotFoundException when using DistributedCache to access MapFile

I am using hadoop cdf4.7 run in yarn mode. There is a MapFile in hdfs://test1:9100/user/tagdict_builder_output/part-00000
and it has two file index and data
I used the following code to add it to distributedCache:
Configuration conf = new Configuration();
Path tagDictFilePath = new Path("hdfs://test1:9100/user/tagdict_builder_output/part-00000");
DistributedCache.addCacheFile(tagDictFilePath.toUri(), conf);
Job job = new Job(conf);
And initialize a MapFile.Reader at setup of Mapper:
#Override
protected void setup(Context context) throws IOException, InterruptedException {
Path[] localFiles = DistributedCache.getLocalCacheFiles(context.getConfiguration());
if (localFiles != null && localFiles.length > 0 && localFiles[0] != null) {
String mapFileDir = localFiles[0].toString();
LOG.info("mapFileDir " + mapFileDir);
FileSystem fs = FileSystem.get(context.getConfiguration());
reader = new MapFile.Reader(fs, mapFileDir, context.getConfiguration());
}
else {
throw new IOException("Could not read lexicon file in DistributedCache");
}
}
But it throws FileNotFoundException:
Error: java.io.FileNotFoundException: File does not exist: /home/mps/cdh/local/usercache/mps/appcache/application_1405497023620_0045/container_1405497023620_0045_01_000012/part-00000/data
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:824)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1704)
at org.apache.hadoop.io.MapFile$Reader.createDataFileReader(MapFile.java:452)
at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:426)
at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:396)
at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:405)
at aps.Cdh4MD5TaglistPreprocessor$Vectorizer.setup(Cdh4MD5TaglistPreprocessor.java:61)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:756)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:160)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:155)
I've also tried /user/tagdict_builder_output/part-00000 as path,or use a symbol link. But these do not work either.How to solve this?Many thanks.
As it says here:
Distributed Cache associates the cache files to the current working directory of the mapper and reducer using symlinks.
So you should try to access your files through the File object:
File f = new File("./part-00000");
EDIT1
My last suggestion:
DistributedCache.addCacheFile(new URI(tagDictFilePath.toString() + "#cache-file"), conf);
DistributedCache.createSymlink(conf);
...
// in mapper
File f = new File("cache-file");

Hive Server 2 thrift Client error: Required field 'operationHandle' is unset

I am trying to run the below hive thrift code on hive server2 on CDH 4.3 and getting below error. Here is my code: I can run hive jdbc connection to same server successfully, it is just thrift which is not working.
public static void main(String[] args) throws Exception
{
TSocket transport = new TSocket("my.org.hiveserver2.com",10000);
transport.setTimeout(999999999);
TBinaryProtocol protocol = new TBinaryProtocol(transport);
TCLIService.Client client = new TCLIService.Client(protocol);
transport.open();
TOpenSessionReq openReq = new TOpenSessionReq();
TOpenSessionResp openResp = client.OpenSession(openReq);
TSessionHandle sessHandle = openResp.getSessionHandle();
TExecuteStatementReq execReq = new TExecuteStatementReq(sessHandle, "SELECT * FROM testhivedrivertable");
TExecuteStatementResp execResp = client.ExecuteStatement(execReq);
TOperationHandle stmtHandle = execResp.getOperationHandle();
TFetchResultsReq fetchReq = new TFetchResultsReq(stmtHandle, TFetchOrientation.FETCH_FIRST, 1);
TFetchResultsResp resultsResp = client.FetchResults(fetchReq);
TRowSet resultsSet = resultsResp.getResults();
List<TRow> resultRows = resultsSet.getRows();
for(TRow resultRow : resultRows){
resultRow.toString();
}
TCloseOperationReq closeReq = new TCloseOperationReq();
closeReq.setOperationHandle(stmtHandle);
client.CloseOperation(closeReq);
TCloseSessionReq closeConnectionReq = new TCloseSessionReq(sessHandle);
client.CloseSession(closeConnectionReq);
transport.close();
}
Here is the error log:
Exception in thread "main" org.apache.thrift.protocol.TProtocolException: Required field 'operationHandle' is unset! Struct:TFetchResultsReq(operationHandle:null, orientation:FETCH_FIRST, maxRows:1)
at org.apache.hive.service.cli.thrift.TFetchResultsReq.validate(TFetchResultsReq.java:465)
at org.apache.hive.service.cli.thrift.TCLIService$FetchResults_args.validate(TCLIService.java:12607)
at org.apache.hive.service.cli.thrift.TCLIService$FetchResults_args$FetchResults_argsStandardScheme.write(TCLIService.java:12664)
at org.apache.hive.service.cli.thrift.TCLIService$FetchResults_args$FetchResults_argsStandardScheme.write(TCLIService.java:12633)
at org.apache.hive.service.cli.thrift.TCLIService$FetchResults_args.write(TCLIService.java:12584)
at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:63)
at org.apache.hive.service.cli.thrift.TCLIService$Client.send_FetchResults(TCLIService.java:487)
at org.apache.hive.service.cli.thrift.TCLIService$Client.FetchResults(TCLIService.java:479)
at HiveJDBCServer1.main(HiveJDBCServer1.java:26)
Are you really sure you set the operationsHandle field to a valid value? The Thrift eror indicates what it says: The API expects a certain field (operationHandle in your case) to be set, which has not been assigned a value. And you stack trace confirms this:
Struct:TFetchResultsReq(operationHandle:null, orientation:FETCH_FIRST,
maxRows:1)
In case anyone finds this, like I did, by googling that error message: I had a similar problem with a PHP Thrift library for hiverserver2. At least in my case, execResp.getOperationHandle() returned NULL because there was an error in the executed request that generated execResp. This didn't throw an exception for some reason, and I had to examine execResp in detail, and specifically check the status, before attempting to get an operation handle.

FileNotFoundException on hadoop

Inside my map function, I am trying to read a file from the distributedcache, load its contents into a hash map.
The sys output log of the MapReduce job prints the content of the hashmap. This shows that it has found the file, has loaded into the data structure and performed the needed operation. It iterates through the list and prints its contents. Thus proving that the operation was successful.
However, I still get the below error after a few minutes of running the MR job:
13/01/27 18:44:21 INFO mapred.JobClient: Task Id : attempt_201301271841_0001_m_000001_2, Status : FAILED
java.io.FileNotFoundException: File does not exist: /app/hadoop/jobs/nw_single_pred_in/predict
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1843)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.(DFSClient.java:1834)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:578)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:154)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:67)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:522)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Here's the portion which initializes Path with the location of the file to be placed in the distributed cache
// inside main, surrounded by try catch block, yet no exception thrown here
Configuration conf = new Configuration();
// rest of the stuff that relates to conf
Path knowledgefilepath = new Path(args[3]); // args[3] = /app/hadoop/jobs/nw_single_pred_in/predict/knowledge.txt
DistributedCache.addCacheFile(knowledgefilepath.toUri(), conf);
job.setJarByClass(NBprediction.class);
// rest of job settings
job.waitForCompletion(true); // kick off load
This one is inside the map function:
try {
System.out.println("Inside try !!");
Path files[]= DistributedCache.getLocalCacheFiles(context.getConfiguration());
Path cfile = new Path(files[0].toString()); // only one file
System.out.println("File path : "+cfile.toString());
CSVReader reader = new CSVReader(new FileReader(cfile.toString()),'\t');
while ((nline=reader.readNext())!=null)
data.put(nline[0],Double.parseDouble(nline[1])); // load into a hashmap
}
catch (Exception e)
{// handle exception }
Help appreciated.
Cheers !
Did a fresh installation of hadoop and ran the job with the same jar, the problem disappeared. Seems to be a bug rather than programming errors.

ColumnFamilyInputFormat - Could not get input splits

I am getting a weird exception when I try to access Cassandra from hadoop, by using ColumnFamilyInputFormat class.
In my hadoop process, this is how I connect to cassandra, after including cassandra-all.jar version 1.1:
private void setCassandraConfig(Job job) {
job.setInputFormatClass(ColumnFamilyInputFormat.class);
ConfigHelper.setInputRpcPort(job.getConfiguration(), "9160");
ConfigHelper
.setInputInitialAddress(job.getConfiguration(), "204.236.1.29");
ConfigHelper.setInputPartitioner(job.getConfiguration(),
"RandomPartitioner");
ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE,
COLUMN_FAMILY);
SlicePredicate predicate = new SlicePredicate().setColumn_names(Arrays
.asList(ByteBufferUtil.bytes(COLUMN_NAME)));
ConfigHelper.setInputSlicePredicate(job.getConfiguration(), predicate);
// this will cause the predicate to be ignored in favor of scanning
// everything as a wide row
ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE,
COLUMN_FAMILY, true);
ConfigHelper.setOutputInitialAddress(job.getConfiguration(),
"204.236.1.29");
ConfigHelper.setOutputPartitioner(job.getConfiguration(),
"RandomPartitioner");
}
public int run(String[] args) throws Exception {
// use a smaller page size that doesn't divide the row count evenly to
// exercise the paging logic better
ConfigHelper.setRangeBatchSize(getConf(), 99);
Job processorJob = new Job(getConf(), "dmp_normalizer");
processorJob.setJarByClass(DmpProcessorRunner.class);
processorJob.setMapperClass(NormalizerMapper.class);
processorJob.setReducerClass(SelectorReducer.class);
processorJob.setOutputKeyClass(Text.class);
processorJob.setOutputValueClass(Text.class);
FileOutputFormat
.setOutputPath(processorJob, new Path(TEMP_PATH_PREFIX));
processorJob.setOutputFormatClass(TextOutputFormat.class);
setCassandraConfig(processorJob);
...
}
But when I run hadoop (I am running it at amazon EMR) I get the exception bellow. Not that the ip is 127.0.0.1 instead of the ip I want...
Any hint? What could I be doing wrong?
2012-11-22 21:37:34,235 ERROR org.apache.hadoop.security.UserGroupInformation (Thread-6): PriviledgedActionException as:hadoop cause:java.io.IOException: Could not get input splits
2012-11-22 21:37:34,235 INFO org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob (Thread-6): dmp_normalizer got an error while submitting java.io.IOException: Could not get input splits at
org.apache.cassandra.hadoop.ColumnFamilyInputFormat.getSplits(ColumnFamilyInputFormat.java:178) at
org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1017) at
org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1034) at
org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:174) at
org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:952) at
org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:905) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:396) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132) at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:905) at
org.apache.hadoop.mapreduce.Job.submit(Job.java:500) at
org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.submit(ControlledJob.java:336) at
org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl.run(JobControl.java:233) at
java.lang.Thread.run(Thread.java:662) Caused by: java.util.concurrent.ExecutionException: java.io.IOException: failed connecting to all endpoints 127.0.0.1 at
java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222) at
java.util.concurrent.FutureTask.get(FutureTask.java:83) at
org.apache.cassandra.hadoop.ColumnFamilyInputFormat.getSplits(ColumnFamilyInputFormat.java:174) ... 13 more Caused by: java.io.IOException: failed connecting to all endpoints 127.0.0.1 at
org.apache.cassandra.hadoop.ColumnFamilyInputFormat.getSubSplits(ColumnFamilyInputFormat.java:272) at
org.apache.cassandra.hadoop.ColumnFamilyInputFormat.access$200(ColumnFamilyInputFormat.java:77) at
org.apache.cassandra.hadoop.ColumnFamilyInputFormat$SplitCallable.call(ColumnFamilyInputFormat.java:211) at
org.apache.cassandra.hadoop.ColumnFamilyInputFormat$SplitCallable.call(ColumnFamilyInputFormat.java:196) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at
java.util.concurrent.FutureTask.run(FutureTask.java:138) at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) ... 1 more
2012-11-22 21:37:39,319 INFO com.s1mbi0se.dmp.processor.main.DmpProcessorRunner (main): Process ended
I was able to solve the problem by changing the cassandra configuration. listen_address needed to be a valid external ip for this to work.
The exception didn't seem to have something to do with it, it took me long to find the answer. In the end, if you specify 0.0.0.0 in cassandra config and try to access it from an external ip, you took this error saying no host was found at 127.0.0.1 .
In my case it was wrong keyspace name issue, look carefully what you pass to ConfigHelper.setInputColumnFamily method.

Resources