SplitFile gives casting error - hadoop

I have placed a mp4 file on hdfs and trying to analyze it directly i have a class name as VideoRecordReader in which it gives the casting error. Below is the description of Error.
You have loaded library /usr/local/lib/libopencv_core.so.3.0.0 which
might have disabled stack guard. The VM will try to fix the stack
guard now. attempt_201607261400_0011_m_000000_1: It's highly
recommended that you fix the library with 'execstack -c ', or
link it with '-z noexecstack'. 16/07/26 17:32:27 INFO
mapred.JobClient: Task Id : attempt_201607261400_0011_m_000000_2,
Status : FAILED java.lang.ClassCastException:
org.apache.hadoop.mapreduce.lib.input.FileSplit cannot be cast to
org.apache.hadoop.mapred.FileSplit at
com.finalyearproject.VideoRecordReader.initialize(VideoRecordReader.java:65)
at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:521)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364) at
org.apache.hadoop.mapred.Child$4.run(Child.java:255) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Here is the code of SplitFile.
public void initialize(InputSplit genericSplit, TaskAttemptContext context)
throws IOException, InterruptedException {
FileSplit split = (FileSplit) genericSplit;
Configuration job = context.getConfiguration();
start = 0;
end = 1;
final Path file = split.getPath();
FileSystem fs = file.getFileSystem(job);
fileIn = fs.open(split.getPath());
filename = split.getPath().getName();
byte [] b = new byte[fileIn.available()];
fileIn.readFully(b);
video = new VideoObject(b);
}
kindly help me thank u best regards.

Its likely you're mixing the mapred and mapreduce APIs together.
Its complaining that you're trying to cast org.apache.hadoop.mapreduce.lib.input.FileSplit to org.apache.hadoop.mapred.FileSplit.
You need to make sure that you generally dont mix imports between the two APIs.
So check if the org.apache.hadoop.mapred.FileSplit has been imported and change it to org.apache.hadoop.mapreduce.lib.input.FileSplit.

Related

Is it possible to recover an broadcast value from Spark-streaming checkpoint

I used hbase-spark to record pv/uv in my spark-streaming project. Then when I killed the app and restart it, I got following exception while checkpoint-recover:
16/03/02 10:17:21 ERROR HBaseContext: Unable to getConfig from broadcast
java.lang.ClassCastException: [B cannot be cast to org.apache.spark.SerializableWritable
at com.paitao.xmlife.contrib.hbase.HBaseContext.getConf(HBaseContext.scala:645)
at com.paitao.xmlife.contrib.hbase.HBaseContext.com$paitao$xmlife$contrib$hbase$HBaseContext$$hbaseForeachPartition(HBaseContext.scala:627)
at com.paitao.xmlife.contrib.hbase.HBaseContext$$anonfun$com$paitao$xmlife$contrib$hbase$HBaseContext$$bulkMutation$1.apply(HBaseContext.scala:457)
at com.paitao.xmlife.contrib.hbase.HBaseContext$$anonfun$com$paitao$xmlife$contrib$hbase$HBaseContext$$bulkMutation$1.apply(HBaseContext.scala:457)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:898)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:898)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1839)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1839)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I checked the code of HBaseContext, It uses a broadcast to store the HBase configuration.
class HBaseContext(#transient sc: SparkContext,
#transient config: Configuration,
val tmpHdfsConfgFile: String = null) extends Serializable with Logging {
#transient var credentials = SparkHadoopUtil.get.getCurrentUserCredentials()
#transient var tmpHdfsConfiguration: Configuration = config
#transient var appliedCredentials = false
#transient val job = Job.getInstance(config)
TableMapReduceUtil.initCredentials(job)
// <-- broadcast for HBaseConfiguration here !!!
var broadcastedConf = sc.broadcast(new SerializableWritable(config))
var credentialsConf = sc.broadcast(new SerializableWritable(job.getCredentials()))
...
When the checkpoint-recover, it tried to access this broadcast value in its getConf func:
if (tmpHdfsConfiguration == null) {
try {
tmpHdfsConfiguration = configBroadcast.value.value
} catch {
case ex: Exception => logError("Unable to getConfig from broadcast", ex)
}
}
Then the exception raised. My question is: is it possible to recover the broadcasted value from checkpoint in a spark application? All we have some other solution to re-broadcast the value after recovering?
Thanks for any feedback!
Currently, it's a known bug of Spark. Contributors have been investigating on this issue but made no progress.
Here's my workaround: Instead of loading data into broadcast variable and broadcasting to all executors, i let each executor loads the data itself into a singleton object.
Btw, follow this issue for changes https://issues.apache.org/jira/browse/SPARK-5206
Follow below approach
Create spark context.
Initialize broadcast variable.
Create streaming context with checkpoint directory using above spark context and passing on the initialized broadcast variable.
When streaming job starts with no data in checkpoint directory, it will initialize the broadcast variable.
When streaming restarts, it will recover the broadcast variable from checkpoint directory.

FileNotFoundException when using DistributedCache to access MapFile

I am using hadoop cdf4.7 run in yarn mode. There is a MapFile in hdfs://test1:9100/user/tagdict_builder_output/part-00000
and it has two file index and data
I used the following code to add it to distributedCache:
Configuration conf = new Configuration();
Path tagDictFilePath = new Path("hdfs://test1:9100/user/tagdict_builder_output/part-00000");
DistributedCache.addCacheFile(tagDictFilePath.toUri(), conf);
Job job = new Job(conf);
And initialize a MapFile.Reader at setup of Mapper:
#Override
protected void setup(Context context) throws IOException, InterruptedException {
Path[] localFiles = DistributedCache.getLocalCacheFiles(context.getConfiguration());
if (localFiles != null && localFiles.length > 0 && localFiles[0] != null) {
String mapFileDir = localFiles[0].toString();
LOG.info("mapFileDir " + mapFileDir);
FileSystem fs = FileSystem.get(context.getConfiguration());
reader = new MapFile.Reader(fs, mapFileDir, context.getConfiguration());
}
else {
throw new IOException("Could not read lexicon file in DistributedCache");
}
}
But it throws FileNotFoundException:
Error: java.io.FileNotFoundException: File does not exist: /home/mps/cdh/local/usercache/mps/appcache/application_1405497023620_0045/container_1405497023620_0045_01_000012/part-00000/data
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:824)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1704)
at org.apache.hadoop.io.MapFile$Reader.createDataFileReader(MapFile.java:452)
at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:426)
at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:396)
at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:405)
at aps.Cdh4MD5TaglistPreprocessor$Vectorizer.setup(Cdh4MD5TaglistPreprocessor.java:61)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:756)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:160)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:155)
I've also tried /user/tagdict_builder_output/part-00000 as path,or use a symbol link. But these do not work either.How to solve this?Many thanks.
As it says here:
Distributed Cache associates the cache files to the current working directory of the mapper and reducer using symlinks.
So you should try to access your files through the File object:
File f = new File("./part-00000");
EDIT1
My last suggestion:
DistributedCache.addCacheFile(new URI(tagDictFilePath.toString() + "#cache-file"), conf);
DistributedCache.createSymlink(conf);
...
// in mapper
File f = new File("cache-file");

Error while integrating mahout into solr

I am trying to integrate mahout into solr (Using updateRequestProcessor chain) But whenever i try to initialise classifierContext i get following error.
INFO: model path :/home/bayes-model
java.lang.IllegalStateException: /home/bayes-model/trainer-weights/Sigma_j/part-*
at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterable.iterator(SequenceFileDirIterable.java:79)
at org.apache.mahout.classifier.bayes.SequenceFileModelReader.loadFeatureWeights(SequenceFileModelReader.java:72)
at org.apache.mahout.classifier.bayes.SequenceFileModelReader.loadModel(SequenceFileModelReader.java:46)
at org.apache.mahout.classifier.bayes.InMemoryBayesDatastore.initialize(InMemoryBayesDatastore.java:72)
at org.apache.mahout.classifier.bayes.ClassifierContext.initialize(ClassifierContext.java:44)
at solr.mypkg.CategorizeDocumentFactory.init(CategorizeDocumentFactory.java:67)
at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:449)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1569)
at org.apache.solr.update.processor.UpdateRequestProcessorChain.init(UpdateRequestProcessorChain.java:57)
Caused by: javax.security.auth.login.LoginException: unable to find LoginModule
class: org.apache.hadoop.security.UserGroupInformation$HadoopLoginModule
at javax.security.auth.login.LoginContext.invoke(LoginContext.java:808)
at javax.security.auth.login.LoginContext.access$000(LoginContext.java:186)
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:683)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
at javax.security.auth.login.LoginContext.login(LoginContext.java:579)
at org.apache.hadoop.security.UserGroupInformation.getLoginUse
My code is as below :
params = SolrParams.toSolrParams((NamedList) args);
BayesParameters p = new BayesParameters();
String modelPath = params.get("model");
File file=new File(modelPath);
p.set("basePath",file.getAbsolutePath());
LOG.info("model path :"+file.getAbsolutePath());
p.set("classifierType","bayes");
p.set("dataSource","hdfs");
Datastore ds = new InMemoryBayesDatastore(p);
Algorithm alg = new BayesAlgorithm();
ctx = new ClassifierContext(alg,ds);
ctx.initialize();
What could be the reason?

FileNotFoundException on hadoop

Inside my map function, I am trying to read a file from the distributedcache, load its contents into a hash map.
The sys output log of the MapReduce job prints the content of the hashmap. This shows that it has found the file, has loaded into the data structure and performed the needed operation. It iterates through the list and prints its contents. Thus proving that the operation was successful.
However, I still get the below error after a few minutes of running the MR job:
13/01/27 18:44:21 INFO mapred.JobClient: Task Id : attempt_201301271841_0001_m_000001_2, Status : FAILED
java.io.FileNotFoundException: File does not exist: /app/hadoop/jobs/nw_single_pred_in/predict
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1843)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.(DFSClient.java:1834)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:578)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:154)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:67)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:522)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Here's the portion which initializes Path with the location of the file to be placed in the distributed cache
// inside main, surrounded by try catch block, yet no exception thrown here
Configuration conf = new Configuration();
// rest of the stuff that relates to conf
Path knowledgefilepath = new Path(args[3]); // args[3] = /app/hadoop/jobs/nw_single_pred_in/predict/knowledge.txt
DistributedCache.addCacheFile(knowledgefilepath.toUri(), conf);
job.setJarByClass(NBprediction.class);
// rest of job settings
job.waitForCompletion(true); // kick off load
This one is inside the map function:
try {
System.out.println("Inside try !!");
Path files[]= DistributedCache.getLocalCacheFiles(context.getConfiguration());
Path cfile = new Path(files[0].toString()); // only one file
System.out.println("File path : "+cfile.toString());
CSVReader reader = new CSVReader(new FileReader(cfile.toString()),'\t');
while ((nline=reader.readNext())!=null)
data.put(nline[0],Double.parseDouble(nline[1])); // load into a hashmap
}
catch (Exception e)
{// handle exception }
Help appreciated.
Cheers !
Did a fresh installation of hadoop and ran the job with the same jar, the problem disappeared. Seems to be a bug rather than programming errors.

ColumnFamilyInputFormat - Could not get input splits

I am getting a weird exception when I try to access Cassandra from hadoop, by using ColumnFamilyInputFormat class.
In my hadoop process, this is how I connect to cassandra, after including cassandra-all.jar version 1.1:
private void setCassandraConfig(Job job) {
job.setInputFormatClass(ColumnFamilyInputFormat.class);
ConfigHelper.setInputRpcPort(job.getConfiguration(), "9160");
ConfigHelper
.setInputInitialAddress(job.getConfiguration(), "204.236.1.29");
ConfigHelper.setInputPartitioner(job.getConfiguration(),
"RandomPartitioner");
ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE,
COLUMN_FAMILY);
SlicePredicate predicate = new SlicePredicate().setColumn_names(Arrays
.asList(ByteBufferUtil.bytes(COLUMN_NAME)));
ConfigHelper.setInputSlicePredicate(job.getConfiguration(), predicate);
// this will cause the predicate to be ignored in favor of scanning
// everything as a wide row
ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE,
COLUMN_FAMILY, true);
ConfigHelper.setOutputInitialAddress(job.getConfiguration(),
"204.236.1.29");
ConfigHelper.setOutputPartitioner(job.getConfiguration(),
"RandomPartitioner");
}
public int run(String[] args) throws Exception {
// use a smaller page size that doesn't divide the row count evenly to
// exercise the paging logic better
ConfigHelper.setRangeBatchSize(getConf(), 99);
Job processorJob = new Job(getConf(), "dmp_normalizer");
processorJob.setJarByClass(DmpProcessorRunner.class);
processorJob.setMapperClass(NormalizerMapper.class);
processorJob.setReducerClass(SelectorReducer.class);
processorJob.setOutputKeyClass(Text.class);
processorJob.setOutputValueClass(Text.class);
FileOutputFormat
.setOutputPath(processorJob, new Path(TEMP_PATH_PREFIX));
processorJob.setOutputFormatClass(TextOutputFormat.class);
setCassandraConfig(processorJob);
...
}
But when I run hadoop (I am running it at amazon EMR) I get the exception bellow. Not that the ip is 127.0.0.1 instead of the ip I want...
Any hint? What could I be doing wrong?
2012-11-22 21:37:34,235 ERROR org.apache.hadoop.security.UserGroupInformation (Thread-6): PriviledgedActionException as:hadoop cause:java.io.IOException: Could not get input splits
2012-11-22 21:37:34,235 INFO org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob (Thread-6): dmp_normalizer got an error while submitting java.io.IOException: Could not get input splits at
org.apache.cassandra.hadoop.ColumnFamilyInputFormat.getSplits(ColumnFamilyInputFormat.java:178) at
org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1017) at
org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1034) at
org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:174) at
org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:952) at
org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:905) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:396) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132) at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:905) at
org.apache.hadoop.mapreduce.Job.submit(Job.java:500) at
org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.submit(ControlledJob.java:336) at
org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl.run(JobControl.java:233) at
java.lang.Thread.run(Thread.java:662) Caused by: java.util.concurrent.ExecutionException: java.io.IOException: failed connecting to all endpoints 127.0.0.1 at
java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222) at
java.util.concurrent.FutureTask.get(FutureTask.java:83) at
org.apache.cassandra.hadoop.ColumnFamilyInputFormat.getSplits(ColumnFamilyInputFormat.java:174) ... 13 more Caused by: java.io.IOException: failed connecting to all endpoints 127.0.0.1 at
org.apache.cassandra.hadoop.ColumnFamilyInputFormat.getSubSplits(ColumnFamilyInputFormat.java:272) at
org.apache.cassandra.hadoop.ColumnFamilyInputFormat.access$200(ColumnFamilyInputFormat.java:77) at
org.apache.cassandra.hadoop.ColumnFamilyInputFormat$SplitCallable.call(ColumnFamilyInputFormat.java:211) at
org.apache.cassandra.hadoop.ColumnFamilyInputFormat$SplitCallable.call(ColumnFamilyInputFormat.java:196) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at
java.util.concurrent.FutureTask.run(FutureTask.java:138) at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) ... 1 more
2012-11-22 21:37:39,319 INFO com.s1mbi0se.dmp.processor.main.DmpProcessorRunner (main): Process ended
I was able to solve the problem by changing the cassandra configuration. listen_address needed to be a valid external ip for this to work.
The exception didn't seem to have something to do with it, it took me long to find the answer. In the end, if you specify 0.0.0.0 in cassandra config and try to access it from an external ip, you took this error saying no host was found at 127.0.0.1 .
In my case it was wrong keyspace name issue, look carefully what you pass to ConfigHelper.setInputColumnFamily method.

Resources