ColumnFamilyInputFormat - Could not get input splits - hadoop

I am getting a weird exception when I try to access Cassandra from hadoop, by using ColumnFamilyInputFormat class.
In my hadoop process, this is how I connect to cassandra, after including cassandra-all.jar version 1.1:
private void setCassandraConfig(Job job) {
job.setInputFormatClass(ColumnFamilyInputFormat.class);
ConfigHelper.setInputRpcPort(job.getConfiguration(), "9160");
ConfigHelper
.setInputInitialAddress(job.getConfiguration(), "204.236.1.29");
ConfigHelper.setInputPartitioner(job.getConfiguration(),
"RandomPartitioner");
ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE,
COLUMN_FAMILY);
SlicePredicate predicate = new SlicePredicate().setColumn_names(Arrays
.asList(ByteBufferUtil.bytes(COLUMN_NAME)));
ConfigHelper.setInputSlicePredicate(job.getConfiguration(), predicate);
// this will cause the predicate to be ignored in favor of scanning
// everything as a wide row
ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE,
COLUMN_FAMILY, true);
ConfigHelper.setOutputInitialAddress(job.getConfiguration(),
"204.236.1.29");
ConfigHelper.setOutputPartitioner(job.getConfiguration(),
"RandomPartitioner");
}
public int run(String[] args) throws Exception {
// use a smaller page size that doesn't divide the row count evenly to
// exercise the paging logic better
ConfigHelper.setRangeBatchSize(getConf(), 99);
Job processorJob = new Job(getConf(), "dmp_normalizer");
processorJob.setJarByClass(DmpProcessorRunner.class);
processorJob.setMapperClass(NormalizerMapper.class);
processorJob.setReducerClass(SelectorReducer.class);
processorJob.setOutputKeyClass(Text.class);
processorJob.setOutputValueClass(Text.class);
FileOutputFormat
.setOutputPath(processorJob, new Path(TEMP_PATH_PREFIX));
processorJob.setOutputFormatClass(TextOutputFormat.class);
setCassandraConfig(processorJob);
...
}
But when I run hadoop (I am running it at amazon EMR) I get the exception bellow. Not that the ip is 127.0.0.1 instead of the ip I want...
Any hint? What could I be doing wrong?
2012-11-22 21:37:34,235 ERROR org.apache.hadoop.security.UserGroupInformation (Thread-6): PriviledgedActionException as:hadoop cause:java.io.IOException: Could not get input splits
2012-11-22 21:37:34,235 INFO org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob (Thread-6): dmp_normalizer got an error while submitting java.io.IOException: Could not get input splits at
org.apache.cassandra.hadoop.ColumnFamilyInputFormat.getSplits(ColumnFamilyInputFormat.java:178) at
org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1017) at
org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1034) at
org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:174) at
org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:952) at
org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:905) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:396) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132) at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:905) at
org.apache.hadoop.mapreduce.Job.submit(Job.java:500) at
org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.submit(ControlledJob.java:336) at
org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl.run(JobControl.java:233) at
java.lang.Thread.run(Thread.java:662) Caused by: java.util.concurrent.ExecutionException: java.io.IOException: failed connecting to all endpoints 127.0.0.1 at
java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222) at
java.util.concurrent.FutureTask.get(FutureTask.java:83) at
org.apache.cassandra.hadoop.ColumnFamilyInputFormat.getSplits(ColumnFamilyInputFormat.java:174) ... 13 more Caused by: java.io.IOException: failed connecting to all endpoints 127.0.0.1 at
org.apache.cassandra.hadoop.ColumnFamilyInputFormat.getSubSplits(ColumnFamilyInputFormat.java:272) at
org.apache.cassandra.hadoop.ColumnFamilyInputFormat.access$200(ColumnFamilyInputFormat.java:77) at
org.apache.cassandra.hadoop.ColumnFamilyInputFormat$SplitCallable.call(ColumnFamilyInputFormat.java:211) at
org.apache.cassandra.hadoop.ColumnFamilyInputFormat$SplitCallable.call(ColumnFamilyInputFormat.java:196) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at
java.util.concurrent.FutureTask.run(FutureTask.java:138) at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) ... 1 more
2012-11-22 21:37:39,319 INFO com.s1mbi0se.dmp.processor.main.DmpProcessorRunner (main): Process ended

I was able to solve the problem by changing the cassandra configuration. listen_address needed to be a valid external ip for this to work.
The exception didn't seem to have something to do with it, it took me long to find the answer. In the end, if you specify 0.0.0.0 in cassandra config and try to access it from an external ip, you took this error saying no host was found at 127.0.0.1 .

In my case it was wrong keyspace name issue, look carefully what you pass to ConfigHelper.setInputColumnFamily method.

Related

Issues with derby Database using a GUI

I am asking this question because I was unable to find a similar question. I recently completed this college project where I made a console application that connected to the database and everything seemed perfectly fine. The method I used to connect to the database is this:
private static Connection getConnection()
{
Connection connection = null;
try
{
String dbDirectory = "c:/murach/java/db";
System.setProperty("derby.system.home", dbDirectory);
String dbURL = "jdbc:derby:MurachDB2";
String username = "";
String password = "";
connection = DriverManager.getConnection(dbURL, username, password);
System.out.println("connect works");
return connection;
} //end try connection statement
catch (SQLException e)
{
for (Throwable t : e)
{
e.printStackTrace();
System.out.println("something went wrong on connection method");
} //end for loop for errors
} // end catch statement for connection error
return connection;
}
As I said before I created the console application and everything seemed fine and I turned it in. However, I wanted to experiment with something, I wanted to make another version of this application but instead a GUI application using Jform. I used all of the same classes as before, except that instead of a main class, I used a jform. The method and class is exactly the same, because the database didn't change locations in my folder, however when I run it in the Jform application I get a runtime error.
What it was a null point error and I knew it had something to do with connecting to the database because I wrote a System.out.Println in the catch SQL Exception to notify that something went wrong with the method. The connect to the database is fine with the console application, but my question is if there is any further measures I need to take when it comes to working with a JFrame application. Is there anything I am missing or is there any extra step I need to do. For further measures I will display the whole class that works with the database, and I will also use the event handlers in the Jframe.
I want to make this clear in which that I am not required to do this. I am just playing around with Java, and I could easily leave this alone without consequences, but I feel I really want to learn this so this is why I am asking for help. Any kind of advice or if any of you can let you know what I am missing I would really appreciate it.
Edit
adding error information
>java.sql.SQLException: No suitable driver found for jdbc:derby:MurachDB2
something went wrong on connection method
java.lang.NullPointerException
something went from with dislay part
at java.sql.DriverManager.getConnection(DriverManager.java:689)
at java.sql.DriverManager.getConnection(DriverManager.java:247)
at CustomerInvoiceDB.getConnection(CustomerInvoiceDB.java:38)
at CustomerInvoiceDB.getCustomers(CustomerInvoiceDB.java:67)
at CutomerInvoice.displayButtonActionPerformed(CutomerInvoice.java:111)
at CutomerInvoice.access$000(CutomerInvoice.java:24)
at CutomerInvoice$1.actionPerformed(CutomerInvoice.java:57)
at javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:2022)
at javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2348)
at javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:402)
at javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:259)
at javax.swing.plaf.basic.BasicButtonListener.mouseReleased(BasicButtonListener.java:252)
at java.awt.Component.processMouseEvent(Component.java:6533)
at javax.swing.JComponent.processMouseEvent(JComponent.java:3324)
at java.awt.Component.processEvent(Component.java:6298)
at java.awt.Container.processEvent(Container.java:2236)
at java.awt.Component.dispatchEventImpl(Component.java:4889)
at java.awt.Container.dispatchEventImpl(Container.java:2294)
at java.awt.Component.dispatchEvent(Component.java:4711)
at java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4888)
at java.awt.LightweightDispatcher.processMouseEvent(Container.java:4525)
at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4466)
at java.awt.Container.dispatchEventImpl(Container.java:2280)
at java.awt.Window.dispatchEventImpl(Window.java:2746)
at java.awt.Component.dispatchEvent(Component.java:4711)
at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:758)
at java.awt.EventQueue.access$500(EventQueue.java:97)
at java.awt.EventQueue$3.run(EventQueue.java:709)
at java.awt.EventQueue$3.run(EventQueue.java:703)
at java.security.AccessController.doPrivileged(Native Method)
at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:76)
at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:86)
at java.awt.EventQueue$4.run(EventQueue.java:731)
at java.awt.EventQueue$4.run(EventQueue.java:729)
at java.security.AccessController.doPrivileged(Native Method)
at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:76)
at java.awt.EventQueue.dispatchEvent(EventQueue.java:728)
at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:201)
at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:116)
at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:105)
at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:101)
at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:93)
at java.awt.EventDispatchThread.run(EventDispatchThread.java:82)

SplitFile gives casting error

I have placed a mp4 file on hdfs and trying to analyze it directly i have a class name as VideoRecordReader in which it gives the casting error. Below is the description of Error.
You have loaded library /usr/local/lib/libopencv_core.so.3.0.0 which
might have disabled stack guard. The VM will try to fix the stack
guard now. attempt_201607261400_0011_m_000000_1: It's highly
recommended that you fix the library with 'execstack -c ', or
link it with '-z noexecstack'. 16/07/26 17:32:27 INFO
mapred.JobClient: Task Id : attempt_201607261400_0011_m_000000_2,
Status : FAILED java.lang.ClassCastException:
org.apache.hadoop.mapreduce.lib.input.FileSplit cannot be cast to
org.apache.hadoop.mapred.FileSplit at
com.finalyearproject.VideoRecordReader.initialize(VideoRecordReader.java:65)
at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:521)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364) at
org.apache.hadoop.mapred.Child$4.run(Child.java:255) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Here is the code of SplitFile.
public void initialize(InputSplit genericSplit, TaskAttemptContext context)
throws IOException, InterruptedException {
FileSplit split = (FileSplit) genericSplit;
Configuration job = context.getConfiguration();
start = 0;
end = 1;
final Path file = split.getPath();
FileSystem fs = file.getFileSystem(job);
fileIn = fs.open(split.getPath());
filename = split.getPath().getName();
byte [] b = new byte[fileIn.available()];
fileIn.readFully(b);
video = new VideoObject(b);
}
kindly help me thank u best regards.
Its likely you're mixing the mapred and mapreduce APIs together.
Its complaining that you're trying to cast org.apache.hadoop.mapreduce.lib.input.FileSplit to org.apache.hadoop.mapred.FileSplit.
You need to make sure that you generally dont mix imports between the two APIs.
So check if the org.apache.hadoop.mapred.FileSplit has been imported and change it to org.apache.hadoop.mapreduce.lib.input.FileSplit.

Is it possible to recover an broadcast value from Spark-streaming checkpoint

I used hbase-spark to record pv/uv in my spark-streaming project. Then when I killed the app and restart it, I got following exception while checkpoint-recover:
16/03/02 10:17:21 ERROR HBaseContext: Unable to getConfig from broadcast
java.lang.ClassCastException: [B cannot be cast to org.apache.spark.SerializableWritable
at com.paitao.xmlife.contrib.hbase.HBaseContext.getConf(HBaseContext.scala:645)
at com.paitao.xmlife.contrib.hbase.HBaseContext.com$paitao$xmlife$contrib$hbase$HBaseContext$$hbaseForeachPartition(HBaseContext.scala:627)
at com.paitao.xmlife.contrib.hbase.HBaseContext$$anonfun$com$paitao$xmlife$contrib$hbase$HBaseContext$$bulkMutation$1.apply(HBaseContext.scala:457)
at com.paitao.xmlife.contrib.hbase.HBaseContext$$anonfun$com$paitao$xmlife$contrib$hbase$HBaseContext$$bulkMutation$1.apply(HBaseContext.scala:457)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:898)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:898)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1839)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1839)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I checked the code of HBaseContext, It uses a broadcast to store the HBase configuration.
class HBaseContext(#transient sc: SparkContext,
#transient config: Configuration,
val tmpHdfsConfgFile: String = null) extends Serializable with Logging {
#transient var credentials = SparkHadoopUtil.get.getCurrentUserCredentials()
#transient var tmpHdfsConfiguration: Configuration = config
#transient var appliedCredentials = false
#transient val job = Job.getInstance(config)
TableMapReduceUtil.initCredentials(job)
// <-- broadcast for HBaseConfiguration here !!!
var broadcastedConf = sc.broadcast(new SerializableWritable(config))
var credentialsConf = sc.broadcast(new SerializableWritable(job.getCredentials()))
...
When the checkpoint-recover, it tried to access this broadcast value in its getConf func:
if (tmpHdfsConfiguration == null) {
try {
tmpHdfsConfiguration = configBroadcast.value.value
} catch {
case ex: Exception => logError("Unable to getConfig from broadcast", ex)
}
}
Then the exception raised. My question is: is it possible to recover the broadcasted value from checkpoint in a spark application? All we have some other solution to re-broadcast the value after recovering?
Thanks for any feedback!
Currently, it's a known bug of Spark. Contributors have been investigating on this issue but made no progress.
Here's my workaround: Instead of loading data into broadcast variable and broadcasting to all executors, i let each executor loads the data itself into a singleton object.
Btw, follow this issue for changes https://issues.apache.org/jira/browse/SPARK-5206
Follow below approach
Create spark context.
Initialize broadcast variable.
Create streaming context with checkpoint directory using above spark context and passing on the initialized broadcast variable.
When streaming job starts with no data in checkpoint directory, it will initialize the broadcast variable.
When streaming restarts, it will recover the broadcast variable from checkpoint directory.

Error while integrating mahout into solr

I am trying to integrate mahout into solr (Using updateRequestProcessor chain) But whenever i try to initialise classifierContext i get following error.
INFO: model path :/home/bayes-model
java.lang.IllegalStateException: /home/bayes-model/trainer-weights/Sigma_j/part-*
at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterable.iterator(SequenceFileDirIterable.java:79)
at org.apache.mahout.classifier.bayes.SequenceFileModelReader.loadFeatureWeights(SequenceFileModelReader.java:72)
at org.apache.mahout.classifier.bayes.SequenceFileModelReader.loadModel(SequenceFileModelReader.java:46)
at org.apache.mahout.classifier.bayes.InMemoryBayesDatastore.initialize(InMemoryBayesDatastore.java:72)
at org.apache.mahout.classifier.bayes.ClassifierContext.initialize(ClassifierContext.java:44)
at solr.mypkg.CategorizeDocumentFactory.init(CategorizeDocumentFactory.java:67)
at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:449)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1569)
at org.apache.solr.update.processor.UpdateRequestProcessorChain.init(UpdateRequestProcessorChain.java:57)
Caused by: javax.security.auth.login.LoginException: unable to find LoginModule
class: org.apache.hadoop.security.UserGroupInformation$HadoopLoginModule
at javax.security.auth.login.LoginContext.invoke(LoginContext.java:808)
at javax.security.auth.login.LoginContext.access$000(LoginContext.java:186)
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:683)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
at javax.security.auth.login.LoginContext.login(LoginContext.java:579)
at org.apache.hadoop.security.UserGroupInformation.getLoginUse
My code is as below :
params = SolrParams.toSolrParams((NamedList) args);
BayesParameters p = new BayesParameters();
String modelPath = params.get("model");
File file=new File(modelPath);
p.set("basePath",file.getAbsolutePath());
LOG.info("model path :"+file.getAbsolutePath());
p.set("classifierType","bayes");
p.set("dataSource","hdfs");
Datastore ds = new InMemoryBayesDatastore(p);
Algorithm alg = new BayesAlgorithm();
ctx = new ClassifierContext(alg,ds);
ctx.initialize();
What could be the reason?

FileNotFoundException on hadoop

Inside my map function, I am trying to read a file from the distributedcache, load its contents into a hash map.
The sys output log of the MapReduce job prints the content of the hashmap. This shows that it has found the file, has loaded into the data structure and performed the needed operation. It iterates through the list and prints its contents. Thus proving that the operation was successful.
However, I still get the below error after a few minutes of running the MR job:
13/01/27 18:44:21 INFO mapred.JobClient: Task Id : attempt_201301271841_0001_m_000001_2, Status : FAILED
java.io.FileNotFoundException: File does not exist: /app/hadoop/jobs/nw_single_pred_in/predict
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1843)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.(DFSClient.java:1834)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:578)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:154)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:67)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:522)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Here's the portion which initializes Path with the location of the file to be placed in the distributed cache
// inside main, surrounded by try catch block, yet no exception thrown here
Configuration conf = new Configuration();
// rest of the stuff that relates to conf
Path knowledgefilepath = new Path(args[3]); // args[3] = /app/hadoop/jobs/nw_single_pred_in/predict/knowledge.txt
DistributedCache.addCacheFile(knowledgefilepath.toUri(), conf);
job.setJarByClass(NBprediction.class);
// rest of job settings
job.waitForCompletion(true); // kick off load
This one is inside the map function:
try {
System.out.println("Inside try !!");
Path files[]= DistributedCache.getLocalCacheFiles(context.getConfiguration());
Path cfile = new Path(files[0].toString()); // only one file
System.out.println("File path : "+cfile.toString());
CSVReader reader = new CSVReader(new FileReader(cfile.toString()),'\t');
while ((nline=reader.readNext())!=null)
data.put(nline[0],Double.parseDouble(nline[1])); // load into a hashmap
}
catch (Exception e)
{// handle exception }
Help appreciated.
Cheers !
Did a fresh installation of hadoop and ran the job with the same jar, the problem disappeared. Seems to be a bug rather than programming errors.

Resources