Is it possible to recover an broadcast value from Spark-streaming checkpoint - spark-streaming

I used hbase-spark to record pv/uv in my spark-streaming project. Then when I killed the app and restart it, I got following exception while checkpoint-recover:
16/03/02 10:17:21 ERROR HBaseContext: Unable to getConfig from broadcast
java.lang.ClassCastException: [B cannot be cast to org.apache.spark.SerializableWritable
at com.paitao.xmlife.contrib.hbase.HBaseContext.getConf(HBaseContext.scala:645)
at com.paitao.xmlife.contrib.hbase.HBaseContext.com$paitao$xmlife$contrib$hbase$HBaseContext$$hbaseForeachPartition(HBaseContext.scala:627)
at com.paitao.xmlife.contrib.hbase.HBaseContext$$anonfun$com$paitao$xmlife$contrib$hbase$HBaseContext$$bulkMutation$1.apply(HBaseContext.scala:457)
at com.paitao.xmlife.contrib.hbase.HBaseContext$$anonfun$com$paitao$xmlife$contrib$hbase$HBaseContext$$bulkMutation$1.apply(HBaseContext.scala:457)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:898)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:898)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1839)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1839)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I checked the code of HBaseContext, It uses a broadcast to store the HBase configuration.
class HBaseContext(#transient sc: SparkContext,
#transient config: Configuration,
val tmpHdfsConfgFile: String = null) extends Serializable with Logging {
#transient var credentials = SparkHadoopUtil.get.getCurrentUserCredentials()
#transient var tmpHdfsConfiguration: Configuration = config
#transient var appliedCredentials = false
#transient val job = Job.getInstance(config)
TableMapReduceUtil.initCredentials(job)
// <-- broadcast for HBaseConfiguration here !!!
var broadcastedConf = sc.broadcast(new SerializableWritable(config))
var credentialsConf = sc.broadcast(new SerializableWritable(job.getCredentials()))
...
When the checkpoint-recover, it tried to access this broadcast value in its getConf func:
if (tmpHdfsConfiguration == null) {
try {
tmpHdfsConfiguration = configBroadcast.value.value
} catch {
case ex: Exception => logError("Unable to getConfig from broadcast", ex)
}
}
Then the exception raised. My question is: is it possible to recover the broadcasted value from checkpoint in a spark application? All we have some other solution to re-broadcast the value after recovering?
Thanks for any feedback!

Currently, it's a known bug of Spark. Contributors have been investigating on this issue but made no progress.
Here's my workaround: Instead of loading data into broadcast variable and broadcasting to all executors, i let each executor loads the data itself into a singleton object.
Btw, follow this issue for changes https://issues.apache.org/jira/browse/SPARK-5206

Follow below approach
Create spark context.
Initialize broadcast variable.
Create streaming context with checkpoint directory using above spark context and passing on the initialized broadcast variable.
When streaming job starts with no data in checkpoint directory, it will initialize the broadcast variable.
When streaming restarts, it will recover the broadcast variable from checkpoint directory.

Related

Transition from JedisPool to JedisCluster

My Application uses ElastiCache on AWS for caching purposes. Our current set up uses a basic Redis Cluster with no sharding or failover. We need to now move to a Clustered Redis Elastic Cache with sharding, failover etc enabled. Creating a new cluster on AWS was the easy bit, but we are a bit lost on how to modify our java code to reads and write from the cluster.
Current Implementation -
Initialize a JedisPool.
JedisPoolConfig jedisPoolConfig = new JedisPoolConfig();
jedisPoolConfig.setMaxTotal(100);
jedisPoolConfig.setMaxIdle(10);
jedisPoolConfig.setMaxWaitMillis(50);
jedisPoolConfig.setTestOnBorrow(true);
String host = "mycache.db8e1v.0001.usw2.cache.amazonaws.com";
int port = 6379;
int timeout = 50;
JedisPool jedisPool = new JedisPool(jedisPoolConfig, host, port, timeout)
A Jedis object is borrowed from the pool everytime we need to perform an operation
Jedis jedis = JedisPool.getResource();
The new implementation would be
JedisPoolConfig jedisPoolConfig = ...
HostAndPort hostAndPort = new HostAndPort(host, port);
jedisCluster = new JedisCluster(Collections.singleton(hostAndPort), jedisPoolConfig);
Question:
The documentation says JedisCluster is to be used in place of Jedis (not JedisPool). Does this mean I need to create and destroy a JedisCluster object in each thread. Or can I re-use the same object and it will handle the thread safety? When do I exactly close the JedisCluster then? At the end of the application?
The JedisCluster holds internal JedisPools for each node in the cluster.
Does this mean I need to create and destroy a JedisCluster object in
each thread. Or can I re-use the same object and it will handle the
thread safety?
You can reuse the same object.
When do I exactly close the JedisCluster then? At the end of the
application?
Yes.
Replacing all Jedis-calls with JedisCluster-calls is the best way to migrate.
But I wanted pipeline support which JedisCluster currently lacks. So one other idea is to extend JedisCluster to return the JedisPool>Jedis for a particular key:
protected Jedis getJedis(String key) {
int slot = JedisClusterCRC16.getSlot(key);
return connectionHandler.getConnectionFromSlot(slot);
}
The extended class has to be in namespace redis.clients.jedis to access getConnectionFromSlot.
Now a pipeline can be executed on the Jedis.
And you need a different Jedis for each key you want to operate on. Which makes sense - in cluster mode, each key can be on a different node.

Internal Failures when using Spring-Integration-Kinesis Message Driven Adapter

I have set up this KinesisMessageDrivenChannelAdapter:
#Bean
public KinesisMessageDrivenChannelAdapter kinesisInboundChannel(AmazonKinesis amazonKinesis, MetadataStore store) {
KinesisMessageDrivenChannelAdapter adapter =
new KinesisMessageDrivenChannelAdapter(amazonKinesis, config.getStreamName());
adapter.setCheckpointMode(CheckpointMode.batch);
adapter.setListenerMode(ListenerMode.batch);
adapter.setStartTimeout(10000);
// Set idle to milliseconds. Max value is 596 before getting an overflow exception.
adapter.setIdleBetweenPolls(config.getPollHours() * 3_600_000);
adapter.setShouldTrack(true);
adapter.setDescribeStreamRetries(5);
adapter.setConcurrency(50);
adapter.setCheckpointStore(store);
adapter.setStreamInitialSequence(KinesisShardOffset.trimHorizon());
adapter.setOutputChannelName("logMessage.input");
adapter.setErrorChannel(errorChannel());
return adapter;
}
Most of the time, it works fine. But from time to time, I get this kind of message:
Exception in thread "kinesisInboundChannel-kinesis-consumer-1" com.amazonaws.services.kinesis.model.AmazonKinesisException: null (Service: AmazonKinesis; Status Code: 500; Error Code: InternalFailure; Request ID: c2f66be9-23f4-b211-9165-ed92383ee673)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1639)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1304)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1056)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
at com.amazonaws.services.kinesis.AmazonKinesisClient.doInvoke(AmazonKinesisClient.java:2276)
at com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:2252)
at com.amazonaws.services.kinesis.AmazonKinesisClient.executeGetRecords(AmazonKinesisClient.java:1062)
at com.amazonaws.services.kinesis.AmazonKinesisClient.getRecords(AmazonKinesisClient.java:1038)
at org.springframework.integration.aws.inbound.kinesis.KinesisMessageDrivenChannelAdapter$ShardConsumer.getRecords(KinesisMessageDrivenChannelAdapter.java:853)
at org.springframework.integration.aws.inbound.kinesis.KinesisMessageDrivenChannelAdapter$ShardConsumer.access$3500(KinesisMessageDrivenChannelAdapter.java:688)
at org.springframework.integration.aws.inbound.kinesis.KinesisMessageDrivenChannelAdapter$ShardConsumer$2.run(KinesisMessageDrivenChannelAdapter.java:816)
at org.springframework.integration.aws.inbound.kinesis.KinesisMessageDrivenChannelAdapter$ConsumerInvoker.run(KinesisMessageDrivenChannelAdapter.java:1003)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
After that, the adapter stops working entirely without hanging the app. I specified what error channel to use, and I'd be happy to just restart the application to bring the adapter back online if I need to. But that appears to not be an option.
How do I build error handling into this?
Sounds like a problem has been fixed here: https://github.com/spring-projects/spring-integration-aws/issues/84
You need to consider to use the latest version (2.0.0.M2), or even better 2.0.0.BUILD-SNAPSHOT.

is there a way to get Spark Tracking URL other than mining log files for the log output?

I have an Scala application that creates a Spark Session and I have set up health checks that use the Spark REST API. The Spark Application itself runs on Hadoop Yarn. The REST API URL is currently retrieved by reading the Spark logging generated when the Spark Session is created. This works most of the time but there are some edge cases in my application where it doesn't work so well.
Does anyone know of another way to get this tracking URL?
"You can do this by reading the yarn.resourcemanager.webapp.address value from YARN's config and the application ID (which is exposed both in an event sent on the listener bus, and an existing SparkContext method."
Copied the paragraph above as is from the developer's response found at: https://issues.apache.org/jira/browse/SPARK-20458
UPDATE:
I did try the solution and got pretty close. Here's some Scala/Spark code to build that URL:
#transient val ssc: StreamingContext = StreamingContext.getActiveOrCreate(rabbitSettings.checkpointPath, CreateStreamingContext)
// Update yarn logs URL in Elasticsearch
YarnLogsTracker.update(
ssc.sparkContext.uiWebUrl,
ssc.sparkContext.applicationId,
"test2")
And the YarnLogsTracker object goes something like this:
object YarnLogsTracker {
private def recoverURL(u: Option[String]): String = u match {
case Some(a) => a.split(":").take(2).mkString(":")
case None => ""
}
def update(rawUrl: Option[String], rawAppId: String, tenant: String): Unit = {
val logUrl = s"${recoverURL(rawUrl)}:8042/node/containerlogs/container${rawAppId.substring(11)}_01_000002/$tenant/stdout/?start=-4096"
...
Which produces something like this: http://10.99.25.146:8042/node/containerlogs/container_1516203096033_91164_01_000002/test2/stdout/?start=-4096
I've discovered a "reasonable" way to obtain this. Obviously, the best way would be for Spark libraries to expose the ApplicationReport that they're already fetching to the launcher application directly, since they go to the trouble of setting delegation tokens, etc. However, this seems unlikely to happen.
This approach is two-pronged. First, it attempts to build a YarnClient itself, in order to fetch the ApplicationReport, which will have the authoritative tracking URL. However, from my experience, this can fail (ex: if the job was run in CLUSTER mode, with a --proxy-user in a Kerberized environment, then this will not be able to properly authenticate to YARN).
In my case, I'm calling this helper method from the driver itself, and reporting the result back to my launcher application on the side. However, in principle, any place where you have the Hadoop Configuration available should work (including, possibly, your launcher application). You can obviously use either "prong" of this implementation (or both) depending on your needs and tolerance for complexity, extra processing, etc.
/**
* Given a Hadoop {#link org.apache.hadoop.conf.Configuration} and appId, use the YARN API (via an
* {#link YarnClient} instance) to get the application report, which includes the trackingUrl. If this fails,
* then as a fallback, it attempts to "guess" the URL by looking at various YARN configuration properties,
* and assumes that the URL will be something like: <pre>[yarnWebUI:port]/proxy/[appId]</pre>.
*
* #param hadoopConf the Hadoop {#link org.apache.hadoop.conf.Configuration}
* #param appId the YARN application ID
* #return the app trackingUrl, either retrieved using the {#link YarnClient}, or manually constructed using
* the fallback approach
*/
public static String getYarnApplicationTrackingUrl(org.apache.hadoop.conf.Configuration hadoopConf, String appId) {
LOG.debug("Attempting to look up YARN url for applicationId {}", appId);
YarnClient yarnClient = null;
try {
// do not attempt to fail over on authentication error (ex: running with proxy-user and Kerberos)
hadoopConf.set("yarn.client.failover-max-attempts", "0");
yarnClient = YarnClient.createYarnClient();
yarnClient.init(hadoopConf);
yarnClient.start();
final ApplicationReport report = yarnClient.getApplicationReport(ConverterUtils.toApplicationId(appId));
return report.getTrackingUrl();
} catch (YarnException | IOException e) {
LOG.warn(
"{} attempting to get report for YARN appId {}; attempting to use manually constructed fallback",
e.getClass().getSimpleName(),
appId,
e
);
String baseYarnWebappUrl;
String protocol;
if ("HTTPS_ONLY".equals(hadoopConf.get("yarn.http.policy"))) {
// YARN is configured to use HTTPS only, hence return the https address
baseYarnWebappUrl = hadoopConf.get("yarn.resourcemanager.webapp.https.address");
protocol = "https";
} else {
baseYarnWebappUrl = hadoopConf.get("yarn.resourcemanager.webapp.address");
protocol = "http";
}
return String.format("%s://%s/proxy/%s", protocol, baseYarnWebappUrl, appId);
} finally {
if (yarnClient != null) {
yarnClient.stop();
}
}
}

Hbase MuleSoft Cloudhub Connectivity

I have to connect Cloudhub to Hbase. I have trid from community edition HBase connector but not succeeded. Then I tried with Java Code and again failed. From HBase Team, they have given only Master IP (10.99.X.X) and Port(2181) and userName (hadoop).
I have tried with following options:
Through Java Code:
public Object transformMessage(MuleMessage message, String outputEncoding) throws TransformerException {
try {
Configuration conf = HBaseConfiguration.create();
//conf.set("hbase.rotdir", "/hbase");
conf.set("hbase.zookeeper.quorum", "10.99.X.X");
conf.set("hbase.zookeeper.property.clientPort", "2181");
conf.set("hbase.client.retries.number", "3");
logger.info("############# Config Created ##########");
// Create a get api for consignment table
logger.info("############# Starting Consignment Test ##########");
// read from table
// Creating a HTable instance
HTable table = new HTable(conf, "consignment");
logger.info("############# HTable instance Created ##########");
// Create a Get object
Get get = new Get(Bytes.toBytes("6910358750"));
logger.info("############# RowKey Created ##########");
// Set column family to be queried
get.addFamily(Bytes.toBytes("consignment_detail"));
logger.info("############# CF Created ##########");
// Perform get and capture result in a iterable
Result result = table.get(get);
logger.info("############# Result Created ##########");
// Print consignment data
logger.info(result);
logger.info(" #### Ending Consignment Test ###");
// Begining Consignment Item Scanner api
logger.info("############# Starting Consignmentitem test ##########");
HTable table1 = new HTable(conf, "consignmentitem");
logger.info("############# HTable instance Created ##########");
// Create a scan object with start rowkey and end rowkey (partial
// row key scan)
// actual rowkey design: <consignment_id>-<trn>-<orderline>
Scan scan = new Scan(Bytes.toBytes("6910358750"),Bytes.toBytes("6910358751"));
logger.info("############# Partial RowKeys Created ##########");
// Perform a scan using start and stop rowkeys
ResultScanner scanner = table1.getScanner(scan);
// Iterate over result and print them
for (Result result1 = scanner.next(); result1 != null; result1 = scanner.next()) {
logger.info("Printing Records\n");
logger.info(result1);
}
return scanner;
} catch (MasterNotRunningException e) {
logger.error("HBase connection failed! --> MasterNotRunningException");
logger.error(e);
} catch (ZooKeeperConnectionException e) {
logger.error("Zookeeper connection failed! -->ZooKeeperConnectionException");
logger.error(e);
} catch (Exception e) {
logger.error("Main Exception Found! -- Exception");
logger.error(e);
}
return "Not Connected";
}
Above Code giving below Error
java.net.UnknownHostException: unknown host: ip-10-99-X-X.ap-southeast-2.compute.internal
It Seems that CloudHub is not able to find host name because cloudHub is not configured with DNS
When I tried with Community Edition HBase Connector it is giving following Exception:
org.apache.hadoop.hbase.MasterNotRunningException: Retried 3 times
Please suggest some way...
Rgeards
Nilesh
Email: bit.nilesh.kumar#gmail.com
It appears that you are configuring your client to try to connect to the zookeeper quorum at a private IP address (10.99.X.X). I'll assume you've already set up a VPC, which is required for your CloudHub worker to connect to your private network.
Your UnknownHostException implies that the HBase server you are connecting to is hosted on AWS, which defines private domain names similar to the one in the error message.
So what might be happening is this:
Mule connects to Zookeeper, asks what HBase nodes there are, and gets back ip-10-99-X-X.ap-southeast-2.compute.internal.
Mule tries to connect to that to find the HTable "consignment", but can't resolve an IP address for that name.
Unfortunately, if this is what's going on, it will take some networking changes to fix it. The FAQ in the VPC discovery form says this about private DNS:
Currently we don't have the ability to relay DNS queries to internal DNS servers. You would need to either use IP addresses or public DNS entries. Beware of connecting to systems which may redirect to a Virtual IP endpoint by using an internal DNS entry.
You could use public DNS and possibly an Elastic IP to get around this problem, but that would require you to expose your HBase cluster to the internet.
I believe the answer of your question is covered in the cloudhub networking guide.
https://developer.mulesoft.com/docs/display/current/CloudHub+Networking+Guide

Check Session with Cassandra Datastax Java Driver

Is there any direct way to check if a Cluster/Session is connected/valid/ok?
I mean, I have a com.datastax.driver.core.Session created into a neverending thread and I'd like to assure the session is ok every time is needed. I use the next cluster initialization, but I'm not sure this is enough...
Cluster.builder().addContactPoint(url)
.withRetryPolicy(DowngradingConsistencyRetryPolicy.INSTANCE)
.withReconnectionPolicy(new ConstantReconnectionPolicy(1000L)).build());
In fact when using the DataStax Java Driver, you have a hidden/magic capability embedded:
The driver is aware of the full network topology (nodes topology across datacenters and nodes availabilities).
Thus, the only thing you have to do is to initialise your cluster with a few nodes(1) and then you can be sure at every moment that if there is at least one available node your request will be performed correctly. Because the driver is topology aware, if one node (even initialisation nodes) goes out of availability, the driver will automagically route your request to another available node.
In summary, your code is good(1).
(1): You should provide a few nodes in order to be fault tolerant in the cluster initialisation phase. Indeed, if one initialisation node is down, the driver has then the possibility to contact another one to discover the full topology.
I have a local development environment setup where I am starting up my java application and Cassandra (Docker) container at the same time, so Cassandra will normally not be in a ready state when the java application first attempts to connect.
When this is starting up the application will throw a NoHostAvailableException when the Cluster instance attempts to create a Session. Subsequent attempts to create a Session from the Cluster will then throw an IllegalStateException because the cluster instance was closed after the first exception.
What I did to rememdy this was to create a check method that attempts to create a Cluster and Session and then immediately closes these. See this:
private void waitForCassandraToBeReady(String keyspace, Cluster.Builder builder) {
RuntimeException exception = null;
int retries = 0;
while (retries++ < 40) {
Session session = null;
Cluster cluster = null;
try {
cluster = builder.build();
session = cluster.connect(keyspace);
log.info("Cassandra is available");
return;
} catch (RuntimeException e) {
log.warn("Cassandra not available, try {}", retries);
exception = e;
} finally {
if (session != null && !session.isClosed()) session.close();
if (cluster != null && !cluster.isClosed()) cluster.close();
}
sleep();
}
log.error("Retries exceeded waiting for Cassandra to be available");
if (exception != null) throw exception;
else throw new RuntimeException("Cassandra not available");
}
After this method returns, I then create a create a Cluster and Session independent of this check method.

Resources