I am using Kafka storm, kafka sends/emits json string to storm, in the storm, I want to distribute the load to a couple of workers based on the key/field in the json. How to do that? In my case, it is groupid field in json string.
For example, I have json like that:
{groupid: 1234, userid: 145, comments:"I want to distribute all this group 1234 to one worker", size:50,type:"group json"}
{groupid: 1235, userid: 134, comments:"I want to distribute all this group 1234 to another worker", size:90,type:"group json"}
{groupid: 1234, userid: 158, comments:"I want to be sent to same worker as group 1234", size:50,type:"group json"}
I try too use following codes:
1. TopologyBuilder builder = new TopologyBuilder();
2. builder.setSpout(SPOUTNAME, kafkaSpout, 1);
3. builder.setBolt(MYDISTRIBUTEDWORKER, new DistributedBolt()).setFieldsGroup(SPOUTNAME,new Fields("groupid")); <---???
I am wondering how to put arguments in setFieldsGroup method in line 3. Could someone give me a hint?
Juhani
==Testing using storm 0.9.4 ============
=============source codes==============
import java.util.List;
import java.util.Map;
import java.util.concurrent.atomic.AtomicInteger;
import storm.kafka.KafkaSpout;
import storm.kafka.SpoutConfig;
import storm.kafka.StringScheme;
import storm.kafka.ZkHosts;
import backtype.storm.Config;
import backtype.storm.LocalCluster;
import backtype.storm.spout.SchemeAsMultiScheme;
import backtype.storm.task.OutputCollector;
import backtype.storm.task.TopologyContext;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.topology.TopologyBuilder;
import backtype.storm.topology.base.BaseRichBolt;
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Tuple;
import backtype.storm.tuple.Values;
public class KafkaBoltMain {
private static final String SPOUTNAME="TopicSpout";
private static final String ANALYSISBOLT = "AnalysisWorker";
private static final String CLIENTID = "Storm";
private static final String TOPOLOGYNAME = "LocalTopology";
private static class AppAnalysisBolt extends BaseRichBolt {
private static final long serialVersionUID = -6885792881303198646L;
private OutputCollector _collector;
private long groupid=-1L;
private String log="test";
public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
_collector = collector;
}
public void execute(Tuple tuple) {
List<Object> objs = tuple.getValues();
int i=0;
for(Object obj:objs){
System.out.println(""+i+"th object's value is:"+obj.toString());
i++;
}
// _collector.emit(new Values(groupid,log));
_collector.ack(tuple);
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("groupid","log"));
}
}
public static void main(String[] args){
String zookeepers = null;
String topicName = null;
if(args.length == 2 ){
zookeepers = args[0];
topicName = args[1];
}else if(args.length == 1 && args[0].equalsIgnoreCase("help")){
System.out.println("xxxx");
System.exit(0);
}
else{
System.out.println("You need to have two arguments: kafka zookeeper:port and topic name");
System.out.println("xxxx");
System.exit(-1);
}
SpoutConfig spoutConfig = new SpoutConfig(new ZkHosts(zookeepers),
topicName,
"",// zookeeper root path for offset storing
CLIENTID);
spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
KafkaSpout kafkaSpout = new KafkaSpout(spoutConfig);
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout(SPOUTNAME, kafkaSpout, 1);
builder.setBolt(ANALYSISBOLT, new AppAnalysisBolt(),2)
.fieldsGrouping(SPOUTNAME,new Fields("groupid"));
//Configuration
Config conf = new Config();
conf.setDebug(false);
//Topology run
conf.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 1);
LocalCluster cluster = new LocalCluster();
cluster.submitTopology(TOPOLOGYNAME, conf, builder.createTopology());
}
}
==================================================
when I start to submit topology(local cluster), it
gives following error:
11658 [SyncThread:0] INFO org.apache.storm.zookeeper.server.ZooKeeperServer - Established session 0x14d097d338c0009 with negotiated timeout 20000 for client /127.0.0.1:34656
11658 [main-SendThread(localhost:2000)] INFO org.apache.storm.zookeeper.ClientCnxn - Session establishment complete on server localhost/127.0.0.1:2000, sessionid = 0x14d097d338c0009, negotiated timeout = 20000
11659 [main-EventThread] INFO org.apache.storm.curator.framework.state.ConnectionStateManager - State change: CONNECTED
12670 [main] INFO backtype.storm.daemon.supervisor - Starting supervisor with id ccc57de0-29ff-4cb4-89de-fea1ea9b6e28 at host storm-VirtualBox
12794 [main] WARN backtype.storm.daemon.nimbus - Topology submission exception. (topology name='LocalTopology') #<InvalidTopologyException InvalidTopologyException(msg:Component: [AnalysisWorker] subscribes from stream: [default] of component [TopicSpout] with non-existent fields: #{"groupid"})>
12800 [main] ERROR org.apache.storm.zookeeper.server.NIOServerCnxnFactory - Thread Thread[main,5,main] died
backtype.storm.generated.InvalidTopologyException: null
at backtype.storm.daemon.common$validate_structure_BANG_.invoke(common.clj:178) ~[storm-core-0.9.4.jar:0.9.4]
at backtype.storm.daemon.common$system_topology_BANG_.invoke(common.clj:307) ~[storm-core-0.9.4.jar:0.9.4]
at backtype.storm.daemon.nimbus$fn__4290$exec_fn__1754__auto__$reify__4303.submitTopologyWithOpts(nimbus.clj:948) ~[storm-core-0.9.4.jar:0.9.4]
at backtype.storm.daemon.nimbus$fn__4290$exec_fn__1754__auto__$reify__4303.submitTopology(nimbus.clj:966) ~[storm-core-0.9.4.jar:0.9.4]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.7.0_80]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) ~[na:1.7.0_80]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.7.0_80]
at java.lang.reflect.Method.invoke(Method.java:606) ~[na:1.7.0_80]
at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93) ~[clojure-1.5.1.jar:na]
at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28) ~[clojure-1.5.1.jar:na]
at backtype.storm.testing$submit_local_topology.invoke(testing.clj:264) ~[storm-core-0.9.4.jar:0.9.4]
at backtype.storm.LocalCluster$_submitTopology.invoke(LocalCluster.clj:43) ~[storm-core-0.9.4.jar:0.9.4]
at backtype.storm.LocalCluster.submitTopology(Unknown Source) ~[storm-core-0.9.4.jar:0.9.4]
at com.callstats.stream.analyzer.KafkaBoltMain.main(KafkaBoltMain.java:94) ~[StreamAnalyzer-1.0-SNAPSHOT-jar-with-dependencies.jar:na]
I'm not sure which version of Storm you are using, as of 0.9.4, your requirement can be implemented as follows.
builder.setBolt(MYDISTRIBUTEDWORKER, new DistributedBolt()).fieldsGrouping(SPOUTNAME, new Fields("groupid"));
In prepare method of DistributedBolt,
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("groupid", "log"));
}
Somewhere in execute method of it, you will call
collector.emit(new Values(groupid, log));
then tuples which have same groupid will be delivered to same instance of next bolt.
Related
I am new to kafka and currently looking at Kafka Streams, especially joining two streams.
The samples I browsed worked with rather simple messages/ text messages.
So I constructed another simple sample, that more applies to the traditional ETL.
Let's say, we have two "datasets": Contract (=Vertrag) and Cashflow, with a cardinality of 1 to n.
In my sample I created a topic for each and sent objects (Vertrag, Cashflow) to each.
And I managed a first join of them.
KStream<String, String> joined = srcVertrag.leftJoin(srcCashflow,
(leftValue, rightValue) -> "left=" + leftValue + ", right=" + rightValue, /* ValueJoiner */
JoinWindows.of(5000),
Joined.with(
Serdes.String(), /* key */
Serdes.String(), /* left value */
Serdes.String()) /* right value */
);
The result looks like this:
left={"name":"Vertrag123","vertragId":"123"}, right={"buchungstag":1560715764709,"betrag":12.0,"vertragId":"123"}
Now my questions:
is this the right way to do this?
should I create Objects at all or rather process just Strings?
After your hints and further research, I came up with the following test.
- I created Pojos for "Vertrag" and "Cashflow"
- I created Serdes for each
- I stream them as objects
- Finally I try to join them into a Wrapper-Class. (and here I hang)
I don't find samples, that do something like this. Is this so exotic?
package tki.bigdata.kafkaetl;
import java.time.Duration;
import java.util.Date;
import java.util.HashMap;
import java.util.Map;
import java.util.Properties;
import java.util.concurrent.CountDownLatch;
import org.apache.kafka.common.serialization.Deserializer;
import org.apache.kafka.common.serialization.Serde;
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.common.serialization.Serializer;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.Topology;
import org.apache.kafka.streams.kstream.JoinWindows;
import org.apache.kafka.streams.kstream.Joined;
import org.apache.kafka.streams.kstream.KStream;
import org.apache.kafka.streams.kstream.Printed;
import org.apache.kafka.streams.kstream.ValueJoiner;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.CommandLineRunner;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.context.annotation.ComponentScan;
import org.springframework.kafka.core.KafkaTemplate;
import org.springframework.scheduling.annotation.EnableScheduling;
import tki.bigdata.domain.Cashflow;
import tki.bigdata.domain.Vertrag;
import tki.bigdata.serde.JsonPOJODeserializer;
import tki.bigdata.serde.JsonPOJOSerializer;
#ComponentScan(basePackages = { "tki.bigdata.domain", "tki.bigdata.config", "tki.bigdata.app" }, basePackageClasses = App.class)
#SpringBootApplication
#EnableScheduling
public class App implements CommandLineRunner {
private static String bootstrapServers = "tobi0179.westeurope.cloudapp.azure.com:9092";
#Autowired
private KafkaTemplate<String, Object> template;
// #Autowired
// ExcelReader excelReader;
public static void main(String[] args) {
SpringApplication.run(App.class, args).close();
}
private void populateSampleData() {
Vertrag v = new Vertrag();
v.setVertragId("123");
v.setName("Vertrag123");
template.send("Vertrag", "123", v);
//template.send("Vertrag", "124", "124;Vertrag12");
Cashflow c = new Cashflow();
c.setVertragId("123");
c.setBetrag(12);
c.setBuchungstag(new Date());
template.send("Cashflow", "123", c);
}
//#Override
public void run(String... args) throws Exception {
// Topics mit Demodata befüllen
populateSampleData();
Properties streamsConfiguration = new Properties();
streamsConfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, "streams-pipe");
streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
streamsConfiguration.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
streamsConfiguration.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
// TODO: the following can be removed with a serialization factory
Map<String, Object> serdeProps = new HashMap<>();
// prepare Serde for Vertrag
final Serializer<Vertrag> vertragSerializer = new JsonPOJOSerializer<Vertrag>();
serdeProps.put("JsonPOJOClass", Vertrag.class);
vertragSerializer.configure(serdeProps, false);
final Deserializer<Vertrag> vertragDeserializer = new JsonPOJODeserializer<Vertrag>();
serdeProps.put("JsonPOJOClass", Vertrag.class);
vertragDeserializer.configure(serdeProps, false);
final Serde<Vertrag> vertragSerde = Serdes.serdeFrom(vertragSerializer, vertragDeserializer);
// prepare Serde for Cashflow
final Serializer<Cashflow> cashflowSerializer = new JsonPOJOSerializer<Cashflow>();
serdeProps.put("JsonPOJOClass", Vertrag.class);
cashflowSerializer.configure(serdeProps, false);
final Deserializer<Cashflow> cashflowDeserializer = new JsonPOJODeserializer<Cashflow>();
serdeProps.put("JsonPOJOClass", Vertrag.class);
cashflowDeserializer.configure(serdeProps, false);
final Serde<Cashflow> cashflowSerde = Serdes.serdeFrom(cashflowSerializer, cashflowDeserializer);
// streamsConfiguration.put(StreamsConfig.STATE_DIR_CONFIG,
// TestUtils.tempDir().getAbsolutePath());
StreamsBuilder builder = new StreamsBuilder();
KStream<String, Vertrag> srcVertrag = builder.stream("Vertrag");
KStream<String, Cashflow> srcCashflow = builder.stream("Cashflow");
// print to sysout
//srcVertrag.print(Printed.toSysOut());
KStream<String, MyValueContainer> joined = srcVertrag.leftJoin(srcCashflow,
(leftValue, rightValue) -> new MyValueContainer(leftValue , rightValue), /* ValueJoiner */
JoinWindows.of(600),
Joined.with(
Serdes.String(), /* key */
vertragSerde, /* left value */
cashflowSerde) /* right value */
);
joined.to("Output");
final Topology topology = builder.build();
System.out.println(topology.describe());
final KafkaStreams streams = new KafkaStreams(topology, streamsConfiguration);
final CountDownLatch latch = new CountDownLatch(1);
// attach shutdown handler to catch control-c
Runtime.getRuntime().addShutdownHook(new Thread("streams-shutdown-hook") {
#Override
public void run() {
streams.close();
latch.countDown();
}
});
try {
streams.start();
latch.await();
} catch (Throwable e) {
System.exit(1);
}
System.exit(0);
}
}
When executed, it produces the error:
2019-06-17 22:18:31.892 ERROR 1599 --- [-StreamThread-1] o.a.k.s.p.i.AssignedStreamsTasks : stream-thread [streams-pipe-0638d359-94df-43bd-9ef7-eb6769ed8a1c-StreamThread-1] Failed to process stream task 0_0 due to the following error:
java.lang.ClassCastException: java.lang.String cannot be cast to tki.bigdata.domain.Vertrag
at org.apache.kafka.streams.kstream.internals.KStreamKStreamJoin$KStreamKStreamJoinProcessor.process(KStreamKStreamJoin.java:98) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorNode$1.run(ProcessorNode.java:50) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorNode.runAndMeasureLatency(ProcessorNode.java:244) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:133) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:143) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:126) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:90) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.kstream.internals.KStreamJoinWindow$KStreamJoinWindowProcessor.process(KStreamJoinWindow.java:63) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorNode$1.run(ProcessorNode.java:50) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorNode.runAndMeasureLatency(ProcessorNode.java:244) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:133) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:143) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:129) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:90) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.SourceNode.process(SourceNode.java:87) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:302) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.AssignedStreamsTasks.process(AssignedStreamsTasks.java:94) ~[kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:409) [kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.StreamThread.processAndMaybeCommit(StreamThread.java:964) [kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:832) [kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:767) [kafka-streams-2.0.1.jar:na]
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:736) [kafka-streams-2.0.1.jar:na]
is this the right way to do this?
Yes.
should I create Objects at all or rather process just Strings?
Yes. Look at Avro as a good example of a data format for serializing/deserializing your pojos. Here, you are looking for an Avro "serde" (serializer/deserializer). Confluent provide such an Avro serde for KStreams, for instance (this serde requires the use of Confluent Schema Registry).
what should I do with the above result?
It's unclear to me what your question is.
Sorry if the question is solved, but I tried to find it and I haven't had success. There are some similar, but I don't found help where I've seen. I have the next problem:
603 [main] WARN b.s.StormSubmitter - Topology submission exception:
Component: [escribirFichero] subscribes from non-existent stream:
[default] of component [buscamosEnKlout]
Exception in thread "main" java.lang.RuntimeException:
InvalidTopologyException(msg:Component:
[escribirFichero] subscribes from non-existent stream:
[default] of component [buscamosEnKlout])
I can't understand why I have this exception. I declare the bolt "buscamosEnKlout" before I use "escribirFichero". Next to my topology I'll put the elemental lines of the bolts. I know the spout is OK,because a trial-and-error approach.
The code of my topology is:
import backtype.storm.Config;
import backtype.storm.LocalCluster;
import backtype.storm.StormSubmitter;
import backtype.storm.stats.RollingWindow;
import backtype.storm.topology.BoltDeclarer;
import backtype.storm.topology.TopologyBuilder;
import bolt.*;
import spout.TwitterSpout;
import twitter4j.FilterQuery;
public class TwitterTopologia {
private static String consumerKey = "xxx1";
private static String consumerSecret = "xxx2";
private static String accessToken = "yyy1";
private static String accessTokenSecret="yyy2";
public static void main(String[] args) throws Exception {
/**************** SETUP ****************/
String remoteClusterTopologyName = null;
if (args!=null) { ... }
TopologyBuilder builder = new TopologyBuilder();
FilterQuery tweetFilterQuery = new FilterQuery();
tweetFilterQuery.track(new String[]{"Vacaciones","Holy Week", "Semana Santa","Holidays","Vacation"});
tweetFilterQuery.language(new String[]{"en","es"});
TwitterSpout spout = new TwitterSpout(consumerKey, consumerSecret, accessToken, accessTokenSecret, tweetFilterQuery);
KloutBuscador buscamosEnKlout = new KloutBuscador();
FileWriterBolt fileWriterBolt = new FileWriterBolt("idUsuarios.txt");
builder.setSpout("spoutLeerTwitter",spout,1);
builder.setBolt("buscamosEnKlout",buscamosEnKlout,1).shuffleGrouping("spoutLeerTwitter");
builder.setBolt("escribirFichero",fileWriterBolt,1).shuffleGrouping("buscamosEnKlout");
Config conf = new Config();
conf.setDebug(true);
if (args != null && args.length > 0) {
conf.setNumWorkers(3);
StormSubmitter.submitTopology(args[0], conf, builder.createTopology());
}
else {
conf.setMaxTaskParallelism(3);
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("twitter-fun", conf, builder.createTopology());
Thread.sleep(460000);
cluster.shutdown();
}
}
}
Bolt "KloutBuscador", alias "buscamosEnKlout", is the next code:
String text = tuple.getStringByField("id");
String cadenaUrl;
cadenaUrl = "http://api.klout.com/v2/identity.json/twitter?screenName=";
cadenaUrl += text.replaceAll("\\[", "").replaceAll("\\]","");
cadenaUrl += "&key=" + kloutKey;
URL url = new URL(cadenaUrl);
HttpURLConnection c = (HttpURLConnection) url.openConnection();
...........c.setRequestMethod("GET");c.setRequestProperty("Content-length", "0");c.setUseCaches(false);c.setAllowUserInteraction(false);c.connect();
int status = c.getResponseCode();
StringBuilder sb = new StringBuilder();
switch (status) {
case 200:
case 201:
BufferedReader br = new BufferedReader(new InputStreamReader(c.getInputStream()));
String line;
while ((line = br.readLine()) != null) sb.append(line + "\n");
br.close();
}
JSONObject jsonResponse = new JSONObject(sb.toString());
//getJSONArray("id");
String results = jsonResponse.toString();
_collector.emit(new Values(text,results));
And the second bolt, fileWriterBolt, alias "escribirFichero", is the next one:
public void prepare(Map map, TopologyContext topologyContext, OutputCollector outputCollector) {
_collector = outputCollector;
try {
writer = new PrintWriter(filename, "UTF-8");...}...}
public void execute(Tuple tuple) {
writer.println((count++)+":::"+tuple.getValues());
//+"+++"+tweet.getUser().getId()+"__FINAL__"+tweet.getUser().getName()
writer.flush();
// Confirm that this tuple has been treated.
//_collector.ack(tuple);
}
If I pass over the bolt of Klous and only write the result of the spout, it works. I don't understand why the Klous's bolt causes this failure
Your buscamosEnKlout bolt needs to declare the format of the tuples it will emit, as well as which streams it will emit to. You most likely haven't implemented declareOutputFields correctly in that bolt. It should contain something like declarer.declare(new Fields("your-text-field", "your-results-field"))
I am using the below code to write to hbase
jsonDStream.foreachRDD(new Function<JavaRDD<String>, Void>() {
#Override
public Void call(JavaRDD<String> rdd) throws Exception {
DataFrame jsonFrame = sqlContext.jsonRDD(rdd);
DataFrame selecteFieldFrame = jsonFrame.select("id_str","created_at","text");
Configuration config = HBaseConfiguration.create();
config.set("hbase.zookeeper.quorum", "d-9543");
config.set("zookeeper.znode.parent","/hbase-unsecure");
config.set("hbase.zookeeper.property.clientPort", "2181");
final JobConf jobConfig=new JobConf(config,SveAsHadoopDataSetExample.class);
jobConfig.setOutputFormat(TableOutputFormat.class);
jobConfig.set(TableOutputFormat.OUTPUT_TABLE,"tableName");
selecteFieldFrame.javaRDD().mapToPair(new PairFunction<Row, ImmutableBytesWritable, Put>() {
#Override
public Tuple2<ImmutableBytesWritable, Put> call(Row row) throws Exception {
// TODO Auto-generated method stub
return convertToPut(row);
}
}).saveAsHadoopDataset(jobConfig);
return null;
}
});
But when i see zkDump in zookeeper the connections keeps on increasing
any suggestion/pointers will be of a great help!
I have the same problem, it is a hbase bug, I fix it:
change org.apache.hadoop.hbase.mapred.TableOutputFormat to org.apache.hadoop.hbase.mapreduce.TableOutputFormat,
and use org.apache.hadoop.mapreduce.Job, not org.apache.hadoop.mapred.JobConf
this is a sample:
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat
val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", zk_hosts)
conf.set("hbase.zookeeper.property.clientPort", zk_port)
conf.set(TableOutputFormat.OUTPUT_TABLE, "TABLE_NAME")
val job = Job.getInstance(conf)
job.setOutputFormatClass(classOf[TableOutputFormat[String]])
formatedLines.map{
case (a,b, c) => {
val row = Bytes.toBytes(a)
val put = new Put(row)
put.setDurability(Durability.SKIP_WAL)
put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("node"), Bytes.toBytes(b))
put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("topic"), Bytes.toBytes(c))
(new ImmutableBytesWritable(row), put)
}
}.saveAsNewAPIHadoopDataset(job.getConfiguration)
this may help you!
https://github.com/hortonworks-spark/shc/pull/20/commits/2074067c42c5a454fa4cdeec18c462b5367f23b9
I want to use Kryo serialization in Spark job.
public class SerializeTest {
public static class Toto implements Serializable {
private static final long serialVersionUID = 6369241181075151871L;
private String a;
public String getA() {
return a;
}
public void setA(String a) {
this.a = a;
}
}
private static final PairFunction<Toto, Toto, Integer> WRITABLE_CONVERTOR = new PairFunction<Toto, Toto, Integer>() {
private static final long serialVersionUID = -7119334882912691587L;
#Override
public Tuple2<Toto, Integer> call(Toto input) throws Exception {
return new Tuple2<Toto, Integer>(input, 1);
}
};
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("SerializeTest");
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
conf.registerKryoClasses(new Class<?>[]{Toto[].class});
JavaSparkContext context = new JavaSparkContext(conf);
List<Toto> list = new ArrayList<Toto>();
list.add(new Toto());
JavaRDD<Toto> cursor = context.parallelize(list, list.size());
JavaPairRDD<Toto, Integer> writable = cursor.mapToPair(WRITABLE_CONVERTOR);
writable.saveAsHadoopFile(args[0], Toto.class, Integer.class, SequenceFileOutputFormat.class);
context.close();
}
}
But i have this error :
java.io.IOException: Could not find a serializer for the Key class: 'com.test.SerializeTest.Toto'. Please ensure that the configuration 'io.serializations' is properly configured, if you're usingcustom serialization.
at org.apache.hadoop.io.SequenceFile$Writer.init(SequenceFile.java:1179)
at org.apache.hadoop.io.SequenceFile$Writer.(SequenceFile.java:1094)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:273)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:530)
at org.apache.hadoop.mapred.SequenceFileOutputFormat.getRecordWriter(SequenceFileOutputFormat.java:63)
at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1068)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/09/21 17:49:14 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.io.IOException: Could not find a serializer for the Key class: 'com.test.SerializeTest.Toto'. Please ensure that the configuration 'io.serializations' is properly configured, if you're usingcustom serialization.
at org.apache.hadoop.io.SequenceFile$Writer.init(SequenceFile.java:1179)
at org.apache.hadoop.io.SequenceFile$Writer.(SequenceFile.java:1094)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:273)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:530)
at org.apache.hadoop.mapred.SequenceFileOutputFormat.getRecordWriter(SequenceFileOutputFormat.java:63)
at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1068)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Thanks.
This error is related neither to Spark nor Kryo.
When using Hadoop output formats you need to make sure your key and value are instances of Writable. Hadoop doesn't use Java serialization by default (and you don't want to use it either, because it's very ineffective)
You can check your io.serializations property in configuration and you'll see list of used serializers including org.apache.hadoop.io.serializer.WritableSerialization
To fix this issue your Toto class must implement Writable. The same issue is with Integer, use rather IntWritable.
Is there and way (apart from consuming the message) I can purge/delete message programmatically from JMS queue. Even if it is possible by wlst command line tool, it will be of much help.
Here is an example in WLST for a Managed Server running on port 7005:
connect('weblogic', 'weblogic', 't3://localhost:7005')
serverRuntime()
cd('/JMSRuntime/ManagedSrv1.jms/JMSServers/MyAppJMSServer/Destinations/MyAppJMSModule!QueueNameToClear')
cmo.deleteMessages('')
The last command should return the number of messages it deleted.
You can use JMX to purge the queue, either from Java or from WLST (Python). You can find the MBean definitions for WLS 10.0 on http://download.oracle.com/docs/cd/E11035_01/wls100/wlsmbeanref/core/index.html.
Here is a basic Java example (don't forget to put weblogic.jar in the CLASSPATH):
import java.util.Hashtable;
import javax.management.MBeanServerConnection;
import javax.management.remote.JMXConnector;
import javax.management.remote.JMXConnectorFactory;
import javax.management.remote.JMXServiceURL;
import javax.management.ObjectName;
import javax.naming.Context;
import weblogic.management.mbeanservers.runtime.RuntimeServiceMBean;
public class PurgeWLSQueue {
private static final String WLS_USERNAME = "weblogic";
private static final String WLS_PASSWORD = "weblogic";
private static final String WLS_HOST = "localhost";
private static final int WLS_PORT = 7001;
private static final String JMS_SERVER = "wlsbJMSServer";
private static final String JMS_DESTINATION = "test.q";
private static JMXConnector getMBeanServerConnector(String jndiName) throws Exception {
Hashtable<String,String> h = new Hashtable<String,String>();
JMXServiceURL serviceURL = new JMXServiceURL("t3", WLS_HOST, WLS_PORT, jndiName);
h.put(Context.SECURITY_PRINCIPAL, WLS_USERNAME);
h.put(Context.SECURITY_CREDENTIALS, WLS_PASSWORD);
h.put(JMXConnectorFactory.PROTOCOL_PROVIDER_PACKAGES, "weblogic.management.remote");
JMXConnector connector = JMXConnectorFactory.connect(serviceURL, h);
return connector;
}
public static void main(String[] args) {
try {
JMXConnector connector =
getMBeanServerConnector("/jndi/"+RuntimeServiceMBean.MBEANSERVER_JNDI_NAME);
MBeanServerConnection mbeanServerConnection =
connector.getMBeanServerConnection();
ObjectName service = new ObjectName("com.bea:Name=RuntimeService,Type=weblogic.management.mbeanservers.runtime.RuntimeServiceMBean");
ObjectName serverRuntime = (ObjectName) mbeanServerConnection.getAttribute(service, "ServerRuntime");
ObjectName jmsRuntime = (ObjectName) mbeanServerConnection.getAttribute(serverRuntime, "JMSRuntime");
ObjectName[] jmsServers = (ObjectName[]) mbeanServerConnection.getAttribute(jmsRuntime, "JMSServers");
for (ObjectName jmsServer: jmsServers) {
if (JMS_SERVER.equals(jmsServer.getKeyProperty("Name"))) {
ObjectName[] destinations = (ObjectName[]) mbeanServerConnection.getAttribute(jmsServer, "Destinations");
for (ObjectName destination: destinations) {
if (destination.getKeyProperty("Name").endsWith("!"+JMS_DESTINATION)) {
Object o = mbeanServerConnection.invoke(
destination,
"deleteMessages",
new Object[] {""}, // selector expression
new String[] {"java.lang.String"});
System.out.println("Result: "+o);
break;
}
}
break;
}
}
connector.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
Works great on a single node environment, but what happens if you are on an clustered environment with ONE migratable JMSServer (currently on node #1) and this code is executing on node #2. Then there will be no JMSServer available and no message will be deleted.
This is the problem I'm facing right now...
Is there a way to connect to the JMSQueue without having the JMSServer available?
[edit]
Found a solution: Use the domain runtime service instead:
ObjectName service = new ObjectName("com.bea:Name=DomainRuntimeService,Type=weblogic.management.mbeanservers.domainruntime.DomainRuntimeServiceMBean");
and be sure to access the admin port on the WLS-cluster.
if this is one time, the easiest would be to do it through the console...
the program in below link helps you to clear only pending messages from queue based on redelivered message parameter
http://techytalks.blogspot.in/2016/02/deletepurge-pending-messages-from-jms.html