Why is SourceConnectorConfig reported for sink connector? - apache-kafka-connect

I'm trying to create a Kafka sink connector using the spredfast s3 connector. However, for some reason, the log output is reporting a SourceConnectorConfig:
INFO ConnectorConfig values:
connector.class = com.spredfast.kafka.connect.s3.sink.S3SinkConnector
key.converter = null
name = transactions-s3-sink
tasks.max = 1
transforms = null
value.converter = class org.apache.kafka.connect.storage.StringConverter
(org.apache.kafka.connect.runtime.ConnectorConfig:180)
INFO Creating connector transactions-s3-sink of type com.spredfast.kafka.connect.s3.sink.S3SinkConnector (org.apache.kafka.connect.runtime.Worker:178)
INFO Instantiated connector transactions-s3-sink with version 0.0.1 of type class com.spredfast.kafka.connect.s3.sink.S3SinkConnector (org.apache.kafka.connect.runtime.Worker:181)
INFO Finished creating connector transactions-s3-sink (org.apache.kafka.connect.runtime.Worker:194)
INFO SourceConnectorConfig values:
connector.class = com.spredfast.kafka.connect.s3.sink.S3SinkConnector
key.converter = null
name = transactions-s3-sink
tasks.max = 1
transforms = null
value.converter = class org.apache.kafka.connect.storage.StringConverter
(org.apache.kafka.connect.runtime.SourceConnectorConfig:180)
INFO Finished starting connectors and tasks (org.apache.kafka.connect.runtime.distributed.DistributedHerder:824)
...
INFO Sink task WorkerSinkTask{id=transactions-s3-sink-0} finished initialization and start (org.apache.kafka.connect.runtime.WorkerSinkTask:232)
Why is a SinkConnectorConfig reported yet further on in the log output I can see a WorkerSinkTask was created?

The reason is that this connector extends Connector abstract class instead of SinkConnector abstract class from Connect's API (see the source code here).
Thus, Connect framework can't tell whether this connector is a source or a sink, and currently the logic in the code is that if it's not a sink, assume it's a source. That's why you experience this inconsistency.
The solution is for the connector to extend appropriate abstract class (here org.apache.kafka.connect.sink.SinkConnector)

Related

Accessing HDFS configured as High availability from Client program

I am trying to understand the context of the working and not working program which connects HDFS via nameservice(which connects active name node - High availability Namenode) outside HDFS cluster.
Not working program:
When i read both config files (core-site.xml and hdfs-site.xml) and accessing HDFS file , it throws an error
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
object HadoopAccess {
def main(args: Array[String]): Unit ={
val hadoopConf = new Configuration(false)
val coreSiteXML = "C:\\Users\\507\\conf\\core-site.xml"
val HDFSSiteXML = "C:\\Users\\507\\conf\\hdfs-site.xml"
hadoopConf.addResource(new Path("file:///" + coreSiteXML))
hadoopConf.addResource(new Path("file:///" + HDFSSiteXML))
println("hadoopConf : " + hadoopConf.get("fs.defaultFS"))
val fs = FileSystem.get(hadoopConf)
val check = fs.exists(new Path("/apps/hive"));
//println("Checked : "+ check)
}
}
Error : We see that Unknownhost Exception
hadoopConf :
hdfs://mycluster
Configuration: file:/C:/Users/64507/conf/core-site.xml, file:/C:/Users/64507/conf/hdfs-site.xml
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Exception in thread "main" java.lang.IllegalArgumentException: java.net.UnknownHostException: mycluster
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:378)
at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310)
at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:678)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:619)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:172)
at HadoopAccess$.main(HadoopAccess.scala:28)
at HadoopAccess.main(HadoopAccess.scala)
Caused by: java.net.UnknownHostException: mycluster
Working Program : I specifically set the High availability into hadoopConf object and passing to Filesystem object , the program works
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
object HadoopAccess {
def main(args: Array[String]): Unit ={
val hadoopConf = new Configuration(false)
val coreSiteXML = "C:\\Users\\507\\conf\\core-site.xml"
val HDFSSiteXML = "C:\\Users\\507\\conf\\hdfs-site.xml"
hadoopConf.addResource(new Path("file:///" + coreSiteXML))
hadoopConf.addResource(new Path("file:///" + HDFSSiteXML))
hadoopConf.set("fs.defaultFS", hadoopConf.get("fs.defaultFS"))
//hadoopConf.set("fs.defaultFS", "hdfs://mycluster")
//hadoopConf.set("fs.default.name", hadoopConf.get("fs.defaultFS"))
hadoopConf.set("dfs.nameservices", hadoopConf.get("dfs.nameservices"))
hadoopConf.set("dfs.ha.namenodes.mycluster", "nn1,nn2")
hadoopConf.set("dfs.namenode.rpc-address.mycluster.nn1", "namenode1:8020")
hadoopConf.set("dfs.namenode.rpc-address.mycluster.nn2", "namenode2:8020")
hadoopConf.set("dfs.client.failover.proxy.provider.mycluster",
"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider")
println(hadoopConf)
/* val namenode = hadoopConf.get("fs.defaultFS")
println("namenode: "+ namenode) */
val fs = FileSystem.get(hadoopConf)
val check = fs.exists(new Path("hdfs://mycluster/apps/hive"));
//println("Checked : "+ check)
}
}
Any reason why we need to set values for this configs like dfs.nameservices,fs.client.failover.proxy.provider.mycluster,dfs.namenode.rpc-address.mycluster.nn1 in hadoopconf object as this values already present in hdfs-site.xml file and core-site.xml. These configs are High availability Namenode settings.
The above program which I am running via Edge mode or local IntelliJ.
Hadoop version : 2.7.3.2
Hortonworks : 2.6.1
My observation in Spark Scala REPL :
When I do val hadoopConf = new Configuration(false) and val fs = FileSystem.get(hadoopConf) .This gives me Local FileSystem .So when I perform below
hadoopConf.addResource(new Path("file:///" + coreSiteXML))
hadoopConf.addResource(new Path("file:///" + HDFSSiteXML))
,now the file System changed to DFSFileSysyem ..My assumption is that some client library which is in Spark that is not available in somewhere in during build or edge node common place .
some client library which is in Spark that is not available in somewhere in during build or edge node common place
This common place would be $SPARK_HOME/conf and/or $HADOOP_CONF_DIR. But if you are just running a regular Scala app with java jar or with IntelliJ, that has nothing to do with Spark.
... this values already present in hdfs-site.xml file and core-site.xml
Then, they should be read, accordingly, however overriding in the code shouldn't hurt either.
The values are necessary because they dicate where the actual namenodes are running; otherwise, it thinks mycluster is a real DNS name of only one server, when it isn't

Lagom Jdbc Read-Side support: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'slick.profile'

I'm trying to set-up jdbc read side processor in lagom service:
class ProjectEventsProcessor(readSide: JdbcReadSide)(implicit ec: ExecutionContext) extends ReadSideProcessor[ProjectEvent] {
def buildHandler = {
readSide.builder[ProjectEvent]("projectEventOffset")
.setEventHandler[ProjectCreated]((conn: Connection, e: EventStreamElement[ProjectCreated]) => insertProject(e.event))
.build
}
private def insertProject(e: ProjectCreated) = {
Logger.info(s"Got event $e")
}
override def aggregateTags: Set[AggregateEventTag[ProjectEvent]] = ProjectEvent.Tag.allTags
}
Services connects to database fine on startup
15:40:32.575 [info] play.api.db.DefaultDBApi [] - Database [default] connected at jdbc:postgresql://localhost/postgres?user=postgres
But right after this I'm getting exception.
com.typesafe.config.ConfigException$Missing: No configuration setting
found for key 'slick.profile'
First of all, why slick is involved here at all?
I'm using JdbcReadSide but not SlickReadSide.
Ok, let's say JdbcReadSide internally uses slick somehow.
I've added slick.profile in application.config of my service.
db.default.driver="org.postgresql.Driver"
db.default.url="jdbc:postgresql://localhost/postgres?user=postgres"
// Tried this way
slick.profile="slick.jdbc.PostgresProfile$"
// Also this fay (copied from play documentation).
slick.dbs.default.profile="slick.jdbc.PostgresProfile$"
slick.dbs.default.db.dataSourceClass = "slick.jdbc.DatabaseUrlDataSource"
slick.dbs.default.db.properties.driver = "org.postgresql.Driver"
But still getting this exception.
What is going on? How to solve this issue?
According to the docs, Lagom uses akka-persistence-jdbc, which under the hood:
uses Slick to map tables and manage asynchronous execution of JDBC calls.
A full configuration, using also the default connection pool (HikariCP), to set in the application.conf file, may be the following (mostly copied from the docs):
# Defaults to use for each Akka persistence plugin
jdbc-defaults.slick {
# The Slick profile to use
# set to one of: slick.jdbc.PostgresProfile$, slick.jdbc.MySQLProfile$, slick.jdbc.OracleProfile$ or slick.jdbc.H2Profile$
profile = "slick.jdbc.PostgresProfile$"
# The JNDI name for the Slick pre-configured DB
# By default, this value will be used by all akka-persistence-jdbc plugin components (journal, read-journal and snapshot).
# you may configure each plugin component to use different DB settings
jndiDbName=DefaultDB
}
db.default {
driver = "org.postgresql.Driver"
url = "jdbc:postgresql://localhost/postgres?user=postgres"
# The JNDI name for this DataSource
# Play, and therefore Lagom, will automatically register this DataSource as a JNDI resource using this name.
# This DataSource will be used to build a pre-configured Slick DB
jndiName=DefaultDS
# Lagom will configure a Slick Database, using the async-executor settings below
# and register it as a JNDI resource using this name.
# By default, all akka-persistence-jdbc plugin components will use this JDNI name
# to lookup for this pre-configured Slick DB
jndiDbName=DefaultDB
async-executor {
# number of objects that can be queued by the async executor
queueSize = 10000
# 5 * number of cores
numThreads = 20
# same as number of threads
minConnections = 20
# same as number of threads
maxConnections = 20
# if true, a Mbean for AsyncExecutor will be registered
registerMbeans = false
}
# Hikari is the default connection pool and it's fine-tuned to use the same
# values for minimum and maximum connections as defined for the async-executor above
hikaricp {
minimumIdle = ${db.default.async-executor.minConnections}
maximumPoolSize = ${db.default.async-executor.maxConnections}
}
}
lagom.persistence.jdbc {
# Configuration for creating tables
create-tables {
# Whether tables should be created automatically as needed
auto = true
# How long to wait for tables to be created, before failing
timeout = 20s
# The cluster role to create tables from
run-on-role = ""
# Exponential backoff for failures configuration for creating tables
failure-exponential-backoff {
# minimum (initial) duration until processor is started again
# after failure
min = 3s
# the exponential back-off is capped to this duration
max = 30s
# additional random delay is based on this factor
random-factor = 0.2
}
}
}

DirectProcessor and UnicastProcessor can subscribe to upstream Publisher when they shouldn't. Why?

According to the Project Reactor documentation regarding processors:
direct (DirectProcessor and UnicastProcessor): These processors can
only push data through direct user action (calling their Sink's
methods directly).
synchronous (EmitterProcessor and ReplayProcessor): These processors
can push data both through user action and by subscribing to an
upstream Publisher and synchronously draining it.
UnicastProcessor shouldn't be able to subscribe to an upstream Publisher. There documentation offers an example of the direct user Sink invocation:
UnicastProcessor<String> hotSource = UnicastProcessor.create();
Flux<String> hotFlux = hotSource.publish()
.autoConnect()
.map(String::toUpperCase);
hotFlux.subscribe(d -> System.out.println("Subscriber 1 to Hot Source: "+d));
hotSource.onNext("blue");
However I have tried subscribing directly a UnicastProcessor to a Publisher and it works. This shouldn't be possible as stated in the documentation. Is the doc wrong of am I missing something?
In the following example, I'm subscribing the UnicastProcessor to an upstream Flux without any problem:
val latch = CountDownLatch(20)
val numberGenerator: Flux<Long> = counter(1000)
val processor = UnicastProcessor.create<Long>()
val connectableFlux = numberGenerator.subscribeWith(processor)
connectableFlux.subscribe {
logger.info("Element [{}]", it)
}
latch.await()
Log:
12:50:12.193 [main] INFO reactor.Flux.Map.1 - onSubscribe(FluxMap.MapSubscriber)
12:50:12.196 [main] INFO reactor.Flux.Map.1 - request(unbounded)
12:50:13.203 [parallel-1] INFO reactor.Flux.Map.1 - onNext(0)
12:50:13.203 [parallel-1] INFO com.codependent.Test - Element [0]
Yes it seems this aspect of the documentation is outdated. Even DirectProcessor can be used as a Subscriber and propagate signals to its own subscribers.
NB: You used an EmitterProcessor in your snippet, but it still behaves the same with UnicastProcessor.

Join KStream and GlobalKTable with Avro Object on Join key

I have a question regarding key deserialization on KafkaStreams. Specifically I use Kafka Connect and debezium connector to read
data from a Postgres table. Data were imported to a Kafka topic created two Avro schemas on Kafka Schema Registry one for the Key
and one for the Value (this contains all Columns on Table).
I read these data on a GlobalKTable like below:
properties.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, SpecificAvroSerde.class);
properties.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, SpecificAvroSerde.class);
GlobalKTable<my.namespace.db.Key, my.namespace.db.Value> tableData = builder.globalTable("topic_name");
My issue is that I have a topology where I need to join this GlobalKTable with a KStream as the one below:
SpecificAvroSerde<EventObj> eventsSpecificAvroSerde = new SpecificAvroSerde<>();
eventsSpecificAvroSerde.configure(Collections.singletonMap(KafkaAvroSerializerConfig.SCHEMA_REGISTRY_URL_CONFIG,
conf.getString(" kafka.schema.registry.url")), false);
KStream<Integer, EventObj> events = builder.stream( "another_topic_name",Consumed.with(Serdes.Integer(),eventsSpecificAvroSerde))
Note that the Avro schema for my.namespace.db.Key is
{
"type": "record",
"name": "Key",
"namespace":"my.namespace.db",
"fields": [
{
"name": "id",
"type": "int"
}
]
}
Obviously the key on GlobalKTable and KStream is a different object and I do not know how to achieve the
join. I initially tried this but it did not work.
events.join(tableData,
(key,val) -> {return my.namespace.db.Key.newBuilder().setId(key).build();})
/* To convert the Integer Key in KStream to the Avro Object Key
on GlobalKTable as to achieve the join */
(ev,tData) -> ... );
The output I get is the following where I can see a WARN on one of my joined topics (which seems suspect) but there is nothing else no output of the joined entities, it just is as if there is nothing to consume.
INFO [Consumer clientId=kafka-streams-0401c29c-30a9-4969-93f9-5a83b3c834b4-StreamThread-1-consumer, groupId=kafka-streams] (Re-)joining group (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:336)
INFO stream-thread [kafka-streams-0401c29c-30a9-4969-93f9-5a83b3c834b4-StreamThread-1-consumer] Assigned tasks to clients as {0401c29c-30a9-4969-93f9-5a83b3c834b4=[activeTasks: ([0_0]) standbyTasks: ([]) assignedTasks: ([0_0]) prevActiveTasks: ([]) prevAssignedTasks: ([]) capacity: 1]}. (org.apache.kafka.streams.processor.internals.StreamPartitionAssignor:341)
WARN [Consumer clientId=kafka-streams-0401c29c-30a9-4969-93f9-5a83b3c834b4-StreamThread-1-consumer, groupId=kafka-streams] The following subscribed topics are not assigned to any members: [my-topic] (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:241)
INFO [Consumer clientId=kafka-streams-0401c29c-30a9-4969-93f9-5a83b3c834b4-StreamThread-1-consumer, groupId=kafka-streams] Successfully joined group with generation 1 (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:341)
INFO [Consumer clientId=kafka-streams-0401c29c-30a9-4969-93f9-5a83b3c834b4-StreamThread-1-consumer, groupId=kafka-streams] Setting newly assigned partitions [mip-events-2-0] (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:341)
INFO stream-thread [kafka-streams-0401c29c-30a9-4969-93f9-5a83b3c834b4-StreamThread-1] State transition from PARTITIONS_REVOKED to PARTITIONS_ASSIGNED (org.apache.kafka.streams.processor.internals.StreamThread:346)
INFO KafkaAvroSerializerConfig values:
schema.registry.url = [http://kafka-schema-registry:8081]
auto.register.schemas = true
max.schemas.per.subject = 1000
(io.confluent.kafka.serializers.KafkaAvroSerializerConfig:175)
INFO KafkaAvroDeserializerConfig values:
schema.registry.url = [http://kafka-schema-registry:8081]
auto.register.schemas = true
max.schemas.per.subject = 1000
specific.avro.reader = true
(io.confluent.kafka.serializers.KafkaAvroDeserializerConfig:175)
INFO KafkaAvroSerializerConfig values:
schema.registry.url = [http://kafka-schema-registry:8081]
auto.register.schemas = true
max.schemas.per.subject = 1000
(io.confluent.kafka.serializers.KafkaAvroSerializerConfig:175)
INFO KafkaAvroDeserializerConfig values:
schema.registry.url = [http://kafka-schema-registry:8081]
auto.register.schemas = true
max.schemas.per.subject = 1000
specific.avro.reader = true
(io.confluent.kafka.serializers.KafkaAvroDeserializerConfig:175)
INFO stream-thread [kafka-streams-0401c29c-30a9-4969-93f9-5a83b3c834b4-StreamThread-1] partition assignment took 10 ms.
current active tasks: [0_0]
current standby tasks: []
previous active tasks: []
(org.apache.kafka.streams.processor.internals.StreamThread:351)
INFO stream-thread [kafka-streams-0401c29c-30a9-4969-93f9-5a83b3c834b4-StreamThread-1] State transition from PARTITIONS_ASSIGNED to RUNNING (org.apache.kafka.streams.processor.internals.StreamThread:346)
INFO stream-client [kafka-streams-0401c29c-30a9-4969-93f9-5a83b3c834b4]State transition from REBALANCING to RUNNING (org.apache.kafka.streams.KafkaStreams:346)
Can I make this join work on Kafka Streams?
Note that this works if I use a KTable to read the topic and use selectKey on
KStream to convert the key but I want to avoid the repartition.
Or should the right approach be importing my data from database in another way as to avoid creating Avro Objects and
how is this possible using debezium connectors and KafkaConnect with AvroConverter enable ?

error spark-shell, falling back to uploading libraries under SPARK_HOME

I'm trying to connect a spark-shell amazon hadoop, but I esart all the time giving the following error and do not know how to fix it or configure what is missing.
spark.yarn.jars, spark.yarn.archive
spark-shell --jars /usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/08/12 07:47:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
16/08/12 07:47:28 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Thx!!!
Error1
I'm trying to run a SQL query, something totally simple as:
val sqlDF = spark.sql("SELECT col1 FROM tabl1 limit 10")
sqlDF.show()
WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Error2
Then I try to run a script scala, something simple collected in:
https://blogs.aws.amazon.com/bigdata/post/Tx2D93GZRHU3TES/Using-Spark-SQL-for-ETL
import org.apache.hadoop.io.Text;
import org.apache.hadoop.dynamodb.DynamoDBItemWritable
import com.amazonaws.services.dynamodbv2.model.AttributeValue
import org.apache.hadoop.dynamodb.read.DynamoDBInputFormat
import org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.io.LongWritable
import java.util.HashMap
var ddbConf = new JobConf(sc.hadoopConfiguration)
ddbConf.set("dynamodb.output.tableName", "tableDynamoDB")
ddbConf.set("dynamodb.throughput.write.percent", "0.5")
ddbConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
ddbConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
var genreRatingsCount = sqlContext.sql("SELECT col1 FROM table1 LIMIT 1")
var ddbInsertFormattedRDD = genreRatingsCount.map(a => {
var ddbMap = new HashMap[String, AttributeValue]()
var col1 = new AttributeValue()
col1.setS(a.get(0).toString)
ddbMap.put("col1", col1)
var item = new DynamoDBItemWritable()
item.setItem(ddbMap)
(new Text(""), item)
}
)
ddbInsertFormattedRDD.saveAsHadoopDataset(ddbConf)
scala.reflect.internal.Symbols$CyclicReference: illegal cyclic reference involving object InterfaceAudience
at scala.reflect.internal.Symbols$Symbol$$anonfun$info$3.apply(Symbols.scala:1502)
at scala.reflect.internal.Symbols$Symbol$$anonfun$info$3.apply(Symbols.scala:1500)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
Looks like spark UI not started, tried to start spark shell and also check sparkUI localhost:4040 running correctly.

Resources