KAFKA JDBC Source connector adds the default schema - jdbc

I use the KAFKA JDBC Source connector to read from the database ClickHouse (driver - clickhouse-jdbc-0.2.4.jar) with incrementing mod.
Settings:
batch.max.rows = 100
catalog.pattern = null
connection.attempts = 3
connection.backoff.ms = 10000
connection.password = [hidden]
connection.url = jdbc:clickhouse://<ip>:8123/<schema>
connection.user = user
db.timezone =
dialect.name =
incrementing.column.name = id
mode = incrementing
numeric.mapping = null
numeric.precision.mapping = false
poll.interval.ms = 5000
query =
query.suffix =
quote.sql.identifiers = never
schema.pattern = null
table.blacklist = []
table.poll.interval.ms = 60000
table.types = [TABLE]
table.whitelist = [<table_name>]
tables = [default.<schema>.<table_name>]
timestamp.column.name = []
timestamp.delay.interval.ms = 0
timestamp.initial = null
topic.prefix = staging-
validate.non.null = false
Why does the connector additionally substitute the default scheme? and how to avoid it?
Instead of a request
SELECT * FROM <schema>.<table_name> WHERE <schema>.<table_name>.id > ? ORDER BY <schema>.<table_name>.id ASC
I get an error with
SELECT * FROM default.<schema>.<table_name> WHERE default.<schema>.<table_name>.id > ? ORDER BY default.<schema>.<table_name>.id ASC

You can create CH data source object like below (Where schema name is not passed).
final ClickHouseDataSource dataSource = new ClickHouseDataSource(
"jdbc:clickhouse://"+host+"/"+user+"?option1=one%20two&option2=y");
Then in SQL query, you can specify a schema name(schema.table). So it will not add the default schema in your query.

Related

SQL Query to change Hibernate 5 CriteriaQuery

My SQL Query look like:
select rtnstatus_id,crs_filedt,(select rtnstatus_code from mst_rtnstatus where mst_rtnstatus.rtnstatus_id = txn_crs.rtnstatus_id )AS rtnstatus_code from txn_crs where mclient_id='44' and gstn_cid='10306' and rtnperiod_id=
(select rtnperiod_id from mst_rtnperiod where year_id=4 and month_id=10 and formtype_id=
(select formtype_id from mst_formtype where formtype_name='GSTR1')
and dlrtype_id=ifnull((select dlrtype_id from txn_clientdlr where mclient_id=44 and month_id=10 and year_id=4 order by clientdlr_id desc limit 1),(SELECT dlrtype_id FROM mst_dlrtype WHERE dlrtype_code='R')) order by rtnperiod_id desc limit 1 ) limit 1;
I Want to Change this query to CriteraQuery Hibernate 5, How i can do this,Plz Help me..
CriteriaBuilder builder = _getSession().getCriteriaBuilder();
CriteriaQuery<Object[]> criteriaQuery = builder.createQuery(Object[].class);
Root<TxnCrs> root = criteriaQuery.from(TxnCrs.class);
// For MstRtnstatus to get rtnstatusCode
Subquery<Object> subQueryStatus = criteriaQuery.subquery(Object.class);
Root<MstRtnstatus> subRootStatus = subQueryStatus.from(MstRtnstatus.class);
subQueryStatus.select(subRootStatus.get("rtnstatusCode"));
subQueryStatus.where(builder.equal(subRootStatus.get("rtnstatusId"),root.get("rtnstatusId")));
// For MstFormtype to get formtypeId
Subquery<Object> subQueryFormType = criteriaQuery.subquery(Object.class);
Root<MstFormtype> subRootFormType = subQueryFormType.from(MstFormtype.class);
subQueryFormType.select(subRootFormType.get("formtypeId"));
subQueryFormType.where(builder.equal(subRootFormType.get("formtypeName"),formType));
//For TxnClientdlr to get dlrtypeId
Subquery<Object> subQueryDlr = criteriaQuery.subquery(Object.class);
Root<TxnClientdlr> subRootDlr = subQueryDlr.from(TxnClientdlr.class);
subQueryDlr.select(subRootDlr.get("dlrtypeId"));
subQueryDlr.where(builder.equal(subRootDlr.get("mclientId"),mclientId),builder.equal(subRootDlr.get("monthId"),monthId),
builder.equal(subRootDlr.get("yearId"),yearId));
// For MstDlrtype IF dlrtypeId get Null
Subquery<Object> subQueryDlrType = criteriaQuery.subquery(Object.class);
Root<MstDlrtype> subRootDlrType = subQueryDlrType.from(MstDlrtype.class);
subQueryDlrType.select(subRootDlrType.get("dlrtypeId"));
subQueryDlrType.where(builder.equal(subRootDlrType.get("dlrtypeCode"),"R"));
//For MstRtnperiod to get rtnperiodId
Subquery<Object> subQueryPeriod = criteriaQuery.subquery(Object.class);
Root<MstRtnperiod> subRootPeriod = subQueryPeriod.from(MstRtnperiod.class);
subQueryPeriod.select(subRootPeriod.get("rtnperiodId"));
subQueryPeriod.where(builder.equal(subRootPeriod.get("yearId"),yearId),builder.equal(subRootPeriod.get("monthId"),monthId),
builder.equal(subRootPeriod.get("formtypeId"),subQueryFormType.getSelection()),
How i can write corect way ?
builder.equal(subRootPeriod.get("dlrtypeId"),subQueryDlr.getSelection()
,subRootPeriod.get("dlrtypeId"),subQueryDlrType.getSelection()));**
criteriaQuery.multiselect(root.get("rtnstatusId"),root.get("crsFiledt"),subQueryStatus.getSelection()).
where(builder.equal(root.get("mclientId"),mclientId),builder.equal(root.get("gstnCid"),gstnCid),
builder.equal(root.get("rtnperiodId"),subQueryPeriod.getSelection()));
Query<Object[]> q=_getSession().createQuery(criteriaQuery);

janusgraph with OneTimeBulkLoader in hadoop-gremlin raise "Graph does not support adding vertices"

my goal:
uses SparkGraphComputer to bulkLoader local data to janusgraph and then build mixed index on hbase and ES
my problem:
Caused by: java.lang.UnsupportedOperationException: Graph does not support adding vertices
at org.apache.tinkerpop.gremlin.structure.Graph$Exceptions.vertexAdditionsNotSupported(Graph.java:1133)
at org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph.addVertex(HadoopGraph.java:187)
at org.apache.tinkerpop.gremlin.process.traversal.step.map.AddVertexStartStep.processNextStart(AddVertexStartStep.java:91)
at org.apache.tinkerpop.gremlin.process.traversal.step.util.AbstractStep.next(AbstractStep.java:128)
at org.apache.tinkerpop.gremlin.process.traversal.step.util.AbstractStep.next(AbstractStep.java:38)
at org.apache.tinkerpop.gremlin.process.traversal.util.DefaultTraversal.next(DefaultTraversal.java:200)
at org.apache.tinkerpop.gremlin.process.computer.bulkloading.OneTimeBulkLoader.getOrCreateVertex(OneTimeBulkLoader.java:49)
at org.apache.tinkerpop.gremlin.process.computer.bulkloading.BulkLoaderVertexProgram.executeInternal(BulkLoaderVertexProgram.java:210)
at org.apache.tinkerpop.gremlin.process.computer.bulkloading.BulkLoaderVertexProgram.execute(BulkLoaderVertexProgram.java:197)
at org.apache.tinkerpop.gremlin.spark.process.computer.SparkExecutor.lambda$null$4(SparkExecutor.java:118)
at org.apache.tinkerpop.gremlin.util.iterator.IteratorUtils$3.next(IteratorUtils.java:247)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
... 3 more
dependences:
janusgraph-all-0.3.1
janusgraph-es-0.3.1
hadoop-gremlin-3.3.3
The followings is configuration:
janusgraph-hbase-es.properties
storage.backend=hbase
gremlin.graph=XXX.XXX.XXX.gremlin.hadoop.structure.HadoopGraph
storage.hostname=<ip>
storage.hbase.table=hadoop-test-3
storage.batch-loading=true
schema.default = none
cache.db-cache = true
cache.db-cache-clean-wait = 20
cache.db-cache-time = 180000
cache.db-cache-size = 0.5
index.search.backend=elasticsearch
index.search.hostname=<ip>
index.search.index-name=hadoop_test_3
hadoop-graphson.properties
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphReader=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONInputFormat
gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat
gremlin.hadoop.inputLocation=data/tinkerpop-modern.json
gremlin.hadoop.outputLocation=output
gremlin.hadoop.jarsInDistributedCache=true
giraph.minWorkers=2
giraph.maxWorkers=2
giraph.useOutOfCoreGraph=true
giraph.useOutOfCoreMessages=true
mapred.map.child.java.opts=-Xmx1024m
mapred.reduce.child.java.opts=-Xmx1024m
giraph.numInputThreads=4
giraph.numComputeThreads=4
giraph.maxMessagesInMemory=100000
spark.master=local[*]
spark.serializer=org.apache.spark.serializer.KryoSerializer
schema.groovy
def defineGratefulDeadSchema(janusGraph) {
JanusGraphManagement m = janusGraph.openManagement()
VertexLabel person = m.makeVertexLabel("person").make()
//使用IncrementBulkLoader导入时,去掉下面注释
//blid=m.makePropertyKey("bulkLoader.vertex.id")
.dataType(Long.class).make()
PropertyKey birth =
m.makePropertyKey("birth").dataType(Date.class).make()
PropertyKey age =
m.makePropertyKey("age").dataType(Integer.class).make()
PropertyKey name =
m.makePropertyKey("name").dataType(String.class).make()
//index
//JanusGraphIndex index = m
.buildIndex("nameCompositeIndex",
Vertex.class).addKey(name).unique().buildCompositeIndex()
JanusGraphIndex index = m.buildIndex("mixedIndex",
Vertex.class).addKey(name).buildMixedIndex("search")
//不支持唯一性检查,search为index.search.backend中的search
//使用IncrementBulkLoader导入时,去掉下面注释
//bidIndex = m.buildIndex("byBulkLoaderVertexId",
Vertex.class).addKey(blid).indexOnly(person)
.buildCompositeIndex()
m.commit()
}
relevant code
JanusGraph janusGraph = JanusGraphFactory.open
("config/janusgraph-hbase-es.properties");
JanusgraphSchema janusgraphSchema = new JanusgraphSchema();
janusgraphSchema.defineGratefulDeadSchema(janusGraph);
janusGraph.close();
Graph graph = GraphFactory.open("config/hadoop-
graphson.properties");
BulkLoaderVertexProgram blvp = BulkLoaderVertexProgram.
build().bulkLoader(OneTimeBulkLoader.class).
writeGraph("config/janusgraph-hbase-es.properties").
create(graph);
graph.compute(SparkGraphComputer.class).program(blvp).
submit().get();
graph.close();
JanusGraph janusGraph1 = JanusGraphFactory.open
("config/janusgraph-hbase-es.properties");
List<Map<String, Object>> list = janusGraph1.traversal().V().
valueMap().toList();
System.out.println("size: " + list.size());
janusGraph1.close();
result:
data success to import hbase, but fail to build index in ES
The above error does not appear, after I reset gremlin.graph with default value gremlin.graph=org.janusgraph.core.JanusGraphFactory.

Number of partitions scanned(=32767) exceeds limit

I'm trying to use Eel-sdk to stream data into Hive.
val sink = HiveSink(testDBName, testTableName)
.withPartitionStrategy(new DynamicPartitionStrategy)
val hiveOps:HiveOps = ...
val schema = new StructType(Vector(Field("name", StringType),Field("pk", StringType),Field("pk1",a StringType)))
hiveOps.createTable(
testDBName,
testTableName,
schema,
partitionKeys = Seq("pk", "pk1"),
dialect = ParquetHiveDialect(),
tableType = TableType.EXTERNAL_TABLE,
overwrite = true
)
val items = Seq.tabulate(100)(i => TestData(i.toString, "42", "apple"))
val ds = DataStream(items)
ds.to(sink)
Getting error: Number of partitions scanned(=32767) exceeds limit(=10000).
Number 32767 is a power of 2....but still can't figure it out what is wrong. Any idea?
Spark + Hive : Number of partitions scanned exceeds limit (=4000)
--conf "spark.sql.hive.convertMetastoreOrc=false"
--conf "spark.sql.hive.metastorePartitionPruning=false"

Flume creating an empty line at the end of output file in HDFS

Currently I am using Flume version : 1.5.2.
Flume creating an empty line at the end of each output file in HDFS which causing row counts, file sizes & check sum are not matching for source and destination files.
I tried by overriding the default values of parameters roolSize, batchSize and appendNewline but still its not working.
Also flume changing EOL from CRLF(Source file) to LF(outputfile) this also causing file size to differ
Below are related flume agent configuration parameters I'm using
agent1.sources = c1
agent1.sinks = c1s1
agent1.channels = ch1
agent1.sources.c1.type = spooldir
agent1.sources.c1.spoolDir = /home/biadmin/flume-test/sourcedata1
agent1.sources.c1.bufferMaxLineLength = 80000
agent1.sources.c1.channels = ch1
agent1.sources.c1.fileHeader = true
agent1.sources.c1.fileHeaderKey = file
#agent1.sources.c1.basenameHeader = true
#agent1.sources.c1.fileHeaderKey = basenameHeaderKey
#agent1.sources.c1.filePrefix = %{basename}
agent1.sources.c1.inputCharset = UTF-8
agent1.sources.c1.decodeErrorPolicy = IGNORE
agent1.sources.c1.deserializer= LINE
agent1.sources.c1.deserializer.maxLineLength = 50000
agent1.sources.c1.deserializer=
org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder
agent1.sources.c1.interceptors = a b
agent1.sources.c1.interceptors.a.type =
org.apache.flume.interceptor.TimestampInterceptor$Builder
agent1.sources.c1.interceptors.b.type =
org.apache.flume.interceptor.HostInterceptor$Builder
agent1.sources.c1.interceptors.b.preserveExisting = false
agent1.sources.c1.interceptors.b.hostHeader = host
agent1.channels.ch1.type = memory
agent1.channels.ch1.capacity = 1000
agent1.channels.ch1.transactionCapacity = 1000
agent1.channels.ch1.batchSize = 1000
agent1.channels.ch1.maxFileSize = 2073741824
agent1.channels.ch1.keep-alive = 5
agent1.sinks.c1s1.type = hdfs
agent1.sinks.c1s1.hdfs.path = hdfs://bivm.ibm.com:9000/user/biadmin/
flume/%y-%m-%d/%H%M
agent1.sinks.c1s1.hdfs.fileType = DataStream
agent1.sinks.c1s1.hdfs.filePrefix = %{file}
agent1.sinks.c1s1.hdfs.fileSuffix =.csv
agent1.sinks.c1s1.hdfs.writeFormat = Text
agent1.sinks.c1s1.hdfs.maxOpenFiles = 10
agent1.sinks.c1s1.hdfs.rollSize = 67000000
agent1.sinks.c1s1.hdfs.rollCount = 0
#agent1.sinks.c1s1.hdfs.rollInterval = 0
agent1.sinks.c1s1.hdfs.batchSize = 1000
agent1.sinks.c1s1.channel = ch1
#agent1.sinks.c1s1.hdfs.codeC = snappyCodec
agent1.sinks.c1s1.hdfs.serializer = text
agent1.sinks.c1s1.hdfs.serializer.appendNewline = false
hdfs.serializer.appendNewline not fixed the issue.
Can anyone please check and suggest..
Replace the below line in your flume agent.
agent1.sinks.c1s1.serializer.appendNewline = false
with the following line and let me know how it goes.
agent1.sinks.c1s1.hdfs.serializer.appendNewline = false
Replace
agent1.sinks.c1s1.hdfs.serializer = text
agent1.sinks.c1s1.hdfs.serializer.appendNewline = false
with
agent1.sinks.c1s1.serializer = text
agent1.sinks.c1s1.serializer.appendNewline = false
Difference is that serializer settings are not set on hdfs prefix but directly on sink name.
Flume documentation should have some example on that as I also got into issues because I didn't spot that serializer is set on different level of property name.
More informations about Hdfs sink can be found here:
https://flume.apache.org/FlumeUserGuide.html#hdfs-sink

Pig throwing incompatible type error

I am using the following code to generate sessionId in pig by using sessionize UDF in datafu.
SET mapred.min.split.size 1073741824
SET mapred.job.queue.name 'marathon'
SET mapred.output.compress true;
--SET avro.output.codec snappy;
--SET pig.maxCombinedSplitSize 536870912;
page_view_pre = LOAD '/data/tracking/PageViewEvent/' USING LiAvroStorage('date.range','start.date=20150226;end.date=20150226;error.on.missing=true'); -----logic is currently for 2015-02-26,will later replace them with date parameters
p_key = LOAD '/projects/dwh/dwh_dim/dim_page_key/#LATEST' USING LiAvroStorage();
page_view_pre = FILTER page_view_pre BY (requestHeader.userAgent != 'CRAWLER' and requestHeader.browserId != 'CRAWLER') and NOT IsTestMemberId(header.memberId);
page_view_pre = FOREACH page_view_pre GENERATE
(int) (header.memberId <0 ? -9 : header.memberId ) as member_sk,
(chararray) requestHeader.browserId as browserId,
--(chararray) requestHeader.sessionId as sessionId,
(chararray) UnixToISO(header.time) as pageViewTime,
header.time as pv_time,
(chararray) requestHeader.path as path,
(chararray) requestHeader.referer as referer,
(chararray) epochToFormat(header.time, 'yyyyMMdd', 'America/Los_Angeles') as tracking_date,
(chararray) requestHeader.pageKey as pageKey,
(chararray) SUBSTRING(requestHeader.trackingCode, 0, 500) as trackingCode,
FLATTEN(botLookup(requestHeader.userAgent, requestHeader.browserId)) as (is_crawler, crawler_type),
(int) totalTime as totalTime,
((int) totalTime < 20 ? 1 :0) as bounce_flag;
page_view_pre = FILTER page_view_pre BY is_crawler == 'N' ;
p_key = FILTER p_key By is_aggregate ==1;
page_view_agg = JOIN page_view_pre by pageKey ,p_key by page_key;
page_view_agg = FOREACH page_view_agg GENERATE
(chararray)page_view_pre::member_sk as member_sk,
(chararray)page_view_pre::browserId as browserId,
--page_view_pre::sessionId as sessionId,
(chararray)page_view_pre::pageViewTime as pageViewTime,
(long)page_view_pre::pv_time as pv_time,
(chararray)page_view_pre::tracking_date as tracking_date,
(chararray)page_view_pre::path as path,
(chararray)page_view_pre::referer as referer,
(chararray)page_view_pre::pageKey as pageKey,
(int)p_key::page_key_sk as page_key_sk,
(chararray)page_view_pre::trackingCode as trackingCode,
(int)page_view_pre::totalTime as totalTime,
(int)page_view_pre::bounce_flag as bounce_flag;
page_view_agg = FILTER page_view_agg By (member_sk is NOT null) OR (browserId IS NOT NULL) ;
pvs_by_member_browser_pair = GROUP page_view_agg BY (member_sk,browserId);
***session_groups = FOREACH pvs_by_member_browser_pair {
visits = ORDER page_view_agg BY pv_time;
GENERATE FLATTEN(Sessionize(visits)) AS (
pageViewTime,member_sk, pv_time,tracking_date, pageKey,page_key_sk,browserId,referer ,path, trackingCode,totalTime, sessionId
);
}***
The bolded part is giving me the following error :
ERROR 1031: Incompatable schema: left is "pageViewTime:NULL,member_sk:NULL,pv_time:NULL,tracking_date:NULL,pageKey:NULL,page_key_sk:NULL,browserId:NULL,referer:NULL,path:NULL,trackingCode:NULL,totalTime:NULL,sessionId:NULL", right is "datafu.pig.sessions.sessionize_visits_43::member_sk:chararray,datafu.pig.sessions.sessionize_visits_43::browserId:chararray,datafu.pig.sessions.sessionize_visits_43::pageViewTime:chararray,datafu.pig.sessions.sessionize_visits_43::pv_time:long,datafu.pig.sessions.sessionize_visits_43::tracking_date:chararray,datafu.pig.sessions.sessionize_visits_43::path:chararray,datafu.pig.sessions.sessionize_visits_43::referer:chararray,datafu.pig.sessions.sessionize_visits_43::pageKey:chararray,datafu.pig.sessions.sessionize_visits_43::page_key_sk:int,datafu.pig.sessions.sessionize_visits_43::trackingCode:chararray,datafu.pig.sessions.sessionize_visits_43::totalTime:int,datafu.pig.sessions.sessionize_visits_43::bounce_flag:int,datafu.pig.sessions.sessionize_visits_43::session_id:chararray"
I thought initially this had to do with null member or browser id's.I filtered for them too, still the error is persisting.I have been stuck here for hours.Would really appreciate some pointers or solution to resolve this problem.
Thanks
This is a classical case of Schema mismatch:
page_view_pre = LOAD '/data/tracking/PageViewEvent/' USING LiAvroStorage('date.range','start.date=20150226;end.date=20150226;error.on.missing=true'); -----logic is currently for 2015-02-26,will later replace them with date parameters
Just have illustrate page_view_pre after this line to figure out the schema.

Resources