parsing kafka messages using clickhouse kafka engine - clickhouse

I am trying to consume kafka messages using CH kafka engine table, my messages structure:
{"after": {"created_at": "2023-01-15T14:54:34.981331Z", "col1": "xxxx", "col2": "xxx", "col3": "5812", "col4": "xxxx", "col5": "xxx"}}
as you can see it is a nested json {"after":{,,}} it follows this structure all the time
I tried the following Kafka engine schemas all of them resulted with empty rows:
A- Schema #1
CREATE TABLE db.table_queue ON CLUSTER clusterx(
col1 String,
col2 String,
col3 String,
col4 String,
col5 String,
created_at String
)
ENGINE = Kafka('xx:xx', 'topicxx’, 'groupxx’,
'JSONEachRow') settings kafka_thread_per_consumer = 0, kafka_num_consumers = 1;
B- Schema #2
CREATE TABLE db.table_queue ON CLUSTER clusterx(
after String
)
ENGINE = Kafka('xx:xx', 'topicxx’, 'groupxx’,
'JSONEachRow') settings kafka_thread_per_consumer = 0, kafka_num_consumers = 1;
when I tried a simpler messages with this structure:
{"col": "test7"}
{"col": "test8"}
{"col": "test9"}
{"col": "test10"}
{"col": "test11"}
{"col": "test12"}
I managed to consume them with success with the following table definition:
CREATE TABLE db.table_queue ON CLUSTER clusterx(
col String
)
ENGINE = Kafka('xx:xx', 'topicxx’, 'groupxx’,
'JSONEachRow') settings kafka_thread_per_consumer = 0, kafka_num_consumers = 1;

Related

Azure Event Hub No Longer receiving messages : Eventhub has request but no messages

For some reason my Azure Event Hub is no longer receiving messages. It was working fine last night.
I am using Databricks Data Generator to send data to Azure Event Hubs with the following code:
import dbldatagen as dg
from pyspark.sql.types import IntegerType, StringType, FloatType
import json
from pyspark.sql.types import StructType, StructField, IntegerType, DecimalType, StringType, TimestampType, Row
from pyspark.sql.functions import *
import pyspark.sql.functions as F
num_rows = 1 * 10000 # number of rows to generate
num_partitions = 2 # number of Spark dataframe partitions
delay_reasons = ["Air Carrier", "Extreme Weather", "National Aviation System", "Security", "Late Aircraft"]
# will have implied column `id` for ordinal of row
flightdata_defn = (dg.DataGenerator(spark, name="flight_delay_data", rows=num_rows, partitions=num_partitions)
#.withColumn("body",StringType(), False)
.withColumn("flightNumber", "int", minValue=1000, uniqueValues=10000, random=True)
.withColumn("airline", "string", minValue=1, maxValue=500, prefix="airline", random=True, distribution="normal")
.withColumn("original_departure", "timestamp", begin="2020-01-01 01:00:00", end="2020-12-31 23:59:00", interval="1 minute", random=True)
.withColumn("delay_minutes", "int", minValue=20, maxValue=600, distribution=dg.distributions.Gamma(1.0, 2.0))
.withColumn("delayed_departure", "timestamp", expr="cast(original_departure as bigint) + (delay_minutes * 60) ", baseColumn=["original_departure", "delay_minutes"])
.withColumn("reason", "string", values=delay_reasons, random=True)
)
df_flight_data = flightdata_defn.build(withStreaming=True, options={'rowsPerSecond': 100})
streamingDelays = (
df_flight_data
.groupBy(
#df_flight_data.body,
df_flight_data.flightNumber,
df_flight_data.airline,
df_flight_data.original_departure,
df_flight_data.delay_minutes,
df_flight_data.delayed_departure,
df_flight_data.reason,
window(df_flight_data.original_departure, "1 hour")
)
.count()
)
writeConnectionString = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)
checkpointLocation = "///checkpoint"
ehWriteConf = {
'eventhubs.connectionString' : writeConnectionString
}
# Write body data from a DataFrame to EventHubs. Events are distributed across partitions using round-robin model.
ds = streamingDelays \
.select(F.to_json(F.struct("*")).alias("body")) \
.writeStream.format("eventhubs") \
.options(**ehWriteConf) \
.outputMode("complete") \
.option("checkpointLocation", "...") \
.start()
# display(streamingDelays)
From the image you will notice that I'm bearly receiving and requests, and absolutely no messages. However, just yesterday I was getting both requests and messages.
I created a new Event Hub, but I'm
I'm sure its something very simple that I'm missing....
I should mention that my Databricks notebook appears to get stuck at 'Stream initializing...

Number of partitions scanned(=32767) exceeds limit

I'm trying to use Eel-sdk to stream data into Hive.
val sink = HiveSink(testDBName, testTableName)
.withPartitionStrategy(new DynamicPartitionStrategy)
val hiveOps:HiveOps = ...
val schema = new StructType(Vector(Field("name", StringType),Field("pk", StringType),Field("pk1",a StringType)))
hiveOps.createTable(
testDBName,
testTableName,
schema,
partitionKeys = Seq("pk", "pk1"),
dialect = ParquetHiveDialect(),
tableType = TableType.EXTERNAL_TABLE,
overwrite = true
)
val items = Seq.tabulate(100)(i => TestData(i.toString, "42", "apple"))
val ds = DataStream(items)
ds.to(sink)
Getting error: Number of partitions scanned(=32767) exceeds limit(=10000).
Number 32767 is a power of 2....but still can't figure it out what is wrong. Any idea?
Spark + Hive : Number of partitions scanned exceeds limit (=4000)
--conf "spark.sql.hive.convertMetastoreOrc=false"
--conf "spark.sql.hive.metastorePartitionPruning=false"

Elasticsearch to Spark Streaming

I'm analyzing logs and I have this architecture:
kafka->spark streaming -> elastic search
My main goal is to create machine learning models in streaming. I think that I can do two things:
1) Kafka->spark Streaming (ML) -> elastic search
2) Kafka->spark Streaming-> elasticsearch -> spark streaming(ML)
-I think that the second architecture is the best since spark streaming will use indexed data directely. What do you think? is that correct?
-Can we easly connecte spark streaming to elasticsearch in real time?
-If we create a model in spark streaming (after elastic search) must we use this model in this place (after elasticsearch) or we can use it in spark streaming (directery after kafka) ? #use== predict in real time
-Does creating models after elasticsearch made our models static (or not in the real time approch)
Thank you.
You mean that?
kafka -> spark Streaming -> elasticsearch db
val sqlContext = new SQLContext(sc)
//kafka group
val group_id = "receiveScanner"
// kafka topic
val topic = Map("testStreaming"-> 1)
// zk connect
val zkParams = Map(
"zookeeper.connect" ->"localhost",
"zookeeper.connection.timeout.ms" -> "10000",
"group.id" -> group_id)
// Kafka
val kafkaConsumer = KafkaUtils.createStream[String,String,StringDecoder,StringDecoder](ssc,zkParams,topic,StorageLevel.MEMORY_ONLY_SER)
val receiveData = kafkaConsumer.map(_._2 )
// printer kafka data
receiveData.print()
receiveData.foreachRDD{ rdd=>
val transform = rdd.map{ line =>
val data = Json.parse(line)
// play json parse
val id = (data \ "id").asOpt[Int] match { case Some(x) => x; case None => 0}
val name = ( data \ "name" ).asOpt[String] match { case Some(x)=> x ; case None => "" }
val age = (data \ "age").asOpt[Int] match { case Some(x) => x; case None => 0}
val address = ( data \ "address" ).asOpt[String] match { case Some(x)=> x ; case None => "" }
Row(id,name,age,address)
}
val transfromrecive = sqlContext.createDataFrame(transform,schameType)
import org.apache.spark.sql.functions._
import org.elasticsearch.spark.sql._
//filter age < 20 , to ES database
transfromrecive.where(col("age").<(20)).orderBy(col("age").asc)
.saveToEs("member/user",Map("es.mapping.id" -> "id"))
}
}
/**
* dataframe schame
* */
def schameType = StructType(
StructField("id",IntegerType,false)::
StructField("name",StringType,false)::
StructField("age",IntegerType,false)::
StructField("address",StringType,false)::
Nil
)

Unable to correctly load twitter avro data into hive table

Need your help!
I am trying a trivial exercise of getting the data from twitter and then loading it up in Hive for analysis. Though I am able to get data into HDFS using flume (using Twitter 1% firehose Source) and also able to load the data into Hive table.
But unable to see all the columns I have expected to be there in the twitter data like user_location, user_description, user_friends_count, user_description, user_statuses_count. The schema derived from Avro only contains two columns header and body.
Below are the steps I have done:
1) create a flume agent with below conf:
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type =org.apache.flume.source.twitter.TwitterSource
#a1.sources.r1.type = com.cloudera.flume.source.TwitterSource
a1.sources.r1.consumerKey =XXXXXXXXXXXXXXXXXXXXXXXXXXXX
a1.sources.r1.consumerSecret =XXXXXXXXXXXXXXXXXXXXXXXXXXXX
a1.sources.r1.accessToken =XXXXXXXXXXXXXXXXXXXXXXXXXXXX
a1.sources.r1.accessTokenSecret =XXXXXXXXXXXXXXXXXXXXXXXXXXXX
a1.sources.r1.keywords = bigdata, healthcare, oozie
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://192.168.192.128:8020/hdp/apps/2.2.0.0-2041/flume/twitter
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.inUsePrefix = _
a1.sinks.k1.hdfs.fileSuffix = .avro
# added for invalid block size error
a1.sinks.k1.serializer = avro_event
#a1.sinks.k1.deserializer.schemaType = LITERAL
# added for exception java.io.IOException:org.apache.avro.AvroTypeException: Found Event, expecting Doc
#a1.sinks.k1.serializer.compressionCodec = snappy
a1.sinks.k1.hdfs.batchSize = 1000
a1.sinks.k1.hdfs.rollSize = 67108864
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.rollInterval = 30
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 1000
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
2) Derive the schema from the avro data file, I don't have any idea why the schema derived from the avro data file only has two columns header and body:
java -jar avro-tools-1.7.7.jar getschema FlumeData.14315982 30978.avro
{
"type" : "record",
"name" : "Event",
"fields" : [ {
"name" : "headers",
"type" : {
"type" : "map",
"values" : "string"
}
}, {
"name" : "body",
"type" : "bytes"
} ]
}
3) Run the above agent and get the data in HDFS, find out the schema of the avro data and create a Hive table as:
CREATE EXTERNAL TABLE TwitterData
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES ('avro.schema.literal'='
{
"type" : "record",
"name" : "Event",
"fields" : [ {
"name" : "headers",
"type" : {
"type" : "map",
"values" : "string"
}
}, {
"name" : "body",
"type" : "bytes"
} ]
}
')
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs://192.168.192.128:8020/hdp/apps/2.2.0.0-2041/flume/twitter'
;
4) Describe Hive Table:
hive> describe twitterdata;
OK
headers map<string,string> from deserializer
body binary from deserializer
Time taken: 0.472 seconds, Fetched: 2 row(s)
5) Query the table:
When I query the table I see the binary data in the 'body'column and the actual schema info in the 'header' column.
select * from twitterdata limit 1;
OK
{"type":"record","name":"Doc","doc":"adoc","fields":[{"name":"id","type":"string"},{"name":"user_friends_count","type":["int","null"]},{"name":"user_location","type":["string","null"]},{"name":"user_description","type":["string","null"]},{"name":"user_statuses_count","type":["int","null"]},{"name":"user_followers_count","type":["int","null"]},{"name":"user_name","type":["string","null"]},{"name":"user_screen_name","type":["string","null"]},{"name":"created_at","type":["string","null"]},{"name":"text","type":["string","null"]},{"name":"retweet_count","type":["long","null"]},{"name":"retweeted","type":["boolean","null"]},{"name":"in_reply_to_user_id","type":["long","null"]},{"name":"source","type":["string","null"]},{"name":"in_reply_to_status_id","type":["long","null"]},{"name":"media_url_https","type":["string","null"]},{"name":"expanded_url","type":["string","null"]}]}�1|$���)]'��G�$598792495703543808�Bあいたぁぁぁぁぁぁぁ!�~�ゆっけ0725Yukken(2015-05-14T10:10:30Z<ん?なんか意味違うわ�Twitter for iPhone�1|$���)]'��
Time taken: 2.24 seconds, Fetched: 1 row(s)
How do I create a hive table with all the columns in the actual schema as shown in the 'header' column. I mean with all the columns like user_location, user_description, user_friends_count, user_description, user_statuses_count?
Shouldn't the schema derived from the avro data file contain more columns?
Is there any issue with the flume-avro source I used in the flume agent (org.apache.flume.source.twitter.TwitterSource)?
Thanks for reading through..
Thanks Farrukh, I have done that the mistake was the configuration 'a1.sinks.k1.serializer = avro_event', I changed this to 'a1.sinks.k1.serializer = text', and I was able to load the data into Hive. But now the issue is retrieving the data from Hive, I am getting the below error while doing so:
hive> describe twitterdata_09062015;
OK
id string from deserializer
user_friends_count int from deserializer
user_location string from deserializer
user_description string from deserializer
user_statuses_count int from deserializer
user_followers_count int from deserializer
user_name string from deserializer
user_screen_name string from deserializer
created_at string from deserializer
text string from deserializer
retweet_count bigint from deserializer
retweeted boolean from deserializer
in_reply_to_user_id bigint from deserializer
source string from deserializer
in_reply_to_status_id bigint from deserializer
media_url_https string from deserializer
expanded_url string from deserializer
select count(1) as num_rows from TwitterData_09062015;
Query ID = root_20150609130404_10ef21db-705a-4e94-92b7-eaa58226ee2e
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1433857038961_0003, Tracking URL = http://sandbox.hortonworks.com:8088/proxy/application_14338570 38961_0003/
Kill Command = /usr/hdp/2.2.0.0-2041/hadoop/bin/hadoop job -kill job_1433857038961_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
* 13:04:36,856 Stage-1 map = 0%, reduce = 0%
* 13:05:09,576 Stage-1 map = 100%, reduce = 100%
Ended Job = job_1433857038961_0003 with errors
Error during job, obtaining debugging information...
Examining task ID: task_1433857038961_0003_m_000000 (and more) from job job_1433857038961_0003
Task with the most failures(4):
Task ID:
task_1433857038961_0003_m_000000
URL:
http://sandbox.hortonworks.com:8088/taskdetails.jsp?jobid=job_1433857038961_0003&tipid=task_1433857038961_0003_m_0 00000
Diagnostic Messages for this Task:
Error: java.io.IOException: java.io.IOException: org.apache.avro.AvroRuntimeException: java.io.IOException: Block si ze invalid or too large for this implementation: -40
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHand lerChain.java:121)
Here is step by step process which used to download tweets and loaded them into hive
Flume agent
##TwitterAgent for collecting Twitter data to Hadoop HDFS #####
TwitterAgent.sources = Twitter
TwitterAgent.channels = FileChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.channels = FileChannel
TwitterAgent.sources.Twitter.consumerKey = *************
TwitterAgent.sources.Twitter.consumerSecret = **********
TwitterAgent.sources.Twitter.accessToken = ************
TwitterAgent.sources.Twitter.accessTokenSecret = ***********
TwitterAgent.sources.Twitter.maxBatchSize = 50000
TwitterAgent.sources.Twitter.maxBatchDurationMillis = 100000
TwitterAgent.sources.Twitter.keywords = Apache, Hadoop, Mapreduce, hadooptutorial, Hive, Hbase, MySql
TwitterAgent.sinks.HDFS.channel = FileChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://nn1.itbeams.com:9000/user/flume/tweets/avrotweets
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
# you do not need to mentioned avro format here. just mention Text
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 200000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 2000000
TwitterAgent.channels.FileChannel.type = file
TwitterAgent.channels.FileChannel.checkpointDir = /var/log/flume/checkpoint/
TwitterAgent.channels.FileChannel.dataDirs = /var/log/flume/data/
I created avro schema in avsc file. Once you create then put this file in hadoop against your user folder like /user/youruser/.
{"type":"record",
"name":"Doc",
"doc":"adoc",
"fields":[{"name":"id","type":"string"},
{"name":"user_friends_count","type":["int","null"]},
{"name":"user_location","type":["string","null"]},
{"name":"user_description","type":["string","null"]},
{"name":"user_statuses_count","type":["int","null"]},
{"name":"user_followers_count","type":["int","null"]},
{"name":"user_name","type":["string","null"]},
{"name":"user_screen_name","type":["string","null"]},
{"name":"created_at","type":["string","null"]},
{"name":"text","type":["string","null"]},
{"name":"retweet_count","type":["long","null"]},
{"name":"retweeted","type":["boolean","null"]},
{"name":"in_reply_to_user_id","type":["long","null"]},
{"name":"source","type":["string","null"]},
{"name":"in_reply_to_status_id","type":["long","null"]},
{"name":"media_url_https","type":["string","null"]},
{"name":"expanded_url","type":["string","null"]}
Loaded tweets in hive table. If you save code in hql file that would be great.
CREATE TABLE tweetsavro
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.url'='hdfs:///user/youruser/examples/schema/twitteravroschema.avsc') ;
LOAD DATA INPATH '/user/flume/tweets/avrotweets/FlumeData.*' OVERWRITE INTO TABLE tweetsavro;
tweetsavro table in hive
hive> describe tweetsavro;
OK
id string from deserializer
user_friends_count int from deserializer
user_location string from deserializer
user_description string from deserializer
user_statuses_count int from deserializer
user_followers_count int from deserializer
user_name string from deserializer
user_screen_name string from deserializer
created_at string from deserializer
text string from deserializer
retweet_count bigint from deserializer
retweeted boolean from deserializer
in_reply_to_user_id bigint from deserializer
source string from deserializer
in_reply_to_status_id bigint from deserializer
media_url_https string from deserializer
expanded_url string from deserializer
Time taken: 0.6 seconds, Fetched: 17 row(s)

read json key-values with hive/sql and spark

I am trying to read this json file into a hive table, the top level keys i.e. 1,2.., here are not consistent.
{
"1":"{\"time\":1421169633384,\"reading1\":130.875969,\"reading2\":227.138275}",
"2":"{\"time\":1421169646476,\"reading1\":131.240628,\"reading2\":226.810211}",
"position": 0
}
I only need the time and readings 1,2 in my hive table as columns ignore position.
I can also do a combo of hive query and spark map-reduce code.
Thank you for the help.
Update , here is what I am trying
val hqlContext = new HiveContext(sc)
val rdd = sc.textFile(data_loc)
val json_rdd = hqlContext.jsonRDD(rdd)
json_rdd.registerTempTable("table123")
println(json_rdd.printSchema())
hqlContext.sql("SELECT json_val from table123 lateral view explode_map( json_map(*, 'int,string')) x as json_key, json_val ").foreach(println)
It throws the following error :
Exception in thread "main" org.apache.spark.sql.hive.HiveQl$ParseException: Failed to parse: SELECT json_val from temp_hum_table lateral view explode_map( json_map(*, 'int,string')) x as json_key, json_val
at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:239)
at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)
at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
This would work, if you rename "1" and "2" (key names) to "x1" and "x2" (inside the json file or in the rdd):
val resultrdd = sqlContext.sql("SELECT x1.time, x1.reading1, x1.reading1, x2.time, x2.reading1, x2.reading2 from table123 ")
resultrdd.flatMap(row => (Array( (row(0),row(1),row(2)), (row(3),row(4),row(5)) )))
This would give you an RDD of tuples with time, reading1 and reading2. If you need a SchemaRDD, you would map it to a case class inside the flatMap transformation, like this:
case class Record(time: Long, reading1: Double, reading2: Double)
resultrdd.flatMap(row => (Array( Record(row.getLong(0),row.getDouble(1),row.getDouble(2)),
Record(row.getLong(3),row.getDouble(4),row.getDouble(5)) )))
val schrdd = sqlContext.createSchemaRDD(resultrdd)
Update:
In the case of many nested keys, you can parse the row like this:
val allrdd = sqlContext.sql("SELECT * from table123")
allrdd.flatMap(row=>{
var recs = Array[Record]();
for(col <- (0 to row.length-1)) {
row(col) match {
case r:Row => recs = recs :+ Record(r.getLong(2),r.getDouble(0),r.getDouble(1));
case _ => ;
}
};
recs
})

Resources