Number of partitions scanned(=32767) exceeds limit - hadoop

I'm trying to use Eel-sdk to stream data into Hive.
val sink = HiveSink(testDBName, testTableName)
.withPartitionStrategy(new DynamicPartitionStrategy)
val hiveOps:HiveOps = ...
val schema = new StructType(Vector(Field("name", StringType),Field("pk", StringType),Field("pk1",a StringType)))
hiveOps.createTable(
testDBName,
testTableName,
schema,
partitionKeys = Seq("pk", "pk1"),
dialect = ParquetHiveDialect(),
tableType = TableType.EXTERNAL_TABLE,
overwrite = true
)
val items = Seq.tabulate(100)(i => TestData(i.toString, "42", "apple"))
val ds = DataStream(items)
ds.to(sink)
Getting error: Number of partitions scanned(=32767) exceeds limit(=10000).
Number 32767 is a power of 2....but still can't figure it out what is wrong. Any idea?

Spark + Hive : Number of partitions scanned exceeds limit (=4000)
--conf "spark.sql.hive.convertMetastoreOrc=false"
--conf "spark.sql.hive.metastorePartitionPruning=false"

Related

KAFKA JDBC Source connector adds the default schema

I use the KAFKA JDBC Source connector to read from the database ClickHouse (driver - clickhouse-jdbc-0.2.4.jar) with incrementing mod.
Settings:
batch.max.rows = 100
catalog.pattern = null
connection.attempts = 3
connection.backoff.ms = 10000
connection.password = [hidden]
connection.url = jdbc:clickhouse://<ip>:8123/<schema>
connection.user = user
db.timezone =
dialect.name =
incrementing.column.name = id
mode = incrementing
numeric.mapping = null
numeric.precision.mapping = false
poll.interval.ms = 5000
query =
query.suffix =
quote.sql.identifiers = never
schema.pattern = null
table.blacklist = []
table.poll.interval.ms = 60000
table.types = [TABLE]
table.whitelist = [<table_name>]
tables = [default.<schema>.<table_name>]
timestamp.column.name = []
timestamp.delay.interval.ms = 0
timestamp.initial = null
topic.prefix = staging-
validate.non.null = false
Why does the connector additionally substitute the default scheme? and how to avoid it?
Instead of a request
SELECT * FROM <schema>.<table_name> WHERE <schema>.<table_name>.id > ? ORDER BY <schema>.<table_name>.id ASC
I get an error with
SELECT * FROM default.<schema>.<table_name> WHERE default.<schema>.<table_name>.id > ? ORDER BY default.<schema>.<table_name>.id ASC
You can create CH data source object like below (Where schema name is not passed).
final ClickHouseDataSource dataSource = new ClickHouseDataSource(
"jdbc:clickhouse://"+host+"/"+user+"?option1=one%20two&option2=y");
Then in SQL query, you can specify a schema name(schema.table). So it will not add the default schema in your query.

Elasticsearch to Spark Streaming

I'm analyzing logs and I have this architecture:
kafka->spark streaming -> elastic search
My main goal is to create machine learning models in streaming. I think that I can do two things:
1) Kafka->spark Streaming (ML) -> elastic search
2) Kafka->spark Streaming-> elasticsearch -> spark streaming(ML)
-I think that the second architecture is the best since spark streaming will use indexed data directely. What do you think? is that correct?
-Can we easly connecte spark streaming to elasticsearch in real time?
-If we create a model in spark streaming (after elastic search) must we use this model in this place (after elasticsearch) or we can use it in spark streaming (directery after kafka) ? #use== predict in real time
-Does creating models after elasticsearch made our models static (or not in the real time approch)
Thank you.
You mean that?
kafka -> spark Streaming -> elasticsearch db
val sqlContext = new SQLContext(sc)
//kafka group
val group_id = "receiveScanner"
// kafka topic
val topic = Map("testStreaming"-> 1)
// zk connect
val zkParams = Map(
"zookeeper.connect" ->"localhost",
"zookeeper.connection.timeout.ms" -> "10000",
"group.id" -> group_id)
// Kafka
val kafkaConsumer = KafkaUtils.createStream[String,String,StringDecoder,StringDecoder](ssc,zkParams,topic,StorageLevel.MEMORY_ONLY_SER)
val receiveData = kafkaConsumer.map(_._2 )
// printer kafka data
receiveData.print()
receiveData.foreachRDD{ rdd=>
val transform = rdd.map{ line =>
val data = Json.parse(line)
// play json parse
val id = (data \ "id").asOpt[Int] match { case Some(x) => x; case None => 0}
val name = ( data \ "name" ).asOpt[String] match { case Some(x)=> x ; case None => "" }
val age = (data \ "age").asOpt[Int] match { case Some(x) => x; case None => 0}
val address = ( data \ "address" ).asOpt[String] match { case Some(x)=> x ; case None => "" }
Row(id,name,age,address)
}
val transfromrecive = sqlContext.createDataFrame(transform,schameType)
import org.apache.spark.sql.functions._
import org.elasticsearch.spark.sql._
//filter age < 20 , to ES database
transfromrecive.where(col("age").<(20)).orderBy(col("age").asc)
.saveToEs("member/user",Map("es.mapping.id" -> "id"))
}
}
/**
* dataframe schame
* */
def schameType = StructType(
StructField("id",IntegerType,false)::
StructField("name",StringType,false)::
StructField("age",IntegerType,false)::
StructField("address",StringType,false)::
Nil
)

Flume creating an empty line at the end of output file in HDFS

Currently I am using Flume version : 1.5.2.
Flume creating an empty line at the end of each output file in HDFS which causing row counts, file sizes & check sum are not matching for source and destination files.
I tried by overriding the default values of parameters roolSize, batchSize and appendNewline but still its not working.
Also flume changing EOL from CRLF(Source file) to LF(outputfile) this also causing file size to differ
Below are related flume agent configuration parameters I'm using
agent1.sources = c1
agent1.sinks = c1s1
agent1.channels = ch1
agent1.sources.c1.type = spooldir
agent1.sources.c1.spoolDir = /home/biadmin/flume-test/sourcedata1
agent1.sources.c1.bufferMaxLineLength = 80000
agent1.sources.c1.channels = ch1
agent1.sources.c1.fileHeader = true
agent1.sources.c1.fileHeaderKey = file
#agent1.sources.c1.basenameHeader = true
#agent1.sources.c1.fileHeaderKey = basenameHeaderKey
#agent1.sources.c1.filePrefix = %{basename}
agent1.sources.c1.inputCharset = UTF-8
agent1.sources.c1.decodeErrorPolicy = IGNORE
agent1.sources.c1.deserializer= LINE
agent1.sources.c1.deserializer.maxLineLength = 50000
agent1.sources.c1.deserializer=
org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder
agent1.sources.c1.interceptors = a b
agent1.sources.c1.interceptors.a.type =
org.apache.flume.interceptor.TimestampInterceptor$Builder
agent1.sources.c1.interceptors.b.type =
org.apache.flume.interceptor.HostInterceptor$Builder
agent1.sources.c1.interceptors.b.preserveExisting = false
agent1.sources.c1.interceptors.b.hostHeader = host
agent1.channels.ch1.type = memory
agent1.channels.ch1.capacity = 1000
agent1.channels.ch1.transactionCapacity = 1000
agent1.channels.ch1.batchSize = 1000
agent1.channels.ch1.maxFileSize = 2073741824
agent1.channels.ch1.keep-alive = 5
agent1.sinks.c1s1.type = hdfs
agent1.sinks.c1s1.hdfs.path = hdfs://bivm.ibm.com:9000/user/biadmin/
flume/%y-%m-%d/%H%M
agent1.sinks.c1s1.hdfs.fileType = DataStream
agent1.sinks.c1s1.hdfs.filePrefix = %{file}
agent1.sinks.c1s1.hdfs.fileSuffix =.csv
agent1.sinks.c1s1.hdfs.writeFormat = Text
agent1.sinks.c1s1.hdfs.maxOpenFiles = 10
agent1.sinks.c1s1.hdfs.rollSize = 67000000
agent1.sinks.c1s1.hdfs.rollCount = 0
#agent1.sinks.c1s1.hdfs.rollInterval = 0
agent1.sinks.c1s1.hdfs.batchSize = 1000
agent1.sinks.c1s1.channel = ch1
#agent1.sinks.c1s1.hdfs.codeC = snappyCodec
agent1.sinks.c1s1.hdfs.serializer = text
agent1.sinks.c1s1.hdfs.serializer.appendNewline = false
hdfs.serializer.appendNewline not fixed the issue.
Can anyone please check and suggest..
Replace the below line in your flume agent.
agent1.sinks.c1s1.serializer.appendNewline = false
with the following line and let me know how it goes.
agent1.sinks.c1s1.hdfs.serializer.appendNewline = false
Replace
agent1.sinks.c1s1.hdfs.serializer = text
agent1.sinks.c1s1.hdfs.serializer.appendNewline = false
with
agent1.sinks.c1s1.serializer = text
agent1.sinks.c1s1.serializer.appendNewline = false
Difference is that serializer settings are not set on hdfs prefix but directly on sink name.
Flume documentation should have some example on that as I also got into issues because I didn't spot that serializer is set on different level of property name.
More informations about Hdfs sink can be found here:
https://flume.apache.org/FlumeUserGuide.html#hdfs-sink

read json key-values with hive/sql and spark

I am trying to read this json file into a hive table, the top level keys i.e. 1,2.., here are not consistent.
{
"1":"{\"time\":1421169633384,\"reading1\":130.875969,\"reading2\":227.138275}",
"2":"{\"time\":1421169646476,\"reading1\":131.240628,\"reading2\":226.810211}",
"position": 0
}
I only need the time and readings 1,2 in my hive table as columns ignore position.
I can also do a combo of hive query and spark map-reduce code.
Thank you for the help.
Update , here is what I am trying
val hqlContext = new HiveContext(sc)
val rdd = sc.textFile(data_loc)
val json_rdd = hqlContext.jsonRDD(rdd)
json_rdd.registerTempTable("table123")
println(json_rdd.printSchema())
hqlContext.sql("SELECT json_val from table123 lateral view explode_map( json_map(*, 'int,string')) x as json_key, json_val ").foreach(println)
It throws the following error :
Exception in thread "main" org.apache.spark.sql.hive.HiveQl$ParseException: Failed to parse: SELECT json_val from temp_hum_table lateral view explode_map( json_map(*, 'int,string')) x as json_key, json_val
at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:239)
at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)
at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
This would work, if you rename "1" and "2" (key names) to "x1" and "x2" (inside the json file or in the rdd):
val resultrdd = sqlContext.sql("SELECT x1.time, x1.reading1, x1.reading1, x2.time, x2.reading1, x2.reading2 from table123 ")
resultrdd.flatMap(row => (Array( (row(0),row(1),row(2)), (row(3),row(4),row(5)) )))
This would give you an RDD of tuples with time, reading1 and reading2. If you need a SchemaRDD, you would map it to a case class inside the flatMap transformation, like this:
case class Record(time: Long, reading1: Double, reading2: Double)
resultrdd.flatMap(row => (Array( Record(row.getLong(0),row.getDouble(1),row.getDouble(2)),
Record(row.getLong(3),row.getDouble(4),row.getDouble(5)) )))
val schrdd = sqlContext.createSchemaRDD(resultrdd)
Update:
In the case of many nested keys, you can parse the row like this:
val allrdd = sqlContext.sql("SELECT * from table123")
allrdd.flatMap(row=>{
var recs = Array[Record]();
for(col <- (0 to row.length-1)) {
row(col) match {
case r:Row => recs = recs :+ Record(r.getLong(2),r.getDouble(0),r.getDouble(1));
case _ => ;
}
};
recs
})

Cassandra insert failure using map reduce

Trying to insert records into cassandra using MapReduce program,
getting below error from reduce job.
13/03/29 07:39:34 INFO mapred.JobClient: Task Id : attempt_201303281807_0009_r_000000_0, Status : FAILED
java.io.IOException: InvalidRequestException(why:TimeUUID should be 16 or 0 bytes (3))
at org.apache.cassandra.hadoop.ColumnFamilyRecordWriter$RangeClient.run(ColumnFamilyRecordWriter.java:309)
Caused by: InvalidRequestException(why:TimeUUID should be 16 or 0 bytes (3))
at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:20350)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:926)
at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:912)
at org.apache.cassandra.hadoop.ColumnFamilyRecordWriter$RangeClient.run(ColumnFamilyRecordWriter.java:301
the slicePredicate Definition is
SlicePredicate predicate = new SlicePredicate().setSlice_range(new SliceRange(ByteBuffer.wrap(new byte[16]), ByteBuffer.wrap(new byte[16]), false, 150));
ConfigHelper.setInputSlicePredicate(conf, predicate);
I have tried couple of other apis to set the sliceRange without use.
e.g. other apis: https://code.google.com/p/skltpservices/source/browse/Components/log-analyzer/trunk/src/main/java/se/skl/skltpservices/components/analyzer/domain/TimeUUID.java?spec=svn1939&r=1939
The column Family definition is :
create column family myColumnFamily
with column_type = 'Standard'
and comparator = 'TimeUUIDType'
and default_validation_class = 'UTF8Type'
and key_validation_class = 'UTF8Type'
and read_repair_chance = 0.1
and dclocal_read_repair_chance = 0.0
and gc_grace = 864000
and min_compaction_threshold = 4
and max_compaction_threshold = 32
and replicate_on_write = true
and compaction_strategy = 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
and caching = 'KEYS_ONLY'
and compression_options = {'sstable_compression' : 'org.apache.cassandra.io.compress.SnappyCompressor'};
Appreciate any help on using a TimeUUIDType comparator in Column family and insert using Mapreduce.

Resources