Flume Agent Unknown Error - hadoop

Im getting bellow error on my flume agent where I use AsyncHbaseEventSerializer.
I doubt is this because of my schema. I have two column families.
MyAgent.sinks.MySink.columnFamily=family1,family2
When I specify both column families coma separated Im getting error as table/column family not found.
MyAgent.sinks.MySink.columnFamily=family1 family2
If I specify both column families separated by space then I think only first column family considered.
This is the error. Can anyone help me?
Unable to deliver event. Exception follows.
org.apache.flume.EventDeliveryException: Could not write events to Hbase. Transaction failed, and rolled back.
at org.apache.flume.sink.hbase.AsyncHBaseSink.process(AsyncHBaseSink.java:317)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
at java.lang.Thread.run(Thread.java:745)
Update:
myAgent.sources = TwitterSource InstagramSource FacebookSource
myAgent.channels = TwitterChannel InstagramChannelMedia InstagramChannelMediaComment FacebookChannelPage FacebookChannelPost FacebookChannelComment
myAgent.sinks = TwitterSink InstagramSinkMedia InstagramSinkMediaComment FacebookSinkPage FacebookSinkPost FacebookSinkComment
myAgent.sources.TwitterSource.type = com.my.socialanalytics.core.twitter.TwitterSource
myAgent.sources.TwitterSource.channels = TwitterChannel
myAgent.sources.TwitterSource.consumerKey = <>
myAgent.sources.TwitterSource.consumerSecret = <>
myAgent.sources.TwitterSource.accessToken = <>
myAgent.sources.TwitterSource.accessTokenSecret = <>
myAgent.sources.TwitterSource.crawlingFrequency = 300000
myAgent.sources.TwitterSource.mongoHost =10.3.0.38
myAgent.sources.TwitterSource.mongoPort =27017
myAgent.sources.TwitterSource.mongoDBName =mySocialAnalytics
myAgent.sources.TwitterSource.mongoCollectionCampaigns =Campaigns
myAgent.sources.TwitterSource.mongoUsername =admin
myAgent.sources.TwitterSource.mongoPassword =qburst
myAgent.sources.TwitterSource.mongoAuthDB =admin
myAgent.sinks.TwitterSink.type=org.apache.flume.sink.hbase.AsyncHBaseSink
myAgent.sinks.TwitterSink.channel=TwitterChannel
myAgent.sinks.TwitterSink.table=Tweets
myAgent.sinks.TwitterSink.columnFamily=campaign tweet
myAgent.sinks.TwitterSink.serializer=com.my.socialanalytics.core.twitter.TwitterSerializer
myAgent.sinks.TwitterSink.serializer.columns=tweet:contributors tweet:createdAt tweet:inReplyToUserID tweet:text tweet:inReplyToStatusID tweet:source tweet:lang tweet:geo tweet:favorited tweet:withheldCopyright campaign:stateCode tweet:truncated tweet:entities tweet:inReplyToScreenName tweet:withheldInCountries campaign:campaignID tweet:favoriteCount tweet:id tweet:user tweet:retweetedStatus campaign:countryCode tweet:possiblySensitive tweet:currentUserRetweet tweet:retweetCount tweet:withheldScope tweet:retweeted tweet:coordinates tweet:filterLevel tweet:quotedStatusID campaign:sentiment tweet:place
myAgent.sinks.TwitterSink.batchSize = 50
myAgent.channels.TwitterChannel.type = memory
myAgent.channels.TwitterChannel.capacity = 10000000
myAgent.channels.TwitterChannel.transactionCapacity = 100000
myAgent.sources.InstagramSource.type = com.my.socialanalytics.core.instagram.InstagramSource
myAgent.sources.InstagramSource.channels = InstagramChannelMedia InstagramChannelMediaComment
myAgent.sources.InstagramSource.clientID = <>
myAgent.sources.InstagramSource.clientSecret = <>
myAgent.sources.InstagramSource.accessToken = <>
myAgent.sources.InstagramSource.crawlingFrequency = 3600000
myAgent.sources.InstagramSource.mongoHost =10.3.0.38
myAgent.sources.InstagramSource.mongoPort =27017
myAgent.sources.InstagramSource.mongoDBName =mySocialAnalytics
myAgent.sources.InstagramSource.mongoCollectionCampaigns =Campaigns
myAgent.sources.InstagramSource.mongoUsername =admin
myAgent.sources.InstagramSource.mongoPassword =qburst
myAgent.sources.InstagramSource.mongoAuthDB =admin
myAgent.sources.InstagramSource.selector.type = multiplexing
myAgent.sources.InstagramSource.selector.header = feedType
myAgent.sources.InstagramSource.selector.mapping.feedTypeMedia = InstagramChannelMedia
myAgent.sources.InstagramSource.selector.mapping.feedTypeComment = InstagramChannelMediaComment
myAgent.sources.InstagramSource.selector.default = InstagramChannelMedia
myAgent.sinks.InstagramSinkMedia.type=org.apache.flume.sink.hbase.AsyncHBaseSink
myAgent.sinks.InstagramSinkMedia.channel=InstagramChannelMedia
myAgent.sinks.InstagramSinkMedia.table=InstagramMedia
myAgent.sinks.InstagramSinkMedia.columnFamily=campaign media
myAgent.sinks.InstagramSinkMedia.serializer=com.my.socialanalytics.core.instagram.InstagramMediaSerializer
myAgent.sinks.InstagramSinkMedia.serializer.columns=media:createdTime campaign:campaignID media:link media:videos media:type media:caption campaign:countryCode media:filter media:likes media:attribution media:comments media:user campaign:updatedAt media:tags campaign:stateCode media:images campaign:sentiment media:location media:id
myAgent.sinks.InstagramSinkMedia.batchSize = 50
myAgent.sinks.InstagramSinkMediaComment.type=org.apache.flume.sink.hbase.AsyncHBaseSink
myAgent.sinks.InstagramSinkMediaComment.channel=InstagramChannelMediaComment
myAgent.sinks.InstagramSinkMediaComment.table=InstagramComments
myAgent.sinks.InstagramSinkMediaComment.columnFamily=campaign comment
myAgent.sinks.InstagramSinkMediaComment.serializer=com.my.socialanalytics.core.instagram.InstagramMediaCommentSerializer
myAgent.sinks.InstagramSinkMediaComment.serializer.columns=comment:mediaId comment:id campaign:campaignID comment:createdTime campaign:updatedAt comment:from comment:text campaign:sentiment
myAgent.sinks.InstagramSinkMediaComment.batchSize = 50
myAgent.channels.InstagramChannelMedia.type = memory
myAgent.channels.InstagramChannelMedia.capacity = 10000000
myAgent.channels.InstagramChannelMedia.transactionCapacity = 100000
myAgent.channels.InstagramChannelMediaComment.type = memory
myAgent.channels.InstagramChannelMediaComment.capacity = 10000000
myAgent.channels.InstagramChannelMediaComment.transactionCapacity = 100000
myAgent.sources.FacebookSource.type = com.my.socialanalytics.core.facebook.FacebookSource
myAgent.sources.FacebookSource.channels = FacebookChannelPage FacebookChannelPost FacebookChannelComment
myAgent.sources.FacebookSource.accessToken = <>
myAgent.sources.FacebookSource.crawlingFrequency = 3600000
myAgent.sources.FacebookSource.mongoHost =10.3.0.38
myAgent.sources.FacebookSource.mongoPort =27017
myAgent.sources.FacebookSource.mongoDBName =mySocialAnalytics
myAgent.sources.FacebookSource.mongoCollectionCampaigns =Campaigns
myAgent.sources.FacebookSource.mongoUsername =admin
myAgent.sources.FacebookSource.mongoPassword =qburst
myAgent.sources.FacebookSource.mongoAuthDB =admin
myAgent.sources.FacebookSource.selector.type = multiplexing
myAgent.sources.FacebookSource.selector.header = feedType
myAgent.sources.FacebookSource.selector.mapping.feedTypePage = FacebookChannelPage
myAgent.sources.FacebookSource.selector.mapping.feedTypePost = FacebookChannelPost
myAgent.sources.FacebookSource.selector.mapping.feedTypeComment = FacebookChannelComment
myAgent.sources.FacebookSource.selector.default = FacebookChannelPage
myAgent.sinks.FacebookSinkPage.type=org.apache.flume.sink.hbase.AsyncHBaseSink
myAgent.sinks.FacebookSinkPage.channel=FacebookChannelPage
myAgent.sinks.FacebookSinkPage.table=FacebookPages
myAgent.sinks.FacebookSinkPage.columnFamily=campaign page
myAgent.sinks.FacebookSinkPage.serializer=com.my.socialanalytics.core.facebook.FacebookPageSerializer
myAgent.sinks.FacebookSinkPage.serializer.columns=page:talkingAboutCount campaign:campaignID page:name page:wereHereCount page:fanCount page:id page:checkins campaign:day
myAgent.sinks.FacebookSinkPage.batchSize = 50
myAgent.sinks.FacebookSinkPost.type=org.apache.flume.sink.hbase.AsyncHBaseSink
myAgent.sinks.FacebookSinkPost.channel=FacebookChannelPost
myAgent.sinks.FacebookSinkPost.table=FacebookPosts
myAgent.sinks.FacebookSinkPost.columnFamily=post campaign
myAgent.sinks.FacebookSinkPost.serializer=com.my.socialanalytics.core.facebook.FacebookPostSerializer
myAgent.sinks.FacebookSinkPost.serializer.columns=post:link campaign:campaignID post:place post:sharesCount post:name post:from campaign:updatedAt post:caption post:id post:likesCount post:createdTime post:message campaign:sentiment
myAgent.sinks.FacebookSinkPost.batchSize = 50
myAgent.sinks.FacebookSinkComment.type=org.apache.flume.sink.hbase.AsyncHBaseSink
myAgent.sinks.FacebookSinkComment.channel=FacebookChannelComment
myAgent.sinks.FacebookSinkComment.table=FacebookComments
myAgent.sinks.FacebookSinkComment.columnFamily=campaign comment
myAgent.sinks.FacebookSinkComment.serializer=com.my.socialanalytics.core.facebook.FacebookCommentSerializer
myAgent.sinks.FacebookSinkComment.serializer.columns=comment:message comment:id campaign:campaignID comment:commentCount comment:parent comment:createdTime campaign:updatedAt comment:from comment:postID campaign:sentiment
myAgent.sinks.FacebookSinkComment.batchSize = 50
myAgent.channels.FacebookChannelPage.type = memory
myAgent.channels.FacebookChannelPage.capacity = 10000000
myAgent.channels.FacebookChannelPage.transactionCapacity = 100000
myAgent.channels.FacebookChannelPost.type = memory
myAgent.channels.FacebookChannelPost.capacity = 10000000
myAgent.channels.FacebookChannelPost.transactionCapacity = 100000
myAgent.channels.FacebookChannelComment.type = memory
myAgent.channels.FacebookChannelComment.capacity = 10000000
myAgent.channels.FacebookChannelComment.transactionCapacity = 100000
Initially Im not getting any error but after couple of hours getting such error. Im running some spark jobs to update particular field in same table. Im no sure if its the reason. My flume frequently updating same records. Is that the reason for this error?
Update 2:
If im mentioning both column families im getting error as below.
Oct 14, 4:21:11.040 PM ERROR org.apache.flume.lifecycle.LifecycleSupervisor
Unable to start SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor#4c74990e counterGroup:{ name:null counters:{} } } - Exception follows.
org.apache.flume.FlumeException: Could not start sink. Table or column family does not exist in Hbase.
at
org.apache.flume.sink.hbase.AsyncHBaseSink.initHBaseClient(AsyncHBaseSink.java:489)
at org.apache.flume.sink.hbase.AsyncHBaseSink.start(AsyncHBaseSink.java:441)
at org.apache.flume.sink.DefaultSinkProcessor.start(DefaultSinkProcessor.java:46)
at org.apache.flume.SinkRunner.start(SinkRunner.java:79)
at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:251)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
If i mention only one column family Im getting error
Unable to deliver event. Exception follows.
org.apache.flume.EventDeliveryException: Could not write events to Hbase. Transaction failed, and rolled back.
at org.apache.flume.sink.hbase.AsyncHBaseSink.process(AsyncHBaseSink.java:317)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
at java.lang.Thread.run(Thread.java:745)

Related

KAFKA JDBC Source connector adds the default schema

I use the KAFKA JDBC Source connector to read from the database ClickHouse (driver - clickhouse-jdbc-0.2.4.jar) with incrementing mod.
Settings:
batch.max.rows = 100
catalog.pattern = null
connection.attempts = 3
connection.backoff.ms = 10000
connection.password = [hidden]
connection.url = jdbc:clickhouse://<ip>:8123/<schema>
connection.user = user
db.timezone =
dialect.name =
incrementing.column.name = id
mode = incrementing
numeric.mapping = null
numeric.precision.mapping = false
poll.interval.ms = 5000
query =
query.suffix =
quote.sql.identifiers = never
schema.pattern = null
table.blacklist = []
table.poll.interval.ms = 60000
table.types = [TABLE]
table.whitelist = [<table_name>]
tables = [default.<schema>.<table_name>]
timestamp.column.name = []
timestamp.delay.interval.ms = 0
timestamp.initial = null
topic.prefix = staging-
validate.non.null = false
Why does the connector additionally substitute the default scheme? and how to avoid it?
Instead of a request
SELECT * FROM <schema>.<table_name> WHERE <schema>.<table_name>.id > ? ORDER BY <schema>.<table_name>.id ASC
I get an error with
SELECT * FROM default.<schema>.<table_name> WHERE default.<schema>.<table_name>.id > ? ORDER BY default.<schema>.<table_name>.id ASC
You can create CH data source object like below (Where schema name is not passed).
final ClickHouseDataSource dataSource = new ClickHouseDataSource(
"jdbc:clickhouse://"+host+"/"+user+"?option1=one%20two&option2=y");
Then in SQL query, you can specify a schema name(schema.table). So it will not add the default schema in your query.

janusgraph with OneTimeBulkLoader in hadoop-gremlin raise "Graph does not support adding vertices"

my goal:
uses SparkGraphComputer to bulkLoader local data to janusgraph and then build mixed index on hbase and ES
my problem:
Caused by: java.lang.UnsupportedOperationException: Graph does not support adding vertices
at org.apache.tinkerpop.gremlin.structure.Graph$Exceptions.vertexAdditionsNotSupported(Graph.java:1133)
at org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph.addVertex(HadoopGraph.java:187)
at org.apache.tinkerpop.gremlin.process.traversal.step.map.AddVertexStartStep.processNextStart(AddVertexStartStep.java:91)
at org.apache.tinkerpop.gremlin.process.traversal.step.util.AbstractStep.next(AbstractStep.java:128)
at org.apache.tinkerpop.gremlin.process.traversal.step.util.AbstractStep.next(AbstractStep.java:38)
at org.apache.tinkerpop.gremlin.process.traversal.util.DefaultTraversal.next(DefaultTraversal.java:200)
at org.apache.tinkerpop.gremlin.process.computer.bulkloading.OneTimeBulkLoader.getOrCreateVertex(OneTimeBulkLoader.java:49)
at org.apache.tinkerpop.gremlin.process.computer.bulkloading.BulkLoaderVertexProgram.executeInternal(BulkLoaderVertexProgram.java:210)
at org.apache.tinkerpop.gremlin.process.computer.bulkloading.BulkLoaderVertexProgram.execute(BulkLoaderVertexProgram.java:197)
at org.apache.tinkerpop.gremlin.spark.process.computer.SparkExecutor.lambda$null$4(SparkExecutor.java:118)
at org.apache.tinkerpop.gremlin.util.iterator.IteratorUtils$3.next(IteratorUtils.java:247)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
... 3 more
dependences:
janusgraph-all-0.3.1
janusgraph-es-0.3.1
hadoop-gremlin-3.3.3
The followings is configuration:
janusgraph-hbase-es.properties
storage.backend=hbase
gremlin.graph=XXX.XXX.XXX.gremlin.hadoop.structure.HadoopGraph
storage.hostname=<ip>
storage.hbase.table=hadoop-test-3
storage.batch-loading=true
schema.default = none
cache.db-cache = true
cache.db-cache-clean-wait = 20
cache.db-cache-time = 180000
cache.db-cache-size = 0.5
index.search.backend=elasticsearch
index.search.hostname=<ip>
index.search.index-name=hadoop_test_3
hadoop-graphson.properties
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphReader=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONInputFormat
gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat
gremlin.hadoop.inputLocation=data/tinkerpop-modern.json
gremlin.hadoop.outputLocation=output
gremlin.hadoop.jarsInDistributedCache=true
giraph.minWorkers=2
giraph.maxWorkers=2
giraph.useOutOfCoreGraph=true
giraph.useOutOfCoreMessages=true
mapred.map.child.java.opts=-Xmx1024m
mapred.reduce.child.java.opts=-Xmx1024m
giraph.numInputThreads=4
giraph.numComputeThreads=4
giraph.maxMessagesInMemory=100000
spark.master=local[*]
spark.serializer=org.apache.spark.serializer.KryoSerializer
schema.groovy
def defineGratefulDeadSchema(janusGraph) {
JanusGraphManagement m = janusGraph.openManagement()
VertexLabel person = m.makeVertexLabel("person").make()
//使用IncrementBulkLoader导入时,去掉下面注释
//blid=m.makePropertyKey("bulkLoader.vertex.id")
.dataType(Long.class).make()
PropertyKey birth =
m.makePropertyKey("birth").dataType(Date.class).make()
PropertyKey age =
m.makePropertyKey("age").dataType(Integer.class).make()
PropertyKey name =
m.makePropertyKey("name").dataType(String.class).make()
//index
//JanusGraphIndex index = m
.buildIndex("nameCompositeIndex",
Vertex.class).addKey(name).unique().buildCompositeIndex()
JanusGraphIndex index = m.buildIndex("mixedIndex",
Vertex.class).addKey(name).buildMixedIndex("search")
//不支持唯一性检查,search为index.search.backend中的search
//使用IncrementBulkLoader导入时,去掉下面注释
//bidIndex = m.buildIndex("byBulkLoaderVertexId",
Vertex.class).addKey(blid).indexOnly(person)
.buildCompositeIndex()
m.commit()
}
relevant code
JanusGraph janusGraph = JanusGraphFactory.open
("config/janusgraph-hbase-es.properties");
JanusgraphSchema janusgraphSchema = new JanusgraphSchema();
janusgraphSchema.defineGratefulDeadSchema(janusGraph);
janusGraph.close();
Graph graph = GraphFactory.open("config/hadoop-
graphson.properties");
BulkLoaderVertexProgram blvp = BulkLoaderVertexProgram.
build().bulkLoader(OneTimeBulkLoader.class).
writeGraph("config/janusgraph-hbase-es.properties").
create(graph);
graph.compute(SparkGraphComputer.class).program(blvp).
submit().get();
graph.close();
JanusGraph janusGraph1 = JanusGraphFactory.open
("config/janusgraph-hbase-es.properties");
List<Map<String, Object>> list = janusGraph1.traversal().V().
valueMap().toList();
System.out.println("size: " + list.size());
janusGraph1.close();
result:
data success to import hbase, but fail to build index in ES
The above error does not appear, after I reset gremlin.graph with default value gremlin.graph=org.janusgraph.core.JanusGraphFactory.

Number of partitions scanned(=32767) exceeds limit

I'm trying to use Eel-sdk to stream data into Hive.
val sink = HiveSink(testDBName, testTableName)
.withPartitionStrategy(new DynamicPartitionStrategy)
val hiveOps:HiveOps = ...
val schema = new StructType(Vector(Field("name", StringType),Field("pk", StringType),Field("pk1",a StringType)))
hiveOps.createTable(
testDBName,
testTableName,
schema,
partitionKeys = Seq("pk", "pk1"),
dialect = ParquetHiveDialect(),
tableType = TableType.EXTERNAL_TABLE,
overwrite = true
)
val items = Seq.tabulate(100)(i => TestData(i.toString, "42", "apple"))
val ds = DataStream(items)
ds.to(sink)
Getting error: Number of partitions scanned(=32767) exceeds limit(=10000).
Number 32767 is a power of 2....but still can't figure it out what is wrong. Any idea?
Spark + Hive : Number of partitions scanned exceeds limit (=4000)
--conf "spark.sql.hive.convertMetastoreOrc=false"
--conf "spark.sql.hive.metastorePartitionPruning=false"

Flume creating an empty line at the end of output file in HDFS

Currently I am using Flume version : 1.5.2.
Flume creating an empty line at the end of each output file in HDFS which causing row counts, file sizes & check sum are not matching for source and destination files.
I tried by overriding the default values of parameters roolSize, batchSize and appendNewline but still its not working.
Also flume changing EOL from CRLF(Source file) to LF(outputfile) this also causing file size to differ
Below are related flume agent configuration parameters I'm using
agent1.sources = c1
agent1.sinks = c1s1
agent1.channels = ch1
agent1.sources.c1.type = spooldir
agent1.sources.c1.spoolDir = /home/biadmin/flume-test/sourcedata1
agent1.sources.c1.bufferMaxLineLength = 80000
agent1.sources.c1.channels = ch1
agent1.sources.c1.fileHeader = true
agent1.sources.c1.fileHeaderKey = file
#agent1.sources.c1.basenameHeader = true
#agent1.sources.c1.fileHeaderKey = basenameHeaderKey
#agent1.sources.c1.filePrefix = %{basename}
agent1.sources.c1.inputCharset = UTF-8
agent1.sources.c1.decodeErrorPolicy = IGNORE
agent1.sources.c1.deserializer= LINE
agent1.sources.c1.deserializer.maxLineLength = 50000
agent1.sources.c1.deserializer=
org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder
agent1.sources.c1.interceptors = a b
agent1.sources.c1.interceptors.a.type =
org.apache.flume.interceptor.TimestampInterceptor$Builder
agent1.sources.c1.interceptors.b.type =
org.apache.flume.interceptor.HostInterceptor$Builder
agent1.sources.c1.interceptors.b.preserveExisting = false
agent1.sources.c1.interceptors.b.hostHeader = host
agent1.channels.ch1.type = memory
agent1.channels.ch1.capacity = 1000
agent1.channels.ch1.transactionCapacity = 1000
agent1.channels.ch1.batchSize = 1000
agent1.channels.ch1.maxFileSize = 2073741824
agent1.channels.ch1.keep-alive = 5
agent1.sinks.c1s1.type = hdfs
agent1.sinks.c1s1.hdfs.path = hdfs://bivm.ibm.com:9000/user/biadmin/
flume/%y-%m-%d/%H%M
agent1.sinks.c1s1.hdfs.fileType = DataStream
agent1.sinks.c1s1.hdfs.filePrefix = %{file}
agent1.sinks.c1s1.hdfs.fileSuffix =.csv
agent1.sinks.c1s1.hdfs.writeFormat = Text
agent1.sinks.c1s1.hdfs.maxOpenFiles = 10
agent1.sinks.c1s1.hdfs.rollSize = 67000000
agent1.sinks.c1s1.hdfs.rollCount = 0
#agent1.sinks.c1s1.hdfs.rollInterval = 0
agent1.sinks.c1s1.hdfs.batchSize = 1000
agent1.sinks.c1s1.channel = ch1
#agent1.sinks.c1s1.hdfs.codeC = snappyCodec
agent1.sinks.c1s1.hdfs.serializer = text
agent1.sinks.c1s1.hdfs.serializer.appendNewline = false
hdfs.serializer.appendNewline not fixed the issue.
Can anyone please check and suggest..
Replace the below line in your flume agent.
agent1.sinks.c1s1.serializer.appendNewline = false
with the following line and let me know how it goes.
agent1.sinks.c1s1.hdfs.serializer.appendNewline = false
Replace
agent1.sinks.c1s1.hdfs.serializer = text
agent1.sinks.c1s1.hdfs.serializer.appendNewline = false
with
agent1.sinks.c1s1.serializer = text
agent1.sinks.c1s1.serializer.appendNewline = false
Difference is that serializer settings are not set on hdfs prefix but directly on sink name.
Flume documentation should have some example on that as I also got into issues because I didn't spot that serializer is set on different level of property name.
More informations about Hdfs sink can be found here:
https://flume.apache.org/FlumeUserGuide.html#hdfs-sink

Cassandra insert failure using map reduce

Trying to insert records into cassandra using MapReduce program,
getting below error from reduce job.
13/03/29 07:39:34 INFO mapred.JobClient: Task Id : attempt_201303281807_0009_r_000000_0, Status : FAILED
java.io.IOException: InvalidRequestException(why:TimeUUID should be 16 or 0 bytes (3))
at org.apache.cassandra.hadoop.ColumnFamilyRecordWriter$RangeClient.run(ColumnFamilyRecordWriter.java:309)
Caused by: InvalidRequestException(why:TimeUUID should be 16 or 0 bytes (3))
at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:20350)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:926)
at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:912)
at org.apache.cassandra.hadoop.ColumnFamilyRecordWriter$RangeClient.run(ColumnFamilyRecordWriter.java:301
the slicePredicate Definition is
SlicePredicate predicate = new SlicePredicate().setSlice_range(new SliceRange(ByteBuffer.wrap(new byte[16]), ByteBuffer.wrap(new byte[16]), false, 150));
ConfigHelper.setInputSlicePredicate(conf, predicate);
I have tried couple of other apis to set the sliceRange without use.
e.g. other apis: https://code.google.com/p/skltpservices/source/browse/Components/log-analyzer/trunk/src/main/java/se/skl/skltpservices/components/analyzer/domain/TimeUUID.java?spec=svn1939&r=1939
The column Family definition is :
create column family myColumnFamily
with column_type = 'Standard'
and comparator = 'TimeUUIDType'
and default_validation_class = 'UTF8Type'
and key_validation_class = 'UTF8Type'
and read_repair_chance = 0.1
and dclocal_read_repair_chance = 0.0
and gc_grace = 864000
and min_compaction_threshold = 4
and max_compaction_threshold = 32
and replicate_on_write = true
and compaction_strategy = 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
and caching = 'KEYS_ONLY'
and compression_options = {'sstable_compression' : 'org.apache.cassandra.io.compress.SnappyCompressor'};
Appreciate any help on using a TimeUUIDType comparator in Column family and insert using Mapreduce.

Resources