I would like to fit a model by group in h2o using some type of distributed apply function.
I tried the following but it doesn't work. Probably due to the fact I cannot pipe the sc object through.
df%>%
spark_apply(function(e)
h2o.coxph(x = predictors,
event_column = "event",
stop_column = "time_to_next",
training_frame = as_h2o_frame(sc, e, strict_version_check = FALSE))
group_by = "id"
)
I receive a pretty generic spark error like this:
error : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 23.0 failed 4 times, most recent failure: Lost task 0.3 in stage 23.0 :
I'm not sure if you can return an entire H2OCoxPH model from sparklyr::spark_apply(): Errors are no method for coercing this S4 class to a vector if you set the fetch_result_as_sdf argument to FALSE and cannot coerce class ‘structure("H2OCoxPHModel", package = "h2o")’ to a data.frame if set to TRUE.
But if you can make your own vector or dataframe from the relevant parts of the model, I think you can do it.
Here I'll use a sample Cox Proportional Hazards file from H2O Docs Cox Proportional Hazards (CoxPH) and I'll use group_by = "surgery".
heart_hf <- h2o::h2o.importFile("http://s3.amazonaws.com/h2o-public-test-data/smalldata/coxph_test/heart.csv")
##### Convert to Spark DataFrame since I assume that is the use case
heart_sf <- sparklyr::copy_to(sc, heart_hf %>% as.data.frame())
##### Use sparklyr::spark_apply() on Spark DataFrame to "distribute and fit h2o model by group"
sparklyr::spark_apply(
x = heart_sf,
f = function(x) {
h2o::h2o.init()
heart_coxph <- h2o::h2o.coxph(x = c("age", "year"),
event_column = "event",
start_column = "start",
stop_column = "stop",
ties = "breslow",
training_frame = h2o::as.h2o(x, strict_version_check = FALSE))
return(data.frame(conc = heart_coxph#model$model_summary$concordance))
},
columns = list(surgery = "integer", conc = "numeric"),
group_by = c("surgery"))
# Source: spark<?> [?? x 2]
surgery conc
<int> <dbl>
1 1 0.588
2 0 0.614
I'm trying to use Eel-sdk to stream data into Hive.
val sink = HiveSink(testDBName, testTableName)
.withPartitionStrategy(new DynamicPartitionStrategy)
val hiveOps:HiveOps = ...
val schema = new StructType(Vector(Field("name", StringType),Field("pk", StringType),Field("pk1",a StringType)))
hiveOps.createTable(
testDBName,
testTableName,
schema,
partitionKeys = Seq("pk", "pk1"),
dialect = ParquetHiveDialect(),
tableType = TableType.EXTERNAL_TABLE,
overwrite = true
)
val items = Seq.tabulate(100)(i => TestData(i.toString, "42", "apple"))
val ds = DataStream(items)
ds.to(sink)
Getting error: Number of partitions scanned(=32767) exceeds limit(=10000).
Number 32767 is a power of 2....but still can't figure it out what is wrong. Any idea?
Spark + Hive : Number of partitions scanned exceeds limit (=4000)
--conf "spark.sql.hive.convertMetastoreOrc=false"
--conf "spark.sql.hive.metastorePartitionPruning=false"
Im getting bellow error on my flume agent where I use AsyncHbaseEventSerializer.
I doubt is this because of my schema. I have two column families.
MyAgent.sinks.MySink.columnFamily=family1,family2
When I specify both column families coma separated Im getting error as table/column family not found.
MyAgent.sinks.MySink.columnFamily=family1 family2
If I specify both column families separated by space then I think only first column family considered.
This is the error. Can anyone help me?
Unable to deliver event. Exception follows.
org.apache.flume.EventDeliveryException: Could not write events to Hbase. Transaction failed, and rolled back.
at org.apache.flume.sink.hbase.AsyncHBaseSink.process(AsyncHBaseSink.java:317)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
at java.lang.Thread.run(Thread.java:745)
Update:
myAgent.sources = TwitterSource InstagramSource FacebookSource
myAgent.channels = TwitterChannel InstagramChannelMedia InstagramChannelMediaComment FacebookChannelPage FacebookChannelPost FacebookChannelComment
myAgent.sinks = TwitterSink InstagramSinkMedia InstagramSinkMediaComment FacebookSinkPage FacebookSinkPost FacebookSinkComment
myAgent.sources.TwitterSource.type = com.my.socialanalytics.core.twitter.TwitterSource
myAgent.sources.TwitterSource.channels = TwitterChannel
myAgent.sources.TwitterSource.consumerKey = <>
myAgent.sources.TwitterSource.consumerSecret = <>
myAgent.sources.TwitterSource.accessToken = <>
myAgent.sources.TwitterSource.accessTokenSecret = <>
myAgent.sources.TwitterSource.crawlingFrequency = 300000
myAgent.sources.TwitterSource.mongoHost =10.3.0.38
myAgent.sources.TwitterSource.mongoPort =27017
myAgent.sources.TwitterSource.mongoDBName =mySocialAnalytics
myAgent.sources.TwitterSource.mongoCollectionCampaigns =Campaigns
myAgent.sources.TwitterSource.mongoUsername =admin
myAgent.sources.TwitterSource.mongoPassword =qburst
myAgent.sources.TwitterSource.mongoAuthDB =admin
myAgent.sinks.TwitterSink.type=org.apache.flume.sink.hbase.AsyncHBaseSink
myAgent.sinks.TwitterSink.channel=TwitterChannel
myAgent.sinks.TwitterSink.table=Tweets
myAgent.sinks.TwitterSink.columnFamily=campaign tweet
myAgent.sinks.TwitterSink.serializer=com.my.socialanalytics.core.twitter.TwitterSerializer
myAgent.sinks.TwitterSink.serializer.columns=tweet:contributors tweet:createdAt tweet:inReplyToUserID tweet:text tweet:inReplyToStatusID tweet:source tweet:lang tweet:geo tweet:favorited tweet:withheldCopyright campaign:stateCode tweet:truncated tweet:entities tweet:inReplyToScreenName tweet:withheldInCountries campaign:campaignID tweet:favoriteCount tweet:id tweet:user tweet:retweetedStatus campaign:countryCode tweet:possiblySensitive tweet:currentUserRetweet tweet:retweetCount tweet:withheldScope tweet:retweeted tweet:coordinates tweet:filterLevel tweet:quotedStatusID campaign:sentiment tweet:place
myAgent.sinks.TwitterSink.batchSize = 50
myAgent.channels.TwitterChannel.type = memory
myAgent.channels.TwitterChannel.capacity = 10000000
myAgent.channels.TwitterChannel.transactionCapacity = 100000
myAgent.sources.InstagramSource.type = com.my.socialanalytics.core.instagram.InstagramSource
myAgent.sources.InstagramSource.channels = InstagramChannelMedia InstagramChannelMediaComment
myAgent.sources.InstagramSource.clientID = <>
myAgent.sources.InstagramSource.clientSecret = <>
myAgent.sources.InstagramSource.accessToken = <>
myAgent.sources.InstagramSource.crawlingFrequency = 3600000
myAgent.sources.InstagramSource.mongoHost =10.3.0.38
myAgent.sources.InstagramSource.mongoPort =27017
myAgent.sources.InstagramSource.mongoDBName =mySocialAnalytics
myAgent.sources.InstagramSource.mongoCollectionCampaigns =Campaigns
myAgent.sources.InstagramSource.mongoUsername =admin
myAgent.sources.InstagramSource.mongoPassword =qburst
myAgent.sources.InstagramSource.mongoAuthDB =admin
myAgent.sources.InstagramSource.selector.type = multiplexing
myAgent.sources.InstagramSource.selector.header = feedType
myAgent.sources.InstagramSource.selector.mapping.feedTypeMedia = InstagramChannelMedia
myAgent.sources.InstagramSource.selector.mapping.feedTypeComment = InstagramChannelMediaComment
myAgent.sources.InstagramSource.selector.default = InstagramChannelMedia
myAgent.sinks.InstagramSinkMedia.type=org.apache.flume.sink.hbase.AsyncHBaseSink
myAgent.sinks.InstagramSinkMedia.channel=InstagramChannelMedia
myAgent.sinks.InstagramSinkMedia.table=InstagramMedia
myAgent.sinks.InstagramSinkMedia.columnFamily=campaign media
myAgent.sinks.InstagramSinkMedia.serializer=com.my.socialanalytics.core.instagram.InstagramMediaSerializer
myAgent.sinks.InstagramSinkMedia.serializer.columns=media:createdTime campaign:campaignID media:link media:videos media:type media:caption campaign:countryCode media:filter media:likes media:attribution media:comments media:user campaign:updatedAt media:tags campaign:stateCode media:images campaign:sentiment media:location media:id
myAgent.sinks.InstagramSinkMedia.batchSize = 50
myAgent.sinks.InstagramSinkMediaComment.type=org.apache.flume.sink.hbase.AsyncHBaseSink
myAgent.sinks.InstagramSinkMediaComment.channel=InstagramChannelMediaComment
myAgent.sinks.InstagramSinkMediaComment.table=InstagramComments
myAgent.sinks.InstagramSinkMediaComment.columnFamily=campaign comment
myAgent.sinks.InstagramSinkMediaComment.serializer=com.my.socialanalytics.core.instagram.InstagramMediaCommentSerializer
myAgent.sinks.InstagramSinkMediaComment.serializer.columns=comment:mediaId comment:id campaign:campaignID comment:createdTime campaign:updatedAt comment:from comment:text campaign:sentiment
myAgent.sinks.InstagramSinkMediaComment.batchSize = 50
myAgent.channels.InstagramChannelMedia.type = memory
myAgent.channels.InstagramChannelMedia.capacity = 10000000
myAgent.channels.InstagramChannelMedia.transactionCapacity = 100000
myAgent.channels.InstagramChannelMediaComment.type = memory
myAgent.channels.InstagramChannelMediaComment.capacity = 10000000
myAgent.channels.InstagramChannelMediaComment.transactionCapacity = 100000
myAgent.sources.FacebookSource.type = com.my.socialanalytics.core.facebook.FacebookSource
myAgent.sources.FacebookSource.channels = FacebookChannelPage FacebookChannelPost FacebookChannelComment
myAgent.sources.FacebookSource.accessToken = <>
myAgent.sources.FacebookSource.crawlingFrequency = 3600000
myAgent.sources.FacebookSource.mongoHost =10.3.0.38
myAgent.sources.FacebookSource.mongoPort =27017
myAgent.sources.FacebookSource.mongoDBName =mySocialAnalytics
myAgent.sources.FacebookSource.mongoCollectionCampaigns =Campaigns
myAgent.sources.FacebookSource.mongoUsername =admin
myAgent.sources.FacebookSource.mongoPassword =qburst
myAgent.sources.FacebookSource.mongoAuthDB =admin
myAgent.sources.FacebookSource.selector.type = multiplexing
myAgent.sources.FacebookSource.selector.header = feedType
myAgent.sources.FacebookSource.selector.mapping.feedTypePage = FacebookChannelPage
myAgent.sources.FacebookSource.selector.mapping.feedTypePost = FacebookChannelPost
myAgent.sources.FacebookSource.selector.mapping.feedTypeComment = FacebookChannelComment
myAgent.sources.FacebookSource.selector.default = FacebookChannelPage
myAgent.sinks.FacebookSinkPage.type=org.apache.flume.sink.hbase.AsyncHBaseSink
myAgent.sinks.FacebookSinkPage.channel=FacebookChannelPage
myAgent.sinks.FacebookSinkPage.table=FacebookPages
myAgent.sinks.FacebookSinkPage.columnFamily=campaign page
myAgent.sinks.FacebookSinkPage.serializer=com.my.socialanalytics.core.facebook.FacebookPageSerializer
myAgent.sinks.FacebookSinkPage.serializer.columns=page:talkingAboutCount campaign:campaignID page:name page:wereHereCount page:fanCount page:id page:checkins campaign:day
myAgent.sinks.FacebookSinkPage.batchSize = 50
myAgent.sinks.FacebookSinkPost.type=org.apache.flume.sink.hbase.AsyncHBaseSink
myAgent.sinks.FacebookSinkPost.channel=FacebookChannelPost
myAgent.sinks.FacebookSinkPost.table=FacebookPosts
myAgent.sinks.FacebookSinkPost.columnFamily=post campaign
myAgent.sinks.FacebookSinkPost.serializer=com.my.socialanalytics.core.facebook.FacebookPostSerializer
myAgent.sinks.FacebookSinkPost.serializer.columns=post:link campaign:campaignID post:place post:sharesCount post:name post:from campaign:updatedAt post:caption post:id post:likesCount post:createdTime post:message campaign:sentiment
myAgent.sinks.FacebookSinkPost.batchSize = 50
myAgent.sinks.FacebookSinkComment.type=org.apache.flume.sink.hbase.AsyncHBaseSink
myAgent.sinks.FacebookSinkComment.channel=FacebookChannelComment
myAgent.sinks.FacebookSinkComment.table=FacebookComments
myAgent.sinks.FacebookSinkComment.columnFamily=campaign comment
myAgent.sinks.FacebookSinkComment.serializer=com.my.socialanalytics.core.facebook.FacebookCommentSerializer
myAgent.sinks.FacebookSinkComment.serializer.columns=comment:message comment:id campaign:campaignID comment:commentCount comment:parent comment:createdTime campaign:updatedAt comment:from comment:postID campaign:sentiment
myAgent.sinks.FacebookSinkComment.batchSize = 50
myAgent.channels.FacebookChannelPage.type = memory
myAgent.channels.FacebookChannelPage.capacity = 10000000
myAgent.channels.FacebookChannelPage.transactionCapacity = 100000
myAgent.channels.FacebookChannelPost.type = memory
myAgent.channels.FacebookChannelPost.capacity = 10000000
myAgent.channels.FacebookChannelPost.transactionCapacity = 100000
myAgent.channels.FacebookChannelComment.type = memory
myAgent.channels.FacebookChannelComment.capacity = 10000000
myAgent.channels.FacebookChannelComment.transactionCapacity = 100000
Initially Im not getting any error but after couple of hours getting such error. Im running some spark jobs to update particular field in same table. Im no sure if its the reason. My flume frequently updating same records. Is that the reason for this error?
Update 2:
If im mentioning both column families im getting error as below.
Oct 14, 4:21:11.040 PM ERROR org.apache.flume.lifecycle.LifecycleSupervisor
Unable to start SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor#4c74990e counterGroup:{ name:null counters:{} } } - Exception follows.
org.apache.flume.FlumeException: Could not start sink. Table or column family does not exist in Hbase.
at
org.apache.flume.sink.hbase.AsyncHBaseSink.initHBaseClient(AsyncHBaseSink.java:489)
at org.apache.flume.sink.hbase.AsyncHBaseSink.start(AsyncHBaseSink.java:441)
at org.apache.flume.sink.DefaultSinkProcessor.start(DefaultSinkProcessor.java:46)
at org.apache.flume.SinkRunner.start(SinkRunner.java:79)
at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:251)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
If i mention only one column family Im getting error
Unable to deliver event. Exception follows.
org.apache.flume.EventDeliveryException: Could not write events to Hbase. Transaction failed, and rolled back.
at org.apache.flume.sink.hbase.AsyncHBaseSink.process(AsyncHBaseSink.java:317)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
at java.lang.Thread.run(Thread.java:745)
Need your help!
I am trying a trivial exercise of getting the data from twitter and then loading it up in Hive for analysis. Though I am able to get data into HDFS using flume (using Twitter 1% firehose Source) and also able to load the data into Hive table.
But unable to see all the columns I have expected to be there in the twitter data like user_location, user_description, user_friends_count, user_description, user_statuses_count. The schema derived from Avro only contains two columns header and body.
Below are the steps I have done:
1) create a flume agent with below conf:
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type =org.apache.flume.source.twitter.TwitterSource
#a1.sources.r1.type = com.cloudera.flume.source.TwitterSource
a1.sources.r1.consumerKey =XXXXXXXXXXXXXXXXXXXXXXXXXXXX
a1.sources.r1.consumerSecret =XXXXXXXXXXXXXXXXXXXXXXXXXXXX
a1.sources.r1.accessToken =XXXXXXXXXXXXXXXXXXXXXXXXXXXX
a1.sources.r1.accessTokenSecret =XXXXXXXXXXXXXXXXXXXXXXXXXXXX
a1.sources.r1.keywords = bigdata, healthcare, oozie
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://192.168.192.128:8020/hdp/apps/2.2.0.0-2041/flume/twitter
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.inUsePrefix = _
a1.sinks.k1.hdfs.fileSuffix = .avro
# added for invalid block size error
a1.sinks.k1.serializer = avro_event
#a1.sinks.k1.deserializer.schemaType = LITERAL
# added for exception java.io.IOException:org.apache.avro.AvroTypeException: Found Event, expecting Doc
#a1.sinks.k1.serializer.compressionCodec = snappy
a1.sinks.k1.hdfs.batchSize = 1000
a1.sinks.k1.hdfs.rollSize = 67108864
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.rollInterval = 30
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 1000
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
2) Derive the schema from the avro data file, I don't have any idea why the schema derived from the avro data file only has two columns header and body:
java -jar avro-tools-1.7.7.jar getschema FlumeData.14315982 30978.avro
{
"type" : "record",
"name" : "Event",
"fields" : [ {
"name" : "headers",
"type" : {
"type" : "map",
"values" : "string"
}
}, {
"name" : "body",
"type" : "bytes"
} ]
}
3) Run the above agent and get the data in HDFS, find out the schema of the avro data and create a Hive table as:
CREATE EXTERNAL TABLE TwitterData
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES ('avro.schema.literal'='
{
"type" : "record",
"name" : "Event",
"fields" : [ {
"name" : "headers",
"type" : {
"type" : "map",
"values" : "string"
}
}, {
"name" : "body",
"type" : "bytes"
} ]
}
')
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs://192.168.192.128:8020/hdp/apps/2.2.0.0-2041/flume/twitter'
;
4) Describe Hive Table:
hive> describe twitterdata;
OK
headers map<string,string> from deserializer
body binary from deserializer
Time taken: 0.472 seconds, Fetched: 2 row(s)
5) Query the table:
When I query the table I see the binary data in the 'body'column and the actual schema info in the 'header' column.
select * from twitterdata limit 1;
OK
{"type":"record","name":"Doc","doc":"adoc","fields":[{"name":"id","type":"string"},{"name":"user_friends_count","type":["int","null"]},{"name":"user_location","type":["string","null"]},{"name":"user_description","type":["string","null"]},{"name":"user_statuses_count","type":["int","null"]},{"name":"user_followers_count","type":["int","null"]},{"name":"user_name","type":["string","null"]},{"name":"user_screen_name","type":["string","null"]},{"name":"created_at","type":["string","null"]},{"name":"text","type":["string","null"]},{"name":"retweet_count","type":["long","null"]},{"name":"retweeted","type":["boolean","null"]},{"name":"in_reply_to_user_id","type":["long","null"]},{"name":"source","type":["string","null"]},{"name":"in_reply_to_status_id","type":["long","null"]},{"name":"media_url_https","type":["string","null"]},{"name":"expanded_url","type":["string","null"]}]}�1|$���)]'��G�$598792495703543808�Bあいたぁぁぁぁぁぁぁ!�~�ゆっけ0725Yukken(2015-05-14T10:10:30Z<ん?なんか意味違うわ�Twitter for iPhone�1|$���)]'��
Time taken: 2.24 seconds, Fetched: 1 row(s)
How do I create a hive table with all the columns in the actual schema as shown in the 'header' column. I mean with all the columns like user_location, user_description, user_friends_count, user_description, user_statuses_count?
Shouldn't the schema derived from the avro data file contain more columns?
Is there any issue with the flume-avro source I used in the flume agent (org.apache.flume.source.twitter.TwitterSource)?
Thanks for reading through..
Thanks Farrukh, I have done that the mistake was the configuration 'a1.sinks.k1.serializer = avro_event', I changed this to 'a1.sinks.k1.serializer = text', and I was able to load the data into Hive. But now the issue is retrieving the data from Hive, I am getting the below error while doing so:
hive> describe twitterdata_09062015;
OK
id string from deserializer
user_friends_count int from deserializer
user_location string from deserializer
user_description string from deserializer
user_statuses_count int from deserializer
user_followers_count int from deserializer
user_name string from deserializer
user_screen_name string from deserializer
created_at string from deserializer
text string from deserializer
retweet_count bigint from deserializer
retweeted boolean from deserializer
in_reply_to_user_id bigint from deserializer
source string from deserializer
in_reply_to_status_id bigint from deserializer
media_url_https string from deserializer
expanded_url string from deserializer
select count(1) as num_rows from TwitterData_09062015;
Query ID = root_20150609130404_10ef21db-705a-4e94-92b7-eaa58226ee2e
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1433857038961_0003, Tracking URL = http://sandbox.hortonworks.com:8088/proxy/application_14338570 38961_0003/
Kill Command = /usr/hdp/2.2.0.0-2041/hadoop/bin/hadoop job -kill job_1433857038961_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
* 13:04:36,856 Stage-1 map = 0%, reduce = 0%
* 13:05:09,576 Stage-1 map = 100%, reduce = 100%
Ended Job = job_1433857038961_0003 with errors
Error during job, obtaining debugging information...
Examining task ID: task_1433857038961_0003_m_000000 (and more) from job job_1433857038961_0003
Task with the most failures(4):
Task ID:
task_1433857038961_0003_m_000000
URL:
http://sandbox.hortonworks.com:8088/taskdetails.jsp?jobid=job_1433857038961_0003&tipid=task_1433857038961_0003_m_0 00000
Diagnostic Messages for this Task:
Error: java.io.IOException: java.io.IOException: org.apache.avro.AvroRuntimeException: java.io.IOException: Block si ze invalid or too large for this implementation: -40
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHand lerChain.java:121)
Here is step by step process which used to download tweets and loaded them into hive
Flume agent
##TwitterAgent for collecting Twitter data to Hadoop HDFS #####
TwitterAgent.sources = Twitter
TwitterAgent.channels = FileChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.channels = FileChannel
TwitterAgent.sources.Twitter.consumerKey = *************
TwitterAgent.sources.Twitter.consumerSecret = **********
TwitterAgent.sources.Twitter.accessToken = ************
TwitterAgent.sources.Twitter.accessTokenSecret = ***********
TwitterAgent.sources.Twitter.maxBatchSize = 50000
TwitterAgent.sources.Twitter.maxBatchDurationMillis = 100000
TwitterAgent.sources.Twitter.keywords = Apache, Hadoop, Mapreduce, hadooptutorial, Hive, Hbase, MySql
TwitterAgent.sinks.HDFS.channel = FileChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://nn1.itbeams.com:9000/user/flume/tweets/avrotweets
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
# you do not need to mentioned avro format here. just mention Text
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 200000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 2000000
TwitterAgent.channels.FileChannel.type = file
TwitterAgent.channels.FileChannel.checkpointDir = /var/log/flume/checkpoint/
TwitterAgent.channels.FileChannel.dataDirs = /var/log/flume/data/
I created avro schema in avsc file. Once you create then put this file in hadoop against your user folder like /user/youruser/.
{"type":"record",
"name":"Doc",
"doc":"adoc",
"fields":[{"name":"id","type":"string"},
{"name":"user_friends_count","type":["int","null"]},
{"name":"user_location","type":["string","null"]},
{"name":"user_description","type":["string","null"]},
{"name":"user_statuses_count","type":["int","null"]},
{"name":"user_followers_count","type":["int","null"]},
{"name":"user_name","type":["string","null"]},
{"name":"user_screen_name","type":["string","null"]},
{"name":"created_at","type":["string","null"]},
{"name":"text","type":["string","null"]},
{"name":"retweet_count","type":["long","null"]},
{"name":"retweeted","type":["boolean","null"]},
{"name":"in_reply_to_user_id","type":["long","null"]},
{"name":"source","type":["string","null"]},
{"name":"in_reply_to_status_id","type":["long","null"]},
{"name":"media_url_https","type":["string","null"]},
{"name":"expanded_url","type":["string","null"]}
Loaded tweets in hive table. If you save code in hql file that would be great.
CREATE TABLE tweetsavro
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.url'='hdfs:///user/youruser/examples/schema/twitteravroschema.avsc') ;
LOAD DATA INPATH '/user/flume/tweets/avrotweets/FlumeData.*' OVERWRITE INTO TABLE tweetsavro;
tweetsavro table in hive
hive> describe tweetsavro;
OK
id string from deserializer
user_friends_count int from deserializer
user_location string from deserializer
user_description string from deserializer
user_statuses_count int from deserializer
user_followers_count int from deserializer
user_name string from deserializer
user_screen_name string from deserializer
created_at string from deserializer
text string from deserializer
retweet_count bigint from deserializer
retweeted boolean from deserializer
in_reply_to_user_id bigint from deserializer
source string from deserializer
in_reply_to_status_id bigint from deserializer
media_url_https string from deserializer
expanded_url string from deserializer
Time taken: 0.6 seconds, Fetched: 17 row(s)
I am trying to understand the textFile method deeply, but I think my
lack of Hadoop knowledge is holding me back here. Let me lay out my
understanding and maybe you can correct anything that is incorrect
When sc.textFile(path) is called, then defaultMinPartitions is used,
which is really just math.min(taskScheduler.defaultParallelism, 2). Let's
assume we are using the SparkDeploySchedulerBackend and this is
conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(),
2))
So, now let's say the default is 2, going back to the textFile, this is
passed in to HadoopRDD. The true size is determined in getPartitions() using
inputFormat.getSplits(jobConf, minPartitions). But, from what I can find,
the partitions is merely a hint and is in fact mostly ignored, so you will
probably get the total number of blocks.
OK, this fits with expectations, however what if the default is not used and
you provide a partition size that is larger than the block size. If my
research is right and the getSplits call simply ignores this parameter, then
wouldn't the provided min end up being ignored and you would still just get
the block size?
Cross posted with the spark mailing list
Short Version:
Split size is determined by mapred.min.split.size or mapreduce.input.fileinputformat.split.minsize, if it's bigger than HDFS's blockSize, multiple blocks inside a same file would be combined into a single split.
Detailed Version:
I think you are right in understanding the procedure before inputFormat.getSplits.
Inside inputFormat.getSplits, more specifically, inside FileInputFormat's getSplits, it is mapred.min.split.size or mapreduce.input.fileinputformat.split.minsize that would at last determine split size. (I'm not sure which would be effective in Spark, I prefer to believe the former one).
Let's see the code: FileInputFormat from Hadoop 2.4.0
long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);
long minSize = Math.max(job.getLong(org.apache.hadoop.mapreduce.lib.input.
FileInputFormat.SPLIT_MINSIZE, 1), minSplitSize);
// generate splits
ArrayList<FileSplit> splits = new ArrayList<FileSplit>(numSplits);
NetworkTopology clusterMap = new NetworkTopology();
for (FileStatus file: files) {
Path path = file.getPath();
long length = file.getLen();
if (length != 0) {
FileSystem fs = path.getFileSystem(job);
BlockLocation[] blkLocations;
if (file instanceof LocatedFileStatus) {
blkLocations = ((LocatedFileStatus) file).getBlockLocations();
} else {
blkLocations = fs.getFileBlockLocations(file, 0, length);
}
if (isSplitable(fs, path)) {
long blockSize = file.getBlockSize();
long splitSize = computeSplitSize(goalSize, minSize, blockSize);
long bytesRemaining = length;
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
String[] splitHosts = getSplitHosts(blkLocations,
length-bytesRemaining, splitSize, clusterMap);
splits.add(makeSplit(path, length-bytesRemaining, splitSize,
splitHosts));
bytesRemaining -= splitSize;
}
if (bytesRemaining != 0) {
String[] splitHosts = getSplitHosts(blkLocations, length
- bytesRemaining, bytesRemaining, clusterMap);
splits.add(makeSplit(path, length - bytesRemaining, bytesRemaining,
splitHosts));
}
} else {
String[] splitHosts = getSplitHosts(blkLocations,0,length,clusterMap);
splits.add(makeSplit(path, 0, length, splitHosts));
}
} else {
//Create empty hosts array for zero length files
splits.add(makeSplit(path, 0, length, new String[0]));
}
}
Inside the for loop, makeSplit() is used to generate each split, and splitSize is the effective Split Size. The computeSplitSize Function to generate splitSize:
protected long computeSplitSize(long goalSize, long minSize,
long blockSize) {
return Math.max(minSize, Math.min(goalSize, blockSize));
}
Therefore, if minSplitSize > blockSize, the output splits are actually a combination of several blocks in the same HDFS file, on the other hand, if minSplitSize < blockSize, each split corresponds to a HDFS's block.
I will add more points with examples to Yijie Shen answer
Before we go into details,lets understand the following
Assume that we are working on Spark Standalone local system with 4 cores
In the application if master is configured as like below
new SparkConf().setMaster("**local[*]**") then
defaultParallelism : 4 (taskScheduler.defaultParallelism ie no.of cores)
/* Default level of parallelism to use when not given by user (e.g. parallelize and makeRDD). */
defaultMinPartitions : 2 //Default min number of partitions for Hadoop RDDs when not given by user
* Notice that we use math.min so the "defaultMinPartitions" cannot be higher than 2.
logic to find defaultMinPartitions as below
def defaultMinPartitions: Int = math.min(defaultParallelism, 2)
The actual partition size is defined by the following formula in the method FileInputFormat.computeSplitSize
package org.apache.hadoop.mapred;
public abstract class FileInputFormat<K, V> implements InputFormat<K, V> {
protected long computeSplitSize(long goalSize, long minSize, long blockSize) {
return Math.max(minSize, Math.min(goalSize, blockSize));
}
}
where,
minSize is the hadoop parameter mapreduce.input.fileinputformat.split.minsize (default mapreduce.input.fileinputformat.split.minsize = 1 byte)
blockSize is the value of the dfs.block.size in cluster mode(**dfs.block.size - The default value in Hadoop 2.0 is 128 MB**) and fs.local.block.size in the local mode (**default fs.local.block.size = 32 MB ie blocksize = 33554432 bytes**)
goalSize = totalInputSize/numPartitions
where,
totalInputSize is the total size in bytes of all the files in the input path
numPartitions is the custom parameter provided to the method sc.textFile(inputPath, numPartitions) - if not provided it will be defaultMinPartitions ie 2 if master is set as local(*)
blocksize = file size in bytes = 33554432
33554432/1024 = 32768 KB
32768/1024 = 32 MB
Ex1:- If our file size is 91 bytes
minSize=1 (mapreduce.input.fileinputformat.split.minsize = 1 byte)
goalSize = totalInputSize/numPartitions
goalSize = 91(file size)/12(partitions provided as 2nd paramater in sc.textFile) = 7
splitSize = Math.max(minSize, Math.min(goalSize, blockSize)); => Math.max(1,Math.min(7,33554432)) = 7 // 33554432 is block size in local mode
Splits = 91(file size 91 bytes) / 7 (splitSize) => 13
FileInputFormat: Total # of splits generated by getSplits: 13
=> while calculating splitSize if file size is > 32 MB then the split size will be taken the default fs.local.block.size = 32 MB ie blocksize = 33554432 bytes