How to perform Streaming of twitter data via Apache Spark Streaming?
Previously I had fetched the twitter data via flume and stored in HDFS ,where I configured the twitter login credentials in conf.txt.
But now got stuck in fetching twitter data via Apache Spark Streaming.
Problem is with placing twitter login credentials..
Please help me..
Why dont you use sc.setLocalProperties()?
create a
val ssc = new StreamingContext("local[10]", "test", Duration(3000))
val sc = ssc.sparkContext
and
sc.setLocalProperties()
Related
I am trying to do a POC in Hadoop for log aggregation. we have multiple IIS servers hosting atleast 100 sites. I want to to stream logs continously to HDFS and parse data and store in Hive for further analytics.
1) Is Apache KAFKA correct choice or Apache Flume
2) After streaming is it better to use Apache storm and ingest data into Hive
Please help with any suggestions and also any information of this kind of problem statement.
Thanks
You can use either Kafka or flume also you can combine both to get data into HDFSbut you need to write code for this There are Opensource data flow management tools available, you don't need to write code. Eg. NiFi and Streamsets
You don't need to use any separate ingestion tools, you can directly use those data flow tools to put data into hive table. Once table is created in hive then you can do your analytics by providing queries.
Let me know you need anything else on this.
I have a problem making real-time dashboard, which visualize my log information from HDFS. Below is sketch of my design, my log information is generated from the fly so I use Kafka and Streaming Spark to deal with it. But after exporting from Streaming Spark, I don't know how to visualize it to a local website. Could you give me any idea to do so?
Data input in HDFS --> Kafka --> Streaming Spark --> ? --> Front-end web (d3js,...)
Thank you
P/s: Kafka and Streaming Spark are managed by Ambari server, HDP 2.4
The Bluemix Big Analytics tutorial mentions importing files, but when I launched the Big sheets from the Bluemix Analytics for Apache Hadoop service, I could not see any option to load external files to the Big sheet. Is there any other way to do it? Please help us in proceeding.
You would upload your data to the HDFS for your Analytics for Hadoop service using the webHDFS REST API first, and then it should be available for you in BigSheets via the DFS Files tab shown in your screenshot.
The data you upload would be under /user/biblumix in HDFS as this is the username your are provided when you create a Analytics for Hadoop service in Bluemix.
To use the webHDFS REST API see these instructions.
I need to upload data which is present on a web link say for example a "blog"
to hdfs .
Now i was looking through options for accomplishing this could find below link:
http://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/
but reading through flume docs , i am not clear on how i can set up flume source
to point to a website where the blogs content resides .
As per my understanding of the fluem doc there needs to be webserver where i need to deploy a applicationthen weblogs will be generated which will be transferred by flume to hdfs .
But i do not want web server logs , actually i am looking for blogs content (i.e all data + comments on blogs if any) which is an unstructured data , then i am thinking to process further this data using java map-reduce .
But not sure i am heading in a correct direction .
Also i went through pentaho . But not clear if using PDI i can get the data from a
website and upload it to hdfs .
Any info on above will be really helpfull.
Thanks in advance .
Flume can pull the data (as in the case of Twitter) and also data can be pushed to Flume as in the case of server logs using the FlumeAppender.
To get the blogging data into HDFS
a) The blogger application should push the data to HDFS, as in the case of FlumeAppender. Changes have to be done to the blogger application, which is not the case in most of the scenarios.
or
b) Flume can pull the blog data using the appropriate API as in the case of Twitter. Blogger provides an API to pull the code, which can be used in the Flume source. The Cloudera blog has reference to Flume code to pull the data from Twitter.
How do we get the twitter(Tweets) into HDFS for offline analysis. we have a requirement to analyze tweets.
I would look for solution in well developed area of streaming logs into hadoop, since the task looks somewhat similar.
There are two existing systems doing so:
Flume: https://github.com/cloudera/flume/wiki
And
Scribe: https://github.com/facebook/scribe
So your task will be only to pull data from twitter, what I asume is not part of this question and feed one of these systems with this logs.
Fluentd log collector just released its WebHDFS plugin, which allows the users to instantly stream data into HDFS.
Fluentd + Hadoop: Instant Big Data Collection
Also by using fluent-plugin-twitter, you can collect Twitter streams by calling its APIs. Of course you can create your custom collector, which posts streams to Fluentd. Here's a Ruby example to post logs against Fluentd.
Fluentd: Data Import from Ruby Applications
This can be a solution to your problem.
Tools to capture Twitter tweets
Create PDF, DOC, XML and other docs from Twitter tweets
Tweets to CSV files
Capture it in any format. (csv,txt,doc,pdf.....etc)
Put it into HDFS.