loading data from website to hdfs - hadoop

I need to upload data which is present on a web link say for example a "blog"
to hdfs .
Now i was looking through options for accomplishing this could find below link:
http://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/
but reading through flume docs , i am not clear on how i can set up flume source
to point to a website where the blogs content resides .
As per my understanding of the fluem doc there needs to be webserver where i need to deploy a applicationthen weblogs will be generated which will be transferred by flume to hdfs .
But i do not want web server logs , actually i am looking for blogs content (i.e all data + comments on blogs if any) which is an unstructured data , then i am thinking to process further this data using java map-reduce .
But not sure i am heading in a correct direction .
Also i went through pentaho . But not clear if using PDI i can get the data from a
website and upload it to hdfs .
Any info on above will be really helpfull.
Thanks in advance .

Flume can pull the data (as in the case of Twitter) and also data can be pushed to Flume as in the case of server logs using the FlumeAppender.
To get the blogging data into HDFS
a) The blogger application should push the data to HDFS, as in the case of FlumeAppender. Changes have to be done to the blogger application, which is not the case in most of the scenarios.
or
b) Flume can pull the blog data using the appropriate API as in the case of Twitter. Blogger provides an API to pull the code, which can be used in the Flume source. The Cloudera blog has reference to Flume code to pull the data from Twitter.

Related

IIS Logs Straming to Hadoop real time

I am trying to do a POC in Hadoop for log aggregation. we have multiple IIS servers hosting atleast 100 sites. I want to to stream logs continously to HDFS and parse data and store in Hive for further analytics.
1) Is Apache KAFKA correct choice or Apache Flume
2) After streaming is it better to use Apache storm and ingest data into Hive
Please help with any suggestions and also any information of this kind of problem statement.
Thanks
You can use either Kafka or flume also you can combine both to get data into HDFSbut you need to write code for this There are Opensource data flow management tools available, you don't need to write code. Eg. NiFi and Streamsets
You don't need to use any separate ingestion tools, you can directly use those data flow tools to put data into hive table. Once table is created in hive then you can do your analytics by providing queries.
Let me know you need anything else on this.

Analytics for Apache Hadoop service - Could not load external files

The Bluemix Big Analytics tutorial mentions importing files, but when I launched the Big sheets from the Bluemix Analytics for Apache Hadoop service, I could not see any option to load external files to the Big sheet. Is there any other way to do it? Please help us in proceeding.
You would upload your data to the HDFS for your Analytics for Hadoop service using the webHDFS REST API first, and then it should be available for you in BigSheets via the DFS Files tab shown in your screenshot.
The data you upload would be under /user/biblumix in HDFS as this is the username your are provided when you create a Analytics for Hadoop service in Bluemix.
To use the webHDFS REST API see these instructions.

Flume: Send files to HDFS via APIs

I am new to Apache Flume-ng. I want to send files from client-agent to server-agent, who will ultimately write files to HDFS. I have seen http://cuddletech.com/blog/?p=795 . This is the best which one i found till now. But it is via script not via APIs. I want to do it via Flume APIs. Please help me in this regard. And tell me steps, how to start and organize code.
I think you should maybe explain more about what you want to achieve.
The link you post appears to be just fine for your needs. You need to start a Flume agent on your client to read the files and send them using the Avro sink. Then you need a Flume agent on your server which uses an Avro source to read the events and write them where you want.
If you want to send events directly from an application then have a look at the embedded agent in Flume 1.4 or the Flume appender in log4j2 or (worse) the log4j appender in Flume.
Check this http://flume.apache.org/FlumeDeveloperGuide.html
You can write client to send events or use Embedded agent.
As for the code organization, it is up to you.

Hadoop with Hive

We want to develop one simple Java EE web application with log file analysis using Hadoop. The following are Approach following to develop the application. But we are unable to through the approach.
Log file would be uploaded into Hadoop server from client machine using sftp/ftp.
Call the Hadoop Job to fetch the log file and process the log file into HDFS file system.
While processing the log file the content will stored into HIVE database.
Search the log content by using HIVE JDBC connection from client web application
We browsed so many sample to full fill some of the steps. But we are not having any concrete sample are not available.
Please suggest the above approach is correct or not and get the links for sample application developed in Java.
I would point out a few thing:
a) You need to merge log files or in some other ways take care that you do not have too much of them. Consider Flume (http://flume.apache.org/) which is built to accept logs from various sources and put them into HDFS.
b) If you go with ftp - you will need some scripting to take data from FTP and put into HDFS.
c) Main problem I see is- to run Hive job as result of the client's web request. Hive request is not interactive - it will take at least dozens of seconds, and probably much more.
I also would be vary of concurrent requests - you proabbly can not run more then a few in parallel
According to me, you can do one thing that:
1)Instead of accepting logs from various sources and put them into HDFS, You can put into one database say SQL Server and from that you can import your data into Hive (or HDFS) using Sqoop.
2) This will reduce your effort for writing the various job to bring the data into HDFS.
3) Once the data come in Hive, you can do whatever you want.

Twitter - Hadoop Data Streaming

How do we get the twitter(Tweets) into HDFS for offline analysis. we have a requirement to analyze tweets.
I would look for solution in well developed area of streaming logs into hadoop, since the task looks somewhat similar.
There are two existing systems doing so:
Flume: https://github.com/cloudera/flume/wiki
And
Scribe: https://github.com/facebook/scribe
So your task will be only to pull data from twitter, what I asume is not part of this question and feed one of these systems with this logs.
Fluentd log collector just released its WebHDFS plugin, which allows the users to instantly stream data into HDFS.
Fluentd + Hadoop: Instant Big Data Collection
Also by using fluent-plugin-twitter, you can collect Twitter streams by calling its APIs. Of course you can create your custom collector, which posts streams to Fluentd. Here's a Ruby example to post logs against Fluentd.
Fluentd: Data Import from Ruby Applications
This can be a solution to your problem.
Tools to capture Twitter tweets
Create PDF, DOC, XML and other docs from Twitter tweets
Tweets to CSV files
Capture it in any format. (csv,txt,doc,pdf.....etc)
Put it into HDFS.

Resources