How do we get the twitter(Tweets) into HDFS for offline analysis. we have a requirement to analyze tweets.
I would look for solution in well developed area of streaming logs into hadoop, since the task looks somewhat similar.
There are two existing systems doing so:
Flume: https://github.com/cloudera/flume/wiki
And
Scribe: https://github.com/facebook/scribe
So your task will be only to pull data from twitter, what I asume is not part of this question and feed one of these systems with this logs.
Fluentd log collector just released its WebHDFS plugin, which allows the users to instantly stream data into HDFS.
Fluentd + Hadoop: Instant Big Data Collection
Also by using fluent-plugin-twitter, you can collect Twitter streams by calling its APIs. Of course you can create your custom collector, which posts streams to Fluentd. Here's a Ruby example to post logs against Fluentd.
Fluentd: Data Import from Ruby Applications
This can be a solution to your problem.
Tools to capture Twitter tweets
Create PDF, DOC, XML and other docs from Twitter tweets
Tweets to CSV files
Capture it in any format. (csv,txt,doc,pdf.....etc)
Put it into HDFS.
Related
We have requirement to send documents to Hadoop (Hortonworks) from our Image Capture Software: Image Capture Software release PDF document with metadata.
I don't have much idea about HDP. Is there any REST service or any tool that can able to add documents to Hadoop by providing Documents with metadata.
Please help
Hadoop HDFS has both WebHDFS and NFSGateway
However, it's generally recommended not to just store raw data immediately onto HDFS if you have better control over how the data gets there. That way, you have better control over auditing where and how data gets written.
For example, you could use Apache Nifi processors to start a ListenHTTP processor, read the document data, parse it, filter and enrich, then you can optionally write to HDFS or many other destinations.
I need to get data from csv files ( daily extraction from différent business Databasses ) to HDFS then move it to Hbase and finaly charging agregation of this data to a datamart (sqlServer ).
I would like to know the best way to automate this process ( using java or hadoops tools )
I'd echo the comment above re. Kafka Connect, which is part of Apache Kafka. With this you just use configuration files to stream from your sources, you can use KSQL to create derived/enriched/aggregated streams, and then stream these to HDFS/Elastic/HBase/JDBC/etc etc etc
There's a list of Kafka Connect connectors here.
This blog series walks through the basics:
https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-1/
https://www.confluent.io/blog/blogthe-simplest-useful-kafka-connect-data-pipeline-in-the-world-or-thereabouts-part-2/
https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-3/
Little to no coding required? In no particular order
Talend Open Studio
Streamsets Data Collector
Apache Nifi
Assuming you can setup a Kafka cluster, you can try Kafka Connect
If you want to program something, probably Spark. Otherwise, pick your favorite language. Schedule the job via Oozie
If you don't need the raw HDFS data, you can load directly into HBase
I am trying to do a POC in Hadoop for log aggregation. we have multiple IIS servers hosting atleast 100 sites. I want to to stream logs continously to HDFS and parse data and store in Hive for further analytics.
1) Is Apache KAFKA correct choice or Apache Flume
2) After streaming is it better to use Apache storm and ingest data into Hive
Please help with any suggestions and also any information of this kind of problem statement.
Thanks
You can use either Kafka or flume also you can combine both to get data into HDFSbut you need to write code for this There are Opensource data flow management tools available, you don't need to write code. Eg. NiFi and Streamsets
You don't need to use any separate ingestion tools, you can directly use those data flow tools to put data into hive table. Once table is created in hive then you can do your analytics by providing queries.
Let me know you need anything else on this.
I need your help with one suggestion. In current scenario, we have one application on cloud and via splunk we have the ability to view log. I am thinking of implementing this using our big data tools like flume/kafka wherein I can take real time log data from cloud ( currently taken by splunk ) and made it available to our HDFS. Few concern here
is this feasible and make sense ?
for log search (same capability like splunk )
which tool can we use?
If you just want to move logs into HDFS, you can use Flume with HDFS sink.
There are also few other options available like -
Logstash
You can use other framework like Elasticsearch and Kibana to have more functionality available for the logs.
I need a system to analyze large log files. A friend directed me to hadoop the other day and it seems perfect for my needs. My question revolves around getting data into hadoop-
Is it possible to have the nodes on my cluster stream data as they get it into HDFS? Or would each node need to write to a local temp file and submit the temp file after it reaches a certain size? and is it possible to append to a file in HDFS while also running queries/jobs on that same file at the same time?
Fluentd log collector just released its WebHDFS plugin, which allows the users to instantly stream data into HDFS. It's really easy to install with ease of management.
Fluentd + Hadoop: Instant Big Data Collection
Of course you can import data directly from your applications. Here's a Java example to post logs against Fluentd.
Fluentd: Data Import from Java Applications
A hadoop job can run over multiple input files, so there's really no need to keep all your data as one file. You won't be able to process a file until its file handle is properly closed, however.
HDFS does not support appends (yet?)
What I do is run the map-reduce job periodically and output results to an 'processed_logs_#{timestamp}" folder.
Another job can later take these processed logs and push them to a database etc. so it can be queried on-line
I'd recommend using Flume to collect the log files from your servers into HDFS.