I'm able to fetch tweets using flume, however, the language in which it is streamed is not what I want. Below is the flume.conf file
And the tweets that I'm getting is shown below:
Can anyone suggest changes that I need to make.?
The TwitterSource in Apache Flume currently does not implement support for language filtering. This prior question describes a procedure (admittedly complex) by which you could deploy your own patched version of the code with language support:
Flume - TwitterSource language filter
I think it would be a valuable enhancement for Apache Flume to support language filtering. I encourage you to file a request in Apache JIRA in the FLUME project.
If you're interested, please also consider contributing a patch. I think it would just be a matter of pulling the "language" setting out of configuration in the configure method, saving it in a member variable, and then passing it along in the Twitter4J APIs.
Related
I am very new to Druid, a column-oriented, open-source, distributed data store written in Java.
I need to start multiple services (nodes) in order to work smoothly Druid. Is there a good way to auto start the services?
You can find patch for Ambari Druid integration, AMBARI-17981, and which will be included as of Ambari v2.5.
Patch file contains all that information in the form of a diff file.
Typically you need to checkout the source code, apply the patch, and then build the project.
You could use the Hortonworks Data Platform (HDP)/distribution that will install Zookeeper/HDFS/Druid/Postgresql/Hadoop and you are good to go.
There is also a video guide available on how to install Druid step-by-step.
Otherwise you can do it your self by building Druid from source and copy jars and configs around.
I wasn't able to find out, how to crawl website and index data to elasticsearch. I managed to do that in the combination nutch+solr and as nutch should be able from the version 1.8 export data directly to elasticsearch (source), I tried to use nutch again. Nevertheless I didn't succeed. After trying to invoke
$ bin/nutch elasticindex
I get:
Error: Could not find or load main class elasticindex
I don't insist on using nutch. I just would need the simpliest way to crawl websites and index them to elasticsearch. The problem is, that I wasn't able to find any step-by-step tutorial and I'm quite new to these technologies.
So the question is - what would be the simpliest solution to integrate crawler to elasticsearch and if possible, I would be grateful for any step-by-step solution.
Did you have a look at the River Web plugin? https://github.com/codelibs/elasticsearch-river-web
It provides a good How To section, including creating the required indexes, scheduling (based on Quartz), authentication (basic and NTLM are supported), meta data extraction, ...
Might be worth having a look at the elasticsearch river plugins overview as well: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-plugins.html#river
Since the River plugins have been deprecated, it may be worth having a look at ManifoldCF or Norconex Collectors.
You can evaluate indexing Common Crawl metadata into Elasticsearch using Hadoop:
When working with big volumes of data, Hadoop provides all the power to parallelize the data ingestion.
Here is an example that uses Cascading to index directly into Elasticsearch:
http://blogs.aws.amazon.com/bigdata/post/TxC0CXZ3RPPK7O/Indexing-Common-Crawl-Metadata-on-Amazon-EMR-Using-Cascading-and-Elasticsearch
The process involves the use of a Hadoop cluster (EMR on this example) running the Cascading application that indexes the JSON metadata directly into Elasticsearch.
Cascading source code is also available to understand how to handle the data ingestion in Elasticsearch.
I am new to Apache Flume-ng. I want to send files from client-agent to server-agent, who will ultimately write files to HDFS. I have seen http://cuddletech.com/blog/?p=795 . This is the best which one i found till now. But it is via script not via APIs. I want to do it via Flume APIs. Please help me in this regard. And tell me steps, how to start and organize code.
I think you should maybe explain more about what you want to achieve.
The link you post appears to be just fine for your needs. You need to start a Flume agent on your client to read the files and send them using the Avro sink. Then you need a Flume agent on your server which uses an Avro source to read the events and write them where you want.
If you want to send events directly from an application then have a look at the embedded agent in Flume 1.4 or the Flume appender in log4j2 or (worse) the log4j appender in Flume.
Check this http://flume.apache.org/FlumeDeveloperGuide.html
You can write client to send events or use Embedded agent.
As for the code organization, it is up to you.
How do we get the twitter(Tweets) into HDFS for offline analysis. we have a requirement to analyze tweets.
I would look for solution in well developed area of streaming logs into hadoop, since the task looks somewhat similar.
There are two existing systems doing so:
Flume: https://github.com/cloudera/flume/wiki
And
Scribe: https://github.com/facebook/scribe
So your task will be only to pull data from twitter, what I asume is not part of this question and feed one of these systems with this logs.
Fluentd log collector just released its WebHDFS plugin, which allows the users to instantly stream data into HDFS.
Fluentd + Hadoop: Instant Big Data Collection
Also by using fluent-plugin-twitter, you can collect Twitter streams by calling its APIs. Of course you can create your custom collector, which posts streams to Fluentd. Here's a Ruby example to post logs against Fluentd.
Fluentd: Data Import from Ruby Applications
This can be a solution to your problem.
Tools to capture Twitter tweets
Create PDF, DOC, XML and other docs from Twitter tweets
Tweets to CSV files
Capture it in any format. (csv,txt,doc,pdf.....etc)
Put it into HDFS.
Since the zohmg project seems to be dead (no new commits since nov 2009), I would like to know if any of you used/uses it (with successful results). Or if you know anything about future of this project.
And if not, is there any alternative for this project. I'm looking for tool that will help to extract data from (apache) logs (using Hadoop as a batch processing system), store it into HBase, help with querying this data.
Cascading is very often used for this. It also provides adapters for HBase.
Examples can be found here
http://github.com/cwensel/cascading.samples
HBase integration
http://www.cascading.org/modules.html