unable to download data from twitter through flume - hadoop

bin/flume-ng agent -n TwitterAgent --conf ./conf/ -f conf/flume-twitter.conf -Dflume.root.logger=DEBUG,console
When I run the above command it generate the following errors:
2016-05-06 13:33:31,357 (Twitter Stream consumer-1[Establishing connection]) [INFO - twitter4j.internal.logging.SLF4JLogger.info(SLF4JLogger.java:83)] 404:The URI requested is invalid or the resource requested, such as a user, does not exist. Unknown URL. See Twitter Streaming API documentation at http://dev.twitter.com/pages/streaming_api
This is my flume-twitter.conf file located in flume/conf folder:
TwitterAgent.sources= Twitter TwitterAgent.channels= MemChannel TwitterAgent.sinks=HDFS TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource TwitterAgent.sources.Twitter.channels=MemChannel TwitterAgent.sources.Twitter.consumerKey=jtlmThaz307pCCQtlw9lvrrOq TwitterAgent.sources.Twitter.consumerSecret=oaGCt6OaUas13Ji5NTnPN6TFjdSKtsAUQdq4ZhAq0BFn9jgHPU TwitterAgent.sources.Twitter.accessToken=921523328-xxY9nrWijDSVC77iK40eRNVmRIopvLXovpoxBnDs TwitterAgent.sources.Twitter.accessTokenSecret=fbtuDENfBNxTooPD0EEgEo15Pg51cxNQa1CochI56gqSO TwitterAgent.sources.Twitter.keywords= WT20,hadoop,election,sports, cricket,Big data,IPL2016,Panamaleaks,Pollingday TwitterAgent.sinks.HDFS.channel=MemChannel TwitterAgent.sinks.HDFS.type=hdfs TwitterAgent.sinks.HDFS.hdfs.path=hdfs://HadoopMaster:9000/user/flume/tweets TwitterAgent.sinks.HDFS.hdfs.fileType=DataStream TwitterAgent.sinks.HDFS.hdfs.writeformat=Text TwitterAgent.sinks.HDFS.hdfs.batchSize=1000 TwitterAgent.sinks.HDFS.hdfs.rollSize=0 TwitterAgent.sinks.HDFS.hdfs.rollCount=10000 TwitterAgent.sinks.HDFS.hdfs.rollInterval=600 TwitterAgent.channels.MemChannel.type=memory TwitterAgent.channels.MemChannel.capacity=10000 TwitterAgent.channels.MemChannel.transactionCapacity=100*

Try replacing your flume-sources-1.x-SNAPSHOT.jar with the jar file downloaded from this link.
As Twitter broke their old APIs few days ago. The old jar file will not work. You can Download the modified jar from the link I have given above.
P.S. I am getting results through this method.

Related

Stanford CoreNLP - Unknown variable WORKDAY

I am processing some documents and I am getting many WORKDAY messages as seen below.
There's a similar issue posted here for WEEKDAY. Does anyone know how to deal with this message. I am running corenlp in a Java server on Windows and accessing it using Juypyter Notebook and Python code.
[pool-2-thread-2] INFO edu.stanford.nlp.ling.tokensregex.types.Expressions - Unknown variable: WORKDAY
[pool-2-thread-2] INFO edu.stanford.nlp.ling.tokensregex.types.Expressions - Unknown variable: WORKDAY
[pool-2-thread-2] INFO edu.stanford.nlp.ling.tokensregex.types.Expressions - Unknown variable: WORKDAY
[pool-1-thread-7] WARN CoreNLP - java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error making document
This is an error in the current SUTime rules file (and it's actually been there for quite a few versions). If you want to fix it immediately, you can do the following. Or we'll fix it in the next release. These are Unix commands, but the same thing will work elsewhere except for how you refer to and create folders.
Find this line in sutime/english.sutime.txt and delete it. Save the file.
{ (/workday|work day|business hours/) => WORKDAY }
Then move the file to the right location for replacing in the jar file, and then replace it in the jar file. In the root directory of the CoreNLP distribution do the following (assuming you don't already have an edu file/folder in that directory):
mkdir -p edu/stanford/nlp/models/sutime
cp sutime/english.sutime.txt edu/stanford/nlp/models/sutime
jar -uf stanford-corenlp-4.2.0-models.jar edu/stanford/nlp/models/sutime/english.sutime.txt
rm -rf edu

Configuring Flume to Collect Data into HDFS from Twitter

Iam getting this log info continously for a whole day.
2016-10-12 21:32:05,696 (conf-file-poller-0) [DEBUG -
org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:126)]
Checking file:conf/flume.conf for changes when executing the command
FLUME_HOME/bin/flume-ng agent --conf ./conf/ -f conf/flume.conf
-Dflume.root.logger=DEBUG,console -n TwitterAgent
Iam getting this error now after modifying the conf file
2016-10-12 22:09:19,592 (lifecycleSupervisor-1-0) [DEBUG -
com.cloudera.flume.source.TwitterSource.start(TwitterSource.java:124)]
Setting up Twitter sample stream using consumer key and
access token 2016-10-12 22:09:19,592
(lifecycleSupervisor-1-0) [ERROR -
org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:253)]
Unable to start EventDrivenSourceRunner: {
source:com.cloudera.flume.source.TwitterSource{name:Twitter,state:IDLE}
} - Exception follows. java.lang.IllegalStateException: consumer
key/secret pair already set.
As far as I can understand from what you have provide here, I think you need to add a1.sources.r1.type = org.apache.flume.source.twitter.TwitterSource into your conf file to define your Twitter sources, also make sure you are using your credentials for accessing twitter API.

Launcher ERROR, reason: Main class [org.apache.oozie.action.hadoop.HiveMain], exit code [12]

I am getting below error while running a hive action in oozie:
015-12-20 19:48:40,368 WARN HiveActionExecutor:523 -
SERVER[sandbox.hortonworks.com] USER[root] GROUP[-] TOKEN[] APP[oozie_hive_root]
JOB[0000013-151220142557945-oozie-oozi-W] ACTION[0000013-151220142557945-oozie-oozi-W#oozie_hive_root]
Launcher ERROR, reason: Main class [org.apache.oozie.action.hadoop.HiveMain],
exit code [12]
Few Details about my config:
I have renamed the hive-site XML to hive-config xml as suggested in some posts.
I have also placed my hive script file(hive.hql) ,hive-config.xml and hive shared lib JAR's in workflow directory.
Have also set the value of - oozie.use.system.libpath=true.
selected the hive-config.xml in File as well as JOB XML in oozie workflow.
I am using OOZIE 4.2.0 in Hue 2.6.1 browser.
Tried all possible helps from different posts but couldn't got rid of this error.Kindly provide help on this.
Thanks in Advance!!
I solve the same problem. My problem was missing atlas jar files.I was using Hortonworks. If you are using HoetonWorks try to copy all the jar file in /usr/hdp/2.3.2.0-2950/atlas/hook/hive/* directory into lib folder at job.properties level.

DBPedia Live mirror setup on Mac OS X

I am trying to set up a DBpedia Live Mirror on my personal Mac machine. Here is some technical host information about my setup:
Operating System: OS X 10.9.3
Processor 2.6 GHz Intel Core i7
Memory 16 GB 1600 MHz DDR3
Database server used for hosting data for the DBpedia Live Mirror: OpenLink Virtuoso (Open-source edition)
Here's a summary of the steps I followed so far:
Downloaded the initial data seed from DBPedia Live as: dbpedia_2013_07_18.nt.bz2
Downloaded the synchronization tool from http://sourceforge.net/projects/dbpintegrator/files/.
Executed the virtload.sh script. Had to tweak some commands in here to be compatible with OS X.
Adapted the synchronization tools configuration files according to the README.txt file as follows:
a) Set the start date in file "lastDownloadDate.dat" to the date of that dump (2013-07-18-00-000000).
b) Set the configuration information in file "dbpedia_updates_downloader.ini", such as login credentials for Virtuoso, and GraphURI.
Executed "java -jar dbpintegrator-1.1.jar" on the command line.
This script repeatedly showed the following error:
INFO - Options file read successfully
INFO - File : http://live.dbpedia.org/changesets/lastPublishedFile.txt has been successfully downloaded
INFO - File : http://live.dbpedia.org/changesets/2014/06/16/13/000001.removed.nt.gz has been successfully downloaded
WARN - File /Users/shruti/virtuoso/dbpedia-live/UpdatesDownloadFolder/000001.removed.nt.gz cannot be decompressed due to Unexpected end of ZLIB input stream
ERROR - Error: (No such file or directory)
INFO - File : http://live.dbpedia.org/changesets/2014/06/16/13/000001.added.nt.gz has been successfully downloaded
WARN - File /Users/shruti/virtuoso/dbpedia-live/UpdatesDownloadFolder/000001.added.nt.gz cannot be decompressed due to Unexpected end of ZLIB input stream
ERROR - Error: (No such file or directory)
INFO - File : http://live.dbpedia.org/changesets/lastPublishedFile.txt has been successfully downloaded
INFO - File : http://live.dbpedia.org/changesets/2014/06/16/13/000002.removed.nt.gz has been successfully downloaded
INFO - File : /Users/shruti/virtuoso/dbpedia-live/UpdatesDownloadFolder/000002.removed.nt.gz decompressed successfully to /Users/shruti/virtuoso/dbpedia-live/UpdatesDownloadFolder/000002.removed.nt
WARN - null Function executeStatement
WARN - null Function executeStatement
WARN - null Function executeStatement
WARN - null Function executeStatement
WARN - null Function executeStatement
...
Questions
Why do I repeatedly see the following error when running the Java program: "dbpintegrator-1.1.jar"? Does this mean that the triples from these files were not updated in my live mirror?
WARN - File /Users/shruti/virtuoso/dbpedia-live/UpdatesDownloadFolder/000001.removed.nt.gz cannot be decompressed due to Unexpected end of ZLIB input stream
ERROR - Error: (No such file or directory)
How can I verify that the data loaded in my mirror is up to date? Is there a SPARQL query I can use to validate this?
I see that the data in my live mirror is missing wikiPageId (http://dbpedia.org/ontology/wikiPageID) and wikiPageRevisionID. Why is that? Is this data missing from the DBpedia live data dumps?
It should be fixed now.
Can you try again from here: https://github.com/dbpedia/dbpedia-live-mirror

HPCC/HDFS Connector

Does anyone know about HPCC/HDFS connector.we are using both HPCC and HADOOP.There is one utility(HPCC/HDFS connector) developed by HPCC which allows HPCC cluster to acess HDFS data
i have installed the connector but when i run the program to acess data from hdfs it gives error as libhdfs.so.0 doesn't exist.
I tried to build libhdfs.so using command
ant compile-libhdfs -Dlibhdfs=1
its giving me error as
target "compile-libhdfs" does not exist in the project "hadoop"
i used one more command
ant compile-c++-libhdfs -Dlibhdfs=1
its giving error as
ivy-download:
[get] Getting: http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.1.0/ivy-2.1.0.jar
[get] To: /home/hadoop/hadoop-0.20.203.0/ivy/ivy-2.1.0.jar
[get] Error getting http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.1.0/ivy-2.1.0.jar
to /home/hadoop/hadoop-0.20.203.0/ivy/ivy-2.1.0.jar
BUILD FAILED java.net.ConnectException: Connection timed out
any suggestion will be a great help
Chhaya, you might not need to build libhdfs.so, depending on how you installed hadoop, you might already have it.
Check in HADOOP_LOCATION/c++/Linux-<arch>/lib/libhdfs.so, where HADOOP_LOCATION is your hadoop install location, and arch is the machine’s architecture (i386-32 or amd64-64).
Once you locate the lib, make sure the H2H connector is configured correctly (see page 4 here).
It's just a matter of updating the HADOOP_LOCATION var in the config file:
/opt/HPCCSystems/hdfsconnector.conf
good luck.

Resources