Nutch 2.0 and Hadoop. How to prevent caching of conf/regex-urlfilter.txt - hadoop

I have nutch 2.x and hadoop 1.2.1 on single machine.
I configure seed.txt, conf/regex-urlfilter.txt and run command
crawl urls/seed.txt TestCrawl http://localhost:8088/solr/ 2
Then I want to change rules in conf/regex-urlfilter.txt
I changed it in 2 files:
~$ find . -name 'regex-urlfilter.txt'
./webcrawer/apache-nutch-2.2.1/conf/regex-urlfilter.txt
./webcrawer/apache-nutch-2.2.1/runtime/local/conf/regex-urlfilter.txt
Then I run
crawl urls/seed.txt TestCrawl2 http://localhost:8088/solr/ 2
But changes in regex-urlfilter.txt doesn't affect.
Hadoop report that it use file.
cat /home/hadoop/data/hadoop-unjar6761544045585295068/regex-urlfilter.txt
When I see content of file I see old file
How to force hadoop to use new config?

This settings stored in arhive file
/home/hadoop/webcrawer/apache-nutch-2.2.1/build/apache-nutch-2.2.1.job
Run
ant clean
ant runtime
to replace it with new settings or edit arhive file /home/hadoop/webcrawer/apache-nutch-2.2.1/build/apache-nutch-2.2.1.job

Related

Install Hive on windows: 'hive' is not recognized as an internal or external command, operable program or batch file

I have installed Hadoop 2.7.3 on Windows and I am able to start the cluster. Now I would like to have hive and went through the steps below:
1. Downloaded db-derby-10.12.1.1-bin.zip, unpacked it and started the startNetworkServer -h 0.0.0.0.
2. Downloaded apache-hive-1.1.1-bin.tar.gz from mirror site and unpacked it. Created hive-site.xml to have below properties:
javax.jdo.option.ConnectionURL
javax.jdo.option.ConnectionDriverName
hive.server2.enable.impersonation
hive.server2.authentication
datanucleus.autoCreateTables
hive.metastore.schema.verification
I have also setup HIVE_HOME and updated path. Also set HIVE_LIB and HIVE_BIN_PATH.
When I run hive from bin I get
'hive' is not recognized as an internal or external command,
operable program or batch file.
The bin/hive appears as filetype File.
Please suggest. Not sure if the hive version is correct one.
Thank you.
If someone is still going through this problem; here's what i did to solve hive installation on windows.
My configurations are as below (latest as of date):
I am using Windows 10
Hadoop 2.9.1
derby 10.14
hive 2.3.4 (my hive version does not contain bin/hive.cmd; the necessary file to run hive on windows)
#wheeler above mentioned that Hive is for Linux. Here's the hack to make it work for windows.
My Hive installation version did not come with windows executable files. Hence the hack!
STEP 1
There are 3 files which you need to specifically download from *https://svn.apache.org/repos/
https://svn.apache.org/repos/asf/hive/trunk/bin/hive.cmd
save it in your %HIVE_HOME%/bin/ as hive.cmd
https://svn.apache.org/repos/asf/hive/trunk/bin/ext/cli.cmd
save it in your %HIVE_HOME%/bin/ext/ as cli.cmd
https://svn.apache.org/repos/asf/hive/trunk/bin/ext/util/execHiveCmd.cmd
save it in your %HIVE_HOME%/bin/ext/util/ as execHiveCmd.cmd*
where %HIVE_HOME% is where Hive is installed.
STEP 2
Create tmp dir under your HIVE_HOME (on local machine and not on HDFS)
give 777 permissions to this tmp dir
STEP 3
Open your conf/hive-default.xml.template save it as conf/hive-site.xml
Then in this hive-site.xml, paste below properties at the top under
<property>
<name>system:java.io.tmpdir</name>
<value>{PUT YOUR HIVE HOME DIR PATH HERE}/tmp</value>
<!-- MY PATH WAS C:/BigData/hive/tmp -->
</property>
<property>
<name>system:user.name</name>
<value>${user.name}</value>
</property>
(check the indents)
STEP 4
- Run Hadoop services
start-dfs
start-yarn
Run derby
StartNetworkServer -h 0.0.0.0
Make sure you have all above services running
- go to cmd for HIVE_HOME/bin and run hive command
hive
Version 1.1.1 of Apache Hive does not contain a version that can be executed on Windows (only Linux binaries):
However, version 2.1.1 does have Windows capabilities:
So even if you had your path correctly set, cmd wouldn't be able to find an executable it could run, since one doesn't exist in 1.1.1.
i also run into this problem. to get necessary file to run hive on windows i have downloaded hive-2.3.9 and hive-3.1.2 but none of them have this files.so, we have two option:
Option 1: install hive-2.1.0 and set it up as i have tried,
Hadoop 2.8.0
derby 10.12.1.1
hive 2.1.0
Option 2: download whole bin directory and replace with yours hive bin directory. for downloading bin we need wget utility for windows.
after that run this command(to understand how it works):
wget -r -np -nH --cut-dirs=3 -R index.html
https://svn.apache.org/repos/asf/hive/trunk/bin/
your downloaded bin looks like:
after replacing it you are ready to go. so now my configurations are as below:
Hadoop 3.3.1
derby 10.13.1.1
hive 2.3.9

How to change tmp directory in yarn

I have written a MR job and have run it in local mode with following configuration settings
mapred.local.dir=<<local directory having good amount of space>>
fs.default.name=file:///
mapred.job.tracker=local
on Hadoop 1.x
Now I am using Hadoop 2.x and the same Job I am running with the same Configuration settings, but I am getting error :
Disk Out of Space
Is it that If I switch from Hadoop 1.x to 2.x (using Hadoop-2.6 jars), the same Configuration Settings to change the Tmp Dir not work.??
What are the new Settings to configure the "tmp" directory of MR1 (mapred API) on Hadoop 2.6.
Kindly advice.
Regards
Cheers :))
Many properties in 1.x have been deprecated and replaced with new properties in 2.x.
mapred.child.tmp has been replaced by mapreduce.task.tmp.dir
mapred.local.dir has been replaced by mapreduce.cluster.local.dir
Have a look at complete list of deprecated properties and new equivalent properties at Apache website link
It can be done by setting
mapreduce.cluster.local.dir=<<local directory having good amount of space>>

Not able to run Hadoop job remotely

I want to run a hadoop job remotely from a windows machine. The cluster is running on Ubuntu.
Basically, I want to do two things:
Execute the hadoop job remotely.
Retrieve the result from hadoop output directory.
I don't have any idea how to achieve this. I am using hadoop version 1.1.2
I tried passing jobtracker/namenode URL in the Job configuration but it fails.
I have tried the following example : Running java hadoop job on local/remote cluster
Result: Getting error consistently as cannot load directory. It is similar to this post:
Exception while submitting a mapreduce job from remote system
Welcome to a world of pain. I've just implemented this exact use case, but using Hadoop 2.2 (the current stable release) patched and compiled from source.
What I did, in a nutshell, was:
Download the Hadoop 2.2 sources tarball to a Linux machine and decompress it to a temp dir.
Apply these patches which solve the problem of connecting from a Windows client to a Linux server.
Build it from source, using these instructions. It will also ensure that you have 64-bit native libs if you have a 64-bit Linux server. Make sure you fix the build files as the post instructs or the build would fail. Note that after installing protobuf 2.5, you have to run sudo ldconfig, see this post.
Deploy the resulted dist tar from hadoop-2.2.0-src/hadoop-dist/target on the server node(s) and configure it. I can't help you with that since you need to tweak it to your cluster topology.
Install Java on your client Windows machine. Make sure that the path to it has no spaces in it, e.g. c:\java\jdk1.7.
Deploy the same Hadoop dist tar you built on your Windows client. It will contain the crucial fix for the Windox/Linux connection problem.
Compile winutils and Windows native libraries as described in this Stackoverflow answer. It's simpler than building entire Hadoop on Windows.
Set up JAVA_HOME, HADOOP_HOME and PATH environment variables as described in these instructions
Use a text editor or unix2dos (from Cygwin or standalone) to convert all .cmd files in the bin and etc\hadoop directories, otherwise you'll get weird errors about labels when running them.
Configure the connection properties to your cluster in your config XML files, namely fs.default.name, mapreduce.jobtracker.address, yarn.resourcemanager.hostname and the alike.
Add the rest of the configuration required by the patches from item 2. This is required for the client side only. Otherwise the patch won't work.
If you've managed all of that, you can start your Linux Hadoop cluster and connect to it from your Windows command prompt. Joy!

Error in Pig: Cannot locate pig-withouthadoop.jar. do 'ant jar-withouthadoop', and try again

I am trying to Start Pig-0.12.0 on MAC after I Installed Pig from Apache website.
Before I start Pig shell, I copied below 4 lines after creating pig-env.sh file in conf Directory.
Export JAVA_HOME=/usr
Export PIG_HOME=/Users/Hadoop_Cluster/pig-0.12.0
Export HADOOP_HOME=Users/Hadoop_Cluster/hadoop-1.2.1
Export PIG_CLASSPATH=$HADOOP_HOME/conf/
Also, Added below text in pig.properties file:
Fs.default.name=hdfs://localhost:9000
Mapred.job.tracker=localhost:9001
I copied core-site.xml, hdfs-site.xml and mapped-site.xml file from
Hadoop_home/conf to pig_home/conf
I Get below Error when starting Pig in Command line under bin directory of Pig. Error says:
Cannot locate pig-withouthadoop.jar. do 'ant jar-withouthadoop', and Try again
If it is not there copy pig-0.12.0-withouthadoop.jar (renamed or not, it shouldn't matter) to your $PIG_HOME, so in the end the file /Users/Hadoop_Cluster/pig-0.12.0/pig-0.12.0-withouthadoop.jar exists.
Also be careful about the lower case/upper case letters. Otherwise it should be fine.
Finally it works.
All I did is rename the file in conf directory to "pig-withouthadoop.jar" instead of pig-0.12.0-withouthadoop. Also I make sure the hadoop is not in safe mode.
I kept the same settings as below in file below and all the 3 hdp files are
copied to pig_home/conf directory.
export JAVA_HOME=/usr
export PIG_HOME=/Users/Hadoop_Cluster/pig-0.12.0
export HADOOP_HOME=/Users/Hadoop_Cluster/hadoop-1.2.1
export PIG_CLASSPATH=$HADOOP_HOME/conf/
I too got the same error. Solved by removing /bin in the home patch in .bashrc .. source in bashrc and start pig..
export PIG_HOME=/home/hadoop/pig-0.13.0/bin ==> wrong
export PIG_HOME=/home/hadoop/pig-0.13.0 ==> correct..
You need to follow as per the error generated :
Cannot locate pig-withouthadoop.jar. do 'ant jar-withouthadoop'
One needs to run the command ant jar-withouthadoop to get pig-withouthadoop.jar
if ant is not installed for ubuntu users try apt-get install ant.
The command ant jar-withouthadoop will take roughly 15 -20 mins, but one needs to be patient for getting this sorted.
I scratched my head all day.Kept looking for solutions on goggle none helped.
On extraction of the pig tar there is no jar that is created in the home directory.The above is to be followed to create the jar file and to run pig successfully.
I don't exactly know why this is done,but this is the solution that has worked for me with hadoop 1.2 [out of safe mode] and pig 0.12.1
The key is find
pig-withouthadoop.jarpig-withouthadoop.jar\
in your $pig_home.
so use
find / -name *withouthadoop*
you can find it. maybe
pig-withouthadoop.jar
, you should rename it and cp to $pig_home. Worked for me

where did the configuration file stored in CDH4

I setup a CDH4
Now I can configure the hadoop on the web page.
I want to know where did the cdh put the configuration file on the local file system.
for example, I want to find the core-site.xml, but where is it?
By default, the installation of CDH has the conf directory located in
/etc/hadoop/
You could always use the following command to find the file:
$ sudo find / -name "core-site.xml"

Resources