How to install and run Nutch in Windows 7 x64

How to install and run Nutch in Windows 7 x64 - windows-7

I want to run Nutch on my Windows 7 x64. I have Nutch versions 1.5.1 and 2 from apache.spinellicreations.com/nutch/.
I used the tutorial at wiki.apache.org/nutch/NutchTutorial. But I messed up in the second step and I can't verify the installation. Other steps are hard to understand...
What are the steps to install and use nutch?

Follow steps to install nutch in windows :
1) download and install cygwin from : https://www.cygwin.com/
2) download nutch from : http://nutch.apache.org/downloads.html
3) paste nutch downloaded and extracted folder into C:\cygwin64\home\
4) rename to apache-nutch
5) open cygwin terminal and type given commands
- $ cd C:
- $ cd cygwin64
- $ cd home
- $ cd apache-nutch
- $ cd src/bin
- $ ./nutch
you will get given output :
Usage: nutch COMMAND
where COMMAND is one of:
inject inject new urls into the database
hostinject creates or updates an existing host table from a text file
generate generate new batches to fetch from crawl db
fetch fetch URLs marked during generate
parse parse URLs marked during fetch
updatedb update web table after parsing
updatehostdb update host table after parsing
readdb read/dump records from page database
readhostdb display entries from the hostDB
index run the plugin-based indexer on parsed batches
elasticindex run the elasticsearch indexer - DEPRECATED use the index command instead
solrindex run the solr indexer on parsed batches - DEPRECATED use the index command instead
solrdedup remove duplicates from solr
solrclean remove HTTP 301 and 404 documents from solr - DEPRECATED use the clean command instead
clean remove HTTP 301 and 404 documents and duplicates from indexing backends configured via plugins
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
plugin load a plugin and run one of its classes main()
nutchserver run a (local) Nutch server on a user defined port
webapp run a local Nutch web application
junit runs the given JUnit test
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

You didn't mess up the second step - you simply don't have (I'm guessing) Cygwin installed so you can't run a bash script. Either install Cygwin (simplest) or you could try porting the bash script to a Windows cmd file. (If you do that, you may find other dependencies down the line.
Hope this helps.

Related

Azure DevOps ThirdParty Tools for build / Deployment

List item
pipelines:
default:
- step:
name: Push changes to Commerce Cloud
script:
- dcu --putAll $OCCS_CODE_LOCATION --node $OCCS_ADMIN_URL --applicationKey $OCCS_APPLICATION_KEY
- step:
name: Publish changes Live Storefront
image: Python 3.5.1
script:
python publishDCUAuthoredChanges.py -u $OCCS_ADMIN_URL -k $OCCS_APPLICATION_KEY
environment variables:
$OCCS_CODE_LOCATION: Path to location of all OCCS code
$OCCS_ADMIN_URL: URL for the administration interface on the target Commerce Cloud instance
$OCCS_APPLICATION_KEY: application key to use to log into the target Commerce Cloud administration interface
So I want to use Azure Dev Repository to CI / CD.
in the above code block if you see I have specified - dcu & python code in two task.
dcu is nodejs third party oracle tool which needed to be used to migrate code to cloud system. I want to know how to use that tool in azure dev ops,
Second python (or) nodejs which I want to invoke to REST api to publish the changes.
So where to place those files and how do we invoke it.
*********** Update **************
I hosted the self pool agent and able to access the system.
Just start executing basic bash code, but end up in two issue -
1) the git extract files from the repository it is going to _work/1/s, not sure how that path is decided. How can I change that location s
2) I did 'pwd' to the correct path but it fails in 'dcu' command. I tried with npm and other few commands it fails. But things like mkdir , rmdir it create & remove folder correctly from the desired path. when I tried the 'dcu' cmd from the terminal manually from the system it works fine as expected.

You can follow below steps to use DCU tool and python in azure pipelines.
1, create a azure git repo to include dcu zip file and your .py files. You can follow the steps in this thread to create a azure git repo and push local files to azure repo.
2, create azure build pipeline. Please check here to create a yaml pipeline. Here is a good tutorial for you to get started.
To create a classic UI pipeline, please choose Use the classic editor in the pipeline setup wizard, and choose start with an Empty job to start with an empty pipeline and add your own steps.(I will use classic UI pipeline in below example.)
3, Click "+" and search for Extract files task to unzip the DCU zip file. Click the 3dots on the Destination folder field to select a destination folder for extracted dcu files. eg. $(agent.builddirectory). Please check my answer in this thread more information about predefined variables
4, click "+" to add a powershell task. Run below script in screenshot to install dcu and run dcu command. For environment variables (like $OCCS_CODE_LOCATION), please click the variables tab in below screenshot to define them
cd $(agent.builddirectory) #the folder where the unzipped dcu files reside. eg. $(agent.builddirectory)
npm install -g
.\dcu.cmd --putAll $(OCCS_CODE_LOCATION) --node $(OCCS_ADMIN_URL) --applicationKey $(OCCS_APPLICATION_KEY)
5, add Use python version task to define a python version to execute your .py file.
6, add Python script task to run your .py file. Click the 3dots on Script path field to locate your publishDCUAuthoredChanges.py file(this py file and the dcu zip file have been pushed to azure git repo in the above step 1).
You should be able to run the script of above question in the azure devops pipeline.
Update:
_work/1/s is the default working folder for the agent. You cannot change it. Though there are ways to change the location where the source code is cloned from git, the tasks' workingdirectory is still from the default folder.
However, You can change the workingdirectory inside the tasks. And there are predefined variables you can use to refer to the places in the agents. For below example:
$(Agent.BuildDirectory) is mapped to c:\agent_work\1
%(Build.ArtifactStagingDirectory) is mapped to c:\agent_work\1\a
$(Build.BinariesDirectory) is mapped to c:\agent_work\1\b
$(Build.SourcesDirectory) is mapped to c:\agent_work\1\s
The .sh scripts in the _temp folder are generated automatically by the agent which contains the scripts in the bash task.
For above dcu command not found error. You can try adding dcu command path to the system variables path for your local machine's environment variables. (path in user variables cannot be found by agent jobs, For the agent use a different user account to connect to local machine)
.
Or you can use the physically path to dcu command in the bash task. For example let's say the dcu.cmd in the c:\dcu\dcu.cmd on local machine. Then in the bash task use below script to run dcu command.
c:/dcu/dcu.cmd --putAll ...

Facing issue regarding installation in hybris6.6

While installing the hybris, my localextension.xml is creating in comment form. I am very new in hybris ecommerce development.
So I have followed below steps for installing the hyrbis -
Installed the zip version of Hybris 6.6
Unzip it
From Platform folder, I opened the terminal and ran ". ./setantenv.sh" And after that I ran "ant clean all" and after the build completed succesfully all folders got created in Hybris folder.
Then I ran "./hybrisserver.sh" and my server got started successfully.
Then I ran "https://localhost:9002/" over that I initialize and it also went successfully.
When I try to access hmc or backoffice it is giving me 404 page not found error.
I checked my localextension.xml file and found all the extensions generated as a comment as shown below.
Could anyone help me out where I am doing the mistake.
Thanks in advance.

If you are using original package you need to install a receipt. Go to install folder.
Run below command for listing existing receipt
./install.sh -l
Prepare b2c with acc:
./install.sh -r b2c_acc
Initialize b2c with acc (Also you can use ant clean all for this step):
./install.sh -r b2c_acc initialize
Start hybris (Also you can use ./hybrisserver.sh start for this step):
./install.sh -r b2c_acc start

When you do "ant all" for the first time and set-up the config folder, it generates a localextensions.xml file which contains extensions that are commented out. If you initialize and start Hybris using this setting, you get nothing, except the HAC.
To enable HMC, you need to at least have "platformhmc" extension enabled (i.e. not commented out) in localextensions. So, stop Hybris, uncomment platformhmc, and do another build (i.e. "ant all"). After that, you can do a Platform Update, or a Platform Initialize (to build from scratch again). When it's done, and you've started Hybris, HMC should be accessible.
Or, if you want more features enabled by default, you can do #mkysoft's suggestion and use recipes.

Unable to use the generate_plugin.js script in Kibana

I tried to run the plugin generator included in Kibana 7.0.0-alpha, but generate_plugin.js script doesn't exist after upgrading Kibana by running the .deb installation file for Kibana 7.0.0-alpha. I know I successfully installed Kibana 7.0.0-alpha because it runs properly through the browser. In addition, I attempted to run the script directly from clone of the git repo but I got a "Error: Cannot find module '#kbn/plugin-generator'" error. Do I have to manually define the generate_plugin script? I wasn't even able to find a scripts folder within my Kibana installation directory.

How to push Ambari use local repository on Hue installation

I have HDP installed with Ambari using public repositories.
I wanted to add Hue to the ecosystem. Since Ambari didn't have Hue as a service to install, I went on with the guide here:
https://github.com/EsharEditor/ambari-hue-service
As far as I understand this guide adds Hue as a service in possible services that Ambari can install.
I think it (this guide) is for local repository installation as I've learned.
My installation failed when it tried to download from public repository. It couldn't find hue server package.
Error log start
2017-01-24 18:53:50,351 - Downloading Hue Service
2017-01-24 18:53:50,351 - Execute['cat /etc/yum.repos.d/HDP.repo | grep "baseurl" | awk -F '=' '{print $2"hue/hue-3.11.0.tgz"}' | xargs wget -O hue.tgz'] {}
Command failed after 1 tries
Error log end
Then I wanted to try installing Hue manually
I followed the guide here:
http://gethue.com/hadoop-hue-3-on-hdp-installation-tutorial
Installation was successfull but my installation was not integrated with Ambari.
I wanted to try the first method again, changing my OS repo files to local repository at first step.
I changed the contents of the files under /etc/yum.repos.d/ to local repository paths to make Ambari use local repository packages but Ambari displayed public-repository. I had tried to install over public repository before. Got the same shell command error this time again as I went on the the next step of ambari add service wizard:
After a short search I found following file and updated also that file with local repository paths:
/var/lib/ambari-server/resources/stacks/HDP/2.5/repos/repoinfo.xml
However, it didn't work either. Ambari was still trying to download from public repository.
Does anyone have a comment?
If I achieve using public repository problem, next step will be finding rpm packages of hue for 3.9.0 or 3.11.0 because my local HDP repository had 2.6 version.
Any help also for this will be appreciated.
OS: Centos 7
HDP: 2.5.3
Ambari: 2.4.2
Hue: 3.9.0

I worked on this with a friend and we were able to overcome this.
I can't say it is the ideal answer but it is a workaround for my case:
The scripts under path
/var/lib/ambari-agent/cache/stacks/HDP/2.5/services/HUE/package/scripts
`$ ls`
common.py hue_server.py params.py setup_hue.py status_params.py
common.pyc hue_server.pyc params.pyc setup_hue.pyc status_params.pyc
were managing the Hue installation over Ambari.
The error message we received was due to a command in common.py
Although we couldn't find out how it overrides our local repository, we searched for pattern "public-repo" and found following files:
/usr/lib/ambari-server/web/data/wizard/stack/HDP_versions.json
/usr/lib/ambari-server/web/data/wizard/stack/HDP_version_definitions.json
/usr/lib/ambari-server/web/data/stacks/HDP-2.1/operating_systems.json
Instead of replacing content of these files, we updated the "download_url" variable inside params.py file.
We hard-coded our local repository URL as value.
We executed the command that we received error from common.py (line 57)
We tried and received another error for the next command.
Then we also applied that command manually
and converted the manually applied command line to comment line
and we retried.
We had to use this apply-manually, comment, retry, receive-error thing for the next command as latest one (3 commands of common.py in total).
On the next retry, installation was successful and hue was up. Rest of is the normal procedure. We updated the hue.ini file.
Currently I am having errors on Hue page as the errors mentioned in this unanswered post :)
https://community.cloudera.com/t5/Web-UI-Hue-Beeswax/Hue-cannot-access-database-Failed-to-access-filesystem-root/td-p/40318
Good luck!

How to set up Spark on Windows?

I am trying to setup Apache Spark on Windows.
After searching a bit, I understand that the standalone mode is what I want.
Which binaries do I download in order to run Apache spark in windows? I see distributions with hadoop and cdh at the spark download page.
I don't have references in web to this. A step by step guide to this is highly appreciated.

Steps to install Spark in local mode:
Install Java 7 or later.
To test java installation is complete, open command prompt type java and hit enter.
If you receive a message 'Java' is not recognized as an internal or external command. You need to configure your environment variables, JAVA_HOME and PATH to point to the path of jdk.
Download and install Scala.
Set SCALA_HOME in Control Panel\System and Security\System goto "Adv System settings" and add %SCALA_HOME%\bin in PATH variable in environment variables.
Install Python 2.6 or later from Python Download link.
Download SBT. Install it and set SBT_HOME as an environment variable with value as <<SBT PATH>>.
Download winutils.exe from HortonWorks repo or git repo. Since we don't have a local Hadoop installation on Windows we have to download winutils.exe and place it in a bin directory under a created Hadoop home directory.
Set HADOOP_HOME = <<Hadoop home directory>> in environment variable.
We will be using a pre-built Spark package, so choose a Spark pre-built package for Hadoop Spark download. Download and extract it.
Set SPARK_HOME and add %SPARK_HOME%\bin in PATH variable in environment variables.
Run command: spark-shell
Open http://localhost:4040/ in a browser to see the SparkContext web UI.

I found the easiest solution on Windows is to build from source.
You can pretty much follow this guide: http://spark.apache.org/docs/latest/building-spark.html
Download and install Maven, and set MAVEN_OPTS to the value specified in the guide.
But if you're just playing around with Spark, and don't actually need it to run on Windows for any other reason that your own machine is running Windows, I'd strongly suggest you install Spark on a linux virtual machine. The simplest way to get started probably is to download the ready-made images made by Cloudera or Hortonworks, and either use the bundled version of Spark, or install your own from source or the compiled binaries you can get from the spark website.

You can download spark from here:
http://spark.apache.org/downloads.html
I recommend you this version: Hadoop 2 (HDP2, CDH5)
Since version 1.0.0 there are .cmd scripts to run spark in windows.
Unpack it using 7zip or similar.
To start you can execute /bin/spark-shell.cmd --master local[2]
To configure your instance, you can follow this link: http://spark.apache.org/docs/latest/

You can use following ways to setup Spark:
Building from Source
Using prebuilt release
Though there are various ways to build Spark from Source.
First I tried building Spark source with SBT but that requires hadoop. To avoid those issues, I used pre-built release.
Instead of Source,I downloaded Prebuilt release for hadoop 2.x version and ran it.
For this you need to install Scala as prerequisite.
I have collated all steps here :
How to run Apache Spark on Windows7 in standalone mode
Hope it'll help you..!!!

Trying to work with spark-2.x.x, building Spark source code didn't work for me.
So, although I'm not going to use Hadoop, I downloaded the pre-built Spark with hadoop embeded : spark-2.0.0-bin-hadoop2.7.tar.gz
Point SPARK_HOME on the extracted directory, then add to PATH: ;%SPARK_HOME%\bin;
Download the executable winutils from the Hortonworks repository, or from Amazon AWS platform winutils.
Create a directory where you place the executable winutils.exe. For example, C:\SparkDev\x64. Add the environment variable %HADOOP_HOME% which points to this directory, then add %HADOOP_HOME%\bin to PATH.
Using command line, create the directory:
mkdir C:\tmp\hive
Using the executable that you downloaded, add full permissions to the file directory you created but using the unixian formalism:
%HADOOP_HOME%\bin\winutils.exe chmod 777 /tmp/hive
Type the following command line:
%SPARK_HOME%\bin\spark-shell
Scala command line input should be shown automatically.
Remark : You don't need to configure Scala separately. It's built-in too.

Here's the fixes to get it to run in Windows without rebuilding everything - such as if you do not have a recent version of MS-VS. (You will need a Win32 C++ compiler, but you can install MS VS Community Edition free.)
I've tried this with Spark 1.2.2 and mahout 0.10.2 as well as with the latest versions in November 2015. There are a number of problems including the fact that the Scala code tries to run a bash script (mahout/bin/mahout) which does not work of course, the sbin scripts have not been ported to windows, and the winutils are missing if hadoop is not installed.
(1) Install scala, then unzip spark/hadoop/mahout into the root of C: under their respective product names.
(2) Rename \mahout\bin\mahout to mahout.sh.was (we will not need it)
(3) Compile the following Win32 C++ program and copy the executable to a file named C:\mahout\bin\mahout (that's right - no .exe suffix, like a Linux executable)
#include "stdafx.h"
#define BUFSIZE 4096
#define VARNAME TEXT("MAHOUT_CP")
int _tmain(int argc, _TCHAR* argv[]) {
DWORD dwLength; LPTSTR pszBuffer;
pszBuffer = (LPTSTR)malloc(BUFSIZE*sizeof(TCHAR));
dwLength = GetEnvironmentVariable(VARNAME, pszBuffer, BUFSIZE);
if (dwLength > 0) { _tprintf(TEXT("%s\n"), pszBuffer); return 0; }
return 1;
}
(4) Create the script \mahout\bin\mahout.bat and paste in the content below, although the exact names of the jars in the _CP class paths will depend on the versions of spark and mahout. Update any paths per your installation. Use 8.3 path names without spaces in them. Note that you cannot use wildcards/asterisks in the classpaths here.
set SCALA_HOME=C:\Progra~2\scala
set SPARK_HOME=C:\spark
set HADOOP_HOME=C:\hadoop
set MAHOUT_HOME=C:\mahout
set SPARK_SCALA_VERSION=2.10
set MASTER=local[2]
set MAHOUT_LOCAL=true
set path=%SCALA_HOME%\bin;%SPARK_HOME%\bin;%PATH%
cd /D %SPARK_HOME%
set SPARK_CP=%SPARK_HOME%\conf\;%SPARK_HOME%\lib\xxx.jar;...other jars...
set MAHOUT_CP=%MAHOUT_HOME%\lib\xxx.jar;...other jars...;%MAHOUT_HOME%\xxx.jar;...other jars...;%SPARK_CP%;%MAHOUT_HOME%\lib\spark\xxx.jar;%MAHOUT_HOME%\lib\hadoop\xxx.jar;%MAHOUT_HOME%\src\conf;%JAVA_HOME%\lib\tools.jar
start "master0" "%JAVA_HOME%\bin\java" -cp "%SPARK_CP%" -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip localhost --port 7077 --webui-port 8082 >>out-master0.log 2>>out-master0.err
start "worker1" "%JAVA_HOME%\bin\java" -cp "%SPARK_CP%" -Xms1g -Xmx1g org.apache.spark.deploy.worker.Worker spark://localhost:7077 --webui-port 8083 >>out-worker1.log 2>>out-worker1.err
...you may add more workers here...
cd /D %MAHOUT_HOME%
"%JAVA_HOME%\bin\java" -Xmx4g -classpath "%MAHOUT_CP%" "org.apache.mahout.sparkbindings.shell.Main"
The name of the variable MAHOUT_CP should not be changed, as it is referenced in the C++ code.
Of course you can comment-out the code that launches the Spark master and worker because Mahout will run Spark as-needed; I just put it in the batch job to show you how to launch it if you wanted to use Spark without Mahout.
(5) The following tutorial is a good place to begin:
https://mahout.apache.org/users/sparkbindings/play-with-shell.html
You can bring up the Mahout Spark instance at:
"C:\Program Files (x86)\Google\Chrome\Application\chrome" --disable-web-security http://localhost:4040

The guide by Ani Menon (thx!) almost worked for me on windows 10, i just had to get a newer winutils.exe off that git (currently hadoop-2.8.1): https://github.com/steveloughran/winutils

Here are seven steps to install spark on windows 10 and run it from python:
Step 1: download the spark 2.2.0 tar (tape Archive) gz file to any folder F from this link - https://spark.apache.org/downloads.html. Unzip it and copy the unzipped folder to the desired folder A. Rename the spark-2.2.0-bin-hadoop2.7 folder to spark.
Let path to the spark folder be C:\Users\Desktop\A\spark
Step 2: download the hardoop 2.7.3 tar gz file to the same folder F from this link - https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz. Unzip it and copy the unzipped folder to the same folder A. Rename the folder name from Hadoop-2.7.3.tar to hadoop.
Let path to the hadoop folder be C:\Users\Desktop\A\hadoop
Step 3: Create a new notepad text file. Save this empty notepad file as winutils.exe (with Save as type: All files). Copy this O KB winutils.exe file to your bin folder in spark - C:\Users\Desktop\A\spark\bin
Step 4: Now, we have to add these folders to the System environment.
4a: Create a system variable (not user variable as user variable will inherit all the properties of the system variable) Variable name: SPARK_HOME
Variable value: C:\Users\Desktop\A\spark
Find Path system variable and click edit. You will see multiple paths. Do not delete any of the paths. Add this variable value - ;C:\Users\Desktop\A\spark\bin
4b: Create a system variable
Variable name: HADOOP_HOME
Variable value: C:\Users\Desktop\A\hadoop
Find Path system variable and click edit. Add this variable value - ;C:\Users\Desktop\A\hadoop\bin
4c: Create a system variable Variable name: JAVA_HOME
Search Java in windows. Right click and click open file location. You will have to again right click on any one of the java files and click on open file location. You will be using the path of this folder. OR you can search for C:\Program Files\Java. My Java version installed on the system is jre1.8.0_131.
Variable value: C:\Program Files\Java\jre1.8.0_131\bin
Find Path system variable and click edit. Add this variable value - ;C:\Program Files\Java\jre1.8.0_131\bin
Step 5: Open command prompt and go to your spark bin folder (type cd C:\Users\Desktop\A\spark\bin). Type spark-shell.
C:\Users\Desktop\A\spark\bin>spark-shell
It may take time and give some warnings. Finally, it will show
welcome to spark version 2.2.0
Step 6: Type exit() or restart the command prompt and go the spark bin folder again. Type pyspark:
C:\Users\Desktop\A\spark\bin>pyspark
It will show some warnings and errors but ignore. It works.
Step 7: Your download is complete. If you want to directly run spark from python shell then:
go to Scripts in your python folder and type
pip install findspark
in command prompt.
In python shell
import findspark
findspark.init()
import the necessary modules
from pyspark import SparkContext
from pyspark import SparkConf
If you would like to skip the steps for importing findspark and initializing it, then please follow the procedure given in
importing pyspark in python shell

Here is a simple minimum script to run from any python console.
It assumes that you have extracted the Spark libraries that you have downloaded into C:\Apache\spark-1.6.1.
This works in Windows without building anything and solves problems where Spark would complain about recursive pickling.
import sys
import os
spark_home = 'C:\Apache\spark-1.6.1'
sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'python\lib\pyspark.zip'))
sys.path.insert(0, os.path.join(spark_home, 'python\lib\py4j-0.9-src.zip'))
# Start a spark context:
sc = pyspark.SparkContext()
#
lines = sc.textFile(os.path.join(spark_home, "README.md")
pythonLines = lines.filter(lambda line: "Python" in line)
pythonLines.first()

Cloudera and Hortonworks are the best tools to start up with the
HDFS in Microsoft Windows. You can also use VMWare or VBox to initiate Virtual Machine to establish build to your HDFS and Spark, Hive, HBase, Pig, Hadoop with Scala, R, Java, Python.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio