Using Nutch 1.10 (newbie), I am trying to learn how to crawl using Nutch 1.10 and using ElasticSearch as my indexer. Not sure why, but I can not get this crawl command to work:
bin/crawl -i --elastic -D elastic.server.url=http://localhost:9200/elastic/ urls elasticTestCrawl 1
UPDATE: just used
bin/crawl -i -D elastic.server.url=http://localhost:9200/elastic/ urls/ elasticTestCrawl/ 2
--almost succesfully, received following error when it came to the indexing part of the command:
Error running:
/home/david/apache-nutch-1.10/bin/nutch clean -Delastic.server.url=http://localhost:9200/elastic/ elasticTestCrawl//crawldb
Failed with exit value 255.
What is exit value 255 for nutch 1.x? And why does the space get deleted between "-D and elastic..."
I have these ElasticSearch Properties from here in my nutch-site.xml file:
If someone can point my to the error of my ways, that would be great!
Update
I just posted my own answer below, its the second one. I had already accepted the first answer months ago when I initially got it working. My answer is simply more clear and concise to make it easier (and quicker) to get started with Nutch.
Unfortunately I can't tell you where you're going wrong as I'm in the same boat although from what I can see you are running nutch and elastic on the same box where as I've split it across two.
I've not got it to work but according to a guide I found on integrating nutch 1.7 with elastic it should just be
bin/crawl urls/ TestCrawl -depth 3 -topN 5
It may just be it isn't working for me because I've added the extra complication of networking.
I also assume you have created an index called elasticTestIndex in your elastic instance and launched it on the box before trying to run your crawl?
Should it be of help the guide I got that command from is
https://www.mind-it.info/integrating-nutch-1-7-elasticsearch/
Update:
I'm not sure I'm quite there yet but using your update I've got further than I had.
You are putting in port 9200 which is the web administartion port but you need to use port 9300 to interact with the service so change the port to 9300
I'm not sure but I thing the portion after the slash refers to the index so in your example make sure you have "elastic" set up as an index. or change
blah (low rep score so can't put in to many urls) blah localhost:9300/[index name]/
so that it uses and index you have created. If you haven't created one then you can do so from the putty with the following command.
curl -XPUT 'http://localhost:9200/[index name]/'
Using the command you supplied with the alternative port it did run although I've yet to extract the crawl data from elastic.
Supplemental Update:
It's successfully dumping data crawled from nutch into elastic for me and having put a different index in on the command line I can tell you it ignores that and uses what ever is in your nutch-site.xml
To help anyone else get it working
Start off by reading this blog post to help you get Elasticsearch configured to work with Nutch.
After that read this Nutch doc to get familiar with the NEW cli command for running the crawl script. (Works for 1.9+)
Follow the example in the new Nutch crawl script command on that page. You have to change it a bit for elasticsearch:
solr.server.url=http://localhost:8983/solr/ to something like
elastic.server.url=http://localhost:9300/yourelasticindex/
So basically there are 2 steps:
Configure Elasticsearch to work with Nutch (click on first link above)
Change the new cli command for solr to work with Elasticsearch (its
default is solr) Hope that helps!
Related
I want to add stopwords to my project but I think Elasticsearch is not installed on my server. Search Engine as MYSQL is selected.
will our stopwords work or not without Elasticsearch configured?
Also, I want to make sure that elastic search is configured or not. For that I am using the command
curl -XGET 'http://localhost:9200'
and in response, I am getting output as:
curl: (7) Failed to connect to localhost:9200; Connection refused.
Does this signify that elastic search is not configured?
I got the proper solution to this question.
a) Install Elasticsearch6.0
b) Then follow the steps https://devdocs.magento.com/guides/v2.4/config-guide/elasticsearch/es-config-stopwords.html#to-change-directory-stopwords
But one thing that needs to be kept in the mind is:
Don't override the stopwords.csv file
Instead, override the stopwords_en_US.csv file i.e. according to your locale.
Your module will work perfectly.
The solution on all sites is perfect. Just we need to override the correct file for stopwords according to locale.
As am following the documentation in the site here https://getcandy.io/docs/master/guides/introduction/01-installation
but when got to point to set this code:
php artisan candy:search:index
having exception error listed here:
Elastica\Exception\Connection\HttpException : Couldn't connect to host, Elasticsearch down?
Sounds most likely that Elasticsearch isn't running properly, rather than an issue with GetCandy.
If you run the following you should be able to determine if Elasticsearch is up.
curl localhost:9200
If you get a response with the Elasticsearch version etc, it is running. If it's not running, you'll need to check the Elasticsearch logs, normally found somewhere like /var/log/elasticsearch/
I'm trying to create a new Core with Solr 5.3. I have no experience working with Solr until a few days ago. I think I need this broken down Barney style. I've been through the system doc, wiki's, YouTube, and random discussion boards. The information I've found is either not current or not what I'm seeing from my UI. I've now wasted five hours trying to get this to work. I'm out of options. I'm about ready to drop this project and start from scratch. I'm completely exasperated and throwing myself to the mercy of my betters. Can anyone just show me how to do it?
I followed the following steps for adding a core using solr admin UI.
Start the solr server using ~/solr-5.2.0/bin/solr start. This will start the solr on 8983 port.
Now go to solr directory. cd ~/solr-5.2.0/server/solr.
Create a new folder, which will contain the solr core configuration. mkdir newCore.
Now create a conf directory in side the newCore and copy your schema.xml and solrconfig.xml along with other necessary files.
Go to Solr Admin UI, Core Admims section. Specify the core name, as per your requirement and newCore (name of the directory which we have created) in the instanceDir field. Click the Add Core button.
I found a tutorial here: apache-solr-tutorial-beginners
I followed the exact instructions the author gives for creating a new core via the command line from solar-5.3.0/bin:
solr create -c jcg -d basic_configs
jcg then appeared in my Solr UI.
I went back and tried this same thing with my Project specs and it worked! I still have no idea how to do this from the UI but at least I can move forward an inch!
Hi am new in elastic search, I installed the elastic search in my windows 7 machine but I can't know, how to run and use elastic search queries in windows where should I type the elastic search queries and where should I run this queries?..
Any one know about it help me. Thanks in advance...
There are multiple ways to do that.
via HTTP interface, which means that you can run GET queries via your browser (Firefox, Chrome etc.) by accesing the proper url like:
http://localhost:9200/_search?q=tag:wow
Elasticsearch's HEAD plugin. You can execute any query with it. It also has multiple additional functionalities.
Install cUrl for Windows and then run queries just like every tutorial suggests.
use any programming language like PHP that supports curl library.
Personally I prefer HEAD plugin since it has other functionalities that I use anyway.
you can also check sense plugin for chrome. It will also help you in syntax for queries.
you can get it from here
https://github.com/bleskes/sense
I have tried making an entry in the elasticsearch.yml file to create the custom analyser for the same as mentioned in the gist: https://gist.github.com/1403902
but i am getting following error
{"error":"RemoteTransportException[[Banner, Robert Bruce][inet[/192.168.1.15:9300]][indices/create]]; nested: MapperParsingException[mapping [type1]]; nested: MapperParsingException[Analyzer [string_lowercase] not found for field [field1]];
I am still not able to figure out how to do this. I have searched stackoverflow for the same and got similar replies (as mentioned in the gist mentioned above)
please provide me an example, that i can test.
Since config file is read only on startup, for the changes to take effect, you need to restart the elasticsearch cluster. You can use Shutdown API for that, or simply kill elasticsearch processes using kill command.