Nutch and HBase for production - hadoop

I am currently using Nutch 2.2.1 and HBase 0.90.4. I am expecting around 300K urls from about 10 URLS in seed. I have already generated so much while using Nutch 1.6. Since I want to manipulate data, I preferred to go Nutch 2.2.1 + HBase route. But I get all sorts of weird errors and crawl doesn't seem to progress.
Various errors such as:
zookeeper.ClientCnxn - Session for server null, unexpected error, closing socket connection and attempting reconnect. - I get this more frequently
bin/crawl: line 164: killed - I get this error from fetch step and the crawling gets killed all of a sudden.
RSS parse error
I am using a all-in-one crawl command - bin/crawl urls 1 http://localhost:8983/solr/ 10
<crawl> <seed-dir> <crawl-id> <solr-url> <number of rounds>
Please suggest where am I going wrong. I have Nutch 2.2.1 installed and HBase (standalone) installed as per the Quick start guide recommended from Nutch site. I am not sure following HBase 0.90.4 standalone set up from Quick start guide link is sufficient to achieve 300K crawled urls.
Edit # 1: RSS Parse Error - log information
Error tika.TikaParser - Error parsing http://www.###.###.##/###/abc.xml
org.apache.tika.exception.TikaException: RSS parse error

Related

Akeneo PIM No alive nodes found in your cluster ERROR

I keep getting the same error when starting the Akeneo Community Edition! It seems to be an error caused by Elastictsearch, but I cannot figure out what is wrong.
The Error message:
[OK] Database schema created successfully!
Updating database schema...
37 queries were executed
[OK] Database schema updated successfully!
Reset elasticsearch indexes
In StaticNoPingConnectionPool.php line 50:
No alive nodes found in your cluster
Im running on a uberspace server without docker and i'm trying to start it like mentioned here:
https://docs.akeneo.com/4.0/install_pim/manual/installation_ee_archive.html but with the community Edition instead.
Does anyone had the same error and knows how to help me out?
Maybe it a problem with the .env file for the entry point of elastic search. My .env: APP_INDEX_HOSTS=localhost:9200
Can you verify that the Elasticsearch search server is available on localhost:9200 when accessing it via curl/Postman/Sense or something else?
That error usually means the node is either not running, or not running on the configured port.
Pay also attention that your server follow the system requirements - https://docs.akeneo.com/4.0/install_pim/manual/system_requirements/system_requirements.html

Google Cloud Data flow jobs failing with error 'Failed to retrieve staged files: failed to retrieve worker in 3 attempts: bad MD5...'

SDK: Apache Beam SDK for Go 0.5.0
We are running Apache Beam Go SDK jobs in Google Cloud Data Flow. They had been working fine until recently when they intermittently stopped working (no changes made to code or config). The error that occurs is:
Failed to retrieve staged files: failed to retrieve worker in 3 attempts: bad MD5 for /var/opt/google/staged/worker: ..., want ; bad MD5 for /var/opt/google/staged/worker: ..., want ;
(Note: It seems as if it's missing a second hash value in the error message message.)
As best I can guess there's something wrong with the worker - It seems to be trying to compare md5 hashes of the worker and missing one of the values? I don't know exactly what it's comparing to though.
Does anybody know what could be causing this issue?
The fix to this issue seems to have been to rebuild the worker_harness_container_image with the latest changes. I had tried this but I didn't have the latest release when I built it locally. After I pulled the latest from the Beam repo, and rebuilt the image (As per the notes here https://github.com/apache/beam/blob/master/sdks/CONTAINERS.md) and reran it seemed to work again.
I'm seeing the same thing. If I look into the Stackdriver logging I see this:
Handler for GET /v1.27/images/apache-docker-beam-snapshots-docker.bintray.io/beam/go:20180515/json returned error: No such image: apache-docker-beam-snapshots-docker.bintray.io/beam/go:20180515
However, I can pull the image just fine locally. Any ideas why Dataflow cannot pull.

[Error]: Failed to run command with error: Error Domain=Parse Code=428

I get this error sometimes when trying to save things to Parse or to fetch data from it.
This is not constant and appear once in a while making the operation to fail.
I have contacted Parse for that. Here is their answer:
Starting on 4/28/2016, apps that have not migrated their database may see a "428" error code if the request cannot be handled by the remaining shared pool of resources. If you see this error in your logs, we highly recommend migrating the database for your app without delay.
Means this happens because of starting this date all apps are on low priority but those who started DB migration. So, Migration of the DB should resolve that.

What is Nutch 1.10 crawl command for elasticsearch

Using Nutch 1.10 (newbie), I am trying to learn how to crawl using Nutch 1.10 and using ElasticSearch as my indexer. Not sure why, but I can not get this crawl command to work:
bin/crawl -i --elastic -D elastic.server.url=http://localhost:9200/elastic/ urls elasticTestCrawl 1
UPDATE: just used
bin/crawl -i -D elastic.server.url=http://localhost:9200/elastic/ urls/ elasticTestCrawl/ 2
--almost succesfully, received following error when it came to the indexing part of the command:
Error running:
/home/david/apache-nutch-1.10/bin/nutch clean -Delastic.server.url=http://localhost:9200/elastic/ elasticTestCrawl//crawldb
Failed with exit value 255.
What is exit value 255 for nutch 1.x? And why does the space get deleted between "-D and elastic..."
I have these ElasticSearch Properties from here in my nutch-site.xml file:
If someone can point my to the error of my ways, that would be great!
Update
I just posted my own answer below, its the second one. I had already accepted the first answer months ago when I initially got it working. My answer is simply more clear and concise to make it easier (and quicker) to get started with Nutch.
Unfortunately I can't tell you where you're going wrong as I'm in the same boat although from what I can see you are running nutch and elastic on the same box where as I've split it across two.
I've not got it to work but according to a guide I found on integrating nutch 1.7 with elastic it should just be
bin/crawl urls/ TestCrawl -depth 3 -topN 5
It may just be it isn't working for me because I've added the extra complication of networking.
I also assume you have created an index called elasticTestIndex in your elastic instance and launched it on the box before trying to run your crawl?
Should it be of help the guide I got that command from is
https://www.mind-it.info/integrating-nutch-1-7-elasticsearch/
Update:
I'm not sure I'm quite there yet but using your update I've got further than I had.
You are putting in port 9200 which is the web administartion port but you need to use port 9300 to interact with the service so change the port to 9300
I'm not sure but I thing the portion after the slash refers to the index so in your example make sure you have "elastic" set up as an index. or change
blah (low rep score so can't put in to many urls) blah localhost:9300/[index name]/
so that it uses and index you have created. If you haven't created one then you can do so from the putty with the following command.
curl -XPUT 'http://localhost:9200/[index name]/'
Using the command you supplied with the alternative port it did run although I've yet to extract the crawl data from elastic.
Supplemental Update:
It's successfully dumping data crawled from nutch into elastic for me and having put a different index in on the command line I can tell you it ignores that and uses what ever is in your nutch-site.xml
To help anyone else get it working
Start off by reading this blog post to help you get Elasticsearch configured to work with Nutch.
After that read this Nutch doc to get familiar with the NEW cli command for running the crawl script. (Works for 1.9+)
Follow the example in the new Nutch crawl script command on that page. You have to change it a bit for elasticsearch:
solr.server.url=http://localhost:8983/solr/ to something like
elastic.server.url=http://localhost:9300/yourelasticindex/
So basically there are 2 steps:
Configure Elasticsearch to work with Nutch (click on first link above)
Change the new cli command for solr to work with Elasticsearch (its
default is solr) Hope that helps!

How to access WSO2 BAM's hadoop job tracker?

I am quite new to BAM and one of my hive queries is broken.
However I can't find what's wrong since the only error it gives me is
ERROR: Error while executing Hive script.Query returned non-zero code: 9, cause: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask
I've looked around and found out that BAM is only capable of displaying that much information and for more I need to look in hadoop's job tracker. However I can't find any info on how to turn it on or access it in the BAM server.
So how do I access it/ turn it on ?
Please do not mislead with the exception. Most probably this seems to be a problem with Hive query. To get a proper idea about the problem you should send the backend console print log.
It seems like the problem is most probably with your hive query and not with hadoop job tracker. To make sure, please run of the samples[1] and check whether hive queries are executing properly. If hive queries executing without a problem and summarized results are displayed in dashboards, the problem could be with your hive query.
[1] - http://docs.wso2.org/display/BAM240/HTTPD+Logs+Analysis+Sample

Resources