nutch 1.10 input path does not exist /linkdb/current - hadoop

When I run nutch 1.10 with the following command, assuming that TestCrawl2 did not previously exist and needs to be created,...
sudo -E bin/crawl -i -D solr.server.url=http://localhost:8983/solr/TestCrawlCore2 urls/ TestCrawl2/ 20
I receive an error on indexing that claims:
Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/opt/apache-nutch-1.10/TestCrawl2/linkdb/current
The linkdb directory exists, but does not contain the 'current' directory. The directory is owned by root so there should be no permissions issues. Because the process exited from an error, the linkdb directory contains .locked and ..locked.crc files. If I run the command again, these lock files cause it to exit in the same place. Delete TestCrawl2 directory, rinse, repeat.
Note that the nutch and solr installaions themselves have run previously without problems in a TestCrawl instance. It's just now that I'm trying a new one that I'm having problems. Any suggestions on troubleshooting this issue?

Ok, it seems as though I have run into a version of this problem:
https://issues.apache.org/jira/browse/NUTCH-2041
Which is a result of the crawl script not being aware of changes to ignore_external_links my nutch-site.xml file.
I am trying to crawl several sites and was hoping to keep my life simple by ignoring external links and leaving regex-urlfilter.txt alone (just using +.)
Now it looks like I'll have to change ignore_external_links back to false and add a regex filter for each of my urls. Hopefully I can get a nutch 1.11 release soon. It looks like this is fixed there.

Related

Command line too long on maven when building hadoop from source in windows 10

I am trying to build hadoop from source as explained in this article. When building Apache common, everything fails with this error message: command line too long.
So, Here's what I have tried(I will update this when I try more):
As said in this
Stackoverflow answer, the way to overcome this error is to shorten the path as much as possible. So, I cut repository files from C:\user_name\.m2\repository\ to another directory and made an empty drive and have that point to this new directory which i have moved files to, like this
subst M: D:\maven-2.0.8\repository and changed <localRepository>M:</localRepository> tag in "settings.xml" from C:\apache-maven-2.0.8\conf to point to M:. And after doing all this, I restart my system and try to build hadoop again. But, as I can see from the error, maven still downloads packages to C:\Users\user_name\.m2\repository\, not M: and the error of command line too long error persists.
To shorten the path as much as possible, i made a directory in C:\mrepo and have this directory symbolically link to the C:\user_name\.m2\repository\ like this: mklink /J C:\mrepo C:\Users\.m2\repository. And after doing all this, I restart my system and try to build hadoop again. But, as I can see from the error, maven still downloads packages to C:\Users\user_name\.m2\repository\, not M: and the error of command line too long error persists.
EDIT 1:
I have also set an environment variable named M2_HOME with value M:\. And made changes in my \conf\settings.xml: <localRepository>${M2_HOME}</localRepository>. The issue still persisits.
How do I fix this and build hadoop successfully?
Are you using IntelliJ?
Because in IntelliJ you have some options to shorten your command line.
Go to Run/Debug Configurations.
Shorten command line, you can try the classpath file option.
You can find more information about it in this blog post.
If you want to set the .m2 directory to something explicit, you do can so by overriding the default.
All Users:
Edit the \conf\settings.xml global configuration file. Change the value of the localRepository key to the absolute path of the local repository cache.
Your user: I think you should be able to set the M2_HOME environment variable in your user settings.

All my Jenkins jobs and configs have disappeared after restart of my Mac

After updating macOS to Mojave (10.14.4), my Mac was restarted and upon opening Jenkins (at localhost:8080) it appeared that I've lost all my jobs and the entire system configurations.
There was only 1 user (admin) defined in my installation and my usual password was deemed invalid, when I tried to log back in. So, I tried entering another password I normally used and it was accepted. I then found that all my jobs and configs have disappeared. It looked as if I've just started Jenkins for the first time.
Looking through here on StackOverFlow, there were suggestions to check the JENKINS_HOME variable to find out where the jobs are saved on the disk, but when I typed export $JENKINS_HOME I just get an empty response. So, it looks like I've never configured it during set up.
I then dig through the hard drive and found the folders matching the names of the jobs I created under ~/.jenkins/workspace. However, the contents of all the folders are empty. I was expecting to see the usual files, e.g. build.xml, config.xml, etc.
I then did a global search for build.xml and config.xml on Mac Finder it turned up nothing.
Any idea where my jobs went and what could have caused all the contents of the folders of the jobs to be empty?
You can find your Jenkins installation directory in "Manage Jenkins" -> "configure System" --> "Home directory". Find what was the Jenkins home before you restart MAC. It looks like your home directory is either deleted by you or you are pointing to new folder now. Set it to earlier folder.
If can help,
I'm having a similar problem.
The curious part is about the new directory after the service restart ".jenkins" directory inside :
'/var/root/'.
And now, the password that Jenkins request me is not from
'/Users/username/.jenkins/secrets/initialAdministratorPassword' but from the newst one with same path pattern.
Simon

Google Cloud Functions and shared libraries

I'm trying to use wkhtmltopdf on GCF for PDF generation.
When my function tries to spawn the child process I get the following error:
Error: ./services/wkhtmltopdf: error while loading shared libraries: libXrender.so.1: cannot open shared object file: No such file or director
The problem is clearly due to the fact that wkhtmltopdf binary depends on external shared libraries which are not installed in GCF environment.
Is there a way to solve this issue or should I give up and use other solutions (AWS Lambda o GAE)?
Thank you in advance
Indeed, I’ve found a way to solve this issue by copying all required libraries in the same folder (/bin for me) containing wkhtmltopdf binary. In order to let the binary file use uploaded libraries I added the following lines to wkhtmltopdf.js:
wkhtmltopdf.command = 'LD_LIBRARY_PATH='+path.resolve(__dirname, 'bin')+' ./bin/wkhtmltopdf';
wkhtmltopdf.shell = '/bin/bash';
module.exports = wkhtmltopdf;
Everything worked fine for a while. At a sudden I receive many connection errors from GCF or timeouts but I think it’s not related to my implementation but rather to Google.
I’ve ended up setting a dedicated server.
I have managed to get it working, there are 2 things needed to be done, as wkhtmltopdf won't work if:
libXrender.so.1 can't be loaded
you are using stdout to collect resulting pdf. Wkhtmltopdf has to write the result into a file
First you need to obtain correct version of libXrender.
I have found out, which docker image Cloud functions are using as base for nodejs functions. I've ran it locally, installed libxrender and copied the library into my function's directory.
docker run -it --rm=true -v /tmp/d:/tmp/d gcr.io/google-appengine/nodejs bash
Then, inside the runing container:
apt update
apt install libxrender1
cp /usr/lib/x86_64-linux-gnu/libXrender.so.1 /tmp/d
I have put this into my function's project directory and under lib sub directory. In my function's source file, I then set-up LD_LIBRARY_PATH to include the /user_code/lib directory (/user_code is the directory, where at last your function will end up being put by google):
process.env['LD_LIBRARY_PATH'] = '/user_code/lib'
This is enough for wkhtmltopdf to be able to execute. It will fail, as it won't be able to write to stdout and the function will eventually timeout and be killed (as Matteo experienced). I think this is because google runs the containers without a tty (just speculation), I can run my code in their container, if I run it with docker run -it flags. To solve this, I am invoking wkhtmltopdf so that it writes the output into a file under /tmp (this is in-memory tmpfs). I then read the file back and send it as my response body. Note that the tmpfs might be reused between function calls, so you need to use unique file every time.
This seems to do the trick and I am able to run wkhtmltopdf as Google CloudFunction.

how to customize login page for shibboleth idp

I would like to customize the login page and I'm trying to follow the shibboleth wiki, but I'm not sure where to find " src/main/webapp/login.jsp within your IdP distribution package" in order to modify it. My shibboleth resides in /opt/shibboleth-idp, but I don't have a src folder in there. Any help would be appreciated.
For IdP version 3, you can customize by changing the files in the "views" directory. These are Apache Velocity templates, and you can make changes that become active without having to rebuild the war file.
(sorry this is two months late, but...)
the files for login are not stored inside your shibboleth-idp directory. (well, they're sorta in there...rolled into the java war file.)
somewhere, there should be a directory that was used to build your shibboleth-idp instance. many times i've seen it in the same folder as the shibboleth-idp folder, but it doesn't have to be. so since yours is /opt/shibboleth-idp, it might be at /opt/shibboleth-identityprovider-version.number. if not, use the find command as already suggested, but maybe try something like
find / -name 'shibboleth-identityprovider*' -ls 2>/dev/null
unless someone built it off-box, that folder should exist somewhere. inside there is the src directory where login.jsp resides.
the install script the shib doc tells you to run after making your changes is at the top level of that shibboleth-identityprovider-version.number folder too (install.sh for unix). when you run the install script, you tell it where to put the idp files (in your case, /opt/shibboleth-idp).
also, before running the install script, it's a good idea to back up your conf directory. you might accidentally tell the install script to overwrite it. or it might do it even if you told it not to (bug in some versions).
I recommend starting with the Linux find command:
find /opt/shibboleth-idp/ -name login.jsp

WKHTMLTOPDF and "Error: Unable to create temporery file"

I've written a piece of code in PHP to generate PDF using WKHTMLTOPDF binary file. It was working fine till I had to recompile my Apache. Now it fails with error Error: Unable to create temporery file (this is the exact wording).
The situation in which the error is reproducible is a little complicated. I managed to narrow down the error and now I'm pretty sure that the error happens because of the user that Apache runs as. It seems to me that when WKTHMLTOPDF is running as a user with no home folder, it's unable to access a temporary folder within the user's home folder.
Surely I can change the Apache's user but I would rather resolve this problem once and for all. To this end it would be great if I could somehow set the temp folder for WKHTMLTOPDF or at least print its current value to make it valid! Does anyone know how to do any of these two?
BTW, I'm using WKHTMLTOPDF 0.11.0 rc1.
I saw the same error today in Rails4 + pdfkit gem(0.8.2) + wkhtmltopdf(0.12.2.1) under CentOS 6.7.
This error came from wkhtmltopdf and the reason was it couldn't create temporary file. wkhtmltopdf depends on some temporary filename creation API (I'm not sure), but probably following shows some hints:
$ man tempfile
$ man tempnam
In my case, my TMPDIR environment variable showed wrong path (I had accidentaly deleted the directory!) so that wkhtmltopdf couldn't create work file.
When I unset TMPDIR, then it worked! Of course, setting correct existence directory to TMPDIR should be OK too.

Resources