Hi , How can I stall dependencies on job clusters? I am unable to find any documentation on this - maven

https://mvnrepository.com/artifact/com.crealytics/spark-excel
I am trying to install the above dependency on my job cluster so that I can read from excel file in my code .Could you please help me how I can ensure that any dependency for my ETL pipeline is installed/imported when the cluster is Initialized?
I am currently using the !pip install -r requirements.txt to install my pip imported libraries but unable to do so for the Maven package.

Related

Maven built Hadoop rpms have no files in them

I am trying to build an HDP 3.1.4 cluster by replacing the Cloudera provided HDFS and YARN rpms wit the Apache Hadoop.
For the same, I am building the Hadoop 3.1.1 rpms using the maven method with the below command which completes successfully.
`mvn -B clean install rpm:rpm -DnewVersion=3.1.1 -DskipTests`
However, when I list the files in rpms using rpm -qlp, it says "contains no files"
rpm -qlp hadoop-main-3.1.1-1.noarch.rpm
(contains no files)
Any help will be appreciated.

Apache beam Pypi packages downloading forever

I am running apache beam pipeline on dataflow with 3 Pypi packages defined in requirements.txt file. When I am running my pipeline with option "--requirements_file=requirements.txt", it submit below command to download Pypi packages.
python -m pip download --dest /tmp/requirements-cache -r requirements.txt --exists-action i --no-binary :all:
This command takes huge time to download the packages. I tried running it manually as well,it runs forever.
Why apache beam is using --no-binary :all: option, this is the root cause of long duration. Am I doing some mistake or any other way we can decrease the pip download time?
This is because the packages need to be installed on the workers, and it doesn't want to download binaries specific to whatever machine you happen to be launching the pipeline from.
If you have a lot of dependencies, the best solution is to use custom containers.

How can I install dask[complete] manually?

I want to use the package "Dask", but there is one problem.
"Dask dataframe requirements are not installed."
Obviously, we can use pip install "dask[dataframe]" or pip install "dask[complete]".
However, in the secured server where I work, there is no internet connection.
So, I transfer the file of package and install manually.
But, I cannot find the package dask[dataframe] for downloading.
How can I install the rest of packages manually without internet connection?
Thank you
You should look at the setup.py requirements file in the dask repository to see which dependencies it requires.

Unable to install Search Guard plugin for Elasticsearch-5.x

Due to the restrictions, I was not allowed to install any packages from internet. So, This command is not useful for me inorder to install search-guard.
bin/elasticsearch-plugin install -b com.floragunn:search-guard-ssl:<version>
However, I am able to install Search Guard successfully on a different network by running the above command.
Because of this reason, I tried installing Search Guard from tar.gz or zip file by the below command as per documentation.
/usr/share/elasticsearch# bin/elasticsearch-plugin install file:///home/xxxx/xxxx/search-guard-5-5.2.0-10-sgadmin-standalone.zip
This one is failing with the below error.
-> Downloading file:///home/xxx/xxxx/search-guard-5-5.2.0-10- sgadmin-standalone.zip
[=================================================] 100%  
ERROR: `elasticsearch` directory is missing in the plugin zip
I downloaded zip/tar.gz from this maven repository of search gaurd.
Is anyone also facing the same issue. If not, kindly help in solving this one.
Download this file from maven to /home/xxxx:
https://oss.sonatype.org/content/repositories/releases/com/floragunn/search-guard-5/5.2.0-11/search-guard-5-5.2.0-11.zip
Install it:
bin/elasticsearch-plugin install -b file:///home/xxxx/search-guard-5-5.2.0-11.zip
Other releases are available here: https://oss.sonatype.org/content/repositories/releases/com/floragunn/

Failed dependencies when install pxf service

When I rpm pxf service in hawq, I got some errors:
error: Failed dependencies:
hadoop >= 2.6.0 is needed by pxf-service-0:3.0.0-root.noarch
hadoop-hdfs >= 2.6.0 is needed by pxf-service-0:3.0.0-root.noarch
What's your advice here ?
Please make sure the PXF rpm OS architecture version matches. For example if the PXF rpm is built for RHEL6 and you are installing on RHEL7 then you may see some dependency issues
Could you please make sure the version of hadoop you are running in the cluster .I guess you might be running a lower version of hadoop .You have to run atleast 2.6 version of hadoop to run the current version of pxf .
The wiki here use the rpm bigtop(hadoop).
https://cwiki.apache.org/confluence/display/HAWQ/Build+Package+and+Install+with+RPM
It means if I install with rpm(HAWQ 2.2.0), the other ways (using binary hadoop without rpm installs like tar) are not support.
If I install hadoop use tar, I must build HAWQ from source code for now.
Please refer to:
https://issues.apache.org/jira/browse/HAWQ-1568

Resources