Pyarrow unable to load libhdfs on Windows 10 - hadoop

I am trying to use pyarrow on Windows but I'm getting the following error with fs.HadoopFileSystem() :
OSError Traceback (most recent call last)
Cell In[1], line 2
1 from pyarrow import fs
----> 2 hdfs = fs.HadoopFileSystem(host='localhost', port=9870)
File c:\prj\study\.venv\lib\site-packages\pyarrow\_hdfs.pyx:96, in pyarrow._hdfs.HadoopFileSystem.__init__()
File c:\prj\study\.venv\lib\site-packages\pyarrow\error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()
File c:\prj\study\.venv\lib\site-packages\pyarrow\error.pxi:115, in pyarrow.lib.check_status()
OSError: Unable to load libhdfs: 指定されたモジュールが見つかりません。
I followed the steps on this site to install Hadoop using binaries from Apache and I am able to use it through cmd. However when I checked lbhdfs.so in lib/native, it shows as a 0 kb file. Is this normal, or do I have to compile Hadoop source on my own so I could get the correct libhdfs.so?

Related

Define setup.py dependencies from a private PyPI

I'd like to install dependencies from my private PyPI by specifying them within a setup.py.
I've already tried to specify where to find dependencies within the dependency_links this way:
setup(
...
install_requires=["foo==1.0"],
dependency_links=["https://my.private.pypi/"],
...
)
I've also tried to define the entire URL within the dependency_links:
setup(
...
install_requires=[],
dependency_links=["https://my.private.pypi/foo/foo-1.0.tar.gz"],
...
)
but when I try to install with python setup.py install, neither of them worked for me.
Can anybody help me?
EDITS:
With the first piece of code I got this error:
...
Installed .../test-1.0.0-py3.7.egg
Processing dependencies for test==1.0.0
Searching for foo==1.0
Reading https://my.private.pypi/
Reading https://pypi.org/simple/foo/
Couldn't find index page for 'foo' (maybe misspelled?)
Scanning index of all packages (this may take a while)
Reading https://pypi.org/simple/
No local packages or working download links found for foo==1.0
error: Could not find suitable distribution for Requirement.parse('foo==1.0')
while in the second case I didn't get any error, just the following:
...
Installed .../test-1.0.0-py3.7.egg
Processing dependencies for test==1.0.0
Finished processing dependencies for test==1.0.0
UPDATE 1:
I've tried to change the setup.py following sinoroc's instructions. Now my setup.py looks like this:
setup(
...
install_requires=["foo==1.0"],
dependency_links=["https://username:password#my.private.pypi/folder/foo/foo-1.0.tar.gz"],
...
)
I built the library test with python setup.py sdist and tried to install it with pip install /tmp/test/dist/test-1.0.0.tar.gz, but I still get this error:
Processing /tmp/test/dist/test-1.0.0.tar.gz
ERROR: Could not find a version that satisfies the requirement foo==1.0 (from test==1.0.0) (from versions: none)
ERROR: No matching distribution found for foo==1.0 (from test==1.0.0)
Regarding the private PyPi, I don't have any additional information because I'm not the administrator of it. As you can see, I just have the credentials (username and password) for that server.
Additionally, that PyPi is organised in sub-folders, https://my.private.pypi/folder/.. where the dependency I want to install is.
UPDATE 2:
By running pip install --verbose /tmp/test/dist/test-1.0.0.tar.gz, it seams there is only 1 location where to search for the library foo, in the public server https://pypi.org/simple/foo/ and not in our private server https://my.private.pypi/folder/foo/.
Here the output:
...
1 location(s) to search for versions of foo:
* https://pypi.org/simple/foo/
Getting page https://pypi.org/simple/foo/
Found index url https://pypi.org/simple
Looking up "https://pypi.org/simple/foo/" in the cache
Request header has "max_age" as 0, cache bypassed
Starting new HTTPS connection (1): pypi.org:443
https://pypi.org:443 "GET /simple/foo/ HTTP/1.1" 404 13
Status code 404 not in (200, 203, 300, 301)
Could not fetch URL https://pypi.org/simple/foo/: 404 Client Error: Not Found for url: https://pypi.org/simple/foo/ - skipping
Given no hashes to check 0 links for project 'foo': discarding no candidates
ERROR: Could not find a version that satisfies the requirement foo==1.0 (from test==1.0.0) (from versions: none)
Cleaning up...
Removing source in /private/var/...
Removed build tracker '/private/var/...'
ERROR: No matching distribution found for foo==1.0 (from test==1.0.0)
Exception information:
Traceback (most recent call last):
...
In your second attempt, I believe you should still have foo==1.0 in the install_requires.
Update
Be aware that pip does not support dependency_links (it used to, but does not anymore).
For pip, the alternative is to use command line options such as --index-url, --extra-index-url, or --find-links. These options can not be enforced on the user of your project (contrary to the dependency links from setuptools), so they have to be properly documented. To facilitate this, a good idea is to provide an example of a requirements.txt file to the users of your project. This file can contain some of pip options.
For example:
# requirements.txt
# ...
--find-links 'https://my.private.pypi/'
foo==1.0
# ...

Running Pyspark on Pycharm

On a Mac (v. 10.14.5), I am trying to run PySpark programs in PyCharm (professional edition, v. 19.2).
I know my simple PySpark program is fine, because when I run it with spark-submit outside PyCharm from the terminal, using Spark I installed via brew, it works as expected. I have tried linking PyCharm to this version of Spark, but am getting other issues.
I followed multiple instructions online to install pyspark within Pycharm (Preferences -> Project Interpreter), and set the SPARK_HOME environment variable to the appropriate venv directory (Run -> Edit Configurations -> Environment Variables). For example, this stackoverflow thread.
But, I get an error message when I run the program:
Failed to find Spark jars directory (/Users/rahul/PycharmProjects/spark-demoII/venv/assembly/target/scala-2.12/jars).
You need to build Spark with the target "package" before running this program.
Traceback (most recent call last):
File "/Users/rahul/PycharmProjects/spark-demoII/run.py", line 6, in <module>
sc = SparkContext("local", "SimpleApp")
File "/Users/rahul/virtualenvs/pyspark/lib/python3.7/site-packages/pyspark/context.py", line 133, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "/Users/rahul/virtualenvs/pyspark/lib/python3.7/site-packages/pyspark/context.py", line 316, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "/Users/rahul/virtualenvs/pyspark/lib/python3.7/site-packages/pyspark/java_gateway.py", line 46, in launch_gateway
return _launch_gateway(conf)
File "/Users/rahul/virtualenvs/pyspark/lib/python3.7/site-packages/pyspark/java_gateway.py", line 108, in _launch_gateway
raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
Process finished with exit code 1
Anyone know how to get PyCharm to run Pyspark programs on a similar machine?
In response to #pissal suggestion:
I tried that previously but that version of spark does work. I tried it again anyway: after switching to a virtual environment, I did a pip install pyspark. To ensure that this version of spark works, I ran a spark-submit run.py (outside of PyCharm), and here is the error message.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/Users/rahul/.virtualenvs/test1/lib/python3.7/site-packages/pyspark/jars/spark-unsafe_2.11-2.4.4.jar) to method java.nio.Bits.unaligned()
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:80)
at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2422)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2422)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2422)
at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:79)
at org.apache.spark.deploy.SparkSubmit.secMgr$lzycompute$1(SparkSubmit.scala:348)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$secMgr$1(SparkSubmit.scala:348)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:356)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:356)
at scala.Option.map(Option.scala:146)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:355)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:774)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 2
at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3720)
at java.base/java.lang.String.substring(String.java:1909)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:52)
... 25 more
So the reason this was happening was that pyspark has not been updated to use the latest version of Java. After removing Java version 13, I made sure my home brew installation of spark uses java version 1.8. Then added the following to the Environment Variables in Run -> Edit Configurations in Pycharm:
SPARK_HOME=/usr/local/Cellar/apache-spark/2.4.4/libexec
With these settings I can run pyspark jobs in PyCharm.

Neo.ClientError.Statement.ExternalResourceFailed on Windows

I just install neo4j 3.2.3 on my windows notebook and I try to load csv file.
LOAD CSV WITH HEADERS FROM 'file:///test.csv' AS line
WITH toUpper(line.TEST_NAME) AS TEST_NAME
CREATE(TEST);
I got the following error:
Neo.ClientError.Statement.ExternalResourceFailed: Couldn't load the external resource at: file:/C:/Users/*****/Documents/Neo4j/default.graphdb/import/test.csv
I cannot locate the neo4j.conf or propriety files in the default.graphdb directories.
can someone help?
I'm on windows 10 neo4j 3.2.3.
I can load from url. but with the same file format, I failed to load from local.
LOAD CSV FROM "https://gist.githubusercontent.com/jexp/d788e117129c3730a042/raw/a147631215456d86a77edc7ec08c128b9ef05e3b/people_leading_empty.csv"
AS line
WITH line LIMIT 4
RETURN line
successful
with the same file save as .../Neo4j/default.graphdb/import/test1.csv
LOAD CSV WITH HEADERS FROM "file:///test1.csv" AS line
WITH line LIMIT 4
RETURN line
I got error:
Neo.ClientError.Statement.ExternalResourceFailed: Couldn't load the external resource at: file:/C:/Users/....../Documents/Neo4j/default.graphdb/import/test.csv
From the error message, it can locate the file. But cannot perform the LOAD from local.
I cannot locate the neo4j.conf or propriety files in the
default.graphdb directories.
According to the Neo4j docs, the neo4j.conf file is located in the directory <neo4j-home>\conf\neo4j.conf for zip packages and %APPDATA%\Neo4j Community Edition\neo4j.conf for desktop installation.
You should locate the configuration file and set the following line.
dbms.security.allow_csv_import_from_file_urls=true
The LOAD CSV operation in Neo4j can only load the UTF-8 format files, try to change the import file's format to UTF-8(save the .csv file as .csv(utf-8) type).

Connecting to Hue via external host

I've recently installed hue and am having problems connecting to the interface via an external host, i can connect locally fine. My hue.ini file is configured as http_host=0.0.0.0 http_port=8888. I've seen some posts about how to fix this by setting "Bind Hue Server to Wildcard Address" in Cloudera Manager. I do not have Cloudera Manager, what is the corresponding way to do this in a standalone hue installation?
error.log shows the following
[24/Nov/2015 03:02:12 -0800] models ERROR error syncing oozie
Traceback (most recent call last):
File "/usr/local/hue/desktop/core/src/desktop/models.py", line 269, in sync
from oozie.models import Workflow, Coordinator, Bundle
ImportError: No module named oozie.models
[24/Nov/2015 03:02:12 -0800] models ERROR error syncing beeswax
Traceback (most recent call last):
File "/usr/local/hue/desktop/core/src/desktop/models.py", line 296, in sync
from beeswax.models import SavedQuery
ImportError: No module named beeswax.models
[24/Nov/2015 03:02:12 -0800] models ERROR error syncing pig
Traceback (most recent call last):
File "/usr/local/hue/desktop/core/src/desktop/models.py", line 308, in sync
from pig.models import PigScript
ImportError: No module named pig.models
[24/Nov/2015 03:02:12 -0800] models ERROR error syncing search
Traceback (most recent call last):
File "/usr/local/hue/desktop/core/src/desktop/models.py", line 318, in sync
from search.models import Collection
ImportError: No module named search.models

Python (boto) TypeError launching Spark Cluster

Following is attempt to launch cluster with ten slaves.
12:13:44/sparkup $ec2/spark-ec2 -k sparkeast -i ~/.ssh/myPem.pem \
-s 10 -z us-east-1a -r us-east-1 launch spark2
Here is output. Note that the same command had been successful with the February Master code. Today I had updated to latest 1.4.0-SNAPSHOT
Setting up security groups...
Searching for existing cluster spark2 in region us-east-1...
Spark AMI: ami-5bb18832
Launching instances...
Launched 10 slaves in us-east-1a, regid = r-68a0ae82
Launched master in us-east-1a, regid = r-6ea0ae84
Waiting for AWS to propagate instance metadata...
Waiting for cluster to enter 'ssh-ready' state.........unable to load cexceptions
TypeError
p0
(S''
p1
tp2
Rp3
(dp4
S'child_traceback'
p5
S'Traceback (most recent call last):\n File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1280, in _execute_child\n sys.stderr.write("%s %s (env=%s)\\n" %(executable, \' \'.join(args), \' \'.join(env)))\nTypeError\n'
p6
sb.Traceback (most recent call last):
File "ec2/spark_ec2.py", line 1444, in <module>
main()
File "ec2/spark_ec2.py", line 1436, in main
real_main()
File "ec2/spark_ec2.py", line 1270, in real_main
cluster_state='ssh-ready'
File "ec2/spark_ec2.py", line 869, in wait_for_cluster_state
is_cluster_ssh_available(cluster_instances, opts):
File "ec2/spark_ec2.py", line 833, in is_cluster_ssh_available
if not is_ssh_available(host=dns_name, opts=opts):
File "ec2/spark_ec2.py", line 807, in is_ssh_available
stderr=subprocess.STDOUT # we pipe stderr through stdout to preserve output order
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 709, in __init__
errread, errwrite)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1328, in _execute_child
raise child_exception
TypeError
The AWS console shows that instances are actually running. So it is unclear what actually failed.
Any hints or workarounds appreciated.
UPDATE This same error occurs when doing login command. It seems to be problem with the boto API - but the cluster itself appears to be OK.
ec2/spark-ec2 -i ~/.ssh/sparkeast.pem login spark2
Searching for existing cluster spark2 in region us-east-1...
Found 1 master, 10 slaves.
Logging into master ec2-54-87-46-170.compute-1.amazonaws.com...
unable to load cexceptions
TypeError
p0
(.. same exception stacktrace as above )
The issue is that the python-2.7.6 installation on my yosemite macbook appears to have become corrupted.
I reset the PATH and PYTHONPATH to point to a custom homebrew installed python version and then the boto - and other python commands including building spark performance project - work fine.

Resources