Running Pyspark on Pycharm - macos

On a Mac (v. 10.14.5), I am trying to run PySpark programs in PyCharm (professional edition, v. 19.2).
I know my simple PySpark program is fine, because when I run it with spark-submit outside PyCharm from the terminal, using Spark I installed via brew, it works as expected. I have tried linking PyCharm to this version of Spark, but am getting other issues.
I followed multiple instructions online to install pyspark within Pycharm (Preferences -> Project Interpreter), and set the SPARK_HOME environment variable to the appropriate venv directory (Run -> Edit Configurations -> Environment Variables). For example, this stackoverflow thread.
But, I get an error message when I run the program:
Failed to find Spark jars directory (/Users/rahul/PycharmProjects/spark-demoII/venv/assembly/target/scala-2.12/jars).
You need to build Spark with the target "package" before running this program.
Traceback (most recent call last):
File "/Users/rahul/PycharmProjects/spark-demoII/run.py", line 6, in <module>
sc = SparkContext("local", "SimpleApp")
File "/Users/rahul/virtualenvs/pyspark/lib/python3.7/site-packages/pyspark/context.py", line 133, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "/Users/rahul/virtualenvs/pyspark/lib/python3.7/site-packages/pyspark/context.py", line 316, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "/Users/rahul/virtualenvs/pyspark/lib/python3.7/site-packages/pyspark/java_gateway.py", line 46, in launch_gateway
return _launch_gateway(conf)
File "/Users/rahul/virtualenvs/pyspark/lib/python3.7/site-packages/pyspark/java_gateway.py", line 108, in _launch_gateway
raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
Process finished with exit code 1
Anyone know how to get PyCharm to run Pyspark programs on a similar machine?
In response to #pissal suggestion:
I tried that previously but that version of spark does work. I tried it again anyway: after switching to a virtual environment, I did a pip install pyspark. To ensure that this version of spark works, I ran a spark-submit run.py (outside of PyCharm), and here is the error message.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/Users/rahul/.virtualenvs/test1/lib/python3.7/site-packages/pyspark/jars/spark-unsafe_2.11-2.4.4.jar) to method java.nio.Bits.unaligned()
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:80)
at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2422)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2422)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2422)
at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:79)
at org.apache.spark.deploy.SparkSubmit.secMgr$lzycompute$1(SparkSubmit.scala:348)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$secMgr$1(SparkSubmit.scala:348)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:356)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:356)
at scala.Option.map(Option.scala:146)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:355)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:774)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 2
at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3720)
at java.base/java.lang.String.substring(String.java:1909)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:52)
... 25 more

So the reason this was happening was that pyspark has not been updated to use the latest version of Java. After removing Java version 13, I made sure my home brew installation of spark uses java version 1.8. Then added the following to the Environment Variables in Run -> Edit Configurations in Pycharm:
SPARK_HOME=/usr/local/Cellar/apache-spark/2.4.4/libexec
With these settings I can run pyspark jobs in PyCharm.

Related

HBaseTestingUtility failing on Windows 10 with UnsatisfiedLinkError

I'm trying to get the HBaseTestingUtility running on Windows 10.
I'm using hbase-client and hbase-testing-util with version 1.4.2.
When running:
HBaseTestingUtility hbaseUtility = new HBaseTestingUtility();
hbaseUtility.startMiniCluster(); //<- error thrown on this line
I get the below error:
java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:609)
at org.apache.hadoop.fs.FileUtil.canWrite(FileUtil.java:996)
...
I have downloaded winutils, and have set the following user variables:
hadoop.home.dir=C:\Users\bwatson\apps\hadoop-2.8.3
HADOOP_HOME=C:\Users\bwatson\apps\hadoop-2.8.3
but this does not make a difference.
The official documentation for the HBaseTestingUtility says that Cygwin is needed on Windows, but I cannot install that due to the admin restrictions on my work machine. Is there any other solution?
After some digging, I found a solution in https://stackoverflow.com/a/43484457/729819. I %HADOOP_HOME%/bin to PATH. Now I get another error but will raise another question for that.

sctp_core_destroy(): SCTP API not initialized in kamailio start

Hi I have installed Kamalio it start first time but when I stop and start it again it gives sctp_core_destroy(): SCTP API not initialized . I have already installed sctp module.
yyerror_at(): parse error in config file /etc/kamailio/kamailio.cfg
load_module(): could not find module <db_mysql> in </usr/lib/kamailio/modules>
[sctp_core.c:53]: sctp_core_destroy(): SCTP API not initialized
From the log it is obvious that you have successfully compiled & installed SCTP module, however it could NOT be initialized.
Note that is error could must often than not be as a result of other errors in your cfg file.
Few tips:
Can you run kamailio -c and to be sure there is NO error in your cfg.
Found error? use this command to monitor what the exact issue is. Run from a different terminal tail -fn200 /var/log/syslog
On the second terminal try restarting you Kamalio server sudo service kamalio restart
Revisit terminal 1 and look out for the first line with CRITICAL output like the one below CRITICAL: <core> [core/cfg.y:3413]: yyerror_at(): parse error in config file /usr/local/etc/kamailio/kamailio.cfg, line 366, column 41: syntax error
Line 366 mostly is the issue so visit that file at that line (366) to fix the proble
sudo nano +366 /usr/local/etc/kamailio/kamailio.cfg
Let me know if it helps

Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found (Spark 1.6 Windows)

I am trying to access s3 files from local spark context using pySpark.
I keep getting File "C:\Spark\python\lib\py4j-0.9-src.zip\py4j\protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o20.parquet.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found
I had set os.environ['AWS_ACCESS_KEY_ID'] and
os.environ['AWS_SECRET_ACCESS_KEY'] before I called df = sqc.read.parquet(input_path). I also added these lines:
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsSecretAccessKey", os.environ["AWS_SECRET_ACCESS_KEY"])
hadoopConf.set("fs.s3.awsAccessKeyId", os.environ["AWS_ACCESS_KEY_ID"])
I have also tried changing s3 to s3n, s3a. Neither worked.
Any idea how to make it work?
I am on Windows 10, pySpark, Spark 1.6.1 built for Hadoop 2.6.0
I'm running pyspark appending the libraries from hadoop-aws.
You will need to use s3n in your input path. I'm running that from Mac-OS. so I'm not sure if it will work in Windows.
$SPARK_HOME/bin/pyspark --packages org.apache.hadoop:hadoop-aws:2.7.1
This package declaration works even in spark-shell
spark-shell --packages org.apache.hadoop:hadoop-aws:2.7.1
and specify in the shell
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "xxxxxxxxxxxxx")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "xxxxxxxxxxxxxxxxx")

install_driver(Oracle) failed: Can't load '/usr/local/lib/perl5/auto/DBD/Oracle/Oracle.so' for module DBD::Oracle: libocci.so.11.1

I am facing this error from two days. I'am able to get output from via command line at nagios end
/usr/local/nagios/libexec/check_oracle_health --connect 192.168.2.92:1521/modula --user nagios --password nagios --mode tnsping
Output is
OK - connection established to 192.168.2.92:1521/modula.
But when I am going to GUI mode it is giving me error
CRITICAL - cannot connect to 192.168.2.92:1521/modula.
install_driver(Oracle) failed:
Can't load '/usr/local/lib/perl5/auto/DBD/Oracle/Oracle.so' for module DBD::Oracle:
libocci.so.11.1: cannot open shared object file:
No such file or directory at /usr/lib/perl5/DynaLoader.pm line 200.
at (eval 18) line 3
Compilation failed in require at (eval 18) line 3.
Perhaps a required shared library or dll isn't installed where expected
at /usr/local/nagios/libexec/check_oracle_health line 5837
\n
Plese help me to resolve the error.
I had this issue on CentOS 6 and here is how I resolved it:
`echo "$ORACLE_HOME/lib" >> /etc/ld.so.conf.d/oracle-x86_64.conf && ldconfig`
The answer by Jordan Neufeld is good and may be enough for you (I've tested it on CentOS 7), but I recommend setting these environment variables:
export ORACLE_HOME=/usr/lib/oracle/11.2/client64
export LD_LIBRARY_PATH=/usr/lib/oracle/11.2/client64/lib:$LD_LIBRARY_PATH
export PATH=/usr/lib/oracle/11.2/client64/bin:$PATH
[examples are for oracle-instantclient11.2-basic-11.2 rpm, change path if needed]

Hadoop and Hive Homes in CDH4

I'm trying to configure RHive in the CDH4 environment.
When reading a package 'RHive' in R, the error below got returned.
I'm guessing that's due to wrong homes.
If so, what would be the correct ones?
Or if that's not the reason, what's wrong with that?
Any help would be very appreciated.
Thanks.
> Sys.setenv(HIVE_HOME="/etc/hive")
> Sys.setenv(HADOOP_HOME="/etc/hadoop")
> library(RHive)
Loading required package: rJava
Loading required package: Rserve
This is RHive 0.0-7. For overview type '?RHive'.
HIVE_HOME=/etc/hive
[1] "there is no slaves file of HADOOP. so you should pass hosts argument when you call rhive.connect()."
Error : .onLoad failed in loadNamespace() for 'RHive', details:
call: .jnew("org/apache/hadoop/conf/Configuration")
error: java.lang.ClassNotFoundException
In addition: Warning message:
In file(file, "rt") :
cannot open file '/etc/hadoop/conf/slaves': No such file or directory
Error: package/namespace load failed for 'RHive'
Had the problems but solved it. Downside is that I have to keep track of a bunch of sym links
After struggling with install RHive_0.0-7.tar.gz on CDH 4.7.x and getting:
Warning in file(file, "rt") :
cannot open file '/etc/hadoop/conf/slaves': No such file or directory
[1] "there is no slaves file of HADOOP. so you should pass hosts argument when you call rhive.connect()."
In /etc/hadoop/conf
I added a the following sym link ----> ln -s /opt/cloudera/parcels/CDH-4.4.0-1.cdh4.4.0.p0.39/etc/hadoop/conf.empty/slaves slaves
(why Cloudera CHD 4.7 installs in /opt without creating the proper sym links from /usr/lib is puzzling)
I also defined the followingin /usr/lib64/R/etc/Renviron
## set hive paths
HIVE_HOME='/opt/cloudera/parcels/CDH-4.4.0-1.cdh4.4.0.p0.39/lib/hive'
HADOOP_HOME='/opt/cloudera/parcels/CDH-4.4.0-1.cdh4.4.0.p0.39/lib/hadoop'
LD_LIBRARY_PATH='/opt/cloudera/parcels/CDH-4.4.0-1.cdh4.4.0.p0.39/lib/hadoop'
At a shell prompt I ran R CMD INSTALL RHive_0.0-7.tar.gz
Installation Happiness!!
++++++
Inside R-Studio (server)
>
> library(RHive)
Loading required package: rJava
Loading required package: Rserve
This is RHive 0.0-7. For overview type ‘?RHive’.
HIVE_HOME=/opt/cloudera/parcels/CDH-4.4.0-1.cdh4.4.0.p0.39/lib/hive
call rhive.init() because HIVE_HOME is set.
rhive.init()
>
+++++++
You should set the HADOOP_CONF_DIR separately.
Try export $HADOOP_CONF_DIR=/etc/hadoop/conf/conf.pseudo
The conf.pseudo has the slaves file.
Though I'd be curious to see if you can make RHive work with CDH4.

Resources