dask computation got different errors with pyarrow and s3

dask computation got different errors with pyarrow and s3 - parallel-processing

I was doing some groupby parallel computation with dask using pyarrow to load parquet files from s3. However, the same piece of code may run or fail (with different error messages) with random chances. Same issue happened when using fastparquet:
File "pyarrow/_parquet.pyx", line 1036, in pyarrow._parquet.ParquetReader.open
File "pyarrow/error.pxi", line 80, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Arrow error: IOError: [SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2309). Detail: Python exception: ssl.SSLError
or failing with different error:
File "pyarrow/_parquet.pyx", line 1036, in pyarrow._parquet.ParquetReader.open
File "pyarrow/error.pxi", line 80, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Arrow error: IOError: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:2309). Detail: Python exception: ssl.SSLError
The dask scheduler I was using is processes. It works fine with threads but will be extremely slow. Is this behavior expected for dask?

Related

Pyhive error after upgrading from CDH to CDP private cloud

May I have your help to resolve below error from Pyhive Module.
Issue: We have upgraded the Cloudera cluster from CDH version to CDP version. We are using Pyhive python module to get the impala connection from Impala using pyhive hive.connect(User,password,host,port,auth=LDAP).
We are getting below error for some queries which are submitted through pandas read_sql and some queries are getting executed fine and returning DF.
This was fine before upgrade and queries have no issues and all were returning results.
conn = pyhive.Connection(host=impala_host, port=impala_port, username=user, password=password, auth="LDAP")
PFB the stack trace.
data = pd.read_sql(sql, conn)
File "/usr/local/lib/python3.7/site-packages/pandas/io/sql.py", line 608, in read_sql
chunksize=chunksize,
File "/usr/local/lib/python3.7/site-packages/pandas/io/sql.py", line 2130, in read_query
data = self._fetchall_as_list(cursor)
File "/usr/local/lib/python3.7/site-packages/pandas/io/sql.py", line 2144, in _fetchall_as_list
result = cur.fetchall()
File "/usr/local/lib/python3.7/site-packages/pyhive/common.py", line 137, in fetchall
return list(iter(self.fetchone, None))
File "/usr/local/lib/python3.7/site-packages/pyhive/common.py", line 106, in fetchone
self._fetch_while(lambda: not self._data and self._state != self._STATE_FINISHED)
File "/usr/local/lib/python3.7/site-packages/pyhive/common.py", line 46, in _fetch_while
self._fetch_more()
File "/usr/local/lib/python3.7/site-packages/pyhive/hive.py", line 477, in _fetch_more
_check_status(response)
File "/usr/local/lib/python3.7/site-packages/pyhive/hive.py", line 585, in _check_status
raise OperationalError(response)
pyhive.exc.OperationalError: TFetchResultsResp(status=TStatus(statusCode=2, infoMessages=None, sqlState=None, errorCode=None, errorMessage=None), hasMoreRows=True, results=None)
We have verified the source code of Pyhive and the cursor.fetchall() not waiting on sleep() and immediately coming out because the query status (code=2) is still running at backend.

Lightgbm Fatal error: Unknown token in data file

I am training a model with LightGBM where I am getting the following error that happens without a pattern (sometimes the training finish without any error on the same exact input data).
[LightGBM] [Fatal] Unknown token a-05 in data file
[LightGBM] [Warning] Unknown token a-05 in data file
terminate called after throwing an instance of 'std::runtime_error'
what(): Unknown token a-05 in data file
Aborted (core dumped)
My best guess is that this error has something to do with the physical memory.

Keep on getting error while trying to upload to esp32-cam

I just got my new esp32-cam and I keeps giving me the error below even though i did everything in the tutorial video I found on youtube correctly and even watched several others and it still gives the error
Traceback (most recent call last):
File "esptool.py", line 3682, in <module>
File "esptool.py", line 3675, in _main
File "esptool.py", line 3329, in main
File "esptool.py", line 263, in __init__
File "site-packages\serial\__init__.py", line 88, in serial_for_url
File "site-packages\serial\serialwin32.py", line 78, in open
File "site-packages\serial\serialwin32.py", line 222, in _reconfigure_port
serial.serialutil.SerialException: Cannot configure port, something went wrong. Original message: WindowsError(31, 'A device attached to the system is not functioning.')
Failed to execute script esptool
the selected serial port Failed to execute script esptool
does not exist or your board is not connected

Can't use 'put'() to add data to hbase with happybase

My python version is 3.7, and after I ran pip3 install happybase, I started the command hbase thrift start and tried to write a brief .py file as following:
import happybase
connection = happybase.Connection('master')
table = connection.table('jmlr') #'jmlr' is a table in hbase
for i in table.scan():
print(i)
table.put('001', {'title':'dasds'}) #error here
connection.close()
When it's about to run table.put(), it reported such an error:
thriftpy2.transport.base.TTransportException: TTransportException(type=4, message='TSocket read 0 bytes')
And at the same time, the thrift reported an error:
ERROR [thrift-worker-1] thrift.TBoundedThreadPoolServer: Error occurred during processing of message. java.lang.IllegalArgumentException: Invalid famAndQf provided.
But just now I ran this python file again, it gave me a different error in thrift:
thrift.TBoundedThreadPoolServer: Thrift error occurred during processing of message.
org.apache.thrift.protocol.TProtocolException: Bad version in readMessageBegin
I have tried to add parameters like protocol='compact', transport='framed', but this didn't work, even the table.scan() failed.
Everything in the hbase shell is OK, so I can't figure out what went wrong, I'm about to collapse.

I ran into the same issue and found this sollution. You need to add even empty Column Qualifier ( ':' symbol as delimiter between Column Family and Column Qualifier) into put() method:
table.put('001:', {'title':'dasds'})
Also, you have a different error message after second run of script because thrift server is already failed.
I hope it will help you.

Running Pyspark on Pycharm

On a Mac (v. 10.14.5), I am trying to run PySpark programs in PyCharm (professional edition, v. 19.2).
I know my simple PySpark program is fine, because when I run it with spark-submit outside PyCharm from the terminal, using Spark I installed via brew, it works as expected. I have tried linking PyCharm to this version of Spark, but am getting other issues.
I followed multiple instructions online to install pyspark within Pycharm (Preferences -> Project Interpreter), and set the SPARK_HOME environment variable to the appropriate venv directory (Run -> Edit Configurations -> Environment Variables). For example, this stackoverflow thread.
But, I get an error message when I run the program:
Failed to find Spark jars directory (/Users/rahul/PycharmProjects/spark-demoII/venv/assembly/target/scala-2.12/jars).
You need to build Spark with the target "package" before running this program.
Traceback (most recent call last):
File "/Users/rahul/PycharmProjects/spark-demoII/run.py", line 6, in <module>
sc = SparkContext("local", "SimpleApp")
File "/Users/rahul/virtualenvs/pyspark/lib/python3.7/site-packages/pyspark/context.py", line 133, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "/Users/rahul/virtualenvs/pyspark/lib/python3.7/site-packages/pyspark/context.py", line 316, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "/Users/rahul/virtualenvs/pyspark/lib/python3.7/site-packages/pyspark/java_gateway.py", line 46, in launch_gateway
return _launch_gateway(conf)
File "/Users/rahul/virtualenvs/pyspark/lib/python3.7/site-packages/pyspark/java_gateway.py", line 108, in _launch_gateway
raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
Process finished with exit code 1
Anyone know how to get PyCharm to run Pyspark programs on a similar machine?
In response to #pissal suggestion:
I tried that previously but that version of spark does work. I tried it again anyway: after switching to a virtual environment, I did a pip install pyspark. To ensure that this version of spark works, I ran a spark-submit run.py (outside of PyCharm), and here is the error message.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/Users/rahul/.virtualenvs/test1/lib/python3.7/site-packages/pyspark/jars/spark-unsafe_2.11-2.4.4.jar) to method java.nio.Bits.unaligned()
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:80)
at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)
at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2422)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2422)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2422)
at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:79)
at org.apache.spark.deploy.SparkSubmit.secMgr$lzycompute$1(SparkSubmit.scala:348)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$secMgr$1(SparkSubmit.scala:348)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:356)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:356)
at scala.Option.map(Option.scala:146)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:355)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:774)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 2
at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3720)
at java.base/java.lang.String.substring(String.java:1909)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:52)
... 25 more

So the reason this was happening was that pyspark has not been updated to use the latest version of Java. After removing Java version 13, I made sure my home brew installation of spark uses java version 1.8. Then added the following to the Environment Variables in Run -> Edit Configurations in Pycharm:
SPARK_HOME=/usr/local/Cellar/apache-spark/2.4.4/libexec
With these settings I can run pyspark jobs in PyCharm.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

dask computation got different errors with pyarrow and s3 - parallel-processing

Related

Pyhive error after upgrading from CDH to CDP private cloud

Lightgbm Fatal error: Unknown token in data file

Keep on getting error while trying to upload to esp32-cam

Can't use 'put'() to add data to hbase with happybase

Running Pyspark on Pycharm

Categories

Resources