unable to submit spark python script - hadoop

I'm using the following script to submit a python script
#!/usr/bin/python
from pyspark.mllib.classification import LogisticRegressionWithSGD
from pyspark.mllib.regression import LabeledPoint
from numpy import array
from pyspark import SparkContext as sc, SparkConf
data = sc.textFile("hdfs:/dataset/parkinsons.data")
got this error:
data = sc.textFile("hdfs:/dataset/parkinsons.data")
TypeError: unbound method textFile() must be called with SparkContext instance as first argument (got str instance instead)

You must create a SparkContext at first, for example:
from pyspark import SparkContext
sc = SparkContext(appName="TestApp")
data = sc.textFile("hdfs:/dataset/parkinsons.data")

Related

How to get the leader feature importances in h2o automl pysparkling water

i am using spark standalone cluster and running h2o pysparkling in it.
I am unable to find the function for getting the leader feature importances. please help
Code:
import pandas as pd
from pyspark.sql import SparkSession
from pysparkling import *
import h2o
from pyspark import SparkFiles
from pysparkling.ml import H2OAutoML
spark = SparkSession.builder.appName('SparkApplication').getOrCreate()
conf = H2OConf()
hc = H2OContext.getOrCreate(conf)
def xgb_automl_features_importance(data, target_metric):
# Converting DataFrame in H2OFrame
hf = h2o.H2OFrame(data)
sparkDF = hc.asSparkFrame(hf)
# Identify predictors and response
y = target_metric
aml = H2OAutoML(labelCol=y)
aml.setIncludeAlgos(["XGBoost"])
aml.setMaxModels(1)
aml.fit(sparkDF)
print('-----------****************')
print(aml.getLeaderboard().show(truncate=False))
The fit method on H2OAutoML returns the leader model. Each model in SW has the method getFeatureImportances() returning Spark data frame with feature importances.
model=aml.fit(sparkDF)
model.getFeatureImportances().show()

Call a Lambda Function with AWS Glue

Im trying to use boto3 in a job of AWS Glue to call a Lambda Function but without results.
I upload a zip with the libraries:
Like the examples by AWS
and without a zip.
The error is this " Unable to load data for: endpoints".
Im trying to invoke without zip but this go to a timeout exception.
import boto3
client = boto3.client('lambda' , region_name='us-east-1')
r_lambda = client.invoke(FunctionName='S3GlueJson')
Can someone help me ?
In Python, use Boto3 Lambda client 'invoke()'. For example, you can create a Lambda container, then call that from a Glue Job:
import boto3
import pandas as pd
lambda_client = boto3.client('lambda',region_name='us-east-1')
def get_predictions( df ):
# Call getPredictions Lambda container
response = lambda_client.invoke(
FunctionName='getPredictions',
InvocationType='RequestResponse',
LogType='Tail',
Payload=df
)
logger.info('Received response from Lambda container.')
data = response["Payload"].read().decode('utf-8')
x = json.loads(data)
df_pred = pd.DataFrame.from_dict(x)
return df_pred
dfjson = df.to_json()
df_pred = get_predictions( dfjson )
df_pred.head()
If you want to call a Glue Jobs from Lambda Function, can do it like this:
import boto3
glue = boto3.client(service_name='glue', region_name='us-east-1',
endpoint_url='https://glue.us-east-1.amazonaws.com')
#Start Job
myNewJobRun = glue.start_job_run(JobName=JOB_NAME)
#Get current state of Job, to be sure it's running
status = glue.get_job_run(JobName=JOB_NAME, RunId=myNewJobRun['JobRunId'])
logger.info('JOB State {}: {}'.format(
JOB_NAME, status['JobRun']['JobRunState']))
As Job execution can late some time to finish, it's better to don't wait on Lambda function for it to finish.

How to check if a file exists in Google Storage from Spark Dataproc?

I was assuming that Google Storage connector would allow to query GS directly as if it was HDFS from Spark in Dataproc, but it looks like the following does not work (from Spark Shell):
scala> import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.FileSystem
scala> import org.apache.hadoop.fs.Path
import org.apache.hadoop.fs.Path
scala> FileSystem.get(sc.hadoopConfiguration).exists(new Path("gs://samplebucket/file"))
java.lang.IllegalArgumentException: Wrong FS: gs://samplebucket/file, expected: hdfs://dataprocmaster-m
Is there a way to access Google Storage files using just the Hadoop API?
That's because FileSystem.get(...) returns the default FileSystem which according to your configuration is HDFS and can only work with paths starting with hdfs://. Use the following to get the correct FS.
Path p = new Path("gs://...");
FileSystem fs = p.getFileSystem(...);
fs.exists(p);
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.hadoop.fs.{FileSystem, Path}
val p = "gs://<your dir>"
val path = new Path(p)
val fs = path.getFileSystem(sc.hadoopConfiguration)
fs.exists(path)
fs.isDirectory(path)
I translated #Pradeep Gollakota answer to PySpark, thanks!!
def path_exists(spark, path): #path = gs://.... return true if exists
p = spark._jvm.org.apache.hadoop.fs.Path(path)
fs = p.getFileSystem(spark._jsc.hadoopConfiguration())
return fs.exists(p)

Stanford CoreNLP use case using Pyspark script runs fine on local node but on yarn cluster mode it runs very slow

I tried debugging all the possible solutions but unable to run this and scale this on cluster as i need to process 100 million records, This script runs very well on local node as expected but fails to run on Cloudera Amazon cluster. Here is the sample data that works on local node. According to me the problem is the 2 files that I am using in the udf's are not getting distributed on the executors/containers or nodes and the jobs just keeps running and processing is very slow. I am unable to fix this code to execute this on the cluster.
##Link to the 2 files which i use in the script###
##https://nlp.stanford.edu/software/stanford-ner-2015-12-09.zip
####Link to the data set########
##https://docs.google.com/spreadsheets/d/17b9NUonmFjp_W0dOe7nzuHr7yMM0ITTDPCBmZ6xM0iQ/edit?usp=drivesdk&lipi=urn%3Ali%3Apage%3Ad_flagship3_messaging%3BQHHZFKYfTPyRb%2FmUg6ahsQ%3D%3D
#spark-submit --packages com.databricks:spark-csv_2.10:1.5.0 --master yarn-cluster --files /home/ec2-user/StanfordParser/stanford-ner-2016-10-31/stanford-ner.jar,/home/ec2-user/StanfordParser/stanford-ner-2016-10-31/classifiers/english.all.3class.distsim.crf.ser.gz stanford_ner.py
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
import os
from pyspark import SparkFiles
from pyspark import SparkContext, SparkConf
from pyspark.sql import Row
from pyspark.context import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import udf
from pyspark.sql import SQLContext
def stanford(str):
os.environ['JAVA_HOME']='/usr/java/jdk1.8.0_131/'
stanford_classifier = SparkFiles.get("english.all.3class.distsim.crf.ser.gz")
stanford_ner_path = SparkFiles.get("stanford-ner.jar")
st = StanfordNERTagger(stanford_classifier, stanford_ner_path, encoding='utf-8')
output = st.tag(str.split())
organizations = []
organization = ""
for t in output:
#The word
word = t[0]
#What is the current tag
tag = t[1]
#print(word, tag)
#If the current tag is the same as the previous tag Append the current word to the previous word
if (tag == "ORGANIZATION"):
organization += " " + word
organizations.append(organization)
final = "-".join(organizations)
return final
stanford_lassification = udf(stanford, StringType())
###################Pyspark Section###############
#Set context
sc = SparkContext.getOrCreate()
sc.setLogLevel("DEBUG")
sqlContext = SQLContext(sc)
#Get data
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load(r"/Downloads/authors_data.csv")
#Create new dataframe with new column organization
df = df.withColumn("organizations", stanford_lassification(df['affiliation_string']))
#Save result
df.select('pmid','affiliation_string','organizations').write.format('com.databricks.spark.csv').save(r"/Downloads/organizations.csv")

Read Lzo file in PySpark

I am new to Spark. I have a bunch of LZO indexed files in a folder. The indexing was done as indicated on https://github.com/twitter/hadoop-lzo.
The files are as follows:
1.lzo
1.lzo.index
2.lzo
2.lzo.index
and so on
I want to read these files. I am using newAPIHadoopFile().
As given on, https://github.com/twitter/hadoop-lzo
I did the following:
val files = sc.newAPIHadoopFile(path, classOf[com.hadoop.mapreduce.LzoTextInputFormat],classOf[org.apache.hadoop.io.LongWritable],classOf[org.apache.hadoop.io.Text])
val lzoRDD = files.map(_._2.toString)
It worked fine in Scala (spark-shell).
But, I want to use pyspark (python-spark application). I am doing the following:
files = sc.newAPIHadoopFile(path,"com.hadoop.mapreduce.LzoTextInputFormat","org.apache.hadoop.io.LongWritable","org.apache.hadoop.io.Text")
lzoRDD = files.map(_._2.toString)
I get the following error: AttributeError: 'RDD' object has no attribute '_2'
The whole code is as follows:
import sys
from pyspark import SparkContext,SparkConf
if __name__ == "__main__":
#Create the SparkContext
conf = (SparkConf().setMaster("local[2]").setAppName("abc").set("spark.executor.memory", "10g").set("spark.cores.max",10))
sc = SparkContext(conf=conf)
path='/x/y/z/*.lzo'
files = sc.newAPIHadoopFile(path,"com.hadoop.mapreduce.LzoTextInputFormat","org.apache.hadoop.io.LongWritable","org.apache.hadoop.io.Text")
lzoRDD = files.map(_._2.toString)
#stop the SparkContext
sc.stop()
And I am submitting using spark-submit.
Any help would be appreciated.
Thank You

Resources