Read Lzo file in PySpark - hadoop

I am new to Spark. I have a bunch of LZO indexed files in a folder. The indexing was done as indicated on https://github.com/twitter/hadoop-lzo.
The files are as follows:
1.lzo
1.lzo.index
2.lzo
2.lzo.index
and so on
I want to read these files. I am using newAPIHadoopFile().
As given on, https://github.com/twitter/hadoop-lzo
I did the following:
val files = sc.newAPIHadoopFile(path, classOf[com.hadoop.mapreduce.LzoTextInputFormat],classOf[org.apache.hadoop.io.LongWritable],classOf[org.apache.hadoop.io.Text])
val lzoRDD = files.map(_._2.toString)
It worked fine in Scala (spark-shell).
But, I want to use pyspark (python-spark application). I am doing the following:
files = sc.newAPIHadoopFile(path,"com.hadoop.mapreduce.LzoTextInputFormat","org.apache.hadoop.io.LongWritable","org.apache.hadoop.io.Text")
lzoRDD = files.map(_._2.toString)
I get the following error: AttributeError: 'RDD' object has no attribute '_2'
The whole code is as follows:
import sys
from pyspark import SparkContext,SparkConf
if __name__ == "__main__":
#Create the SparkContext
conf = (SparkConf().setMaster("local[2]").setAppName("abc").set("spark.executor.memory", "10g").set("spark.cores.max",10))
sc = SparkContext(conf=conf)
path='/x/y/z/*.lzo'
files = sc.newAPIHadoopFile(path,"com.hadoop.mapreduce.LzoTextInputFormat","org.apache.hadoop.io.LongWritable","org.apache.hadoop.io.Text")
lzoRDD = files.map(_._2.toString)
#stop the SparkContext
sc.stop()
And I am submitting using spark-submit.
Any help would be appreciated.
Thank You

Related

How to change hdfs block size of DataFrame in pysark

This seems related to
How to change hdfs block size in pyspark?
I can successfully change the hdfs block size with rdd.saveAsTextFile,
but not the corresponding DataFrame.write.parquet and unable to save with parquet format.
Unsure whether it's the bug in pyspark DataFrame or I did not set the configurations correctly.
The following is my testing code:
##########
# init
##########
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
import hdfs
from hdfs import InsecureClient
import os
import numpy as np
import pandas as pd
import logging
os.environ['SPARK_HOME'] = '/opt/spark-2.2.1-bin-hadoop2.7'
block_size = 512 * 1024
conf = SparkConf().setAppName("myapp").setMaster("spark://spark1:7077").set('spark.cores.max', 20).set("spark.executor.cores", 10).set("spark.executor.memory", "10g").set("spark.hadoop.dfs.blocksize", str(block_size)).set("spark.hadoop.dfs.block.size", str(block_size))
spark = SparkSession.builder.config(conf=conf).getOrCreate()
spark.sparkContext._jsc.hadoopConfiguration().setInt("dfs.blocksize", block_size)
spark.sparkContext._jsc.hadoopConfiguration().setInt("dfs.block.size", block_size)
##########
# main
##########
# create DataFrame
df_txt = spark.createDataFrame([\{'temp': "hello"}, \{'temp': "world"}, \{'temp': "!"}])
# save using DataFrameWriter, resulting 128MB-block-size
df_txt.write.mode('overwrite').format('parquet').save('hdfs://spark1/tmp/temp_with_df')
# save using rdd, resulting 512k-block-size
client = InsecureClient('http://spark1:50070')
client.delete('/tmp/temp_with_rrd', recursive=True)
df_txt.rdd.saveAsTextFile('hdfs://spark1/tmp/temp_with_rrd')
Found the answer from the following link:
https://forums.databricks.com/questions/918/how-to-set-size-of-parquet-output-files.html
I can successfully setup parquet block size with spark.hadoop.parquet.block.size
The following is the sample code:
# init
block_size = 512 * 1024
conf = SparkConf().setAppName("myapp").setMaster("spark://spark1:7077").set('spark.cores.max', 20).set("spark.executor.cores", 10).set("spark.executor.memory", "10g").set('spark.hadoop.parquet.block.size', str(block_size)).set("spark.hadoop.dfs.blocksize", str(block_size)).set("spark.hadoop.dfs.block.size", str(block_size)).set("spark.hadoop.dfs.namenode.fs-limits.min-block-size", str(131072))
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
# create DataFrame
df_txt = spark.createDataFrame([{'temp': "hello"}, {'temp': "world"}, {'temp': "!"}])
# save using DataFrameWriter, resulting 512k-block-size
df_txt.write.mode('overwrite').format('parquet').save('hdfs://spark1/tmp/temp_with_df')
# save using DataFrameWriter.csv, resulting 512k-block-size
df_txt.write.mode('overwrite').csv('hdfs://spark1/tmp/temp_with_df_csv')
# save using DataFrameWriter.text, resulting 512k-block-size
df_txt.write.mode('overwrite').text('hdfs://spark1/tmp/temp_with_df_text')
# save using rdd, resulting 512k-block-size
client = InsecureClient('http://spark1:50070')
client.delete('/tmp/temp_with_rrd', recursive=True)
df_txt.rdd.saveAsTextFile('hdfs://spark1/tmp/temp_with_rrd')
Hadoop and Spark are two independent tools which have their own strategies to work. Spark and Parquet work with data partitions and block size is not meaningful for them. Do what Spark does say and then do what you want with it inside the HDFS.
You can change your Parquet partition number by
df_txt.repartition(6).format("parquet").save("hdfs://...")

How to check if a file exists in Google Storage from Spark Dataproc?

I was assuming that Google Storage connector would allow to query GS directly as if it was HDFS from Spark in Dataproc, but it looks like the following does not work (from Spark Shell):
scala> import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.FileSystem
scala> import org.apache.hadoop.fs.Path
import org.apache.hadoop.fs.Path
scala> FileSystem.get(sc.hadoopConfiguration).exists(new Path("gs://samplebucket/file"))
java.lang.IllegalArgumentException: Wrong FS: gs://samplebucket/file, expected: hdfs://dataprocmaster-m
Is there a way to access Google Storage files using just the Hadoop API?
That's because FileSystem.get(...) returns the default FileSystem which according to your configuration is HDFS and can only work with paths starting with hdfs://. Use the following to get the correct FS.
Path p = new Path("gs://...");
FileSystem fs = p.getFileSystem(...);
fs.exists(p);
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.hadoop.fs.{FileSystem, Path}
val p = "gs://<your dir>"
val path = new Path(p)
val fs = path.getFileSystem(sc.hadoopConfiguration)
fs.exists(path)
fs.isDirectory(path)
I translated #Pradeep Gollakota answer to PySpark, thanks!!
def path_exists(spark, path): #path = gs://.... return true if exists
p = spark._jvm.org.apache.hadoop.fs.Path(path)
fs = p.getFileSystem(spark._jsc.hadoopConfiguration())
return fs.exists(p)

Stanford CoreNLP use case using Pyspark script runs fine on local node but on yarn cluster mode it runs very slow

I tried debugging all the possible solutions but unable to run this and scale this on cluster as i need to process 100 million records, This script runs very well on local node as expected but fails to run on Cloudera Amazon cluster. Here is the sample data that works on local node. According to me the problem is the 2 files that I am using in the udf's are not getting distributed on the executors/containers or nodes and the jobs just keeps running and processing is very slow. I am unable to fix this code to execute this on the cluster.
##Link to the 2 files which i use in the script###
##https://nlp.stanford.edu/software/stanford-ner-2015-12-09.zip
####Link to the data set########
##https://docs.google.com/spreadsheets/d/17b9NUonmFjp_W0dOe7nzuHr7yMM0ITTDPCBmZ6xM0iQ/edit?usp=drivesdk&lipi=urn%3Ali%3Apage%3Ad_flagship3_messaging%3BQHHZFKYfTPyRb%2FmUg6ahsQ%3D%3D
#spark-submit --packages com.databricks:spark-csv_2.10:1.5.0 --master yarn-cluster --files /home/ec2-user/StanfordParser/stanford-ner-2016-10-31/stanford-ner.jar,/home/ec2-user/StanfordParser/stanford-ner-2016-10-31/classifiers/english.all.3class.distsim.crf.ser.gz stanford_ner.py
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
import os
from pyspark import SparkFiles
from pyspark import SparkContext, SparkConf
from pyspark.sql import Row
from pyspark.context import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import udf
from pyspark.sql import SQLContext
def stanford(str):
os.environ['JAVA_HOME']='/usr/java/jdk1.8.0_131/'
stanford_classifier = SparkFiles.get("english.all.3class.distsim.crf.ser.gz")
stanford_ner_path = SparkFiles.get("stanford-ner.jar")
st = StanfordNERTagger(stanford_classifier, stanford_ner_path, encoding='utf-8')
output = st.tag(str.split())
organizations = []
organization = ""
for t in output:
#The word
word = t[0]
#What is the current tag
tag = t[1]
#print(word, tag)
#If the current tag is the same as the previous tag Append the current word to the previous word
if (tag == "ORGANIZATION"):
organization += " " + word
organizations.append(organization)
final = "-".join(organizations)
return final
stanford_lassification = udf(stanford, StringType())
###################Pyspark Section###############
#Set context
sc = SparkContext.getOrCreate()
sc.setLogLevel("DEBUG")
sqlContext = SQLContext(sc)
#Get data
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load(r"/Downloads/authors_data.csv")
#Create new dataframe with new column organization
df = df.withColumn("organizations", stanford_lassification(df['affiliation_string']))
#Save result
df.select('pmid','affiliation_string','organizations').write.format('com.databricks.spark.csv').save(r"/Downloads/organizations.csv")

Can I create sequence file using spark dataframes?

I have a requirement in which I need to create a sequence file.Right now we have written custom api on top of hadoop api,but since we are moving in spark we have to achieve the same using spark.Can this be achieved using spark dataframes?
AFAIK there is no native api available directly in DataFrame except the below approach
Please try/think some thing like(which is RDD of DataFrame style, inspired by SequenceFileRDDFunctions.scala & method saveAsSequenceFile) in below example :
Extra functions available on RDDs of (key, value) pairs to create a Hadoop SequenceFile, through an implicit conversion.
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.SequenceFileRDDFunctions
import org.apache.hadoop.io.NullWritable
object driver extends App {
val conf = new SparkConf()
.setAppName("HDFS writable test")
val sc = new SparkContext(conf)
val empty = sc.emptyRDD[Any].repartition(10)
val data = empty.mapPartitions(Generator.generate).map{ (NullWritable.get(), _) }
val seq = new SequenceFileRDDFunctions(data)
// seq.saveAsSequenceFile("/tmp/s1", None)
seq.saveAsSequenceFile(s"hdfs://localdomain/tmp/s1/${new scala.util.Random().nextInt()}", None)
sc.stop()
}
Further information pls see ..
how-to-write-dataframe-obtained-from-hive-table-into-hadoop-sequencefile-and-r
sequence file

unable to submit spark python script

I'm using the following script to submit a python script
#!/usr/bin/python
from pyspark.mllib.classification import LogisticRegressionWithSGD
from pyspark.mllib.regression import LabeledPoint
from numpy import array
from pyspark import SparkContext as sc, SparkConf
data = sc.textFile("hdfs:/dataset/parkinsons.data")
got this error:
data = sc.textFile("hdfs:/dataset/parkinsons.data")
TypeError: unbound method textFile() must be called with SparkContext instance as first argument (got str instance instead)
You must create a SparkContext at first, for example:
from pyspark import SparkContext
sc = SparkContext(appName="TestApp")
data = sc.textFile("hdfs:/dataset/parkinsons.data")

Resources