Spark Streaming: HDFS - hadoop

I can't get my Spark job to stream "old" files from HDFS.
If my Spark job is down for some reason (e.g. demo, deployment) but the writing/moving to HDFS directory is continuous, I might skip those files once I up the Spark Streaming Job.
val hdfsDStream = ssc.textFileStream("hdfs://sandbox.hortonworks.com/user/root/logs")
hdfsDStream.foreachRDD(
rdd => logInfo("Number of records in this batch: " + rdd.count())
)
Output --> Number of records in this batch: 0
Is there a way for Spark Streaming to move the "read" files to a different folder? Or we have to program it manually? So it will avoid reading already "read" files.
Is Spark Streaming the same as running the spark job (sc.textFile) in CRON?

As Dean mentioned, textFileStream uses the default of only using new files.
def textFileStream(directory: String): DStream[String] = {
fileStream[LongWritable, Text, TextInputFormat](directory).map(_._2.toString)
}
So, all it is doing is calling this variant of fileStream
def fileStream[
K: ClassTag,
V: ClassTag,
F <: NewInputFormat[K, V]: ClassTag
] (directory: String): InputDStream[(K, V)] = {
new FileInputDStream[K, V, F](this, directory)
}
And, looking at the FileInputDStream class we will see that it indeed can look for existing files, but defaults to new only:
newFilesOnly: Boolean = true,
So, going back into the StreamingContext code, we can see that there is and overload we can use by directly calling the fileStream method:
def fileStream[
K: ClassTag,
V: ClassTag,
F <: NewInputFormat[K, V]: ClassTag]
(directory: String, filter: Path => Boolean, newFilesOnly: Boolean):InputDStream[(K, V)] = {
new FileInputDStream[K, V, F](this, directory, filter, newFilesOnly)
}
So, the TL;DR; is
ssc.fileStream[LongWritable, Text, TextInputFormat]
(directory, FileInputDStream.defaultFilter, false).map(_._2.toString)

Are you expecting Spark to read files already in the directory? If so, this is a common misconception, one that took me by surprise. textFileStream watches a directory for new files to appear, then it reads them. It ignores files already in the directory when you start or files it's already read.
The rationale is that you'll have some process writing files to HDFS, then you'll want Spark to read them. Note that these files much appear atomically, e.g., they were slowly written somewhere else, then moved to the watched directory. This is because HDFS doesn't properly handle reading and writing a file simultaneously.

val filterF = new Function[Path, Boolean] {
def apply(x: Path): Boolean = {
println("looking if "+x+" to be consider or not")
val flag = if(x.toString.split("/").last.split("_").last.toLong < System.currentTimeMillis){ println("considered "+x); list += x.toString; true}
else{ false }
return flag
}
}
this filter function is used to determine whether each path is actually the one preferred by you. so the function inside the apply should be customized as per your requirement.
val streamed_rdd = ssc.fileStream[LongWritable, Text, TextInputFormat]("/user/hdpprod/temp/spark_streaming_output",filterF,false).map{case (x, y) => (y.toString)}
now you have to set the third variable of filestream function to false, this is to make sure not only new files but also consider old existing files in the streaming directory.

Related

spark job performing poorly while converting text files to parquet format

I have a spark streaming application which is responsible for converting text files into parquet format on the fly, and then saving the data in an external hive table. Please refer the mentioned piece of code which is one of the classes for processing text files into parquet:
object HistTableLogic {val logger = Logger.getLogger("file")
def schemadef(batchId: String) {
println("process started!")
logger.debug("process started")
val sourcePath = "some path"
val destPath = "somepath"
println(s"source path :${sourcePath}")
println(s"dest path :${destPath}")
logger.debug(s"source path :${sourcePath}")
logger.debug(s"dest path :${destPath}")
// val sc = new SparkContext(new SparkConf().set("spark.driver.allowMultipleContexts", "true"))
val conf = new Configuration()
println("Spark Context created!!")
logger.debug("Spark Context created!!")
val spark = SparkSession.builder.enableHiveSupport().getOrCreate()
println("Spark session created!")
logger.debug("Spark session created!")
val schema = StructType.apply(spark.read.table("hivetable").schema.fields.dropRight(2))
try {
val fs = FileSystem.get(conf)
spark.sql("ALTER table hivetable drop if exists partition (batch_run_dt='"+batchId.substring(1,9)+"', batchid='"+batchId+"')")
fs.listStatus(new Path(sourcePath)).foreach(x => {
val df = spark.read.format("com.databricks.spark.csv").option("inferSchema","true").option("delimiter","\u0001").
schema(schema).csv(s"${sourcePath}/"+batchId).na.fill("").repartition(50).write.mode("overwrite").option("compression", "gzip")
.parquet(s"${destPath}/batch_run_dt="+batchId.substring(1,9)+"/batchid="+batchId)
spark.sql("ALTER table hivetable add partition (batch_run_dt='"+batchId.substring(1,9)+"', batchid='"+batchId+"')")
logger.debug("Partition added")
})
} catch {
case e: Exception => {
println("---------Exception caught---------!")
logger.debug("---------Exception caught---------!")
e.printStackTrace()
logger.debug(e.printStackTrace)
logger.debug(e.getMessage)
}
}
}}
I am passing schemadef method of above class in main method of another java class which has the logic of receiving batchIds 24x7, set via custom receiver.
Functionally the application runs fine, but is taking around 15 minutes to process even 1GB of data. And if I try to simply load the data into hive table through LOAD query, it happens within a minute.
referring below configuration for spark job:
SPARK_MASTER YARN
SPARK_DEPLOY-MODE CLUSTER
SPARK_DRIVER-MEMORY 13g
SPARK_NUM-EXECUTORS 6
SPARK_EXECUTOR-MEMORY 15g
SPARK_EXECUTOR-CORES 2
Please let me know if you find any flaw in this or any other optimization I can do to enhance this process. Thank you

Is it possible to read pdf/audio/video files(unstructured data) using Apache Spark?

Is it possible to read pdf/audio/video files(unstructured data) using Apache Spark?
For example, I have thousands of pdf invoices and I want to read data from those and perform some analytics on that. What steps must I do to process unstructured data?
Yes, it is. Use sparkContext.binaryFiles to load files in binary format and then use map to map value to some other format - for example, parse binary with Apache Tika or Apache POI.
Pseudocode:
val rawFile = sparkContext.binaryFiles(...
val ready = rawFile.map ( here parsing with other framework
What is important, parsing must be done with other framework like mentioned previously in my answer. Map will get InputStream as an argument
We had a scenario where we needed to use a custom decryption algorithm on the input files. We didn't want to rewrite that code in Scala or Python. Python-Spark code follows:
from pyspark import SparkContext, SparkConf, HiveContext, AccumulatorParam
def decryptUncompressAndParseFile(filePathAndContents):
'''each line of the file becomes an RDD record'''
global acc_errCount, acc_errLog
proc = subprocess.Popen(['custom_decrypt_program','--decrypt'],
stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
(unzippedData, err) = proc.communicate(input=filePathAndContents[1])
if len(err) > 0: # problem reading the file
acc_errCount.add(1)
acc_errLog.add('Error: '+str(err)+' in file: '+filePathAndContents[0]+
', on host: '+ socket.gethostname()+' return code:'+str(returnCode))
return [] # this is okay with flatMap
records = list()
iterLines = iter(unzippedData.splitlines())
for line in iterLines:
#sys.stderr.write('Line: '+str(line)+'\n')
values = [x.strip() for x in line.split('|')]
...
records.append( (... extract data as appropriate from values into this tuple ...) )
return records
class StringAccumulator(AccumulatorParam):
''' custom accumulator to holds strings '''
def zero(self,initValue=""):
return initValue
def addInPlace(self,str1,str2):
return str1.strip()+'\n'+str2.strip()
def main():
...
global acc_errCount, acc_errLog
acc_errCount = sc.accumulator(0)
acc_errLog = sc.accumulator('',StringAccumulator())
binaryFileTup = sc.binaryFiles(args.inputDir)
# use flatMap instead of map, to handle corrupt files
linesRdd = binaryFileTup.flatMap(decryptUncompressAndParseFile, True)
df = sqlContext.createDataFrame(linesRdd, ourSchema())
df.registerTempTable("dataTable")
...
The custom string accumulator was very useful in identifying corrupt input files.

How to recursively read Hadoop files from directory using Spark?

Inside the given directory I have many different folders and inside each folder I have Hadoop files (part_001, etc.).
directory
-> folder1
-> part_001...
-> part_002...
-> folder2
-> part_001...
...
Given the directory, how can I recursively read the content of all folders inside this directory and load this content into a single RDD in Spark using Scala?
I found this, but it does not recursively enters into sub-folders (I am using import org.apache.hadoop.mapreduce.lib.input):
var job: Job = null
try {
job = Job.getInstance()
FileInputFormat.setInputPaths(job, new Path("s3n://" + bucketNameData + "/" + directoryS3))
FileInputFormat.setInputDirRecursive(job, true)
} catch {
case ioe: IOException => ioe.printStackTrace(); System.exit(1);
}
val sourceData = sc.newAPIHadoopRDD(job.getConfiguration(), classOf[TextInputFormat], classOf[LongWritable], classOf[Text]).values
I also found this web-page that uses SequenceFile, but again I don't understand how to apply it to my case?
If you are using Spark, you can do this using wilcards as follow:
scala>sc.textFile("path/*/*")
sc is the SparkContext which if you are using spark-shell is initialized by default or if you are creating your own program should will have to instance a SparkContext by yourself.
Be careful with the following flag:
scala> sc.hadoopConfiguration.get("mapreduce.input.fileinputformat.input.dir.recursive")
> res6: String = null
Yo should set this flag to true:
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
I have found that the parameters must be set in this way:
.set("spark.hive.mapred.supports.subdirectories","true")
.set("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive","true")
connector_output=${basepath}/output/connector/*/*/*/*/*
works for me when I've dir structure like -
${basepath}/output/connector/2019/01/23/23/output*.dat
I didn't have to set any other properties, just used following -
sparkSession.read().format("csv").schema(schema)
.option("delimiter", "|")
.load("/user/user1/output/connector/*/*/*/*/*");

Streaming to HBase with pyspark

There is a fair amount of info online about bulk loading to HBase with Spark streaming using Scala (these two were particularly useful) and some info for Java, but there seems to be a lack of info for doing it with PySpark. So my questions are:
How can data be bulk loaded into HBase using PySpark?
Most examples in any language only show a single column per row being upserted. How can I upsert multiple columns per row?
The code I currently have is as follows:
if __name__ == "__main__":
context = SparkContext(appName="PythonHBaseBulkLoader")
streamingContext = StreamingContext(context, 5)
stream = streamingContext.textFileStream("file:///test/input");
stream.foreachRDD(bulk_load)
streamingContext.start()
streamingContext.awaitTermination()
What I need help with is the bulk load function
def bulk_load(rdd):
#???
I've made some progress previously, with many and various errors (as documented here and here)
So after much trial and error, I present here the best I have come up with. It works well, and successfully bulk loads data (using Puts or HFiles) I am perfectly willing to believe that it is not the best method, so any comments/other answers are welcome. This assume you're using a CSV for your data.
Bulk loading with Puts
By far the easiest way to bulk load, this simply creates a Put request for each cell in the CSV and queues them up to HBase.
def bulk_load(rdd):
#Your configuration will likely be different. Insert your own quorum and parent node and table name
conf = {"hbase.zookeeper.qourum": "localhost:2181",\
"zookeeper.znode.parent": "/hbase-unsecure",\
"hbase.mapred.outputtable": "Test",\
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.TableOutputFormat",\
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",\
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
load_rdd = rdd.flatMap(lambda line: line.split("\n"))\#Split the input into individual lines
.flatMap(csv_to_key_value)#Convert the CSV line to key value pairs
load_rdd.saveAsNewAPIHadoopDataset(conf=conf,keyConverter=keyConv,valueConverter=valueConv)
The function csv_to_key_value is where the magic happens:
def csv_to_key_value(row):
cols = row.split(",")#Split on commas.
#Each cell is a tuple of (key, [key, column-family, column-descriptor, value])
#Works well for n>=1 columns
result = ((cols[0], [cols[0], "f1", "c1", cols[1]]),
(cols[0], [cols[0], "f2", "c2", cols[2]]),
(cols[0], [cols[0], "f3", "c3", cols[3]]))
return result
The value converter we defined earlier will convert these tuples into HBase Puts
Bulk loading with HFiles
Bulk loading with HFiles is more efficient: rather than a Put request for each cell, an HFile is written directly and the RegionServer is simply told to point to the new HFile. This will use Py4J, so before the Python code we have to write a small Java program:
import py4j.GatewayServer;
import org.apache.hadoop.hbase.*;
public class GatewayApplication {
public static void main(String[] args)
{
GatewayApplication app = new GatewayApplication();
GatewayServer server = new GatewayServer(app);
server.start();
}
}
Compile this, and run it. Leave it running as long as your streaming is happening. Now update bulk_load as follows:
def bulk_load(rdd):
#The output class changes, everything else stays
conf = {"hbase.zookeeper.qourum": "localhost:2181",\
"zookeeper.znode.parent": "/hbase-unsecure",\
"hbase.mapred.outputtable": "Test",\
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2",\
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",\
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}#"org.apache.hadoop.hbase.client.Put"}
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
load_rdd = rdd.flatMap(lambda line: line.split("\n"))\
.flatMap(csv_to_key_value)\
.sortByKey(True)
#Don't process empty RDDs
if not load_rdd.isEmpty():
#saveAsNewAPIHadoopDataset changes to saveAsNewAPIHadoopFile
load_rdd.saveAsNewAPIHadoopFile("file:///tmp/hfiles" + startTime,
"org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2",
conf=conf,
keyConverter=keyConv,
valueConverter=valueConv)
#The file has now been written, but HBase doesn't know about it
#Get a link to Py4J
gateway = JavaGateway()
#Convert conf to a fully fledged Configuration type
config = dict_to_conf(conf)
#Set up our HTable
htable = gateway.jvm.org.apache.hadoop.hbase.client.HTable(config, "Test")
#Set up our path
path = gateway.jvm.org.apache.hadoop.fs.Path("/tmp/hfiles" + startTime)
#Get a bulk loader
loader = gateway.jvm.org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles(config)
#Load the HFile
loader.doBulkLoad(path, htable)
else:
print("Nothing to process")
Finally, the fairly straightforward dict_to_conf:
def dict_to_conf(conf):
gateway = JavaGateway()
config = gateway.jvm.org.apache.hadoop.conf.Configuration()
keys = conf.keys()
vals = conf.values()
for i in range(len(keys)):
config.set(keys[i], vals[i])
return config
As you can see, bulk loading with HFiles is more complex than using Puts, but depending on your data load it is probably worth it since once you get it working it's not that difficult.
One last note on something that caught me off guard: HFiles expect the data they receive to be written in lexical order. This is not always guaranteed to be true, especially since "10" < "9". If you have designed your key to be unique, then this can be fixed easily:
load_rdd = rdd.flatMap(lambda line: line.split("\n"))\
.flatMap(csv_to_key_value)\
.sortByKey(True)#Sort in ascending order

Reading Distributed Files in Hadoop

I'm trying to the following in hadoop:
I have implemented a map-reduce job that outputs a file to directory "foo".
the foo files are with a key=IntWriteable, value=IntWriteable format (used a SequenceFileOutputFormat).
Now, I want to start another map-reduce job. the mapper is fine, but each reducer is required to read the entire "foo" files at start-up (I'm using the HDFS for sharing data between reducers).
I used this code on the "public void configure(JobConf conf)":
String uri = "out/foo";
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FileStatus[] status = fs.listStatus(new Path(uri));
for (int i=0; i<status.length; ++i) {
Path currFile = status[i].getPath();
System.out.println("status: " + i + " " + currFile.toString());
try {
SequenceFile.Reader reader = null;
reader = new SequenceFile.Reader(fs, currFile, conf);
IntWritable key = (IntWritable) ReflectionUtils.newInstance(reader.getKeyClass(), conf);
IntWritable value = (IntWritable ) ReflectionUtils.newInstance(reader.getValueClass(), conf);
while (reader.next(key, value)) {
// do the code for all the pairs.
}
}
}
The code runs well on a single machine, but I'm notsure if it will run on a cluster.
In other words, does this code reads files from the current machine or does id read from the distributed system?
Is there a better solution for what I'm trying to do?
Thanks in advance,
Arik.
The URI for the FileSystem.get() does not have scheme defined and hence, the File System used depends on the configuration parameter fs.defaultFS. If none set, the default setting i.e LocalFile system will be used.
Your program writes to the Local file system under the workingDir/out/foo. It should work in the cluster as well but looks for the local file system.
With the above said, I'm not sure why you need the entire files from foo directory. You may have consider other designs. If needed, these files should copied to HDFS first and read the files from the overridden setup method of your reducer. Needless to say, to close the files opened in the overridden closeup method of your reducer. While the files can be read in reducers, the map/reduce programs are not designed for this kind of functionality.

Resources