Is it possible to read pdf/audio/video files(unstructured data) using Apache Spark? - hadoop

Is it possible to read pdf/audio/video files(unstructured data) using Apache Spark?
For example, I have thousands of pdf invoices and I want to read data from those and perform some analytics on that. What steps must I do to process unstructured data?

Yes, it is. Use sparkContext.binaryFiles to load files in binary format and then use map to map value to some other format - for example, parse binary with Apache Tika or Apache POI.
Pseudocode:
val rawFile = sparkContext.binaryFiles(...
val ready = rawFile.map ( here parsing with other framework
What is important, parsing must be done with other framework like mentioned previously in my answer. Map will get InputStream as an argument

We had a scenario where we needed to use a custom decryption algorithm on the input files. We didn't want to rewrite that code in Scala or Python. Python-Spark code follows:
from pyspark import SparkContext, SparkConf, HiveContext, AccumulatorParam
def decryptUncompressAndParseFile(filePathAndContents):
'''each line of the file becomes an RDD record'''
global acc_errCount, acc_errLog
proc = subprocess.Popen(['custom_decrypt_program','--decrypt'],
stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
(unzippedData, err) = proc.communicate(input=filePathAndContents[1])
if len(err) > 0: # problem reading the file
acc_errCount.add(1)
acc_errLog.add('Error: '+str(err)+' in file: '+filePathAndContents[0]+
', on host: '+ socket.gethostname()+' return code:'+str(returnCode))
return [] # this is okay with flatMap
records = list()
iterLines = iter(unzippedData.splitlines())
for line in iterLines:
#sys.stderr.write('Line: '+str(line)+'\n')
values = [x.strip() for x in line.split('|')]
...
records.append( (... extract data as appropriate from values into this tuple ...) )
return records
class StringAccumulator(AccumulatorParam):
''' custom accumulator to holds strings '''
def zero(self,initValue=""):
return initValue
def addInPlace(self,str1,str2):
return str1.strip()+'\n'+str2.strip()
def main():
...
global acc_errCount, acc_errLog
acc_errCount = sc.accumulator(0)
acc_errLog = sc.accumulator('',StringAccumulator())
binaryFileTup = sc.binaryFiles(args.inputDir)
# use flatMap instead of map, to handle corrupt files
linesRdd = binaryFileTup.flatMap(decryptUncompressAndParseFile, True)
df = sqlContext.createDataFrame(linesRdd, ourSchema())
df.registerTempTable("dataTable")
...
The custom string accumulator was very useful in identifying corrupt input files.

Related

Dataloader on top of protobuf file using pytorch's torch.utils.data.IterableDataset

I am building a RNN network using pytorch.
The data is stored in various protobuf file.
Each record in protobuf represents one training example with multiple timestamp.
As this is very large dataset, reading the whole data in memory or random read by extending torch.utils.data.Dataset class isn't feasible.
As per the docs using the torch.utils.data.IterableDataset is recommended.
DataLoader on top of IterableDataset would be able to achieve parallelism
However I am not able to find an implementation of this on custom data, docs only talk about a simple range iterator.
import math
import stream
from src import record_pb2
import torch
class MyIterableDataset(torch.utils.data.IterableDataset):
def __init__(self, pb_file):
self.pb_file = pb_file
self.start = 0
self.end = 0
# One time read of the data to get the total count of records in the dataset
with stream.open(self.pb_file, 'rb') as data_stream:
for _ in data_stream:
self.end += 1
def __iter__(self):
worker_info = torch.utils.data.get_worker_info()
if worker_info is None: # Single-process data loading, return the full iterator
iter_start = self.start
iter_end = self.end
else:
# in a worker process, split the workload
per_worker = int(math.ceil((self.end - self.start))/float(worker_info.num_workers))
worker_id = worker_info.id
iter_start = self.start + worker_id * per_worker
iter_end = min(iter_start + per_worker, self.end)
data_stream = stream.open(self.pb_file, 'rb')
# Block to skip the streaming data till the iter start for the current worker process
i = 0
for _ in data_stream:
i += 1
if i >= iter_start:
break
return iter(self.pb_stream)
I am expecting a mechanism by which a parallel data feeder could be designed on top of a large streaming data (protobuf)
The __iter__ method of the IterableDataset would yield your data samples one at a time. In a parallel setup, you have to choose the samples based on worker_id. And with respect to the DataLoader using this dataset, shuffle and sampler options would not work, as an IterableDataset is not going to have any indices. In other words, have your dataset yield one sample at a time and the data loader will take care of loading them. Does this answer?

How to use ExecuteScript (with python as a script engine) for an exercise to add numbers? [Novice user trying to learn NiFi]

I am relatively new to NiFi and am not sure how to do the following correctly. I would like to use ExecuteScript processor (script engine: python) to do the following (only in python please):
1) There is a CSV file containing the following information (the first row is the header):
first,second,third
1,4,9
7,5,2
3,8,7
2) I would like to find the sum of individual rows and generate a final file with a modified header. The final file should look like this:
first,second,third,total
1,4,9,14
7,5,2,14
3,8,7,18
For the python script, I wrote:
def summation(first,second,third):
numbers = first + second + third
return numbers
flowFile = session.get()
if (flowFile != None):
flowFile = session.write(flowFile, summation())
But it does not work and I am not sure how to fix this. Can anyone provide me an understanding on how to approach this problem?
The NiFi flow:
Thank you
Your script is not doing what you would like it to do. There are a couple approaches to this problem:
Operate on the whole flowfile at once with a script that iterates over the rows in the CSV content
Treat the rows in the CSV content as a "record" and operate on each record with a script that handles a single line
I will provide changes to your script to handle the entire flowfile content at once; you can read more about the Record* processors here, here, and here.
Here is a script which performs the action you expect. Note the differences to see where I changed things (this script could certainly be made more efficient and concise; it is verbose to demonstrate what is happening, and I am not a Python expert).
import json
from java.io import BufferedReader, InputStreamReader
from org.apache.nifi.processor.io import StreamCallback
# This PyStreamCallback class is what the processor will use to ingest and output the flowfile content
class PyStreamCallback(StreamCallback):
def __init__(self):
pass
def process(self, inputStream, outputStream):
try:
# Get the provided inputStream into a format where you can read lines
reader = BufferedReader(InputStreamReader(inputStream))
# Set a marker for the first line to be the header
isHeader = True
try:
# A holding variable for the lines
lines = []
# Loop indefinitely
while True:
# Get the next line
line = reader.readLine()
# If there is no more content, break out of the loop
if line is None:
break
# If this is the first line, add the new column
if isHeader:
header = line + ",total"
# Write the header line and the new column
lines.append(header)
# Set the header flag to false now that it has been processed
isHeader = False
else:
# Split the line (a string) into individual elements by the ',' delimiter
elements = self.extract_elements(line)
# Get the sum (this method is unnecessary but shows where your "summation" method would go)
sum = self.summation(elements)
# Write the output of this line
newLine = ",".join([line, str(sum)])
lines.append(newLine)
# Now out of the loop, write the output to the outputStream
output = "\n".join([str(l) for l in lines])
outputStream.write(bytearray(output.encode('utf-8')))
finally:
if reader is not None:
reader.close()
except Exception as e:
log.warn("Exception in Reader")
log.warn('-' * 60)
log.warn(str(e))
log.warn('-' * 60)
raise e
session.transfer(flowFile, REL_FAILURE)
def extract_elements(self, line):
# This splits the line on the ',' delimiter and converts each element to an integer, and puts them in a list
return [int(x) for x in line.split(',')]
# This method replaces your "summation" method and can accept any number of inputs, not just 3
def summation(self, list):
# This returns the sum of all items in the list
return sum(list)
flowFile = session.get()
if (flowFile != None):
flowFile = session.write(flowFile,PyStreamCallback())
session.transfer(flowFile, REL_SUCCESS)
Result from my flow (using your input in a GenerateFlowFile processor):
2018-07-20 13:54:06,772 INFO [Timer-Driven Process Thread-5] o.a.n.processors.standard.LogAttribute LogAttribute[id=b87f0c01-0164-1000-920e-799647cb9b48] logging for flow file StandardFlowFileRecord[uuid=de888571-2947-4ae1-b646-09e61c85538b,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1532106928567-1, container=default, section=1], offset=2499, length=51],offset=0,name=470063203212609,size=51]
--------------------------------------------------
Standard FlowFile Attributes
Key: 'entryDate'
Value: 'Fri Jul 20 13:54:06 EDT 2018'
Key: 'lineageStartDate'
Value: 'Fri Jul 20 13:54:06 EDT 2018'
Key: 'fileSize'
Value: '51'
FlowFile Attribute Map Content
Key: 'filename'
Value: '470063203212609'
Key: 'path'
Value: './'
Key: 'uuid'
Value: 'de888571-2947-4ae1-b646-09e61c85538b'
--------------------------------------------------
first,second,third,total
1,4,9,14
7,5,2,14
3,8,7,18

How to use Hadoop's MapFileOutputFormat in Flink?

I've got stuck while I'm writing a program using Apache Flink. The problem is that I'm trying to generate Hadoop's MapFile as a result of computation but Scala compiler complains about type mismatch.
To illustrate the problem, let me show you the below code snippet which tries to generate two kinds of output: one is Hadoop's SequenceFile and the other is MapFile.
val dataSet: DataSet[(IntWritable, BytesWritable)] =
env.readSequenceFile(classOf[Text], classOf[BytesWritable], inputSequenceFile.toString)
.map(mapper(_))
.partitionCustom(partitioner, 0)
.sortPartition(0, Order.ASCENDING)
val seqOF = new HadoopOutputFormat(
new SequenceFileOutputFormat[IntWritable, BytesWritable](), Job.getInstance(hadoopConf)
)
val mapfileOF = new HadoopOutputFormat(
new MapFileOutputFormat(), Job.getInstance(hadoopConf)
)
val dataSink1 = dataSet.output(seqOF) // it typechecks!
val dataSink2 = dataSet.output(mapfileOF) // syntax error
As commented above, dataSet.output(mapfileOF) causes Scala compiler to complain as follows:
FYI, compared to SequenceFile, MapFile calls for a stronger condition that a key must be WritableComparable.
Before writing the application using Flink, I implemented it using Spark as below and it worked okay (no compilation error and it runs okay without any error).
val rdd = sc
.sequenceFile(inputSequenceFile.toString, classOf[Text], classOf[BytesWritable])
.map(mapper(_))
.repartitionAndSortWithinPartitions(partitioner)
rdd.saveAsNewAPIHadoopFile(
outputPath.toString,
classOf[IntWritable],
classOf[BytesWritable],
classOf[MapFileOutputFormat]
)
Did you check: https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/batch/hadoop_compatibility.html#using-hadoop-outputformats
It contains the following example:
// Obtain your result to emit.
val hadoopResult: DataSet[(Text, IntWritable)] = [...]
val hadoopOF = new HadoopOutputFormat[Text,IntWritable](
new TextOutputFormat[Text, IntWritable],
new JobConf)
hadoopOF.getJobConf.set("mapred.textoutputformat.separator", " ")
FileOutputFormat.setOutputPath(hadoopOF.getJobConf, new Path(resultPath))
hadoopResult.output(hadoopOF)

Streaming to HBase with pyspark

There is a fair amount of info online about bulk loading to HBase with Spark streaming using Scala (these two were particularly useful) and some info for Java, but there seems to be a lack of info for doing it with PySpark. So my questions are:
How can data be bulk loaded into HBase using PySpark?
Most examples in any language only show a single column per row being upserted. How can I upsert multiple columns per row?
The code I currently have is as follows:
if __name__ == "__main__":
context = SparkContext(appName="PythonHBaseBulkLoader")
streamingContext = StreamingContext(context, 5)
stream = streamingContext.textFileStream("file:///test/input");
stream.foreachRDD(bulk_load)
streamingContext.start()
streamingContext.awaitTermination()
What I need help with is the bulk load function
def bulk_load(rdd):
#???
I've made some progress previously, with many and various errors (as documented here and here)
So after much trial and error, I present here the best I have come up with. It works well, and successfully bulk loads data (using Puts or HFiles) I am perfectly willing to believe that it is not the best method, so any comments/other answers are welcome. This assume you're using a CSV for your data.
Bulk loading with Puts
By far the easiest way to bulk load, this simply creates a Put request for each cell in the CSV and queues them up to HBase.
def bulk_load(rdd):
#Your configuration will likely be different. Insert your own quorum and parent node and table name
conf = {"hbase.zookeeper.qourum": "localhost:2181",\
"zookeeper.znode.parent": "/hbase-unsecure",\
"hbase.mapred.outputtable": "Test",\
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.TableOutputFormat",\
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",\
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
load_rdd = rdd.flatMap(lambda line: line.split("\n"))\#Split the input into individual lines
.flatMap(csv_to_key_value)#Convert the CSV line to key value pairs
load_rdd.saveAsNewAPIHadoopDataset(conf=conf,keyConverter=keyConv,valueConverter=valueConv)
The function csv_to_key_value is where the magic happens:
def csv_to_key_value(row):
cols = row.split(",")#Split on commas.
#Each cell is a tuple of (key, [key, column-family, column-descriptor, value])
#Works well for n>=1 columns
result = ((cols[0], [cols[0], "f1", "c1", cols[1]]),
(cols[0], [cols[0], "f2", "c2", cols[2]]),
(cols[0], [cols[0], "f3", "c3", cols[3]]))
return result
The value converter we defined earlier will convert these tuples into HBase Puts
Bulk loading with HFiles
Bulk loading with HFiles is more efficient: rather than a Put request for each cell, an HFile is written directly and the RegionServer is simply told to point to the new HFile. This will use Py4J, so before the Python code we have to write a small Java program:
import py4j.GatewayServer;
import org.apache.hadoop.hbase.*;
public class GatewayApplication {
public static void main(String[] args)
{
GatewayApplication app = new GatewayApplication();
GatewayServer server = new GatewayServer(app);
server.start();
}
}
Compile this, and run it. Leave it running as long as your streaming is happening. Now update bulk_load as follows:
def bulk_load(rdd):
#The output class changes, everything else stays
conf = {"hbase.zookeeper.qourum": "localhost:2181",\
"zookeeper.znode.parent": "/hbase-unsecure",\
"hbase.mapred.outputtable": "Test",\
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2",\
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",\
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}#"org.apache.hadoop.hbase.client.Put"}
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
load_rdd = rdd.flatMap(lambda line: line.split("\n"))\
.flatMap(csv_to_key_value)\
.sortByKey(True)
#Don't process empty RDDs
if not load_rdd.isEmpty():
#saveAsNewAPIHadoopDataset changes to saveAsNewAPIHadoopFile
load_rdd.saveAsNewAPIHadoopFile("file:///tmp/hfiles" + startTime,
"org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2",
conf=conf,
keyConverter=keyConv,
valueConverter=valueConv)
#The file has now been written, but HBase doesn't know about it
#Get a link to Py4J
gateway = JavaGateway()
#Convert conf to a fully fledged Configuration type
config = dict_to_conf(conf)
#Set up our HTable
htable = gateway.jvm.org.apache.hadoop.hbase.client.HTable(config, "Test")
#Set up our path
path = gateway.jvm.org.apache.hadoop.fs.Path("/tmp/hfiles" + startTime)
#Get a bulk loader
loader = gateway.jvm.org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles(config)
#Load the HFile
loader.doBulkLoad(path, htable)
else:
print("Nothing to process")
Finally, the fairly straightforward dict_to_conf:
def dict_to_conf(conf):
gateway = JavaGateway()
config = gateway.jvm.org.apache.hadoop.conf.Configuration()
keys = conf.keys()
vals = conf.values()
for i in range(len(keys)):
config.set(keys[i], vals[i])
return config
As you can see, bulk loading with HFiles is more complex than using Puts, but depending on your data load it is probably worth it since once you get it working it's not that difficult.
One last note on something that caught me off guard: HFiles expect the data they receive to be written in lexical order. This is not always guaranteed to be true, especially since "10" < "9". If you have designed your key to be unique, then this can be fixed easily:
load_rdd = rdd.flatMap(lambda line: line.split("\n"))\
.flatMap(csv_to_key_value)\
.sortByKey(True)#Sort in ascending order

How to export data from Spark SQL to CSV

This command works with HiveQL:
insert overwrite directory '/data/home.csv' select * from testtable;
But with Spark SQL I'm getting an error with an org.apache.spark.sql.hive.HiveQl stack trace:
java.lang.RuntimeException: Unsupported language features in query:
insert overwrite directory '/data/home.csv' select * from testtable
Please guide me to write export to CSV feature in Spark SQL.
You can use below statement to write the contents of dataframe in CSV format
df.write.csv("/data/home/csv")
If you need to write the whole dataframe into a single CSV file, then use
df.coalesce(1).write.csv("/data/home/sample.csv")
For spark 1.x, you can use spark-csv to write the results into CSV files
Below scala snippet would help
import org.apache.spark.sql.hive.HiveContext
// sc - existing spark context
val sqlContext = new HiveContext(sc)
val df = sqlContext.sql("SELECT * FROM testtable")
df.write.format("com.databricks.spark.csv").save("/data/home/csv")
To write the contents into a single file
import org.apache.spark.sql.hive.HiveContext
// sc - existing spark context
val sqlContext = new HiveContext(sc)
val df = sqlContext.sql("SELECT * FROM testtable")
df.coalesce(1).write.format("com.databricks.spark.csv").save("/data/home/sample.csv")
Since Spark 2.X spark-csv is integrated as native datasource. Therefore, the necessary statement simplifies to (windows)
df.write
.option("header", "true")
.csv("file:///C:/out.csv")
or UNIX
df.write
.option("header", "true")
.csv("/var/out.csv")
Notice: as the comments say, it is creating the directory by that name with the partitions in it, not a standard CSV file. This, however, is most likely what you want since otherwise your either crashing your driver (out of RAM) or you could be working with a non distributed environment.
The answer above with spark-csv is correct but there is an issue - the library creates several files based on the data frame partitioning. And this is not what we usually need. So, you can combine all partitions to one:
df.coalesce(1).
write.
format("com.databricks.spark.csv").
option("header", "true").
save("myfile.csv")
and rename the output of the lib (name "part-00000") to a desire filename.
This blog post provides more details: https://fullstackml.com/2015/12/21/how-to-export-data-frame-from-apache-spark/
The simplest way is to map over the DataFrame's RDD and use mkString:
df.rdd.map(x=>x.mkString(","))
As of Spark 1.5 (or even before that)
df.map(r=>r.mkString(",")) would do the same
if you want CSV escaping you can use apache commons lang for that. e.g. here's the code we're using
def DfToTextFile(path: String,
df: DataFrame,
delimiter: String = ",",
csvEscape: Boolean = true,
partitions: Int = 1,
compress: Boolean = true,
header: Option[String] = None,
maxColumnLength: Option[Int] = None) = {
def trimColumnLength(c: String) = {
val col = maxColumnLength match {
case None => c
case Some(len: Int) => c.take(len)
}
if (csvEscape) StringEscapeUtils.escapeCsv(col) else col
}
def rowToString(r: Row) = {
val st = r.mkString("~-~").replaceAll("[\\p{C}|\\uFFFD]", "") //remove control characters
st.split("~-~").map(trimColumnLength).mkString(delimiter)
}
def addHeader(r: RDD[String]) = {
val rdd = for (h <- header;
if partitions == 1; //headers only supported for single partitions
tmpRdd = sc.parallelize(Array(h))) yield tmpRdd.union(r).coalesce(1)
rdd.getOrElse(r)
}
val rdd = df.map(rowToString).repartition(partitions)
val headerRdd = addHeader(rdd)
if (compress)
headerRdd.saveAsTextFile(path, classOf[GzipCodec])
else
headerRdd.saveAsTextFile(path)
}
With the help of spark-csv we can write to a CSV file.
val dfsql = sqlContext.sql("select * from tablename")
dfsql.write.format("com.databricks.spark.csv").option("header","true").save("output.csv")`
The error message suggests this is not a supported feature in the query language. But you can save a DataFrame in any format as usual through the RDD interface (df.rdd.saveAsTextFile). Or you can check out https://github.com/databricks/spark-csv.
enter code here IN DATAFRAME:
val p=spark.read.format("csv").options(Map("header"->"true","delimiter"->"^")).load("filename.csv")

Resources