Error in datetime module in python script for ni-fi - apache-nifi

So I am trying to implement this script
from datetime import datetime, timedelta, date
flowfile = session.get()
current_date= datetime.today().date()
if flowfile != None:
src_table_name = flowfile.getAttribute('src_table_name')
date_filter = str(current_date.replace(day =11))
if src_table_name in ['Income_project_BP', 'Plan_final_scenario', 'Plan_primary_scenario', 'ProductionCalendar', 'Subcontracting_BP']:
flowfile = session.putAttribute(flowfile, 'date_filter', date_filter)
session.transfer(flowfile, REL_SUCCESS)
else:
session.transfer(flowfile, REL_FAILURE)
And I've tried a lot of different ways but when it comes to "replace" method I'm always getting this error. Same thing when I am trying to use "strftime" method
What should I do? I really need this method.
Thank you beforehand!

Related

TypeError: Object of type RowProxy is not JSON serializable - Flask

I am using SQLAlchemy to query the database from my Flask web-application using engine.After I do the SELECT Query and also do use fetchall object after ResultProxy is returned which ultimately returns RowProxy object and then I store in session.
Here is my code:
import os
from sqlalchemy import create_engine
from sqlalchemy.orm import scoped_session, sessionmaker
from flask import Flask, session
engine = create_engine(os.environ.get('DATABASE_URL'))
db = scoped_session(sessionmaker(bind=engine))
app = Flask(__name__)
app.secret_key = os.environ.get('SECRET_KEY')
#app.route('/')
def index():
session['list'] = db.execute("SELECT title,author,year FROM books WHERE year = 2011 LIMIT 4").fetchall()
print(session['list'])
return "<h1>hello world</h1>"
if __name__ == "__main__":
app.run(debug = True)
Here is the output:
[('Steve Jobs', 'Walter Isaacson', 2011), ('Legend', 'Marie Lu', 2011), ('Hit List', 'Laurell K. Hamilton', 2011), ('Born at Midnight', 'C.C. Hunter', 2011)]
Traceback (most recent call last):
File "C:\Users\avise\AppData\Local\Programs\Python\Python38\Lib\site-packages\flask\app.py", line 2463, in __call__
return self.wsgi_app(environ, start_response)
File "C:\Users\avise\AppData\Local\Programs\Python\Python38\Lib\site-packages\flask\app.py", line 2449, in wsgi_app
response = self.handle_exception(e)
File "C:\Users\avise\AppData\Local\Programs\Python\Python38\Lib\site-packages\flask\app.py", line 1866, in handle_exception
reraise(exc_type, exc_value, tb)
File "C:\Users\avise\AppData\Local\Programs\Python\Python38\Lib\json\encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type RowProxy is not JSON serializable
The session item stores the data as i can see in output.But "hello world" is not rendered.
And if i replace the session variable by ordinary variable say x then it seems to be working.
But i think i need to use sessions so that my application will be used simultaneously by to users to display different things. So, how could i use sessions in this case or is there any other way?
Any help will be appreciated as I am new to Flask and web-development.
From what I understand about the Flask Session object is that it acts as a python dictionary; however values must be JSON serializable. In this case, just like the error suggests, the RowProxy object that is being returned by fetch all is not json serializable.
A solution to this problem would be to instead pass through a result of your query as a dictionary (which is JSON serializable).
It looks like the result of your query is returning a list of tuples so we can do the following:
res = db.execute("SELECT title,author,year FROM books WHERE year = 2011 LIMIT 4").fetchall()
user_books = {}
index = 0
for entry in res:
user_books[index] = {'title':res[index][0],
'author':res[index][1],
'year':res[index][2],
}
index += 1
session['list'] = user_books
A word of caution; however, is that since we are using the title of the book as a key, if there are two books with the same title, information may be overwritten, so consider using a unique id as the key.
Also note that the dictionary construction above would only work for the query you already have - if you added another column to the select statement you would have to edit the code to include the extra column information.

How to use ExecuteScript (with python as a script engine) for an exercise to add numbers? [Novice user trying to learn NiFi]

I am relatively new to NiFi and am not sure how to do the following correctly. I would like to use ExecuteScript processor (script engine: python) to do the following (only in python please):
1) There is a CSV file containing the following information (the first row is the header):
first,second,third
1,4,9
7,5,2
3,8,7
2) I would like to find the sum of individual rows and generate a final file with a modified header. The final file should look like this:
first,second,third,total
1,4,9,14
7,5,2,14
3,8,7,18
For the python script, I wrote:
def summation(first,second,third):
numbers = first + second + third
return numbers
flowFile = session.get()
if (flowFile != None):
flowFile = session.write(flowFile, summation())
But it does not work and I am not sure how to fix this. Can anyone provide me an understanding on how to approach this problem?
The NiFi flow:
Thank you
Your script is not doing what you would like it to do. There are a couple approaches to this problem:
Operate on the whole flowfile at once with a script that iterates over the rows in the CSV content
Treat the rows in the CSV content as a "record" and operate on each record with a script that handles a single line
I will provide changes to your script to handle the entire flowfile content at once; you can read more about the Record* processors here, here, and here.
Here is a script which performs the action you expect. Note the differences to see where I changed things (this script could certainly be made more efficient and concise; it is verbose to demonstrate what is happening, and I am not a Python expert).
import json
from java.io import BufferedReader, InputStreamReader
from org.apache.nifi.processor.io import StreamCallback
# This PyStreamCallback class is what the processor will use to ingest and output the flowfile content
class PyStreamCallback(StreamCallback):
def __init__(self):
pass
def process(self, inputStream, outputStream):
try:
# Get the provided inputStream into a format where you can read lines
reader = BufferedReader(InputStreamReader(inputStream))
# Set a marker for the first line to be the header
isHeader = True
try:
# A holding variable for the lines
lines = []
# Loop indefinitely
while True:
# Get the next line
line = reader.readLine()
# If there is no more content, break out of the loop
if line is None:
break
# If this is the first line, add the new column
if isHeader:
header = line + ",total"
# Write the header line and the new column
lines.append(header)
# Set the header flag to false now that it has been processed
isHeader = False
else:
# Split the line (a string) into individual elements by the ',' delimiter
elements = self.extract_elements(line)
# Get the sum (this method is unnecessary but shows where your "summation" method would go)
sum = self.summation(elements)
# Write the output of this line
newLine = ",".join([line, str(sum)])
lines.append(newLine)
# Now out of the loop, write the output to the outputStream
output = "\n".join([str(l) for l in lines])
outputStream.write(bytearray(output.encode('utf-8')))
finally:
if reader is not None:
reader.close()
except Exception as e:
log.warn("Exception in Reader")
log.warn('-' * 60)
log.warn(str(e))
log.warn('-' * 60)
raise e
session.transfer(flowFile, REL_FAILURE)
def extract_elements(self, line):
# This splits the line on the ',' delimiter and converts each element to an integer, and puts them in a list
return [int(x) for x in line.split(',')]
# This method replaces your "summation" method and can accept any number of inputs, not just 3
def summation(self, list):
# This returns the sum of all items in the list
return sum(list)
flowFile = session.get()
if (flowFile != None):
flowFile = session.write(flowFile,PyStreamCallback())
session.transfer(flowFile, REL_SUCCESS)
Result from my flow (using your input in a GenerateFlowFile processor):
2018-07-20 13:54:06,772 INFO [Timer-Driven Process Thread-5] o.a.n.processors.standard.LogAttribute LogAttribute[id=b87f0c01-0164-1000-920e-799647cb9b48] logging for flow file StandardFlowFileRecord[uuid=de888571-2947-4ae1-b646-09e61c85538b,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1532106928567-1, container=default, section=1], offset=2499, length=51],offset=0,name=470063203212609,size=51]
--------------------------------------------------
Standard FlowFile Attributes
Key: 'entryDate'
Value: 'Fri Jul 20 13:54:06 EDT 2018'
Key: 'lineageStartDate'
Value: 'Fri Jul 20 13:54:06 EDT 2018'
Key: 'fileSize'
Value: '51'
FlowFile Attribute Map Content
Key: 'filename'
Value: '470063203212609'
Key: 'path'
Value: './'
Key: 'uuid'
Value: 'de888571-2947-4ae1-b646-09e61c85538b'
--------------------------------------------------
first,second,third,total
1,4,9,14
7,5,2,14
3,8,7,18

Is it possible to read pdf/audio/video files(unstructured data) using Apache Spark?

Is it possible to read pdf/audio/video files(unstructured data) using Apache Spark?
For example, I have thousands of pdf invoices and I want to read data from those and perform some analytics on that. What steps must I do to process unstructured data?
Yes, it is. Use sparkContext.binaryFiles to load files in binary format and then use map to map value to some other format - for example, parse binary with Apache Tika or Apache POI.
Pseudocode:
val rawFile = sparkContext.binaryFiles(...
val ready = rawFile.map ( here parsing with other framework
What is important, parsing must be done with other framework like mentioned previously in my answer. Map will get InputStream as an argument
We had a scenario where we needed to use a custom decryption algorithm on the input files. We didn't want to rewrite that code in Scala or Python. Python-Spark code follows:
from pyspark import SparkContext, SparkConf, HiveContext, AccumulatorParam
def decryptUncompressAndParseFile(filePathAndContents):
'''each line of the file becomes an RDD record'''
global acc_errCount, acc_errLog
proc = subprocess.Popen(['custom_decrypt_program','--decrypt'],
stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
(unzippedData, err) = proc.communicate(input=filePathAndContents[1])
if len(err) > 0: # problem reading the file
acc_errCount.add(1)
acc_errLog.add('Error: '+str(err)+' in file: '+filePathAndContents[0]+
', on host: '+ socket.gethostname()+' return code:'+str(returnCode))
return [] # this is okay with flatMap
records = list()
iterLines = iter(unzippedData.splitlines())
for line in iterLines:
#sys.stderr.write('Line: '+str(line)+'\n')
values = [x.strip() for x in line.split('|')]
...
records.append( (... extract data as appropriate from values into this tuple ...) )
return records
class StringAccumulator(AccumulatorParam):
''' custom accumulator to holds strings '''
def zero(self,initValue=""):
return initValue
def addInPlace(self,str1,str2):
return str1.strip()+'\n'+str2.strip()
def main():
...
global acc_errCount, acc_errLog
acc_errCount = sc.accumulator(0)
acc_errLog = sc.accumulator('',StringAccumulator())
binaryFileTup = sc.binaryFiles(args.inputDir)
# use flatMap instead of map, to handle corrupt files
linesRdd = binaryFileTup.flatMap(decryptUncompressAndParseFile, True)
df = sqlContext.createDataFrame(linesRdd, ourSchema())
df.registerTempTable("dataTable")
...
The custom string accumulator was very useful in identifying corrupt input files.

How to export data from Spark SQL to CSV

This command works with HiveQL:
insert overwrite directory '/data/home.csv' select * from testtable;
But with Spark SQL I'm getting an error with an org.apache.spark.sql.hive.HiveQl stack trace:
java.lang.RuntimeException: Unsupported language features in query:
insert overwrite directory '/data/home.csv' select * from testtable
Please guide me to write export to CSV feature in Spark SQL.
You can use below statement to write the contents of dataframe in CSV format
df.write.csv("/data/home/csv")
If you need to write the whole dataframe into a single CSV file, then use
df.coalesce(1).write.csv("/data/home/sample.csv")
For spark 1.x, you can use spark-csv to write the results into CSV files
Below scala snippet would help
import org.apache.spark.sql.hive.HiveContext
// sc - existing spark context
val sqlContext = new HiveContext(sc)
val df = sqlContext.sql("SELECT * FROM testtable")
df.write.format("com.databricks.spark.csv").save("/data/home/csv")
To write the contents into a single file
import org.apache.spark.sql.hive.HiveContext
// sc - existing spark context
val sqlContext = new HiveContext(sc)
val df = sqlContext.sql("SELECT * FROM testtable")
df.coalesce(1).write.format("com.databricks.spark.csv").save("/data/home/sample.csv")
Since Spark 2.X spark-csv is integrated as native datasource. Therefore, the necessary statement simplifies to (windows)
df.write
.option("header", "true")
.csv("file:///C:/out.csv")
or UNIX
df.write
.option("header", "true")
.csv("/var/out.csv")
Notice: as the comments say, it is creating the directory by that name with the partitions in it, not a standard CSV file. This, however, is most likely what you want since otherwise your either crashing your driver (out of RAM) or you could be working with a non distributed environment.
The answer above with spark-csv is correct but there is an issue - the library creates several files based on the data frame partitioning. And this is not what we usually need. So, you can combine all partitions to one:
df.coalesce(1).
write.
format("com.databricks.spark.csv").
option("header", "true").
save("myfile.csv")
and rename the output of the lib (name "part-00000") to a desire filename.
This blog post provides more details: https://fullstackml.com/2015/12/21/how-to-export-data-frame-from-apache-spark/
The simplest way is to map over the DataFrame's RDD and use mkString:
df.rdd.map(x=>x.mkString(","))
As of Spark 1.5 (or even before that)
df.map(r=>r.mkString(",")) would do the same
if you want CSV escaping you can use apache commons lang for that. e.g. here's the code we're using
def DfToTextFile(path: String,
df: DataFrame,
delimiter: String = ",",
csvEscape: Boolean = true,
partitions: Int = 1,
compress: Boolean = true,
header: Option[String] = None,
maxColumnLength: Option[Int] = None) = {
def trimColumnLength(c: String) = {
val col = maxColumnLength match {
case None => c
case Some(len: Int) => c.take(len)
}
if (csvEscape) StringEscapeUtils.escapeCsv(col) else col
}
def rowToString(r: Row) = {
val st = r.mkString("~-~").replaceAll("[\\p{C}|\\uFFFD]", "") //remove control characters
st.split("~-~").map(trimColumnLength).mkString(delimiter)
}
def addHeader(r: RDD[String]) = {
val rdd = for (h <- header;
if partitions == 1; //headers only supported for single partitions
tmpRdd = sc.parallelize(Array(h))) yield tmpRdd.union(r).coalesce(1)
rdd.getOrElse(r)
}
val rdd = df.map(rowToString).repartition(partitions)
val headerRdd = addHeader(rdd)
if (compress)
headerRdd.saveAsTextFile(path, classOf[GzipCodec])
else
headerRdd.saveAsTextFile(path)
}
With the help of spark-csv we can write to a CSV file.
val dfsql = sqlContext.sql("select * from tablename")
dfsql.write.format("com.databricks.spark.csv").option("header","true").save("output.csv")`
The error message suggests this is not a supported feature in the query language. But you can save a DataFrame in any format as usual through the RDD interface (df.rdd.saveAsTextFile). Or you can check out https://github.com/databricks/spark-csv.
enter code here IN DATAFRAME:
val p=spark.read.format("csv").options(Map("header"->"true","delimiter"->"^")).load("filename.csv")

How to save a spark rdd to an avro file

I am trying to save an rdd to a file in avro format. This is how my code looks like:
val output = s"/test/avro/${date.toString(dayFormat)}"
rmr(output)//deleteing the path
rdd.coalesce(64).saveAsNewAPIHadoopFile(
output,
classOf[org.apache.hadoop.io.NullWritable],
classOf[PageViewEvent],
classOf[AvroKeyValueOutputFormat[org.apache.hadoop.io.NullWritable,PageViewEvent]],
spark.hadoopConfiguration)
}
When I run this I get an error saying:
Unsupported input type PageViewEvent
The type of the rdd is RDD[(Null,PageViewEvent)].
Can someone explain my what I am doing wrong?
Thanks in advance
So I managed to find a 'workaround'.
val job = new Job(spark.hadoopConfiguration)
AvroJob.setOutputKeySchema(job, PageViewEvent.SCHEMA$)
val output = s"/avro/${date.toString(dayFormat)}"
rmr(output)
rdd.coalesce(64).map(x => (new AvroKey(x._1), x._2))
.saveAsNewAPIHadoopFile(
output,
classOf[PageViewEvent],
classOf[org.apache.hadoop.io.NullWritable],
classOf[AvroKeyOutputFormat[PageViewEvent]],
job.getConfiguration)
this works fine. I don't try to use AvroKeyValueOutputFormat anymore. But I think now i would be able to. The key change was to use AvroKey and to set OutputKeySchema.

Resources