How to handle nested double quotes in Nifi? - apache-nifi

We have a csv file with nested double quotes column.
For example : 1,John,26,"how are you "Jim"".
In this example we have 4 columns id, name, age and message.
Here message column is having nested double quotes, which is causing the data parsing issue in convertRecord Nifi processor(could not parse incoming data error). Is there any way we can escape nested double quotes and read the data properly ?
As shown in the below image, we are using the following properties in both CSVReader and CSVRecordSetWritter controller services.

We had the exact same issue and as #daggett highlighted - How could you detect which quote is the end of the field? We even spoke with Cloudera, and everything boils down to that data does not conform to CSV standard rules.
So written a small python script which is called using ExecuteScript processor, and able to escape almost all the special characters except when double quote and dilimiter is part of the data eg. "field_1","field_2 this is very invalid", data","field_3"
Give it a go and please comment if it works so that we can encompass logic into a custom processor!
from org.apache.commons.io import IOUtils
from java.nio.charset import StandardCharsets
from org.apache.nifi.processor.io import StreamCallback
from org.apache.nifi.processors.script import ExecuteScript
from org.python.core.util.FileUtil import wrap
from io import StringIO
import re
# Define a subclass of StreamCallback for use in session.write()
class PyStreamCallback(StreamCallback):
def __init__(self):
pass
def process(self, inputStream, outputStream):
with wrap(inputStream) as f:
lines = f.readlines()
outer_new_value_list = []
is_header_row = True
for row in lines:
if is_header_row:
is_header_row = False
outer_new_value_list.append(row)
continue
char_list = list(row.strip())
for position, char in enumerate(char_list):
#print(position, char)
# if position == 54:
# print()
if (position + 1) == len(char_list):
continue
if position == 0:
continue
else:
if char == '"':
if char_list[position - 1] == ',' or char_list[position + 1] == ',':
# this double quote is Quote Character at start of field or end of field
continue
if char_list[position - 1] != ',' and char_list[position + 1] != ',':
# this double quote is inbetween and is not Quote Character, add escape character to it
replace_char = '\\' + char
char_list[position] = replace_char
if char == ',':
# Int values are not in double quotes, so check previous and next char is of int type
previous_char_type = ''
next_char_type = ''
try:
previous_char = char_list[position - 1]
if isinstance(int(previous_char), int):
previous_char_type = 'Int'
except:
pass
# print('previous_char : ' + str(previous_char))
try:
next_char = char_list[position + 1]
if isinstance(int(next_char), int):
next_char_type = 'Int'
except:
pass
# print(" next_char: " + str(next_char))
if previous_char_type == 'Int' or next_char_type == 'Int':
print('No need to replace this instance of comma')
continue
if char_list[position - 1] == '"' or char_list[position + 1] == '"':
# delimited comma
continue
if char_list[position - 1] != '"' and char_list[position + 1] != '"':
# not delimited comma, inbetween comma, add with escape character to it
replace_char = '\\' + char
char_list[position] = replace_char
if char == '\\':
replace_char = ''
char_list[position] = replace_char
new_data_line = ''.join([str(elem) for elem in char_list])
outer_new_value_list.append(new_data_line + '\r\n')
with wrap(outputStream, 'w') as filehandle:
filehandle.writelines("%s" % line for line in outer_new_value_list)
# end class
flowFile = session.get()
if (flowFile != None):
flowFile = session.write(flowFile, PyStreamCallback())
session.transfer(flowFile, ExecuteScript.REL_SUCCESS)
# implicit return at the end

Related

how to build a parser with antlr4 target golang without visitors and walkers

I am trying to write a small parser with golang target, but not using visitors or walkers, but I am not able to find any sample code to build my parser upon.
For example, the following is the grammar code which I am trying to replicate with golang:
# Expr.g4:
grammar Expr;
#header {
}
#parser::members {
def eval(self, left, op, right):
if ExprParser.MUL == op.type:
return left * right
elif ExprParser.DIV == op.type:
return left / right
elif ExprParser.ADD == op.type:
return left + right
elif ExprParser.SUB == op.type:
return left - right
else:
return 0
}
stat: e NEWLINE {print($e.v);}
| ID '=' e NEWLINE {self.memory[$ID.text] = $e.v}
| NEWLINE
;
e returns [int v]
: a=e op=('*'|'/') b=e {$v = self.eval($a.v, $op, $b.v)}
| a=e op=('+'|'-') b=e {$v = self.eval($a.v, $op, $b.v)}
| INT {$v = $INT.int}
| ID
{
id = $ID.text
$v = self.memory.get(id, 0)
}
| '(' e ')' {$v = $e.v}
;
MUL : '*' ;
DIV : '/' ;
ADD : '+' ;
SUB : '-' ;
ID : [a-zA-Z]+ ; // match identifiers
INT : [0-9]+ ; // match integers
NEWLINE:'\r'? '\n' ; // return newlines to parser (is end-statement signal)
WS : [ \t]+ -> skip ; // toss out whitespace
And this is the python tester code for it:
# test_expr.py:
import sys
from antlr4 import *
from antlr4.InputStream import InputStream
from ExprLexer import ExprLexer
from ExprParser import ExprParser
if __name__ == '__main__':
parser = ExprParser(None)
parser.buildParseTrees = False
parser.memory = {} # how to add this to generated constructor?
line = sys.stdin.readline()
lineno = 1
while line != '':
line = line.strip()
istream = InputStream(line + "\n")
lexer = ExprLexer(istream)
lexer.line = lineno
lexer.column = 0
token_stream = CommonTokenStream(lexer)
parser.setInputStream(token_stream)
parser.stat()
line = sys.stdin.readline()
lineno += 1
Can anybody please post a sample golang code which is equivalent to the above python and inlined code?

Pyparsing: changing default debug actions for all parser elements

I want to improve the readability of pyparsing's debugging output by adding indentation. For example, instead of this:
Match part at loc 0(1,1)
Match subpart1 at loc 0(1,1)
Match subsubpart1 at loc 0(1,1)
Matched subsubpart1 at loc 10(2,1) -> ...
Matched subpart1 at loc 20(3,1) -> ...
Match subpart2 at loc 20(3,1)
Match subsubpart2 at loc 20(3,1)
Matched subsubpart2 at loc 30(4,1) -> ...
Matched subpart2 at loc 40(5,1) -> ...
Matched part at loc 50(6,1) -> ...
I would like to have it indented like this to better understand what's going on during parsing:
Match part at loc 0(1,1)
Match subpart1 at loc 0(1,1)
Match subsubpart1 at loc 0(1,1)
Matched subsubpart1 at loc 10(2,1) -> ...
Matched subpart1 at loc 20(3,1) -> ...
Match subpart2 at loc 20(3,1)
Match subsubpart2 at loc 20(3,1)
Matched subsubpart2 at loc 30(4,1) -> ...
Matched subpart2 at loc 40(5,1) -> ...
Matched part at loc 50(6,1) -> ...
So in pyparsing.py, I just changed _defaultStartDebugAction, _defaultSuccessDebugAction and _defaultExceptionDebugAction to:
pos = -1
def _defaultStartDebugAction( instring, loc, expr ):
global pos
pos = pos + 1
print ("\t" * pos + ("Match " + _ustr(expr) + " at loc " + _ustr(loc) + "(%d,%d)" % ( lineno(loc,instring), col(loc,instring) )))
def _defaultSuccessDebugAction( instring, startloc, endloc, expr, toks ):
print ("\t" * pos + "Matched " + _ustr(expr) + " -> " + str(toks.asList()))
global pos
pos = pos - 1
def _defaultExceptionDebugAction( instring, loc, expr, exc ):
print ("\t" * pos + "Exception raised:" + _ustr(exc))
global pos
pos = pos - 1
(I just added the pos expressions and "\t" * pos to the output to get my desired result)
However, I don't like tampering directly with the pyparsing library. On the other hand, I don't want to use the .setDebugActions method on every parser element I define, I want them all to use my modified default debug actions.
Is there a way I can achieve this without having to tamper with the pyparsing.py library directly?
Thanks!
Python modules are just like any other Python object, and you can manipulate their symbols using standard Python function decorating methods. Often referred to as "monkeypatching", these can be done entirely from your own code, without modifying the actual library source.
The simplest way to implement this change is to just overwrite the symbols. In your code, write:
import pyparsing
# have to import _ustr explicitly, since it does not get pulled in with '*' import
_ustr = pyparsing._ustr
pos = -1
def defaultStartDebugAction_with_indent( instring, loc, expr ):
global pos
pos = pos + 1
print ("\t" * pos + ("Match " + _ustr(expr) + " at loc " + _ustr(loc) + "(%d,%d)" % ( lineno(loc,instring), col(loc,instring) )))
def defaultSuccessDebugAction_with_indent( instring, startloc, endloc, expr, toks ):
global pos
print ("\t" * pos + "Matched " + _ustr(expr) + " -> " + str(toks.asList()))
pos = pos - 1
def defaultExceptionDebugAction_with_indent( instring, loc, expr, exc ):
global pos
print ("\t" * pos + "Exception raised:" + _ustr(exc))
pos = pos - 1
pyparsing._defaultStartDebugAction = defaultStartDebugAction_with_indent
pyparsing._defaultSuccessDebugAction = defaultSuccessDebugAction_with_indent
pyparsing._defaultExceptionDebugAction = defaultExceptionDebugAction_with_indent
Or a cleaner version is to wrap the original methods with your code as a decorator:
pos = -1
def incr_pos(fn):
def _inner(*args):
global pos
pos += 1
print ("\t" * pos , end="")
return fn(*args)
return _inner
def decr_pos(fn):
def _inner(*args):
global pos
print ("\t" * pos , end="")
pos -= 1
return fn(*args)
return _inner
import pyparsing
pyparsing._defaultStartDebugAction = incr_pos(pyparsing._defaultStartDebugAction)
pyparsing._defaultSuccessDebugAction = decr_pos(pyparsing._defaultSuccessDebugAction)
pyparsing._defaultExceptionDebugAction = decr_pos(pyparsing._defaultExceptionDebugAction)
This way, if you update pyparsing and the original code changes, your monkeypatch will get the updates without your having to modify your copies of the original methods.
To make your intentions even clearer, and to avoid duplicating those function names (DRY), this will replace those last 3 lines:
def monkeypatch_decorate(module, name, deco_fn):
setattr(module, name, deco_fn(getattr(module, name)))
monkeypatch_decorate(pyparsing, "_defaultStartDebugAction", incr_pos)
monkeypatch_decorate(pyparsing, "_defaultSuccessDebugAction", decr_pos)
monkeypatch_decorate(pyparsing, "_defaultExceptionDebugAction", decr_pos)

ZPL print Italic Bold and Underline

I need to print ZPL with dynamic content.
I Need your help. I am dynamic content , please help me.
is this word possible to print. Note the content is dynamic.
ZPL Code please.
If you want to type bold you can use this
^FO340,128^FDbold^FS
^FO339,128^FDbold^FS
Another option (External fonts usage for underline, italic and bold)
http://labelary.com/docs.html
There is no simple way to bold or italicize text withing ZPL. The fonts the printer has are very basic and can't be changed like that.
Complex font settings (italic, bold, serif ) are actually sent as compressed images to ZPL printers (you can check this with ZebraDesigner).
The format is called Z64, which is based on LZ77.
These two pages contain interesting code in Java to write a converter :
http://www.jcgonzalez.com/img-to-zpl-online
https://gist.github.com/trevarj/1255e5cbc08fb3f79c3f255e25989a18
...still I'm not sure whether the CRC part of the conversion will remain the same in the future, as this is probably vendor-dependent.
Here is a Python port of the first script :
import cv2
import base64
import matplotlib.pyplot as plt
import io
import numpy
blacklimit=int(50* 768/100)
compress=False
total=0
width_byte=0
mapCode = dict()
LOCAL_PATH="C://DEV//printer//zebra_debug.txt"
'''
class StringBuilder(object):
def __init__(self):
self._stringio = io.StringIO()
def __str__(self):
return self._stringio.getvalue()
def getvalue(self):
return self._stringio.getvalue()
def append(self, *objects, sep=' ', end=''):
print(*objects, sep=sep, end=end, file=self._stringio)
'''
def init_map_code():
global mapCode
mapCode[1] = "G"
mapCode[2] = "H"
mapCode[3] = "I"
mapCode[4] = "J"
mapCode[5] = "K"
mapCode[6] = "L"
mapCode[7] = "M"
mapCode[8] = "N"
mapCode[9] = "O"
mapCode[10] = "P"
mapCode[11] = "Q"
mapCode[12] = "R"
mapCode[13] = "S"
mapCode[14] = "T"
mapCode[15] = "U"
mapCode[16] = "V"
mapCode[17] = "W"
mapCode[18] = "X"
mapCode[19] = "Y"
mapCode[20] = "g"
mapCode[40] = "h"
mapCode[60] = "i"
mapCode[80] = "j"
mapCode[100] = "k"
mapCode[120] = "l"
mapCode[140] = "m"
mapCode[160] = "n"
mapCode[180] = "o"
mapCode[200] = "p"
mapCode[220] = "q"
mapCode[240] = "r"
mapCode[260] = "s"
mapCode[280] = "t"
mapCode[300] = "u"
mapCode[320] = "v"
mapCode[340] = "w"
mapCode[360] = "x"
mapCode[380] = "y"
mapCode[400] = "z"
def numberToBase(n, b):
if n == 0:
return [0]
digits = []
while n:
digits.append(int(n % b))
n //= b
return digits[::-1]
def four_byte_binary(binary_str):
decimal=int(binary_str, 2)
if decimal>15:
returned=hex(decimal).upper()
returned=returned[2:]
else:
#returned=hex(decimal).upper()+"0"
returned=hex(decimal).upper()
if binary_str!="00000000":
print("cut="+returned)
returned=returned[2:]
returned="0"+returned
if binary_str!="00000000":
print("low10="+returned)
#
if binary_str!="00000000":
print(binary_str+"\t"+str(decimal)+"\t"+returned+"\t")
return returned
def createBody(img):
global blacklimit
global width_byte
global total
height, width, colmap = img.shape
print(height)
print(width)
print(colmap)
rgb = 0
index=0
aux_binary_char=['0', '0', '0', '0', '0', '0', '0', '0']
sb=[]
if(width%8>0):
width_byte=int((width/8)+1)
else:
width_byte=width/8
total=width_byte*height
print(height)
print("\n")
print(width)
print("\n")
i=0
for h in range(0, height):
for w in range(0, width):
color = img[h,w]
#print(color)
#print(w)
blue=color[0]
green=color[1]
red=color[2]
blue=blue & 0xFF
green=green & 0xFF
red=red & 0xFF
"""
blue=np.uint8(blue)
green=np.unit8(green)
red=np.unit8(red)
"""
#print(bin(blue))
auxchar='1'
total_color=red+green+blue
if(total_color> blacklimit):
#print('above_black_limit')
auxchar='0'
aux_binary_char[index]=auxchar
index=index+1
if(index==8 or w==(width-1)):
if "".join(aux_binary_char) !="00000000":
print(i)
sb.append(four_byte_binary("".join(aux_binary_char)))
i=i+1
aux_binary_char=['0', '0', '0', '0', '0', '0', '0', '0']
index=0
#print(h)
sb.append("\n")
#print(sb)
print(blacklimit)
return ''.join(sb)
def encode_hex_ascii(code):
global width_byte
global mapCode
max_linea=width_byte*2
sb_code=[]
sb_linea=[]
previous_line=1
counter=1
aux = code[0]
first_char=False
for i in range(1, len(code)):
if(first_char):
aux=code[i]
first_char=False
continue
if(code[i]=="\n"):
if(counter>= max_linea and aux=='0'):
sb_linea.append(",")
elif(counter>= max_linea and aux=='F'):
sb_linea.append("!")
elif(counter>20):
multi20=int((counter/20))*20
resto20=counter%20
sb_linea.append(mapCode[multi20])
if(resto20!=0):
sb_linea.append(mapCode[resto20] +aux)
else:
sb_linea.append(aux)
else:
sb_linea.append(mapCode[counter] +aux)
counter=1
first_char=True
if(''.join(sb_linea)==previous_line):
sb_code.append(":")
else:
sb_code.append(''.join(sb_linea))
previous_line=''.join(sb_linea)
sb_linea=[]
continue
if aux==code[i]:
counter=counter+1
else:
if counter>20:
multi20=int((counter/20))*20
resto20=counter%20
sb_linea.append(mapCode[multi20])
if resto20!=0:
sb_linea.append(mapCode[resto20] + aux)
else:
sb_linea.append(aux)
else:
sb_linea.append(mapCode[counter] + aux)
counter=1
aux=code[i]
return ''.join(sb_code)
def head_doc():
global total
global width_byte
return "^XA " + "^FO0,0^GFA,"+ str(int(total)) + ","+ str(int(total)) + "," + str(int(width_byte)) +", "
def foot_doc():
return "^FS"+ "^XZ"
def process(img):
global compress
init_map_code()
cuerpo=createBody(img)
print("CUERPO\n")
print(cuerpo)
print("\n")
if compress:
cuerpo=encode_hex_ascii(cuerpo)
print("COMPRESS\n")
print(cuerpo)
print("\n")
return head_doc() + cuerpo + foot_doc()
img = cv2.imread("C:\\Users\\ftheeten\\Pictures\\out.jpg", cv2.IMREAD_COLOR )
compress=True
blacklimit ==int(50* 768/100)
test=process(img)
file=open(LOCAL_PATH, 'w')
file.write(test)
file.close()

python, import csv file, read columns and rows, remove blank spaces, convert strings to real numbers

I'm having trouble with importing a csv file into python and having it separate the information. I want to also remove all the blank spaces and convert the numbers (which are strings right now) into integers. Here is what I have so far. These lines work but do not accomplish the task of removing the blank spaces and converting the strings to integers.
filename = 'myfile.csv'
f = open(filename, 'r')
read = f.readlines()
print(read)
for i in range(len(read)):
read[i] = read[i].split(',')
print(read)
header = read[0]
print(header)
info = {}
cntr = 0
for name in header:
info[name] = [line[cntr] for line in read]
cntr += 1
print(info)
I searched through past examples on this forum and this is what I tried to do to have the blank spaces removed but now I'm lost:
import csv
aList = []
with open('myfile.csv', 'r') as f:
reader = csv.reader(f, skipinitialspace = True, delimiter = ',', quoting = csv.QUOTE_NONE)
for row in reader:
aList.append(row)
print(aList)
info = {}
cntr = 0
for i in aList:
info[aList] = [line[cntr] for line in reader]
cntr += 1
print(info)
#sample input
#1 23,456,789
#11 2,11 3,114
import csv
aList = []
with open('myfile.csv', 'r') as f:
reader = csv.reader(f, skipinitialspace = True, delimiter = ',', quoting = csv.QUOTE_NONE)
for row in reader:
aList.append(row)
print(aList)
info = {}
cntr = 0
print [map(int,[j.replace(" ","") for j in i]) for i in aList]
#[[123,456,789][112,113,114]]
Explanation - making the last line simple, and breaking into parts,
#[i for i in aList] gives [["1 23","456","789"]["11 2","11 3","114"]]
#[j.replace(" ","") for j in i] gives [["123","456","789"]["112","113","114"]]
#[map(int,[j.replace(" ","") for j in i]) for i in aList]
#maps all string in list to int and gives [[123,456,789][112,113,114]]

Bracket finding algorithm lua?

I'm making a JSON parser and I am looking for an algorithm that can find all of the matching brackets ([]) and braces ({}) and put them into a table with the positions of the pair.
Examples of returned values:
table[x][firstPos][secondPos] = type
table[x] = {firstPos, secondPos, bracketType}
EDIT: Let parse() be the function that returns the bracket pairs. Let table be the value returned by the parse() function. Let codeString be the string containing the brackets that I want to detect. Let firstPos be the position of the first bracket in the Nth pair of brackets. Let secondPos be the position of the second bracket in the Nth pair of brackets. Let bracketType be the type of the bracket pair ("bracket" or "brace").
Example:
If you called:
table = parse(codeString)
table[N][firstPos][secondPos] would be equal to type.
Well, In plain Lua, you could do something like this, also taking into account nested brackets:
function bm(s)
local res ={}
if not s:match('%[') then
return s
end
for k in s:gmatch('%b[]') do
res[#res+1] = bm(k:sub(2,-2))
end
return res
end
Of course you can generalize this easy enough to braces, parentheses, whatever (do keep in mind the necessary escaping of [] in patterns , except behind the %b pattern).
If you're not restricted to plain Lua, you could use LPeg for more flexibility
If you are not looking for the contents of the brackets, but the locations, the recursive approach is harder to implement, since you should keep track of where you are. Easier is just walking through the string and match them while going:
function bm(s,i)
local res={}
res.par=res -- Root
local lev = 0
for loc=1,#s do
if s:sub(loc,loc) == '[' then
lev = lev+1
local t={par=res,start=loc,lev=lev} -- keep track of the parent
res[#res+1] = t -- Add to the parent
res = t -- make this the current working table
print('[',lev,loc)
elseif s:sub(loc,loc) == ']' then
lev = lev-1
if lev<0 then error('too many ]') end -- more closing than opening.
print(']',lev,loc)
res.stop=loc -- save bracket closing position
res = res.par -- revert to the parent.
end
end
return res
end
Now that you have all matched brackets, you can loop through the table, extracting all locations.
I figured out my own algorithm.
function string:findAll(query)
local firstSub = 1
local lastSub = #query
local result = {}
while lastSub <= #self do
if self:sub(firstSub, lastSub) == query then
result[#result + 1] = firstSub
end
firstSub = firstSub + 1
lastSub = lastSub + 1
end
return result
end
function string:findPair(openPos, openChar, closeChar)
local counter = 1
local closePos = openPos
while closePos <= #self do
closePos = closePos + 1
if self:sub(closePos, closePos) == openChar then
counter = counter + 1
elseif self:sub(closePos, closePos) == closeChar then
counter = counter - 1
end
if counter == 0 then
return closePos
end
end
return -1
end
function string:findBrackets(bracketType)
local openBracket = ""
local closeBracket = ""
local openBrackets = {}
local result = {}
if bracketType == "[]" then
openBracket = "["
closeBracket = "]"
elseif bracketType == "{}" then
openBracket = "{"
closeBracket = "}"
elseif bracketType == "()" then
openBracket = "("
closeBracket = ")"
elseif bracketType == "<>" then
openBracket = "<"
closeBracket = ">"
else
error("IllegalArgumentException: Invalid or unrecognized bracket type "..bracketType.."\nFunction: findBrackets()")
end
local openBrackets = self:findAll(openBracket)
if not openBrackets[1] then
return {}
end
for i, j in pairs(openBrackets) do
result[#result + 1] = {j, self:findPair(j, openBracket, closeBracket)}
end
return result
end
Will output:
5 14
6 13
7 12
8 11
9 10

Resources