Applying String Manipulations/ Mathematical operations to the contents of a flow file in nifi - apache-nifi

I have a flow file coming in, which has fixed width data in the following format :
ABC 0F 15343543543454434 gghhhhhg
ABC 01 433534343434 hjvh
I want to have my output data in the following format:
ABC|15|15343543543454434|gghhhhhg
ABC|1|433534343434|hjvh
to get this output I need to convert the second field in each line to base10 integer and apply a strip operation to all the other fields to trim the white spaces.
I tried using the replaceText processor but I could not find a way to convert the second field to a base10 integer or apply strip function to the string fields.

Working with hexadecimal numbers is not something that is easily done in a current release of NiFi. In order to get it to work you'd need to use one of the scripting processors ExecuteScript or InvokeScripted processor.
That said, doing numeric evaluations is one of my focuses in this upcoming release (which is currently being curated to be finalized) and I've been able to create a solution involving just the ReplaceText processor. I used the following configuration:
Search Value: ^(\w*)\ *(\w*)\ *(\d*)\ *(\w*)$
Replacement Value: $1|${'$2':prepend('0x'):append('p0'):toNumber()}|$3|$4
Replacement Strategy: Regex Replace
Evaluation Mode: Line-by-line
The rest is up to your use-case (ie. which ever character set it is in). The search value will create capture groups for each of the sections. Then in the replacement value I utilize the second (the one for the hex digit) in an Expression language function to convert to base 10. The purpose of the "append" and "prepend" is that on the current master only decimals/double accept hex numbers (I need to improve that) so I just make it format it as a double.
So it is unfortunate this use-case isn't currently handled out of the box, it soon will be!
Edit: I've created a Jira to track adding hex -> whole numbers in EL here: https://issues.apache.org/jira/browse/NIFI-2950
Edit2: A commit addressing the issue has been merged to master and will be in versions 1.1+: https://github.com/apache/nifi/commit/c4be800688bf23a3bdea8def75b84c0f4ded243d

Related

NiFi: change text in FlowFile (Python or ...)

Im very new in NiFi..
I get data(FlowFile ?) from my processor "ConsumerKafka", it seems like
So, i have to delete any text before '!',I know a little Python. So with "ExcecuteScript", i want to do something like this
my_string=session.get()
my_string.split('!')[1]
#it return "ZPLR_CHDN_UPN_ECN....."
but how to do it right?
p.s. or, may be, use "substringAfterLast", but how?
Tnanks.
Update:
I have to remove text between '"Tagname":' and '!', how can i do it without regex?
If you simply want to split on a bang (!) and only keep the text after it, then you could achieve this with a SplitContent configured as:
Byte Sequence Format: Text
Byte Sequence: !
Keep Byte Sequence: false
Follow this with a RouteOnAttribute configured as:
Routing Strategy: Route to Property name
Add a new dynamic property called "substring_after" with a value: ${fragment.index:equals(2)}
For your input, this will produce 2 FlowFiles - one with the substring before ! and one with the substring after !. The first FlowFile (substring before) will route out of the RouteOnAttribute to the unmatched relationship, while the second FlowFile (substring after) will route to a substring_after relationship. You can auto-terminate the unmatched relationship to drop the text you don't want.
There are downsides to this approach though.
Are you guaranteed that there is only ever a single ! in the content? How would you handle multiple?
You are doing a substring on some JSON as raw text. Splitting on ! will result in a "} left at the end of the string.
These look like log entries, you may want to consider looking into ConsumeKafkaRecord and utilising NiFi's Record capabilities to interpret and manipulate the data more intelligently.
On scripting, there are some great cookbooks for learning to script in NiFi, start here: https://community.cloudera.com/t5/Community-Articles/ExecuteScript-Cookbook-part-1/ta-p/248922
Edit:
Given your update, I would use UpdateRecord with a JSON Reader and Writer, and Replacement Value Strategy set to Record Path Value .
This uses the RecordPath syntax to perform transformations on data within Records. Your JSON Object is a Record. This would allow you to have multiple Records within the same FlowFile (rather than 1 line per FlowFile).
Then, add a dynamic property to the UpdateRecord with:
Name: /Tagname
Value: substringAfter(/Tagname, '!' )
What is this doing?
The Name of the property (/Tagname) is a RecordPath to the Tagname key in your JSON. This tells UpdateRecord where to put the result. In your case, we're replacing the value of an existing key (but it could be also be a new key if you wanted to add one).
The Value of the property is the expression to evaluate to build the value you want to insert. We are using the substringAfter function, which takes 2 parameters. The first parameter is the RecordPath to the Key in the Record that contains the input String, which is also /Tagname (we're replacing the value of Tagname, with a substring of the original Tagname value). The second parameter is the String to split on, which is !.
If your purpose getting the string between ! and "} use ReplaceText with (.*)!(.*)"} , capture second group and replace it with entire content
Please note that this regular expression may not be best for your case but I believe you can find solution for your problem with regular expression

Only allow whole numbers but display with '.' on the thousand for easy read

I want to apply data validation to my column so as to only accept whole numbers.
However I want these to be displayed with a dot so as to make it easier to read later on.
e.g. input = 14354 which is valid and then displayed 14.354
the data validation regular expression I am ussing is:
=regexmatch(to_text(A2);"^\d+\.*\d+$")
and the custom formatting is:
#,##
for most this working fine, large numbers are displayed with the '.' and things it shouldnt accept it is rejecting.
However, in the case of numbers which are entered with a decimal point as these are hidden, it is accepting it as valid.
It is also changing the format to auto atic and reading as date such entries like: 15.4
I should point out that I am using sheets in spanish and therefore the , is the marker for decimal places.
What am i missing here??
Select the cell range then go to Data > Data validation...
Add a custom formula rule:
=mod(A1;1)=0
Try this one:
=and(regexmatch(to_text(A2);"^\d+(\.\d{3})*$");mod(A2;1)=0)
Improved your formula to only accept a dot when it is followed by 3 numbers (this way, we invalidate the date e.g A2)
Combining the improved formula of yours and Aresvik's modulo answer, we need to check if the value does not have decimal. (this way, we invalidate the decimal e.g A6)
When both returns true, this shall confirm that the number inputted is a whole number with no decimal and not a date.
Output:
Invalid inputted values:
A2 - 15.4
A6 - 16412,212

Find and replace (increment) ASN.1 BER hex value

I have a long string of hex (converted from BER ASN.1) where I need to find and increment a particular value which is incorrect.
<TAG> <LENGTH> <VALUE to INCREMENT>
the ASN.1 tag is 84 and the length byte will change from 01 to 02 when the value > 127dec. And the value to increment will therefore become 2 bytes.
The value should start at 00.
e.g.
- Original file: ...840101...840107...84020085...84020097
- New file: ...840100...840101...84020080...84020081
Any ideas how best to do this, preferably using standard bash commands?
Ilya Etingof hinted at this already, but to be explicit about it, BER uses TAG, LENGTH, VALUE (TLV) encoding, where the VALUE can itself be a TLV. If you change the length in a TLV that is nested inside a TLV, you will need to update all of the lengths of the enclosing TLVs as well. It is not a simple search/replace operation.
Assuming you have the octet-stream in text work already, you may consider searching/replacing pieces of text with awk or sed. If you can only use bash, may be variable substitution (${parameter/pattern/string} or ${parameter:offset:length}) would work?
Keep in mind however, that BER is quite flexible in the sense that (sometimes) the same data structure may be encoded differently and that would still constitute a valid encoding. The rationale behind that is to allow the encoder to optimize for its very own situation (e.g. save on memory or CPU cycles or on copying etc).
What I am trying to say that depending on your situation there may be a chance that your search/replace logic may fail. The bullet-proof solution would be to fully decode your BER octet stream, change the data structure you need and re-encode it back into BER.

Can I alpha sort base32/64 encoded MD5 hashes?

I've got a massive file of hex encoded MD5 values that I'm using linux 'sort' utility to sort. The result is that the hashes come out in sequential order (which is what I need for the next stage of processing). E.g:
000001C35AE83CEFE245D255FFC4CE11
000003E4B110FE637E0B4172B386ACAC
000004AAD0EB3D896B654A960B0111FA
In the interest of speeding up the sort operation (and making the files smaller), I was considering encoding the data as base32 or base64.
The question is, would an alpha-sort of the base32/64 data get me the same result? My quick tests seem to indicate that it would work. For example, the above three hex strings correspond 1:1 to these base64 strings:
AAABw1roPO/iRdJV/8TOEQ==
AAAD5LEQ/mN+C0Fys4asrA==
AAAEqtDrPYlrZUqWCwER+g==
But I'm unsure as to the sort order when it comes to special characters used in Base64 like "/" and "+" and how those would be treated in the context of an alpha sort.
Note: I happen to be using the linux sort utility but the question still applies to other alpha-sorting tools. The tool used is not really part of the question.
I've since discovered that this isn't possible with the standard base32/64 implementations. There exists however a base32 variation called "base32hex" which preserves sort ordering, but there is no official "base64hex" equivalent.
Looks like that leaves creating a custom encoding like this.
EDIT:
This turned out to be very trivial to solve. Simply encode in base 64 then translate character to character with a custom table of characters that respects sort order.
Simply map from the standard Mime 64 characters:
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"
To something like this:
"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz|~"
Then sorting will work.

Least used delimiter character in normal text < ASCII 128

For coding reasons which would horrify you (I'm too embarrassed to say), I need to store a number of text items in a single string.
I will delimit them using a character.
Which character is best to use for this, i.e. which character is the least likely to appear in the text? Must be printable and probably less than 128 in ASCII to avoid locale issues.
I would choose "Unit Separator" ASCII code "US": ASCII 31 (0x1F)
In the old, old days, most things were done serially, without random access. This meant that a few control codes were embedded into ASCII.
ASCII 28 (0x1C) File Separator - Used to indicate separation between files on a data input stream.
ASCII 29 (0x1D) Group Separator - Used to indicate separation between tables on a data input stream (called groups back then).
ASCII 30 (0x1E) Record Separator - Used to indicate separation between records within a table (within a group). These roughly map to a tuple in modern nomenclature.
ASCII 31 (0x1F) Unit Separator - Used to indicate separation between units within a record. The roughly map to fields in modern nomenclature.
Unit Separator is in ASCII, and there is Unicode support for displaying it (typically a "us" in the same glyph) but many fonts don't display it.
If you must display it, I would recommend displaying it in-application, after it was parsed into fields.
Assuming for some embarrassing reason you can't use CSV I'd say go with the data. Take some sample data, and do a simple character count for each value 0-127. Choose one of the ones which doesn't occur. If there is too much choice get a bigger data set. It won't take much time to write, and you'll get the answer best for you.
The answer will be different for different problem domains, so | (pipe) is common in shell scripts, ^ is common in math formulae, and the same is probably true for most other characters.
I personally think I'd go for | (pipe) if given a choice but going with real data is safest.
And whatever you do, make sure you've worked out an escaping scheme!
When using different languages, this symbol: ¬
proved to be the best. However I'm still testing.
Probably | or ^ or ~ you could also combine two characters
You said "printable", but that can include characters such as a tab (0x09) or form feed (0x0c). I almost always choose tabs rather than commas for delimited files, since commas can sometimes appear in text.
(Interestingly enough the ascii table has characters GS (0x1D), RS (0x1E), and US (0x1F) for group, record, and unit separators, whatever those are/were.)
If by "printable" you mean a character that a user could recognize and easily type in, I would go for the pipe | symbol first, with a few other weird characters (# or ~ or ^ or \, or backtick which I can't seem to enter here) as a possibility. These characters +=!$%&*()-'":;<>,.?/ seem like they would be more likely to occur in user input. As for underscore _ and hash # and the brackets {}[] I don't know.
How about you use a CSV style format? Characters can be escaped in a standard CSV format, and there's already a lot of parsers already written.
Can you use a pipe symbol? That's usually the next most common delimiter after comma or tab delimited strings. It's unlikely most text would contain a pipe, and ord('|') returns 124 for me, so that seems to fit your requirements.
For fast escaping I use stuff like this:
say you want to concatinate str1, str2 and str3
what I do is:
delimitedStr=str1.Replace("#","#a").Replace("|","#p")+"|"+str2.Replace("#","#a").Replace("|","#p")+"|"+str3.Replace("#","#a").Replace("|","#p");
then to retrieve original use:
splitStr=delimitedStr.Split("|".ToCharArray());
str1=splitStr[0].Replace("#p","|").Replace("#a","#");
str2=splitStr[1].Replace("#p","|").Replace("#a","#");
str3=splitStr[2].Replace("#p","|").Replace("#a","#");
note: the order of the replace is important
its unbreakable and easy to implement
Pipe for the win! |
We use ascii 0x7f which is pseudo-printable and hardly ever comes up in regular usage.
Well it's going to depend on the nature of your text to some extent but a vertical bar 0x7C doesn't crop up in text very often.
I don't think I've ever seen an ampersand followed by a comma in natural text, but you can check the file first to see if it contains the delimiter, and if so, use an alternative. If you want to always be able to know that the delimiter you use will not cause a conflict, then do a loop checking the file for the delimiter you want, and if it exists, then double the string until the file no longer has a match. It doesn't matter if there are similar strings because your program will only look for exact delimiter matches.
This can be good or bad (usually bad) depending on the situation and language, but keep mind mind that you can always Base64 encode the whole thing. You then don't have to worry about escaping and unescaping various patterns on each side, and you can simply seperate and split strings based on a character which isn't used in your Base64 charset.
I have had to resort to this solution when faced with putting XML documents into XML properties/nodes. Properties can't have CDATA blocks in them at all, and nodes escaped as CDATA obviously cannot have further CDATA blocks inside that without breaking the structure.
CSV is probably a better idea for most situations, though.
Both pipe and caret are the obvious choices. I would note that if users are expected to type the entire response, caret is easier to find on any keyboard than is pipe.
I've used double pipe and double caret before. The idea of a non printable char works if your not hand creating or modifying the file. For quick random access file storage and retrieval field width is used. You don't even have to read the file.. your literally pulling from the file by reference. This is how databases do some storage.. but they also manage the spaces between records and such. And it introduced the problem of max data element width. (Index attach a header which is used to define the width of each element and it's data type in the original old days.. later they introduced compression with remapping chars. This allows for a text file to get about 1/8 the size in transmission.. variable length char encoding for the win
make it dynamic : )
announce your control characters in the file header
for example
delimiter: ~
escape: \
wrapline: $
width: 19
hello world~this i$
s \\just\\ a sampl$
e text~$someVar$~h$
ere is some \~\~ma$
rkdown strikethrou$
gh\~\~ text
would give the strings
hello world
this is \just\ a sample text
$someVar$
here is some ~~markdown strikethrough~~ text
i have implemented something similar:
a plaintar text container format,
to escape and wrap utf16 text in ascii,
as an alternative to mime multipart messages.
see https://github.com/milahu/live-diff-html-editor

Resources