PigStorage multibyte field separator - hadoop

Needed a StoreFunc implementation that could allow PIG to have field delimiters as multiple bytes for example - ^^ (\u005E\u005E)
Tried all these but without succcess -
store B into '/tmp/test/output' using PigStorage('\u005E\u005E');
store B into '/tmp/test/output' using PigStorage('^^');
store B into '/tmp/test/output' using PigStorage('\\^\\^');
Is there an already existing implementation like LoadFunc implementation org.apache.pig.piggybank.storage.MyRegExLoader for StoreFunc that can take regular expressions for field separator while writing ?

Worked around this by using CONCAT for first delimiter and using PigStorage syntax for second occurence

Related

Pick/UniBasic Field function that operates with a delimiter of more than one character?

Has there ever been an implementation of the field function (page 311) in the various flavors of Pick/UniBasic etc. that would operate on a delimiter of more than one character?
The documented implementations I can find stipulate one character as the delimiter argument and if the delimiter is presented with more than one character, the first character of the delimiter string is used instead of the entire string as a delimiter.
I am asking this because there are many instances in the commercial and custom software I maintain where I see attempts to use a multi-character delimiter with the field statement. It seems programmers were using it expecting a different result than is currently happening.
jBASE does allow for this. From the FIELD docs:
This function returns a multi-character delimited field from within a string. It takes the general form:
FIELD(string, delimiter, occurrence{, extractCount})
where:
string specifies the string, from which the field(s) is to be extracted.
delimiter specifies the character or characters that delimit the fields within the dynamic array.
occurrence should evaluate to an integer of value 1 or higher. It specifies the delimiter used as the starting point for the extraction.
extractCount is an integer that specifies the number of fields to extract. If omitted, assumes one.
Additionally, an example from the docs:
in_Value = "AAAA : BBjBASEBB : CCCCC"
CRT FIELD(in_Value , "jBASE", 1)
Producing output:
AAAA : BB
Update 2020-08-13 (adding context for OpenQM):
As an official comment since we maintain both jBASE and OpenQM, I felt it worth calling out that OpenQM does not allow multi-character delimiters for FIELD().

I want to convert the text into JSON format using nifi

I have been trying to convert my integer and string values to JSON format using replacetext processor in NIFI. But I'm facing problem in regular expression. Can anyone suggest me a Regular Expression in search value and replacement value.
Orginal Text format :
{Sensor_id:2.4,locationIP:2.2,Sensor_value:A}
Expected JSON format
{Sensor_id:2.4,locationIP:2.2,Sensor_value:"A"}
Processor configuration :
You can use the regex ([\w_]+):([a-zA-Z]\w*) with replacement $1:"$2" as you can see here
But notice that a valid JSON should have quotes in the keys. For example:
{"Sensor_id":2.4,"locationIP":2.2,"Sensor_value":"A"}
In this case, I would recommend:
Add a ReplaceText processor with the regex ([\w_]+): and replacement "$1":
Link the output of the first ReplaceText to another ReplaceText processor with the regex ([\w_"]+):([a-zA-Z]\w*) and replacement $1:"$2"
I hope it helps
EDIT:
If you want to transform {Sensor_id:2.4,locationIP:2.2,Sensor_value:A} into {"Sensor_id":"2.4","locationIP":"2.2","Sensor_value":"A"} you can use only one regex in a single processor:
Regex: ([\w_]+):([.\w]*)
Replacement: "$1":"$2"

Is it possible to use regex as textinputformat delimiter with JavaSparkContext?

I have multiple text files to read with JavaSparkContext, and each of the file might be slightly different and contains multiline records, so I want to use a regex delimiter to find the records. Is it possible to configure the textinputformat delimiter with a regex?
..
String regex = "^(?!(^a\\s|^b\\s))";
JavaSparkContext jsc = new JavaSparkContext(conf);
jsc.hadoopConfiguration().set("textinputformat.record.delimiter", regex);
..
Unfortunately it is not. textinputformat.record.delimiter has to be a fix pattern. When working with Spark, you have to alternatives:
Implement your own Hadoop input format - scales better but requires more work.
Use wholeTextFiles (or binaryFiles) and split strings using regex - easy to use, but doesn't scale to large files.

How to have a Multi Character Delimiter in Informatica Cloud?

I have a problem I need help solving. The business I am working for is using Informatica cloud to do alot of their ETL into AWS and Other Services.
We have been given a flat file by the business where the field delimiter is "~|" Currently to the best of my knowledge informatica only accepts a single character delimiter.
Does any one know how to overcome this?
Informatica cannot read composite delimiters.
First you could feed each line as one single long string into an
Expression transformation. In this case the delimiter character should
be set to \037 , I haven't seen this character (ASCII Unit Separator)
in use at least since 1982. Then use repetitive invocations of InStr()
within the EXP to identify the positions of those double pipe
characters and split up each line into fields using SubStr().
Second
(easier in the mapping, more work with the session) you could feed the
file into some utility which replaces those double pipe characters by
the character ASCII 31 (the Unit Separator mentioned above); the
session has to be set up such that it reads the output from this
utility (input file type = Command instead of File). Then the source
definition should contain the \037 as the field delimiter instead of
any pipe character or so.

Splitting clipboard import in ABAP

I'm using CLPB_IMPORT function module to get clipboard to internal table. it's ok. I'am copying two column Excel data. So it fills the table with delimiter '#', like;
4448#3000
4449#4000
4441#5000
But the problem is splitting these strings. I'm trying;
LOOP AT foytab.
SPLIT foytab-tab AT '#' INTO temp1 temp2.
ENDLOOP.
But it doesn't split. it puts whole line into temp1. I think the delimiter is not what I thought ('#'). Because when I write a string manually with delimiter '#' it splits.
Do you have any idea how to split this ?
You should not use CLPB_IMPORT since it's explicitly marked as obsolete. Use CL_GUI_FRONTEND_SERVICES=>CLIPBOARD_IMPORT instead.
The data is probably not separated by # but by a tab character. You can check this in the hex view of the debugger. # is just a replacement symbol the UI uses for any unprintable character. If the delimiter is the tab character, you can use the constant CL_ABAP_CHAR_UTILITIES=>HORIZONTAL_TAB.

Resources