Nifi Processor splittext escape character - apache-nifi

I use splitText processor to split a multiple statement hql file into single hql statements on semicolon before sending them to PutHiveQL Processor
'
My problem is that need to concat som fields and seperate them with ; meaning that i want splittext to ignore that particular semicolon.
I tried to escape with
example.
drop table my table if exits;
create external table mytable as
select CONCAT_WS('\;\',field1,field2.field3) as concatfields
from oldtable;
Now this will result in followidng statements
flowfile1
drop table my table if exits;
flowfile2
create external table mytable as
select CONCAT_WS('\;
flowfile3
\',field1,field2.field3) as concatfields
from oldtable;
But clearly i want to escape my semicolon in CONCAT_WS('\;\',field1,field2.field3) as concatfields
is that possible ?

Are you using SplitContent? I thought SplitText only split on line boundaries. If you use SplitContent you should be able to split on ;\n (use Shift+Enter to input a newline character) and choose Keep Byte Sequence. This should split when a semicolon ends a line but keep your escaped semicolon intact.

Related

Replace comma with double quotes enclosed comma for all character fields?

In my schema, for a definite set of tables, if any character field contains a comma it must be enclosed in double quotes.How can I achieve this for all the character fields of that set of tables in one go.I am using Oracle 11g?
Not sure if I'm following the question correctly, but maybe this will give you what you need.
Create our example table:
create table foobar (foo varchar2(30));
Populate it with test values:
insert into foobar values ('There is no comma.');
insert into foobar values ('There is, a comma.');
This query will wrap comma-containing values in double-quotes:
select decode(instr(foo,','),0,foo,'"'||foo||'"') from foobar;
And here's the output:
"There is, a comma."
There is no comma.
Here's how it works. The INSTR tests for the presence of a comma in the column. If there is no comma (INSTR returns 0) the DECODE returns the column as is. In all other circumstances (i.e. a comma exists) we wrap the output in double-quotes.
Repeat for each table/column.
It would be simple to use the syntax in an update statement, if that's what you're trying to achieve. So,
update foobar
set foo=decode(instr(foo,','),0,foo,'"'||foo||'"');

How to load data into Oracle using SQL Loader with skipping and merging columns?

I am trying to load data into Oracle database using sqlloader,
My data looks like following.
1|2|3|4|5|6|7|8|9|10
I do not want to load first and last column into table,
I want to load 2|3|4|5|6|7|8|9 into one field.
The table I am trying to load into has only one filed named 'field1'.
If anyone has this kind of experience, could you give some advice?
I tried BOUNDFILLER, FILLER and so on, I could not make it.
Help me. :)
Load the entire row from the file into a BOUNDFILLER, then extract the part you need into the column. You have to tell sqlldr that the field is terminated by the carriage return/linefeed (assuming a Windows OS) so it will read the entire line from the file as one field. here the whole line from the file is read into "dummy" as BOUNDFILLER. "dummy" does not match a column name, and it's defined as BOUNDFILLER anyway, so the whole row is "remembered". The next line in the control file starts with a column that DOES match a column name, so sqlldr attempts to execute the expression. It extracts a substring from the saved "dummy" and puts it into the "col_a" column.
The regular expression in a nutshell returns the part of the string after but not including the first pipe, and before but not including the last pipe. Note the double backslashes. In my environment anyway, when using a backslash to take away the special meaning of the pipe (not needed when between the square brackets) it gets stripped when passing from sqlldr to the regex engine so two backslashes are required in the control file
(normally a pipe symbol is a logical OR) so one gets through in the end. If you have trouble here, try one backslash and see what happens. Guess how long THAT took me to figure out!
load data
infile 'x_test.dat'
TRUNCATE
into table x_test
FIELDS TERMINATED BY x'0D0A'
(
dummy BOUNDFILLER,
col_a expression "regexp_substr(:dummy, '[^|]*\\|(.+)\\|.*', 1, 1, NULL, 1)"
)
EDIT: Use this to test the regular expression. For example, if there is an additional pipe at the end:
select regexp_substr('1|2|3|4|5|6|7|8|9|10|', '[^|]*\|(.+)\|.*\|', 1, 1, NULL, 1)
from dual;
2nd edit: For those uncomfortable with regular expressions, this method uses nested SUBSTR and INSTR functions:
SQL> with tbl(str) as (
select '1|2|3|4|5|6|7|8|9|10|' from dual
)
select substr(str, instr(str, '|')+1, (instr(str, '|', -1, 2)-1 - instr(str
, '|')) ) after
from tbl;
AFTER
---------------
2|3|4|5|6|7|8|9
Deciding which is easier to maintain is up to you. Think of the developer after you and comment at any rate! :-)

HIVE: apply delimiter until a specified column

I am trying to move data from a file into a hive table. The data in the file looks something like this:-
StringA StringB StringC StringD StringE
where each string is separated by a space. The problem is that i want separate columns for StringA, StringB and StringC and one column for StringD onwards i.e. StringD and String E should be part of the same column. If i use
ROW DELIMITED BY FIELDS TERMINATED BY ' ', Hive would produce separate columns for StringD and StringE. (StringD and StringE contain space within themselves whereas other strings do not contain spaces within themselves)
Is there any special syntax in hive to achieve this or do i need to pre-process my data file in some way?
Use regular expresion
https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ApacheWeblogData
you can define when use space as delimiter and when part of data

LINES TERMINATED BY only supports newline '\n' right now

I have files where the column is delimited by char(30) and the lines are delimited by char(31). I'm using these delimiters mainly because the columns may contain newline (\n), so the default line delimiter for hive is not useful for us.
I have tried to change the line delimiter in hive but get the error below:
LINES TERMINATED BY only supports newline '\n' right now.
Any suggestion?
Write custom SerDe may work?
is there any plan to enhance this functionality in hive in new releases?
thanks
Not sure if this helps, or is the best answer, but when faced with this issue, what we ended up doing is setting the 'textinputformat.record.delimiter' Map/Reduce java property to the value being used. In our case it was a string "{EOL}", but could be any unique string for all practical purposes.
We set this in our beeline shell which allowed us to pull back the fields correctly. It should be noted that once we did this, we converted the data to Avro as fast as possible so we didn't need to explain to every user, and the user's baby brother, to set the {EOL} line delimiter.
set textinputformat.record.delimiter={EOL};
Here is the full example.
#example CSV data (fields broken by '^' and end of lines broken by the String '{EOL}'
ID^TEXT
11111^Some THings WIth
New Lines in THem{EOL}11112^Some Other THings..,?{EOL}
111113^Some crazy thin
gs
just crazy{EOL}11114^And Some Normal THings.
#here is the CSV table we laid on top of the data
CREATE EXTERNAL TABLE CRAZY_DATA_CSV
(
ID STRING,
TEXT STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\136'
STORED AS TEXTFILE
LOCATION '/archive/CRAZY_DATA_CSV'
TBLPROPERTIES('skip.header.line.count'='1');
#here is the Avro table which we'll migrate into below.
CREATE EXTERNAL TABLE CRAZY_DATA_AVRO
(
ID STRING,
TEXT STRING
)
STORED AS AVRO
LOCATION '/archive/CRAZY_DATA_AVRO'
TBLPROPERTIES ('avro.schema.url'='hdfs://nameservice/archive/avro_schemas/CRAZY_DATA.avsc');
#And finally, the magic is here. We set the custom delimiter and import into our Avro table.
set textinputformat.record.delimiter={EOL};
INSERT INTO TABLE CRAZY_DATA_AVRO SELECT * from CRAZY_DATA_CSV;
I have worked it out by using the option during the extract --hive-delims-replacement ' ' in sqoop so the characters \n \001 \r are removed from the columns.

Hadoop Remove unnecessary \n in the input files

I have a large input file, values are pipe delimited. And there are 20 values in a row. after 19th pipe, if new line character comes, that is a record.
But my input file is having \n not only after 19 pipes but also in the other values. sample line looks like this...
101101|this\nis my sample|12547|sample\nxyz|......(19th pipe)|end of record\n
I am new to Hadoop and I don't know how to divide lines to create key value pairs based on this condition.
Another related question I have is, input split happens at the client side and if I have to split the input file conditionally on the client side(one machine), will it not be very slow considering the large file? Please help.
In Hive NULL column values are represented as "\N" that's the default behaviour of Hive. This is done to differentiate NULL and "NULL" (string NULL).
If you don't want \N to appears to appear in your export you can use COALESCE UDF.
Roughly your query may look like this
SELECT
COALESCE (my_column, '') AS my_column
FROM
my_table

Resources