how to replace characters in hive? - hadoop

I have a string column description in a hive table which may contain tab characters '\t', these characters are however messing some views when connecting hive to an external application.
is there a simple way to get rid of all tab characters in that column?. I could run a simple python program to do it, but I want to find a better solution for this.

regexp_replace UDF performs my task. Below is the definition and usage from apache Wiki.
regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT):
This returns the string resulting from replacing all substrings in INITIAL_STRING
that match the java regular expression syntax defined in PATTERN with instances of REPLACEMENT,
e.g.: regexp_replace("foobar", "oo|ar", "") returns fb

Custom SerDe might be a way to do it. Or you could use some kind of mediation process with regex_replace:
create table tableB as
select
columnA
regexp_replace(description, '\\t', '') as description
from tableA
;

select translate(description,'\\t','') from myTable;
Translates the input string by replacing the characters present in the from string with the corresponding characters in the to string. This is similar to the translate function in PostgreSQL. If any of the parameters to this UDF are NULL, the result is NULL as well. (Available as of Hive 0.10.0, for string types)
Char/varchar support added as of Hive 0.14.0

You can also use translate(). If the third argument is too short, the corresponding characters from the second argument are deleted. Unlike regexp_replace() you don't need to worry about special characters.
Source code.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions

There is no OOTB feature at this moment which allows this. One way to achieve that could be to write a custom InputFormat and/or SerDe that will do this for you. You might this JIRA useful : https://issues.apache.org/jira/browse/HIVE-3751. (not related directly to your problem though).

Related

Validate Oracle Column Names

In one scenario we are dynamically creating sql to create temp tables on-fly. There is no issue with table_name as it is decided by us however the column-names are provided by sources not in our control.
Usually we would check the column names using below query:
select ..
where NOT REGEXP_LIKE (Column_Name_String,'^([a-zA-Z])[a-zA-Z0-9_]*$')
OR Column_Name_String is NULL
OR Length(Column_Name_String) > 30
However is there any build in function which can do a more extensive check. Also any input on the above query is welcome as well.
Thanks in advance.
Final query based on below answers:
select ..
where NOT REGEXP_LIKE (Column_Name_String,'^([a-zA-Z])[a-zA-Z0-9_]{0,29}$')
OR Column_Name_String is NULL
OR Upper(Column_Name_String) in (select Upper(RESERVED_WORDS.Keyword) from V$RESERVED_WORDS RESERVED_WORDS)
Particularly not happy with character's like $ in column name either hence won't be using..
dbms_assert.simple_sql_name('VALID_NAME')
Instead with regexp I can decide my own set of character's to allow.
This answer does not necessarily offer either a performance or logical improvement, but you can actually validate the column names using a single regex:
SELECT ...
WHERE NOT
REGEXP_LIKE (COALESCE(Column_Name_String, ''), '^([a-zA-Z])[a-zA-Z0-9_]{0,29}$')
This works because:
It uses the same pattern to match columns, i.e. starting with a letter and afterwards using only alphanumeric characters and underscore
NULL column names are mapped to empty string, which fails the regex
We use a length quantifier {0,29} to check the column length directly in the regex
" is there any build in function which can do a more extensive check."
Oracle has the DBMS_ASSERT.SIMPLE_SQL_NAME() function. This returns the passed name if it meets the Oracle naming rules ...
select dbms_assert.simple_sql_name('VALID_NAME') from dual;
... and hurls ORA-44003 if the name is invalid.
Valid names permit any characters if the name is double-quoted (yuck, but then so is creating "temp tables on-fly"). Also the function doesn't check the length of the name, so you will still need to validate that yourself.
Find out more in the docs.
Also here is a SQL Fiddle.
"creating a table with comment column is not possible as its a invalid identifier"
Fair point. DBMS_ASSERT is primarily aimed at preventing SQL injection. So it verifies that a value conforms to Oracle's naming rules, not that the value is a valid Oracle name. To catch things like comment you will also need to check the value against V$RESERVED_WORDS, probably where reserved != 'Y'. As this is a V$ view select on it is not granted by default; if you don't have access you'll need to ask your friendly DBA to help out.
" For validating column names I believe I should check with the entire list"
Up to you. The distinction is that some keywords can legitimately be used as identifiers. For instance TYPE only became a reserved word in Oracle version 8 when they introduced the object-relational stuff. But there were a lot of tables and views in existing systems which used 'TYPE' as a column name (not least the Oracle data dictionary). If Oracle had made TYPE a properly reserved word it would have broken all those systems. So the list of reserved words which cannot be used as identifiers is a sub-set of all the Oracle keywords.
Opinions on the general task:
"we are getting data from external sources (files) and the job of the program/script is to push that data to oracle tables."
There are two parts to this task.
The first is that you should have agreed a standard format for these files with the third parties. There should be no need for discovery of the files' structure or content. (Or if there is such a need because the files are randomly sourced from a carousel of third parties probably you should not be using a relational database but something else: Endeca? Python Pandas library?)
The second is the creating tables on the fly. If you have an agreed file structure then you should be loading into standard tables, using either SQL*Loader or external tables according to your circumstances. If you're on 12c maybe SQL*Loader Express Mode could be of interest.

How to add filter to excel table in UI Path?

I have an excel file with a table named 'Table1' in it. I have to perform 'Filter Table' activity in UiPath with the condition "column1 begins with '*my column'". But when I specify the value like this, the column is filtered for 'ends with' operation.
Here is the screenshot for my table-
Below is the screenshot for the steps I followed-
This has been answered many times on UiPath Forum
For example https://forum.uipath.com/t/filter-table-in-excel-data-tables/559/3
If you use *my value as the search / filter pattern, then it'd mean, anything in the beginning and must have my value in the end. So, it is being interpreted correctly as Ends With. If you want to have a Begins With filter, you should have your filter text followed by the wildcard, like - my value*.
Further, if you want to include wildcard as a literal in the search pattern, you'd need to escape that by enclosing it in brackets like [*]my value* - this'd search for text beginning with *my value.
MS Excel / VBA also supports Tilde ~ as an escape character in some cases.
In excel filters, '' represents any series of characters.
The issue in the above case is that the filter value in the condition already contains a ''. Because of this, system always reads it as '*My column' => '[any characters]My column'. i.e., value ends with 'My column'.
To resolve this issue, I have specified contains filter instead of Begins with as 'My column'.
I have also tried to escape '*'. But it threw excel exception.
In addition, you can not specify condition as "Column1 Like '*My column%'". This works file when you are adding filter to 'DataTable'(after performing 'ReadRange' activity). But in this case, you will retrieve all the records and then you will be filtering the columns. This will lead to performance issues if the the excel table is huge.
You can follow the syntax below to perform filter activities in an excel:
DataTableName.Select("[ColumnName]='Datawithwhichweneedtofilter’").CopytoDataTable()

Explicit null string in flat files that are imported with SQL*Loader

I am using Oracle's SQL*Loader to import flat files into the database. Is there an explicit NULL string in SQL*Loader, like \N in PostgreSQL, that can be used instead of an empty string? Or is there an option in the control file that can be used to set a NULL string, e.g. NULL AS ''?
Like Alex said, empty fields are treated as NULL in Oracle. You may need to set default values or conditions on your table itself if you want to differentiate somehow. You might be able to find a solution to your issue from here using the WHEN, NULLIF, and DEFAULTIF Clauses.
Ex from the docs:
fixed_field CHAR(2) NULLIF fixed_field=BLANKS
But ultimately, I don't know how you differentiate between empty fields and null fields in a flat fixed-width file. Either the data is present or its position filled with white space. Unless you have specific rules that you're going to apply to make that determination, I don't see a difference between the two.

How can I do a double delimiter in Hive?

let's say I have some sample rows of data
site1^http://article1.com?datacoll=5|4|3|2|1&test=yes
site1^http://article1.com?test=yes
site1^http://article1.com?datacoll=5|4|3|2|1&test=yes
I want to create a table like so
create table clicklogs (sitename string, url string)
ROW format delimited fields terminated by '^';
As you can see I have some data in the url parameter I'd like to extract, namely
datacoll=5|4|3|2|1
I also want to work with those individual elements seperated by pipes so I can do group bys on them to show for example how many urls had a 2nd position of "4" which would be 2 rows in this case. So in this case I have the "url" field that has additional data I'd like to parse out and use in my queries.
The question is, what is the best way to do that in hive?
thanks!
First, use parse_url(string urlString, string partToExtract [, string keyToExtract]) to grab the data in question:
parse_url('http://article1.com?datacoll=5|4|3|2|1&test=yes', 'QUERY', 'datacol1')
This returns '5|4|3|2|1', which gets us halfway there. Now, use split(string str, string pat) to break those out of each sub-delimiter into an array:
split(parse_url(url, 'QUERY', 'datacol1'), '\|')
With the result of this, you should be able to grab the columns that you want.
See the UDF documentation for more built-in functions.
Note: I wasn't able to verify this works in Hive from where I am, sorry if there are some minor issues.
This looks very similar to something I've done a couple weeks ago, I think the best approach in your case would be to apply a pre-processing step (possibly with hadoop streaming), and change the prototype of your table to be:
create table clicklogs(sitename string, datacol Array<int>) row format delimited fields terminated by '^' collection items terminated by '|'
Once you have that you can easily manipulate your data in Hive using lateral views and the builtin explode. The following code should help you get the counts of URLs per col.
select col, count(1) from clicklogs lateral view explode(datacol) dataTable as col group by col

Oracle: search for diacritics

In Oracle I try to find all the rows that contains some diacritics in one column. I used something like:
where regexp_like(name,'(Ă|Î|Ș|Ț|Â)','i');
The problem is that it also returns rows that contain the letters without diacritics (A,I,S,T). For example the clause above will return a row that contains "Adrian" as name.
How can I search only for diacritics?
Thank you
The way diacritics is handled in comparisons and when sorting is a property of the session that depends on the value of NLS_SORT. See Linguistic Sorting and String Searching
I think it may be caused by character conversion.
What do you get when you run the query?:
select 'ĂÎȘȚÂ' from dual

Resources