Hive - Error replacing with $ in regexp_replace - hadoop

Running this line:
regexp_replace('Hello from zzz','zzz','$15000') gives an error saying:
Wrong arguments ''$15000'': org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public org.apache.hadoop.io.Text org.apache.hadoop.hive.ql.udf.UDFRegExpReplace.evaluate(org.apache.hadoop.io.Text,org.apache.hadoop.io.Text,org.apache.hadoop.io.Text) on object org.apache.hadoop.hive.ql.udf.UDFRegExpReplace#6e85e0dd of class org.apache.hadoop.hive.ql.udf.UDFRegExpReplace with arguments {Hello from zzz:org.apache.hadoop.io.Text, zzz:org.apache.hadoop.io.Text, $15000:org.apache.hadoop.io.Text} of size 3
Is $ not supported? What is the alternative for this?

Try with two back slashes (\\) to escape $(is an regex special character)
hive> select regexp_replace('Hello from zzz','zzz','\\$15000');
+--------------------+--+
| _c0 |
+--------------------+--+
| Hello from $15000 |
+--------------------+--+
There is replace function introduced in Hive-1.3.0+
Related jira addressing replace function
If your replace string coming from a field in the table then use concat function to concatenate field value with back slashes(\\) and the other argument would be field name
hive> select regexp_replace('Hello from zzz','zzz',concat('\\',"$15000"));
+--------------------+--+
| _c0 |
+--------------------+--+
| Hello from $15000 |
+--------------------+--+
(or)
hive> select regexp_replace('Hello from zzz','zzz',concat('\\',field/column-name))

Related

Concat variable with a string in location parameter in a create statement?

In Greenplum, I need to create an external table with a dynamic location parameter. For an example:
CREATE READABLE TABLE_A(
date_inic date,
func_name varchar,
q_session bigint
)
LOCATION(:location)
FORMAT 'TEXT' (DELIMITER '|');
But in :location parameter I need to concat it with a fixed string. I tried:
LOCATION (:location || '123')
But I get a syntax error, otherwise in select statement it works perfectly. I'm inserting :location value like: " 'gphdfs://teste:1010/tmp' "
Can anyone help me?
You are missing a few things in your table definition. You forgot "external" and "table".
CREATE READABLE EXTERNAL TABLE table_a
(
date_inic date,
func_name varchar,
q_session bigint
)
LOCATION(:location)
FORMAT 'TEXT' (DELIMITER '|');
Note: gphdfs has been deprecated and you should use PXF or gpfdist instead.
Next, you just need to use double quotes around the location value.
[gpadmin#mdw ~]$ psql -f example.sql -v location="'gpfdist://teste:1010/tmp'"
CREATE EXTERNAL TABLE
[gpadmin#mdw ~]$ psql
psql (9.4.24)
Type "help" for help.
gpadmin=# \d+ table_a
External table "public.table_a"
Column | Type | Modifiers | Storage | Stats target | Description
-----------+-------------------+-----------+----------+--------------+-------------
date_inic | date | | plain | |
func_name | character varying | | extended | |
q_session | bigint | | plain | |
Type: readable
Encoding: UTF8
Format type: text
Format options: delimiter '|' null '\N' escape '\'
External options: {}
External location: gpfdist://teste:1010/tmp
Execute on: all segments
And from bash, you can just concat the strings together too.
loc="gpfdist://teste"
port="1010"
dir="tmp"
location="'""$loc"":""$port""/""$dir""'"
psql -f example.sql -v location="$location"

Unable to load data from a CSV file into HIVE

I am getting 'None' values while loading data from a CSV file into hive external table.
My CSV file structure is like this:
creation_month,accts_created
7/1/2018,40847
6/1/2018,67216
5/1/2018,76009
4/1/2018,87611
3/1/2018,99687
2/1/2018,92631
1/1/2018,111951
12/1/2017,107717
'creation_month' and 'accts_created' are my column headers.
create external table monthly_creation
(creation_month DATE,
accts_created INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' location '/user/dir4/'
The location is '/user/dir4/' because that's where I put the 'monthly_acct_creation.csv' file, as seen in the screenshot below:
I have no idea why the external table I created had all 'None' values when the source data have dates and numbers.
Can anybody help?
DATE values describe a particular year/month/day, in the form YYYY-­MM-­DD. For example, DATE '2013-­01-­01'.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-date
I suggest using string type for your date column, which you can convert later or parse into timestamps.
Regarding the integer column, you'll need to skip the header for all columns to be appropriately converted to int types
By the way, new versions of HUE allow you to build Hive tables directly from CSV
Date data type format in hive only accepts yyyy-MM-dd as your date field is not in the same format and that results null values in creation_month field value.
Create table with creation_month field as string datatype and skip the first line by using skip.header.line property in create table statement.
Try with below ddl:
hive> create external table monthly_creation
(creation_month string,
accts_created INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
Location '/user/dir4/'
tblproperties ("skip.header.line.count"="1");
hive> select * from monthly_creation;
+-----------------+----------------+--+
| creation_month | accts_created |
+-----------------+----------------+--+
| 7/1/2018 | 40847 |
| 6/1/2018 | 67216 |
| 5/1/2018 | 76009 |
| 4/1/2018 | 87611 |
| 3/1/2018 | 99687 |
| 2/1/2018 | 92631 |
| 1/1/2018 | 111951 |
| 12/1/2017 | 107717 |
+-----------------+----------------+--+

How to reset textinputformat.record.delimiter to its default value within hive cli / beeline?

Setting textinputformat.record.delimiter to a non-default value, is useful for loading multi-row text, as shown in the demo below.
However, I'm failing to set this parameter back to its default value without exiting the cli and reopen it.
None of the following options worked (nor some other trials)
set textinputformat.record.delimiter='\n';
set textinputformat.record.delimiter='\r';
set textinputformat.record.delimiter='\r\n';
set textinputformat.record.delimiter='
';
reset;
Any thought?
Thanks
Demo
create table mytable (mycol string);
insert into mytable select concat('Hello',unhex('A'),'world');
select concat('>>>',mycol,'<<<') as mycol from mytable;
NewLine is interpreted is record delimiter, causing the insert of 2 records
+-------------+
| mycol |
+-------------+
| >>>Hello<<< |
| >>>world<<< |
+-------------+
set textinputformat.record.delimiter='\0';
truncate table mytable;
insert into mytable select concat('Hello',unhex('A'),'world');
select concat('>>>',mycol,'<<<') as mycol from mytable;
The whole text was inserted as a single record
+----------+
| mycol |
+----------+
| >>>Hello |
| world |
| <<< |
+----------+
Trying to change the delimiter back to newline
set textinputformat.record.delimiter='\n';
truncate table mytable;
insert into mytable select concat('Hello',unhex('A'),'world');
select concat('>>>',mycol,'<<<') as mycol from mytable;
Still get the same results
+----------+
| mycol |
+----------+
| >>>Hello |
| world |
| <<< |
+----------+
Have you checked the "textinputformat.record.delimiter" variable state? Was it really changed? You could do it calling set textinputformat.record.delimiter without any value. If it was changed, but not works, you could definitely create issue in the issue tracker. As a workaround for setting delimiter param back to default value, you could try RESET command. It would reset ALL properties to default values, though this solution could be unacceptable for your case.
use unicode alt+A or \u0001 as delimer.

vsql/vertica, how to copy text input file's name into destination table

I have to copy a input text file (text_file.txt) to a table (table_a). I also need to include the input file's name into the table.
my code is:
\set t_pwd `pwd`
\set input_file '\'':t_pwd'/text_file.txt\''
copy table_a
( column1
,column2
,column3
,FileName :input_file
)
from :input_file
The last line does not copy the input text file name in the table.
How to copy the input text file's name into the table? (without manually typing the file name)
Solution 1
This might not be the perfect solution for your job but i think will do the job :
You can get the table name and store it in a TBL variable and next add this variable at the end of each line in the CSV file that you are about to load into Vertica.
Now depending on your CSV file size this can be quite time and CPU consuming.
export TBL=`ls -1 | grep *.txt` | sed -e 's/$/,'$TBL'/' -i $TBL
Example:
[dbadmin#bih001 ~]$ cat load_data1
1|2|3|4|5|6|7|8|9|10
[dbadmin#bih001 ~]$ export TBL=`ls -1 | grep load*` | sed -e 's/$/|'$TBL'/' -i $TBL
[dbadmin#bih001 ~]$ cat load_data1
1|2|3|4|5|6|7|8|9|10||load_data1
Solution 2
You can use a DEFAULT CONSTRAINT, see example:
1. Create your table with a DEFAULT CONSTRAINT
[dbadmin#bih001 ~]$ vsql
Password:
Welcome to vsql, the Vertica Analytic Database interactive terminal.
Type: \h or \? for help with vsql commands
\g or terminate with semicolon to execute query
\q to quit
dbadmin=> create table TBL (id int ,CSV_FILE_NAME varchar(200) default 'TBL');
CREATE TABLE
dbadmin=> \dt
List of tables
Schema | Name | Kind | Owner | Comment
--------+------+-------+---------+---------
public | TBL | table | dbadmin |
(1 row)
See the DEFAULT CONSTRAINT it has the 'TBL' default value
dbadmin=> \d TBL
List of Fields by Tables
Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key
--------+-------+---------------+--------------+------+---------+----------+-------------+-------------
public | TBL | id | int | 8 | | f | f |
public | TBL | CSV_FILE_NAME | varchar(200) | 200 | 'TBL' | f | f |
(2 rows)
2. Now setup your COPY variables
- insert some data and alter the DEFAULT CONSTRAINT value to your current :input_file value.
dbadmin=> \set t_pwd `pwd`
dbadmin=> \set CSV_FILE `ls -1 | grep load*`
dbadmin=> \set input_file '\'':t_pwd'/':CSV_FILE'\''
dbadmin=>
dbadmin=>
dbadmin=> insert into TBL values(1);
OUTPUT
--------
1
(1 row)
dbadmin=> select * from TBL;
id | CSV_FILE_NAME
----+---------------
1 | TBL
(1 row)
dbadmin=> ALTER TABLE TBL ALTER COLUMN CSV_FILE_NAME SET DEFAULT :input_file;
ALTER TABLE
dbadmin=> \dt TBL;
List of tables
Schema | Name | Kind | Owner | Comment
--------+------+-------+---------+---------
public | TBL | table | dbadmin |
(1 row)
dbadmin=> \d TBL;
List of Fields by Tables
Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key
--------+-------+---------------+--------------+------+----------------------------+----------+-------------+-------------
public | TBL | id | int | 8 | | f | f |
public | TBL | CSV_FILE_NAME | varchar(200) | 200 | '/home/dbadmin/load_data1' | f | f |
(2 rows)
dbadmin=> insert into TBL values(2);
OUTPUT
--------
1
(1 row)
dbadmin=> select * from TBL;
id | CSV_FILE_NAME
----+--------------------------
1 | TBL
2 | /home/dbadmin/load_data1
(2 rows)
Now you can implement this in your copy script.
Example:
\set t_pwd `pwd`
\set CSV_FILE `ls -1 | grep load*`
\set input_file '\'':t_pwd'/':CSV_FILE'\''
ALTER TABLE TBL ALTER COLUMN CSV_FILE_NAME SET DEFAULT :input_file;
copy TBL from :input_file DELIMITER '|' DIRECT;
Solution 3
Use the LOAD_STREAMS table
Example:
When loading a table give it a stream name - this way you can identify the file name / stream name:
COPY mytable FROM myfile DELIMITER '|' DIRECT STREAM NAME 'My stream name';
*Here is how you can query your load_streams table :*
=> SELECT stream_name, table_name, load_start, accepted_row_count,
rejected_row_count, read_bytes, unsorted_row_count, sorted_row_count,
sort_complete_percent FROM load_streams;
-[ RECORD 1 ]----------+---------------------------
stream_name | fact-13
table_name | fact
load_start | 2010-12-28 15:07:41.132053
accepted_row_count | 900
rejected_row_count | 100
read_bytes | 11975
input_file_size_bytes | 0
parse_complete_percent | 0
unsorted_row_count | 3600
sorted_row_count | 3600
sort_complete_percent | 100
Makes sense ? Hope this helped !
If you do not need to do it purely from inside vsql, it might possible to cheat a bit, and export the logic outside Vertica, in bash for example:
FILE=text_file.txt
(
while read LINE; do
echo "$LINE|$FILE"
done < "$FILE"
) | vsql -c 'copy table_a (...) FROM STDIN'
That way you basically COPY FROM STDIN, adding the filename to each line before it even reaches Vertica.

Is there a Hive equivalent of SQL "not like"

While Hive supports positive like queries: ex.
select * from table_name where column_name like 'root~%';
Hive Does not support negative like queries: ex.
select * from table_name where column_name not like 'root~%';
Does anyone know an equivalent solution that Hive does support?
Try this:
Where Not (Col_Name like '%whatever%')
also works with rlike:
Where Not (Col_Name rlike '.*whatever.*')
NOT LIKE have been supported in HIVE version 0.8.0, check at JIRA.
https://issues.apache.org/jira/browse/HIVE-1740
In SQL:
select * from table_name where column_name not like '%something%';
In Hive:
select * from table_name where not (column_name like '%something%');
Check out https://cwiki.apache.org/confluence/display/Hive/LanguageManual if you haven't. I reference it all the time when I'm writing queries for hive.
I haven't done anything where I'm trying to match part of a word, but you might check out RLIKE (in this section https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#Relational_Operators)
This is probably a bit of a hack job, but you could do a sub query where you check if it matches the positive value and do a CASE (http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF#Conditional_Functions) to have a known value for the main query to check against to see if it matches or not.
Another option is to write a UDF which does the checking.
I'm just brainstorming while sitting at home with no access to Hive, so I may be missing something obvious. :)
Hope that helps in some fashion or another. \^_^/
EDIT: Adding in additional method from my comment below.
For your provided example colName RLIKE '[^r][^o][^o][^t]~\w' That may not be the optimal REGEX, but something to look into instead of sub-queries
Using regexp_extract works as well:
select * from table_name where regexp_extract(my_column, ('myword'), 0) = ''
Actually, you can make it like this:
select * from table_name where not column_name like 'root~%';
In impala you can use != for not like:
columnname != value
as #Sanjiv answered
hive has support not like
0: hive> select * from dwtmp.load_test;
+--------------------+----------------------+
| load_test.item_id | load_test.item_name |
+--------------------+----------------------+
| 18282782 | NW |
| 1929SEGH2 | BSTN |
| 172u8562 | PLA |
| 121232 | JHK |
| 3443453 | AG |
| 198WS238 | AGS |
+--------------------+----------------------+
6 rows selected (0.224 seconds)
0: hive> select * from dwtmp.load_test where item_name like '%ST%';
+--------------------+----------------------+
| load_test.item_id | load_test.item_name |
+--------------------+----------------------+
| 1929SEGH2 | BSTN |
+--------------------+----------------------+
1 row selected (0.271 seconds)
0: hive> select * from dwtmp.load_test where item_name not like '%ST%';
+--------------------+----------------------+
| load_test.item_id | load_test.item_name |
+--------------------+----------------------+
| 18282782 | NW |
| 172u8562 | PLA |
| 121232 | JHK |
| 3443453 | AG |
| 198WS238 | AGS |
+--------------------+----------------------+
5 rows selected (0.247 seconds)

Resources