In Greenplum, I need to create an external table with a dynamic location parameter. For an example:
CREATE READABLE TABLE_A(
date_inic date,
func_name varchar,
q_session bigint
)
LOCATION(:location)
FORMAT 'TEXT' (DELIMITER '|');
But in :location parameter I need to concat it with a fixed string. I tried:
LOCATION (:location || '123')
But I get a syntax error, otherwise in select statement it works perfectly. I'm inserting :location value like: " 'gphdfs://teste:1010/tmp' "
Can anyone help me?
You are missing a few things in your table definition. You forgot "external" and "table".
CREATE READABLE EXTERNAL TABLE table_a
(
date_inic date,
func_name varchar,
q_session bigint
)
LOCATION(:location)
FORMAT 'TEXT' (DELIMITER '|');
Note: gphdfs has been deprecated and you should use PXF or gpfdist instead.
Next, you just need to use double quotes around the location value.
[gpadmin#mdw ~]$ psql -f example.sql -v location="'gpfdist://teste:1010/tmp'"
CREATE EXTERNAL TABLE
[gpadmin#mdw ~]$ psql
psql (9.4.24)
Type "help" for help.
gpadmin=# \d+ table_a
External table "public.table_a"
Column | Type | Modifiers | Storage | Stats target | Description
-----------+-------------------+-----------+----------+--------------+-------------
date_inic | date | | plain | |
func_name | character varying | | extended | |
q_session | bigint | | plain | |
Type: readable
Encoding: UTF8
Format type: text
Format options: delimiter '|' null '\N' escape '\'
External options: {}
External location: gpfdist://teste:1010/tmp
Execute on: all segments
And from bash, you can just concat the strings together too.
loc="gpfdist://teste"
port="1010"
dir="tmp"
location="'""$loc"":""$port""/""$dir""'"
psql -f example.sql -v location="$location"
Related
My understanding about semi-structured data handling in Vertica is that if data is say like this (in json)
{
"f1":1,
"f2":"hello",
"f3":false,
"f4":2
}
then a flextable is created with two columns __identity__ and __raw__. __identify__ will have 4 fields (I suppose integers 1,2,3,4) and __raw__ will be raw representation of data (1, hello,false and 2).
I can also load data in a csv file in the same flextable eg 2, hello2, true, 3. How does Vertica decide which field maps to which column (eg. both f1 and f4) are int.
Well, nothing beats having a Vertica SQL prompt ready (and the privilege to create a database object ...) to try and find out.
With JSON, the field names are in the structure: key-value pairs.
With CSV, the first line of the data file needs to have the column names - which I add below ...
-- connecting with VSQL,
$ vsql -h localhost -d sbx -U dbadmin -w pwd
$ vsql -h localhost -d sbx -U dbadmin -w pwd
Welcome to vsql, the Vertica Analytic Database interactive terminal.
Type: \h or \? for help with vsql commands
\g or terminate with semicolon to execute query
\q to quit
sbx=> -- create the flex table
sbx=> CREATE FLEX TABLE flx();
CREATE TABLE
sbx=> -- load the flex table from stdin - data handed in-line - using your input
sbx=> COPY flx FROM stdin PARSER fjsonparser();
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> {
>> "f1":1,
>> "f2":"hello",
>> "f3":false,
>> "f4":2
>> }
>> \.
-- test the load ...
sbx=> SELECT f1,f2,f3,f4 FROM flx;
f1 | f2 | f3 | f4
----+-------+-------+----
1 | hello | false | 2
sbx=>-- load the CSV file - note that we need the title line,
sbx=>-- which I add, to have same values in the same fields
sbx=> COPY flx FROM stdin PARSER fcsvparser();
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> f1,f2,f3,f4
>> 2, hello2, true, 3
>> \.
sbx=>-- check the contents now
sbx=> SELECT f1,f2,f3,f4 FROM flx;
f1 | f2 | f3 | f4
----+--------+-------+----
1 | hello | false | 2
2 | hello2 | true | 3
sbx=>-- resulting table definition in catalog ...
sbx=> \d flx
List of Fields by Tables
Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key
---------+-------+--------------+------------------------+--------+---------+----------+-------------+-------------
dbadmin | flx | __identity__ | int | 8 | | t | f |
dbadmin | flx | __raw__ | long varbinary(130000) | 130000 | | t | f |
(2 rows)
sbx=> -- check the contents of __identity__ and (after visualising) __raw__
sbx=> SELECT __identity__,REPLACE(MAPTOSTRING(__raw__),CHR(10),' ') FROM flx;
__identity__ | REPLACE
--------------+------------------------------------------------------------------------
1 | { "f1": "1", "f2": "hello", "f3": "false", "f4": "2" }
2 | { "f1": "2", "f2": "hello2", "f3": "true", "f4": "3" }
I am getting 'None' values while loading data from a CSV file into hive external table.
My CSV file structure is like this:
creation_month,accts_created
7/1/2018,40847
6/1/2018,67216
5/1/2018,76009
4/1/2018,87611
3/1/2018,99687
2/1/2018,92631
1/1/2018,111951
12/1/2017,107717
'creation_month' and 'accts_created' are my column headers.
create external table monthly_creation
(creation_month DATE,
accts_created INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' location '/user/dir4/'
The location is '/user/dir4/' because that's where I put the 'monthly_acct_creation.csv' file, as seen in the screenshot below:
I have no idea why the external table I created had all 'None' values when the source data have dates and numbers.
Can anybody help?
DATE values describe a particular year/month/day, in the form YYYY-MM-DD. For example, DATE '2013-01-01'.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-date
I suggest using string type for your date column, which you can convert later or parse into timestamps.
Regarding the integer column, you'll need to skip the header for all columns to be appropriately converted to int types
By the way, new versions of HUE allow you to build Hive tables directly from CSV
Date data type format in hive only accepts yyyy-MM-dd as your date field is not in the same format and that results null values in creation_month field value.
Create table with creation_month field as string datatype and skip the first line by using skip.header.line property in create table statement.
Try with below ddl:
hive> create external table monthly_creation
(creation_month string,
accts_created INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
Location '/user/dir4/'
tblproperties ("skip.header.line.count"="1");
hive> select * from monthly_creation;
+-----------------+----------------+--+
| creation_month | accts_created |
+-----------------+----------------+--+
| 7/1/2018 | 40847 |
| 6/1/2018 | 67216 |
| 5/1/2018 | 76009 |
| 4/1/2018 | 87611 |
| 3/1/2018 | 99687 |
| 2/1/2018 | 92631 |
| 1/1/2018 | 111951 |
| 12/1/2017 | 107717 |
+-----------------+----------------+--+
Running this line:
regexp_replace('Hello from zzz','zzz','$15000') gives an error saying:
Wrong arguments ''$15000'': org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public org.apache.hadoop.io.Text org.apache.hadoop.hive.ql.udf.UDFRegExpReplace.evaluate(org.apache.hadoop.io.Text,org.apache.hadoop.io.Text,org.apache.hadoop.io.Text) on object org.apache.hadoop.hive.ql.udf.UDFRegExpReplace#6e85e0dd of class org.apache.hadoop.hive.ql.udf.UDFRegExpReplace with arguments {Hello from zzz:org.apache.hadoop.io.Text, zzz:org.apache.hadoop.io.Text, $15000:org.apache.hadoop.io.Text} of size 3
Is $ not supported? What is the alternative for this?
Try with two back slashes (\\) to escape $(is an regex special character)
hive> select regexp_replace('Hello from zzz','zzz','\\$15000');
+--------------------+--+
| _c0 |
+--------------------+--+
| Hello from $15000 |
+--------------------+--+
There is replace function introduced in Hive-1.3.0+
Related jira addressing replace function
If your replace string coming from a field in the table then use concat function to concatenate field value with back slashes(\\) and the other argument would be field name
hive> select regexp_replace('Hello from zzz','zzz',concat('\\',"$15000"));
+--------------------+--+
| _c0 |
+--------------------+--+
| Hello from $15000 |
+--------------------+--+
(or)
hive> select regexp_replace('Hello from zzz','zzz',concat('\\',field/column-name))
I have to copy a input text file (text_file.txt) to a table (table_a). I also need to include the input file's name into the table.
my code is:
\set t_pwd `pwd`
\set input_file '\'':t_pwd'/text_file.txt\''
copy table_a
( column1
,column2
,column3
,FileName :input_file
)
from :input_file
The last line does not copy the input text file name in the table.
How to copy the input text file's name into the table? (without manually typing the file name)
Solution 1
This might not be the perfect solution for your job but i think will do the job :
You can get the table name and store it in a TBL variable and next add this variable at the end of each line in the CSV file that you are about to load into Vertica.
Now depending on your CSV file size this can be quite time and CPU consuming.
export TBL=`ls -1 | grep *.txt` | sed -e 's/$/,'$TBL'/' -i $TBL
Example:
[dbadmin#bih001 ~]$ cat load_data1
1|2|3|4|5|6|7|8|9|10
[dbadmin#bih001 ~]$ export TBL=`ls -1 | grep load*` | sed -e 's/$/|'$TBL'/' -i $TBL
[dbadmin#bih001 ~]$ cat load_data1
1|2|3|4|5|6|7|8|9|10||load_data1
Solution 2
You can use a DEFAULT CONSTRAINT, see example:
1. Create your table with a DEFAULT CONSTRAINT
[dbadmin#bih001 ~]$ vsql
Password:
Welcome to vsql, the Vertica Analytic Database interactive terminal.
Type: \h or \? for help with vsql commands
\g or terminate with semicolon to execute query
\q to quit
dbadmin=> create table TBL (id int ,CSV_FILE_NAME varchar(200) default 'TBL');
CREATE TABLE
dbadmin=> \dt
List of tables
Schema | Name | Kind | Owner | Comment
--------+------+-------+---------+---------
public | TBL | table | dbadmin |
(1 row)
See the DEFAULT CONSTRAINT it has the 'TBL' default value
dbadmin=> \d TBL
List of Fields by Tables
Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key
--------+-------+---------------+--------------+------+---------+----------+-------------+-------------
public | TBL | id | int | 8 | | f | f |
public | TBL | CSV_FILE_NAME | varchar(200) | 200 | 'TBL' | f | f |
(2 rows)
2. Now setup your COPY variables
- insert some data and alter the DEFAULT CONSTRAINT value to your current :input_file value.
dbadmin=> \set t_pwd `pwd`
dbadmin=> \set CSV_FILE `ls -1 | grep load*`
dbadmin=> \set input_file '\'':t_pwd'/':CSV_FILE'\''
dbadmin=>
dbadmin=>
dbadmin=> insert into TBL values(1);
OUTPUT
--------
1
(1 row)
dbadmin=> select * from TBL;
id | CSV_FILE_NAME
----+---------------
1 | TBL
(1 row)
dbadmin=> ALTER TABLE TBL ALTER COLUMN CSV_FILE_NAME SET DEFAULT :input_file;
ALTER TABLE
dbadmin=> \dt TBL;
List of tables
Schema | Name | Kind | Owner | Comment
--------+------+-------+---------+---------
public | TBL | table | dbadmin |
(1 row)
dbadmin=> \d TBL;
List of Fields by Tables
Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key
--------+-------+---------------+--------------+------+----------------------------+----------+-------------+-------------
public | TBL | id | int | 8 | | f | f |
public | TBL | CSV_FILE_NAME | varchar(200) | 200 | '/home/dbadmin/load_data1' | f | f |
(2 rows)
dbadmin=> insert into TBL values(2);
OUTPUT
--------
1
(1 row)
dbadmin=> select * from TBL;
id | CSV_FILE_NAME
----+--------------------------
1 | TBL
2 | /home/dbadmin/load_data1
(2 rows)
Now you can implement this in your copy script.
Example:
\set t_pwd `pwd`
\set CSV_FILE `ls -1 | grep load*`
\set input_file '\'':t_pwd'/':CSV_FILE'\''
ALTER TABLE TBL ALTER COLUMN CSV_FILE_NAME SET DEFAULT :input_file;
copy TBL from :input_file DELIMITER '|' DIRECT;
Solution 3
Use the LOAD_STREAMS table
Example:
When loading a table give it a stream name - this way you can identify the file name / stream name:
COPY mytable FROM myfile DELIMITER '|' DIRECT STREAM NAME 'My stream name';
*Here is how you can query your load_streams table :*
=> SELECT stream_name, table_name, load_start, accepted_row_count,
rejected_row_count, read_bytes, unsorted_row_count, sorted_row_count,
sort_complete_percent FROM load_streams;
-[ RECORD 1 ]----------+---------------------------
stream_name | fact-13
table_name | fact
load_start | 2010-12-28 15:07:41.132053
accepted_row_count | 900
rejected_row_count | 100
read_bytes | 11975
input_file_size_bytes | 0
parse_complete_percent | 0
unsorted_row_count | 3600
sorted_row_count | 3600
sort_complete_percent | 100
Makes sense ? Hope this helped !
If you do not need to do it purely from inside vsql, it might possible to cheat a bit, and export the logic outside Vertica, in bash for example:
FILE=text_file.txt
(
while read LINE; do
echo "$LINE|$FILE"
done < "$FILE"
) | vsql -c 'copy table_a (...) FROM STDIN'
That way you basically COPY FROM STDIN, adding the filename to each line before it even reaches Vertica.
I have a Data file that looks like this:
1 2 3 4 5 6
FirstName1 | LastName1 | 4224423 | Address1 | PhoneNumber1 | 1/1/1980
FirstName2 | LastName2 | 4008933 | Address1 | PhoneNumber1 | 1/1/1980
FirstName3 | LastName3 | 2344327 | Address1 | PhoneNumber1 | 1/1/1980
FirstName4 | LastName4 | 5998943 | Address1 | PhoneNumber1 | 1/1/1980
FirstName5 | LastName5 | 9854531 | Address1 | PhoneNumber1 | 1/1/1980
My DB has 2 Tables, one for PERSON and one for ADDRESS, so I need to store columns 1,2,3 and 6 in PERSON and column 4 and 5 in ADDRESS. All examples provided in the SQL Loader documentation address this case but only for fixed size columns, and my data file is pipe delimited (and spiting this into 2 different data files is not an option).
Do someone knows how to do this?
As always help will be deeply appreciated.
Another option may be to set up the file as an external table and then run inserts selecting the columns you want from the external table.
options(skip=1)
load data
infile "csv file path"
insert into table person
fields terminated by ','
optionally enclosed by '"'
trialling nullcols(1,2,3,6)
insert into table address
fields terminated by ','
optionally enclosed by '"'
trialling nullcols(4,5)
Even if SQLLoader doesn't support this (I'm not sure) nothing stops you from pre-processing it with say awk and then loading. For example:
cat 1.dat | awk -F '|' '{print $1 $2 $3 $6}' > person.dat
cat 1.dat | awk -F '|' '{print $4 $5}' > address.dat