I am having some data in HDFS that does not have a delimiter. That is, the individual data fields are identified by their position in the line.
For instance,
CountryXTOWNYCRIMEVALUEZ
So here the country would be positions 0 to 7, the town 8 to 12, and the crime statistic would be 13 to 23.
Is there a way to import data organised like this directly into Hive? I suppose a workable way would be to design a map reduce job that delimits the data, but I was wondering if there is a Hive command that can be used to import the data directly?
RegexSerDe
create external table mytable
(
country string
,town string
,crime_statistic string
)
row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
with serdeproperties
(
'input.regex' = '^(.{8})(.{5})(.*)$'
)
location '/...location of the data...'
;
select * from mytable
;
+----------+-------+-----------------+
| country | town | crime_statistic |
+----------+-------+-----------------+
| CountryX | TOWNY | CRIMEVALUEZ |
+----------+-------+-----------------+
I am trying to follow the post here to set a variable in my Hive query. Assuming I've the following file in hdfs:
/home/hduser/test/hr.txt
Berg,12000
Faviet,9000
Chen,8200
Urman,7800
Sciarra,7700
Popp,6900
Paino,8790
I then created my schema on top of the data as follows:
CREATE EXTERNAL TABLE IF NOT EXISTS employees (lname STRING, salary INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/home/hduser/test/';
I want to create 4 tiles for the table but I don't want to hardcode the number of tiles and instead want to pass it in as a variable. My code is below:
SET q1=select ceiling(count(*)/2) from employees;
SELECT lname,
salary,
NTILE(${hiveconf:q1}) OVER (
ORDER BY salary DESC) AS quartile
FROM employees;
However, this throws an error:
FAILED: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies.
Underlying error: org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException: Number of tiles must be an int expression
I tried to use quotes when calling the variable, as in '${hiveconf:q1}', but that didn't seem to help. If I hardcode the number of tiles (which I am trying to avoid), the workflow will go something like this:
SELECT lname,
salary,
NTILE(4) OVER (
ORDER BY salary DESC) AS quartile
FROM employees;
which yields
Berg 12000 1
Faviet 9000 1
Paino 8790 2
Chen 8200 2
Urman 7800 3
Sciarra 7700 3
Popp 6900 4
Thoughts?
When there isn't a documented way one can use documented features to provide a clean enough hack :)
Here's my attempt, using dfs commands from hive, shell commands from hive, the source-command and what not. I guess it might not work out of the box with queries through Hiveserver2. I would be glad if there were a prettier way
Let's go
Basic setup
SET EMPLOYEE_TABLE_LOCATION=/home/hduser/test/;
CREATE EXTERNAL TABLE IF NOT EXISTS employees (lname STRING, salary INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '${hiveconf:EMPLOYEE_TABLE_LOCATION}';
SET PATH_TO_SETTINGS_FILE=hdfs:/tmp/query_to_setting;
SET FILENAME_ON_LOCAL_FS=query_to_setting.sql;
Generate a file in hdfs
with content "SET q1=<the-query-result>;"
CREATE TABLE query_to_setting_table
LOCATION '${hiveconf:PATH_TO_SETTINGS_FILE}'
AS
SELECT concat('SET q1=', ceiling(count(*)/2),'\;') from employees;
Source in the generated file as any sql-file.
First put the file to local fs since 'source' only operates on local disk...
dfs -get ${hiveconf:PATH_TO_SETTINGS_FILE}/000000_0 ${hiveconf:FILENAME_ON_LOCAL_FS};
source ${hiveconf:FILENAME_ON_LOCAL_FS};
Try the setting
hive> SET q1;
q1=4
Use the setting in a query
hive > SELECT lname,
salary,
NTILE( ${hiveconf:q1}) OVER (
ORDER BY salary DESC) AS quartile
FROM employees;
OK
Berg 12000 1
Faviet 9000 1
Paino 8790 2
Chen 8200 2
Urman 7800 3
Sciarra 7700 3
Popp 6900 4
Optional cleanup
!rm ${hiveconf:FILENAME_ON_LOCAL_FS};
DROP TABLE query_to_setting_table;
I've created a Hbase table like this,
create 'student','personal'
and I've put some data into it like this.
ROW COLUMN+CELL
1 column=personal:age, timestamp=1456224023454, value=20
1 column=personal:name, timestamp=1456224008188, value=pesronA
2 column=personal:age, timestamp=1456224891317, value=13
2 column=personal:name, timestamp=1456224868967, value=pesronB
3 column=personal:age, timestamp=1456224935178, value=21
3 column=personal:name, timestamp=1456224921246, value=personC
4 column=personal:age, timestamp=1456224951789, value=20
4 column=personal:name, timestamp=1456224961845, value=personD
5 column=personal:age, timestamp=1456224983240, value=20
5 column=personal:name, timestamp=1456224972816, value=personE
-
I want to import this data to a hive table. I wrote a hive query for that like this
CREATE TABLE hbaseStudent(key INT,name STRING,age INT) STORED BY'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,personal:age,personal:name") TBLPROPERTIES("hbase.table.name" = "student")
But when I execute the query error comes out like this.
Driver returned: 1. Errors: OK
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org/apache/hadoop/hbase/HBaseConfiguration
what should i do?
I tried this thing and it worked try replacing all the double quotes (") with single quotes ('). It will work & also try to add terminator ; in last line.
I am running the following sqoop query:
sqoop import --connect jdbc:oracle:thin:#ldap://oid:389/ewsop000,cn=OracleContext,dc=****,dc=com \
--table ngprod.ewt_payment_ng --where "d_last_updt_ts >= to_timestamp('11/01/2013 11:59:59.999999 PM', 'MM/DD/YYYY HH:MI:SS.FF6 AM')" \
AND "d_last_updt_ts <= to_timestamp('11/10/2013 11:59:59.999999 PM', 'MM/DD/YYYY HH:MI:SS.FF6 AM')" --username ***** --P \
--columns N_PYMNT_ID,D_last_updt_Ts,c_pymnt_meth,c_rcd_del,d_Create_ts \
--hive-import --hive-table payment_sample_table2
The schema for table payment_sample_table2 is in hive. it is running fine if I do not use
AND "d_last_updt_ts <= to_timestamp('11/10/2013 11:59:59.999999 PM', 'MM/DD/YYYY HH:MI:SS.FF6 AM')"
Can someone tell me why, or if there's any other way to get the range of data?
Please specify the exact error . In any case please put the "AND .." within the same double quotation and on the same previous line as the preceding part of the "where" clause. As shown above you have a badly formatted commandline - nothing to do with the actual query.
I'm merging several tables in Oracle 10g, into a consolidated table, like this:
table_A (will have all the records)
table_b -part of the data to be merged
table_c -part of the data to be merged
table_d -part of the data to be merged
now, i run it with error logging like this
MERGE INTO TABLE_A A USING (SELECT * FROM TABLE_B) B
ON
(
A.NOMBRE=B.NOMBRE AND
A.PRIMER_APELLIDO=B.PRIMER_APELLIDO AND
A.SEGUNDO_APELLIDO=B.SEGUNDO_APELLIDO AND
TO_CHAR(A.FECHA_NACIMIENTO,'DD/MM/YYYY')=TO_CHAR(B.FECHA_NACIMIENTO,'DD/MM/YYYY') AND
A.SEXO=B.SEXO
)
WHEN MATCHED THEN
UPDATE SET DGP2011='1'
WHEN NOT MATCHED THEN
INSERT
(
A.FOLIO_RELACIONADO,
A.CVE_PROGRAMA,
A.FECHA_ALTA,
A.PRIMER_APELLIDO,
A.SEGUNDO_APELLIDO,
A.NOMBRE,
A.FECHA_NACIMIENTO,
A.SEXO,
A.CVE_NACIONALIDAD,
A.CVE_ENTIDAD_NACIMIENTO,
A.CVE_GRADO_ESCOLAR,
A.CVE_GRADO_ESTUDIOS,
A.CURP,
A.CALLE,
A.NUM_EXT,
A.NUM_INT,
A.CODIGO_POSTAL,
A.ENTRE_CALLE,
A.Y_CALLE,
A.OTRA_REFERENCIA,
A.TELEFONO,
A.COLONIA,
A.LOCALIDAD,
A.CVE_MUNICIPIO,
A.CVE_ENTIDAD_FEDERATIVA,
A.CVE_CCT,
A.PRIMER_APELLIDO_C,
A.SEGUNDO_APELLIDO_C,
A.NOMBRE_C,
A.FECHA_NACIMIENTO_C,
A.SEXO_C,
A.CVE_ESTADO_CIVIL_C,
A.CVE_GRADO_ESTUDIOS_C,
A.CVE_PARENTESCO_C,
A.CURP_C,
A.CVE_TIPO_ID_OFCL_C,
A.ID_DOCTO_OFL_C,
A.CVE_NACIONALIDAD_C,
A.CVE_ENTIDAD_NACIMIENTO_C,
A.CALLE_C,
A.NUM_EXT_C,
A.NUM_INT_C,
A.CODIGO_POSTAL_C,
A.ENTRE_CALLE_C,
A.Y_CALLE_C,
A.OTRA_REFERENCIA_C,
A.TELEFONO_C,
A.COLONIA_C,
A.LOCALIDAD_C,
A.CVE_MUNICIPIO_C,
A.CVE_ENTIDAD_FEDERATIVA_C,
A.E_MAIL_C,
A.DGP2011
)
VALUES
(
B.FOLIO_RELACIONADO,
B.CVE_PROGRAMA,
B.FECHA_ALTA,
B.PRIMER_APELLIDO,
B.SEGUNDO_APELLIDO,
B.NOMBRE,
TO_CHAR(B.FECHA_NACIMIENTO,'DD/MM/YYYY'),
B.SEXO,
B.CVE_NACIONALIDAD,
B.CVE_ENTIDAD_NACIMIENTO,
B.CVE_GRADO_ESCOLAR,
B.CVE_GRADO_ESTUDIOS,
B.CURP,
B.CALLE,
B.NUM_EXT,
B.NUM_INT,
B.CODIGO_POSTAL,
B.ENTRE_CALLE,
B.Y_CALLE,
B.OTRA_REFERENCIA,
B.TELEFONO,
B.COLONIA,
B.LOCALIDAD,
B.CVE_MUNICIPIO,
B.CVE_ENTIDAD_FEDERATIVA,
B.CVE_CCT,
B.PRIMER_APELLIDO_C,
B.SEGUNDO_APELLIDO_C,
B.NOMBRE_C,
TO_CHAR(B.FECHA_NACIMIENTO_C,'DD/MM/YYYY'),
B.SEXO_C,
B.CVE_ESTADO_CIVIL_C,
B.CVE_GRADO_ESTUDIOS_C,
B.CVE_PARENTESCO_C,
B.CURP_C,
B.CVE_TIPO_ID_OFCL_C,
B.ID_DOCTO_OFL_C,
B.CVE_NACIONALIDAD_C,
B.CVE_ENTIDAD_NACIMIENTO_C,
B.CALLE_C,
B.NUM_EXT_C,
B.NUM_INT_C,
B.CODIGO_POSTAL_C,
B.ENTRE_CALLE_C,
B.Y_CALLE_C,
B.OTRA_REFERENCIA_C,
B.TELEFONO_C,
B.COLONIA_C,
B.LOCALIDAD_C,
B.CVE_MUNICIPIO_C,
B.CVE_ENTIDAD_FEDERATIVA_C,
B.E_MAIL_C,
'1'
)LOG ERRORS INTO ELOG_SEGURO_ESCOLAR REJECT LIMIT UNLIMITED;
and it just raises the error "ORA-01722: invalid number" and toad highlights the 'A.' part of the query.
Now about the tables
table A has all the fields in varchar2 (4000)
table b to d have formatting according to the data they hold (date, number, etc)
the thing is, even with the error logging clause it raises the error and doesn't merge anything!
Plus i have no idea what i should be looking for to find the 'invalid number' field
Any advice would be deeply appreciated
Found it!
It was the TO_CHAR(A.FECHA_NACIMIENTO,'DD/MM/YYYY') line. Just left it like this
A.FECHA_NACIMIENTO=B.FECHA_NACIMIENTO and it worked. Thanks anyway!