I am new to Hadoop. I tried to executed below query but it didn't go well.
sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" -- username retail_dba --password cloudera --query "SELECT order_items.order_item_product_id, orders.order_status FROM orders INNER JOIN order_items ON orders.order_id = order_items.order_item_order_id WHERE \$CONDITIONS" --target-dir /user/cloudera/order_join1 --split-by order_id --num-mappers 4
When I tried to executed the above query in mysql and sqoop eval, it went well but when tried in import arguments getting error as below:
[cloudera#quickstart ~]$ sqoop import --connect
"jdbc:mysql://quickstart.cloudera:3306/retail_db" --username retail_dba
--password cloudera --query "SELECT order_items.order_item_product_id,
orders.order_status FROM orders INNER JOIN order_items ON orders.order_id =
order_items.order_item_order_id WHERE \$CONDITIONS" --target-dir
/user/cloudera/order_join1 --split-by order_id --num-mappers 4
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
17/01/15 14:34:59 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.8.0
17/01/15 14:34:59 WARN tool.BaseSqoopTool: Setting your password on the
command-line is insecure. Consider using -P instead.
17/01/15 14:35:00 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
17/01/15 14:35:00 INFO tool.CodeGenTool: Beginning code generation
17/01/15 14:35:02 INFO manager.SqlManager: Executing SQL statement: SELECT
order_items.order_item_product_id, orders.order_status FROM orders INNER
JOIN order_items ON orders.order_id = order_items.order_item_order_id WHERE (1 = 0)
17/01/15 14:35:02 INFO manager.SqlManager: Executing SQL statement: SELECT
order_items.order_item_product_id, orders.order_status FROM orders INNER
JOIN order_items ON orders.order_id = order_items.order_item_order_id WHERE (1 = 0)
17/01/15 14:35:03 INFO manager.SqlManager: Executing SQL statement: SELECT
order_items.order_item_product_id, orders.order_status FROM orders INNER
JOIN order_items ON orders.order_id = order_items.order_item_order_id WHERE (1 = 0)
17/01/15 14:35:03 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-
cloudera/compile/f6cf89b54d33e5676419b1646a648100/QueryResult.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
17/01/15 14:35:10 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-
cloudera/compile/f6cf89b54d33e5676419b1646a648100/QueryResult.jar
17/01/15 14:35:10 INFO mapreduce.ImportJobBase: Beginning query import.
17/01/15 14:35:12 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
17/01/15 14:35:15 INFO Configuration.deprecation: mapred.map.tasks is
deprecated. Instead, use mapreduce.job.maps
17/01/15 14:35:16 INFO client.RMProxy: Connecting to ResourceManager at quickstart.cloudera/10.0.2.15:8032
17/01/15 14:35:20 WARN hdfs.DFSClient: Caught exception
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1281)
at java.lang.Thread.join(Thread.java:1355)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.java:862)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeInternal(DFSOutputStream.java:830)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:826)
17/01/15 14:35:23 INFO db.DBInputFormat: Using read commited transaction isolation
17/01/15 14:35:23 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(order_id), MAX(order_id) FROM (SELECT
order_items.order_item_product_id, orders.order_status FROM orders INNER
JOIN order_items ON orders.order_id = order_items.order_item_order_id WHERE (1 = 1) ) AS t1
17/01/15 14:35:24 INFO mapreduce.JobSubmitter: Cleaning up the staging area
/user/cloudera/.staging/job_1484512313628_0005
17/01/15 14:35:24 WARN security.UserGroupInformation:
PriviledgedActionException as:cloudera (auth:SIMPLE) cause:java.io.IOException:
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Unknown column 'order_id' in 'field list'
17/01/15 14:35:24 ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException:
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Unknown column 'order_id' in 'field list'
at
org.apache.sqoop.mapreduce.db.DataDrivenDBInputFormat.getSplits(DataDrivenDBInputFormat.java:207)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:305)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:322)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:200)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1307)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1304)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1304)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1325)
at org.apache.sqoop.mapreduce.ImportJobBase.doSubmitJob(ImportJobBase.java:203)
at org.apache.sqoop.mapreduce.ImportJobBase.runJob(ImportJobBase.java:176)
at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:273)
at org.apache.sqoop.manager.SqlManager.importQuery(SqlManager.java:748)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:509)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:615)
at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Unknown column 'order_id' in 'field list'
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:377)
at com.mysql.jdbc.Util.getInstance(Util.java:360)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:978)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3887)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3823)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2435)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2582)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2526)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2484)
at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1446)
at org.apache.sqoop.mapreduce.db.DataDrivenDBInputFormat.getSplits(DataDrivenDBInputFormat.java:178)
... 22 more
Can anyone please help me? Where I did mistake in command and also why we have to use WHERE \$CONDITIONS in the query?
You might need to include the column "order_id" in your select list.
sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username retail_dba --password cloudera
--query "SELECT orders.order_id, order_items.order_item_product_id, orders.order_status FROM orders INNER JOIN order_items ON orders.order_id = order_items.order_item_order_id WHERE \$CONDITIONS" --target-dir /user/cloudera/order_join1 --split-by order_id --num-mappers 4
Related
I am running the following sqoop import from Teradata:
sqoop import --driver com.teradata.jdbc.TeraDriver \
--connect jdbc:teradata://telearg7/DATABASE=AR_PROD_HUB_DIM_VW,CHARSET=UTF8,CLIENT_CHARSET=UTF-8,TCP=SEND1500,TCP=RECEIVE1500 \
--verbose \
--username ld_hadoop \
--password xxxx \
--query "SELECT G.suscripcion_id , G.valor_recurso_primario_cd , G.suscripcion_cd , G.fecha_migra_id FROM ( SELECT DISTINCT a.suscripcion_id as suscripcion_id, a.valor_recurso_primario_cd as valor_recurso_primario_cd , f.suscripcion_cd as suscripcion_cd, a.fecha_fin_orden_id AS fecha_migra_id , row_number() over (partition by a.valor_recurso_primario_CD order by a.Fecha_Fin_Orden_ID DESC) as row_num FROM AR_PROD_HUB_DIM_VW.F_TR_CAMBIO_OFERTA_D A INNER JOIN AR_PROD_HUB_DIM_VW.D_ESTADO_OPERACION B ON A.ESTADO_OPERACION_ID = B.ESTADO_OPERACION_ID INNER JOIN AR_PROD_HUB_DIM_VW.D_ESTADO_ORDEN C ON A.ESTADO_ORDEN_ID = C.ESTADO_ORDEN_ID INNER JOIN AR_PROD_HUB_DIM_VW.D_TIPO_OFERTA D ON A.TIPO_OFERTA_ID = D.TIPO_OFERTA_ID INNER JOIN AR_PROD_HUB_DIM_VW.D_TIPO_OFERTA E ON A.TIPO_OFERTA_ANTERIOR_ID = E.TIPO_OFERTA_ID INNER JOIN AR_PROD_HUB_DIM_VW.D_Suscripcion F ON a.Suscripcion_ID = F.Suscripcion_ID WHERE FECHA_FIN_ORDEN_ID BETWEEN CURRENT_DATE-15 and CURRENT_DATE AND B.ESTADO_OPERACION_CD = 'DO' AND C.ESTADO_ORDEN_CD = 'DO' AND D.TIPO_OFERTA_DE IN ('PortePagado', 'PRE', 'Prepaid') AND E.TIPO_OFERTA_DE NOT IN ('PortePagado', 'PRE', 'Prepaid') ) G WHERE \$CONDITIONS AND G.ROW_NUM = 1" \
--hcatalog-database TRAFICO \
--hcatalog-table CRITERIO_TEM_MIGNEG_TMP \
--create-hcatalog-table \
--hcatalog-storage-stanza "stored as orcfile tblproperties ('EXTERNAL'='TRUE')" -m 1
And it is giving me the following error:
22/05/11 12:16:15 INFO hcat.SqoopHCatUtilities: Caused by:
java.lang.NullPointerException 22/05/11 12:16:15 INFO
hcat.SqoopHCatUtilities: at
org.apache.hadoop.hive.ql.ddl.DDLSemanticAnalyzerFactory.(DDLSemanticAnalyzerFactory.java:79)
22/05/11 12:16:16 DEBUG manager.SqlManager: Closing a db connection
ERROR tool.ImportTool: Import failed: java.io.IOException: HCat exited
with status 1
at org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities.executeExternalHCatProgram(SqoopHCatUtilities.java:1252)
at org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities.launchHCatCli(SqoopHCatUtilities.java:1201)
at org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities.createHCatTable(SqoopHCatUtilities.java:735)
at org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities.configureHCat(SqoopHCatUtilities.java:394)
at org.apache.sqoop.mapreduce.hcat.SqoopHCatUtilities.configureImportOutputFormat(SqoopHCatUtilities.java:904)
at org.apache.sqoop.mapreduce.ImportJobBase.configureOutputFormat(ImportJobBase.java:100)
at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:265)
at org.apache.sqoop.manager.SqlManager.importQuery(SqlManager.java:732)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:549)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:653)
at org.apache.sqoop.Sqoop.run(Sqoop.java:151)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:187)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:241)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:250)
at org.apache.sqoop.Sqoop.main(Sqoop.java:259)
I run the sql from teradata and it works, it brings records that are the ones to be imported to Hive.
In Hive the table TRAFFIC.CRITERIO_TEM_MIGNEG_TMP is dropped before the import.
I run it and I can't solve the error.
Any suggestion?
This is Hive's version:
Hive 3.1.3000.7.1.7.1000-141
This is Hadoop's version
Hadoop 3.1.1.7.1.7.1000-141
Source code repository git#github.infra.cloudera.com:CDH/hadoop.git -r 8225796fc6d7984f835c3f63f1feb1efb1e4784a
Compiled by jenkins on 2022-03-24T17:23Z
Compiled with protoc 2.5.0
From source with checksum b591347dc68a5634183cd9aac1974ddd
This command was run using /opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p1000.24102687/lib/hadoop/hadoop-common-3.1.1.7.1.7.1000-141.jar
I am joining three tables data and importing the data from Oracle to Hive using Sqoop import command. Find table data count below.
select count(*) from table1; -- 40446561
select count(*) from table2; -- 16886690
select count(*) from table3; -- 15142664
Sqoop Query:
sqoop-import -D mapred.child.java.opts="-Djava.security.egd=file:/dev/../dev/urandom" --connect $CONNECTION --username $DB_USER_NAME --password $DB_PASSWORD --hive-import --hive-overwrite --hive-table ${HIVE_TABLE_NAME} --target-dir $HDFS_TARGET_DIR --mapreduce-job-name $JOB_NAME --query " SELECT t.* FROM (SELECT rownum ID, a.column1, a.column2, a.column3, a.column4, b.column5, b.column6, c.column7 FROM table1 a LEFT OUTER JOIN table2 b on (b.column5 = a.column1) LEFT OUTER JOIN table3 c on (c.column7= a.column1)) t WHERE \$CONDITIONS" --split-by t.ID --null-string '\\N' --null-non-string '\\N' --num-mappers 12 --fetch-size 10000 --delete-target-dir --direct --verbose
I am getting the below exception:
Error: java.io.IOException: SQLException in nextKeyValue
at org.apache.sqoop.mapreduce.db.DBRecordReader.nextKeyValue(DBRecordReader.java:277)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:556)
at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: java.sql.SQLRecoverableException: IO Error: Connection reset
at oracle.jdbc.driver.T4CPreparedStatement.fetch(T4CPreparedStatement.java:1080)
at oracle.jdbc.driver.OracleStatement.fetchMoreRows(OracleStatement.java:3716)
at oracle.jdbc.driver.InsensitiveScrollableResultSet.fetchMoreRows(InsensitiveScrollableResultSet.java:1015)
at oracle.jdbc.driver.InsensitiveScrollableResultSet.absoluteInternal(InsensitiveScrollableResultSet.java:979)
at oracle.jdbc.driver.InsensitiveScrollableResultSet.next(InsensitiveScrollableResultSet.java:579)
at org.apache.sqoop.mapreduce.db.DBRecordReader.nextKeyValue(DBRecordReader.java:237)
... 12 more
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:209)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at oracle.net.ns.Packet.receive(Packet.java:311)
at oracle.net.ns.DataPacket.receive(DataPacket.java:105)
at oracle.net.ns.NetInputStream.getNextPacket(NetInputStream.java:305)
at oracle.net.ns.NetInputStream.read(NetInputStream.java:249)
at oracle.jdbc.driver.T4CSocketInputStreamWrapper.read(T4CSocketInputStreamWrapper.java:104)
at oracle.jdbc.driver.T4CMAREngineStream.getNBytes(T4CMAREngineStream.java:646)
at oracle.jdbc.driver.T4CMAREngineStream.unmarshalNBytes(T4CMAREngineStream.java:616)
at oracle.jdbc.driver.DynamicByteArray.unmarshalBuffer(DynamicByteArray.java:338)
at oracle.jdbc.driver.DynamicByteArray.unmarshalCLR(DynamicByteArray.java:226)
at oracle.jdbc.driver.T4CMarshaller$BasicMarshaller.unmarshalBytes(T4CMarshaller.java:124)
at oracle.jdbc.driver.T4CMarshaller$BasicMarshaller.unmarshalOneRow(T4CMarshaller.java:101)
at oracle.jdbc.driver.T4CVarcharAccessor.unmarshalOneRow(T4CVarcharAccessor.java:212)
at oracle.jdbc.driver.T4CTTIrxd.unmarshal(T4CTTIrxd.java:1474)
at oracle.jdbc.driver.T4CTTIrxd.unmarshal(T4CTTIrxd.java:1282)
at oracle.jdbc.driver.T4C8Oall.readRXD(T4C8Oall.java:851)
at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:448)
at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:257)
at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:587)
at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:225)
at oracle.jdbc.driver.T4CPreparedStatement.fetch(T4CPreparedStatement.java:1066)
... 17 more
Please let me know to hot to fix it.
This simply means that something in the backend ( DBMS ) decided to stop working due to unavailability of resources etc. It has nothing to do with your code or the number of inserts. You can read more about similar problems here:
http://kr.forums.oracle.com/forums/thread.jspa?threadID=941911
http://forums.oracle.com/forums/thread.jspa?messageID=3800354
This may not answer your question, but you will get an idea of why it might be happening. You could further discuss with your DBA and see if there is something specific in your case.[duplicate]
I am trying to use --where option to get conditional data by joining orders table with order_items table using below command :
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--query "Select * from orders o join order_items oi on o.order_id = oi.order_item_order_id where \$CONDITIONS " \
--where "order_id between 10840 and 10850" \
--target-dir /user/cloudera/order_join_conditional \
--split-by order_id
Now i don't know whats wrong with this because when i Run same Query in MySQL i get 41 records which is correct But when i run this command in sqoop it will Dump all the 172198 records. I don't understand whats happening and whats going wrong.
When you run a parallel import, Sqoop will use the value of the parameter specified in --split-by to substitute the $CONDITIONS parameter and generate different queries (which will be executed by different mappers). For instance, Sqoop will first try to find the minimum and maximum value of order_id and depending on the number of mappers, will try to execute your query against different subsets of the whole range of possible values of order_id.
That way, your query would be translated internally to different parallel queries like these ones:
SELECT * FROM orders o join order_items oi on o.order_id = oi.order_item_order_id
WHERE (order_id >=0 AND order_id < 10000)
SELECT * FROM orders o join order_items oi on o.order_id = oi.order_item_order_id
WHERE (order_id >=1000 AND order_id < 20000)
...
So in this case, the --where clause you specified separately will not be used and you'll end up having all the records. But in your particular case, you don't really need the --split-by flag, because you are only interested in a particular (and very limited) range of values. So you could use this instead:
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--query "Select * from orders o join order_items oi on o.order_id = oi.order_item_order_id WHERE (order_id BETWEEN 10840 AND 10850)" \
--target-dir /user/cloudera/order_join_conditional \
-m 1
Note also the -m 1 at the end which (as pointed out by dev ツ) stands for --num-mappers and allows you to tell Sqoop that you want to use just one mapper for your import process (therefore, no parallelism).
If the range of values was bigger, you could use the --split-by and use your where condition in your free-form query, making use of the parallelism:
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
--password cloudera \
--query "Select * from orders o join order_items oi on o.order_id = oi.order_item_order_id WHERE (order_id BETWEEN 10840 AND 10850) AND \$CONDITIONS" \
--target-dir /user/cloudera/order_join_conditional \
--split-by order_id
I am using Sqoop to import MySQL tables to HDFS. To do that, I use a free-form query import.
--query "SELECT $query_select FROM $table where \$CONDITIONS"
This query is quite slow because of the min(id) and the max(id) search. To improve performances, I've decided to use --boundary-query and specify manually lower-bound and upper-bound.
( https://www.safaribooksonline.com/library/view/apache-sqoop-cookbook/9781449364618/ch04.html):
--boundary-query "select 176862848, 172862848"
However, sqoop doesn't care about specified value and again tries to find minimum and maximum "id" by itself.
16/06/13 14:24:44 INFO tool.ImportTool: Lower bound value: 170581647
16/06/13 14:24:44 INFO tool.ImportTool: Upper bound value: 172909234
The complete sqoop command:
sqoop-import -fs hdfs://xxxxxxxxx/ -D mapreduce.map.java.opts=" -Duser.timezone=Europe/Paris" -m $nodes_number\
--connect jdbc:mysql://$server:$port/$database --username $username --password $password\
--target-dir $destination_dir --boundary-query "select 176862848, 172862848"\
--incremental append --check-column $id_column_name --last-value $last_value\
--split-by $id_column_name --query "SELECT $query_select FROM $table where \$CONDITIONS"\
--fields-terminated-by , --escaped-by \\ --enclosed-by '\"'
Does anyone has already met/solved this problem? Thanks
I've managed to solve this problem by deleting the following arguments:
--incremental append --check-column $id_column_name --last-value $last_value
It seems that there is a concurrency between arguments --boundary-query, --check-column, --split-by and --incremental append
You are correct..
We should not use --split-by with --boundary-query control argument.
try like this..
--boundary-query "select 176862848, 172862848 from tablename limit 1" \
I am using AWS EMR + Spark 1.6.1 + Hive 1.0.0
I have this UDAF and have included it in the classpath of spark https://github.com/scribd/hive-udaf-maxrow/blob/master/src/com/scribd/hive/udaf/GenericUDAFMaxRow.java
And registered it in spark by sqlContext.sql("CREATE TEMPORARY FUNCTION maxrow AS 'some.cool.package.hive.udf.GenericUDAFMaxRow'")
However, when I call it in Spark in the following query
CREATE VIEW VIEW_1 AS
SELECT
a.A,
a.B,
maxrow ( a.C,
a.D,
a.E,
a.F,
a.G,
a.H,
a.I
) as m
FROM
table_1 a
JOIN
table_2 b
ON
b.Z = a.D
AND b.Y = a.C
JOIN dummy_table
GROUP BY
a.A,
a.B
It gave me this error
16/05/18 19:49:14 WARN RowResolver: Duplicate column info for a.A was overwritten in RowResolver map: _col0: string by _col0: string
16/05/18 19:49:14 WARN RowResolver: Duplicate column info for a.B was overwritten in RowResolver map: _col1: bigint by _col1: bigint
16/05/18 19:49:14 ERROR Driver: FAILED: SemanticException [Error 10002]: Line 16:32 Invalid column reference 'C'
org.apache.hadoop.hive.ql.parse.SemanticException: Line 16:32 Invalid column reference 'C'
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genAllExprNodeDesc(SemanticAnalyzer.java:10643)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:10591)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:3656)
But if I removed the group by clause and the aggregate function it worked. So I doubt that SparkSQL somehow does not think it as a aggregate function.
Any help is appreciated. Thanks.