I have a table in Hive that has the following structure:
> describe volatility2;
Query: describe volatility2
+------------------+---------------+---------+
| name | type | comment |
+------------------+---------------+---------+
| version | int | |
| unmappedmkfindex | int | |
| mfvol | array<string> | |
+------------------+---------------+---------+
It was created by Spark HiveContext code by using a DataFrame API like this:
val volDF = hc.createDataFrame(volRDD)
volDF.saveAsTable(volName)
which carried over the RDD structure that was defined in the schema:
def schemaVolatility: StructType = StructType(
StructField("Version", IntegerType, false) ::
StructField("UnMappedMKFIndex", IntegerType, false) ::
StructField("MFVol", DataTypes.createArrayType(StringType), true) :: Nil)
However, when I'm trying to select from this table using the latest JDBC Impala driver the last column is not visible to it. My query is very simple - trying to print the data to the console - exactly like in the example code provided by the driver download:
String sqlStatement = "select * from default.volatility2";
Class.forName(jdbcDriverName);
con = DriverManager.getConnection(connectionUrl);
Statement stmt = con.createStatement();
ResultSet rs = stmt.executeQuery(sqlStatement);
System.out.println("\n== Begin Query Results ======================");
ResultSetMetaData metadata = rs.getMetaData();
for (int i=1; i<=metadata.getColumnCount(); i++) {
System.out.println(rs.getMetaData().getColumnName(i)+":"+rs.getMetaData().getColumnTypeName(i));
}
System.out.println("== End Query Results =======================\n\n");
The console output it this:
== Begin Query Results ======================
version:version
unmappedmkfindex:unmappedmkfindex
== End Query Results =======================
Is it a driver bug or I'm missing something?
I found the answer to my own question. Posting it here so it may help others and save time in searching. Apparently Impala lately introduced the so called "complex types" support to their SQL that include array among others. The link to the document is this:
http://www.cloudera.com/documentation/enterprise/5-5-x/topics/impala_complex_types.html#complex_types_using
According to this what I had to do is change the query to look like this:
select version, unmappedmkfindex, mfvol.ITEM from volatility2, volatility2.mfvol
and I got the right expected results back
Related
I was using UcanAccess 5.0.1 in databricks 9.1LTS (Spark 3.1.2, Scala 2.1.2), and for whatever reasons when I use the following code to read in a single record Access db table it keeps treating the column names as the record itself (I've tried adding more records and got the same results.)
The Access db table looks like this (2 records):
ID Field1 Field2
2 key1 column2
3 key2 column2-1
The python code looks like this:
connectionProperties = {
"driver" : "net.ucanaccess.jdbc.UcanaccessDriver"
}
url = "jdbc:ucanaccess:///dbfs/mnt/internal/Temporal/Database1.accdb"
df = spark.read.jdbc(url=url, table="Table1", properties=connectionProperties)
And the result looks like this:
df.printSchema()
df.count()
root
|-- Field1: string (nullable = true)
|-- Field2: string (nullable = true)
|-- ID: string (nullable = true)
Out[21]: 2
df.show()
+------+------+---+
|Field1|Field2| ID|
+------+------+---+
|Field1|Field2| ID|
|Field1|Field2| ID|
+------+------+---+
Any idea/suggestion?
If your data has column names in a first row, you can try header = True, to set first row as column headers.
Sample code –
df = spark.read.jdbc( url = url, table="Table1", header = true, properties= connectionProperties)
But if your data does not have column headers, you need to explicitly define column names and then assign them as column headers.
Sample code –
columns = ["column_name_1"," column_name_2"," column_name_3"]
df = spark.read.jdbc( url = url, table="Table1”, schema=columns, properties= connectionProperties)
You can also refer this answer by Alberto Bonsanto
Reference - https://learn.microsoft.com/en-us/azure/databricks/data/data-sources/sql-databases#read-data-from-jdbc
turns out that there was a bug in the jdbc code ([https://stackoverflow.com/questions/63177736/spark-read-as-jdbc-return-all-rows-as-columns-name])
I added the following code and now the ucanaccess driver works fine:
%scala
import org.apache.spark.sql.jdbc.JdbcDialect
import org.apache.spark.sql.jdbc.JdbcDialects
private case object HiveDialect extends JdbcDialect {
override def canHandle(url : String): Boolean = url.startsWith("jdbc:ucanaccess")
override def quoteIdentifier(colName: String): String = {
colName.split('.').map(part => s"`$part`").mkString(".")
}
}
JdbcDialects.registerDialect(HiveDialect)
Then display(df) would show
|Field1 |Field2 |ID |
|:------|:------|:----- |
|key1 |column2 | 2|
|key2 |column2-1| 3|
I am new to Airflow. I want to do an operation like below using airflow operators.
Briefly I want to read some data from a database table and according to the values of a column in that table I want to do different tasks.
This is the table which I used to get data.
+-----------+--------+
| task_name | status |
+-----------+--------+
| a | 1 |
| b | 2 |
| c | 4 |
| d | 3 |
| e | 4 |
+-----------+--------+
From the above table I want to select the rows where status=4 and according to their task name run the relevant jar file (for running jar files I am planning to use Bash Operator). I want to execute this task using Airflow. Note that I am using PostgreSQL.
This is the code which I have implemented so far.
from airflow.models import DAG
from airflow.operators.postgres_operator import PostgresOperator
from datetime import datetime, timedelta
from airflow import settings
#set the default attributes
default_args = {
'owner': 'Airflow',
'start_date': datetime(2020,10,4)
}
status_four_dag = DAG(
dag_id = 'status_check',
default_args = default_args,
schedule_interval = timedelta(seconds=5)
)
test=PostgresOperator(
task_id='check_status',
sql='''select * from table1 where status=4;''',
postgres_conn_id='test',
database='status',
dag=status_four_dag,
)
I am stuck in the place where I want to check the task_name and call the relevant BashOperators.
Your support is appreciated. Thank you.
XComs are used for communicating messages between tasks. Send the JAR filename and other arguments for forming the command to xcom and consume it in the subsequent tasks.
For example,
check_status >> handle_status
check_status - checks status from DB and write JAR filename and arguments to xcom
handle_status - pulls the JAR filename and arguments from xcom, forms the command and execute it
Sample code:
def check_status(**kwargs):
if randint(1, 100) % 2 == 0:
kwargs["ti"].xcom_push("jar_filename", "even.jar")
else:
kwargs["ti"].xcom_push("jar_filename", "odd.jar")
with DAG(dag_id='new_example', default_args=default_args) as dag:
t0 = PythonOperator(
task_id="check_status",
provide_context=True,
python_callable=check_status
)
t1 = BashOperator(
task_id="handle_status",
bash_command="""
jar_filename={{ ti.xcom_pull(task_ids='check_status', key='jar_filename') }}
echo "java -jar ${jar_filename}"
"""
)
t0 >> t1
I have a bunch of CSV files in S3 that I am trying to covert to ORC using an ETL job in AWS Glue. I have a crawler that crawls the directory containing the CSVs and generates a table. The table looks like this:
Column name | Data type | Partition key
---------------------------------------
field1 | string |
field2 | string |
field3 | string |
partition_0 | string | Partition (0)
partition_1 | string | Partition (1)
Next, I try to convert the CSVs into ORC files. Here is a similar ETL script to what I'm using:
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'database', 'table_name', 'output_dir'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
partition_predicate = '(partition_0 = "val1") AND (partition_1 = "val2")'
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = args['database'], table_name = args['table_name'], push_down_predicate = partition_predicate, transformation_ctx = "datasource0")
final = glueContext.write_dynamic_frame.from_options(frame = datasource0, connection_type = "s3", connection_options = { "path": args['output_dir'] }, format = "orc")
job.commit()
I have another crawler that crawls my output directory containing the ORC files. When it generates the table, it looks like this:
Column name | Data type | Partition key
---------------------------------------
field1 | string |
field2 | string |
field3 | string |
partition_0 | string |
partition_1 | string |
partition_0 | string | Partition (0)
partition_1 | string | Partition (1)
It looks like it considers the partitions to be fields in the ORC file (which they should not be). How can I modify my script so that the CSV to ORC conversion won't include the partition keys as schema columns?
If you need to preserve partitioning then add option partitionKeys to the writer:
final = glueContext.write_dynamic_frame.from_options(frame = datasource0, connection_type = "s3", connection_options = { "path": args['output_dir'], "partitionKeys" -> Seq("partition_0", "partition_1") }, format = "orc")
Otherwise just remove partitioning columns:
cleanDyf = datasource0.dropFields(Seq("partition_0", "partition_1"))
final = glueContext.write_dynamic_frame.from_options(frame = cleanDyf, connection_type = "s3", connection_options = { "path": args['output_dir'] }, format = "orc")
I have a HIVE table which contains 3 columns- "id"(String), "booklist"(Array of String), and "date"(string) with the following data:
----------------------------------------------------
id | booklist | date
----------------------------------------------------
1 | ["Book1" , "Book2"] | 2017-11-27T01:00:00.000Z
2 | ["Book3" , "Book4"] | 2017-11-27T01:00:00.000Z
When trying to insert into Elasticsearch with this PIG script
-------------------------Script begins------------------------------------------------
SET hive.metastore.uris 'thrift://node:9000';
REGISTER hdfs://node:9001/library/elasticsearch-hadoop-5.0.0.jar;
DEFINE HCatLoader org.apache.hive.hcatalog.pig.HCatLoader();
DEFINE EsStore org.elasticsearch.hadoop.pig.EsStorage(
'es.nodes = elasticsearch.service.consul',
'es.port = 9200',
'es.write.operation = upsert',
'es.mapping.id = id',
'es.mapping.pig.tuple.use.field.names=true'
);
hivetable = LOAD 'default.reading' USING HCatLoader();
hivetable_flat = FOREACH hivetable
GENERATE
id AS id,
booklist as bookList,
date AS date;
STORE hivetable_flat INTO 'readings/reading' USING EsStore();
-------------------------Script Ends------------------------------------------------
When running above, i got an error saying:
ERROR 2999:Unexpected internal error. Found unrecoverable error [ip:port] returned Bad Request(400) - failed to parse [bookList]; Bailing out..
Can anyone shed any light on how to parse ARRAY of STRING into ES and get above to work?
Thank you!
I am trying to create a merge statement for Greenplum DB and I am getting an syntax error. So I am wondering if the MERGE is even supported the way I am writing it.
I have two approaches
Approach 1-
MERGE into public.table20 pritab
USING
(
select stgout.key1, stgout.key2, stgout.col1
from public.table20_stage stgout
where stgout.sequence_id < 1000
) as stgtab
ON (pritab.key1 = stgtab.key1
and pritab.key2 = stgtab.key2)
WHEN MATCHED THEN
UPDATE SET pritab.key1 = stgtab.key1
,pritab.key2 = stgtab.key2
,pritab.col1 = stgtab.col1
WHEN NOT MATCHED THEN
INSERT (key1, key2, col1)
values (stgtab.key1, stgtab.key2, stgtab.col1);
Approach 2:
public.table20 pritab
SET pritab.key1 = stgtab.key1
,pritab.key2 = stgtab.key2
,pritab.col1 = stgtab.col1
from
(
select stgout.key1, stgout.key2, stgout.col1
from public.table20_stage stgout
where stgout.sequence_id < 1000
) as stgtab
ON (pritab.key1 = stgtab.key1
and pritab.key2 = stgtab.key2)
returning (stgtab.key1, stgtab.key2, stgtab.col1);
Is there any other way or something is wrong with my syntax itself?
Merge is not supported in Greenplum but I wrote a blog post on how to achieve the results of a merge statement in Greenplum.
http://www.pivotalguru.com/?p=104