UCanAccess keeps repeating column names as data - azure-databricks

I was using UcanAccess 5.0.1 in databricks 9.1LTS (Spark 3.1.2, Scala 2.1.2), and for whatever reasons when I use the following code to read in a single record Access db table it keeps treating the column names as the record itself (I've tried adding more records and got the same results.)
The Access db table looks like this (2 records):
ID Field1 Field2
2 key1 column2
3 key2 column2-1
The python code looks like this:
connectionProperties = {
"driver" : "net.ucanaccess.jdbc.UcanaccessDriver"
}
url = "jdbc:ucanaccess:///dbfs/mnt/internal/Temporal/Database1.accdb"
df = spark.read.jdbc(url=url, table="Table1", properties=connectionProperties)
And the result looks like this:
df.printSchema()
df.count()
root
|-- Field1: string (nullable = true)
|-- Field2: string (nullable = true)
|-- ID: string (nullable = true)
Out[21]: 2
df.show()
+------+------+---+
|Field1|Field2| ID|
+------+------+---+
|Field1|Field2| ID|
|Field1|Field2| ID|
+------+------+---+
Any idea/suggestion?

If your data has column names in a first row, you can try header = True, to set first row as column headers.
Sample code –
df = spark.read.jdbc( url = url, table="Table1", header = true, properties= connectionProperties)
But if your data does not have column headers, you need to explicitly define column names and then assign them as column headers.
Sample code –
columns = ["column_name_1"," column_name_2"," column_name_3"]
df = spark.read.jdbc( url = url, table="Table1”, schema=columns, properties= connectionProperties)
You can also refer this answer by Alberto Bonsanto
Reference - https://learn.microsoft.com/en-us/azure/databricks/data/data-sources/sql-databases#read-data-from-jdbc

turns out that there was a bug in the jdbc code ([https://stackoverflow.com/questions/63177736/spark-read-as-jdbc-return-all-rows-as-columns-name])
I added the following code and now the ucanaccess driver works fine:
%scala
import org.apache.spark.sql.jdbc.JdbcDialect
import org.apache.spark.sql.jdbc.JdbcDialects
private case object HiveDialect extends JdbcDialect {
override def canHandle(url : String): Boolean = url.startsWith("jdbc:ucanaccess")
override def quoteIdentifier(colName: String): String = {
colName.split('.').map(part => s"`$part`").mkString(".")
}
}
JdbcDialects.registerDialect(HiveDialect)
Then display(df) would show
|Field1 |Field2 |ID |
|:------|:------|:----- |
|key1 |column2 | 2|
|key2 |column2-1| 3|

Related

How can I exclude partitions when converting CSV to ORC using AWS Glue?

I have a bunch of CSV files in S3 that I am trying to covert to ORC using an ETL job in AWS Glue. I have a crawler that crawls the directory containing the CSVs and generates a table. The table looks like this:
Column name | Data type | Partition key
---------------------------------------
field1 | string |
field2 | string |
field3 | string |
partition_0 | string | Partition (0)
partition_1 | string | Partition (1)
Next, I try to convert the CSVs into ORC files. Here is a similar ETL script to what I'm using:
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'database', 'table_name', 'output_dir'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
partition_predicate = '(partition_0 = "val1") AND (partition_1 = "val2")'
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = args['database'], table_name = args['table_name'], push_down_predicate = partition_predicate, transformation_ctx = "datasource0")
final = glueContext.write_dynamic_frame.from_options(frame = datasource0, connection_type = "s3", connection_options = { "path": args['output_dir'] }, format = "orc")
job.commit()
I have another crawler that crawls my output directory containing the ORC files. When it generates the table, it looks like this:
Column name | Data type | Partition key
---------------------------------------
field1 | string |
field2 | string |
field3 | string |
partition_0 | string |
partition_1 | string |
partition_0 | string | Partition (0)
partition_1 | string | Partition (1)
It looks like it considers the partitions to be fields in the ORC file (which they should not be). How can I modify my script so that the CSV to ORC conversion won't include the partition keys as schema columns?
If you need to preserve partitioning then add option partitionKeys to the writer:
final = glueContext.write_dynamic_frame.from_options(frame = datasource0, connection_type = "s3", connection_options = { "path": args['output_dir'], "partitionKeys" -> Seq("partition_0", "partition_1") }, format = "orc")
Otherwise just remove partitioning columns:
cleanDyf = datasource0.dropFields(Seq("partition_0", "partition_1"))
final = glueContext.write_dynamic_frame.from_options(frame = cleanDyf, connection_type = "s3", connection_options = { "path": args['output_dir'] }, format = "orc")

Transform a list of files (JSON) to a dataframe

Spark Version: '2.0.0.2.5.0.0-1245'
So, my original question changed a bit but it's still the same issue.
What I want to do is load a huge amount of JSON files and transform those to a DataFrame - also probably save them as CSV or parquet file for further processing. Each JSON file represents one row in the final DataFrame.
import os
import glob
HDFS_MOUNT = # ...
DATA_SET_BASE = # ...
schema = StructType([
StructField("documentId", StringType(), True),
StructField("group", StringType(), True),
StructField("text", StringType(), True)
])
# Get the file paths
file_paths = glob.glob(os.path.join(HDFS_MOUNT, DATA_SET_BASE, '**/*.json'))
file_paths = [f.replace(HDFS_MOUNT + '/', '') for f in file_paths]
print('Found {:d} files'.format(len(file_paths))) # 676 files
sql = SQLContext(sc)
df = sql.read.json(file_paths, schema=schema)
print('Loaded {:d} rows'.format(df.count())) # 9660 rows (what !?)
Besides the fact that there are 9660 rows instead of 676 (number of available files) I also have the problem that the content seems to be None:
df.head(2)[0].asDict()
gives
{
'documentId': None,
'group': None,
'text': None,
}
Example Data
This is just fake data of course but it resembles the actual data.
Note: Some fields may be missing e.g. text must not always be present.
a.json
{
"documentId" : "001",
"group" : "A",
"category" : "indexed_document",
"linkIDs": ["adiojer", "asdi555", "1337"]
}
b.json
{
"documentId" : "002",
"group" : "B",
"category" : "indexed_document",
"linkIDs": ["linkId", "1000"],
"text": "This is the text of this document"
}
assuming that all your files has the same structure and are in the same directory:
df = sql_cntx.read.json('/hdfs/path/to/folder/*.json')
There might be a problem if any of the columns has Null values for all rows. Then spark will not be able to determine schema, so you have an option to tell spark which schema to use:
from pyspark import SparkContext, SQLContext
from pyspark.sql.types import StructType, StructField, StringType, LongType
sc = SparkContext(appName="My app")
sql_cntx = SQLContext(sc)
schema = StructType([
StructField("field1", StringType(), True),
StructField("field2", LongType(), True)
])
df = sql_cntx.read.json('/hdfs/path/to/folder/*.json', schema=schema)
UPD:
in case if file has multirows formatted json you can try this code:
sc = SparkContext(appName='Test')
sql_context = SQLContext(sc)
rdd = sc.wholeTextFiles('/tmp/test/*.json').values()
df = sql_context.read.json(rdd, schema=schema)
df.show()

Setting textinputformat.record.delimiter in sparksql

In spark2.0.1 ,hadoop2.6.0, I have many files delimited with '!#!\r' and not with the usual new line \n,for example:
=========================================
2001810086 rongq 2001 810!#!
2001810087 hauaa 2001 810!#!
2001820081 hello 2001 820!#!
2001820082 jaccy 2001 820!#!
2002810081 cindy 2002 810!#!
=========================================
I try to extracted data according to Setting textinputformat.record.delimiter in spark
set textinputformat.record.delimiter='!#!\r';or set textinputformat.record.delimiter='!#!\n';but still cannot extracted the data
In spark-sql,I do this :
===== ================================
create table ceshi(id int,name string, year string, major string)
row format delimited
fields terminated by '\t';
load data local inpath '/data.txt' overwrite into table ceshi;
select count(*) from ceshi;
the result is 5,but I try to set textinputformat.record.delimiter='!#!\r'; then select count(*) from ceshi; the result is 1, the delimiter donot work well;
I also check the source of hadoop2.6.0, the method of RecordReader in TextInputFormat.java,I notice that default textinputformat.record.delimiter is null,then the the LineReader.java use the method readDefaultLine to read a line terminated by one of CR, LF, or CRLF(CR ='\r',LF ='\n').
You should use sparkContext's hadoopConfiguration api to set the textinputformat.record.delimiter as
sc.hadoopConfiguration.set("textinputformat.record.delimiter", "!#!\r")
Then if you read the text file using sparkContext as
sc.textFile("the input file path")
You should fine.
Updated
I have noticed that a text file with delimiter \r when saved is changed to \n delimiter.
so, following format should work for you as it did for me
sc.hadoopConfiguration.set("textinputformat.record.delimiter", "!#!\n")
val data = sc.textFile("the input file path")
val df = data.map(line => line.split("\t"))
.map(array => ceshi(array(0).toInt, array(1), array(2), array(3)))
.toDF
a case class called ceshi is needed as
case class ceshi(id: Int, name: String, year: String, major :String)
which should give dataframe as
+----------+-----+-----+-----+
|id |name |year |major|
+----------+-----+-----+-----+
|2001810086|rongq| 2001|810 |
|2001810087|hauaa| 2001|810 |
|2001820081|hello| 2001|820 |
|2001820082|jaccy| 2001|820 |
|2002810081|cindy| 2002|810 |
+----------+-----+-----+-----+
Now you can hit the count function as
import org.apache.spark.sql.functions._
df.select(count("*")).show(false)
which would give output as
+--------+
|count(1)|
+--------+
|5 |
+--------+

Getting null output when schema specified to read data in BigQuery select operation

I am facing issue when selecting data from BigQuery table with specified schema.
val tableData:RDD[String] = sqlContext.sparkContext.newAPIHadoopRDD(
hadoopConf,
classOf[GsonBigQueryInputFormat],
classOf[LongWritable],
classOf[JsonObject]).map(_._2.toString)
val jsonSchema:StructType = (new StructType).add("f1",IntegerType,true).add("f2",FloatType,true).add("f3",StringType,true).add("f4",BooleanType,true).add("f5",DateType,true).add("f6",TimestampType,true)
val df = sqlContext.read.schema(jsonSchema).json(tableData)
When I specify schema like above I am getting null result in the data frame.But when no schema specified getting proper results.
df.printSchema()
root
|-- f1: integer (nullable = true)
|-- f2: float (nullable = true)
|-- f3: string (nullable = true)
|-- f4: boolean (nullable = true)
|-- f5: date (nullable = true)
|-- f6: timestamp (nullable = true)
df.show
+----+----+----+----+----+----+
| f1| f2| f3| f4| f5| f6|
+----+----+----+----+----+----+
|null|null|null|null|null|null|
When analyzed I found BigQuery exports table data in following format ex:
{"f1":"3","f2":2.7,"f3":"Anna","f4":true,"f5":"2014-10-15","f6":"2014-10-15 03:15:58 UTC"}.
When I read from tableData using json format it can't cast the data with specified schema and returns null result.
How can I get proper result with specified schema? Please suggest if you have any idea/solution for it.

cloudera impala jdbc query doesn't see array<string> Hive column

I have a table in Hive that has the following structure:
> describe volatility2;
Query: describe volatility2
+------------------+---------------+---------+
| name | type | comment |
+------------------+---------------+---------+
| version | int | |
| unmappedmkfindex | int | |
| mfvol | array<string> | |
+------------------+---------------+---------+
It was created by Spark HiveContext code by using a DataFrame API like this:
val volDF = hc.createDataFrame(volRDD)
volDF.saveAsTable(volName)
which carried over the RDD structure that was defined in the schema:
def schemaVolatility: StructType = StructType(
StructField("Version", IntegerType, false) ::
StructField("UnMappedMKFIndex", IntegerType, false) ::
StructField("MFVol", DataTypes.createArrayType(StringType), true) :: Nil)
However, when I'm trying to select from this table using the latest JDBC Impala driver the last column is not visible to it. My query is very simple - trying to print the data to the console - exactly like in the example code provided by the driver download:
String sqlStatement = "select * from default.volatility2";
Class.forName(jdbcDriverName);
con = DriverManager.getConnection(connectionUrl);
Statement stmt = con.createStatement();
ResultSet rs = stmt.executeQuery(sqlStatement);
System.out.println("\n== Begin Query Results ======================");
ResultSetMetaData metadata = rs.getMetaData();
for (int i=1; i<=metadata.getColumnCount(); i++) {
System.out.println(rs.getMetaData().getColumnName(i)+":"+rs.getMetaData().getColumnTypeName(i));
}
System.out.println("== End Query Results =======================\n\n");
The console output it this:
== Begin Query Results ======================
version:version
unmappedmkfindex:unmappedmkfindex
== End Query Results =======================
Is it a driver bug or I'm missing something?
I found the answer to my own question. Posting it here so it may help others and save time in searching. Apparently Impala lately introduced the so called "complex types" support to their SQL that include array among others. The link to the document is this:
http://www.cloudera.com/documentation/enterprise/5-5-x/topics/impala_complex_types.html#complex_types_using
According to this what I had to do is change the query to look like this:
select version, unmappedmkfindex, mfvol.ITEM from volatility2, volatility2.mfvol
and I got the right expected results back

Resources