Spark Schema Not Being used for Parquet Write - parquet

I have been dealing with an issue related to writing out a Parquet file in Spark, when the input file is Parquet as well and contains some invalid column names.
Unfortunately, the column naming convention is outside of my hands, which is why I am attempting replace the non alpha-numeric characters before writing the file back out as Parquet.
Ex. of column names (These pain me):
"Current qty"
"31-60 Total"
"<=30 qty"
I have tried a handful of different methods to alter the dataframe columns to get them to valid format:
df.withColumnRenamed("Total qty","total_qty")
df.select(col("30-60 Total").alias("_30_60_total"))
Which all appear to work based on the printSchema results:
# Schema before change
>>> df.printSchema()
root
|-- Current qty: decimal(38,5) (nullable = true)
|-- <=30 qty: decimal(38,5) (nullable = true)
|-- 31-60 Total: decimal(38,5) (nullable = true)
>>> new_df = df.withColumnRenamed('Current qty','current_qty').withColumnRenamed('<=30 qty','__30_qty').withColumnRenamed('31-60 Total','_31_60_total')
# Schema after change
>>> new_df.printSchema()
root
|-- current_qty: decimal(38,5) (nullable = true)
|-- __30_qty: decimal(38,5) (nullable = true)
|-- _31_60_total: decimal(38,5) (nullable = true)
The issue starts when attempting to write this back out with the new column names. For some reason it appears to be referencing the original schema columns, which are invalid and will then throw an exception:
>>> new_df.write.mode('overwrite').parquet(write_path)
20/12/17 13:56:45 ERROR FileFormatWriter: Aborting job 37427d91-be5c-43c5-b3ed-8d57217a0733.
org.apache.spark.sql.AnalysisException: Attribute name "Current qty" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkConversionRequirement(ParquetSchemaConverter.scala:583)
at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkFieldName(ParquetSchemaConverter.scala:570)
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:449)
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:449)
at scala.collection.immutable.List.foreach(List.scala:392)
I have cross checked multiple times that this is not a case of using the wrong dataframe on the write, as well as tried it in Scala to no avail.
Interestingly this is not an issue when dealing with CSV input files, which leads to me believe it is specific to Spark-Parquet internal workings.
Would greatly appreciate any help on this if anyone has seen or encountered a similar error in the past.
Thanks!

You can try to simply do the aliasing , it works for me Everytime .
df.select([df['Current qty'].alias('current_qty'),df['<=30'].alias('__30_qty'),df{'31-60 Total'].alias('_31_60_total')]).write.mode('overwrite').parquet(write_path)

Related

Spring Data JDBC MappedCollection casing issues with Postgres + Oracle

This is my current repo structure, I'm looking for a solution that works with both Postgres and OracleDB and preferably does not involve changing my DB schema to accomodate the ORM. Whether Postgres or Oracle is used is in defined in the spring.datasource.url in the application.properties file.
data class NewsCover(
#Id val tenantId: TenantId,
val openOnStart: Boolean,
val cycleDelay: Int,
#MappedCollection(idColumn = "tenant_id", keyColumn = "tenant_id")
val sections: Set<NewsCoverSection>,
)
data class NewsCoverSection(
#Id val id: NewsCoverSectionId,
val title: String,
val pinnedOnly: Boolean,
val position: Int,
val tenantId: TenantId,
... some other fields ...
)
interface NewsCoverRepo : CrudRepository<NewsCover, TenantId> { ... }
This works just fine with Postgresql, but creates errors when uses with Oracle:
SELECT "NEWS_COVER_SECTION"."ID" AS "ID", "NEWS_COVER_SECTION"."TITLE" AS "TITLE", "NEWS_COVER_SECTION"."POSITION" AS "POSITION", "NEWS_COVER_SECTION"."TENANT_ID" AS "TENANT_ID", "NEWS_COVER_SECTION"."PINNED_ONLY" AS "PINNED_ONLY"
FROM "NEWS_COVER_SECTION"
WHERE "NEWS_COVER_SECTION"."tenant_id" = ?
See the quoted idColumn/keyColumn names in the #MappedCollection. They are lower case. That is fine for Postgres, but does not work with Oracle. Changing tenant_id to TENANT_ID fixes the problem for Oracle, but breaks Postgres.
What I tried:
A NamingStrategy override for Oracle, but I can't seem to override those quoted identifiers.
Conditional column names in #MappedCollection, but #MappedCollection only accepts compile time constants and does not support SpEL, so I can't differentiate based on the spring.datasource.url property.
Any ideas how I can get it to query for "news_cover_section"."tenant_id" when the DB is Postgres and "NEWS_COVER_SECTION"."TENANT_ID" when the DB is Oracle?
As you found out you can disable the behaviour of quoting all names by setting the forceQuote property of the JdbcMappingContext to false.
Alternatively you can create the schema in a consistent way on both databases by quoting the names in your schema creation script.
The first option allows you not to fiddle with the database schema.
But it makes the application depend on avoiding database key words like for example: ORDER or USER.
The second option is arguably the conceptual cleaner one, because it actually uses the same schema (as far as names are concerned) for both databases, which in itself is certainly valuable. But comes at the cost of quoting names because Postgres doesn't adhere to the behaviour prescribed by the SQL standard of treating unquoted names as uppercase.
Note: There is now an issue for supporting SpEL expressions for table and column names.

How to specify the file name when saving the model using h2o package from R

I am trying to save the model build using the function: h2o.saveModel(), based on function description on page 159 of the H2O user manual for R, the arguments only consider path. I looked at other similar function such as: h2o.saveModelDetails() but it uses the same argument. Please advise if there any another way to specify the name of the model.
The name of the model file will be determined by the ID of the model. So if you specify model_id when training your model, then you can customize it. Right now there is no way to change the ID of the model after it's been trained.
The file can be renamed once saved:
h2o.saveModel(object = fit, path = path.value, force = TRUE) # force overwriting
name <- file.path(path.value, fileName) # destination file name at the same folder location
file.rename(file.path(path.value, fit#model_id), name)
I think a better work around would be to generate a unique folder each time saving the model. When loading model there will always be only one model file under the path.
saved_model = os.path.join('UNIQUE_MODEL_PATH', os.listdir('UNIQUE_MODEL_PATH')[0])
loaded_model = h2o.load_model(saved_model)
in python :
model_path = h2o.save_model(model=model, path="mymodel1", force=True)
path = os.path.dirname(os.path.abspath(model_path))
os.rename(model_path, os.path.join(path,f'h2o_new_name'))
Here a possible way to do it:
output_dir <-getwd()
DRF_MO <- h2o.saveModel(object=aml, path=output_dir, force=TRUE)
DRF_MO <- file.path(output_dir, aml#algorithm)
file.rename(file.path(output_dir, aml#model_id), DRF_MO)

Specify Oracle schema with Slick 3.2

I have an Oracle database that contains multiple users/schemas and I would like to generate Slick Schemas automatically for a specific user. This is what I've tried so far :
import scala.concurrent.ExecutionContext.Implicits.global
val profileInstance: JdbcProfile =
Class.forName("slick.jdbc.OracleProfile$")
.getField("MODULE$")
.get(null).asInstanceOf[JdbcProfile]
val db = profileInstance.api.Database
.forURL("jdbc:oracle:thin:#//myhost:myport/servicename","user","pass")
val modelAction = OracleProfile.createModel(Some(OracleProfile.defaultTables))
val model = Await.result(db.run(modelAction), Duration.Inf)
model.tables.foreach(println)
This doesn't print anything, I guess I have to provide the current schema to use, but I don't know how to do this.
On the other hand, I am able to list all the schemas of the database, using the following code :
val resultSet = db.createSession().metaData.getSchemas.getStatement.getResultSet
while(resultSet.next()) {
println(resultSet.getString(1))
}
How can I specify which schema I want to use with Slick ?
I've found out how to do it. Instead of using OracleProfile.defaultTable I manually defined the tables and views I needed like this :
val modelAction = OracleProfile.createModel(
Some(MTable.getTables(None, Some("MYSCHEMA"), None, Some(Seq("TABLE", "VIEW"))))
)

how to join header row to detail rows in multiple files with apache pig

I have several CSV files in a HDFS folder which I load to a relation with:
source = LOAD '$data' USING PigStorage(','); --the $data is a passed as a parameter to the pig command.
When I dump it, the structure of the source relation is as follows: (note that the data is text qualified but I will deal with that using the REPLACE function)
("HEADER","20110118","20101218","20110118","T00002")
("0000000000000000035412","20110107","2699","D","20110107","2315.","","","","","","C")
("0000000000000000035412","20110107","2699","D","20110107","246..","162","74","","","","B")
<.... more records ....>
("HEADER","20110224","20110109","20110224","T00002")
("0000000000000000035412","20110121","2028","D","20110121","a6c3.","","","","","R","P")
("0000000000000000035412","20110217","2619","D","20110217","a6c3.","","","","","R","P")
<.... more records ....>
So each file has a header which provides some information about the data set that follows it such as the provider of the data and the date range it covers.
So now, how can I transform the above structure and create a new relation like the following ?:
{
(HEADER,20110118,20101218,20110118,T00002),{(0000000000000000035412,20110107,2699,D,20110107,2315.,,,,,,C),(0000000000000000035412,20110107,2699,D,20110107,246..,162,74,,,,B),..more tuples..},
(HEADER,20110224,20110109,20110224,T00002),{(0000000000000000035412,20110121,2028,D,20110121,a6c3.,,,,,R,P),(0000000000000000035412,20110217,2619,D,20110217,a6c3.,,,,,R,P),..more tuples..},..more tuples..
}
Where each header tuple is followed by a bag of record tuples belonging to that header ?.
Unfortunately there is no common key field between the header and the detail rows, so I don't think cant use any JOIN operation. ?
I am quite new to Pig and Hadoop and this is one of the first concept projects that I am engaging in.
Hope my question is clear and look forward to some guidance here.
This should get you started.
Code:
Source = LOAD '$data' USING PigStorage(',','-tagFile');
A = SPLIT Source INTO FileHeaders IF $1 == 'HEADER', FileData OTHERWISE;
B = GROUP FileData BY $0;
C = GROUP FileHeaders BY $0;
D = JOIN B BY Group, C BY Group;
...

SPQuery sorting issue

I have this SPListItem.Folder in sharepoint that contains a property named "Asset ID".
I have this data in my list
Asset ID | Name | Asset Type
1 | GamesFolder | Games
2 | AppsFolder | softwares
3 | MusicFolder | music
In my code I did this
SPList objList = web.Lists["MyList"];
SPQuery query = new SPQuery();
query.Query = "<OrderBy><FieldRef Name='Asset ID' Ascending='FALSE'/></OrderBy>";
query.ViewAttributes = "Scope=\"Recursive\"";
query.RowLimit = 1;
SPListItemCollection items = objList.GetItems(query);
return objList.Items[0].Folder.Properties["Asset ID"].ToString();
I use .Folder because every entry in the list is a DocumentSet.
The returned value is always "1". I don't know what's wrong why my sorting
doesn't work at all.
Please help me resolve this issue. Thanks.
Hi Carls I think there is issue for field name. U include space in field name
If you want to avoid having to seek out what the internal name of a particular field is, when you first name your column, do not include any spaces or special characters. Once the field (column) has been created, go back and rename the field to include the spaces or special characters as desired. SharePoint will still retain the original field name without spaces and you can use that directly in your query without issue.
or use its internalName:
query.Query = "<OrderBy><FieldRef Name='Asset_x0020_ID' Ascending='FALSE'/></OrderBy>";
A little late but if you are having issues, you may be able to use all or a part of the following gist:
https://gist.github.com/trgraglia/4672176
And as the accepted answer states, the field name is the issue. You need to use the static name of the field. The static name will always stay the same. Even if the column is renamed. So you should get the column from the column collection by display name and then get the static name from the properties.

Resources