AWS Glue issue with double quote and commas - hadoop

I have this CSV file:
reference,address
V7T452F4H9,"12410 W 62TH ST, AA D"
The following options are being used in the table definition
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'quoteChar'='\"',
'separatorChar'=',')
but it still won't recognize the double quotes in the data, and that comma in the double quote fiel is messing up the data. When I run the Athena query, the result looks like this
reference address
V7T452F4H9 "12410 W 62TH ST
How do I fix this issue?

I do this to solve:
1 - Create a Crawler that don't overwrite the target table properties, I used boto3 for this but it can be created in AWS console to, Do this (change de xxx-var):
import boto3
client = boto3.client('glue')
response = client.create_crawler(
Name='xxx-Crawler-Name',
Role='xxx-Put-here-your-rol',
DatabaseName='xxx-databaseName',
Description='xxx-Crawler description if u need it',
Targets={
'S3Targets': [
{
'Path': 's3://xxx-Path-to-s3/',
'Exclusions': [
]
},
]
},
SchemaChangePolicy={
'UpdateBehavior': 'LOG',
'DeleteBehavior': 'LOG'
},
Configuration='{ \
"Version": 1.0, \
"CrawlerOutput": { \
"Partitions": {"AddOrUpdateBehavior": "InheritFromTable" \
}, \
"Tables": {"AddOrUpdateBehavior": "MergeNewColumns" } \
} \
}'
)
# run the crawler
response = client.start_crawler(
Name='xxx-Crawler-Name'
)
2 - Edit the serialization lib, I do this in AWS Console like say this post (https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html#schema-csv-quotes)
just change this:
3 - Run Crawler again. Run the crawler as always do:
4 - That's it, your 2nd run should not change any data in the table, it's just for testing that it's works ¯\_(ツ)_/¯.

Look like you also need to add escapeChar. AWS Athena docs shows this example:
CREATE EXTERNAL TABLE myopencsvtable (
col1 string,
col2 string,
col3 string,
col4 string
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
STORED AS TEXTFILE
LOCATION 's3://location/of/csv/';

Related

How to use Spark DataFrame column value to update the elasticsearch

Problem description: I am trying to update documents in elastic search. I have data in a dataframe. Dataframe has esIndex, id and body attributes. I need to use the value in index attribute to update the document in the right index. The following code below does not work. What is possible fix?
Command 1
val esUrl = "test.elasticsearch.com"
Command 2
import spark.implicits._
val df = Seq(
("test_datbricks001","1221866122238136328", "TEST DATA 001"),
("test_datbricks001","1221866122238136329", "TEST DATA 002"),
("test_datbrick","1221866122238136329", "TEST DATA 002")
).toDF("esIndex", "id", "body")
Command 3
display(df)
Command 4
import org.apache.spark.sql.functions.col
val retval = df.toDF.write
.format("org.elasticsearch.spark.sql")
.option("es.nodes.wan.only","true")
.option("es.mapping.id","id")
.option("es.mapping.exclude", "id, esIndex")
.option("es.port","80")
.option("es.nodes", esUrl)
.option("es.write.operation", "update")
.option("es.index.auto.create", "no")
.mode("Append")
.save("{esIndex}") // <<== This index is not taking value from "index" column of df
I am getting the following error EsHadoopIllegalArgumentException: Target index [{esIndex}] does not exist and auto-creation is disabled [setting 'es.index.auto.create' is 'false']
Question: How do I get code in Command 4 to use the value in esIndex column of df

SQLRPGLE & JSON_OBJECT CTE Statements -101 Error

This program compiles correctly, we are on V7R3 - but when running it receives an SQLCOD of -101 and an SQLSTATE code is 54011 which states: Too many columns were specified for a table, view, or table function. This is a very small JSON that is being created so I do not think that is the issue.
The RPGLE code:
dcl-s OutFile sqltype(dbclob_file);
xfil_tofile = '/ServiceID-REFCODJ.json';
Clear OutFile;
OutFile_Name = %TrimR(XFil_ToFile);
OutFile_NL = %Len(%TrimR(OutFile_Name));
OutFile_FO = IFSFileCreate;
OutFile_FO = IFSFileOverWrite;
exec sql
With elm (erpRef) as (select json_object
('ServiceID' VALUE trim(s.ServiceID),
'ERPReferenceID' VALUE trim(i.RefCod) )
FROM PADIMH I
INNER JOIN PADGUIDS G ON G.REFCOD = I.REFCOD
INNER JOIN PADSERV S ON S.GUID = G.GUID
WHERE G.XMLTYPE = 'Service')
, arr (arrDta) as (values json_array (
select erpRef from elm format json))
, erpReferences (refs) as ( select json_object ('erpReferences' :
arrDta Format json) from arr)
, headerData (hdrData) as (select json_object(
'InstanceName' : trim(Cntry) )
from padxmlhdr
where cntry = 'US')
VALUES (
select json_object('header' : hdrData format json,
'erpReferenceData' value refs format json)
from headerData, erpReferences )
INTO :OutFile;
Any help with this would be very much appreciated, this is our first attempt at creating JSON for sending and have not experienced this issue before.
Thanks,
John
I am sorry for the delay in getting back to this issue. It has been corrected, the issue was with the "values" statement.
This is the correct code needed to make it work correctly:
Select json_object('header' : hdrData format json,
'erpReferenceData' value refs format json)
INTO :OutFile
From headerData, erpReferences )

How do you decode GraphQL::Schema::UniqueWithinType in Postgres?

Say someone evil stored an encoded id in your database and you needed to use it. Example:
Ruby:
GraphQL::Schema::UniqueWithinType.default_id_separator = '|'
relay_id = GraphQL::Schema::UniqueWithinType.encode('User', '123')
# "VXNlcnwxMjM="
How do I get 123 out of VXNlcnwxMjM= in Postgres?
Ruby:
GraphQL::Schema::UniqueWithinType.default_id_separator = '|'
relay_id = GraphQL::Schema::UniqueWithinType.encode('User', '123')
# "VXNlcnwxMjM="
Base64.decode64(relay_id)
# "User|123"
To get "123" out of "VXNlcnwxMjM=" in Postgres SQL you can do this horror show
select
substring(
(decode('VXNlcnwxMjM=', 'base64')::text)
from (char_length('User|') + 1)
for (char_length(decode('VXNlcnwxMjM=', 'base64')::text) - char_length('User|'))
)::int
Edit: Playing around with this ... on Postgres 9.6.5 the above works... but our staging server is 10.5 and I had to do this instead (which also works on 9.6.5)
select
substring(
convert_from(decode('VXNlcnwxMjM=', 'base64'), 'UTF-8')
from (char_length('User|') + 1)
for (char_length(convert_from(decode('VXNlcnwxMjM=', 'base64'), 'UTF-8')) - char_length('User|'))
)::int

Setting textinputformat.record.delimiter in sparksql

In spark2.0.1 ,hadoop2.6.0, I have many files delimited with '!#!\r' and not with the usual new line \n,for example:
=========================================
2001810086 rongq 2001 810!#!
2001810087 hauaa 2001 810!#!
2001820081 hello 2001 820!#!
2001820082 jaccy 2001 820!#!
2002810081 cindy 2002 810!#!
=========================================
I try to extracted data according to Setting textinputformat.record.delimiter in spark
set textinputformat.record.delimiter='!#!\r';or set textinputformat.record.delimiter='!#!\n';but still cannot extracted the data
In spark-sql,I do this :
===== ================================
create table ceshi(id int,name string, year string, major string)
row format delimited
fields terminated by '\t';
load data local inpath '/data.txt' overwrite into table ceshi;
select count(*) from ceshi;
the result is 5,but I try to set textinputformat.record.delimiter='!#!\r'; then select count(*) from ceshi; the result is 1, the delimiter donot work well;
I also check the source of hadoop2.6.0, the method of RecordReader in TextInputFormat.java,I notice that default textinputformat.record.delimiter is null,then the the LineReader.java use the method readDefaultLine to read a line terminated by one of CR, LF, or CRLF(CR ='\r',LF ='\n').
You should use sparkContext's hadoopConfiguration api to set the textinputformat.record.delimiter as
sc.hadoopConfiguration.set("textinputformat.record.delimiter", "!#!\r")
Then if you read the text file using sparkContext as
sc.textFile("the input file path")
You should fine.
Updated
I have noticed that a text file with delimiter \r when saved is changed to \n delimiter.
so, following format should work for you as it did for me
sc.hadoopConfiguration.set("textinputformat.record.delimiter", "!#!\n")
val data = sc.textFile("the input file path")
val df = data.map(line => line.split("\t"))
.map(array => ceshi(array(0).toInt, array(1), array(2), array(3)))
.toDF
a case class called ceshi is needed as
case class ceshi(id: Int, name: String, year: String, major :String)
which should give dataframe as
+----------+-----+-----+-----+
|id |name |year |major|
+----------+-----+-----+-----+
|2001810086|rongq| 2001|810 |
|2001810087|hauaa| 2001|810 |
|2001820081|hello| 2001|820 |
|2001820082|jaccy| 2001|820 |
|2002810081|cindy| 2002|810 |
+----------+-----+-----+-----+
Now you can hit the count function as
import org.apache.spark.sql.functions._
df.select(count("*")).show(false)
which would give output as
+--------+
|count(1)|
+--------+
|5 |
+--------+

sphinx and multilanguage search || search by attribute

I'm trying to get results from sphinx by attr_string. Here is sphinx configuration:
source db
{-
type = mysql
sql_query = \
SELECT id,language,text,page_url \
FROM content
sql_attr_string = language
sql_attr_string = page_url
}
index content
{
source = db
charset_type = utf-8
min_word_len = 3
}
The results that i'm getting are like this:
[matches] => Array
(
[106] => Array
(
[weight] => 4
[attrs] => Array
(
[page_url] => en/example.gtml
[language] => en
)
)
What I want to do is to filter all results by "language"=en.
$sphinx->SetFilter() is working by integers where in this case I'll need only string "en".
Any help is appreciated!
I found solution...
If anybody need it.
Configure "source" to use crc32, Ex:
source db
{
type = mysql
sql_query = \
SELECT id,crc32(language) as language,text,page_url \
FROM content
sql_attr_uint = language
sql_attr_string = page_url
}
And in client, modify setFilter method to use crc32(). ex:
$s->SetFilter('language',array(crc32('en')));
$result = $s->query('bird is a word','content');
I hope it helps somebody...
more information: http://sphinxsearch.com/docs/current.html#attributes

Resources