Controlling string unescaping in Clickhouse? - clickhouse

I am trying to load JSON data from Kafka into Clickhouse where the JSON contains some escape sequences. E.g:
:) SELECT JSONExtractRaw('{"message": "Hello \"to\" you!"}', 'message')
SELECT JSONExtractRaw('{"message": "Hello "to" you!"}', 'message')
┌─JSONExtractRaw('{"message": "Hello "to" you!"}', 'message')─┐
│ "Hello " │
└─────────────────────────────────────────────────────────────┘
It appears that prior to calling JSONExtractRaw, the input strings are unescaped, which produces invalid JSON. The unescaping seems to be reproducible with this minimal example:
:) SELECT 'Hello \"there\"'
SELECT 'Hello "there"'
┌─'Hello "there"'─┐
│ Hello "there" │
└─────────────────┘
I am wondering if it's possible to retain the original (escaped) representation of the input.
Thank you!

SELECT '{"message": "Hello \\"to\\" you!"}'
┌─'{"message": "Hello \\"to\\" you!"}'─┐
│ {"message": "Hello \"to\" you!"} │
└──────────────────────────────────────┘

Let's test your case without CLI to emulate the table filled by data coming outside:
echo '{"message": "Hello \"to\" you!"}' | clickhouse-client \
--password 12345 \
--query="INSERT INTO test.json_001 SELECT json FROM input('json String') FORMAT CSV"
create database test;
create table test.json_001 (json String) Engine=Memory;
Get required data:
clickhouse-client \
--password 12345 \
--query="SELECT JSONExtractRaw(json, 'message'), JSONExtractString(json, 'message') FROM test.json_001"
# result:
# "Hello \\"to\\" you!" Hello "to" you!
SELECT
JSONExtractRaw(json, 'message'),
JSONExtractString(json, 'message')
FROM test.json_001
/* result
┌─JSONExtractRaw(json, 'message')─┬─JSONExtractString(json, 'message')─┐
│ "Hello \"to\" you!" │ Hello "to" you! │
└─────────────────────────────────┴────────────────────────────────────┘
*/
It works as it should. I don't see any problems.

Related

Update `index` property in a collection of given IDs

Using a POST on my front-end (React.js, with react-beautiful-dnd), I'm trying to update the position of the items a user has stored.
My strategy is to send over an updated index of IDs that are ordered in the specified position.
My Data (example, simplified)
┌────┬────────┬───────┐
│ id │ title │ index │ <── my "position"
├────┼────────┼───────┤
│ 1 │ Apple │ 4 │
│ 2 │ Banana │ 1 │
│ 3 │ Mango │ 3 │
│ 4 │ Kiwi │ 2 │
│ 5 │ Orange │ 0 │
└────┴────────┴───────┘
New positions array (anatomy)
$value: sorted ids ------> [3, 2, 5, 1, 4]
: : : : :
$key: new index will be… 0, 1, 2, 3, 4
The Code
The controller ingests an array that derives the new positions from the keys of the array. It loops then through the collection of given IDs and reassigns them the updated index.
public function reorder(Request $request) {
$newIndex = $request->newIndex;
$items = Item::whereIn('id', $newIndex)->get();
foreach ($newIndex as $key => $value) {
// TODO: check if $item->index needs to be rewritten at all
$item = $items->find($value);
$item->index = $key;
$item->save();
}
return response('Success', 200);
}
dd() of $request->input
array:1 [
"newIndex" => array:5 [
0 => 4
1 => 1
2 => 2
3 => 5
4 => 3
]
]
The problem
It seems… the code only works once? Did I miss something? The POST fires every time and the set array is accurate, so the issue is not on the frontend side of things.

ClickHouse Union All exception

I wrote create view expression with union all.
create view if not exists V_REGRESULT as select 1 AS jType,
regF.smscid AS SmscID,
regF.eventdate AS EventDate,
regF.smppid AS SmppID,
regF.registereddate AS RegisteredDate,
regF.sendernpid AS SenderNPid,
regF.senderntid AS SenderNTid,
regF.senderdirectionid AS SenderDirectionID,
regF.CgPN AS CgPN,
regF.OriginalCgPn AS OriginalCgPN,
regF.recipientnpid AS RecipientNPid,
regF.recipientntid AS RecipientNTid,
regF.recipientdirectionid AS RecipientDirectionID,
regF.cdpn AS CdPN,
regF.OriginalCdPn AS OriginalCdPN,
regF.ValidityPeriod AS ValidityPeriod,
regF.expireddate AS ExpiredDate,
regF.deliveryreporttypeid AS DeliveryReportTypeID,
regF.umr AS Umr,
regF.smtypeid AS SmTypeID,
regF.uhdiheaderinsm AS UhdiHeaderInsm,
regF.uhdisequenceid AS UhdiSequenceID,
regF.uhdipartstotal AS UhdiPartsTotal,
regF.uhdipartn AS UhdiPartN,
regF.setbackpath AS SetBackPath,
regF.smppencoding AS SmppEncoding,
regF.originalsmppencoding AS OriginalSmppEncoding,
regF.retrycount AS RetryCount,
regF.retrycountsh AS RetryCountsH,
regF.schemeid AS SchemeID,
regF.lasterrortypeid AS LastErrorTypeID,
regF.lastneterror AS LastNetError,
regF.lastsmpperror AS LastSmppError,
regF.lastnativeerror AS LastNativeError,
regF.smstatusid AS SmStatusID,
regF.reportstatus AS ReportStatus,
regF.protocolid AS ProtocolID,
regF.modifiedprotocolid AS ModifiedProtocolID,
regF.dbencoding AS DbEncoding,
regF.messagelen AS MessageLen,
regF.smbody AS SmBody,
regF.senderserviceid AS SenderServiceID,
regF.recipientserviceid AS RecipientServiceID,
regF.imsi AS Imsi,
regF.commutatorgt AS CommutatorGT,
regF.targetimsi AS TargetImsi,
regF.targetcommutatorgt AS TargetCommutatorGT,
regF.eventflag AS EventFlag,
regF.failcode AS FailCode,
regF.ScpGt AS ScpGt,
regF.ResultCode AS ResultCode
from SMCS.V_REGFAIL regF
union all
select 2 AS jType,
reg.SmscId AS SmscID,
reg.EventDate AS EventDate,
reg.SmppID AS SmppID,
reg.RegisteredDate AS RegisteredDate,
reg.SenderNPID AS SenderNPid,
reg.SenderNTID AS SenderNTid,
reg.SenderDirectionID AS SenderDirectionID,
reg.CgPN AS CgPN,
reg.OriginalCgPN AS OriginalCgPN,
reg.RecipientNPid AS RecipientNPid,
reg.RecipientNTid AS RecipientNTid,
reg.RecipientDirectionID AS RecipientDirectionID,
reg.CdPN AS CdPN,
reg.OriginalCdPN AS OriginalCdPN,
reg.ValidityPeriod AS ValidityPeriod,
reg.ExpiredDate AS ExpiredDate,
reg.DeliveryReportTypeID AS DeliveryReportTypeID,
reg.Umr AS Umr,
reg.SmTypeID AS SmTypeID,
reg.UhdiHeaderInsm AS UhdiHeaderInsm,
reg.UhdiSequenceID AS UhdiSequenceID,
reg.UhdiPartsTotal AS UhdiPartsTotal,
reg.UhdiPartN AS UhdiPartN,
reg.SetBackPath AS SetBackPath,
reg.SmppEncoding AS SmppEncoding,
reg.OriginalSmppEncoding AS OriginalSmppEncoding,
reg.RetryCount AS RetryCount,
reg.RetryCountsH AS RetryCountsH,
reg.SchemeID AS SchemeID,
reg.LastErrorTypeID AS LastErrorTypeID,
reg.LastNetError AS LastNetError,
reg.LastSmppError AS LastSmppError,
reg.LastNativeError AS LastNativeError,
reg.SmStatusID AS SmStatusID,
reg.ReportStatus AS ReportStatus,
reg.ProtocolID AS ProtocolID,
reg.ModifiedProtocolID AS ModifiedProtocolID,
reg.DbEncoding AS DbEncoding,
reg.MessageLen AS MessageLen,
reg.SmBody AS SmBody,
reg.SenderServiceID AS SenderServiceID,
reg.RecipientServiceID AS RecipientServiceID,
reg.Imsi AS Imsi,
reg.CommutatorGT AS CommutatorGT,
reg.TargetImsi AS TargetImsi,
reg.TargetCommutatorGT AS TargetCommutatorGT,
'' AS EventFlag,
'' AS FailCode,
reg.ScpGt AS ScpGt,
'' AS ResultCode
from SMCS.V_REGISTRATION reg
;
All columns in requested views exists. I haven't got any idea about this exception
Code: 386, e.displayText() = DB::Exception: There is no supertype for types String, DateTime because some of them are String/FixedString and some of them are not (version 19.15.3.6 (official build))
Where could be there a mistake? Simple union like select 1 as one union all select 2 as one works good. The only thing i found was that error was caused due to Array Join, when type casting.
Initally, i was thinking that this was caused due to NULL casting at EventFlag, e.t.c, while union and replace NULL with ''. But nothing has changed
Help, pls
p.s. sorry for my english
select now() union all select ''
Code: 386. DB::Exception: Received from localhost:9000.
DB::Exception: There is no supertype for types DateTime, String
because some of them are String/FixedString and some of them are not.
cast to String
select toString(now()) union all select ''
┌─toString(now())─────┐
│ 2019-11-19 14:25:13 │
└─────────────────────┘
┌─toString(now())─┐
│ │
└─────────────────┘
'cast' to DateTime
select now() union all select toDateTime(0)
┌───────────────now()─┐
│ 2019-11-19 14:25:48 │
└─────────────────────┘
┌───────────────now()─┐
│ 0000-00-00 00:00:00 │
└─────────────────────┘

How do you decode GraphQL::Schema::UniqueWithinType in Postgres?

Say someone evil stored an encoded id in your database and you needed to use it. Example:
Ruby:
GraphQL::Schema::UniqueWithinType.default_id_separator = '|'
relay_id = GraphQL::Schema::UniqueWithinType.encode('User', '123')
# "VXNlcnwxMjM="
How do I get 123 out of VXNlcnwxMjM= in Postgres?
Ruby:
GraphQL::Schema::UniqueWithinType.default_id_separator = '|'
relay_id = GraphQL::Schema::UniqueWithinType.encode('User', '123')
# "VXNlcnwxMjM="
Base64.decode64(relay_id)
# "User|123"
To get "123" out of "VXNlcnwxMjM=" in Postgres SQL you can do this horror show
select
substring(
(decode('VXNlcnwxMjM=', 'base64')::text)
from (char_length('User|') + 1)
for (char_length(decode('VXNlcnwxMjM=', 'base64')::text) - char_length('User|'))
)::int
Edit: Playing around with this ... on Postgres 9.6.5 the above works... but our staging server is 10.5 and I had to do this instead (which also works on 9.6.5)
select
substring(
convert_from(decode('VXNlcnwxMjM=', 'base64'), 'UTF-8')
from (char_length('User|') + 1)
for (char_length(convert_from(decode('VXNlcnwxMjM=', 'base64'), 'UTF-8')) - char_length('User|'))
)::int

Scan/Match incorrect input error messages

I am trying to count the correct inputs from the user. An input looks like:
m = "<ex=1>test xxxx <ex=1>test xxxxx test <ex=1>"
The tag ex=1 and the word test have to be connected and in this particular order to count as correct. In case of an invalid input, I want to send the user an error message that explains the error.
I tried to do it as written below:
ex_test_size = m.scan(/<ex=1>test/).size # => 2
test_size = m.scan(/test/).size # => 3
ex_size = m.scan(/<ex=1>/).size # => 3
puts "lack of tags(<ex=1>)" if ex_test_size < ex_size
puts "Lack of the word(test)" if ex_test_size < test_size
I believe it can be written in a better way as the way I wrote, I guess, is prone to errors. How can I make sure that all the errors will be found and shown to the user?
You might use negative lookarounds:
#⇒ ["xxx test", "<ex=1>"]
m.scan(/<ex=1>(?!test).{,4}|.{,4}(?<!<ex=1>)test/).map do |msg|
"<ex=1>test expected, #{msg} got"
end.join(', ')
We scan the string for either <ex=1> not followed by test or vice versa. Also, we grab up to 4 characters that violate the rule for the more descriptive message.

AWS Glue issue with double quote and commas

I have this CSV file:
reference,address
V7T452F4H9,"12410 W 62TH ST, AA D"
The following options are being used in the table definition
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'quoteChar'='\"',
'separatorChar'=',')
but it still won't recognize the double quotes in the data, and that comma in the double quote fiel is messing up the data. When I run the Athena query, the result looks like this
reference address
V7T452F4H9 "12410 W 62TH ST
How do I fix this issue?
I do this to solve:
1 - Create a Crawler that don't overwrite the target table properties, I used boto3 for this but it can be created in AWS console to, Do this (change de xxx-var):
import boto3
client = boto3.client('glue')
response = client.create_crawler(
Name='xxx-Crawler-Name',
Role='xxx-Put-here-your-rol',
DatabaseName='xxx-databaseName',
Description='xxx-Crawler description if u need it',
Targets={
'S3Targets': [
{
'Path': 's3://xxx-Path-to-s3/',
'Exclusions': [
]
},
]
},
SchemaChangePolicy={
'UpdateBehavior': 'LOG',
'DeleteBehavior': 'LOG'
},
Configuration='{ \
"Version": 1.0, \
"CrawlerOutput": { \
"Partitions": {"AddOrUpdateBehavior": "InheritFromTable" \
}, \
"Tables": {"AddOrUpdateBehavior": "MergeNewColumns" } \
} \
}'
)
# run the crawler
response = client.start_crawler(
Name='xxx-Crawler-Name'
)
2 - Edit the serialization lib, I do this in AWS Console like say this post (https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html#schema-csv-quotes)
just change this:
3 - Run Crawler again. Run the crawler as always do:
4 - That's it, your 2nd run should not change any data in the table, it's just for testing that it's works ¯\_(ツ)_/¯.
Look like you also need to add escapeChar. AWS Athena docs shows this example:
CREATE EXTERNAL TABLE myopencsvtable (
col1 string,
col2 string,
col3 string,
col4 string
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
STORED AS TEXTFILE
LOCATION 's3://location/of/csv/';

Resources