How to load Avro array of records into Vertica ARRAY - vertica

My avro files contain the following column:
{"name":"my_column","type":["null",{"type":"array","items":{"type":"record","name":"my_column","namespace":"v11","fields":[{"name":"my_column","type":["null","int"],"default":null}]}}],"default":null}
I loaded the data into Vertica and stored as VARBINARY. Example:
db=> select MapToString(my_column) from tab limit 1;
MapToString
------------------------------------------------------------------------------------------------------------------------------------------------------
{
"0.__name__": "my_column",
"0.my_column": "5",
"1.__name__": "my_column",
"1.my_column": "9"
}
(1 row)
The data can actually be simplified into ARRAY[INT]. (I.e. ARRAY[5,9]).
What is the correct way of performing this transformation?
Extend Vertica via UDTF or UDParser? Perform this transformation via SQL? Something else?
EDIT: I am going to check whether a scalar UDF can be embedded in the COPY command alongside the AVROPARSER, or wether it requires extra ETL.
Thank you!

Try materialising the column - well, you need to know what to expect in a flex table ...
I had this table first:
SELECT
__id__
, REGEXP_REPLACE(MAPTOSTRING(__raw__),'\s+',' ') AS rawasstring
FROM a;
-- out __id__ | rawasstring
-- out --------+--------------------------------------------------------------------------------------------------
-- out | { "0.__name__": "my_column", "0.my_column": "5", "1.__name__": "my_column", "1.my_column": "9" }
-- out (1 row)
Then, I just added a materialised column, like so:
ALTER TABLE a ADD int_array ARRAY[int,10]
DEFAULT ARRAY[
MAPLOOKUP(__raw__, '0.my_column')
, MAPLOOKUP(__raw__, '1.my_column')
, MAPLOOKUP(__raw__, '2.my_column')
, MAPLOOKUP(__raw__, '3.my_column')
, MAPLOOKUP(__raw__, '4.my_column')
, MAPLOOKUP(__raw__, '5.my_column')
, MAPLOOKUP(__raw__, '6.my_column')
, MAPLOOKUP(__raw__, '7.my_column')
, MAPLOOKUP(__raw__, '8.my_column')
, MAPLOOKUP(__raw__, '9.my_column')
]::ARRAY[INT,10];
Keys that don't exist in the Map are NULL and not really added to the array. And now I have the array:
SELECT
__id__::VARCHAR
, REGEXP_REPLACE(MAPTOSTRING(__raw__),'\s+',' ') AS rawasstring
, int_array
FROM a ;
-- out __id__ | rawasstring | int_array
-- out --------+--------------------------------------------------------------------------------------------------+-----------
-- out | { "0.__name__": "my_column", "0.my_column": "5", "1.__name__": "my_column", "1.my_column": "9" } | [5,9]
-- out (1 row)

Related

MongoDB $addToSet equivalent in CockroachDB

As the title suggests, is there an operator similar to MongoDB's $addToSet in CockroachDB? I want to append a value to an array if it's not present.
I browsed the docs but didn't find an $addToSet operator.
The array_append function should work for this case, see this doc: https://www.cockroachlabs.com/docs/stable/array.html#using-the-array_append-function
You'll need two functions to do this with arrays. Either of these should work:
array_append(array_remove(myArray, newElement), newElement)
IF(array_position(myArray,newElement) IS NULL, array_append(myArray,newElement), myArray)
If what you're representing should never contain duplicates, you might be better off using a JSON object type than an array type, since their keys are automatically unique. Here's an example:
create table unique_groups(id int primary key, vals jsonb default '{}');
insert into unique_groups values (1, '{}'), (2, '{"a": true}');
-- Add "b" to each set
update unique_groups set vals = jsonb_set(vals, '{b}', 'true') where true;
select * from unique_groups;
id | vals
-----+-----------------------
1 | {"b": true}
2 | {"a": true, "b": true}
-- Add "a" to each set
update unique_groups set vals = jsonb_set(vals, '{a}', 'true') where true;
select * from unique_groups;
id | vals
-----+-------------------------
1 | {"a": true, "b": true}
2 | {"a": true, "b": true}

Lua how to retrieve second value of key1 , aaa2

tab1 = {key1="aaa1","aaa2",key2="bbb"}
print(tab1.key1(1)) fail
print(tab1.key1{1}) fail
print(tab1.key1.1) fail
I cant retrieve second value of key 1..
note that you have not set any keys to "key1" !
another note do you know how index and what table are?
please read about them in lua doc.
it's just a normal table, you can use "next" or "pair", functions to print all values.
tab1 = {key1="aaa1","aaa2",key2="bbb"}
for key, value in next, tab1 do
print(key, value)
end
-- result:
-- [[
1 aaa2
key2 bbb
key1 aaa1
]]
if its a key then just use '.key1' or '["key"]'
or its not a key then by default it's an index.
eg:
print(tab1.key1) -- or tab1["key1"]
print(tab1[1])
print(tab1.key2) -- or tab1["key2"]
and if you want to save more than a key inside another key simple make another table inside that key.
tab1 = {
key1 = {'a', 'b', 'c'}
}
print(tab1.key1[1])
print(tab1.key1[2])
print(tab1.key1[3])
sorry if i did not explain it well enough.

N1Ql query grouping on distinct values with different keys

I have ~7 million docs in a bucket and I am struggling to write the correct query/index combo to prevent it from running >5 seconds.
Here is a similar scenario to the one I am trying to solve:
I have multiple coffee shops each making coffee with different container/lid combos. These field key’s are also different for different doc types. With each sale being generated I keep track of these combos.
Here are a few example docs:
[{
"shopId": "x001",
"date": "2022-01-01T08:49:00Z",
"cappuccinoContainerId": "a001",
"cappuccinoLidId": "b001"
},
{
"shopId": "x001",
"date": "2022-01-02T08:49:00Z",
"latteContainerId": "a002",
"latteLidId": "b002"
},
{
"shopId": "x001",
"date": "2022-01-02T08:49:00Z",
"espressoContainerId": "a003",
"espressoLidId": "b003"
},
{
"shopId": "x002",
"date": "2022-01-01T08:49:00Z",
"cappuccinoContainerId": "a001",
"cappuccinoLidId": "b001"
},
{
"shopId": "x002",
"date": "2022-01-02T08:49:00Z",
"latteContainerId": "a002",
"latteLidId": "b002"
},
{
"shopId": "x002",
"date": "2022-01-02T08:49:00Z",
"espressoContainerId": "a003",
"espressoLidId": "b003"
}]
What I need to get out of the query is the following:
[{
"shopId": "x001",
"day": "2022-01-01",
"uniqueContainersLidsCombined": 2
},
{
"shopId": "x001",
"day": "2022-01-01",
"uniqueContainersLidsCombined": 4
},
{
"shopId": "x002",
"day": "2022-01-01",
"uniqueContainersLidsCombined": 2
},
{
"shopId": "x002",
"day": "2022-01-01",
"uniqueContainersLidsCombined": 4
}]
I.e. I want the total number of unique containers and lids combined per site and day.
I have tried using composite, adaptive and FTS indexes but I unable to figure this one out.
Does anybody have a different suggestion? Can someone please help?
CREATE INDEX ix1 ON default(shopId, DATE_FORMAT_STR(date,"1111-11-11"), [cappuccinoContainerId, cappuccinoLidId]);
If Using EE and shopId is immutable add PARTITION BY HASH (shopId) to above index definition (with higher partition numbers).
SELECT d.shopId,
DATE_FORMAT_STR(d.date,"1111-11-11") AS day
COUNT(DISTINCT [d.cappuccinoContainerId, d.cappuccinoLidId]) AS uniqueContainersLidsCombined
FROM default AS d
WHERE d.shopId IS NOT NULL
GROUP BY d.shopId, DATE_FORMAT_STR(d.date,"1111-11-11");
Adjust index key order of shopId, day based on the query predicates.
https://blog.couchbase.com/understanding-index-grouping-aggregation-couchbase-n1ql-query/
Update:
Based on EXPLAIN you have date predicate and all shopIds so use following index
CREATE INDEX ix2 ON default( DATE_FORMAT_STR(date,"1111-11-11"), shopId, [cappuccinoContainerId, cappuccinoLidId]);
As you need to DISTINCT of cappuccinoContainerId, cappuccinoLidId storing as single key (array of 2 elements) as [cappuccinoContainerId, cappuccinoLidId]. The advantage of this you can directly reference in COUNT as DISTINCT this allows use index aggregation. (NO DISTINCT in the Index that turns into ARRAY index and things will not work as expected .
I assume
That the cup types and lid types can be used for any drink type.
That you don't want to add any precomputed stuff to your data.
Perhaps an index like this my collection keyspace is in bulk.sales.amer, note I am not sure if this performs better or worse (or even if it is equivalent) WRT the solution posted by vsr:
CREATE INDEX `adv_shopId_concat_nvls`
ON `bulk`.`sales`.`amer`(
`shopId` MISSING,
(
nvl(`cappuccinoContainerId`, "") ||
nvl(`cappuccinoLidId`, "") ||
nvl(`latteContainerId`, "") ||
nvl(`latteLidId`, "") ||
nvl(`espressoContainerId`, "") ||
nvl(`espressoLidId`, "")),substr0(`date`, 0, 10)
)
And then a using the covered index above do your query like this:
SELECT
shopId,
CONCAT(
NVL(cappuccinoContainerId,""),
NVL(cappuccinoLidId,""),
NVL(latteContainerId,""),
NVL(latteLidId,""),
NVL(espressoContainerId,""),
NVL(espressoLidId,"")
) AS uniqueContainersLidsCombined,
SUBSTR(date,0,10) AS day,
COUNT(*) AS cnt
FROM `bulk`.`sales`.`amer`
GROUP BY
shopId,
CONCAT(
NVL(cappuccinoContainerId,""),
NVL(cappuccinoLidId,""),
NVL(latteContainerId,""),
NVL(latteLidId,""),
NVL(espressoContainerId,""),
NVL(espressoLidId,"")
),
SUBSTR(date,0,10)
Note I used the following 16 lines of data:
{"amer":"amer","date":"2022-01-01T08:49:00Z","cappuccinoContainerId":"a001","cappuccinoLidId":"b001","sales":"sales","shopId":"x001"}
{"amer":"amer","date":"2022-01-01T08:49:00Z","cappuccinoContainerId":"a001","cappuccinoLidId":"b001","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-02T08:49:00Z","latteContainerId":"a002","latteLidId":"b002","sales":"sales","shopId":"x001"}
{"amer":"amer","date":"2022-01-02T08:49:00Z","latteContainerId":"a002","latteLidId":"b002","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-02T08:49:00Z","espressoContainerId":"a003","espressoLidId":"b003","sales":"sales","shopId":"x001"}
{"amer":"amer","date":"2022-01-02T08:49:00Z","espressoContainerId":"a003","espressoLidId":"b003","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-03T08:49:00Z","cappuccinoContainerId":"a007","cappuccinoLidId":"b004","sales":"sales","shopId":"x001"}
{"amer":"amer","date":"2022-01-03T08:49:00Z","cappuccinoContainerId":"a007","cappuccinoLidId":"b004","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-03T08:49:00Z","latteContainerId":"a007","latteLidId":"b004","sales":"sales","shopId":"x001"}
{"amer":"amer","date":"2022-01-03T08:49:00Z","latteContainerId":"a007","latteLidId":"b004","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-03T01:49:00Z","espressoContainerId":"a007","espressoLidId":"b005","sales":"sales","shopId":"x001"}
{"amer":"amer","date":"2022-01-03T02:49:00Z","espressoContainerId":"a007","espressoLidId":"b005","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-03T03:49:00Z","espressoContainerId":"a007","espressoLidId":"b005","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-03T04:49:00Z","espressoContainerId":"a007","espressoLidId":"b005","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-03T05:49:00Z","espressoContainerId":"a007","espressoLidId":"b005","sales":"sales","shopId":"x002"}
{"amer":"amer","date":"2022-01-03T06:49:00Z","espressoContainerId":"a007","espressoLidId":"b005","sales":"sales","shopId":"x002"}
Applying some sorting by wrapping the above query with
SELECT T1.* FROM
(
-- paste above --
) AS T1
ORDER BY T1.day, T1,shopid, T1.uniqueContainersLidsCombined
We get
cnt day shopId uniqueContainersLidsCombined
1 "2022-01-01" "x001" "a001b001"
1 "2022-01-01" "x002" "a001b001"
1 "2022-01-02" "x001" "a002b002"
1 "2022-01-02" "x001" "a003b003"
1 "2022-01-02" "x002" "a002b002"
1 "2022-01-02" "x002" "a003b003"
1 "2022-01-03" "x001" "a007b005"
2 "2022-01-03" "x001" "a007b004"
2 "2022-01-03" "x002" "a007b004"
5 "2022-01-03" "x002" "a007b005"
If you still don't get the performance you need, you could possibly use the Eventing service to do a continuous map/reduce and an occasional update query to make sure things stay perfectly in sync.

Save a file in HDFS from Pyspark

I have an empty table in Hive I mean there are no records in that table.
Using this empty table I have created a data frame in pyspark
df = sqlContext.table("testing.123_test")
I have registered this data frame as an temp table in
df.registerTempTable('mytempTable')
date=datetime.now().strftime('%Y-%m-%d %H:%M:%S')
In this table I have column called id.
Now I want to query the temp table like below
min_id = sqlContext.sql("select nvl(min(id),0) as minval from mytempTable").collect()[0].asDict()['minval']
max_id = sqlContext.sql("select nvl(max(id),0) as maxval from mytempTable").collect()[0].asDict()['maxval']
Now I want to save date, min_id and max_id into a file in HDFS
I have done like below:
from pyspark.sql import functions as f
(sqlContext.table("myTempTable").select(f.concat_ws(",", f.first(f.lit(date)), f.min("id"), f.max("id"))).coalesce(1).write.format("text").mode("append").save("/tmp/fooo"))
Now when I check the file in HDFS it show all NULL values.
The file output in HDFS is below.
NULL,NULL,NULL
What I want is
Date,0,0
Here date is the current timestamp
How can I achieve what I want.
This is in scala but you should be easily able to replicate it to Python.
The function you need here is na.fill function. And you'll have to replace Scala Maps with Python Dictionaries in the below code:
This is what your DF looks like:
scala> nullDF.show
+----+----+----+
|date| x| y|
+----+----+----+
|null|null|null|
+----+----+----+
// You have already done this using Python's datetime functions
val format = new java.text.SimpleDateFormat("dd/MM/YYYY HH:mm:ss")
val curr_timestamp = format.format(new java.util.Date())
//Use na fill to replace null values
//Column names as keys in map
//And values are what you want to replace NULL with
val df = nullDF.na.fill(scala.collection.immutable.Map(
"date" -> ) ,
"x" -> "0" ,
"y" -> "0" ) )
This should give you
+-------------------+---+---+
| date| x| y|
+-------------------+---+---+
|10/06/2017 12:10:20| 0| 0|
+-------------------+---+---+

Sparql multi lang data compression to one row

I'd like to select data property values using sparql with some restrictions on their languages:
I have an ordered set of preferred languages ("ru", "en", ... etc )
If an item have more than one language for value, I'd like to have only one value restricted by my set of languages (if ru is available - I want to see ru value, else if en available I want to see en else if ... etc if no lang available - no lang value).
Current query is:
select distinct ?dataProperty ?dpropertyValue where {
<http://dbpedia.org/resource/Blackmore's_Night> ?dataProperty ?dpropertyValue.
?dataProperty a owl:DatatypeProperty.
FILTER ( langmatches(lang(?dpropertyValue),"ru") || langmatches(lang(? dpropertyValue),"en") || lang(?dpropertyValue)="" )
}
The problem with it: results contain two rows for abstract (ru+en). I want only one row, which should contain ru. In case when ru is not available I'd like to get en etc.
How?
Suppose you have data like this:
#prefix : <http://stackoverflow.com/q/21531063/1281433/> .
:a a :resource;
:p "a in english"#en, "a in russian"#ru .
:b a :resource ;
:p "b in english"#en .
Then you're hoping to get results like this:
--------------------------------
| resource | label |
================================
| :b | "b in english"#en |
| :a | "a in russian"#ru |
--------------------------------
Here are two ways of doing this.
Associate language tags with ranks, find the rank of the best label, then find the label with that rank
This way uses SPARQL 1.1 subqueries, aggregates, and data provided with values. The idea is to use values to associate each language tag with a rank. Then you use a subquery to pull out the optimal rank over all the labels that the resource has). Then in the outer query, you have access to the optimal rank, and you just retrieve the label with the language corresponding to that rank.
prefix : <http://stackoverflow.com/q/21531063/1281433/>
select ?resource ?label where {
# for each resource, find the rank of the
# language of the most preferred label.
{
select ?resource (min(?rank) as ?langRank) where {
values (?lang ?rank) { ("ru" 1) ("en" 2) }
?resource :p ?label .
filter(langMatches(lang(?label),?lang))
}
group by ?resource
}
# ?langRank from the subquery is, for each
# resource, the best preference. With the
# values clause, we get just the language
# that we want.
values (?lang ?langRank) { ("ru" 1) ("en" 2) }
?resource a :resource ; :p ?label .
filter(langMatches(lang(?label),?lang))
}
Select the labels separately and coalesce in the order that you want
You can select an optional label for each of the languages you're considering, and then coalesce them into (so you get the first one that's bound) in the order of your preference. This is kind of verbose, but if you need to do anything else with the labels in various languages other than the most preferred, you'll have access to them.
prefix : <http://stackoverflow.com/q/21531063/1281433/>
select ?resource ?label where {
# find resources
?resource a :resource .
# grab a russian label, if available
optional {
?resource :p ?rulabel .
filter( langMatches(lang(?rulabel),"ru") )
}
# grab an english label, if available
optional {
?resource :p ?enlabel .
filter( langMatches(lang(?enlabel),"en") )
}
# take either as the label, but russian over english
bind( coalesce( ?rulabel, ?enlabel ) as ?label )
}

Resources