Is there any performance difference between querying a struct field vs map field? - trino

I am trying to design a schema for a hive table. Our schema needs to be extensible so that it can allow any kind of event to flow through.
Each event can potentially have it's own schema.
One option is that I create a wide table with fields for each schema_type that may be sent.
create table data {
source struct<name:string, id:int>,
pageview struct<url:string>,
traffic_source struct<utm:string,utm_param:string>
}
Or I can create a table that has generic maps in which any kind of key/value can be put:
create table data {
source struct<>,
intvalues map<string, int>,
stringvalues map<string, string>,
...
}
My question is: if using the latter schema, is there any perf implication when querying with Trino? I heard from some colleagues that querying Maps with Trino is very slow.
In particular, would there be a performance degradation with any of the following query patterns?
select intvalues["source.id"], stringvalues["pageview"] from data where....
select * from data
where
intvalues["source.id"] == 0 and stringvalues["pageview"] == "home_page"

Related

Fetching the data optimally in GraphQL

How can I write the resolvers such that I can generate database sub-query in each resolver and effectively combine all of them and fetch the data at once?
For the following schema :
type Node {
index: Int!
color: String!
neighbors(first: Int = null): [Node!]!
}
type Query {
nodes(color: String!): [Node!]!
}
schema {
query: Query
}
To perform the following query :
{
nodes(color: "red") {
index
neighbors(first: 5) {
index
}
}
}
Data store:
In my data store, nodes and neighbors are stored in separate tables. I want to write a resolver so that we can fetch the required data optimally.
If there are any similar examples, please share the details. (It would be helpful to get an answer in reference to graphql-java)
DataFetchingEnvironment provides access to sub-selections via DataFetchingEnvironment#getSelectionSet. This means, in your case, you'd be able to know from the nodes resolver that neighbors will also be required, so you could JOIN appropriately and prepare the result.
One limitation of the current implementation of getSelectionSet is that it doesn't provide info on conditional selections. So if you're dealing with interfaces and unions, you'll have to manually collect the sub-selection starting from DataFetchingEnvironment#getField. This will very likely be improved in the future releases of graphql-java.
The recommended and most common way is to use a data loader.
A data loader collects the info about which fields to load from which table and which where filters to use.
I haven't worked with GraphQL in Java, so I can only give you directions how you could implement this yourself.
Create an instance of your data loader and pass it to your resolvers as the context argument.
Your resolvers should pass the table name, a list of field names and a list of where conditions to the data loader and return a promise.
Once all the resolvers have executed your data loader should combine those lists so you only end up with one query per table.
You should remove duplicate field names and combine the where conditions using the or keyword.
After the queries have executed you can return all of this data to your resolvers and let them filter the data (since we combined the conditions using the or keyword)
As an advanced feature your data loader could apply the where conditions before returning the data to the resolvers so that they don't have to filter them.

How do I query an optional column with a secondary index using phantom?

I have a secondary index on an optional column:
class Sessions extends CassandraTable[ConcreteSessions, Session] {
object matchId extends LongColumn(this) with PartitionKey[Long]
object userId extends OptionalLongColumn(this) with Index[Option[Long]]
...
}
However, the indexedToQueryColumn implicit conversion is not available for optional columns, so this does not compile:
def getByUserId(userId: Long): Future[Seq[Session]] = {
select.where(_.userId eqs userId).fetch()
}
Neither does this:
select.where(_.userId eqs Some(userId)).fetch()
Or changing the type of the index:
object userId extends OptionalLongColumn(this) with Index[Long]
Is there a way to perform such a query using phantom?
I know that I could denormalize, but it would involve some very messy housekeeping and triple our (substantial) data size. The query usually returns only a handful of results, so I'd be willing to use a secondary index in this case.
Short answer: You could not use optional fields in order to query things in phantom.
Long detailed answer:
But, if you really want to work with secondary optional columns, you should declare your entity field as Option but your phantom representation should not be an option in order to query.
object userId extends LongColumn(this) with Index[Long]
In the fromRow(r: Row) you can create your object like this:
Sessions(matchId(r), Some(userId(r)))
Then in the service part you could do the following:
.value(_.userId, t.userId.getOrElse(0))
You also have a better way to do that. You could duplicate the table, making a new kind of query like sessions_by_user_id where in this table your user_id would be the primary key and the match_id the clustering key.
Since user_id is optional, you would end with a table that contains only valid user ids, which is easy and fast to lookup.
Cassandra relies on queries, so use it in your favor.
Take a look up on my github project that helps you get up with multiple queries in the same table.
https://github.com/iamthiago/cassandra-phantom

Can you do a join using an embedded array in a document with rethinkdb?

Say I have a user table with a property called favoriteUsers which is an embedded array. i.e.
users
{
name:'bob'
favoriteUsers:['jim', 'tim'] //can you have an index on an embedded array?
}
user_presence
{
name:'jim', //index on name
online_since:14440000
}
Can I do an inner or eqJoin against say a 2nd table using the embedded property, or would I have to pull favoriteUsers out of the users table and into a join table like in traditional sql?
r.table('users')
.getAll('bob', {index:'name'})
// inner join user_presence on user_presence.name in users.highlights
.eqJoin("name", r.table('user_presence'), {index:'name'})
Eventually, I'd like to call changes() on the query so that I can get a realtime update of the users favorite users presence changes
eqJoin can works on embedded document, but it works by compare a value which we transform/pick from the embedded document to mark secondary index on right table.
In any other complicated join, I would rather use concatMap together with getAll.
Let's say we can fetch user and user_presence of their favoriteUsers
r.table('users')
.getAll('bob', {index: 'name'})
.concatMap(function(user) {
return r.table('user_presence').filter(function(presence) {
return user("favoriteUsers").contains(presence("name"))
})
)
So ideally, now you get the data and do the join yourself by querying extra data that you need. My query may have some syntax/error but I hope it gives you the idea

How to create an outputschema which has nested bags in pig

I am trying out Pig UDFs and have been reading about it. While the online content was helpful, I am still not sure if I understand how to create a complex output schema which has nested bags.
Please help.The requirement is as follows. Say for example, I am analyzing e-commerce orders data. An order can have multiple products ordered in them.
I have the product level data grouped at an order level. This is the input to my UDF. So each grouped data at an order level containing information about the products in each order is my input.
InputSchema:
(grouped_at_order, {
(input_column_values_at_product1_level),
(input_column_values_at_product2_level)
})
I would be computing metrics both at an order level and at a product level in UDF. For example: sum(products) is an order level metric, color of each product is a product level metric. So, ForEach row grouped at an order level sent to UDF, I want to compute the order level & item level metrics.
Expected OutputSchema:
{
{ (orders, (computed_values_at_order_level)) },
{(productlevel,
{
(computed_values_at_product1_level),
(computed_values_at_product2_level),
(computed_values_at_product3_level)
}
)
}
}
The objective then is to persist the data at order level and product level in two separate output tables from pig.
Is there a better way of doing the same?
As #maxymoo said, before returning nested data from an UDF, I would check first if I really need it.
Anyway, if you do, the solution is not complicated but painfull. You just create schema, add field, then create a schema for the tuple, add the fields or the subbags into, and so on.
#Override
public Schema outputSchema(Schema input) {
Schema statsOrderLevel = new Schema();
statsOrderLevel.add(new FieldSchema("value", DataType.CHARARRAY));
Schema statsOrderLevelTuple = new Schema();
statsOrderLevelTuple.add(new FieldSchema(null, statsOrderLevel, DataType.TUPLE);
Schema statsOrderLevelBag = new Schema();
statsOrderLevelBag.add(new FieldSchema("stats", statsOrderLevelTuple, DataType.BAG));
[...]
}

Create a linq subquery returns error "Local sequence cannot be used in LINQ to SQL implementations of query operators except the Contains operator"

I have created a linq query that returns my required data, I now have a new requirement and need to add an extra field into the returned results. My entity contains an ID field that I am trying to map against another table without to much luck.
This is what I have so far.
Dictionary<int, string> itemDescriptions = new Dictionary<int, string>();
foreach (var item in ItemDetails)
{
itemDescriptions.Add(item.ItemID, item.ItemDescription);
}
DB.TestDatabase db = new DB.TestDatabase(Common.GetOSConnectionString());
List<Transaction> transactionDetails = (from t db.Transactions
where t.CardID == CardID.ToString()
select new Transaction
{
ItemTypeID= t.ItemTypeID,
TransactionAmount = t.TransactionAmount,
ItemDescription = itemDescriptions.Select(r=>r.Key==itemTypeID).ToString()
}).ToList();
What I am trying to do is key the value from the dictonary where the key = itemTypeID
I am getting this error.
Local sequence cannot be used in LINQ to SQL implementations of query operators except the Contains operator.
What do I need to modify?
This is a duplicate of this question. The problem you're having is because you're trying to match an in-memory collection (itemDescriptions) with a DB table. Because of the way LINQ2SQL works it's trying to do this in the DB which is not possible.
There are essentially three options (unless I'm missing something)
1) refactor your query so you pass a simple primitive object to the query that can be passed accross to the DB (only good if itemDescriptions is a small set)
2) In your query use:
from t db.Transactions.ToList()
...
3) Get back the objects you need as you're doing, then populate ItemDescription in a second step.
Bear in mind that the second option will force LINQ to evaluate the query and return all transactions to your code that will then be operated on in memory. If the transaction table is large this will not be quick!

Resources