How to create an outputschema which has nested bags in pig - hadoop

I am trying out Pig UDFs and have been reading about it. While the online content was helpful, I am still not sure if I understand how to create a complex output schema which has nested bags.
Please help.The requirement is as follows. Say for example, I am analyzing e-commerce orders data. An order can have multiple products ordered in them.
I have the product level data grouped at an order level. This is the input to my UDF. So each grouped data at an order level containing information about the products in each order is my input.
InputSchema:
(grouped_at_order, {
(input_column_values_at_product1_level),
(input_column_values_at_product2_level)
})
I would be computing metrics both at an order level and at a product level in UDF. For example: sum(products) is an order level metric, color of each product is a product level metric. So, ForEach row grouped at an order level sent to UDF, I want to compute the order level & item level metrics.
Expected OutputSchema:
{
{ (orders, (computed_values_at_order_level)) },
{(productlevel,
{
(computed_values_at_product1_level),
(computed_values_at_product2_level),
(computed_values_at_product3_level)
}
)
}
}
The objective then is to persist the data at order level and product level in two separate output tables from pig.
Is there a better way of doing the same?

As #maxymoo said, before returning nested data from an UDF, I would check first if I really need it.
Anyway, if you do, the solution is not complicated but painfull. You just create schema, add field, then create a schema for the tuple, add the fields or the subbags into, and so on.
#Override
public Schema outputSchema(Schema input) {
Schema statsOrderLevel = new Schema();
statsOrderLevel.add(new FieldSchema("value", DataType.CHARARRAY));
Schema statsOrderLevelTuple = new Schema();
statsOrderLevelTuple.add(new FieldSchema(null, statsOrderLevel, DataType.TUPLE);
Schema statsOrderLevelBag = new Schema();
statsOrderLevelBag.add(new FieldSchema("stats", statsOrderLevelTuple, DataType.BAG));
[...]
}

Related

Is there any performance difference between querying a struct field vs map field?

I am trying to design a schema for a hive table. Our schema needs to be extensible so that it can allow any kind of event to flow through.
Each event can potentially have it's own schema.
One option is that I create a wide table with fields for each schema_type that may be sent.
create table data {
source struct<name:string, id:int>,
pageview struct<url:string>,
traffic_source struct<utm:string,utm_param:string>
}
Or I can create a table that has generic maps in which any kind of key/value can be put:
create table data {
source struct<>,
intvalues map<string, int>,
stringvalues map<string, string>,
...
}
My question is: if using the latter schema, is there any perf implication when querying with Trino? I heard from some colleagues that querying Maps with Trino is very slow.
In particular, would there be a performance degradation with any of the following query patterns?
select intvalues["source.id"], stringvalues["pageview"] from data where....
select * from data
where
intvalues["source.id"] == 0 and stringvalues["pageview"] == "home_page"

Adding a custom sorting to listing with an aggregate in shopware 6

I am trying to build a custom sorting for the product listings in shopware 6.
I want to include a foreign table (entity is: leasingPlanEntity), get the min of one of the fields of that table (period_price) and then order the search result by that value.
I have already built a Subscriber, and try it like that, what seems to work.
public static function getSubscribedEvents(): array
{
return [
//ProductListingCollectFilterEvent::class => 'addFilter'
ProductListingCriteriaEvent::class => ['addCriteria', 5000]
];
}
public function addCriteria(ProductListingCriteriaEvent $event): void
{
$criteria = $event->getCriteria();
$criteria->addAssociation('leasingPlan');
$criteria->addAggregation(new MinAggregation('min_period_price', 'leasingPlan.periodPrice'));
// Sortierung hinzufügen.
$availableSortings = $event->getCriteria()->getExtension('sortings') ?? new ProductSortingCollection();
$myCustomSorting = new ProductSortingEntity();
$myCustomSorting->setId(Uuid::randomHex());
$myCustomSorting->setActive(true);
$myCustomSorting->setTranslated(['label' => 'My Custom Sorting at runtime']);
$myCustomSorting->setKey('my-custom-runtime-sort');
$myCustomSorting->setPriority(5);
$myCustomSorting->setFields([
[
'field' => 'leasingPlan.periodPrice',
'order' => 'asc',
'priority' => 1,
'naturalSorting' => 0,
],
]);
$availableSortings->add($myCustomSorting);
$event->getCriteria()->addExtension('sortings', $availableSortings);
}
Is this already the right way to get the min(periodPrice)? Or is it taking just a random value out of the leasingPlan table to define the sort-order?
I didn't find a way, to define the min_period_price aggregate value in the $myCustomSorting->setFields Methods.
Update 1
Some days later, I asked a less complex question in the shopware community on slack:
Is it possible to use the DAL to define a subquery for an association in the product-listing?
It should generate something like:
FROM
JOIN (
SELECT ... FROM ... WHERE ... GROUP BY ... ORDER BY ...
) AS ...
The answer there was:
Don't think so
Update 2
I also did an in-deep anlysis of the DAL-Query-Builder, and it really seems to be not possible, to perform a subquery with the current version.
Update 3 - Different approach
A different approach might be, to define custom fields in the main entity. Every time a change is made on the main entity, the values of this custom fields should be recalculated.
It is a lot of overhead work, to realize this. Especially when the fields you are adding, are dependend on other data like the availability of a product in the store, for example.
So check, if it is worth the extra work. Would be better, to have a solution for building subqueries.
Unfortunately it seems that in your case there is no easy way to achieve this, if I understand the issue correctly.
Consider the following: for each product you can have multiple leasingPlan entities, and I assume that for a given context (like a specific sales channel or listing) that still holds. This means that you would have to sort the leasingPlan entities by price, then take the one with the lowest price, and then sort the products by their lowest-price leasingPlan's price.
There seems to be no other way to achieve that, and unfortunately for you, sorting is applied at the end, even if it is sort of a subquery.
So, for example, if you have the following snippet
$criteria = $event->getCriteria();
$criteria->addAssociation('leasingPlan');
$criteria->getAssociation('leasingPlan')
->addSorting(new FieldSorting('price', FieldSorting::ASCENDING))
->setLimit(1)
;
The actual price-sorting would be applied AFTER the leasingPlan entities are fetched - essentially the results would be sorted, meaning that you would not get the cheapest leasing plan per product, instead getting the first one.
You can only do something like that with filters, but in this case there is nothing to filter by - I assume you don't have one leasingPlan per SalesChannel or per language, so that you could limit that list to just one entry that could be used for sorting
That is not to mention that this could not be included in a ProductSortingEntity, but you could always work around that by plugging into the appropriate events and modifying the criteria during runtime
I see two ways to resolve your issue
Making another table which would store the cheapest leasingPlan per product and just using that as your association
Storing the information about the cheapest leasingPlans in e.g. cache and using that for filtering (caution: a mistake here would probably break the sorting, for example if you end up with too few or too many leasingPlans per product)
public function applyCustomSorting(ProductListingCriteriaEvent $event): void
{
// One leasingPlan per one product
$cheapestLeasingPlans = $this->myCustomService->getCheapestLeasingPlanIds();
$criteria = $event->getCriteria();
$criteria->addAssociation('leasingPlan');
$criteria->getAssociation('leasingPlan')
->addSorting(new FieldSorting('price', FieldSorting::ASCENDING))
->addFilter(new EqualsAnyFilter('id', $cheapestLeasingPlans))
;
}
And then you could sort by
$criteria->addSorting(new FieldSorting('leasingPlan.periodPrice', FieldSorting::ASCENDING));
There should be no need to add the association manually and to add the aggregation to the criteria, that should happen automatically behind the scenes if your custom sorting is selected in the storefront.
For more information refer to the official docs.

Fetching the data optimally in GraphQL

How can I write the resolvers such that I can generate database sub-query in each resolver and effectively combine all of them and fetch the data at once?
For the following schema :
type Node {
index: Int!
color: String!
neighbors(first: Int = null): [Node!]!
}
type Query {
nodes(color: String!): [Node!]!
}
schema {
query: Query
}
To perform the following query :
{
nodes(color: "red") {
index
neighbors(first: 5) {
index
}
}
}
Data store:
In my data store, nodes and neighbors are stored in separate tables. I want to write a resolver so that we can fetch the required data optimally.
If there are any similar examples, please share the details. (It would be helpful to get an answer in reference to graphql-java)
DataFetchingEnvironment provides access to sub-selections via DataFetchingEnvironment#getSelectionSet. This means, in your case, you'd be able to know from the nodes resolver that neighbors will also be required, so you could JOIN appropriately and prepare the result.
One limitation of the current implementation of getSelectionSet is that it doesn't provide info on conditional selections. So if you're dealing with interfaces and unions, you'll have to manually collect the sub-selection starting from DataFetchingEnvironment#getField. This will very likely be improved in the future releases of graphql-java.
The recommended and most common way is to use a data loader.
A data loader collects the info about which fields to load from which table and which where filters to use.
I haven't worked with GraphQL in Java, so I can only give you directions how you could implement this yourself.
Create an instance of your data loader and pass it to your resolvers as the context argument.
Your resolvers should pass the table name, a list of field names and a list of where conditions to the data loader and return a promise.
Once all the resolvers have executed your data loader should combine those lists so you only end up with one query per table.
You should remove duplicate field names and combine the where conditions using the or keyword.
After the queries have executed you can return all of this data to your resolvers and let them filter the data (since we combined the conditions using the or keyword)
As an advanced feature your data loader could apply the where conditions before returning the data to the resolvers so that they don't have to filter them.

Can you do a join using an embedded array in a document with rethinkdb?

Say I have a user table with a property called favoriteUsers which is an embedded array. i.e.
users
{
name:'bob'
favoriteUsers:['jim', 'tim'] //can you have an index on an embedded array?
}
user_presence
{
name:'jim', //index on name
online_since:14440000
}
Can I do an inner or eqJoin against say a 2nd table using the embedded property, or would I have to pull favoriteUsers out of the users table and into a join table like in traditional sql?
r.table('users')
.getAll('bob', {index:'name'})
// inner join user_presence on user_presence.name in users.highlights
.eqJoin("name", r.table('user_presence'), {index:'name'})
Eventually, I'd like to call changes() on the query so that I can get a realtime update of the users favorite users presence changes
eqJoin can works on embedded document, but it works by compare a value which we transform/pick from the embedded document to mark secondary index on right table.
In any other complicated join, I would rather use concatMap together with getAll.
Let's say we can fetch user and user_presence of their favoriteUsers
r.table('users')
.getAll('bob', {index: 'name'})
.concatMap(function(user) {
return r.table('user_presence').filter(function(presence) {
return user("favoriteUsers").contains(presence("name"))
})
)
So ideally, now you get the data and do the join yourself by querying extra data that you need. My query may have some syntax/error but I hope it gives you the idea

Rethinkdb - filtering by value in another table

In our RethinkDB database, we have a table for orders, and a separate table that stores all the order items. Each entry in the OrderItems table has the orderId of the corresponding order.
I want to write a query that gets all SHIPPED order items (just the items from the OrderItems table ... I don't want the whole order). But whether the order is "shipped" is stored in the Order table.
So, is it possible to write a query that filters the OrderItems table based on the "shipped" value for the corresponding order in the Orders table?
If you're wondering, we're using the JS version of Rethinkdb.
UPDATE:
OK, I figured it out on my own! Here is my solution. I'm not positive that it is the best way (and certainly isn't super efficient), so if anyone else has ideas I'd still love to hear them.
I did it by running a .merge() to create a new field based on the Order table, then did a filter based on that value.
A semi-generalized query with filter from another table for my problem looks like this:
r.table('orderItems')
.merge(function(orderItem){
return {
orderShipped: r.table('orders').get(orderItem('orderId')).pluck('shipped') // I am plucking just the "shipped" value, since I don't want the entire order
}
})
.filter(function(orderItem){
return orderItem('orderShipped')('shipped').gt(0) // Filtering based on that new "shipped" value
})
it will be much easier.
r.table('orderItems').filter(function(orderItem){
return r.table('orders').get(orderItem('orderId'))('shipped').default(0).gt(0)
})
And it should be better to avoid result NULL, add '.default(0)'
It's probably better to create proper index before any finding. Without index, you cannot find document in a table with more than 100,000 element.
Also, filter is limit for only primary index.
A propery way is to using getAll and map
First, create index:
r.table("orderItems").indexCreate("orderId")
r.table("orders").indexCreate("shipStatus", r.row("shipped").default(0).gt(0))
With that index, we can find all of shipper order
r.table("orders").getAll(true, {index: "shipStatus"})
Now, we will use concatMap to transform the order into its equivalent orderItem
r.table("orders")
.getAll(true, {index: "shipStatus"})
.concatMap(function(order) {
return r.table("orderItems").getAll(order("id"), {index: "orderId"}).coerceTo("array")
})

Resources