Select columns in a Polars LazyFrame based on a condition without collect? - lazy-evaluation

We often want to remove columns from a LazyFrame that don't fit a condition or threshold evaluated over that column (variance, number of missing values, number of unique values). It's possible to evaluate a condition over a LazyFrame columnwise, collect that condition, and pass it as a list to the same LazyFrame (see this question). Is it possible to do this without evaluating an intermediate result?
A toy example would be to select only the columns that have 10 or more unique values. I can do this following the example from the linked question:
threshold = 10
df = ldf.select(
ldf.select(pl.all().n_unique())
.melt()
.filter(pl.col("value") >= threshold)
.select("variable")
.collect() # this evaluates the condition over the dataframe
.to_series()
.to_list()
).collect()
I would like to do this with only one collect() statement at the end.

This is impossible without a collect. With LazyFrames you are making a computation graph. Every node in that graph has a known schema that is defined before running the query.
It is impossible to know what the schema is if the columns you select are dependent on the "running" the query.
In short, you have to collect and then continue lazy from that point.

Related

Sum value flow power automate

I have a table on azure following :
enter image description here
Can anyone help me to make a sum perday of number of users with flow power automate :
enter image description here
Thanks in advance
Ok, this is a monster of an answer but it works so follow closely.
Refer to the images for what to do.
The basic concept is, loop through all entities and fill an array with the distinct row keys. The way we determine if it's distinct or not is by adding the row key to an array IF it hasn't been added previously.
From there, we will loop through that distinct list and using an inner loop, we will sum each NumberOfUsers column IF the Inner Row Key matches the Outer Row Key that is being processed.
At the end of the outer loop, add an object to an array. That object has two fields, "RowKey" and "NumberOfUsers". The "NumberOfUsers" field contains the summation for that given RowKey.
From here, you have the distinct count.
If I've mis-used any fields (i.e. the use of RowKey) then change it up as need be.
This is just logic, you just need to apply it to the scenario. I think this is best done in an Azure Function because it'll run faster and be a lot less to maintain but if you want to avoid that and use PowerAutomate, this works.
Flow
Data / Table
Result

Repeated uses of TOP in query

Finding TOP items in query
I have an Access query with fields (in simplified form) name, type, value. I need to extract the top x records (according to value) FOR EVERY PAIR (name, type) with x depending on the pair. The query already has the column "value" sorted for each pair.
Solution 1. Do separate queries per pair, take the top x in each and build the union of the queries. Wrong! the number of pairs is large, Access can't handle the resulting query.
Solution 2. Add an extra column to the query, call it "Valid" and set it to True in all records. Then use VBA to traverse the recordset items of the query one by one and set Valid to False for the non-top items. Then do an additional query dropping the false records. Wrong again, the recordset is not editable in VBA (even though "Valid" has nothing to do with any tables used in the query). Yes, I opened the recordset in VBA with dbOpenDynaset -- no dice.
Any ideas? Thanks

Salesforce SOQL query length and efficiency

I am trying to solve a problem of deleting only rows matching two criteria, each being a list of ids. Now these Ids are in pairs, if the item to be deleted has one, it must have the second one in the pair, so just using two in clauses will not work. I have come up with two solutions.
1) Use the two in clauses but then loop over the items and check that the two ids in question appear in the correct pairing.
I.E.
for(Object__c obj : [SELECT Id FROM Object__c WHERE Relation1__c in :idlist1 AND Relation2__c in:idlist2]){
if(preConstructedPairingsAsString.contains(''+obj.Relation1__c+obj.Relation2__c)){
listToDelete.add(obj);
}
}
2) Loop over the ids and build an admittedly long query.
I like the second choice because I only get the items I need and can just throw the list into delete but I know that salesforce has hangups with SOQL queries. Is there a penalty to the second option? Is it better to build and query off a long string or to get more objects than necessary and filter?
In general you want to put as much logic as you can into soql queries because that won't use any script statements and they execute faster than your code. However, there is a 10k character limit on soql queries (can be raised to 20k) so based on my back of the envelope calculations you'd only be able to put in 250 id pairs or so before hitting that limit.
I would go with option 1 or if you really care about efficiency you can create a formula field on the object that pairs the ids and filter on that.
formula: relation1__c + '-' + relation2__c
for(list<Object__c> objs : [SELECT Id FROM Object__c WHERE formula__c in :idpairs]){
delete objs;
}

How do I sort, group a query properly that returns a tuple of an orm object and a custom column?

I am looking for a way to have a query that returns a tuple first sorted by a column, then grouped by another (in that order). Simply .sort_by().group_by() didn't appear to work. Now I tried the following, which made the return value go wrong (I just got the orm object, not the initial tuple), but read for yourself in detail:
Base scenario:
There is a query which queries for test orm objects linked from the test3 table through foreign keys.
This query also returns a column named linked that either contains true or false. It is originally ungrouped.
my_query = session.query(test_orm_object)
... lots of stuff like joining various things ...
add_column(..condition that either puts 'true' or 'false' into the column..)
So the original return value is a tuple (the orm object, and additionally the true/false column).
Now this query should be grouped for the test orm objects (so the test.id column), but before that, sorted by the linked column so entries with true are preferred during the grouping.
Assuming the current unsorted, ungrouped query is stored in my_query, my approach to achieve this was this:
# Get a sorted subquery
tmpquery = my_query.order_by(desc('linked')).subquery()
# Read the column out of the sub query
my_query = session.query(tmpquery).add_columns(getattr(tmpquery.c,'linked').label('linked'))
my_query = my_query.group_by(getattr(tmpquery.c, 'id')) # Group objects
The resulting SQL query when running this is (it looks fine to me btw - the subquery 'anon_1' is inside itself properly sorted, then fetched and its id aswell as the 'linked' column is extracted (amongst a few other columns SQLAlchemy wants to have apparently), and the result is properly grouped):
SELECT anon_1.id AS anon_1_id, anon_1.name AS anon_1_name, anon_1.fk_test3 AS anon_1_fk_test3, anon_1.linked AS anon_1_linked, anon_1.linked AS linked
FROM (
SELECT test.id AS id, test.name AS name, test.fk_test3 AS fk_test3, CASE WHEN (anon_2.id = 87799534) THEN 'true' ELSE 'false' END AS linked
FROM test LEFT OUTER JOIN (SELECT test3.id AS id, test3.fk_testvalue AS fk_testvalue
FROM test3)
AS anon_2 ON anon_2.fk_testvalue = test.id ORDER BY linked DESC
)
AS anon_1 GROUP BY anon_1.id
I tested it in phpmyadmin, where it gave me, as expected, the id column (for the orm object id), then the additional columns SQL_Alchemy seems to want there, and the linked column. So far, so good.
Now my expected return values would be, as they were from the original unsorted, ungrouped query:
A tuple: 'test' orm object (anon_1.id column), 'true'/'false' value (linked column)
The actual return value of the new sorted/grouped query is however (the original query DOES indeed return a touple before the code above is applied):
'test' orm object only
Why is that so and how can I fix it?
Excuse me if that approach turns out to be somewhat flawed.
What I actually want is, have the original query simply sorted, then grouped without touching the return values. As you can see above, my attempt was to 'restore' the additional return value again, but that didn't work. What should I do instead, if this approach is fundamentally wrong?
Explanation for the subquery use:
The point of the whole subquery is to force SQLAlchemy to execute this query separately as a first step.
I want to order the results first, and then group the ordered results. That seems to be hard to do properly in one step (when trying manually with SQL I had issues combining order and group by in one step as I wanted).
Therefore I don't simply order, group, but I order first, then subquery it to enforce that the order step is actually completed first, and then I group it.
Judging from manual PHPMyAdmin tests with the generated SQL, this seems to work fine. The actual problem is that the original query (which is now wrapped as the subquery you were confused about) had an added column, and now by wrapping it up as a subquery, that column is gone from the overall result. And my attempt to readd it to the outer wrapping failed.
It would be much better if you provided examples. I don't know if these columns are in separate tables or what not. Just looking at your first paragraph, I would do something like this:
a = session.query(Table1, Table2.column).\
join(Table2, Table1.foreign_key == Table2.id).\
filter(...).group_by(Table2.id).order_by(Table1.property.desc()).all()
I don't know exactly what you're trying to do since I need to look at your actual model, but it should look something like this with maybe the tables/objs flipped around or more filters.

Oracle runtime of comparing numbers versus comparing strings using a LIKE operator

My company database has 20 different string formats for their primary product label. All 20 of them are stored in a separate look-up table
1 are strings starting with 'W'
2 are strings starting with 'TAIC'
3 are strings starting with 'D'
...
Next to the label attribute is the 'type' attribute, which stores the number related to which prefix the label contains.
I'm tasked with updating one of our modules for better runtime. One of the queries I ran across deals with all labels containing 'TAIC' as the prefix. However, instead of comparing whether the type number is equal to 2, it runs a LIKE operation checking for each label that begins with TAIC.
Now, my question is this -- since my goal is for better run time, would it be wise to switch from the like operator to just a regular equality operation against the type attribute? It seems that running a regular expression-ish operation against a string would be a bit more time consuming, but enough to significantly alter the run time of a system?
In Oracle, both these operations:
SELECT *
FROM mytable
WHERE pk LIKE 'TAIC%'
and
SELECT *
FROM mytable
WHERE type = 2
are sargable, that is able to use an index on the appropriate fields.
The numeric index, however, would be more compact and hence require less time to traverse, so using numeric comparison could increase the query performance.

Resources