I have a table named:
project(pId,cId)
and I am trying to find projects ids that only one company is working on them, using relational algebra.
I thought about using join and finding those pIds the occur more than once and then using subtraction but not sure how should I write it.
I cannot use count in my relational algebra and also not the !=
I'll use the variety of Relational Algebra at wikipedia, explanation below, plus assignment to relation variables for intermediate results.
crosscid := project ⋈ ρ<cid2/cid>(project);
multicid := crosscid \ σ<cid = cid2>(crosscid);
result := π<pid>(project) \ π<pid>(multicid);
Where wikipedia shows subscripted components of operators, I show in angle brackets < >.
crosscid is the cross-product of all cids for each pid, obtained by creating a duplicate of the project relation with cid renamed. Note this includes tuples with cid == cid2.
multicid is crosscid filtered to only the pids with multiple cids, obtained by subtracting the tuples in crosscid with cid == cid2. (This is the 'work round' for the limitation that we're not allowed to use !=.)
result is the pids from the original project relation subtract the pids with multiple cids.
Related
We often want to remove columns from a LazyFrame that don't fit a condition or threshold evaluated over that column (variance, number of missing values, number of unique values). It's possible to evaluate a condition over a LazyFrame columnwise, collect that condition, and pass it as a list to the same LazyFrame (see this question). Is it possible to do this without evaluating an intermediate result?
A toy example would be to select only the columns that have 10 or more unique values. I can do this following the example from the linked question:
threshold = 10
df = ldf.select(
ldf.select(pl.all().n_unique())
.melt()
.filter(pl.col("value") >= threshold)
.select("variable")
.collect() # this evaluates the condition over the dataframe
.to_series()
.to_list()
).collect()
I would like to do this with only one collect() statement at the end.
This is impossible without a collect. With LazyFrames you are making a computation graph. Every node in that graph has a known schema that is defined before running the query.
It is impossible to know what the schema is if the columns you select are dependent on the "running" the query.
In short, you have to collect and then continue lazy from that point.
This has been a curly question I have been asked and I am unsure how to tackle it in PowerBI (the user I am helping is specifically using it so I don't have the option to lean on a more comfortable programming language).
We have a situation where (for example) they are wanting to filter for compound names based upon a selected treatment and a response value from that treatment. So far so good with basic slicers.
HOWEVER they then wish to see ONLY those compounds that have a response in the correct range for the specifically chosen treatment, and if another treatment has a response in that range too, they wish to drop that compound.
For example, the following table is something I synthesized that gives an easy example:
Compound
Treatment
Response
1
A
13.80
1
B
8.25
1
C
9.22
1
D
10.50
2
A
11.66
2
B
8.42
2
C
12.63
2
D
9.63
In this case, I have a checkbox slicer for treatment and a slider slicer for response. If I specify a range of 11 - 14 for the response, the current behaviour is that I have compound 1 with treatment A and compound 2 with treatment A and C. Then if I select Treatment A for the treatment slicer, I just have both compound 1 and 2 with treatment A.
The desired behaviour would be to have ONLY compound 1 with treatment A because compound 2 has treatment C within this range too.
I feel like this is probably doable with an appropriate DAX but I don't know where to start with this.
If I understood correctly, you want the Compound with the less treatment(s) only.
Create a new measure:
Rank =
RANKX(
'Table',
CALCULATE(
COUNTROWS('Table'),
REMOVEFILTERS('Table'[Treatment])
),
,
TRUE,
SKIP
)
It ranks the number of results (ie. lines in the Table) for a given Compound.
Then, you add this new measure in the Filters bar, in "Filters on this visual" zone. And you set the filter to 1.
Most things here expressed in Python-ish pseudocode as that is what I am working in
Background
We have a data process which
produces entities where there are multiple natural keys for a single entity
chooses one of the natural keys as the canonical natural key for the entity
So each record from the dataset is essentially:
class Entity:
canonical_natural_key: str
natural_keys: List[str] # canonical natural key is also in this list
This represents real-world data with no UUID to work with until we assign one.
Now each time this data process runs, it does its best to resolve the real state of the world. A few things can happen that make it nontrivial to match a entity from one run to an entity of the next run and be sure they are in fact the same entity, namely:
which of the natural keys is the canonical natural key can change
the canonical natural key could be completely new, as in not previously seen in this entity's list of natural keys
the list of natural keys can change, although we expect overlap (set intersection size > 0) between runs
ElasticSearch Logic
What I need to do is to handle these Entity records each time the process runs and attempt to resolve them to an existing one from previous runs.
To attempt to find an already existing entity, we want to waterfall through these conditions:
canonical_natural_key == existing_entity.canonical_natural_key, OTHERWISE
canonical_natural_key in existing_entity.natural_keys, OTHERWISE
len(natural_keys.intersection(existing_entity.natural_keys) > 0
If all conditions fail this is in all likelihood a completely new entity.
The naive version would take 3 ElasticSearch queries, essentially:
existing_entity: Optional[Entity] = find_by_canonical_natural_key_match(canonical_natural_key)
if not existing_entity:
existing_entity: Optional[Entity] = find_by_canonical_natural_key_in_natural_keys(canonical_natural_key)
if not existing_entity:
existing_entity: Optional[Entity] = find_by_natural_keys_intersection(natural_keys)
return existing_entity
For performance reasons I would ideally do this in 1 or 2 queries instead of 3.
Is there a good way to do this in ElasticSearch?
I know you can boost results that match a certain field, but boost appears only to be for match queries, whereas I want to filter so we don't get any fuzzy matches, returning 0 entities if there are truly no matches.
My concern is that using a standard OR filter that treats the three conditions equally might erroneously produce some incorrect matches.
We don't want a scenario where:
existing entity 1 matches on condition 2
existing entity 2 matches on condition 3
...but existing entity 2 is returned as the top result from the search for some reason.
We would want entity 1 as the top result because satisfying condition 2 is considered a better indicator than condition 3
Is it possible to create a "List from a range" Data Validation rule in Google Sheets where the range skips columns?
For example:
Cells A6:A11 is limited to the range A1:B3. Cells B6:B11 is limited to the range A1:A3 AND C1:C3 (skips column B).
Creating a Data Validation rule for cells A6:A11 is trivial as I simply need to create a Criteria of "List from a range = A1:B3".
However, creating the Data Validation rule for cells B6:B11 is not so intuitive since Google Sheets does not allow me to create a Criteria using the syntax "List from a range = A1:A3, C1:C3".
Does the "List from a range" Criteria support a syntax that allows us to skip columns within a range?
Note: I currently have a work around for this where I defined an array formula in D1 = =ArrayFormula(if({1,""},A1:A3,C1:C3)) and then use D1:E3 as the Data Validation range. But this is a hacky solution and I'm hoping there is a better way to accomplish my goal.
The solution is to use { } to create a combination of columns or rows that will result in some sort of virtual table on-the-fly.
Example:
Assuming you have a spreadsheet with Name, Age, Gender, Phone and Address in A, B, C, D and E, and you want to skip the Gender (column C) while using the UNIQUE statement, you can use something like this.
Put in G1 the following formula:
=UNIQUE({A1:B, D1:E})
From the cell G1, the spreadsheet will populate the columns G, H, I and J with unique combinations of A, B, D and E, excluding the column C (Gender).
The same application of a combined range can be used in any formula and also you can combine multiple different ranges, including cross Spreadsheets and Files.
It is a very useful trick if you need to combine pieces of multiple spreadsheets for data visualization or reports. However, always remember you cannot manipulate the displayed data. You can still search through it, format it, etc., but you cannot change it. On the other hand, it will auto-update always if the data source gets updated, which is very useful.
Note: Try it with LOOKUP, VLOOPUP or HLOOKUP.
How can we translate the non aggregate functions of Structured Query Language into relational algebra expressions?! I know how to express the aggregate functions, but what about the non aggregate functions?!
e.g How can we write the Year( a date format column) function?! Just Year(date)?
select e.name,year(e.dateOfEmployment) from Employees e
Thanks!
(This is a very reasonable question, I don't understand why it should get downvoted.)
The "Relational" in RA means expressing functions as mathematical relations -- using a set-theoretic approach. (It doesn't mean, as often thought, relating one table or datum to another as in "Entity Relational" modelling.) I can't grab a very succinct reference for this off the top of my head, but start here http://en.wikipedia.org/wiki/Binary_relation and follow the links.
How does this get to answer your question in context of a practical RA? Have a look at this:
http://www.dcs.warwick.ac.uk/~hugh/TTM/APPXA.pdf, especially the section Treating Operators as Relations.
See how the relations PLUS and SQRT can be 'applied' (using COMPOSE, which is a shorthand for Natural JOIN and PROJECT) to behave as a function.
For your specific question, you need a relation with two attributes (type Date and Year).