How do I convert to Chomsky Normal Form(CNF) - chomsky-normal-form

How do I convert below grammar to CNF?
S → ASA | aB
A → B | S
B → b | ε

We can split the transformation of context free grammars to Chomsky Normal Form into four steps.
The Bin step ensures that all alternatives in all productions contains no more than two terminals or non-terminals.
The Del step "deletes" all empty string tokens.
The Unit steps "inlines" productions that directly map to a single non-terminal.
The Term step makes sure terminals and non-terminals are not mixed in any alternative.
From your example, describing each step, the transformation to CNF can look like the following.
Bin
Alternatives in production S is split up into smaller productions. New non-terminals are T.
S → AT | aB
A → B | S
B → b | ε
T → SA
Del
From the production of S, nullable non-terminals A and B were factored out.
S → AT | T | aB | a
A → B | S
B → b | ε
T → SA
For the production of A, no action need be taken.
S → AT | T | aB | a
A → B | S
B → b | ε
T → SA
From the production of B, empty string tokens were removed.
S → AT | T | aB | a
A → B | S
B → b
T → SA
From the production of T, nullable non-terminal A were factored out.
S → AT | T | aB | a
A → B | S
B → b
T → SA | S
Unit
"Inlined" the production for B in A.
S → AT | T | aB | a
A → b | S
B → b
T → SA | S
Term
Replaced a terminal "a" in production S with the new non-terminal U.
S → AT | T | UB | a
A → b | S
B → b
T → SA | S
U → a
And you're done.

Related

BigQuery: Sample a varying number of rows per group

I have two tables. One has a list of items, and for each item, a number n.
item | n
--------
a | 1
b | 2
c | 3
The second one has a list of rows containing item, uid, and other rows.
item | uid | data
------------------
a | x | foo
a | x | baz
a | x | bar
a | z | arm
a | z | leg
b | x | eye
b | x | eye
b | x | eye
b | x | eye
b | z | tap
c | y | tip
c | z | top
I would like to sample, for each (item,uid) pair, n rows (arbitrary, it's better if this is uniformly random, but it doesn't have to be). In the example above, I want to keep maximum one row per user for item a, two rows per user for item b, and three rows per user to item c:
item | uid | data
------------------
a | x | baz
a | z | arm
b | x | eye
b | x | eye
b | z | tap
c | y | tip
c | z | top
ARRAY_AGG with LIMIT n doesn't work for two reasons: first, I suspect that given that n can be large (on the order of 100,000), this won't scale. The second, more fundamental problem is that n needs to be a constant.
Table sampling also doesn't seem to solve my problem, since it's per-table, and also only supports sampling a fixed percentage of rows, rather than a fixed number of rows.
Are there any other options?
Consider below solution
select * except(n)
from rows_list
join items_list
using(item)
where true
qualify row_number() over win <= n
window win as (partition by item, uid order by rand())
if applied to sample data in your question - output is

Constraint programming suitable for extracting OneToMany relationships from records

Maybe someone can help me to solve a problem with Prolog or any constraint programming language. Imagine a table of projects (school projects where pupils do something with their mothers). Each project has one or more children participating. For each child we store its name and the name of its mother. But for each project there is only one cell that contains all mothers and one cell that contains all children. Both cells are not necessarily ordered in the same way.
Example:
+-----------+-----------+------------+
| | | |
| Project | Parents | Children |
| | | |
+-----------+-----------+------------+
| | | |
| 1 | Jane; | Brian; |
| | Claire | Stephen |
| | | |
+-----------+-----------+------------+
| | | |
| 2 | Claire; | Emma; |
| | Jane | William |
| | | |
+-----------+-----------+------------+
| | | |
| 3 | Jane; | William; |
| | Claire | James |
| | | |
+-----------+-----------+------------+
| | | |
| 4 | Jane; | Brian; |
| | Sophia; | James; |
| | Claire | Isabella |
| | | |
+-----------+-----------+------------+
| | | |
| 4 | Claire | Brian |
| | | |
+-----------+-----------+------------+
| | | |
| 5 | Jane | Emma |
| | | |
+-----------+-----------+------------+
I hope this example visualizes the problem. As I said both cells only contain the names separated by a delimiter, but are not necessarily ordered in a similar way. So for technical applications you would transform the data into this:
+-------------+-----------+----------+
| Project | Name | Role |
+-------------+-----------+----------+
| 1 | Jane | Mother |
+-------------+-----------+----------+
| 1 | Claire | Mother |
+-------------+-----------+----------+
| 1 | Brian | Child |
+-------------+-----------+----------+
| 1 | Stephen | Child |
+-------------+-----------+----------+
| 2 | Jane | Mother |
+-------------+-----------+----------+
| 2 | Claire | Mother |
+-------------+-----------+----------+
| 2 | Emma | Child |
+-------------+-----------+----------+
| 2 | William | Child |
+-------------+-----------+----------+
| | | |
| |
| And so on |
The number of parents and children is equal for each project. So for each deal we have n mothers and n children and each mother belongs to exactly one child. With these constraints it is possible to assign each mother to all of her children by logical inference starting with the projects that involve only one child (i.e. 4 and 5).
Results:
Jane has Emma, Stephen and James;
Claire has Brian and William;
Sophia has Isabella
I am wondering how this can be solved using constraint programming. Additionally, the data set might be underdetermined and I am wondering if it is possible to isolate records that, when solved manually (i.e. when the mother-child assignments are done manually), would break the underdetermination.
I'm not sure if I understand all the requirements of the problem, but here is a constraint programming model in MiniZinc (http://www.minizinc.org/). The full model is here: http://hakank.org/minizinc/one_to_many.mzn .
LATER NOTE: The first version of the project constraints where not correct. I have removed the incorrect code . See the edit history for the original answer.
enum mothers = {jane,claire,sophia};
enum children = {brian,stephen,emma,william,james,isabella};
% decision variables
% who is the mother of this child?
array[children] of var mothers: x;
solve satisfy;
constraint
% All mothers has at least one child
forall(m in mothers) (
exists(c in children) (
x[c] = m
)
)
;
constraint
% NOTE: This is a more correct version of the project constraints.
% project 1
(
( x[brian] = jane /\ x[stephen] = claire) \/
( x[stephen] = jane /\ x[brian] = claire)
)
/\
% project 2
(
( x[emma] = claire /\ x[william] = jane) \/
( x[william] = claire /\ x[emma] = jane)
)
/\
% project 3
(
( x[william] = claire /\ x[james] = jane) \/
( x[james] = claire /\ x[william] = jane)
)
/\
% project 4
(
( x[brian] = jane /\ x[james] = sophia /\ x[isabella] = claire) \/
( x[james] = jane /\ x[brian] = sophia /\ x[isabella] = claire) \/
( x[james] = jane /\ x[isabella] = sophia /\ x[brian] = claire) \/
( x[brian] = jane /\ x[isabella] = sophia /\ x[james] = claire) \/
( x[isabella] = jane /\ x[brian] = sophia /\ x[james] = claire) \/
( x[isabella] = jane /\ x[james] = sophia /\ x[brian] = claire)
)
/\
% project 4(sic!)
( x[brian] = claire) /\
% project 5
( x[emma] = jane)
;
output [
"\(c): \(x[c])\n"
| c in children
];
The unique solution is
brian: claire
stephen: jane
emma: jane
william: claire
james: jane
isabella: sophia
Edit2: Here is a more general solution. See http://hakank.org/minizinc/one_to_many.mzn for the complete model.
include "globals.mzn";
enum mothers = {jane,claire,sophia};
enum children = {brian,stephen,emma,william,james,isabella};
% decision variables
% who is the mother of this child?
array[children] of var mothers: x;
% combine all the combinations of mothers and children in a project
predicate check(array[int] of mothers: mm, array[int] of children: cc) =
let {
int: n = length(mm);
array[1..n] of var 1..n: y;
} in
all_different(y) /\
forall(i in 1..n) (
x[cc[i]] = mm[y[i]]
)
;
solve satisfy;
constraint
% All mothers has at least one child.
forall(m in mothers) (
exists(c in children) (
x[c] = m
)
)
;
constraint
% project 1
check([jane,claire], [brian,stephen]) /\
% project 2
check([claire,jane],[emma,william]) /\
% project 3
check([claire,jane],[william,james]) /\
% project 4
check([claire,sophia,jane],[brian,james,isabella]) /\
% project 4(sic!)
check([claire],[brian]) /\
% project 5
check([jane],[emma])
;
output [
"\(c): \(x[c])\n"
| c in children
];
This model use the following predicate to ensure that all the combinations of mothers vs children are considered:
predicate check(array[int] of mothers: mm, array[int] of children: cc) =
let {
int: n = length(mm);
array[1..n] of var 1..n: y;
} in
all_different(y) /\
forall(i in 1..n) (
x[cc[i]] = mm[y[i]]
)
;
It use the global constraint all_different(y) to ensure that mm[y[i]] is one of the mothers in mm, and then assign the `i'th child to that specific mother.
A bit off topic, but since from SWI-Prolog manual:
Plain Prolog can be regarded as CLP(H), where H stands for Herbrand
terms. Over this domain, =/2 and dif/2 are the most important
constraints that express, respectively, equality and disequality of
terms.
I feel authorized to suggest a Prolog solution, more general than the algorithm you suggested (progressively reduce relations based on single to single relations):
solve2(Projects,ParentsChildren) :-
foldl([_-Ps-Cs,L,L1]>>try_links(Ps,Cs,L,L1),Projects,[],ChildrenParent),
transpose_pairs(ChildrenParent,ParentsChildrenFlat),
group_pairs_by_key(ParentsChildrenFlat,ParentsChildren).
try_links([],[],Linked,Linked).
try_links(Ps,Cs,Linked,Linked2) :-
select(P,Ps,Ps1),
select(C,Cs,Cs1),
link(C,P,Linked,Linked1),
try_links(Ps1,Cs1,Linked1,Linked2).
link(C,P,Assigned,Assigned1) :-
( memberchk(C-Q,Assigned)
-> P==Q,
Assigned1=Assigned
; Assigned1=[C-P|Assigned]
).
This accepts data in a natural format, like
data(1,
[1-[jane,claire]-[brian,stephen]
,2-[claire,jane]-[emma,william]
,3-[jane,claire]-[william,james]
,4-[jane,sophia,claire]-[brian,james,isabella]
,5-[claire]-[brian]
,6-[jane]-[emma]
]).
data(2,
[1-[jane,claire]-[brian,stephen]
,2-[claire,jane]-[emma,william]
,3-[jane,claire]-[william,james]
,4-[jane,sophia,claire]-[brian,james,isabella]
,5-[claire]-[brian]
,6-[jane]-[emma]
,7-[sally,sandy]-[grace,miriam]
]).
?- data(2,Ps),solve2(Ps,S).
Ps = [1-[jane, claire]-[brian, stephen], 2-[claire, jane]-[emma, william], 3-[jane, claire]-[william, james], 4-[jane, sophia, claire]-[brian, james, isabella], 5-[claire]-[brian], 6-[jane]-[emma], 7-[...|...]-[grace|...]],
S = [claire-[william, brian], jane-[james, emma, stephen], sally-[grace], sandy-[miriam], sophia-[isabella]].
This is my first CHR program, so I hope that someone will come and give me some advice on how to improve it.
My thinking is that you need to expand all the lists into facts. From there, if you know that a project has just one parent and one child, you can establish the parent relationship from that. Also, once you have a parent-child relationship, you can remove that set from the other facts in the other projects and reduce the cardinality of the problem by one. Eventually you will have figured out everything you can. The only difference between a completely determined dataset and an incompletely determined one is in how far that reduction can go. If it doesn't quite get there, it will leave around some facts so you can see which projects/parents/children are still creating ambiguity.
:- use_module(library(chr)).
:- chr_constraint project/3, project_parent/2, project_child/2,
project_parents/2, project_children/2, project_size/2, parent/2.
%% turn a project into a fact about its size plus
%% facts for each parent and child in this project
project(N, Parents, Children) <=>
length(Parents, Len),
project_size(N, Len),
project_parents(N, Parents),
project_children(N, Children).
%% expand the list of parents for this project into a fact per parent per project
project_parents(_, []) <=> true.
project_parents(N, [Parent|Parents]) <=>
project_parent(N, Parent),
project_parents(N, Parents).
%% same for the children
project_children(_, []) <=> true.
project_children(N, [Child|Children]) <=>
project_child(N, Child),
project_children(N, Children).
%% a single parent-child combo on a project is exactly what we need
one_parent # project_size(Project, 1),
project_parent(Project, Parent),
project_child(Project, Child) <=>
parent(Parent, Child).
%% if I have a parent relationship for project of size N,
%% remove this parent and child from the project and decrease
%% the number of parents and children by one
parent_det # parent(Parent, Child) \ project_size(Project, N),
project_parent(Project, Parent),
project_child(Project, Child) <=>
succ(N0, N),
project_size(Project, N0).
I ran this with your example by making a main/0 predicate to do it:
main :-
project(1, [jane, claire], [brian, stephen]),
project(2, [claire, jane], [emma, william]),
project(3, [jane, claire], [william, james]),
project(4, [jane, sophia, claire], [brian, james, isabella]),
project(5, [claire], [brian]),
project(6, [jane], [emma]).
This outputs:
parent(sophia, isabella),
parent(jane, james),
parent(claire, william),
parent(jane, emma),
parent(jane, stephen),
parent(claire, brian).
To demonstrate incomplete determination, I added a seventh project:
project(7, [sally,sandy], [grace,miriam]).
The program then outputs this:
project_parent(7, sandy),
project_parent(7, sally),
project_child(7, miriam),
project_child(7, grace),
project_size(7, 2),
parent(sophia, isabella),
parent(jane, james),
parent(claire, william),
parent(jane, emma),
parent(jane, stephen),
parent(claire, brian).
As you can see, any project_size/2 that remains tells you the cardinality of what remains to be solved (project seven has two parent/children relationships still remaining to be determined) and you get back exactly the parents/children that remain to be handled, as well as all of the parent/2 relations which could be determined.
I'm pretty happy with this outcome but hopefully others can come and improve my code!
Edit: my code has a shortcoming which was identified on the mailing list, that certain inputs will fail to converge even though the solution can be computed, for instance:
project(1,[jane,claire],[brian, stephan]),
project(2,[jane,emma],[stephan, jones]).
For more information, see Ian's solution, which uses set intersection to determine the mapping.

Writing conditions in Conjunctive Normal Form

Conjunctive Normal Form (CNF) is a standardized notation for propositional formulas that dictate that every formula should be written as a conjunction of disjunctions. Every boolean formula can be converted to CNF. So for example:
A | (B & C)
Has a representation in CNF like this:
(A | B) & (A | C)
Is it a best practice in programming to write conditionals in CNF?
No, this is not a good idea. Conjunctive normal form is primarily used in theoretical computer science. There are algorithms to solve formulas in CNF, and proofs about time complexity and NP-hardness.
From a pragmatic point of view, you should write code using Boolean operators that most "naturally" describe the logic. This means taking full advantage of nested expressions, operators like XOR, negation, and so on. As you exemplified, CNF often contradicts this goal of "naturalness" because the expression is longer and often repeats subexpressions.
As a theoretical side note, in the worst case, an unrestricted Boolean formula containing n operators can transform into a CNF formula whose length is exponential in n. So CNF can potentially blow up a formula by very large amount. A sequence of examples illustrating this behavior:
(A & B) | (C & D) ==
(A | C) & (A | D) & (B | C) & (B | D).
(A & B) | (C & D) | (E & F) ==
(A | C | E) & (A | C | F) & (A | D | E) & (A | D | F) & (B | C | E) & (B | C | F) & (B | D | E) & (B | D | F).
(A & B) | (C & D) | (E & F) | (G & H) ==
(A | C | E | G) & (A | C | E | H) & (A | C | F | G) & (A | C | F | H) & (A | D | E | G) & (A | D | E | H) & (A | D | F | G) & (A | D | F | H) & (B | C | E | G) & (B | C | E | H) & (B | C | F | G) & (B | C | F | H) & (B | D | E | G) & (B | D | E | H) & (B | D | F | G) & (B | D | F | H).

EBNF Syntax for imaginary language

In school we have been studying metalanguages, in particular, railroad diagrams and EBNF. I received a question where an imaginary programming language (winston) was described in EBNF. Here it is:
Digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
LCase = a | b | c | d
UCase = A | B | C | D | E | F | G | H | I | J
Operator = + | - | * | /
Logical = < | > | <= | >= | <>
Constant = [-] <Digit>{<Digit>}
Identifier = <UCase>{<LCase> | <Digit>}
Assignment = Set <Identifier> to <Constant> | <Identifier
{<Operator>(<Constant> | <Identifier>)}
Condition = <Identifier> <Logical> (<Identifier> | <Constant>)
{(and | or) <Identifier> <Logical> (<Identifier> | <Constant>)}
When = (<Assignment> | <Condition> {<Assignment> | <Condition>})
Statement = <Input> | <Output> | <Assignment> | <Condition> | <When> | <Pretest> | <Posttest>
Program = Start <Statement> {! <Statement>} Stop
The program written below was made with winston but doesn't execute properly. Use the EBNF descriptions to identify the error.
Start
Input J1
Input J2
When (J1 = J2, Set A3 to 0), (J1 < J2, Set A3 to -1), Set A3 to 1
Output A3
Stop
My working so far: To me, this program seems legitimate. It is a program so if must start with "start" and end with "stop", which it does. The statements in the middle seemingly are allowed to be in there. Can someone point me in the right direction?
Also, can someone tell me what it means in the EBNF description of a program what this means:<statement>
I think it means statements like when and if but Im not too sure. Thanks for the help :)
When is comma-separated and the grammar doesn't specify commas at all.
J1 = J2 -- there is no = comparison op in the grammar (see Logical), so J1 = J2 is neither Assignment, nor Condition and is thus invalid.
<statement> -- the grammar wraps symbols in angle brackets on the right-hand side, e.g. Identifier on the left-hand and, later, <Identifier> in Assignment rule -- that doesn't look like a valid EBNF.

Regex that matches valid Ruby local variable names

Does anyone know the rules for valid Ruby variable names? Can it be matched using a RegEx?
UPDATE: This is what I could come up with so far:
^[_a-z][a-zA-Z0-9_]+$
Does this seem right?
Identifiers are pretty straightforward. They begin with letters or an underscore, and contain letters, underscore and numbers. Local variables can't (or shouldn't?) begin with an uppercase letter, so you could just use a regex like this.
/^[a-z_][a-zA-Z_0-9]*$/
It's possible for variable names to be unicode letters, in which case most of the existing regexes don't match.
varname = "\u2211" # => "∑"
eval(varname + '= "Tony the Pony"') => "Tony the Pony"
puts varname # => ∑
local_variable_identifier = /Insert large regular expression here/
varname =~ local_variable_identifier # => nil
See also "Fun with Unicode" in either the Ruby 1.9 Pickaxe or at Fun with Unicode.
According to http://rubylearning.com/satishtalim/ruby_names.html a Ruby variable consists of:
A name is an uppercase letter,
lowercase letter, or an underscore
("_"), followed by Name characters
(this is any combination of upper- and
lowercase letters, underscore and
digits).
In addition, global variables begin with a dollar sign, instance variables with a single at-sign, and class variables with two at-signs.
A regular expression to match all that would be:
%r{
(\$|#{1,2})? # optional leading punctuation
[A-Za-z_] # at least one upper case, lower case, or underscore
[A-Za-z0-9_]* # optional characters (including digits)
}x
Hope that helps.
I like #aboutruby's answer, but just to complete it, here's the equivalent using POSIX bracket expressions.
/^[_[:lower:]][_[:alnum:]]*$/
Or, since a-z is actually shorter than [:lower:]:
/^[_a-z][_[:alnum:]]*$/
I think /^(\$){0,1}[_a-zA-Z][a-zA-Z0-9_]*([?!]){0,1}$/ is a bit closer to what you will need...
It depends on whether you want to match method names as well.
If you are trying to match a name that might be encountered in an expression, then it might start with $ and it might end with ? or !. If you know for sure that it is just a local variable then the rule will be much simpler.
i was trying to figure one out for a rails patch, and Matthew Draper wrote this one, using the ruby parser as a reference:
/\A(?![A-Z0-9])(?:[[:alnum:]_]|[^\0-\177])+\z/
And here it is, straight from the horse's mouth. (The horse in this case is the Draft ISO Ruby Specification):
local-variable-identifier → ( lowercase-character | _ ) identifier-character *
identifier-character → lowercase-character | uppercase-character | decimal-digit | _
uppercase-character → A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z
lowercase-character → a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z
decimal-digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
In Ruby 1.9, using named groups, you can translate this literally:
local_variable_identifier = %r{
(?<uppercase_character> A | B | C | D | E | F | G | H | I | J | K | L | M
| N | O | P | Q | R | S | T | U | V | W | X | Y | Z
){0}
(?<lowercase_character> a | b | c | d | e | f | g | h | i | j | k | l | m
| n | o | p | q | r | s | t | u | v | w | x | y | z
){0}
(?<decimal_digit> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9){0}
(?<identifier_character> \g<lowercase_character>
| \g<uppercase_character>
| \g<decimal_digit>
| _
){0}
( \g<lowercase_character> | _ ) \g<identifier_character>*
}x
Of course, this is not how you would really write it.

Resources