how to identify the minimal set of parameters describing a data set

how to identify the minimal set of parameters describing a data set - algorithm

I have a bunch of regression test data. Each test is just a list of messages (associative arrays), mapping message field names to values. There's a lot of repetition within this data.
For example
test1 = [
{ sender => 'client', msg => '123', arg => '900', foo => 'bar', ... },
{ sender => 'server', msg => '456', arg => '800', foo => 'bar', ... },
{ sender => 'client', msg => '789', arg => '900', foo => 'bar', ... },
]
I would like to represent the field data (as a minimal-depth decision tree?) so that each message can be programatically regenerated using a minimal number of parameters. For example, in the above
foo is always 'bar', so I don't need to mention it
sender and client are correlated, so I only need to mention one or the other
and msg is different each time
So I would like to be able to regenerate these messages with a program along the lines of
write_msg( 'client', '123' )
write_msg( 'server', '456' )
write_msg( 'client', '789' )
where the write_msg function would be composed of nested if statements or subfunction calls using the parameters.
Based on my original data, how can I determine the 'most important' set of parameters, i.e. the ones that will let me recreate my data set using the smallest number of arguments?

The following papers describe algortithms for discovering functional dependencies:
Y. Huhtala, J. Kärkkäinen, P. Porkka,
and H. Toivonen. TANE: An efficient
algorithm for discovering functional
and approximate dependencies. The
Computer Journal, 42(2):100–111,
1999, doi:10.1093/comjnl/42.2.100.
I. Savnik and P. A. Flach. Bottom-up
induction of functional dependencies
from relations. In Proc. AAAI-93 Workshop:
Knowledge Discovery in Databases,
pages 174–185, Washington, DC, USA,
1993.
C. Wyss, C. Giannella, and E.
Robertson. FastFDs: A
Heuristic-Driven, Depth-First
Algorithm for Mining Functional
Dependencies from Relation Instances.
In Proc. Data Warehousing and Knowledge Discovery, pages 101–110, Munich,
Germany, 2001, doi:10.1007/3-540-44801-2.
Hong Yao and Howard J. Hamilton. "Mining functional dependencies from data." Data Mining and Knowledge Discovery, 2008, doi:10.1007/s10618-007-0083-9.
There has also been some work on discovering multivalued dependencies:
I. Savnik and P. A. Flach. "Discovery
of Mutlivalued Dependencies from
Relations." Intelligent Data Analysis
Journal, 4(3):195–211, IOS Press, 2000.

This looks very similar to Database Normalization.
You have a relation (your test data set), and some known functional dependencies ({sender} => arg, {} => foo and possibly {msg} => sender. If the order of tests is important then add {testNr} => msg.) and you want to eliminate redundancies.
Treat your test set as a database table, apply the normalization rules and create equivalent functions (getArgFromSender(sender) etc.) for each join.

If the number of fields and records is small:
Brute force it by looping through every combination of fields, and for each combination detect if there are multiple items in the list which map to the same value.
If you can live with a fairly good choice of fields:
Start off assuming you need all fields. Then, select a field at random and see if it can be eliminated; if it can, cross it off the set of fields. Otherwise, choose another field at random and try again. If you find no fields can be eliminated, then you've found a reasonable set of fields. Had you chosen other fields first, you may find a better solution. You can repeat the whole procedure a few times and pick the best solution if you like. This kind of approach is called hill climbing.
(I suspect that this problem is NP complete, i.e. we probably don't know of an efficient and powerful solution so it is not worth losing sleep over trying to dream up a perfect solution.)

Related

How to get documents that contain sub-string in FaunaDB

I'm trying to retrieve all the tasks documents that have the string first in their name.
I currently have the following code, but it only works if I pass the exact name:
res, err := db.client.Query(
f.Map(
f.Paginate(f.MatchTerm(f.Index("tasks_by_name"), "My first task")),
f.Lambda("ref", f.Get(f.Var("ref"))),
),
)
I think I can use ContainsStr() somewhere, but I don't know how to use it in my query.
Also, is there a way to do it without using Filter()? I ask because it seems like it filters after the pagination, and it messes up with the pages

FaunaDB provides a lot of constructs, this makes it powerful but you have a lot to choose from. With great power comes a small learning curve :).
How to read the code samples
To be clear, I use the JavaScript flavor of FQL here and typically expose the FQL functions from the JavaScript driver as follows:
const faunadb = require('faunadb')
const q = faunadb.query
const {
Not,
Abort,
...
} = q
You do have to be careful to export Map like that since it will conflict with JavaScripts map. In that case, you could just use q.Map.
Option 1: using ContainsStr() & Filter
Basic usage according to the docs
ContainsStr('Fauna', 'a')
Of course, this works on a specific value so in order to make it work you need Filter and Filter only works on paginated sets. That means that we first need to get a paginated set. One way to get a paginated set of documents is:
q.Map(
Paginate(Documents(Collection('tasks'))),
Lambda(['ref'], Get(Var('ref')))
)
But we can do that more efficiently since one get === one read and we don't need the docs, we'll be filtering out a lot of them. It's interesting to know that one index page is also one read so we can define an index as follows:
{
name: "tasks_name_and_ref",
unique: false,
serialized: true,
source: "tasks",
terms: [],
values: [
{
field: ["data", "name"]
},
{
field: ["ref"]
}
]
}
And since we added name and ref to the values, the index will return pages of name and ref which we can then use to filter. We can, for example, do something similar with indexes, map over them and this will return us an array of booleans.
Map(
Paginate(Match(Index('tasks_name_and_ref'))),
Lambda(['name', 'ref'], ContainsStr(Var('name'), 'first'))
)
Since Filter also works on arrays, we can actually simple replace Map with filter. We'll also add a to lowercase to ignore casing and we have what we need:
Filter(
Paginate(Match(Index('tasks_name_and_ref'))),
Lambda(['name', 'ref'], ContainsStr(LowerCase(Var('name')), 'first'))
)
In my case, the result is:
{
"data": [
[
"Firstly, we'll have to go and refactor this!",
Ref(Collection("tasks"), "267120709035098631")
],
[
"go to a big rock-concert abroad, but let's not dive in headfirst",
Ref(Collection("tasks"), "267120846106001926")
],
[
"The first thing to do is dance!",
Ref(Collection("tasks"), "267120677201379847")
]
]
}
Filter and reduced page sizes
As you mentioned, this is not exactly what you want since it also means that if you request pages of 500 in size, they might be filtered out and you might end up with a page of size 3, then one of 7. You might think, why can't I just get my filtered elements in pages? Well, it's a good idea for performance reasons since it basically checks each value. Imagine you have a massive collection and filter out 99.99 percent. You might have to loop over many elements to get to 500 which all cost reads. We want pricing to be predictable :).
Option 2: indexes!
Each time you want to do something more efficient, the answer lies in indexes. FaunaDB provides you with the raw power to implement different search strategies but you'll have to be a bit creative and I'm here to help you with that :).
Bindings
In Index bindings, you can transform the attributes of your document and in our first attempt we will split the string into words (I'll implement multiple since I'm not entirely sure which kind of matching you want)
We do not have a string split function but since FQL is easily extended, we can write it ourselves bind to a variable in our host language (in this case javascript), or use one from this community-driven library: https://github.com/shiftx/faunadb-fql-lib
function StringSplit(string: ExprArg, delimiter = " "){
return If(
Not(IsString(string)),
Abort("SplitString only accept strings"),
q.Map(
FindStrRegex(string, Concat(["[^\\", delimiter, "]+"])),
Lambda("res", LowerCase(Select(["data"], Var("res"))))
)
)
)
And use it in our binding.
CreateIndex({
name: 'tasks_by_words',
source: [
{
collection: Collection('tasks'),
fields: {
words: Query(Lambda('task', StringSplit(Select(['data', 'name']))))
}
}
],
terms: [
{
binding: 'words'
}
]
})
Hint, if you are not sure whether you have got it right, you can always throw the binding in values instead of terms and then you'll see in the fauna dashboard whether your index actually contains values:
What did we do? We just wrote a binding that will transform the value into an array of values at the time a document is written. When you index the array of a document in FaunaDB, these values are indexes separately yet point all to the same document which will be very useful for our search implementation.
We can now find tasks that contain the string 'first' as one of their words by using the following query:
q.Map(
Paginate(Match(Index('tasks_by_words'), 'first')),
Lambda('ref', Get(Var('ref')))
)
Which will give me the document with name:
"The first thing to do is dance!"
The other two documents didn't contain the exact words, so how do we do that?
Option 3: indexes and Ngram (exact contains matching)
To get exact contains matching efficient, you need to use a (still undocumented function since we'll make it easier in the future) function called 'NGram'. Dividing a string in ngrams is a search technique that is often used underneath the hood in other search engines. In FaunaDB we can easily apply it as due to the power of the indexes and bindings. The Fwitter example has an example in it's source code that does autocompletion. This example won't work for your use-case but I do reference it for other users since it's meant for autocompleting short strings, not to search a short string in a longer string like a task.
We'll adapt it though for your use-case. When it comes to searching it's all a tradeoff of performance and storage and in FaunaDB users can choose their tradeoff. Note that in the previous approach, we stored each word separately, with Ngrams we'll split words even further to provide some form of fuzzy matching. The downside is that the index size might become very big if you make the wrong choice (this is equally true for search engines, hence why they let you define different algorithms).
What NGram essentially does is get substrings of a string of a certain length.
For example:
NGram('lalala', 3, 3)
Will return:
If we know that we won't be searching for strings longer than a certain length, let's say length 10 (it's a tradeoff, increasing the size will increase the storage requirements but allow you to do query for longer strings), you can write the following Ngram generator.
function GenerateNgrams(Phrase) {
return Distinct(
Union(
Let(
{
// Reduce this array if you want less ngrams per word.
indexes: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
indexesFiltered: Filter(
Var('indexes'),
// filter out the ones below 0
Lambda('l', GT(Var('l'), 0))
),
ngramsArray: q.Map(Var('indexesFiltered'), Lambda('l', NGram(LowerCase(Var('Phrase')), Var('l'), Var('l'))))
},
Var('ngramsArray')
)
)
)
}
You can then write your index as followed:
CreateIndex({
name: 'tasks_by_ngrams_exact',
// we actually want to sort to get the shortest word that matches first
source: [
{
// If your collections have the same property tht you want to access you can pass a list to the collection
collection: [Collection('tasks')],
fields: {
wordparts: Query(Lambda('task', GenerateNgrams(Select(['data', 'name'], Var('task')))))
}
}
],
terms: [
{
binding: 'wordparts'
}
]
})
And you have an index backed search where your pages are the size you requested.
q.Map(
Paginate(Match(Index('tasks_by_ngrams_exact'), 'first')),
Lambda('ref', Get(Var('ref')))
)
Option 4: indexes and Ngrams of size 3 or trigrams (Fuzzy matching)
If you want fuzzy searching, often trigrams are used, in this case our index will be easy so we're not going to use an external function.
CreateIndex({
name: 'tasks_by_ngrams',
source: {
collection: Collection('tasks'),
fields: {
ngrams: Query(Lambda('task', Distinct(NGram(LowerCase(Select(['data', 'name'], Var('task'))), 3, 3))))
}
},
terms: [
{
binding: 'ngrams'
}
]
})
If we would place the binding in values again to see what comes out we'll see something like this:
In this approach, we use both trigrams on the indexing side as on the querying side. On the querying side, that means that the 'first' word which we search for will also be divided in Trigrams as follows:
For example, we can now do a fuzzy search as follows:
q.Map(
Paginate(Union(q.Map(NGram('first', 3, 3), Lambda('ngram', Match(Index('tasks_by_ngrams'), Var('ngram')))))),
Lambda('ref', Get(Var('ref')))
)
In this case, we do actually 3 searches, we are searching for all of the trigrams and union the results. Which will return us all sentences that contain first.
But if we would have miss-spelled it and would have written frst we would still match all three since there is a trigram (rst) that matches.

Designing an algorithm to check combinations

I´m having serious performance issues with a job that is running everyday and I think i cannot improve the algorithm; so I´m gonnga explain you what is the problem to solve and the algorithm we have, and maybe you have some other ideas to solve the problem better.
So the problem we have to solve is:
There is a set of Rules, ~ 120.000 Rules.
Every rule has a set of combinations of Codes. Codes are basically strings. So we have ~8 combinations per rule. Example of a combination: TTAAT;ZZUHH;GGZZU;WWOOF;SSJJW;FFFOLL
There is a set of Objects, ~800 objects.
Every object has a set of ~200 codes.
We have to check for every Rule, if there is at least one Combination of Codes that is fully contained in the Objects. It means =>
loop in Rules
Loop in Combinations of the rule
Loop in Objects
every code of the combination found in the Object? => create relationship rule/object and continue with the next object
end of loop
end of loop
end of loop
For example, if we have the Rule with this combination of two codes: HHGGT; ZZUUF
And let´s say we have an object with this codes: HHGGT; DHZZU; OIJUH; ZHGTF; HHGGT; JUHZT; ZZUUF; TGRFE; UHZGT; FCDXS
Then we create a relationship between the Object and the Rule because every code of the combination of the rule is contained in the codes of the object => this is what the algorithm has to do.
As you can see this is quite expensive, because we need 120.000 x 8 x 800 = 750 millions of times in the worst-case scenario.
This is a simplified scenario of the real problem; actually what we do in the loops is a little bit more complicated, that´s why we have to reduce this somehow.
I tried to think in a solution but I don´t have any ideas!
Do you see something wrong here?
Best regards and thank you for the time :)

Something like this might work better if I'm understanding correctly (this is in python):
RULES = [
['abc', 'def',],
['aaa', 'sfd',],
['xyy', 'eff',]]
OBJECTS = [
('rrr', 'abc', 'www', 'def'),
('pqs', 'llq', 'aaa', 'sdr'),
('xyy', 'hjk', 'fed', 'eff'),
('pnn', 'rrr', 'mmm', 'qsq')
]
MapOfCodesToObjects = {}
for obj in OBJECTS:
for code in obj:
if (code in MapOfCodesToObjects):
MapOfCodesToObjects[code].add(obj)
else:
MapOfCodesToObjects[code] = set({obj})
RELATIONS = []
for rule in RULES:
if (len(rule) == 0):
continue
if (rule[0] in MapOfCodesToObjects):
ValidObjects = MapOfCodesToObjects[rule[0]]
else:
continue
for i in range(1, len(rule)):
if (rule[i] in MapOfCodesToObjects):
codeObjects = MapOfCodesToObjects[rule[i]]
else:
ValidObjects = set()
break
ValidObjects = ValidObjects.intersection(codeObjects)
if (len(ValidObjects) == 0):
break
for vo in ValidObjects:
RELATIONS.append((rule, vo))
for R in RELATIONS:
print(R)
First you build a map of codes to objects. If there are nObj objects and nCodePerObj codes on average per object, this takes O(nObj*nCodePerObj * log(nObj*nCodePerObj).
Next you iterate through the rules and look up each code in each rule in the map you built. There is a relation if a certain object occurs for every code in the rule, i.e. if it is in the set intersection of the objects for every code in the rule. Since hash lookups have O(1) time complexity on average, and set intersection has time complexity O(min of the lengths of the 2 sets), this will take O(nRule * nCodePerRule * nObjectsPerCode), (note that is nObjectsPerCode, not nCodePerObj, the performance gets worse when one code is included in many objects).

How to Connect Logic with Objects

I have a system that contains x number of strings. These string are shown in a UI based on some logic. For example string number 1 should only show if the current time is past midday and string 3 only shows if a randomly generated number between 0-1 is less than 0.5.
How would be the best way to model this?
Should the logic just be in code and be linked to a string by some sort or ID?
Should the logic be some how stored with the strings?
NOTE The above is a theoretical example before people start questioning my logic.

It's usually better to keep resources (such as strings) separate from logic. So referring strings by IDs is a good idea.
It seems that you have a bunch of rules which you have to link to the display of strings. I'd keep all three as separate entities: rules, strings, and the linking between them.
An illustration in Python, necessarily simplified:
STRINGS = {
'morning': 'Good morning',
'afternoon': 'Good afternoon',
'luck': 'you must be lucky today',
}
# predicates
import datetime, random
def showMorning():
return datetime.datetime.now().hour < 12
def showAfternoon():
return datetime.datetime.now().hour >= 12
def showLuck():
return random.random() > 0.5
# interconnection
RULES = {
'morning': showMorning,
'afternoon': showAfternoon,
'luck': showLuck,
}
# usage
for string_id, predicate in RULES.items():
if predicate():
print STRINGS[string_id]

Ruby - Check if intersect exists

I'm trying to speed up a search function in a RoR app w/ Postgres DB. I won't explain how it works currently...just go with an /achieve approach!
I have x number of records (potentially a substantial number) which each have an associated array of Facebook ID numbers...potentially up to 5k. I need to search against this with an individual's list of friend IDs to ascertain if an intersect between the search array and any (and which) of the records' arrays exists.
I don't need to know the result of the intersection, just whether it's true or false.
Any bright ideas?!
Thanks!

Just using pure ruby since you don't mention your datastore:
friend_ids = user.friend_ids
results = records.select { |record| !(record.friend_ids & friend_ids).empty? }
results will contain all records that have at least 1 friend_id in common. This will not be very fast if you have to check a very large number of records.
& is the array intersection operator, which is implemented in C, you can see it here: http://www.ruby-doc.org/core-1.9.3/Array.html#method-i-26

A probably faster version of #ctcherry's answer, especially when user.friend_ids has high cardinality:
require 'set'
user_friend_ids = Set[ user.friend_ids ]
results = records.select { |record|
record.friend_ids.any? { |friend_id| user_friend_ids.include? friend_id }
}
Since this constructs the test set(hash) for user.freind_ids only once, it's probably also faster than the Array#memory_efficient_intersect linked by #Tass.
This may also be faster performed in the db, but without more info on the models, it's hard to compose an approach.

What would you use for `n to n` relations in python?

after fiddling around with dictionaries, I came to the conclusion, that I would need a data structure that would allow me an n to n lookup. One example would be: A course can be visited by several students and each student can visit several courses.
What would be the most pythonic way to achieve this? It wont be more than 500 Students and 100 courses, to stay with the example. So I would like to avoid using a real database software.
Thanks!

Since your working set is small, I don't think it is a problem to just store the student IDs as lists in the Course class. Finding students in a class would be as simple as doing
course.studentIDs
To find courses a student is in, just iterate over the courses and find the ID:
studentIDToGet = "johnsmith001"
studentsCourses = list()
for course in courses:
if studentIDToGet in course.studentIDs:
studentsCourses.append(course.id)
There's other ways you could do it. You could have a dictionary of studentIDs mapped to courseIDs or two dictionaries that - one mapped studentIDs:courseIDs and another courseIDs:studentIDs - when updated, update each other.
The implementation I wrote out the code for would probably be the slowest, which is why I mentioned that your working set is small enough that it would not be a problem. The other implentations I mentioned but did not show the code for would require some more code to make them work that just aren't worth the effort.

It depends completely on what operations you want the structure to be able to carry out quickly.
If you want to be able to quickly look up properties related to both a course and a student, for example how many hours a student has spent on studies for a specific course, or what grade the student has in the course if he has finished it, and if he has finished it etc. a vector containing n*m elements is probably what you need, where n is the number of students and m is the number of courses.
If on the other hand the average number of courses a student has taken is much less than the total number of courses (which it probably is for a real case scenario), and you want to be able to quickly look up all the courses a student has taken, you probably want to use an array consisting of n lists, either linked lists, resizable vectors or similar – depending on if you want to be able to with the lists; maybe that is to quickly remove elements in the middle of the lists, or quickly access an element at a random location. If you both want to be able to quickly remove elements in the middle of the lists and have quick random access to list elements, then maybe some kind of tree structure would be the most suitable for you.
Most tree data structures carry out all basic operations in logarithmic time to the number of elements in the tree. Beware that some tree data structures have an amortized time on these operators that is linear to the number of elements in the tree, even though the average time for a randomly constructed tree would be logarithmic. A typical example of when this happens is if you use a binary search tree and build it up with increasingly large elements. Don't do that; scramble the elements before you use them to build up the tree in that case, or use a divide-and-conquer method and split the list in two parts and one pivot element and create the tree root with the pivot element, then recursively create trees from both the left part of the list and the right part of the list, these also using the divide-and-conquer method, and attach them to the root as the left child and the right child respectively.
I'm sorry, I don't know python so I don't know what data structures that are part of the language and which you have to create yourself.

I assume you want to index both the Students and Courses. Otherwise you can easily make a list of tuples to store all Student,Course combinations: [ (St1, Crs1), (St1, Crs2) .. (St2, Crs1) ... (Sti, Crsi) ... ] and then do a linear lookup everytime you need to. For upto 500 students this ain't bad either.
However if you'd like to have a quick lookup either way, there is no builtin data structure. You can simple use two dictionaries:
courses = { crs1: [ st1, st2, st3 ], crs2: [ st_i, st_j, st_k] ... }
students = { st1: [ crs1, crs2, crs3 ], st2: [ crs_i, crs_j, crs_k] ... }
For a given student s, looking up courses is now students[s]; and for a given course c, looking up students is courses[c].

For something simple like what you want to do, you could create a simple class with data members and methods to maintain them and keep them consistent with each other. For this problem two dictionaries would be needed. One keyed by student name (or id) that keeps track of the courses each is taking, and another that keeps track of which students are in each class.
defaultdicts from the 'collections' module could be used instead of plain dicts to make things more convenient. Here's what I mean:
from collections import defaultdict
class Enrollment(object):
def __init__(self):
self.students = defaultdict(set)
self.courses = defaultdict(set)
def clear(self):
self.students.clear()
self.courses.clear()
def enroll(self, student, course):
if student not in self.courses[course]:
self.students[student].add(course)
self.courses[course].add(student)
def drop(self, course, student):
if student in self.courses[course]:
self.students[student].remove(course)
self.courses[course].remove(student)
# remove student if they are not taking any other courses
if len(self.students[student]) == 0:
del self.students[student]
def display_course_enrollments(self):
print "Class Enrollments:"
for course in self.courses:
print ' course:', course,
print ' ', [student for student in self.courses[course]]
def display_student_enrollments(self):
print "Student Enrollments:"
for student in self.students:
print ' student', student,
print ' ', [course for course in self.students[student]]
if __name__=='__main__':
school = Enrollment()
school.enroll('john smith', 'biology 101')
school.enroll('mary brown', 'biology 101')
school.enroll('bob jones', 'calculus 202')
school.display_course_enrollments()
print
school.display_student_enrollments()
school.drop('biology 101', 'mary brown')
print
print 'After mary brown drops biology 101:'
print
school.display_course_enrollments()
print
school.display_student_enrollments()
Which when run produces the following output:
Class Enrollments:
course: calculus 202 ['bob jones']
course: biology 101 ['mary brown', 'john smith']
Student Enrollments:
student bob jones ['calculus 202']
student mary brown ['biology 101']
student john smith ['biology 101']
After mary brown drops biology 101:
Class Enrollments:
course: calculus 202 ['bob jones']
course: biology 101 ['john smith']
Student Enrollments:
student bob jones ['calculus 202']
student john smith ['biology 101']

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

how to identify the minimal set of parameters describing a data set - algorithm

Related

How to get documents that contain sub-string in FaunaDB

Designing an algorithm to check combinations

How to Connect Logic with Objects

Ruby - Check if intersect exists

What would you use for `n to n` relations in python?

Categories

Resources