I have a main data store which has a big set of perfectly ordinary records, which might look like (all examples here are pseudocode):
class Person {
string FirstName;
string LastName;
int Height;
// and so on...
}
I have a supplementary data structure I'm using for answering statistical questions efficiently. It's computed from the main data store, and it's a dictionary that looks like:
// { (field_name, field_value) => count }
Dictionary<Tuple<string, object>, int>;
For example, one entry of the dictionary might be:
(LastName, "Smith") => 345
which means in 345 of the Person records, the LastName field is "Smith" (or was, at the time this dictionary was last computed).
What is this supplementary dictionary called? I think it'd be easier to talk about if it had a proper name.
I might call it a "histogram", if I was to print the entire thing graphically (but it's just a data structure, not a visual representation). If I stored the locations of these values (instead of just their count) I might call it an "inverted index".
I think you have found the most appropriate name already: frequency table or frequency distribution.
Related
Hullo everyone,
This has been discussed a bit before, but it's one of those things where there is so much scattered discussion resulting in various proposed "hacks" that I'm having a hard time determining what I should do.
I would like to use the result of a query as an argument for another nested query.
query {
allStudents {
nodes {
courseAssessmentInfoByCourse(courseId: "2b0df865-d7c6-4c96-9f10-992cd409dedb") {
weightedMarkAverage
// getting result for specific course is easy enough
}
coursesByStudentCourseStudentIdAndCourseId {
nodes {
name
// would like to be able to do something like this
// to get a list of all the courses and their respective
// assessment infos
assessmentInfoByStudentId (studentId: student_node.studentId) {
weightedMarkAverage
}
}
}
}
}
}
Is there a way of doing this that is considered to be best practice?
Is there a standard way to do it built into GraphQL now?
Thanks for any help!
The only means to substitute values in a GraphQL document is through variables, and these must be declared in your operation definition and then included alongside your document as part of your request. There is no inherent way to reference previously resolved values within the same document.
If you get to a point where you think you need this functionality, it's generally a symptom of poor schema design in the first place. What follows are some suggestions for improving your schema, assuming you have control over that.
For example, minimally, you could eliminate the studentId argument on assessmentInfoByStudentId altogether. coursesByStudentCourseStudentIdAndCourseId is a field on the student node, so its resolver can already access the student's id. It can pass this information down to each course node, which can then be used by assessmentInfoByStudentId.
That said, you're probably better off totally rethinking how you've got your connections set up. I don't know what your underlying storage layer looks like, or the shape your client needs the data to be in, so it's hard to make any specific recommendations. However, for the sake of example, let's assume we have three types -- Course, Student and AssessmentInfo. A Course has many Students, a Student has many Courses, and an AssessmentInfo has a single Student and a single Course.
We might expose all three entities as root level queries:
query {
allStudents {
# fields
}
allCourses {
# fields
}
allAssessmentInfos {
# fields
}
}
Each node could have a connection to the other two types:
query {
allStudents {
courses {
edges {
node {
id
}
}
}
assessmentInfos {
edges {
node {
id
}
}
}
}
}
If we want to fetch all students, and for each student know what courses s/he is taking and his/her weighted mark average for that course, we can then write a query like:
query {
allStudents {
assessmentInfos {
edges {
node {
id
course {
id
name
}
}
}
}
}
}
Again, this exact schema might not work for your specific use case but it should give you an idea around how you can approach your problem from a different angle. A couple more tips when designing a schema:
Add filter arguments on connection fields, instead of creating separate fields for each scenario you need to cover. A single courses field on a Student type can have a variety of arguments like semester, campus or isPassing -- this is cleaner and more flexible than creating different fields like coursesBySemester, coursesByCampus, etc.
If you're dealing with aggregate values like average, min, max, etc. it might make sense to expose those values as fields on each connection type, in the same way a count field is sometimes available alongside the nodes field. There's a (proposal)[https://github.com/prisma/prisma/issues/1312] for Prisma that illustrates one fairly neat way to do handle these aggregate values. Doing something like this would mean if you already have, for example, an Assessment type, a connection field might be sufficient to expose aggregate data about that type (like grade averages) without needing to expose a separate AssessmentInfo type.
Filtering is relatively straightforward, grouping is a bit tougher. If you do find that you need the nodes of a connection grouped by a particular field, again this may be best done by exposing an additional field on the connection itself, (like Gatsby does it)[https://www.gatsbyjs.org/docs/graphql-reference/#group].
I have a list of 50 subjects.
I also have a list of 1000 schools which will teach at least one or more of these subjects.
Every time I search for a school I'm thinking of caching this school with the subjects it teaches. What would be a good way to save this data in most optimal way?
I suggest you to use Hash Tables, using school for the keys, and subjects for the element. Hash table's operations of insert, delete and search can have a complexity that may vary depending on the way you handle extra collisions(many keys could be hashed in the same index) . But in your problem collisions are unavoidable( subjects are shared between many schools ) so for example,the use of open addressing with double hashing could reduce extra collisions but not the implicit collisions in your problem .If you want to implement it yourself, using a good hash function ( uniform distribution of the key over the index ) and a simple list for collisions, can make you achieve insertion in O(1), deletion(search for a subject and its deletion) in O(50) and search in O(50) at the worst case, which i think is good (and simple and fast to implement) for the problem. More on hash tables and how to implement it : https://en.wikipedia.org/wiki/Hash_table
I would store this information within a HashMap. A HashMap stores key-value pairs, so you would be able to map each school to the subjects taught at the school.
Here's an example of what the code may look like implemented in java:
public class Main{
public static void main(String[] args){
//Creates a HashMap with a String as the key and String[] as the value.
Map<String, String[]> schools = new HashMap<String, String[]>();
//Name of the college
String college = "College";
//Subjects offered at said college
String[] subjects = {"Physics","Calculus","Algorithms"};
//Stores the college and its subjects within the HashMap
schools.put(college, subjects);
//Search and print the given subjects any college stored in the map.
System.out.println(Arrays.toString(schools.get("College")));
}
}
This code will return:
[Physics, Calculus, Algorithms]
I am currently trying to play around with mahout. I purchased the book Mahout in Action.
The whole process is understood and with simple test data sets I was already successful.
Now I have a classification problem that I would like to solve.
the target variable is found, which I call - for now - x.
The existing data in our database has already been classified with -1, 0 and +1.
We defined several predictor variables which we select with an SQL query.
These are the product's attributes: language, country, category (of the shop), title, description.
Now I want them to directly be written in a SequenceFile, for which I wrote a little helper class that will append to the sequence file each time a new row of the SQL resultset has been processed:
public void appendToFile(String classification, String databaseID, String language, String country, String vertical, String title, String description) {
int count = 0;
Text key = new Text();
Text value = new Text();
key.set("/" + classification + "/" + databaseID);
//??value.set(message);
try {
this.writer.append(key, value);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
If I only had the title or so, I could simply store it in the value - but how do I store mutiple values like country, lang, and so on, in that particular key?
Thanks for any help!
you shouldnt be storing structures in a seq file, just dump all the text you have seperated by a space,
it's simply a place to put all your content for term counting and such when using something like Naive Bayes, it cares not about structure.
Then when you have classification, lookup the structure in your database.
I Have a list of string where i want to find particular value and return.
If i just want to search i can use Hashset instead of list
HashSet<string> data = new HashSet<string>();
bool contains = data.Contains("lokendra"); //
But for list i am using Find because i want to return the value also from list.
I found this methos is time consuming. The method where this code resides is hit more than 1000 times and the size of list is appx 20000 to 25000.This method takes time.Is there any other way i can make search faster.
List<Employee> employeeData= new List<Employee>();
var result = employeeData.Find(element=>element.name=="lokendra")
Do we have any linq or any other approach which makes retrievel of data faster from search.
Please help.
public struct Employee
{
public string role;
public string id;
public int salary;
public string name;
public string address;
}
I have the list of this structure and if the name property matches the value "lokendra".then i want to retrun the whole object.Consider list as the employee data.
I want to know the way we have Hashset to get faster search is there anyway we can search data and return fast other than find.
It sounds like what you actually want is a Dictionary<string, Employee>. Build that once, and you can query it efficiently many times. You can build it from a list of employees easily:
var employeesByName = employees.ToDictionary(e => e.Name);
...
var employee;
if (employeesByName.TryGetValue(name, out employee))
{
// Yay, found the employee
}
else
{
// Nope, no employee with that name
}
EDIT: Now I've seen your edit... please don't create struct types like this. You almost certainly want a class instead, and one with properties rather than public fields...
You can try with employeeData.FirstOrDefault(e => e == "lokendra"), but it still needs to iterate over collection, so will have performance list Find method.
If your list content is set only once and then you're searching it again and again you should consider implementing your own solution:
sort list before first search
use binary search (which would be O(log n) instead of O(n) for standard Find and Where)
I'm trying to get a query returned ordered on a filed which is calculated in Play.
This is the query I'm using.
return all().order("points").fetch();
where points is defined as
public Integer points;
and is retrieve thanks to this getter
public int getPoints(){
List<EventVote> votesP = votes.filter("isPositive", true).fetch();
List<EventVote> votesN = votes.filter("isPositive", false).fetch();
this.points= votesP.size()-votesN.size();
return this.points;
}
The getter is correctly called when I do
int votes=objectWithPoints.points;
I have the feeling I'm pretending a bit too much out of siena, but I would love this to work (or some similar code). Currently it just skips the order condition. Ordering on any other field works correctly.
I think you're true when you say you await a bit too much :)
The Siena query all().order("points").fetch() performs a request to the DB.
So it will order the values stored into the DB not into your program.
From what you say, I see that you have a getter getPoints which computes a value.
Yet, if you don't store this value into the database, the ordering can't be performed by Siena.
So either you compute the value, set it in your object and save the object to the DB.
objectWithPoints.points = getPoints();
objectWithPoints.save();
Either you order values by yourself in your program after computing them.