is there any way how to sort the result set in GT.M by specific value?
Let's say I have global variable ^People(name,surname)=age and I want to get all the people with age between 20 and 40, ordered by their age?
Edit: Sorry to Answer instead of continuing the comment chain... my permissions level isn't high enough to comment on this question yet.
As long as you won't have multiple people with the same name, but different surnames, Yogesh is correct. It would be best to use the index to find the correct node of ^People, then grab what you need from ^People. But if you would have multiple people with the same name, but a different surname, then you would want to include the surname in the ^PeopleAgeIndex subnodes.
Related
A store has n customers and anyone can visit them any time throughout the year. Data is stored in a file. Design a data structure to find if a given person visited on a date or not.
Could anyone suggest data structure I shall use in this case?
I'd suggest this: Every customer is stored in one line while you include the customer name first and then the date. You can split them with commas or something.
These are some examples
Name,Date
Name, Date
Name|Date
Name | Date
Just choose something that will be the easiest for you to use and to retrieve the information correctly with using string .split or .substring.
The problem statement does not state whether or not a customer may visit the store numerous times during the year so assuming that they can I would use a Map data-structure, where the key is the name of the customer and the value is the set of dates the customer visited the store. The data can be stored in the file using XML.
I kind of have the feeling this has been asked before, but I have been searching, but cannot come to a clear description.
I have a rails app that holds items that occur on a specific date (like birthdays). Now I would like to make a view that creates a table (or something else, divs are all right as well) that states a specified date once and then iterates over the related items one by one.
Items have a date field and are, of course, not related to a date in a separate table or something.
I can of course query the database for ~30 times (as I want a representation for one months worth of items), but I think it looks ugly and would be massively repetitive. I would like the outcome to look like this (consider it a table with two columns for the time being):
Jan/1 | jan1.item1.desc
| jan1.item2.desc
| jan1.item3.desc
Jan/2 | jan2.item1.desc
| etc.
So I think I need to know two things: how to construct a correct query (but it could be that this is as simple as Item.where("date > ? < ?", lower_bound, upper_bound)) and how to translate that into the view.
I have also thought about a hash with a key for each individual day and an array for the values, but I'd have to construct that like above(repetition) which I expect is not very elegant.
Using GROUP BY does not seem to get me anything different (apart from the grouping, of course, of the items) to work with than other queries. Just an array of objects, but I might do this wrong.
Sorry if it is a basic question. I am relatively new to the field (and programming in general).
If you're making a calendar, you probably want to GROUP BY date:
SELECT COUNT(*) AS instances, DATE(`date`) AS on_date FROM items GROUP BY DATE(`date`)
This is presuming your column is literally called date, which seeing as how that's a SQL reserved word, is probably a bad idea. You'll need to escape that whenever it's used if that's the case, using ``` here in MySQL notation. Postgres and others use a different approach.
For instances in a range, what you want is probably the BETWEEN operator:
#items = Item.where("`date` BETWEEN ? AND ?", lower_bound, upper_bound)
Let's say I have list of persons in my datastore. Each person there may have the following fields:
last name (*)
first name
middle name
id (*)
driving licence id (*)
another id (*)
date of birth
region
place of birth
At least one of the fields marked with (*) must exist.
Now user provides me with the same list of fields (and again at least one of the fields marked with (*) must be provided). I should search for the person user provided. But not all fields should be matched. I should display to the user somehow how I am sure in the results of search. Something like:
if person matched by id and last name (and user provided just these 2 fields for the search), then I am sure that result is correct (100%);
if person matched by id and last name (and user provided other fields, which were found in the database, but were not matched), then I am sure that result is almost correct by 60%;
etc.
(numbers are provided just as example)
How can I organize such search? Is there any standard algorithm? I also would like to minimize number of requests to the database.
P.S. I can not provide user with the actual field values from the database.
It sounds like your logic for determining the quality of a match will be too complex to handle at the database layer. I think you'll get the best performance by retrieving all of the records that match at least one of the mandatory keys, calculating the match score for each of them in memory, and returning the best score. For example, if the user provides you with an id, last name and place of birth, your query would look something like:
SELECT * FROM users WHERE id = `the_id` OR last_name = `the_last_name`;
This could be a performance problem if you have a VERY large dataset with lots of common last names but otherwise I would expect not to see too many collisions. You can check this on your own dataset outside of GAE. You could also get better performance if all mandatory fields MUST match by changing the OR to an AND.
I have a dilemma that I've encountered before. What's the best in terms of usability when one displays personal names in a table? Should there be a single column for the name? If so, is "firstname lastname" or "lastname, firstname" preferable? Or would a column for "firstname" and a column for "lastname" be best? I'm thinking in terms of the user's desire to sort the columns. I like having a column for each name component because I can imagine that in some cases the first name will be more important to the user whereas in other cases the last name would be more important.
I would assume that many out there have had this dilemma and am looking for pearls of wisdom based on past experience.
Definitely have a column for each part. That gives you much more flexibility. So you could sort by surname, but print "firstname surname", for example.
If you don't have the screen real estate to have a column for each part, you can combine them into a single string whose format represents the sorting order. Each click on the column header cycles to the next sort order. For example:
Default: sort by last, first (ASC)
Bimbleman, Wally P.
Zonkenstein, Arnold Q.
1st click: sort by last, first (DESC)
Zonkenstein, Arnold Q.
Bimbleman, Wally P.
2nd click: sort by first, middle, last (ASC)
Arnold Q. Zonkenstein
Wally P. Bimbleman
3rd click: sort by first, middle, last (DESC)
Wally P. Bimbleman
Arnold Q. Zonkenstein
etc...
Easier to read an entire name this way (vs. having it span across columns), takes up less screen real estate, and frees you from having to decide upon a single format & sort.
As far as I know, each country has Its own rules to Sort the names, some countries have the uses of do it By First name, and some by Last Name, I believe that the right answer here is, what is about your app? how many users will appear on those columns? And which users (age/nationality/context) are going to use your app?
Really, I agree with Skilldrick - a good UI has at least separate columns for first and last names...
But don't forget that CONSISTENCY in a UI is actually more important and makes things usable: giving the end user an implied expectation of how things are done.
You might consider calling the fields "Given Name" and "Family Name" to account for people who put their family name first. Of course this doesn't cover everyone (some people only have a given name) but it might reduce potential confusion with Chinese and Japanese names, for example.
In most cases you will find that these fields will cover for most scenarios: Title, Firstname, Middlename, Lastname
Most systems that I have worked with here in Australia, data are sorted by their lastname on default display. Also on the screen if you are providing search, usually Lastname field comes before firstname. Sorting by firstname is just as common too, so your systems should always allow the view to switch to sorting by Firstname
Here is a solution for a single column, I don't think separate columns can be scanned and read as quickly, although I don't have any data to back that up.
The primary focus of a user-oriented solution should be to display names as they would be read aloud, i.e. Title Firstname Middlename Lastname.
For most domains where the names are known to the user, sorting by firstname is acceptable. Here is an example where a persons title is ignored in the sorting, and the sort field is clear as it is highlighted:
Arnold Q. Zonkenstein
Mr. David Cliff
Marty P. Bimbleman
For formal business oriented applications, the default sorting could be by surname. You can preserve reading order, while still sorting by last name, again using highlighting:
Marty P. Bimbleman
Mr. David Cliff
Arnold Q. Zonkenstein
If you want the sorting field to be configurable, use an explicit checkbox, the solution of clicking multiple times on the column heading to cycle between sort fields will be jarring to the user (toggling sort direction by clicking on the heading is more acceptable).
IMO this is the simplest solution without any compromises.
I've never built an algorithm for matching before and don't really know where to start. So here is my basic set up and why I'm doing it. Feel free to correct me if I'm not asking the right questions.
I have a database of names and unique identifiers for people. Several generated identifiers (internally generated and some third party), last name, first name, and birth date are the primary ones that I would be using.
Several times throughout the year I receive a list from a third party that needs to be imported and tied to the existing people in my database but the data is never as clean as mine. IDs could change, birth dates could have typos, names could have typos, last names could change, etc.
Each import could have 20,000 records so even if it's 99% accurate that's still 200 records I'd have to go in manually and match. I think I'm looking for more like 99.9% accuracy when it comes to matching the incoming people to my users.
So, how do I go about making an algorithm that can figure this out?
PS Even if you don't have an exact answer but do know of some materials to reference would also be helpful.
PPS Some examples would be similar to what m3rLinEz wrote:
ID: 9876234 Fname: Jose LName: Guitierrez Birthdate:01/20/84 '- Original'
ID: 9876234 Fname: Jose LName: Guitierrez Birthdate:10/20/84 '- Typo in birth date'
ID: 0876234 Fname: Jose LName: Guitierrez Birthdate:01/20/84 '- Wrong ID'
ID: 9876234 Fname: Jose LName: Guitierrez-Brown Birthdate:01/20/84 '- Hyphenated last name'
ID: 9876234 Fname: Jose, A. LName: Guitierrez Birthdate:01/20/84 '- Added middle initial'
ID: 3453555 Fname: Joseph LName: Guitierrez Birthdate:01/20/84 '- Probably someone else with same birthdate and same last name'
You might be interested in Levenshtein distance.
The Levenshtein distance between two
strings is defined as the minimum
number of edits needed to transform
one string into the other, with the
allowable edit operations being
insertion, deletion, or substitution
of a single character. It is named
after Vladimir Levenshtein, who
considered this distance in 1965.1
It is possible to compare every of your fields and computing the total distance. And by trial-and-error you may discover the appropriate threshold to allow records to be interpret as matched. Have not implemented this myself but just thought of the idea :}
For example:
Record A - ID: 4831213321, Name: Jane
Record B - ID: 431213321, Name: Jann
Record C - ID: 4831211021, Name: John
The distance between A and B will be lower than A and C / B and C, which indicates better match.
When it comes to something like this, do not reinvent the wheel. The Levehstein distance is probably your best bet if you HAVE to do this yourself, but otherwise, do some research on existing solutions which do database query and fuzzy searches. They've been doing it longer than you, it'll probably be better, too..
Good luck!
If you're dealing with data sets of this size and different resources being imported, you may want to look into an Identity Management solution. I'm mostly familiar with Sun Identity Manager, but it may be overkill for what you're trying to do. It might be worth looking into.
If the data you are getting from 3rd parties is consistent (same format each time) I'd probably create a table for each of the 3rd parties you are getting data from. Then import each new set of data to the same table each time. I know there's a way to then join the two tables based on common columns in each using an SQL statement. That way you can perform SQL queries and get data from multiple tables, but make it look like it came from one single unified table. Similarly records that were added that don't have matches in both tables could be found and then manually paired. This way you keep your 'clean' data separate from the junk you get from third parties. If you wanted a true import you could then use that joined table to create a third table containing all your data.
I would start with the easy near 100% certain matches and handle them first, so now you have a list of say 200 that need fixing.
For the remaining rows you can use a simplified version of Bayes' Theorem.
For each unmatched row, calculate the likelihood that it is a match for each row in your data set assuming that the data contains certain changes which occur with certain probabilities. For example, a person changes their surname with probability 0.1% (possibly also depends on gender), changes their first name with probability 0.01%, and is a has a single typo with probility 0.2% (use Levenshtein's distance to count the number of typos). Other fields also change with certain probabilities. For each row calculate the likeliness that the row matches considering all the fields that have changed. Then pick the the one that has the highest probability of being a match.
For example a row with only a small typo in one field but equal on all others would have a 0.2% chance of a match, whereas rows which differs in many fields might have only a 0.0000001% chance. So you pick the row with the small typo.
Regular expressions are what you need, why reinvent the wheel?