I have high scores (name, score, time) stored in a single file and I divide them into separate arrays once it reads them, only problem is I can't figure out a way to sort all three by score and time from least to greatest and still keep the correct order of values.
For example:
Name score time
---------------
nathan 123 01:12
bob 321 55:32
frank 222 44:44
turns to:
bob 123 01:12
frank 222 44:44
nathan 321 55:32
Encapsulate the data into a single object (HighScore) that has three properties: name, time, and score. Then store them in a single array and sort the single array.
Welcome to object-oriented programming.
Related
Table with rows more than 3 million names (Name, surname, father name). I want check similarity more than 90%.
I used many fuzzy algorithms and also utl_match similarities (jaro_winkler , edit_distance). Performance of these algorithms is not good. (more than 20 sec.)
and i want check with changing places, but it works long. Like:
Name Surname Fathername,
Name Fathername Surname,
Surname Name Fathername,
.........
I couldnt find any algorithm with good performance, it works on transactional system.
You don't need to check against all 3 million names every time because you have duplicates in your database. What you can do as well is precluster your entries in these with most distance and then use different entry points.
So in the first step create your entry points:
Miller
Smith
Yang
...
And check versus this entry points with > for example 70% and then go deeper to the clusters where you have a quite good match. This should prune most of the searches and make your algorithmn much faster.
When comparing two objects of the same size, Javers compares 1-to-1. However, if a new change is added such as new row to one of the objects, the comparison reports changes that are NOT changes. Is it possible to have Javers ignore the addition/deletion for the sake of just comparing like objects?
Basically the indices get out of sync.
Row Name Age Phone(Cell/Work)
1 Jo 20 123
2 Sam 25 133
3 Rick 30 152
4 Rick 30 145
New List
Row Name Age Phone(Cell/Work)
1 Jo 20 123
2 Sam 25 133
3 Bill 30 170
4 Rick 30 152
5 Rick 30 145
Because Bill is added the new comparison result will say that Rows 4,5 have changed when they actually didn't.
Thanks.
I'm guessing that your 'rows' are objects representing rows in an excel table and that you have mapped them as ValueObjects and put them into some list.
Since ValueObjects don't have its own identity, it's unclear, even for a human, what was the actual change. Take a look at your row 4:
Row Name Age Phone(Cell/Work)
before:
4 Rick 30 145
after:
4 Rick 30 152
Did you changed Phone at row 4 from 145 to 152? Or maybe you inserted a new data to row 4? How can we know?
We can't. By default, JaVers chooses the simplest answer, so reports value change at index 4.
If you don't care aboute the indices, you can change the list comparision algorithm from Simple to Levenshtein distance. See https://javers.org/documentation/diff-configuration/#list-algorithms
SIMPLE algorithm generates changes for shifted elements (in case when elements are inserted or removed in the middle of a list). On the contrary, Levenshtein algorithm calculates short and clear change list even in case when elements are shifted. It doesn’t care about index changes for shifted elements.
But, I'm not sure if Levenshtein is implemented for ValueObjects, if it is not implemented yet, it's a feature request to javers-core.
Based on GoodData's excellent suggestion for implementing Fact tables, I have been able to design a model that meets our client’s requirements for joining different attributes across different tables. The issue I have now is that the model metrics are highly denormalized, with data repeating itself. I am currently trying to figure out a way to dedupe results.
For example, I have two tables—the first is a NAMES table and the second is my fact table:
NAMES
Val2 Name
35 John
36 Bill
37 Sally
FACT
VAL1 VAL2 SCORE COURSEGRADE
1 35 50 90%
2 35 50 80%
3 35 50 60%
4 36 10 75%
5 37 40 95%
What I am trying to do is write a metric in such a way so that we can get an average of SCORE that eliminates the duplicate value. GoodData is excellent in that it can actually give me back the unique results using the COUNT(VARIABLE1,RECORD) metric, but I can’t seem to get the average store to stick when eliminating the breakout information. If I keep all fields (including the VAL2), it shows me everything:
VAL2 SCORE(AVG)
35 50
36 10
37 40
AVG: 33.33
But when I remove VAL2, I suddenly lose the "uniqueness" of the record.
SCORE(AVG)
40
What I want is the score of 33.33 we got above.
I’ve tried using a BY statement in my SELECT AVG(SCORE), but this doesn’t seem to work. It’s almost like I need some kind of DISTINCT clause. Any thoughts on how to get that rollup value shown in my first example above?
Happy to help here. I would try the following:
Create an intermediate metric (let's call it Score by Employee):
SELECT MIN( SCORE ) BY ID ALL IN ALL OTHER DIMENSIONS
Then, once you have this metric defined you should be able to create a metric for the average score as follows:
SELECT AVG( Score by Employee )
The reason we create the first metric is to force the table to normalize score around the ID attribute which gets rid of duplicates when we use this in the next metric (we could have used MAX or AVG also, it doesn't matter).
Hopefully this solves your issue, let me know if it doesn't work and I'll be happy to help out more. Also feel free to check out GoodData's Developer Portal for more information about reporting:
https://developer.gooddata.com/docs/reporting
Best,
JT
you should definitively check "How to build a metric in a metric" presentation, made by Petr Olmer (http://www.slideshare.net/petrolmer/in10-how-to-build-a-metric-in-a-metric).
It can help you to understand it better.
Cheers,
Peter
I have to search a 25 GB corpus of wikipedia for a single word. I used grep but it takes lot of time. Is there a efficient and easy representation that can be made to search quickly. Also, I want to find exact match.
Thank you.
You would probably want to do an index of a mapping from word to list of locations (bytecode offsets). The list of words would be sorted alphabetically. You could then have a secondary index of where certain letters start in this large list of words.
Lazy hash | Word index | Corpus
aaa starts at X | aaa | lorem ipsum dolor
aab starts at Y | ... | sit amet .....
aac ... | and 486, 549, 684, ... | ...
... ... | |
zzz ... | |
This is the way advocated by the natural language professor at my department (we did this exercise as a lab in an algorithm course).
Have you tried using an indexing engine... say, Lucene with Nutch? Lucene is indexing engine. Nutch is web crawler. Combine the power!
I forgot to mention... CouchDB (http://couchdb.apache.org/)
I've had success with the Boyer-Moore algorithm and its simplified version. There are implementations for various languages floating around the web.
#aloobe had the answer of using an index file that mapped words to locations. I just want to expound upon this, though I think the answer the OP is looking for may just be Boyer-Moore.
The index file would look like this (simplified to use human-readable 2-digits):
53 17 89 03
77 79 29 39
88 01 05 15
...
Each entry above is the byte offset of a word or letter that you've deemed important enough to index. In practice, you won't use letter indices as then your index file is larger than your corpus!
The trick is, if you were to substitute the words at those locations with the locations, your index file would be an alphabetically-sorted version of the corpus:
and and are as
ate bad bat bay
bear best bin binge
This enables you to do Binary Search on the corpus through the index file. If you are searching for the word "best" above, you would grab the middle entry in the index file, 79. Then you would go to position/byte 79 in the corpus and see what word is there. It is bad. We know that alphabetically best > bad, so the position must be in the 2nd half of the index file.
So we grab the middle index between 79 (6th) and 15 (12th), which is 01 in my example. Then we look at position/byte 88 (9th) in the corpus to find bear. best > bear so we try again - the middle index now is either 01 (10th) or 05 (11th) depending on how you round. But clearly we'll find best in 1 or 2 more searches. If we have 12 words like the example, it will take at most 4 searches in the worst case. For a 25GB file with an average word length of say 5 letters and spaces between them, that's ~4 billion words. However, in the worst case scenario you will only search ~32 times. At that point, more of your program's time is spent spinning up the disk and buffering input than actually searching!
This method works with duplicate words as well. If you want to find all of the locations of the word the, you would binary search on the until you found the index. Then you would subtract 1 from the position in the index file repeatedly, using the value each time to look into the corpus. If the word at that location is still the, continue. When you finally stop, you have the earliest index in the index file that maps to the.
The creation of the index file is the only tough part. You need to go through each word in the corpus, building up a data structure of the words and their indices. Along the way, skip words that are too common or short to be listed, like "a", "I", "the", "and", "is", etc. When you are finished, you can take that data structure and turn it into an index file. For a 25GB file, your indices will need to be > 32 bits, unfortunately, so use a long (in Java) or long long (in C) to hold it. There's no reason it should be human readable for you, so write the indices out as 64 bit values, not strings.
The structure I would recommend is a self-balancing binary search tree. Each node is a string value (the word) and index. The tree compares nodes based only on the string, however. If you do this, then in-order traversal (left, node, right) will give you exactly the index file.
Hope this helps! An example I used years ago developing a mobile phone dictionary is Jim Breen's EDICT. It may be difficult to pick up because of the EUC-encoding and Japanese characters, but the intent is the same.
A sort is said to be stable if it maintains the relative order of elements with equal keys. I guess my question is really, what is the benefit of maintaining this relative order? Can someone give an example? Thanks.
It enables your sort to 'chain' through multiple conditions.
Say you have a table with first and last names in random order. If you sort by first name, and then by last name, the stable sorting algorithm will ensure people with the same last name are sorted by first name.
For example:
Smith, Alfred
Smith, Zed
Will be guaranteed to be in the correct order.
A sorting algorithm is stable if it preserves the order of duplicate keys.
OK, fine, but why should this be important? Well, the question of "stability" in a sorting algorithm arises when we wish to sort the same data more than once according to different keys.
Sometimes data items have multiple keys. For example, perhaps a (unique) primary key such as a social insurance number, or a student identification number, and one or more secondary keys, such as city of residence, or lab section. And we may very well want to sort such data according to more than one of the keys. The trouble is, if we sort the same data according to one key, and then according to a second key, the second key may destroy the ordering achieved by the first sort. But this will not happen if our second sort is a stable sort.
From Stable Sorting Algorithms
A priority queue is an example of this. Say you have this:
(1, "bob")
(3, "bill")
(1, "jane")
If you sort this from smallest to largest number, an unstable sort might do this.
(1, "jane")
(1, "bob")
(3, "bill")
...but then "jane" got ahead of "bob" even though it was supposed to be the other way around.
Generally, they are useful for sorting multiple entries in multiple steps.
Not all sorting is based upon the entire value. Consider a list of people. I may only want to sort them by their names, rather than all of their information. With a stable sorting algorithm, I know that if I have two people named "John Smith", then their relative order is going to be preserved.
Last First Phone
-----------------------------
Wilson Peter 555-1212
Smith John 123-4567
Smith John 012-3456
Adams Gabriel 533-5574
Since the two "John Smith"s are already "sorted" (they're in the order I want them), I won't want them to change positions. If I sort these items by last, then first with an unstable sorting algorithm, I could end up either with this:
Last First Phone
-----------------------------
Adams Gabriel 533-5574
Smith John 123-4567
Smith John 012-3456
Wilson Peter 555-1212
Which is what I want, or I could end up with this:
Last First Phone
-----------------------------
Adams Gabriel 533-5574
Smith John 012-3456
Smith John 123-4567
Wilson Peter 555-1212
(You see the two "John Smith"s have switched places). This is NOT what I want.
If I used a stable sorting algorithm, I would be guaranteed to get the first option, which is what I'm after.
An example:
Say you have a data structure that contains pairs of phone numbers and employees who called them. A number/employee record is added after each call. Some phone numbers may be called by several different employees.
Furthermore, say you want to sort the list by phone number and give a bonus to the first 2 people who called any given number.
If you sort with an unstable algorithm, you may not preserve the order of callers of a given number, and the wrong employees could be given the bonus.
A stable algorithm makes sure that the right 2 employees per phone number get the bonus.
It means if you want to sort by Album, AND by Track Number, that you can click Track number first, and it's sorted - then click Album Name, and the track numbers remain in the correct order for each album.
One case is when you want to sort by multiple keys. For example, to sort a list of first name / surname pairs, you might sort first by the first name, and then by the surname.
If your sort was not stable, then you would lose the benefit of the first sort.
The advantage of stable sorting for multiple keys is dubious, you can always use a comparison that compares all the keys at once. It's only an advantage if you're sorting one field at a time, as when clicking on a column heading - Joe Koberg gives a good example.
Any sort can be turned into a stable sort if you can afford to add a sequence number to the record, and use it as a tie-breaker when presented with equivalent keys.
The biggest advantage comes when the original order has some meaning in and of itself. I couldn't come up with a good example, but I see JeffH did so while I was thinking about it.
Let's say you are sorting on an input set which has two fields, and, you only sort on the first. The '|' character divides the fields.
In the input set you have many entries, but, you have 3 entries that look like
.
.
.
AAA|towing
.
.
.
AAA|car rental
.
.
.
AAA|plumbing
.
.
.
Now, when you get done sorting you expect all the fields with AAA in them to be together.
A stable sort will give you:
.
.
.
AAA|towing
AAA|car rental
AAA|plumbing
.
.
.
ie, the three records which had the same sort key, AAA, are in the same order in the output that they were in the input. Note that they are not sorted on the second field, because you didn't sort on the second field in the record.
An unstable sort will give you:
.
.
.
AAA|plumbing
AAA|car rental
AAA|towing
.
.
.
Note that the records are still sorted only on the first field, and, the order of the
second field differs from the input order.
An unstable sort has the potential to be faster. A stable sort tends to mimic what non-computer scientist/non-math folks have in their mind when they sort something. Ie, if you did an insertion sort with index cards you would most likely have a stable sort.
You can't always compare all the fields at once. A couple of examples: (1) memory limits, where you are sorting a large disk file, and there isn't room for all the fields of all records in main memory; (2) Sorting a list of base class pointers, where some of the objects may be derived subclasses (you only have access to the base class fields).
Also, stable sorts have deterministic output given the same input, which can be important for debugging and testing.