Titan remove many vertices has poor performance - performance

Improving performance when deleting multiple vertices?
I have a big graph in Titan (using rexster and cassandra) which I would like to be able to remove a potentially large portion of.
I want to delete everything a user "owns" which isn't in the provided list of IDs, and return the list of IDs of what they actually still "own" in the graph. I have tried these two approaches, both of which takes very long. I am testing with the worst-case scenario where list_of_ids_to_keep is empty([]).
//A naive list-based approach
def delete_mulitple(graph, user_id, list_of_ids_to_keep)
{
u = get_or_create_user(g, user_id)
user_things = u.outE('owns').inV().collect{it.id}
removees = user_things - list_of_ids_to_keep
g.v(*(removees)).remove() //This is the slow part of this approach
g.commit()
return (user_things - removees)
}
//Approach using Hash to potentially reduce memory footprint and O-notation, but is still taking a long time.
def delete_multiple_hash(graph, user_id, list_of_ids_to_keep)
{
def u = get_or_create_user(g, user_id)
def user_things = u.outE('owns').inV().collect{it.id}
def inUserGraph = [:]
def keepers = [:]
list_of_ids_to_keep.each { k_id ->
inUserGraph[k_id] = true
}
user_things.each {k_id ->
if (!inUserGraph[k_id]){
g.v(k_id).remove() //Again: the remove is the slow part
}
else{
keepers << k_id
}
}
g.commit()
return keepers
}
Is there a way to improve the performance of this using some sort of bulk operation, turning off some checks, doing an async query, or applying some other strategy?
I have already enabled "storage.batch-loading" which I though might improve performance. I measure the query times to be about 150 seconds, but then nothing is commited, so I believe it is timing out.
The layout of the graph is:
User (-owns->) A Thing (-contains->) A Description (can be shared by multiple "A Thing"s)
Numbers:
User -owns-> ~= 200K "A Thing"s. (g.v(user_id).outE('owns').count() ~= 200K)
A Thing -contains-> ~= 100 "A Description"s (g.v(a_thing_id).outE('contains').count() ~= 100)
A Description <-contains- ~= 100 "A Thing" (g.v(a_description_id).inE('contains').count ~= 100)

Related

How to fix a 'SystemStackError: stack level too deep' error with a large amount of ActiveRecord conditions

I have an ActiveRecord query which iteratively adds or conditions to a search:
places.each do |place|
people = people.or(People.within_radius_of(place.latitude, place.longitude, place.radius).select(:id))
end
scope :within_radius_of, ->(lat = nil, lon = nil, radius = 100_000) {
where(earth_box(lat, lon, radius)).where(earth_distance_less_than(lat, lon, radius)) if lat.present? && lon.present?
}
earth_box and earth_distance_less_than are just functions to generate postgresql sql for earth_box and earth_distance queries.
This works without issue with our original max target of 100 places, but 600 places were added to a search and I get the stack level too deep issue.
What are the implications of increasing the stack size? Default is 1MB, for example what issues might occur if I set it to 8MB or more?
Can this query be rewritten to avoid the error?

Spark imbalanced partitions after leftOuterJoin

I have a pattern like this... psuedo-code, but I think it makes sense...
type K // key, function of records in B
class A // compact data structure
val a: RDD[(K, A)] // many records
class B { // massive data structure
def funcIter // does full O(n) scans of huge data structure
}
val b: RDD[(K,B)] // comparatively few records
val emptyB = new B("", Nil, etc.)
val C: RDD[(A,B)] = {
a
.leftOuterJoin(b, 1.5x increase in partitions)
.map{ case (k, (val_a, option_b)) => (val_a, option_b.getOrElse(emptyB)) }
.map{ case (val_a, val_b) => (val_a, val_b.funcIter(val_a.attributes)) }
}
My problem is that records in val b vary enormously in size with some quite enormous, and since it's a leftOuterJoin, each of those records is replicated 1,000's or 10,000's of times to join to val a... so it's not just that there are large values in b to handle, but that the worst case records in b end up copied many times in one partition after the join. So the worse partitions are almost exclusively made up of many copies of only the worse case values from b. So my last few partitions take ages to work through while most of my enormous cluster sits idle, draining my wallet.
Is there anything I can do to modify this pattern... try broadcasting b and joining with a in place (it's probably too big).... or split partitions after the join maybe splitting copies of the worst b vals apart into different partitions without doing another shuffle... like the opposite of a coalesce so at least multiple executors on the same core instance (I have 3 executors per core instance) can work on those records in parallel?
Thanks for any advice.

Faster: Iterate list or JPQL query?

What would be faster to find a match: iterate over a list, or perform a JPQL query? On average of course, because this also depends on where the match is in the list. Would the answer depend on the list size?
Example: find the Person with name "Joe" (unique name)
JPQL:
TypedQuery<Person> q = em.createQuery( "SELECT p " +
"FROM Person p " +
"WHERE p.name = :name ", Person.class);
q.setParameter("name", "Joe");
return q.getResultList().size() == 1;
Iterate:
for (Person p : persons) {
if ("Joe".equals(p.name)) {
return true;
}
}
return false;
Whether the finding the specific person will be faster in Java or the DB is really dependent on your specific situation and if you're having problems with performance here, you should really just time both implementations.
There are a bunch of things to consider. A big one is, where does persons come from in your iteration example? Do you already have a list of people on hand, or do you have to SELECT all the people out of the db right before that? If you are going to query the Person table anyway, you're probably better off just throwing that simple where clause on.
Also, the number of results may matter if there are a lot. If you only have 5 people in your db then any difference in implementation speed will probably be so small you won't notice. If you have a couple million then those differences will be amplified.
My gut feeling is that it probably doesn't make a significant difference either way for your case and you should use whichever you are more comfortable with. If this is really a performance bottleneck though, then you need to time them both before you start worrying about fiddling with tables, indexes, threading, whatever.

Best clustering algorithm? (simply explained)

Imagine the following problem:
You have a database containing about 20,000 texts in a table called "articles"
You want to connect the related ones using a clustering algorithm in order to display related articles together
The algorithm should do flat clustering (not hierarchical)
The related articles should be inserted into the table "related"
The clustering algorithm should decide whether two or more articles are related or not based on the texts
I want to code in PHP but examples with pseudo code or other programming languages are ok, too
I've coded a first draft with a function check() which gives "true" if the two input articles are related and "false" if not. The rest of the code (selecting the articles from the database, selecting articles to compare with, inserting the related ones) is complete, too. Maybe you can improve the rest, too. But the main point which is important to me is the function check(). So it would be great if you could post some improvements or completely different approaches.
APPROACH 1
<?php
$zeit = time();
function check($str1, $str2){
$minprozent = 60;
similar_text($str1, $str2, $prozent);
$prozent = sprintf("%01.2f", $prozent);
if ($prozent > $minprozent) {
return TRUE;
}
else {
return FALSE;
}
}
$sql1 = "SELECT id, text FROM articles ORDER BY RAND() LIMIT 0, 20";
$sql2 = mysql_query($sql1);
while ($sql3 = mysql_fetch_assoc($sql2)) {
$rel1 = "SELECT id, text, MATCH (text) AGAINST ('".$sql3['text']."') AS score FROM articles WHERE MATCH (text) AGAINST ('".$sql3['text']."') AND id NOT LIKE ".$sql3['id']." LIMIT 0, 20";
$rel2 = mysql_query($rel1);
$rel2a = mysql_num_rows($rel2);
if ($rel2a > 0) {
while ($rel3 = mysql_fetch_assoc($rel2)) {
if (check($sql3['text'], $rel3['text']) == TRUE) {
$id_a = $sql3['id'];
$id_b = $rel3['id'];
$rein1 = "INSERT INTO related (article1, article2) VALUES ('".$id_a."', '".$id_b."')";
$rein2 = mysql_query($rein1);
$rein3 = "INSERT INTO related (article1, article2) VALUES ('".$id_b."', '".$id_a."')";
$rein4 = mysql_query($rein3);
}
}
}
}
?>
APPROACH 2 [only check()]
<?php
function square($number) {
$square = pow($number, 2);
return $square;
}
function check($text1, $text2) {
$words_sub = text_splitter($text2); // splits the text into single words
$words = text_splitter($text1); // splits the text into single words
// document 1 start
$document1 = array();
foreach ($words as $word) {
if (in_array($word, $words)) {
if (isset($document1[$word])) { $document1[$word]++; } else { $document1[$word] = 1; }
}
}
$rating1 = 0;
foreach ($document1 as $temp) {
$rating1 = $rating1+square($temp);
}
$rating1 = sqrt($rating1);
// document 1 end
// document 2 start
$document2 = array();
foreach ($words_sub as $word_sub) {
if (in_array($word_sub, $words)) {
if (isset($document2[$word_sub])) { $document2[$word_sub]++; } else { $document2[$word_sub] = 1; }
}
}
$rating2 = 0;
foreach ($document2 as $temp) {
$rating2 = $rating2+square($temp);
}
$rating2 = sqrt($rating2);
// document 2 end
$skalarprodukt = 0;
for ($m=0; $m<count($words)-1; $m++) {
$skalarprodukt = $skalarprodukt+(array_shift($document1)*array_shift($document2));
}
if (($rating1*$rating2) == 0) { continue; }
$kosinusmass = $skalarprodukt/($rating1*$rating2);
if ($kosinusmass < 0.7) {
return FALSE;
}
else {
return TRUE;
}
}
?>
I would also like to say that I know that there are lots of algorithms for clustering but on every site there is only the mathematical description which is a bit difficult to understand for me. So coding examples in (pseudo) code would be great.
I hope you can help me. Thanks in advance!
The most standard way I know of to do this on text data like you have, is to use the 'bag of words' technique.
First, create a 'histogram' of words for each article. Lets say between all your articles, you only have 500 unique words between them. Then this histogram is going to be a vector(Array, List, Whatever) of size 500, where the data is the number of times each word appears in the article. So if the first spot in the vector represented the word 'asked', and that word appeared 5 times in the article, vector[0] would be 5:
for word in article.text
article.histogram[indexLookup[word]]++
Now, to compare any two articles, it is pretty straightforward. We simply multiply the two vectors:
def check(articleA, articleB)
rtn = 0
for a,b in zip(articleA.histogram, articleB.histogram)
rtn += a*b
return rtn > threshold
(Sorry for using python instead of PHP, my PHP is rusty and the use of zip makes that bit easier)
This is the basic idea. Notice the threshold value is semi-arbitrary; you'll probably want to find a good way to normalize the dot product of your histograms (this will almost have to factor in the article length somewhere) and decide what you consider 'related'.
Also, you should not just put every word into your histogram. You'll, in general, want to include the ones that are used semi-frequently: Not in every article nor in only one article. This saves you a bit of overhead on your histogram, and increases the value of your relations.
By the way, this technique is described in more detail here
Maybe clustering is the wrong strategy here?
If you want to display similar articles, use similarity search instead.
For text articles, this is well understood. Just insert your articles in a text search database like Lucene, and use your current article as search query. In Lucene, there exists a query called MoreLikeThis that performs exactly this: find similar articles.
Clustering is the wrong tool, because (in particular with your requirements), every article must be put into some cluster; and the related items would be the same for every object in the cluster. If there are outliers in the database - a very likely case - they could ruin your clustering. Furthermore, clusters may be very big. There is no size constraint, the clustering algorithm may decide to put half of your data set into the same cluster. So you have 10000 related articles for each article in your database. With similarity search, you can just get the top-10 similar items for each document!
Last but not least: forget PHP for clustering. It's not designed for this, and not performant enough. But you can probably access a lucene index from PHP well enough.
I believe you need to make some design decisions about clustering, and continue from there:
Why are you clustering texts? Do you want to display related documents together? Do you want to explore your document corpus via clusters?
As a result, do you want flat or hierarchical clustering?
Now we have the complexity issue, in two dimensions: first, the number and type of features you create from the text - individual words may number in the tens of thousands. You may want to try some feature selection - such as taking the N most informative words, or the N words appearing the most times, after ignoring stop words.
Second, you want to minimize the number of times you measure similarity between documents. As bubaker correctly points out, checking similarity between all pairs of documents may be too much. If clustering into a small number of clusters is enough, you may consider K-means clustering, which is basically: choose an initial K documents as cluster centers, assign every document to the closest cluster, recalculate cluster centers by finding document vector means, and iterate. This only costs K*number of documents per iteration. I believe there are also heuristics for reducing the needed number of computations for hierarchical clustering as well.
What does the similar_text function called in Approach #1 look like? I think what you're referring to isn't clustering, but a similarity metric. I can't really improve on the White Walloun's :-) histogram approach - an interesting problem to do some reading on.
However you implement check(), you've got to use it to make at least 200M comparisons (half of 20000^2). The cutoff for "related" articles may limit what you store in the database, but seems too arbitrary to catch all useful clustering of texts,
My approach would be to modify check() to return the "similarity" metric ($prozent or rtn). Write the 20K x 20K matrix to a file and use an external program to perform a clustering to identify nearest neighbors for each article, which you could load into the related table. I would do the clustering in R - there's a nice tutorial for clustering data in a file running R from php.

How do I break up high-cpu requests on Google App Engine?

To give an example of the kind of request that I can't figure out what else to do for:
The application is a bowling score/stat tracker. When someone enters their scores in advanced mode, a number of stats are calculated, as well as their score. The data is modeled as:
Game - members like name, user, reference to the bowling alley, score
Frame - pinfalls for each ball, boolean lists for which pins were knocked down on each ball, information about the path of the ball (stance, target, where it actually went), the score as of that frame, etc
GameStats - stores calculated statistics for the entire game, to be merged with other game stats as needed for statistics display across groups of games.
An example of this information in practice can be found here.
When a game is complete, and a frame is updated, I have to update the game, the frame, every frame after it and possibly some before it (to make sure their scores are correct), and the stats. This operation always flags the CPU monitor. Even if the game isn't complete, and statistics don't need to be calculated, the scores and such need to be updated to show the real-time progress to the user, and so these also get flagged. The average CPU time for this handler is over 7000 mcycles, and it doesn't even display a view. Most people bowl 3 to 4 games per series - if they are entering their scores realtime, at the lanes, that's about 1 request every 2 to 4 minutes, but if they write it all down and enter it later, there are 30-40 of these requests being made in a row.
As requested, the data model for the important classes:
class Stats(db.Model):
version = db.IntegerProperty(default=1)
first_balls=db.IntegerProperty(default=0)
pocket_tracked=db.IntegerProperty(default=0)
pocket=db.IntegerProperty(default=0)
strike=db.IntegerProperty(default=0)
carry=db.IntegerProperty(default=0)
double=db.IntegerProperty(default=0)
double_tries=db.IntegerProperty(default=0)
target_hit=db.IntegerProperty(default=0)
target_missed_left=db.IntegerProperty(default=0)
target_missed_right=db.IntegerProperty(default=0)
target_missed=db.FloatProperty(default=0.0)
first_count=db.IntegerProperty(default=0)
first_count_miss=db.IntegerProperty(default=0)
second_balls=db.IntegerProperty(default=0)
spare=db.IntegerProperty(default=0)
single=db.IntegerProperty(default=0)
single_made=db.IntegerProperty(default=0)
multi=db.IntegerProperty(default=0)
multi_made=db.IntegerProperty(default=0)
split=db.IntegerProperty(default=0)
split_made=db.IntegerProperty(default=0)
class Game(db.Model):
version = db.IntegerProperty(default=3)
user = db.UserProperty(required=True)
series = db.ReferenceProperty(Series)
score = db.IntegerProperty()
game_number = db.IntegerProperty()
pair = db.StringProperty()
notes = db.TextProperty()
simple_entry_mode = db.BooleanProperty(default=False)
stats = db.ReferenceProperty(Stats)
complete = db.BooleanProperty(default=False)
class Frame(db.Model):
version = db.IntegerProperty(default=1)
user = db.UserProperty()
game = db.ReferenceProperty(Game, required=True)
frame_number = db.IntegerProperty(required=True)
first_count = db.IntegerProperty(required=True)
second_count = db.IntegerProperty()
total_count = db.IntegerProperty()
score = db.IntegerProperty()
ball = db.ReferenceProperty(Ball)
stance = db.FloatProperty()
target = db.FloatProperty()
actual = db.FloatProperty()
slide = db.FloatProperty()
breakpoint = db.FloatProperty()
pocket = db.BooleanProperty()
pocket_type = db.StringProperty()
notes = db.TextProperty()
first_pinfall = db.ListProperty(bool)
second_pinfall = db.ListProperty(bool)
split = db.BooleanProperty(default=False)
A few suggestions:
You could store the stats for frames as part of the same entity as the game, rather than having a separate entity for each, by storing it as a list of bitfields (stored in integers) for the pins standing at the end of each half-frame, for example. Let me know if you want more details on how this would be implemented.
Failing that, you can calculate some of the more interrelated stats on fetch. For example, calculating the score-so-far ought to be simple if you have the whole game loaded at once, which means you can avoid having to update multiple frames on every request.
We can be of more help if you show us your data model. :)

Resources