Solr performance with commitWithin does not make sense - performance

I am running a very simple performance experiment where I post 2000 documents to my application.
Who in tern persists them to a relational DB and sends them to Solr for indexing (Synchronously, in the same request).
I am testing 3 use cases:
No indexing at all - ~45 sec to post 2000 documents
Indexing included - commit after each add. ~8 minutes (!) to post and index 2000 documents
Indexing included - commitWithin 1ms ~55 seconds (!) to post and index 2000 documents
The 3rd result does not make any sense, I would expect the behavior to be similar to the one in point 2. At first I thought that the documents were not really committed but I could actually see them being added by executing some queries during the experiment (via the solr web UI).
I am worried that I am missing something very big. Is it possible that committing after each add will degrade performance by a factor of 400?!
The code I use for point 2:
SolrInputDocument = // get doc
SolrServer solrConnection = // get connection
solrConnection.add(doc);
solrConnection.commit();
Where as the code for point 3:
SolrInputDocument = // get doc
SolrServer solrConnection = // get connection
solrConnection.add(doc, 1); // According to API documentation I understand there is no need to call an explicit commit after this

According to this wiki:
https://wiki.apache.org/solr/NearRealtimeSearch
the commitWithin is a soft-commit by default. Soft-commits are very efficient in terms of making the added documents immediately searchable. But! They are not on the disk yet. That means the documents are being committed into RAM. In this setup you would use updateLog to be solr instance crash tolerant.
What you do in point 2 is hard-commit, i.e. flush the added documents to disk. Doing this after each document add is very expensive. So instead, post a bunch of documents and issue a hard commit or even have you autoCommit set to some reasonable value, like 10 min or 1 hour (depends on your user expectations).

Related

ElasticSearch: Result window is too large

My friend stored 65000 documents on the Elastic Search cloud and I would like to retrieve all of them (using python). However, when I am running my current script, there is an error noticing that :
RequestError(400, 'search_phase_execution_exception', 'Result window is too large, from + size must be less than or equal to: [10000] but was [30000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.')
My script
es = Elasticsearch(cloud_id=cloud_id, http_auth=(username, password))
docs = es.search(body={"query": {"match_all": {}}, '_source': ["_id"], 'size': 65000})
What would be the easiest way to retrieve all those document and not limit it to 10000 docs? thanks
The limit has been set so that the result set does not overwhelm your nodes. Results will occupy memory in the elastic node. So bigger the result set, bigger the memory footprint and impact on the nodes.
Depending on what you want to do with the retrieved documents,
try to use the scroll api (as suggested in your error message) if its a batch job. Be mindful of the lifetime of scroll context in that case.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html#request-body-search-scroll
or, use the Search After
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html#request-body-search-search-after
You should use the scroll API and get the results in different calls. The scroll API will return to you the results 10000 by 10000 as maximum (that will be available to consult during the amount of time you indicate in the call) and you will be able then to paginate the results and obtain them thanks to a scroll_id.
The error message itself is mentioning that how can you solve the issue, look carefully this part of the error message.
This limit can be set by changing the [index.max_result_window] index
level setting.
Please refer update indices level setting on how to change that.
So for your setting it would look like:
PUT /<your-index-name>/_settings
{
"index" : {
"index.max_result_window" : 65000 -> note its equal to your all the docs in your index
}
}

How to Fix Read timed out in Elasticsearch

I used Elasticsearch-1.1.0 to index tweets.
The indexing process is okay.
Then I upgraded the version. Now I use Elasticsearch-1.3.2, and I get this message randomly:
Exception happened: Error raised when there was an exception while talking to ES.
ConnectionError(HTTPConnectionPool(host='127.0.0.1', port=8001): Read timed out. (read timeout=10)) caused by: ReadTimeoutError(HTTPConnectionPool(host='127.0.0.1', port=8001): Read timed out. (read timeout=10)).
Snapshot of the randomness:
Happened --33s-- Happened --27s-- Happened --22s-- Happened --10s-- Happened --39s-- Happened --25s-- Happened --36s-- Happened --38s-- Happened --19s-- Happened --09s-- Happened --33s-- Happened --16s-- Happened
--XXs-- = after XX seconds
Can someone point out on how to fix the Read timed out problem?
Thank you very much.
Its hard to give a direct answer since the error your seeing might be associated with the client you are using. However a solution might be one of the following:
1.Increase the default timeout Globally when you create the ES client by passing the timeout parameter. Example in Python
es = Elasticsearch(timeout=30)
2.Set the timeout per request made by the client. Taken from Elasticsearch Python docs below.
# only wait for 1 second, regardless of the client's default
es.cluster.health(wait_for_status='yellow', request_timeout=1)
The above will give the cluster some extra time to respond
Try this:
es = Elasticsearch(timeout=30, max_retries=10, retry_on_timeout=True)
It might won't fully avoid ReadTimeoutError, but it minimalize them.
Read timeouts can also happen when query size is large. For example, in my case of a pretty large ES index size (> 3M documents), doing a search for a query with 30 words took around 2 seconds, while doing a search for a query with 400 words took over 18 seconds. So for a sufficiently large query even timeout=30 won't save you. An easy solution is to crop the query to the size that can be answered below the timeout.
For what it's worth, I found that this seems to be related to a broken index state.
It's very difficult to reliably recreate this issue, but I've seen it several times; operations run as normal except certain ones which periodically seem to hang ES (specifically refreshing an index it seems).
Deleting an index (curl -XDELETE http://localhost:9200/foo) and reindexing from scratch fixed this for me.
I recommend periodically clearing and reindexing if you see this behaviour.
Increasing various timeout options may immediately resolve issues, but does not address the root cause.
Provided the ElasticSearch service is available and the indexes are healthy, try increasing the the Java minimum and maximum heap sizes: see https://www.elastic.co/guide/en/elasticsearch/reference/current/jvm-options.html .
TL;DR Edit /etc/elasticsearch/jvm.options -Xms1g and -Xmx1g
You also should check if all fine with elastic. Some shard can be unavailable, here is nice doc about possible reasons of unavailable shard https://www.datadoghq.com/blog/elasticsearch-unassigned-shards/

can't reproduce/verify the performance claims in graph databases and neo4j in action books

UPDATE I have put up a follow up question that contains updated scripts and and a clearer setup on neo4j performance compared to mysql (how can it be improved?). Please continue there./UPDATE
I have some problems verifying the performance claims made in the "graph databases" book (page 20) and in the neo4j (chapter 1).
To verify these claims I created a sample dataset of 100000 'person' entries with 50 'friends' each, and tried to query for e.g. friends 4 hops away. I used the very same dataset in mysql. With friends of friends over 4 hops mysql returns in 0.93 secs, while neo4j needs 65 -75 secs (on repeated calls).
How can I improve this miserable outcome, and verify the claims made in the books?
A bit more detail:
I run the whole setup on a i5-3570K with 16GB Ram, using ubuntu12.04 64bit, java version "1.7.0_25" and mysql 5.5.31, neo4j-community-2.0.0-M03 (I get a similar outcome with 1.9)
All code/sample data can be found on https://github.com/jhb/neo4j-experiements/ (to be used with 2.0.0). The resulting sample data in different formats can be found on https://github.com/jhb/neo4j-testdata.
To use the scripts you need a python with mysql-python, requests and simplejson installed.
the dataset is created with friendsdata.py and stored to friends.pickle
friends.pickle gets imported to neo4j using import_friends_neo4j.py
friends.pickle gets imported to mysql using import_friends_mysql.py
I add indexes on t_user_friend.* in mysql
I added "create index on :node(noscenda_name) in neo4j
To make life a bit easier the friends.*.bz2 contain sql and cypher statements to create those datasets in mysql and neo4j 2.0 M3.
Mysql performance
I first warm mysql up by querying:
select count(distinct name) from t_user;
select count(distinct name) from t_user;
Then, for the real meassurment I do
python query_friends_mysql.py 4 10
This creates the following sql statement (with changing t_user.names):
select
count(*)
from
t_user,
t_user_friend as uf1,
t_user_friend as uf2,
t_user_friend as uf3,
t_user_friend as uf4
where
t_user.name='person8601' and
t_user.id = uf1.user_1 and
uf1.user_2 = uf2.user_1 and
uf2.user_2 = uf3.user_1 and
uf3.user_2 = uf4.user_1;
and repeats this 4 hop query 10 times. The queries need around 0.95 secs each. Mysql is configured to use a key_buffer of 4G.
neo4j performance testing
I have modified neo4j.properties:
neostore.nodestore.db.mapped_memory=25M
neostore.relationshipstore.db.mapped_memory=250M
and the neo4j-wrapper.conf:
wrapper.java.initmemory=2048
wrapper.java.maxmemory=8192
To warm up neo4j I do
start n=node(*) return count(n.noscenda_name);
start r=relationship(*) return count(r);
Then I start using the transactional http endpoint (but I get the same results using the neo4j-shell).
Still warming up, I run
./bin/python query_friends_neo4j.py 3 10
This creates a query of the form (with varying person ids):
{"statement": "match n:node-[r*3..3]->m:node where n.noscenda_name={target} return count(r);", "parameters": {"target": "person3089"}
after the 7th call or so each call needs around 0.7-0.8 secs.
Now for the real thing (4 hops) I do
./bin/python query_friends_neo4j.py 4 10
creating
{"statement": "match n:node-[r*4..4]->m:node where n.noscenda_name={target} return count(r);", "parameters": {"target": "person3089"}
and each call takes between 65 and 75 secs.
Open questions/thoughts
I'd really like see the claims in the books to be reproducable and correct, and neo4j faster then mysql instead of magnitudes slower.
But I don't know what I am doing wrong... :-(
So, my big hopes are:
I didn't do the memory settings for neo4j correctly
The query I use for neo4j is completely wrong
Any suggestions to get neo4j up to speed are highly welcome.
Thanks a lot,
Joerg
2.0 has not been performance optimized at all, so you should use 1.9.2 for comparison.
(if you use 2.0 - did you create an index for n.noscenda_name)
You can check the query plan with profile start ....
With 1.9 please use a manual index or node_auto_index for noscenda_name.
Can you try these queries:
start n=node:node_auto_index(noscenda_name={target})
match n-->()-->()-->m
return count(*);
Fulltext indexes are also more expensive than exact indexes, so keep the exact auto-index for noscenda_name.
can't get your importer to run, it fails at some point, perhaps you can share the finished neo4j database
python importer.py
reading rels
reading nodes
delete old
Traceback (most recent call last):
File "importer.py", line 9, in <module>
g.query('match n-[r]->m delete r;')
File "/Users/mh/java/neo/neo4j-experiements/neo4jconnector.py", line 99, in query
return self.call(payload)
File "/Users/mh/java/neo/neo4j-experiements/neo4jconnector.py", line 71, in call
self.transactionurl = result.headers['location']
File "/Library/Python/2.7/site-packages/requests-1.2.3-py2.7.egg/requests/structures.py", line 77, in __getitem__
return self._store[key.lower()][1]
KeyError: 'location'
Just to add to what Michael said, in the book I believe the authors are referring to a comparison that was done in the Neo4j in Action book - it's described in the free first chapter of that book.
At the top of page 7 they explain that they were using the Traversal API rather than Cypher.
I think you'll struggle to get Cypher near that level of performance at the moment so if you want to do those types of queries you'll want to use the Traversal API directly and then perhaps wrap it in an unmanaged extension.

Can Cube (js metrics framework) return more than 1000 events?

The Cube software (https://github.com/square/cube) allows you to retrieve events.
I want to retrieve a lot of events. But it appears that I am capped at 1000. There are well over 9000 in mongodb in the collection and time range I am querying
Example http GET queries I issue:
# 1000 results
http://1.2.3.4:1081/1.0/event?expression=my_event_type
# 1000 results
http://1.2.3.4:1081/1.0/event?expression=my_event_type&start=2012-02-02&stop=2013-07-03
# 7 results
http://1.2.3.4:1081/1.0/event?expression=my_event_type&limit=7
# 1000 results
http://1.2.3.4:1081/1.0/event?expression=my_event_type&limit=9999
It appears that the limit is pinned:
https://github.com/square/cube/blob/28dad4af27a6680deb46077b16952590f2c21cad/lib/cube/event.js
Line 166
based on the 'batchSize=1000'
Is it possible that you can 'page' through the data in some way? Or is this just a hard limit?
Looks like there is a hard cap on results in three places that need to be updated for large domains:
event.js - line 166
metric.js - line 11
metric.js - line 12
In addition, I was unable to find any query-string apis for the parameters. Ideally, we can leave the cap at 1000 (to avoid server bloat for people not tuning their queries correctly) and allow the consumer to define override behavior.

Zend Framework Cache

I'm trying to make an ajax autocomplete search box that of course uses SQL, min 3 characters, and have a SQL view of relevant fields already set up and indexed in the db. The CPU still spikes when searching, which I expected as it's running a query for every character. I want to use Zend shm cache to speed up results and reduce CPU usage. The results are stored in an array which is to be cached like this:
while($row = db2_fetch_row($stmt)) {
$fSearch[trim($row[0]).trim($row[1])] = array(/*array built here*/);
}
if (zend_shm_cache_store('fSearch', $fSearch, 10 * 60) === false) {
error_log('Failed to store search cache!');
}
Of course there's actual data inside the array instead of comments, I just shortened the code for simplicity. Rows 0&1 form the PK, and this has tested to be working properly. It's the zend_shm_cache_store that fails because the error log gets flooded with 'Failed to store search cache!'. I read that zend_shm_cache_store can store any array that can be serialized - how can I tell if my data is serialized or can be serialized? Are there any other potential causes? I did make a test page that only stored a string and that was successful, so I know caching is on.
Solved: cache size was too small for array - increased cache size and it worked fine. Sorry for the trouble.

Resources