Janusgraph is doing full table scans for equality queries. Not using indexed backend to get better performance - janusgraph

I'm running janusgraph server backed by AWS Keyspace and Elasticsearch. The elasticsearch backend is properly configured and the dataload process is able to persist data in elasticsearch as expected.
Janugraph is doing full scans for equality based queries. It is not making use of indexes.
Example:
gremlin> g.E().has("edge_id","axxxxxxxx6a1796de717e9df").profile()
==>Traversal Metrics
Step Count Traversers Time (ms) % Dur
=============================================================================================================
JanusGraphStep([],[edge_id.eq(axxxxxxxx6a1796de... 1227.690 100.00
constructGraphCentricQuery 0.087
constructGraphCentricQuery 0.003
GraphCentricQuery 1227.421
\_condition=(edge_id = axxxxxxxx6a1796de717e9df)
\_orders=[]
\_isFitted=false
\_isOrdered=true
\_query=[]
scan 1227.316
\_query=[]
\_fullscan=true
\_condition=EDGE
>TOTAL - - 1227.690 -
When I use textContains it does make use of the indices.
g.E().has("edge_id",textContains("axxxxxxxx6a1796de717e9df")).bothV().profile()
==>Traversal Metrics
Step Count Traversers Time (ms) % Dur
=============================================================================================================
JanusGraphStep([],[edge_id.textContains(axxxx..... 2 2 1934.487 100.00
constructGraphCentricQuery 0.125
GraphCentricQuery 1934.234
\_condition=(edge_id textContains axxxxxxxx6a1796de717e9df)
\_orders=[]
\_isFitted=true
\_isOrdered=true
\_query=[(edge_id textContains axxxxxxxx6a1796de717e9df)]:edge_information
\_index=edge_information
\_index_impl=search
backend-query 2 1934.207
\_query=edge_information:[(edge_id textContains axxxxxxxx6a1796de717e9df)]:edge_information
EdgeVertexStep(BOTH) 4 4 0.043 0.00
>TOTAL - - 1934.530 -
Is there a configuration which controls this behavior?
In my opinion doing full table scans are very in-efficient.
When I run janusgraph locally I do see it makes use of index backend even for equality queries.

Check out https://docs.janusgraph.org/index-backend/text-search/#full-text-search. By default, mixed indexes only support full-text search while you want equality matches. You need to use String search or Full text + String search.

Related

Elastic Search Version 7.17 Java Rest API returns incorrect totalElements and total pages using queryBuilder

We are currently upgrading our system from ElasticSearch 6.8.8 to ElasticSearch 7.17. When we run pageable queries using the Java Rest API, the results are incorrect.
For example, in version 6.8.8, if we query for data with and request page 2 with a page size of 10, the query return the 10 items on page 2 and give us a totalElement of 10000 records which is correct. When we run this same exact query on Version 7.17, it returns 10 items on page 2 but only gives us a totalElement of 10 instead of the correct number. We need the correct number, so that our gridview handles paging correctly. Is there a setting I am missing in ElasticSearch version 7.17?
Elasticsearch implemented an option of Track_total_hits in all search in ES 7.X.
Generally the total hit count can’t be computed accurately without visiting all matches, which is costly for queries that match lots of documents. The track_total_hits parameter allows you to control how the total number of hits should be tracked. Given that it is often enough to have a lower bound of the number of hits, such as "there are at least 10000 hits", the default is set to 10,000. This means that requests will count the total hit accurately up to 10,000 hits. It is a good trade-off to speed up searches if you don’t need the accurate number of hits after a certain threshold.
So to force ES to calculate all the hit documents you should set Track_total_hits to true. For more information, you can check the ES official documentation page here.

Slow N1QL performance on small dataset using covering index

I am using Couchbase 6 community edition and my bucket has about 2 million documents of small sizes (< 5000 bytes)
Each document has a fields named country and I have an GSI on this field.
There are four unique values for this field, however, a query to get these unique values takes 8 to 10 seconds.
I am not sure why it is this slow.
my query is:
SELECT DISTINCT(country)
FROM test_bucket
USE INDEX(country_index USING GSI)
WHERE country IS NOT MISSING
The memory quota on this bucket is 50 GB.
and the machine has 40 cores.
I would like to ask what is the bottleneck here or what would cause a bottleneck in this situation
You have right index. As you have 2 million country documents query engine needs to get all the 2 million entries from indexer and eliminate duplicates. Use request profiling described in page 341 https://blog.couchbase.com/n1ql-practical-guide-second-edition/
Also checkout technique described https://dzone.com/articles/count-amp-group-faster-using-n1ql
If you can use EE version you can use index aggregation as described by here https://blog.couchbase.com/understanding-index-grouping-aggregation-couchbase-n1ql-query/ by changing query to group by like below.
SELECT country
FROM test_bucket
USE INDEX(country_index USING GSI)
WHERE country IS NOT MISSING
GROUP BY country;

elasticsearch bulk import speed when updating data

My current elasticsearch cluster configuration has one master node & six data nodes each on independent AWS ec2 instances.
Master t2.large [2 vcpu, 8G ram]
Each data Node r4.xlarge [4 vcpu. 30.5GB]
Number of shards = 12 [is it too low?]
Everyday I run a bulk import of 110GB data from logs.
It is imported in three steps.
First, create a new index and bulk import 50GB data.
That import runs very fast and usually completes in 50-65min
Then I run the second bulk import task of about 40GB data which is actually an update of the previously imported records.
[Absolutely No new record]
That update task takes about 6 hours on average.
Is there any way to speedup/optimize the whole process to run faster?
Options I am considering
1- Increase data nodes count from current 6 to 10.
OR
2- Increase the memory/CPU on each data node.
OR
3- ditch the update part altogether and import all the data into separate indices. That will need to update the query logic in the application side as well in oder to query from multiple indices but other than that, are there any demerits for multiple indices?
OR
4- Any other option which I might have overlooked?
EDIT:
I went ahead with increasing the number of nodes as a test run and I can confirm the following results.[posting here just in case it can help someone]
Note: each node's specs remained the same as above
Current System 6 nodes 12 shards
Main insert (~50G) = 54 Min
Update (~40G) = 5H30min
Test 1 System 10 nodes 24 shards
Main insert (~50G) = 51 Min
Update (~40G) = 2H15min
Test 2 System 12 nodes 24 shards
Main insert (~50G) = 51 Min
Update (~40G) = 1H36min
There is a huge improvement but still looking for suggestions though as having that many instances is economically burdensome.
Increasing the data nodes and Increase the memory/CPU on each data node won't solve your problem as there won't be the significant difference in the indexing time.
Since, Updates requires Elasticsearch to first find the document and then overwrite it by creating a new index with a new version number and then deleting the old index, which tends to get slower the larger the shards get.
The option 3 that you purpose will be the one of an ideal solution for it, but it can impact your querying time as it has to search in two different indices.
You can avoid that by introducing a field called 'type' in the same index which can be used to distinguish the documents which will make it easy to write the query for the Es index and as well as time to fetch.
For Eg:(Your index will look something like that with the type you could fetch data )
{
data:'some data'
type:'first-inserted'
},
{
data:'some data'
type:'second-inserted'
}

solr performance with large data retrieval

My use case
I am having 20gb file per day. (pipe delimited text file)
I have indexed 90 days data (20 * 90 gb)
Record count - 5.5 billion
total fields - 30
Indexed fields - called_number, calling_number, time_key
All other fields are stored as per schema.cml
index size - 300gb
No of shards = 4
I used below method to index (org.apache.solr.hadoop.MapReduceIndexerTool)
hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.M apReduceIndexerTool \
--morphline-file $path/morphlines.conf –output -dir hdfs://MASTERNODE:8020/$path2 \
--go-live --zk-host MASTERNODE:2181/solr \
--collection COLLECTIONNAME \
--mappers 4 \
--reducers 12 hdfs://Masternode/path/asd.txt
In my test bed I have 4 data nodes and 1 name node. (Test bed on cloudera 5.4.7)
each node has 256gb ram,Any performance increasing tips i should follow in solr?
It took around 120 sec to get 3000 record out put in one search (range query based on time key). But after first time query, its getting cached and then if I executed again I'm getting response less than 1 sec with larger records out put as well (10000 record out put also getting with in 1 sec)
Note that when retrieving 10 - 20 records, then performance was good on first time itself.

Solr caching & sorting

Our solr index (Solr 3.4) has over 100 million docuemnts.
We frequently fire one type of query on this index to get documents, do some processing and dump in another index.
Query is of the form -
((keyword1 AND keyword2...) OR (keyword3 AND keyword4...) OR ...) AND date:[date1 TO *]
No. of keywords can be in the range of 100 - 1000.
We are adding sort parameter 'date asc'.
The keyword part of the query changes very rarely but date part always changes.
Now there are mainly 2 problems,
1) Query takes too much time.
2) Sometimes when 'numFound' is very large for a query, It gives OOM error (I guess this is because of sort).
We are not using any type of caching yet.
Will caching be helpful to solve these problems?
If yes, what type of cache or caching configuration is suitable to start with?

Resources