Neo4J tuning or just more RAM? - performance

I have a Neo4J-enterprise database running on a DigitalOcean VPS with 8Gb RAM and 80Gb SSD.
The performance of the Neo4J instance is awful at the moment:
match (n) where n.gram='0gram' AND n.word=~'a.' return n.word LIMIT 5 # 349ms
match (n) where n.gram='0gram' AND n.word=~'a.*' return n.word LIMIT 25 # 1588ms
I understand regex are expensive, but on likewise queries where I replace the 'a.' or 'a.*' part with any other letter, Neo4j simply crashes. I can see a huge build-up in memory before that (towards 90%), and the CPU sky-rocketing.
My Neo4j is populated as follows:
Number Of Relationship Type Ids In Use: 1,
Number Of Node Ids In Use: 172412046,
Number Of Relationship Ids In Use: 172219328,
Number Of Property Ids In Use: 344453742
The VPS only runs Neo4J (on debian 7/amd64). I use the NUMA+parallelGC flags as they're supposed to be faster. I've been tweaking my RAM settings, and although it doesn't crash at often now, I have a feeling there should be some gainings to be made
neostore.nodestore.db.mapped_memory=1024M
neostore.relationshipstore.db.mapped_memory=2048M
neostore.propertystore.db.mapped_memory=6144M
neostore.propertystore.db.strings.mapped_memory=512M
neostore.propertystore.db.arrays.mapped_memory=512M
# caching
cache_type=hpc
node_cache_array_fraction=7
relationship_cache_array_fraction=5
# node_cache_size=3G
# relationship_cache_size=1G --> these throw a not-enough-heap-mem error
The data is essentially a series of tree, where on node0 only a full text search is needed, the following nodes are searched by a property with floating point values.
node0 -REL-> node0.1 -REL-> node0.1.1 ... node0.1.1.1.1
\
-REL-> node0.2 -REL-> node0.2.1 ... node0.2.1.1
There are aprox. 5.000 top-nodes like node0.
Should I reconfigure my memory/cache usage, or should I just add more RAM?
--- Edit on Indexes ---
Because all tree's of nodes al always 4-levels deep, each level has a label for quick finding.in this case all node0 nodes have a label (called 0gram). the n.gram='0gram' should use the index coupled to the label.
--- Edit on new Config ---
I upgraded the VPS to 16Gb. The nodeStore has 2.3Gb (11%), PropertyStore 13.8Gb (64%) and the relastionshipStore amounts to 5.6Gb (26%) on the SSD.
On this basis I created a new config (detailed above).
I'm waiting for the full set of queries and will do some additional testing in the mean time

Yes you need to create an index, what's your label called? Imagine it being called :NGram
create index on :NGram(gram);
match (n:NGram) where n.gram='0gram' AND n.word=~'a.' return n.word LIMIT 5
match (n:NGram) where n.gram='0gram' AND n.word=~'a.*' return n.word LIMIT 25
What you're doing is not a graph search but just a lookup via full scan + property comparison with a regexp. Not a very efficient operation. What you need is FullTextSearch (which is not supported with the new schema indexes but still with the legacy indexes).
Could you run this query (after you created the index) and say how many nodes it returns?
match (n:NGram) where n.gram='0gram' return count(*)
which is the equivalent to
match (n:NGram {gram:'0gram'}) return count(*)
I wrote a blog post about it a few days ago, please read it and see if it applies to your case.
How big is your Neo4j database on disk?
What is the configured heap size? (in neo4j-wrapper.conf?)
As you can see you use more RAM than you machine has (not even counting OS or filesystem caches).
So you would have to reduce the mmio sizes, e.g. to 500M for nodes 2G for rels and 1G for properties.
Look at your store-file sizes and set mmio accordingly.

Depending on the number of nodes having n.gram='0gram' you might benefit a lot from setting a label on them and index for the gram property. If you have this in place a index lookup will directly return all 0gram nodes and apply regex matching only on those. Your current statement will load each and every node from the db and inspect its properties.

Related

Why does Dask's map_partitions function use more memory than looping over partitions?

I have a parquet file of position data for vehicles that is indexed by vehicle ID and sorted by timestamp. I want to read the parquet file, do some calculations on each partition (not aggregations) and then write the output directly to a new parquet file of similar size.
I organized my data and wrote my code (below) to use Dask's map_partitions, as I understood this would perform the operations one partition at a time, saving each result to disk sequentially and thereby minimizing memory usage. I was surprised to find that this was exceeding my available memory and I found that if I instead create a loop that runs my code on a single partition at a time and appends the output to the new parquet file (see second code block below), it easily fits within memory.
Is there something incorrect in the original way I used map_partitions? If not, why does it use so much more memory? What is the proper, most efficient way of achieving what I want?
Thanks in advance for any insight!!
Original (memory hungry) code:
ddf = dd.read_parquet(input_file)
meta_dict = ddf.dtypes.to_dict()
(
ddf
.map_partitions(my_function, meta = meta_dict)
.to_parquet(
output_file,
append = False,
overwrite = True,
engine = 'fastparquet'
)
)
Awkward looped (but more memory friendly) code:
ddf = dd.read_parquet(input_file)
for partition in range(0, ddf.npartitions, 1):
partition_df = ddf.partitions[partition]
(
my_function(partition_df)
.to_parquet(
output_file,
append = True,
overwrite = False,
engine = 'fastparquet'
)
)
More hardware and data details:
The total input parquet file is around 5GB and is split into 11 partitions of up to 900MB. It is indexed by ID with divisions so I can do vehicle grouped operations without working across partitions. The laptop I'm using has 16GB RAM and 19GB swap. The original code uses all of both, while the looped version fits within RAM.
As #MichaelDelgado pointed out, by default Dask will spin up multiple workers/threads according to what is available on the machine. With the size of the partitions I have, this maxes out the available memory when using the map_partitions approach. In order to avoid this, I limited the number of workers and the number of threads per worker to prevent automatic parellelization using the code below, and the task fit in memory.
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(
n_workers = 1,
threads_per_worker = 1)
client = Client(cluster)

Spark Performance tuning / optimization

I have pretty standard use case and need suggestion on how to improve the Spark(2.4) Job:
Dataframe1 (df1) = 10M records and
Dataframe2 (df2) = 50M records
then : join df1 & df2
use windowing functions etc
Result Dataframe (df3) = 2B records
further process i.e filter and generate 5 different dateset from prior df3. (when it issue starts)
The issues i face is initial few steps it works fine in notebook but as soon i reach to df3, further processing gets really slow and gets failed/killed.
What would be best way to optimized this processing? so far i tried using:
r4.xlarge cluster, also r5.16xlarge (500 GB Memory)cluster (should i try any other like M4 or C4 clusters or what would you suggest for this kind of processing)
spark conf used:
spark.conf.set("spark.executor.memory", "64g")
spark.conf.set("spark.driver.memory", "64g")
spark.conf.set("spark.executor.memoryOverHead", "24g")
spark.conf.set("spark.driver.memoryOverHead", "24g")
spark.conf.set("spark.executor.cores", "8")
spark.conf.set("spark.paralellism", 100)
spark.conf.set("spark.dynamicAllocation.enabled", "true")
spark.conf.set("spark.sql.broadcastTimeout", "7200")
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
using cache on df1,df2,df3.
once memory is used,i see disk spill, so i tried freeing GC using:
spark.conf.set("spark.driver.extraJavaOptions", "XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps")
spark.conf.set("spark.executor.extraJavaOptions", "XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps")
above steps, didn't do much help, please suggest what config, memory and cluster setting might help
or
What other optimization technique can be used here?

Redis high memory usage for almot no keys

I have a redis instance hosted by heroku ( https://elements.heroku.com/addons/heroku-redis ) and using the plan "Premium 1"
This redis is usued only to host a small queue system called Bull ( https://www.npmjs.com/package/bull )
The memory usage is now almost at 100 % ( of the 100 Mo allowed ) even though there is barely any job stored in redis.
I ran an INFO command on this instance and here are the important part ( can post more if needed ) :
# Server
redis_version:3.2.4
# Memory
used_memory:98123632
used_memory_human:93.58M
used_memory_rss:470360064
used_memory_rss_human:448.57M
used_memory_peak:105616528
used_memory_peak_human:100.72M
total_system_memory:16040415232
total_system_memory_human:14.94G
used_memory_lua:280863744
used_memory_lua_human:267.85M
maxmemory:104857600
maxmemory_human:100.00M
maxmemory_policy:noeviction
mem_fragmentation_ratio:4.79
mem_allocator:jemalloc-4.0.3
# Keyspace
db0:keys=45,expires=0,avg_ttl=0
# Replication
role:master
connected_slaves:1
master_repl_offset:25687582196
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:25686533621
repl_backlog_histlen:1048576
I have a really hard time figuring out how I can be using 95 Mo with barely 50 object stored. These objects are really small, usually a JSON with 2-3 fields containing small strings and ids
I've tried https://github.com/gamenet/redis-memory-analyzer but it crashes on me when I try to run it
I can't get a dump because Heroku does not allow it.
I'm a bit lost here, there might be something obvious I've missed but I'm reaching the limit of my understanding of Redis.
Thanks in advance for any tips / pointer.
EDIT
We had to upgrade our Redis instance to keep everything running but it seems the issue is still here. Currently sitting at 34 keys / 34 Mo
I've tried redis-cli --bigkeys :
Sampled 34 keys in the keyspace!
Total key length in bytes is 743 (avg len 21.85)
9 strings with 43 bytes (26.47% of keys, avg size 4.78)
0 lists with 0 items (00.00% of keys, avg size 0.00)
0 sets with 0 members (00.00% of keys, avg size 0.00)
24 hashs with 227 fields (70.59% of keys, avg size 9.46)
1 zsets with 23 members (02.94% of keys, avg size 23.00)
I'm pretty sure there is some overhead building up somewhere but I can't find what.
EDIT 2
I'm actually blind : used_memory_lua_human:267.85M in the INFO command I run when first creating this post and now used_memory_lua_human:89.25M on the new instance
This seems super high, and might explain the memory usage
You have just 45 keys in database, so what you can do is:
List all keys with KEYS * command
Run DEBUG OBJECT <key> command for each or several keys, it will return serialized length so you will get better understanding what keys consume lot of space.
Alternative option is to run redis-cli --bigkeys so it will show biggest keys. You can see content of the key by specific for the data type command - for strings it's GET command, for hashes it's HGETALL and so on.
After a lot of digging, the issue is not coming from Redis or Heroku in anyway.
The queue system we use has a somewhat recent bug where Redis ends up caching a Lua script repeatedly eating up memory as time goes on.
More info here : https://github.com/OptimalBits/bull/issues/426
Thanks for those who took the time to reply.

Writing small amount of data to large number of files on GlusterFS 3.7

I'm experimenting with 2 Gluster 3.7 servers in 1x2 configuration. Servers are connected over 1 Gbit network. I'm using Debian Jessie.
My use case is as follows: open file -> append 64 bytes -> close file and do this in a loop for about 5000 different files. Execution time for such loop is roughly 10 seconds if I access files through mounted glusterfs drive. If I use libgfsapi directly, execution time is about 5 seconds (2 times faster).
However, the same loop executes in 50ms on plain ext4 disk.
There is huge performance difference between Gluster 3.7 end earlier versions which is, I believe, due to the cluster.eager-lock setting.
My target is to execute the loop in less than 1 second.
I've tried to experiment with lots of Gluster settings but without success. dd tests with various bsize values behave like that TCP no-delay option is not set, although from Gluster source code it seems that no-delay is default.
Any idea how to improve the performance?
Edit:
I've found a solution that works in my case so I'd like to share it in case anyone else faces the same issue.
The root cause of the problem is the number of roundtrips between client and Gluster server during execution of open/write/close sequence. I don't know exactly what is happening behind but timing measurements shows exactly that pattern. Now, the obvious idea would be to "pack" open/write/close sequence into a single write function. Roughly, the C prototype of such function would be:
int write(const char* fname, const void *buf, size_t nbyte, off_t offset)
But, there is already such API function glfs_h_anonymous_write in libgfapi (thanks goes to Suomya from Gluster mailing group). Kind of hidden thing there is the file identifier which is not plain file name, but something of type struct glfs_object. Clients obtain an instance of such object through API calls glfs_h_lookupat/glfs_h_creat. The point here is that glfs_object representing filename is "stateless" in a sense that corresponding inode is left intact (not ref counted). One should think of glfs_object as plain filename identifier and use it as you would use filename (actually, glfs_object stores plain pointer to corresponding inode without ref counting it).
Finally, we should use glfs_h_lookupat/glfs_h_creat once and write many times to the file using glfs_h_anonymous_write.
That way I was able to append 64 bytes to 5000 files in 0.5 seconds, which is 20 times faster than using mounted volume and open//write/close sequence.

Can ETW (event tracing for windows) be used to gather also memory statistics?

Is it possible using ETW to also get memory statistics of all the processes and the system ?
With memory statistics I mean : e.g. Commited bytes, private bytes,paged pool,working set,...
I cannot find anything about using xperf to get and see memory statistics. It is always about CPU , disk , network.
One could probably use performance counters to get that kind of information, but how can one overlay the statistics graphically in one chart (how to correlate/sync the timestamps) ?
Your best bet on Windows 8.1 and higher is the Microsoft-Windows-Kernel-Memory provider, which records per-process memory information every 0.5 s. See https://github.com/google/UIforETW/issues/80 for details. UIforETW enables this by default when it is available.
You could also try the MEMINFO provider. It gives a system-wide overview of memory pressure. It shows the Active List (currently in use memory), the Standby List ('useful' pages not currently in use, such as the disk cache), and the Zero and Free lists (genuinely free memory). This at least lets you tell whether a system is running out of memory.
You could also try MEMINFO_WS and CONTMEMGEN but these are undocumented so I really don't know what they do. They show up in xperf -providers k but when I record with them I can't see any new graphs appearing. Apparently Microsoft ships these providers but no way to view them. Sigh...
If you want more memory details on Windows 7 -- such as per-process working sets -- your best bet is to have a process running which periodically queries this data and emits it in custom ETW events. This is available in a prepackaged form in UIforETW which can query the working set of a specified set of processes once a second. See the announcement post for how to get UIforETW:
https://randomascii.wordpress.com/2015/04/14/uiforetw-windows-performance-made-easier/
UIforETW's Windows 7 working set data shows up in Generic Events under Task Name == WorkingSet. On Windows 8.1 the OS working set data (more detailed, more efficiently recorded) shows up under Memory-> Virtual Memory Snapshots.
You can trace memory usage with ReferenceSet kernel group. It includes the following traceflags:
PROC_THREAD+LOADER+HARD_FAULTS+MEMORY+FOOTPRINT+VIRT_ALLOC+MEMINFO+VAMAP+SESSION+REFSET+MEMINFO_WS
MEMORY = Memory tracing
FOOTPRINT+REFSET = Support footprint analysis
MEMINFO = Memory List Info (active, standby and oters you see from ResMon)
VIRT_ALLOC = Virtual allocation reserve and release
VAMAP = mapped files information
MEMINFO_WS = Working set Info
As you can see xperf can capture a lot of memory data when you sue the right flags.

Resources