Find out reason for slow indexing in elasticsearch - elasticsearch

I have written a script to bulk index a dataset with elasticsearch. It is working as intended, however, if I run the same script on the same dataset on different servers the execution time varies. In the server equipped with SSD, the 2 million documents are done indexing within 10 minutes, however on the one with normal hard disk, it takes up to an hour to complete. Is there a diagnostic tool I can make use of to figure out what causes the slow down?
Some additional information:
The script is written for Python3, and uses elasticsearch-py module for the bulk indexing
Both server runs the same operating system (Ubuntu 14.04 LTS), the one with slower hard drive has 64GB of RAM, but the one with SSD has half the RAM.

You will run into index merges when the large number of records is ingested. That is a process heavily dependent on the speed of the underlying storage. RAM is not really that significant here - it may be more significant when it comes to query performance and stuff you do there. Disk latencies will add up and cause a slow-down compared to the SSD platform.
Therefore, I am not surprised about the SSD speedup. SSD storage is faster than HDD by a factor of 3-8, depending on the manufacturers. If you take into account that HDDs also needs to perform positioning operations for access to different parts of the storage, it is clear that simply using an SDD instead of an HDD can accelerate disk-bound applications by a factor of 10 and more.

Related

SAN Performance

Have a question regarding SAN performance specifically EMC VNX SAN. I have a significant number of processes spread over number of blade servers running concurrently. The number of processes is typically around 200. Each process loads 2 small files from storage, one 3KB one 30KB. There are millions (20) of files to be processed. The processes are running on Windows Server on VMWare. The way this was originally setup was 1TB LUNs on the SAN bundled into a single 15TB drive in VMWare and then shared as a network share from one Windows instance to all the processes. The processes running concurrently and the performance is abysmal. Essentially, 200 simultaneous requests are being serviced by the SAN through Windows share at the same time and the SAN is not handling it too well. I'm looking for suggestions to improve performance.
With all performance questions, there's a degree of 'it depends'.
When you're talking about accessing a SAN, there's a chain of potential bottlenecks to unravel. First though, we need to understand what the actual problem is:
Do we have problems with throughput - e.g. sustained transfer, or latency?
It sounds like we're looking at random read IO - which is one of the hardest workloads to service, because predictive caching doesn't work.
So begin at the beginning:
What sort of underlying storage are you using?
Have you fallen into the trap of buying big SATA, configuring it RAID-6? I've seen plenty of places do this because it looks like cheap terabytes, without really doing the sums on the performance. A SATA drive starts to slow down at about 75 IO operations per second. If you've got big drives - 3TB for example - that's 25 IOPs per terabytes. As a rough rule of thumb, 200 per drive for FC/SAS and 1500 for SSD.
are you tiering?
Storage tiering is a clever trick of making a 'sandwich' out of different speeds of disk. This usually works, because usually only a small fraction of a filesystem is 'hot' - so you can put the hot part on fast disk, and the cold part on slow disk, and average performance looks better. This doesn't work for random IO or cold read accesses. Nor does it work for full disk transfers - as only 10% of it (or whatever proportion) can ever be 'fast' and everything else has to go the slow way.
What's your array level contention?
The point of SAN is that you aggregate your performance, such that each user has a higher peak and a lower average, as this reflects most workloads. (When you're working on a document, you need a burst of performance to fetch it, but then barely any until you save it again).
How are you accessing your array?
Typically SAN is accessed using a Fiber Channel network. There's a whole bunch of technical differences with 'real' networks, but they don't matter to you - but contention and bandwidth still do. With ESX in particular, I find there's a tendency to underestimate storage IO needs. (Multiple VMs using a single pair of HBAs means you get contention on the ESX server).
what sort of workload are we dealing with?
One of the other core advantages of storage arrays is caching mechanisms. They generally have very large caches and some clever algorithms to take advantage of workload patterns such as temporal locality and sequential or semi-sequential IO. Write loads are easier to handle for an array, because despite the horrible write penalty of RAID-6, write operations are under a soft time constraint (they can be queued in cache) but read operations are under a hard time constraint (the read cannot complete until the block is fetched).
This means that for true random read, you're basically not able to cache at all, which means you get worst case performance.
Is the problem definitely your array? Sounds like you've a single VM with 15TB presented, and that VM is handling the IO. That's a bottleneck right there. How many IOPs are the VM generating to the ESX server, and what's the contention like there? What's the networking like? How many other VMs are using the same ESX server and might be sources of contention? Is it a pass through LUN, or VMFS datastore with a VMDK?
So - there's a bunch of potential problems, and as such it's hard to roll it back to a single source. All I can give you is some general recommendations to getting good IO performance.
fast disks (they're expensive, but if you need the IO, you need to spend money on it).
Shortest path to storage (don't put a VM in the middle if you can possibly avoid it. For CIFS shares a NAS head may be the best approach).
Try to make your workload cacheable - I know, easier said than done. But with millions of files, if you've got a predictable fetch pattern your array will start prefetching, and it'll got a LOT faster. You may find if you start archiving the files into large 'chunks' you'll gain performance (because the array/client will fetch the whole chunk, and it'll be available for the next client).
Basically the 'lots of small random IO operations' especially on slow disks is really the worst case for storage, because none of the clever tricks for optimization work.

Virtuoso System Requirements

We would be using Virtuoso for storing RDFs, the triple count will be 100 million to start with. I need to know what should be typical RAM, CPU, Disk etc for this. Querying will be with SPARQL and there will be a bit complex queries.
Kindly provide your inputs.
The average size of a Virtuoso version 6.x triple (quad) is about 30bytes thus for 100 million triples you would need about 3GB RAM , this being the most critical component to enable the database working set to fit in memory , data does not need to be loaded from disk once the database is "warmed up", for best performance. This would be especially the case when running complex queries. In terms of disk, the fast they are the quicker the databaase can be loaded into memory, checkpoints performed etc. thus SSDs or similar devices are recommended where possible, espcially if memory is limited and reading data from disk at times in unavoidable. In terms of processor standard commodity 64bit processor available today would suffice, typically running on a Linux x86_64 system of your choice, as said memory is always the most critical component though.
See the following Virtuoso FAQ and peformance tuning documents for more details:
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtRDFPerformanceTuning
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/#FAQ

How many cores for SSIS?

I did a proof of concept for a complex transformation in SSIS. I have performance metrics now for this POC that I created in a virtual machine, with 1 gig memory, 1 core assigned. The SSIS transformations are all file based (source and target).
Now I want to use this metric for choosing the right amount of cores and memory in production environment.
What would be the right strategy to determine the right amount of cores and memory for production if I know the amount of files per day and the total amount of file size per day to be transformed ?
(edit) Think about total transfer sizes of 100 gigabyte and 5000 files per day!
You'd want to do two other benchmarks: 2 GB mem, 1 core and 1 GB mem, dual core. Taking a snapshot of a fairly tiny environment is difficult to extrapolate without a couple more datapoints.
Also, with only 1GB RAM you'll also want to make sure the server isn't also running out of memory and paging to disk (which will skew your figures somewhat as everything becomes reliant on disk access - and given you're already reading from disk anyway...). So make sure you know what's happening there as well.
SSIS tries to buffer as much as it can in memory for speed, so more memory is always good :-) The bigger question is what benefit extra cores will give you.
There are a number of areas for performance. One is the number of cores. The more cores you have the more parallel work that can be done. This of course is also dependent upon how you build your package. Certain objects are synchronous others are asynchronous. Memory is also a factor, but it is limited to 100MB/dataflow component.

RAMdisk slower than disk?

A python program I created is IO bounded. The majority of the time (over 90%) is spent in a single loop which repeats ~10,000 times. In this loop, ~100KB data is generated and written to a temporary file; it is then read back out by another program and statistics about that data collected. This is the only way to pass data into the second program.
Due to this being the main bottleneck, I thought that moving the location of the temporary file from my main HDD to a (~40MB) RAMdisk (inside of over 2GB of free RAM) would greatly increase the IO speed for this file and so reduce the run-time. However, I obtained the following results (each averaged over 20 runs):
Test data 1: Without RAMdisk - 72.7s, With RAMdisk - 78.6s
Test data 2: Without RAMdisk - 223.0s, With RAMdisk - 235.1s
It would appear that the RAMdisk is slower that my HDD.
What could be causing this?
Are there any other alternative to using a RAMdisk in order to get faster file IO?
Your operating system is almost certainly buffering/caching disk writes already. It's not surprising the RAM disk is so close in performance.
Without knowing exactly what you're writing or how, we can only offer general suggestions. Some ideas:
If you have 2 GB RAM you probably have a decent processor, so you could write this data to a filesystem that has compression. That would trade I/O operations for CPU time, assuming your data is amenable to that.
If you're doing many small writes, combine them to write larger pieces at once. (Can we see the source code?)
Are you removing the 100 KB file after use? If you don't need it, then delete it. Otherwise the OS may be forced to flush it to disk.
Can you write the data out in batches rather than one item at a time? Are you caching resources like open file handles etc or cleaning those up? Are your disk writes blocking, can you use background threads to saturate IO while not affecting compute performance.
I would look at optimising the disk writes first, and then look at faster disks when that is complete.
I know that Windows is very aggressive about caching disk data in RAM, and 100K would fit easily. The writes are going directly to cache and then perhaps being written to disk via a non-blocking write, which allows the program to continue. The RAM disk probably wouldn't support non-blocking operations because it expects those operations to be quick and not worth the bother.
By reducing the amount of memory available to programs and caching, you're going to increase the amount of disk I/O for paging even if only slightly.
This is all speculation on my part, since I'm not familiar with the kernel or drivers. I also speculate that Linux would operate similarly.
In my tests I've found that not only batch size affects overall performance, but also the nature of data itself. I've managed to get 5 times better write times compared to SSD in only one scenario: writing a 100MB chunk of pre-cooked random byte array to RAM drive. Writing more "predictable" data like letters "aaa" or current datetime yields quite opposite results - SSD is always faster or equal. So my guess is that opertating system (Win 7 in my case) does lots of caching and optimizations.
Looks like the most hindering case for RAM-drive is when you perform lots of small writes instead of a few big ones, and RAM drive shines at writing large amounts of hard-to-compress data.
I had the same mind boggling experience, and after many tries I figured it out.
When ramdisk is formatted as FAT32, then even though benchmarks shows high values, real world use is actually slower than NTFS formatted SSD.
But NTFS formatted ramdisk is faster in real life than SSD.
I join the people having problems with RAM disk speeds (only on Windows).
The SSD i have can write 30 GiB (in one big block, dump a 30GiB RAM ARRAY) with a speed of 550 MiB/s (arround 56 seconds to write 30 GiB) ... this is if the write is asked in one source code sentence.
The RAM Disk (imDisk) i have can write 30 GiB write (in one big block, dump a 30GiB RAM ARRAY) with a speed of a bit less than 100 MiB/s (arround 5 minutes and 13 seconds to write 30 GiB) ... this is if the write is asked in one source code sentence.
I had also done another RAM test: from source code do a sequential direct write (one byte per source code loop pass) to a 30GiB RAM ARRAY (i have 64GiB of RAM) and i get a speed of near 1.3GiB/s (1298 MiB per second).
Why on the hell (on Windows) RAM Disk is so slow for one BIG secuential write?
Of course that low write speed happens on RAM disks on Windows, since i tested the same 'concept' on Linux with Linux native ram disk and Linux ram disk can write at near one gigabyte per second.
Please note that i had also tested SoftPerfect and other RAM disks on Windows, RAM Disk speeds are near the same, can not write at more than one hundred megabytes per second.
Actual Windows tested: 10 & 11 (on both HOME & PRO, on 64 bits), RAM Disk format (exFAT & NTFS); since RAM disk speed was too slow i was trying to find one Windows version where RAM disk speed be normal, but found no one.
Actual Linux Kernel tested: Only 5.15.11, since Linux native RAM disk speed was normal i do not test on any other kernel.
Hope this help other people, since knowledge is the base to solve a problem.

Estimation of commodity hardware for an application

Suppose, I wanted to develop stack overflow website. How do I estimate the amount of commodity hardware required to support this website assuming 1 million requests per day. Are there any case studies that explains the performance improvements possible in this situation?
I know I/O bottleneck is the major bottleneck in most systems. What are the possible options to improve I/O performance? Few of them I know are
caching
replication
You can improve I/O performance in several ways depending upon what you use for your storage setup:
Increase filesystem block size if your app displays good spatial locality in its I/Os or uses large files.
Use RAID 10 (striping + mirroring) for performance + redundancy (disk failure protection).
Use fast disks (Performance Wise: SSD > FC > SATA).
Segregate workloads at different times of day. e.g. Backup during night, normal app I/O during day.
Turn off atime updates in your filesystem.
Cache NFS file handles a.k.a. Haystack (Facebook), if storing data on NFS server.
Combine small files into larger chunks, a.k.a BigTable, HBase.
Avoid very large directories i.e. lots of files in the same directory (instead divide files between different directories in a hierarchy).
Use a clustered storage system (yeah not exactly commodity hardware).
Optimize/design your application for sequential disk accesses whenever possible.
Use memcached. :)
You may want to look at "Lessons Learned" section of StackOverflow Architecture.
check out this handy tool:
http://www.sizinglounge.com/
and another guide from dell:
http://www.dell.com/content/topics/global.aspx/power/en/ps3q01_graham?c=us&l=en&cs=555
if you want your own stackoverflow-like community, you can sign up with StackExchange.
you can read some case studies here:
High Scalability - How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data
http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data
http://www.gear6.com/gear6-downloads?fid=56&dlt=case-study&ls=Veoh-Case-Study
1 million requests per day is 12/second. Stack overflow is small enough that you could (with interesting normalization and compression tricks) fit it entirely in RAM of a 64 GByte Dell PowerEdge 2970. I'm not sure where caching and replication should play a role.
If you have a problem thinking enough about normalization, a PowerEdge R900 with 256GB is available.
If you don't like a single point of failure, you can connect a few of those and just push updates over a socket (preferably on a separate network card). Even a peak load of 12K/second should not be a problem for a main-memory system.
The best way to avoid the I/O bottleneck is to not do I/O (as much as possible). That means a prevayler-like architecture with batched writes (no problem to lose a few seconds of data), basically a log file, and for replication also write them out to a socket.

Resources