How to rebuild an elasticsearch index for lots of data without it getting "Killed" after 15 hours or so - elasticsearch

I have about 130 million articles in my Postgres database on AWS. I am trying to index them with elasticsearch. In a screen, I entered:
python manage.py search_index --rebuild -f --parallel --model [APP NAME].[MODEL NAME]
Everything began correctly. The output was
Deleting index '[MODEL NAME]'
Creating index '[MODEL NAME]'
Indexing 129413202 'MODEL NAME' objects (parallel)
But after about 15 hours, the output was "Killed". I was running this on a t2.xlarge EC2 instance, which has 16 GBs of memory. Interestingly, the "Killed" message happened after I saw that the connection to the AWS server was broken, but that shouldn't matter if the process was run in a screen. Any idea what the issue is? Do I just need to get an even larger EC2 instance?

A process unexpectedly exiting with message Killed often means it received a SIGKILL; if so then the exit code would be 137. Hard to be certain here, a process can obviously print Killed and exit with code 137 anyway, but assuming you're not doing that in your code then this is what I'd check next.
An unexpected SIGKILL often comes from the kernel's OOM killer which takes action when the system runs out of memory and typically kills the process with the largest memory footprint. If so it will have logged details in the kernel logs that you can read with dmesg.
If it was the OOM killer then this sounds like a bug in this indexing code. Indexing a large body of documents into Elasticsearch should require pretty limited working memory, nowhere near 16GB, but it's easy to accidentally keep too much data in memory for too long which would lead to excessive memory usage.
python manage.py search_index suggests you're using the Django Elasticsearch DSL which fixed a performance issue relatively recently. Make sure you're using a version that contains this fix.

Related

PostgreSQL 9.6 server process terminated by 0xc0000374

I have a pretty large project in which I am executing something about 500 queries one after the other and exporting the data and loading it later when process is finished, in another table, all through psql and a bash script on Windows.
PostgreSQL is crashing with this in the logfile:
LOG: server process (PID 5776) was terminated by exception 0xc0000374
DETAIL: Failed process was running: COPY ( SELECT ...)
HINT: See C include file "ntstatus.h" for a description of the hexadecimal value.
Unfortunately I don't have much details on the frequency of this. Seems like sometimes happens and sometimes not and it's not clear when it happens. Looks random to me... It's driving me crazy.
Now 0xc0000374 seems to mean Heap Corruption Exception which is still not so helpful applying it to my context.
How can I understand what's going on better or prevent it? I would be grateful for any feedback!

Elasticsearch bulk update is extremely slow

I am indexing a large amount of daily data ~160GB per index into elasticsearch. I am facing this case where I need to update almost all the docs in the indices with a small amount of data(~16GB) which is of the format
id1,data1
id1,data2
id2,data1
id2,data2
id2,data3
.
.
.
My update operations start happening at 16000 lines per second and in over 5 minutes it comes down to 1000 lines per second and doesnt go up after that. The update process for this 16GB of data is currently longer than the time it takes for my entire indexing of 160GB to happen
My conf file for the update operation currently looks as follows
output
{
elasticsearch {
action => "update"
doc_as_upsert => true
hosts => ["host1","host2","host3","host4"]
index => "logstash-2017-08-1"
document_id => "%{uniqueid}"
document_type => "daily"
retry_on_conflict => 2
flush_size => 1000
}
}
The optimizations I have done to speed up indexing in my cluster based on the suggestions here https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html are
Setting "indices.store.throttle.type" : "none"
Index "refresh_interval" : "-1"
I am running my cluster on 4 instances of the d2.8xlarge EC2 instances. I have allocated 30GB of heap to each nodes.
While the update is happening barely any cpu is used and the load is very less as well.
Despite everything the update is extremely slow. Is there something very obvious that I am missing that is causing this issue? While looking at the threadpool data I find that the number of threads working on bulk operations are constantly high.
Any help on this issue would be really helpful
Thanks in advance
There are a couple of rule-outs to try here.
Memory Pressure
With 244GB of RAM, this is not terribly likely, but you can still check it out. Find the jstat command in the JDK for your platform, though there are visual tools for some of them. You want to check both your Logstash JVM and the ElasticSearch JVMs.
jstat -gcutil -h7 {PID of JVM} 2s
This will give you a readout of the various memory pools, garbage collection counts, and GC timings for that JVM as it works. It will update every 2 seconds, and print headers every 7 lines. Spending excessive time in the FCT is a sign that you're underallocated for HEAP.
I/O Pressure
The d2.8xlarge is a dense-storage instance, and may not be great for a highly random, small-block workload. If you're on a Unix platform, top will tell you how much time you're spending in IOWAIT state. If it's high, your storage isn't up to the workload you're sending it.
If that's the case, you may want to consider provisioned IOP EBS instances rather than the instance-local stuff. Or, if your stuff will fit, consider an instance in the i3 family of high I/O instances instead.
Logstash version
You don't say what version of Logstash you're using. Being StackOverflow, you're likely to be using 5.2. If that's the case, this isn't a rule-out.
But, if you're using something in the 2.x series, you may want to set the -w flag to 1 at first, and work your way up. Yes, that's single-threading this. But the ElasticSearch output has some concurrency issues in the 2.x series that are largely fixed in the 5.x series.
With elasticsearch version 6.0 we had an exactly same issue of slow updates on aws and the culprit was slow I/O. Same data was upserting on a local test stack completely fine but once in cloud on ec2 stack, everything was dying after an initial burst of speedy inserts lasting only for few minutes.
Local test stack was a low-spec server in terms of memory and cpu but contained SSDs.
s3 stack was EBS volumes with default gp2 300 IOPS.
Converting the volumes to type io1 with 3000 IOPS solved the issue and everything got back on track.
I am using amazon aws elasticsearch service version 6.0 . I need heavy write/insert from serials of json file to the elasticsearch for 10 billion items . The elasticsearch-py bulk write speed is really slow most of time and occasionally high speed write . i tried all kinds of methods , such as split json file to smaller pieces , multiprocess read json files , parallel_bulk insert into elasticsearch , nothing works . Finally , after I upgraded io1 EBS volume , everything goes smoothly with 10000 write IOPS .

Hadoop Error: Java heap space

So, after seeing the a percent or so of running the job I get an error that says, "Error: Java heap space" and then something along the lines of, "Application container killed"
I am literally running an empty map and reduce job. However, the job does take in an input that is, roughly, about 100 gigs. For whatever reason, I run out of heap space. Although the job does nothing.
I am using default configurations and it's on a single machine. It is running on hadoop version 2.2 and ubuntu. The machine has 4 gigs of ram.
Thanks!
//Note
Got it figured out.
Turns out I was setting the configuration to have a different terminating token/string. The format of the data had changed, so that token/string no longer existed. So it was trying to send all 100gigs into ram for one key.

High Write I/O "Pulsing" with dtrace errors

We are experiencing "pulsing" writes to disk (from 1 writes out/sec pulsing up to 142+ writes out/sec) around every 10 seconds.
See this example image:
https://discussions.apple.com/servlet/JiveServlet/showImage/2-22394173-269851/Screen+Shot+2013-07-03+at+13.22.28.png
We dug into these "pulsing" writes and found that they happen exactly the same time as these errors from IOTOP:
dtrace: error on enabled probe ID 5 (ID 992: io:mach_kernel:buf_strategy:start): illegal operation in action #3 at DIF offset 0
The "pulsing" only happens when the error above presents itself in IOTOP.
Note: we are running Apple RAID software mirroring for two drives.
Any suggestions, help and tips will be greatly appreciated. Thanks in advance.
The pulsing I/O pattern you're seeing is characteristic of applications where many/most filesystem writes are asynchronous - this is because the filesystem will batch up the writes so it can do many at the same time to avoid doing one disk seek per write. The most common example I can think of is a database writing data - except for the database's write-ahead log, everything is typically written asynchronously; other transactional access patterns tend to be similar because they have a write-ahead log to recover if some async writes are lost in a crash. This is a common access pattern and isn't necessarily a problem, but it can become a problem when your disk is highly fragmented and the filesystem can't write everything out in batches (causing many seeks, just like it was trying to avoid).
The DTrace/iotop error you're seeing means there's either a bug in the DTrace implementation itself or in the iotop DTrace script. Looking at iotop's source code (in /usr/bin/iotop on OS X), there are three io:::start callbacks which could be the culprit. It's possible that there's some sort of null pointer access in the script for some types of I/O, but it doesn't look likely based on the script and the arguments io:::start probes take. Perhaps this is best resolved with a bug report to Apple.

iowait for while/sleep bash script on debian/64

I'm running a mysql-DB server (debian/squeeze/64) with 48GB RAM and 8TB disk, lots of inserts and quite a few CPU-intensive background processes.
Since some of these processes kept dying I used a simple bash-watchdog to restart them, which worked, but produced a lot of iowait. I simplified the problem down to:
#!/bin/bash
while true; do sleep 1; done
which still produces iowait of up to 90% for the bash(!)-process (as seen on iotop). there's no disk read or write displayed and the test-script really is just this one line.
Note that everything's working fine and the server is still perfectly responsive. I'm just curious to know what's going on.
Anyone any idea?
I was not able to reproduce your results, possible it's a bug.
I tested on Arch(vm) and Ubuntu(physical), running a while with a sleep of one second resulted in minimal IO and essentially no IOwait.

Resources