Solr indexing performance - performance

We are experiencing some performance issues with Solr batch indexing: we have a cluster composed by 4 workers, each of which is equipped with 32 cores and 256GB of RAM. YARN is configured to use 100 vCores and 785.05GB of memory. The HDFS storage is managed by an EMC Isilon system connected through a 10Gb interface. Our cluster runs CDH 5.8.0, features Solr 4.10.3 and it is Kerberized.
With the current setup, speaking of compressed data, we can index about 25GB per day and 500GB per month by using MapReduce jobs. Some of these jobs run daily and they take almost 12 hours to index 15 GB of compressed data. In particular, MorphlineMapper jobs last approximately 5 hours and TreeMergeMapper last about 6 hours.
Are these performances normal? Can you suggest us some tweaks that could improve our indexing performances?

We are using the MapReduceIndexerTool and there are no network problems. We are reading compressed files from HDFS and decompressing them in our morphline. This is the way we run our script:
cmd_hdp=$(
HADOOP_OPTS="-Djava.security.auth.login.config=jaas.conf" hadoop --config /etc/hadoop/conf.cloudera.yarn \
jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar \
org.apache.solr.hadoop.MapReduceIndexerTool \
-D morphlineVariable.ZK_HOST=hostname1:2181/solr \
-D morphlineVariable.COLLECTION=my_collection \
-D mapreduce.map.memory.mb=8192 \
-D mapred.child.java.opts=-Xmx4096m \
-D mapreduce.reduce.java.opts=-Xmx4096m \
-D mapreduce.reduce.memory.mb=8192 \
--output-dir hdfs://isilonhostname:8020/tmp/my_tmp_dir \
--morphline-file morphlines/my_morphline.conf \
--log4j log4j.properties \
--go-live \
--collection my_collection \
--zk-host hostname1:2181/solr \
hdfs://isilonhostname:8020/my_input_dir/
)
The MorphlineMapper phase takes all available resources, the TreeMergeMapper takes only a couple of containers.
We don't need to make queries for the moment, we just need to index historical data. We are wondering if there is a way to speed up indexing time and then optimize collections for searching when indexing is complete.

Related

How to run concurrent Active Jobs in Spark Streaming and fair task scheduling among executors

I am using Spark Streaming on Yarn, I am facing below issues.
Issue 1:
I am using spark streaming (1.6.1) on yarn, I always see active job count as 1 that means only 1 job is running at a time. I have used "--conf spark. streaming. concurrentJobs=3" parameter, but no luck I can see only 1 active job always.
Issue 2:
I have 50 Kafka partition and spark streaming creates 50 RDD partitions, but I can see 95% of task are allocated to only 1 executor rest of the executor mostly always have zero active task.
My Spark Submit command is as follows:
spark-submit \
--verbose \
--master yarn-cluster \
--num-executors 3 \
--executor-memory 7g \
--executor-cores 3 \
--conf spark.driver.memory=1024m \
--conf spark.streaming.backpressure.enabled=false \
--conf spark.streaming.kafka.maxRatePerPartition=3 \
--conf spark.streaming.concurrentJobs=3 \
--conf spark.speculation=true \
--conf spark.hadoop.fs.hdfs.impl.disable.cache=true \
--files kafka_jaas.conf#kafka_jaas.conf,user.headless.keytab#user.headless.keytab \
--driver-java-options "-Djava.security.auth.login.config=./kafka_jaas.conf -Dhttp.proxyHost=PROXY_IP -Dhttp.proxyPort=8080 -Dhttps.proxyHost=PROXY_IP -Dhttps.proxyPort=8080 -Dlog4j.configuration=file:/home/user/spark-log4j/log4j-topic_name-driver.properties" \
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./kafka_jaas.conf -Dlog4j.configuration=file:/home/user/spark-log4j/log4j-topic_name-executor.properties" \
--class com.spark.demo.StreamProcessor /home/user/demo.jar /tmp/data/out 30 KAFKA_BROKER:6667 "groupid" topic_name
--conf spark.streaming.kafka.maxRatePerPartition=3
Also why do you have max rate per partition so low ? this means it will process only 3 records per second per partition !!!!. So if microbatch interval is 30 seconds and say you have 3 partitions ,it ll processs 30*3*3 which is 270 records which seem pretty low.

Morphline Read one big file

I have a Hive table that i am trying to index into SolrCloud using morphline, however, the data behind the Hive table is ONE big file 20GB that morphline is taking a long time to process.
Instead of running multiple mappers and reducers, there can only be 1 mapper running, probably due to the fact that we have only one file.
yarn jar /opt/<path>/search-mr-1.0.0-cdh5.5.1-job.jar \
org.apache.solr.hadoop.MapReduceIndexerTool \
--morphline-file morphlines.conf \
--output-dir hdfs://<outputdir> \
--zk-host node1.datafireball.com:2181/solr \
--collection <collectionname> \
--input-list <filewherethedatais> \
--mappers 6
And it still kicked out only 1 job... and this is taking forever, can anyone shed some light on this?
Resources You might find helpful:
Cloudera Mapreduce Batch Index into Solrcloud
Kitesdk which morphline belongs to.

How do I control which tasks run on which hosts?

I'm running Giraph, which executes on our small CDH4 Hadoop cluster of five hosts (four compute nodes and a head node - call them 0-3 and 'w') - see versions below. All five hosts are running mapreduce tasktracker services, and 'w' is also running the jobtracker. Resources are tight for my particular Giraph application (a kind of path-finding), and I've discovered that some configurations of the automatically-scheduled hosts for tasks work better than others.
More specifically, my Giraph command (see below) specifies four Giraph workers, and when executing, Hadoop (Zookeeper actually, IIUC) creates five tasks that I can see in the jobtracker web UI: one master and four slaves. When it puts three or more of the map tasks on 'w' (e.g., 01www or 1wwww), then that host maxes out ram, cpu, and swap, and the job hangs. However, when the system spreads the work out more evenly so that 'w' has only two or fewer tasks (e.g., 123ww or 0321w), then the job finishes fine.
My question is, 1) what program is deciding the task-to-host assignment, and 2) how do I control that?
Thanks very much!
Versions
CDH: 4.7.3
Giraph: Compiled as "giraph-1.0.0-for-hadoop-2.0.0-alpha" (CHANGELOG starts with: Release 1.0.0 - 2013-04-15)
Zookeeper Client environment: zookeeper.version=3.4.5-cdh4.4.0--1, built on 09/04/2013 01:46 GMT
Giraph command
hadoop jar $GIRAPH_HOME/giraph-ex.jar org.apache.giraph.GiraphRunner \
-Dgiraph.zkList=wright.cs.umass.edu:2181 \
-libjars ${LIBJARS} \
relpath.RelPathVertex \
-wc relpath.RelPathWorkerContext \
-mc relpath.RelPathMasterCompute \
-vif relpath.JsonAdjacencyListVertexInputFormat \
-vip $REL_PATH_INPUT \
-of relpath.JsonAdjacencyListTextOutputFormat \
-op $REL_PATH_OUTPUT \
-ca RelPathVertex.path=$REL_PATH_PATH \
-w 4

Dump all documents of Elasticsearch

Is there any way to create a dump file that contains all the data of an index among with its settings and mappings?
A Similar way as mongoDB does with mongodump
or as in Solr its data folder is copied to a backup location.
Cheers!
Here's a new tool we've been working on for exactly this purpose https://github.com/taskrabbit/elasticsearch-dump. You can export indices into/out of JSON files, or from one cluster to another.
Elasticsearch supports a snapshot function out of the box:
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html
We can use elasticdump to take the backup and restore it, We can move data from one server/cluster to another server/cluster.
1. Commands to move one index data from one server/cluster to another using elasticdump.
# Copy an index from production to staging with analyzer and mapping:
elasticdump \
--input=http://production.es.com:9200/my_index \
--output=http://staging.es.com:9200/my_index \
--type=analyzer
elasticdump \
--input=http://production.es.com:9200/my_index \
--output=http://staging.es.com:9200/my_index \
--type=mapping
elasticdump \
--input=http://production.es.com:9200/my_index \
--output=http://staging.es.com:9200/my_index \
--type=data
2. Commands to move all indices data from one server/cluster to another using multielasticdump.
Backup
multielasticdump \
--direction=dump \
--match='^.*$' \
--limit=10000 \
--input=http://production.es.com:9200 \
--output=/tmp
Restore
multielasticdump \
--direction=load \
--match='^.*$' \
--limit=10000 \
--input=/tmp \
--output=http://staging.es.com:9200
Note:
If the --direction is dump, which is the default, --input MUST be a URL for the base location of an ElasticSearch server (i.e. http://localhost:9200) and --output MUST be a directory. Each index that does match will have a data, mapping, and analyzer file created.
For loading files that you have dumped from multi-elasticsearch, --direction should be set to load, --input MUST be a directory of a multielasticsearch dump and --output MUST be a Elasticsearch server URL.
The 2nd command will take a backup of settings, mappings, template and data itself as JSON files.
The --limit should not be more than 10000 otherwise, it will give an exception.
Get more details here.
For your case Elasticdump is the perfect answer.
First, you need to download the mapping and then the index
# Install the elasticdump
npm install elasticdump -g
# Dump the mapping
elasticdump --input=http://<your_es_server_ip>:9200/index --output=es_mapping.json --type=mapping
# Dump the data
elasticdump --input=http://<your_es_server_ip>:9200/index --output=es_index.json --type=data
If you want to dump the data on any server I advise you to install esdump through docker. You can get more info from this website Blog Link
ElasticSearch itself provides a way to create data backup and restoration. The simple command to do it is:
CURL -XPUT 'localhost:9200/_snapshot/<backup_folder name>/<backupname>' -d '{
"indices": "<index_name>",
"ignore_unavailable": true,
"include_global_state": false
}'
Now, how to create, this folder, how to include this folder path in ElasticSearch configuration, so that it will be available for ElasticSearch, restoration method, is well explained here. To see its practical demo surf here.
At the time of writing this answer(2021), the official way of backing up an ElasticSearch cluster is to snapshot it. Refer to: https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshot-restore.html
The data itself is one or more lucene indices, since you can have multiple shards. What you also need to backup is the cluster state, which contains all sorts of information regarding the cluster, the available indices, their mappings, the shards they are composed of etc.
It's all within the data directory though, you can just copy it. Its structure is pretty intuitive. Right before copying it's better to disable automatic flush (in order to backup a consistent view of the index and avoiding writes on it while copying files), issue a manual flush, disable allocation as well. Remember to copy the directory from all nodes.
Also, next major version of elasticsearch is going to provide a new snapshot/restore api that will allow you to perform incremental snapshots and restore them too via api. Here is the related github issue: https://github.com/elasticsearch/elasticsearch/issues/3826.
You can also dump elasticsearch data in JSON format by http request:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html
CURL -XPOST 'https://ES/INDEX/_search?scroll=10m'
CURL -XPOST 'https://ES/_search/scroll' -d '{"scroll": "10m", "scroll_id": "ID"}'
To export all documents from ElasticSearch into JSON, you can use the esbackupexporter tool. It works with index snapshots. It takes the container with snapshots (S3, Azure blob or file directory) as the input and outputs one or several zipped JSON files per index per day. It is quite handy when exporting your historical snapshots. To export your hot index data, you may need to make the snapshot first (see the answers above).
If you want to massage the data on its way out of Elasticsearch, you might want to use Logstash. It has a handy Elasticsearch Input Plugin.
And then you can export to anything, from a CSV file to reindexing the data on another Elasticsearch cluster. Though for the latter you also have the Elasticsearch's own Reindex.

Hadoop streaming with single mapper

I am using Hadoop streaming, I start the script as following:
../hadoop/bin/hadoop jar ../hadoop/contrib/streaming/hadoop-streaming-1.0.4.jar \
-mapper ../tests/mapper.php \
-reducer ../tests/reducer.php \
-input data \
-output out
"data" is 2.5 GB txt file.
however in ps axf I can see only one mapper. i tried with -Dmapred.map.tasks=10, but result is the same - single mapper.
how can I make hadoop split my input file and start several mapper processes?
To elaborate on my comments - If your file isn't in HDFS, and you're running with local runner then the file itself will only be processed by a single mapper.
A large file is typically processed by several mappers due to the fact that it is stored in HDFS as several blocks.
A 2.5 GB file, with a block size of 512M will be split into ~5 blocks in HDFS. If the file is splittable (plain text, or using a splittable compression codec such as snappy, but not gzip), then hadoop will launch a mapper per block to process the file.
Hope this helps explain what you're seeing

Resources