I have elasticsearch setup, so that it stores the data at two locations meaning in elasticsearch.yml I have
path.data: /path_one/es_data,/path_two/elasticsearch
I was hoping elasticsearch would automatically figure out where more space is available and store new incoming data wherever possible, but instead, I found it starts to crash when short of memory at any one location. So I would like to move one node from path_one to path_two.
Currently, it looks like this
ls -lha /path_one/es_data/nodes/0/indices/
drwxr-xr-x 3 elasticsearch elasticsearch 4.0K Mar 7 03:13 A4XXnhNdTwKILyeE39UosA
drwxr-xr-x 3 elasticsearch elasticsearch 4.0K Mar 7 03:13 C2BPWKL4T3-jHIfZXNKG6g
drwxr-xr-x 3 elasticsearch elasticsearch 4.0K Mar 7 03:13 c8mFFi56RAyRYNpHOUvG4g
drwxr-xr-x 6 elasticsearch elasticsearch 4.0K Mar 7 03:13 DEk-qwdnSLOHbP_-nAhSdw
drwxr-xr-x 3 elasticsearch elasticsearch 4.0K Mar 7 03:13 kV32aUcET1WrlKXWOunGhg
drwxr-xr-x 3 elasticsearch elasticsearch 4.0K Mar 7 03:13 pGmjsSJHRAiMUC5paYfjag
drwxr-xr-x 3 elasticsearch elasticsearch 4.0K Mar 7 03:13 T1k45bs2SUGHJ6dJniPjZg
ls -lha /path_two/elasticsearch/nodes/0/indices/
drwxr-xr-x 4 elasticsearch elasticsearch 4.0K Mar 7 03:13 A4XXnhNdTwKILyeE39UosA
drwxr-xr-x 4 elasticsearch elasticsearch 4.0K Mar 7 03:13 C2BPWKL4T3-jHIfZXNKG6g
drwxr-xr-x 4 elasticsearch elasticsearch 4.0K Mar 7 03:13 c8mFFi56RAyRYNpHOUvG4g
drwxr-xr-x 5 elasticsearch elasticsearch 4.0K Mar 7 03:13 DEk-qwdnSLOHbP_-nAhSdw
drwxr-xr-x 4 elasticsearch elasticsearch 4.0K Mar 7 03:13 pGmjsSJHRAiMUC5paYfjag
drwxr-xr-x 3 elasticsearch elasticsearch 4.0K Mar 7 03:13
T1k45bs2SUGHJ6dJniPjZg
drwxr-xr-x 4 elasticsearch elasticsearch 4.0K Mar 7 03:13 XpHUz15oTbGG0Bvnf2xZsw
So my first question is, why are some of the nodes present at both locations? And my second question is whether I can just
stop elasticsearch
copy nodes over
restart elasticsearch
or whether I have to do more?
EDIT: I found some messages in the logfiles which look related
[2019-03-07T17:08:21,910][WARN ][o.e.c.r.a.DiskThresholdMonitor] [WU6cQ-o] high disk watermark [90%] exceeded on [WU6cQ-oTR2Ssg3LzoI4_yg][WU6cQ-o][/var/lib/elasticsearch/elasticsearch/nodes/0] free: 984.7mb[1.6%], shards will be relocated away from this node
[2019-03-07T17:08:51,944][WARN ][o.e.g.DanglingIndicesState] [WU6cQ-o] [[paper-index/XpHUz15oTbGG0Bvnf2xZsw]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
so it seems like elasticsearch is trying to move indices, but can't because there are already copies of these indices at the other location? Can I just delete the copies at the location where there is more space?
An Elasticsearch instance corresponds to one node. Setting two locations in path.data doesn't mean that you have two nodes running on the same host, but that you're storing the data of your node on two locations (see documentation). Therefore, to answer your first question, it is to be expected that data of the same node is spread across locations.
As for your second question, I don't understand your process, mostly because I'm not sure that you're running multiple nodes.
From the Elastic documentation, all we know about data distribution across locations is that Elasticsearch stores the files related to the same shard in the same location.
Hope it helps
Related
I am trying to store dataframe into an external hive table. When I perform the following action:
recordDF.write.option("path", "hdfs://quickstart.cloudera:8020/user/cloudera/hadoop/hive/warehouse/VerizonProduct").saveAsTable("productstoreHTable")
At the hdfs location where the table was supposed to be present instead I get this:
-rw-r--r-- 3 cloudera cloudera 0 2016-12-25 18:58 hadoop/hive/warehouse/VerizonProduct/_SUCCESS
-rw-r--r-- 3 cloudera cloudera 482 2016-12-25 18:58 hadoop/hive/warehouse/VerizonProduct/part-r-00000-0acdcc6d-893b-4e9d-b1d6-50bf02bea96a.snappy.parquet
-rw-r--r-- 3 cloudera cloudera 482 2016-12-25 18:58 hadoop/hive/warehouse/VerizonProduct/part-r-00001-0acdcc6d-893b-4e9d-b1d6-50bf02bea96a.snappy.parquet
-rw-r--r-- 3 cloudera cloudera 482 2016-12-25 18:58 hadoop/hive/warehouse/VerizonProduct/part-r-00002-0acdcc6d-893b-4e9d-b1d6-50bf02bea96a.snappy.parquet
-rw-r--r-- 3 cloudera cloudera 482 2016-12-25 18:58 hadoop/hive/warehouse/VerizonProduct/part-r-00003-0acdcc6d-893b-4e9d-b1d6-50bf02bea96a.snappy.parquet
How do I store it as uncompressed text format?
Thanks
You can add format option:
recordDF.write.option("path", "...").format("text").saveAsTable("...")
or
recordDF.write.option("path", "...").format("csv").saveAsTable("...")
The above solution with format csv, threw a warning "Couldn't find corresponding Hive SerDe for data source provider csv.". The table is not created in the desired way. One solution could be create an external table as below
sqlContext.sql("CREATE EXTERNAL TABLE test(col1 int,col2 string) STORED AS TEXTFILE LOCATION '/path/in/hdfs'") .
Then
dataFrame.write.format("com.databricks.spark.csv").option("header", "true").save("/path/in/hdfs")
Try this .option("fileFormat", "texfile"). Look at Specifying storage format for Hive tables
We have a sandbox that has 5 nodes with all five nodes running a kafka broker(broker id = 0)
Now I have made copied of config files on all 5 nodes with distinct broker id's and log files directory to have multiple broker running
-rw-r--r-- 1 root root 5652 Apr 2 23:01 server.properties - (this one being the default)
-rw-r--r-- 1 root root 5675 Apr 2 23:02 server1.properties
-rw-r--r-- 1 root root 5675 Apr 2 23:02 server2.properties
Now do I start kafka on all the 5 nodes with new config files by using
./kafka-server-start.sh config/server1.properties &
./kafka-server-start.sh config/server2.properties &
Does every node will have 3 three brokers running?? or its 3 brokers for overall cluster??
how does this work?Any help would be appreciated??
Each node in your cluster should only have one configuration file and kafka-server-start should only be run once on each node. For example, server 1 only needs to have a single configuration file that contains, e.g. broker.id = 1.
Each time you run kafka-server-start you are starting a broker (i.e. server). When the broker starts, Kafka will locate the other brokers via ZooKeeper. This allows new brokers to be added and removed from the cluster without any additional configuration file specifying the other nodes in the cluster and without having to do any reconfiguration on the existing nodes.
If you're running kafka-server-start multiple times on the same node then you are indeed starting multiple brokers on the same node, but that's not what you want.
I'm using Amazon EMR (Hadoop2 / AMI version:3.3.1) and I would like to change the default configuration (for example replication factor). In order for the change to take effect I need to restart the cluster and that's where my problems start.
How to do it? The script I found at ./.versions/2.4.0/sbin/stop-dfs.sh doesn't work. The slaves file ./.versions/2.4.0/etc/hadoop/slaves is empty anyway. There are some scripts in init.d:
$ ls -l /etc/init.d/hadoop-*
-rwxr-xr-x 1 root root 477 Nov 8 02:19 /etc/init.d/hadoop-datanode
-rwxr-xr-x 1 root root 788 Nov 8 02:19 /etc/init.d/hadoop-httpfs
-rwxr-xr-x 1 root root 481 Nov 8 02:19 /etc/init.d/hadoop-jobtracker
-rwxr-xr-x 1 root root 477 Nov 8 02:19 /etc/init.d/hadoop-namenode
-rwxr-xr-x 1 root root 1632 Oct 27 21:12 /etc/init.d/hadoop-state-pusher-control
-rwxr-xr-x 1 root root 484 Nov 8 02:19 /etc/init.d/hadoop-tasktracker
but if I for example stop the namenode something will start it again immediately. I looked for documentation and Amazon provides a 600 pages user guide but it's more how to use the cluster and not that much about maintenance.
On EMR 3.x.x , it used traditional SysVInit scripts for managing services. ls /etc/init.d/ can tell you the list of such services. You can restart a service like so,
sudo service hadoop-namenode restart
But if I for example stop the namenode something will start it again
immediately.
However, EMR also has a process called service-nanny that monitors hadoop related services and ensure all of em' are always running. This is the mystery process that brings it back.
So, for truly restarting a service, you would need to stop the service-nanny for a while and then restart/stop the necessary processes. Once you bring back service nanny , it will again do its job. So, you might run commands like -
sudo service service-nanny stop
sudo service hadoop-namenode restart
sudo service service-nanny start
Note that this behavior is different in 4.x.x and 5.x.x AMI's where upstart is used to stop/start applications and service-nanny no longer brings back applications.
I am following this tutorial:
http://www.cloudera.com/content/cloudera-content/cloudera-docs/HadoopTutorial/CDH4/Hadoop-Tutorial/ht_topic_5_2.html
It says the following:
javac -cp classpath -d wordcount_classes WordCount.java
where classpath is:
CDH4 - /usr/lib/hadoop/*:/usr/lib/hadoop/client-0.20/*
CDH3 - /usr/lib/hadoop-0.20/hadoop-0.20.2-cdh3u4-core.jar
I have downloaded the "cloudera-quickstart-demo-vm-4.2.0-vmware" .
Running as user cloudera.
[cloudera#localhost wordcount]$ javac -cp /usr/lib/hadoop/*:/usr/lib/hadoop/client-0.20/* -d wordcount_classes WordCount.java
incorrect classpath: /usr/lib/hadoop/*
incorrect classpath: /usr/lib/hadoop/client-0.20/*
----------
1. ERROR in WordCount.java (at line 8)
import org.apache.hadoop.fs.Path;
^^^^^^^^^^
When checking the cp folder: .
[cloudera#localhost wordcount]$ ls -l /usr/lib/hadoop
total 3500
drwxr-xr-x. 2 root root 4096 Apr 22 14:37 bin
drwxr-xr-x. 2 root root 4096 Apr 22 14:33 client
drwxr-xr-x. 2 root root 4096 Apr 22 14:33 client-0.20
drwxr-xr-x. 2 root root 4096 Apr 22 14:36 cloudera
drwxr-xr-x. 2 root root 4096 Apr 22 14:30 etc
-rw-r--r--. 1 root root 16536 Feb 15 14:24 hadoop-annotations-2.0.0-cdh4.2.0.jar
lrwxrwxrwx. 1 root root 37 Apr 22 14:30 hadoop-annotations.jar -> hadoop-annotations-2.0.0-cdh4.2.0.jar
-rw-r--r--. 1 root root 46855 Feb 15 14:24 hadoop-auth-2.0.0-cdh4.2.0.jar
lrwxrwxrwx. 1 root root 30 Apr 22 14:30 hadoop-auth.jar -> hadoop-auth-2.0.0-cdh4.2.0.jar
-rw-r--r--. 1 root root 2266171 Feb 15 14:24 hadoop-common-2.0.0-cdh4.2.0.jar
-rw-r--r--. 1 root root 1212163 Feb 15 14:24 hadoop-common-2.0.0-cdh4.2.0-tests.jar
lrwxrwxrwx. 1 root root 32 Apr 22 14:30 hadoop-common.jar -> hadoop-common-2.0.0-cdh4.2.0.jar
drwxr-xr-x. 3 root root 4096 Apr 22 14:36 lib
drwxr-xr-x. 2 root root 4096 Apr 22 14:33 libexec
drwxr-xr-x. 2 root root 4096 Apr 22 14:31 sbin
What am I doing wrong?
This is directly from the Cloudera Quickstart VM with CDH4 installed.
Following the "Hadoop Tutorial" .
It even says
**Prerequisites**
Ensure that CDH is installed, configured, and running. The easiest way to get going quickly is to use a CDH4 QuickStart VM
Which is exactly from where I am running this tutorial from - the CDH4 QuickStart VM.
What am I doing wrong?
*update
Version information;
[cloudera#localhost cloudera]$ cat cdh_version.properties
# Autogenerated build properties
version=2.0.0-cdh4.2.0
git.hash=8bce4bd28a464e0a92950c50ba01a9deb1d85686
cloudera.hash=8bce4bd28a464e0a92950c50ba01a9deb1d85686
cloudera.base-branch=cdh4-base-2.0.0
cloudera.build-branch=cdh4-2.0.0_4.2.0
cloudera.pkg.version=2.0.0+922
cloudera.pkg.release=1.cdh4.2.0.p0.12
cloudera.cdh.release=cdh4.2.0
cloudera.build.time=2013.02.15-18:39:29GMT
cloudera.pkg.name=hadoop
CLASSPATH ENV:
[cloudera#localhost bin]$ echo $CLASSPATH
:/usr/lib/hadoop/*:/usr/lib/hadoop/client-0.20/*
EDIT!!
So I think I figured it out.
This is a new issue possibly with the Cloudera CD4 VM quickstart VM:
from: This Post dated yesterday
Another person was having the exact same problem.
It seems the javac program does not accept wildcards properly on exported paths.
I had to do the following:
export CLASSPATH=/usr/lib/hadoop/client-0.20/\*:/usr/lib/hadoop/\*
Then
javac -d [Without a -cp override]
javac -d wordcount_classes/ WordCount.java
Only warnings will appear.
I wonder if Cloudera has to fix their quickstart VM.
You need to have a classpath variable set that includes those directories in /usr/lib/hadoop if you want javac to find them. You can set this env var as follows
$: export CLASSPATH=$CLASSPATH:/usr/lib/hadoop/*:/usr/lib/hadoop/client-0.20/*
javac will now find those libs. If you have any additional complaining regarding classpath variables you can just append them to the above line using a colon (:) as a delimiter
You could include this in a bash script, but it is best practice to set the correct env variables at runtime, then you get exactly what you want. In this case it could be word count or the CDH4 env is setting it, but it is best to just set it yourself.
I spent some time searching for a response to the same issue (also using VM with CDH4) so I shall leave my solution here in hopes that it might help others.
Unfortunately, neither of the solutions above worked in my case.
However, I was able to successfully compile the example by closing my terminal and opening a new one. My issue was having previously switched to the 'cloudera' user with 'sudo su cloudera' as mentioned in the tutorial.
Reference:
http://community.cloudera.com/t5/Apache-Hadoop-Concepts-and/Classpath-Problem-on-WordCount-Tutorial-QuickStart-VM-4-4-0-1/td-p/3613
I am getting the problem
Failed with exception java.io.IOException:java.io.IOException: Could not obtain block: blk_364919282277866885_1342 file=/user/hive/warehouse/invites/ds=2008-08-08/kv3.txt
I checked the file is actually there.
hive>dfs -ls /user/hive/warehouse/invites/ds=2008-08-08/kv3.txt
Found 1 items
-rw-r--r-- 2 root supergroup 216 2012-11-16 16:28 /user/hive/warehouse/invites/ds=2008-08-08/kv3.txt
What I should do?
Please help.
I ran into this problem on my cluster, but it disappeared once I restarted the task on a cluster with more nodes available. The underlying cause appears to be an out-of-memory error, as this thread indicates. My original cluster on AWS was running 3 c1.xlarge instances (7 GB memory each), while the new one had 10 c3.4xlarge instances (30 GB memory each).
Try hadoop fsck /user/hive/warehouse/invites/ds=2008-08-08/kv3.txt ?