I am new to hadoop.
I'm trying to install hadoop in my laptop in Pseudo-Distributed mode.
I am running it with root user, but I'm getting the error below.
root#debdutta-Lenovo-G50-80:~# $HADOOP_PREFIX/sbin/start-dfs.sh
WARNING: HADOOP_PREFIX has been replaced by HADOOP_HOME. Using value of HADOOP_PREFIX.
Starting namenodes on [localhost]
ERROR: Attempting to operate on hdfs namenode as root
ERROR: but there is no HDFS_NAMENODE_USER defined.
Aborting operation.
Starting datanodes
ERROR: Attempting to operate on hdfs datanode as root
ERROR: but there is no HDFS_DATANODE_USER defined.
Aborting operation.
Starting secondary namenodes [debdutta-Lenovo-G50-80]
ERROR: Attempting to operate on hdfs secondarynamenode as root
ERROR: but there is no HDFS_SECONDARYNAMENODE_USER defined. Aborting operation.
WARNING: HADOOP_PREFIX has been replaced by HADOOP_HOME. Using value of HADOOP_PREFIX.
Also, I have to run hadoop in root user as hadoop is not able to access ssh service with other user.
How to fix the same?
just do what it asks you:
export HDFS_NAMENODE_USER="root"
export HDFS_DATANODE_USER="root"
export HDFS_SECONDARYNAMENODE_USER="root"
export YARN_RESOURCEMANAGER_USER="root"
export YARN_NODEMANAGER_USER="root"
The root cause of this problem,
hadoop install for different user and you start yarn service for different user.
OR
in hadoop config's hadoop-env.sh specified HDFS_NAMENODE_USER and HDFS_DATANODE_USER user is something else.
Hence we need to correct and make it consistent at every place. So a simple solution of this problem is to edit your hadoop-env.sh file and add the user-name for which you want to start the yarn service. So go ahead and edit $HADOOP_HOME/etc/hadoop/hadoop-env.sh by adding the following lines
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
Now save and start yarn, hdfs service and check that it works.
Based on on the first warning, HADOOP_PREFIX, sounds like you've not defined HADOOP_HOME correctly.
This would be done in your /etc/profile.d.
hadoop-env.sh is where the remainder of those variables are are defined.
Please refer to the UNIX Shell Guide
hadoop is not able to access ssh service with other user
This has nothing to do with Hadoop itself. It's basic SSH account management. You need to
Make the hadoop (and other, like yarn) accounts on all machines of a cluster (see adduser command documentation)
Copy a passwordless SSH key using ssh-copy-id hadoop#localhost, for example
If you don't need distributed mode and just want to use Hadoop locally, you can use a Mini Cluster.
The documentation also recommends making a single node installation before continuing to pseudo distributed
Vim ${HADOOP_HOME}sbin/start-dfs.sh & ${HADOOP_HOME}sbin/stop-dfs.sh, then add:
HDFS_DATANODE_USER=root
HADOOP_SECURE_DN_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root
Check your pdsh default rcmd rsh
pdsh -q -w localhost -- should get something like this
-- DSH-specific options --
Separate stderr/stdout Yes
Path prepended to cmd none
Appended to cmd none
Command: none
Full program pathname /usr/bin/pdsh
Remote program path /usr/bin/pdsh
-- Generic options --
Local username enock
Local uid 1000
Remote username enock
Rcmd type rsh
one ^C will kill pdsh No
Connect timeout (secs) 10
Command timeout (secs) 0
Fanout 32
Display hostname labels Yes
Debugging No
-- Target nodes --
localhost
Modify pdsh default rcmd. Add pdsh to bashrc
nano ~/.bashrc
-- add this line towards the end
export PDSH_RCMD_TYPE=ssh
-- update
source ~/.bashrc
That should solve your problem
C. sbin/start-dfs.sh
After starting a spark-ec2 cluster, I start sparkR from /root with
$ ./spark/bin/sparkR
A few lines of the resulting message include:
16/11/20 10:13:51 WARN SparkConf:
SPARK_WORKER_INSTANCES was detected (set to '1').
This is deprecated in Spark 1.0+.
Please instead use:
- ./spark-submit with --num-executors to specify the number of executors
- Or set SPARK_EXECUTOR_INSTANCES
- spark.executor.instances to configure the number of instances in the spark config.
So, following that suggestion I added the last line to spark-defaults.conf
$ pwd
/root/spark/conf
$ cat spark-defaults.conf
spark.executor.memory 512m
spark.executor.extraLibraryPath /root/ephemeral-hdfs/lib/native/
spark.executor.extraClassPath /root/ephemeral-hdfs/conf
spark.executor.instances 2
This resulted in the message no longer being printed.
In sparkR, how can I verify the number of worker nodes that will be accessed?
After you start your spark cluster you can check your current workers and executer on spark ui on Master_IP:8080 for example in local its localhost:8080
And you can also check that your configuration will correctly apply in localhost:4040 under environment tab
I wanted to use Spark's History Server to make use of the logging mechanisms of my Web UI, but I find some difficulty in running this code on my Windows machine.
I have done the following:
Set my spark-defaults.conf file to reflect
spark.eventLog.enabled=true
spark.eventLog.dir=file://C:/spark-1.6.2-bin-hadoop2.6/logs
spark.history.fs.logDirectory=file://C:/spark-1.6.2-bin-hadoop2.6/logs
My spark-env.sh to reflect:
SPARK_LOG_DIR "file://C:/spark-1.6.2-bin-hadoop2.6/logs"
SPARK_HISTORY_OPTS "-Dspark.history.fs.logDirectory=file://C:/spark-1.6.2-bin-hadoop2.6/logs"
I am using Git-BASH to run the start-history-server.sh file, like this:
USERA#SYUHUH MINGW64 /c/spark-1.6.2-bin-hadoop2.6/sbin
$ sh start-history-server.sh
And, I get this error:
USERA#SYUHUH MINGW64 /c/spark-1.6.2-bin-hadoop2.6/sbin
$ sh start-history-server.sh
C:\spark-1.6.2-bin-hadoop2.6/conf/spark-env.sh: line 69: SPARK_LOG_DIR: command not found
C:\spark-1.6.2-bin-hadoop2.6/conf/spark-env.sh: line 70: SPARK_HISTORY_OPTS: command not found
ps: unknown option -- o
Try `ps --help' for more information.
starting org.apache.spark.deploy.history.HistoryServer, logging to C:\spark-1.6.2-bin-hadoop2.6/logs/spark--org.apache.spark.deploy.history.HistoryServer-1-SGPF02M9ZB.out
ps: unknown option -- o
Try `ps --help' for more information.
failed to launch org.apache.spark.deploy.history.HistoryServer:
Spark Command: C:\Program Files (x86)\Java\jdk1.8.0_91\bin\java -cp C:\spark-1.6.2-bin-hadoop2.6/conf\;C:\spark-1.6.2-bin-hadoop2.6/lib/spark-assembly-1.6.2-hadoop2.6.0.jar;C:\spark-1.6.2-bin-hadoop2.6\lib\datanucleus-api-jdo-3.2.6.jar;C:\spark-1.6.2-bin-hadoop2.6\lib\datanucleus-core-3.2.10.jar;C:\spark-1.6.2-bin-hadoop2.6\lib\datanucleus-rdbms-3.2.9.jar -Xms1g -Xmx1g org.apache.spark.deploy.history.HistoryServer
========================================
full log in C:\spark-1.6.2-bin-hadoop2.6/logs/spark--org.apache.spark.deploy.history.HistoryServer-1-SGPF02M9ZB.out
The full log from the output can be found below:
Spark Command: C:\Program Files (x86)\Java\jdk1.8.0_91\bin\java -cp C:\spark-1.6.2-bin-hadoop2.6/conf\;C:\spark-1.6.2-bin-hadoop2.6/lib/spark-assembly-1.6.2-hadoop2.6.0.jar;C:\spark-1.6.2-bin-hadoop2.6\lib\datanucleus-api-jdo-3.2.6.jar;C:\spark-1.6.2-bin-hadoop2.6\lib\datanucleus-core-3.2.10.jar;C:\spark-1.6.2-bin-hadoop2.6\lib\datanucleus-rdbms-3.2.9.jar -Xms1g -Xmx1g org.apache.spark.deploy.history.HistoryServer
========================================
I am running a sparkR script where I initialize my spark context and then call init().
Please advise whether I should be running the history server before I run my spark script?
Pointers & tips to proceed(with respect to logging) would be greatly appreciated.
On Windows you'll need to run the .cmd files of Spark not .sh. According to what I saw, there is no .cmd script for Spark history server. So basically it needs to be run manually.
I have followed the history server Linux script and in order to run it manually on Windows you'll need to take the following steps:
All history server configurations should be set at the spark-defaults.conf file (remove .template suffix) as described below
You should go to spark config directory and add the spark.history.* configurations to %SPARK_HOME%/conf/spark-defaults.conf. As follows:
spark.eventLog.enabled true
spark.history.fs.logDirectory file:///c:/logs/dir/path
After configuration is finished run the following command from %SPARK_HOME%
bin\spark-class.cmd org.apache.spark.deploy.history.HistoryServer
The output should be something like that:
16/07/22 18:51:23 INFO Utils: Successfully started service on port 18080.
16/07/22 18:51:23 INFO HistoryServer: Started HistoryServer at http://10.0.240.108:18080
16/07/22 18:52:09 INFO ShutdownHookManager: Shutdown hook called
Hope that it helps! :-)
in case any one gets the floowing exception:
17/05/12 20:27:50 ERROR FsHistoryProvider: Exception encountered when attempting
to load application log file:/C:/Spark/Logs/spark--org.apache.spark.deploy.hist
ory.HistoryServer-1-Arsalan-PC.out
java.lang.IllegalArgumentException: Codec [out] is not available. Consider setti
ng spark.io.compression.codec=snappy
at org.apache.spark.io.CompressionCodec$$anonfun$createCodec$1.apply(Com
Just go to SparkHome/config/spark-defaults.conf
and set
spark.eventLog.compress false
I am using Hadoop 2.6.0 (emr-4.2.0 image). I have made some changes in yarn-site.xml and want to restart yarn to bring the changes into effect.
Is there a command using which I can do this?
Edit (10/26/2017): A more detailed Knowledge Center article on how to do this has been published here by AWS officially -
https://aws.amazon.com/premiumsupport/knowledge-center/restart-service-emr/.
You can ssh into the master node of your EMR cluster and run -
"sudo /sbin/stop hadoop-yarn-resourcemanager"
"sudo /sbin/start hadoop-yarn-resourcemanager"
commands to restart the Yarn resource manager. EMR AMI 4.x.x uses upstart - /sbin/{start,stop,restart} are all symlinks to /sbin/initctl, which is part of upstart. See the initctl man page for more information.
Alternatively, you can follow the instructions here to propagate your changes to yarn-site.xml - yarn-change-configuration-on-yarn-site-xml
For those who are gonna come from Google
In order to restart a service in EMR, perform the following actions:
Find the name of the service by running the following command:
initctl list
For example, the YARN Resource Manager service is named hadoop-yarn-resourcemanager.
Stop the service by running the following command:
sudo stop hadoop-yarn-resourcemanager
Wait a few seconds, then start the service by running the following command:
sudo start hadoop-yarn-resourcemanager
Note: Stop/start is required; do not use the restart command.
Verify that the process is running by running the following command:
sudo status hadoop-yarn-resourcemanager
Check for the process using ps, and then check the log file for any errors in the log directory /var/log/.
Source : https://aws.amazon.com/premiumsupport/knowledge-center/restart-service-emr/
If what you want to do is to enable log-aggregation, it is actually easier to create the cluster with log-aggregation already enabled, as described in the documentation:
http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-debugging.html
(It is actually enabled by default if you are using emr-4.3.0).
Try restarting this service as well:
hadoop-yarn-nodemanager
I have an Apache server with a default configuration of Elasticsearch and everything works perfectly, except that the default configuration has a max size of 1GB.
I don't have such a large number of documents to store in Elasticsearch, so I want to reduce the memory.
I have seen that I have to change the -Xmx parameter in the Java configuration, but I don't know how.
I have seen I can execute this:
bin/ElasticSearch -Xmx=2G -Xms=2G
But when I have to restart Elasticsearch this will be lost.
Is it possible to change max memory usage when Elasticsearch is installed as a service?
In ElasticSearch >= 5 the documentation has changed, which means none of the above answers worked for me.
I tried changing ES_HEAP_SIZE in /etc/default/elasticsearch and in /etc/init.d/elasticsearch, but when I ran ps aux | grep elasticsearch the output still showed:
/usr/bin/java -Xms2g -Xmx2g # aka 2G min and max ram
I had to make these changes in:
/etc/elasticsearch/jvm.options
# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space
-Xms1g
-Xmx1g
# the settings shipped with ES 5 were: -Xms2g
# the settings shipped with ES 5 were: -Xmx2g
Updated on Nov 24, 2016: Elasticsearch 5 apparently has changed the way to configure the JVM. See this answer here. The answer below still applies to versions < 5.
tirdadc, thank you for pointing this out in your comment below.
I have a pastebin page that I share with others when wondering about memory and ES. It's worked OK for me: http://pastebin.com/mNUGQCLY. I'll paste the contents here as well:
References:
https://github.com/grigorescu/Brownian/wiki/ElasticSearch-Configuration
http://www.elasticsearch.org/guide/reference/setup/installation/
Edit the following files to modify memory and file number limits. These instructions assume Ubuntu 10.04, may work on later versions and other distributions/OSes. (Edit: This works for Ubuntu 14.04 as well.)
/etc/security/limits.conf:
elasticsearch - nofile 65535
elasticsearch - memlock unlimited
/etc/default/elasticsearch (on CentOS/RH: /etc/sysconfig/elasticsearch ):
ES_HEAP_SIZE=512m
MAX_OPEN_FILES=65535
MAX_LOCKED_MEMORY=unlimited
/etc/elasticsearch/elasticsearch.yml:
bootstrap.mlockall: true
For anyone looking to do this on Centos 7 or with another system running SystemD, you change it in
/etc/sysconfig/elasticsearch
Uncomment the ES_HEAP_SIZE line, and set a value, eg:
# Heap Size (defaults to 256m min, 1g max)
ES_HEAP_SIZE=16g
(Ignore the comment about 1g max - that's the default)
Create a new file with the extension .options inside /etc/elasticsearch/jvm.options.d and put the options there. For example:
sudo nano /etc/elasticsearch/jvm.options.d/custom.options
and put the content there:
# JVM Heap Size - see /etc/elasticsearch/jvm.options
-Xms2g
-Xmx2g
It will set the maximum heap size to 2GB. Don't forget to restart elasticsearch:
sudo systemctl restart elasticsearch
Now you can check the logs:
sudo cat /var/log/elasticsearch/elasticsearch.log | grep "heap size"
You'll see something like so:
… heap size [2gb], compressed ordinary object pointers [true]
Doc
Instructions for ubuntu 14.04:
sudo vim /etc/init.d/elasticsearch
Set
ES_HEAP_SIZE=512m
then in:
sudo vim /etc/elasticsearch/elasticsearch.yml
Set:
bootstrap.memory_lock: true
There are comments in the files for more info
Previous answers were insufficient in my case, probably because I'm on Debian 8, while they were referred to some previous distribution.
On Debian 8 modify the service script normally place in /usr/lib/systemd/system/elasticsearch.service, and add Environment=ES_HEAP_SIZE=8G
just below the other "Environment=*" lines.
Now reload the service script with systemctl daemon-reload and restart the service. The job should be done!
If you use the service wrapper provided in Elasticsearch's Github repository, found at https://github.com/elasticsearch/elasticsearch-servicewrapper, then the conf file at elasticsearch-servicewrapper / service / elasticsearch.conf controls memory settings. At the top of elasticsearch.conf is a parameter:
set.default.ES_HEAP_SIZE=1024
Just reduce this parameter, say to "set.default.ES_HEAP_SIZE=512", to reduce Elasticsearch's allotted memory.
Note that if you use the elasticsearch-wrapper, the ES_HEAP_SIZE provided in elasticsearch.conf OVERRIDES ALL OTHER SETTINGS. This took me a bit to figure out, since from the documentation, it seemed that heap memory could be set from elasticsearch.yml.
If your service wrapper settings are set somewhere else, such as at /etc/default/elasticsearch as in James's example, then set the ES_HEAP_SIZE there.
If you installed ES using the RPM/DEB packages as provided (as you seem to have), you can adjust this by editing the init script (/etc/init.d/elasticsearch on RHEL/CentOS). If you have a look in the file you'll see a block with the following:
export ES_HEAP_SIZE
export ES_HEAP_NEWSIZE
export ES_DIRECT_SIZE
export ES_JAVA_OPTS
export JAVA_HOME
To adjust the size, simply change the ES_HEAP_SIZE line to the following:
export ES_HEAP_SIZE=xM/xG
(where x is the number of MB/GB of RAM that you would like to allocate)
Example:
export ES_HEAP_SIZE=1G
Would allocate 1GB.
Once you have edited the script, save and exit, then restart the service. You can check if it has been correctly set by running the following:
ps aux | grep elasticsearch
And checking for the -Xms and -Xmx flags in the java process that returns:
/usr/bin/java -Xms1G -Xmx1G
Hope this helps :)
Elasticsearch will assign the entire heap specified in jvm.options via the Xms (minimum heap size) and Xmx (maximum heap size) settings.
-Xmx12g
-Xmx12g
Set the minimum heap size (Xms) and maximum heap size (Xmx) to be equal to each other.
Don’t set Xmx to above the cutoff that the JVM uses for compressed object pointers (compressed oops), the exact cutoff varies but is near 32 GB.
It is also possible to set the heap size via an environment variable
ES_JAVA_OPTS="-Xms2g -Xmx2g" ./bin/elasticsearch
ES_JAVA_OPTS="-Xms4000m -Xmx4000m" ./bin/elasticsearch
File path to change heap size /etc/elasticsearch/jvm.options
If you are using nano then do sudo nano /etc/elasticsearch/jvm.options and update -Xms and -Xmx accordingly.
(You can use any file editor to edit it)
In elasticsearch path home dir i.e. typically /usr/share/elasticsearch,
There is a config file bin/elasticsearch.in.sh.
Edit parameter ES_MIN_MEM, ES_MAX_MEM in this file to change -Xms2g, -Xmx4g respectively.
And Please make sure you have restarted the node after this config change.
If you are using docker-compose to run a ES cluster:
Open <your docker compose>.yml file
If you have set the volumes property, you won't lose anything. Otherwise, you must first move the indexes.
Look for this value ES_JAVA_OPTS under environment and change the value in all nodes, the result could be somethig like "ES_JAVA_OPTS=-Xms2g -Xmx2g"
rebuild all nodes docker-compose -f <your docker compose>.yml up -d
Oneliner for Centos 7 & Elasticsearch 7 (2g = 2GB)
$ echo $'-Xms2g\n-Xmx2g' > /etc/elasticsearch/jvm.options.d/2gb.options
and then
$ service elasticsearch restart
If you use windows server, you can change Environment Variable, restart server to apply new Environment Value and start Elastic Service. More detail in Install Elastic in Windows Server
In elasticsearch 2.x :
vi /etc/sysconfig/elasticsearch
Go to the block of code
# Heap size defaults to 256m min, 1g max
# Set ES_HEAP_SIZE to 50% of available RAM, but no more than 31g
#ES_HEAP_SIZE=2g
Uncomment last line like
ES_HEAP_SIZE=2g
Update elastic configuration in path /etc/elasticsearch/jvm.options
################################################################
## IMPORTANT: JVM heap size
################################################################
##
## The heap size is automatically configured by Elasticsearch
## based on the available memory in your system and the roles
## each node is configured to fulfill. If specifying heap is
## required, it should be done through a file in jvm.options.d,
## and the min and max should be set to the same value. For
## example, to set the heap to 4 GB, create a new file in the
## jvm.options.d directory containing these lines:
##
## -Xms4g
## -Xmx4g
##
## See https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
## for more information
##
################################################################
-Xms1g
-Xmx1g
These configs mean you allocate 1GB RAM for elasticsearch service.
If you use ubuntu 15.04+ or any other distro that uses systemd, you can set the max memory size editing the elasticsearch systemd service and setting the max memory size using the ES_HEAP_SIZE environment variable, I tested it using ubuntu 20.04 and it works fine:
systemctl edit elasticsearch
Add the environement variable ES_HEAP_SIZE with the desired max memory, here 2GB as example:
[Service]
Environment=ES_HEAP_SIZE=2G
Reload systemd daemon
systemd daemon-reload
Then restart elasticsearch
systemd restart elasticsearch
To check if it worked as expected:
systemd status elasticsearch
You should see in the status -Xmx2G:
CGroup: /system.slice/elasticsearch.service
└─2868 /usr/bin/java -Xms2G -Xmx2G
window 7 elasticsearch
elastic search memories problem
elasticsearch-7.14.1\config\jvm.options
add this
-Xms1g
-Xmx1g
elasticsearch-7.14.1\config\elasticsearch.yml
uncomment
bootstrap.memory_lock: true
and pest
https://github.com/elastic/elasticsearch-servicewrapper download service file and pest
lasticsearch-7.14.1\bin
bin\elasticsearch.bat enter
Elastic Search 7.x and above, tested with Ubuntu 20
Create a file in /etc/elasticsearch/jvm.options.d. The file name must ends with .options
For example heap_limit.options
Add these lines to the file
## Initial memory allocation
-Xms1g
## Maximum memory allocation
-Xmx1g
Restart elastic search service
sudo service elasticsearch restart
or
sudo systemctl restart elasticsearch