Copy Solr HDFS Data to another Cluster - hadoop

I have a solr cloud (v 4.10) installation that sits on top of Cloudera (CDH 5.4.2) HDFS with 3 solr instances each hosting a shard of each core.
I am looking for a way to incrementally copy the solr data from our production cluster to our development cluster. There are 3 cores but I am only interested in copying one of them.
I have tried to use the Solr replication - backup and restore but that doesn't seem to load anything into the dev cluster.
http://host:8983/solr/core/replication?command=backup&location=/solr_transfer&name=core-name
http://host:8983/solr/core/replication?command=restore&location=/solr_transfer&name=core-name
I also tried to snapshot the /solr dir in the hdfs prod clusters and use hadoop disctp to copy the files but the solr indexer deletes some of the files so the distcp job fails.
hadoop distcp hftp://prod:50070/solr/* hdfs://dev:8020/solr/
Can anyone help me here?

please follow below steps to create snapshot of solr_hdfs folder and move the same on another cluster
1.Allow snapshot
sudo -u hdfs hadoop dfsadmin -allowSnapshot /user/solr/SolrCollectionName
2.Create snapshot with a specific name
sudo -u hdfs hadoop dfs -createSnapshot /user/solr/SolrCollectionName/ snapshotName
3. To list to snapshot directory
hdfs dfs -ls /user/solr/solrcollectionName/.snapshot
4. To copy, execute below command
sudo -u solr hadoop distcp hdfs://NNIP1:8020/user/solr/collectionName/.snapshot/SanpshotName hdfs://NNIP2:8020/user/solr
5. To restore snapshot
sudo -u solr hadoop fs -cp /user/solr/SanpshotName/* /user/solr/SolrcollectionName/

After a lot of trying this is the solution we worked out.
- Initialise solr in the second environment with all the collections in the same way as the primary.
- Take a snapshot of HDFS
- Use hadoop hdfs -cp to copy the data up to the checkpoint
After the first run the copy job will be quick as you are only copying the increments.

Related

How to run HDFS Copy commands using Airflow?

May I know how to execute HDFS copy commands on DataProc cluster using airflow.
After the cluster is created using airflow, I have to copy few jar files from Google storage to the HDFS master node folder.
You can execute hdfs commands on dataproc cluster using something like this
gcloud dataproc jobs submit hdfs 'ls /hdfs/path/' --cluster=my-cluster --
region=europe-west1
The easiest way is [1] via
gcloud dataproc jobs submit pig --execute 'fs -ls /'
or otherwise [2] as a catch-all for other shell commands.
For a single small file
You can copy a single file from Google Cloud Storage (GCS) to HDFS using the hdfs copy command. Note that you need to run this from a node within the cluster:
hdfs dfs -cp gs://<bucket>/<object> <hdfs path>
This works because
hdfs://<master node>
is the default filesystem. You can explicitly specify the scheme and NameNode if desired:
hdfs dfs -cp gs://<bucket>/<object> hdfs://<master node>/<hdfs path>
For a large file or large directory of files
When you use hdfs dfs, data is piped through your local machine. If you have a large dataset to copy, you will likely want to do this in parallel on the cluster using DistCp:
hadoop distcp gs://<bucket>/<directory> <HDFS target directory>
Consider [3] for details.
[1] https://pig.apache.org/docs/latest/cmds.html#fs
[2] https://pig.apache.org/docs/latest/cmds.html#sh
[3] https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html
I am not sure about your use case to do this via airflow because if its onetime setup then i think we can run commands directly on dataproc cluster. But found some links which might be of some help. As i understand we can use BashOperator and can run commands.
https://big-data-demystified.ninja/2019/11/04/how-to-ssh-to-a-remote-gcp-machine-and-run-a-command-via-airflow/
Airflow Dataproc operator to run shell scripts

Uploading file in HDFS cluster

I was learning hadoop and till now I configured 3 Node cluster
127.0.0.1 localhost
10.0.1.1 hadoop-namenode
10.0.1.2 hadoop-datanode-2
10.0.1.3 hadoop-datanode-3
My hadoop Namenode directory looks like below
hadoop
bin
data-> ./namenode ./datanode
etc
logs
sbin
--
--
As I learned that when we upload a large file in the cluster in divide the file into blocks, I want to upload a 1Gig file in my cluster and want to see how it is being stored in datanode.
Can anyone help me with the commands to upload file and see where these blocks are being stored.
First, you need to check if you have Hadoop tools in your path, if not - I recommend integrate them into it.
One of the possible ways of uploading a file to HDFS:hadoop fs -put /path/to/localfile /path/in/hdfs
I would suggest you read the documentation and get familiar with high-level commands first as it will save you time
Hadoop Documentation
Start with "dfs" command, as this one of the most often used commands

Datanode is in dead state as DFS used is 100 percent

I am having a standalone setup of Apache Hadoop with Namenode and Datanode running in the same machine.
I am currently running Apache Hadoop 2.6 (I cannot upgrade it) running on Ubuntu 16.04.
Although my system is having more than 400 GB of Hard disk left but my Hadoop dashboard is showing 100%.
Why Apache Hadoop is not consuming the rest of the disk space available to it? Can anybody help me figuring out the solution.
There can be certain reasons for it.
You can try following steps:
Goto $HADOOP_HOME/bin
./hadoop-daemon.sh --config $HADOOP_HOME/conf start datanode
Then you can try the following things:-
If any directory other than your namenode and datanode directories taking up too much space, you can start cleaning up
Also you can run hadoop fs -du -s -h /user/hadoop (to see usage of the directories).
Identify all the unnecessary directories and start cleaning up by running hadoop fs -rm -R /user/hadoop/raw_data (-rm is to delete -R is to delete recursively, be careful while using -R).
Run hadoop fs -expunge (to clean up the trash immediately, some times you need to run multiple times).
Run hadoop fs -du -s -h / (it will give you hdfs usage of the entire file system or you can run dfsadmin -report as well - to confirm whether storage is reclaimed)
Many times it shows missing blocks ( with replication 1).

Reading a file in Spark in cluster mode in Amazon EC2

I'm trying to execute a spark program in cluster mode in Amazon Ec2 using
spark-submit --master spark://<master-ip>:7077 --deploy-mode cluster --class com.mycompany.SimpleApp ./spark.jar
And the class has a line that tries to read a file:
JavaRDD<String> logData = sc.textFile("/user/input/CHANGES.txt").cache();
I'm unable to read this txt file in cluster mode even if I'm able to read in standalone mode. In cluster mode, it's looking to read from hdfs. So I put the file in hdfs at /root/persistent-hdfs using
hadoop fs -mkdir -p /wordcount/input
hadoop fs -put /app/hadoop/tmp/input.txt /wordcount/input/input.txt
And I can see the file using hadoop fs -ls /workcount/input. But Spark is still unable to read the file. Any idea what I'm doing wrong. Thanks.
You might want to check the following points:
Is the file really in the persistent HDFS?
It seems that you just copy the input file from /app/hadoop/tmp/input.txt to /wordcount/input/input.txt, all in the node disk. I believe you misunderstand the functionality of the hadoop commands.
Instead, you should try putting the file explicitly in the persistent HDFS (root/persistent-hdfs/), and then loading it using the hdfs://... prefix.
Is the persistent HDFS server up?
Please take a look here, it seems Spark only starts the ephemeral HDFS server by default. In order to switch to the persistent HDFS server, you must do the following:
1) Stop the ephemeral HDFS server: /root/ephemeral-hdfs/bin/stop-dfs.sh
2) Start the persistent HDFS server: /root/persistent-hdfs/bin/start-dfs.sh
Please try these things, I hope they can serve you well.

how do i backup hbase using distcp?

I would like to do a back up of hbase files using distcp. Then point hbase to the newly copied files and work with the stored tables.
I realize that there are tools out there which are recommended for this job. However, I'd like to know what I need to do after I've copied the files to get hbase to recognize the copied files.
For example, i'd like to start hbase shell and scan the stored tables from the newly copied file.
DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. So if you want to backup your clusterA to clusterB, you'll have to:
do the copy from clusterA to clusterB using distcp
start an Hbase master and some RegionServers
enjoy the command line interface on clusterB
This means have 2 clusters each with HDFS and Hbase.
But, if you want to backup your data in the same cluster, this is simplier:
do the intra copy in a different folder: hadoop distcp hdfs://nn:8020/hbase hdfs://nn:8020/backuptest
stop all the Hbase processes and change the property hbase.rootdir from "hbase" to "backuptest"
restart all the processes

Resources