Efficient copy method in Hadoop - hadoop

Is there a faster or more efficient way of copying files across HDFS other than distcp. I tried both the regular hadoop fs -cp as well as distcp and both seem to be giving the same transfer rate, around 50 MBPS.
I have 5TB of data split into smaller files of 500GB each which I have to copy to a new location on HDFS. Any thoughts?
Edit:
The original distcp is only spawning 1 mapper so I added -m100 option to increase the mappers
hadoop distcp -D mapred.job.name="Gigafiles distcp" -pb -i -m100 "/user/abc/file1" "/xyz/aaa/file1"
But still it is spawning only 1 and not 100 mappers. Am I missing something here?

I came up with this if you want to copy a subset of files from a folder to another in HDFS. It may not be as efficient as distcp but does the job and gives you more freedom in case you want to do other operations. It also checks if each file already exists there:
import pandas as pd
import os
from multiprocessing import Process
from subprocess import Popen, PIPE
hdfs_path_1 = '/path/to/the/origin/'
hdfs_path_2 = '/path/to/the/destination/'
process = Popen(f'hdfs dfs -ls -h {hdfs_path_2}', shell=True, stdout=PIPE, stderr=PIPE)
std_out, std_err = process.communicate()
already_processed = [fn.split()[-1].split('/')[-1] for fn in std_out.decode().readlines()[1:]][:-1]
print(f'Total number of ALREADY PROCESSED tar files = {len(already_processed)}')
df = pd.read_csv("list_of_files.csv") # or any other lists that you have
to_do_tar_list = list(df.tar)
to_do_list = set(to_do_tar_list) - set(already_processed)
print(f'To go: {len(to_do_list)}')
def copyy(f):
process = Popen(f'hdfs dfs -cp {hdfs_path_1}{f} {hdfs_path_2}', shell=True, stdout=PIPE, stderr=PIPE)
std_out, std_err = process.communicate()
if std_out!= b'':
print(std_out)
ps = []
for f in to_do_list:
p = Process(target=copyy, args=(f,))
p.start()
ps.append(p)
for p in ps:
p.join()
print('done')
Also if you want to have a list of all files in a directory use this:
from subprocess import Popen, PIPE
hdfs_path = '/path/to/the/designated/folder'
process = Popen(f'hdfs dfs -ls -h {hdfs_path}', shell=True, stdout=PIPE, stderr=PIPE)
std_out, std_err = process.communicate()
list_of_file_names = [fn.split(' ')[-1].split('/')[-1] for fn in std_out.decode().readlines()[1:]][:-1]
list_of_file_names_with_full_address = [fn.split(' ')[-1] for fn in std_out.decode().readlines()[1:]][:-1]

I was able to solve this by using a pig script to read the data from path A, convert to parquet (which is the desired storage format anyway) and write it in path B. The process took close to 20 mins on average for 500GB files. Thank you for the suggestions.

Related

OraclePropertyGraphDataLoader loadData from HDFS

I'm using Spark+Hive to build graphs and relations and export flat OPV/OPE files to HDFS, one OPV/OPE CSV per reducer.
All our graph database is ready to be loaded on OPG/PGX for analytics an that worked like a charm.
Now, we want to load those vertices/edges on Oracle Property Graph.
I'v dumped the filenames from hdfs this way:
$ hadoop fs -find '/user/felipeferreira/dadossinapse/ops/*.opv/*.csv' | xargs -I{} echo 'hdfs://'{} > opvs.lst
$ hadoop fs -find '/user/felipeferreira/dadossinapse/ops/*.ope/*.csv' | xargs -I{} echo 'hdfs://'{} > opes.lst
And I'm experimenting on groovy shell with some issues and doubts:
opvs = new File('opvs.lst') as String[]
opes = new File('opes.lst') as String[]
opgdl.loadData(opg, opvs, opes, 72)
That doesn't work out of the box, I receive errors like
java.lang.IllegalArgumentException: loadData: part-00000-f97f1abf-5f69-479a-baee-ce0a7bcaa86c-c000.csv flat file does not exist
I'll manage that with a InputStream approach available in the loadData interface, hope that solves this problem, but I have some questions/sugestions:
Does loadData support vfs so I may load 'hdfs://...' files directly?
Wouldn't be nice to have glob syntax in the filenames so we may do something like:
opgdl.loadData(opg, 'hdfs:///user/felipeferreira/opvs/**/*.csv' ...
Thanks in advance!
You can use an alternate API from OraclePropertyGraphDataLoader where you can specifiy the InputStream objects for the opv/ope files used for loading. This way, you can use FsDataInputStream objects to read the files from your HDFS environment.
A small sample is the following:
// ====== Init HDFS File System Object
Configuration conf = new Configuration();
// Set FileSystem URI
conf.set("fs.defaultFS", hdfsuri);
conf.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
conf.set("fs.file.impl", org.apache.hadoop.fs.LocalFileSystem.class.getName());
// Set HADOOP user
System.setProperty("HADOOP_USER_NAME", "hdfs");
System.setProperty("hadoop.home.dir", "/");
//Get the filesystem - HDFS
FileSystem fs = FileSystem.get(URI.create(hdfsuri), conf);`
// Read files into InputStreams using HDFS FsDataInputStream Java APIs
**Path pathOPV = new Path("/path/to/file.opv");
FSDataInputStream inOPV = fileSystem.open(pathOPV);
Path pathOPV = new Path("/path/to/file.ope");
FSDataInputStream inOPE = fileSystem.open(pathOPE);**
cfg = GraphConfigBuilder.forPropertyGraphHbase().setName("sinapse").setZkQuorum("bda1node05,bda1node06").build()
opg = OraclePropertyGraph.getInstance(cfg)
opgdl = OraclePropertyGraphDataLoader.getInstance();
opgdl.loadData(opg, **inOPV, inOPE**, 100);
Let us know if this one works for you.
For the sake of tracking, here is the solution we'v adopted:
Mounted the hdfs through the NFS gateway on a folder below the groovy shell.
Exported the filenames to the OPV/OPE list-of-files:
$ find ../hadoop/user/felipeferreira/dadossinapse/ -iname "*.csv" | grep ".ope" > opes.lst
$ find ../hadoop/user/felipeferreira/dadossinapse/ -iname "*.csv" | grep ".opv" > opvs.lst
Then it was as simple as this to load the data on the opg/hbase:
cfg = GraphConfigBuilder.forPropertyGraphHbase().setName("sinapse").setZkQuorum("bda1node05,bda1node06").build()
opg = OraclePropertyGraph.getInstance(cfg)
opgdl = OraclePropertyGraphDataLoader.getInstance()
opvs = new File("opvs.lst") as String[]
opes = new File("opes.lst") as String[]
opgdl.loadData(opg, opvs, opes, 100)
This seems to get bottlenecked by the nfs gateway, but we will evaluate this next week.
Graph data loading is running just fine so far.
If anyone would suggest a better approach, please let me know!

HDFS path does not exist with SparkSession object when spark master is set as LOCAL

I am trying to load a dataset into Hive table using Spark.
But when I try to load the file from HDFS directory to Spark, I get the exception:
org.apache.spark.sql.AnalysisException: Path does not exist: file:/home/cloudera/partfile;
These are the steps before loading the file.
val wareHouseLocation = "file:${system:user.dir}/spark-warehouse"
val SparkSession = SparkSession.builder.master("local[2]") \
.appName("SparkHive") \
.enableHiveSupport() \
.config("hive.exec.dynamic.partition", "true") \
.config("hive.exec.dynamic.partition.mode","nonstrict") \
.config("hive.metastore.warehouse.dir","/user/hive/warehouse") \
.config("spark.sql.warehouse.dir",wareHouseLocation).getOrCreate()
import sparkSession.implicits._
val partf = sparkSession.read.textFile("partfile")
Exception for the statement ->
val partf = sparkSession.read.textFile("partfile")
org.apache.spark.sql.AnalysisException: Path does not exist: file:/home/cloudera/partfile;
But I have the file in my home directory of HDFS.
hadoop fs -ls
Found 1 items
-rw-r--r-- 1 cloudera cloudera 58 2017-06-30 02:23 partfile
I tried various ways to load the dataset like:
val partfile = sparkSession.read.textFile("/user/cloudera/partfile") and
val partfile = sparkSession.read.textFile("hdfs://quickstart.cloudera:8020/user/cloudera/partfile")
But nothing seems to work.
My spark version is 2.0.2
Could anyone tell me how to fix it ?
When you submit the job by setting master as local[2], your job is not getting submitted to spark master and so, spark does not know about underlying HDFS.
Spark will consider local file system as its default file system, and that's why, IOException occurs in your case.
Try this way:
val SparkSession = SparkSession.builder \
.master("<spark-master-ip>:<spark-port>") \
.appName("SparkHive").enableHiveSupport() \
.config("hive.exec.dynamic.partition", "true") \
.config("hive.exec.dynamic.partition.mode","nonstrict") \
.config("hive.metastore.warehouse.dir","/user/hive/warehouse") \
.config("spark.sql.warehouse.dir",wareHouseLocation).getOrCreate()
import sparkSession.implicits._
val partf = sparkSession.read.textFile("partfile")
You need to know <spark-master-ip> and <spark-port> for this.
This way, spark will take underlying hdfs file system as its default file system.
It's not clear for me what would be an error with explicit protocol specification but usually (as already was answered) it means that no neccesary configurations were passed into Spark context.
The first solution:
val sc = ??? // Spark Context
val config = sc.hadoopConfiguration
// you can mutate config object, it should work
config.addResource(new Path(s"${HADOOP_HOME}/conf/core-site.xml"))
// instead of adding a resource you can just specify hdfs address
// config.set("fs.defaultFS", "hdfs://host:port")
The second:
Explicitly specify HADOOP_CONF_DIR in $SPARK_HOME/spark-env.sh file. If you plan to use a cluster, be sure that every node of your cluster have HADOOP_CONF_DIR specified.
And be sure that you have all necessary Hadoop deps in your Spark / App classpath.
Try the following, it should work.
SparkSession session = SparkSession.builder().appName("Appname").master("local[1]").getOrCreate();
DataFrameReader dataFrameReader = session.read();
String path = "path\\file.csv";
Dataset <Row> responses = dataFrameReader.option("header","true").csv(path);

How to recursively read Hadoop files from directory using Spark?

Inside the given directory I have many different folders and inside each folder I have Hadoop files (part_001, etc.).
directory
-> folder1
-> part_001...
-> part_002...
-> folder2
-> part_001...
...
Given the directory, how can I recursively read the content of all folders inside this directory and load this content into a single RDD in Spark using Scala?
I found this, but it does not recursively enters into sub-folders (I am using import org.apache.hadoop.mapreduce.lib.input):
var job: Job = null
try {
job = Job.getInstance()
FileInputFormat.setInputPaths(job, new Path("s3n://" + bucketNameData + "/" + directoryS3))
FileInputFormat.setInputDirRecursive(job, true)
} catch {
case ioe: IOException => ioe.printStackTrace(); System.exit(1);
}
val sourceData = sc.newAPIHadoopRDD(job.getConfiguration(), classOf[TextInputFormat], classOf[LongWritable], classOf[Text]).values
I also found this web-page that uses SequenceFile, but again I don't understand how to apply it to my case?
If you are using Spark, you can do this using wilcards as follow:
scala>sc.textFile("path/*/*")
sc is the SparkContext which if you are using spark-shell is initialized by default or if you are creating your own program should will have to instance a SparkContext by yourself.
Be careful with the following flag:
scala> sc.hadoopConfiguration.get("mapreduce.input.fileinputformat.input.dir.recursive")
> res6: String = null
Yo should set this flag to true:
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
I have found that the parameters must be set in this way:
.set("spark.hive.mapred.supports.subdirectories","true")
.set("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive","true")
connector_output=${basepath}/output/connector/*/*/*/*/*
works for me when I've dir structure like -
${basepath}/output/connector/2019/01/23/23/output*.dat
I didn't have to set any other properties, just used following -
sparkSession.read().format("csv").schema(schema)
.option("delimiter", "|")
.load("/user/user1/output/connector/*/*/*/*/*");

Read Lzo file in PySpark

I am new to Spark. I have a bunch of LZO indexed files in a folder. The indexing was done as indicated on https://github.com/twitter/hadoop-lzo.
The files are as follows:
1.lzo
1.lzo.index
2.lzo
2.lzo.index
and so on
I want to read these files. I am using newAPIHadoopFile().
As given on, https://github.com/twitter/hadoop-lzo
I did the following:
val files = sc.newAPIHadoopFile(path, classOf[com.hadoop.mapreduce.LzoTextInputFormat],classOf[org.apache.hadoop.io.LongWritable],classOf[org.apache.hadoop.io.Text])
val lzoRDD = files.map(_._2.toString)
It worked fine in Scala (spark-shell).
But, I want to use pyspark (python-spark application). I am doing the following:
files = sc.newAPIHadoopFile(path,"com.hadoop.mapreduce.LzoTextInputFormat","org.apache.hadoop.io.LongWritable","org.apache.hadoop.io.Text")
lzoRDD = files.map(_._2.toString)
I get the following error: AttributeError: 'RDD' object has no attribute '_2'
The whole code is as follows:
import sys
from pyspark import SparkContext,SparkConf
if __name__ == "__main__":
#Create the SparkContext
conf = (SparkConf().setMaster("local[2]").setAppName("abc").set("spark.executor.memory", "10g").set("spark.cores.max",10))
sc = SparkContext(conf=conf)
path='/x/y/z/*.lzo'
files = sc.newAPIHadoopFile(path,"com.hadoop.mapreduce.LzoTextInputFormat","org.apache.hadoop.io.LongWritable","org.apache.hadoop.io.Text")
lzoRDD = files.map(_._2.toString)
#stop the SparkContext
sc.stop()
And I am submitting using spark-submit.
Any help would be appreciated.
Thank You

Script working in Python2 but not in Python 3 (hashlib)

I worked today in a simple script to checksum files in all available hashlib algorithms (md5, sha1.....) I wrote it and debug it with Python2, but when I decided to port it to Python 3 it just won't work. The funny thing is that it works for small files, but not for big files. I thought there was a problem with the way I was buffering the file, but the error message is what makes me think it is something related to the way I am doing the hexdigest (I think) Here is a copy of my entire script, so feel free to copy it, use it and help me figure out what the problem is with it. The error I get when checksuming a 250 MB file is
"'utf-8' codec can't decode byte 0xf3 in position 10: invalid continuation byte"
I google it, but can't find anything that fixes it. Also if you see better ways to optimize it, please let me know. My main goal is to make work 100% in Python 3. Thanks
#!/usr/local/bin/python33
import hashlib
import argparse
def hashFile(algorithm = "md5", filepaths=[], blockSize=4096):
algorithmType = getattr(hashlib, algorithm.lower())() #Default: hashlib.md5()
#Open file and extract data in chunks
for path in filepaths:
try:
with open(path) as f:
while True:
dataChunk = f.read(blockSize)
if not dataChunk:
break
algorithmType.update(dataChunk.encode())
yield algorithmType.hexdigest()
except Exception as e:
print (e)
def main():
#DEFINE ARGUMENTS
parser = argparse.ArgumentParser()
parser.add_argument('filepaths', nargs="+", help='Specified the path of the file(s) to hash')
parser.add_argument('-a', '--algorithm', action='store', dest='algorithm', default="md5",
help='Specifies what algorithm to use ("md5", "sha1", "sha224", "sha384", "sha512")')
arguments = parser.parse_args()
algo = arguments.algorithm
if algo.lower() in ("md5", "sha1", "sha224", "sha384", "sha512"):
Here is the code that works in Python 2, I will just put it in case you want to use it without having to modigy the one above.
#!/usr/bin/python
import hashlib
import argparse
def hashFile(algorithm = "md5", filepaths=[], blockSize=4096):
'''
Hashes a file. In oder to reduce the amount of memory used by the script, it hashes the file in chunks instead of putting
the whole file in memory
'''
algorithmType = hashlib.new(algorithm) #getattr(hashlib, algorithm.lower())() #Default: hashlib.md5()
#Open file and extract data in chunks
for path in filepaths:
try:
with open(path, mode = 'rb') as f:
while True:
dataChunk = f.read(blockSize)
if not dataChunk:
break
algorithmType.update(dataChunk)
yield algorithmType.hexdigest()
except Exception as e:
print e
def main():
#DEFINE ARGUMENTS
parser = argparse.ArgumentParser()
parser.add_argument('filepaths', nargs="+", help='Specified the path of the file(s) to hash')
parser.add_argument('-a', '--algorithm', action='store', dest='algorithm', default="md5",
help='Specifies what algorithm to use ("md5", "sha1", "sha224", "sha384", "sha512")')
arguments = parser.parse_args()
#Call generator function to yield hash value
algo = arguments.algorithm
if algo.lower() in ("md5", "sha1", "sha224", "sha384", "sha512"):
for hashValue in hashFile(algo, arguments.filepaths):
print hashValue
else:
print "Algorithm {0} is not available in this script".format(algorithm)
if __name__ == "__main__":
main()
I haven't tried it in Python 3, but I get the same error in Python 2.7.5 for binary files (the only difference is that mine is with the ascii codec). Instead of encoding the data chunks, open the file directly in binary mode:
with open(path, 'rb') as f:
while True:
dataChunk = f.read(blockSize)
if not dataChunk:
break
algorithmType.update(dataChunk)
yield algorithmType.hexdigest()
Apart from that, I'd use the method hashlib.new instead of getattr, and hashlib.algorithms_available to check if the argument is valid.

Resources