Loading orc files from a client node into Vertica - vertica

Are there any ways to load orc files which reside on a client machine into Vertica table?
What I tried was COPY LOCAL but it's stated in the doc that by some reason:
ORC and Parquet Hadoop files not supported with COPY LOCAL

As you mentioned, Vertica doesn't support COPY LOCAL for ORC/Parquet files. You can either copy the files to Vertica server or to a shared directory, such as HDFS or AWS S3, and then use COPY FROM directly. For example:
-- copy from Vertica server:
copy t1 from '/path/to/orc/files/*' orc;
-- copy from HDFS:
copy t1 from 'hdfs:///path/to/orc/files/*' orc;
-- copy from AWS S3:
copy t1 from 's3://s3_bucket/path/to/orc/files/*' orc;

Related

Loading data into Hive Table from HDFS in Cloudera VM

When using the Cloudera VM how can you access information in the HDFS? I know there isn't a direct path to the HDFS but I also don't see how to dynamically access it.
After creating a Hive Table through the Hive CLI I attempted to load some data from a file located in the HDFS:
load data inpath '/test/student.txt' into table student;
But then I just get this error:
FAILED: SemanticException Line 1:17 Invalid path ''/test/student.txt'': No files matching path hdfs://quickstart.cloudera:8020/test/student.txt
I also tried to just load data not in the HDFS into a Hive Table like so:
load data inpath '/home/cloudera/Desktop/student.txt' into table student;
However that just produced this error:
FAILED: SemanticException Line 1:17 Invalid path ''/home/cloudera/Desktop/student.txt'': No files matching path hdfs://quickstart.cloudera:8020/home/cloudera/Desktop/student.txt
Once again I see it trying to access data with the root of hdfs://quickstart.cloudera:8020 and I'm not sure what that is, but it doesn't seem to be the root directory for the HDFS.
I'm not sure what I'm doing wrong but I made sure the file is located in the HDFS so I don't know why this error is coming up or how to fix it.
how can you access information in the HDFS
Well, you certainly don't need to use Hive to do it. hdfs dfs commands are how you interact with HDFS.
I'm not sure what that is, but it doesn't seem to be the root directory for the HDFS
It is the root of HDFS. quickstart.cloudera is the hostname of the VM. Port 8020 is the HDFS port.
Your exceptions are from the difference in using the LOCAL keyword.
What you're doing
LOAD DATA INPATH <hdfs location>
VS what you seem to be wanting
LOAD DATA LOCAL INPATH <local file location>
Or if the files are in HDFS, it's not clear how you have put files into it, but HDFS definitely doesn't have a /home folder or a Desktop, so the second error at least makes sense.
Anyways, hdfs dfs -put /test/students.text /test/ is one way to upload your file, assuming the hdfs:///test folder already exists. Otherwise, hdfs dfs -put /test/students.text /test renames your file to /test on HDFS
Note: You can create an EXTERNAL TABLE over an HDFS directory, you don't need to use the LOAD DATA command.

Why DATA is COPIED and not MOVED while loading data from local filesystem Hive hadoop

When we use following command:
Load data local inpath "mypath"
why the data is copied from local filesystem into HDFS and not moved?
Since you are moving data between 2 different file systems (sh + HDFS) this cannot be a metadata operation as in non-local load.
The data itself should be copied.
Theoretically this command could also initiate a deletion command of the source file, but what for?

Basic issue in copying files from hive or hadoop to local directory due to wrong nomenclature

I'm trying to copy a file that is hosted both within Hive and on an HDFS onto my local computer, but I can't seem to figure out the write call/terminology to use to invoke my local. All online explanations describe it solely as "path to local". I'm trying to copy this into a folder at C/Users/PC/Desktop/data and am using the following attempts:
In HDFS
hdfs dfs -copyToLocal /user/w205/staging /C/Users/PC/Desktop/data
In Hive
INSERT OVERWRITE LOCAL DIRECTORY 'c/users/pc/desktop/data/' SELECT * FROM lacountyvoters;
How should I be invoking the local repository in this case?

how to save data in HDFS with spark?

I want to using Spark Streaming to retrieve data from Kafka. Now, I want to save my data in a remote HDFS. I know that I have to use the function saveAsText. However, I don't know precisely how to specify the path.
Is that correct if I write this:
myDStream.foreachRDD(frm->{
frm.saveAsTextFile("hdfs://ip_addr:9000//home/hadoop/datanode/myNewFolder");
});
where ip_addr is the ip address of my hdfs remote server.
/home/hadoop/datanode/ is the DataNode HDFS directory created when I installed hadoop (I don't know if I have to specify this directory). And,
myNewFolder is the folder where I want to save my data.
Thanks in advance.
Yassir
The path has to be a directory in HDFS.
For example, if you want to save the files inside a folder named myNewFolder under the root / path in HDFS.
The path to use would be hdfs://namenode_ip:port/myNewFolder/
On execution of the spark job this directory myNewFolder will be created.
The datanode data directory which is given for the dfs.datanode.data.dir in hdfs-site.xml is used to store the blocks of the files you store in HDFS, should not be referenced as HDFS directory path.

Temporary storage usage between distcp and s3distcp

I read the documentation for Amazon's S3DistCp - it says,
"During a copy operation, S3DistCp stages a temporary copy of the
output in HDFS on the cluster. There must be sufficient free space in
HDFS to stage the data, otherwise the copy operation fails. In
addition, if S3DistCp fails, it does not clean the temporary HDFS
directory, therefore you must manually purge the temporary files. For
example, if you copy 500 GB of data from HDFS to S3, S3DistCp copies
the entire 500 GB into a temporary directory in HDFS, then uploads the
data to Amazon S3 from the temporary directory".
This is not insignificant especially if you have a large HDFS cluster. Does anybody know if the regular Hadoop DistCp has this same behaviour of staging the files to copy in a temporary folder?
Distcp does not use a temporary folder rather distcp used Map Reduce for the file copy in inter/intra cluster. The same used for HDFS to S3 also. AFAIK distcp will not fail the whole bunch of file copy if it fails for some reason.
If total of 500 GB file copy needs to be happen and if 200 GB of file already copied in and distcp failed you have the 200 GB of data in S3. When you try to rerun the distcp job again it will skip the already existing files.
For more information about commands look at the distcp guide here

Resources