Loading data into Hive Table from HDFS in Cloudera VM

Loading data into Hive Table from HDFS in Cloudera VM - hadoop

When using the Cloudera VM how can you access information in the HDFS? I know there isn't a direct path to the HDFS but I also don't see how to dynamically access it.
After creating a Hive Table through the Hive CLI I attempted to load some data from a file located in the HDFS:
load data inpath '/test/student.txt' into table student;
But then I just get this error:
FAILED: SemanticException Line 1:17 Invalid path ''/test/student.txt'': No files matching path hdfs://quickstart.cloudera:8020/test/student.txt
I also tried to just load data not in the HDFS into a Hive Table like so:
load data inpath '/home/cloudera/Desktop/student.txt' into table student;
However that just produced this error:
FAILED: SemanticException Line 1:17 Invalid path ''/home/cloudera/Desktop/student.txt'': No files matching path hdfs://quickstart.cloudera:8020/home/cloudera/Desktop/student.txt
Once again I see it trying to access data with the root of hdfs://quickstart.cloudera:8020 and I'm not sure what that is, but it doesn't seem to be the root directory for the HDFS.
I'm not sure what I'm doing wrong but I made sure the file is located in the HDFS so I don't know why this error is coming up or how to fix it.

how can you access information in the HDFS
Well, you certainly don't need to use Hive to do it. hdfs dfs commands are how you interact with HDFS.
I'm not sure what that is, but it doesn't seem to be the root directory for the HDFS
It is the root of HDFS. quickstart.cloudera is the hostname of the VM. Port 8020 is the HDFS port.
Your exceptions are from the difference in using the LOCAL keyword.
What you're doing
LOAD DATA INPATH <hdfs location>
VS what you seem to be wanting
LOAD DATA LOCAL INPATH <local file location>
Or if the files are in HDFS, it's not clear how you have put files into it, but HDFS definitely doesn't have a /home folder or a Desktop, so the second error at least makes sense.
Anyways, hdfs dfs -put /test/students.text /test/ is one way to upload your file, assuming the hdfs:///test folder already exists. Otherwise, hdfs dfs -put /test/students.text /test renames your file to /test on HDFS
Note: You can create an EXTERNAL TABLE over an HDFS directory, you don't need to use the LOAD DATA command.

Related

Why DATA is COPIED and not MOVED while loading data from local filesystem Hive hadoop

When we use following command:
Load data local inpath "mypath"
why the data is copied from local filesystem into HDFS and not moved?

Since you are moving data between 2 different file systems (sh + HDFS) this cannot be a metadata operation as in non-local load.
The data itself should be copied.
Theoretically this command could also initiate a deletion command of the source file, but what for?

Basic issue in copying files from hive or hadoop to local directory due to wrong nomenclature

I'm trying to copy a file that is hosted both within Hive and on an HDFS onto my local computer, but I can't seem to figure out the write call/terminology to use to invoke my local. All online explanations describe it solely as "path to local". I'm trying to copy this into a folder at C/Users/PC/Desktop/data and am using the following attempts:
In HDFS
hdfs dfs -copyToLocal /user/w205/staging /C/Users/PC/Desktop/data
In Hive
INSERT OVERWRITE LOCAL DIRECTORY 'c/users/pc/desktop/data/' SELECT * FROM lacountyvoters;
How should I be invoking the local repository in this case?

how to save data in HDFS with spark?

I want to using Spark Streaming to retrieve data from Kafka. Now, I want to save my data in a remote HDFS. I know that I have to use the function saveAsText. However, I don't know precisely how to specify the path.
Is that correct if I write this:
myDStream.foreachRDD(frm->{
frm.saveAsTextFile("hdfs://ip_addr:9000//home/hadoop/datanode/myNewFolder");
});
where ip_addr is the ip address of my hdfs remote server.
/home/hadoop/datanode/ is the DataNode HDFS directory created when I installed hadoop (I don't know if I have to specify this directory). And,
myNewFolder is the folder where I want to save my data.
Thanks in advance.
Yassir

The path has to be a directory in HDFS.
For example, if you want to save the files inside a folder named myNewFolder under the root / path in HDFS.
The path to use would be hdfs://namenode_ip:port/myNewFolder/
On execution of the spark job this directory myNewFolder will be created.
The datanode data directory which is given for the dfs.datanode.data.dir in hdfs-site.xml is used to store the blocks of the files you store in HDFS, should not be referenced as HDFS directory path.

HDFS path to load data to Hive

I am running hadoop as a single node distribution.
Following the posts i moved a file to HDFS using
hadoop fs -put <local path> </usr/tmp/fileNAme.txt> .
Now I am trying to load the data from HDFS file to Hive table using the command below . Not able to find out what is the HDFS path relative to my local file system that i should be providing in the command below.
Load Command I am using from my java program to load the hive table is
LOAD DATA IN PATH ('HDFS PATH as it relates to my local File System???' ). All my attempts in giving the path including /usr/tmp/fileNAme.txt fails.
How do I resolve the full HDFS path?

Syntax is incorrect
load data local inpath '/tmp/categories01.psv' overwrite into table categories;
You have to specify local inpath in the command.

This command loads data from local file system
LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
'LOCAL' signifies that the input file is on the local file system. If 'LOCAL' is omitted then it looks for the file in HDFS.
This command loads data from HDFS file system.
LOAD DATA INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
Have a look into this article for more details.

The syntax for loading file from hdfs into hive is
LOAD DATA INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
Please clarify how do i resolve the full HDFS path .
the full hdfs path in your syntax would be
hdfs://<namenode-hostname>:<port>/your/file/path

Apache default Hive Warehouse path in HDFS

I installed HIVE on CentOS 7 3-node cluster the first time for POC purpose. HIVE is installed inside a user(hduser1)'s root folder and specified in the .bashrc file.
export HIVE_HOME=/home/hduser1/hive
I also created an HDFS folder for HIVE warehouse, with the following commands.
hadoop fs -mkdir /user/hive/warehouse
hadoop fs -chmod g+w /user/hive/warehouse
Everything works fine. After I created a table, I saw a file appearing in the warehouse folder.
Here is my question - how does HIVE know about this warehouse path, considering that I did not add this path /user/hive/warehouse in any configuration file?
I saw another person's installation, which created the Hive warehouse folder at /user/hive234/warehouse and that installation still worked. Does HIVE figure it out by some naming convention?

Well, as you know that default location is maintain as /user/hive/warehouse, But you can change location as well, by specifying the desired directory in hive.metastore.warehouse.dir configuration parameter present in the hive-site.xml, one can change this default location.
Here is the example

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Loading data into Hive Table from HDFS in Cloudera VM - hadoop

Related

Why DATA is COPIED and not MOVED while loading data from local filesystem Hive hadoop

Basic issue in copying files from hive or hadoop to local directory due to wrong nomenclature

how to save data in HDFS with spark?

HDFS path to load data to Hive

Apache default Hive Warehouse path in HDFS

Categories

Resources