Hive - How to load full html file content to a single hive row? - hadoop

I hav 1,000 *.html files in HDFS path and I want to create HIVE table whit this files.
But below query give me a '\n' delimited rows rather than full content of the html.
> create external table if not exist mydb.myhtmltable (
> body STRING )
> STORED AS TEXTFILE
> LOCATION '/user/hadoop/dataset/refhtml';
How can I put full html content into .body field?
I want 1,000 rows from 1,000 html file.
Is it possible?

Add this:
LINES TERMINATED BY \789
where 789 is the octal representation of the unicode character you want to use.
so:
create external table if not exist mydb.myhtmltable (
body STRING )
STORED AS TEXTFILE
LINES TERMINATED BY \789
LOCATION '/user/hadoop/dataset/refhtml';

Related

Dynamically Load Multiple CSV files to Oracle External Table

I am trying to load my oracle external table dynamically with multiple .csv files.
I am able to load one .csv file but as soon as I alter with new .csv file name, the table gets rewritten.
I have multiple .csv files in a folder which changes everyday with a prefix of the date.
Eg file name FileName1_20200607.csv, FileName2_20200607.csv
I dont think there is a way to write 'FileName*20200607.csv' to pick all the files for that date?
My code:
......
ORGANIZATION EXTERNAL
( TYPE ORACLE_LOADER
DEFAULT DIRECTORY "DATA_DIR_PATH"
ACCESS PARAMETERS
( RECORDS DELIMITED BY NEWLINE BADFILE CRRENG_ORA_APPS_OUT_DIR
: 'Filebad' DISCARDFILE DATA_OUT_PATH :
'Filedesc.dsc' LOGFILE DATA_OUT_PATH :
'Filelog.log' SKIP 0 FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED
BY '"' AND '"' MISSING FIELD VALUES ARE NULL REJECT ROWS WITH ALL NULL
FIELDS )
LOCATION
( 'FileName1_20200607.csv',
'FileName2_20200607.csv'
)
);
But I want to populate these file name dynamically. It should pick up all the file names from the DATA_DIR. There are about 50 other file names.
I can add Unix script if need be.

How to add file to Hive

I have a file where all the column delimiters at Notepad++ are shown as EOT, SOH, ETX, ACK, BEL, BS, ENQ
I know the schema of the table but I am totally new at these technologies and I cannot load the file to the table. Can I do it through UI like CSV file, and if yes with what delimiter?
Thank you in advance for your help.
It is pretty easy as you have mentioned the file is "," saparated.
lets create a simple table with 1 column.
CREATE TABLE test1(col1 STRING);
Row format delimited
Fields terminated by ',';
Please note statement Fields terminated by ',' we have given fields are saparated by "," if it columns are saparated by tab we can change it to "\t"
once the table is create we can load the file using the below commands.
If File is on local file system
LOAD DATA LOCAL INPATH '<complete_local_file_path>' INTO table test1;
If File is in HDFS
LOAD DATA INPATH '<complete_HDFS_file_path>' INTO table test1;
Hive is just an abstraction layer over HDFS, so you would add the file to HDFS in some folder, then build an EXTERNAL TABLE;over top of it
CREATE EXTERNAL TABLE name(...)
STORED AS TEXT
LINE FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/path/to/folder/'
;
Can I do it through UI like CSV file
If you install HUE, then you could

How to store multiple files under the same directory in hive?

I'm using Hive to process my CSV files. I've stored CSV files in HDFS and wanna create tables from those files.
I use the following command:
create external table if not exists csv_table (dummy STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 'hdfs://localhost:9000/user/hive'
TBLPROPERTIES ("skip.header.line.count"="1");
LOAD DATA INPATH '/CsvData/csv_table.csv' OVERWRITE INTO TABLE csv_table;
So the file under /CsvData will be moved into /user/hive. It makes sense.
But how if I want to create another table?
create external table if not exists csv_table2 (dummy STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 'hdfs://localhost:9000/user/hive'
TBLPROPERTIES ("skip.header.line.count"="1");
LOAD DATA INPATH '/CsvData/csv_table2.csv' OVERWRITE INTO TABLE csv_table2;
It will raise an exception complaining that the directory is not empty.
ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask. Directory hdfs://localhost:9000/user/hive could not be cleaned up.
So it is hard for me to understand, does it mean I can store only one file understand one directory? To store multiple files I have to create one directory for every file?
Is it possible to store all the files together?
Create table sentence will NOT raise an exception complaining that the directory is not empty because it is quite normal scenario when you create table on top of existing directory.
You can store as many files in the directory as necessary. And all of them will be accessible to the table built on top of the folder.
Table location is directory, not file. If you need to create new table and keep it's files not mixed with other table then create separate folder.
Read also this answer for clear understanding: https://stackoverflow.com/a/54038932/2700344

In Hive, how do I load only part of the raw data to a table?

I've got a typical CREATE TABLE statement as follows:
CREATE EXTERNAL TABLE temp_url (
MSISDN STRING,
TIMESTAMP STRING,
URL STRING,
TIER1 STRING
)
row format delimited fields terminated by '\t' lines terminated by '\n'
LOCATION 's3://mybucket/input/project_blah/20140811/';
Where /20140811/ is a directory with gigabytes worth of data inside.
Loading the things is not a problem. Querying anything on it, however, chokes Hive up and simply gives me a number of MapRed errors.
So instead, I'd like to ask if there's a way to load only part of the data in /20140811/. I know I can select a few files from inside the folder, dump them into another folder, and use that, but it seems tedious, especially when I've got 20 or so of this /20140811/ directories.
Is there something like this:
CREATE EXTERNAL TABLE temp_url (
MSISDN STRING,
TIMESTAMP STRING,
URL STRING,
TIER1 STRING
)
row format delimited fields terminated by '\t' lines terminated by '\n'
LOCATION 's3://mybucket/input/project_blah/Half_of_20140811/';
I'm also open to non-hive answers. Perhaps there's a way in s3cmd to quickly get a certain amount of data inside /20140811/ dump it into /20140811_halved/ or something.
Thanks.
I would suggest the following as a workaround :
Create a temp table with same structure. (using like)
insert into NEW_TABLE select * from OLD_TABLE limit 1000;
You add as many filter conditions to filter out data and load.
Hope this helps you.
Since you are saying that you have "20 or so of this /20140811/ directories", why don't you try creating an external table with partitions on those directories and run your queries on a single partition.

Hive table not retrieving rows from external file

I have a text file called as sample.txt. The file looks like:
abc,23,M
def,25,F
efg,25,F
I am trying to create a table in hive using:
CREATE EXTERNAL TABLE ppldb(name string, age int,gender string)
ROW FORMAT
DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/path/to/sample.txt';
But the data isn't getting into the table. When I run the query:
select count(*) from ppldb
I get 0 in output.
What could be the reason for data not getting loaded into the table?
The location in a external table in Hive should be an HDFS directory and not the full path of the file.
If that directory does not exists then the location we give will be created automatically. In your case /path/to/sample.txt is being treated as a directory.
So just give the /path/to/ in the LOCATION and keep the sample.txt file inside the directory. It will work.
Hope it helps...!!!
the LOCATION clause indicates where the table will be stored, not where to retrieve data from. After moving the samples.txt file into hdfs with something like
hdfs dfs -copyFromLocal ~/samples.txt /user/tables/
you could load the data into a table in hive with
create table temp(name string, age int, gender string)
row format delimited fields terminated by ','
stored as textfile;
load data inpath '/user/tables/samples.txt' into table temp;
That should work

Resources