EMR Hadoop processing whole S3 file - hadoop

I have a bunch of small (1KB to 1MB) text files stored in Amazon S3 that I would like to process using Amazon EMR's Hadoop.
Each record given to the mapper needs to contain the entire contents of a text file as well as some way to determine the filename, so I cannot use the default TextInputFormat.
What is the best way to accomplish this? Is there anything else I can do (like copying files from S3 to hdfs) to increase performance?

I had the same issue. Please refer following questions.
s3fs on Amazon EMR: Will it scale for approx 100million small files?
Write 100 million files to s3
Too many open files in EMR
If you don't have any large files, but have a lot of files, it's sufficient to use s3cmd get --recursive s3://<url> . command. After retrieved files into EMR instance, you could create tables with Hive. For example, you can load whole files with LOAD DATA statement with partition.
sample
This is a sample code
#!/bin/bash
s3cmd get --recursive s3://your.s3.name .
# create table with partitions
hive -e "SET mapred.input.dir.recursive=true; DROP TABLE IF EXISTS import_s3_data;"
hive -e "CREATE TABLE import_s3_data( rawdata string )
PARTITIONED BY (tier1 string, tier2, string, tier3 string);"
LOAD_SQL=""
# collect files as array
FILES=(`find . -name \*.txt -print`)
for FILE in ${FILES[#]}
do
DIR_INFO=(`echo ${FILE##./} | tr -s '/' ' '`)
T1=${DIR_INFO[0]}
T2=${DIR_INFO[1]}
T3=${DIR_INFO[2]}
LOAD_SQL="${LOAD_SQL} LOAD DATA LOCAL INPATH '${FILE}' INTO TABLE
import_s3_data PARTITION (tier1 = '${T1}', tier2 = '${T2}', tier3 = '${T3}');"
done
hive -e "${LOAD_SQL}"
another option
I think there are some another options to retrieve small S3 data
S3DistCp ... it will merge small file as large one in order to deal with Hadoop
Hive - External Tables ... it will create an external table referring s3 storages. However, it has almost same performance compared with the case to use s3cmd get. It might more effective in such a case, there are many large raw or gziped files on S3.

The best approach according to me would be to create an external table on the CSV files and load it into another table stored again in S3 bucket in parquet format. You will not have to write any script in that case, just few SQL queries.
CREATE EXTERNAL TABLE databasename.CSV_EXT_Module(
recordType BIGINT,
servedIMSI BIGINT,
ggsnAddress STRING,
chargingID BIGINT,
...
...
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://module/input/csv_files/'
TBLPROPERTIES ("skip.header.line.count"="1");
The above table will only be an external table mapped to the csv file.
Create another table on top of it if you want the query to run faster:
CREATE TABLE databasename.RAW_Module as
SELECT
recordType,
servedIMSI,
ggsnAddress,
chargingID,
...
regexp_extract(INPUT__FILE__NAME,'(.*)/(.*)',2) as filename from
databasename.CSV_EXT_Module
STORED AS PARQUET
LOCATION 's3://module/raw/parquet_files/';
Change the regexp_extract to have the required input file name.

Related

Convert data from gzip to sequenceFile format using Hive on spark

I'm trying to read a large gzip file into hive through spark runtime
to convert into SequenceFile format
And, I want to do this efficiently.
As far as I know, Spark supports only one mapper per gzip file same as it does for text files.
Is there a way to change the number of mappers for a gzip file being read? or should I choose another format like parquet?
I'm stuck currently.
The problem is that my log file is json-like data save into txt-format and then was gzip - ed, so for reading I used org.apache.spark.sql.json.
The examples I have seen that show - converting data into SequenceFile have some simple delimiters as csv-format.
I used to execute this query:
create TABLE table_1
USING org.apache.spark.sql.json
OPTIONS (path 'dir_to/file_name.txt.gz');
But now I have to rewrite it in something like that:
CREATE TABLE table_1(
ID BIGINT,
NAME STRING
)
COMMENT 'This is table_1 stored as sequencefile'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS SEQUENCEFILE;
LOAD DATA INPATH 'dir_to/file_name.txt.gz' OVERWRITE INTO TABLE table_1;
LOAD DATA INPATH 'dir_to/file_name.txt.gz' INTO TABLE table_1;
INSERT OVERWRITE TABLE table_1 SELECT id, name from table_1_text;
INSERT INTO TABLE table_1 SELECT id, name from table_1_text;
Is this the optimal way of doing this, or is there a simpler approach to this problem?
Please help!
As gzip textfile file is not splitable ,only one mapper will be launched or
you have to choose other data formats if you want to use more than one
mappers.
If there are huge json files and you want to save storage on hdfs use bzip2
compression to compress your json files on hdfs.You can query .bzip2 json
files from hive without modifying anything.

how to preprocess the data and load into hive

I completed my hadoop course now I want to work on Hadoop. I want to know the workflow from data ingestion to visualize the data.
I am aware of how eco system components work and I have built hadoop cluster with 8 datanodes and 1 namenode:
1 namenode --Resourcemanager,Namenode,secondarynamenode,hive
8 datanodes--datanode,Nodemanager
I want to know the following things:
I got data .tar structured files and first 4 lines have got description.how to process this type of data im little bit confused.
1.a Can I directly process the data as these are tar files.if its yes how to remove the data in the first four lines should I need to untar and remove the first 4 lines
1.b and I want to process this data using hive.
Please suggest me how to do that.
Thanks in advance.
Can I directly process the data as these are tar files.
Yes, see the below solution.
if yes, how to remove the data in the first four lines
Starting Hive v0.13.0, There is a table property, tblproperties ("skip.header.line.count"="1") while creating a table to tell Hive the number of rows to ignore. To ignore first four lines - tblproperties ("skip.header.line.count"="4")
CREATE TABLE raw (line STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
CREATE TABLE raw_sequence (line STRING)
STORED AS SEQUENCEFILE
tblproperties("skip.header.line.count"="4");
LOAD DATA LOCAL INPATH '/tmp/test.tar' INTO TABLE raw;
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below)
INSERT OVERWRITE TABLE raw_sequence SELECT * FROM raw;
To view the data:
select * from raw_sequence
Reference: Compressed Data Storage
Follow the below steps to achieve your goal:
Copy the data(ie.tar file) to the client system where hadoop is installed.
Untar the file and manually remove the description and save it in local.
Create the metadata(i.e table) in hive based on the description.
Eg: If the description contains emp_id,emp_no,etc.,then create table in hive using this information and also make note of field separator used in the data file and use the corresponding field separator in create table query. Assumed that file contains two columns which is separated by comma then below is the syntax to create the table in hive.
Create table tablename (emp_id int, emp_no int)
Row Format Delimited
Fields Terminated by ','
Since, data is in structured format, you can load the data into hive table using the below command.
LOAD DATA LOCAL INPATH '/LOCALFILEPATH' INTO TABLE TABLENAME.
Now, local data will be moved to hdfs and loaded into hive table.
Finally, you can query the hive table using SELECT * FROM TABLENAME;

Where is HIVE metadata stored by default?

I have created an external table in Hive using following:
create external table hpd_txt(
WbanNum INT,
YearMonthDay INT ,
Time INT,
HourlyPrecip INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
stored as textfile
location 'hdfs://localhost:9000/user/hive/external';
Now this table is created in location */hive/external.
Step-1: I loaded data in this table using:
load data inpath '/input/hpd.txt' into table hpd_txt;
the data is successfully loaded in the specified path ( */external/hpd_txt)
Step-2: I delete the table from */hive/external path using following:
hadoop fs -rmr /user/hive/external/hpd_txt
Questions:
why is the table deleted from original path? (*/input/hpd.txt is deleted from hdfs but table is created in */external path)
After I delete the table from HDFS as in step 2, and again I use show tables; It still gives the table hpd_txt in the external path.
so where is this coming from.
Thanks in advance.
Hive doesn't know that you deleted the files. Hive still expects to find the files in the location you specified. You can do whatever you want in HDFS and this doesn't get communicated to hive. You have to tell hive if things change.
hadoop fs -rmr /user/hive/external/hpd_txt
For instance the above command doesn't delete the table it just removes the file. The table still exists in hive metastore. If you want to delete the table then use:
drop if exists tablename;
Since you created the table as an external table this will drop the table from hive. The files will remain if you haven't removed them. If you want to delete an external table and the files the table is reading from you can do one of the following:
Drop the table and then remove the files
Change the table to managed and drop the table
Finally the location of the metastore for hive is by default located here /usr/hive/warehouse.
The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does not use a default location for this table. This comes is handy if you already have data generated. Else, you will have data loaded (conventionally or by creating a file in the directory being pointed by the hive table)
When dropping an EXTERNAL table, data in the table is NOT deleted from the file system.
An EXTERNAL table points to any HDFS location for its storage, rather than being stored in a folder specified by the configuration property hive.metastore.warehouse.dir.
Source: Hive docs
So, in your step 2, removing the file /user/hive/external/hpd_txt removes the data source(data pointing to the table) but the table still exists and would continue to point to hdfs://localhost:9000/user/hive/external as it was created
#Anoop : Not sure if this answers your question. let me know if you have any questions further.
Do not use load path command. The Load operation is used to MOVE ( not COPY) the data into corresponding Hive table. Use put Or copyFromLocal to copy file from non HDFS format to HDFS format. Just provide HDFS file location in create table after execution of put command.
Deleting a table does not remove HDFS file from disk. That is the advantage of external table. Hive tables just stores metadata to access data files. Hive tables store actual data of data file in HIVE tables. If you drop the table, the data file is untouched in HDFS file location. But in case of internal tables, both metadata and data will be removed if you drop table.
After going through you helping comments and other posts, I have found answer to my question.
If I use LOAD INPATH command then it "moves" the source file to the location where external table is being created. Which although, wont be affected in case of dropping the table, but changing the location is not good. So use local inpath in case of loading data in Internal tables .
To load data in external tables from a file located in the HDFS, use the location in the CREATE table query which will point to the source file, for example:
create external table hpd(WbanNum string,
YearMonthDay string ,
Time string,
hourprecip string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
stored as textfile
location 'hdfs://localhost:9000/input/hpd/';
So this sample location will point to the data already present in HDFS in this path. so no need to use LOAD INPATH command here.
Its a good practice to store a source files in their private dedicated directories. So that there is no ambiguity while external tables are created as data is in a properly managed directory system.
Thanks a lot for helping me understand this concept guys! Cheers!

Is it possible to import data into Hive table without copying the data

I have log files stored as text in HDFS. When I load the log files into a Hive table, all the files are copied.
Can I avoid having all my text data stored twice?
EDIT: I load it via the following command
LOAD DATA INPATH '/user/logs/mylogfile' INTO TABLE `sandbox.test` PARTITION (day='20130221')
Then, I can find the exact same file in:
/user/hive/warehouse/sandbox.db/test/day=20130220
I assumed it was copied.
use an external table:
CREATE EXTERNAL TABLE sandbox.test(id BIGINT, name STRING) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/user/logs/';
if you want to use partitioning with an external table, you will be responsible for managing the partition directories.
the location specified must be an hdfs directory..
If you drop an external table hive WILL NOT delete the source data.
If you want to manage your raw files, use external tables. If you want hive to do it, the let hive store inside of its warehouse path.
I can say, instead of copying data by your java application directly to HDFS, have those file in local file system, and import them into HDFS via hive using following command.
LOAD DATA LOCAL INPATH '/your/local/filesystem/file.csv' INTO TABLE `sandbox.test` PARTITION (day='20130221')
Notice the LOCAL
You can use alter table partition statement to avoid data duplication.
create External table if not exists TestTable (testcol string) PARTITIONED BY (year INT,month INT,day INT) row format delimited fields terminated by ',';
ALTER table TestTable partition (year='2014',month='2',day='17') location 'hdfs://localhost:8020/data/2014/2/17/';
Hive (atleast when running in true cluster mode) can not refer to external files in local file system. Hive can automatically import the files during table creation or load operation. The reason behind this can be that Hive runs MapReduce jobs internally to extract the data. MapReduce reads from the HDFS as well as writes back to HDFS and even runs in distributed mode. So if the file is stored in local file system, it can not be used by the distributed infrastructure.

how to load data in hive automatically

recently I want to load the log files into hive tables, I want a tool which can read data from a certain directory and load them into hive automatically. This directory may include lots of subdirectories, for example, the certain directory is '/log' and the subdirectories are '/log/20130115','/log/20130116','/log/201301017'. Is there some ETL tools which can achieve the function that:once the new data is stored in the certain directory, the tool can detect this data automatically and load them into hive table. Is there such tools, do I have to write script by myself?
You can easily do this using Hive external tables and partitioning your table by day. For example, create your table as such:
create external table mytable(...)
partitioned by (day string)
location '/user/hive/warehouse/mytable';
This will essentially create an empty table in the metastore and make it point to /user/hive/warehouse/mytable.
Then you can load your data in this directory with the format key=value where key is your partition name (here "day") and value is the value of your partition. For example:
hadoop fs -put /log/20130115 /user/hive/warehouse/mytable/day=20130115
Once your data is loaded there, it is in the HDFS directory, but the Hive metastore doesn't know yet that it belongs to the table, so you can add it this way:
alter table mytable add partition(day='20130115');
And you should be good to go, the metastore will be updated with your new partition, and you can now query your table on this partition.
This should be trivial to script, you can create a cron job running once a day that will do these command in order and find the partition to load with the date command, for example continuously doing this command:
hadoop fs -test /log/`date +%Y%m%d`
and checking if $? is equal to 0 will tell you if the file is here and if it is, you can transfer it and add the partition as described above.
You can make use of LOAD DATA command provided by Hive. It exactly matches your use case. Specify a directory in your local file system and make Hive tables from it.
Example usage -
LOAD DATA LOCAL INPATH '/home/user/some-directory'
OVERWRITE INTO TABLE table

Resources