How to force header when creating hive .gz output - hadoop

How do I make sure each .gz file is created with a header? I am setting these properties which give me multiple output files named 00000_0.gz, 00001_0.gz, 00002_0.gz, etc. But these have no header. What syntax do I need to force a header for each file?
BTW, my query is of the form
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/target_dir/' ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' SELECT ...
Properties now set:
set mapred.output.compress=true;
set hive.exec.compress.output=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;

Related

How to save a filename in a variable in Nifi?

I'm new to Nifi and I'm trying to get a file name and save this filename in a variable to be used later on in the process.
Basically I have a file(data_yyyyMMdd.tar.gz) which contains 2 .txt files(1.txt and 2.txt), and before to unpack this file, I want to save it's name to a variable and then, use this variable to add content to the unpacked files.
content of the files(originally) :
1.txt
id|name
1|apple
2|orange
content of the files after be updated with the filename
id|name|filename
1|apple|data_yyyyMMdd.tar.gz
2|orange|data_yyyyMMdd.tar.gz
I managed to unpack to file successfully, but, I'm not being able to save the .tar.gz filename in a variable and add it's value to the content of each file.
Could you guys help me?
Depending on what processor you used to get the tar.gz file, you likely already have a FlowFile attribute called filename set to the name of the tar.gz file. After unpacking you may find that the filename attribute is overwritten (not sure though), so before unpacking, copy the filename attribute into some other attribute using UpdateAttribute. For example you can add a property in UpdateAttribute named original.filename and set its value to ${filename}.
After unpacking you can use UpdateRecord to add the original filename as a field in each record, I think by setting the Replacement Value Strategy to Literal Value and adding a property /filename set to ${original.filename}. I haven't tried this so I don't know if these are exactly the right settings, but the approach should work.

export data to csv using hive sql

How to export hive table/select query to csv? I have tried the command below. But it creates the output as multiple files. Any better methods?
INSERT OVERWRITE LOCAL DIRECTORY '/mapr/mapr011/user/output/'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
SELECT fied1,field2,field3 FROM table1
Hive creates as many files as many reducers were running. This is fully parallel.
If you want single file then add order by to force running on single reducer or try to increase bytes per reducer configuration parameter:
SELECT fied1,field2,field3 FROM table1 order by fied1
OR
set hive.exec.reducers.bytes.per.reducer=67108864; --increase accordingly
Also you can try to merge files:
set hive.merge.smallfiles.avgsize=500000000;
set hive.merge.size.per.task=500000000;
set hive.merge.mapredfiles=true;
Also you can concatenate files using cat after getting them from hadoop.
You can use hadoop fs -cat /hdfspath > some.csv
command and get the output in one file.
If you want Header then you can use SED along with hive.. See this link which discusses various options in exporting Hive to CSV
https://medium.com/#gchandra/best-way-to-export-hive-table-to-csv-file-326063f0f229

SQLLDR file path argument

I have more than 30 files to load the data.
The path changes at every run in those files. So the path becomes
INFILE "/home/dmf/Cycle7Data/ITEM_IMAGE.csv"
INFILE "/home/dmf/Cycle8Data/ITEM_IMAGE.csv"
The file names change on every control file (SUPPLIER.csv)
Is there any way to pass the File path in a variable, or set any Env. Variable?
So that the control file is not edited everytime
You can pass the data file name on the command line; from the documentation:
DATA specifies the name of the data file containing the data to be loaded. If you do not specify a file extension or file type, then the default is .dat.
If you specify a data file on the command line and also specify data files in the control file with INFILE, then the data specified on the command line is processed first. The first data file specified in the control file is ignored. All other data files specified in the control file are processed.
So pass the relevant file name with each invocation, e.g.
sqlldr user/passwd control=myfile.ctl data=/home/dmf/Cycle7Data/ITEM_IMAGE.csv
If you have lots of files to load from a directory you could have a shell script that loops over the directory contents and passes each file name in turn to an SQL*Loader session.

.sql file not returning the column headers in csv file

The below code is in batch file script(.bat file) which calls the sql file.
del C:\SREE\csvfile.csv
sqlplus SERVERNAME/Test123#ldptstb #C:\SREE\sree.sql
set from_email="SENDER_EMAIL_ID"
set to_email="TO_EMAIL_ID"
set cc_email="CC_EMAIL_ID"
set email_message="Csv file from application server"
set body_email=C:\SREE\sree.txt
set sendmail=C:\Interface\sqlldr\common\SENDMAIL.VBS
set interface_log=C:\SREE\csvfile.csv
cscript %sendmail% -F %from_email% -T %to_email% -C %cc_email% -S %email_message% -B %body_email% -A %interface_log% -O "ATTACHFILE" -A %body_email% -O "FILEASTEXT"
exit
This below content in .sql file code which executes the SQL Query and stores the data into csv file:
set pagesize 0
set heading on
set feedback off
set trimspool on
set linesize 32767
set termout off
set verify off
set colsep ","
spool C:\SREE\csvfile.csv
SELECT Name, ID, Email, Role, Status FROM csvfile
exit
The output is stored in csv file and getting this file in email.
But theproblem is I am not getting the Column Names in the csv file. I had tried in many scenarios to get the names as a cloumn headings in csv file.
Anyone please help me out with the code to get the column names in the csv file.. thanks in advance...
When you set pagesize 0 headings are suppressed:
SET PAGES[IZE] {14 | n}
Sets the number of lines on each page of output. You can set PAGESIZE to zero to suppress all headings, page breaks, titles, the initial blank line, and other formatting information.
That's the case even if you then explicitly set headings on.
You can either set pagesize to something very large instead, or possibly more helpfully as you probably don't really want the separator line of dashes, generate them yourself with:
PROMPT Name,ID,Email,Role,Status
... before your select statement.
Use GENERATE_HEADER configuration setting amd set it to Yes like
SET GENERATE_HEADER = 'Yes'
See this related thread here https://community.oracle.com/thread/2325171?start=0&tstart=0

Use parameters with CTL

I am using a CTL file to load data stored in a file to a specific table in my Oracle database.
Currently, I launch the loader file using the following command line:
sqlldr user/pwd#db data=my_data_file control=my_loader.ctl
I would like to know if it is possible to use specify parameters to be retrieved in the CTL file.
Also, is it possible to retrieve the name of the data file used by the CTL to fill the table ?I also would like to insert it for each row. I currently have to call a procedure to update previously inserted records.
Any help would be appreciated !
As I know don't have any way to pass parametter as variable in ctrl. But You can use constant in ctl and modify clt file to change that constant value (in ctl file content) for every loading times.
Edit: more specific.
my_loader.ctl:
--options
load data
infile 'c:\$datfilename$' --this is optional, you can specify here or from command line
into table mytable
fields....
(
datafilename constant '$datfilename$', -- will be replace by real datafname each load
datacol1 char(1),
....
)
dataload.bat: assume that $datfilename$ is the text will be replace by datafile's name.
::sample copy
copy my_loader.ctl my_loader_temp.ctl
::replace the name of datafile (mainly the content to load into table's data column)
findandreplace my_loader_temp.ctl "$datafilename$" "%1"
::load
sqlldr user/pwd#db data=%1 control=my_loader_temp.ctl
::or with data be obmitted if you specified by infile in control file.
sqlldr user/pwd#db control=my_loader_temp.ctl
using: dataload.bat mydatafile_2010_10_10.txt

Resources