export data to csv using hive sql - hadoop

How to export hive table/select query to csv? I have tried the command below. But it creates the output as multiple files. Any better methods?
INSERT OVERWRITE LOCAL DIRECTORY '/mapr/mapr011/user/output/'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
SELECT fied1,field2,field3 FROM table1

Hive creates as many files as many reducers were running. This is fully parallel.
If you want single file then add order by to force running on single reducer or try to increase bytes per reducer configuration parameter:
SELECT fied1,field2,field3 FROM table1 order by fied1
OR
set hive.exec.reducers.bytes.per.reducer=67108864; --increase accordingly
Also you can try to merge files:
set hive.merge.smallfiles.avgsize=500000000;
set hive.merge.size.per.task=500000000;
set hive.merge.mapredfiles=true;
Also you can concatenate files using cat after getting them from hadoop.

You can use hadoop fs -cat /hdfspath > some.csv
command and get the output in one file.
If you want Header then you can use SED along with hive.. See this link which discusses various options in exporting Hive to CSV
https://medium.com/#gchandra/best-way-to-export-hive-table-to-csv-file-326063f0f229

Related

Is there a way to skip the header line while sqoop export the csv file to Vertica DB

I need to sqoop export the csv file to vertica, But since csv has header in it, that line as well gets exported.
Is there any efficient way to avoid the header
I don't know sqoop completely. But if sqoop can control the generation of the Vertica copy command, make sure, with a CSV file like this:
id|name
1|Arthur
2|Ford
3|Zaphod
to generate this command:
COPY public.foo FROM LOCAL 'foo.csv' DELIMITER '|' SKIP 1
SKIP 1 is the clause that makes COPY skip the first line.

How to merge CSV files in Hadoop?

I am new to the Hadoop framework and I would like to merge 4 CSV files into a single file.
All the 4 CSV files have same headers and order is also the same.
I don't think Pig STORE offers such a feature.
You could use Spark's coalesce(1) function, however, there is little reason to do this as almost all Hadoop processing tools prefer to read directories, not files.
You should ideally not be storing raw CSV in Hadoop for very long, anyway, and rather you convert it to ORC or Parquet as columnar data. Especially if you are reading CSV to begin with already -- do not output CSV again.
If the idea is to produce one CSV to later download, then I would suggest using Hive + Beeline to do that
This will store the result into a file in the local file system.
beeline -u 'jdbc:hive2://[databaseaddress]' --outputformat=csv2 -f yourSQlFile.sql > theFileWhereToStoreTheData.csv
try using getmerge utility to merge the csv files
for example you have a couple of EMP_FILE1.csv EMP_FILE2.csv EMP_FILE3.csv are placed at some location on hdfs. you can merge all these files and can placed merge file at some new location.
hadoop fs -getmerge /hdfsfilelocation/EMP_FILE* /newhdfsfilelocation/MERGED_EMP_FILE.csv

informatica Post command task

I am working with multiple source files with single source instance. I created three flat files and one destination table to experiment multiple sources. I am using ‘File list’ concept, for that I created a text file which contains all the flat file names.
Example:
Filename : File_list.txt
File content : Price1.txt
Price2.txt
Price3.txt
In the above example Price1.txt, Price2.txt and Price3.txt are flat file names. I specified File_list.txt as a source file while running the Workflow in Informatica. So it will iterate through all the flat files in the specified file (File_list.txt) and insert all the values to destination table.
Now what I want to do is once data is inserted to the destination, I need to delete that source file in that directory location.
How to achieve this?.
You'll need to write a custom script that will use the File_list.txt as input and perform the delete operations. You can then call it using Post-Session Success Command session component, or as a separate Command Task in the workflow linked using a $YourSessionName.Status = SUCCEEDED condition.

How can I merge two files using PIG script?

I have two files. And I want to merge it sequentially. How can I do so using Pig/PigLatin script?
f1.csv
1,aa
1,aa
1,ab
1,ac
2,bd
2,bd
2,bd
4,ab
4,bc
f2.csv
1,xxx
1,xxy
1,xyx
1,yxx
1,xyy
1,yyx
2,pqr
2,pq
2,pqrs
2,pqs
3,def
And the output i need is
1,aa,1,xxy
1,aa,1,xyx
1,ab,1,yxx
1,ac,1,xyy
2,bd,2,pqr
2,bd,2,pq
2,bd,2,pqrs
Can anyone help me which join should be used and how to get this?
1) LOAD each file.
2) Then UNION them together
http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#UNION
3) STORE the new unioned alias.
P.S. You can SET DEFAULT_PARALLEL 1; to make sure you only output one file.

hive : remove stuff from distributed cache

I can add stuff to distributed cache via
add file largelookuptable
and then run a bunch of HQL.
now when I have a series of commands, like the following
add file largelookuptable1;
select blah from blahness using somehow largelookuptable1;
add file largelookuptable2;
select newblah from otherblah using largelookuptable2;
in this case largelookuptable1 is unnecessarily available for the second query. is there a way I can get rid of it before the second query runs ?
On the Hive CLI, type:
delete file largelookuptable1;
Same thing applies to jars added to distributed cache.
Syntax (from Hive CLI):
Usage: delete [FILE|JAR|ARCHIVE] []*

Resources