Is there a way to skip the header line while sqoop export the csv file to Vertica DB - sqoop

I need to sqoop export the csv file to vertica, But since csv has header in it, that line as well gets exported.
Is there any efficient way to avoid the header

I don't know sqoop completely. But if sqoop can control the generation of the Vertica copy command, make sure, with a CSV file like this:
id|name
1|Arthur
2|Ford
3|Zaphod
to generate this command:
COPY public.foo FROM LOCAL 'foo.csv' DELIMITER '|' SKIP 1
SKIP 1 is the clause that makes COPY skip the first line.

Related

Read csv and update csv

I have csv file which has list of hadoop file path, so I have to read each hadoop file from each row and calling hadoop - get. It is working fine. But I would like to mark the csv 2nd column with files are copied to destination folder.
something like flag. How to do edit the second column in while loop and save it in same csv?
Input.csv
path,flag
file1path,
file2path,
So after copying each file want to mark flag as Y in the same file.

How to merge CSV files in Hadoop?

I am new to the Hadoop framework and I would like to merge 4 CSV files into a single file.
All the 4 CSV files have same headers and order is also the same.
I don't think Pig STORE offers such a feature.
You could use Spark's coalesce(1) function, however, there is little reason to do this as almost all Hadoop processing tools prefer to read directories, not files.
You should ideally not be storing raw CSV in Hadoop for very long, anyway, and rather you convert it to ORC or Parquet as columnar data. Especially if you are reading CSV to begin with already -- do not output CSV again.
If the idea is to produce one CSV to later download, then I would suggest using Hive + Beeline to do that
This will store the result into a file in the local file system.
beeline -u 'jdbc:hive2://[databaseaddress]' --outputformat=csv2 -f yourSQlFile.sql > theFileWhereToStoreTheData.csv
try using getmerge utility to merge the csv files
for example you have a couple of EMP_FILE1.csv EMP_FILE2.csv EMP_FILE3.csv are placed at some location on hdfs. you can merge all these files and can placed merge file at some new location.
hadoop fs -getmerge /hdfsfilelocation/EMP_FILE* /newhdfsfilelocation/MERGED_EMP_FILE.csv

Hadoop FileUtil copymerge - Ignore header

While writing out from spark to HDFS, depending upon the header setting, each file has a header. So when calling copymerge in FileUtil we get duplicated headers in merged file. Is there a way to retain header from 1st file and ignore others.
If you are planning to merge it as a single file and then fetch it on to your local file system, you can use getmerge.
getmerge
Usage: hadoop fs -getmerge [-nl] <src> <localdst>
Takes a source directory and a destination file as input and concatenates files in src into the destination local file. Optionally -nl can be set to enable adding a newline character (LF) at the end of each file. -skip-empty-file can be used to avoid unwanted newline characters in case of empty files.
Now to remove the headers, you should have an idea of how your header looks like.
Suppose if your header looks like:
HDR20171227
You can use:
sed -i '1,${/^HDR/d}' "${final_filename}"
where final_filename is the name of the file on local FS.
This will delete all lines that start with HDR in your file and occur after the first line.
If you are unsure about the header, you can first store it in a variable using
header=$(head -1 "${final_filename}" )
And then proceed to delete it using sed.

export data to csv using hive sql

How to export hive table/select query to csv? I have tried the command below. But it creates the output as multiple files. Any better methods?
INSERT OVERWRITE LOCAL DIRECTORY '/mapr/mapr011/user/output/'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
SELECT fied1,field2,field3 FROM table1
Hive creates as many files as many reducers were running. This is fully parallel.
If you want single file then add order by to force running on single reducer or try to increase bytes per reducer configuration parameter:
SELECT fied1,field2,field3 FROM table1 order by fied1
OR
set hive.exec.reducers.bytes.per.reducer=67108864; --increase accordingly
Also you can try to merge files:
set hive.merge.smallfiles.avgsize=500000000;
set hive.merge.size.per.task=500000000;
set hive.merge.mapredfiles=true;
Also you can concatenate files using cat after getting them from hadoop.
You can use hadoop fs -cat /hdfspath > some.csv
command and get the output in one file.
If you want Header then you can use SED along with hive.. See this link which discusses various options in exporting Hive to CSV
https://medium.com/#gchandra/best-way-to-export-hive-table-to-csv-file-326063f0f229

filename of the sqoop import

When we import from RDBMS to HDFS using sqoop we will give target directory to store data, once the job completed we can see the filename as part-m-0000 as mapper output. Is there any way we can pass the filename in which the data will stored? Is sqoop have any option like that?
According to this answer, you can specify arguments passed to mapreduce with -D option, which can accept file name options:
-Dmapreduce.output.basename=myoutputprefix
Although this will change the basename of your file, it will not change the part numbers.
Same answers on other sites:
cloudera
hadoopinrealworld
No you can't rename it.
You can specify --target-dir <dir> to tell the location of directory where all the data is imported,
In this directory, you see many part files (e.g. part-m-00000). These part files are created by various mappers (remember -m <number> in your sqoop import command)
Since data is imported in multiple files, how would you name each part file?
I did not see any additional benefit for this renaming.

Resources