How to merge CSV files in Hadoop? - hadoop

I am new to the Hadoop framework and I would like to merge 4 CSV files into a single file.
All the 4 CSV files have same headers and order is also the same.

I don't think Pig STORE offers such a feature.
You could use Spark's coalesce(1) function, however, there is little reason to do this as almost all Hadoop processing tools prefer to read directories, not files.
You should ideally not be storing raw CSV in Hadoop for very long, anyway, and rather you convert it to ORC or Parquet as columnar data. Especially if you are reading CSV to begin with already -- do not output CSV again.
If the idea is to produce one CSV to later download, then I would suggest using Hive + Beeline to do that
This will store the result into a file in the local file system.
beeline -u 'jdbc:hive2://[databaseaddress]' --outputformat=csv2 -f yourSQlFile.sql > theFileWhereToStoreTheData.csv

try using getmerge utility to merge the csv files
for example you have a couple of EMP_FILE1.csv EMP_FILE2.csv EMP_FILE3.csv are placed at some location on hdfs. you can merge all these files and can placed merge file at some new location.
hadoop fs -getmerge /hdfsfilelocation/EMP_FILE* /newhdfsfilelocation/MERGED_EMP_FILE.csv

Related

How to specify number of partitions when writing a Parquet file?

parquet_writer.write_table(table)
This line writes a single file.
The documentation says:
This creates a single Parquet file. In practice, a Parquet dataset may consist of many files in many directories. We can read a single file back with read_table:
Is there a way for PyArrow to create a parquet file in the form of a directory with multiple part files in it such as :
ls -lrt permit-inspections-recent.parquet
... 14:53 part-00001-bd5d902d-fac9-4e03-b63e-6a8dfc4060b6.snappy.parquet
... 14:53 part-00000-bd5d902d-fac9-4e03-b63e-6a8dfc4060b6.snappy.parquet
Regards,
Yash
You need to tell Arrow how to partition the data. This done with partition_cols argument. See here: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_to_dataset.html

how to insert header file as first line into data file in HDFS without using getmerge(performance issue while copying to local)?

I am trying to insert header.txt as first line into data.txt without using getmerge. Getmerge copies to local and inserts into third file. But I want in HDFS only
Header.txt
Head1,Head2,Head3
Data.txt
100,John,28
101,Gill,25
102,James,29
I want output in Data.txt file only like below :
Data.txt
Head1,Head2,Head3
100,John,28
101,Gill,25
102,James,29
Please suggest me whether can we implement in HDFS only ?
HDFS supports a concat (short for concatenate) operation in which two files are merged together into one without any data transfer. It will do exactly what you are looking for. Judging by the file system shell guide documentation, it is not currently supported from the command line, so you will need to implement this in Java:
FileSystem fs = ...
Path data = new Path("Data.txt");
Path header = new Path("Header.txt");
Path dataWithHeader = new Path("DataWithHeader.txt");
fs.concat(dataWithHeader, header, data);
After this, Data.txt and Header.txt both cease to exist, replaced by DataWithHeader.txt.
Thanks for your reply.
I got other way like :
Hadoop fs cat hdfs_path/header.txt hdfs_path/data.txt | Hadoop fs -put - hdfs_path/Merged.txt
This is having drawback as cat command reads complete data which impacts performance.

export data to csv using hive sql

How to export hive table/select query to csv? I have tried the command below. But it creates the output as multiple files. Any better methods?
INSERT OVERWRITE LOCAL DIRECTORY '/mapr/mapr011/user/output/'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
SELECT fied1,field2,field3 FROM table1
Hive creates as many files as many reducers were running. This is fully parallel.
If you want single file then add order by to force running on single reducer or try to increase bytes per reducer configuration parameter:
SELECT fied1,field2,field3 FROM table1 order by fied1
OR
set hive.exec.reducers.bytes.per.reducer=67108864; --increase accordingly
Also you can try to merge files:
set hive.merge.smallfiles.avgsize=500000000;
set hive.merge.size.per.task=500000000;
set hive.merge.mapredfiles=true;
Also you can concatenate files using cat after getting them from hadoop.
You can use hadoop fs -cat /hdfspath > some.csv
command and get the output in one file.
If you want Header then you can use SED along with hive.. See this link which discusses various options in exporting Hive to CSV
https://medium.com/#gchandra/best-way-to-export-hive-table-to-csv-file-326063f0f229

Working with zips in pyspark

I have n zips in a directory and I want to extract each one of those and then extract some data out of a file or two lying inside the zips and add it to a graph DB. I have made a sequential python script for this whole thing, but I am stuck at converting it for spark. All of my zips are in a HDFS directory. And, he graph DB is Neo4j. I am yet to learn about connecting spark with neo4j but I am stuck at a more initial step.
I am thinking my code should be along these lines.
# Names of all my zips
zip_names = ["a.zip", "b.zip", "c.zip"]
# function extract_&_populate_graphDB() returns 1 after doing all the work.
# This was done so that a closure can be applied to start the spark job.
sc.parallelize(zip_names).map(extract_&_populate_grapDB).reduce(lambda a, b: a+b)
What I cant do to test this is how to extract the zips and read the files within. I was able to read the zip by sc.textFile but it on running take(1) on it, it returned hex data.
So, is it possible to read in a zip and extract the data? Or, should I extract the data before putting it into the HDFS? Or maybe there's some other approach to deal with this?
Updating Answer*
If you'd like to use Gzip compressed files, there are parameters you can set when you configure your Spark shell or Spark job that allow you to read and write compressed data.
--conf spark.hadoop.mapred.output.compress=True \
--conf spark.hadoop.mapred.output.compression.codec=True \
--conf spark.hadoop.mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
--conf spark.hadoop.mapred.output.compression.type: BLOCK
Add those to the bash script you are currently using to create a shell (e.g. pyspark) and you can read and write compressed data.
Unfortunately, there is no innate support of Zip files, so you'll need to do a bit more legwork to get there.

Mahout - Naive Bayes

I tried deploying 20- news group example with mahout, it seems working fine. Out of curiosity I would like to dig deep into the model statistics,
for example: bayes-model directory contains the following sub directories,
trainer-tfIdf trainer-thetaNormalizer trainer-weights
which contains part-0000 files. I would like to read the contents of the file for better understanding, cat command doesnt seems to work, it prints some garbage.
Any help is appreciated.
Thanks
The 'part-00000' files are created by Hadoop, and are in Hadoop's SequenceFile format, containing values specific to Mahout. You can't open them as text files, no. You can find the utility class SequenceFileDumper in Mahout that will try to output the content as text to stdout.
As to what those values are to begin with, they're intermediate results of the multi-stage Hadoop-based computation performed by Mahout. You can read the code to get a better sense of what these are. The "tfidf" directory for example contains intermediate calculations related to term frequency.
You can read part-0000 files using hadoop's filesystem -text option. Just get into the hadoop directory and type the following
`bin/hadoop dfs -text /Path-to-part-file/part-m-00000`
part-m-00000 will be printed to STDOUT.
If it gives you an error, you might need to add the HADOOP_CLASSPATH variable to your path. For example, if after running it gives you
text: java.io.IOException: WritableName can't load class: org.apache.mahout.math.VectorWritable
then add the corresponding class to the HADOOP_CLASSPATH variable
export HADOOP_CLASSPATH=/src/mahout/trunk/math/target/mahout-math-0.6-SNAPSHOT.jar
That worked for me ;)
In order to read part-00000 (sequence files) you need to use the "seqdumper" utility. Here's an example I used for my experiments:
MAHOUT_HOME$: bin/mahout seqdumper -s
~/clustering/experiments-v1/t14/tfidf-vectors/part-r-00000
-o ~/vectors-v2-1010
-s is the sequence file you want to convert to plain text
-o is the output file

Resources