How to merge part files and headers in cloudera - hadoop

I have a big table, that is generated in Hue with Pig Editor and contains some hundred thousand records.
Pig returns some part files and separately .pig_header and .pig_schema files.
I need to have all the part files and a header as one complete table in .txt format.
I can do it with getmerge command:
-- To delete schema from output folder
fs -rm /OUTPUT_folder/.pig_schema
--To merge all the part files and header from output folder and to save result in .txt file
fs -getmerge /OUTPUT_folder/* /Another_folder/Result.txt
I would like to ask if there is any way in Cloudera to get this complete table without using getmerge command?
Maybe there is a software in Cloudera or command that allows to combine part files at once.
And then i just need to open this table having, all the columns with headers in a ''nice- ordered way'', what is better to use for this goal in hue?

You could try to do a final GROUP BY ALL and a ORDER BY follow by a FOREACH FLATTEN() that way all the records will go into a single reducers and so will be in only one file.

Related

How to merge HDFS small files into a one large file?

I have number of small files generated from Kafka stream so I like merge small files to one single file but this merge is based on the date i.e. the original folder may have number of previous files but I only like to merge for given date files to one single file.
Any suggestions?
Use something like the code below to iterate over the smaller files and aggregate them into a big one (assuming that source contains the HDFS path to your smaller files, and target is the path where you want your big result file):
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.listStatus(new Path(source)).map(_.getPath.toUri.getPath).
foreach(name => spark.read.text(name).coalesce(1).write.mode(Append).text(target))
This example assumes text file format, but you can just as well read any Spark-supported format, and you can use different formats for source and target, as well
you should be able to use .repartition(1) to write all results to 1 file. if you need to split by date, consider partitionBy("your_date_value") .
if you're working within HDFS and S3, this may also be helpful. you might actually even use s3-dist-cp and stay within HDFS.
https://aws.amazon.com/blogs/big-data/seven-tips-for-using-s3distcp-on-amazon-emr-to-move-data-efficiently-between-hdfs-and-amazon-s3/#5
There's a specific option to aggregate multiple files in HDFS using a --groupBy option based n a regular expression pattern. So if the date is in the file name, you can group based on that pattern.
You can develop a spark application. Using this application read the data from small files and create dataframe and write dataframe to big file in append mode.

Best way for cutting fields from hadoop files

I have several files in my hadoop cluster, about 2000 fields in each file. I need a quick way of cutting specific fields out of each file and creating a new file for sftping to a client.
eg. I have 20 files with fields from 1 to 2000
From each file I need to pull fields 1,6,7,777,545,345,655,1004 etc, in that order.
I need to do this every day and have several processes selecting different fields to use.
Interested in hearing what other peoples suggestions would be for best technology to use to do that?
Use hive query to select all the required fields
Use mapreduce Use
spark to run hive
or mapreduce Something else completely different
Thanks,
Red
One approach is to use Apache Pig. The source files can be loaded into Pig and as you know the indexes of extracting fields, you can use those indexes to extract from Apache Pig relations (loaded files). Indexes start from 0 in Pig.
See following link for more details about loading and extracting fields in Apache Pig
https://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#LOAD
Alternatively, you can use awk commands to slice your files (specify appropriate delimiter) and pipe the command accordingly.
An ideal syntax may somewhat look like:
hdfs dfs -cat <filename> | awk -F"," {print insert_columns_here} > output_file
I'd use hive's "create external table as select".

How to add data in same file in Apache PIG?

I am new to PIG.
Actually I have a use case in which I have to store the data again and again in the same file after every regular interval. But as I gone through some tutorial and links, I didn't see the anything related to this.
How should I do store the data in same file?
It's impossible. Pig uses Hadoop and right now there is no "recommended" solution for appending files.
The other point is that pig would produce one file only if one mapper has been used or one reducer has been used and the end of the whole data flow.
You can:
Give more info about the problem you are trying to solve
Bad solution:
2.1. process data in your pig script
2.2. load data from exisitng file
2.3. union relations hwre first relation keeps new data, the second relation keeps data from exisitng file
2.4. store union result to new output
2.5. replace old file with new one.
Good solution:
Create folder /mydata
create partitions inside folder, they can be /yyyy/MM/dd/HH if you do process data each hour
Use globs to read data:
/mydata/*/*/*/*/*
All files from hour partitions would be read by PIG/HIVE/MR or whatever hadoop tool.
make a date folder like: /abc/hadoop/20130726/
within you generate output based on timestamp like: /abc/hadoop/20130726/201307265465.gz.
Then use getmerge command to merge all data into a single file
Usage: hadoop fs -getmerge <src> <localdst> [addnl]
Hope it will help you.

how to work on specific part of cvs file uploaded into HDFS?

how to work on specific part of cvs file uploaded into HDFS ?
I'm new in Hadoop and i have an a question that is if i export an a relational database into cvs file then uploaded it into HDFS . so how to work on specific part (table) in file using MapReduce .
thanks in advance .
I assume that the RDBMS tables are exported to individual csv files for each table and stored in HDFS. I presume that, you are referring to column(s) data within the table(s) when you mentioned 'specific part (table)'. If so, place the individual csv files into the separate file paths say /user/userName/dbName/tables/table1.csv
Now, you can configure the job for the input path and field occurrences. You may consider to use the default Input Format so that your mapper would get one line at time as input. Based on the configuration/properties, you can read the specific fields and process the data.
Cascading allows you to get started very quickly with MapReduce. It has framework that allows you to set up Taps to access sources (your CSV file) and process it inside a pipeline say to (for example) add column A to column B and place the sum into column C by selecting them as Fields
use BigTable means convert your database to one big table

Hadoop FS delimieter

I'm attempting to copy a table to a file using hadoop fs -copyToLocal. The command works swimmingly, minus the fact that all my fields are merged together. Is there a way to specify a delimiter?
I have seen exact same issues where coping Hive tables to local file system adds all the fields together in one giant line and '\n' character is not honored at the end of each row in table.
You best option is to use custom SerDe (Serializer and DeSerializer) to export hive to CVS as described here. You can get the source code from github as well.
Are you copying from the Hive table? And
Are you copying directly from the warehouse directory? Please provide the full command that you are using.

Resources