file formats that can be read using PIG - hadoop

What kind of file formats can be read using PIG?
How can I store them in different formats? Say we have CSV file and I want to store it as MXL file how this can be done? Whenever we use STORE command it makes directory and it stores file as part-m-00000 how can I change name of the file and overwrite directory?

what kind of file formats can be read using PIG? how can i store them in different formats?
There are a few built-in loading and storing methods, but they are limited:
BinStorage - "binary" storage
PigStorage - loads and stores data that is delimited by something (such as tab or comma)
TextLoader - loads data line by line (i.e., delimited by the newline character)
piggybank is a library of community contributed user-defined functions and it has a number of loading and storing methods, which includes an XML loader, but not a XML storer.
say we have CSV file n i want to store it as MXL file how this can be done?
I assume you mean XML here... Storing in XML is something that is a bit rough in Hadoop because it splits files on a reducer basis, so how do you know where to put the root tag? this likely should be some sort of post-processing to produce wellformed XML.
One thing you can do is to write a UDF that converts your columns into an XML string:
B = FOREACH A GENERATE customudfs.DataToXML(col1, col2, col3);
For example, say col1, col2, col3 are "foo", 37, "lemons", respectively. Your UDF can output the string "<item><name>Foo</name><num>37</num><fruit>lemons</fruit></item>".
whenever we use STORE command it makes directory and it stores file as part-m-00000 how can i change name of the file and overwrite directory?
You can't change the name of the output file to be something other than part-m-00000. That's just how Hadoop works. If you want to change the name of it, you should do something to it after the fact with something like hadoop fs -mv output/part-m-00000 newoutput/myoutputfile. This could be done with a bash script that runs the pig script then executes this command.

Related

How to merge HDFS small files into a one large file?

I have number of small files generated from Kafka stream so I like merge small files to one single file but this merge is based on the date i.e. the original folder may have number of previous files but I only like to merge for given date files to one single file.
Any suggestions?
Use something like the code below to iterate over the smaller files and aggregate them into a big one (assuming that source contains the HDFS path to your smaller files, and target is the path where you want your big result file):
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.listStatus(new Path(source)).map(_.getPath.toUri.getPath).
foreach(name => spark.read.text(name).coalesce(1).write.mode(Append).text(target))
This example assumes text file format, but you can just as well read any Spark-supported format, and you can use different formats for source and target, as well
you should be able to use .repartition(1) to write all results to 1 file. if you need to split by date, consider partitionBy("your_date_value") .
if you're working within HDFS and S3, this may also be helpful. you might actually even use s3-dist-cp and stay within HDFS.
https://aws.amazon.com/blogs/big-data/seven-tips-for-using-s3distcp-on-amazon-emr-to-move-data-efficiently-between-hdfs-and-amazon-s3/#5
There's a specific option to aggregate multiple files in HDFS using a --groupBy option based n a regular expression pattern. So if the date is in the file name, you can group based on that pattern.
You can develop a spark application. Using this application read the data from small files and create dataframe and write dataframe to big file in append mode.

Pig : Use to Write record types in a single file to multiple outputs

I have the following data in a single file
"HD",003498,"20160913:17:04:10","D3ZYE",1
"EH","XXX-1985977-1",1,"01","20151215","20151215","20151229","20151215","2304",,,"36-126481000",1340.74,61808.00,1126.62,0.00,214.12,0.00,0.00,0.00,"30","20151229","00653845",,,"PARTS","001","ABI","20151215","Y","Y","N","36-126481000",
I would like to use Pig to read this single file and then segregate it to different files based on the first column
In the same light, I was looking for a way to treat the record first as a following construct:
recTypCd, recordData
And then later on just treat recordData as a CSV record
In this regard, after I store them in separate files with the same record types, I can simply load them to its own External HIVE Tables by using a CSV serde
You can use split by in pig based on your condition
E.g multiple =split line by recTypeCd
Case hd1 when rectypecd ==‘hd’,
Case hd2 ...
Store hd1 into op1;
Store hd2 into op2;

How to output multiple s3 files in Parquet

Writing parquet data can be done with something like the following. But if I'm trying to write to more than just one file and moreover wanting to output to multiple s3 files so that reading a single column does not read all s3 data how can this be done?
AvroParquetWriter<GenericRecord> writer =
new AvroParquetWriter<GenericRecord>(file, schema);
GenericData.Record record = new GenericRecordBuilder(schema)
.set("name", "myname")
.set("favorite_number", i)
.set("favorite_color", "mystring").build();
writer.write(record);
For example what if I want to partition by a column value so that all the data with favorite_color of red goes in one file and those with blue in another file to minimize the cost of certain queries. There should be something similar in a Hadoop context. All I can find are things that mention Spark using something like
df.write.parquet("hdfs:///my_file", partitionBy=["created_year", "created_month"])
But I can find no equivalent to partitionBy in plain Java with Hadoop.
In a typical Map-Reduce application, the number of output files will be the same as the number of reduces in your job. So if you want multiple output files, set the number of reduces accordingly:
job.setNumReduceTasks(N);
or alternatively via the system property:
-Dmapreduce.job.reduces=N
I don't think it is possible to have one column per file with the Parquet format. The internal structure of Parquet files is initially split by row groups, and only these row groups are then split by columns.

how to work on specific part of cvs file uploaded into HDFS?

how to work on specific part of cvs file uploaded into HDFS ?
I'm new in Hadoop and i have an a question that is if i export an a relational database into cvs file then uploaded it into HDFS . so how to work on specific part (table) in file using MapReduce .
thanks in advance .
I assume that the RDBMS tables are exported to individual csv files for each table and stored in HDFS. I presume that, you are referring to column(s) data within the table(s) when you mentioned 'specific part (table)'. If so, place the individual csv files into the separate file paths say /user/userName/dbName/tables/table1.csv
Now, you can configure the job for the input path and field occurrences. You may consider to use the default Input Format so that your mapper would get one line at time as input. Based on the configuration/properties, you can read the specific fields and process the data.
Cascading allows you to get started very quickly with MapReduce. It has framework that allows you to set up Taps to access sources (your CSV file) and process it inside a pipeline say to (for example) add column A to column B and place the sum into column C by selecting them as Fields
use BigTable means convert your database to one big table

JMeter export graph to CSV exports to multiple columns, write results to file exports to one column

I was annoyed with JMeter writing data results to CSV as one column. So when the CSV file was opened in Excel all values would be added to one single column (which requires annoying manual copy/paste work to get to graphs). I then noticed that if I choose Export to CSV on a Listener graph, it actually exports the CSV file as separate columns in Excel, which is great.
Is it possible to have the "Write results to file" write data into separate columns by default as it does with the graph "Export to CSV"? Thanks!
Suppose you have at least 2 options:
Simple Data Writer, which one you are using at the moment, as you understand.
In jmeter.properties file (JMETER_HOME\bin\jmeter.properties) uncomment and set jmeter.save.saveservice.default_delimiter=; to use ';' instead of ',' (used by default) as separator in csv-files (which one you are creating using "Write results to file") - this will separate values in different columns if opened in Excel.
# For use with Comma-separated value (CSV) files or other formats
# where the fields' values are separated by specified delimiters.
jmeter.save.saveservice.default_delimiter=;
Flexible File Writer from jmeter-plugins pack implments the same functionality and looks to be more customizable.
Idea is the same as above - use ";" to separate values written into file:
Write file header: endTimeMillis;responseTime;latency;sentBytes;receivedBytes;isSuccessful;responseCode
Record each sample as: endTimeMillis|;|responseTime|;|latency|;|sentBytes|;|receivedBytes|;|isSuccessful|;|responseCode|\r\n
Hope this helps.

Resources