Pass parameter from spark to input format - hadoop

We have files with specific format in HDFS. We want to process data extracted from these files within spark. We have started to write an input format in order to create the RDD. This way we hope will be able to create an RDD from the whole file.
But each processing has to process a small subset of data contained in the file and I know how to extract this subset very efficiently, more than filtering a huge RDD.
How can I pass a query filter in the form of a String from my driver to my input format (the same way hive context does)?
Edit:
My file format is NetCDF which stores huge matrix in a efficient way for a multidimentionnal data, for exemple x,y,z and time. A first approach would be to extract all values from the matrix and produce a RDD line for each value. I'd like my inputformat to extract only a few subset of the matrix (maybe 0.01%) and build a small RDD to work with. The subset could be z = 0 and a small time period. I need to pass the time period to the input format which will retrieve only the values I'm interested in.
I guess Hive context does this when you pass an SQL query to the context. Only values matching the SQL query are present in the RDD, not all lines of the files.

Related

Record definition in MapReduce on different types of data-sets in Hadoop?

I want to understand the definition of Record in MapReduce Hadoop, for data types other than Text.
Typically, for Text data a record is full line terminated by new line.
Now, if we want to process an XML data, how does this data get processed , that is , how would a Record definition be on which mapper would work?
I have read that there is concept of InputFormat and RecordReader, but I didn't get it well.
Can anyone help me understand what is the relationship between InputFormat, RecordReader for various types of data-sets (other than text) and how does the data gets converted into Records on which mapper works upon?
Lets start with some basic concept.
From perspective of a file.
1. File -> collection of rows.
2. Row -> Collection of one or more columns , separated by delimiter.
2. File can be of any format=> text file, parquet file, ORC file.
Different file format, store Rows(columns) in different way , and the choice of delimiter is also different.
From Perspective of HDFS,
1. File is sequennce of bytes.
2. It has no idea of the logical structuring of file. ie Rows and columns.
3. HDFS do-sent guarantee, that a row will be contained within oe hdfs block, a row can span across two blocks.
Input Format : The code which knows how to read the file chunks from splits , and at the same time ensure if a row extends to other split, it should be considered part of the first split.
Record Reader : As you read a Split , some code(Record Reader) should be able to understand how to interpret a row from the bytes read from HDFS.
for more info :
http://bytepadding.com/big-data/map-reduce/understanding-map-reduce-the-missing-guide/

How to output multiple s3 files in Parquet

Writing parquet data can be done with something like the following. But if I'm trying to write to more than just one file and moreover wanting to output to multiple s3 files so that reading a single column does not read all s3 data how can this be done?
AvroParquetWriter<GenericRecord> writer =
new AvroParquetWriter<GenericRecord>(file, schema);
GenericData.Record record = new GenericRecordBuilder(schema)
.set("name", "myname")
.set("favorite_number", i)
.set("favorite_color", "mystring").build();
writer.write(record);
For example what if I want to partition by a column value so that all the data with favorite_color of red goes in one file and those with blue in another file to minimize the cost of certain queries. There should be something similar in a Hadoop context. All I can find are things that mention Spark using something like
df.write.parquet("hdfs:///my_file", partitionBy=["created_year", "created_month"])
But I can find no equivalent to partitionBy in plain Java with Hadoop.
In a typical Map-Reduce application, the number of output files will be the same as the number of reduces in your job. So if you want multiple output files, set the number of reduces accordingly:
job.setNumReduceTasks(N);
or alternatively via the system property:
-Dmapreduce.job.reduces=N
I don't think it is possible to have one column per file with the Parquet format. The internal structure of Parquet files is initially split by row groups, and only these row groups are then split by columns.

How do I store data in multiple, partitioned files on HDFS using Pig

I've got a pig job that analyzes a large number of log files and generates a relationship between a group of attributes and a bag of IDs that have those attributes. I'd like to store that relationship on HDFS, but I'd like to do so in a way that is friendly for other Hive/Pig/MapReduce jobs to operate on the data, or subsets of the data without having to ingest the full output of my pig job, as that is a significant amount of data.
For example, if the schema of my relationship is something like:
relation: {group: (attr1: long,attr2: chararray,attr3: chararray),ids: {(id: chararray)}}
I'd really like to be able to partition this data, storing it in a file structure that looks like:
/results/attr1/attr2/attr3/file(s)
where the attrX values in the path are the values from the group, and the file(s) contain only ids. This would allow me to easily subset my data for subsequent analysis without duplicating data.
Is such a thing possible, even with a custom StoreFunc? Is there a different approach that I should be taking to accomplish this goal?
I'm pretty new to Pig, so any help or general suggestions about my approach would be greatly appreciated.
Thanks in advance.
Multistore wasn't a perfect fit for what I was trying to do, but it proved a good example of how to write a custom StoreFunc that writes multiple, partitioned output files. I downloaded the Pig source code and created my own storage function that parsed the group tuple, using each of the items to build up the HDFS path, and then parsed the bag of ids, writing one ID per line into the result file.

Not able to store the data into hbase using pig when I dont know the number of columns in a file

I have a text file with N number of columns (Not sure, in the future I may have N+1).
Example:
1|A
2|B|C
3|D|E|F
I want to store above data into hbase using pig without writing UDF. How can I store this kind of data without knowing the number of columns in a file?
Put it in a map and then you can use cf1:* where cf1 is your column family

how to perform ETL in map/reduce

how do we design mapper/reducer if I have to transform a text file line-by-line into another text file.
I wrote a simple map/reduce programs which did a small transformation but the requirement is a bit more elaborate below are the details:
the file is usually structured like this - the first row contains a comma separated list of column names. Second and the rest of the rows specify values against the columns
In some rows the trailing column values might be missing ex: if there are 15 columns then values might be specified only for the first 10 columns.
I have about 5 input files which I need to transform and aggregate into one file. the transformations are specific to each of the 5 input files.
How do I pass contextual information like file name to the mapper/reducer program?
Transformations are specific to columns so how do I remember the columns mentioned in the first row and then correlate and transform values in rows?
Split file into lines, transform (map) each line in parallel, join (reduce) the resulting lines into one file?
You can not rely on the column info in the first row. If your file is larger than a HDFS block, your file will be broken into multiple splits and each split handed to a different mapper. In that case, only the mapper receiving the first split will receive the first row with column info and the rest won't.
I would suggest passing file specific meta data in separate file and distribute it as side data. Your mapper or reducer tasks could read the meta data file.
Through the Hadoop Context object, you can get hold of the name of the file being processed by a mapper. Between all these, I think you have all the context information you are referring to and you can do file specific transformation. Even though the transformation logic is different for different files, the mapper output needs to have the same format.
If you using reducer, you could set the number of reducers to one, to force all output to aggregate to one file.

Resources