I have several inhomogenous structured files stored in a Hadoop cluster. The files contain a header line but not all files contain the same columns.
file1.csv:
a,b,c
1,2,1
file2.csv:
a,b,d
2,2,2
What I need to do is looking for all data in column a or column c and process it further (possibly Spark SQL). So I expect something like:
a,b,c,d
1,2,1,,
2,2,,2
Just doing
spark.read.format("csv").option("header", "true").load(CSV_PATH)
will miss all columns not present in the "first" file read.
How can I do this? Is a conversion to Parquet and its dataset feature a better approach?
Read two files separately and create a two dataframes. Then do an inner join between those two w.r.t join keys as a,b
Related
Using pyarrow.
I have a Parquet Dataset composed of multiple parquet files. If the columns differ between the files then i get a "ValueError: Schema in was different".
Is there a way to avoid this?
Meaning i'd like to have a Dataset composed of files which each contain different columns.
I guess this could be done by pyarrow by filling in the values of the missing columns as na if the columns are not there in a particular component file of the Dataset.
Thanks
Loads the files with separate dataframes such as df1 and df2, merge those dataframes by referencing THIS article.
In the article, you may find two ways to merge, one is
df1.merge(df2, how = 'outer')
and the other one with the pandas package as follows:
pd.concat([df1, df2])
Writing parquet data can be done with something like the following. But if I'm trying to write to more than just one file and moreover wanting to output to multiple s3 files so that reading a single column does not read all s3 data how can this be done?
AvroParquetWriter<GenericRecord> writer =
new AvroParquetWriter<GenericRecord>(file, schema);
GenericData.Record record = new GenericRecordBuilder(schema)
.set("name", "myname")
.set("favorite_number", i)
.set("favorite_color", "mystring").build();
writer.write(record);
For example what if I want to partition by a column value so that all the data with favorite_color of red goes in one file and those with blue in another file to minimize the cost of certain queries. There should be something similar in a Hadoop context. All I can find are things that mention Spark using something like
df.write.parquet("hdfs:///my_file", partitionBy=["created_year", "created_month"])
But I can find no equivalent to partitionBy in plain Java with Hadoop.
In a typical Map-Reduce application, the number of output files will be the same as the number of reduces in your job. So if you want multiple output files, set the number of reduces accordingly:
job.setNumReduceTasks(N);
or alternatively via the system property:
-Dmapreduce.job.reduces=N
I don't think it is possible to have one column per file with the Parquet format. The internal structure of Parquet files is initially split by row groups, and only these row groups are then split by columns.
How could i analyze two file with different structure in hadoop (with out MapReduce)?
Ex: File 1 is csv have O2 index in third column
File 2 is csv have O2 index in second column
I know that i can use MapReduce for manually analyze but is that anyway more automatically? Because of it not just two file. May be more!
Thanks
You could store the two files in separate locations, build two separate hive tables and then combine the two tables into one view...
This will most likely be fairly inefficient and should probably be done using custom Map/ Reduce.
I have a text file with N number of columns (Not sure, in the future I may have N+1).
Example:
1|A
2|B|C
3|D|E|F
I want to store above data into hbase using pig without writing UDF. How can I store this kind of data without knowing the number of columns in a file?
Put it in a map and then you can use cf1:* where cf1 is your column family
how do we design mapper/reducer if I have to transform a text file line-by-line into another text file.
I wrote a simple map/reduce programs which did a small transformation but the requirement is a bit more elaborate below are the details:
the file is usually structured like this - the first row contains a comma separated list of column names. Second and the rest of the rows specify values against the columns
In some rows the trailing column values might be missing ex: if there are 15 columns then values might be specified only for the first 10 columns.
I have about 5 input files which I need to transform and aggregate into one file. the transformations are specific to each of the 5 input files.
How do I pass contextual information like file name to the mapper/reducer program?
Transformations are specific to columns so how do I remember the columns mentioned in the first row and then correlate and transform values in rows?
Split file into lines, transform (map) each line in parallel, join (reduce) the resulting lines into one file?
You can not rely on the column info in the first row. If your file is larger than a HDFS block, your file will be broken into multiple splits and each split handed to a different mapper. In that case, only the mapper receiving the first split will receive the first row with column info and the rest won't.
I would suggest passing file specific meta data in separate file and distribute it as side data. Your mapper or reducer tasks could read the meta data file.
Through the Hadoop Context object, you can get hold of the name of the file being processed by a mapper. Between all these, I think you have all the context information you are referring to and you can do file specific transformation. Even though the transformation logic is different for different files, the mapper output needs to have the same format.
If you using reducer, you could set the number of reducers to one, to force all output to aggregate to one file.