Using pyarrow.
I have a Parquet Dataset composed of multiple parquet files. If the columns differ between the files then i get a "ValueError: Schema in was different".
Is there a way to avoid this?
Meaning i'd like to have a Dataset composed of files which each contain different columns.
I guess this could be done by pyarrow by filling in the values of the missing columns as na if the columns are not there in a particular component file of the Dataset.
Thanks
Loads the files with separate dataframes such as df1 and df2, merge those dataframes by referencing THIS article.
In the article, you may find two ways to merge, one is
df1.merge(df2, how = 'outer')
and the other one with the pandas package as follows:
pd.concat([df1, df2])
Related
I just read that HDF5 allows you to access seek into data without reading the entire file into memory.
Is this seeking behavior possible in Parquet files without Java (non-pyspark solutions)? I am using Parquet because of the strong dtype support.
import h5py
f = h5py.File('my_file.hdf5', 'w')
dset = f.create_dataset('coords', data=my_ndarray)
f.close()
f = h5py.File('my_file.hdf5', 'r')
dset = f['coords']
my_array = dset[-2:]
https://arrow.apache.org/docs/python/parquet.html#inspecting-the-parquet-file-metadata
I see here that Parquet metadata has num_row_groups: 1 (or more). But I am not sure how that helps me fetch rows [23, 42, 117, 99293184].
Parquet allows some forms of partial / random access. However, it is limited. Each parquet file is made up of one or more row groups and each parquet file is made up of one or more columns. You can retrieve any combination of rows groups & columns that you want.
There is only one way to store columns in a parquet file. However, it is up to the creator of the file how to distribute the rows into row groups. The creator could put every row in its own row group (although this would be too inefficient) or they could choose to use one row group for the entire file (this is quite common).
This means the ability to do partial reads is going to depend on how the file was created. If you are creating the files and you know ahead of time what sorts of reads are going to be done to access the data you can use this to create row groups. If you don't know the access patterns ahead of time or you have no control over the creation of the files you are reading then you will likely have to read the entire file into memory and filter later.
Another common scenario is to store a single large dataset across many files (so that some rows are in each file). This allows for the same sort of partial read behavior that you would have from multiple row groups. However, having multiple files is sometimes easier to manage.
Both pyarrow and fastparquet should give you APIs for filtering row groups. They also expose the parquet file metadata so that you can access the metadata information yourself to implement some custom filtering mechanism.
I have several inhomogenous structured files stored in a Hadoop cluster. The files contain a header line but not all files contain the same columns.
file1.csv:
a,b,c
1,2,1
file2.csv:
a,b,d
2,2,2
What I need to do is looking for all data in column a or column c and process it further (possibly Spark SQL). So I expect something like:
a,b,c,d
1,2,1,,
2,2,,2
Just doing
spark.read.format("csv").option("header", "true").load(CSV_PATH)
will miss all columns not present in the "first" file read.
How can I do this? Is a conversion to Parquet and its dataset feature a better approach?
Read two files separately and create a two dataframes. Then do an inner join between those two w.r.t join keys as a,b
I am receiving data of formats like csv, xml, json and I want to keep all the files in same hive table.Is it achievable?
Hive expects all the files for one table to use the same delimiter, same compression applied etc. So, you cannot use a Hive table on top of files with multiple formats.
The solution you may want to use is
Create a separate table (json/xml/csv) for each of the file formats
Create a view for the UNION of the 3 tables created above.
This way the consumer of the data has to query only one view/object, if that's what you are looking for.
Yes, you can achieve this through a combination of different external tables.
Because different SerDes with different specifications for how to read columns in the different files will be needed, you will need to create one external table per type of file (and table). The data from each of these external tables can then be combined into a view with UNION, as suggested by Ramesh. The view can could then be used for reading from these, and you could e.g. insert the data into a managed table.
Writing parquet data can be done with something like the following. But if I'm trying to write to more than just one file and moreover wanting to output to multiple s3 files so that reading a single column does not read all s3 data how can this be done?
AvroParquetWriter<GenericRecord> writer =
new AvroParquetWriter<GenericRecord>(file, schema);
GenericData.Record record = new GenericRecordBuilder(schema)
.set("name", "myname")
.set("favorite_number", i)
.set("favorite_color", "mystring").build();
writer.write(record);
For example what if I want to partition by a column value so that all the data with favorite_color of red goes in one file and those with blue in another file to minimize the cost of certain queries. There should be something similar in a Hadoop context. All I can find are things that mention Spark using something like
df.write.parquet("hdfs:///my_file", partitionBy=["created_year", "created_month"])
But I can find no equivalent to partitionBy in plain Java with Hadoop.
In a typical Map-Reduce application, the number of output files will be the same as the number of reduces in your job. So if you want multiple output files, set the number of reduces accordingly:
job.setNumReduceTasks(N);
or alternatively via the system property:
-Dmapreduce.job.reduces=N
I don't think it is possible to have one column per file with the Parquet format. The internal structure of Parquet files is initially split by row groups, and only these row groups are then split by columns.
How could i analyze two file with different structure in hadoop (with out MapReduce)?
Ex: File 1 is csv have O2 index in third column
File 2 is csv have O2 index in second column
I know that i can use MapReduce for manually analyze but is that anyway more automatically? Because of it not just two file. May be more!
Thanks
You could store the two files in separate locations, build two separate hive tables and then combine the two tables into one view...
This will most likely be fairly inefficient and should probably be done using custom Map/ Reduce.