I am kind of new in spark and i have a requirement where i am required to read from different part folders and then merge them all together to create a single df based on a passed schema. it is something like this
/feed=abc -> contains multiple part folders based on date like below
/feed=abc/date=20221220
/feed=abc/date=20221221
.....
/feed=abc/date=20221231
Each part folder can have multiple part files. All the files are in parquet format but the schema across two different part folders may vary either in the number of cols or in datatype. So my approach is
1 - create an empty final_df based on the schema passed
2 - Iterate over the list of part folders using the below code
hadoop = sc._jvm.org.apache.hadoop
fs = hadoop.fs.FileSystem
conf = hadoop.conf.Configuration()
path = hadoop.fs.Path(inp_fl_loc)
for f in fs.get(conf).listStatus(path):
path2 = str(hadoop.fs.Path(str(f.getPath())))
if(f.isDirectory()):
path2= path2 + "/"
print("the inp_path is ",str(path2))
#splitting the individual name to get the corresponding partition col name and value
temp_path = path2.split("/")[-2]
part_col,part_val = temp_path.split("=")[0],temp_path.split("=")[1]
elif('_' in path2.split("/")[-1]):
continue
#reading the file
df =
spark.read.format(inp_fl_frmt).option("mergeSchema","true").load(str(path2))
#other operation follows :-
3 - Once a particular part folder is read, comparing the schema of the read_df with that of the final_df and selecting only the req cols and if required typecasting the req col of the read_df based on the final_df schema. Note in this process i might have to type cast a sub-col within a struct type variable as well. For that i am actually expanding the struct variables into new cols, type casting them and then again converting them back in the original structure.
4 - Unioning the typecasted read_df with the final_df.
5 - Repeat steps 3-4 for all the part folders ultimately giving me the final final_df
The thing is in the presence of large data (in one of my feed i am reading 340 part folders totalling around 13000 files close to around 7GB in total) the job is running for a large amount of time (7hrs+ in the above case).
Since i am working on a shared cluster i dont have the exact details of the number of nodes and the number of cores and following the standard configuration used in our team...but seems like that is not enough. The above details are not yet handy but i am trying to get those but i am more concerned if any tuning is possible from the code perspective.
Few questions that i have in mind :-
Since i am using the loop to read each part folder one by one i think the reading is happening serially rather than parallelizing the operation. Is it possible to read the different part folders parallely. I tried reduce operation but that isn't working properly.
Post the union of read-df with the empty df i am caching the empty_df so that in the next union operation the empty_df is not recalculated. But that doesn't seem to help in perf. Shouldn't i cache the empty-df ?
Any help regarding this is much appreciated.
I think there are several considerations that impact the performance of your job:
the simple Python for loop is not distributing the work evenly across nodes - you are losing the benefit of running a distributed engine like Spark by overloading only one of the workers
your folder structure already seems quite nicely partitioned, so reading the data even with varying schemas shouldn't be that big of a problem
selecting and casting the columns would make most sense only after you have read all the required files - before that, you are risking of building large if-else spaghetti to handle every possible case
A simple solution: have you tried reading in all of the desired folders, by passing a whole directory to the Spark?
In general, when you have varying schemas, then sane solution is to have a separate DataFrame for group of files with distinct schema, and then use function like unionByName to combine them. You can pass allowMissingColumn to True, so that when for example DataFrame A doesn't have some columns of DataFrame B, after union, it will have a NULL values assigned there, instead of throwing an exception.
Try out any of the solutions, and let me know which one worked the best - always interested what works for people :)
Related
I am trying to understand how exactly the ALTER TABLE CONCATENATE in HIVE Works.
I saw this link How does Hive 'alter table <table name> concatenate' work? but all I got from this links is that for ORC Files, the merge happens at a stripe level.
I am looking for a detailed explanation of how CONCATENATE works. As an e.g I initially had 500 small ORC Files in the HDFS. I ran the Hive ALTER TABLE CONCATENATE and the files merged to 27 bigger files. Subsequent runs of CONCATENATE reduced the number of files to 16 and finally I ended up in two large files.( used version Hive 0.12 ) So I wanted to understand
How exactly CONCATENATE works? Does it looks at the existing number of files , as well as the size ? How will it determine the no: of output ORC files after concatenation?
Is there any known issues with using the Concatenate ? We are planning to run the concatenate one a day in the maintenance window
Is Using CTAS an alternative to concatenate and which is better? Note that my requirement is to reduce the no of ORC files (ingested through Nifi) without compromising performance of Read
Any help is appreciated and thanks in advance
Concatenated file size can be controlled with following two values:
set mapreduce.input.fileinputformat.split.minsize=268435456;
set hive.exec.orc.default.block.size=268435456;
These values should be set based on your HDFS/MapR-FS block size.
As commented by #leftjoin it is indeed the case that you can get different output files for the same underlying data.
This is discussed more in the linked HCC thread but the key point is:
Concatenation depends on which files are chosen first.
Note that having files of different sizes, should not be a problem in normal situations.
If you want to streamline your process, then depending on how big your data is, you may also want to batch it a bit before writing to HDFS. For instance, by setting the batch size in NiFi.
We had 5 years of data in cluster and we are loading the data everyday. The data that gets added everyday might contain duplicate data , partially modified data etc ..
1 . How to handle duplicate data - should that be handled as part of highlevel programming interfaces pig, hive etc .. or any other alternatives.
Say if there is a usecase to find out what is changed between two records given the key to find out the row.
What is the best way to model the data, using which hadoop eco system components.
How to handle duplicate data
It's very hard to remove duplicates from HDFS raw data,
so I guess your approach is right: remove using pig or hive while loading those data.
Say if there is a usecase to find out what is changed between two records given the key to find out the row.
For this case, do you meaning that two records has the same key?
Then what kind of changes you want to capture?
When you say that, you need to remove duplicates and also the delta between two records when you know the key, you should have some criteria of which data to be removed in case of partial changed data.
In both scenarios, you can have a handle of the key and write logic to remove duplicates. Map reduce seems to be a good choice, given the parallelism, performance and ability to manage based on keys. Mostly your requirements could be handled in reducer
See if Sqoop-merge fits your use case.
From the doc:
The merge tool allows you to combine two datasets where entries in one dataset should overwrite entries of an older dataset. For example, an incremental import run in last-modified mode will generate multiple datasets in HDFS where successively newer data appears in each dataset. The merge tool will "flatten" two datasets into one, taking the newest available records for each primary key.
I've got a pig job that analyzes a large number of log files and generates a relationship between a group of attributes and a bag of IDs that have those attributes. I'd like to store that relationship on HDFS, but I'd like to do so in a way that is friendly for other Hive/Pig/MapReduce jobs to operate on the data, or subsets of the data without having to ingest the full output of my pig job, as that is a significant amount of data.
For example, if the schema of my relationship is something like:
relation: {group: (attr1: long,attr2: chararray,attr3: chararray),ids: {(id: chararray)}}
I'd really like to be able to partition this data, storing it in a file structure that looks like:
/results/attr1/attr2/attr3/file(s)
where the attrX values in the path are the values from the group, and the file(s) contain only ids. This would allow me to easily subset my data for subsequent analysis without duplicating data.
Is such a thing possible, even with a custom StoreFunc? Is there a different approach that I should be taking to accomplish this goal?
I'm pretty new to Pig, so any help or general suggestions about my approach would be greatly appreciated.
Thanks in advance.
Multistore wasn't a perfect fit for what I was trying to do, but it proved a good example of how to write a custom StoreFunc that writes multiple, partitioned output files. I downloaded the Pig source code and created my own storage function that parsed the group tuple, using each of the items to build up the HDFS path, and then parsed the bag of ids, writing one ID per line into the result file.
Does mapreduce and any of the other hadoop technologies (HBase, Hive, pig etc) lend themselves well to situations where you have multiple input files and where data needs to be compared between the different datasources.
In the past I've written a few mapreduce jobs using Hadoop and Pig. However these tasks were quite simple since they involved manipulating only a single dataset. The requirements we have now, dictates that we read data from multiple sources and perform comparisons on various data elements on another datasource. We then report on the differences. The datasets we are working with are in the region of 10million - 60million records and so far we haven't manage to make these jobs fast enough.
Is there a case for using mapreduce in order to solve such issues or am I going down the wrong route.
Any suggestions are much appreciated.
I guess I'd preprocess the different datasets into a common format (being sure to include a "data source" id column with a single unique value for each row coming from the same dataset). Then move the files into the same directory, load the whole dir and treat it as a single data source in which you compare the properties of rows based on their dataset id.
Yes, you can join multiple datasets in a mapreduce job. I would recommend getting a copy of the book/ebook Hadoop In Action which addresses joining data from multiple sources.
When you have multiple input files you can use MapReduce API FileInputFormat.addInputPaths() in which can take a comma separated list of multiple files, as below:
FileInputFormat.addInputPaths("dir1/file1,dir2/file2,dir3/file3");
You can also pass multiple inputs into a Mapper in hadoop using Distributed Cache, more info is described here: multiple input into a Mapper in hadoop
If i am not misunderstanding you are trying to normalize the structured data in records, coming in from several inputs and then process it. Based on this, i think you really need to look at this article which helped me in past. It included How To Normalize Data Using Hadoop/MapReduce as below:
Step 1: Extract the column value pairs from the original data.
Step 2: Extract column-value Pairs Not In Master ID File
Step 3: Calculate the Maximum ID for Each Column in the Master File
Step 4: Calculate a New ID for the Unmatched Values
Step 5: Merge the New Ids with the Existing Master IDs
Step 6: Replace the Values in the Original Data with IDs
Using MultipleInputs we can do this.
MutlipleInputs.addInputPath(job, Mapper1.class, TextInputFormat.class,path1);
MutlipleInputs.addInputPath(job, Mapper2.class, TextInputFormat.class,path2);
job.setReducerClass(Reducer1.class);
//FileOutputFormat.setOutputPath(); set output path here
If both classes have a common key, then they can be joined in reducer and do the necessary logics
Imagine you have a big file stored in hdtf which contains structured data. Now the goal is to process only a portion of data in the file like all the lines in the file where second column value is between so and so. Is it possible to launch the MR job such that hdfs only stream the relevant portion of the file versus streaming everything to the mappers.
The reason is that I want to expedite the job speed by only working on the portion that I need. Probably one approach is to run a MR job to get create a new file but I am wondering if one can avoid that?
Please note that the goal is to keep the data in HDFS and I do not want to read and write from database.
HDFS stores files as a bunch of bytes in blocks, and there is no indexing, and therefore no way to only read in a portion of your file (at least at the time of this writing). Furthermore, any given mapper may get the first block of the file or the 400th, and you don't get control over that.
That said, the whole point of MapReduce is to distribute the load over many machines. In our cluster, we run up to 28 mappers at a time (7 per node on 4 nodes), so if my input file is 1TB, each map slot may only end up reading 3% of the total file, or about 30GB. You just perform the filter that you want in the mapper, and only process the rows you are interested in.
If you really need filtered access, you might want to look at storing your data in HBase. It can act as a native source for MapReduce jobs, provides filtered reads, and stores its data on HDFS, so you are still in the distributed world.
One answer is looking at the way that hive solves this problem. The data is in "tables" which are really just meta data about files on disk. Hive allows you to set columns on which a table is partitioned. This creates a separate folder for each partition so if you were partitioning a file by date you would have:
/mytable/2011-12-01
/mytable/2011-12-02
Inside of the date directory would be you actual files. So if you then ran a query like:
SELECT * FROM mytable WHERE dt ='2011-12-01'
Only files in /mytable/2011-12-01 would be fed into the job.
Tho bottom line is that if you want functionality like this you either want to move to a higher level language (hive/pig) or you need to roll your own solutions.
Big part of the processing cost - is data parsing to produce Key-Values to the Mapper. We create there (usually) one java object per value + some container. It is costly both in terms of CPU and garbage collector pressure
I would suggest the solution "in the middle". You can write input format which will read the input stream and skip non-relevant data in the early stage (for example by looking into few first bytes of the string). As a result you will read all data, but actually parse and pass to the Mapper - only portion of it.
Another approach I would consider - is to use RCFile format (or other columnar format), and take care that relevant and non relevant data will sit in the different columns.
If the files that you want to process have some unique attribute about their filename (like extension or partial filename match), you can also use the setInputPathFilter method of FileInputFormat to ignore all but the ones you want for your MR job. Hadoop by default ignores all ".xxx" and _xxx" files/dirs, but you can extend with setInputPathFilter.
As others have noted above, you will likely get sub-optimal performance out of your cluster doing something like this which breaks the "one block per mapper" paradigm, but sometimes this is acceptable. Can sometimes take more to "do it right", esp if you're dealing with a small amount of data & the time to re-architect and/or re-dump into HBase would eclipse the extra time required to run your job sub-optimally.