When writing a Vertica transform function, is there a way to get an identifier for the partition - vertica

I am attempting to write a user-defined transform function in Vertica. Each execution of processPartition() within my UDTF implementation ends up generating an output file on disk. I would like some way to uniquely identify those files based upon the number of partitions into which Vertica has divided my operation (i.e. 1-4, 2-4, 3-4, 4-4). But it does not appear that I have access to any sort of identifier beyond the name of the executing node (from ServerInterface).
Does there exist some value that I can use to uniquely identify the partitions?

Can you give more insight what you plan to do in processPartition() ? Maybe there are other ways to get to the same result!

Related

Map Reduce Keep input ordering

I tried to implement an application using hadoop which processes text files.The problem is that I cannot keep the ordering of the input text.Is there any way to choose the hash function?This problem could be easily solved by assigning a partition of the input to each mapper an then send the partition to the reducers.Is this possible with hadoop ?
The base idea of MapReduce is that the order in which things are done is irrelevant.
So you cannot (and do not need to) control the order in which:
the input records go through the mappers.
the key and related values go through the reducers.
The only thing you can control is the order in which the values are placed in the iterator that is made available in the reducer.
This is done using a construct called "secondary sort".
A simple google action for this term resulted in several points where you can continue.
I like this blog post : link

Using Hadoop to process data from multiple datasources

Does mapreduce and any of the other hadoop technologies (HBase, Hive, pig etc) lend themselves well to situations where you have multiple input files and where data needs to be compared between the different datasources.
In the past I've written a few mapreduce jobs using Hadoop and Pig. However these tasks were quite simple since they involved manipulating only a single dataset. The requirements we have now, dictates that we read data from multiple sources and perform comparisons on various data elements on another datasource. We then report on the differences. The datasets we are working with are in the region of 10million - 60million records and so far we haven't manage to make these jobs fast enough.
Is there a case for using mapreduce in order to solve such issues or am I going down the wrong route.
Any suggestions are much appreciated.
I guess I'd preprocess the different datasets into a common format (being sure to include a "data source" id column with a single unique value for each row coming from the same dataset). Then move the files into the same directory, load the whole dir and treat it as a single data source in which you compare the properties of rows based on their dataset id.
Yes, you can join multiple datasets in a mapreduce job. I would recommend getting a copy of the book/ebook Hadoop In Action which addresses joining data from multiple sources.
When you have multiple input files you can use MapReduce API FileInputFormat.addInputPaths() in which can take a comma separated list of multiple files, as below:
FileInputFormat.addInputPaths("dir1/file1,dir2/file2,dir3/file3");
You can also pass multiple inputs into a Mapper in hadoop using Distributed Cache, more info is described here: multiple input into a Mapper in hadoop
If i am not misunderstanding you are trying to normalize the structured data in records, coming in from several inputs and then process it. Based on this, i think you really need to look at this article which helped me in past. It included How To Normalize Data Using Hadoop/MapReduce as below:
Step 1: Extract the column value pairs from the original data.
Step 2: Extract column-value Pairs Not In Master ID File
Step 3: Calculate the Maximum ID for Each Column in the Master File
Step 4: Calculate a New ID for the Unmatched Values
Step 5: Merge the New Ids with the Existing Master IDs
Step 6: Replace the Values in the Original Data with IDs
Using MultipleInputs we can do this.
MutlipleInputs.addInputPath(job, Mapper1.class, TextInputFormat.class,path1);
MutlipleInputs.addInputPath(job, Mapper2.class, TextInputFormat.class,path2);
job.setReducerClass(Reducer1.class);
//FileOutputFormat.setOutputPath(); set output path here
If both classes have a common key, then they can be joined in reducer and do the necessary logics

Running a MR Job on a portion of the HDFS file

Imagine you have a big file stored in hdtf which contains structured data. Now the goal is to process only a portion of data in the file like all the lines in the file where second column value is between so and so. Is it possible to launch the MR job such that hdfs only stream the relevant portion of the file versus streaming everything to the mappers.
The reason is that I want to expedite the job speed by only working on the portion that I need. Probably one approach is to run a MR job to get create a new file but I am wondering if one can avoid that?
Please note that the goal is to keep the data in HDFS and I do not want to read and write from database.
HDFS stores files as a bunch of bytes in blocks, and there is no indexing, and therefore no way to only read in a portion of your file (at least at the time of this writing). Furthermore, any given mapper may get the first block of the file or the 400th, and you don't get control over that.
That said, the whole point of MapReduce is to distribute the load over many machines. In our cluster, we run up to 28 mappers at a time (7 per node on 4 nodes), so if my input file is 1TB, each map slot may only end up reading 3% of the total file, or about 30GB. You just perform the filter that you want in the mapper, and only process the rows you are interested in.
If you really need filtered access, you might want to look at storing your data in HBase. It can act as a native source for MapReduce jobs, provides filtered reads, and stores its data on HDFS, so you are still in the distributed world.
One answer is looking at the way that hive solves this problem. The data is in "tables" which are really just meta data about files on disk. Hive allows you to set columns on which a table is partitioned. This creates a separate folder for each partition so if you were partitioning a file by date you would have:
/mytable/2011-12-01
/mytable/2011-12-02
Inside of the date directory would be you actual files. So if you then ran a query like:
SELECT * FROM mytable WHERE dt ='2011-12-01'
Only files in /mytable/2011-12-01 would be fed into the job.
Tho bottom line is that if you want functionality like this you either want to move to a higher level language (hive/pig) or you need to roll your own solutions.
Big part of the processing cost - is data parsing to produce Key-Values to the Mapper. We create there (usually) one java object per value + some container. It is costly both in terms of CPU and garbage collector pressure
I would suggest the solution "in the middle". You can write input format which will read the input stream and skip non-relevant data in the early stage (for example by looking into few first bytes of the string). As a result you will read all data, but actually parse and pass to the Mapper - only portion of it.
Another approach I would consider - is to use RCFile format (or other columnar format), and take care that relevant and non relevant data will sit in the different columns.
If the files that you want to process have some unique attribute about their filename (like extension or partial filename match), you can also use the setInputPathFilter method of FileInputFormat to ignore all but the ones you want for your MR job. Hadoop by default ignores all ".xxx" and _xxx" files/dirs, but you can extend with setInputPathFilter.
As others have noted above, you will likely get sub-optimal performance out of your cluster doing something like this which breaks the "one block per mapper" paradigm, but sometimes this is acceptable. Can sometimes take more to "do it right", esp if you're dealing with a small amount of data & the time to re-architect and/or re-dump into HBase would eclipse the extra time required to run your job sub-optimally.

Generate multiple outputs with Hadoop Pig

I've got this file containing a list of data in Hadoop. I've build a simple Pig script which analyze the file by the id number, and so on...
The last step I'm looking for is this: I'd like to to create (store) a file for each unique id number. So this should depend on a group step...however, I haven't understood if this is possible (maybe there is a custom store module?).
Any idea?
Thanks
Daniele
While keeping in mind what is said by frail, MultiStorage, in PiggyBank, seems to be what you are looking for.
for getting an output(file or anything) you need to assign data to a variable, thats how it works with STORE. If id's are limited and finite you can FILTER them one by one and then STORE them. (I always do that for action types which is about 20-25).
But if you need to get each unique id file badly then make 2 files. 1 with whole data in it grouped by id, 1 with just unique ids. Then try generating 1(or more if you have too many) pig scripts that FILTER BY that id. But it's a bad solution. Assuming you would group 10 ids in a pig script you would have (unique id count/10) pig scripts to run.
Beware that Hdfs ain't good at handling too many small files.
Edit:
A better solution would be to GROUP and SORT by unique id to a big file. Then since its sorted you can easily divide the contents with a 3rd party script.

Load Multiple files in same map function in Hadoop

I have two data sets one is historical quote data and other is historical trade data. Data is splitted per symbol per day basis. My question is how to load two files of same symbol in a same map function for example I want to process 2011-01-27 IBM quotes and same date IBM trade file simultaneously. How do i configure Hadoop to do this? I have read about MultlipleFileReader but this does not give us independence of load specific files together.
Thanks
Ankush
Output a <$date-$symbol, $data> pair in your map function, where $date-$symbol is a compound key with the date and symbol concatenated together, and where $data is either quote data or trade data. Hadoop will group together all pairs that share the same key and you can process the data in the reduce() function.
The reducer will need some logic to distinguish between quote data or trade data, depending on how you're serializing that data.
While you can do the way defined above, you can also create text file, with names of the files from both datasets - and use it as an input to the job. You can build it automatically by scanning HDFS tree. The main drawback of this solution that you will not enjoy data locality - so most of the data will travel over the network.

Resources