I have data for one column in my CSV file as :
`column1`
row1 : {'name':'Steve Jobs','location':'America','status':'none'}
row2 : {'name':'Mark','location':'America','status':'present'}
row3 : {'name':'Elan','location':'Canada','status':'present'}
I want as the output for that column as :
`name` `location` `status`
Steve jobs America none
Mark America present
Elan Canada present
But sometimes I have row value like {'name':'Steve Jobs','location':'America','status':'none'},{'name':'Mark','location':'America','status':'present'}
Please help !
You have to use tMap and tExtractDelimitedFields components.
Flow,
Below is the step by step explination,
Original data - row1 : {'name':'Steve Jobs','location':'America','status':'none'}
Substring the value inside the braces using below function
row1.Column0.substring(row1.Column0.indexOf("{")+1, row1.Column0.indexOf("}")-1)
Now the result is - 'name':'Steve Jobs','location':'America','status':'none'
3.Extract single columns to multiple using tExtractDelimitedFields. Since the columns are seperated be ,, delimiter should be provided as comma. And we have 3 fields in the data, so create 3 fields in the component schema. Below is the snipping of the tExtractDelimitedFields component configuration
Now the result is,
name location status
'name':'Steve Jobs' 'location':'America' 'status':'none'
'name':'Mark' 'location':'America' 'status':'present'
'name':'Elan' 'location':'Canada' 'status':'present'
Again using one more tMap, replace the column names and single quotes from the data,
row2.name.replaceAll("'name':", "").replaceAll("'", "")
row2.location.replaceAll("'location':", "").replaceAll("'", "")
row2.status.replaceAll("'status':", "").replaceAll("'", "")
Your final result is below,
I have an input file with one of the column as ID and another column as counter value. Based on the counter value, I am filtering data from input to output file. I made a task in DMExpress and checked for the counter and ID. I have 10 rows for each id in the input file. If counter value for each id is 3 then I will extract top 3 rows for this ID and then check for the next ID. While running this task in hadoop, Hadoop is taking the first 3 record of several IDs and creating the new file(when desired size reached) for other IDs.
Now, when hadoop is writing the record in file 0, it is extracting 3 records for ID X, but when it is writing the another part of the output file (file 1), it is writing the first record from the ID X of the previous file(which was at the last line of the file 0 . It is 4th record for the ID X). This in return increasing my record count in the output file.
Ex:this is the record in input file.
..more records..
1|XXXX|3|NNNNNNN
2|XXXX|3|MMMMMMM
3|XXXX|3|AAAAAAA
4|XXXX|3|BBBBBBB
5|XXXX|3|NNNDDDD
6|YYYY|3|QQQQQQQ
7|YYYY|3|4444444
8|YYYY|3|1111111
..more records..
The output file that hadoop is creating is as below:
file 0 :
..more records..
1|XXXX|3|NNNNNNN
2|XXXX|3|MMMMMMM
3|XXXX|3|AAAAAAA
file 1:
4|XXXX|3|BBBBBBB
6|YYYY|3|QQQQQQQ
7|YYYY|3|4444444
8|YYYY|3|1111111
..more records..
*line 4 for ID: XXXX should not be there!
Why hadoop is not filtering the counter correctly?
To be concrete, say we have a folder with 10k of tab-delimited csv files with following attributes format (each csv file is about 10GB):
id name address city...
1 Matt add1 LA...
2 Will add2 LA...
3 Lucy add3 SF...
...
And we have a lookup table based on "name" above
name gender
Matt M
Lucy F
...
Now we are interested to output from top 100,000 rows of each csv file into following format:
id name gender
1 Matt M
...
Can we use pyspark to efficiently handle this?
How to handle these 10k csv files in parallel?
You can do that in python to exploit the 1000 first line of your files :
top1000 = sc.parallelize("YourFile.csv").map(lambda line : line.split("CsvSeparator")).take(1000)
I have a custom Pig loader like this:
A = LOAD 'myfile' USING myudf_loader()
A contains:
((key1, val1), (key2, val2), (key3, val3), ...)
That is A has an outer tuple that contains key-value pairs stored in inner tuples.
I am not using maps because maps require that key values within a relation must be unique. The keys I have do not necessarily have to be unique.
The keys are chararrays, while the values can be chararrays, ints, and floats.
I would like to access A's inner tuples, as well as the (key, value) pairs within those tuples.
For example, I want to FILTER the keys of A such that the only fields remaining are key = "city" and value = "New York City".
Example input:
DUMP A;
(("city", "New York City"), ("city", "Boston"),
("city", "Washington, D.C."), ("non-city-key", "non-city-value"),
("city", "New York City"), ("non-city-key", "non-city-value"))
Example output of the filtering, which is stored into B:
DUMP B;
("city", "New York City")
("city", "New York City")
I dont have your full pig latin script.
But you can achieve using the below idea
grouped_records = GROUP records By Key;
filtered_records = FILTER grouped_records By group='CITY'
Dump filtered_records
Cheers
Nag
As far as I understand;
sort by only sorts with in the reducer
order by orders things globally but shoves everything into one reducers
cluster by intelligently distributes stuff into reducers by the key hash and make a sort by
So my question is does cluster by guarantee a global order? distribute by puts the same keys into same reducers but what about the adjacent keys?
The only document I can find on this is here and from the example it seems like it orders them globally. But from the definition I feel like it doesn't always do that.
A shorter answer: yes, CLUSTER BY guarantees global ordering, provided you're willing to join the multiple output files yourself.
The longer version:
ORDER BY x: guarantees global ordering, but does this by pushing all data through just one reducer. This is basically unacceptable for large datasets. You end up one sorted file as output.
SORT BY x: orders data at each of N reducers, but each reducer can receive overlapping ranges of data. You end up with N or more sorted files with overlapping ranges.
DISTRIBUTE BY x: ensures each of N reducers gets non-overlapping ranges of x, but doesn't sort the output of each reducer. You end up with N or more unsorted files with non-overlapping ranges.
CLUSTER BY x: ensures each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers. This gives you global ordering, and is the same as doing (DISTRIBUTE BY x and SORT BY x). You end up with N or more sorted files with non-overlapping ranges.
Make sense? So CLUSTER BY is basically the more scalable version of ORDER BY.
Let me clarify first: clustered by only distributes your keys into different buckets, clustered by ... sorted by get buckets sorted.
With a simple experiment (see below) you can see that you will not get global order by default. The reason is that default partitioner splits keys using hash codes regardless of actual key ordering.
However you can get your data totally ordered.
Motivation is "Hadoop: The Definitive Guide" by Tom White (3rd edition, Chapter 8, p. 274, Total Sort), where he discusses TotalOrderPartitioner.
I will answer your TotalOrdering question first, and then describe several sort-related Hive experiments that I did.
Keep in mind: what I'm describing here is a 'proof of concept', I was able to handle a single example using Claudera's CDH3 distribution.
Originally I hoped that org.apache.hadoop.mapred.lib.TotalOrderPartitioner will do the trick. Unfortunately it did not because it looks like Hive partitions by value, not key. So I patch it (should have subclass, but I do not have time for that):
Replace
public int getPartition(K key, V value, int numPartitions) {
return partitions.findPartition(key);
}
with
public int getPartition(K key, V value, int numPartitions) {
return partitions.findPartition(value);
}
Now you can set (patched) TotalOrderPartitioner as your Hive partitioner:
hive> set hive.mapred.partitioner=org.apache.hadoop.mapred.lib.TotalOrderPartitioner;
hive> set total.order.partitioner.natural.order=false
hive> set total.order.partitioner.path=/user/yevgen/out_data2
I also used
hive> set hive.enforce.bucketing = true;
hive> set mapred.reduce.tasks=4;
in my tests.
File out_data2 tells TotalOrderPartitioner how to bucket values.
You generate out_data2 by sampling your data. In my tests I used 4 buckets and keys from 0 to 10. I generated out_data2 using ad-hoc approach:
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.hive.ql.io.HiveKey;
import org.apache.hadoop.fs.FileSystem;
public class TotalPartitioner extends Configured implements Tool{
public static void main(String[] args) throws Exception{
ToolRunner.run(new TotalPartitioner(), args);
}
#Override
public int run(String[] args) throws Exception {
Path partFile = new Path("/home/yevgen/out_data2");
FileSystem fs = FileSystem.getLocal(getConf());
HiveKey key = new HiveKey();
NullWritable value = NullWritable.get();
SequenceFile.Writer writer = SequenceFile.createWriter(fs, getConf(), partFile, HiveKey.class, NullWritable.class);
key.set( new byte[]{1,3}, 0, 2);//partition at 3; 1 came from Hive -- do not know why
writer.append(key, value);
key.set( new byte[]{1, 6}, 0, 2);//partition at 6
writer.append(key, value);
key.set( new byte[]{1, 9}, 0, 2);//partition at 9
writer.append(key, value);
writer.close();
return 0;
}
}
Then I copied resulting out_data2 to HDFS (into /user/yevgen/out_data2)
With these settings I got my data bucketed/sorted (see last item in my experiment list).
Here is my experiments.
Create sample data
bash> echo -e "1\n3\n2\n4\n5\n7\n6\n8\n9\n0" > data.txt
Create basic test table:
hive> create table test(x int);
hive> load data local inpath 'data.txt' into table test;
Basically this table contains values from 0 to 9 without order.
Demonstrate how table copying works (really mapred.reduce.tasks parameter which sets MAXIMAL number of reduce tasks to use)
hive> create table test2(x int);
hive> set mapred.reduce.tasks=4;
hive> insert overwrite table test2
select a.x from test a
join test b
on a.x=b.x; -- stupied join to force non-trivial map-reduce
bash> hadoop fs -cat /user/hive/warehouse/test2/000001_0
1
5
9
Demonstrate bucketing. You can see that keys are assinged at random without any sort order:
hive> create table test3(x int)
clustered by (x) into 4 buckets;
hive> set hive.enforce.bucketing = true;
hive> insert overwrite table test3
select * from test;
bash> hadoop fs -cat /user/hive/warehouse/test3/000000_0
4
8
0
Bucketing with sorting. Results are partially sorted, not totally sorted
hive> create table test4(x int)
clustered by (x) sorted by (x desc)
into 4 buckets;
hive> insert overwrite table test4
select * from test;
bash> hadoop fs -cat /user/hive/warehouse/test4/000001_0
1
5
9
You can see that values are sorted in ascending order. Looks like Hive bug in CDH3?
Getting partially sorted without cluster by statement:
hive> create table test5 as
select x
from test
distribute by x
sort by x desc;
bash> hadoop fs -cat /user/hive/warehouse/test5/000001_0
9
5
1
Use my patched TotalOrderParitioner:
hive> set hive.mapred.partitioner=org.apache.hadoop.mapred.lib.TotalOrderPartitioner;
hive> set total.order.partitioner.natural.order=false
hive> set total.order.partitioner.path=/user/training/out_data2
hive> create table test6(x int)
clustered by (x) sorted by (x) into 4 buckets;
hive> insert overwrite table test6
select * from test;
bash> hadoop fs -cat /user/hive/warehouse/test6/000000_0
1
2
0
bash> hadoop fs -cat /user/hive/warehouse/test6/000001_0
3
4
5
bash> hadoop fs -cat /user/hive/warehouse/test6/000002_0
7
6
8
bash> hadoop fs -cat /user/hive/warehouse/test6/000003_0
9
CLUSTER BY does not produce global ordering.
The accepted answer (by Lars Yencken) misleads by stating that the reducers will receive non-overlapping ranges. As Anton Zaviriukhin correctly points to the BucketedTables documentation, CLUSTER BY is basically DISTRIBUTE BY (same as bucketing) plus SORT BY within each bucket/reducer. And DISTRIBUTE BY simply hashes and mods into buckets and while the hashing function may preserve order (hash of i > hash of j if i > j), mod of hash value does not.
Here's a better example showing overlapping ranges
http://myitlearnings.com/bucketing-in-hive/
As I understand, short answer is No.
You'll get overlapping ranges.
From SortBy documentation:
"Cluster By is a short-cut for both Distribute By and Sort By."
"All rows with the same Distribute By columns will go to the same reducer."
But there is no information that Distribute by guarantee non-overlapping ranges.
Moreover, from DDL BucketedTables documentation:
"How does Hive distribute the rows across the buckets? In general, the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets."
I suppose that Cluster by in Select statement use the same principle to distribute rows between reducers because it's main use is for populating bucketed tables with the data.
I created a table with 1 integer column "a", and inserted numbers from 0 to 9 there.
Then I set number of reducers to 2
set mapred.reduce.tasks = 2;.
And select data from this table with Cluster by clause
select * from my_tab cluster by a;
And received result that I expected:
0
2
4
6
8
1
3
5
7
9
So, first reducer (number 0) got even numbers (because their mode 2 gives 0)
and second reducer (number 1) got odd numbers (because their mode 2 gives 1)
So that's how "Distribute By" works.
And then "Sort By" sorts the results inside each reducer.
Use case : When there is a large dataset then one should go for sort by as in sort by , all the set reducers sort the data internally before clubbing together and that enhances the performance. While in Order by, the performance for the larger dataset reduces as all the data is passed through a single reducer which increases the load and hence takes longer time to execute the query.
Please see below example on 11 node cluster.
This one is Order By example output
This one is Sort By example output
This one is Cluster By example
What I observed , the figures of sort by , cluster by and distribute by is SAME But internal mechanism is different. In DISTRIBUTE BY : The same column rows will go to one reducer , eg. DISTRIBUTE BY(City) - Bangalore data in one column , Delhi data in one reducer:
Cluster by is per reducer sorting not global. In many books also it is mentioned incorrectly or confusingly. It has got particular use where say you distribute each department to specific reducer and then sort by employee name in each department and do not care abt order of dept no the cluster by to be used and it more perform-ant as workload is distributed among reducers.
SortBy: N or more sorted files with overlapping ranges.
OrderBy: Single output i.e fully ordered.
Distribute By: Distribute By protecting each of N reducers gets non-overlapping ranges of the column but doesn’t sort the output of each reducer.
For more information http://commandstech.com/hive-sortby-vs-orderby-vs-distributeby-vs-clusterby/
ClusterBy: Refer to the same example as above, if we use Cluster By x, the two reducers will further sort rows on x:
If I understood it correctly
1.sort by - only sorts the data within the reducer
2.order by - orders things globally by pushing the entire data set to a single reducer. If we do have a lot of data(skewed), this process will take a lot of time.
cluster by - intelligently distributes stuff into reducers by the key hash and make a sort by, but does not grantee global ordering. One key(k1) can be placed into two reducers. 1st reducer gets 10K K1 data, the second one might get 1K k1 data.