Hi, I would like to know how to implement lookup logic in Hadoop Pig. I have a set of records, say for a weblog user, and need to go back to fetch some fields from his first visit (not the current).
This is doable in Java but do we have a way to implement this in Hadoop pig.
Example:
Suppose for traversing one particular user, identified by col1 and col2, output the first value for that user in lookup_col, in this case '1'.
col1 col2 lookup_col
---- ---- -----
326 8979 1
326 8979 4
326 8979 3
326 8979 0
You can implement this as a pig UDF.
Alternatively, you can also use simple SQL-like logic and aggregate the visits by user (not sure how you define the user and how you plan to look up visit by user but that's another matter) and get the first one and then left-join users with agg_visits.
A 'replicated join' in Pig is essentially a look up in a set that distributed amongst nodes and loaded into memory. However, you can get more than a single result because it's a JOIN operation, and not a lookup - so if you aggregate the data beforehand, you make sure that you only have a single record per key.
Related
I am new to HBase and still not sure which component of Hadoop ecosystem I will use in my case and how to analyse my data later so just exploring options.
I have an Excel sheet with a summary about all the customers like this but with ≈ 400 columns:
CustomerID Country Age E-mail
251648 Russia 27 boo#yahoo.com
487985 USA 30 foo#yahoo.com
478945 England 15 lala#yahoo.com
789456 USA 25 nana#yahoo.com
Also, I have .xls files created separately for each customer with an information about him (one customer = one .xls file), the number of columns and names of columns are the same in each file. Each of these files are named with a CustomerID. A one looks like this:
'customerID_251648.xls':
feature1 feature2 feature3 feature4
0 33,878 yes 789,598
1 48,457 yes 879,594
1 78,495 yes 487,457
0 94,589 no 787,475
I have converted all these files into .csv format and now feeling stuck which component of Hadoop ecosystem should I use for storing and querying such a data.
My eventual goal is to query some customerID and to get all the information about a customer from all the files.
I think that HBase fits perfectly for that because I can create such a schema:
row key timestamp Column Family 1 Column Family 2
251648 Country Age E-Mail Feature1 Feature2 Feature3 Feature4
What is the best approach to upload and query such a data in HBase? Should I first combine an information about a customer from different sources and then upload it to HBase? Or I can keep different .csv files for each customer and when uploading to HBase choose somehow which .csv to use for forming column-families?
For querying data stored in HBase I am going to write MapReduce tasks via Python API.
Any help would be very approciated!
You are correct with schema design, also remember that hbase loads the whole column family during scans, so if you need all the data at one time maybe its better to place everything in one column family.
A simple way to load the data will be to scan first file with customers and fetch the data from the second file on fly. Bulk CSV load could be faster in execution time, but you'll spend more time writing code.
Maybe you also need to think about the row key because HBase stores data in alphabetical order. If you have a lot of data, you'd better create table with given split-keys rather than let HBase do the splits because it can end up with unbalanced regions.
i have a two data sets names dataset1 and dataset2 and dataset1 is like
empid empame
101 john
102 kevin
and dataset2 is like
empid empmarks empaddress
101 75 LA
102 69 NY
The dataset2 will be very huge and i need to process some operations on these two datasets and need to get results from above two datasets.
As of my knowledge, now i have two options to process these datasets:
1.Store dataset1(which is lesser in size) as hive lookup table and have to process them through Spark
2.By using Spark Broadcast Variables we can process these dataset.
Anyone please suggest me which one is the better option.
This should be better option than those 2 options mentioned.
since you have common key you can do inner join.
dataset2.join(dataset1, Seq("empid"), "inner").show()
you can use broadcast function/hint like this as well. which means you are telling framework that small dataframe i.e dataset1 should be broadcasted to every executor.
import org.apache.spark.sql.functions.broadcast
dataset2.join(broadcast(dataset1), Seq("empid"), "inner").show()
Also Look at for more details..
DataFrame join optimization - Broadcast Hash Join how broadcast joins will work.
What-is-the-maximum-size-for-a-broadcast-object-in-spark
I receive an input file which has 200 MM of records. The records are just a keys.
For each record from this file (which i'll call SAMPLE_FILE), i need to retrieve all records from a database (which i'll call EVENT_DATABASE ) that match key . The EVENT_DATABASE can have billions of records.
For example:
SAMPLE_FILE
1234
2345
3456
EVENT_DATABASE
2345 - content C - 1
1234 - content A - 3
1234 - content B - 5
4567 - content D - 7
1234 - content K - 7
1234 - content J - 2
So the system will iterate through each record from SAMPLE_RECORD and get all EVENTS which has the same key. For example, getting 1234 and query the EVENT_DATABASE will retrieve:
1234 - content A - 3
1234 - content B - 5
1234 - content K - 7
1234 - content J - 2
Then i will execute some calculations using the result set. For example, count, sum, mean
F1 = 4 (count)
F2 = 17 (sum(3+5+7+2))
I will approach the problem storing the EVENT_DATABASE using HBASE. Then i will run a map-reduce job, and in the map phase i will query the HBase, get he events and execute the calculations.
The process can be in batch. It is not necessary to be real time.
Does anyone suggests another architecture? Do i really need a map reduce job? Can i use another approach?
I personally solved this kind of problem using MapReduce, HDFS & HBase for batch analysis. Your approach seems to be good for implementing your use-case I am guessing you are going to store the calculations back into HBase.
Storm could also be used to implement the same usecase, but Storm really shines with streaming data & near real time processing rather than data at rest.
You don't really need to query Hbase for every single event. According to me this would be a better approach.
Create an external table in hive using your input file.
Create an external table in hive for your hbase table using Hive Hbase Integration (https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration)
Do a join on both the tables and get the retrieve the results.
Your approach would have been good if you were only querying for a subset of your input file but since you are querying hbase for all recrods (20M), using a join would be more efficient.
My problem.
I have 500,000 distinct IP address I need to geocode. The Geocode look up table have an ip-from and ip-to range that I have to compare against, a table of 1.8 million rows.
So it's basically:
select *
/*+ MAPJOIN(a) */
from ip_address a
cross join ip_lookup b
where a.AddressInt >= b.ip_from and a.AddressInt <= b.ip_to;
On aws EMR, I'm running a cluster of 10 m1.large and during the cross join phase it gets stuck at 0% for 20 min but here's the funny thing:
Stage-5: number of mappers: 1; number of reducers: 0
Questions:
1) any one have any better ideas than a cross join? I don't mind firing up a few (dozen) more instances but I doubt that will help and
2) am I REALLY doing a cross map join as in storing the ip_addresses in the memory?
Thanks in advance.
I had your (kind of) problem last year.
Since my geocode table fitted in RAM here's what I did:
I've written Java class (let's call it GeoCoder) that read geocode info from the disc into RAM and
did geocoding in-memory.
I've added file geocode.info to the distributed cache (Hive add file command does this).
I've written UDF that created (or used if it was already created) GeoCoder instance in the evaluate method. Hive UDF can get the local path of the file in the distributed cache via getClass().getClassLoader().getResource("geocode.info").getFile()
Now I have the local path of geocode.info (now it's an ordinary file) and the rest is a history.
Probably this method is an overkill (150 lines of Java) but it worked for me.
Also I assume that you really need to use Hadoop (like I did) for your task.
Geocoding of 500000 IPs could probably be done on laptop pretty fast.
create external table if not exists my_table
(customer_id STRING,ip_id STRING)
location 'ip_b_class';
And then:
hive> set mapred.reduce.tasks=50;
hive> select count(distinct customer_id) from my_table;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
There's 160GB in there, and with 1 reducer it takes a long time...
[ihadanny#lvshdc2en0011 ~]$ hdu
Found 8 items
162808042208 hdfs://horton/ip_b_class
...
Logically you cannot have more than one reducer here. Unless all the distinct customer IDs from the individual map tasks come to one place the distinctness can not be established and a single count can not be produced. In other words unless you heap all the customer IDs together in one place, you cannot say each one is distinct and eventually count them.
The originial answer and explanation provided by #Rags is correct. The attached link give you good workaround by re-writing your query. I would suggest that if you don't want to rewrite your query, provide more memory to reducer by using this option:
set mapreduce.reduce.java.opts=-Xmx8000m
That options sets memory max used by reducer to 8 GB. if you have more then you can specify higher value here. Hope this helps