Pig - Store a complex relation schema in a hive table

Pig - Store a complex relation schema in a hive table - hadoop

here is my deal today. Well, I have created a relation as result of a couple of transformations after have read the relation from hive. the thing is that I want to store the final relation after a couple of analysis back in Hive but I can't. Let see that in my code much clear.
The first String is when I LOAD from Hive and transform my result:
july = LOAD 'POC.july' USING org.apache.hive.hcatalog.pig.HCatLoader ;
july_cl = FOREACH july GENERATE GetDay(ToDate(start_date)) as day:int,start_station,duration; jul_cl_fl = FILTER july_cl BY day==31;
july_gr = GROUP jul_cl_fl BY (day,start_station);
july_result = FOREACH july_gr {
total_dura = SUM(jul_cl_fl.duration);
avg_dura = AVG(jul_cl_fl.duration);
qty_trips = COUNT(jul_cl_fl);
GENERATE FLATTEN(group),total_dura,avg_dura,qty_trips;
};
So, now when I try to store the relation july_result I can't because the schema has changed and I suppose that it's not compatible with Hive:
STORE july_result INTO 'poc.july_analysis' USING org.apache.hive.hcatalog.pig.HCatStorer ();
Even if I have tried to set a special scheme for the final relation I haven't figured it out.
july_result = FOREACH july_gr {
total_dura = SUM(jul_cl_fl.duration);
avg_dura = AVG(jul_cl_fl.duration);
qty_trips = COUNT(jul_cl_fl);
GENERATE FLATTEN(group) as (day:int),total_dura as (total_dura:int),avg_dura as (avg_dura:int),qty_trips as (qty_trips:int);
};

After a research in hortonworks community, I got the solution about how to define an output format for a group relation in pig. My new code looks like:
july_result = FOREACH july_gr {
total_dura = SUM(jul_cl_fl.duration);
avg_dura = AVG(jul_cl_fl.duration);
qty_trips = COUNT(jul_cl_fl);
GENERATE FLATTEN( group) AS (day, code_station),(int)total_dura as (total_dura:int),(float)avg_dura as (avg_dura:float),(int)qty_trips as (qty_trips:int);
};
Thanks guys.

Related

Link table to another table

I'm creating new data model in oracle fusion cloud. I have a different tables, and I want to link all 3 of them. What should I do?
Below are the tables i used:
doo_headers_all dha,
ra_terms rt,
ra_customer_trx_all rcta,
hz_cust_site_uses_all bill_hcsua,
hz_cust_acct_sites_all bill_hcasa,
hz_cust_accounts bill_hca,
hz_party_sites bill_hps,
hz_parties bill_hp,
hz_party_site_uses ship_hpsu,
hz_party_sites ship_hps,
hz_parties ship_hp,
hz_addtnl_party_names bill_hapn,
hz_addtnl_party_names ship_hapn
I link all of them except for DOO_HEADERS_ALL, I tried to search which foreign keys can I use so that I can link it all but I don't see any answer.
rcta.term_id = rt.term_id(+)
and rcta.bill_to_site_use_id = bill_hcsua.site_use_id(+)
and bill_hcsua.cust_acct_site_id = bill_hcasa.cust_acct_site_id(+)
and bill_hcasa.cust_account_id = bill_hca.cust_account_id(+)
and bill_hcasa.party_site_id = bill_hps.party_site_id(+)
and bill_hps.party_id = bill_hp.party_id(+)
and rcta.ship_to_party_site_use_id = ship_hpsu.party_site_use_id(+)
and ship_hpsu.party_site_id = ship_hps.party_site_id(+)
and ship_hps.party_id = ship_hp.party_id(+)
and bill_hp.party_id = bill_hapn.party_id(+) and bill_hapn.party_name_type (+) = 'PHONETIC'
and ship_hp.party_id = ship_hapn.party_id(+) and ship_hapn.party_name_type (+) = 'PHONETIC'
this is I used to link other tables except doo_headers_all

dha.ORDER_NUMBER = rcta.CT_REFERENCE(+)
or
dha.SOLD_TO_CUSTOMER_ID = bill_hca.CUST_ACCOUNT_ID (+)
This is highly dependent on your setup though.
It's also not really clear what you mean by 'different tables, and I want to link all 3 of them', as your list is a lot longer than 3 tables.

Order of Apache Pig Transformations

I am reading through Pig Programming by Alan Gates.
Consider the code:
ratings = LOAD '/user/maria_dev/ml-100k/u.data' AS
(userID:int, movieID:int, rating:int, ratingTime:int);
metadata = LOAD '/user/maria_dev/ml-100k/u.item' USING PigStorage ('|') AS
(movieID:int, movieTitle:chararray, releaseDate:chararray, imdbLink: chararray);
nameLookup = FOREACH metadata GENERATE
movieID, movieTitle, ToDate(releaseDate, 'dd-MMM-yyyy') AS releaseYear;
nameLookupYear = FOREACH nameLookup GENERATE
movieID, movieTitle, GetYear(releaseYear) AS finalYear;
filterMovies = FILTER nameLookupYear BY finalYear < 1982;
groupedMovies = GROUP filterMovies BY finalYear;
orderedMovies = FOREACH groupedMovies {
sortOrder = ORDER metadata by finalYear DESC;
GENERATE GROUP, finalYear;
};
DUMP orderedMovies;
It states that
"Sorting by maps, tuples or bags produces error".
I want to know how I can sort the grouped results.
Do the transformations need to follow a certain sequence for them to work?

Since you are trying to sort the grouped results, you do not need a nested foreach. You would use the nested foreach if you were trying to, for example, sort each movie within the year by title or release date. Try ordering as usual (refer to finalYear as group since you grouped by finalYear in the previous line):
orderedMovies = ORDER groupedMovies BY group ASC;
DUMP orderedMovies;

If you are looking to sort the grouped values then you will have to use nested foreach. This will sort the years in descending order within a group.
orderedMovies = FOREACH groupedMovies {
sortOrder = ORDER metadata by GetYear(ToDate(releaseDate, 'dd-MMM-yyyy')) DESC;
GENERATE GROUP, movieID, movieTitle;
};

Pig flatten error

I tried this script for my nested data :
`books = load 'data/book-seded-workings-reduced.json'
using JsonLoader('user_id:chararray,type:chararray,title:chararray,year:chararray,publisher:chararray,authors:{(name:chararray)},source:chararray');`
group_auth = group books by title;
maped = foreach group_auth generate group, books.authors;
fil = foreach maped generate flatten(books);
DUMP fil;
but I got this error : A column needs to be projected from a relation for it to be used as a scalar
Any idea?

books = load 'input.data'
using JsonLoader('user_id:chararray,
type:chararray,
title:chararray,
year:chararray,
publisher:chararray,
authors:{(name:chararray)},source:chararray');
flatten_authors = foreach books generate title, FLATTEN(authors.name);
dump flatten_authors;
Output : (Input referred from Loading JSON file with serde in Cloudera)
(Modern Database Systems: The Object Model, Interoperability, and Beyond.,null)
(Inequalities: Theory of Majorization and Its Application.,Albert W. Marshall)
(Inequalities: Theory of Majorization and Its Application.,Ingram Olkin)

Entity Framework SQL Selecting 600+ Columns

I have a query generated by entity framework running against oracle that's too slow. It runs in about 4 seconds.
This is the main portion of my query
var query = from x in db.BUILDINGs
join pro_co in db.PROFILE_COMMUNITY on x.COMMUNITY_ID equals pro_co.COMMUNITY_ID
join co in db.COMMUNITies on x.COMMUNITY_ID equals co.COMMUNITY_ID
join st in db.STATE_PROFILE on co.STATE_CD equals st.STATE_CD
where pro_co.PROFILE_NM == authorizedUser.ProfileName
select new
{
COMMUNITY_ID = x.COMMUNITY_ID,
COUNTY_ID = x.COUNTY_ID,
REALTOR_GROUP_NM = x.REALTOR_GROUP_NM,
BUILDING_NAME_TX = x.BUILDING_NAME_TX,
ACTIVE_FL = x.ACTIVE_FL,
CONSTR_SQFT_AVAIL_NB = x.CONSTR_SQFT_AVAIL_NB,
TRANS_RAIL_FL = x.TRANS_RAIL_FL,
LAST_UPDATED_DT = x.LAST_UPDATED_DT,
CREATED_DATE = x.CREATED_DATE,
BUILDING_ADDRESS_TX = x.BUILDING_ADDRESS_TX,
BUILDING_ID = x.BUILDING_ID,
COMMUNITY_NM = co.COMMUNITY_NM,
IMAGECOUNT = x.BUILDING_IMAGE2.Count(),
StateCode = st.STATE_NM,
BuildingTypeItems = x.BUILDING_TYPE_ITEM,
BuildingZoningItems = x.BUILDING_ZONING_ITEM,
BuildingSpecFeatures = x.BUILDING_SPEC_FEATURE_ITEM,
buildingHide = x.BUILDING_HIDE,
buildinghideSort = x.BUILDING_HIDE.Count(y => y.PROFILE_NM == ProfileName) > 0 ? 1 : 0,
BUILDING_CITY_TX = x.BUILDING_CITY_TX,
BUILDING_ZIP_TX = x.BUILDING_ZIP_TX,
LPF_GENERAL_DS = x.LPF_GENERAL_DS,
CONSTR_SQFT_TOTAL_NB = x.CONSTR_SQFT_TOTAL_NB,
CONSTR_STORIES_NB = x.CONSTR_STORIES_NB,
CONSTR_CEILING_CENTER_NB = x.CONSTR_CEILING_CENTER_NB,
CONSTR_CEILING_EAVES_NB = x.CONSTR_CEILING_EAVES_NB,
DESCR_EXPANDABLE_FL = x.DESCR_EXPANDABLE_FL,
CONSTR_MATERIAL_TYPE_TX = x.CONSTR_MATERIAL_TYPE_TX,
SITE_ACRES_SALE_NB = x.SITE_ACRES_SALE_NB,
DESCR_PREVIOUS_USE_TX = x.DESCR_PREVIOUS_USE_TX,
CONSTR_YEAR_BUILT_TX = x.CONSTR_YEAR_BUILT_TX,
DESCR_SUBDIVIDE_FL = x.DESCR_SUBDIVIDE_FL,
LOCATION_CITY_LIMITS_FL = x.LOCATION_CITY_LIMITS_FL,
TRANS_INTERSTATE_NEAREST_TX = x.TRANS_INTERSTATE_NEAREST_TX,
TRANS_INTERSTATE_MILES_NB = x.TRANS_INTERSTATE_MILES_NB,
TRANS_HIGHWAY_NAME_TX = x.TRANS_HIGHWAY_NAME_TX,
TRANS_HIGHWAY_MILES_NB = x.TRANS_HIGHWAY_MILES_NB,
TRANS_AIRPORT_COM_NAME_TX = x.TRANS_AIRPORT_COM_NAME_TX,
TRANS_AIRPORT_COM_MILES_NB = x.TRANS_AIRPORT_COM_MILES_NB,
UTIL_ELEC_SUPPLIER_TX = x.UTIL_ELEC_SUPPLIER_TX,
UTIL_GAS_SUPPLIER_TX = x.UTIL_GAS_SUPPLIER_TX,
UTIL_WATER_SUPPLIER_TX = x.UTIL_WATER_SUPPLIER_TX,
UTIL_SEWER_SUPPLIER_TX = x.UTIL_SEWER_SUPPLIER_TX,
UTIL_PHONE_SVC_PVD_TX = x.UTIL_PHONE_SVC_PVD_TX,
CONTACT_ORGANIZATION_TX = x.CONTACT_ORGANIZATION_TX,
CONTACT_PHONE_TX = x.CONTACT_PHONE_TX,
CONTACT_EMAIL_TX = x.CONTACT_EMAIL_TX,
TERMS_SALE_PRICE_TX = x.TERMS_SALE_PRICE_TX,
TERMS_LEASE_SQFT_NB = x.TERMS_LEASE_SQFT_NB
};
There is a section of code that tacks on dynamic where and sort clauses to the query but I've left those out. The query takes about 4 seconds to run no matter what is in the where and sort.
I dropped the generated SQL in Oracle and an explain plan didn't appear to show anything that screamed fix me. Cost is 1554
If this isn't allowed I apologize but I can't seem to find a good way to share this information. I've uploaded the explain plan generated by Sql Developer here: http://www.123server.org/files/explainPlanzip-e1d291efcd.html
Table Layout
Building
--------------------
- BuildingID
- CommunityId
- Lots of other columns
Profile_Community
-----------------------
- CommunityId
- ProfileNM
- lots of other columns
state_profile
---------------------
- StateCD
- ProfileNm
- lots of other columns
Profile
---------------------
- Profile-NM
- a few other columns
All of the tables with allot of columns have 120-150 columns each. It seems like entity is generating a select statement that pulls every column from every table instead of just the ones I want.
The thing that's bugging me and I think might be my issue is that in my LINQ I've selected 50 items, but the generated sql is returning 677 columns. I think returning so many columns is the source of my slowness possibly.
Any ideas why I am getting so many columns returned in SQL or how to speed my query?

I have a suspicion some of the performance is being impacted by your object creation. Try running the query without just a basic "select x" and see if it's the SQL query taking time or the object creation.
Also if the query being generated is too complicated you could try separating it out into smaller sub-queries which gradually enrich your object rather than trying to query everything at once.

I ended up creating a view and having the view only select the columns I wanted and joining on things that needed to be left-joined in linq.
It's pretty annoying that EF selects every column from every table you're trying to join across. But I guess I only noticed this because I am joining a bunch of tables with 150+ columns in them.

Multiple rows update without select

An old question for Linq 2 Entities. I'm just asking it again, in case someone has came up with the solution.
I want to perform query that does this:
UPDATE dbo.Products WHERE Category = 1 SET Category = 5
And I want to do it with Entity Framework 4.3.1.
This is just an example, I have a tons of records I just want 1 column to change value, nothing else. Loading to DbContext with Where(...).Select(...), changing all elements, and then saving with SaveChanges() does not work well for me.
Should I stick with ExecuteCommand and send direct query as it is written above (of course make it reusable) or is there another nice way to do it from Linq 2 Entities / Fluent.
Thanks!

What you are describing isnt actually possible with Entity Framework. You have a few options,
You can write it as a string and execute it via EF with .ExecuteSqlCommand (on the context)
You can use something like Entity Framework Extended (however from what ive seen this doesnt have great performance)

You can update an entity without first fetching it from db like below
using (var context = new DBContext())
{
context.YourEntitySet.Attach(yourExistingEntity);
// Update fields
context.SaveChanges();
}

If you have set-based operations, then SQL is better suited than EF.
So, yes - in this case you should stick with ExecuteCommand.

I don't know if this suits you but you can try creating a stored procedure that will perform the update and then add that procedure to your model as a function import. Then you can perform the update in a single database call:
using(var dc = new YourDataContext())
{
dc.UpdateProductsCategory(1, 5);
}
where UpdateProductsCategory would be the name of the imported stored procedure.

Yes, ExecuteCommand() is definitely the way to do it without fetching all the rows' data and letting ChangeTracker sort it out. Just to provide an example:
Will result in all rows being fetched and an update performed for each row changed:
using (YourDBContext yourDB = new YourDBContext()) {
yourDB.Products.Where(p => p.Category = 1).ToList().ForEach(p => p.Category = 5);
yourDB.SaveChanges();
}
Just a single update:
using (YourDBContext yourDB = new YourDBContext()) {
var sql = "UPDATE dbo.Products WHERE Category = #oldcategory SET Category = #newcategory";
var oldcp = new SqlParameter { ParameterName = "oldcategory", DbType = DbType.Int32, Value = 1 };
var newcp = new SqlParameter { ParameterName = "newcategory", DbType = DbType.Int32, Value = 5 };
yourDB.Database.ExecuteSqlCommand(sql, oldcp, newcp);
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Pig - Store a complex relation schema in a hive table - hadoop

Related

Link table to another table

Order of Apache Pig Transformations

Pig flatten error

Entity Framework SQL Selecting 600+ Columns

Multiple rows update without select

Categories

Resources