Match individual records during Batch predictions with VertexAI pipeline - google-cloud-vertex-ai

I have a custom model in Vertex AI and a table storing the features for the model along with the record_id.
I am building pipeline component for the batch prediction and facing a critical issue.
When I submit the batch_prediction, I should exclude the record_id for the job but How can I map the record if I don't have the record_id in the result?
from google.cloud import bigquery
from google.cloud import aiplatform
aiplatform.init(project=project_id)
client = bigquery.Client(project=project_id)
query = '''
SELECT * except(record_id) FROM `table`
'''
df = client.query(query).to_dataframe() # drop the record_id and load it to another table
job = client.load_table_from_dataframe(
X, "table_wo_id",
)
clf = aiplatform.Model(model_id = 'custom_model')
clf.batch_predict(job_display_name = 'custom model batch prediction',
bigquery_source = 'bq://table_wo_id',
instances_format = 'bigquery',
bigquery_destination_prefix = 'bq://prediction_result_table',
predictions_format = 'bigquery',
machine_type = 'n1-standard-4',
max_replica_count = 1
)
like the above example, there is no record_id column in prediction_result_table. There is no way to map the result back to each record

Related

Extracting data from multiple tables with Scrapy using xpath

I'm extracting meta data and urls from 12 tables on a web page and while I've got it working, I'm pretty new to both xpath and scrapy so is there a more concise way I could have done this?
I was initially getting loads of duplicates as I tried a variety of xpaths and realised each table row was being repeated for each table. My solution to that was to enumerate the tables and loop through each one grabbing the rows only for that table. Feels like there is probably a simpler way to do it but I'm not sure now.
import scrapy
class LinkCheckerSpider(scrapy.Spider):
name = 'foodstandardsagency'
allowed_domains = ['ratings.food.gov.uk']
start_urls = ['https://ratings.food.gov.uk/open-data/en-gb/']
def parse(self, response):
print(response.url)
tables = response.xpath('//*[#id="openDataStatic"]//table')
num_tables = len(tables)
for tabno in range(num_tables):
search_path = '// *[ # id = "openDataStatic"] / table[%d] / tr'%tabno
rows = response.xpath(search_path)
for row in rows:
local_authority = row.xpath('td[1]//text()').extract()
last_update = row.xpath('td[2]//text()').extract()
num_businesses = row.xpath('td[3]//text()').extract()
xml_file_descr = row.xpath('td[4]//text()').extract()
xml_file = row.xpath('td[4]/a/#href').extract()
yield {'local_authority': local_authority[1],
'last_update':last_update[1],
'num_businesses':num_businesses[1],
'xml_file':xml_file[0],
'xml_file_descr':xml_file_descr[1]
}
'''
And I'm running it with
scrapy runspider fsa_xpath.py
You can iterate though the table selectors returned by your first xpath:
tables = response.xpath('//*[#id="openDataStatic"]//table')
for table in tables:
for row in table.xpath('./tr'):
local_authority = row.xpath('td[1]//text()').extract()
You did this with the rows.

SparkRDD Operations

Let's assume i had a table of two columns A and B in a CSV File. I pick maximum value from column A [Max value = 100] and i need to return the respective value of column B [Return Value = AliExpress] using JavaRDD Operations without using DataFrames.
Input Table :
COLUMN A Column B
56 Walmart
72 Flipkart
96 Amazon
100 AliExpress
Output Table:
COLUMN A Column B
100 AliExpress
This is what i tried till now
SourceCode:
SparkConf conf = new SparkConf().setAppName("SparkCSVReader").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> diskfile = sc.textFile("/Users/apple/Downloads/Crash_Data_1.csv");
JavaRDD<String> date = diskfile.flatMap(f -> Arrays.asList(f.split(",")[1]));
From the above code i can fetch only one column data. Is there anyway to get two columns. Any suggestions. Thanks in advance.
You can use either top or takeOrdered functions to achieve it.
rdd.top(1) //gives you top element in your RDD
Data:
COLUMN_A,Column_B
56,Walmart
72,Flipkart
96,Amazon
100,AliExpress
Creating df using Spark 2
val df = sqlContext.read.option("header", "true")
.option("inferSchema", "true")
.csv("filelocation")
df.show
import sqlContext.implicits._
import org.apache.spark.sql.functions._
Using Dataframe functions
df.orderBy(desc("COLUMN_A")).take(1).foreach(println)
OUTPUT:
[100,AliExpress]
Using RDD functions
df.rdd
.map(row => (row(0).toString.toInt, row(1)))
.sortByKey(false)
.take(1).foreach(println)
OUTPUT:
(100,AliExpress)

groupby.sum() sparse matrix in pandas or scipy: looking for performance

I have the following dataset df:
import numpy.random
import pandas
cat = pandas.Series(numpy.random.random_integers(0,400,1000000))
ids = pandas.Series(numpy.random.random_integers(0,10000,1000000))
team = pandas.Series(numpy.random.random_integers(0,1,1000000))
df = pandas.concat([ids,cat,team],axis=1)
df.columns = ['ids','cat','team']
Note that there is only 400 distinct categories in the cat column. Conseqently, I want to prepare the dataset for a machine learning classification, i.e., create one column for each distinct category value from 0 to 400, and for each row, write 1 if the id has the corresponding category, and 0 otherwise. My goal is then to make a groupby ids, and sum the 1 for every category column, as follows:
df2 = pandas.get_dummies(df['cat'], sparse=True)
df2['ids'] = df['ids']
df3 = df2.groupby('ids').sum()
My problem is that the groupby.sum() is very very long, far too long (more than 30 mins). So I need a different strategy to make my calculation. Here is a second attempt.
from sklearn import preprocessing
import numpy
text_encoder = preprocessing.OneHotEncoder(dtype=numpy.int)
X = text_encoder.fit_transform(df.drop(['team','ids'],axis=1).values).astype(int)
But then, X is a sparse scipy matrix. Here I have two choices: either a find a way to groupby.sum() efficiently on this sparse scipy matrix, or I convert it to a real numpy matrix with .toarray() as follows:
X = X.toarray()
df2 = pandas.DataFrame(X)
df2['ids'] = df['ids']
df3 = df2.groupby('ids').sum()
The problem now is that a lot of memory is lost due to the .toarray(). And the groupby.sum() surely takes a lot of memory.
So my question is: is there a smart way to solve my problem using SPARSE MATRIX with EFFICIENT TIME for the groupby.sum()?
EDIT: In fact this is a job for pivot_table(), so once your df is created:
df_final = df.pivot_table(cols='cat', rows='ids', aggfunc='count')
df_final.fillna(0, inplace = True)
For the record but useless: following my comments on the question:
import numpy.random
import pandas
from sklearn import preprocessing
cat = pandas.Series(numpy.random.random_integers(0,400,1000000))
ids = pandas.Series(numpy.random.random_integers(0,10000,1000000))
team = pandas.Series(numpy.random.random_integers(0,1,1000000))
df = pandas.concat([ids,cat,team],axis=1)
df.columns = ['ids','cat','team']
df.sort('ids', inplace = True)
text_encoder = preprocessing.OneHotEncoder(dtype=numpy.int)
X = text_encoder.fit_transform(df.drop(['team','ids'],axis=1).values).astype(int)
se_size = df.groupby('ids').size()
ls_rows = []
row_ind = 0
for name, nb_lines in se_size.iteritems():
ls_rows.append(X[row_ind : row_ind + nb_lines,:].sum(0).tolist()[0])
row_ind += nb_lines
df_final = pandas.DataFrame(ls_rows,
index = se_size.index,
columns = text_encoder.active_features_)

Entity Framework SQL Selecting 600+ Columns

I have a query generated by entity framework running against oracle that's too slow. It runs in about 4 seconds.
This is the main portion of my query
var query = from x in db.BUILDINGs
join pro_co in db.PROFILE_COMMUNITY on x.COMMUNITY_ID equals pro_co.COMMUNITY_ID
join co in db.COMMUNITies on x.COMMUNITY_ID equals co.COMMUNITY_ID
join st in db.STATE_PROFILE on co.STATE_CD equals st.STATE_CD
where pro_co.PROFILE_NM == authorizedUser.ProfileName
select new
{
COMMUNITY_ID = x.COMMUNITY_ID,
COUNTY_ID = x.COUNTY_ID,
REALTOR_GROUP_NM = x.REALTOR_GROUP_NM,
BUILDING_NAME_TX = x.BUILDING_NAME_TX,
ACTIVE_FL = x.ACTIVE_FL,
CONSTR_SQFT_AVAIL_NB = x.CONSTR_SQFT_AVAIL_NB,
TRANS_RAIL_FL = x.TRANS_RAIL_FL,
LAST_UPDATED_DT = x.LAST_UPDATED_DT,
CREATED_DATE = x.CREATED_DATE,
BUILDING_ADDRESS_TX = x.BUILDING_ADDRESS_TX,
BUILDING_ID = x.BUILDING_ID,
COMMUNITY_NM = co.COMMUNITY_NM,
IMAGECOUNT = x.BUILDING_IMAGE2.Count(),
StateCode = st.STATE_NM,
BuildingTypeItems = x.BUILDING_TYPE_ITEM,
BuildingZoningItems = x.BUILDING_ZONING_ITEM,
BuildingSpecFeatures = x.BUILDING_SPEC_FEATURE_ITEM,
buildingHide = x.BUILDING_HIDE,
buildinghideSort = x.BUILDING_HIDE.Count(y => y.PROFILE_NM == ProfileName) > 0 ? 1 : 0,
BUILDING_CITY_TX = x.BUILDING_CITY_TX,
BUILDING_ZIP_TX = x.BUILDING_ZIP_TX,
LPF_GENERAL_DS = x.LPF_GENERAL_DS,
CONSTR_SQFT_TOTAL_NB = x.CONSTR_SQFT_TOTAL_NB,
CONSTR_STORIES_NB = x.CONSTR_STORIES_NB,
CONSTR_CEILING_CENTER_NB = x.CONSTR_CEILING_CENTER_NB,
CONSTR_CEILING_EAVES_NB = x.CONSTR_CEILING_EAVES_NB,
DESCR_EXPANDABLE_FL = x.DESCR_EXPANDABLE_FL,
CONSTR_MATERIAL_TYPE_TX = x.CONSTR_MATERIAL_TYPE_TX,
SITE_ACRES_SALE_NB = x.SITE_ACRES_SALE_NB,
DESCR_PREVIOUS_USE_TX = x.DESCR_PREVIOUS_USE_TX,
CONSTR_YEAR_BUILT_TX = x.CONSTR_YEAR_BUILT_TX,
DESCR_SUBDIVIDE_FL = x.DESCR_SUBDIVIDE_FL,
LOCATION_CITY_LIMITS_FL = x.LOCATION_CITY_LIMITS_FL,
TRANS_INTERSTATE_NEAREST_TX = x.TRANS_INTERSTATE_NEAREST_TX,
TRANS_INTERSTATE_MILES_NB = x.TRANS_INTERSTATE_MILES_NB,
TRANS_HIGHWAY_NAME_TX = x.TRANS_HIGHWAY_NAME_TX,
TRANS_HIGHWAY_MILES_NB = x.TRANS_HIGHWAY_MILES_NB,
TRANS_AIRPORT_COM_NAME_TX = x.TRANS_AIRPORT_COM_NAME_TX,
TRANS_AIRPORT_COM_MILES_NB = x.TRANS_AIRPORT_COM_MILES_NB,
UTIL_ELEC_SUPPLIER_TX = x.UTIL_ELEC_SUPPLIER_TX,
UTIL_GAS_SUPPLIER_TX = x.UTIL_GAS_SUPPLIER_TX,
UTIL_WATER_SUPPLIER_TX = x.UTIL_WATER_SUPPLIER_TX,
UTIL_SEWER_SUPPLIER_TX = x.UTIL_SEWER_SUPPLIER_TX,
UTIL_PHONE_SVC_PVD_TX = x.UTIL_PHONE_SVC_PVD_TX,
CONTACT_ORGANIZATION_TX = x.CONTACT_ORGANIZATION_TX,
CONTACT_PHONE_TX = x.CONTACT_PHONE_TX,
CONTACT_EMAIL_TX = x.CONTACT_EMAIL_TX,
TERMS_SALE_PRICE_TX = x.TERMS_SALE_PRICE_TX,
TERMS_LEASE_SQFT_NB = x.TERMS_LEASE_SQFT_NB
};
There is a section of code that tacks on dynamic where and sort clauses to the query but I've left those out. The query takes about 4 seconds to run no matter what is in the where and sort.
I dropped the generated SQL in Oracle and an explain plan didn't appear to show anything that screamed fix me. Cost is 1554
If this isn't allowed I apologize but I can't seem to find a good way to share this information. I've uploaded the explain plan generated by Sql Developer here: http://www.123server.org/files/explainPlanzip-e1d291efcd.html
Table Layout
Building
--------------------
- BuildingID
- CommunityId
- Lots of other columns
Profile_Community
-----------------------
- CommunityId
- ProfileNM
- lots of other columns
state_profile
---------------------
- StateCD
- ProfileNm
- lots of other columns
Profile
---------------------
- Profile-NM
- a few other columns
All of the tables with allot of columns have 120-150 columns each. It seems like entity is generating a select statement that pulls every column from every table instead of just the ones I want.
The thing that's bugging me and I think might be my issue is that in my LINQ I've selected 50 items, but the generated sql is returning 677 columns. I think returning so many columns is the source of my slowness possibly.
Any ideas why I am getting so many columns returned in SQL or how to speed my query?
I have a suspicion some of the performance is being impacted by your object creation. Try running the query without just a basic "select x" and see if it's the SQL query taking time or the object creation.
Also if the query being generated is too complicated you could try separating it out into smaller sub-queries which gradually enrich your object rather than trying to query everything at once.
I ended up creating a view and having the view only select the columns I wanted and joining on things that needed to be left-joined in linq.
It's pretty annoying that EF selects every column from every table you're trying to join across. But I guess I only noticed this because I am joining a bunch of tables with 150+ columns in them.

Unable to fetch data from Hbase based on query parameters

How to get data from HBase? I have a table with empId, name, startDate, endDate and other columns. Now I want to get data from an HBase table based upon empId, startDate and endDate.In normal SQL I can use:
select * from tableName where empId=val and date>=startDate and date<=endDate
How can I do this in HBase as it stores data as key value pairs? The key is empId.
Getting filtered rows in HBase shell is tricky. Since the shell is JRuby-based you can have here Ruby commands as well:
import org.apache.hadoop.hbase.filter.CompareFilter
import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
import org.apache.hadoop.hbase.filter.BinaryComparator
import org.apache.hadoop.hbase.filter.FilterList
import java.text.SimpleDateFormat
import java.lang.Long
def dateToBytes(val)
Long.toString(
SimpleDateFormat.new("yyyy/MM/dd").parse(val).getTime()).to_java_bytes
end
# table properties
colfam='c'.to_java_bytes;
col_name='name';
col_start='startDate';
col_end='endDate';
# query params
q_name='name2';
q_start='2012/08/14';
q_end='2012/08/24';
# filters
f_name=SingleColumnValueFilter.new(
colfam, col_name.to_java_bytes,
CompareFilter::CompareOp::EQUAL,
BinaryComparator.new(q_name.to_java_bytes));
f_start=SingleColumnValueFilter.new(
colfam, col_start.to_java_bytes,
CompareFilter::CompareOp::GREATER_OR_EQUAL,
BinaryComparator.new(dateToBytes(q_start)));
f_end=SingleColumnValueFilter.new(
colfam, col_end.to_java_bytes,
CompareFilter::CompareOp::LESS_OR_EQUAL,
BinaryComparator.new(dateToBytes(q_end)));
filterlist= FilterList.new([f_name, f_start, f_end]);
# get the result
scan 'mytable', {"FILTER"=>filterlist}
Similarly in Java construct a FilterList :
// Query params
String nameParam = "name2";
String startDateParam = "2012/08/14";
String endDateParam = "2012/08/24";
Filter nameFilter =
new SingleColumnValueFilter(colFam, nameQual, CompareOp.EQUAL,
Bytes.toBytes(nameParam));
//getBytesFromDate(): parses startDateParam and create a byte array out of it
Filter startDateFilter =
new SingleColumnValueFilter(colFam, startDateQual,
CompareOp.GREATER_OR_EQUAL, getBytesFromDate(startDateParam));
Filter endDateFilter =
new SingleColumnValueFilter(colFam, endDateQual,
CompareOp.LESS_OR_EQUAL, getBytesFromDate(endDateParam));
FilterList filters = new FilterList();
filters.addFilter(nameFilter);
filters.addFilter(startDateFilter);
filters.addFilter(endDateFilter);
HTable htable = new HTable(conf, tableName);
Scan scan = new Scan();
scan.setFilter(filters);
ResultScanner rs = htable.getScanner(scan);
//process your result...

Resources