Hive query GROUP BY error; Invalid table alias or column reference - hadoop

I am trying to extend some working HIVE queries, but seem to fall short. Just wanting to test GROUP BY function, which is common to a number of queries that I need to complete. Here is the query that I am trying to execute:
DROP table CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary;
CREATE EXTERNAL TABLE IF NOT EXISTS CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary ( messageRowID STRING, payload_sensor INT, messagetimestamp BIGINT, payload_temp FLOAT, payload_timestamp BIGINT, payload_timestampmysql STRING, payload_watt INT, payload_wattseconds INT )
STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
"cassandra.port" = "9160",
"" = "EVENT_KS",
"cassandra.ks.username" = "admin",
"cassandra.ks.password" = "admin",
"" = "currentcost_stream",
"cassandra.columns.mapping" = ":key, payload_sensor, Timestamp, payload_temp, payload_timestamp, payload_timestampmysql, payload_watt, payload_wattseconds" );
select messageRowID, payload_sensor, messagetimestamp, payload_temp, payload_timestamp, payload_timestampmysql, payload_watt, payload_wattseconds, hour(from_unixtime(payload_timestamp)) AS hourly
FROM CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary
WHERE payload_timestamp > unix_timestamp() - 3024*60*60
GROUP BY hourly;
This yields the following error:
ERROR: Error while executing Hive script.Query returned non-zero code:
10, cause: FAILED: Error in semantic analysis: Line 1:320 Invalid
table alias or column reference 'hourly': (possible column names are:
messagerowid, payload_sensor, messagetimestamp, payload_temp,
payload_timestamp, payload_timestampmysql, payload_watt,
The intention is to end up with a timebound query (say last 24 hours) broken in by SUM() on payload_wattsecond etc. To get started breaking then out the creation of the summary tables, I started building a group by query which were going to derive the hourly anchor for the select query.
Problem though is the error above. Would greatly appreciate any pointers to what is wrong here.. can't seem to find it myself, but then again I am a newbie on HIVE.
Thanks in advance ..
UPDATE: Tried to update the query. Here is the query that I just tried to run:
DROP table CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary;
CREATE EXTERNAL TABLE IF NOT EXISTS CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary ( messageRowID STRING, payload_sensor INT, messagetimestamp BIGINT, payload_temp FLOAT, payload_timestamp BIGINT, payload_timestampmysql STRING, payload_watt INT, payload_wattseconds INT )
STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
"cassandra.port" = "9160",
"" = "EVENT_KS",
"cassandra.ks.username" = "admin",
"cassandra.ks.password" = "admin",
"" = "currentcost_stream",
"cassandra.columns.mapping" = ":key, payload_sensor, Timestamp, payload_temp, payload_timestamp, payload_timestampmysql, payload_watt, payload_wattseconds" );
select messageRowID, payload_sensor, messagetimestamp, payload_temp, payload_timestamp, payload_timestampmysql, payload_watt, payload_wattseconds, hour(from_unixtime(payload_timestamp))
FROM CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary
WHERE payload_timestamp > unix_timestamp() - 3024*60*60
GROUP BY hour(from_unixtime(payload_timestamp));
.. that however gives another error, which is:
ERROR: Error while executing Hive script.Query returned non-zero code: 10, cause: FAILED: Error in semantic analysis: Line 1:7 Expression not in GROUP BY key 'messageRowID'
UPDATE #2) The following is a quick dump of a few samples that are derived into the EVENT_KS CF in WSO2BAM. The last column is a calculated (in the perl daemon..) #watt_seconds, which will be used in a query to calculate the aggregate sum totalled into kwH, which then will be dumped into MySQL tables for sync to the application that holds the ui/ux layer..
[12:03:00] [jskogsta#enterprise ../Product Centric Opco Modelling]$ ~/local/apache-cassandra-2.0.8/bin/cqlsh localhost 9160 -u admin -p admin --cqlversion="3.0.5"
Connected to Test Cluster at localhost:9160.
[cqlsh 4.1.1 | Cassandra 1.2.13 | CQL spec 3.0.5 | Thrift protocol 19.36.2]
Use HELP for help.
cqlsh> use "EVENT_KS";
cqlsh:EVENT_KS> select * from currentcost_stream limit 5;
key | Description | Name | Nick_Name | StreamId | Timestamp | Version | payload_sensor | payload_temp | payload_timestamp | payload_timestampmysql | payload_watt | payload_wattseconds
1403365575174:: | Sample data from CC meter | | Currentcost Realtime | | 1403365575174 | 1.0.18 | 1 | 13.16 | 1403365575 | 2014-06-21 23:46:15 | 6631 | 19893
1403354553932:: | Sample data from CC meter | | Currentcost Realtime | | 1403354553932 | 1.0.18 | 1 | 14.1 | 1403354553 | 2014-06-21 20:42:33 | 28475 | 0
1403374113341:: | Sample data from CC meter | | Currentcost Realtime | | 1403374113341 | 1.0.18 | 1 | 10.18 | 1403374113 | 2014-06-22 02:08:33 | 17188 | 154692
1403354501924:: | Sample data from CC meter | | Currentcost Realtime | | 1403354501924 | 1.0.18 | 1 | 10.17 | 1403354501 | 2014-06-21 20:41:41 | 26266 | 0
1403407054092:: | Sample data from CC meter | | Currentcost Realtime | | 1403407054092 | 1.0.18 | 1 | 17.16 | 1403407054 | 2014-06-22 11:17:34 | 6332 | 6332
(5 rows)
What I will be trying to do is to issue a query against this table (actually multiples depending on the various presentation aggregations that are required..), and present a view based on hourly sum's, 10-minute sum's, daily sum's, monthly sum's etc. etc. Depending on the query, the GROUP BY was intended to give this 'index' so to speak. Right now just testing this.. so will see how it ends up in the end. Hope this makes sense?!
So not trying to remove duplicates...
UPDATE 3) Was going about this all wrong.. and thought a bit more on the tip that was given below. Hence just simplifying the whole query gave the right results. The following query gives the total amount of kwH on an hourly basis for the WHOLE dataset. With this, I can create the various iterations of kwH spent over various time periods like
Hourly over the last 24 hours
Daily over the last year
Minute over the last hour
.. etc. etc.
Here is the query:
DROP table CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary;
CREATE EXTERNAL TABLE IF NOT EXISTS CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary ( messageRowID STRING, payload_sensor INT, messagetimestamp BIGINT, payload_temp FLOAT, payload_timestamp BIGINT, payload_timestampmysql STRING, payload_watt INT, payload_wattseconds INT )
STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
"cassandra.port" = "9160",
"" = "EVENT_KS",
"cassandra.ks.username" = "admin",
"cassandra.ks.password" = "admin",
"" = "currentcost_stream",
"cassandra.columns.mapping" = ":key, payload_sensor, Timestamp, payload_temp, payload_timestamp, payload_timestampmysql, payload_watt, payload_wattseconds" );
select hour(from_unixtime(payload_timestamp)) AS hourly, (sum(payload_wattseconds)/(60*60)/1000)
FROM CurrentCostDataSamples_MySQL_Dump_Last_1_Hour_Summary
GROUP BY hour(from_unixtime(payload_timestamp));
This query yields the following based on the sample data:
hourly _c1
0 16.91570472222222
1 16.363228888888887
2 15.446414166666667
3 11.151388055555556
4 18.10564666666667
5 2.2734924999999997
6 17.370668055555555
7 17.991484444444446
8 38.632728888888884
9 16.001440555555554
10 15.887023888888889
11 12.709341944444445
12 23.052629722222225
13 14.986092222222222
14 16.182284722222224
15 5.881564999999999
18 2.8149172222222223
19 17.484405
20 15.888274166666665
21 15.387210833333333
22 16.088641666666668
23 16.49990916666667
Which is aggregate kwH per hourly timeframe over the entire dataset..
So, now on to the next problem. ;-)


Deleting duplicate database records by date in Laravel

I'm currently working on a Laravel 8 application, backed by a PostgreSQL database, in which I'm generating a Cost model for various different items. My intention was to record a maximum of one Cost->value per day, per item; however, due to some issues with overlapping jobs and the way in which I was using the updateOrCreate() method, I've ended up with multiple Cost records per day for each item.
I've since fixed the logic so that I'm no longer getting multiple records per day, but I'd like to now go back and clean up all of the duplicate records.
Is there an efficient way to delete all of the duplicate records per item, leaving the newest record for each day, i.e.: leaving no more than one record per item, per day? While I'm sure this seems pretty straight-forward, I can't seem to land on the correct logic either directly in SQL, or through Laravel and PHP.
Maybe relevant info: Currently, there's ~50k records in the table.
Example table
// Example database table migration
Schema::create('costs', function (Blueprint $table) {
Rough Example (Before)
510,item1,12,2021-07-02,2021-07-02 16:45:17 126.5010838402907751
500,item1,13,2021-07-02,2021-07-02 16:45:05 126.5010838402907751
490,item1,13,2021-07-02,2021-07-02 16:45:01 126.5010838402907751
480,item2,12,2021-07-02,2021-07-02 16:44:59 126.5010838402907751
470,item2,14,2021-07-02,2021-07-02 16:44:55 126.5010838402907751
460,item2,12,2021-07-02,2021-07-02 16:44:54 126.5010838402907751
450,item2,11,2021-07-02,2021-07-02 16:44:53 126.5010838402907751
Rough Example (Desired End-State)
510,item1,12,2021-07-02,2021-07-02 16:45:17 126.5010838402907751
480,item2,12,2021-07-02,2021-07-02 16:44:59 126.5010838402907751
You could use EXISTS():
select * from meuk;
SELECT * FROM meuk x
WHERE x.item = d.item -- same item
AND x.updated_at::date = d.updated_at::date -- same date
AND x.updated_at > d.updated_at -- but: more recent
select * from meuk;
id | item | value | created_at | updated_at
510 | item1 | 12 | 2021-07-02 | 2021-07-02 16:45:17
500 | item1 | 13 | 2021-07-02 | 2021-07-02 16:45:05
490 | item1 | 13 | 2021-07-02 | 2021-07-02 16:45:01
480 | item2 | 12 | 2021-07-02 | 2021-07-02 16:44:59
470 | item2 | 14 | 2021-07-02 | 2021-07-02 16:44:55
460 | item2 | 12 | 2021-07-02 | 2021-07-02 16:44:54
450 | item2 | 11 | 2021-07-02 | 2021-07-02 16:44:53
(7 rows)
id | item | value | created_at | updated_at
510 | item1 | 12 | 2021-07-02 | 2021-07-02 16:45:17
480 | item2 | 12 | 2021-07-02 | 2021-07-02 16:44:59
(2 rows)
A different approach, using window functions. The idea is to number all records on the same {item,day} downward, and preserve only the first:
SELECT item,updated_at
, row_number() OVER (PARTITION BY item,updated_at::date
ORDER BY item,updated_at DESC
) rn
FROM meuk x
) xx
WHERE xx.item = d.item
AND xx.updated_at = d.updated_at
AND xx.rn > 1
Do note that this procedure always involves a self-join: the fate of a record depends on the existence of other records in the same table.
There's a hairy SQL query here ; the simpler one is based on joining the table with itself on < The < or > indicates whether you like to keep the oldest or the newest value. Sadly there is not an easy way (you cannot trust an ORDER BY on a GROUP BY statement If i recall correctly).
Since I cannot explain to you in detail how this query works, I give you a Laravel/PHP solution which is inefficient but comprehensible:
$keepIds = [];
// Loop the table (without Eloquent for performance benefit).
foreach(DB::table('costs')->orderBy('id', 'ASC')->get() as $cost) {
// Keep overwriting the index such that the last overwrite will contain the end result.
$keepIds[$cost->item] = $cost->id;
// Remove elements that you do not want to keep.
DB::table('costs')->whereNotIn('id', array_values($keepIds))->delete();
I'm not sure if that last query will work properly though with a very big array; it might throw an SQL error.
Note that you can play with the orderBy to chose whether you want to keep the newest or the oldest records.

Insert value based on min value greater than value in another row

It's difficult to explain the question well in the title.
I am inserting 6 values from (or based on values in) one row.
I also need to insert a value from a second row where:
The values in one column (ID) must be equal
The values in column (CODE) in the main source row must be IN (100,200), whereas the other row must have value of 300 or 400
The value in another column (OBJID) in the secondary row must be the lowest value above that in the primary row.
Source Table looks like:
1 | 100 | x timestamp| .... | 10 | X
2 | 100 | y timestamp| .... | 11 | Y
3 | 300 | z timestamp| .... | 10 | F
4 | 100 | h timestamp| .... | 10 | X
5 | 300 | g timestamp| .... | 10 | G
So to provide an example..
In my second table I want to insert OBJID, OBJID2, CODE, ENTRY_TIME, substr(INFO(...)), ID, USER
i.e. from my example a line inserted in the second table would look like:
1 | 3 | 100 | x timestamp| substring | 10 | X
4 | 5 | 100 | h timestamp| substring2| 10 | X
My insert for everything that just comes from one row works fine.
WHERE CODE IN (100,200);
I'm aware that I'll need to use an alias on TABLE1, but I don't know how to get the rest to work, particularly in an efficient way. There are 2 million rows right now, but there will be closer to 20 million once I start using production data.
You could try this:
select primary.* ,
(select min(objid)
from table1 secondary
where primary.objid < secondary.objid
and secondary.code in (300,400)
and =
) objid2
from table1 primary
where primary.code in (100,200);
Ok, I've come up with:
select OBJID,
min(case when code in (300,400) then objid end)
over (partition by id order by objid
range between 1 following and unbounded following
) objid2,
from table1;
So, you need a insert select the above query with a where objid2 is not null and code in (100,200);

oracle query to get max hour every day, and corresponding row values

I'm having a hard time creating a query to do the following:
I have this table, called LOG:
1 | 2013-04-29 18:00:00.000 | 160473
2 | 2013-04-29 21:00:00.000 | 154281
3 | 2013-04-30 09:00:00.000 | 186552
4 | 2013-04-30 14:00:00.000 | 173145
5 | 2013-04-30 14:30:00.000 | 102235
6 | 2013-05-01 11:00:00.000 | 201541
7 | 2013-05-01 23:00:00.000 | 195234
What I want to do is build a query that returns, for each day, the last values inserted (using the max value of INSERT_TIME). I'm only interested in the date part of that column, and in the column LOG_VALUE. So, this would be my resultset after running the query:
2013-04-29 154281
2013-04-30 102235
2013-05-01 195234
I guess that I need to use GROUP BY over the INSERT_TIME column, along with MAX() function, but by doing that, I can't seem to get the LOG_VALUE. Can anyone help me on this, please?
(I'm on Oracle 10g)
SELECT trunc(insert_time),
SELECT insert_time,
rank() over (partition by trunc(insert_time)
order by insert_time desc) rnk
FROM log)
WHERE rnk = 1
is one option. This uses the analytic function rank to identify the row with the latest insert_time on each day.

Adding an Index degraded execution time

I have a table like this:
myTable (id, group_id, run_date, table2_id, description)
I also have a index like this:
index myTable_grp_i on myTable (group_id)
I used to run a query like this:
select * from myTable t where t.group_id=3 and t.run_date='20120512';
and it worked fine and everyone was happy.
Until I added another index:
index myTable_tab2_i on myTable (table2_id)
My life became miserable... it's taking almost as 5 times longer to run !!!
execution plan looks the same (with or without the new index):
| Id | Operation | Name | Rows | Bytes | Cost
| 0 | SELECT STATEMENT | | 1 | 220 | 17019
|* 1 | TABLE ACCESS BY INDEX ROWID| MYTABLE | 1 | 220 | 17019
|* 2 | INDEX RANGE SCAN | MYTABLE_GRP_I | 17056 | | 61
Predicate Information (identified by operation id):
1 - filter("T"."RUN_DATE"='20120512')
2 - access("T"."GROUP_ID"=3)
I have almost no hair left on my head, why should another index which is not used, on a column which is not in the where clause make a difference ...
I will update the things I checked:
a. I removed the new index and it run faster
b. I added the new index in 2 more different environments and the same thing happen
c. I changed MYTABLE_GRP_I to be on columns run_date and group_id - this made it run fast as a lightning !!
But still why does it happen ?

How many Include I can use on ObjectSet in EntityFramework to retain performance?

I am using the following LINQ query for my profile page:
var userData = from u in db.Users
where u.UserId == userId
select u;
It has a long object graph and uses many Includes. It is running perfect right now, but when the site has many users, will it impact performance much?
Should I do it in some other way?
A query with includes returns a single result set and the number of includes affect how big data set is transfered from the database server to the web server. Example:
Suppose we have an entity Customer (Id, Name, Address) and an entity Order (Id, CustomerId, Date). Now we want to query a customer with her orders:
var customer = context.Customers
.SingleOrDefault(c => c.Id == 1);
The resulting data set will have the following structure:
Id | Name | Address | OrderId | CustomerId | Date
1 | A | XYZ | 1 | 1 | 1.1.
1 | A | XYZ | 2 | 1 | 2.1.
It means that Cutomers data are repeated for each Order. Now lets extend the example with another entities - 'OrderLine (Id, OrderId, ProductId, Quantity)andProduct (Id, Name)`. Now we want to query a customer with her orders, order lines and products:
var customer = context.Customers
.SingleOrDefault(c => c.Id == 1);
The resulting data set will have the following structure:
Id | Name | Address | OrderId | CustomerId | Date | OrderLineId | LOrderId | LProductId | Quantity | ProductId | ProductName
1 | A | XYZ | 1 | 1 | 1.1. | 1 | 1 | 1 | 5 | 1 | AA
1 | A | XYZ | 1 | 1 | 1.1. | 2 | 1 | 2 | 2 | 2 | BB
1 | A | XYZ | 2 | 1 | 2.1. | 3 | 2 | 1 | 4 | 1 | AA
1 | A | XYZ | 2 | 1 | 2.1. | 4 | 2 | 3 | 6 | 3 | CC
As you can see data become quite a lot duplicated. Generaly each include to a reference navigation propery (Product in the example) will add new columns and each include to a collection navigation property (Orders and OrderLines in the example) will add new columns and duplicate already created rows for each row in the included collection.
It means that your example can easily have hundreds of columns and thousands of rows which is a lot of data to transfer. The correct approach is creating performance tests and if the result will not satisfy your expectations, you can modify your query and load navigation properties separately by their own queries or by LoadProperty method.
Example of separate queries:
var customer = context.Customers
.SingleOrDefault(c => c.Id == 1);
var orderLines = context.OrderLines
.Where(l => l.Order.Customer.Id == 1)
Example of LoadProperty:
var customer = context.Customers
.SingleOrDefault(c => c.Id == 1);
context.LoadProperty(customer, c => c.Orders);
Also you should always load only data you really need.
Edit: I just created proposal on Data UserVoice to support additional eager loading strategy where eager loaded data would be passed in additional result set (created by separate query within the same database roundtrip). If you find this improvement interesting don't forget to vote for the proposal.
(You can improve performance of many includes by creating 2 or more small data request from data base like below.
According to my experience,Only can give maximum 2 includes per query like below.More than that will give really bad performance.
var userData = from u in db.Users
userData = from u in db.Users
Above will bring small data set from database by using more travels to the database.
Yes it will. Avoid using Include if it expands multiple detail rows on a master table row.
I believe EF converts the query into one large join instead of several queries. Therefore, you'll end up duplicating your master table data over every row of the details table.
For example: Master -> Details. Say, master has 100 rows, Details has 5000 rows (50 for each master).
If you lazy-load the details, you return 100 rows (size: master) + 5000 rows (size: details).
If you use .Include("Details"), you return 5000 rows (size: master + details). Essentially, the master portion is duplicated over 50 times.
It multiplies upwards if you include multiple tables.
Check the SQL generated by EF.
I would recommend you to perform load tests and measure the performance of the site under stress. If you are performing complex queries on each request you may consider caching some results.
The result of include may change: it depend by the entity that call the include method.
Like the example proposed from Ladislav Mrnka, suppose that we have an entity
Customer (Id, Name, Address)
that map to this table:
Id | Name | Address
C1 | Paul | XYZ
and an entity Order (Id, CustomerId, Total)
that map to this table:
Id | CustomerId | Total
O1 | C1 | 10.00
O2 | C1 | 13.00
The relation is one Customer to many Orders
Esample 1: Customer => Orders
var customer = context.Customers
.SingleOrDefault(c => c.Id == "C1");
Linq will be translated in a very complex sql query.
In this case the query will produce two record and the informations about the customer will be replicated.
Customer.Id | Customer.Name | Order.Id | Order.Total
C1 | Paul | O1 | 10.00
C1 | Paul | O2 | 13.00
Esample 2: Order => Customer
var order = context.Orders
.SingleOrDefault(c => c.Id == "O1");
Linq will be translated in a simple sql Join.
In this case the query will produce only one record with no duplication of informations:
Order.Id | Order.Total | Customer.Id | Customer.Name
O1 | 10.00 | C1 | Paul
