Reshape data wide to long in Hadoop Hive

Reshape data wide to long in Hadoop Hive - hadoop

My wide data look like this:
What I am trying to accomplish is long:
I have many Score_X's and each score has many items. So the less hard-coding (e.g. Convert data from wide format to long format in SQL) the better.
I have thought about a few ways to do this; unfortunately Hive does not have many features that other SQL implementations have. So first I would appreciate a solution to my problem, and secondly, if anyone knows easy ways to emulate these things in Hive please do share with me.
The pivot function, which Hive doesn't have.
I tried to apply Joe Stefanelli's answer in Selecting all columns that start with XXX using a wildcard?. Hive does not have INFORMATION_SCHEMA either. I was told (also by stackoverflow) that I could get table metadata by first installing MySQL and then detour through MySQL; I don't feel like spending that much effort on a simple task like reshaping a table...
Then I think I can combine the values of Score_A_1, Score_A_2 and Score_A_3 into one Score_A array and then do a LATERAL VIEW EXPLODE like in myui's answer in How to transpose/pivot data in hive?. But I Googled around and could not find a tutorial to do that.
Thanks. Your help is greatly appreciated.
Update:
So the array function will create an array column from multiple columns. Now I am doing the LATERAL VIEW EXPLODE; through hard-coding (i.e., non-dynamic query) I am getting what I want. However it is difficult to believe that there is not a simpler way to perform a data management task as basic as reshaping. Am I missing something fundamental about Hive?

Related

Is it OK to have multiple merge steps in an Excel Power query?

I have data from multiple sources - a combination of Excel (table and non table), csv and, sometimes, even a tsv.
I create queries for each data source and then I am bringing them together one step at a time or, actually, it's two steps: merge and then expand to bring in the fields I want for each data source.
This doesn't feel very efficient and I think that maybe I should be just joining everything together in the Data Model. The problem when I did that was that I couldn't then find a way to write a single query to access all the different fields spread across the different data sources.
If it were Access, I'd have no trouble creating a single query one I'd created all my relationships between my tables.
I feel as though I'm missing something: How can I build a single query out of the data model?
Hoping my question is clear. It feels like something that should be easy to do but I can't home in on it with a Google search.

It is never a good idea to push the heavy lifting downstream in Power Query. If you can, work with database views, not full tables, use a modular approach (several smaller queries that you then connect in the data model), filter early, remove unneeded columns etc.
The more work that has to be performed on data you don't really need, the slower the query will be. Please take a look at this article and this one, the latter one having a comprehensive list for Best Practices (you can also just do a search for that term, there are plenty).
In terms of creating a query from the data model, conceptually that makes little sense, as you could conceivably create circular references galore.

force Oracle to use indexes over DB-Link queries

I use stored proceduers on DB instance "A" to store data in GTT. To get the original data i have to go over a DB-Link to DB instance "B". That for i put together the whole query and send it to remote DB instance.
This works fine. But sometimes it seems that Oracle is not using the best way or correct indexes for queries. Is there a way to force Oracle to use specific indexes? I tried to use hints, but honestly I dind't understand the difference between all these options.
Thanks for helping me!

There is a huge temptation to optimize a query one way when you want it to work another way. Adding hints is a temporary solution which can backfire on you when the amount or type of data in the table changes or when you upgrade to a newer version with a newer optimizer.
First, determine that there is a problem. Are all queries taking too long? Just some? Only the first one?
The easiest thing to do is to make sure the indexes on that table are up to date. Then look at optimizing the query by using the explain plan feature to see what indexes are being used.
It's also prudent to examine your data to see if the query is selecting different things or different amounts of records if it is time based.

How slow are cursors really and what would be better alternatives?

I have been reading that cursors are pretty slow and one should unless out of options avoid them. I am trying to optimize my stored procedures and one of them uses a cursor. It frequently is being called by my application and with lot of users(20000) and rows to update. I was thinking maybe I should use something else as an alternative.
All I am trying to do or want is to get a list of records and then operate on depending on each row value. So for e.g we have say -
Employee - Id,Name,BenefitId,StartDate,EndDate
So based on benefitId I need to do different calculation using dates between StartDate and EndDate and update employee details. I am just making this contrived example to give a idea on my situation.
What are your thoughts on it ? Are there better alternatives for cursors like say using temp tables or user defined functions? When should you really opt for them or should we never be using cursors ? Thanks everyone for their help.

I once changed a stored procedure from cursors to set based logic. Running time went from 8 hours to 22 seconds. That's the kind of difference we're talking about.
Instead of taking different action a record at a time, use several passes on the data. Update and set field1=A where field2 is X, then update and set field1= B where field2 is Y, etc.

I've changed out cursors and moved from over 24 hours of processing time to less than a minute.
TO help you see how to fix your proc with set-based logic, read this:
http://wiki.lessthandot.com/index.php/Cursors_and_How_to_Avoid_Them

A cursor does row-by-row processing, or "Row By Agonizing Row" if your name is Jeff Moden.
This is just one example of how to do set-based SQL programming as opposed to RBAR, but it depends ultimately on what your cursor is doing.
Also, have a look at this on StackOverflow:
RBAR vs. Set based programming for SQL

First off, it sounds like you are trying to mix some business logic in your stored procs. That's generally something you want to avoid. A better solution would be to have a middle tier layer which encapsulates that business logic. That way your data layer remains purely data.
To answer your original question, it really depends on what you are using the cursors for. In some cases you can use a table variable or a temp table. You have to remember to free up temp tables though so I would suggest using table variables whenever possible. Sometimes, though, there is just no way around using cursors. Maybe the original DBA's didn't normalize enough (or normalized too much) and you are forced to use a cursor to traverse through multiple tables without any foreign key relationships.

do's and don'ts for writing mysql queries

One thing I always wonder while writing query is that am I writing most optimized query or not? I know certain things like:
1) using SELECT field1, filed2 instead of SELECT *
2) Giving proper indexes to the tables
but I am sure there are more things that should be kept in mind for writing queries, since most of the database can only grow more and optimal query will help in execution time. Can you share some tips and tricks on writing queries?

Testing is the best way to measure performance. Monitor your queries on the live database and make use of things like the slow query log.
I would also recommend enabling the query cache, which will give most typical usage situations a massive boost.

Use proper data types for your fields
Use back-tick character (`) for reserved keywords
When dealing with multiple tables, try using joins
Resource:
See:
20 SQL Tips

As well as the Do's and Dont's, you may find the Hidden Features of MySQL useful.

As a matter of fact, no "tips" can help you.
Database design require deep knowledge, not tips.
There are always "weight" of these "dont's". Most of such listings fall to list most unimportant things and fail to mention important ones. Your list for example, is if it was culinary forum:
Always use a knife with black handle
To prepare good dish you need to choose proper ingredients.
First one is impressing but never help in the real world.
Second one is right, but must be backed with deep knowledge to make it right.
So, it must be a book, not tips. Ones from Paul Dubios are among recommended.

use below fields necessarily in each table
tablename_id( auto increment , unsigned zerofill)
created_by( timestamp)
tablerow_status( enum ('t','f') by default set 't')
always make an comment when u create a field in mysql( it helps when u search in phpmyadmin))
alwayz take care of Normalization forms
if u r doing some field that would be alwayz positive then select unsigned .
use decimal data type instead of float in somw case( like discount, it should be maximum 99.99% so use decimal( 5,2)
use date, time data type whereve needed, don't use timestamp everywhere

Correlated subqueries are very bad, but often not well understood and end up in production. They can often be fixed by using derived tables and a join instead.
http://en.wikipedia.org/wiki/Correlated_subquery

One more thing I found today is regarding the difference between COUNT(*) and COUNT(col)
Using COUNT(*) is faster than COUNT(col)
MYISAM tables cached number of rows in this table, for innoDB doesn't cache row count and may be slower without WHERE clause
It is better to use NOT NULL column for both MYISAM and innoDB than some other column where Null is allowed.
More details here

Hbase / Hadoop Query Help

I'm working on a project with a friend that will utilize Hbase to store it's data. Are there any good query examples? I seem to be writing a ton of Java code to iterate through lists of RowResult's when, in SQL land, I could write a simple query. Am I missing something? Or is Hbase missing something?

I think you, like many of us, are making the mistake of treating bigtable and HBase like just another RDBMS when it's actually a column-oriented storage model meant for efficiently storing and retrieving large sets of sparse data. This means storing, ideally, many-to-one relationships within a single row, for example. Your queries should return very few rows but contain (potentially) many datapoints.
Perhaps if you told us more about what you were trying to store, we could help you design your schema to match the bigtable/HBase way of doing things.
For a good rundown of what HBase does differently than a "traditional" RDBMS, check out this awesome article: Matching Impedance: When to use HBase by Bryan Duxbury.

If you want to access HBase using a query language and a JDBC driver it is possible. Paul Ambrose has released a library called HBQL at hbql.com that will help you do this. I've used it for a couple of projects and it works well. You obviously won't have access to full SQL, but it does make it a little easier to use.

I looked at Hadoop and Hbase and as Sean said, I soon realised it didn't give me what I actually wanted, which was a clustered JDBC compliant database.
I think you could be better off using something like C-JDBC or HA-JDBC which seem more like what I was was after. (Personally, I haven't got farther with either of these other than reading the documentation so I can't tell which of them is any good, if any.)

I'd recommend taking a look at Apache Hive project, which is similar to HBase (in the sense that it's a distributed database) which implements a SQL-esque language.

Thanks for the reply Sean, and sorry for my late response. I often make the mistake of treating HBase like a RDBMS. So often in fact that I've had to re-write code because of it! It's such a hard thing to unlearn.
Right now we have only 4 tables. Which, in this case, is very few considering my background. I was just hoping to use some RDBMS functionality while mostly sticking to the column-oriented storage model.

Glad to hear you guys are using HBase! I'm not an expert by any stretch of the imagination, but here are a couple of things that might help.
HBase is based on / inspired by BigTable, which happens to be exposed by AppEngine as their db api, so browsing their docs should help a great deal if you're working on a webapp.
If you're not working on a webapp, the kind of iterating you're describing is usually handled with via map/reduce (don't emit the values you don't want). Skipping over values using iterators virtually guarantees your application will have bottlenecks with HBase-sized data sets. If you find you're still thinking in SQL, check out cloudera's pig tutorial and hive tutorial.
Basically the whole HBase/SQL mental difference (for non-webapps) boils down to "Send the computation to the data, don't send the data to the computation" -- if you keep that in mind while you're coding you'll do fine :-)
Regards,
David

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio