I am facing difficulties in getting the dump(text file delimited by ^) for a Query in hive for my project -sentimental analysis in stock market using twitter.
The query which should fetch me an output in hdfs or local file-system is given below:
hive> select t.cmpname,t.datecol,t.tweet,st.diff FROM tweet t LEFT OUTER JOIN stock st ON(t.datecol = st.datecol AND lower(t.cmpname) = lower(st.cmpname));
The query produces the correct output but when I try dumping it in hdfs it gives me an error.
I ran through various other solutions given in stackoverflow for dumping but I was not able to find an appropriate solution which suits me.
Thanks for your help.
INSERT OVERWRITE DIRECTORY '/path/to/dir'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '^'
SELECT t.cmpname,t.datecol,t.tweet,st.diff FROM tweet t LEFT OUTER JOIN stock st
ON(t.datecol = st.datecol AND lower(t.cmpname) = lower(st.cmpname));
Related
I was playing around with a simple dataset that you can find here.
No matter what I do, calling the SUM() aggregate function on the 4th column of the given data set returns the wrong answer.
Here is the exact code that I have used:
create database beep_boop;
use beep_boop;
create table cause (year INT, sex STRING, cause STRING, value INT)
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as textfile
tblproperties("skip.header.line.count" = "1");
load data inpath '/user/verterse/CauseofDeath.csv' into table cause;
select sum(value) from cause;
The answer that I get is 11478567 as shown in the screenshot here.
But using the SUM() in MS Excel gives an answer of 12745563.
I tried deleting the table/database and recreating them from scratch. I tried uploading the csv file again. I tried using different datatypes like INT and BIGINT for the value column. I tried skipping and not skipping the header line. Nothing works. I also know that the file is being read completely because select count(*) from cause; returns a correct answer of 1016.
P.S.: I am new to Hadoop, Hive and big data in general.
I am trying to subtract two columns and fetch the result if the difference is greater then 100 in hive. I have written the following query:
select District.ID,Year,(volume_IN-volume_OUT) as d1 from petrol where d1>100;
but I am getting error.
The table column names are:
District.ID, Distributer name, volume_IN ,volume_OUT, Year
Please help me, Is there any error in the query. I am new to the hive.
One of the Limitations of hive is you cannot refer the alias you used in the same query
Try writing a subquery, may be something like below
select * from (select District_ID,year, (volume_IN-volume_OUT) as d1 from petrol) t1 where d1>100;
I have a Spark Data Frame that contains Timestamp and Machine Ids. I wish to remove the lowest timestamp value from each group. I tried following code:
sqlC <- sparkRHive.init(sc)
ts_df2<- sql(sqlC,"SELECT ts,Machine FROM sdf2 EXCEPT SELECT MIN(ts),Machine FROM sdf2 GROUP BY Machine")
But the following error is coming:
16/04/06 06:47:52 ERROR RBackendHandler: sql on 35 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: missing EOF at 'SELECT' near 'EXCEPT'; line 1 pos 35
What is the problem? If HiveContext does not support EXCEPT keyword what will be synonymous way of doing the same in HiveContext?
The programming guide for Spark 1.6.1 shows supported and unsupported Hive features in Spark 1.6.1
http://spark.apache.org/docs/latest/sql-programming-guide.html#supported-hive-features
I don't see EXCEPT in either category. I saw elsewhere that Hive QL doesn't support EXCEPT, or at least did not at that time.
Hive QL Except clause
Perhaps try a table of the mins and then do a left outer join as in that answer?
SELECT ts, Machine FROM ts mins LEFT OUTER JOIN ts mins ON (ts.id=mins.id) WHERE mins.id IS NULL;
You can also use the sparkR built-in function except(), though I think you would need to create you mins DataFrame first
exceptDF <- except(df, df2)
I have a requirement to do a nested select within a where clause in a Hive query. A sample code snippet would be as follows;
select *
from TableA
where TA_timestamp > (select timestmp from TableB where id="hourDim")
Is this possible or am I doing something wrong here, because I am getting an error while running the above script ?!
To further elaborate on what I am trying to do, there is a cassandra keyspace that I publish statistics with a timestamp. Periodically (hourly for example) this stats will be summarized using hive, once summarized that data will be stored separately with the corresponding hour. So when the query runs for the second time (and consecutive runs) the query should only run on the new data (i.e. - timestamp > previous_execution_timestamp). I am trying to do that by storing the latest executed timestamp in a separate hive table, and then use that value to filter out the raw stats.
Can this be achieved this using hive ?!
Subqueries inside a WHERE clause are not supported in Hive:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries
However, often you can use a JOIN statement instead to get to the same result:
https://karmasphere.com/hive-queries-on-table-data#join_syntax
For example, this query:
SELECT a.KEY, a.value
FROM a
WHERE a.KEY IN
(SELECT b.KEY FROM B);
can be rewritten to:
SELECT a.KEY, a.val
FROM a LEFT SEMI JOIN b ON (a.KEY = b.KEY)
Looking at the business requirements underlying your question, it occurs that you might get more efficient results by partitioning your Hive table using hour. If the data can be written to use this factor as the partition key, then your query to update the summary will be much faster and require fewer resources.
Partitions can get out of hand when they reach the scale of millions, but this seems like a case that will not tease that limitation.
It will work if you put in :
select *
from TableA
where TA_timestamp in (select timestmp from TableB where id="hourDim")
EXPLANATION : As > , < , = need one exact figure in the right side, while here we are getting multiple values which can be taken only with 'IN' clause.
My PIG Query is given below
emp = LOAD 'hdfs://master:9000/hrms/DimEmployee' AS (EmployeeID,OrganizationID,EmploymentType);
grouped = group emp by (OrganizationID, EmploymentType);
AggEmploymentType = FOREACH grouped GENERATE group.OrganizationID, group.EmploymentType,COUNT(emp.EmployeeID) as cnt;
DUMP AggEmploymentType;
Below given is the step by step description of above pig query.
LOAD 100097 records from HDFS file which is tab delimited.
Group by the records by Company,EmploymentStatus
Count the records by EmployeeID.
Dump the output.
After execution of above query, Pig shell says, successfully read 100115 records.
I am getting below given three problems after Pig query executes successfully:
Why pig is ready more records than available in HDFS
(100115>100097)
Why There is warning message "ACCESSING_NON_EXISTENT_FIELD 27 TIMES"
The result has count difference of 9 when I run same group by query in MySQL.
Please solve my problem as soon as possible. My pig,hadoop project is depending upon your prompt response. I am struck since last 5 days due to above problem
I don't think it is coincidence that you are loading extra records and you are also getting accessing non existent fields error. The non existent field error shows up when you are loading and there aren't enough columns. For example, you might get the error if you see a line like: hello,world when you are expected 3 columns.
Suggestion: The other thing to note is that COUNT(x) does not count items that are null. Try swapping out COUNT(emp.EmployeeID) with COUNT_STAR(emp.EmployeeID). COUNT_STAR takes nulls into account.
Suggestion: One thing that Pig will do when it doesn't have the fields is just put nulls in it. I suggest you add a filter before the GROUP that removes records with nulls (and also potentially "bad" records).
emp = FILTER emp BY EmployeeID IS NOT NULL AND
OrganizationID IS NOT NULL AND
EmploymentType IS NOT NULL;