Count difference in PIG query and MySql query results in GROUP BY - hadoop

My PIG Query is given below
emp = LOAD 'hdfs://master:9000/hrms/DimEmployee' AS (EmployeeID,OrganizationID,EmploymentType);
grouped = group emp by (OrganizationID, EmploymentType);
AggEmploymentType = FOREACH grouped GENERATE group.OrganizationID, group.EmploymentType,COUNT(emp.EmployeeID) as cnt;
DUMP AggEmploymentType;
Below given is the step by step description of above pig query.
LOAD 100097 records from HDFS file which is tab delimited.
Group by the records by Company,EmploymentStatus
Count the records by EmployeeID.
Dump the output.
After execution of above query, Pig shell says, successfully read 100115 records.
I am getting below given three problems after Pig query executes successfully:
Why pig is ready more records than available in HDFS
(100115>100097)
Why There is warning message "ACCESSING_NON_EXISTENT_FIELD 27 TIMES"
The result has count difference of 9 when I run same group by query in MySQL.
Please solve my problem as soon as possible. My pig,hadoop project is depending upon your prompt response. I am struck since last 5 days due to above problem

I don't think it is coincidence that you are loading extra records and you are also getting accessing non existent fields error. The non existent field error shows up when you are loading and there aren't enough columns. For example, you might get the error if you see a line like: hello,world when you are expected 3 columns.
Suggestion: The other thing to note is that COUNT(x) does not count items that are null. Try swapping out COUNT(emp.EmployeeID) with COUNT_STAR(emp.EmployeeID). COUNT_STAR takes nulls into account.
Suggestion: One thing that Pig will do when it doesn't have the fields is just put nulls in it. I suggest you add a filter before the GROUP that removes records with nulls (and also potentially "bad" records).
emp = FILTER emp BY EmployeeID IS NOT NULL AND
OrganizationID IS NOT NULL AND
EmploymentType IS NOT NULL;

Related

Import massive table from Oracle to PostgreSQL with oracle-fdw return ORA-01406

I work on a project to transfer data from an Oracle database to a PostgreSQL database to create a datawarehouse with bash & SQL scripts. To access to the Oracle database, I use the PostgreSQL extension oracle-fdw.
One of my scripts import data from a massive table (~ 100 000 000 new rows/day). This table is partitioned and each partition contains 1 day of data. The query I use to import data looks like that :
INSERT INTO postgre_target_table (some_fields)
SELECT some_aggregated_fields -- (~150 fields)
FROM oracle_source_table
WHERE partition_id = :v_partition_id AND some_others_filters
GROUP BY primary_key;
On DEV server, the query works fine (there is much less data on this server) but in PREPROD, it returns the error ORA-01406: fetched column value was truncated.
In some posts, people say that the output fields may be too small but if I try to send a simple SELECT query without INSERT or GROUP BY I have the same error.
Another idea I found in another post is to create an Oracle side view but in my query I use multiple parameters that I cannot use in a view.
The last idea I found is to create an Oracle stored procedure that fills a table with aggregated data and then import data from this table but the Oracle database is critical and my customer prefers to avoid adding more data on it.
Now, I'm starting to think there's no solution and it's not good...
PostgreSQL version : 12.4 / Oracle version : 11.2
UPDATE
It seems my problem is more complecated than I thought.
After applying the modification given by Laurenz Albe, the query runs correctly on PGAdmin but the problem still appears when I use psql command.
Moreover, another query seems to have the same problem. This other query does not use the same source table as the first query, it uses 4 joined tables without any partition. The common point between these queries is the structure.
The detail I omit to specify in the original post is that the purpose of both queries is to pivot a table. They look like that :
SELECT osr.id,
MIN(CASE osr.category
WHEN 123 THEN
1
END) AS field1,
MIN(CASE osr.category
WHEN 264 THEN
1
END) AS field2,
MIN(CASE osr.category
WHEN 975 THEN
1
END) AS field3,
...
FROM oracle_source_table osr
WHERE osr.category IN (123, 264, 975, ...)
GROUP BY osr.id;
Now that I have detailed what the queries look like, I can give you some results I had with the second one without changing the value of max_long (this query is lighter than the first one) :
Sometimes it works (~10%), sometimes it failed (~90%) on PGadmin but it never works with psql command
If I delete the WHERE, it always works
I don't understand why deleting the WHERE change something, the field used in this clause is a NUMBER(6, 0) between 0 and 2500 and it is still used in the SELECT clause... Oh and in the 4 Oracle tables used by this query, there is no LONG datatype, only NUMBER datatype is used.
Among 20 queries I have, only these two have a problem, their structure is similar and I don't believe in coincidences.
Don't despair!
Set the max_long option on the foreign table big enough that all your oversized data fit.
The documentation has the details:
max_long (optional, defaults to "32767")
The maximal length of any LONG, LONG RAW and XMLTYPE columns in the Oracle table. Possible values are integers between 1 and 1073741823 (the maximal size of a bytea in PostgreSQL). This amount of memory will be allocated at least twice, so large values will consume a lot of memory.
If max_long is less than the length of the longest value retrieved, you will receive the error message
ORA-01406: fetched column value was truncated
Example:
ALTER FOREIGN TABLE my_tab OPTIONS (ADD max_long '1000000');

Hive + Tez :: A join query stuck at last 2 mappers for a long time

I have a views table joining with a temp table with the below parameters intentionally enabled.
hive.auto.convert.join=true;
hive.execution.engine=tez;
The Code Snippet is,
CREATE TABLE STG_CONVERSION AS
SELECT CONV.CONVERSION_ID,
CONV.USER_ID,
TP.TIME,
CONV.TIME AS ACTIVITY_TIME,
TP.MULTI_DIM_ID,
CONV.CONV_TYPE_ID,
TP.SV1
FROM VIEWS TP
JOIN SCU_TMP CONV ON TP.USER_ID = CONV.USER_ID
WHERE TP.TIME <= CONV.TIME;
In the normal scenario, both the tables can have any number of records.
However,in the SCU_TMP table, only 10-50 records are expected with the same User Id.
But in some cases, couple of User IDs come with around 10k-20k records in SCU Temp table, which creates a cross product effect.
In such cases, it'll run for ever with just 1 mapper to complete.
Is there any way to optimise this and run this gracefully?
I was able to find a solution to it by the below query.
set hive.exec.reducers.bytes.per.reducer=10000
CREATE TABLE STG_CONVERSION AS
SELECT CONV.CONVERSION_ID,
CONV.USER_ID,
TP.TIME,
CONV.TIME AS ACTIVITY_TIME,
TP.MULTI_DIM_ID,
CONV.CONV_TYPE_ID,
TP.SV1
FROM (SELECT TIME,MULTI_DIM_ID,SV1 FROM VIEWS SORT BY TIME) TP
JOIN SCU_TMP CONV ON TP.USER_ID = CONV.USER_ID
WHERE TP.TIME <= CONV.TIME;
The problem arises due to the fact that when a single user id dominates the table, join of that user gets processed through a single mapper which gets stuck.
Two modifications to it,
1) Replaced Table name with a subquery - which added a sorting process before the join.
2)Reduced the hive.exec.reducers.bytes.per.reducer parameter to 10KB.
Sort by time in step (1) added a shuffle phase which evenly distributed the data which was earlier skewed by the User ID.
Reducing the bytes per reducer parameter resulted in distribution of data to all available reducers.
By these two enhancements, 10-12hrs run was reduced to 45 mins.

Hive Query dumping issue

I am facing difficulties in getting the dump(text file delimited by ^) for a Query in hive for my project -sentimental analysis in stock market using twitter.
The query which should fetch me an output in hdfs or local file-system is given below:
hive> select t.cmpname,t.datecol,t.tweet,st.diff FROM tweet t LEFT OUTER JOIN stock st ON(t.datecol = st.datecol AND lower(t.cmpname) = lower(st.cmpname));
The query produces the correct output but when I try dumping it in hdfs it gives me an error.
I ran through various other solutions given in stackoverflow for dumping but I was not able to find an appropriate solution which suits me.
Thanks for your help.
INSERT OVERWRITE DIRECTORY '/path/to/dir'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '^'
SELECT t.cmpname,t.datecol,t.tweet,st.diff FROM tweet t LEFT OUTER JOIN stock st
ON(t.datecol = st.datecol AND lower(t.cmpname) = lower(st.cmpname));

Insert into ElasticSearch using Hive/Qubole

I am trying to insert data into elastic search from a hive table.
CREATE EXTERNAL TABLE IF NOT EXISTS es_temp_table (
dt STRING,
text STRING
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource'='aggr_2014-10-01/metric','es.index.auto.create'='true')
;
INSERT OVERWRITE TABLE es_temp_table
SELECT dt, description
FROM other_table
However, the data is off. When I do a count(*) on my other table I am getting 6,000 rows. When I search the aggr_2014-10-01 index, I see 10,000 records! Somehow, the records are being duplicated (rows are being copied over multiple times). Maybe I can remove duplicate records inside of elastic search? Not sure how I would do that though.
I believe it might be a result of Hive/Qubole spawning two tasks for every mapping. If one mapper succeeds, it tries to kill the other. However, the other task already did damage (aka inserted into ElasticSearch). This is my best guess, but I would prefer to know the exact reason and if there is a way for me to fix it.
set mapred.map.tasks.speculative.execution=false;
One thing I found was to set speculative execution to false, so that only one task is spawned per mapper (see above setting). However, now I am seeing undercounting. I believe this may be due to records being skipped, but I am unable to diagnose why those records would be skipped in the first place.
In this version, it also means that if even one task/mapper fails, the entire job fails, and then I need to delete the index (partial data was uploaded) and rerun the entire job (which takes ~4hours).
[PROGRESS UPDATE]
I attempted to solve this by putting all of the work in the reducer (it's the only way to only spawn one task to ensure no duplicate record insertions).
INSERT OVERWRITE TABLE es_temp_table
SELECT dt, description
FROM other_table
DISTRIBUTE BY cast(rand()*250 as int);
However, I now see a huge underestimate! 2,000 records only now. Elastic search does estimate some things, but not to this extent. There are simply less records in ElasticSearch. This may be due to failed tasks (that are no longer retrying). It may be from when Qubole/Hive passes over malformed entries. But I set:
set mapreduce.map.skip.maxrecords=1000;
Here are some other settings for my query:
set es.nodes=node-names
set es.port=9200;
set es.bulk.size.bytes=1000mb;
set es.http.timeout=20m;
set mapred.tasktracker.expiry.interval=3600000;
set mapred.task.timeout=3600000;
I determined the problem. As I suspected, insertion was skipping over some records that were considered "bad." I was never able to find what records exactly were being skipped, but I tried replacing all non-alphanumeric characters with a space. This solved the problem! The records are no longer being skipped, and all data is uploaded to Elastic Search.
INSERT OVERWRITE TABLE es_temp_table
SELECT dt, REGEXP_REPLACE(description, '[^0-9a-zA-Z]+', ' ')
FROM other_table

Getting count from large tables

I was trying to get the count from a table with millions of entries. My query looks somewhat like this:
Select count(*)
from Users
where status = 'A' and office_id = '000111' and user_type = 'C'
Status can be A or C, User Type can be C or R.
Status, Office_id and User_type are Strings
The result has around 10 million rows, and its taking a lot of time. I just want the total count.
Would appreciate if anyone could tell me why its taking this much time, and workaround if any.
Do let me know in case of any more details required.
The database engine is Oracle 11g
Edit: I Added index for all three columnns. Still theres no improvement. Also tried the below query, but it always returns the total count in the table without checking the conditions.
SELECT COUNT(office_id_key)
FROM Users
WHERE EXISTS (SELECT * FROM Users WHERE status = 'A' AND office_id = '000111' AND user_type = 'C')
Why not just simply create indexes on the table on age and place this way your search will be faster then simply scanning the entire table for these values.
CREATE INDEX age_index ON Employee(age);
CREATE INDEX place_index ON Employee(place);
This should speed up the process.
AMENDED BASED ON QUERY CHANGE
CREATE INDEX status_index ON Users(status);
CREATE INDEX office_id_index ON Users(office_id);
CREATE INDEX user_type_index ON Users(user_type);
You'll want to create the following multi-column index on the Users table to improve the query:
(office_id, status, user_type)
The database can use a "covering" index with COUNT(*). Create the index with the columns in that order, due to cardinality.
After adding the indexes, I think changing where to where exists and a subquery may help as well.
Edit2: removed exists as it was returning all valid, usually the subquery has multiple joins, but I guess the case with one table returns all true. I read that count is optimized to act similar to exists when it has only one table and no where clause, so I treat the results as a table. Hopefully, this will have the same quick results.
select count(1) from
(select 1 from Employee where age = '25' and place = 'bricksgate')
Edit: When you use 'where exists' the db server doesn't load your data into memory and also takes advantage of the indexes because you will be reading values from the indexes not doing costly table lookups. You may also want to change count(*) to count(place) - that way it will limit the fields to an indexed field as well.
In your original query, your data was doing table lookups and then loading them into memory just to be counted.
count(1) works faster than count(*)

Resources