Upsolver Hive or Athena output have Upsert Partition Fields property. What does this do? - upsolver

When we create Hive or Athena output in Upsolver, the properties show a Upsert Partition Fields. What does this property really do and should we set it to Yes or No?

Our recommendation is to keep Yes as it improves overall performance.
This applies when your output is an Upsert Output and we recommend using the Upsert partition fields = Yes. This way processing is more efficient and also the historical record is maintained in the older partition. View would always give the most recent record. The catalog is automatically updated to point to the most recent record. Example, if Upsert key is userId and you get new event for same userId, it will only vin current partition (lets day date partition if you have partitioned by date) and update the catalog, historical record for same userId in older date partitions won't be touched. The underlying table will have all records, view will have the latest record.
With Upsert partition fields = No, eventually only most recent copy will be maintained (table/view will eventually be kind of alike) but processing is little less efficient as older records from older partitions will be removed.

Related

Hive partition table query optimisation

I am new to hive,and hadoop ecosystem in general.From what I learnt of the basics of Hive you can create partitions on hive table based on certain attributes.And if a query has any mention of that attribute then it should supposedly get a performance boost as hive only scans that particular partition file instead of scanning the whole table.My question is suppose we have some hierarchical structure in the data.Say I partition a table based on unique state values and every time a query is based on state hive would only scan that particular state partition instead of scanning the whole table.However say every state also has unique district names.If I make a query based only on district values would hive scan the whole table?
If so then is there some way to change the query in such a way that I can manually instruct hive to query the particular state file to which the district belongs to.And then perform other operations only on that partition file,instead of scanning the whole table for matching district values.
One of the strengths of Hive is that it has strong support for partitioning. However, it cannot read your mind when you write queries.
If you have a partition on state, then you need state in the where clause for partition pruning. So, if you query only on district, the whole table would be scanned.
If you have a partition on district, then you need the district. A query on state would scan the whole table.
If you have a partition on both . . . well, then it is a little more complicated to declare, but your queries would read a minority of partitions with either state or district.
If you are just learning about partitions, I would advise you to start with date partitions. These are the most common and a good way to get familiar with the concept.

Importing data incrementally from RDBMS to hive/hadoop using sqoop

I have an oracle database and need to import data to a hive table. The daily import data size would be around 1 GB. What would be the better approach?
If I import each day data as a partition, how can the updated values be handled?
For example, if I imported today's data as a partition and for the next day there are some fields that are updated with the new values.
Using --lastmodified we can get the values but where we need to send the updated values to the new partition or to the old (already existing) partition?
If I send to the new partition, then the data is duplicated.
If I want to send to the already existing partition, how we can it be achieved?
Your only option is to override the entire existing partition with 'INSERT OVERWRITE TABLE...'.
Question is - how far back are you going to be constantly updating the data?
I think of 3 approaches u can consider:
Decide on a threshold for 'fresh' data. for example '14 days backwards' or '1 month backwards'.
Then each day you are running the job, you override partitions (only the ones which have updated values) backwards, until the threshold decided.
With ~1 GB a day it should be feasible.
All the data from before your decided time is not guranteed to be 100% correct.
This scenario could be relevant if you know the fields can only be changed a certain time window after they were initially set.
Make your Hive table compatible with ACID transactions, thus allowing updates on the table.
Split your daily job to 2 tasks: the new data being written for the run day. the updated data that you need to run backwards. the sqoop will be responsible for the new data. take care of the updated data 'manually' (some script that generates the update statements)
Don't use partitions based on time. maybe dynamic partitioning is more suitable for your use case.It depends on the nature of the data being handled.

Historical Data Comparison in realtime - faster in SQL or code?

I have a requirement in the project I am currently working on to compare the most recent version of a record with the previous historical record to detect changes.
I am using the Azure Offline data sync framework to transfer data from a client device to the server which causes records in the synced table to update based on user changes. I then have a trigger copying each update into a history table and a SQL query which runs when building a list of changes to compare the current record vs the most recent historical by doing column comparisons - mainly string but some integer and date values.
Is this the most efficient way of achieving this? Would it be quicker to load the data into memory and perform a code based comparison with rules?
Also, if I continually store all the historical data in a SQL table, will this affect the performance over time and would I be better storing this data in something like Azure Table Storage? I am also thinking along the lines of cost as SQL usage is much more expensive that Table Storage but obviously I cannot use a trigger and would need to insert each synced row into Table Storage manually.
You could avoid querying and comparing the historical data altogether, because the most recent version is already in the main table (and if it's not, it will certainly be new/changed data).
Consider a main table with 50.000 records and 1.000.000 records of historical data (and growing every day).
Instead of updating the main table directly and then querying the 1.000.000 records (and extracting the most recent record), you could query the smaller main table for that one record (probably an ID), compare the fields, and only if there is a change (or no data yet) update those fields and add the record to the historical data (or use a trigger / stored procedure for that).
That way you don't even need a database (probably containing multiple indexes) for the historical data, you could even store it in a flat file if you wanted, depending on what you want to do with that data.
The sync framework I am using deals with the actual data changes, so i only get new history records when there is an actual change. Given a batch of updates to a number of records, i need to compare all the changes with their previous state and produce an output list of whats changed.

Elasticsearch - indexing SQL Server table - what is available today?

This question was probably asked multiple times but I still can't find definite answer.
I have a table in a SQL Server database. Data inserted into the table constantly. Nothing gets deleted or updated. Each row has unique id and date column. I want to index this table in ES. Initial index plus all the new inserts coming in. If possible to create a separated index per day.
What is available today. I see that revers deprecated. Found this JDBC importer. is that's the one to use? Logstash?
Thanks in advance.

Does postgresql index update on inserting new row?

Sorry if this is a dumb question but do i need to reindex my table every time i insert rows, or does the new row get indexed when added?
From the manual
Once an index is created, no further intervention is required: the system will update the index when the table is modified
http://postgresguide.com/performance/indexes.html
I think when you insert rows, the index does get updated. It maintains the sort on the index table as you insert data. Hence there are performance issues or downtimes on a table, if you try adding large number of rows at once.
On top of the other answers: PostgreSQL is a top notch Relational Database. I'm not aware of any Relational Database system where indices are not updated automatically.
It seems to depend on the type of index. For example, according to https://www.postgresql.org/docs/9.5/brin-intro.html, for BRIN indexes:
When a new page is created that does not fall within the last summarized range, that range does not automatically acquire a summary tuple; those tuples remain unsummarized until a summarization run is invoked later, creating initial summaries. This process can be invoked manually using the brin_summarize_new_values(regclass) function, or automatically when VACUUM processes the table.
Although this seems to have changed in version 10.

Resources