Avoid duplication in data pulled from Cassandra - elasticsearch

Background: I am getting information from varius logfiles and a Cassandra table. The logfiles are fine, but fetching from the table gives me duplicates inside elasticsearch, as I can't get the rows added since sql_last_run.
How do I avoid duplication of rows?

One way to avoid this is to create your own document IDs by computing the SHA or MD5 of the raw log line.
That way the same log line, even if read repeatedly, will always produce the same ID and you won't get any duplicate documents anymore.
Another solution is to create another column in your tables with a unique GUID and use that value as the document IDs.

Related

Stop Hbase update operation if it have same value

I have a table in Hbase named 'xyz' . When I do an update operation on this table , it updates a table even though it is same record .
How can I control second record to not be added.
Eg:
create 'ns:xyz',{NAME=>'cf1',VERSIONS => 5}
put 'ns:xyz','1','cf1:name','NewYork'
put 'ns:xyz','1','cf1:name','NewYork'
Above put statements are giving 2 records with different timestamp if I check all versions. I am expecting that it should not add 2nd record because it have same value
HBase isn't going to look through the entire row and work out if it's the same as the data you're adding. That would be an expensive operation, and HBase prides itself on its fast insert speeds.
If you're really eager to do this (and I'd ask if you really want to do this), you should perform a GET first to see if the data is already present in the table.
You could also write a Coprocessor to do this every time you PUT data, but again the performance would be undesirable.
As mentioned by #Ben Watson, HBase is best known for it's performance in write since it doesn't need to check for the existence of a value as multiple versions will be maintained by default.
One hack what you can do is, you can use custom versioning. As show in the below screenshot, you have two versions already for a row key. Now if you are going to insert the same record with the same timestamp. HBase would be overwriting the same record with just the value.
NOTE: It is left to your application to get the same timestamp for a particular value.

SSIS Lookup transformation not finding matches

I have a Lookup transformation that does not seem to be finding obvious matches. I have an input file that has 43 records that includes the same CustomerID which is set as an 8 byte-Signed Integer. I am using the Lookup to see if the CustomerID already exist in my destination table. In the destination table the CustomerID is defined as BigInt.
For testing, I truncated the Lookup(destination) table. I have tried all three Cache settings with the same results.
When I run the SSIS package, all 43 records are sent through the No Match Output side. I would think that only the 1st record should go that direction and all others would be considered as a match since they have the same CustomerID. Additionally, if I run the job a second time(without truncating the destination) then they are all flagged as Matched.
It seems as if the cache is not being looked at in the Lookup. Ultimately I want the NO Match records to be written to the Destination table and the Matched records to have further processing.
Any ideas?
Lookup transformation is working as expected. I am not sure what's your understanding of look up is, so I'll go point by point.
For testing, I truncated the Lookup(destination) table. I have tried
all three Cache settings with the same results.
When I run the SSIS package, all 43 records are sent through the No
Match Output side
Above behavior is expected. After truncate, lookup is essentially trying to find those 43 records within your truncated destination table. Since it can't find any, it is flagging them as new records ie No match output side.
if I run the job a second time(without truncating the destination)
then they are all flagged as Matched
In this case, all those 43 records from file are matched within destination table, hence lookup refers them as duplicates and thus they are flagged as Matched output
I am using the Lookup to see if the CustomerID already exist in my
destination table
To achieve this, all you need to do is send Matched output to some staging table which can be periodically truncated(as they are duplicate). and all the No match output can be send to your destination table.
You can post screenshot of your lookup as well in case you want further help.
The lookup can't be used this way. SSIS dataflows execute in a transaction. So while the package is running, no rows have been written to the destination until the entire dataflow runs. So regardless of the Cache setting, the new rows being sent to your destination table are not going to be considered by the Lookup while it's running. Then when you run it again, the rows will be considered. This is expected behavior.

RethinkDB: Get fields inside indexes instead of just index names

I am trying to make a tool that will live-copy a DB from one RethinkDB host to another, however I am hung-up on the fact I can't seem to find out what is actually in each index. I have tried
r.db('db').table('table').index_list()
and
r.db('db').table('table').info()
I even tried
r.db('db').table('table').index_list().info()
But all three only returned the names of the indexes and not what fields are in them. This makes it impossible to re-create the table on the destination DB exactly the same as the source.
What am I missing here? There has to be a way to do this, or is this just something missing from RethinkDB? If so, does anyone know why?
Indexes are computed from the documents in the table. If you read all of the documents from the first table (with e.g. r.table.run()) and insert them all into the second table, then re-create all the indexes, you will have successfully re-created the table.
As usual I only get answers from people who don't read my question or who want to answer questions that weren't asked.
The solution is to parse the data from
r.db('db').table('table').index_status()

Deleting large number of rows of an Oracle table

I have a data table from company which is of 250Gb having 35 columns. I need to delete around 215Gb of data which
is obviously large number of rows to delete from the table. This table has no primary key.
What could be the fastest method to delete data from this table? Are there any tools in Oracle for such large deletion processes?
Please suggest me the fastest way to do this with using Oracle.
As it is said in the answer above it's better to move the rows to be retained into a separate table and truncate the table because there's a thing called HIGH WATERMARK. More details can be found here http://sysdba.wordpress.com/2006/04/28/how-to-adjust-the-high-watermark-in-oracle-10g-alter-table-shrink/ . The delete operation will overwhelm your UNDO TABLESPACE it's called.
The recovery model term is rather applicable for mssql I believe :).
hope it clarifies the matter abit.
thanks.
Dou you know which records need to be retained ? How will you identify each record ?
A solution might be to move the records to be retained to a temp db, and then truncate the big table. Afterwards, move the retained records back.
Beware that the transaction log file might become very big because of this (but depends on your recovery model).
We had a similar problem a long time ago. Had a table with 1 billion rows in it but had to remove a very large proportion of the data based on certain rules. We solved it by writing a Pro*C job to extract the data that we wanted to keep and apply the rules, and sprintf the data to be kept to a csv file.
Then created a sqlldr control file to upload the data using direct path (which wont create undo/redo (but if you need to recover the table, you have the CSV file until you do your next backup anyway).
The sequence was
Run the Pro*C to create CSV files of data
generate DDL for the indexes
drop the indexes
run the sql*load using the CSV files
recreate indexes using parallel hint
analyse the table using degree(8)
The amount of parellelism depends on the CPUs and memory of the DB server - we had 16CPUs and a few gig of RAM to play with so not a problem.
The extract of the correct data was the longest part of this.
After a few trial runs, the SQL Loader was able to load the full 1 billion rows (thats a US Billion or 1000 million rows) in under an hour.

Informatica Data Quality - Match Analysis

In our Duplicate analysis requirement the input data has 1418 records out of which 1380 records are duplicate records.
On using the Match Analysis (used Key Generator, Matcher, Associator, Consolidator) in IDQ integrated with PowerCenter except for 8 records all duplicates were eliminated.
On executing the workflow by excluding these records, duplicates appear in other records for which duplicate didnt occur in the previous run.
Can anyone tell why this mismatch occurs?
Looks like your Consolidator transformation is not getting correct association ids and hence inserting multiple records resulting in duplicates.
please try the below steps:
1) Try to create a workflow in IDQ itself by deploying the mapping which you developed in IDQ.
2) Also keep a check on the business keys of the records which make a primary key through which you are identifying the dups in source.

Resources