What is the best way to compare dataset from 2 different tables? - etl

What is the best way to compare dataset from 2 different tables. Here the tables contain enormous data. Also need to find out the difference in between these .

you could using the except function in linq
var differences =
dataTable1.AsEnumerable().Except(dataTable2.AsEnumerable(),
DataRowComparer.Default);

Related

I would like to compare data between tables

I would like to compare data between two tables say source and destination and output the difference,
the problem is there's a mapping table which stores the columns of source table and corresponding columns of destination.
For example,
Table: T_MAP
SourceTableName SourceTableColumns DestinationTable DestinationTableColumn
s_t1 s_t1_col1 d_t1 d_t1_col1
s_t1 s_t1_col2 d_t d_t1_col2
s_t2 s_t2_col1 d_t2 d_t2_col1
....
So the question is how to compare the data between two tables with the map table.
Current idea is using dynamic cursor to generate dynamic sql statement, then using minus+union all to compare data. But the performance may be a big problem.
Is there any thoughts?
Please help..
Thanks in advance.

Combining multiple tables with r.union in RethinkDB

I will be dynamically combining a range of tables with the exact same structure in RethinkDB.
I have my dynamically-generated list of tables in an array as follows:
tables = [r.table('table1'), r.table('table2'), ...]
And I am trying to do this:
r.union(r.args(tables))
But that just gives me an error: ReqlLogicError: Expected type DATUM but found TABLE
Overall, I have not been able to find a way to generate a list of tables in JavaScript and to add use r.union to combine them into a stream. Would appreciate help on this.
Thanks!
You can use reduce to do what you want, we merge one by one, like r.table(t1).union(r.table(t2)).union(r.table(t3)).
Like this:
[r.table('t1'), r.table('t2'), r.table('t3')].reduce(function(p, c) {
return p.union(c)
})
Try it from data explorer.
The answer provided by kureikain works. I still wish the functionality existed in RethinkDB with r.args() (it seems to me that this would be consistent with the documentation of that function).
Moreover, one important tip tangentially related to this question: if you want to combine multiple tables into a stream through r.union() but be able to tell which table it is in the results, use merge(). So my query would look something like this:
[r.db('database').table('table1').merge({source: 'table1'}), r.db('database').table('table2').merge({source: 'table2'})].reduce(function(p, c) { return p.union(c) }).filter( ...)
This allows you to not only combine multiple tables into one stream, but to always distinguish between the source tables in your results (by looking up the value of the key 'source').

How to fill a Cassandra Column Family from another one's columns?

I have always read that Cassandra is good if your application changes frequently and features are added frequently.
That makes sense, since you don't have any fixed schema, you can add columns to rows to suffice your needs, instead of running an ALTER TABLE query which may freeze your database for hours for very large tables.
However I have an hypotetical problem which I'm not able to solve.
Let's say I have:
CREATE COLUMN FAMILY Students
with comparator='CompositeType(UTF8Type,UTF8Type),
and key_validation_class=UUIDType;
Each student has some generic column (you know, meta:username, meta:password, meta:surname, etc), plus each student may follow N courses. This N-N relationship is resolved using denormalization, adding N columns to each Student (course:ID1, course:ID2).
On the other side, I may have a Courses CF, where each row is contains all of the following Students UUIDs.
So I can ask "which courses are followed by XXX" and "which students follow course YYY".
The problem is: what if I didn't create the second column family? Maybe at the time when the application was built, getting the students following a specific course wasn't a requirement.
This is a simple example, but I believe it's quite common. "With Cassandra you plan CFs in terms of queries instead of relationships". I need that query now, while at first it wasn't needed.
Given a table of students with thousands of entries, how would you fill the Courses CF? Is this a job for Hadoop, Pig or Hive (I never touched any of those, just guessing).
Pig (which uses the Hadoop integration) is actually perfect for this type of work, because you can not only read but also write data back into Cassandra using CassandraStorage. It gives you the parallel processing capability to do the job with minimal time and overhead. Otherwise the alternative is to write something to do the extraction yourself, then write the new CF.
Here is a Pig example that computes averages from a set of data in one CF and outputs them to another:
rows = LOAD 'cassandra://HadoopTest/TestInput' USING CassandraStorage() AS (key:bytearray,cols:bag{col:tuple(name:chararray,value)});
columns = FOREACH rows GENERATE flatten(cols) AS (name,value);
grouped = GROUP columns BY name;
vals = FOREACH grouped GENERATE group, columns.value AS values;
avgs = FOREACH vals GENERATE group, 'Pig_Average' AS name, (long)SUM(values.value)/COUNT(values.value) AS average;
cass_group = GROUP avgs BY group;
cass_out = FOREACH cass_group GENERATE group, avgs.(name, average);
STORE cass_out INTO 'cassandra://HadoopTest/TestOutput' USING CassandraStorage();
If you use the existing cassandra file, you would have to unwind the data. Since NOSQL files are unidirectional this could be a very time consuming operation in Cassandra itself. The data would have to be sorted in the opposite order from the first file. Frankly I believe that you would have to go back to the original data that was used to populate the first file and populate this new file from that.

How do I compare Record Sets or Record Groups in Oracle?

I have an assignment where I have two tables. Both of these two tables have multiple records that can be grouped by a certain ID creating record sets within those two tables
Those record sets can have various number of records. The trick is I have to compare those two tables and compare them by those record sets. If one record set ordered by update date (one of the record fields) doesn't find an identical record set in another table, I have to output that record set
What is the best way to do it? How do I compare two different tables by record groups/record sets/record blocks?
Should I use sub-query factoring? Should I temporary tables? Should I use something else?
Thank you very much for your generous responses and please let me know if I made my question unclear
i guess you just need a minus query to show the differences.
If you use Toad there is a specific function. Or you can use the minus operator or read this other post link

Using LINQ to query flat text files with fixed-length records?

I've got a file filled with records like this:
NCNSCF1124557200811UPPY19871230
The codes are all fixed-length, and some of them link to other flat files (sort of like a relational database). What's the best way of querying this data using LINQ?
This is what I came up with intuitively, but I was wondering if there's a more elegant way:
var records = File.ReadAllLines("data.txt");
var table = from record in records
select new { FirstCode = record.Substring(0, 2),
OtherCode = record.Substring(18, 4) };
For one thing I wouldn't read it all into memory to start with. It's very easy to write a LineReader class which iterates over a file a line at a time. I've got a version in MiscUtil which you can use.
Unless you only want to read the results once, however, you might want to call ToList() at the end to avoid reading the file multiple times. (This is still nicer than reading all the lines and keeping that in memory - you only want to do the splitting once.)
Once you've basically got in-memory collections of all the tables, you can use normal LINQ to Objects to join them together etc. You might want to go to a more sophisticated data model to get indexes though.
I don't think there's a better way out of the box.
One could define a Flat-File Linq Provider which could make the whole thing much simpler, but as far as I know, no one has yet.

Resources