Python/Pandas - merging one to many csv for denormalization - data-structures

I have a bunch of large csv files that were extracted out of a relational database. So for example I have customers.csv , address.csv and customer-address.csv that maps the key values for the relationships. I found an answer on how to merge the files here :
Python/Panda - merge csv according to join table/csv
So right now my code looks like this:
df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)
df3 = pd.read_csv(file3)
df = (df3.merge(df1, left_on='CID', right_on='ID')
.merge(df2, left_on='AID', right_on='ID', suffixes=('','_'))
.drop(['CID','AID','ID_'], axis=1))
print (df)
Now I noticed that I have files with a one to many relationship and with the code above pandas is probably overriding values when there are multiple matches for one key.
Is there a method to join files with a one to many (many to many) relationship? I'm thinking of creating a full (redundant) row for each foreign key. So basically denormalization.

The answer to my question is to perform an outer join. With the code below pandas creates a new row for every occurence of one of the id's in the left or right dataframe thus creating a denormalized table.
df1.merge(df2, left_on='CID', right_on='ID', how='outer')

Related

Sorting after Repartitioning PySpark Dataframe

We have a giant file which we repartitioned according to one column, for example, say it is STATE. Now it seems like after repartitioning, the data cannot be sorted completely. We are trying to save our final file as a text file but instead of the first state listed being Alabama, now California shows up first. OrderBy doesn't seem to have an effect after running the repartition.
df = df.repartition(100, ['STATE_NAME'])\
.sortWithinPartitions('STATE_NAME', 'CUSTOMER_ID', 'ROW_ID')
I can't find a clear statement in the documentation about this, only this hint for pyspark.sql.DataFrame.repartition:
The resulting DataFrame is hash partitioned.
Obviously, repartition doesn't bring the rows in a specific (namely alphabetic) order (not even if they were ordered previously), it only groups them. That .sortWithinPartitions imposes no global order is no wonder considering the name, which implies that the sorting only occurs within the partitions, not on them. You can try .sort instead.

Combine Tables matching a name pattern in Power Query

I am trying to combine many tables that has a name that matches a patterns.
So far, I have extracted the table names from #shared and have the table names in a list.
What I haven't being able to do is to loop this list and transform in a table list that can be combined.
e.g. Name is the list with the table names:
Source = Table.Combine( { List.Transform(Name, each #shared[_] )} )
The error is:
Expression.Error: We cannot convert a value of type List to type Text.
Details:
Value=[List]
Type=[Type]
I have tried many ways but I am missing some kind of type transformation.
I was able to transform this list of tables names to a list of tables with:
T1 = List.Transform(Name, each Expression.Evaluate(_, #shared))
However, the Expression.Evaluate feels like an ugly hack. Is there a better way for this transformation?
With this list of tables, I tried to combine them with:
Source = Table.Combine(T1)
But I got the error:
Expression.Error: A cyclic reference was encountered during evaluation.
If I extract the table from the list with the index (e.g T1{2}) it works. So in this line of thinking, I would need some kind o loop to append.
Steps illustrating the problem.
The objective is to append (Tables.Combine) every table named T_\d+_Mov:
After filtering the matching table names in a table:
Converted to a List:
Converted the names in the list to the real tables:
Now I just need to combine them, and this is where I am stuck.
It is important to not that I don't want to use VBA for this.
It is easier to recreate the query from VBA scanning the ThisWorkbook.Queries() but it would not be a clean reload when adding removing tables.
The final solution as suggested by #Michal Palko was:
CT1 = Table.FromList(T1, Splitter.SplitByNothing(), {"Name"}, null, ExtraValues.Ignore),
EC1 = Table.ExpandTableColumn(CT1, "Name", Table.ColumnNames(CT1{0}[Name]) )
where T1 was the previous step.
The only caveat is that the first table must have all columns or they will be skiped.
I think there might be easier way but given your approach try to convert your list to table (column) and then expand that column:
Alternatively use Table.Combine(YourList)

Cross table with two datasets (one as the row and the other as the column)

I have two datasets in my birt report :
Lesson (date)
Student (name)
and I would like to know how to create a cross table using the date (red) as the column names and name (blue) as the row names as shown below :
The cells will stay empty.
I have try to use the Cross Tab but it seems that I can only use one dataset.
For information I am stuck with the version 2.5.2. I say this in case someone writes about a practical functionality available in the later version of birt... :-)
Where both datasets are coming from the same relational data source, the simplest way to achieve this would normally be:
Replace the existing two datasets with a single dataset, in which the two original datasets are cross-joined to each other;
create a crosstab from the new dataset, with the new dataset columns as the data cube groups.

Conditional array or hash combining

I am working on a project that has 2 separate input files, each with some information that relates to the other file.
I have loaded them each into their own arrays after parsing them like so
file_1 << "#{contract_id}|#{region}|#{category}|#{supplier_ID}"
file_2 << "#{contract_id}|#{region}|#{category}|#{manufacturer}|#{model}"
File 1 has 30,000 lines and File 2 has 400,000 lines. My desired output will have somewhere in the neighborhood of 600,000 lines from my estimations.
Now my problem is figuring out a way to combine them, as they have a many-to-many relationship.
For every time the contract_id, region AND category match, i need to have a record that looks like the following:
supplier_ID region category manufacturer model.
my initial thought was to iterate over one of the arrays and put everything into a hash using the #{contract_id}|#{region}|#{category}|#{manufacturer} as the KEY and the #{model} as the value.
But the limitation there is that it only iterates over the array once and thus the output is limited to the number of elements in the respective array.
My understanding of your question:
File 1 has the columns contract_id, region, category, supplier_id.
File 2 has the columns contract_id, region, category, manufacturer, model
You want to a program that will take file 1 and file 2 as inputs do the equivalent of an SQL join to produce a new file with the following columns: supplier_id, region, category, manufacturer, model. Your join condition is that the contract_id, region, and category need to match.
Here is how I would tackle this:
Step 1: Read both files into arrays that have the data from each. Don't store the data entries as an ugly pipe-delimited string; store them as an array or a hash.
file_1_entries << [contract_id, region, category, supplier_ID]
Step 2: Iterate over the data from both files and make hashes to index them by the columns you care about (contract_id, region, and category). For example, to index file 1, you would make a hash whose key is some combination of those three columns (either an array or a string) and the value is an array of entries from file 1 that match.
file_1_index = {}
file_1_entries.each do |x|
key = some_function_of(x)
file_1_index[key] ||= []
file_1_index[key] << x
end
Step 3: Iterate over one of your index hashes, and use the index hashes to do the join you want to do.
file_1_index.keys.each do |key|
file_1_matching_entries = file_1_index.fetch(key, [])
file_2_matching_entries = file_2_index.fetch(key, [])
# nested loop to do the join
end
I can't go into very much detail on each of these steps because you asked a pretty broad question and it would take a long time to add all the details. But you should try to do these steps and ask more specific questions if you get stuck.
It's possible your machine might run out of memory while you are doing this, depending on your computer. In that case, you might need to build a temporary database (e.g. with sqlite) and then perform the join using an actual SQL query instead of trying to do it yourself in Ruby.

How to fill a Cassandra Column Family from another one's columns?

I have always read that Cassandra is good if your application changes frequently and features are added frequently.
That makes sense, since you don't have any fixed schema, you can add columns to rows to suffice your needs, instead of running an ALTER TABLE query which may freeze your database for hours for very large tables.
However I have an hypotetical problem which I'm not able to solve.
Let's say I have:
CREATE COLUMN FAMILY Students
with comparator='CompositeType(UTF8Type,UTF8Type),
and key_validation_class=UUIDType;
Each student has some generic column (you know, meta:username, meta:password, meta:surname, etc), plus each student may follow N courses. This N-N relationship is resolved using denormalization, adding N columns to each Student (course:ID1, course:ID2).
On the other side, I may have a Courses CF, where each row is contains all of the following Students UUIDs.
So I can ask "which courses are followed by XXX" and "which students follow course YYY".
The problem is: what if I didn't create the second column family? Maybe at the time when the application was built, getting the students following a specific course wasn't a requirement.
This is a simple example, but I believe it's quite common. "With Cassandra you plan CFs in terms of queries instead of relationships". I need that query now, while at first it wasn't needed.
Given a table of students with thousands of entries, how would you fill the Courses CF? Is this a job for Hadoop, Pig or Hive (I never touched any of those, just guessing).
Pig (which uses the Hadoop integration) is actually perfect for this type of work, because you can not only read but also write data back into Cassandra using CassandraStorage. It gives you the parallel processing capability to do the job with minimal time and overhead. Otherwise the alternative is to write something to do the extraction yourself, then write the new CF.
Here is a Pig example that computes averages from a set of data in one CF and outputs them to another:
rows = LOAD 'cassandra://HadoopTest/TestInput' USING CassandraStorage() AS (key:bytearray,cols:bag{col:tuple(name:chararray,value)});
columns = FOREACH rows GENERATE flatten(cols) AS (name,value);
grouped = GROUP columns BY name;
vals = FOREACH grouped GENERATE group, columns.value AS values;
avgs = FOREACH vals GENERATE group, 'Pig_Average' AS name, (long)SUM(values.value)/COUNT(values.value) AS average;
cass_group = GROUP avgs BY group;
cass_out = FOREACH cass_group GENERATE group, avgs.(name, average);
STORE cass_out INTO 'cassandra://HadoopTest/TestOutput' USING CassandraStorage();
If you use the existing cassandra file, you would have to unwind the data. Since NOSQL files are unidirectional this could be a very time consuming operation in Cassandra itself. The data would have to be sorted in the opposite order from the first file. Frankly I believe that you would have to go back to the original data that was used to populate the first file and populate this new file from that.

Resources