I am working on a project that has 2 separate input files, each with some information that relates to the other file.
I have loaded them each into their own arrays after parsing them like so
file_1 << "#{contract_id}|#{region}|#{category}|#{supplier_ID}"
file_2 << "#{contract_id}|#{region}|#{category}|#{manufacturer}|#{model}"
File 1 has 30,000 lines and File 2 has 400,000 lines. My desired output will have somewhere in the neighborhood of 600,000 lines from my estimations.
Now my problem is figuring out a way to combine them, as they have a many-to-many relationship.
For every time the contract_id, region AND category match, i need to have a record that looks like the following:
supplier_ID region category manufacturer model.
my initial thought was to iterate over one of the arrays and put everything into a hash using the #{contract_id}|#{region}|#{category}|#{manufacturer} as the KEY and the #{model} as the value.
But the limitation there is that it only iterates over the array once and thus the output is limited to the number of elements in the respective array.
My understanding of your question:
File 1 has the columns contract_id, region, category, supplier_id.
File 2 has the columns contract_id, region, category, manufacturer, model
You want to a program that will take file 1 and file 2 as inputs do the equivalent of an SQL join to produce a new file with the following columns: supplier_id, region, category, manufacturer, model. Your join condition is that the contract_id, region, and category need to match.
Here is how I would tackle this:
Step 1: Read both files into arrays that have the data from each. Don't store the data entries as an ugly pipe-delimited string; store them as an array or a hash.
file_1_entries << [contract_id, region, category, supplier_ID]
Step 2: Iterate over the data from both files and make hashes to index them by the columns you care about (contract_id, region, and category). For example, to index file 1, you would make a hash whose key is some combination of those three columns (either an array or a string) and the value is an array of entries from file 1 that match.
file_1_index = {}
file_1_entries.each do |x|
key = some_function_of(x)
file_1_index[key] ||= []
file_1_index[key] << x
end
Step 3: Iterate over one of your index hashes, and use the index hashes to do the join you want to do.
file_1_index.keys.each do |key|
file_1_matching_entries = file_1_index.fetch(key, [])
file_2_matching_entries = file_2_index.fetch(key, [])
# nested loop to do the join
end
I can't go into very much detail on each of these steps because you asked a pretty broad question and it would take a long time to add all the details. But you should try to do these steps and ask more specific questions if you get stuck.
It's possible your machine might run out of memory while you are doing this, depending on your computer. In that case, you might need to build a temporary database (e.g. with sqlite) and then perform the join using an actual SQL query instead of trying to do it yourself in Ruby.
Related
I have 2 columns - Matches(Integer), Accounts_type(String). And i want to create a third column where i want to get proportions of matches played by different account types. I am new to Talend & am facing issue with this for past 2 days & did a lot of research but to no avail. Please help..
You can do it like this:
You need to read your source data twice (I used tFixedFlowInput_1 and tFixedFlowInput_2 with the same data). The idea is to calculate the total of your matches in tAggregateRow_1, it simply does a sum of all Matches without a group by column, then use that as a lookup.
The tMap then joins your source data with the calculated total. Since the total will always be one record, you don't need any join column. You then simply divide Matches by Total as required.
This is supposing you have unique values in Account_type; if you don't, you need to add another tAggregateRow between your source and tMap_1, in order to get sum of Matches for each Account_type (group by Account_type).
I have a bunch of large csv files that were extracted out of a relational database. So for example I have customers.csv , address.csv and customer-address.csv that maps the key values for the relationships. I found an answer on how to merge the files here :
Python/Panda - merge csv according to join table/csv
So right now my code looks like this:
df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)
df3 = pd.read_csv(file3)
df = (df3.merge(df1, left_on='CID', right_on='ID')
.merge(df2, left_on='AID', right_on='ID', suffixes=('','_'))
.drop(['CID','AID','ID_'], axis=1))
print (df)
Now I noticed that I have files with a one to many relationship and with the code above pandas is probably overriding values when there are multiple matches for one key.
Is there a method to join files with a one to many (many to many) relationship? I'm thinking of creating a full (redundant) row for each foreign key. So basically denormalization.
The answer to my question is to perform an outer join. With the code below pandas creates a new row for every occurence of one of the id's in the left or right dataframe thus creating a denormalized table.
df1.merge(df2, left_on='CID', right_on='ID', how='outer')
I'm writing a custom search function, and I have to filter through an association.
I have 2 active record backed models, cards and colors with a has_many_and_belongs_to, and colors have an attribute color_name
As my DB has grown to around 10k cards, my search function gets exceptionally slow because i have a select statement with a query inside it, so essentially im having to make thousands of queries.
i need to convert the array#select method into an active record query, that will yield the same results, and im having trouble coming up with a solution. the current (relevant code) is the following:
colors = [['Black'], ['Blue', 'Black']] #this is a parameter retrieved from a form submission
if color
cards = color.flat_map do |col|
col.inject( Card.includes(:colors) ) do |memo, color|
temp = cards.joins(:colors).where(colors: {color_name: color})
memo + temp.select{|card| card.colors.pluck(:color_name).sort == col.sort}
end
end
end
the functionality im trying to mimic is that only cards with colors exactly matching the incoming array will be selected by the search (comparing two arrays). Because cards can be mono-red, red-blue, or red-blue-green etc, i need to be able to search for only red-blue cards or only mono-red cards
I initially started along this route, but i'm having trouble comparing arrays with an active record query
color_objects = Color.where(color_name: col)
Card.includes(:colors).where('colors = ?', color_objects)
returns the error
ActiveRecord::StatementInvalid: PG::SyntaxError: ERROR: syntax error
at or near "SELECT" LINE 1: ...id" WHERE "cards"."id" IN (2, 3, 4) AND
(colors = SELECT "co...
it looks to me like its failing because it doesnt want to compare arrays, only table elements. is this functionality even possible?
One solution might be to convert the habtm into has many through relation and make join tables which contain keys for every permutation of colors in order to access those directly
I need to be able to search for only green-black cards, and not have mono-green, or green-black-red cards show up.
I've deleted my previos answer, because i did not realized you are looking for the exact match.
I played a little with it and i can't see any solution without using an aggregate function.
For Postgres it will be array_agg.
You need to generate an SQL Query like:
SELECT *, array_to_string(array_agg(colors.color_name), ',')) as color_names FROM cards
JOINS cards_colors, colors
ON (cards.id = cards_colors.card_id AND colors.id = cards_colors.color_id)
GROUP BY cards.id HAVING color_names = 'green, black'
I never used those aggregators, so perhaps array_to_string is a wrong formatter, anyway you have to watch for aggregating the colors in alphabetical order. As long as you aint't having too many cards it will be slow enough, but it will scan every card in a table.
I you want to use an index on this query, you should denormalize your data structure, use an array of color_names on a cards record, index that array field and search on it. You can also keep you normalized structure and define an automatic association callback which will put the colorname to the card's color_names array every time a color is assigned to a card.
try this
colors = Color.where(color_name: col).pluck(:id)
Card.includes(:colors).where('colors.id'=> colors)
I currently have the following pig script (column list truncated for brevity):
REGISTER /usr/lib/pig/piggybank.jar;
inputData = LOAD '/data/$date*.{bz2,bz,gz}' USING PigStorage('\\x7F')
PigStorage('\\x7F')
AS (
SITE_ID_COL :int,-- = Item Site ID
META_ID_COL :int,-- = Top Level (meta) category ID
EXTRACT_DATE_COL :chararray,-- = Date for the data points
...
)
SPLIT inputData INTO site0 IF (SITE_ID_COL == 0), site3 IF (SITE_ID_COL == 3), site15 IF (SITE_ID_COL == 15);
STORE site0 INTO 'pigsplit1/0/' USING org.apache.pig.piggybank.storage.MultiStorage('pigsplit1/0/','2', 'bz2', '\\x7F');
STORE site3 INTO 'pigsplit1/3/' USING org.apache.pig.piggybank.storage.MultiStorage('pigsplit1/3/','2', 'bz2', '\\x7F');
STORE site15 INTO 'pigsplit1/15/' USING org.apache.pig.piggybank.storage.MultiStorage('pigsplit1/15/','2', 'bz2', '\\x7F');
And it works great for what I wanted it to do, but there's actually at least 22 possible site IDs and I'm not certain there's not more. I'd like to dynamically create the splits and store into paths based on that column. Is the easiest way to do this going to be through a two step usage of the MultiStorage UDF, first splitting by the site ID and then loading all those results and splitting by the date? That seems inefficient. Can I somehow do it through GROUP BYs? It seems like I should be able to GROUP BY the site ID, then flatten each row and run the multi storage on that, but I'm not sure how to concatenate the GROUP into the path.
The MultiStorage UDF is not set up to divide inputs on two different fields, but that's essentially what you're doing -- the use of SPLIT is just to emulate MultiStorage with two parameters. In that case, I'd recommend the following:
REGISTER /usr/lib/pig/piggybank.jar;
inputData = LOAD '/data/$date*.{bz2,bz,gz}' USING PigStorage('\\x7F')
AS (
SITE_ID_COL :int,-- = Item Site ID
META_ID_COL :int,-- = Top Level (meta) category ID
EXTRACT_DATE_COL :chararray,-- = Date for the data points
...
)
dataWithKey = FOREACH inputData GENERATE CONCAT(CONCAT(SITE_ID_COL, '-'), EXTRACT_DATE_COL), *;
STORE dataWithKey INTO 'tmpDir' USING org.apache.pig.piggybank.storage.MultiStorage('tmpDir', '0', 'bz2', '\\x7F');
Then go over your output with a simple script to list all the files in your output directories, extract the site and date IDs, and move them to appropriate locations with whatever structure you like.
Not the most elegant workaround, but it could work all right for you. One thing to watch out for is the separator you choose in your key might not be allowed (it might only be alphanumeric). Also, you'll be stuck with that extra field in your output data.
I've actually submitted a patch to the MultiStorage module to allow splitting on multiple tuple fields rather than only one, resulting in a dynamic output tree.
https://issues.apache.org/jira/browse/PIG-3258
It hasn't gotten much attention yet, but I'm using it in production with no issues.
We have a huge chunk of data and we want to perform a few operations on them. Removing duplicates is one of the main operations.
Ex.
a,me,123,2631272164
yrw,wq,1237,123712,126128361
yrw,dsfswq,1323237,12xcvcx3712,1sd26128361
These are three entries in a file and we want to remove duplicates on the basis of 1st column. So, 3rd row should be deleted. Each row may have different number of columns but the column we are interested into, will always be present.
In memory operation doesn't look feasible.
Another option is to store the data in database and removing duplicates from there but it's again not a trivial task.
What design should I follow to dump data into database and removing duplicates?
I am assuming that people must have faced such issues and solved it.
How do we usually solve this problem?
PS: Please consider this as a real life problem rather than interview question ;)
If the number of keys is also infeasible to load into memory, you'll have to do a Stable(order preserving) External Merge Sort to sort the data and then a linear scan to do duplicate removal. Or you could modify the external merge sort to provide duplicate elimination when merging sorted runs.
I guess since this isn't an interview question or efficiency/elegance seems to not be an issue(?). Write a hack python script that creates 1 table with the first field as the primary key. Parse this file and just insert the records into the database, wrap the insert into a try except statement. Then preform a select * on the table, parse the data and write it back to a file line by line.
If you go down the database route, you can load the csv into a database and use 'duplicate key update'
using mysql:-
Create a table with rows to match your data (you may be able to get away with just 2 rows - id and data)
dump the data using something along the lines of
LOAD DATA LOCAL infile "rs.txt" REPLACE INTO TABLE data_table FIELDS TERMINATED BY ',';
You should then be able to dump out the data back into csv format without duplicates.
If the number of unique keys aren't extremely high, you could simply just do this;
(Pseudo code since you're not mentioning language)
Set keySet;
while(not end_of_input_file)
read line from input file
if first column is not in keySet
add first column to keySet
write line to output file
end while
If the input is sorted or can be sorted, then one could do this which only needs to store one value in memory:
r = read_row()
if r is None:
os.exit()
last = r[0]
write_row(r)
while True:
r = read_row()
if r is None:
os.exit()
if r[0] != last:
write_row(r)
last = r[0]
Otherwise:
What I'd do is keep a set of the first column values that I have already seen and drop the row if it is in that set.
S = set()
while True:
r = read_row()
if r is None:
os.exit()
if r[0] not in S:
write_row(r)
S.add(r[0])
This will stream over the input using only memory proportional to the size of the set of values from the first column.
If you need to preserve order in your original data, it MAY be sensible to create new data that is a tuple of position and data, then sort on the data you want to de-dup. Once you've sorted by data, de-duplication is (essentially) a linear scan. After that, you can re-create the original order by sorting on the position-part of the tuple, then strip it off.
Say you have the following data: a, c, a, b
With a pos/data tuple, sorted by data, we end up with: 0/a, 2/a, 3/b, 1/c
We can then de-duplicate, trivially being able to choose either the first or last entry to keep (we can also, with a bit more memory consumption, keep another) and get: 0/a, 3/b, 1/c.
We then sort by position and strip that: a, c, b
This would involve three linear scans over the data set and two sorting steps.