Searching CSV with 1.6 Million lines (150MB) file? - performance

I have a CSV containing 1.6 million lines of data and at around 150MB, it contains product data. I have another CSV containing 2000 lines, which contains a list of product in the big CSV. They relate to each other by a unique id. The idea is to add the product data in the CSV with 2000 lines.
The databank.csv has headers ID, Product Name, Description, Price .
The sm_list.csv has header ID.
The result is to output a csv with products in sm_list.csv, with the corresponding data in databank.csv... 2000 rows long.
My original solution reads in all of the sm_list, and reads databank line by line. It searches sm_list for the ID in the line read in from databank. This leads to 2000x1.6Million = 3200 million comparisons!
Could you please provide a basic algorithm outline to complete this task in the most efficient way?

Assuming you know to how read/write CSV files in MATLAB (several questions here on SO shows how), here is an example:
%# this would be read from "databank.csv"
prodID = (1:10)'; %'
prodName = cellstr( num2str(prodID, 'Product %02d') );
prodDesc = cellstr( num2str(prodID, 'Description %02d') );
prodPrice = rand(10,1)*100;
databank = [num2cell(prodID) prodName prodDesc num2cell(prodPrice)];
%# same for "sm_list.csv"
sm_list = [2;5;7;10];
%# find matching rows
idx = ismember(prodID,sm_list);
result = databank(idx,:)
%# ... export 'result' to CSV file ...
The result of the above example:
result =
[ 2] 'Product 02' 'Description 02' [19.251]
[ 5] 'Product 05' 'Description 05' [14.651]
[ 7] 'Product 07' 'Description 07' [4.2652]
[10] 'Product 10' 'Description 10' [ 53.86]

have to be using matlab? If you just input all that data into a database, it'll be easier. A simple select tableA.ID, tableB.productname... where tableA.id = tableB.id will do it.

Related

Is there a way to copy only unique rows in an Excel worksheet column to another sheet?

I use a CSV file as $AgencyMaster with two columns, AgencyID and AgencyName. I currently manually input these from another file, $Excel_File_Path, but I would like to automatically generate $AgencyMaster if possible.
$Excel_File_Path has three worksheets: Sheet1, Sheet2 and Template. Sheet1 and Sheet2 are full of data, while Template is used as a graphical representation of said data which populates based on the AgencyID. I have a script that opens $Excel_File_Path, inputs AgencyID into a specific cell, saves it, then converts it to a PDF. It does this for each AgencyID in $AgencyMaster, which is currently over 200.
In $Excel_File_Path, columns B and C in Sheet1 and Sheet2 contain all of the AgencyIDs and AgencyNames, but there are a bunch of duplicates. I can't delete any of the rows because while they are duplicates in column B and C, columns D, E, F, etc have different data used in Template. So I need to be able to take each unique AgencyID and AgencyName which may appear in Sheet1 or Sheet2 and export them to a CSV to use as $AgencyMaster.
Example:
(https://i.imgur.com/j8UIZqp.jpg)
Column B contains the AgencyID and Column C contains the AgencyName. I'd like to export unique values of each from Sheet1 and Sheet2 to CSV $AgencyMaster
I've found how to export it to a different worksheet within the same workbook, just not a separate workbook alltogether. I'd also like to save it as a .CSV with leading 0's in cell A.
# Checking that $AgencyMaster Exists, and importing the data if it does
If (Test-Path $AgencyMaster) {
$AgencyData = Import-CSV -Path $AgencyMaster
# Taking data from $AgencyMaster and assigning it to each variable
ForEach ($Agency in $AgencyData) {
$AgencyID = $Agency.AgencyID
$AgencyName = $Agency.AgencyName
# Insert agency code into cell D9 on Template worksheet
$ExcelWS.Cells.Item(9,4) = $AgencyID
$ExcelWB.Save()
# Copy-Item Properties
$Destination_File_Path = "$Xlsx_Destination\$AgencyID -
$AgencyName - $company $month $year.xlsx"
$CI_Props = #{
'Path' = $Excel_File_Path;
'Destination' = $Destination_File_Path;
'PassThru' = $true;
} # Close $CI_Props
# Copy & Rename file
Copy-Item #CI_Props
} # Close ForEach
} # Close IF
I would recommend using either Sort-Object -Unique or Group-Object.

reading specific columns from excel file and writing to another excel file

My objective is to read an excel file columns and write it to another new file.
Till now I am able to create a new file with specified columns.I am able to read an excel file based on row and column index. But my objective is a different.
I have to pick specific columns from the excel file and write all the data to another file under the same column.
How can I achieve this.
require 'spreadsheet'
#Step 1 : create an excel sheet
book = Spreadsheet::Workbook.new
sheet = book.create_worksheet
sheet.row(0).concat %w[id column_1 column_2 column_3]
book.write 'Data/write_excel.xls'
#step 2 : read the data excel file
book1 = Spreadsheet.open('Data/read_excel.xls')
sheet1 = book1.worksheet('')
val = sheet1[0, 1]
puts val
This is an option, knowing before the number of the source column and the number of the destination column:
# Step 3 copy the columns
col_num = 3 #=> the destination column
row_num = 1 #=> to skip the headers and start from the second row
sheet1.each row_num do |row|
sheet[row_num, col_num] = row[0] # row[1] the number represents the column to copy from the source file
row_num += 1
end
Then save the file: book.write 'filename'

Filter inner bag in Pig

The data looks like this:
22678, {(112),(110),(2)}
656565, {(110), (109)}
6676, {(2),(112)}
This is the data structure:
(id:chararray, event_list:{innertuple:(innerfield:chararray)})
I want to filter those rows where event_list contains 2. I thought initially to flatten the data and then filter those rows that have 2. Somehow flatten doesn't work on this dataset.
Can anyone please help?
There might be a simpler way of doing this, like a bag lookup etc. Otherwise with basic pig one way of achieving this is:
data = load 'data.txt' AS (id:chararray, event_list:bag{});
-- flatten bag, in order to transpose each element to a separate row.
flattened = foreach data generate id, flatten(event_list);
-- keep only those rows where the value is 2.
filtered = filter flattened by (int) $1 == 2;
-- keep only distinct ids.
dist = distinct (foreach filtered generate $0 as (id:chararray));
-- join distinct ids to origitnal relation
jnd = join a by id, dist by id;
-- remove extra fields, keep original fields.
result = foreach jnd generate a::id, a::event_list;
dump result;
(22678,{(112),(110),(2)})
(6676,{(2),(112)})
You can filter the Bag and project a boolean which says if 2 is present in the bag or not. Then, filter the rows which says that projection is true or not
So..
input = LOAD 'data.txt' AS (id:chararray, event_list:bag{});
input_filt = FOREACH input {
bag_filter = FILTER event_list BY (val_0 matches '2');
GENERATE
id,
event_list,
isEmpty(bag_filter.$0) ? false : true AS is_2_present:boolean;
;
};
output = FILTER input_filt BY is_2_present;

Neo4j cypher complicated query sort ,count, sum before collect

I am newbie with neo4j db and just started learning it, looking for some help, because I am stuck. Is it possible to get it in one cypher query? how?
my graph structure looks like that:
(s:Store)-[r:RELEASED]->(m:Movie)<-[r1:ASSIGNED]-(cat:MovieCategorie)
How I could get this data?
Movie store (got)
Movie (got)
Most common 5 categories of movies in that store (I don't know how to sort them before using collect(cat.name)[0..5])
Anyone could suggest how to get this data? I tried lots of times and failed, this is what I got and it doesn't work.
match (s:Store)
with s
match (s)-[r:RELEASED]->(m:Movie)
with s,m
match (m)<-[r1:ASSIGNED]-(cat:MovieCategorie)
with s, m, count(r1) as stylesCount, cat
order by stylesCount
return distinct s as store, collect(cat.name)[0..5] as topCategories
order by store.name
Thank you!
Ok, so as I got my query right and I am developing this query further, got some problem by combining multiple aggregation functions COUNT and SUM.
My query witch works well for finding top 5 categories per store:
MATCH (s:Store)
OPTIONAL MATCH (s)-[:RELEASED]->(m:Movie)<-[r:ASSIGNED]-(cat:MovieCategorie)
WITH s, COUNT(r) AS count, cat
ORDER BY count DESC
RETURN c AS Store, COLLECT(distinct cat.name) AS `Top Categories`
ORDER BY Store.name
On top of this query I need count how much views this store has sum(m.viewsCount) as Total store views. I tried to add in to same WITH statement as COUNT is, and tried to put it in return, In both scenarios it doesn't work how I would like to. Any suggestions, examples? I am still confused how WITH with aggregation functions works... :(
create example database
CREATE (s1:Store) SET s1.name = 'Store 1'
CREATE (s2:Store) SET s2.name = 'Store 2'
CREATE (s3:Store) SET s3.name = 'Store 3'
CREATE (m1:Movie) SET m1.title = 'Movie 1', m1.viewsCount = 50
CREATE (m2:Movie) SET m2.title = 'Movie 2', m2.viewsCount = 50
CREATE (m3:Movie) SET m3.title = 'Movie 3', m3.viewsCount = 50
CREATE (m4:Movie) SET m4.title = 'Movie 4', m4.viewsCount = 50
CREATE (m5:Movie) SET m5.title = 'Movie 5', m5.viewsCount = 50
CREATE (c1:MovieCategorie) SET c1.name = 'Cat 1'
CREATE (c2:MovieCategorie) SET c2.name = 'Cat 2'
CREATE (c3:MovieCategorie) SET c3.name = 'Cat 3'
CREATE (m1)<-[:ASSIGNED]-(c1)
CREATE (m1)<-[:ASSIGNED]-(c3)
CREATE (m2)<-[:ASSIGNED]-(c2)
CREATE (m3)<-[:ASSIGNED]-(c1)
CREATE (m3)<-[:ASSIGNED]-(c2)
CREATE (m3)<-[:ASSIGNED]-(c3)
CREATE (m4)<-[:ASSIGNED]-(c1)
CREATE (m4)<-[:ASSIGNED]-(c3)
CREATE (m5)<-[:ASSIGNED]-(c3)
CREATE (s1)-[:RELEASED]->(m1)
CREATE (s1)-[:RELEASED]->(m3)
CREATE (s1)-[:RELEASED]->(m4)
CREATE (s1)-[:RELEASED]->(m5)
CREATE (s2)-[:RELEASED]->(m1)
CREATE (s2)-[:RELEASED]->(m2)
CREATE (s2)-[:RELEASED]->(m3)
CREATE (s2)-[:RELEASED]->(m4)
CREATE (s2)-[:RELEASED]->(m5)
CREATE (s3)-[:RELEASED]->(m1)
SOLVED!! FINALLY I DID IT! Trick was use one more match after everything , great - now I can sleep in peace. Thank you.
MATCH (s:Store)-[:RELEASED]->(m:Movie)<-[r:ASSIGNED]-(cat:MovieCategorie)
with s,count(r) as catCount, cat
order by catCount desc
with s, collect( distinct cat.name)[0..5] as TopCategories
match (s)-[:RELEASED]->(m:Movie)
return s as Store, TopCategories, sum(m.viewsCount) as TotalViews
Ok, that was fast :D I finally got it!
match (s:Store)
with s
match (s)-[r:PUBLISHED]->(m:Movie)
with s
match (s)<-[r2:ASSIGNED]-(cat:MovieCategorie)
with s, count(r2) as stylesCount, cat
order by stylesCount desc
return distinct s, collect(distinct cat.name)[0..5] as topCategories
order by s.name
So trick is first count() in with , then order by that with, and collect DISTINCT in return. I am not so sure about these mutiple with statements, will try to clean it up. ;)
MATCH (s:Store)-[:RELEASED]->(:Movie)<-[:ASSIGNED]-(cat:MovieCategorie)
WITH s, COUNT(cat) AS count, cat
ORDER BY s.name, count DESC
RETURN s.name AS Store, COLLECT(cat.name)[0..5] AS `Top Categories`
And if you want the sum of the viewsCount property from the Movie nodes per store:
MATCH (s:Store)-[:RELEASED]->(m:Movie)<-[:ASSIGNED]-(cat:MovieCategorie)
WITH s, COUNT(cat) AS count, m, cat
ORDER BY s.name, count DESC
RETURN s.name AS Store, COLLECT(cat.name)[0..5] AS `Top Categories`, SUM(m.viewsCount) AS `Total Views`

Pig Latin issue

please help me out..its really urgent..deadline nearing, and im stuck with it since 2 weeks..breaking my head but no result. i am a newbie in piglatin.
i have a scenario where i have to filter data from a csv file.
the csv is on hdfs, and has two columns.
grunt>> fl = load '/user/hduser/file.csv' USING PigStorage(',') AS (conv:chararray, clnt:chararray);
grunt>> dump f1;
("first~584544fddf~dssfdf","2001")
("first~4332990~fgdfs4s","2001")
("second~232434334~fgvfd4","1000")
("second~786765~dgbhgdf","1000)
("second~345643~gfdgd43","1000")
what i need to do is i need to extract only the first word before the 1st '~' sign and concat that with the second column value of the csv file. Also i need to group the concatenated result returned and count the number of such similar rows, and create a new csv file as out put, where there would be 2 columns again. 1st column would be the concatenated value and the 2nd column would be the row count.
i.e
("first 2001","2")
("second 1000","3")
and so on.
I have written the code here but its just not working. i have used STRSPLIT. it is splitting the values of the first column of input csv file. but i dont know how to extract the first split value.
code is given below:
convData = LOAD '/user/hduser/file.csv' USING PigStorage(',') AS (conv:chararray, clnt:chararray);
fil = FILTER convData BY conv != '"-1"'; --im using this to filter out the rows that has 1st column as "-1".
data = FOREACH fil GENERATE STRSPLIT($0, '~');
X = FOREACH data GENERATE CONCAT(data.$0,' ',convData.clnt);
Y = FOREACH X GROUP BY X;
Z = FOREACH Y GENERATE COUNT(Y);
var = FOREACH Z GENERATE CONCAT(Y,',',Z);
STORE var INTO '/user/hduser/output.csv' USING PigStorage(',');
STRSPLIT returns a tuple, the individual elements of which you can access using the numbered syntax. This is what you need:
data = FOREACH fil GENERATE STRSPLIT($0, '~') AS a, clnt;
X = FOREACH data GENERATE CONCAT(a.$0,' ', clnt);

Resources