Searching CSV with 1.6 Million lines (150MB) file?

Searching CSV with 1.6 Million lines (150MB) file? - performance

I have a CSV containing 1.6 million lines of data and at around 150MB, it contains product data. I have another CSV containing 2000 lines, which contains a list of product in the big CSV. They relate to each other by a unique id. The idea is to add the product data in the CSV with 2000 lines.
The databank.csv has headers ID, Product Name, Description, Price .
The sm_list.csv has header ID.
The result is to output a csv with products in sm_list.csv, with the corresponding data in databank.csv... 2000 rows long.
My original solution reads in all of the sm_list, and reads databank line by line. It searches sm_list for the ID in the line read in from databank. This leads to 2000x1.6Million = 3200 million comparisons!
Could you please provide a basic algorithm outline to complete this task in the most efficient way?

Assuming you know to how read/write CSV files in MATLAB (several questions here on SO shows how), here is an example:
%# this would be read from "databank.csv"
prodID = (1:10)'; %'
prodName = cellstr( num2str(prodID, 'Product %02d') );
prodDesc = cellstr( num2str(prodID, 'Description %02d') );
prodPrice = rand(10,1)*100;
databank = [num2cell(prodID) prodName prodDesc num2cell(prodPrice)];
%# same for "sm_list.csv"
sm_list = [2;5;7;10];
%# find matching rows
idx = ismember(prodID,sm_list);
result = databank(idx,:)
%# ... export 'result' to CSV file ...
The result of the above example:
result =
[ 2] 'Product 02' 'Description 02' [19.251]
[ 5] 'Product 05' 'Description 05' [14.651]
[ 7] 'Product 07' 'Description 07' [4.2652]
[10] 'Product 10' 'Description 10' [ 53.86]

have to be using matlab? If you just input all that data into a database, it'll be easier. A simple select tableA.ID, tableB.productname... where tableA.id = tableB.id will do it.

Related

Is there a way to copy only unique rows in an Excel worksheet column to another sheet?

I use a CSV file as $AgencyMaster with two columns, AgencyID and AgencyName. I currently manually input these from another file, $Excel_File_Path, but I would like to automatically generate $AgencyMaster if possible.
$Excel_File_Path has three worksheets: Sheet1, Sheet2 and Template. Sheet1 and Sheet2 are full of data, while Template is used as a graphical representation of said data which populates based on the AgencyID. I have a script that opens $Excel_File_Path, inputs AgencyID into a specific cell, saves it, then converts it to a PDF. It does this for each AgencyID in $AgencyMaster, which is currently over 200.
In $Excel_File_Path, columns B and C in Sheet1 and Sheet2 contain all of the AgencyIDs and AgencyNames, but there are a bunch of duplicates. I can't delete any of the rows because while they are duplicates in column B and C, columns D, E, F, etc have different data used in Template. So I need to be able to take each unique AgencyID and AgencyName which may appear in Sheet1 or Sheet2 and export them to a CSV to use as $AgencyMaster.
Example:
(https://i.imgur.com/j8UIZqp.jpg)
Column B contains the AgencyID and Column C contains the AgencyName. I'd like to export unique values of each from Sheet1 and Sheet2 to CSV $AgencyMaster
I've found how to export it to a different worksheet within the same workbook, just not a separate workbook alltogether. I'd also like to save it as a .CSV with leading 0's in cell A.
# Checking that $AgencyMaster Exists, and importing the data if it does
If (Test-Path $AgencyMaster) {
$AgencyData = Import-CSV -Path $AgencyMaster
# Taking data from $AgencyMaster and assigning it to each variable
ForEach ($Agency in $AgencyData) {
$AgencyID = $Agency.AgencyID
$AgencyName = $Agency.AgencyName
# Insert agency code into cell D9 on Template worksheet
$ExcelWS.Cells.Item(9,4) = $AgencyID
$ExcelWB.Save()
# Copy-Item Properties
$Destination_File_Path = "$Xlsx_Destination\$AgencyID -
$AgencyName - $company $month $year.xlsx"
$CI_Props = #{
'Path' = $Excel_File_Path;
'Destination' = $Destination_File_Path;
'PassThru' = $true;
} # Close $CI_Props
# Copy & Rename file
Copy-Item #CI_Props
} # Close ForEach
} # Close IF

I would recommend using either Sort-Object -Unique or Group-Object.

reading specific columns from excel file and writing to another excel file

My objective is to read an excel file columns and write it to another new file.
Till now I am able to create a new file with specified columns.I am able to read an excel file based on row and column index. But my objective is a different.
I have to pick specific columns from the excel file and write all the data to another file under the same column.
How can I achieve this.
require 'spreadsheet'
#Step 1 : create an excel sheet
book = Spreadsheet::Workbook.new
sheet = book.create_worksheet
sheet.row(0).concat %w[id column_1 column_2 column_3]
book.write 'Data/write_excel.xls'
#step 2 : read the data excel file
book1 = Spreadsheet.open('Data/read_excel.xls')
sheet1 = book1.worksheet('')
val = sheet1[0, 1]
puts val

This is an option, knowing before the number of the source column and the number of the destination column:
# Step 3 copy the columns
col_num = 3 #=> the destination column
row_num = 1 #=> to skip the headers and start from the second row
sheet1.each row_num do |row|
sheet[row_num, col_num] = row[0] # row[1] the number represents the column to copy from the source file
row_num += 1
end
Then save the file: book.write 'filename'

Filter inner bag in Pig

The data looks like this:
22678, {(112),(110),(2)}
656565, {(110), (109)}
6676, {(2),(112)}
This is the data structure:
(id:chararray, event_list:{innertuple:(innerfield:chararray)})
I want to filter those rows where event_list contains 2. I thought initially to flatten the data and then filter those rows that have 2. Somehow flatten doesn't work on this dataset.
Can anyone please help?

There might be a simpler way of doing this, like a bag lookup etc. Otherwise with basic pig one way of achieving this is:
data = load 'data.txt' AS (id:chararray, event_list:bag{});
-- flatten bag, in order to transpose each element to a separate row.
flattened = foreach data generate id, flatten(event_list);
-- keep only those rows where the value is 2.
filtered = filter flattened by (int) $1 == 2;
-- keep only distinct ids.
dist = distinct (foreach filtered generate $0 as (id:chararray));
-- join distinct ids to origitnal relation
jnd = join a by id, dist by id;
-- remove extra fields, keep original fields.
result = foreach jnd generate a::id, a::event_list;
dump result;
(22678,{(112),(110),(2)})
(6676,{(2),(112)})

You can filter the Bag and project a boolean which says if 2 is present in the bag or not. Then, filter the rows which says that projection is true or not
So..
input = LOAD 'data.txt' AS (id:chararray, event_list:bag{});
input_filt = FOREACH input {
bag_filter = FILTER event_list BY (val_0 matches '2');
GENERATE
id,
event_list,
isEmpty(bag_filter.$0) ? false : true AS is_2_present:boolean;
;
};
output = FILTER input_filt BY is_2_present;

Neo4j cypher complicated query sort ,count, sum before collect

I am newbie with neo4j db and just started learning it, looking for some help, because I am stuck. Is it possible to get it in one cypher query? how?
my graph structure looks like that:
(s:Store)-[r:RELEASED]->(m:Movie)<-[r1:ASSIGNED]-(cat:MovieCategorie)
How I could get this data?
Movie store (got)
Movie (got)
Most common 5 categories of movies in that store (I don't know how to sort them before using collect(cat.name)[0..5])
Anyone could suggest how to get this data? I tried lots of times and failed, this is what I got and it doesn't work.
match (s:Store)
with s
match (s)-[r:RELEASED]->(m:Movie)
with s,m
match (m)<-[r1:ASSIGNED]-(cat:MovieCategorie)
with s, m, count(r1) as stylesCount, cat
order by stylesCount
return distinct s as store, collect(cat.name)[0..5] as topCategories
order by store.name
Thank you!
Ok, so as I got my query right and I am developing this query further, got some problem by combining multiple aggregation functions COUNT and SUM.
My query witch works well for finding top 5 categories per store:
MATCH (s:Store)
OPTIONAL MATCH (s)-[:RELEASED]->(m:Movie)<-[r:ASSIGNED]-(cat:MovieCategorie)
WITH s, COUNT(r) AS count, cat
ORDER BY count DESC
RETURN c AS Store, COLLECT(distinct cat.name) AS `Top Categories`
ORDER BY Store.name
On top of this query I need count how much views this store has sum(m.viewsCount) as Total store views. I tried to add in to same WITH statement as COUNT is, and tried to put it in return, In both scenarios it doesn't work how I would like to. Any suggestions, examples? I am still confused how WITH with aggregation functions works... :(
create example database
CREATE (s1:Store) SET s1.name = 'Store 1'
CREATE (s2:Store) SET s2.name = 'Store 2'
CREATE (s3:Store) SET s3.name = 'Store 3'
CREATE (m1:Movie) SET m1.title = 'Movie 1', m1.viewsCount = 50
CREATE (m2:Movie) SET m2.title = 'Movie 2', m2.viewsCount = 50
CREATE (m3:Movie) SET m3.title = 'Movie 3', m3.viewsCount = 50
CREATE (m4:Movie) SET m4.title = 'Movie 4', m4.viewsCount = 50
CREATE (m5:Movie) SET m5.title = 'Movie 5', m5.viewsCount = 50
CREATE (c1:MovieCategorie) SET c1.name = 'Cat 1'
CREATE (c2:MovieCategorie) SET c2.name = 'Cat 2'
CREATE (c3:MovieCategorie) SET c3.name = 'Cat 3'
CREATE (m1)<-[:ASSIGNED]-(c1)
CREATE (m1)<-[:ASSIGNED]-(c3)
CREATE (m2)<-[:ASSIGNED]-(c2)
CREATE (m3)<-[:ASSIGNED]-(c1)
CREATE (m3)<-[:ASSIGNED]-(c2)
CREATE (m3)<-[:ASSIGNED]-(c3)
CREATE (m4)<-[:ASSIGNED]-(c1)
CREATE (m4)<-[:ASSIGNED]-(c3)
CREATE (m5)<-[:ASSIGNED]-(c3)
CREATE (s1)-[:RELEASED]->(m1)
CREATE (s1)-[:RELEASED]->(m3)
CREATE (s1)-[:RELEASED]->(m4)
CREATE (s1)-[:RELEASED]->(m5)
CREATE (s2)-[:RELEASED]->(m1)
CREATE (s2)-[:RELEASED]->(m2)
CREATE (s2)-[:RELEASED]->(m3)
CREATE (s2)-[:RELEASED]->(m4)
CREATE (s2)-[:RELEASED]->(m5)
CREATE (s3)-[:RELEASED]->(m1)
SOLVED!! FINALLY I DID IT! Trick was use one more match after everything , great - now I can sleep in peace. Thank you.
MATCH (s:Store)-[:RELEASED]->(m:Movie)<-[r:ASSIGNED]-(cat:MovieCategorie)
with s,count(r) as catCount, cat
order by catCount desc
with s, collect( distinct cat.name)[0..5] as TopCategories
match (s)-[:RELEASED]->(m:Movie)
return s as Store, TopCategories, sum(m.viewsCount) as TotalViews

Ok, that was fast :D I finally got it!
match (s:Store)
with s
match (s)-[r:PUBLISHED]->(m:Movie)
with s
match (s)<-[r2:ASSIGNED]-(cat:MovieCategorie)
with s, count(r2) as stylesCount, cat
order by stylesCount desc
return distinct s, collect(distinct cat.name)[0..5] as topCategories
order by s.name
So trick is first count() in with , then order by that with, and collect DISTINCT in return. I am not so sure about these mutiple with statements, will try to clean it up. ;)

MATCH (s:Store)-[:RELEASED]->(:Movie)<-[:ASSIGNED]-(cat:MovieCategorie)
WITH s, COUNT(cat) AS count, cat
ORDER BY s.name, count DESC
RETURN s.name AS Store, COLLECT(cat.name)[0..5] AS `Top Categories`
And if you want the sum of the viewsCount property from the Movie nodes per store:
MATCH (s:Store)-[:RELEASED]->(m:Movie)<-[:ASSIGNED]-(cat:MovieCategorie)
WITH s, COUNT(cat) AS count, m, cat
ORDER BY s.name, count DESC
RETURN s.name AS Store, COLLECT(cat.name)[0..5] AS `Top Categories`, SUM(m.viewsCount) AS `Total Views`

Pig Latin issue

please help me out..its really urgent..deadline nearing, and im stuck with it since 2 weeks..breaking my head but no result. i am a newbie in piglatin.
i have a scenario where i have to filter data from a csv file.
the csv is on hdfs, and has two columns.
grunt>> fl = load '/user/hduser/file.csv' USING PigStorage(',') AS (conv:chararray, clnt:chararray);
grunt>> dump f1;
("first~584544fddf~dssfdf","2001")
("first~4332990~fgdfs4s","2001")
("second~232434334~fgvfd4","1000")
("second~786765~dgbhgdf","1000)
("second~345643~gfdgd43","1000")
what i need to do is i need to extract only the first word before the 1st '~' sign and concat that with the second column value of the csv file. Also i need to group the concatenated result returned and count the number of such similar rows, and create a new csv file as out put, where there would be 2 columns again. 1st column would be the concatenated value and the 2nd column would be the row count.
i.e
("first 2001","2")
("second 1000","3")
and so on.
I have written the code here but its just not working. i have used STRSPLIT. it is splitting the values of the first column of input csv file. but i dont know how to extract the first split value.
code is given below:
convData = LOAD '/user/hduser/file.csv' USING PigStorage(',') AS (conv:chararray, clnt:chararray);
fil = FILTER convData BY conv != '"-1"'; --im using this to filter out the rows that has 1st column as "-1".
data = FOREACH fil GENERATE STRSPLIT($0, '~');
X = FOREACH data GENERATE CONCAT(data.$0,' ',convData.clnt);
Y = FOREACH X GROUP BY X;
Z = FOREACH Y GENERATE COUNT(Y);
var = FOREACH Z GENERATE CONCAT(Y,',',Z);
STORE var INTO '/user/hduser/output.csv' USING PigStorage(',');

STRSPLIT returns a tuple, the individual elements of which you can access using the numbered syntax. This is what you need:
data = FOREACH fil GENERATE STRSPLIT($0, '~') AS a, clnt;
X = FOREACH data GENERATE CONCAT(a.$0,' ', clnt);

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Searching CSV with 1.6 Million lines (150MB) file? - performance

have to be using matlab? If you just input all that data into a database, it'll be easier. A simple select tableA.ID, tableB.productname... where tableA.id = tableB.id will do it.

Related

Is there a way to copy only unique rows in an Excel worksheet column to another sheet?

reading specific columns from excel file and writing to another excel file

Filter inner bag in Pig

Neo4j cypher complicated query sort ,count, sum before collect

Pig Latin issue

Categories

Resources