Best way to work with large amounts of CSV data quickly - ruby

I have large CSV datasets (10M+ lines) that need to be processed. I have two other files that need to be referenced for the output—they contain data that amplifies what we know about the millions of lines in the CSV file. The goal is to output a new CSV file that has each record merged with the additional information from the other files.
Imagine that the large CSV file has transactions but the customer information and billing information is recorded in two other files and we want to output a new CSV that has each transaction linked to the customer ID and account ID, etc.
A colleague has a functional program written in Java to do this but it is very slow. The reason is that the CSV file with the millions of lines has to be walked through many, many, many times apparently.
My question is—yes, I am getting to it—how should I approach this in Ruby? The goal is for it to be faster (18+ hours right now with very little CPU activity)
Can I load this many records into memory? If so, how should I do it?
I know this is a little vague. Just looking for ideas as this is a little new to me.

Here is some ruby code I wrote to process large csv files (~180mb in my case).
https://gist.github.com/1323865
A standard FasterCSV.parse pulling it all into memory was taking over an hour. This got it down to about 10 minutes.
The relevant part is this:
lines = []
IO.foreach('/tmp/zendesk_tickets.csv') do |line|
lines << line
if lines.size >= 1000
lines = FasterCSV.parse(lines.join) rescue next
store lines
lines = []
end
end
store lines
IO.foreach doesn't load the entire file into memory and just steps through it with a buffer. When it gets to 1000 lines, it tries parsing a csv and inserting just those rows. One tricky part is the "rescue next". If your CSV has some fields that span multiple lines, you may need to grab a few more lines to get a valid parseable csv string. Otherwise the line you're on could be in the middle of a field.
In the gist you can see one other nice optimization which uses MySQL's update ON DUPLICATE KEY. This allows you to insert in bulk and if a duplicate key is detected it simply overwrites the values in that row instead of inserting a new row. You can think of it like a create/update in one query. You'll need to set a unique index on at least one column for this to work.

how about using a database.
jam the records into tables, and then query them out using joins.
the import might take awhile, but the DB engine will be optimized for the join and retrieval part...

10M+ rows doesn't really sound like that much. If you can preload the contents of the files and match up the data in memory with decent data structures (you'll want maps at some point), you won't have to keep running through the CSV files over and over. File access is SLOW.

Two reasonably fast options:
Put your data into sqlite DB. Then it's a simple query with pair of join that would perform way faster than anything you could write yourself -- SQL is very good for this kind of tasks.
Assuming your additional CSV files are small enough to fit into RAM, you can read everything into hash, using customer ID as a key, then look up that hash when processing main file with 10+M records. Note that it's only necessary to put lookup data into RAM, main list can be processed in small branches.

My experience is that with Ruby, prepare to have about 10x memory usage of the actual payload. Of course, with current amounts of RAM, if the process loads only one file at a time, 10MB is almost negligible even when multiplied by ten :)
If you can read one line at a time (which is easy with File instances), you could use FasterCSV and write one line at a time as well. That would make memory consumption O(1) instead of O(n). But with 10 megabyte files you can probably slurp that file to memory and write it to CSV in one pass, given only few processes at any given time.

If you have a Java program written make sure you use the NIO libraries. They are way faster than the default. I have processed text files with 500,000 lines using the NIO libraries before.

Related

Elastic search - best way for multiple updates index?

I'm integrating with external system.
From it I have 3 files:
customer_data.csv
address_data.csv
additional_customer_data.csv
Order in each of them can be random.
There is:
relation one to many (customer_data => addresses) but I am interested only in one address with specified kind.
one to one (customer_data => additional_customer_data)
Goal:
Merge files together and put it in one index in Elastic search.
Additional info:
-each file has circa 1 million records
-this operation will be done each night
-data is used only for search purposes
Options:
a) I thought about:
Parse and add to ES first file
Do the same from next and update document created in point one
Looks very inefficient.
b) another way:
parse and add first file to relational data base
do same with another fields and update records from point one
Propagate data to ES
Can you see another options?
I assume you have a normalized relational data structure with 1 to n relationships in those CSV files like that:
customer_data.csv
Id;Name;AdressId;AdditionalCustomerDataId;...
0;Mike;2;1;...
address_data.csv
Id;Street;City;...
....
2;Abbey Road;London;...
additional_customer_data.csv
Id;someData;...
...
1;data;...
In that case, I would denormalize those in a preprocessing step into one single CSV and use that to upload them to ES. For avoiding downtime, you can then use aliases.
Preprocessing can be done in any language, but probably converting the CSVs into a sqlite table will be the fastest.
I wouldn't choose a strategy to create just half of the document and add the additional information later, as you probably need to reindex afterwards.
However, maybe you can tell us more about the requirements and the external system, cause this doesn't seem to be a great solution.

Hadoop grep dump sql

I want to use Apache Hadoop to parse large files (~~ 20 MB each). These files are postegresql dumps (i.e. mostly CREATE TABLE and INSERT). I just need to filter out anything that is not CREATE TABLE or INSERT INTO in the first place.
So I decided to use the grep map reduce with ^(CREATE TABLE|INSERT).*;$ pattern (lines starting with CREATE TABLE or INSERT and ending with a ";").
My problem is some of these create and insert take multiple lines (because the schema is really large I guess) so the pattern isn't able to match them at all (like CREATE TABLE test(\n
"id"....\n..."name"...\n
);)
I guess I could write a mapreduce job to refactor each "insert" and "create" on one line but that would be really costly because the files are large. I could also remove all "\n" from the file but then a single map operation would have to handle multiple create/insert making the balance of the work really bad. I'd really like one map operation per insert or create.
I'm not responsible for the creation of the dump files so I cannot change the layout of the initial dump files.
I have actually no clue what is the best solution, I could use some help :). I can provide any additionnal information if needed.
First things first:
20 mb files are NOT large files to Hadoop standards, you will probably have many files (unless you only have a tiny amount of data) so there should be plenty of parallelization possible.
As such having 1 mapper per file could very well be an excellent solution, and you may even want to concatenate files to reduce overhead.
That being said:
If you don't want to handle all lines at once, and handling a single line at once is insufficient, then the only straightforward solution would be to handle 'a few' lines at once, for example 2 or 3.
An alternate solution would be to chop the file up and use one map per filepart, but then you either need to deal with the edges, or accept that your solution may not remove one of the desired bits.
I realize that this is still quite a conceptual answer, but based on your progress so far, I feel that this may be sufficient to get you there.

Best way to sort 200-300 million records in Pentaho?

I am working on this new task where my input csv file has about 200 to 300 million records my requirement is to sort the incoming data perform lookup's get the key value and insert into target table. One suggestion was to write a java plugin that will sort and store data in multiple temp files (say a million each) and retrieve from there. I was thinking to use sort step in pentaho and set the number of copies to start. But I am not sure whats the best approach. Can anyone suggest how to go about this. Thanks.
I have used PDI to sort this many rows. The Sort step works fine, tho it can be finicky. I set my "Free memory threshold (in %)" to ~50. The step will generate gobs of temp files in your "Sort-directory"; if the job crashes (usually by running out of memory) you will have to remove the temp files manually.
If I had to do it again I'd probably set the "Compress TMP Files?" option since multiple failures ran me out of disk space. Good luck!
A custom sort in Java may give you better performance, but development time will be significant. If you're going to sort this many rows daily/weekly, whatever, it's probably worth it. If not, just stick with PDI's Sort.

What is the capacity of a BluePrism Internal Work Queue?

I am working in BluePrism Robotics Process Automation and trying to load an excel sheet with more than 100k records (It might go upwards of 300k in some cases).
I am trying to load internal work queue of BluePrism, but I get an error as quoted below:
'Load Data Into Queue' ERROR: Internal : Exception of type 'System.OutOfMemoryException' was thrown.
Is there a way to avoid this problem, in the way where I can free up more memory?
I plan to process records one by one from queue, and put them into new excel sheets categorically. Loading all that data in a collection and looping over it may be memory consuming, so I am trying to find out a more efficient way.
I welcome any and all help/tips.
Thanks!
Basic Solution:
Break up the number of Excel rows you are pulling into your Collection data item at any one time. The thresholds for this will depend on your resource system memory and architecture, as well as structure and size of the data in the Excel Worksheet. I've been able to quickly move 50k 10-column-rows from Excel to a Collection and then into the Blue Prism queue very quickly.
You can set this up by specifying the Excel Worksheet range to pull into the Collection data item, and then shift that range each time the Collection has been successfully added to the queue.
After each successful addition to the queue and/or before you shift the range and/or at a predefined count limit you can then run a Clean Up or Garbage Collection action to free up memory.
You can do all of this with the provided Excel VBO and an additional Clean Up object.
Keep in mind:
Even breaking it up, looping over a Collection this large to amend the data will be extremely expensive and slow. The most efficient way to make changes to the data will be at the Excel Workbook level or when it is already in the Blue Prism queue.
Best Bet: esqew's alternative solution is the most elegant and probably your best bet.
Jarrick hit it on the nose in that Work Queue items should provide the bot with information on what they are to be working on and a Control Room feedback space, but not the actual work data to be implemented/manipulated.
In this case you would want to just use the items Worksheet row number and/or some unique identifier from a single Worksheet column as the queue item data so that the bot can provide Control Room feedback on the status of the item. If this information is predictable enough in format there should be no need to move any data from the Excel Worksheet to a Collection and then into a Work Queue, but rather simply build the queue based on that data predictability.
Conversely you can also have the bot build the queue "as it happens", in that once it grabs the single row data from the Excel Worksheet to work it, can as well add a queue item with the row number of the data. This will then enable Control Room feedback and tracking. However, this would, in almost every case, be a bad practice as it would not prevent a row from being worked multiple times unless the bot checked the queue first, at which point you've negated the speed gains you were looking to achieve in cutting out the initial queue building in the first place. It would also be impossible to scale the process for multiple bots to work the Excel Worksheet data efficiently.
This is a common issue for RPA, especially if working with large excel files. As far as I know, there are no 100% solutions, but only methods reduce the symptoms. I have run into this problem several times and these are the ways I would try to handle them:
Disable or Errors only for stage logging.
Don`t log parameters on action stages (especially ones that work with the excel files)
Run Garbage collection process
See if it is possible to avoid reading excel files into BP collections and use OLEDB to query the file
See if it is possible to increase the Ram memory on the machines
If they’re using the 32-bit version of the app, then it doesn’t really matter how much memory you feed it, Blue Prism will cap out at 2 GB.
This is may be because of BP Server as the memory is shared between Processes and Work queue.Better option is to use two bots and multiple queues to avoid Memory Error.
If you're using Excel documents or CSV files, you can use the OLEDB object to connect and query against it as if it were a database. You can use the SQL syntax to limit the amount of rows that are returned at a time and paginate through them until you've reached the end of the document.
For starters, you are making incorrect use of the Work Queue in Blue Prism. The Work Queue should not be used to store this type and amount of data. (please read the BP documentation on Work Queues thoroughly).
Solving the issue at hand, being the misuse requires 2 changes:
Only store references in your Item Data which point to the Excel file containing the data.
If you're consulting this much data many times, perhaps convert the file into a CSV, write a VBO that queries the data directly in the CSV.
The first change is not just a recommendation, but as your project progresses and IT Architecture and InfoSec comes into play, it will be mandatory.
As for the CSV VBO, take a look at C#, it will make your life a lot easier than loading all this data into BP (time consuming, unreliable, ...).

Import data from a large CSV (or stream of data) to Neo4j efficiently in Ruby

I am new to background processes, so feel free to point me out if I am making wrong assumptions.
I am trying to write a script that imports import data into a Neo4j db from a large CSV file (consider it as a stream of data, endless). The csv file only contains two column - user_a_id and user_b_id, which maps the directed relations. A few things to consider:
data might have duplicates
the same user can map to multiple other users and is not guaranteed that when it will show up again.
My current solution: I am using sidekiq and have one worker to read the file in batches and dispatch workers to create edges in the database.
Problems that I am having:
Since the I am receiving a stream of data, I cannot pre-sort the file and assign job that build relation for one user.
Since jobs are performed asynchronously, if two workers are working on relation of the same node, I will get a write lock from Neo4j.
Let's say I get around with the write lock, if two workers are working on records that are duplicated, I will build duplicated edges.
Possible solution: Build a synchronous queue and have only one worker to perform writing (Seems neither sidekiq or resque has the option). And this could be pretty slow since only one thread is working on the job.
Or, I can write my own implementation, which create one worker to build multiple queues of jobs based on user_id (one unique id per queue), and use redis to store them. Then assign one worker per queue to write to database. Set a maximum number of queues so I wouldn't run out of memory, and delete the queue once it exhausts all the jobs (rebuild it if I see the same user_id in the future). - This doesn't sound trivial though, so I would prefer using an existing library before diving into this.
My question is — is there a existing gem that I can use? What is a good practice of handling this?
You have a number of options ;)
If your data really is in a file and not as a stream, I would definitely recommend checking out the neo4j-import command which comes with Neo4j. It allows you to import CSV data at speeds on the order of 1 million rows per second. Two caveats: You may need to modify your file format a bit, and you would need to be generating a fresh database (it doesn't import new data into an existing database)
I would also be familiar with the LOAD CSV command. This takes a CSV in any format and lets you write some Cypher commands to transform and import the data. It's not as fast as neo4j-import, but it's pretty fast and it can stream a CSV file from disk or a URL.
Since you're using Ruby, I would also suggest checking out neo4apis. This is a gem that I wrote to make it easier to batch import data so that you're not making a single request for every entry in your file. It allows you to define a class in a sort of DSL with importers. These importers can take any sort of Ruby object and, given that Ruby object, will define what should be imported using add_node and add_relationship methods. Under the covers this generates Cypher queries which are buffered and executed in batches so that you don't have lots of round trips to Neo4j.
I would investigate all of those things first before thinking about doing things asynchronously. If you really do have a never ending set of data coming in, however. The MERGE clause should help you with any race conditions or locks. It allows you to create objects and relationships if they don't already exist. It's basically a find_or_create, but at a database level. If you use LOAD CSV you'll probably want merge as well, and neo4apis uses MERGE under the covers.
Hope that helps!

Resources