5GB file to read - algorithm

I have a design question. I have a 3-4 GB data file, ordered by time stamp. I am trying to figure out what the best way is to deal with this file.
I was thinking of reading this whole file into memory, then transmitting this data to different machines and then running my analysis on those machines.
Would it be wise to upload this into a database before running my analysis?
I plan to run my analysis on different machines, so doing it through database would be easier but if I increase the number machines to run my analysis on the database might get too slow.
Any ideas?
#update :
I want to process the records one by one. Basically trying to run a model on a timestamp data but I have various models so want to distribute it so that this whole process run over night every day. I want to make sure that I can easily increase the number of models and not decrease the system performance. Which is why I am planning to distributing data to all the machines running the model ( each machine will run a single model).

You can even access the file in the hard disk itself and reading a small chunk at a time. Java has something called Random Access file for the same but the same concept is available in other languages also.
Whether you want to load into the the database and do analysis should be purely governed by the requirement. If you can read the file and keep processing it as you go no need to store in database. But for analysis if you require the data from all the different area of file than database would be a good idea.

You do not need the whole file into memory, just the data you need for analysis. You can read every line and store only the needed parts of the line and additionally the index where the line starts in file, so you can find it later if you need more data from this line.

Would it be wise to upload this into a database before running my analysis ?
yes
I plan to run my analysis on different machines, so doing it through database would be easier but if I increase the number machines to run my analysis on the database might get too slow.
don't worry about it, it will be fine. Just introduce a marker so the rows processed by each computer are identified.
I'm not sure I fully understand all of your requirements, but if you need to persist the data (refer to it more than once,) then a db is the way to go. If you just need to process portions of these output files and trust the results, you can do it on the fly without storing any contents.
Only store the data you need, not everything in the files.

Depending on the analysis needed, this sounds like a textbook case for using MapReduce with Hadoop. It will support your requirement of adding more machines in the future. Have a look at the Hadoop wiki: http://wiki.apache.org/hadoop/
Start with the overview, get the standalone setup working on a single machine, and try doing a simple analysis on your file (e.g. start with a "grep" or something). There is some assembly required but once you have things configured I think it could be the right path for you.

I had a similar problem recently, and just as #lalit mentioned, I used the RandomAccess file reader against my file located in the hard disk.
In my case I only needed read access to the file, so I launched a bunch of threads, each thread starting in a different point of the file, and that got me the job done and that really improved my throughput since each thread could spend a good amount of time blocked while doing some processing and meanwhile other threads could be reading the file.
A program like the one I mentioned should be very easy to write, just try it and see if the performance is what you need.

#update :
I want to process the records one by one. Basically trying to run a model on a timestamp data but I have various models so want to distribute it so that this whole process run over night every day. I want to make sure that I can easily increase the number of models and not decrease the system performance. Which is why I am planning to distributing data to all the machines running the model ( each machine will run a single model).

Related

Uploading data to HDFS cluster from custom format

I have several machines with TBs of log data in a custom format which can be read with a c++ library. I want to upload all data to hadoop cluster (HDFS) while converting it to parquet files.
This is an on going process (meaning every day I will get more data) and not a one time effort.
What is best alternative to do it performance wise (doing it efficiently)?
Is the parquet C++ library as good as the Java one? (updates, bugs, etc.)
The solution should handle tens of TBs per day or even more in the future.
Log data arrives on going and should be available immediately on HDFS cluster.
Performance-wise, your best approach will be to gather the data in batches and then write out a new Parquet file per batch. If your data is received in single lines and you want to persist them immediately on HDFS, then you could also write them out to a row-based format (that supports single line appends), e.g. AVRO and run regulary a job that compacts them into a single Parquet file.
Library-wise, parquet-cpp is much more in active development at the moment then parquet-mr (the Java library). This is mainly due to the fact that active parquet-cpp development (re-)started about 1.5 years ago (winter/spring 2016). So updates to the C++ library will happen very quickly at the moment while the Java library is very mature as it has a huge userbase since quite some years. There are some features like predicate pushdown that are not yet implemented in parquet-cpp but these all on the read path, so for write they don't matter.
We now at a point with parquet-cpp, that it already runs very stable in different productive environments, so in the end, your choice of using the C++ or Java library should mainly depend on our system environment. If all your code is currently running in the JVM, than use parquet-mr, otherwise, if you're a C++/Python/Ruby user, use parquet-cpp.
Disclaimer: I'm one of the parquet-cpp developers.

Rails, how to migrate large amount of data?

I have a Rails 3 app running an older version of Spree (an open source shopping cart). I am in the process of updating it to the latest version. This requires me to run numerous migrations on the database to be compatible with the latest version. However the apps current database is roughly around 300mb and to run the migrations on my local machine (mac os x 10.7, 4gb ram, 2.4GHz Core 2 Duo) takes over three days to complete.
I was able to decrease this time to only 16 hours using an Amazon EC2 instance (High-I/O On-Demand Instances, Quadruple Extra Large). But 16 hours is still too long as I will have to take down the site to perform this update.
Does anyone have any other suggestions to lower this time? Or any tips to increase the performance of the migrations?
FYI: using Ruby 1.9.2, and Ubuntu on the Amazon instance.
Dropping indices beforehand and adding them again afterwards is a good idea.
Also replacing .where(...).each with .find_each and perhaps adding transactions could help, as already mentioned.
Replace .save! with .save(:validate => false), because during the migrations you are not getting random inputs from users, you should be making known-good updates, and validations account for much of the execution time. Or using .update_attribute would also skip validations where you're only updating one field.
Where possible, use fewer AR objects in a loop. Instantiating and later garbage collecting them takes CPU time and uses more memory.
Maybe you have already considered this:
Tell the db not to bother making sure everything is on disk (no WAL, no fsync, etc), you now have an in memory db which should make a very big difference. (Since you have taken the db offline you can just restore from a backup in the unlikely event of power loss or similar). Turn fsync/WAL on when you are done.
It is likely that you can do some of the migrations before you take the db offline. Test this in staging env of course. That big user migration might very well be possible to do live. Make sure that you don't do it in a transaction, you might need to modify them a bit.
I'm not familiar with your exact situation but I'm sure there are even more things you can do unless this isn't enough.
This answer is more about approach than a specific technical solution. If your main criteria is minimum downtime (and data-integrity of course) then the best strategy for this is to not use rails!
Instead you can do all the heavy work up-front and leave just the critical "real time" data migration (i'm using "migration" in the non-rails sense here) as a step during the switchover.
So you have your current app with its db schema and the production data. You also (presumably) have a development version of the app based on the upgraded spree gems with the new db schema but no data. All you have to do is figure out a way of transforming the data between the two. This can be done in a number of ways, for example using pure SQL and temporary tables where necessary or using SQL and ruby to generate insert statements. These steps can be split up so that data that is fairly "static" (reference tables, products, etc) can be loaded into the db ahead of time and the data that changes more frequently (users, sesssions, orders, etc) can be done during the migration step.
You should be able to script this export-transform-import procedure so that it is repeatable and have tests/checks after it's complete to ensure data integrity. If you can arrange access to the new production database during the switchover then it should be easy to run the script against it. If you're restricted to a release process (eg webistrano) then you might have to shoe-horn it into a rails migration but you can run raw SQL using execute.
Take a look at this gem.
https://github.com/zdennis/activerecord-import/
data = []
data << Order.new(:order_info => 'test order')
Order.import data
Unfortunaltly the downrated solution is the only one. What is really slow in rails are the activerecord models. The are not suited for tasks like this.
If you want a fast migration you will have to do it in sql.
There is an other approach. But you will always have to rewrite most of the migrations...

How to Throttle DataStage

I work on a project where we run a number of DataStage sequences can be run in parallel, one in particular is poorly performing and takes a lot of resources, impacting the shared environment. Performance tuning initiative is in progress but will take time.
In the meantime I was hopeful that we could throttle DataStage to restrict the resources that could be used by this particular job/sequence - however I'm not personally experienced with DataStage specifically.
Can anyone comment if this facility exists in DataStage (v8.5 I believe), and point me in the direction of some further detail.
Secondly, I know that we can at the throttle based on the user (I think this ties into AIX 'ulimit', but not sure). Is it easy/possbile to run different jobs/sequences as different users?
In this type of situations resources for a particular job can be restricted by specifying number of nodes and resources in a config file. Possible in 8.5 and you may find something at www.datastagetips.com
Revolution_In_Progress is right.
Datastage PX has the notion of a configuration file. That file can be specified for all the jobs you run or it can be overridden on a job by job basis. The configuration file can be used to limit the physical resources that are associated with a job.
In this case, if you have a 4-node config file for most of your jobs, you may want to write a 2-node config file for the job with performance issue. That way, you'll get the minimum amount of parallelism (without going completely sequential) and use the minimum amount of resources.
http://pic.dhe.ibm.com/infocenter/iisinfsv/v8r1/index.jsp?topic=/com.ibm.swg.im.iis.ds.parjob.tut.doc/module5/lesson5.1exploringtheconfigurationfile.html
Sequence is a collection of individual jobs.
In most cases, jobs in a sequence can be rearranged to run serially. Please check the organisation of the sequence and do a critical path analyis to remove the jobs that need not run in parallel to critical jobs.

Migrating from processing many small data files to a few large files in ruby

What should I keep in mind when migrating from processing many small data files to a few large data files in ruby?
Background: I'm a bioinformatician who is processing next generation sequencing data, which produces about one million sequences per run. I previously saved each one of the million sequences to its own file, and did a few processing steps to each sequence, producing a couple of files for each sequence. Unfortunately, having a couple of million files is making file input and output a major bottleneck (and also makes backup slow). (Having millions of files is also discouraged in answers to this question)
I considered using sqlite to store each file, but I want to avoid this option if possible, to avoid adding dependencies.
I suspect that I should write one and only one module for handling the large files, and let all of the processing scripts (which run as independent processes) use this module whenever it wants to do input or output. Providing the processing classes with a filestream created with StringIO may be useful for this, as that way they don't need to know about how the large files work.
In order to avoid having to read an entire large file when getting input (I want processing of each sequence to be an independent process, so that an analysis of one sequence can't corrupt the analysis of another sequence), I'll have to keep track of where I'm up to in the large input file. Although more sophisticated inter-process communication techniques exist, I might merely use a temporary file to store the character position for IO#seek.
I'll also have to keep in mind that I won't really be able to run multiple processes at once if they're writing to the same file, and that the large file handler will need to flush its output regularly.
I don't know the details of your situation, but the application you are describing -- I want to store a million things and I'd like to access them quickly and flexibly -- sounds like a DB to me. By avoiding tools like sqlite you aren't necessarily avoiding dependencies; you might be trading one kind of dependency for another.
If you do have to roll your own file-based solution, you don't necessarily have to go from one extreme to the other. What about 1000 medium-sized files, dispersed across 10 subdirectories? And those medium-sized files could be .tar archives or something similar (directories in disguise) that, from the point of view of your code, might behave a lot like the 1 million little files you're used to handling. In addition, those .tar files will remain accessible directly from the command-line without any special software.
Maybe those are crazy ideas, but if you're going to avoid a DB and instead whip together something quick and practical, consider options that don't require you to build the moral equivalent of your own DB system.
If this is just a case of storing "a bunch of files" you might just need a simple key/value store like BDB which could scale up quite easily to any RDBMS including MySQL, SQLite, or even a key/value store like Tokyo-Cabinet.
Any reasons for SQLite being such a problem? A robust data storage mechanism might be a much better approach to the 'pile of files' system.

Trying to write a program / library like LogParser - How does it work internally?

LogParser isn't open source and I need this functionality for an open source project I'm working on.
I'd like to write a library that allows me to query huge (mostly IIS) log files, preferably with Linq.
Do you have any links that could help me? How does a program like LogParser work so fast? How does it handle memory limitations?
It probably process the information in the log as it reads it. This means it (the library) doesn't have to allocate a huge amount of memory to store the information. It can read a chunk, process it and throw it away. It is a usual and very effective way to process data.
You could for example work line by line and parse each line. For the actual parsing you can write a state machine or if the requirements allows it, use regex.
Another approach would be a state machine that both reads and parses the data. If for some reason a log entry spans more than one line this might be needed.
Some state machine related links:
A very simple state machine written in C: http://snippets.dzone.com/posts/show/3793
Alot of python related code, but some sections are universally applicable: http://www.ibm.com/developerworks/library/l-python-state.html
If your aim is to query IIS log data with LINQ. Then i suggest you to move the Raw IIS Log data to database and query the database using LINQ. This blog post might help.
http://getsrirams.blogspot.in/2012/07/migrate-iislog-data-to-sqlce-4-database.html

Resources