How to deal with multiple data files using lightGBM - lightgbm

I am trying to use lightGBM as a classifier. My data are saved in multiple csv files, but I find there is no way to directly use multiple files as the input.
I have considered to combine all the data into a big one (numpy array), but my computer doesn't have enough memory. How can I use lightGBM to deal with multiple data files when the avaliable memory is poor?

I guess that you are using Python.
What is the size of your data ? (num of rows x num of columns)
Lightgbm will need to load the data in-memory for training.
But if you haven't done it yet, you can wisely choose a suitable datatype for every column of your data.
It can considerably reduce the memory footprint if you use dtypes such as 'uint8' / 'uint16' and help you load everything in memory.

Sample.
You shouldn't ever (Except certain edge cases) need to use your entire dataset if you sample CORRECTLY.
I use a DB that has over 230M records but I usually only select a RANDOM sample of anywhere from 1k-100k to create the model.
Also, you might as well split your data into training, testing and validation. That will help cut down the size per file.

You might want to categorize your features, then to one-hot-encode them. LightGBM works best with sparse features such as one-hot-encoded ones due to its EFB (Effective Feature Bundling) which enhances computation efficiency of LightGBM significantly. Moreover, you will definitely get rid of the floating parts of the numbers.
Think categorization as that; let’s say that values of one of the numerical features vary between 36 to 56, you can digitize it as [36,36.5,37,....,55.5,56] or [40,45,50,55] to make it categorical. Up to your expertise and imagination. You can refer to scikit-learn for one-hot-encoding, it has built-in function for that.
PS: With a numerical feature, always inspect the statistical properties of it, you can use pandas.describe() which summarizes its mean, max, min, std etc.

Related

Most efficient storage format for HDFS data

I have to store a lot of data on dedicated storage servers in HDFS. This is some kind of archive for historic data. The data being store is row oriented and have tens of different kind of fields. Some of them are Strings, some are Integers, there are also few Floats, Shorts, ArrayLists and a Map.
The idea is that the data will be scanned from time to time using MapReduce or Spark job.
Currently I am storing them as SequenceFiles with NullWritable as keys and custom WritableComparable class as values. This custom class has all of these fields defined.
I would like to achieve two goals - one is to optimize a size of data, as it is getting really big and I have to add new servers every few weeks and the costs are constantly growing. The other thing is to make it easier to add new fields - in current state if I would like to add some new field I would have to rewrite all of the old data.
I tried to achieve this by using EnumMap inside this class. It gave quite good results, as it allows adding new fields easily and also the size of data have been reduced by 20% (the reason is a lot of fields in a record are often empty). But the code I wrote looks awful and it gets even uglier when I try to add to this EnumMap also Lists and Maps. It's ok for a data of the same type, but trying to combine all of the fields is a nightmare.
So I thought of some other popular formats. I have tried Avro and Parquet, but size of the data is almost exactly the same as SequenceFiles with custom class before trying with Enums. So it resolves problems of adding new fields without a need of rewriting old data, but I feel like there is more potential to optimize the size of the data.
The one more thing I am going to check yet is of course the time it takes to load the data (this will also tell me if it's ok to use bzip2 compression or I have to go back to gzip because of performance), but before I proceed with this I was wondering if maybe someone will suggest some other solution or a hint.
Thanks in advance for all comments.
Most of your approach seems good. I just decided to add some of my thoughts in this answer.
The data being store is row oriented and have tens of different kind
of fields. Some of them are Strings, some are Integers, there are also
few Floats, Shorts, ArrayLists and a Map.
None of the types you have mentioned here are any more complex than the datatypes supported by spark. So I wouldn't bother changing the data types in any way.
achieve two goals - one is to optimize a size of data, as it is
getting really big and I have to add new servers every few weeks and
the costs are constantly growing.
By adding servers, are you also adding compute? Storage should be relatively cheap, and I'm wondering if you are adding compute with your servers, which you don't really need. You should only be paying to store and retrieve data. Consider a simple object store like S3 that only charges you for storage space and gives a free quota of access requests (GET/PUT/POST) - I believe about 1000 requests are free and it costs only ~$10 for a terabyte of storage per month.
The other thing is to make it easier to add new fields - in current
state if I would like to add some new field I would have to rewrite
all of the old data.
If you have a use case where you will be writing to the files more often than reading, I'd recommend not storing the file on HDFS. It is more suited for write once, read many type applications. That said, i'd recommend using parquet to start since i think you will need a file format that allows slicing and dicing the data. Avro is also a good choice as it also supports schema evolution. But its better to use this if you have a complex structures where you need to specify the schema and make it easier to serialize/deserialize with java objects.
The one more thing I am going to check yet is of course the time it
takes to load the data (this will also tell me if it's ok to use bzip2
compression or I have to go back to gzip because of performance)
Bzip2 has the highest compression, but is also the slowest. So i'd recommend it if the data isn't really used/queried frequently. Gzip has comparable compression with Bzip2, but is slightly faster. Also consider snappy compression as that has a balance of performance and storage and can support splittable files for certain file types (parquet or avro) which is useful for map-reduce jobs.

Computing percentiles

I'm writing a program that's going to generate a bunch of data. I'd like to find various percentiles over that data.
The obvious way to do this is to store the data in some kind of sorted container. Are there any Haskell libraries which offer a container which is automatically sorted and offers fast random access to arbitrary indexes?
The alternative is to use an unordered container and perform sorting at the end. I don't know if that's going to be any faster. Either way, we're still left with needing a container which offers fast random access. (An array, perhaps...)
Suggestions?
(Another alternative is to build a histogram, rather than keep the entire data set in memory. But since the objective is to compute percentiles extremely accurately, I'm reluctant to go down that route. I also don't know the range of my data until I generate it...)
Are there any Haskell libraries which offer a container which is automatically sorted and offers fast random access to arbitrary indexes?
Yes, it's your good old Data.Map. See elemAt and other functions under the «Indexed» category.
Data.Set doesn't offer these, but you can emulate it with Data.Map YourType ().

Techniques for handling arrays whose storage requirements exceed RAM

I am author of a scientific application that performs calculations on a gridded basis (think finite difference grid computation). Each grid cell is represented by a data object that holds values of state variables and cell-specific constants. Until now, all grid cell objects have been present in RAM at all times during the simulation.
I am running into situations where the people using my code wish to run it with more grid cells than they have available RAM. I am thinking about reworking my code so that information on only a subset of cells is held in RAM at any given time. Unfortunately the grids (or matrices if you prefer) are not sparse, which eliminates a whole class of possible solutions.
Question: I assume that there are libraries out in the wild designed to facilitate this type of data access (i.e. retrieve constants and variables, update variables, store for future reference, wipe memory, move on...) After several hours of searching Google and Stack Overflow, I have found relatively few libraries of this sort.
I am aware of a few options, such as this one from the HSL mathematical library: http://www.hsl.rl.ac.uk/specs/hsl_of01.pdf. I'd prefer to work with something that is open source and is written in Fortran or C. (my code is mostly Fortran 95/2003, with a little C and Python thrown in for good measure!)
I'd appreciate any suggestions regarding available libraries or advice on how to reformulate my problem. Thanks!
Bite the bullet and roll your own?
I deal with too-large data all the time, such as 30,000+ data series of half-hourly data that span decades. Because of the regularity of the data (daylight savings changeovers a problem though) it proved quite straightforward to devise a scheme involving a random-access disc file and procedures ReadDay and WriteDay that use a series number, and a day number, with further details because series start and stop at different dates. Thus, a day's data in an array might be Array(Run,DayNum) but now is ReturnCode = ReadDay(Run,DayNum,Array) and so forth, the codes indicating presence/absence of that day's data, etc. The key is that a day's data is a convenient size, and a regular (almost) size, and although my prog. allocates a buffer of one record per series, it runs in ~100MB of memory rather than GB.
Because your array is non-sparse, it is regular. Granted that a grid cell's data are of fixed size, you could devise a random-access disc file with each record holding one cell, or, perhaps a row's worth of cells (or a column's worth of cells) or some worthwhile blob size. I choose to have 4,096 bytes/record as that is the disc file allocation size. Let the computer's operating system and disc storage controller do whatever buffering to real memory they feel up to. Typical execution is restricted to the speed of data transfer however, unless the local data's computation is heavy. Thus, I get cpu use of a few percent until data requests start being satisfied from buffers.
Because fortran uses the same syntax for functions as for arrays (unlike say Pascal), instead of declaring DIMENSION ARRAY(Big,Big) you would remove that and devise FUNCTION ARRAY(i,j), and all read references in your source file stay as they are. Alas, in the absence of a "palindromic" function declaration, assignments of values to your array will have to be done with a different syntax and you devise a subroutine or similar. Possibly a scratchpad array could be collated, worked upon with convenient syntax, and then written back if changed.

Reverse "jpeg" compression algorithm?

I have to write a tool that manages very large data sets (well, large for an ordinary workstations). I need basically something that works the opposite that the jpeg format. I need the dataset to be intact on disk where it can be arbitrarily large, but then it needs to be lossy compressed when it gets read in memory and only the sub-part used at any given time need to be uncompressed on the flight. I have started looking at ipp (Intel Integrated Performance Primitives) but it's not really clear for now if I can use them for what I need to do.
Can anyone point me in the right direction?
Thank you.
Given the nature of your data, it seems you are handling some kind of raw sample.
So the easiest and most generic "lossy" technique will be to drop the lower bits, reducing precision, up to the level you want.
Note that you will need to "drop the lower bits", which is quite different from "round to the next power of 10". Computer work on base 2, and you want all your lower bits to be "00000" for compression to perform as well as possible. This method suppose that the selected compression algorithm will make use of the predictable 0-bits pattern.
Another method, more complex and more specific, could be to convert your values as an index into a table. The advantage is that you can "target" precision where you want it. The obvious drawback is that the table will be specific to a distribution pattern.
On top of that, you may also store not the value itself, but the delta of the value with its preceding one if there is any kind of relation between them. This will help compression too.
For data to be compressed, you will need to "group" them by packets of appropriate size, such as 64KB. On a single field, no compression algorithm will give you suitable results. This, in turn, means that each time you want to access a field, you need to decompress the whole packet, so better tune it depending on what you want to do with it. Sequential access is easier to deal with in such circumstances.
Regarding compression algorithm, since these data are going to be "live", you need something very fast, so that accessing the data has very small latency impact.
There are several open-source alternatives out there for that use. For easier license management, i would recommend a BSD alternative. Since you use C++, the following ones look suitable :
http://code.google.com/p/snappy/
and
http://code.google.com/p/lz4/

Good compression algorithm for small chunks of data? (around 2k in size)

I have a system with one machine generate small chunks of data in the form of objects containing arrays of integers and longs. These chunks get passed to another server which in turn distributes them elsewhere.
I want to compress these objects so the memory load on the pass-through server is reduced. I understand that compression algorithms like deflate need to build a dictionary so something like that wouldn't really work on data this small.
Are there any algorithms that could compress data like this efficiently?
If not, another thing I could do is batch these chunks into arrays of objects and compress the array once it gets to be a certain size. But I am reluctant to do this because I would have to change interfaces in an existing system. Compressing them individually would not require any interface changes, the way this is all set up.
Not that I think it matters, but the target system is Java.
Edit: Would Elias gamma coding be the best for this situation?
Thanks
If you think that reducing your data packet to its entropy level is at best as it can be, you can try a simple huffman compression.
For an early look at how well this would compress, you can pass a packet through Huff0 :
http://fastcompression.blogspot.com/p/huff0-range0-entropy-coders.html
It is a simple 0-order huffman encoder. So the result will be representative.
For more specific ideas on how to efficiently use the characteristics of your data, it would be advised to describe a bit what data the packets contains and how it is generated (as you have done in the comments, so they are ints (4 bytes?) and longs (8 bytes?)), and then provide one or a few samples.
It sounds like you're currently looking at general-purpose compression algorithms. The most effective way to compress small chunks of data is to build a special-purpose compressor that knows the structure of your data.
The important thing is that you need to match the coding you use with the distribution of values you expect from your data: to get a good result from Elias gamma coding, you need to make sure the values you code are smallish positive integers...
If different integers within the same block are not completely independent (e.g., if your arrays represent a time series), you may be able to use this to improve your compression (e.g., the differences between successive values in a time series tend to be smallish signed integers). However, because each block needs to be independently compressed, you will not be able to take this kind of advantage of differences between successive blocks.
If you're worried that your compressor might turn into an "expander", you can add an initial flag to indicate whether the data is compressed or uncompressed. Then, in the worst case where your data doesn't fit your compression model at all, you can always punt and send the uncompressed version; your worst-case overhead is the size of the flag...
Elias Gamma Coding might actually increase the size of your data.
You already have upper bounds on your numbers (whatever fits into a 4- or probably 8-byte int/long). This method encodes the length of your numbers, followed by your number (probably not what you want). If you get many small values, it might make things smaller. If you also get big values, it will probably increase the size (the 8-byte unsigned max value would become almost twice as big).
Look at the entropy of your data packets. If it's close to the maximum, compression will be useless. Otherwise, try different GP compressors. Tho I'm not sure if the time spent compressing and decompressing is worth the size reduction.
I would have a close look at the options of your compression library, for instance deflateSetDictionary() and the flag Z_FILTERED in http://www.zlib.net/manual.html. If you can distribute - or hardwire in the source code - an agreed dictionary to both sender and receiver ahead of time, and if that dictionary is representative of real data, you should get decent compression savings. Oops - in Java look at java.util.zip.Deflater.setDictionary() and FILTERED.

Resources