Programmatic data conversion strategy - etl

I have a product that imports certain data files from clients (ie: user directories, etc), and will export other types of data (ie: reports, etc). All import and exports are in currently in CSV format (rfc4180), and files are passed back and forth through managed file transfers.
Increasingly, I'm seeing requests from clients to transform and reconfigure these data files for use in their legacy systems. For import data files, it's bizarre requests like:
"We're passing you 20 columns, from that apply $business_logic to
columns 4,7,5,18,19 to determine the actual value your system needs in
column 21, then drop those original columns cuz they aren't really useful by
themselves"
or
"The value in column 2 is padded with zeros, please strip that off."
For data exports files, it's requests like:
"You are sending us .csv, but we need it in our special fixed width format."
or
"You are formatting numbers with decimals. Remove those, and prefix with 8 zeros."
Of course, every client we onboard has different requirements. I'm hesitant to dive in and write something from scratch as I imagine there are all sorts of gotchas along the way in building out files of different formats (csv, tsv, fixed width, excel, stone tablets), and dealing with character encoding, etc, etc. What I'm looking for is some sort of a dev framework (or commercial product) that would allow us to quickly satisfy the increasing number of (and variety of) data transformation requests. Something lightweight & simple is much preferred.
Any thoughts or experiences appreciated.

I'm not sure if it's a total fit but you can check out streamsets.com
It's an open-source tool for data movement and lightweight transformations. It allows you to provide minimal input schema (e.g. I have CSV files) so you don't have to deal with a lot of the things you mentioned.
*Full disclosure I'm an engineer at StreamSets

Related

Python structure for working with variable log messages in a large file

I am trying to get data out of debug log messages created by a certain piece of open source software. It has many lines describing what it is doing during stages. It does not have a specific structure, i.e. some data covers multiple lines with different indents and no separator so does not import nicely into a pandas data frame, which would be my go-to usually.
Is there a good way to structure a python script that parses this data and one that can be used in the future for the same function, and also be extendable to extract different data? I have to do a bunch of different steps to extract the data. The other complication is that the file is much too big to store in memory (10^6 lines) so i need to iterate through the lines.
Please could anyone give me some tips on how to do this, is it best to move to do each step and save to a new file? Or my idea is to create a data object and store relevant line numbers as attributes in lists, that are generated in different method. Then each subsequent method only loads the lines from that list.
Or alternatively, maybe I am totally using the wrong tool and I need to learn awk or regex commands to do it? I just know python already so have a preference for it. Not looking for a specific answer necessarily, some tips and pointers would also be very useful!
(--details--) I am trying to trace on a freeradius server the difference between log messages of requests, accepts and rejects of a mac address to see if I can find out why it is sometimes accepted and other times rejected, seemingly randomly.
There are a lot of plugins running on the server setup before I got to dealing with it so the debug is a massive wall of text, labelling each request with a number. I can split it into requests by that number, find the request that mentions the mac, split those requests into different files, then run want to filter out all the boilerplate info that comes with each message and get to the things that are different between them. (--details--)

What are some file formats used when writing an append only file?

So, I've been looking at a number of resources about performance issues surrounding writing to file. I've come across the notion of Append Only Files and Transaction Logs. What I have not found are typical formats, or efficient formats for these kinds of files.
I may be wrong, but it would seem that one could read and write to the same file at the same time, but I haven't found any simple implementation examples. It seems as though the writer would have to leave behind details about the data found in the file, or perhaps a fully descriptive format that can be parsed.
Are there good references for how to implement a transaction log or append only file implementation? Perhaps even better: descriptions of a formats used in append only files implemntation?
Your question is very broad and it's hard to recommend a single approach. But since you're looking at an append-only option, you would need a format that doesn't require a footer. E.g. you can't use XML since XML has to have closing tags and you wouldn't simply be appending data.
An obvious option is a delimited file format, be it tab or comma-delimited text. They are practically universal and well-defined. They are also pretty compact, just one character to delimit fields. However, they are not good for data that changes row by row. E.g. one row has values for A, B, C fields but another row has values for A, D, and E fields. In that case, you might need a format that defines the type of data in a record per record. An example of such format is HL7 (https://en.wikipedia.org/wiki/Health_Level_7). It's a delimited format but each row has a "header" indicating the record type.
If you're looking for a higher performance option, you can come up with your own format depending on your data, and even store it in binary format, and even use compression (See DeflateStream https://msdn.microsoft.com/en-us/library/system.io.compression.deflatestream(v=vs.110).aspx) to reduce file I/O. That will make write operations a bit more CPU intensive but I/O is usually slower so on the whole, especially since text compresses really well, you might end up with performance gains. You'd have to benchmark to be sure for your use case.
Finally, you would want a class that can manage the writing (cache/queue writes, keep the file handler, etc) so that calling code can be simplified and synchronized in one place. You can make that async, if the caller can move on with their work and your writer will ensure the data makes it in, or synchronous if this is a "transaction log," meaning, loss is unacceptable, and caller has to make sure write actually happened.
Again, this is very high level info since your request is just as vague and high level. If you come up with more details, maybe we can better help you.

Is avoiding the T in ETL possible?

ETL is pretty common-place. Data is out there somewhere so you go get it. After you get it, it's probably in a weird format so you transform it into something and then load it somewhere. The only problem I see with this method is you have to write the transform rules. Of course, I can't think of anything better. I supposed you could load whatever you get into a blob (sql) or into a object/document (non-sql) but then I think you're just delaying the parsing. Eventually you'll have to parse it into something structured (assuming you want to). So is there anything better? Does it have a name? Does this problem have a name?
Example
Ok, let me give you an example. I've got a printer, an ATM and a voicemail system. They're all network enabled or I can give you connectivity. How would you collect the state from all these devices? For example, the printer dumps a text file when you type status over port 9000:
> status
===============
has_paper:true
jobs:0
ink:low
The ATM has a CLI after you connect on port whatever and you can type individual commands to get different values:
maint-mode> GET BILLS_1
[$1 bills]: 7
maint-mode> GET BILLS_5
[$5 bills]: 2
etc ...
The voicemail system requires certain key sequences to get any kind of information over a network port:
telnet> 7,9*
0 new messages
telnet> 7,0*
2 total messages
My thoughts
Printer - So this is pretty straight-forward. You can just capture everything after sending "status", split on lines and then split on colons or something. Pretty easy. It's almost like getting a crap-formatted result from a web service or something. I could avoid parsing and just dump the whole conversation from port 9000. But eventually I'll want to get rid of that equal signs line. It doesn't really mean anything.
ATM - So this is a bit more of a pain because it's interactive. Now I'm approaching expect or a protocol territory. It'd be better if they had a service that I could query these values but that's out of scope for this post. So I write a client that gets all the values. But now if I want to collect all the data, I have to define what all the questions are. For example, I know that the ATM has more bills than $1 and $5 so I'd have a complete list like "BILLS_1 BILLS_5 BILLS_10 BILLS_20". If I ask all the questions then I have an inventory of the ATM machine. Of course, I still have to parse out the results and clean up the text if I wanted to figure out how much money is left in the ATM machine. So I could parse the results and figure out the total at data collection time or just store it raw and make sense of it later.
Voicemail - This is similar to the ATM machine where it's interactive. It's just a bit weirder because the key sequences/commands aren't "get key". But essentially it's the same problem and solution.
Future Proof
Now what if I was going to give you an unknown device? Like a refrigerator. Or a toaster. Or anything? You'd have to write "connectors" ahead of time or write a parser afterwards against some raw field you stored earlier. Maybe in the case of these very limited examples there's no alternative. There's no way to future-proof. You just have to understand the new device and parse it at collection or parse it after the fact (your stored blob/object/document).
I was thinking that all these systems are text driven so maybe you could create a line iterator type abstraction layer that simply requires the device to split out lines. Then you could have a text processing piece that parses based on rules. For the ATM device, you'd have to write something that "speaks ATM" and turns it into lines which the iterator would then take care of. At this point, hopefully you'd be able to say "I can handle anything that has lines of text".
But then what will you call these rules for parsing the text? "Printer rules" might as well be called "printer parser" which is the same to me as "printer transform". Is there a better term for all of this?
I apologize for this question being so open ended. :)
When your sources of information are as disparate as what you illustrate then you have no choice but to implement the Transform in order to bring the items into a common data repository. Usually your data sources won't be this extreme, the data will all be related in some way but you may be retrieving it from different sources (some might come from a nicely structured database, some more might come from an Excel or XML or text file, some more might come from a web service call, etc).
When coding up a custom ETL application, a common pattern that is used is the Provider model, this enables you to write a whole bunch of custom providers to load/query and then transform the data. All the providers will implement a common interface with some relatively common function definitions (for example QueryData(), TransformData()), but the implementation of those methods will be wildly different depending on the data source being dealt with - the interface just gives a common way to deal with all the different providers. You can then use an XML configuration file to dictate which providers to run and any other initial settings they may require. Tools like SSIS abstract this stuff away for you by giving you a nice visual designer, but you can still get down and dirty and write your own code which it calls.
Now what if I was going to give you an unknown device? Like a refrigerator. Or a toaster.
No problem, i would just write a new provider, which can sit in its very own assembly (dll), so it can be shipped (or modified, upgraded, etc) in isolation to any other providers i already have. Or if i was using SSIS then i would write a new DTS package.
I was thinking that all these systems are text driven so maybe you could create a line iterator type abstraction layer ... Then you could have a text processing piece that parses based on rules.
Absolutely - you can have a base class containing common functionality which several different providers can implement, and each provider can use its own set of rules which could be coded into it or they can be contained in an external configuration file.
So I could parse the results and figure out the total at data collection time or just store it raw and make sense of it later.
Use whichever approach makes sense for the data you are grabbing. It is also quite common for an ETL process to dump its data into a staging area (like some staging tables in a database) while the data is all being aggregated and accumulated, and then further process it to link related data and perform calculations. In the case of your ATM it may not be necessary to calculate a cash balance at ETL time because you can easily calculate it at any time in the future.

Migrating from processing many small data files to a few large files in ruby

What should I keep in mind when migrating from processing many small data files to a few large data files in ruby?
Background: I'm a bioinformatician who is processing next generation sequencing data, which produces about one million sequences per run. I previously saved each one of the million sequences to its own file, and did a few processing steps to each sequence, producing a couple of files for each sequence. Unfortunately, having a couple of million files is making file input and output a major bottleneck (and also makes backup slow). (Having millions of files is also discouraged in answers to this question)
I considered using sqlite to store each file, but I want to avoid this option if possible, to avoid adding dependencies.
I suspect that I should write one and only one module for handling the large files, and let all of the processing scripts (which run as independent processes) use this module whenever it wants to do input or output. Providing the processing classes with a filestream created with StringIO may be useful for this, as that way they don't need to know about how the large files work.
In order to avoid having to read an entire large file when getting input (I want processing of each sequence to be an independent process, so that an analysis of one sequence can't corrupt the analysis of another sequence), I'll have to keep track of where I'm up to in the large input file. Although more sophisticated inter-process communication techniques exist, I might merely use a temporary file to store the character position for IO#seek.
I'll also have to keep in mind that I won't really be able to run multiple processes at once if they're writing to the same file, and that the large file handler will need to flush its output regularly.
I don't know the details of your situation, but the application you are describing -- I want to store a million things and I'd like to access them quickly and flexibly -- sounds like a DB to me. By avoiding tools like sqlite you aren't necessarily avoiding dependencies; you might be trading one kind of dependency for another.
If you do have to roll your own file-based solution, you don't necessarily have to go from one extreme to the other. What about 1000 medium-sized files, dispersed across 10 subdirectories? And those medium-sized files could be .tar archives or something similar (directories in disguise) that, from the point of view of your code, might behave a lot like the 1 million little files you're used to handling. In addition, those .tar files will remain accessible directly from the command-line without any special software.
Maybe those are crazy ideas, but if you're going to avoid a DB and instead whip together something quick and practical, consider options that don't require you to build the moral equivalent of your own DB system.
If this is just a case of storing "a bunch of files" you might just need a simple key/value store like BDB which could scale up quite easily to any RDBMS including MySQL, SQLite, or even a key/value store like Tokyo-Cabinet.
Any reasons for SQLite being such a problem? A robust data storage mechanism might be a much better approach to the 'pile of files' system.

Does soCaseInsensitive greatly impact performance for a TdxMemIndex on a TdxMemDataset?

I am adding some indexes to my DevExpress TdxMemDataset to improve performance. The TdxMemIndex has SortOptions which include the option for soCaseInsensitive. My data is usually a GUID string, so it is not case sensitive. I am wondering if I am better off just forcing all the data to the same case or if the soCaseInsensitive flag and using the loCaseInsensitive flag with the call to Locate has only a minor performance penalty (roughly equal to converting the case of my string every time I need to use the index).
At this point I am leaving the CaseInsentive off and just converting case.
IMHO, The best is to assure the data quality at Post time. Reasonings:
You (usually) know the nature of the data. So, eg. you can use UpperCase (knowing that GUIDs are all in ASCII range) instead of much slower AnsiUpperCase which a general component like TdxMemDataSet is forced to use.
You enter the data only once. Searching/Sorting/Filtering which all implies the internal upercassing engine of TdxMemDataSet it's a repeated action. Also, there are other chained actions which will trigger this engine whithout realizing. (Eg. a TcxGrid which is Sorted by default having GridMode:=True (I assume that you use the DevEx. components) and having a class acting like a broker passing the sort message to the underlying dataset.
Usually the data entry is done in steps, one or few records in a batch. The only notable exception is data aquisition applications. But in both cases above the user's usability culture allows way greater response times for you to play with. (IOW how much would add an UpperCase call to a record post which lasts 0.005 ms?) OTOH, users are very demanding with the speed of data retreival operations (searching, sorting, filtering etc.). Keep the data retreival as fast as you can.
Having the data in the database ready to expose reduces the risk of processing errors when you'll write (if you'll write) other modules (you need to remember to AnsiUpperCase the data in any module in any language you'll write). Also here a classical example is when you'll use other external tools to access the data (for ex. db managers to execute an SQL SELCT over the data).
hth.
Maybe the DevExpress forums (or ever a support email, if you have access to it) would be a better place to seek an authoritative answer on that performance question.
Anyway, is better to guarantee that data is on the format you want - for the reasons plainth already explained - the moment you save it. So, in that specific, make sure the GUID is written in upper(or lower, its a matter of taste)case. If it is SQL Server or another database server that have an guid datatype, make sure the SELECT make the work - if applicable and possible, even the sort.

Resources