Hadoop grep dump sql - hadoop

I want to use Apache Hadoop to parse large files (~~ 20 MB each). These files are postegresql dumps (i.e. mostly CREATE TABLE and INSERT). I just need to filter out anything that is not CREATE TABLE or INSERT INTO in the first place.
So I decided to use the grep map reduce with ^(CREATE TABLE|INSERT).*;$ pattern (lines starting with CREATE TABLE or INSERT and ending with a ";").
My problem is some of these create and insert take multiple lines (because the schema is really large I guess) so the pattern isn't able to match them at all (like CREATE TABLE test(\n
"id"....\n..."name"...\n
);)
I guess I could write a mapreduce job to refactor each "insert" and "create" on one line but that would be really costly because the files are large. I could also remove all "\n" from the file but then a single map operation would have to handle multiple create/insert making the balance of the work really bad. I'd really like one map operation per insert or create.
I'm not responsible for the creation of the dump files so I cannot change the layout of the initial dump files.
I have actually no clue what is the best solution, I could use some help :). I can provide any additionnal information if needed.

First things first:
20 mb files are NOT large files to Hadoop standards, you will probably have many files (unless you only have a tiny amount of data) so there should be plenty of parallelization possible.
As such having 1 mapper per file could very well be an excellent solution, and you may even want to concatenate files to reduce overhead.
That being said:
If you don't want to handle all lines at once, and handling a single line at once is insufficient, then the only straightforward solution would be to handle 'a few' lines at once, for example 2 or 3.
An alternate solution would be to chop the file up and use one map per filepart, but then you either need to deal with the edges, or accept that your solution may not remove one of the desired bits.
I realize that this is still quite a conceptual answer, but based on your progress so far, I feel that this may be sufficient to get you there.

Related

Is there any way to handle source flat file with dynamic structure in informatica power center?

I have to load the flat file using informatica power center whose structure is not static. Number of columns will get changed in the future run.
Here is the source file:
In the sample file I have 4 columns right now but in the future I can get only 3 columns or I may get another set of new columns as well. I can't go and change the code every time in the production , I have to use the same code by handling this situation.
Expected result set is:-
Is there any way to handle this scenario? PLSQL and unix will also work here.
I can see two ways to do it. Only requirement is - source should decide a future structure and stick to it. Come tomorrow if someone decides to change the structure, data type, length, mapping will not work properly.
Solutions -
create extra columns in source towards the end. If you have 5 columns now, extra columns after 5th column will be pulled in as blank. Create as many as you want but pls note, you need to transform them as per future structure and load into proper place in target.
This is similar to above solution but in this case, read the line as single column in source+ source qualifier as large string of length 40000.
Then split the columns as per delimiter in Informatica expression transformation. This splitting can be done by following below thread. This can be also tricky if you have 100s of columns.
Split Flat File String into multiple columns in Informatica

Elastic search - best way for multiple updates index?

I'm integrating with external system.
From it I have 3 files:
customer_data.csv
address_data.csv
additional_customer_data.csv
Order in each of them can be random.
There is:
relation one to many (customer_data => addresses) but I am interested only in one address with specified kind.
one to one (customer_data => additional_customer_data)
Goal:
Merge files together and put it in one index in Elastic search.
Additional info:
-each file has circa 1 million records
-this operation will be done each night
-data is used only for search purposes
Options:
a) I thought about:
Parse and add to ES first file
Do the same from next and update document created in point one
Looks very inefficient.
b) another way:
parse and add first file to relational data base
do same with another fields and update records from point one
Propagate data to ES
Can you see another options?
I assume you have a normalized relational data structure with 1 to n relationships in those CSV files like that:
customer_data.csv
Id;Name;AdressId;AdditionalCustomerDataId;...
0;Mike;2;1;...
address_data.csv
Id;Street;City;...
....
2;Abbey Road;London;...
additional_customer_data.csv
Id;someData;...
...
1;data;...
In that case, I would denormalize those in a preprocessing step into one single CSV and use that to upload them to ES. For avoiding downtime, you can then use aliases.
Preprocessing can be done in any language, but probably converting the CSVs into a sqlite table will be the fastest.
I wouldn't choose a strategy to create just half of the document and add the additional information later, as you probably need to reindex afterwards.
However, maybe you can tell us more about the requirements and the external system, cause this doesn't seem to be a great solution.

Put performance - Hbase Java Client

I did some bench on PUT performance from a Java client, but the result is not clear to me.
Here's the problem:
What it is the best way to do puts in HBase? A single put with 1000 columns (4 families), or 1000 puts witha single columns? Maybe 4 puts with 250 columns each one?
In theory, what would be the best strategy?
PS: I can't use batch because I need the Wals for Solr.
Thanks.
To get good performance for the write operation, you should use a one Put for single Row. In other cases, perfomance will be significantly degraded, because HBase create a lock for row key and in this case, a lot of time will be wasted on synchronization. In a case of single put per row write performance will be comparable with the bulk load.
First of all use as few column families as you can (I have provided details in this answer). Second, you must specify not only your write patterns but also read patterns. HBase works best for "write once and read many" scenarios. Therefore you want to design you table thus it will provide the fastest access to data. And this criterion will determine whether you need "tall" or "wide" table. Check out HBase table design chapter of "HBase in Action".

Best way to sort 200-300 million records in Pentaho?

I am working on this new task where my input csv file has about 200 to 300 million records my requirement is to sort the incoming data perform lookup's get the key value and insert into target table. One suggestion was to write a java plugin that will sort and store data in multiple temp files (say a million each) and retrieve from there. I was thinking to use sort step in pentaho and set the number of copies to start. But I am not sure whats the best approach. Can anyone suggest how to go about this. Thanks.
I have used PDI to sort this many rows. The Sort step works fine, tho it can be finicky. I set my "Free memory threshold (in %)" to ~50. The step will generate gobs of temp files in your "Sort-directory"; if the job crashes (usually by running out of memory) you will have to remove the temp files manually.
If I had to do it again I'd probably set the "Compress TMP Files?" option since multiple failures ran me out of disk space. Good luck!
A custom sort in Java may give you better performance, but development time will be significant. If you're going to sort this many rows daily/weekly, whatever, it's probably worth it. If not, just stick with PDI's Sort.

Best way to work with large amounts of CSV data quickly

I have large CSV datasets (10M+ lines) that need to be processed. I have two other files that need to be referenced for the output—they contain data that amplifies what we know about the millions of lines in the CSV file. The goal is to output a new CSV file that has each record merged with the additional information from the other files.
Imagine that the large CSV file has transactions but the customer information and billing information is recorded in two other files and we want to output a new CSV that has each transaction linked to the customer ID and account ID, etc.
A colleague has a functional program written in Java to do this but it is very slow. The reason is that the CSV file with the millions of lines has to be walked through many, many, many times apparently.
My question is—yes, I am getting to it—how should I approach this in Ruby? The goal is for it to be faster (18+ hours right now with very little CPU activity)
Can I load this many records into memory? If so, how should I do it?
I know this is a little vague. Just looking for ideas as this is a little new to me.
Here is some ruby code I wrote to process large csv files (~180mb in my case).
https://gist.github.com/1323865
A standard FasterCSV.parse pulling it all into memory was taking over an hour. This got it down to about 10 minutes.
The relevant part is this:
lines = []
IO.foreach('/tmp/zendesk_tickets.csv') do |line|
lines << line
if lines.size >= 1000
lines = FasterCSV.parse(lines.join) rescue next
store lines
lines = []
end
end
store lines
IO.foreach doesn't load the entire file into memory and just steps through it with a buffer. When it gets to 1000 lines, it tries parsing a csv and inserting just those rows. One tricky part is the "rescue next". If your CSV has some fields that span multiple lines, you may need to grab a few more lines to get a valid parseable csv string. Otherwise the line you're on could be in the middle of a field.
In the gist you can see one other nice optimization which uses MySQL's update ON DUPLICATE KEY. This allows you to insert in bulk and if a duplicate key is detected it simply overwrites the values in that row instead of inserting a new row. You can think of it like a create/update in one query. You'll need to set a unique index on at least one column for this to work.
how about using a database.
jam the records into tables, and then query them out using joins.
the import might take awhile, but the DB engine will be optimized for the join and retrieval part...
10M+ rows doesn't really sound like that much. If you can preload the contents of the files and match up the data in memory with decent data structures (you'll want maps at some point), you won't have to keep running through the CSV files over and over. File access is SLOW.
Two reasonably fast options:
Put your data into sqlite DB. Then it's a simple query with pair of join that would perform way faster than anything you could write yourself -- SQL is very good for this kind of tasks.
Assuming your additional CSV files are small enough to fit into RAM, you can read everything into hash, using customer ID as a key, then look up that hash when processing main file with 10+M records. Note that it's only necessary to put lookup data into RAM, main list can be processed in small branches.
My experience is that with Ruby, prepare to have about 10x memory usage of the actual payload. Of course, with current amounts of RAM, if the process loads only one file at a time, 10MB is almost negligible even when multiplied by ten :)
If you can read one line at a time (which is easy with File instances), you could use FasterCSV and write one line at a time as well. That would make memory consumption O(1) instead of O(n). But with 10 megabyte files you can probably slurp that file to memory and write it to CSV in one pass, given only few processes at any given time.
If you have a Java program written make sure you use the NIO libraries. They are way faster than the default. I have processed text files with 500,000 lines using the NIO libraries before.

Resources