Is it possible to use regex as textinputformat delimiter with JavaSparkContext?

Is it possible to use regex as textinputformat delimiter with JavaSparkContext? - hadoop

I have multiple text files to read with JavaSparkContext, and each of the file might be slightly different and contains multiline records, so I want to use a regex delimiter to find the records. Is it possible to configure the textinputformat delimiter with a regex?
..
String regex = "^(?!(^a\\s|^b\\s))";
JavaSparkContext jsc = new JavaSparkContext(conf);
jsc.hadoopConfiguration().set("textinputformat.record.delimiter", regex);
..

Unfortunately it is not. textinputformat.record.delimiter has to be a fix pattern. When working with Spark, you have to alternatives:
Implement your own Hadoop input format - scales better but requires more work.
Use wholeTextFiles (or binaryFiles) and split strings using regex - easy to use, but doesn't scale to large files.

Related

Hadoop: InputFormat for Variable-Length files without delimiter

I have to process (by Hadoop) variable-length files without delimiter.
The format of these files is:
(LengthRecord1)(Record1)(LengthRecord2)(Record2)...(LengthRecordN)(RecordN)
There is no delimiter between the records (the file is in one line).
There is no delimiter between the LenghtRecord and the Record itself (parenthesis were added in this text only for clarity).
I think I can't use neither TextInputFormat nor KeyValueTextInputFormat default classes, because they are based on using linefeed or carriage-return to signal then end of line.
So, I think I have to customize an InputFormat to load these files. But I don't know exactly how to do this.
Do I have to override createRecordReader() in order to read the length of record n, and identify the end of record n? If so, how can I manage the fact that the splits can have half lines?
Thanks in advance.
Regards

How to have a Multi Character Delimiter in Informatica Cloud?

I have a problem I need help solving. The business I am working for is using Informatica cloud to do alot of their ETL into AWS and Other Services.
We have been given a flat file by the business where the field delimiter is "~|" Currently to the best of my knowledge informatica only accepts a single character delimiter.
Does any one know how to overcome this?

Informatica cannot read composite delimiters.
First you could feed each line as one single long string into an
Expression transformation. In this case the delimiter character should
be set to \037 , I haven't seen this character (ASCII Unit Separator)
in use at least since 1982. Then use repetitive invocations of InStr()
within the EXP to identify the positions of those double pipe
characters and split up each line into fields using SubStr().
Second
(easier in the mapping, more work with the session) you could feed the
file into some utility which replaces those double pipe characters by
the character ASCII 31 (the Unit Separator mentioned above); the
session has to be set up such that it reads the output from this
utility (input file type = Command instead of File). Then the source
definition should contain the \037 as the field delimiter instead of
any pipe character or so.

How to clean a csv file where fields contains the csv separator and delimiter

I'm currently strugling to clean csv files generated automatically with fields containing the csv separator and the field delimiter using sed or awk or via a script.
The source software has no settings to play with to improve the situation.
Format of the csv:
"111111";"text";"";"text with ; and " sometimes "; or ;" multiple times";"user";
Fortunately, the csv is "well" formatted, the exporting software just doesn't escape or replace "forbidden" chars from the fields.
In the last few days I tried to improve my knowledge of regular expression and find expression to clean the files but I failed.
What I managed to do so far:
RegEx to find the fields (I wanted to find the fields and perform a replace inside but I didn't find a way to do it)
(?:";"|^")(.*?)(?=";"|";\n)
RegEx that find semicolon, does not work if the semicolon is the last char of the field only find one per field.
(?:^"|";")(?:.*?)(;)(?:[^"\n].*?)(?=";"|";\n)
RegEx to find the double quotes, seems to pick the first double quote of the line in online regex testers
(?:^"|";")(?:.*?)[^;](")(?:[^;].*?)(?=";"|";\n)
I thought of adding space between each chars in the fields then searching for lonely semi colon and double quotes and remove single space after that but I don't know if it's even possible and seems like a poor solution anyway.

Any standard library should be able to handle it if there is no explicit error in the CSV itself. This is why we have quote-characters and escape characters.
When you create a CSV by yourself - you may forgot handling such cases and let your final output file use this situation. AWK is not a CSV reader but simply a text processing utility.
This is what your row should rather look like.
"111111";"text";"";"text with \; and \" sometimes \"; or ;\" multiple times";"user";
So if you can still re-fetch the data, find a way to export the CSV either through the database's own functionality of csv library for the languages you work with.
In python, this would look like this:-
mywriter = csv.writer(csvfile, delimiter=';', quotechar='"', escapechar="\\")
But if you can't create csv again, the only hope is that you expect some pattern within the fields, as in this question:- parse a csv file that contains commans in the fields with awk
But this is rarely true in textual data - esp comments or posts on a webpage. Another idea in such situations would be to use '\t' as separator.

PigStorage multibyte field separator

Needed a StoreFunc implementation that could allow PIG to have field delimiters as multiple bytes for example - ^^ (\u005E\u005E)
Tried all these but without succcess -
store B into '/tmp/test/output' using PigStorage('\u005E\u005E');
store B into '/tmp/test/output' using PigStorage('^^');
store B into '/tmp/test/output' using PigStorage('\\^\\^');
Is there an already existing implementation like LoadFunc implementation org.apache.pig.piggybank.storage.MyRegExLoader for StoreFunc that can take regular expressions for field separator while writing ?

Worked around this by using CONCAT for first delimiter and using PigStorage syntax for second occurence

Hadoop custom split of TextFile

I have a fairly large text file that I would like to convert into a SequenceFile. Unfortunately, the file consists of Python code with logical lines running over several physical lines. For example,
print "Blah Blah\
... blah blah"
Each logical line is terminated by a NEWLINE. Could someone clarify how I could possibly generate Key, Value pairs in Map-Reduce where each Value is the entire logical line?

I don't find the question asked earlier, but you just have to iterate over your lines via a simple mapreduce job and save them into a StringBuilder. Flush the StringBuilder to the context if you want to begin with a new record. The trick is to setup the StringBuilder in your mappers class as a field and not as a local variable.
here it is:
Processing paraphragraphs in text files as single records with Hadoop

You should create your own variation on TextInputFormat. In there you make a new RecordReader that skips lines until it sees the start of a logical line.

Preprocess the input file to remove the newlines. What is your goal in creating the SequenceFile?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Is it possible to use regex as textinputformat delimiter with JavaSparkContext? - hadoop

Related

Hadoop: InputFormat for Variable-Length files without delimiter

How to have a Multi Character Delimiter in Informatica Cloud?

How to clean a csv file where fields contains the csv separator and delimiter

PigStorage multibyte field separator

Hadoop custom split of TextFile

Categories

Resources