How to handle Hadoop splits in case of large XML input file - hadoop

I have one really large input file which is an XML data.
So now when I put that in HDFS, logically the HDFS blocks will be created and the XML records will also be divided amongst blocks. Now the typical TextInputFormat handles the scenario by skipping the first line if it is not start of line and logically the previous mapper reads (over RPC) from this block till end of record.
In XML case how we can handle the scenario? I don't want to use the WholeFileInputFormat as that will not help me using the parallelism.
<books>
<book>
<author>Test</author>
<title>Hadoop Recipes</title>
<ISBN>04567GHFR</ISBN>
</book>
<book>
<author>Test</author>
<title>Hadoop Data</title>
<ISBN>04567ABCD</ISBN>
</book>
<book>
<author>Test1</author>
<title>C++</title>
<ISBN>FTYU9876</ISBN>
</book>
<book>
<author>Test1</author>
<title>Baby Tips</title>
<ISBN>ANBMKO09</ISBN>
</book>
</books>
The initialize function of the XMLRecordReader looks like -
public void initialize(InputSplit arg0, TaskAttemptContext arg1)
throws IOException, InterruptedException {
Configuration conf = arg1.getConfiguration();
FileSplit split = (FileSplit) arg0;
start = split.getStart();
end = start + split.getLength();
final Path file = split.getPath();
FileSystem fs = file.getFileSystem(conf);
fsin = fs.open(file);
fsin.seek(start);
DocumentBuilder db = null;
try {
db = DocumentBuilderFactory.newInstance()
.newDocumentBuilder();
} catch (ParserConfigurationException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Document doc = null;
try {
doc = db.parse(fsin);
} catch (SAXException e) {
e.printStackTrace();
}
NodeList nodes = doc.getElementsByTagName("book");
for (int i = 0; i < nodes.getLength(); i++) {
Element element = (Element) nodes.item(i);
BookWritable book = new BookWritable();
NodeList author = element.getElementsByTagName("author");
Element line = (Element) author.item(0);
book.setBookAuthor(new Text(getCharacterDataFromElement(line)));
NodeList title = element.getElementsByTagName("title");
line = (Element) title.item(0);
book.setBookTitle(new Text(getCharacterDataFromElement(line)));
NodeList isbn = element.getElementsByTagName("ISBN");
line = (Element) isbn.item(0);
book.setBookISBN(new Text(getCharacterDataFromElement(line)));
mapBooks.put(Long.valueOf(i), book);
}
this.startPos = 0;
endPos = mapBooks.size();
}
Using the DOM parser for handling the XML parsing part, not sure but may be if I do a pattern match then the DOM parser parsing issue will be resolved (in case of broken XML in one of the splits) but will that also solve the last mapper completing the record from next input split?
Please correct me in case there is some fundamental issue and if any solution is there it will be a great help.
Thanks,
AJ

You could very well try out mahout's XMLinputFormat class. More explanation in the book 'Hadoop in action'

I don't think an XML file could be splittable by itself. THen I don't think there is a generic public solution for you. The problem is there is not way to understand the tag hierarchy starting in the middle of the XML unless you know the structure of the XML a priori.
But your XML is very simple and you can create an Ad-Hoc splitter. As you have explained, the TextInputFormat skip the first characters until it reach the beginning of a new text line. Well, you can do the same thing looking for book tag instead a new line. Copy the code but instead to look for "\n" character look for the open tag for your items.
Be sure to use a SAX parser in your development, use DOM is not a good option to deal with big XML's. In a SAX parser you read one by one each tag and take an action in each event instead to load all the file in memory as in the case of DOM Tree generation.

Maybe split the XML file first. There are Open Source XML splitters. Also at least two commercial split tools that claim to handle the XML structure automatically to ensure each split file is well-formed XML. Google "xml split tool" or "xml splitter"

Related

Parsing several csv files using Spring Batch

I need to parse several csv files from a given folder. As each csv has different columns, there are separate tables in DB for each csv. I need to know
Does spring batch provide any mechanism which scans through the given folder and then I can pass those files one by one to the reader.
As I am trying to make the reader/writer generic, is it possible to just get the column header for each csv, based upon that I am trying to build tokenizer and also the insert query.
Code sample
public ItemReader<Gdp> reader1() {
FlatFileItemReader<Gdp> reader1 = new FlatFileItemReader<Gdp>();
reader1.setResource(new ClassPathResource("datagdp.csv"));
reader1.setLinesToSkip(1);
reader1.setLineMapper(new DefaultLineMapper<Gdp>() {
{
setLineTokenizer(new DelimitedLineTokenizer() {
{
setNames(new String[] { "region", "gdpExpend", "value" });
}
});
setFieldSetMapper(new BeanWrapperFieldSetMapper<Gdp>() {
{
setTargetType(Gdp.class);
}
});
}
});
return reader1;
}
Use a MultiResourceItemReader to scan all files.
I think you need a sort of classified ItemReader as MultiResourceItemReader.delegate but SB doesn't offer that so you have to write your own.
For ItemProcessor and ItemWriter SB offers a classifier-aware implementation (ClassifierCompositeItemProcessor and ClassifierCompositeItemWriter).
Obviously more different input file you have more XML config must be write,but it should be straightforward to do.
I suppose you are expecting this kind of implementation.
During the Partition Step Builder, read all the files names, file header, insert query for the writer and save the same in the Execution Context.
In the slave step, for every reader and writer, pass on the Execution context, get the file to read, file header to the tokenizer, insert query that needs to be inserted for that writer.
This resolves your question.
Answers for your questions:
I don't know about a specific mechanism on spring batch to scan files.
You can use opencsv as generic CSV reader, there are a lot of mechanisms reading files.
About OpenCSV:
If you are using maven project, try to import this dependency:
<dependency>
<groupId>net.sf.opencsv</groupId>
<artifactId>opencsv</artifactId>
<version>2.0</version>
</dependency>
You can read your files making an object for specific formats, or generic headers like this below:
private static List<DadosPeople> extrairDadosPeople() throws IOException {
CSVReader readerPeople = new CSVReader(new FileReader(people));
List<PeopleData> listPeople = new ArrayList<PeopleData>();
String[] nextLine;
while ((nextLine = readerPeople.readNext()) != null) {
PeopleData people = new PeopleData();
people.setIncludeData(nextLine[0]);
people.setPartnerCode(Long.valueOf(nextLine[1]));
listPeople.add(people);
}
readerPeople.close();
return listPeople;
}
There are a lot of other ways to read CSV files using opencsv:
If you want to use an Iterator style pattern, you might do something like this:
CSVReader reader = new CSVReader(new FileReader("yourfile.csv"));
String [] nextLine;
while ((nextLine = reader.readNext()) != null) {
// nextLine[] is an array of values from the line
System.out.println(nextLine[0] + nextLine[1] + "etc...");
}
Or, if you might just want to slurp the whole lot into a List, just call readAll()...
CSVReader reader = new CSVReader(new FileReader("yourfile.csv"));
List myEntries = reader.readAll();
which will give you a List of String[] that you can iterate over. If all else fails, check out the Javadocs here.
If you want to customize quote characters and separators, you'll find constructors that cater for supplying your own separator and quote characters. Say you're using a tab for your separator, you can do something like this:
CSVReader reader = new CSVReader(new FileReader("yourfile.csv"), '\t');
And if you single quoted your escaped characters rather than double quote them, you can use the three arg constructor:
CSVReader reader = new CSVReader(new FileReader("yourfile.csv"), '\t', '\'');
You may also skip the first few lines of the file if you know that the content doesn't start till later in the file. So, for example, you can skip the first two lines by doing:
CSVReader reader = new CSVReader(new FileReader("yourfile.csv"), '\t', '\'', 2);
Can I write csv files with opencsv?
Yes. There is a CSVWriter in the same package that follows the same semantics as the CSVReader. For example, to write a tab separated file:
CSVWriter writer = new CSVWriter(new FileWriter("yourfile.csv"), '\t');
// feed in your array (or convert your data to an array)
String[] entries = "first#second#third".split("#");
writer.writeNext(entries);
writer.close();
If you'd prefer to use your own quote characters, you may use the three arg version of the constructor, which takes a quote character (or feel free to pass in CSVWriter.NO_QUOTE_CHARACTER).
You can also customise the line terminators used in the generated file (which is handy when you're exporting from your Linux web application to Windows clients). There is a constructor argument for this purpose.
Can I dump out SQL tables to CSV?
Yes you can. There is a feature on CSVWriter so you can pass writeAll() a ResultSet.
java.sql.ResultSet myResultSet = ....
writer.writeAll(myResultSet, includeHeaders);
Is there a way to bind my CSV file to a list of Javabeans?
Yes there is. There is a set of classes to allow you to bind a CSV file to a list of JavaBeans based on column name, column position, or a custom mapping strategy. You can find the new classes in the com.opencsv.bean package. Here's how you can map to a java bean based on the field positions in your CSV file:
ColumnPositionMappingStrategy strat = new ColumnPositionMappingStrategy();
strat.setType(YourOrderBean.class);
String[] columns = new String[] {"name", "orderNumber", "id"}; // the fields to bind do in your JavaBean
strat.setColumnMapping(columns);
CsvToBean csv = new CsvToBean();
List list = csv.parse(strat, yourReader);

How do I autoscan local documents to add words to a custom dictionary?

I'd like my dictionary to know more of the words I use - and don't want to manually add all possible words as I end up typing them (I'm a biologist/bioinformatician - there's lots of jargon and specific software and species names). Instead I want to:
Take a directory of existing documents. These are PDFs or Word/latex documents of scientific articles; I guess they could be "easily" be converted to plain text.
Pull out all words that are not in the "normal" dictionary.
Add these to my local custom dictionary (on my mac that's ~/Library/Spelling/LocalDictionary. But it would make sense to add them in the libreoffice/word/ispell custom dictionaries as well.
1 and 3 are easy. How can I do 2? Thanks!
As far as I understand you want to remove duplicates (that already exist in the system dictionary). You might want to ask first, if this is really necessary, though. I guess they won't cause any problems and won't increase word-spell-checking excessively, so there is no real reason for step 2 in my opinion.
I think you'll have a much harder time with step 1. Extracting plain-text from a PDF may sound easy, but it certainly is not. You'll end up with plenty of unknown symbols. You need to fix split-words at the end of a line and you probably want to exclude equations/links/numbers/etc. before adding all these to your dictionary.
But if you have some tool to get this done and can create a couple of .txt files really containing only the words/sentences you need, then I would go with something like the following python code to "solve" the merge for your local dictionary only. Of course you can also extend this to load the system dictionary (wherever that is?) and merge it the same way I show below.
Please note that I left out any error handling on purpose.
Save as import_to_dict.py, adjust the paths to your requirements and call with python import_to_dict.py
#!/usr/bin/env python
import os,re
# 1 - load existing dictionaries from files (adjust paths here!)
dictionary_file = '~/Library/Spelling/LocalDictionary'
global_dictionary_file = '/Library/Spelling/GlobalDictionary'
txt_file_folder = '~/Documents/ConvertedPapers'
reg_exp = r'[\s,.|/]+' #add symbols here
with open(local_dictionary_file, 'r') as f:
# splitting with regular expressions shouldn't really be needed for the dictionary, but it should work
dictionary = set(re.split(reg_exp,f.read()))
with open(global_dictionary_file, 'r') as f:
# splitting with regular expressions shouldn't really be needed for the dictionary, but it should work
global_dictionary = set(re.split(reg_exp,f.read()))
# 2 - walk over all sub-dirs in your folder
for root, dirs, files in os.walk(txt_file_folder):
# open all files (this could easily be limited to only .txt files)
for file in files:
with open(os.path.join(root, file), 'r') as txt_f:
# read the file contents
words = txt_f.read()
# split into word-set (set guarantees no duplicates)
word_set = set(re.split(reg_exp,words))
# remove any already in dictionary existing words
missing_words = (word_set - dictionary) - global_dictionary
# add missing words to dictionary
dictionary |= missing_words
# 3 - write dictionary file
with open(dictionary_file, 'w') as f:
f.write('\n'.join(dictionary))
Here is a basic java program that will generate a text file containing all of the unique words in a directory of plain text files, separated by a newline.
You can just replace the input directory and output file path strings with correct values for your system and run it.
import java.io.*;
import java.util.*;
public class MakeDictionary {
public static void main(String args[]) throws IOException {
Hashtable<String, Boolean> dictionary = new Hashtable<String, Boolean>();
String inputDir = "C:\\test";
String outputFile = "C:\\out\\dictionary.txt";
File[] files = new File(inputDir).listFiles();
BufferedWriter out = new BufferedWriter(new FileWriter(outputFile));
for (File file : files) {
if (file.isFile()) {
BufferedReader in = null;
try {
in = new BufferedReader(new FileReader(file.getCanonicalPath()));
String line;
while ((line = in.readLine()) != null) {
String[] words = line.split(" ");
for (String word : words) {
dictionary.put(word, true);
}
}
} finally {
if (in != null) {
in.close();
}
}
}
}
Set<String> wordset = dictionary.keySet();
Iterator<String> iter = wordset.iterator();
while(iter.hasNext()) {
out.write(iter.next());
out.newLine();
}
out.close();
}
}

Stripping out headers from CSV file loading

I have a file that takes the following format.
TIMESTAMP=Jan 20 10:22:43 2014
TYPE=text
BEGIN-FILE
value1,value2,value3,value4,value5,value6
value1,value2,value3,value4,value5,value6
value1,value2,value3,value4,value5,value6
END-FILE
I want to load this file into Hadoop, unfortunately, all the files contain lines of meta information that I want to strip out. Is there a way in Pig (or any other method) where I can ignore all lines that do not contain comma's?
In Pig, you can use the FILTER command to remove those lines if you just want to throw them away. You could do this multiple ways; here are a couple of possibilities:
Load entire lines into a single field, filter out the ones which can't be split on comma into 6 fields, and then split them out for use in your script:
a = LOAD 'file' USING PigStorage('\n') AS (line:chararray);
b = FILTER a BY SIZE(STRSPLIT(line, ',', 6)) == 6;
c = FOREACH a GENERATE FLATTEN(STRSPLIT(line, ',', 6)) AS (/*put your schema here*/);
Load as comma-separated file, and then throw away any lines with NULL in the 6th field:
a = LOAD 'file' USING PigStorage(',') AS (/*put your schema here*/);
b = FILTER a BY $5 IS NOT NULL;
In MapReduce, have the first line of your mapper parse the line it is reading. You can do this with custom parsing logic, or you can leverage pre-built code (in this case, a CSV library).
protected void map(LongWritable key, Text value, Context context) throws IOException {
String line = value.toString();
CSVParser parser = new au.com.bytecode.opencsv.CSVParser.CSVParser();
try {
parser.parseLine(line);
// do your other stuff
} catch (Exception e) {
// the line was not comma delimited, do nothing
}
}

How to parse an XMLDocument in namespace neutral way using JDOM

I am trying to parse a document using Dom4J. This document comes from various providers, and sometimes comes with namespaces and sometimes without.
For eg:
<book>
<author>john</author>
<publisher>
<name>John Q</name>
</publisher>
</book>
or
<book xmlns="http://schemas.xml.com/XMLSchemaInstance">
<author>john</author>
<publisher>
<name>John Q</name>
</publisher>
</book>
or
<book xmlns:i="http://schemas.xml.com/XMLSchemaInstance">
<i:author>john</i:author>
<i:publisher>
<i:name>John Q</i:name>
</i:publisher>
</book>
I have a list of XPaths. I parse the document into a Document class, and then search on it using the xpaths.
Document doc = parseDocument(documentFile);
List<String> XmlPaths = new List<String>();
XmlPaths.add("book/author");
XmlPaths.add("book/publisher/name");
for (int i = 0; i < XmlPaths.size(); i++)
{
String searchPath = XmlPaths.get(i);
Node currentNode = doc.selectSingleNode(searchPath);
assert(currentNode != null);
}
This code does not work on the last document, the one that is using namespace prefixes.
I tried these techniques, but none of them seem to work.
1) changing the last element in the xpath to be namespace neutral:
/book/:author
/book/[local-name()='author']
/[local-name()='book']/[local-name()='author']
All of these throw an exception saying that the XPATH format is not correct.
2) Adding namespace uris to the XPAth, after creating it using DocumentHelper.createXPath();
Any idea what I am doing wrong?
FYI I am using dom4j version 1.5
Your XPath does not contain a tag name. The general syntax in your case would be
/TAGNAMEPARENT[CONDITION_PARENT]/TAGNAMECHILD[CONDITION_CHILD]
The important aspect is that the tag names are mandatory while the conditions are optional. If you do not want to specify a tag name you have use * for "any tag". There may be performance implications for large XML files since you will always have to iterate over a node set instead of using an index lookup. Maybe #MichaelKay can comment on this.
Try this instead:
/*[local-name()='book']/*[local-name()='author']

Image tag not closing with HTMLAgilityPack

Using the HTMLAgilityPack to write out a new image node, it seems to remove the closing tag of an image, e.g. should be but when you check outer html, has .
string strIMG = "<img src='" + imgPath + "' height='" + pubImg.Height + "px' width='" + pubImg.Width + "px' />";
HtmlNode newNode = HtmlNode.Create(strIMG);
This breaks xhtml.
Telling it to output XML as Micky suggests works, but if you have other reasons not to want XML, try this:
doc.OptionWriteEmptyNodes = true;
Edit 1:Here is how to fix an HTML Agilty Pack document to correctly display image (img) tags:
if (HtmlNode.ElementsFlags.ContainsKey("img"))
{ HtmlNode.ElementsFlags["img"] = HtmlElementFlag.Closed;}
else
{ HtmlNode.ElementsFlags.Add("img", HtmlElementFlag.Closed);}
replace "img" for any other tag to fix them as well (input, select, and option come up frequently). Repeat as needed. Keep in mind that this will produce rather than , because of the HAP bug preventing the "closed" and "empty" flags from being set simultaneously.
Source: Mike Bridge
Original answer:
Having just labored over solutions to this issue, and not finding any sufficient answers (doctype set properly, using Output as XML, Check Syntax, AutoCloseOnEnd, and Write Empty Node options), I was able to solve this with a dirty hack.
This will certainly not solve the issue outright for everyone, but for anyone returning their generated html/xml as a string (EG via a web service), the simple solution is to use fake tags that the agility pack doesn't know to break.
Once you have finished doing everything you need to do on your document, call the following method once for each tag giving you a headache (notable examples being option, input, and img). Immediately after, render your final string and do a simple replace for each tag prefixed with some string (in this case "Fix_", and return your string.
This is only marginally better in my opinion than the regex solution proposed in another question I cannot locate at the moment (something along the lines of )
private void fixHAPUnclosedTags(ref HtmlDocument doc, string tagName, bool hasInnerText = false)
{
HtmlNode tagReplacement = null;
foreach(var tag in doc.DocumentNode.SelectNodes("//"+tagName))
{
tagReplacement = HtmlTextNode.CreateNode("<fix_"+tagName+"></fix_"+tagName+">");
foreach(var attr in tag.Attributes)
{
tagReplacement.SetAttributeValue(attr.Name, attr.Value);
}
if(hasInnerText)//for option tags and other non-empty nodes, the next (text) node will be its inner HTML
{
tagReplacement.InnerHtml = tag.InnerHtml + tag.NextSibling.InnerHtml;
tag.NextSibling.Remove();
}
tag.ParentNode.ReplaceChild(tagReplacement, tag);
}
}
As a note, if I were a betting man I would guess that MikeBridge's answer above inadvertently identifies the source of this bug in the pack - something is causing the closed and empty flags to be mutually exclusive
Additionally, after a bit more digging, I don't appear to be the only one who has taken this approach:
HtmlAgilityPack Drops Option End Tags
Furthermore, in cases where you ONLY need non-empty elements, there is a very simple fix listed in that same question, as well as the HAP codeplex discussion here: This essentially sets the empty flag option listed in Mike Bridge's answer above permanently everywhere.
There is an option to turn on XML output that makes this issue go away.
var htmlDoc = new HtmlDocument();
htmlDoc.OptionOutputAsXml = true;
htmlDoc.LoadHtml(rawHtml);
This seems to be a bug with HtmlAgilityPack. There are many ways to reproduce this, for example:
Debug.WriteLine(HtmlNode.CreateNode("<img id=\"bla\"></img>").OuterHtml);
Outputs malformed HTML. Using the suggested fixes in the other answers does nothing.
HtmlDocument doc = new HtmlDocument();
doc.OptionOutputAsXml = true;
HtmlNode node = doc.CreateElement("x");
node.InnerHtml = "<img id=\"bla\"></img>";
doc.DocumentNode.AppendChild(node);
Debug.WriteLine(doc.DocumentNode.OuterHtml);
Produces malformed XML / XHTML like <x><img id="bla"></x>
I have created a issue in CodePlex for this.

Resources