Apache Mahout Database to Sequence File - hadoop

I am currently trying to play around with mahout. I purchased the book Mahout in Action.
The whole process is understood and with simple test data sets I was already successful.
Now I have a classification problem that I would like to solve.
the target variable is found, which I call - for now - x.
The existing data in our database has already been classified with -1, 0 and +1.
We defined several predictor variables which we select with an SQL query.
These are the product's attributes: language, country, category (of the shop), title, description.
Now I want them to directly be written in a SequenceFile, for which I wrote a little helper class that will append to the sequence file each time a new row of the SQL resultset has been processed:
public void appendToFile(String classification, String databaseID, String language, String country, String vertical, String title, String description) {
int count = 0;
Text key = new Text();
Text value = new Text();
key.set("/" + classification + "/" + databaseID);
//??value.set(message);
try {
this.writer.append(key, value);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
If I only had the title or so, I could simply store it in the value - but how do I store mutiple values like country, lang, and so on, in that particular key?
Thanks for any help!

you shouldnt be storing structures in a seq file, just dump all the text you have seperated by a space,
it's simply a place to put all your content for term counting and such when using something like Naive Bayes, it cares not about structure.
Then when you have classification, lookup the structure in your database.

Related

OpenCSV : getting the list of header names in the order it appears in csv

I am using Springboot + OpenCSV to parse a CSV with 120 columns (sample 1). I upload the file process each rows and in case of error, return a similar CSV (say errorCSV). This errorCSV will have only errored out rows with 120 original columns and 3 additional columns for details on what went wrong. Sample Error file 2
I have used annotation based processing and beans are populating fine. But I need to get header names in the order they appear in the csv. This particular part is quite challenging. Then capture exception and original data during parsing. The two together can later be used in writing CSV.
CSVReaderHeaderAware headerReader;
headerReader = new CSVReaderHeaderAware(reader);
try {
header = headerReader.readMap().keySet();
} catch (CsvValidationException e) {
e.printStackTrace();
}
However the header order is jumbled and there is no way to get header index. The reason being CSVReaderHeaderAware internally uses a HashMap. In order to solve this I built my custom class. It is a replica of CSVReaderHeaderAware 3 except that I used LinkedHashMap
public class CSVReaderHeaderOrderAware extends CSVReader {
private final Map<String, Integer> headerIndex = new LinkedHashMap<>();
}
....
// This code cannot be done with a stream and Collectors.toMap()
// because Map.merge() does not play well with null values. Some
// implementations throw a NullPointerException, others simply remove
// the key from the map.
Map<String, String> resultMap = new LinkedHashMap<>(headerIndex.size()*2);
It does the job however wanted to check if this is the best way out or can you think of a better way to get header names and failed values back and write in a csv.
I referred to following links but couldn't get much help
How to read from particular header in opencsv?

Is there a way to speed up the in-memory full text search indexing speed?

To my surprise I have discovered that the indexing of documents into the full text search engine in H2 is comparably slow and I would like to speed that up.
I'm using the in-memory version of H2, which makes this case especially surprising.
Some benchmarks using 100k small documents (only title and some tags):
Using org.h2.fulltext.FullTextLucene.init it takes ~15s to index.
Using org.h2.fulltext.FullText.init makes no change.
The SQL inserting alone (i.e. full text indexing disabled) only takes 1s.
When using Elasticsearch (with bulk indexing) I would expect that this amount would be processed and be searchable within 3s, i.e. it's even stored on disk.
Some additional info which might help:
Connection is reused.
No stop words are used (but that wouldn't make much difference in terms of documents size).
EDIT_2: I added a big list of stop words (>100). This made it like <10% faster (~15s to ~14s).
The SQL inserting alone (i.e. full text indexing disabled) only takes 1s, so the problem should be with the full text search indexing.
The official tutorial and page about performance don't seem to offer a solution.
There doesn't seem to be a possibility for bulk indexing like in Elasticsearch.
EDIT_1: I also tried to create the SQL table and inserts FIRST (which take 1s) and AFTER THAT create the full text search index and run FullTextLucene.reindex(). But that makes the process even a bit slower.
If it's of any help, here's the code how the index is created and the inserts are made:
Create index:
private void createTablesAndLuceneIndex() {
try {
final Statement statement = this.createStatement();
statement.execute("CREATE ALIAS IF NOT EXISTS FT_INIT FOR \"org.h2.fulltext.FullTextLucene.init\"");
statement.execute("CALL FT_INIT()");
// FullTextLucene.setIgnoreList(this.conn, "to,this"); // Do we need stop words?
FullTextLucene.setWhitespaceChars(this.conn, " ,.-");
// Set up SQL table & Lucene index
statement.execute("CREATE TABLE " + PNS_VIDEOS + "(ID INT PRIMARY KEY, TITLE VARCHAR, TAGS VARCHAR, ACTORS VARCHAR)");
statement.execute("CALL FT_CREATE_INDEX('PUBLIC', '" + PNS_VIDEOS + "', NULL)");
// Close statement
statement.close();
} catch (final SQLException e) {
throw new SqlTableCreationException(e); // todo logging?!
}
}
Index document:
public void index(final PnsVideo pnsVideo) {
try (PreparedStatement statement = this.conn.prepareStatement("INSERT INTO " + PNS_VIDEOS + " VALUES(?, ?, ?, ?)")) {
statement.setInt(1, this.autoKey.getAndIncrement());
statement.setString(2, pnsVideo.getTitle());
statement.setString(3, Joiner.on(",").join(pnsVideo.getTags()));
statement.setString(4, Joiner.on(",").join(pnsVideo.getActors()));
statement.execute();
} catch (final SQLException e) {
throw new FTSearchIndexException(e); // todo logging?!
}
}
Thanks for any suggestion!

SingleColumnValueFilter not returning proper number of rows

In our HBase table, each row has a column called crawl identifier. Using a MapReduce job, we only want to process at any one time rows from a given crawl. In order to run the job more efficiently we gave our scan object a filter that (we hoped) would remove all rows except those with the given crawl identifier. However, we quickly discovered that our jobs were not processing the correct number of rows.
I wrote a test mapper to simply count the number of rows with the correct crawl identifier, without any filters. It iterated over all the rows in the table and counted the correct, expected number of rows (~15000). When we took that same job, added a filter to the scan object, the count dropped to ~3000. There was no manipulation of the table itself during or in between these two jobs.
Since adding the scan filter caused the visible rows to change so dramatically, we expect that we simply built the filter incorrectly.
Our MapReduce job features a single mapper:
public static class RowCountMapper extends TableMapper<ImmutableBytesWritable, Put>{
public String crawlIdentifier;
// counters
private static enum CountRows {
ROWS_WITH_MATCHED_CRAWL_IDENTIFIER
}
#Override
public void setup(Context context){
Configuration configuration=context.getConfiguration();
crawlIdentifier=configuration.get(ConfigPropertyLib.CRAWL_IDENTIFIER_PROPERTY);
}
#Override
public void map(ImmutableBytesWritable legacykey, Result row, Context context){
String rowIdentifier=HBaseSchema.getValueFromRow(row, HBaseSchema.CRAWL_IDENTIFIER_COLUMN);
if (StringUtils.equals(crawlIdentifier, rowIdentifier)){
context.getCounter(CountRows.ROWS_WITH_MATCHED_CRAWL_IDENTIFIER).increment(1l);
}
}
}
The filter setup is like this:
String crawlIdentifier=configuration.get(ConfigPropertyLib.CRAWL_IDENTIFIER_PROPERTY);
if (StringUtils.isBlank(crawlIdentifier)){
throw new IllegalArgumentException("Crawl Identifier not set.");
}
// build an HBase scanner
Scan scan=new Scan();
SingleColumnValueFilter filter=new SingleColumnValueFilter(HBaseSchema.CRAWL_IDENTIFIER_COLUMN.getFamily(),
HBaseSchema.CRAWL_IDENTIFIER_COLUMN.getQualifier(),
CompareOp.EQUAL,
Bytes.toBytes(crawlIdentifier));
filter.setFilterIfMissing(true);
scan.setFilter(filter);
Are we using the wrong filter, or have we configured it wrong?
EDIT: we're looking at manually adding all the column families as per https://issues.apache.org/jira/browse/HBASE-2198 but I'm pretty sure the Scan includes all the families by default.
The filter looks correct, but under certain conditions one scenario that could cause this relates to character encodings. Your Filter is using Bytes.toBytes(String) which uses UTF8 [1], whereas you might be using native character encoding in HBaseSchema or when you write the record if you use String.getBytes()[2]. Check that the crawlIdentifier was originally written to HBase using the following to ensure the filter is comparing like for like in the filtered scan.
Bytes.toBytes(crawlIdentifier)
[1] http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/util/Bytes.html#toBytes(java.lang.String)
[2] http://docs.oracle.com/javase/1.4.2/docs/api/java/lang/String.html#getBytes()

What's the name of a table of values/frequencies?

I have a main data store which has a big set of perfectly ordinary records, which might look like (all examples here are pseudocode):
class Person {
string FirstName;
string LastName;
int Height;
// and so on...
}
I have a supplementary data structure I'm using for answering statistical questions efficiently. It's computed from the main data store, and it's a dictionary that looks like:
// { (field_name, field_value) => count }
Dictionary<Tuple<string, object>, int>;
For example, one entry of the dictionary might be:
(LastName, "Smith") => 345
which means in 345 of the Person records, the LastName field is "Smith" (or was, at the time this dictionary was last computed).
What is this supplementary dictionary called? I think it'd be easier to talk about if it had a proper name.
I might call it a "histogram", if I was to print the entire thing graphically (but it's just a data structure, not a visual representation). If I stored the locations of these values (instead of just their count) I might call it an "inverted index".
I think you have found the most appropriate name already: frequency table or frequency distribution.

How to use Crystal Reports without a tightly-linked DB connection?

I'm learning to use Crystal Reports (with VB 2005).
Most of what I've seen so far involves slurping data directly from a database, which is fine if that's all you want to display in the report.
My DB has a lot of foreign keys, so the way I've tried to stay sane with presenting actual information in my app is to add extra members to my objects that contain strings (descriptions) of what the foreign keys represent. Like:
Class AssetIdentifier
Private ID_AssetIdentifier As Integer
Private AssetID As Integer
Private IdentifierTypeID As Integer
Private IdentifierType As String
Private IdentifierText As String
...
Here, IdentifierTypeID is a foreign key, and I look up the value in a different table and place it in IdentifierType. That way I have the text description right in the object and I can carry it around with the other stuff.
So, on to my Crystal Reports question.
Crystal Reports seems to make it straightforward to hook up to records in a particular table (especially with the Experts), but that's all you get.
Ideally, I'd like to make a list of my classes, like
Dim assetIdentifiers as New List(Of AssetIdentifier)
and pass that to a Crystal Report instead of doing a tight link to a particular DB, have most of the work done for me but leaving me to work around the part that it doesn't do. The closest I can see so far is an ADO.NET dataset, but even that seems far removed. I'm already handling queries myself fine: I have all kinds of functions that return List(Of Whatever) based on queries.
Is there an easy way to do this?
Thanks in advance!
UPDATE: OK, I found something here:
http://msdn.microsoft.com/en-us/library/ms227595(VS.80).aspx
but it only appears to give this capability for web projects or web applications. Am I out of luck if I want to integrate into a standalone application?
Go ahead and create the stock object as described in the link you posted and create the report (StockObjectsReport) as they specify. In this simplified example I simply add a report viewer (crystalReportViewer1) to a form (Form1) and then use the following code in the Form_Load event.
stock s1 = new stock("AWRK", 1200, 28.47);
stock s2 = new stock("CTSO", 800, 128.69);
stock s3 = new stock("LTWR", 1800, 12.95);
ArrayList stockValues = new ArrayList();
stockValues.Add(s1);
stockValues.Add(s2);
stockValues.Add(s3);
ReportDocument StockObjectsReport = new StockObjectsReport();
StockObjectsReport.SetDataSource(stockValues);
crystalReportViewer1.ReportSource = StockObjectsReport;
This should populate your report with the 3 values from the stock object in a Windows Form.
EDIT: Sorry, I just realized that your question was in VB, but my example is in C#. You should get the general idea. :)
I'm loading the report by filename and it is working perfect:
//........
ReportDocument StockObjectsReport;
string reportPath = Server.MapPath("StockObjectsReport.rpt");
StockObjectsReport.Load(reportPath);
StockObjectsReport.SetDataSource(stockValues);
//Export PDF To Disk
string filePath = Server.MapPath("StockObjectsReport.pdf");
StockObjectsReport.ExportToDisk(ExportFormatType.PortableDocFormat, filePath);
#Dusty had it. However in my case it turned out you had to wrap the object in a list even though it was a single item before I could get it to print. See full code example:
string filePath = null;
string fileName = null;
ReportDocument newDoc = new ReportDocument();
// Set Path to Report File
fileName = "JShippingParcelReport.rpt";
filePath = func.GetReportsDirectory();
// IF FILE EXISTS... THEN
string fileExists = filePath +#"\"+ fileName;
if (System.IO.File.Exists(fileExists))
{
// Must Convert Object to List for some crazy reason?
// See: https://stackoverflow.com/a/35055093/1819403
var labelList = new List<ParcelLabelView> { label };
newDoc.Load(fileExists);
newDoc.SetDataSource(labelList);
try
{
// Set User Selected Printer Name
newDoc.PrintOptions.PrinterName = report.Printer;
newDoc.PrintToPrinter(1, false, 0, 0); //copies, collated, startpage, endpage
// Save Printing
report.Printed = true;
db.Entry(report).State = System.Data.Entity.EntityState.Modified;
db.SaveChanges();
}
catch (Exception e2)
{
string err = e2.Message;
}
}

Resources