Spark ML convert Map of counts to feature - feature-extraction

I have a Scala Map of seenCounts in specific places, eg.:
Map(beach -> 31, cafe -> 140, prison -> 2)
How should I convert such type of data to features for machine learning?
Currently I construct a List[String] of items and use CountVectorizer to convert it to feature, however I am loosing information of how frequent particular place is. I would like not to loose this information.

Related

Nvidia Digits accuracy and loss plots data

I trained my model in Nvidia Digits 5 and I would now like to extract the accuracy and loss plots that were generated during training for a report. Is this data saved somewhere so that it would possible to extract the data for these plots so that I could plot it in Python and perhaps ultimately modify the plots to compare different models etc?
The best solution I have found is to either look at the HTML file or to scan the text file caffe_output.log that is produced by Caffe. The text file is usually stored in /var/digits/jobs/insert_your_job_id/ but you can also just run on linux systems:
locate caffe_output.log
Go to your DIGITS job folder and locate your job's subfolder. Inside you'll find a file status.pickle, which is a pickled object containing all your job's information.
You can load it in python like so:
import digits
import pickle
data = pickle.load(open('status.pickle','rb'))
This object is somewhat generic and may contain multiple tasks. For a typical classification task it will likely be just one, but you will still need to access it via data.tasks[0]. From there you can grab the plots:
data.tasks[0].combined_graph_data()
which returns a somewhat convoluted dict (unfortunately - since your network can produce many accuracy/loss outputs, as well as even custom ones). It contains everything you need though - I managed to plot accuracy with:
plt.plot( data.tasks[0].combined_graph_data()['columns'][2][1:] )
but it's likely that you'll have to write a bit of custom code. As always, dir() is your friend.

Add feature extractor to Stanford NER

From http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ie/NERFeatureFactory.html, to add a new extractor, the last step is:
Add code to NERFeatureFactory for this feature. First decide which
classes (hidden states) are involved in the feature. If only the
current class, you add the feature extractor to the featuresC code, if
both the current and previous class, then featuresCpC, etc.
Do we only have to add a string to feature collection, such as: featuresCpCnC.add(getWord(c) + "-PNSEQW");, and then StanfordNER will parse the string into a real feature? In that case, how do I specify the specific class/field, e.g., title and author, in the feature string? When I dump features in to text file (using exportFeatures or printFeatures), I only find features with generic class like June-PSEQW|CpC, while I want something like June-DateField-DateField-PSEQW|CpC, which means (class[t-1]==DateField)*(class[t]==DateField)*(word[t-1]=="June")
I believe this is expected behavior -- are there performance issues that indicate that training is not working as expected?
To elaborate, in the most general case a featurizer f(x,y) takes both the input x and the output y, and constructs a feature vector for that particular pair. However, in many NLP applications, the features only really depend on the input x, and so the featurizer interface we expose is just f(x), and just implicitly join the features with the output class in the backend (see, e.g., page 10 on "Block Feature Vectors"). In this case, it seems reasonable that we'd only print f(x), and not the full f(x,y).

What data structure is best to represent search filters?

I'm implementing a native iOS application, which provides a feature to search for articles using keywords and several search filters such as author, year, publisher etc. Each of these search filters has several options such as 2014, 2013, and 2012 for the year filter, or Academe Research Journals, Annex Publishers, and Elmer Press for the publisher filter. Each of these options has a BOOL stating whether the option is selected or not. I need an object that wraps the search keywords and search filters so that I can send it to the server, which is responsible for the search operation.
Which data structure should I use to represent these search filters in the wrapper class?
Something like XML comes to mind:
<year>2014</year>
<publisher>Annex Publishers</publisher>
Although I find it rather bulky. I'd probably prefer something like:
year=2014|publisher=Annex Publishers
You'll need to escape = and | appearing in the field names or values, but this is easy to do.
This could just be a single string sent across.
In terms of actual data structures, you could have a map of field name to value, only containing entries where the option is selected. Or you could have a class containing pointers / references for each field, set to null if the option is not selected.
Another totally different consideration is to use an enumerated type, i.e. mapping each possible value to an integer, typically resulting in less memory used and faster (and possibly more robust) code, depending on how exactly this is done.
You could map it as follows, for example:
Academe Research Journals 0
Annex Publishers 1
Elmer Press 2
Then, rather than sending "Annex Publishers" as publisher, you could just send 1.
year=2014|publisher=1
The extension for multiple possible values for a field can be done in various ways, but it's fairly easy to do:
<year>2014</year>
<year>2013</year>
<publisher>Annex Publishers</publisher>
Or:
year=2014,2013|publisher=Annex Publishers

How do I use IOB tags with Stanford NER?

There seem to be a few different settings:
iobtags
iobTags
entitySubclassification (IOB1 or IOB2?)
evaluateIOB
Which setting do I use, and how do I use it correctly?
I tried labelling like this:
1997 B-DATE
volvo B-BRAND
wia64t B-MODEL
highway B-TYPE
tractor I-TYPE
But on the training output, it seemed to think that B-TYPE and I-TYPE were different classes.
I am using the 2013-11-12 release.
How this can be done is currently (2013 releases) a bit of a mess, since there are two different sets of flags for two different DocumentReaderAndWriter implementations. Sorry.
The most flexible support for different IOB styles is found in CoNLLDocumentReaderAndWriter. You can have it map any IOB/IOE/... annotation done by hyphenated prefixes like your examples (B-BRAND) to any other while it is reading files with the flag:
-entitySubclassification IOB2
The resulting label set is then used for training and classification. The options are documented in the entitySubclassify() method of CoNLLDocumentReaderAndWriter: IOB1, IOB2, IOE1, IOE2, SBIEO, IO. You can find a discussion of IOB1 vs. IOB2 in Tjong Kim Sang and Veenstra 1999. By default the representation is mapped back to IOB1 on output, since that is the default used in the CoNLL conlleval program, but you can keep it as what you mapped it to with the flag:
-retainEntitySubclassification
To use this DocumentReaderAndWriter, you can give a training command like:
java8 -mx6g edu.stanford.nlp.ie.crf.CRFClassifier -prop conll.crf.chris2009.prop -readerAndWriter edu.stanford.nlp.sequences.CoNLLDocumentReaderAndWriter -entitySubclassification iob2
Alternatively, ColumnDocumentReaderAndWriter is the default DocumentReaderAndWriter which we use in the distributed models. The options you get with it are different and slightly more limited. You have these two flags:
-mergeTags will take either plain ("BRAND") or CoNLL-like ("I-BRAND") labels and map them down to a prefix-less IO label ("BRAND") and use that for training and classifying.
-iobTags can take either plain ("BRAND") or CoNLL-like ("I-BRAND") labels and maps them to IOB2.
In a sequence model, for any of the labeling schemes like IOB2, the labels are different classes. That is how these labeling schemes work. The special interpretation of "I-", "B-", etc. is left to the human observer and entity-level evaluation software. The included evaluation software will work with IOB1, IOB2, or prefixless IO encoding only.

Import data from URL

The St. Louis Federal Reserve Bank has a great set of data available on a variety of their web pages, such as:
http://research.stlouisfed.org/fred2/series/OILPRICE/downloaddata?cid=32217
http://www.federalreserve.gov/releases/h10/summary/default.htm
http://research.stlouisfed.org/fred2/series/DGS20
The data sets get updated, some as often as daily. I tend to have an interest in the daily data (see the above settings on the URLS)
I'd like to import these kinds of price or rate data streams (accessible as CSV or Excel files at the above URLs) directly into Mathematica.
I've looked at the documentation on Importing[] but I find scant documentation (actually none) on how to go about something like this.
It looks like I need to navigate to the pages, send some data to select specific files and formats, trigger the download, then access the downloaded data from my own machine. Even better if I could access the data directly from the sites.
I had hoped Wolfram Alpha might make this sort thing easy, but I haven't had any success.
FinancialData[] would seem natural for this sort of thing, but I don't see anyway to do it. Financial data has lots of features, but I don't see a way yo get this sort of thing.
Does anyone have any experience with this or can someone point me in the right direction?
You can Import directly from a URL. For example, the data from federalreserve.gov can be obtained and visualized as follows.
url = "http://www.federalreserve.gov/datadownload/Output.aspx?";
url = url<>"rel=H10&series=a660e724c705cea4b7bd1d1b85789862&lastObs=&";
url = url<>"from=&to=&filetype=csv&label=include&layout=seriescolumn";
data = Import[url, "CSV"];
DateListPlot[data[[7 ;;]], Joined -> True]
I broke up url for convenience, since it's so long. I had to examine the contents of data before I knew exactly how to plot it - a step that is typically necessary. I'm sure that the data from stlouisfed.org can be obtained in a similar way, but it requires the use of an API with key to access it.
As Mark said, you can get the data directly from a URL. Your oil data can be imported from a different URL than you had:
http://research.stlouisfed.org/fred2/data/OILPRICE.txt
With that URL, you can do this:
oil = Import["http://research.stlouisfed.org/fred2/data/OILPRICE.txt",
"Table", "HeaderLines" -> 12, "DateStringFormat" -> {"Year", "Month", "Day"}];
DateListPlot[oil, Joined -> True, PlotRange -> All]
Note that "HeaderLines"->12 option strips off the header text in the first 12 lines (you have to count the header lines to know how many to remove). I've also specified the date format.
To find that URL, do as you did before, but click on a data series and then choose View Data from the menu on the left when you see the chart.
The documentation has a short example on extracting data out of a webpage:
http://reference.wolfram.com/mathematica/howto/CleanUpDataImportedFromAWebsite.html
Of course, what actually needs to be done will vary significantly from page to page.
discussion on how to do this with your API key here:
http://library.wolfram.com/infocenter/MathSource/7583/
the function is based on the API documentation. I haven't looked at the code for a couple of years and from memory I put it together rather quickly but I have used it regularly for over 2 years without problems. Here is an example for monthly non seasonally adjusted retail sales from early 1992 to now:
wolfram alpha also uses FRED data so you could use that as an alternative to direct import but it is more tricky to get the query right. I prefer to use FRED directly. Also from memory the data is only available on alpha the day after the release, which is not what you would typically want.

Resources