Under the Stanford Sentiment TreeBank, directory there are several files, which is explained clearly in the README.txt file.
And i've my own dataset and i am thinking to train my own dataset using the Stanford Sentiment (RNTN). But i am wondering how do i generate the "STree.txt" file from my dataset.
The format of data in "STree.txt" is in the form of:
40|39|38|37|36|34|33|32|32|31|30|30|29|28|26|25|24|23|22|22|27|23|24|25|26|27|28|29|41|31|35|33|34|35|36|37|38|39|40|41|0
Related
in the tutorial of tensorflow federated learning for text generation, they used a preprocessed shakespeare dataset.
https://www.tensorflow.org/federated/tutorials/federated_learning_for_text_generation
i want to create a federated dataset from scratch using my own text file.
Summary
I am trying to design an elastic index(s) that will provide a solid foundation for indexing 1,000,000+ Files and full text searching on the contents. New files will be continuously added after the initial digitization process.
Use Case
Various File Types (Pdf, outlook email, mp3, txt, jpeg of handwritten things, ..etc) need to be searchable by their contents and meta-data. Users want to manually tag relationships between documents. ex Document A -> contains information about -> Document B. Users want to be able to see related/similar texts. Users want Named Entity Recognition on the text contents. The physical files are already stored on an external computer just waiting to be processed.
Implementation
File Content extraction pipeline using Apache Tika
NER using Spacy
Upload File Contents + NER Tags to Elastic
Eventually we would run our own search models to gain better search insights + data science.
How do I best store my extracted contents to fit the needs of the user and have a scalable foundation? Is it better to run our trained Named Entity Recognition on initial index or after text extraction has been uploaded to elastic?
Or does it make more sense to use an existing solution from below to not reinvent the wheel?
https://github.com/dadoonet/fscrawler
https://github.com/deepset-ai/haystack
https://github.com/RD17/ambar
Instead of reinventing the wheel, I'd recommend to use existing solutions such as Jina, there's a working example of pdf search using Jina. You can also search across different modalities(text, image, pdf, etc.) using this.
I am trying to use my own data set for the mind-gapper motion chart reproduced by Mike Bostock at https://bost.ocks.org/mike/nations/
He uses a JSON data file from https://bost.ocks.org/mike/nations/nations.json
I have a data file having food trends in an excel file and I'm wondering what is the best approach to converting excel file into the appropriate JSON format?
How did Mike originally do this? I presume that he had an excel file originally?
It depends on the structure of the data in your csv, but I use online tools like this one: http://www.convertcsv.com/csv-to-json.htm
Im am extracting data from web , converting it into a json form and then storing it into hadoop fs using apache flume. So flume makes its our files when storing data on hadoop. Now i want to access this data randomly and so text analysis or statistical analysis on it. What can be the optimal way to do this?
I am trying to write a map-reduce implementation of Document Categorizer using OpenNLP.
During the training phase, I am planning to read a large amount of files and create a model file as result of the map-reduce computation(may be a chain of jobs). I will distribute the files to different mappers, I would create a number of model files as result of this step. Now, I wish to reduce these model files to a single model file to be used for classification.
I understand that this is not the most intuitive of use cases, but I am ready to get my hands dirty and extend/modify the OpenNLP source code, assuming it is possible to tweak the maxent algorithm to work this way.
In case this seems too far fetched, I request for suggestions to do this by generating document samples corresponding to the input files as output of map-reduce step and reducing them to model files by feeding them to document categorizer trainer.
Thanks!
I've done this before, and my approach was to not have each reducer produce the model, but rather only produce the properly formatted data.
Rather than use a category as a key, which separates all the categories Just use a single key and make the value the proper format (cat sample newline) then in the single reducer you can read in that data as (a string) a bytearrayinputstream and train the model. Of course this is not the only way. You wouldn't have to modify opennlp at all to do this.
Simply put, my recommendation is to use a single job that behaves like this:
Map: read in your data, create category label and sample pair. Use a key called 'ALL' and context.write each pair with that key .
Reduce: use a stringbuilder to concat all the cat: sample pairs into the proper training format. Convert the string into a bytearrayinputstream and feed the training API . Write the model somewhere.
Problem may occur that your samples data is too huge to send to one node. If so, you can write the values to A nosql db and read then in from a beefier training node. Or you can use randomization in your mapper to produce many keys and build many models, then at classification time write z wrapper that tests data across them all and Getz The best from each one..... Lots of options.
HTH