FAISS and ElasticSearch capabilities for IterableDataset? - huggingface-transformers

There is a nice tutorial here to add FAISS and ElasticSearch capabilities to HuggingFace Datasets. The required functionality (e.g. "add_faiss_index" or "add_elasticsearch_index") only has been defined in Dataset and not available in IterableDataset. Is there any reason behind that? and if there is any fundamental bottleneck behind applying FAISS/Elastic on IterableDataset, then does it mean that FAISS/Elastic features are not available in big datasets?

Related

How to train on very small data set?

We are trying to understand the underlying model of Rasa - the forums there still didnt get us an answer - on two main questions:
we understand that Rasa model is a transformer-based architecture. Was it
pre-trained on any data set? (eg wikipedia, etc)
then, if we
understand correctly, the intent classification is a fine tuning task
on top of that transformer. How come it works with such small
training sets?
appreciate any insights!
thanks
Lior
the transformer model is not pre-trained on any dataset. We use quite a shallow stack of transformer which is not as data hungry as deeper stacks of transformers used in large pre-trained language models.
Having said that, there isn't an exact number of data points that will be sufficient for training your assistant as it varies by the domain and your problem. Usually a good estimate is 30-40 examples per intent.

Machine learning Algorithms used by Elastic x-pack plugin

Elastic X-pack plugin predicts the dynamic baseline for our data and according to that specifies the anomalies out of the box.
All these stuff are getting done behind the scene. My question is this how xpack learns from previous data and dynamically change the baseline. Does that use a specific algorithm?
Is there any document for this?
The algorithms used for Elasticsearch's Machine Learning are a mixture of techniques, including clustering, various types of time series decomposition, bayesian distribution modelling and correlation analysis.
Here are some resources where you can deep dive into how it works:
2018's Elastic{ON} featured this presentation: "The Math Behind Elastic Machine Learning", a recording is available here: https://www.elastic.co/elasticon/conf/2018/sf/the-math-behind-elastic-machine-learning
The C++ code which implements the core analytics for machine learning is available on github: https://github.com/elastic/ml-cpp
I found some good answers on this website which belongs to the Prelert the engine is applied by elastic for anomaly detection.

Attribute selection in h2o

I am very beginner in h2o and I want to know if there is any attribute selection capabilities in h2o framework so to be applied in h2oframes?
No there are not currently feature selection functions in H2O -- my advice would be to use Lasso regression (in H2O this means use GLM with alpha = 1.0) to do the feature selection, or simply allow whatever machine learning algorithm (e.g. GBM) you are planning to use to use all the features (they'll tend to ignore the bad ones, but it could still degrade performance of the algorithm to have bad features in the training data).
If you'd like, you can make a feature request by filling out a ticket on the H2O-3 JIRA. This seems like a nice feature to have.
In my opinion, Yes
My way is use automl to train your data.
after training, you can get a lot of model.
use h2o.get_model method or H2O server page to watch some model you like.
you can get VARIABLE IMPORTANCES frame.
then pick your features.

Algolia vs Solr search

I'm building a product search platform. I used Solr search engine before, and i found its performance is fine but doesn't generate a user interface. Recently I found Algolia has more features, easy setup, and generates a User Interface.
So if someone used Algolia before:
Is Algolia performance better than Solr?
Is there any difference between Algolia and Websolr ?
I'm using Algolia and SolR in production for an e-commerce website.
You're right about what you say on Algolia. It's fast (really) and has a lot of powerful features.
You have a complete dashboard to manage your search engine.
For SolR, it's ok but it's also a black box. You can fine tune your search engine, but it exhibits poor performance for semantic searches (I tested it).
If you have to make a choice, it depends on a lot of things.
With Algolia, there are no servers to manage, easy configuration and integration. It's fast with 20 millions records for me (less than 15ms per search).
With SolR, you can customise a little bit more. But it's a lot of work. If I had to make a choice, it would be more between Algolia and ElasticSearch. SolR is losing velocity; it's hard to imagine it growing again in the next few years.
As a resume, if you want to be fast and efficient, choose Algolia. If you want to dive deep into a search engine architecture and you have a lot of time (count it in months), you can try ElasticSearch.
I hope that I was helpful with my answer, ask me if you have more questions.
Speed is a critical part of keeping users happy. Algolia is aggressively designed to reduce latency. In a benchmarking test, Algolia returned results up to 200x faster than Elasticsearch.
Out-of-the-box, Algolia provides prefix matching for as-you-type search, typo-tolerance with intelligent result highlighting, and a flexible, powerful ranking formula. The ranking formula makes it easy to combine textual relevance with business data like prices and popularity metrics. With Lucene-based search tools like Solr and Elasticsearch, the ranking formula must be designed and built from scratch, which can be very difficult for teams without deep search experience to get right.
Algolia’s highly-optimized infrastructure is distributed across the world in 15 regions and 47 datacenters. Algolia provides a 99.99% reliability guarantee and can deliver a fast search to users wherever in the world they’re connecting from. Elasticsearch and Solr do not automatically distribute to multiple regions, and doing so can incur significant server costs and devops resources

Build an inverted index in distributed environment

What tools/libs/platforms would you use if you had to build a distributed inverted index from scratch? elasticseach (I need partial TF with dates constraints) only partially does what I need it, and thinking about building an inverted index using hbase, but wondering if there are some more sane choices (I will not fit all into memory, and will initially looking into caching).
Your requirements still sound pretty vague to me, so some additional detail would be helpful in providing a better answer.
Solr Cloud may be a good option if you need support for faceting and fuzzy term matching. Solr Cloud is simply the distributed configuration of Solr. It's a bit more tedious to setup than elasticsearch but still a very powerful and popular tool.
If you're not already using HBase I'm not sure I'd recommend introducing it just for the sole purpose of creating an index.
Could probably give you a better answer if I understood your use case and current environment better.

Resources