faster search for a substring through large document

faster search for a substring through large document - spring

I have a csv file of more than 1M records written in English + another language. I have to make a UI that gets a keyword, search through the document, and returns record where that key appears. I look for the key in two columns only.
Here is how I implemented it:
First, I made a postgres database for the data stored in the CSV file. Then made a classic website where the user can enter a keyword. This is the SQL query that I use(In spring boot)
SELECT * FROM table WHERE col1 LIKE %:keyword% OR col2 LIKE %:keyword%;
Right now, it is working perfectly fine, but I was wondering how to make search faster? was using SQL instead of classic document search better?

If the document is only searched once and thrown away, then it's overhead to load into a database. Instead can search the file directly using the nio parallel search feature which uses multiple threads to concurrently search the file:
List<Record> result = Files.lines("some/path")
.parallel()
.unordered()
.map(l -> lineToRecord(l))
.filter(r -> r.getCol1().contains(keyword) || r.getCol2().contains(keyword))
.collect(Collectors.toList());
NOTE: need to provide the lineToRecord() method and the Record class.
If the document is going to be searched over and over again, then can think about indexing the document. This means pre-processing the document to suit the search requirements. In this case it's keywords of col1 and col2. An index is like a map in java, eg:
Map<String, Record> col1Index
But since you have the "LIKE" semantics, this is not so easy to do as it's not as simple as splitting the string by white space since the keyword could match a substring. So in this case it might be best to look for some tool to help. Typically this would be something like solr/lucene.
Databases can also provide similar functionality eg: https://www.postgresql.org/docs/current/pgtrgm.html

For LIKE queries, you should look at the pg_trgm index type with the gin_trgm_ops operator class. You shouldn't need to change query at all, just build the index on each column. Or maybe one multi-column index.

Related

How to perform a Join query on Elastic Search via springboot/java?

I have a springboot application that interacts with elastic search (or as it know now OpenSearch). It can perform basic operations such as search, index etc. I used this as my base (although I replaced high level client since it is deprecated) and to perform queries, I am using #Query annotation mostly (as described in section 2.2 here, although I also used QueryBuilders).
Now, I have an interesting use case - I would like to perform 2 queries at the same time. First query would find a file in elastic search that would contain 3 ids. These 3 ids are ids of other files in the same elastic search. The 2nd query would look for these 3 files and finally return them to me. Now, I can easily do it in 2 steps:
Have a query to find a file containing 3 ids and return it
Have a second query (multisearch query can do bulk search as I understand) to search
for 3 files using info from the first query.
However, I need them to happen within the same query - so within the same query I need to search for a file containing the 3 ids and then perform a search for these 3 files.
So currently my files in elastic search look like so:
{
"docId": "docId57",
"relatedDocs": [
{
"relatedId": "docId1",
"type": "apple"
},
{
"relatedId": "docId2",
"type": "orange"
},
{
"relatedId": "docId3",
"type": "banana"
}
]
}
and my goal is to have a query that will accept docId57 as an arg (so a method findFilesViaJoin(docId57) or something) and return a list of 3 files: file for docId1, file docId2 and file for docId3.
I know it is possible either via nested queries, child/parent queries or good old SQL queries (via jpa/hibarnate).
I attempted to use all of these and was unsuccessful for reasons described below.
Child/parent queries
So for child/parent queries, I attempted to use DSL with #Query but couldn't quite get it since I don't have a solid documentation to refer to (the one that actually helps with java not curls). After some time I found this and this articles - I maybe can figure out how to make it work with child/parent but neither explain how to do mapping. If this approach can do what I want, my question is: how to set up & map parent/child in springboot.
Using SQL queries
So for this one, I need to change my set up to use hibarnate. I used this as my base. It works, the only problem I have is that my SQL queries get ignored. Instead, the search is done based of a method's name, not the content of #Query. So it is as if I don't have an annotation used at all. So using the structure mentioned above, the following method in my app:
#Query("select t from MyModel t where t.docId = ?1")
findByRelatedDocsRelatedId(String id)
will return files that has a relatedId that matches the id passed via method ard id (as oppose to reading query from #Query that tells method to search all docs based on docId). Now, I don't mind using method name as a query to search for something. But then why would I use #Query for? (not to mention how do I create a name that does join). It might be possible that my hibernate is set up wrong (never used it before this week). So question here is, does anybody have a nice complete example of hibarnate being used with elastic search that does join query?
Nested queries
For these queries, I assume that I just need to figure out what to put inside the #Query but due to limited documentation about how to compose nested query I didn't manage to make it even remotely to work. Any concreate documentation on how to create DSL nested query would be appreciated.
Any of the ways I described will work for me. I think child/parent seems the best choice (seeing as they kind created for this purpose) but any will do.

Good way to exclude records in SOLR or Elasticsearch

For a matchmaking portal, we have one requirement where in, if a customer viewed complete profile details of a bride or groom then we have to exclude that profile from further search results. Currently, along with other detail we are storing the viewed profile ids in a field (Comma Separated) against that bride or groom's details.
Eg., if A viewed B, then in B's record under the field saw_me we will add A (comma separated).
while searching let say the currently searching members id is 123456 then we will fire a query like
Select * from profiledetails where (OTHER CON) AND 123456 not in saw_me;
The problem here is the saw_me field value is growing like anything, is there any better way to handle this requirement? Please guide.

If this is using Solr:
first, DON'T add the 'AND NOT ...' clauses along with the main query in q param, add them to fq. This have many benefits (the fq will be cached)
Until you get to a list of values that is maybe 1000s this approach is simple and should work fine
After you reach a point where the list is huge, maybe it time to move to a post filter with a high cost ( so it is looked up last). This would look up docs to remove in an external source (redis, db...).

In my opinion no matter how much the saw_me field grows, it will not make much difference in search time.Because tokens are indexed inversely and doc_values are created at index time in column major fashion for efficient read and has support for caching from OS. ES handles these things for you efficiently.

elasticsearch: extract number from a field

I'm using elasticsearch and kibana for storing my logs.
Now what I want is to extract a number from a field and store it a new field.
So for instance, having this:
accountExist execution time: 1046 ms
I would like to extract the number (1046) and see it in a new field in kibana.
Is it possible? how?
Thanks for the help

You'll need to do this before/during indexing.
Within Elasticsearch, you can get what you need during indexing:
Define a new analyzer using the Pattern Analyzer to wrap a regular expression (for your purposes, to capture consecutive digits in the string - good answer on this topic).
Create your new numeric field in the mapping to hold the extracted times.
Use copy_to to copy the log message from the input field to the new numeric field from (2) where the new analyzer will parse it.
The Analyze API can be helpful for testing purposes.

While not performant, if you must avoid reindexing, you could use scripted fields in kibana.
Introduction here: https://www.elastic.co/blog/using-painless-kibana-scripted-fields
enable painless regex support by putting the following in your elasticsearch.yaml:
script.painless.regex.enabled: true
restart elasticsearch
create a new scripted field in Kibana through Management -> Index Patterns -> Scripted Fields
select painless as the language and number as the type
create the actual script, for example:
def logMsg = params['_source']['log_message'];
if(logMsg == null) {
return -10000;
}
def m = /.*accountExist execution time: ([0-9]+) ms.*$/.matcher(params['_source']['log_message']);
if ( m.matches() ) {
return Integer.parseInt(m.group(1))
} else {
return -10000
}
you must reload the website completely for the new fields to be executed, simply re-doing a search on an open discover site will not pick up the new fields. (This almost made me quit trying to get this working -.-)
use the script in discover or visualizations
While I do understand, that it is not performant to script fields for millions of log entries, my usecase is a very specific log entry, that is logged 10 times a day in total and I only use the resulting fields to create a visualization or in analysis where I reduce the candidates through regular queries in advance.
Would be interesing if it is possible to have those fields only be calculated in situations where you need them (or they make sense & are computable to begin with; i.e. to make the "return -1000" unnecessary). Currently they will be applied and show up for every log entry.
You can generate scripted fields inside of queries like this: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-script-fields.html but that seems a bit too much of burried under the hood, to maintain easily :/

Lucene filter with docIds

I'm trying to do the following: I want to create a set of candidates by querying each field separately and then adding the top k matches to this set. After I'm done with that, I need to run another query on this candidate set.
The way how I implemented it right now is using a QueryWrapperFilter with a BooleanQuery that matches the unique id field of each candidate document. However, this means I have to call IndexSearcher.doc().get("docId") for each candidate document before I can add it to my BooleanQuery, which is the major bottleneck. I'm only loading the docId field via MapFieldSelector("docId).
I wanted to create my own Filter class, but I can't use the internal Lucene doc ids directly, because they are specified per segment. Any thoughts on how to approach this?

Instead of reading the stored docId, index the field (it probably already is) and use the FieldCache to retrieve docIds much faster. Then instead of using the docIds in a BooleanQuery, try using a TermsFilter or FieldCacheTermsFilter. The latter documentation describes the performance trade-offs.

SQLite - how to return rows containing a text field that contains one or more strings?

I need to query a table in an SQLite database to return all the rows in a table that match a given set of words.
To be more precise: I have a database with ~80,000 records in it. One of the fields is a text field with around 100-200 words per record. What I want to be able to do is take a list of 200 single word keywords {"apple", "orange", "pear", ... } and retrieve a set of all the records in the table that contain at least one of the keyword terms in the description column.
The immediately obvious way to do this is with something like this:
SELECT stuff FROM table
WHERE (description LIKE '% apple %') or (description LIKE '% orange %') or ...
If I have 200 terms, I end up with a big and nasty looking SQL statement that seems to me to be clumsy, smacks of bad practice, and not surprisingly takes a long time to process - more than a second per 1000 records.
This answer Better performance for SQLite Select Statement seemed close to what I need, and as a result I created an index, but according to http://www.sqlite.org/optoverview.html sqlite doesn't use any optimisations if the LIKE operator is used with a beginning % wildcard.
Not being an SQL expert, I am assuming I'm doing this the dumb way. I was wondering if someone with more experience could suggest a more sensible and perhaps more efficient way of doing this?
Alternatively, is there a better approach I could use to the problem?

Using the SQLite fulltext search would be faster than a LIKE '%...%' query. I don't think there's any database that can use an index for a query beginning with %, as if the database doesn't know what the query starts with then it can't use the index to look it up.
An alternative approach is putting the keywords in a separate table instead, and making an intermediate table that has the information about which row in your main table has which keywords. If you indexed all the relevant columns that way, it could be queried very quickly.

Sounds like you might want to have a look at Full Text Search. It was contributed to SQLite by someone from google. The description:
allows the user to efficiently query
the database for all rows that contain
one or more words (hereafter
"tokens"), even if the table contains
many large documents.

This is the same problem as full-text search, right? In which case, you need some help from the DB to construct indexes into these fields if you want to do this efficiently. A quick search for SQLite full text search yields this page.
The solution you correctly identify as clumsy is probably going to do up to 200 regular expression matches per document in the worst case (i.e. when a document doesn't match), where each match has to traverse the entire field. Using the index approach will mean that your search speed will be independent of the size of each document.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

faster search for a substring through large document - spring

For LIKE queries, you should look at the pg_trgm index type with the gin_trgm_ops operator class. You shouldn't need to change query at all, just build the index on each column. Or maybe one multi-column index.

Related

How to perform a Join query on Elastic Search via springboot/java?

Good way to exclude records in SOLR or Elasticsearch

elasticsearch: extract number from a field

Lucene filter with docIds

SQLite - how to return rows containing a text field that contains one or more strings?

Categories

Resources