The Problem
Given a database of 10,000 items, I would like to go about doing the following:
search by any of the columns
match results by a variable number of letters at the beginning of the result
print out duplicate results
the rest of the information for that entry is appended to the search
Consider the table in ms-access (omitting the primary key)
Header1|Header2|Header3
apple rotten green
apple fresh yellow
pear fresh blue
orange rotten pink
Given the following search by Header1; apple, pear
I would receive the result:
apple, rotten, green
apple, fresh, yellow
pear, fresh, blue
Similarly, given the search by Header1; pear, orange, pear
I would receive the result:
pear, fresh, blue
orange, rotten, pink
pear, fresh, blue
What I'm doing
My approach is to store the header you are searching for and an array containing the elements that you searched for. I retrieve the WHOLE database (it's large so this wouldn't be the preferred method) and order it by the header chosen, and also sort the input that the user gave me (both lists in ascending order).
By using simple comparisons (strComp = 0, -1, 1) I increment counter variables for the respective list. This, however, does not account for the cases where the user inputs a duplicate AND the table has a duplicate result. It only accounts for one or the other of those cases.
My solution to that issue would be to "roll" up and down when we find a result to check for nearby results as well, but that seems horrible, nor does it account for fuzzy string matching.
Any recommendations? The solution should somehow stay O(n) if possible given that the user input can (and will) be > 100,000
I suggest you construct a dynamic UNION ALL query, with one SELECT statement for each search.
UNION ALL returns all rows, including duplicates.
e.g.
SELECT * FROM myTable WHERE Header1 LIKE 'apple*'
UNION ALL
SELECT * FROM myTable WHERE Header1 LIKE 'pear*'
UNION ALL
SELECT * FROM myTable WHERE Header1 LIKE 'apple*'
With indexes on the columns that are searched, this should be reasonably fast.
My solution;
First: Store the data (comma delimited) from the database in a dictionary as value with the key being the value for the searched header. If an entry already exists, simply append the new data to the previous data with a bar delimiter.
Second: loop through the list of inputs and match them (with a simple first N characters comparison - if necessary) with the items in the dictionary. If you found a match, get the value and split by delimiters accordingly.
I believe this stays an O(n) solution so long as the first N characters comparison is not used.
Related
Let's take the Excel Pivot data structure (or concept), where we have a hierarchy on the Rows (x-) and on the Cols (y-axis).
Would it be possible (or have any attempts been made) to address location in the pivot table using XPath? I know there is MDX for a cube which I'm familiar with (of n-Dimensionality, or so it says, but in actuality the display is almost always in 2-dimensions), but what about using Xpath to do the same? For example, to address the Cat (subtotal) row, it seems like the following could be used:
Format: (Rows(Xpath), Cols(Xpath), Vals(List))
(
Rows: '//Animal[#Value="Cat"]',
Cols: '//' (or empty --> means everything)
Vals: '', empty for all values, or a list of the specific values
)
A few more examples:
Row for Dog named Sally
('//Animal[#value="Dog"]/Name[#value="Sally"],,)
Column for F(emale) dogs
(,'//Gender[#value="F"],)
Value ("cell") for Booker, Male
('//Animal[#value="Cat"]/Name[#value="Booker"]', '//Gender[#value="M"]', )
Rows for Book, Pebbles
('//Animal[#value="Cat"]/Name[#value="Booker" or #value="Tood",,)
Would this be a valid way to address a two-dimensional Pivot? What might be the challenges if any of using this approach? Note the above pivot table probably isn't the best example because an animal will be either M or F but not both, so that column is in effect irrelevant, but even so hopefully it's a good-enough of an example to communicate my intent.
i have some index(sorted set) containing key name sorted with timestamp as score, these index are for searching purpose , for example one index apple and one index red , apple contain all key name referencing an apple and red all key referencing a red thing.
All this is sorted with the timestamp of the creation of the main key, so i want to do search with that.
For one fild it's not a problem , with pagination i do zrange on apple for example to get all apple within range of pagination sorted by date, but my problem are when i want to combine 2 field.
For example if want all red apple, i can do it sure, but i must use a zunionstore and zrange(too long) or get all of the 2 index to perform a filter based on date, and i search the fastest solution to do that.
thank you for reading :)
The approach you described - ZUNIONSTORE followed by a ZRANGE is the most efficient within Redis core. Alternatively, you could use RediSearch for robuster indexing and searching abilities.
Lets say i have 3 table A,B,C.
In every table i have some insert query.
I want to using Find "ctrl+f" to find every insert query with some format.
Example: i want to find code that contain "insert [table_name] value" no matter what is the table name (A or B or C), so i want to search some code but skip the word in the middle of it.
I have googling with any keyword, but i doesn't get any solution that even close to what i want.
Is it possible to do something like this.?
You need to use what are known as "wildcard" characters.
In the find window, you'll notice there is a check box called "Use Pattern Matching".
If you check this, then you can use some special characters to expand your search.
? is a wildcard that indicates any character can take this place.
* is a wildcard that indicates a string of any length could take this place
eg. ca? would match cat, car, cam etc
ca* would match cat, car, catastrophe, called ... etc
So something along the lines of insert * value should find what you are interested in.
I'm writing a custom search function, and I have to filter through an association.
I have 2 active record backed models, cards and colors with a has_many_and_belongs_to, and colors have an attribute color_name
As my DB has grown to around 10k cards, my search function gets exceptionally slow because i have a select statement with a query inside it, so essentially im having to make thousands of queries.
i need to convert the array#select method into an active record query, that will yield the same results, and im having trouble coming up with a solution. the current (relevant code) is the following:
colors = [['Black'], ['Blue', 'Black']] #this is a parameter retrieved from a form submission
if color
cards = color.flat_map do |col|
col.inject( Card.includes(:colors) ) do |memo, color|
temp = cards.joins(:colors).where(colors: {color_name: color})
memo + temp.select{|card| card.colors.pluck(:color_name).sort == col.sort}
end
end
end
the functionality im trying to mimic is that only cards with colors exactly matching the incoming array will be selected by the search (comparing two arrays). Because cards can be mono-red, red-blue, or red-blue-green etc, i need to be able to search for only red-blue cards or only mono-red cards
I initially started along this route, but i'm having trouble comparing arrays with an active record query
color_objects = Color.where(color_name: col)
Card.includes(:colors).where('colors = ?', color_objects)
returns the error
ActiveRecord::StatementInvalid: PG::SyntaxError: ERROR: syntax error
at or near "SELECT" LINE 1: ...id" WHERE "cards"."id" IN (2, 3, 4) AND
(colors = SELECT "co...
it looks to me like its failing because it doesnt want to compare arrays, only table elements. is this functionality even possible?
One solution might be to convert the habtm into has many through relation and make join tables which contain keys for every permutation of colors in order to access those directly
I need to be able to search for only green-black cards, and not have mono-green, or green-black-red cards show up.
I've deleted my previos answer, because i did not realized you are looking for the exact match.
I played a little with it and i can't see any solution without using an aggregate function.
For Postgres it will be array_agg.
You need to generate an SQL Query like:
SELECT *, array_to_string(array_agg(colors.color_name), ',')) as color_names FROM cards
JOINS cards_colors, colors
ON (cards.id = cards_colors.card_id AND colors.id = cards_colors.color_id)
GROUP BY cards.id HAVING color_names = 'green, black'
I never used those aggregators, so perhaps array_to_string is a wrong formatter, anyway you have to watch for aggregating the colors in alphabetical order. As long as you aint't having too many cards it will be slow enough, but it will scan every card in a table.
I you want to use an index on this query, you should denormalize your data structure, use an array of color_names on a cards record, index that array field and search on it. You can also keep you normalized structure and define an automatic association callback which will put the colorname to the card's color_names array every time a color is assigned to a card.
try this
colors = Color.where(color_name: col).pluck(:id)
Card.includes(:colors).where('colors.id'=> colors)
I have a big load of documents, text-files, that I want to search for relevant content. I've seen a searching tool, can't remeber where, that implemented a nice method as I describe in my requirement below.
My requirement is as follows:
I need an optimised search function: I supply this search function with a list (one or more) partially-complete (or complete) words separated with spaces.
The function then finds all the documents containing words starting or equal to the first word, then search these found documents in the same way using the second word, and so on, at the end of which it returns a list containing the actual words found linked with the documents (name & location) containing them, for the complete the list of words.
The documents must contain all the words in the list.
I want to use this function to do an as-you-type search so that I can display and update the results in a tree-like structure in real-time.
A possible approach to a solution I came up with is as follows:
I create a database (most likely using mysql) with three tables: 'Documents', 'Words' and 'Word_Docs'.
'Documents' will have (idDoc, Name, Location) of all documents.
'Words' will have (idWord, Word) , and be a list of unique words from all the documents (a specific word appears only once).
'Word_Docs' will have (idWord, idDoc) , and be a list of unique id-combinations for each word and document it appears in.
The function is then called with the content of an editbox on each keystroke (except space):
the string is tokenized
(here my wheels spin a bit): I am sure a single SQL statement can be constructed to return the required dataset: (actual_words, doc_name, doc_location); (I'm not a hot-number with SQL), alternatively a sequence of calls for each token and parse-out the non-repeating idDocs?
this dataset (/list/array) is then returned
The returned list-content is then displayed:
e.g.: called with: "seq sta cod"
displays:
sequence - start - code - Counting Sequences [file://docs/sample/con_seq.txt]
- stop - code - Counting Sequences [file://docs/sample/con_seq.txt]
sequential - statement - code - SQL intro [file://somewhere/sql_intro.doc]
(and-so-on)
Is this an optimal way of doing it? The function needs to be fast, or should it be called only when a space is hit?
Should it offer word-completion? (Got the words in the database) At least this would prevent useless calls to the function for words that does not exist.
If word-completion: how would that be implemented?
(Maybe SO could also use this type of search-solution for browsing the tags? (In top-right of main page))
What you're talking about is known as an inverted index or posting list, and operates similary to what you propose and what Mecki proposes. There's a lot of literature about inverted indexes out there; the Wikipedia article is a good place to start.
Better, rather than trying to build it yourself, use an existing inverted index implementation. Both MySQL and recent versions of PostgreSQL have full text indexing by default. You may also want to check out Lucene for an independent solution. There are a lot of things to consider in writing a good inverted index, including tokenisation, stemming, multi-word queries, etc, etc, and a prebuilt solution will do all this for you.
The fastest way is certainly not using a database at all, since if you do the search manually with optimized data, you can easily beat select search performance. The fastest way, assuming the documents don't change very often, is to build index files and use these for finding the keywords. The index file is created like this:
Find all unique words in the text file. That is split the text file by spaces into words and add every word to a list unless already found on that list.
Take all words you have found and sort them alphabetically; the fastest way to do this is using Three Way Radix QuickSort. This algorithm is hard to beat in performance when sorting strings.
Write the sorted list to disk, one word a line.
When you now want to search the document file, ignore it completely, instead load the index file to memory and use binary search to find out if a word is in the index file or not. Binary search is hard to beat when searching large, sorted lists.
Alternatively you can merge step (1) and step (2) within a single step. If you use InsertionSort (which uses binary search to find the right insert position to insert a new element into an already sorted list), you not only have a fast algorithm to find out if the word is already on the list or not, in case it is not, you immediately get the correct position to insert it and if you always insert new ones like that, you will automatically have a sorted list when you get to step (3).
The problem is you need to update the index whenever the document changes... however, wouldn't this be true for the database solution as well? On the other hand, the database solution buys you some advantages: You can use it, even if the documents contain so many words, that the index files wouldn't fit into memory anymore (unlikely, as even a list of all English words will fit into memory of any average user PC); however, if you need to load index files of a huge number of documents, then memory may become a problem. Okay, you can work around that using clever tricks (e.g. searching directly within the files that you mapped to memory using mmap and so on), but these are the same tricks databases use already to perform speedy look-ups, thus why re-inventing the wheel? Further you also can prevent locking problems between searching words and updating indexes when a document has changed (that is, if the database can perform the locking for you or can perform the update or updates as an atomic operation). For a web solution with AJAX calls for list updates, using a database is probably the better solution (my first solution is rather suitable if this is a locally running application written in a low level language like C).
If you feel like doing it all in a single select call (which might not be optimal, but when you dynamacilly update web content with AJAX, it usually proves as the solution causing least headaches), you need to JOIN all three tables together. May SQL is a bit rusty, but I'll give it a try:
SELECT COUNT(Document.idDoc) AS NumOfHits, Documents.Name AS Name, Documents.Location AS Location
FROM Documents INNER JOIN Word_Docs ON Word_Docs.idDoc=Documents.idDoc
INNER JOIN Words ON Words.idWord=Words_Docs.idWord
WHERE Words.Word IN ('Word1', 'Word2', 'Word3', ..., 'WordX')
GROUP BY Document.idDoc HAVING NumOfHits=X
Okay, maybe this is not the fastest select... I guess it can be done faster. Anyway, it will find all matching documents that contain at least one word, then groups all equal documents together by ID, count how many have been grouped togetehr, and finally only shows results where NumOfHits (the number of words found of the IN statement) is equal to the number of words within the IN statement (if you search for 10 words, X is 10).
Not sure about the syntax (this is sql server syntax), but:
-- N is the number of elements in the list
SELECT idDoc, COUNT(1)
FROM Word_Docs wd INNER JOIN Words w on w.idWord = wd.idWord
WHERE w.Word IN ('word1', ..., 'wordN')
GROUP BY wd.idDoc
HAVING COUNT(1) = N
That is, without using like. With like things are MUCH more complex.
Google Desktop Search or a similar tool might meet your requirements.