BigData solution for LIKE queries - elasticsearch

I am looking for suggestions to use a distributed system to process this data. I have a data from organization wide computers (laptops, desktops, tablets etc.). The sample table contains data for all the files present on each computer in this organization. The idea is to find files with certain keywords (3000+) within FileName or FilePath i.e wild card pattern matching.
+-------------+----------+----------+----------+----------+
| MachineName | FileName | FilePath | FileType | FileSize |
+-------------+----------+----------+----------+----------+
The current solution is running on a beefy SQL Server but still takes hours to run through 80 million records due to wild card SQL queries i.e FILENAME LIKE '%abc%' or FILEPATH LIKE '%abc%' and the list goes on.
We have thought about FullText indexes on SQL but this activity is performed once a month and then the data is discarded. So, investing resources in getting full text index populated does not seem worth it in terms of time & resources.
The requirement is to get this activity completed in much shorter time and hence we are exploring for options.
Should it be ElasticSearch Or Solr or some other cloud-based solution? Please provide advise on some high-level solution.

For this use case, ElasticSearch is a good choice. It provides all you would need - because every field is indexed, it is commonly being used as a real-time full-text search engine.
On the other hand, Solr is a good choice too. From your requirements, I think that ElasticSearch offers much more than you would need. Solr is a bit older, which results in the excellent documentation. It specializes only in the text which is not a problem in your case. It is scalable and optimized for high traffic, so it should be suitable for your problem.
I think both ElasticSearch and Solr will fulfill what you need; choice is up to you - what is more sympathetic to you :) In my opinion, if you can, best is to try both of them and choose then.

Related

DB candidate as CouchDB/Schema replacement

The idea is to redesign data structure and/or change DB.
I just started to review this project and plan to start optimization from this one.
Currently i have CouchDb with about 80GB of document data, around 30M records.
From that subset for the most of documents properties like id, group_id, location, type can be considered as generic, but unfortunately for now such are even stored with different property naming around the set. Also a lot of deeply nested can be found.
Structure isn't hardly defined, that's why NoSQL db was selected way before some picture was seen.
Data is calculated and populated in DB in a separate Job on powerful cluster. This isn't done too often. From that perspective i can conclude that general write/update performance isn't very important. Also size decrease would be great, but isn't most important. There are only like 1-10 active customers at a time.
Actually read performance with various filtering/grouping etc is most important.
But no heavy summary calculations should be done, this one is already done while population.
This one is a data analytical tool for displaying compare and other reports to quality engineers and data analyst, so they can browse the results, group them or filter from the Web UI.
Now such tasks like searching a subset of document properties for a text isn't possible due to performance.
For sure i've done some initial investigations(like http://www.datastax.com/wp-content/themes/datastax-2014-08/files/NoSQL_Benchmarks_EndPoint.pdf) and it looks Cassandra seems to be good choice among NoSql.
Also it's quite interesting trying to port this data into the new PostgreSQl.
Any ideas would be highly appreciated :-)
Hello please check the following articles:
http://www.enterprisedb.com/nosql-for-enterprise
For me, PostgreSQL json(and jsonb!) capabilities allow to start schema-less, have transactions, indexes, grouping, aggregate functions with very good performance, just from the start. And when ready(and if needed), you can go for the schema, with internal data migration.
Also check:
https://www.compose.io/articles/is-postgresql-your-next-json-database/
Good luck

Text Search Engine for Mailing List Archive Cataloging and Search

I'm working with a mailing list archive and am tasked with setting up basic search, boolean search, and ultimately some sort of more intelligent tag-based searching.
I see both commercial products and some open-source projects (like Lucene.NET)
Has anyone else done any similar kind of work?
I'm working in Win2k3 server now, so the immediate thought was to use ASP Classic or ASP.NET. However, if there were another platform that was orders of magnitude better for the purpose, then I'd consider that as well. I'm not going to throw out something becuse of that ;)
Since you are setting up mail search you will need two things : a search engine and a database.
There are many search engines that offer what you need.
Sphinx
Solr(Lucene and Solr are merged now)
PostgreSQL(inbuilt search)
They provide advanced search tools like keywords, field-restricted search, boolean queries, phrase search and more. Here is another SO post looking into various text search engines: Comparison of full text search engine - Lucene, Sphinx, Postgresql, MySQL?
Sphinx and Solr are pretty fast in search. Sphinx does full database search and also does partial indexing. Solr uses indexed based search, and is scalable with almost linear performance.
Second most important choice is the database where you store your mails. The mails will be in some format (schema), like fields in a table. It would be plain crazy not to use any format. It is not file search, right? Some search engines require particular DB's to work. Sphinx uses SQL databases only, Solr can be integrated with noSQL databases.
If you are not worried about scaling issues (you have thousands of users, having GB's of data, needing real time performance) then, you are fine with SQL databases. Otherwise you will have to use noSQL database with Solr.
SQL databases(like PostgreSQL) are simplest to work with, do what you need and require minimal setup/effort. Connectors will allow you to send query(mail search) from browser to your database.
Also you said you use Win2k3, you will have to switch to linux distribution to take advantage of these search engines. Win2k3 is slow, does not offer performance comparable to linux distros.
First, you should think about what you need.
What do you want to search in your e-mail archive? Just full text search in the the e-mail’s plein data? You will not get matches in mails that are base64 encoded then, for example. Do you need ‘fielded’ search? E.g.: search only in ‘subject’, ‘from’, ‘to’, ‘body’, ‘attachments’?
How do you want to provide access to search in the mails? Via a web page? On a command line? In some windows program?
If you didn’t yet, you should examine what your data looks like. Maybe ‘mbox’ format (one file with mail plain text concatenated) ‘maildir’ (a directory with many files, each contain one mail), or something else?
Setting up a search engine means to think about how data needs to be prepared:
E-Mails can contain different data inside. You will have to deal with base64 encoded data, character encodings as UTF-8 and attachments.
Usegroup mails may even be split across multiple e-mail messages.
If you want to search different ‘fields’ (‘Subject’, ‘date’, ‘body’) they need to be extracted.
Data needs to be prepared by linguistic means. You will need to find out which language the mails are in (if there are several) and process the data, eg. to make a search on mouse match on notions of mice and, perhaps, rats; or cursor and pointing device, depending on the topic of your mailing list.
Also think about:
Will there be updates to the data in future?
Are there deletes (including messages being relabeled later)?
Then compare the products (commercial or open source) that you favour how much of this they do provide already and what you will have to write yourself. Be aware that providing a search experience is more than downloading a search engine and dropping in a ton of data.

Microsoft Access equivalent of explain in MySQL

I'm working on a very large query, in a inherited application. This is a large insert-query, that takes 4 tables with well over a million records. I know, I would also rather have this in SQL-server, but there is no infrastructure at this customer to do this :-)
This query has worked for over a year. However, the source-tables keep on growing, and last week it threw the dreaded 'out of system resources'-error. Bummer...!
I think it is possible to optimize this query. Working in MySQL, I would use the explain-command, to see where optimalisation might occur. Is there a equivalent of this in Access? I cannot seem to find it....
kind regards,
Paul
Probably Jet ShowPlan is closest to what you want. You will have to set a registry key. Then query plan information gets dumped to a text file named SHOWPLAN.OUT. You can read about the details in this article on TechRepublic: Use Microsoft Jet's ShowPlan to write more efficient queries
Also try the Performance Analyzer wizard. You can ask it to examine your query alone, or also ask it to examine table or other queries used by that query.
If you haven't compacted the database recently, see whether that improves performance. Compacting also updates index statistics which allows the engine to make better decisions for the query plan.

badoo.com user search - how can this be done?

Badoo.com has 56.000.000 user profiles. Profiles can be searched by sex, age, hair color, zodiac, education and so on, plus distance from my hometown, online status and date of registration. So far, this seems doable even if it's quite some query on huge tables (56m members...), it can be cached in a general way.
The interesting part is that they also have an individual "exclude list" (with every profile you look at, you can say that you don't want to meet this person). Plus, you friends don't show up either.
The second interesting part are the OR parts of the query. You can search for someone who's a woman, 25-35, blonde OR brunette, non-smoker, hetero OR bisexual, virgo OR twins OR cancer, living in a 50KM radius of Paris and who is not your friend and not on your exclude list and who's online now. Many ORs, heavy query, sort options, no way of caching or pre-calculating all this, but the search returns 11.298 results in milliseconds.
How do they do such a thing with 56 million datasets and 250K people using it at the same time? Fulltext search indexes? Relational Databases? Key Value Stores?
Does anyone have an idea abou the concept or architecture?
They are most likely built using an inverted indexing technology like Lucene or Sphinx. If you are looking to build a solution, my recommendation would be Apache Solr (a search server built using Lucene). It is very popular, has an active OSS community, and is used by sites such as Netflix, Cnet etc.
I'd recommend to take a look at Badoo Dev Blog. It's in Russian but google translate helps a lot.
In short they are using sharded MySQL and memcached. Here is some badoo evolution list.

Can OLAP be done in BigTable?

In the past I used to build WebAnalytics using OLAP cubes running on MySQL.
Now an OLAP cube the way I used it is simply a large table (ok, it was stored a bit smarter than that) where each row is basically a measurement or and aggregated set of measurements. Each measurement has a bunch of dimensions (i.e. which pagename, useragent, ip, etc.) and a bunch of values (i.e. how many pageviews, how many visitors, etc.).
The queries that you run on a table like this are usually of the form (meta-SQL):
SELECT SUM(hits), SUM(bytes),
FROM MyCube
WHERE date='20090914' and pagename='Homepage' and browser!='googlebot'
GROUP BY hour
So you get the totals for each hour of the selected day with the mentioned filters.
One snag was that these cubes usually meant a full table scan (various reasons) and this meant a practical limitation on the size (in MiB) you could make these things.
I'm currently learning the ins and outs of Hadoop and the likes.
Running the above query as a mapreduce on a BigTable looks easy enough:
Simply make 'hour' the key, filter in the map and reduce by summing the values.
Can you run a query like I showed above (or at least with the same output) on a BigTable kind of system in 'real time' (i.e. via a user interface and the user get's their answer ASAP) instead of batch mode?
If not; what is the appropriate technology to do something like this in the realm of BigTable/Hadoop/HBase/Hive and the likes?
It's even kind of been done (kind of).
LastFm's aggregation/summary engine: http://github.com/zohmg/zohmg
A google search turned up a google code project "mroll" but it doesn't have anything except contact info (no code, nothing). Still, might want to reach out to that guy and see what's up. http://code.google.com/p/mroll/
We managed to create low latency OLAP in HBase by preagragating a SQL query and mapping it into appropriate Hbase qualifiers. For more detail visit below site.
http://soumyajitswain.blogspot.in/2012/10/hbase-low-latency-olap.html
My answer relates to HBase, but applies equally to BigTable.
Urban Airship open-sourced datacube, which I think is close to what you want. See their presentation here.
Adobe also has a couple of presentations (here and here) on how they do "low-latency OLAP" with HBase.
Andrei Dragomir made an interesting talk about how Adobe performs OLAP functionality with M/R and HBase.
Video: http://www.youtube.com/watch?v=5U3EnfiKs44
Slides: http://hstack.org/hbasecon-low-latency-olap-with-hbase/
If you are looking for a table-scan approach, have you considered Google BigQuery? BigQuery does automatic scale-out on the back-side that gives interactive response. There is a good session by Jordan Tigani from the 2012 Google I/O event that explains some of the internals.
http://www.youtube.com/watch?v=QI8623HlYd4
It's not MapReduce but it is geared towards high-speed table scan like what you described.

Resources