Are numeric/date comparisons in Spark SQL more performative than String compatison - performance

I am trying to optimize a Spark query and as part of that, I wanted to know if representing data in date type instead of string type would be more performative for comparisons and other computing.
Similarly for numeric comparisons.
Can any one shed light on this? Or if someone has done any perf benchmarking, if they can share.
Searching on forums, could not find a definitive answer.

Related

Strategies to compare performance of two Elasticsearch queries?

Since actual query runtime varies, it's not always useful to just check the runtime of two queries to determine which is generally faster. What are some ways to generally test whether one query is more efficient than another?
As an example of what I'm after, in MongoDB I can run explain on a query to get the number of documents iterated vs. returned. If the documents iterated is several orders of magnitude higher than what it's actually returning, I know I have an inefficient query. I know that since Elasticsearch indexes data much differently than other dbs, this may not translate well, but I'm wondering if there's some rough equivalent.
I'm looking at the Profile API which looks like a good starting place. Are fields like next_doc and next_doc_count what I'm after? Are there any others I should look for? Thanks!!

Ruby efficienct queries

So far in my new workplace I've been dealing with querying databases and finding out the most efficient ways of getting the desired data.
I've found out about using pluck and getting the desired attributes instead of loading the whole result in the memory and other tricks, such as using inject (reduce), map, reject and such, which made my life a whole lot easier.
However, I haven't exactly found any theoretical explanation why inject/map/reject should be used in order to gain higher efficiency, only some sort of empiric conclusions from my own attempts. Like, why should I use map instead of iterating over an array with an "each".
Could someone please shed some light?

Data matching Algorithm Approach

I don't really know where to start with this project, and so I'm hoping a broad question can at least point me in the right direction.
I have 2 data sets right now, each about 5gb with 2million observations. They are the assessed and historical data gathered for property listings of a given area for a certain amount of time. What I need to do is match properties to one another. So a property may arise in the historical since it gets sold 2 or 3 times during the period. In this historical I have the seller info, the loan info, and sale info. In the assessor data I have all of the characteristics that would describe the property sold. So in order to do any pricing model, I need to match the two.
I have variables that are similar in each, however they are going to differ slightly (misspellings, abbreviations, etc). Does anyone have any recommendations for me about going through this? First off, what program would I want to do this in? I have experience in STATA, R and a little bit of SAS and Matlab, but I'd prefer to use the former two.
I read through this:
Data matching algorithm
Where he uses .NET and one user suggested a Levenshtein approach (where the distance between strings is calculated) so for fields like Address I could use this and weight the approximate accuracy between the two string. Then it was suggested maybe to use Soundex for maybe Name of the seller/owner.
But I'm really lost in how to implement any of this, and before I approach anyone in my department I really need to have some sort of idea of what I'm doing!
Any help or advice would be immensely helpful.
Yes, there are several good algorithms for the string matching problem you describe, namely:
jaro-winkler,
smith-waterman,
dice-sorense
soundex
damerau-levenshtein, and
monge-elkan
to name the few.
I recommend A Comparison of String Distance Metrics for Name-Matching Tasks, by W. W. Cohen, P. Ravikumar, S. Fienberg for an overview of what might be working the best for what.
SoftTFIDF claims to be the best one. It is available as a Java package. There are other implementations of string matching and record linkage algorithms available in:
Java (SecondString),
Python (JellyFish),
C# (FuzzyString), and
Scala StringMetric
libraries.

Big Data Analytics Choice of Technology

I am asked to asses possible chice of technology we need to use for the problem described below. Possible options are Hadoop, Hive, and Pig. I do not have much experience with either of those. If you could point out a good source to read. I google and find tons of references but it is hard to find a step by step explanation or comparison.
Here is the task I need to solve.
Users enter sentences into the system. Sentences are broken out by words and stored in Cassandra column family. Each row is a single word (key) and column names are the time stamp this record was entered with no column values.
I need to be able to query the database and extract N words that are taken from the following breakdown:
a_1% must be the top words from period T1 from now into the past
a_2% must be the top words from period T2 from now into the past
a_3% must be the top words from period T3 from now into the past
a_n% must be the top words from period T_n from now into the past
a_1+a_2+...a_n = 100%
and T1, T2, etc are arbitrary time intervals.
any suggestion for a choice of technology I should use for this task would be greatly appreciate. We are using Cassandra and we are quite familiar with it. Now we need to decide which analytical tool to put on top of it.
Links or specifics would be quite appreciated.
If you have the data partitioned ( by time intervals ) in HIVE, finding such 'top words combination' sentences could be achived by one query in HIVE. Also HIVEQL sytnax might help with additional analytic in future, especially for people who know SQL. Question is how to integrate Cassandra with Hadoop. I hope someone might say something about it. GL!
EDITED: There is nice chapter about intergarating Cassandra and HIVE.
The term Big Data is not very much unknown for most of the tech guys albeit there is some sort of confusion about it in everyone's mind. If we explain the term from layman's point of view, then it means the large volume of structured as well as unstructured data. Now a very usual question will arise in our mind after knowing the definition of the term big data that how we can get this large amount of data? As an answer to this question, we can say that we usually produce data when we communicate with our friends or when we do some digital transactions or when we shop whenever we go online.
What are the solutions Big Data is providing which seem to be impossible even a few years ago?
We already know that information, photographs, text, voice, and video data is the base of big data and big data is now involved in so many projects for helping the mankind.

Text Search Algorithm

i have a table with approximately million rows holding a text 500-600 words and i'm searching word within these texts. but iterating rows and searching within the text is not efficient from time aspect. any idea?
I would suggest Lucene
http://lucene.apache.org/java/docs/index.html
With this scarce information, I suggest you have a look at inverted indexes. Easy to build up and fast retrieval for your case, as far as I can tell. Also very easy to implement in any kind of database environment in case you cannot switch to a database which already supports them.
If you give more information maybe another solution would also work.

Resources