Elastic search - best way for multiple updates index? - elasticsearch

I'm integrating with external system.
From it I have 3 files:
customer_data.csv
address_data.csv
additional_customer_data.csv
Order in each of them can be random.
There is:
relation one to many (customer_data => addresses) but I am interested only in one address with specified kind.
one to one (customer_data => additional_customer_data)
Goal:
Merge files together and put it in one index in Elastic search.
Additional info:
-each file has circa 1 million records
-this operation will be done each night
-data is used only for search purposes
Options:
a) I thought about:
Parse and add to ES first file
Do the same from next and update document created in point one
Looks very inefficient.
b) another way:
parse and add first file to relational data base
do same with another fields and update records from point one
Propagate data to ES
Can you see another options?

I assume you have a normalized relational data structure with 1 to n relationships in those CSV files like that:
customer_data.csv
Id;Name;AdressId;AdditionalCustomerDataId;...
0;Mike;2;1;...
address_data.csv
Id;Street;City;...
....
2;Abbey Road;London;...
additional_customer_data.csv
Id;someData;...
...
1;data;...
In that case, I would denormalize those in a preprocessing step into one single CSV and use that to upload them to ES. For avoiding downtime, you can then use aliases.
Preprocessing can be done in any language, but probably converting the CSVs into a sqlite table will be the fastest.
I wouldn't choose a strategy to create just half of the document and add the additional information later, as you probably need to reindex afterwards.
However, maybe you can tell us more about the requirements and the external system, cause this doesn't seem to be a great solution.

Related

Is it OK to have multiple merge steps in an Excel Power query?

I have data from multiple sources - a combination of Excel (table and non table), csv and, sometimes, even a tsv.
I create queries for each data source and then I am bringing them together one step at a time or, actually, it's two steps: merge and then expand to bring in the fields I want for each data source.
This doesn't feel very efficient and I think that maybe I should be just joining everything together in the Data Model. The problem when I did that was that I couldn't then find a way to write a single query to access all the different fields spread across the different data sources.
If it were Access, I'd have no trouble creating a single query one I'd created all my relationships between my tables.
I feel as though I'm missing something: How can I build a single query out of the data model?
Hoping my question is clear. It feels like something that should be easy to do but I can't home in on it with a Google search.
It is never a good idea to push the heavy lifting downstream in Power Query. If you can, work with database views, not full tables, use a modular approach (several smaller queries that you then connect in the data model), filter early, remove unneeded columns etc.
The more work that has to be performed on data you don't really need, the slower the query will be. Please take a look at this article and this one, the latter one having a comprehensive list for Best Practices (you can also just do a search for that term, there are plenty).
In terms of creating a query from the data model, conceptually that makes little sense, as you could conceivably create circular references galore.

Is there any way to handle source flat file with dynamic structure in informatica power center?

I have to load the flat file using informatica power center whose structure is not static. Number of columns will get changed in the future run.
Here is the source file:
In the sample file I have 4 columns right now but in the future I can get only 3 columns or I may get another set of new columns as well. I can't go and change the code every time in the production , I have to use the same code by handling this situation.
Expected result set is:-
Is there any way to handle this scenario? PLSQL and unix will also work here.
I can see two ways to do it. Only requirement is - source should decide a future structure and stick to it. Come tomorrow if someone decides to change the structure, data type, length, mapping will not work properly.
Solutions -
create extra columns in source towards the end. If you have 5 columns now, extra columns after 5th column will be pulled in as blank. Create as many as you want but pls note, you need to transform them as per future structure and load into proper place in target.
This is similar to above solution but in this case, read the line as single column in source+ source qualifier as large string of length 40000.
Then split the columns as per delimiter in Informatica expression transformation. This splitting can be done by following below thread. This can be also tricky if you have 100s of columns.
Split Flat File String into multiple columns in Informatica

Load data from S3 to sort and allow timeline analysis

I'm currently trying to find out the best architecture approach for my use case:
I have S3 buckets (two totally separated) which contains data stored in JSON format. Data is partitioned by year/month/day prefixes, and inside particular day I can find e.g hundreds of files for this date
(example: s3://mybucket/2018/12/31/file1,
s3://mybucket/2018/12/31/file2, s3://mybucket/2018/12/31/file..n)
Unfortunately inside particular prefix for single day, in those tens..or hundreds files JSONs are not ordered by exact timestamp - so if we follow this example:
s3://mybucket/2018/12/31/
I can find:
file1 - which contains JSON about object "A" with timestamp "2018-12-31 18:00"
file100 - which contains JSON about object "A" with timestamp "2018-12-31 04:00"
What even worse...the same scenario I have with my second bucket.
What I want to do with this data?
Gather my events from both buckets, ordered by "ID" of object, in a sorted way (by timestamp) to visualize that in timeline at last step (which tools and how it's out of scope).
My doubts are more how to do it:
In cost efficient way
Cloud native (in AWS)
With smallest possible maintenance
What I was thinking of:
Not sure if...but loading every new file which arrived on S3 to DynamoDB (using Lambda triggered). AFAIK Creating table in proper approach - my ID as Hask key and timestamp as Range Key should works for me, correct?
As every new row inserted will be partitioned to particular ID, and already ordered in correct manner - but I'm not an expert.
Use Log-stash to load data from S3 to ElasticSearch - again AFAIK everything in ES can be indexed, so also sorted. Timelion will probably allow me to do those fancy analysis I need to created. But again....not sure if ES will perform as I want...price...volume is big etc.
??? No other ideas
To help somehow understand my need and show a bit data structure I prepared this: :)
example of workflow
Volume of data?
Around +- 200 000 events - each event is a JSON with 4 features (ID, Event_type, Timestamp, Price)
To summarize:
I need put data somewhere effectively, minimizing cost, sorted to maintain at next step front end to present how events are changing based on time - filtered by particular "ID".
Thank and appreciate for any good advice, some best practices, or solutions I can rely on!:)
#John Rotenstein - you are right I absolutely forgot to add those details. Basically I don't need any SQL functionality, as data will be not updated. Only scenario is that new event for particular ID will just arrive, so only new incremental data. Based on that, my only operation I will do on this dataset - is "Select". That's why I would prefer speed and instant answer. People will look at this mostly per each "ID" - so using filtering. Data is arriving every 15 minut on S3 (new files).
#Athar Khan - thanks for good sugestion!
As far as I understand this, I would choose the second option of Elasticsearch, with Logstash loading the data from S3, and Kibana as the tool to investigate, search, sort and visualise.
Having lambda pushing data from s3 to DynamoDB would probably work, but might be less efficient and cost more, as you are running a compute process on each event, while pushing to Dynamo in small/single-item bulks. Logstash, on the other hand, would read the files one by one and process them all. It also depends on how ofter do you plan to load fresh data to S3, but both solution should fit.
The fact that the timestamps are not ordered in the files would not make an issue in elasticsearch and you can index them on any order you can, you would still be able to visualise and search them in kibana in a time based sorted order.

HBASE versus HIVE: What is more suitable for data that is uniquely defined by multiple fields?

We are building a DB infrastructure on the top of Hadoop systems. We will be paying a vendor to do that and I do not think we are getting the right answers from the first vendor. So, I need the help from some experts to validate if I am right or I miss something
1. We have about 1600 fields in the data. A unique record is identified by those 1600 records
We want to be able to search records in a particular timeframe
(aka, records for a given time frame)
There are some fields that change overtime (monthly)
The vendor stated that the best way to go is HBASE and that they have to choices: (1) make the search optimize for machine learning (2) make adhoc queries.
The (1) will require a concatenate key with all the fields of interest. The key length will determine how slow or fast the search will run.
I do not think this is correct.
1. We do not need to use HBASE. We can use HIVE
2. We do not need to concatenate field names. We can translate those to a number and have a key as a number
3. I do not think we need to choose one or the other.
Could you let me know what you think about that?
It all depends on what is your use case. In simpler terms, Hive alone is not good when it comes to interactive queries however one of the best when it comes to analytics.
Hbase on the other hand, is really good for interactive queries, however doing analytics would not be that easy as hive.
We have about 1600 fields in the data. A unique record is identified by those 1600 records
HBase
Hbase is a NoSQL, columner database which stores information in Map(Dictionary) like format. Where each row needs to have one column which uniquely identifies the row. This is called key.
You can have key as a combination of multiple columns as well if you don't have a single column which can uniquely identifies the row. And then you can search record using partial key. However this is going to affect the performance ( as compared to have single column key).
Hive:
Hive has a SQL like language (HQL) to Query HDFS, which you can use for analytics. However, it doesn't require any primary key so you can insert duplicate records if required.
The vendor stated that the best way to go is HBASE and that they have
to choices: (1) make the search optimize for machine learning (2) make
adhoc queries. The (1) will require a concatenate key with all the
fields of interest. The key length will determine how slow or fast the
search will run.
In a way your vendor is correct, as I explained earlier.
We do not need to use HBASE. We can use HIVE 2. We do not need to concatenate field names. We can translate those to a number and have a key as a number 3. I do not think we need to choose one or the other.
Weather you can use HBASE or Hive is depends on your use case. However, if you are planning to use Hive then you don't even need to generate a pseudo key (row numbers you are talking about)
There is one more option if you have hortonworks deployment. Consider Hive for analytics and LLAP for interactive queries.

Parsing structured and semi-structured text with hundreds of tags with Ruby

I will be processing batches of 10,000-50,000 records with roughly 200-400 characters in each record. I expect the number of search terms I could have would be no more than 1500 (all related to local businesses).
I want to create a function that compares the structured tags with a list of terms to tag the data.
These terms are based on business descriptions. So, for example, a [Jazz Bar], [Nightclub], [Sports Bar], or [Wine Bar] would all correspond to queries for [Bar].
Usually this data has some kind of existing tag, so I can also create a strict hierarchy for the first pass and then do a second pass if there is no definitive existing tag.
What is the most performance sensitive way to implement this? I could have a table with all the keywords and try to match them against each piece of data. This is straightforward in the case where I am matching the existing tag, less straightforward when processing free text.
I'm using Heroku/Postgresql
It's a pretty safe bet to use the Sphinx search engine and the ThinkingSphinx Ruby gem. Yes, there is some configuration overhead, but I am yet to find a scenario where Sphinx has failed me. :-)
If you have 30-60 minutes to tinker with setting this up, give it a try. I have been using Sphinx to search in a DB table with 600,000+ records with complex queries (3 separate search criterias + 2 separate field groupings / sortings) and I was getting results in 0.625 secs, which is not bad at all and I am sure is lot better than anything you could accomplish yourself with a pure Ruby code.

Resources