Update data in elasticsearch from csv - elasticsearch

I have a bunch of indices A,B,C,D containing large amounts of data per each let's say 50 million records per index and the record contains data as follows:
user: amine, age: 22, state: Michigan
and I want to create a new index E and bulk new data into it from a csv file. What I hope for is that if a user from E already exists in one of the other indices but with different information for example:
user: amine, age: 25, state: California
I want the user amine to be updated in all other indices with the new data.
I tried pulling all the data processing it and re-ingesting it but the process took so long.

Related

Performance with pagination

Question
Given the following query:
MATCH (t:Tenant)-[:lives_in]->(:Apartment)-[:is_in]->(:City {name: 'City1'})
RETURN t
ORDER BY t.id
LIMIT 10
So: "Give me the first 10 tenants that live in City1"
With the sample data below, the database will get hit for every single apartment in City1 and for every tenant that lives in each of these apartments.
If I remove the ORDER BY this doesn't happen.
I am trying to implement pagination so I need the ORDER BY. How to improve the performance on this?
Sample data
UNWIND range(1, 5) as CityIndex
CREATE (c:City { id: CityIndex, name: 'City' + CityIndex})
WITH c, CityIndex
UNWIND range(1, 5000) as ApartmentIndex
CREATE (a:Apartment { id: CityIndex * 1000 + ApartmentIndex, name: 'Apartment'+CityIndex+'_'+ApartmentIndex})
CREATE (a)-[:is_in]->(c)
WITH c, a, CityIndex, ApartmentIndex
UNWIND range(1, 3) as TenantIndex
CREATE (t:Tenant { id: (CityIndex * 1000 + ApartmentIndex) * 10 + TenantIndex, name: 'Tenant'+CityIndex+'_'+ApartmentIndex+'_'+TenantIndex})
CREATE (t)-[:lives_in]->(a)
Without the ORDER BY, cypher can lazily evaluate the tenants and stop at 10 rather than matching every tenant in City1. However, because you need to order the tenants, the only way it can do that is to fetch them all and then sort.
If the only labels that can live in apartments is Tenants then you could possibly save a Filter step by removing the Tenant in your query like MATCH (t)-[:lives_in]->(:Apartment)....
You might want to check the profile of your query as well and see if it uses the index backed order by
What sort of numbers are you expecting back from this query? What's the worst case number of tenants in a given city?
EDIT
I was hoping a USING JOIN on t would use the index to improve the plan but it does not.
The query performs slightly better if you add a redundant relation from the tenant to the city:
MATCH (t:Tenant)-[:CITY]->(:City {name: 'City1'})
RETURN t
ORDER BY t.id
LIMIT 10
and similarly by embedding the city name onto the tenant- no major gains. I tested for 150,000 tenants in City1, perhaps the gains are more visible as you approach millions, but not sure.

How to group records based on a child record in Amazon Quicksight

Is there any way to categorise an aggregated record in Amazon Quicksight based on the records being aggregated?
I'm trying to build a pivot table an Amazon Quicksight to pull out counts of various records split down by what state the record is in limited to a date range where the record first had activity.
This is all being done in SPICE since the raw data is CSV files in S3. The two CSV files are:
Student
Id: String
Current State: One of pass / fail
Department: String
Test
Id: String
Student Id: String
DateOfTest: DateTime
State: One of pass / fail
The pivot table I'm trying to build would have a row for each department and 3 columns:
fail - Count of students in the department where the current state is fail
pass-never-failed - Count of students in the department where the current state is pass and there are no failed tests for the Student
pass-failed-some - Count of students in the department where the current state is pass and there is at least one failed test
To do this I've created a dataset that contains Student joined to Test. I can then create a Pivot table with:
Row: Department
Column: Current State
Value: Count Distinct(Id[Student])
I thought I'd be able to create a calculated field to give me the three categories and use that in place of Current State using the following calculation:
ifelse(
{Current State} = 'pass',
ifelse(
countIf({Id[Test]}, {State} = 'fail') = 0,
'pass-never-failed',
'pass-failed-some'
),
{Current State}
)
But that shows invalid with the following error:
Mismatched aggregation. Custom aggregations can’t contain both aggregate "`COUNT`" and non-aggregated fields “COUNT(CASE WHEN "State" = _UTF16'fail' THEN "Id[Test]" ELSE NULL END)”, in any combination.
Is there any way to categories the Students based on an aggregation of the Tests in quicksight or do I have to pre-calculate this information in the source data?
I've been able to workaround this for now by defining three separate calculations for the three columns and adding these as values in the quicksight pivot rather than setting a column dimension at all.
Count Passed Never Failed
distinct_countIf({Id[Student]}, maxOver(
ifelse({State[Test]} = 'fail' AND {Current State[Student]} = 'pass',
1, 0)
, [{Id[Student]}], PRE_AGG) = 0)
Count Passed Failed Some
distinct_countIf({Id[Student]}, maxOver(
ifelse({State[Test]} = 'fail' AND {Current State[Student]} = 'pass',
1, 0)
, [{Id[Student]}], PRE_AGG) > 0)
Count Failed
distinct_countIf({Id[Student]}, {Current State[Student]} = 'fail')
This works, but I'd still like to know if it's possible to build a dimension that I could use for this as it would be more flexible if new states are added without the special handling of pass.

Extracting weekly or monthly data from a daily Daru time series

I am trying to understand how to obtain weekly or monthly data from a daily Daru time series.
Given this sample time series:
require 'daru'
index = Daru::DateTimeIndex.date_range start: '2020-01-15',
periods: 80, freq: 'D'
data = { price: Array.new(index.periods) { rand 100 } }
df = Daru::DataFrame.new data, index: index
I would like to:
Create a new monthly data frame with only the last row in each month.
Create a new weekly data frame with only the last row of each week.
In fact, I am not sure if I need to create a new data frame, or just query the original data frame.
The objective is to be able to plot weekly and monthly percentage change.
I am not even sure if this operation is something I am supposed to do on the index or on the data frame, so I am a bit lost in the documentation.
Any assistance is appreciated.

Search by values in Redis cache - Secondary Indexing

I am new to Redis. I want to search by one or multiple values that comes from API.
e.g - Let's say that I want to store some sec data as below:
Value1
{
"isin":"isin123",
"id_bb_global":"BBg12345676",
"cusip":"cusip123",
"sedol":"sedol123",
"cpn":"0.09",
"cntry":"US",
"144A":"xyz",
"issue_cntry":"UK"
}
Value2
{
"isin":"isin222",
"id_bb_global":"BBG222",
"cusip":"cusip222",
"sedol":"sedol222",
"cpn":"1.0",
"cntry":"IN",
"144A":"Y",
"issue_cntry":"DE"
}
...
...
I want to search by cusip or cusip and id_bb_global, ISIN plus Exchange, or sedol.
e.g - search query data -> {"isin":"isin222", "cusip":"cusip222"} , should return all data sets from value.
What is the best way to store this kind of data structure in Redis and API to retrieve the same faster.
when you insert data, you can create sets to maintain the index.
{
"isin":"isin123",
"id_bb_global":"BBg12345676",
"cusip":"cusip123",
"sedol":"sedol123",
"cpn":"0.09",
"cntry":"US",
"144A":"xyz",
"issue_cntry":"UK"
}
example for the above data, if you wnat to filter by isin and cusip, you can create the respective set for isin:123 and cusip:123 and add that item id to both of those sets.
later on, if you want to find item that are in both isin:123 and cusip:123, you just have to run SINTER on those 2 sets.
Or if you want to find items that are either in isin:123 OR cusip:123, you can union them.

Cassandra slow get_indexed_slices speed

We are using Cassandra for log collecting.
About 150,000 - 250,000 new records per hour.
Our column family has several columns like 'host', 'errorlevel', 'message', etc and special indexed column 'indexTimestamp'.
This column contains time rounded to hours.
So, when we want to get some records, we use get_indexed_slices() with first IndexExpression by indexTimestamp ( with EQ operator ) and then some other IndexExpressions - by host, errorlevel, etc.
When getting records just by indexTimestamp everything works fine.
But, when getting records by indexTimestamp and, for example, host - cassandra works for long ( more than 15-20 seconds ) and throws timeout exception.
As I understand, when getting records by indexed column and non-indexed column, Cassandra firstly gets all records by indexed column and than filters them by non-indexed columns.
So, why Cassandra does it so slow? By indexTimestamp there are no more than 250,000 records. Isn't it possible to filter them at 10 seconds?
Our Cassandra cluster is running on one machine ( Windows 7 ) with 4 CPUs and 4 GBs memory.
You have to bear in mind that Cassandra is very bad with this kind of queries. Indexed columns queries are not meant for big tables. If you want to search for your data around this type of queries you have to tailor your data model around it.
In fact Cassandra is not a DB you can query. It is a key-value storage system. To understand that please go there and have a quick look: http://howfuckedismydatabase.com/
The most basic pattern to help you is bucket-rows and ranged range-slice-queries.
Let's say you have the object
user : {
name : "XXXXX"
country : "UK"
city : "London"
postal_code :"N1 2AC"
age : "24"
}
and of course you want to query by city OR by age (and & or is another data model yet).
Then you would have to save your data like this, assuming the name is a unique id :
write(row = "UK", column_name = "city_XXXX", value = {...})
AND
write(row = "bucket_20_to_25", column_name = "24_XXXX", value = {...})
Note that I bucketed by country for the city search and by age bracket for age search.
the range query for age EQ 24 would be
get_range_slice(row= "bucket_20_to_25", from = "24-", to = "24=")
as a note "minus" == "under_score" - 1 and "equals" == "under_score" + 1, giving you effectively all the columns that start with "24_"
This also allow you to query for age between 21 and 24 for example.
hope it was useful

Resources