create partition column from structure streaming json data - spark-streaming

I'm new to structured streaming and would like to a create a partition column based on date column from json message.
Here is the sample message :
{"date": "2022-03-01", "code": "1000310014", "no": "20191362283860", "update_type_cd": "UP", "EventProcessedUtcTime": null, "PartitionId": null, "EventEnqueuedUtcTime": null}
{"date": "2022-03-01", "code": "2000310014", "no": "300191362283860", "update_type_cd": "UP", "EventProcessedUtcTime": null, "PartitionId": null, "EventEnqueuedUtcTime": null}
{"date": "2022-03-01", "code": "30002220014", "no": "20191333383860", "update_type_cd": "UP", "EventProcessedUtcTime": null, "PartitionId": null, "EventEnqueuedUtcTime": null}
val date = event.select(col("date"))
val stream = flatten_messages
.writeStream
.partitionBy(date)
.format("delta")
.outputMode("append")
.start(output_path)
Is this right way partition on json message?

No, in the partitionBy you just specify column names, not dataframe. So the code would be just:
val stream = flatten_messages
.writeStream
.partitionBy("date")
.format("delta")
.outputMode("append")
.start(output_path)
But the first question would be - do you really need partitioning of the data? It may not be strictly required with Delta that has things like data skipping, ZOrder, etc.
P.S. Also, you may need to cast date column to a date type - in this case it will be stored more efficiently on the disk, and will allow range search, etc. Although it's not related to partitioning.

Related

Timelion Statement : How to filter data from an array in Timelion visualization query

There is a column of an index in Kibana, which has an array of data
E.g. Below is a sample column = blocked_by
"blocked_by": [
{
"error_category_name": "Record is not a new one",
"error_category_owner": "AB",
"created_when": "2022-05-18T09:52:44.000Z",
"name": "ERROR IN RCS: Delete Subscriber",
"resolved_when": "2022-05-18T10:52:55.963+01:00",
"id": "8163578639440138764"
},
{
"error_category_name": "NM-1009 Phone Number is not in appropriate state",
"error_category_owner": "AB",
"created_when": "2022-05-18T09:52:45.000Z",
"name": "ERROR IN NC NM: Change MSISDN status",
"resolved_when": "2022-05-18T10:53:16.230+01:00",
"id": "8163578637640138764"
},
I want to extract only the latest record out of this column in my timelion expression
Can someone help me out, if this is possible to do so in timelion
My expression:
.es(index=sales_order,timefield=created_when,q='blocked_by.error_category_owner.keyword:(AB OR Undefined OR null OR "") AND _exists_:blocked_by').divide(.es(index=sales_order,timefield=created_when)).yaxis(2,position=right,units=percent).label(Fallout)

Elasticsearch data size optimization

I was wondering, would it be good practice to optimize data like this in Elasticsearch?
Old data
{
"user_id": 1,
"firstname": "name",
"lastname": "name",
"email": "email"
}
New data
{
"uid": 1,
"f": "name",
"l": "name",
"e": "email"
}
Lets say, I have billions of documents with long named keys, would it save alot space if I used short named keys instead?
Or does elasticsearch compress data by default, so I don't need to worry about this?
I prefer to have data more readable, but if it could save a lot space, then its whole different thing.
This question is asked five years ago and it had only one answer, so would be nice to have more comments about this.
You can read it here:
Elasticsearch scheme optimization
Any thoughts from experienced elasticsearch developers?

Azure data factory copy activity from Storage to SQL: hangs at 70000 rows

I have a data factory with a pipeline copy activity like this:
{
"type": "Copy",
"name": "Copy from storage to SQL",
"inputs": [
{
"name": "storageDatasetName"
}
],
"outputs": [
{
"name": "sqlOutputDatasetName"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink"
}
},
"policy": {
"concurrency": 1,
"retry": 3
},
"scheduler": {
"frequency": "Month",
"interval": 1
}
}
The input data is approx 90MB in size, about 1.5 million rows, broken into approx. 20 x 4.5MB block blob files in Azure Storage. Here's an example of the data (CSV):
A81001,1,1,1,2,600,3.0,0.47236654,141.70996,0.70854986
A81001,4,11,0,25,588,243.0,5.904582,138.87576,57.392536
A81001,7,4,1,32,1342,278.0,7.5578647,316.95795,65.65895
The sink is an Azure SQL Server of type S2, which is rated at 50 DTUs. I've created a simple table with sensible data types, and no keys, indexes or anything fancy, just columns:
CREATE TABLE [dbo].[Prescriptions](
[Practice] [char](6) NOT NULL,
[BnfChapter] [tinyint] NOT NULL,
[BnfSection] [tinyint] NOT NULL,
[BnfParagraph] [tinyint] NOT NULL,
[TotalItems] [int] NOT NULL,
[TotalQty] [int] NOT NULL,
[TotalActCost] [float] NOT NULL,
[TotalItemsPerThousand] [float] NOT NULL,
[TotalQtyPerThousand] [float] NOT NULL,
[TotalActCostPerThousand] [float] NOT NULL
)
The source, sink and data factory are all in the same region (North Europe).
According to Microsoft's 'Copy activity performance and tuning guide', for Azure Storage Source and Azure SQL S2 sink, I should be getting about 0.4 MBps. By my calculation, that means 90MB should transfer in about half and hour (is that right?).
For some reason it copies 70,000 rows very quickly, then seems to hang. Using SQL management studio I can see the count of rows in the database table is exactly 70,000 and hasn't increased at all in 7 hours. Yet the copy task is still running with no errors:
Any ideas why this is hanging at 70,000 rows? I can't see anything unusual about the 70,001st data row which would cause a problem. I've tried compeltely trashing the data factory and starting again, and I always get the same behaviour. i have another copy activity with a smaller table (8000 rows), which completes in 1 minute.
Just to answer my own question in case it helps anyone else:
The issue was with null values. The reason that my run was hanging at 70,000 rows was that at row 76560 of my blob source file, there was a null value in one of the columns. The HIVE script I had used to generate this blob file had written the null value as '\N'. Also, my sink SQL table specified 'NOT NULL' as part of the column, and the column was a FLOAT value.
So I made two changes: added the following property to my blob dataset definition:
"nullValue": "\\N"
And made my SQL table column nullable. It now runs completely and doesn't hang! :)
The problem is that the Data Factory did not error, it just got stuck - it would be nice if the job had failed with a useful error message, and told me what row of the data was the problem. I think because the write batch size is 10,000 by default, this is why it got stuck at 70,000 and not at 76560.
Here is a new workaround, just set write batch size to cover the default value(10,000)
click here to see my copy data activity config

Hive DDL for parquet formate with complex datatypes

Could someone help me to create the Hive DDL for this dataset which was processed and stored in Parquet format..
properties:
{
"freq": "8600",
"id": "23266",
"array": [
{
"ver": "201.0.0.F",
"key_ver": "201.0.0.F",
"key": "001I1SS",
"code": "ACDEE",
"prod_code": "DSADVVSS",
"prod_key": "001123"
}
],
"ipm": null,
"offline": "1234234209600"
}
CREATE TABLE my_table(freq INT, id INT, array<struct<ver: FLOAT, key_ver: FLOAT, key: STRING, code: STRING, prod_code: STRING, prod_key: INT>>, ipm: **UNKOWN**, offline: BIGINT>
Since JSON has many less types than Hive, we can not get all the information we need from just what you posted. For example, we don't know what the type of ipm should be, and we don't know whether id should be an INT or a BIGINT or so on.
Since you've already converted that JSON file to a Parquet file, you can inspect the Parquet file (which has more types) to get a better idea of what Schema to use.

How to find related related songs or artists using Freebase MQL?

I have any Freebase mid such as: /m/0mgcr, which is The Offspring.
Whats the best way to use MQL to find related artists?
Or if I have a song mid such as: /m/0l_f7f, which is Original Prankster by The Offspring.
Whats the best way to use MQL to find related songs?
So, the revised question is, given a musical artist, find all other musical artists who share all of the same genres assigned to the first artist.
MQL doesn't have any operators which can work across parts of the query tree, so this can't be done in a single query, but given that you're likely doing this from a programming language, it be done pretty simply in two steps.
First, we'll get all genres for our subject artist, sorted by the number of artists that they contain using this query (although the last part isn't strictly necessary):
[{
"id": "/m/0mgcr",
"name": null,
"/music/artist/genre": [{
"name": null,
"id": null,
"artists": {
"return": "count"
},
"sort": "artists.count"
}]
}]
Then, using the genre with the smallest number of artists for maximum selectivity, we'll add in the other genres to make it even more specific. Here's a version of the query with the artists that match on the three most specific genres (the base genre plus two more):
[{
"id": "/m/0mgcr",
"name": null,
"/music/artist/genre": [{
"name": null,
"id": null,
"artists": {
"return": "count"
},
"sort": "artists.count",
"limit": 1,
"a:artists": [{
"name": null,
"id": null,
"a:genre": {
"id": "/en/ska_punk"
},
"b:genre": {
"id": "/en/melodic_hardcore"
}
}]
}]
}]
Which gives us: Authority Zero, Millencolin, Michael John Burkett, NOFX, Bigwig, Huelga de Hambre, Freygolo, The Vandals
The things to note about this query are that, this fragment:
"sort": "artists.count",
"limit": 1,
limits our initial genre selection to the single genre with the fewest artists (ie Skate Punk), while the prefix notation:
"a:genre": {"id": "/en/ska_punk"},
"b:genre": {"id": "/en/melodic_hardcore"}
is to get around the JSON limitation on not having more than one key with the same name. The prefixes are ignored and just need to be unique (this is the same reason for the a:artists elsewhere in the query.
So, having worked through that whole little exercise, I'll close by saying that there are probably better ways of doing this. Instead of an absolute match, you may get better results with a scoring function that looks at % overlap for the most specific genres or some other metric. Things like common band members, collaborations, contemporaneous recording history, etc, etc, could also be factored into your scoring. Of course this is all beyond the capabilities of raw MQL and you'd probably want to load the Freebase data for the music domain (or some subset) into a graph database to run these scoring algorithms.
In point of fact, both last.fm and Google think a better list would include bands like Sum 41, blink-182, Bad Religion, Green Day, etc.

Resources