In Spark streaming, getting no output in append mode - spark-streaming

I am reading data from Kaka topic and then doing a group by on the message received from topic
val lines = spark.readStream.format("Kafka")...
val df1 = lines.select($"timestamp", $"value".cast("STRING"))
...// Created a schema and fetched message from value and now df3 has schema timestamp and message string
Now, doing a group by on message to find the count and tried writing the result in file in append output mode. But it gave watermark error so in group by included timestamp like screenshot
My code snapshot
Can someone help how can I dump the count in file in append output mode
I expect the message count to be written in file for new data received in Kafka topic. Console mode is not suitable for my usecase

Related

Read from kafka csv file with one row and without headers in Nifi

I am trying to read in Nifi from a kafka topic a csv file that has only 1 row and no headers but It doesn't even get to the processor group 'Read from kafka'. If my csv file contains 2 rows (also without headers) it works as it should be. Does anyone encountered the same issue?

Glue Crawler excluding many files from table after running on S3 json GZIP data

I have a lambda that is ingesting json data from a load balancer and then writing each individual json record with a PUT to a kinesis stream. The kinesis stream is the producer for kinesis firehose, which deposits GZIP into S3 bucket under prefix 'raw'. Example JSON record:
{"level":"INFO","hash":"3c351293-11e3-4e32-baa2-
bf810ed44466","source":"FE","hat_name":"2249444f-c3f4-4e3d-8572-
c38c3dab4848","event_type":"MELT_DOWN","payload":{"checking": "true"}}
I created an x-ray trace in the producing lambda so I have an idea of how many PUT request (so each individual JSON record). In the time period I had this ingestion turned "On", I sent about 18,000 records to kinesis stream. When I ran the crawler on the table with prefix "raw" ( I used default settings but checked in "Crawlers Output" section "Update all new and existing partitions with metadata from the table." to avoid the HIVE_PARTITION_SCHEMA_MISMATCH. The crawler runs and successfully detects the schema, and looks like this:
column . data type
level . string
hash string
source . string
hat_name string
event_type string
payload string . <--- (only nested json field that has lots of possible internal structure)
parition_0 string
partition_1 string
partition_2 string
partition_3 string
Once the table is created I notice that there are only about 4,000 records, and it should have about 4 times the amount of records. Later I reran the crawler and I noticed in the logs that one line says:
INFO : Some files do not match the schema detected. Remove or exclude the following files from the crawler
I examined some of the files excluded, the majority of them had valid JSON data, however one or two the file had truncated json record at the end of the file like so:
{"level":"INFO","hash":"3c351293-11e3-4e32-baa2-
bf810ed44466","source":"FE","hat_name":"2249444f-c3f4-4e3d-8572-
c38c3dab4848","event_type":"MELT_DOWN","payload":{"checking":
"true"}}{"level":"INFO","hash":"3c351293-11e3-4e32-baa2-
bf810ed44466","source":"FE","hat_name":"2249444f-c3f4-4e3d-8572-
c38c3dab4848","event_type":"MELT_DOWN","payl
What do I need to do in glue to have all records loaded into the table, i should have around 18,000 not 4,200? I think one issue is the schema may not match exaclty on some records? But I validate in the kinesis producer that it is a valid json strucutre with appropriate top level fields. The second issue I see is the file with truncated json record? I am assuming this may be an issue with firehose batching the files? Any help is appreciated.
Note: I have tried to manually create the json table defining all top level fields, and I still have the same problem, It only finds around 4,200 entries when I query in athena.

Kafka Streams Custom processing

There is a requirement for me to process huge files, there could be multiple files that we may end up processing in parallel.
Each Row in a specific file would be processed for a rule specific to that file.
Once the processing is complete we would be generating an output file based on the processed records.
One option that i have thought of is each message pushed to the broker will have: the row data + rule to be applied + some co relation ID(would be like an identifier for that particular file)
I plan to use kafka streams and create a topology with a processor which will get the rule with message process it and sink it.
However (I am new to kafka streams hence may be wrong):
The order in which the messages will be processed will not be sequential as we are processing multiple files in Tandom(which is fine because there isn't a requirement for me to do so, moreover i want to keep it decoupled). But then how would i bring it to logical closure, i.e. in my processor how would i come to know that all the records of a file are processed.
Do i need to maintain the records(co relation ID, number of records etc.) in something like ignite.. i am unsure on that though..
i guess you can set a key and value record aside that could be sent to the topics at the end of the file which would signify the closure of the file.
Say the record has a unique key such as -1 which signifies that the eof

Mapreduce job has only map phase showing no Mapoutput records =0

I have a code for mapreduce job which has only map phase. Its function is to read the file and publish it to the kafka topic. When we see the logs of this job after successful execution, it does write to the kafka topic successfully but it doesnt show the map output records(i.e., shows map output records =0) and map input records match the file. What is the reason of not showing the map output records incorrectly here. The process is working as expected without any problem in publishing the data to topic.

Spark Streaming : Join Dstream batches into single output Folder

I am using Spark Streaming to fetch tweets from twitter by creating a StreamingContext as : val ssc = new StreamingContext("local[3]", "TwitterFeed",Minutes(1))
and creating twitter stream as :
val tweetStream = TwitterUtils.createStream(ssc, Some(new OAuthAuthorization(Util.config)),filters)
then saving it as text file
tweets.repartition(1).saveAsTextFiles("/tmp/spark_testing/")
and the problem is that the tweets are being saved as folders based on batch time but I need all the data of each batch in a same folder.
Is there any workaround for it?
Thanks
We can do this using Spark SQL's new DataFrame saving API which allow appending to an existing output. By default, saveAsTextFile, won't be able to save to a directory with existing data (see https://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes ). https://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations covers how to setup a Spark SQL context for use with Spark Streaming.
Assuming you copy the part from the guide with the SQLContextSingleton, The resulting code would look something like:
data.foreachRDD{rdd =>
val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)
// Convert your data to a DataFrame, depends on the structure of your data
val df = ....
df.save("org.apache.spark.sql.json", SaveMode.Append, Map("path" -> path.toString))
}
(Note the above example used JSON to save the result, but you can use different output formats too).

Resources