Spark : Avro RDD to csv - hadoop

I am able to read arvo file into avroRDD and am trying to convert into csvRDD which contain all the values in comma separated. With the following code I am able to read specific field into csvRDD.
val csvRDD = avroRDD .map({case (u, _) => u.datum.get("empname")})
How can I read all the values into csvRDD instead of specifying field names. My result csvRDD should contain records as follows
(100,John,25,IN)
(101,Ricky,38,AUS)
(102,Chris,68,US)

Using Spark 1.2+ with the Spark-Avro integration library by Databricks, one can convert an avro rdd to a csv rdd as follows:
val sqlContext = new SQLContext(sc)
val episodes = sqlContext.avroFile("episodes.avro")
val csv = episodes.map(_.mkString(","))
Running csv.collect().foreach(println) using this sample avro file prints
The Eleventh Hour,3 April 2010,11
The Doctor's Wife,14 May 2011,11
Horror of Fang Rock,3 September 1977,4
An Unearthly Child,23 November 1963,1
The Mysterious Planet,6 September 1986,6
Rose,26 March 2005,9
...

Related

Why header is automatically skipped in output file

I want to storage my data without skipping data header
This is my pig script :
CRE_GM05 = LOAD '$input1' USING PigStorage(;) AS (MGM_COMPTEUR:chararray,CIA_CD_CRV_CIA:chararray,CIA_DA_EM_CRV:chararray,CIA_CD_CTRL_BLCE:chararray,CIA_IDC_EXTR_RDJ:chararray,CIA_VLR_IDT_CRV_LOQ:chararray,CIA_VLR_REF_CRV:chararray,CIA_NO_SEQ_CRV:chararray,CIA_VLR_LG_ZON_RTG:chararray,CIA_HEU_CIA:chararray,CIA_TM_STP_CRE:chararray,CIA_CD_SI:chararray,CIA_VLR_1:chararray,CIA_DA_ARR_FIC:chararray,CIA_TY_ENR:chararray,CIA_CD_BTE:chararray,CIA_CD_PER:chararray,CIA_CD_EFS:chararray,CIA_CD_ETA_VAL_CRV:chararray,CIA_CD_EVE_CPR:int,CIA_CD_APLI_TDU:chararray,CIA_CD_STE_RTG:chararray,CIA_DA_TT_RTG:chararray,CIA_NO_ENR_RTG:chararray,CIA_DA_VAL_EVE:chararray,T32_001:chararray,TEC_013:chararray,TEC_014:chararray,DAT_001_X:chararray,DAT_002_X:chararray,TEC_001:chararray);
CRE_GM11 = LOAD '$input2' USING PigStorage(;) AS (MGM_COMPTEUR:chararray,CIA_CD_CRV_CIA:chararray,CIA_DA_EM_CRV:chararray,CIA_CD_CTRL_BLCE:chararray,CIA_IDC_EXTR_RDJ:chararray,CIA_VLR_IDT_CRV_LOQ:chararray,CIA_VLR_REF_CRV:chararray,CIA_NO_SEQ_CRV:chararray,CIA_VLR_LG_ZON_RTG:chararray,CIA_HEU_CIA:chararray,CIA_TM_STP_CRE:chararray,CIA_CD_SI:chararray,CIA_VLR_1:chararray,CIA_DA_ARR_FIC:chararray,CIA_TY_ENR:chararray,CIA_CD_BTE:chararray,CIA_CD_PER:chararray,CIA_CD_EFS:chararray,CIA_CD_ETA_VAL_CRV:chararray,CIA_CD_EVE_CPR:int,CIA_CD_APLI_TDU:chararray,CIA_CD_STE_RTG:chararray,CIA_DA_TT_RTG:chararray,CIA_NO_ENR_RTG:chararray,CIA_DA_VAL_EVE:chararray,DAT_001_X:chararray,DAT_002_X:chararray,D08_001:chararray,PSE_001:chararray,PSE_002:chararray,PSE_003:chararray,RUB_001:chararray,RUB_002:chararray,RUB_003:chararray,RUB_004:chararray,RUB_005:chararray,RUB_006:chararray,RUB_007:chararray,RUB_008:chararray,RUB_009:chararray,RUB_010:chararray,TEC_001:chararray,TEC_002:chararray,TEC_003:chararray,TX_001_VLR:chararray,TX_001_DCM:chararray,D08_004:chararray,D11_004:chararray,RUB_016:chararray,T03_001:chararray);
-- Effectuer une jointure entre les deux tables
JOINED_TABLES = JOIN CRE_GM05 BY TEC_001, CRE_GM11 BY TEC_001;
-- Generer les colonnes
DATA_GM05 = FOREACH JOINED_TABLES GENERATE
CRE_GM05::MGM_COMPTEUR AS MGM_COMPTEUR,
CRE_GM05::CIA_CD_CRV_CIA AS CIA_CD_CRV_CIA,
CRE_GM05::CIA_DA_EM_CRV AS CIA_DA_EM_CRV,
CRE_GM05::CIA_CD_CTRL_BLCE AS CIA_CD_CTRL_BLCE,
CRE_GM05::CIA_IDC_EXTR_RDJ AS CIA_IDC_EXTR_RDJ,
CRE_GM05::CIA_VLR_IDT_CRV_LOQ AS CIA_VLR_IDT_CRV_LOQ,
CRE_GM05::CIA_VLR_REF_CRV AS CIA_VLR_REF_CRV,
CRE_GM05::CIA_VLR_LG_ZON_RTG AS CIA_VLR_LG_ZON_RTG,
CRE_GM05::CIA_HEU_CIA AS CIA_HEU_CIA,
CRE_GM05::CIA_TM_STP_CRE AS CIA_TM_STP_CRE,
CRE_GM05::CIA_VLR_1 AS CIA_VLR_1,
CRE_GM05::CIA_DA_ARR_FIC AS CIA_DA_ARR_FIC,
CRE_GM05::CIA_TY_ENR AS CIA_TY_ENR,
CRE_GM05::CIA_CD_BTE AS CIA_CD_BTE,
CRE_GM05::CIA_CD_PER AS CIA_CD_PER,
CRE_GM05::CIA_CD_EFS AS CIA_CD_EFS,
CRE_GM05::CIA_CD_ETA_VAL_CRV AS CIA_CD_ETA_VAL_CRV,
CRE_GM05::CIA_CD_EVE_CPR AS CIA_CD_EVE_CPR,
CRE_GM05::CIA_CD_APLI_TDU AS CIA_CD_APLI_TDU,
CRE_GM05::CIA_CD_STE_RTG AS CIA_CD_STE_RTG,
CRE_GM05::CIA_DA_TT_RTG AS CIA_DA_TT_RTG,
CRE_GM05::CIA_NO_ENR_RTG AS CIA_NO_ENR_RTG,
CRE_GM05::CIA_DA_VAL_EVE AS CIA_DA_VAL_EVE,
CRE_GM05::T32_001 AS T32_001,
CRE_GM05::TEC_013 AS TEC_013,
CRE_GM05::TEC_014 AS TEC_014,
CRE_GM05::DAT_001_X AS DAT_001_X,
CRE_GM05::DAT_002_X AS DAT_002_X,
CRE_GM05::TEC_001 AS TEC_001;
STORE DATA_GM05 INTO '$OUTPUT_FILE' USING PigStorage(';');
It returns data but I lost the first line of headers !
Note that my $input1 and $input2 variables are csv files
I tried using CSVLoader but it doesn't working also.
I need to get output stored with headers please
In pig final output by default there is no headers coming. Also adding header to final output will doesn't make any sense as sequence of rows is not fixed in pig output.
If you want to add header to final output, either merge all the part files data to a file in local file system where you can add header information explicitly or use hive table to store the output of this pig script. There is HCatlog store can be used for same.

Stanford NLP Coref Resolution for Conversational Data

I want to make some experiments with Stanford dcoref package on our conversational data. Our data contains usernames (speakers) and the utterances. Is it possible to give a structured data as input (instead of the raw text) to Stanford dcoref annotator? If yes, what should be the format of conversational input data?
Thank you,
-berfin
I was able to get this basic example to work:
<doc id="speaker-example-1">
<post author="Joe Smith" datetime="2018-02-28T20:10:00" id="p1">
I am hungry!
</post>
<post author="Jane Smith" datetime="2018-02-28T20:10:05" id="p2">
Joe Smith is hungry.
</post>
</doc>
I used these properties:
annotators = tokenize,cleanxml,ssplit,pos,lemma,ner,parse,coref
coref.conll = true
coref.algorithm = clustering
# Clean XML tags for SGM (move to sgm specific conf file?)
clean.xmltags = headline|dateline|text|post
clean.singlesentencetags = HEADLINE|DATELINE|SPEAKER|POSTER|POSTDATE
clean.sentenceendingtags = P|POST|QUOTE
clean.turntags = TURN|POST|QUOTE
clean.speakertags = SPEAKER|POSTER
clean.docIdtags = DOCID
clean.datetags = DATETIME|DATE|DATELINE
clean.doctypetags = DOCTYPE
clean.docAnnotations = docID=doc[id],doctype=doc[type],docsourcetype=doctype[source]
clean.sectiontags = HEADLINE|DATELINE|POST
clean.sectionAnnotations = sectionID=post[id],sectionDate=post[date|datetime],sectionDate=postdate,author=post[author],author=poster
clean.quotetags = quote
clean.quoteauthorattributes = orig_author
clean.tokenAnnotations = link=a[href],speaker=post[author],speaker=quote[orig_author]
clean.ssplitDiscardTokens = \\n|\\*NL\\*
Also this document has great info on the coref system:
https://stanfordnlp.github.io/CoreNLP/coref.html
I am looking into using the neural option on my example .xml document, but you might have to put your data into the conll format to run our neural coref with the conll settings. The conll data has conversational data with speaker info among other document formats.
This document contains info on the CoNLL format you'd have to use for the neural algorithm to work.
CoNLL 2012 format: http://conll.cemantix.org/2012/data.html
You need to create a folder with a similar directory structure (but you can put your files in instead)
example:
/Path/to/conll_2012_dir/v9/data/test/data/english/annotations/wb/eng/00/eng_0009.v9_auto_conll
If you run this command:
java -Xmx20g edu.stanford.nlp.coref.CorefSystem -props speaker.properties
with these properties:
coref.algorithm = clustering
coref.conll = true
coref.conllOutputPath = /Path/to/output_dir
coref.data = /Path/to/conll_2012_dir
it will write conll output files to /Path/to/output_dir
That command should read in all files ending with _auto_conll

Unable to load data into parquet file format?

I am trying to parse log data into parquet file format in hive , the separator used is "||-||".
The sample row is
"b8905bfc-dc34-463e-a6ac-879e50c2e630||-||syntrans1||-||CitBook"
After performing the data staging I am able to get the result
"b8905bfc-dc34-463e-a6ac-879e50c2e630 syntrans1 CitBook ".
While converting the data to parquet file format I got error :
`
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2185)
at org.apache.hadoop.hive.ql.plan.PartitionDesc.getDeserializer(PartitionDesc.java:137)
at org.apache.hadoop.hive.ql.exec.MapOperator.getConvertedOI(MapOperator.java:297)
... 24 more
This is what I have tried
create table log (a String ,b String ,c String)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
WITH SERDEPROPERTIES (
"field.delim"="||-||",
"collection.delim"="-",
"mapkey.delim"="#"
);
create table log_par(
a String ,
b String ,
c String
) stored as PARQUET ;
insert into logspar select * from log_par ;
`
Aman kumar,
To resolve this issue, run the hive query after adding the following jar:
hive> add jar hive-contrib.jar;
To add the jar permanently, do the following:
1.On Hive Server host, create a /usr/hdp//hive/auxlib directory.
2.Copy /usr/hdp//hive/lib/hive-contrib-.jar to /usr/hdp//hive/auxlib.
3.Restart the HS2 server.
Please check further reference.
https://community.hortonworks.com/content/supportkb/150175/errororgapachehadoophivecontribserde2multidelimits.html.
https://community.hortonworks.com/questions/79075/loading-data-to-hive-via-pig-orgapachehadoophiveco.html
Let me know,if you face any issues

Clickhouse Kafka Engine Throwing Exception

I am trying to use Clickhouse Kafka Engine to ingest data. Data is in CSV format. During data ingestion, sometimes I am getting exception
2018.01.08 08:41:47.016826 [ 3499 ] <Debug> StorageKafka (consumer_queue): Started streaming to 1 attached views
2018.01.08 08:41:47.016906 [ 3499 ] <Trace> StorageKafka (consumer_queue): Creating formatted reader
2018.01.08 08:41:49.680816 [ 3499 ] <Error> void DB::StorageKafka::streamThread(): Code: 117, e.displayText() = DB::Exception: Expected end of line, e.what() = DB::Exception, Stack trace:
0. clickhouse-server(StackTrace::StackTrace()+0x16) [0x3221296]
1. clickhouse-server(DB::Exception::Exception(std::string const&, int)+0x1f) [0x144a02f]
2. clickhouse-server() [0x36e6ce1]
3. clickhouse-server(DB::CSVRowInputStream::read(DB::Block&)+0x1a0) [0x36e6f60]
4. clickhouse-server(DB::BlockInputStreamFromRowInputStream::readImpl()+0x64) [0x36e3454]
5. clickhouse-server(DB::IProfilingBlockInputStream::read()+0x16e) [0x2bcae0e]
6. clickhouse-server(DB::KafkaBlockInputStream::readImpl()+0x6c) [0x32f6e7c]
7. clickhouse-server(DB::IProfilingBlockInputStream::read()+0x16e) [0x2bcae0e]
8. clickhouse-server(DB::copyData(DB::IBlockInputStream&, DB::IBlockOutputStream&, std::atomic<bool>*)+0x55) [0x35b3e25]
9. clickhouse-server(DB::StorageKafka::streamToViews()+0x366) [0x32f54f6]
10. clickhouse-server(DB::StorageKafka::streamThread()+0x143) [0x32f58c3]
11. clickhouse-server() [0x40983df]
12. /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f4d115d06ba]
13. /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f4d10bf13dd]
Below is the table
CREATE TABLE test.consumer_queue (ID Int32, DAY Date) ENGINE = Kafka('broker-ip:port', 'clickhouse-kyt-test','clickhouse-kyt-test-group', '**CSV**')
CREATE TABLE test.consumer_request ( ID Int32, DAY Date) ENGINE = MergeTree PARTITION BY DAY ORDER BY (DAY, ID) SETTINGS index_granularity = 8192
CREATE MATERIALIZED VIEW test.consumer_view TO test.consumer_request (ID Int32, DAY Date) AS SELECT ID, DAY FROM test.consumer_queue
CSV Data
10034,"2018-01-05"
10035,"2018-01-05"
10036,"2018-01-05"
10037,"2018-01-05"
10038,"2018-01-05"
10039,"2018-01-05"
Clickhouse server version 1.1.54318.
It seems that ClickHouse read batch of messages from Kafka and then try to decode all these messages as a single CSV.
And messages in this single CSV should be separated with new line character.
So all messages should have new line character at the end.
I am not sure if it is a feature or a bug of ClickHouse.
You can try to send to kafka only one message and check if it appears correctly in ClickHouse.
If you send messages to Kafka with script kafka-console-producer.sh then this script (class ConsoleProducer.scala) reads lines from a file and sends each line to a Kafka topic without new line character, so such messages can not be processed correctly.
If you send messages with your own script/application then you can try to modify it and add new line character to the end of each messages. This should solve the problem.
Or you can use another format for Kafka Engine, for example JSONEachRow.
Agree with #mikhail 's answer, i guess, try kafka_row_delimiter = '\n' in SETTINGS KAFKA engine

Using Spark Context To Read Parquet File as RDD(wihout using Spark-Sql Context) giving Exception

I am trying to read and write Parquet file as RDD using Spark. I cant use Spark-Sql-Context in my current application(It needs a parquet schema in StructType which when I convert from Avro Schema gives me castException in few cases)
So if i try to implement and save Parquet File by overload AvroParquetFormat and Sending ParquetInputFormat to Hadoop To write in following way:
def saveAsParquetFile[T <: IndexedRecord](records: RDD[T], path: String)(implicit m: ClassTag[T]) = {
val keyedRecords: RDD[(Void, T)] = records.map(record => (null, record))
spark.hadoopConfiguration.setBoolean("parquet.enable.summary-metadata", false)
val job = Job.getInstance(spark.hadoopConfiguration)
ParquetOutputFormat.setWriteSupportClass(job, classOf[AvroWriteSupport])
AvroParquetOutputFormat.setSchema(job, m.runtimeClass.newInstance().asInstanceOf[IndexedRecord].getSchema())
keyedRecords.saveAsNewAPIHadoopFile(
path,
classOf[Void],
m.runtimeClass.asInstanceOf[Class[T]],
classOf[ParquetOutputFormat[T]],
job.getConfiguration
)
}
This is thowing error:
Exception in thread "main" java.lang.InstantiationException: org.apache.avro.generic.GenericRecord
I am calling The function as follows:
val file1: RDD[GenericRecord] = sc.parquetFile[GenericRecord]("/home/abc.parquet")
sc.saveAsParquetFile(file1, "/home/abc/")

Resources