has anybody compare 2 csv files in ansible. can you provide an example code if so?
I want to compare specific columns in csv files and output the differences to other file. I was able to easily do it using a powershell, looking to do it directly using ansible.
There is a csvfile module in ansible that can read the content of csv file separated by comma. You can use two variables to store the column content and compare the two stored variables and find the difference and re direct to a file
Related
My script uses an API GET request to pull data and dump to csv. Currently my script outputs the csv using todays date. I would also like to have it output a 2nd and identical csv. The 2nd csv needs to always overwrite itself and would always represent the latest data pull.
My script for generating the 1st csv output is
>> /mnt/d/DGD/"Market Place Scrapes"/$FILEDATE.csv
How do I create the second output?
I have number of small files generated from Kafka stream so I like merge small files to one single file but this merge is based on the date i.e. the original folder may have number of previous files but I only like to merge for given date files to one single file.
Any suggestions?
Use something like the code below to iterate over the smaller files and aggregate them into a big one (assuming that source contains the HDFS path to your smaller files, and target is the path where you want your big result file):
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.listStatus(new Path(source)).map(_.getPath.toUri.getPath).
foreach(name => spark.read.text(name).coalesce(1).write.mode(Append).text(target))
This example assumes text file format, but you can just as well read any Spark-supported format, and you can use different formats for source and target, as well
you should be able to use .repartition(1) to write all results to 1 file. if you need to split by date, consider partitionBy("your_date_value") .
if you're working within HDFS and S3, this may also be helpful. you might actually even use s3-dist-cp and stay within HDFS.
https://aws.amazon.com/blogs/big-data/seven-tips-for-using-s3distcp-on-amazon-emr-to-move-data-efficiently-between-hdfs-and-amazon-s3/#5
There's a specific option to aggregate multiple files in HDFS using a --groupBy option based n a regular expression pattern. So if the date is in the file name, you can group based on that pattern.
You can develop a spark application. Using this application read the data from small files and create dataframe and write dataframe to big file in append mode.
Am maintaining the data in different sheets in a csv file,but now i wanted to read this through jmeter .
i know how to read single csv file in jmeter , so need help to read different sheets in a single csv file.
Can anyone please help to get solution for this ?
Jmeter only reads CSV so you would need to save each sheet as CSV.
You could try otherwise with a setup Thread group and custom code (Beanshell or JSR223) that takes this excel and extracts each sheet into a CSV file.
CSV file is not created for that : CSV file should contains only one sheet.
As #PMD UBIK-INGENIERIE suggests, export every sheet to an other csv file.
I was annoyed with JMeter writing data results to CSV as one column. So when the CSV file was opened in Excel all values would be added to one single column (which requires annoying manual copy/paste work to get to graphs). I then noticed that if I choose Export to CSV on a Listener graph, it actually exports the CSV file as separate columns in Excel, which is great.
Is it possible to have the "Write results to file" write data into separate columns by default as it does with the graph "Export to CSV"? Thanks!
Suppose you have at least 2 options:
Simple Data Writer, which one you are using at the moment, as you understand.
In jmeter.properties file (JMETER_HOME\bin\jmeter.properties) uncomment and set jmeter.save.saveservice.default_delimiter=; to use ';' instead of ',' (used by default) as separator in csv-files (which one you are creating using "Write results to file") - this will separate values in different columns if opened in Excel.
# For use with Comma-separated value (CSV) files or other formats
# where the fields' values are separated by specified delimiters.
jmeter.save.saveservice.default_delimiter=;
Flexible File Writer from jmeter-plugins pack implments the same functionality and looks to be more customizable.
Idea is the same as above - use ";" to separate values written into file:
Write file header: endTimeMillis;responseTime;latency;sentBytes;receivedBytes;isSuccessful;responseCode
Record each sample as: endTimeMillis|;|responseTime|;|latency|;|sentBytes|;|receivedBytes|;|isSuccessful|;|responseCode|\r\n
Hope this helps.
What kind of file formats can be read using PIG?
How can I store them in different formats? Say we have CSV file and I want to store it as MXL file how this can be done? Whenever we use STORE command it makes directory and it stores file as part-m-00000 how can I change name of the file and overwrite directory?
what kind of file formats can be read using PIG? how can i store them in different formats?
There are a few built-in loading and storing methods, but they are limited:
BinStorage - "binary" storage
PigStorage - loads and stores data that is delimited by something (such as tab or comma)
TextLoader - loads data line by line (i.e., delimited by the newline character)
piggybank is a library of community contributed user-defined functions and it has a number of loading and storing methods, which includes an XML loader, but not a XML storer.
say we have CSV file n i want to store it as MXL file how this can be done?
I assume you mean XML here... Storing in XML is something that is a bit rough in Hadoop because it splits files on a reducer basis, so how do you know where to put the root tag? this likely should be some sort of post-processing to produce wellformed XML.
One thing you can do is to write a UDF that converts your columns into an XML string:
B = FOREACH A GENERATE customudfs.DataToXML(col1, col2, col3);
For example, say col1, col2, col3 are "foo", 37, "lemons", respectively. Your UDF can output the string "<item><name>Foo</name><num>37</num><fruit>lemons</fruit></item>".
whenever we use STORE command it makes directory and it stores file as part-m-00000 how can i change name of the file and overwrite directory?
You can't change the name of the output file to be something other than part-m-00000. That's just how Hadoop works. If you want to change the name of it, you should do something to it after the fact with something like hadoop fs -mv output/part-m-00000 newoutput/myoutputfile. This could be done with a bash script that runs the pig script then executes this command.