shell script : how to merge 2 files asymmetrically [closed] - shell

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have 2 files that I need to merge in order to create a report summary, as shown in the example below:
file 1 contains workflow names:
Workflow_name.log
---------------------------------------------
workflow Wf_s_m_DAI_IFDS_Account_Stage
workflow Wf_s_m_DAI_IFDS_Txn_Map
file 2 contains matching workflow-run summaries - that is, the nth summary block corresponds to the nth line in the names file:
*************** Summary ***************
Objects provided for validation: 10
Objects skipped: 7
Objects that were invalid before the validation: 0
Objects successfully validated: 3
Objects that are still invalid after the validation: 0
Validated objects that were Saved/Checked in: 0
Cannot save objects due to lock conflict: 0
*************** Summary ***************
Objects provided for validation: 14
Objects skipped: 11
Objects that were invalid before the validation: 0
Objects successfully validated: 3
Objects that are still invalid after the validation: 0
Validated objects that were Saved/Checked in: 0
Cannot save objects due to lock conflict: 0
validate completed successfully.
Expected output after merging the files:
workflow Wf_s_m_DAI_IFDS_Account_Stage
*************** Summary ***************
Objects provided for validation: 10
Objects skipped: 7
Objects that were invalid before the validation: 0
Objects successfully validated: 3
Objects that are still invalid after the validation: 0
Validated objects that were Saved/Checked in: 0
Cannot save objects due to lock conflict: 0
workflow Wf_s_m_DAI_IFDS_Txn_Map
*************** Summary ***************
Objects provided for validation: 14
Objects skipped: 11
Objects that were invalid before the validation: 0
Objects successfully validated: 3
Objects that are still invalid after the validation: 0
Validated objects that were Saved/Checked in: 0
Cannot save objects due to lock conflict: 0
validate completed successfully.
Please let me know the right approach to getting the desired output.
Thanks in advance.
Ahshan

An awk solution with getline:
Let's assume file f1 contains the workflow names and f2 the summaries:
awk -v namesFile=f1 '
$0 == "*************** Summary ***************" {
getline name < namesFile # read next name from names file
print name # print name
}
1 # print each input line
' f2
The opening { must be on the SAME line as the pattern ($0 == "...") in order for the block to be associated with it.
1 is (effectively) shorthand for {print}; i.e. it simply outputs the line.
Note that there's no error handling - the number of names is expected to match the number of summary blocks.

Related

Limit(n) vs Show(n) performance disparity in Pyspark

Trying to get a deeper understanding of how spark works and was playing around with the pyspark cli (2.4.0). I was looking for the difference between using limit(n).show() and show(n). I ended up getting two very different performance times for two very similar queries. Below are the commands I ran. The parquet file referenced in the code below has about 50 columns and is over 50gb in size on remote HDFS.
# Create dataframe
>>> df = sqlContext.read.parquet('hdfs://hdfs.host/path/to.parquet') ↵
# Create test1 dataframe
>>> test1 = df.select('test_col') ↵
>>> test1.schema ↵
StructType(List(StructField(test_col,ArrayType(LongType,true),true)))
>>> test1.explain() ↵
== Physical Plan ==
*(1) Project [test_col#40]
+- *(1) FileScan parquet [test_col#40]
Batched: false,
Format: Parquet,
Location: InMemoryFileIndex[hdfs://hdfs.host/path/to.parquet],
PartitionCount: 25,
PartitionFilters: [],
PushedFilters: [],
ReadSchema: struct<test_col:array<bigint>>
# Create test2 dataframe
>>> test2 = df.select('test_col').limit(5) ↵
>>> test2.schema ↵
StructType(List(StructField(test_col,ArrayType(LongType,true),true)))
>>> test2.explain() ↵
== Physical Plan ==
CollectLimit 5
+- *(1) Project [test_col#40]
+- *(1) FileScan parquet [test_col#40]
Batched: false,
Format: Parquet,
Location: InMemoryFileIndex[hdfs://hdfs.host/path/to.parquet],
PartitionCount: 25,
PartitionFilters: [],
PushedFilters: [],
ReadSchema: struct<test_col:array<bigint>>
Notice that the physical plan is almost identical for both test1 and test2. The only exception is test2's plan starts with "CollectLimit 5". After setting this up I ran test1.show(5) and test2.show(5). Test 1 returned the results instantaneously. Test 2 showed a progress bar with 2010 tasks and took about 20 minutes to complete (I only had one executor)
Question
Why did test 2 (with limit) perform so poorly compared to test 1 (without limit)? The data set and result set were identical and the physical plan was nearly identical.
Keep in mind:
show() is an alias for show(20) and relies internally on take(n: Int): Array[T]
limit(n: Int) returns another dataset and is an expensive operation that reads the whole source
Limit - result in new dataframe and taking longer time because this is because predicate pushdown is currently not supported in your input file format. Hence reading entire dataset and applying limit.

libav gives audio duration as negative

I am trying to make a simple av player, and in some cases I am getting values correctly as below:
checking /media/timecapsule/Music/02 Baawre.mp3
[mp3 # 0x7f0698005660] Skipping 0 bytes of junk at 2102699.
dur is 4396400640
duration is 311
However, in other places, I am getting negative durations:
checking /media/timecapsule/Music/01 Just Chill.mp3
[mp3 # 0x7f0694005f20] Skipping 0 bytes of junk at 1318922.
dur is -9223372036854775808
duration is -653583619391
I am not sure what's causing the duration to end up negative only in some audio files. Any ideas to where I might be wrong are welcome!
Source code here https://github.com/heroic/musika/blob/master/player/library.c
I would suggest two things:
1. Make sure that failed files are not corrupt, i.e. you can use ffmpeg command line tool to dump details.
2. Break this in 2 if conditions to avoid order of operation and ensure open succeeded.
if(!(avformat_open_input(&container, name, NULL, NULL) < 0 && avformat_find_stream_info(container, NULL) < 0)) {
Also you can use av_dump_format to ensure that it headers are correct. See ex - https://www.ffmpeg.org/doxygen/2.8/avio_reading_8c-example.html#a24
Ketan

Coffeelint indentation in Web Essentials 2013

I have 4 space indentation in my coffee files and when I am compiling those I am getting errors:
CoffeeLint: YourFile.coffee compilation failed: CoffeeLint: Line contains inconsistent indentation; context: Expected 2 got 4
I found that http://www.coffeelint.org/ actually provides option to configure indentation and in Web Essentials menu there is option to edit Global CofeeLint settings. So I changed that option to be:
"indentation": {
"name": "indentation",
"value": 4,
"level": "error"
}
(changed value from 2 to 4)
But it makes no difference I even tried to change level from error to ignore still no success. I even tried to restart VS and Windows, What I am doing wrong?
Update 1.
As requested in comments here is code I have:
if 1
0
And also screenshot of it with View White Space ON:
If you are using coffeelint and you want to change the indentation value to 2 spaces then you must edit the coffeelint/lib/coffeelint.js file and change the value of the "value" to 2 as follows:
module.exports = Indentation = (function() {
Indentation.prototype.rule = {
name: 'indentation',
value: 2,
level: 'error',
message: 'Line contains inconsistent indentation',
description: "This rule imposes a standard number of spaces to be used for\nindentation. Since whitespace is significant in CoffeeScript, it's\ncritical that a project chooses a standard indentation format and\nstays consistent. Other roads lead to darkness. <pre> <code>#\nEnabling this option will prevent this ugly\n# but otherwise valid CoffeeScript.\ntwoSpaces = () ->\n fourSpaces = () ->\n eightSpaces = () ->\n 'this is valid CoffeeScript'\n\n</code>\n</pre>\nTwo space indentation is enabled by default."
};
The file you edited is probably a generated file that is of no consequence.

Two apparently equal test cases coming back failed. What can cause that?

Below are a few lines from my test case. The first assertion comes back as false, but why? The second does not.
result=Parser.parse_subject(##lexicon.scan("kill princess"), Pair.new(:noun, "bear"))
assert_equal(Parser.parse_subject(##lexicon.scan("kill princess"), Pair.new(:noun, "bear")),Parser.parse_subject(##lexicon.scan("kill princess"), Pair.new(:noun, "bear")))
assert_equal(result,result)
Here is the actual error:
Run options:
# Running tests:
.F.
Finished tests in 0.004000s, 750.0000 tests/s, 1750.0000 assertions/s.
1) Failure:
test_parse_subject(ParserTests) [test_fournineqa.rb:30]:
Sentence:0x21ad958 #object="princess", #subject="bear", #verb="kill" expect
ed but was
Sentence:0x21acda0 #object="princess", #subject="bear", #verb="kill".
3 tests, 7 assertions, 1 failures, 0 errors, 0 skips
It looks like you have defined a class Sentence but have provided no way to compare two Sentence instances, leaving assert_equal comparing the identities of two objects to discover that they are not the same instance.
A simple fix would be something like:
class Sentence
def ==(sentence)
#subject == sentence.subject and
#verb == sentence.verb and
#object == sentence.object
end
end
The first assertion compares two different objects with the same content whereas the second assertion compares two identical objects. Apparently equal in this context means "identical objects". (Check the implementation.)

What are the child OIDs in an SNMP trap?

I have inherited a MIB and example documentation, and need to re-implement the code that generates traps. (For various reason the original code is lost and gone forever, but CM is not my question.)
The MIB says:
alertObjects OBJECT IDENTIFIER ::= { corpAlert 1 }
alertEvents OBJECT IDENTIFIER ::= { corpAlert 2 }
alertDispatchTime OBJECT-TYPE
SYNTAX OCTET STRING
MAX-ACCESS read-only
STATUS current
DESCRIPTION
"Time Event Dispatched"
::= { alertObjects 3 }
testFailure OBJECT IDENTIFIER ::= { alertEvents 4 }
testFailureClearTrap NOTIFICATION-TYPE
OBJECTS
{
alertDispatchTime,
[omitted]
}
STATUS current
DESCRIPTION
"Clear prior failure"
::= { testFailure 0 }
Our documentation has the following snippet:
/usr/bin/snmptrap \
-v 1 \
-c public 192.168.0.2:162 [our-base-oid] 127.0.0.1 6 4 '' \
[our-base-oid].2.4.0.4.1.0 s "May 21 2007 10:19PM" \
[etc]
What I can't figure out is the OID used for the alert dispatch time. I would understand it if it were [our-base-oid].1.3.0, or even [our-base-oid].2.4.0.[our-base-oid].1.3. If we were generating a trap at { alertEvents 3 }, what would the suffix be for the individual objects?
It is possible that the MIB was updated after the documentation, so if this looks wrong to an expert then what should the OID be for the alertDispatchTime?
Thanks.
As defined here, alertDispatchTime is a scalar object (only one instance), so its instance subidentifier is always 0 (full OID is [corpAlert].1.3.0). The notification's OID is [corpAlert].2.4.0.
Assuming by "[our-base-oid]" you mean corpAlert, the snmptrap command shown doesn't look to be correct because [our-base-oid].2.4.0.4.1.0 would be testFailureClearTrap.4.1.0, which doesn't make sense: traps don't have instance subidentifiers. But I'm making some assumptions here about the parts of the MIB spec you've not included.
If you have a working system, maybe it'll good if you can generate a trap and see its contents.

Resources