Bash Script Loop for Substitution and Traversal - bash

So I'm trying to figure out what a question from an old exam means and I'm slightly confused about one or two parts.
#!/bin/bash
awk '{$0 = tolower($0)
gsub(/[,.?;:#!\(\)]/),"",$0)
for(a=1;a<=NF;a++)
b[$a]++}
END print b[a],a}'
sort -sk2
Here is my interpretation:
target the bash script location
scan file with awk
convert string to lower case
sub all occurrences of symbols with nothing (ie. remove) and overwrite string
(here is my issue) for every field increment a by 1?
(again not sure what this is doing) b takes a's number and increments by 1?
end the for loop and print (b, a)
sort by size of the second field
I think the last four lines are my main issue. Also is it just me or is there an extra } in that question?
Thanks in advance.

The for loop is weirdly formatted. Here it is again with proper indentation:
for(a=1; a<=NF; a++)
b[$a]++
In other words, we loop over the field positions; for each, the count in the associative array b is incremented. So if the current input line is
foo bar poo bar baz
the script will do
b["foo"]++ # a is 1; $a is $1
b["bar"]++
b["poo"]++
b["bar"]++
b["baz"]++
So now b contains a set of tokens as keys, and the number of times each occurred as their respective values. In other words, this collects word counts for each word in the input.
The case folding and removal of punctuation normalizes the input so that
Word word word, word!
will count as four occurrences of "word", rather than one each for the capitalized version, the undecorated normal form, and the ones with punctuation attached at the end. It slightly distorts e.g. words which should properly be capitalized, and conflates into homographs words which are differentiated only by capitalization (such as china porcelain vs China the country.)
The END block is executed only when all input lines have been consumed, and thus b is fully loaded with all input words from all input lines, with their final counts. (Though here, there is no valid END block actually, because the opening brace after END is missing; this is a fatal syntax error. There isn't one closing brace too many, there's one non-optional opening brace missing.)

Related

AWK - I need to write a one line shell command that will count all lines that

I need to write this solution as an AWK command. I am stuck on the last question:
Write a one line shell command that will count all lines in a file called "file.txt" that begin with a decimal number in parenthesis, containing a mix of both upper and lower case letters, and end with a period.
Example(s):
This is the format of lines we want to print. Lines that do not match this format should be skipped:
(10) This is a sample line from file.txt that your script should
count.
(117) And this is another line your script should count.
Lines like this, as well as other non-matching lines, should be skipped:
15 this line should not be printed
and this line should not be printed
Thanks in advance, I'm not really sure how to tackle this in one line.
This is not a homework solution service. But I think I can give a few pointers.
One idea would be to create a counter, and then print the result at the end:
awk '<COND> {c++} END {print c}'
I'm getting a bit confused by the terminology. First you claim that the lines should be counted, but in the examples, it says that those lines should be printed.
Now of course you could do something like this:
awk '<COND>' file.txt | wc -l
The first part will print out all lines that follow the condition, but the output will be parsed to wc -l which is a separate program that counts the number of lines.
Now as to what the condition <COND> should be, I leave to you. I strongly suggest that you google regular expressions and awk, it shouldn't be too hard.
I think the requirement is very clear
Write a one line shell command that will count all lines in a file called "file.txt" that begin with a decimal number in parenthesis, containing a mix of both upper and lower case letters, and end with a period.
1. begin with a decimal number in parenthesis
2. containing a mix of both upper and lower case letters
3. end with a period
check all three conditions. Note that in 2. it doesn't say "only" so you can have extra class of characters but it should have at least one uppercase and one lowercase character.
The example mixes concepts printing and counting, if part of the exercise it's very poorly worded or perhaps assumes that the counting will be done by wc by a piped output of a filtering script; regardless more attention should have been paid, especially for a student exercise.
Please comment if anything not clear and I'll add more details...

Nested for each to compare two arrays doesn't match final item

I have two files of user-guids that I need to compare.
FileA contains a list sent from a client and contains duplicates and FileB is a list of user-guids from our system.
My first task is to make sure that our system has all the unique user-guids from the client's system (ie FileB contains all the user-guids that are in FileA). After that I need to determine how many of the user-guids in our system are NOT in the client's list but that's another task and is unrelated.
The files contain one guid per line so I'm reading them into arrays and using a nested for each to compare them.
Here is my code:
# Open each file of users
FileA = File.open("file_a.txt")
FileB = File.open("file_b.txt")
# Turn file_a into an array with only unique values and close the file
file_a_array = IO.readlines(FileA).uniq
FileA.close
# Turn the local file into an array, we already know each line is unique
file_b_array = IO.readlines(FileB)
FileB.close
file_a_array.each do |i|
file_b_array.each do |j|
if i == j
puts i
end
end
end
This code again is meant to return all the matches, but in reality I was seeing all the matches except one, incidentally the last one on the list of FileB.
In trying to guess at why I was not seeing the last match I noticed that the FileA had an empty line at the end of the file but FileB did not.
Here's an example:
FileA Contents:
guid_a
guid_b
guid_c
guid_d
[empty line]
FileB Contents:
guid_a
guid_aa
guid_b
guid_bb
guid_c
guid_cc
guid_d
Notice each file contains guid_d but the results of running my code was returning the following as the matches:
guid_a
guid_b
guid_c
When I added an extra line to the end of FileB suddenly I was getting the full set.
So the question is why?
I'm adding my own answer because the two that are here while technically both correct aren't very descriptive and didn't lead me to my solution. Only after I figured it out on my own did I finally understand what they were saying.
When I was loading my files into arrays using IO.readlines the contents of each array item contained a newline character \n.
So going off the example in my original question, the reason guid_d wasn't being matched is because in file_a_array, the value being used for comparison was guid_d\n and the value in file_b_array was guid_d. The line of FileB with guid_d did not contain a newline character until I added it by adding the empty last line.
Use the chomp function.
This function removes trailed line breaks from a string is meant to be used to sanitize input from files. As you mention in your answer Ruby reads lines including the linebreak
the reason guid_d wasn't being matched is because in file_a_array, the value being used for comparison was guid_d\n and the value in file_b_array was guid_d.
Use chomp to fix this
"guid_d\n".chomp # => "guid_d"
"guid_d".chomp # => "guid_d"
Change your program to use
IO.readlines(...).map(&:chomp)...
You have left out some important information about the file format and how you were reading in the contents, so I am going to make an educated guess that your comparisons are including a newline or return character. Therefore the last item in the list was different until you added the newline character.
So my question is why did I need to have an empty line in order for the last item to show up in my results of matched items?
Because then the last item will also end with a newline like all the others.

What's the difference between /\t+|,/ and /[\t+,]/ when split a string using Ruby?

I have a string seperated by \t and ,, but the number of \t is not fixed, for example :
a=["seg1\tseg2\t\tseg3,seg4"]
seg2 and seg3 is seperated by two \t.
So I try to split them by
a.split(/\t+|,/)
it print the right anwser :
["seg1", "seg2", "seg3", "seg4"]
And I also try this
a.split(/[\t+,]/)
but the answer is
["seg1", "seg2", "", "seg3", "seg4"]
Why ruby print different results?
Because \t+ inside [] does not mean "one or more tabs", it means "a tab or a plus". Since it finds two consecutive tabs, it splits twice, and the string in the middle becomes empty.
Most special characters, like . + * ? etc, when placed in an interval become "regular" characters. There are some exceptions, like ^ (which negates the interval when placed at the beginning), the \ (that escapes the next character(s), just like it does outside intervals) and the ] (that closes the interval; another [ is also disallowed there). So, [\t+,] actually means '\t' or '+' or ','.
Unfortunatly, I don't know any reference for the full set of characters that need or don't need escaping inside an interval. In doubt, I tend to escape just to be sure. In any case, an interval will always match a single character only, if you want something different you must put your quantifier outside the interval. (For example: [\t,]+, if you also admit two commas in a row; otherwise, your first regex is really the correct one)

Regular expression Unix shell script

I need to filter all lines with words starting with a letter followed by zero or more letters or numbers, but no special characters (basically names which could be used for c++ variable).
egrep '^[a-zA-Z][a-zA-Z0-9]*'
This works fine for words such as "a", "ab10", but it also includes words like "b.b". I understand that * at the end of expression is problem. If I replace * with + (one or more) it skips the words which contain one letter only, so it doesn't help.
EDIT:
I should be more precise. I want to find lines with any number of possible words as described above. Here is an example:
int = 5;
cout << "hello";
//some comments
In that case it should print all of the lines above as they all include at least one word which fits the described conditions, and line does not have to began with letter.
Your solution will look roughly like this example. In this case, the regex requires that the "word" be preceded by space or start-of-line and then followed by space or end-of-line. You will need to modify the boundary requirements (the parenthesized stuff) as needed.
'(^| )[a-zA-Z][a-zA-Z0-9]*( |$)'
Assuming the line ends after the word:
'^[a-zA-Z][a-zA-Z0-9]+|^[a-zA-Z]$'
You have to add something to it. It might be that the rest of it can be white spaces or you can just append the end of line.(AFAIR it was $ )
Your problem lies in the ^ and $ anchors that match the start and end of the line respectively. You want the line to match if it does contain a word, getting rid of the anchors does what you want:
egrep '[a-zA-Z][a-zA-Z0-9]+'
Note the + matches words of length 2 and higher, a * in that place would signel chars too.

String Find/Replace Algorithm

I would like to be able to search a string for various words, when I find one, i want to split the string at that point into 3 parts (left, match, right), the matched text would be excluded, and the process would continue with the new string left+right.
Now, once i have all my matches done, i need to reverse the process by reinserting the matched words (or a replacement for them) at the point they were removed. I have never really found what i wanted in any of my searches, so I thought I would ask for input here on SO.
Please let me know if this question needs further description.
BTW - at the moment, i have a very poor algorithm that replaces matched text with a unique string token, and then replaces the tokens with the replacement text for the appropriate match after all the matches have been done.
This is the goal:
one two three four five six
match "three" replace with foo (remember we found three, and where we found it)
one two four five six
|
three
match "two four" and prevent it from being matched by anything (edited for clarity)
one five six
|
two four
|
three
at this point, you cannot match for example "one two"
all the matches have been found, now put their replacements back in (in reverse order)
one two four five six
|
three
one two foo four five six
What's the point? Preventing one match's replacement text from being matched by another pattern. (all the patterns are run at the same time and in the same order for every string that is processed)
I'm not sure the language matters, but I'm using Lua in this case.
I'll try rephrasing, I have a list of patterns i want to find in a given string, if I find one, I want to remove that part of the string so it isnt matched by anything else, but I want to keep track of where i found it so I can insert the replacement text there once I am done trying to match my list of patterns
Here's a related question:
Shell script - search and replace text in multiple files using a list of strings
Your algorithm description is unclear. There's no exact rule where the extracted tokens should be re-inserted.
Here's an example:
Find 'three' in 'one two three four five six'
Choose one of these two to get 'foo bar' as result:
a. replace 'one two' with 'foo' and 'four five six' with 'bar'
b. replace 'one two four five six' with 'foo bar'
Insert 'three' back in the step 2 resulting string 'foo bar'
At step 3 does 'three' goes before 'bar' or after it?
Once you've come up with clear rules for reinserting, you can easily implement the algorithm as a recursive method or as an iterative method with a replacements stack.
Given the structure of the problem, I'd probably try an algorithm based on a binary tree.
pseudocode:
for( String snippet in snippets )
{
int location = indexOf(snippet,inputData);
if( location != -1)
{
// store replacement text for a found snippet on a stack along with the
// location where it was found
lengthChange = getReplacementFor(snippet).length - snippet.length;
for each replacement in foundStack
{
// IF the location part of the pair is greater than the location just found
//Increment the location part of the pair by the lengthChange to account
// for the fact that when you replace a string with a new one the location
// of all subsequent strings will be shifted
}
//remove snippet
inputData.replace(snippet, "");
}
}
for( pair in foundStack )
{
inputData.insert( pair.text, pair.location);
}
This is basically just doing exactly as you said in your problem description. Step through the algorithm, putting everything on a stack with the location it was found at. You use a stack so when you reinsert in the second half, it happens in reverse order so that the stored "location" applies to the current state of the inputString.
Edited with a potential fix for commenter's criticism. Does the commented for block within the first one account for your criticisms, or is it still buggy in certain scenarios?

Resources