Getting module and version from a file in bash - bash

I have a text file (text.txt) which contain the following lines:
|**tashtit.liba.version**|2001.01.2012072137|
|**tashtit.gimla.version**|2001.01.2012072156|
|**chaluka.version**|2001.01.2012080754|
|**analytics.version**|2001.01.2012072142|
|**yizumim.version**|2001.01.2012072222|
I would like the following output (text2.txt):
tashtit.liba-2001.01.2012072137
tashtit.gimla-2001.01.2012072156
chaluka-2001.01.2012080754
analytics-2001.01.2012072142
yizumim-2001.01.2012072222
How can i get that using bash and some regex?
The above is example and the answer should fit the following convention:
|**${module}.version**|$version|

sed 's/|\*\*\([a-z.]*\)\.version\*\*|\([0-9.]*\)|/\1-\2/1'
This matches a literal |** followed by a grouping of lowercase letters and periods terminated by a literal .version**|, then another grouping of numbers and periods terminated by |, with the first grouping, a hyphen, and the second grouping.

sed -rn 's/(^\|\*\*)(.*)(.version)(\*\*\|)(.*)(\|$)/\3-\5/p' file
Split each line into 5 sections base don regular expressions. Substitute the line for the 2nd and 4th sections, separated by "-" using sed and print.
Awk alternative:
awk -F\| '{ $0=gensub("*","","g",$0);split($2,map,".");print map[1]"-"$3 }' file
Set the field delimiter to | and then then strip any asterix out of the lines and further split the 3rd delimited field with ".". Print the 2nd and 3th | delimited fields.

Related

Replace pipe with comma except between curly braces in CSV in bash

Need some solution to replace pipe with comma in specific column of CSV file, which is also having some key value as pipe separated strings (could be any in number, one or more).
Basically need to replace pipe which is not within curly braces i.e.{subStringX441|subStringX442|subStringX443|subStringX444} should remain untouched.
Can't use simple sed -i -e 's\|\,\g' filename as it will replace all pipes.
Input:
column1,column2,column3,column4,column5,column6,column7
stringX1,stringX2,stringX3,stringX41|stringX42|stringX43|stringX44={subStringX441|subStringX442|subStringX443|subStringX444}|stringX45,stringX5,stringX6,stringX7
stringY1,stringY2,stringY3,stringY41|stringY42|stringY43|stringY44={subStringY441|subStringY442|subStringY443}|stringY45,stringY5,stringY6,stringY7
Desired Output:
column1,column2,column3,column4a,column4b,column4c,column4d,column4e,column5,column6,column7
stringX1,stringX2,stringX3,stringX41,stringX42,stringX43,stringX44={subStringX441|subStringX442|subStringX443|subStringX444},stringX45,stringX5,stringX6,stringX7
stringY1,stringY2,stringY3,stringY41,stringY42,stringY43,stringY44={subStringY441|subStringY442|subStringY443},stringY45,stringY5,stringY6,stringY7
Using sed
$ sed 's/\({[^}]*\)\||/,\1/g;s/,{/{/;1s/column4/&a,&b,&c,&d,&e/' input_file
column1,column2,column3,column4a,column4b,column4c,column4d,column4e,column5,column6,column7
stringX1,stringX2,stringX3,stringX41,stringX42,stringX43,stringX44={subStringX441|subStringX442|subStringX443|subStringX444},stringX45,stringX5,stringX6,stringX7
stringY1,stringY2,stringY3,stringY41,stringY42,stringY43,stringY44={subStringY441|subStringY442|subStringY443},stringY45,stringY5,stringY6,stringY7
Regular expressions (in strict sense) are not enough for dealing with balanced bracket (last imply at least Chomsky Type-2). I would use GNU AWK for this task following way, let file.txt content be
stringY1,stringY2,stringY3,stringY41|stringY42|stringY43|stringY44
{subStringY441|subStringY442|subStringY443}|stringY45,stringY5,stringY6,stringY7
then
awk 'BEGIN{FPAT=".";OFS=""}{for(i=1;i<=NF;i+=1){if($i=="{"){inside=1};if($i=="}"){inside=0};if(!inside && $i=="|"){$i=","}};print}' file.txt
output
stringY1,stringY2,stringY3,stringY41,stringY42,stringY43,stringY44
{subStringY441|subStringY442|subStringY443},stringY45,stringY5,stringY6,stringY7
Explanation: I inform GNU AWK that any single character is to be treated as field using FPAT variable and output field seperator is empty string using OFS variable. For every line I go through subsequent fields (i.e. characters) using for loop, if character is { then I set variable inside to 1, if character is } then I set variable to 0, then if we are not (!) inside and (&&) character is | change it to ,. After processing all characters in line I print.
DISCLAIMER this solution assumes that curly brackets are never nested and every { has matching } in given line.
(tested in gawk 4.2.1)
This might work for you (GNU sed):
sed ':a;s/\({[^|}]*\)|\([^}]*}\)/\1\n\2/g;ta;y/\n|/|,/' file
Replace |'s between {...}'s with newlines, then translate newlines to |'s and |'s to ,'s.

Replace spaces between two strings with symbol using sed

I have string like this:
20.07.2010|Berlin|id 100|bd-22.10.94|Marry Scott Robinson|msc#gmail.com
I need to replace whitespaces only between "Marry Scott Robinson" with "|". So to have bd-22.10.94|Marry|Scott|Robinson|
There many of such rows, so problem is in replace whitespace only between "bd-" and vertical line after name.
I'll assume that the name is always on the fifth column :
awk 'BEGIN{FS=OFS="|"}{gsub(/ /,OFS,$5)}1' file
If it is not the case, you can do :
awk 'BEGIN{FS=OFS="|"}{for(i=1;i<=NF;i++){if($i ~ /bd-/){break}};gsub(/ /,OFS,$(i+1))}1' file
Returns :
20.07.2010|Berlin|id 100|bd-22.10.94|Marry|Scott|Robinson|msc#gmail.com
Perl to the rescue!
perl -lne '($before, $change, $after) = /(.*\|bd-.*?\|)(.*?)(\|.*)/;
print $before, $change =~ s/ /|/gr, $after' -- file
-n reads the input line by line, running the code for each line
-l removes newlines from input and adds them to output
the first line populates three variables by values captured from the line. $before contains verything up to the first | after bd-; $change contains what follows up to the next |, and $after contains the rest.
s/ /|/gr replaces spaces by pipes (/g for "all of them") and returns (/r) the result.
This might work for you (GNU sed):
sed 's/[^|]*/\n&\n/5;:a;s/\(\n[^\n ]*\) /\1\|/;ta;s/\n//g' file
Sometimes to fix a problem we must erect scaffolding, then fix the original problem and finally remove the scaffolding.
Here we need to isolate the field by surrounding it by newlines.
Remove the spaces between the newlines by looping until failure.
Finally, remove the scaffolding i.e. the introduced newlines.
Another perl version:
$ perl -F'\|' -ne '$F[4] =~ tr/ /|/; print join("|", #F)' foo.txt
20.07.2010|Berlin|id 100|bd-22.10.94|Marry|Scott|Robinson|msc#gmail.com
Same basic idea as Corentin's first awk example. Split each line into columns based on |, replace spaces in the 5th one with |'s, print the re-joined lines.

grep strings based on the length

Is it possible to search strings based on the length in a specific file using grep?
I have tried using the awk but did not work
awk '$0~"^s" && length($0)==31' strings.xml
If not using grep is it possible to find using some other command line tool.
You can use:
grep -E '^s.{30}$' strings.xml
The regexp matches s at the beginning of the line, followed by any 30 characters, then the end of the line. So it will match a line with exactly 31 characters beginning with s.
But the awk command is equivalent, so if it didn't work, neither will this.
awk default is to split fields by whitespace, therefore if you want to match against the first match starting with s and have a length of 31, you could use:
awk '$1 ~ /^s.{30}$/ {print}' strings.xml
The /^s is to match a string starting with s and the .{30}$ matches any . character (except for line terminators) {30} exactly 30 times

Search for Double Quotes (") in the file and copy the whole line in different file

I have a requirement to read through all the files and look for <double quotes> (") and copy the whole line to a different file. The challenge is here that to identify the whole line when there is a new character in the line.
The file format is like this - values are separated with delimiter |*| and end with |##|.
In the attached (image), the highlighted in green should go to new file, Logic would be check for " and if it finds read line starting from (line after |##| to until next |##| )
10338|*|BVL-O-G-01020-R4|*||*|BVL|*||*|Y|*|Y|*||*|CFC6E82284990A7AE040800AA5644B19|*|jmorlan|*|2011.12.21 15:52:01|##|
10358|*|BI-MED-CDMA-MCS-90-118-EXAM|*|Exam for 001-MCS-90-118:
Planning, Conducting and Reporting Post Marketing Surveillance "Studies and Safety Reporting from Non Trial Activities |*|GLOBAL_MEDICAL|*||*|Y|*|N|*||*|CFC6E822849A0A7AE040800AA5644B19|*|finke|*|2012.04.30 04:23:27|##|
10342|*|BVL-O-4-01020-R7|*||*|DVL|*||*|Y|*|Y|*||*|RRFC6E82284990A7AE040800AA5644B19|*|sppa|*|2011.12.21 15:52:01|##|
Assuming you mean that the sections between |##| should be considered as newline, next question is does you file contain any real newlines? If not, grep is probably not going to be very efficient as it works on a line-by-line basis. If any real newlines are supposed to be considered part of the text, then definitely, grep is going to be unhappy.
If you really want to do it in 1 go in grep:
grep -Eoz '(^|\|##\|)([^|]|\|[^#]|\|#[^#]|\|##[^|])"([^|]|\|[^#]|\|#[^#]|\|##[^|])(\|##\||$)'
This is looking for any sequence that starts with |##| (or is the start of the file) is followed by some characters, a quote, and some more characters, then ends with |##| (or end of file). By using -z grep will ignore any newlines in the file.
The complex "any characters" ([^|]|\|[^#]|\|#[^#]|\|##[^|])* expression is because grep is greedy. It basically looks for repeating sequences that are not |##|. Perhaps turning off greed is good, but that will depend on the power of the regexp engine in your version of grep.
But much easier, and probably faster, to use sed to break up the records and inject "NULL" line-breaks:
sed 's/\|##\|/\x00/g' | grep -z '"'
This is simply replacing your end of line pattern |##| with the null character, then asking grep to find quote while treating null character as end of line.
This answer provides two solutions a Gnu Awk solution and a POSIX version.
POSIX awk
awk '{r=r ? r "\n" $0 : $0}
/\|##\|$/ { if (r ~ /"/) print r; r=""}' inputfile > outputfile
GNU awk 1
awk 'BEGIN{RS="\\|##\\|\n?";ORS="|##|\n"}/"/' inputfile > outputfile
GNU awk 2
awk 'BEGIN{RS="\\|##\\|\n?"}/"/{printf $0 RT}' inputfile > outputfile
On the sample data provided in the question, all provided solutions give the following output:
10358|*|BI-MED-CDMA-MCS-90-118-EXAM|*|Exam for 001-MCS-90-118:
Planning, Conducting and Reporting Post Marketing Surveillance "Studies and Safety Reporting from Non Trial Activities |*|GLOBAL_MEDICAL|*||*|Y|*|N|*||*|CFC6E822849A0A7AE040800AA5644B19|*|finke|*|2012.04.30 04:23:27|##|
note: It is possible that you are suffering from the Carriage Return problem if the file comes from a Windows machine. Pleas run dos2unix on the file before using it with these tools.
How does this work? (POSIX)
Using a POSIX version of awk we can do
awk '{r=r ? r "\n" $0 : $0}
/\|##\|$/ { if (r ~ /"/) print r; r=""}' inputfile > outputfile
The idea is to build a record r by appending every line to r. If the current line ends with "|##|", then we check if the record r contains a <double quote> ". If this is the case, we print the record r and reset the record r to an empty string. If it does not contain the <double quote>, we just reset it.
How does this work? (GNU)
Using GNU awk you can do this directly using the record separator RS
awk 'BEGIN{RS="\\|##\\|\n?";ORS="|##|\n"}/"/' inputfile > outputfile
The idea here is that the file contains various records. The OP clearly stated that the information of a record is split in fields separated by |*|, but more importantly, the records themselves are separated by |##|. So in the presented example of the OP, the first record is line1 while the second record is spread over line 2 and line 3.
In awk, you can define a record separator by means of the variable RS. In its default state, RS is the <newline> character \n which makes each line a separate record which can be referenced by $0. In POSIX, the record separator can only be a single character which separates the records, while in Gnu awk, this can be a regular expression (see addendum below).
Since the record separator of the OP is the string "|##|" followed all or not by a <newline> character \n, we need to define RS=\\|##\\|\n?. Why so complicated?
the <pipe> | symbol is the OR operation (alternation operator) in a regular expression, so we need to escape it. But since string literals that are used as regular expressions are parsed twice, we also need to escape it twice. So | &rightarrow; \\| (see here)
the \n? is because it seems that the actual record separator is the string "|##|\n", but maybe some records do not have a newline character, especially the last record.
When you print records, using the print statement it automatically appends the output record separator ORS after each line. By default this is again a <newline> character \n. Since the record separator RS is not a part of the record $0 you need to update the value ORS to ORS="|##|\n". This time, not a regex, so you do not need to escape at all.
The statement /"/ is a shorthand for /"/{print $0} which means If the current record $0 contains a <double quote> ", then print the current record $0 followed by the output record separator ORS.
Note: since we actually already use Gnu awk, we can actually reduce the whole thing even further to:
awk 'BEGIN{RS="\\|##\\|\n?"}/"/{printf $0 RT}' inputfile > outputfile
Which makes use of the matched record separator RT that corresponds to the text found by RS. By replacing the print statement by a printf statement, we do not need to ORS anymore and just manually add RT to the record $0.
RS: The input record separator. Its default value is a string containing a single newline character, which means that an input record consists of a single line of text. It can also be the null string, in which case records are separated by runs of blank lines. If it is a regexp, records are separated by matches of the regexp in the input text.
The ability for RS to be a regular expression is a gawk extension. In most other AWK implementations, or if gawk is in compatibility mode (see Options), just the first character of RS’s value is used.
ORS: The output record separator. It is output at the end of every print statement. Its default value is "\n", the newline character.
RT: (GNU AWK specific) The input text that matched the text denoted by RS, the record separator. It is set every time a record is read.
source: GNU AWK manual

Explained shell statement

The following statement will remove line numbers in a txt file:
cat withLineNumbers.txt | sed 's/^.......//' >> withoutLineNumbers.txt
The input file is created with the following statement (this one i understand):
nl -ba input.txt >> withLineNumbers.txt
I know the functionality of cat and i know the output is written to the 'withoutLineNumbers.txt' file. But the part of '| sed 's/^.......//'' is not really clear to me.
Thanks for your time.
That sed regular expression simply removes the first 7 characters from each line. The regular expression ^....... says "Any 7 characters at the beginning of the line." The sed argument s/^.......// substitutes the above regular expression with an empty string.
Refer to the sed(1) man page for more information.
that sed statement says the delete the first 7 characters. a dot "." means any character. There is an even easier way to do this
awk '{print $2}' withLineNumbers.txt
you just have to print out the 2nd column using awk. No need to use regex
if your data has spaces,
awk '{$1="";print substr($0,2)}' withLineNumbers.txt
sed is doing a search and replace. The 's' means search, the next character ('/') is the seperator, the search expression is '^.......', and the replace expression is an empty string (i.e. everything between the last two slashes).
The search is a regular expression. The '^' means match start of line. Each '.' means match any character. So the search expression matches the first 7 characters of each line. This is then replaced with an empty string. So what sed is doing is removing the first 7 characters of each line.
A more simple way to achieve the same think could be:
cut -b8- withLineNumbers.txt > withoutLineNumbers.txt

Resources