How to extract multiple fields with specific character lengths in Bash? - bash

I have a file (test.csv) with a few fields and what I wanted is the Title and Path with 10 character for the title and remove a few levels from the path. What have done is use the awk command to pick two fields:
$ awk -F "," '{print substr($4, 1, 10)","$6}' test.csv [1]
The three levels in the path need to be removed are not always the same. It can be /article/17/1/ or this /open-organization/17/1 so I can't use the substr for field $6.
Here the result I have:
Title,Path
Be the ope,/article/17/1/be-open-source-supply-chain
Developing,/open-organization/17/1/developing-open-leaders
Wanted result would be:
Title,Path
Be the ope,be-open-source-supply-chain
Developing,developing-open-leaders
The title is ok with 10 characters but I still need to remove 3 levels off the path.
I could use the cut command:
cut -d'/' -f5- to remove the "/.../17/1/"
But not sure how this can be piped to the [1]
I tried to use a for loop to get the title and the path one by one by but I have difficulty in getting the awk command to run one line at time.
I have spent hours on this with no luck. Any help would be appreciated.
Dummy Data for testing:
test.csv
Post date,Content type,Author,Title,Comment count,Path,Tags,Word count
31 Jan 2017,Article,Scott Nesbitt,Book review: Ours to Hack and to Own,0,/article/17/1/review-book-ours-to-hack-and-own,Books,660
31 Jan 2017,Article,Jason Baker,5 new guides for working with OpenStack,2,/article/17/1/openstack-tutorials,"OpenStack, How-tos and tutorials",419

you can replace the string by using regex.
stringZ="Be the ope,/article/17/1/be-open-source-supply-chain"
sed -E "s/((\\/\\w+){3}\\/)//" <<< $stringZ
note that you need to use -i if you are going to give file as input to sed

Related

Passing parameter as control number and get table name

I have a scenario where there is a file with control number and table name, hereby an example:
1145|report_product|N|N|
1156|property_report|N|N
I need to pass the control number as 1156 and have to get table name as PR once I get the table name as PR then I need to add some text on that.
Please help
Assuming the controll file is:
# cat controlfile.txt
1145|report_product|N|N
1156|property_report|N|N
To fine some line you can use:
grep 1156 controlfile.txt
If needed you can save it to a variable: result=$(grep 1156 file.txt)
Assuming you need to add append something on this line.... you can use:
sed '/^1156/s/$/ 123/' controlfile.txt
This example will add "123" at the end of line that start with 1156
If needed, add more details like what output you want or anything else to help us better understand your need.
You need to work in two stages:
You need to find the line, containing 1156.
You need to get the information from that line.
In order to find the line (as already indicated by Juranir), you can use grep:
Prompt> grep "1156" control.txt
1156|property_report|N|N
In order to get the information from that line, you need to get the second column, based on the vertical line (often referred as a "pipe" character), for which there are different approaches. I'll give you two:
The cut approach: you can cut a line into different parts and take a character, a byte, a column, .... In this case, this is what you need:
grep "1156" control.txt | cut -d '|' -f 2
-d '|' : use the vertical line as a column separator
-f 2 : show the second field (column)
The awk approach: awk is a general "text modifier" with multiple features (showing parts of text, performing basic calculations, ...). For this case, it can be used as follows:
grep "1156" control.txt | awk -F '|' '{print $2}'
-F '|' : use the vertical line as a column separator
'{print $2}' : the awk script for showing the second field.
Oh, by the way, I've edited your question. You might press the edit button in order to learn how I did this :-)
For getting only the first letters, separated by the underscores:
grep "1156" control.txt | awk -F '|' '{print $2}' | awk -F '_' '{print substr($1,1,1) substr($2,1,1)}'
(something like that)

extract words matching a pattern and print character length

I have a test file which looks like this
file.txt
this is a smart boy "abc.smartxyz" is the name
what you in life doesn;t matter
abc.smartabc is here to help you.
where is the joy of life
life is joyous at "https://abc.smart/strings"
grep 'abc.smart' file.txt
this is a smart boy "abc.smartxyz" is the name
abc.smartabc is here to help you.
life is joyous at "https://abc.smart/strings"
Now I want to be able to extract all words that have the string abc.smart from this grepped file and also print out how many characters they are. Output I am after is something like
"abc.smartxyz" 14
abc.smartabc 12
"https://abc.smart/strings" 27
Please can someone help with this.
With awk
awk '{for (i=1;i<=NF;i++) if ($i~/abc.smart/) print $i,length($i)}' file
You can run it directly on the first file. Output:
"abc.smartxyz" 14
abc.smartabc 12
"https://abc.smart/strings" 27
This might work for you (GNU grep and sed):
grep -o '\S*abc\.smart\S*' file | sed 's/"/\\"/g;s/.*/echo "& $(expr length &)"/e'
Use grep to output words containing abc.smart and sed to evaluate each string to calculate its length using the bash expr command.

3 Is there anyway to insert new lines in-between two patterns

Is there anyway to insert new lines in-between 2 specific patterns of characters? I want to insert a new line every time "butterfly" occurs in a text file, however I want this new line to be inserted between the "butter" and "fly". For example butter\nfly
I also want to find the length of each line after splitting.
Eg:
if textfile contains:
fgsccgewvdhbejbecbecboubutterflybvdcvhkebcjl
vdjchvhecbihbutterflyglehblejkbedkbutterflyr
Then, I want a result like the following:
29 fgsccgewvdhbejbecbecboubutter
33 flybvdcvhkebcjlvdjchvhecbihbutter
22 flyglehblejkbedkbutter
4 flyr
I believe one way to tackle it would be to insert a new line using "sed" everywhere "butter" occurs and is followed by "fly". Strip out all blank line using grep with a -v flag. Then get the length of each line. However, even after trying a lot, I am unable to get the correct answer.
The Sed 's' sub-command + awk can work together:
sed -e "s/butterfly/butter\\nfly/g" < input.txt | awk '{ print length, $0 }'
This might work for you (GNU sed & bash):
sed -Ez 's/\n//g;s/(butter)(fly)/\1\n\2/g;s/^.*$/l=&;printf "%d %s\n" ${#l} &/meg' file
Slurp the file into memory using the -z sed option. Remove all existing newlines and then insert new ones between butter and fly. Using the m, g and e flags of the sed substitute command, split into separate lines and using bash make a variable l and via printf print the required format.

Unix: Removing date from a string in single command

For satisfying a legacy code i had to add date to a filename like shown below(its definitely needed and cannot modify legacy code :( ). But i need to remove the date within the same command without going to a new line. this command is read from a text file so i should do this within the single command.
$((echo "$file_name".`date +%Y%m%d`| sed 's/^prefix_//')
so here i am removing the prefix from filename and adding a date appended to filename. i also do want to remove the date which i added. for ex: prefix_filename.txt or prefix_filename.zip should give me as below.
Expected output:
filename.txt
filename.zip
Current output:
filename.txt.20161002
filename.zip.20161002
Assumming all the files are formatted as filename.ext.date, You can pipe the output to 'cut' command and get only the 1st and 2nd fields :
~> X=filename.txt.20161002
~> echo $X | cut -d"." -f1,2
filename.txt
I am not sure that I understand your question correctly, but perhaps this does what you want:
$((echo "$file_name".`date +%Y%m%d`| sed -e 's/^prefix_//' -e 's/\.[^.]*$//')
Sample input:
cat sample
prefix_original.txt.log.tgz.10032016
prefix_original.txt.log.10032016
prefix_original.txt.10032016
prefix_one.txt.10032016
prefix.txt.10032016
prefix.10032016
grep from start of the string till a literal dot "." followed by digit.
grep -oP '^.*(?=\.\d)' sample
prefix_original.txt.log.tgz
prefix_original.txt.log
prefix_original.txt
prefix_one.txt
prefix.txt
prefix
perhaps, following should be used:
grep -oP '^.*(?=\.\d)|^.*$' sample
If I understand your question correctly, you want to remove the date part from a variable, AND you already know from the context that the variable DOES contain a date part and that this part comes after the last period in the name.
In this case, the question boils down to removing the last period and what comes after.
This can be done (Posix shell, bash, zsh, ksh) by
filename_without=${filename_with%.*}
assuming that filename_with contains the filename which has the date part in the end.
% cat example
filename.txt.20161002
filename.zip.20161002
% cat example | sed "s/.[0-9]*$//g"
filename.txt
filename.zip
%

Use awk to extract value from a line

I have these two lines within a file:
<first-value system-property="unique.setting.limit">3</first-value>
<second-value-limit>50000</second-value-limit>
where I'd like to get the following as output using awk or sed:
3
50000
Using this sed command does not work as I had hoped, and I suspect this is due to the presence of the quotes and delimiters in my line entry.
sed -n '/WORD1/,/WORD2/p' /path/to/file
How can I extract the values I want from the file?
awk -F'[<>]' '{print $3}' input.txt
input.txt:
<first-value system-property="unique.setting.limit">3</first-value>
<second-value-limit>50000</second-value-limit>
Output:
3
50000
sed -e 's/[a-zA-Z.<\/>= \-]//g' file
Using sed:
sed -E 's/.*limit"*>([0-9]+)<.*/\1/' file
Explanation:
.* takes care of everything that comes before the string limit
limit"* takes care of both the lines, one with limit" and the other one with just limit
([0-9]+) takes care of matching numbers and only numbers as stated in your requirement.
\1 is actually a shortcut for capturing pattern. When a pattern groups all or part of its content into a pair of parentheses, it captures that content and stores it temporarily in memory. For more details, please refer https://www.inkling.com/read/introducing-regular-expressions-michael-fitzgerald-1st/chapter-4/capturing-groups-and
The script solution with parameter expansion:
#!/bin/bash
while read line || test -n "$line" ; do
value="${line%<*}"
printf "%s\n" "${value##*\>}"
done <"$1"
output:
$ ./ltags.sh dat/ltags.txt
3
50000
Looks like XML to me, so assuming it forms part of some valid XML, e.g.
<root>
<first-value system-property="unique.setting.limit">3</first-value>
<second-value-limit>50000</second-value-limit>
</root>
You can use Perl's XML::Simple and do something like this:
perl -MXML::Simple -E '$xml = XMLin("file"); say $xml->{"first-value"}->{"content"}; say $xml->{"second-value-limit"}'
Output:
3
50000
If the XML structure is more complicated, then you may have to drill down a bit deeper to get to the values you want. If that's the case, you should edit the question to show the bigger picture.
Ashkan's awk solution is straightforward, but let me suggest a sed solution that accepts non-integer numbers:
sed -n 's/[^>]*>\([.[:digit:]]*\)<.*/\1/p' input.txt
This extracts the number between the first > character of the line and the following <. In my RE this "number" can be the empty string, if you don't want to accept an empty string please add the -r option to sed and replace \([.[:digit:]]*\) by ([.[:digit:]]+).

Resources