Decrease string length in pattern - bash

I have a string like so:
text1;text2;text3;text4;text5;text6
and i need to decrease length of string after 3rd occurrence ; so in that case
text
, to for example 4 characters.
so far i did:
echo "$(cat pp.txt | awk -F ";" '{print $4}' | sed 's/^\(...\).*/\1/;q' )"
But the output is "text". What i need is:
Text1;text2;text3;text;text5;text6
Please help

You could do it all with awk like
awk -F\; 'BEGIN {OFS=FS} {$4=substr($4, 0, 4); print}' pp.txt
You can modify the fields in awk, which we do with $4=substr($4, 0, 4). That will take the substring of $4 from the 0th to the 4th characters, then store that back into the forth field. Then we just print the line with the updated value.
Also, set the output field separator to be the same as the one we specify on the command line so we don't change that when printing it.

Related

I want to use awk to print rearranged fields then print from the 4th field to the end

I have a text file containing filesize, filedate, filetime, and filepath records. The filepath can contain spaces and can be very long (classical music names). I would like to print the file with filedate, filetime, filesize, and filepath. The first part, without the filepath is easy:
awk '{print $2,$3,$1}' filelist.txt
This works, but it prints the record on two lines:
awk '{print $2,$3,$1,$1=$2=$3=""; print $0}' filelist.txt
I've tried using cut -d' ' -f '2 3 1 4-' , but that doesn't allow rearranging fields. I can fix the two line issue using sed to join. There must be a way to only use awk. In summary, I want to print the 2nd, 3rd, 1st, and from the 4th field to the end. Can anyone help?
Since the print statement in awk always prints a newline in the end (technically ORS, which defaults to a newline), your first print will break the output in two lines.
With printf, on the other hand, you completely control the output with your format string. So, you can print the first three fields with printf (without the newline), then set them to "", and just finish off with the print $0 (which is equivalent to print without arguments):
awk '{ printf("%s %s %s",$2,$3,$1); $1=$2=$3=""; print }' file
I avoid awk when I can. If I understand correctly what you have said -
while read size date time path
do echo "$date $time $size $path"
done < filelist.txt
You could printf instead of echo for more formatting options.
Embedded spaces in $path won't matter since it's the last field.
I have no awk at hand to test but I suppose you may use printf to format a one-line output. Just locate the third space in $0 and take a substring from that position through the end of the input line.
You may also try to swap fields before a standard print, although I'm not sure it will produce desired results...
It always helps to delimit your fields with something like <tab>, so subsequent operations are easier... (I can see you used cut without -d, so maybe your data is already tab delimited.)
echo 1 2 3 very long name |
sed -e 's/ /\t/' -e 's/ /\t/' -e 's/ /\t/' |
awk -v FS='\t' -v OFS='\t' '{print $2, $3, $1, $4}'
The first line generates data. The sed command substitutes first three spaces in each row with \t. Then the awk works flawlessly, outputting tab delimited data again (you need a reasonably new awk).
With GNU awk for gensub():
$ echo '1 2 3 4 5 6' | awk '{print $3, $2, $1, gensub(/([^ ]+){3}/,"",1)}'
3 2 1 4 5 6
With any awk:
$ echo '1 2 3 4 5 6' | awk '{rest=$0; sub(/([^ ]+ ){3}/,"",rest); print $3, $2, $1, rest}'
3 2 1 4 5 6

print 1st string of a line if last 5 strings match input

I have a requirement to print the first string of a line if last 5 strings match specific input.
Example: Specified input is 2
India;1;2;3;4;5;6
Japan;1;2;2;2;2;2
China;2;2;2;2
England;2;2;2;2;2
Expected Output:
Japan
England
As you can see, China is excluded as it doesn't meet the requirement (last 5 digits have to be matched with the input).
grep ';2;2;2;2;2$' file | cut -d';' -f1
$ in a regex stands for "end of line", so grep will print all the lines that end in the given string
-d';' tells cut to delimit columns by semicolons
-f1 outputs the first column
You could use awk:
awk -F';' -v v="2" -v count=5 '
{
c=0;
for(i=2;i<=NF;i++){
if($i == v) c++
if(c>=count){print $1;next}
}
}' file
where
v is the value to match
count is the maximum number of value to print the wanted string
the for loop is parsing all fields delimited with a ; in order to find a match
This script doesn't need the 5 values 2 to be consecutive.
With sed:
sed -n 's/^\([^;]*\).*;2;2;2;2;2$/\1/p' file
It captures and output non ; first characters in lines ending with ;2;2;2;2;2
It can be shortened with GNU sed to:
sed -nE 's/^([^;]*).*(;2){5}$/\1/p' file
awk -F\; '/;2;2;2;2;2$/{print $1}' file
Japan
England

How to get word count of a part of a line

The lines of the files are as something like this .
<some character> ||| each line. So far i can get the total number of lines and the text for each on its own line ||| <some text>
Now I want to count the no of words in between the |||.
What I intended to do is
awk -F '|||' '{print $2}' word_file | wc -l
but it throws blank in the awk part ,which suggests it is not taking ||| as I want (which is as a delimiter ),interestingly if i use $1 instead of $2 ,it prints the whole text
However if I use ||| (i.e a space before and after) it gives me some output but does not treat the sentence between the two delimeters as one field ,i.e it prints each instead of the whole sentence if I use the following
awk -F ' ||| ' '{print $2}' word_file
How do I achieve this using a bash command
FYI
awk version -GNU Awk 4.0.1
Awk's -F option, which sets FS, the input-field separator, expects a regular expression as its value.
Thus, for ||| to be interpreted as a literal, you must \-escape the | chars, which are metacharacters in a regex context.
Given that Awk also accepts \-based escape sequences in string literals, you must double the \ instances:
awk -F '\\|\\|\\|' ...
To properly count the words (defined as whitespace-separated tokens) in field 2, you can try this:
awk -F '\\|\\|\\|' 'BEGIN { orgFs=FS } { FS=" "; $0 = $2; print NF; FS=orgFS }' word_file
This splits each input line into fields by literal |||.
By temporarily setting FS to a single space - which is a magic value that tells Awk to split into fields by any nonempty run of whitespace - we can assign $2, the value of field 2, to $0, the whole input line, which causes the new value of $0 to be split into fields again.
At that point NF reflects the number of fields in what was originally the 2nd field - i.e., the number of words - and we can print that.
Restoring FS to its original value then prepares for parsing the next input line.
with gawk multi-char RS support, this might be easier
$ awk -v RS="\\\|\\\|\\\|" 'NR==2{print NF}' file
or if not sure how to escape the pipe, perhaps cleaner with
$ awk -v RS='[|]{3}' ...

Get the contents of one column given another column

I have a tab separated file with 3 columns. I'd like to get the contents of the first column, but only for the rows where the 3rd column is equal to 8. How do I extract these values? If I just wanted to extract the values in the first column, I would do the following:
cat file1 | tr "\t" "~" | cut -d"~" -f1 >> file_with_column_3
I'm thinking something like:
cat file1 | tr "\t" "~" | if cut -d"~" -f3==8; then cut -d"~" -f1 ; fi>> file_with_column_3
But that doesn't quite seem to work.
Given that your file is tab delimited, it seems like this problem would be well suited for awk.
Something simple like below should work for you, though without any sample data I can't say for sure (try to always include this on questions on SO)
awk -F'\t' '$3==8 {print $1}' inputfile > outputfile
The -F'\t' sets the input delimiter as tab.
$3==8 compares if the 3rd column based on that delimiter is 8.
If so, the {print $1} is executed, which prints the first column.
Otherwise, nothing is done and awk proceeds to the next line.
If your file had a header you wanted to preserve, you could just modify this like the following, which tells awk to print if the current record number is 1.
awk -F'\t' 'NR==1 {print;} $3==8 {print $1}' inputfile > outputfile
awk can handle this better:
awk -F '\t' '$3 == 8 { print $1 }' file1
You can do it with bash only too:
cat x | while read y; do split=(${y}); [ ${split[2]} == '8' ] && echo $split[0]; done
The input is read in variable y, then split into an array. The IFS (input field separator) defaults to <space><tab<>newline>, so it splits on tabs too. The third field of the array is then compared to '8'. If it equals, it prints the first field of the array. Remember that fields in arrays start counting at zero.

cut string in a specific column in bash

How can I cut the leading zeros in the third field so it will only be 6 characters?
xxx,aaa,00000000cc
rrr,ttt,0000000yhh
desired output
xxx,aaa,0000cc
rrr,ttt,000yhh
or here's a solution using awk
echo " xxx,aaa,00000000cc
rrr,ttt,0000000yhh"|awk -F, -v OFS=, '{sub(/^0000/, "", $3)}1'
output
xxx,aaa,0000cc
rrr,ttt,000yhh
awk uses -F (or FS for FieldSeparator) and you must use OFS for OutputFieldSeparator) .
sub(/srchtarget/, "replacmentstring", stringToFix) is uses a regular expression to look for 4 0s at the front of (^) the third field ($3).
The 1 is a shorthand for the print statement. A longhand version of the script would be
echo " xxx,aaa,00000000cc
rrr,ttt,0000000yhh"|awk -F, -v OFS=, '{sub(/^0000/, "", $3);print}'
# ---------------------------------------------------------^^^^^^
Its all related to awk's /pattern/{action} idiom.
IHTH
If you can assume there are always three fields and you want to strip off the first four zeros in the third field you could use a monstrosity like this:
$ cat data
xxx,0000aaa,00000000cc
rrr,0000ttt,0000000yhh
$ cat data |sed 's/\([^,]\+\),\([^,]\+\),0000\([^,]\+\)/\1,\2,\3/
xxx,0000aaa,0000cc
rrr,0000ttt,000yhh
Another more flexible solution if you don't mind piping into Python:
cat data | python -c '
import sys
for line in sys.stdin():
print(",".join([f[4:] if i == 2 else f for i, f in enumerate(line.strip().split(","))]))
'
This says "remove the first four characters of the third field but leave all other fields unchanged".
Using awks substr should also work:
awk -F, -v OFS=, '{$3=substr($3,5,6)}1' file
xxx,aaa,0000cc
rrr,ttt,000yhh
It just take 6 characters from 5 position in field 3 and set it back to field 3

Resources