remove everything but first and last 6 digits - bash

I am trying to remove everything but the first number and last 6 digits from every line in a file. So far I have removed everything but the last 6 digits using sed like so:
sed -r 's/.*(.{6})/\1/' test
Would there be a way for me to modify this so that I keep the first number too? This number can be any length but will always be followed by a space. Basically, I would like to get rid of /home/usr/file and only keep 123456789 123456 Any help would be greatly appreciated!
Input line:
123455679 /home/usr/file123456
Desired Output:
123456789 123456

echo 5 /home/usr/file123456 | awk '{print $1,substr($2,length($2)-5,6)}'

Do the same thing you did for the end at the beginning.
sed -r 's/(.).*(.{6})/\1\2/' test
(I have no idea how efficient this is however. It might need to back-track for the length of the final match.)
To grab the first "field" (space separated) and the last six characters you can use.
sed -r 's/([^[:space:]]*) .*(.{6})/\1 \2/' test
Though I think the awk solution is generally a better idea.

$ echo '123456789 /home/usr123/file123456' | sed -r 's/ .*(.{6})/ \1/'
123456789 123456

Related

Delete everything before a pattern

I'm trying to clean a text file.
I want to delete everything start before the first 12 numbers.
1:0:135103079189:0:0:2:0::135103079189:000011:00
A:908529896240:0:10250:2:0:1:
603307102606:0:0:1:0::01000::M
Output desired:
135103079189:0:0:2:0::135103079189:000011:00
908529896240:0:10250:2:0:1:
603307102606:0:0:1:0::01000::M
Here's my command but seems not working.
sed '/:\([0-9]\{12\}\)/d' t.txt
the d command in sed will delete entire line on matching the given regex, you need to use s command to search and replace only part of line... however, for given problem, sed is not suitable as it doesn't support non-greedy regex
you can use perl instead
$ perl -pe's/^.*?(?=\d{12}:)//' ip.txt
135103079189:0:0:2:0::135103079189:000011:00
908529896240:0:10250:2:0:1:
603307102606:0:0:1:0::01000::M
.*? match zero or more characters as minimally as possible
(?=\d{12}:) only if it is followed by 12-digits ending with :
use perl -i -pe for in-place editing
some possible corner cases
$ # this is matching part of field
$ echo 'foo:123:abc135103079189:23:603307102606:1' | perl -pe's/^.*?(?=\d{12}:)//'
135103079189:23:603307102606:1
$ # this is not matching 12-digit field at end of line
$ echo 'foo:123:135103079189' | perl -pe's/^.*?(?=\d{12}:)//'
foo:123:135103079189
$ # so, add start/end of line matching cases and restrict 12-digits to whole field
$ echo 'foo:123:abc135103079189:23:603307102606:1' | perl -pe 's/^(?:.*?:)?(?=\d{12}(:|$))//'
603307102606:1
$ echo 'foo:123:135103079189' | perl -pe's/^(?:.*?:)?(?=\d{12}(:|$))//'
135103079189
Could you please try following.
awk --re-interval 'match($0,/[0-9]{12}/){print substr($0,RSTART)}' Input_file
Since I have OLD version of awk so I am using --re-interval you could remove it in case you have new version of it.
This might work for you (GNU sed):
sed -n 's/[0-9]\{12\}/\n&/;s/.*\n//p' file
We only want to print specific lines so use the -n option to turn off automatic printing. If a line contains a 12 digit number, insert a newline before it. Remove any characters before and including a newline and print the result.
If you want to print lines that do not contain a 12 digit number as is, use:
sed 's/[0-9]\{12\}/\n&/;s/.*\n//' file
The crux of the problem is to identify the start of a multi-character string, insert a unique marker and delete all characters before and including the unique marker. As sed uses the newline to delimit lines, only the user can introduce newlines into the pattern space and as a result, newlines will always be unique.
Taking the nice answer from #Sundeep, in case you would like to use grep or pcregrep (macOS/BSD) you could give a try to:
$ grep -oP '^(?:.*?:)?(?=\d{12})\K.*' file
or
$ pcregrep -o '^(?:.*?:)?(?=\d{12})\K.*' file
The \K will ignore everything after the pattern
Alternative thoughts - I almost think your data is too dirty for a quick sed fix but if generally it's all similar to your sample set of data then certainly pick one of the answers with sed etc. However if you wanted to be more particular about it you could build up a set of commands to ensure the values. I like doing this for debugging and when speed isn't urgent.
Take this tiny sample of code, you could do this other ways but I'm getting the value for each part of the string and I know the order because it contiguous. You could then set up controls on which parts to keep and such as it builds out say a new string per line. Overwrought for sure, but sometimes that is a better long term approach.
#!/bin/bash
while IFS= read -r line ;do
IFS=':' read -r -a array <<< "$line"
for ((i=0; i<${#array[#]}; i++)) ;do
echo "part : ${array[$i]}"
done
done < "test_data.txt"
You could then build the data back up how you wanted and more easily understand what's happening every step of the way ..
part : 1
part : 0
part : 135103079189
part : 0
part : 0
part : 2
part : 0
part :
part : 135103079189
part : 000011
part : 00
part : A
part : 908529896240
part : 0

Sed creating duplicates

I have used the command sed in shell to remove everything except for numbers from my string.
Now, my string contains three 0s among other numbers and after running
sed 's/[^0-9]*//g'
Instead of three 0s, i now have 0 01 and 02.
How can I prevent sed from doing that so that I can have the three 0s?
sample of the string:
0 cat
42 dog
24 fish
0 bird
0 tiger
5 fly
Now that we know that digits in filenames in the output from the du utility caused the problem (tip of the hat to Lars Fischer), simply use cut to extract only first column (which contains the data of interest, each file's/subdir.'s size in blocks):
du -a "$var" | cut -f1
du outputs tab-separated data, and a tab is also cut's default separator, so all that is needed is to ask for the 1st field (-f1).
In hindsight, your problem was unrelated to sed; your sample data simply wasn't representative of your actual data. It's always worth creating an MCVE (Minimal, Complete, and Verifiable Example) when asking a question.
try this:
du -a "$var" | sed -e 's/ .*//' -e 's/[^0-9]*//g'

Use awk to extract value from a line

I have these two lines within a file:
<first-value system-property="unique.setting.limit">3</first-value>
<second-value-limit>50000</second-value-limit>
where I'd like to get the following as output using awk or sed:
3
50000
Using this sed command does not work as I had hoped, and I suspect this is due to the presence of the quotes and delimiters in my line entry.
sed -n '/WORD1/,/WORD2/p' /path/to/file
How can I extract the values I want from the file?
awk -F'[<>]' '{print $3}' input.txt
input.txt:
<first-value system-property="unique.setting.limit">3</first-value>
<second-value-limit>50000</second-value-limit>
Output:
3
50000
sed -e 's/[a-zA-Z.<\/>= \-]//g' file
Using sed:
sed -E 's/.*limit"*>([0-9]+)<.*/\1/' file
Explanation:
.* takes care of everything that comes before the string limit
limit"* takes care of both the lines, one with limit" and the other one with just limit
([0-9]+) takes care of matching numbers and only numbers as stated in your requirement.
\1 is actually a shortcut for capturing pattern. When a pattern groups all or part of its content into a pair of parentheses, it captures that content and stores it temporarily in memory. For more details, please refer https://www.inkling.com/read/introducing-regular-expressions-michael-fitzgerald-1st/chapter-4/capturing-groups-and
The script solution with parameter expansion:
#!/bin/bash
while read line || test -n "$line" ; do
value="${line%<*}"
printf "%s\n" "${value##*\>}"
done <"$1"
output:
$ ./ltags.sh dat/ltags.txt
3
50000
Looks like XML to me, so assuming it forms part of some valid XML, e.g.
<root>
<first-value system-property="unique.setting.limit">3</first-value>
<second-value-limit>50000</second-value-limit>
</root>
You can use Perl's XML::Simple and do something like this:
perl -MXML::Simple -E '$xml = XMLin("file"); say $xml->{"first-value"}->{"content"}; say $xml->{"second-value-limit"}'
Output:
3
50000
If the XML structure is more complicated, then you may have to drill down a bit deeper to get to the values you want. If that's the case, you should edit the question to show the bigger picture.
Ashkan's awk solution is straightforward, but let me suggest a sed solution that accepts non-integer numbers:
sed -n 's/[^>]*>\([.[:digit:]]*\)<.*/\1/p' input.txt
This extracts the number between the first > character of the line and the following <. In my RE this "number" can be the empty string, if you don't want to accept an empty string please add the -r option to sed and replace \([.[:digit:]]*\) by ([.[:digit:]]+).

In shell, how to process this line, in order to extract the filed that I want

I have some lines in a plat file. Take 2 line for instance:
1 aa bb 05 may 2014 cc G 14-MAY-2014 hello world
j sd az 20140505 sd G 14-MAY-2014 hello world haha
So maybe you have noticed, I can count neither the number of the char, nor the number of the space, because the lines are not well aligned, and the forth field, sometimes it's like 20140505, sometimes it's like 05 may 2014. So what I want, is to try to match the G , or match the 14-MAY-2014. Then I can easily get the following fields: hello world or hello world haha. So Can anyone help me? thank you!
Assuming your lines are in a file called test.txt:
cat test.txt | sed -r 's/^.*-[0-9]{4}\s//'
This is using GNU sed on a Linux system. There are many other ways. Here i simply remove anything up to and including the date from the begiining of the line.
sed -r 's/^.*-[0-9]{4}\s//'
-r = extendes reg ex, makes things like the quantor {4} possible
's/ ... //' = s is for substitute,
it matches the first part and replaces it with the second.
since the resocond part is empty, it's a remove/delete
^ = start of line
.* = any character, any number of times
-[0-9]{4} = a dash, followed by four digits ([0-9]), the year part of the date
\s = any white space
You can make use of lookbehind regex of perl:
perl -lne '/(?<=14-MAY-2014)(.*)/ && print $1' file
It will print anything after 14-MAY-2014.
You can also use grep if it supports -P:
grep -Po '(?<=14-MAY-2014)(.*)' file

sed: replace a character only between two positions

Sorry for this apparently simple question, but spent too long trying to find the solution everywhere and trying different sed options.
I just need to replace all dots by commas in a text file, but just between two positions.
As an example, from:
1.3.5.7.9
to
1.3,5,7.9
So, replace . by , between positions 3 to 7.
Thanks!
EDITED: sorry, I pretended to simplify the problem, but as none of the first 3 answers work due to a lack of details in my question, let me go a bit deeper. The important point is replacing all dots by comas in an interval of positions without knowing the rest of the string:
Here some text. I don't want to change. 10.000 usd 234.566 usd Continuation text.
More text. No need to change this part. 345 usd 76.433 usd Text going on. So on.
This is a fixed width text file, in columns, and I need to change the international format of numbers, replacing dots by commas. I just know the initial and final positions where I need to search and eventually replace the dots. Obviously, not all figures have dots (only those over 1000).
Thanks.
Rewriting the answer after the clarification of the question:
This is hard to handle with sed only, but can be simplified with other standard utilities like cut and paste:
$ start=40
$ end=64
$ paste -d' ' <(cut -c -$((start-1)) example.txt) \
> <(cut -c $((start+1))-$((end-1)) example.txt | sed 'y/./,/') \
> <(cut -c $((end+1))- example.txt)
Here some text. I don't want to change. 10,000 usd 234,566 usd Continuation text.
More text. No need to change this part. 345 usd 76,433 usd Text going on. So on.
(> just mean continuation of the previous line. < are real). This of course is very inefficient, but conceptually simple.
I used all the +1 and -1 stuff to get rid of extra spaces. Not sure if you need it.
A pure sed solution (brace yourself):
$ sed "s/\(.\{${start}\}\)\(.\{$((end-start))\}\)/\1\n\2\n/;h;s/.*\n\(.*\)\n.*/\1/;y/./,/;G;s/^\(.*\)\n\(.*\)\n\(.*\)\n\(.*\)$/\2\1\4/" example.txt
Here some text. I don't want to change. 10,000 usd 234,566 usd Continuation text.
More text. No need to change this part. 345 usd 76,433 usd Text going on. So on.
GNU sed:
$ sed -r "s/(.{${start}})(.{$((end-start))})/\1\n\2\n/;h;s/.*\n(.*)\n.*/\1/;y/./,/;G;s/^(.*)\n(.*)\n(.*)\n(.*)$/\2\1\4/" example.txt
Here some text. I don't want to change. 10,000 usd 234,566 usd Continuation text.
More text. No need to change this part. 345 usd 76,433 usd Text going on. So on.
I try to simplify the regex, but it more permissive.
echo 1.3.5.7.9 | sed -r "s/^(...).(.).(..)/\1,\2,\3/"
1.3,5,7.9
PS: It doesn't work with BSD sed.
$ echo "1.3.5.7.9" |
gawk -v s=3 -v e=7 '{
print substr($0,1,s-1) gensub(/\./,",","g",substr($0,s,e-s+1)) substr($0,e+1)
}'
1.3,5,7.9
This is rather awkward to do in pure sed. If you're not strictly constrained to sed, I suggest using another tool to do this. Ed Morton's gawk-based solution is probably the least-awkward (no pun intended) way to solve this.
Here's an example of using sed to do the grunt work, but wrapped in a bash function for simplicity:
function transform () {
line=$1
start=$2
end=$3
# Save beginning and end of line
front=$(echo $line | sed -e "s/\(^.\{$start\}\).*$/\1/")
back=$(echo $line | sed -e "s/^.\{$end\}//")
# Translate characters
line=$(echo $line | sed -e 'y/\./,/')
# Restore unmodified beginning/end
echo $line | sed -e "s/^.\{$start\}/$front/" -e "s/\(^.\{$end\}\).*$/\1$back/"
}
Call this function like:
$ transform "1.3.5.7.9" 3 7
1.3,5,7.9
Thank you all.
What I found around (not my merit) as simple solutions are:
For fixed width files:
awk -F "" 'OFS="";{for (j=2;j<= 5;j++) if ($j==".") $j=","}'1
Will change all dots into commas from the 2nd position to the 5th.
For tab delimited fields files:
awk -F'\t' 'OFS="\t" {for (j=2;j<=5;j++) gsub(/\./,",",$j)}'1
Will change all dots into comas from the 2nd field to the 5th.
Hope that can help someone: couldn't imagine it would be so tough in the begining.

Resources