Extract only a part of data from a file

Extract only a part of data from a file - bash

My input is test.txt which contains data in this format:
'X'=>'ABCDEF',
'X'=>'XYZ',
'X'=>'GHIJKLMN',
I want to get something like:
'ABCDEF',
'XYZ',
'GHIJKLMN',
How do I go about this in bash?
Thanks!

If the input never contains the character > elsewhere than in the "fat arrow", you can use cut:
cut -f2 -d\> file
-d specifies the delimiter, here > (backslash needed to prevent the shell from interpreting it as the redirection operator)
-f specifies which field to extract

Here's a solution using sed:
curl -sL https://git.io/fjeX4 | sed 's/^.*>//'
Sed is passed a single command: s///. is a regex that matches any characters (.*) from the beginning of the line (^) to the last '>'. The is an empty string, so essentially sed is just deleting all the characters on the line up to the last >. As with the other solutions, this solution assumes that there is only one '>' on the line.

If the data is really uniform, then you could just run cut (on example input):
$ curl -sL https://git.io/fjeX4 | cut -d '>' -f 2
'ABCDEF',
'XYZ',
'GHIJKLMN',
You can see flag explanations on explainshell.
With awk, it would look similar:
$ curl -sL https://git.io/fjeX4 | awk -F '>' '{ print $2 }'
'ABCDEF',
'XYZ',
'GHIJKLMN',

Using awk
awk 'BEGIN{FS="=>"}{print $2}' file
'ABCDEF',
'XYZ',
'GHIJKLMN',
FS in awk stands for field separator. The code inside BEGIN is executed only at the beginning, ie, before processing the first record. $2 prints the second field.
A more idiomatic way of putting the above stuff would be
awk 'BEGIN{FS="=>"}$2' file
'ABCDEF',
'XYZ',
'GHIJKLMN',
The default action in awk is to print the record. Here we explicitly mention what to print. ie $2.

Related

Duplicate first column of multiple text files in bash

I have multiple text files each containing two columns and I would like to duplicate the first column in each file in bash to have three columns in the end.
File:
sP100227 1
sP100267 1
sP100291 1
sP100493 1
Output file:
sP100227 sP100227 1
sP100267 sP100267 1
sP100291 sP100291 1
sP100493 sP100493 1
I tried:
txt=path/to/*.txt
echo "$(paste <(cut -f1-2 $txt) > "$txt"

Could you please try following. Written and tested with shown samples in GNU awk. This will add fields to only those lines which have 2 fields in it.
awk 'NF==2{$1=$1 OFS $1} 1' Input_file
In case you don't care of number of fields and simply want to have value of 1st field 2 times then try following.
awk '{$1=$1 OFS $1} 1' Input_file
OR if you only have 2 fields in your Input_file then we need not to rewrite the complete line we could simply print them as follows.
awk '{print $1,$1,$2}' Input_file
To save output into same Input_file itself append > temp && mv temp Input_file for above solutions(after testing).

Use a temp file, with cut -f1 and paste, like so:
paste <(cut -f1 in_file) in_file > tmp_file
mv tmp_file in_file
Alternatively, use a Perl one-liner, like so:
perl -i.bak -lane 'print join "\t", $F[0], $_;' in_file
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-i.bak : Edit input files in-place (overwrite the input file). Before overwriting, save a backup copy of the original file by appending to its name the extension .bak.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches

The default delimiter in cut and paste is TAB, but your file looks to be space-separated.
You can't use the same file as input and output redirection, because when the shell opens the file for output it truncates it, so there's nothing for the program to read. Write to a new file and then rename it.
Your paste command is only being given one input file. And there's no need to use echo.
paste -d' ' <(cut -d' ' -f1 "$txt") "$txt" > "$txt.new" && mv "$txt.new" "$txt"
You can do this more easily using awk.
awk '{print $1, $0}' "$txt" > "$txt.new" && mv "$txt.new" "$txt"
GNU awk has an in-place extension, so you can use that if you like. See Save modifications in place with awk

Try sed -Ei 's/\s*(\S+)\s+/\1 \1 /1' $txt if your fields are separated by strings of one or more whitespace characters. This used the Stream Editor (sed) replaces (s///1) the first string of non-space characters (\S+) followed by a string of whitespace characters (\s+) with the same thing repeated with intervening spaces(\1 \1 ). It keeps the rest of the line. The -E to sed means use extended pattern matching (+, ( vs. \(). The -i means do it in-place, replacing the file with the output.
You could use awk and do awk '{ printf "%s %s\n",$1,$0 }'. This takes the first whitespace-delimited field ($1) and follows it with a space and the whole line ($0) followed by a newline. This is a little clearer than sed but it doesn't have the advantage of being in-place.
If you can guarantee they are delimited by only one space, with no leading spaces, you can use paste -d' ' <(cut -d' ' -f1 ${txt}) ${txt} > ${txt}.new; mv ${txt}.new ${txt}. The -d' ' sets the delimiter to space for both cut and paste. You know this but for others -f1 means extract the first -d-delimited field. The mv command replaces the input with the output.

how to grep everything between single quotes?

I am having trouble figuring out how to grep the characters between two single quotes .
I have this in a file
version: '8.x-1.0-alpha1'
and I like to have the output like this (the version numbers can be various):
8.x-1.0-alpha1
I wrote the following but it does not work:
cat myfile.txt | grep -e 'version' | sed 's/.*\?'\(.*?\)'.*//g'
Thank you for your help.
Addition:
I used the sed command sed -n "s#version:\s*'\(.*\)'#\1#p"
I also like to remove 8.x- which I edited to sed -n "s#version:\s*'8.x-\(.*\)'#\1#p".
This command only works on linux and it does not work on MAC. How to change this command to make it works on MAC?
sed -n "s#version:\s*'8.x-\(.*\)'#\1#p"

If you just want to have that information from the file, and only that you can quickly do:
awk -F"'" '/version/{print $2}' file
Example:
$ echo "version: '8.x-1.0-alpha1'" | awk -F"'" '/version/{print $2}'
8.x-1.0-alpha1
How does this work?
An awk program is a series of pattern-action pairs, written as:
condition { action }
condition { action }
...
where condition is typically an expression and action a series of commands.
-F "'": Here we tell awk to define the field separator FS to be a <single quote> '. This means the all lines will be split in fields $1, $2, ... ,$NF and between each field there is a '. We can now reference these fields by using $1 for the first field, $2 for the second ... etc and this till $NF where NF is the total number of fields per line.
/version/{print $2}: This is the condition-action pair.
condition: /version/:: The condition reads: If a substring in the current record/line matches the regular expression /version/ then do action. Here, this is simply translated as if the current line contains a substring version
action: {print $2}:: If the previous condition is satisfied, then print the second field. In this case, the second field would be what the OP requests.
There are now several things that can be done.
Improve the condition to be /^version :/ && NF==3 which reads _If the current line starts with the substring version : and the current line has 3 fields then do action
If you only want the first occurance, you can tell the system to exit immediately after the find by updating the action to {print $2; exit}

I'd use GNU grep with pcre regexes:
grep -oP "version: '\\K.*(?=')" file
where we are looking for "version: '" and then the \K directive will forget what it just saw, leaving .*(?=') to match up to the last single quote.

Try something like this: sed -n "s#version:\s*'\(.*\)'#\1#p" myfile.txt. This avoids the redundant cat and grep by finding the "version" line and extracting the contents between the single quotes.
Explanation:
the -n flag tells sed not to print lines automatically. We then use the p command at the end of our sed pattern to explicitly print when we've found the version line.
Search for pattern: version:\s*'\(.*\)'
version:\s* Match "version:" followed by any amount of whitespace
'\(.*\)' Match a single ', then capture everything until the next '
Replace with: \1; This is the first (and only) capture group above, containing contents between single quotes.

When your only want to look at he quotes, you can use cut.
grep -e 'version' myfile.txt | cut -d "'" -f2

grep can almost do this alone:
grep -o "'.*'" file.txt
But this may also print lines you don't want to: it will print all lines with 2 single quotes (') in them. And the output still has the single quotes (') around it:
'8.x-1.0-alpha1'
But sed alone can do it properly:
sed -rn "s/^version: +'([^']+)'.*/\1/p" file.txt

How do I seperate a link to get the end of a URL in shell?

I have some data that looks like this
"thumbnailUrl": "http://placehold.it/150/adf4e1"
I want to know how I can get the trailing part of the URL, I want the output to be
adf4e1
I was trying to grep when starting with / and ending with " but I'm only a beginner in shell scripting and need some help.

I came up with a quick and dirty solution, using grep (with perl regex) and cut:
$ cat file
"thumbnailUrl": "http://placehold.it/150/adf4e1"
"anotherUrl": "http://stackoverflow.com/questions/3979680"
"thumbnailUrl": "http://facebook.com/12f"
"randortag": "http://google.com/this/is/how/we/roll/3fk19as1"
$ cat file | grep -o '/\w*"$' | cut -d'/' -f2- | cut -d'"' -f1
adf4e1
3979680
12f
3fk19as1

We could kill this with a thousand little cuts, or just one blow from Awk:
awk -F'[/"]' '{ print $(NF-1); }'
Test:
$ echo '"thumbnailUrl": "http://placehold.it/150/adf4e1"' \
| awk -F'[/"]' '{ print $(NF-1); }'
adf4e1
Filter thorugh Awk using double quotes and slashes as field separators. This means that the trailing part ../adf4e1" is separated as {..}</>{adf4e1}<">{} where curly braces denote fields and angle brackets separators. The Awk variable NF gives the 1-based number of fields and so $NF is the last field. That's not the one we want, because it is blank; we want $(NF-1): the second last field.
"Golfed" version:
awk -F[/\"] '$0=$(NF-1)'

If the original string is coming from a larger JSON object, use something like jq to extract the value you want.
For example:
$ jq -n '{thumbnail: "http://placehold.it/150/adf4e1"}' |
> jq -r '.thumbnail|split("/")[-1]'
adf4e1
(The first command just generates a valid JSON object representing the original source of your data; the second command parses it and extracts the desired value. The split function splits the URL into an array, from which you only care about the last element.)

You can also do this purely in bash using string replacement and substring removal if you wrap your string in single quotes and assign it to a variable.
#!/bin/bash
string='"thumbnailUrl": "http://placehold.it/150/adf4e1"'
string="${string//\"}"
echo "${string##*/}"
adf4e1 #output

You can do that using 'cut' command in linux. Cut it using '/' and keep the last cut. Try it, its fun!
Refer http://www.thegeekstuff.com/2013/06/cut-command-examples

Display all fields except the last

I have a file as show below
1.2.3.4.ask
sanma.nam.sam
c.d.b.test
I want to remove the last field from each line, the delimiter is . and the number of fields are not constant.
Can anybody help me with an awk or sed to find out the solution. I can't use perl here.

Both these sed and awk solutions work independent of the number of fields.
Using sed:
$ sed -r 's/(.*)\..*/\1/' file
1.2.3.4
sanma.nam
c.d.b
Note: -r is the flag for extended regexp, it could be -E so check with man sed. If your version of sed doesn't have a flag for this then just escape the brackets:
sed 's/\(.*\)\..*/\1/' file
1.2.3.4
sanma.nam
c.d.b
The sed solution is doing a greedy match up to the last . and capturing everything before it, it replaces the whole line with only the matched part (n-1 fields). Use the -i option if you want the changes to be stored back to the files.
Using awk:
$ awk 'BEGIN{FS=OFS="."}{NF--; print}' file
1.2.3.4
sanma.nam
c.d.b
The awk solution just simply prints n-1 fields, to store the changes back to the file use redirection:
$ awk 'BEGIN{FS=OFS="."}{NF--; print}' file > tmp && mv tmp file

Reverse, cut, reverse back.
rev file | cut -d. -f2- | rev >newfile
Or, replace from last dot to end with nothing:
sed 's/\.[^.]*$//' file >newfile
The regex [^.] matches one character which is not dot (or newline). You need to exclude the dot because the repetition operator * is "greedy"; it will select the leftmost, longest possible match.

With cut on the reversed string
cat youFile | rev |cut -d "." -f 2- | rev

If you want to keep the "." use below:
awk '{gsub(/[^\.]*$/,"");print}' your_file

Unix cut: Print same Field twice

Say I have file - a.csv
ram,33,professional,doc
shaym,23,salaried,eng
Now I need this output (pls dont ask me why)
ram,doc,doc,
shayam,eng,eng,
I am using cut command
cut -d',' -f1,4,4 a.csv
But the output remains
ram,doc
shyam,eng
That means cut can only print a Field just one time. I need to print the same field twice or n times.
Why do I need this ? (Optional to read)
Ah. It's a long story. I have a file like this
#,#,-,-
#,#,#,#,#,#,#,-
#,#,#,-
I have to covert this to
#,#,-,-,-,-,-
#,#,#,#,#,#,#,-
#,#,#,-,-,-,-
Here each '#' and '-' refers to different numerical data. Thanks.

You can't print the same field twice. cut prints a selection of fields (or characters or bytes) in order. See Combining 2 different cut outputs in a single command? and Reorder fields/characters with cut command for some very similar requests.
The right tool to use here is awk, if your CSV doesn't have quotes around fields.
awk -F , -v OFS=, '{print $1, $4, $4}'
If you don't want to use awk (why? what strange system has cut and sed but no awk?), you can use sed (still assuming that your CSV doesn't have quotes around fields). Match the first four comma-separated fields and select the ones you want in the order you want.
sed -e 's/^\([^,]*\),\([^,]*\),\([^,]*\),\([^,]*\)/\1,\4,\4/'

$ sed 's/,.*,/,/; s/\(,.*\)/\1\1,/' a.csv
ram,doc,doc,
shaym,eng,eng,
What this does:
Replace everything between the first and last comma with just a comma
Repeat the last ",something" part and tack on a comma. Voilà!
Assumptions made:
You want the first field, then twice the last field
No escaped commas within the first and last fields
Why do you need exactly this output? :-)

using perl:
perl -F, -ane 'chomp($F[3]);$a=$F[0].",".$F[3].",".$F[3];print $a."\n"' your_file
using sed:
sed 's/\([^,]*\),.*,\(.*\)/\1,\2,\2/g' your_file

As others have noted, cut doesn't support field repetition.
You can combine cut and sed, for example if the repeated element is at the end:
< a.csv cut -d, -f1,4 | sed 's/,[^,]*$/&&,/'
Output:
ram,doc,doc,
shaym,eng,eng,
Edit
To make the repetition variable, you could do something like this (assuming you have coreutils available):
n=10
rep=$(seq $n | sed 's:.*:\&:' | tr -d '\n')
< a.csv cut -d, -f1,4 | sed 's/,[^,]*$/'"$rep"',/'
Output:
ram,doc,doc,doc,doc,doc,doc,doc,doc,doc,doc,
shaym,eng,eng,eng,eng,eng,eng,eng,eng,eng,eng,

I had the same problem, but instead of adding all the columns to awk, I just used (to duplicate the 2nd column):
awk -v OFS='\t' '$2=$2"\t"$2' # for tab-delimited files
For CSVs you can just use
awk -F , -v OFS=, '$2=$2","$2'

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio