Get line number (count of newlines) when piping in bash - bash

I am converting a file of json documents to a file of differently shaped json documents using jq. I need the output documents to have a contiguous positive id. Can I access a variable that equals the number of newlines seen?
gzcat input.gz | jq -r '"{\"id\":???, \"foo\":\(.foo)}"' > output
# can anything take the place of ??? to give 0..n?

If your jq has input_line_number, you might be able to use that. Here is a typescript illustrating what it does:
$ jq 'input_line_number'
"a"
1
"b"
2
(In the above, the input line is immediately followed by the output line.)
Similarly, here is how foreach and inputs can be used together:
$ jq -n 'foreach inputs as $line (0; .+1; "line \(.) is \($line)")'
"abc"
"line 1 is abc"
123
"line 2 is 123"
If your jq does not have foreach, then you might find reduce adequate for your needs:
$ jq -s -r 'reduce .[] as $line
( [0,""]; .[0]+=1 | .[1] += "line \(.[0]) is \($line)\n")
| .[1]'
Input:
"abc"
123
Output:
line 1 is abc
line 2 is 123

Related

How to get from a file only the character with reputed value

I need to extract from the file the words that contain certain letters in a certain amount.
I apologize if this question has been resolved in the past, I just did not find anything that fits what I am looking for.
File:
wab 12aaabbb abababx ab ttttt baaabb zabcabc
baab baaabb cbaab ab ccabab zzz
For example
1. If I chose the letters a and the number is 1 the output should be:
wab
ab
ab
//only the words that contains a and the char appear in the word 1 time
2. If I chose the letters a,b and the number is 3, the output should be:
12aaabbb
abababx
baaabb
//only the word contains a,b, and both chars appear in the word 3 times
3. If I chose the letters a,b,c and the number 2, the output should be:
ccabab
zabcabc
//only the words that contains a,b,c and the chars appear in the word 3 times
Is it possible to find 2 letters in the same script?
I was able to find in a single letter but I get only the words where the letters appear in sequence and I do not want to find only these words, that's what I did:
egrep '([a])\1{N-1}' file
And another problem I can not get only the specific words, I get all file and the letter I am looking for "a" in red.
I tried using -w but it does not display anything.
::: EDIT :::
try to edit what you did to a for
i=$1
fileName=$2
letters=${#: 3}
tr -s '[:space:]' '\n' < $fileName* |
for letter in $letters; do
grep -E "^[^$letter]*($letter[^$letter]*){$i}$"
done | uniq
There are various ways to split input so that grep sees a single word per line. tr is most common. For example:
tr -s '[:space:]' '\n' file | ...
We can build a function to find a specific number of a particular letter:
NofL(){
num=$1
letter=$2
regex="^[^$letter]*($letter[^$letter]*){$num}$"
grep -E "$regex"
}
Then:
# letter=a number=1
tr -s '[:space:]' '\n' file | NofL 1 a
# letters=a,b number=3
tr -s '[:space:]' '\n' file | NofL 3 a | NofL 3 b
# letters=a,b,c number=2
tr -s '[:space:]' '\n' file | NofL 2 a | NofL 2 b | NofL 2 c
Regexes are not really suited for that job as there are more efficient ways, but it is possible using repeated matching. We first select all words, from those we select words with n as, and from those we select words with n bs and so on.
Example for n=3 and a, b:
grep -Eo '[[:alnum:]]+' |
grep -Ex '[^a]*a[^a]*a[^a]*a[^a]*' |
grep -Ex '[^b]*b[^b]*b[^b]*b[^b]*'
To auto-generate such a command from an input like 3 a b, you need to dynamically create a pipeline, which is possible, but also a hassle:
exactly_n_times_char() {
(( $# >= 2 )) || { cat; return; }
local n="$1" char="$2" regex
regex="[^$char]*($char[^$char]*){$n}"
shift 2
grep -Ex "$regex" | exactly_n_times_char "$n" "$#"
}
grep -Eo '[[:alnum:]]+' file.txt | exactly_n_times_char 3 a b
With PCREs (requires GNU grep or pcregrep) the check can be done in a single regex:
exactly_n_times_char() {
local n="$1" regex=""
shift
for char; do # could be done without a loop using sed on $*
regex+="(?=[^$char\\W]*($char[^$char\\W]*){$n})"
done
regex+='\w+'
grep -Pow "$regex"
}
exactly_n_times_char 3 a b < file.txt
If a matching word appears multiple times (like baaabb in your example) it is printed multiple times too. You can filter out duplicates by piping through sort -u but that will change the order.
A method using sed and bash would be:
#!/bin/bash
file=$1
n=$2
chars=$3
for ((i = 0; i < ${#chars}; ++i)); do
c=${chars:i:1}
args+=(-e)
args+=("/^\([^$c]*[$c]\)\{$n\}[^$c]*\$/!d")
done
sed "${args[#]}" <(tr -s '[:blank:]' '\n' < "$file")
Notice that filename, count, and characters are parameterized. Use it as
./script filename 2 abc
which should print out
zabcabc
ccabab
given the file content in the question.
An implementation in pure bash, without calling an external program, could be:
#!/bin/bash
readonly file=$1
readonly n=$2
readonly chars=$3
while read -ra words; do
for word in "${words[#]}"; do
for ((i = 0; i < ${#chars}; ++i)); do
c=${word//[^${chars:i:1}]}
(( ${#c} == n )) || continue 2
done
printf '%s\n' "$word"
done
done < "$file"
You can match a string containing exactly N occurrences of character X with the (POSIX-extended) regexp [^X]*(X[^X]*){N}. To do this for multiple characters you could chain them, and the traditional way to process one 'word' at a time, simplistically defined as a sequence of non-whitespace chars, is like this
<infile tr -s ' \t\n' ' ' | grep -Ex '[^a]*(a[^a]*){3}' | \grep -Ex '[^b]*(b[^b]*){3}'
# may need to add \r on Windows-ish systems or for Windows-derived data
If you get colorized output from egrep and grep and maybe some other utilities it's usually because in a GNU-ish environment you -- often via a profile that was automatically provided and you didn't look at or modify -- set aliases to turn them into e.g. egrep --color=auto or possibly/rarely =always; using \grep or command grep or the pathname such as /usr/bin/grep disables the alias, or you could just un-set it/them. Another possibility is you may have envvar(s) set in which case you need to remove or suppress it/them, or explicitly say --color=never, or (somewhat hackily) pipe the output through ... | cat which has the effect of making [e]grep's stdout a pipe not a tty and thus turning off =auto.
However, GNU awk (not necessarily others) can also do this more directly:
<infile awk -vRS='[ \t\n]+' -F '' '{delete f;for(i=1;i<=NF;i++)f[$i]++}
f["a"]==3&&f["b"]==3'
or to parameterize the criteria:
<infile awk -vRS='[ \t\n]+' -F '' 'BEGIN{split("ab",w,//);n=3}
{delete f;for(i=1;i<=NF;i++)f[$i]++;s=1;for(t in w)if(f[w[t]]!=occur)s=0} s'
perl can do pretty much everything awk can do, and so can some other general-purpose tools, but I leave those as exercises.

Parsing CSV records when a value is multiline

Source file looks like this:
"google.com", "vuln_example1
vuln_example2
vuln_example3"
"facebook.com", "vuln_example2"
"reddit.com", "stupidly_long_vuln_name1"
"stackoverflow.com", ""
I've been trying to get the output to be something like this but the line breaks seem to cause me no end of problems. I'm using a "while read line" job to do this because I do some processing on the columns (e.g Vulnerability count and url in this example). This is output into a jenkins job (yuk).
The basic summary of the problem is getting the linebreaks in the csv to be output into the third column while retaining the table structure. I've got a sort of weird example of the desired output below.
||hostname ||Vulnerability count|| Vulnerability list || URL ||
|google.com |3 |vuln_example1 |http://cve.com/vuln_example1|
| | |vuln_example2 |http://cve.com/vuln_example2|
| | |vuln_example3 |http://cve.com/vuln_example3|
|facebook.com |1 |vuln_example2 |http://cve.com/vuln_example2|
|reddit.com |1 |stupidly_long_vuln_name1 |http://cve.com/stupidly_long_vuln_name1|
|stackoverflow.com |0 | ||
Looking at this... I've got a feeling it might be easier by showing some code and example output.
Parsing your input with the command line below makes the problem easier (I'm assuming the inputs are correct):
perl -0777 -pe 's/([^"])\s*\n/\1 /g ; s/[",]//g' < sample.txt
This line invokes Perl to perform two regex substitutions:
s/([^"])\s*\n/\1 /g: This substitution removes an end of line if it doesn't terminate by a quote " (i.e. if a host entry, with all vulnerabilities isn't yet complete).
s/[",]//g removes all quotes and commas remaining.
For each host entry like this one:
"google.com", "vuln_example1
vuln_example2
vuln_example3"
You'll get:
google.com vuln_example1 vuln_example2 vuln_example3
Then you can assume for each line, you have an host and a set of vulnerabilities.
The given example below stores vulnerabilities in an array and loop through it, formatting and printing each line:
# Replace this by your custom function
# to get an URL for a given vulnerability
function get_vuln_url () {
# This just displays a random url for an non-empty arg
[[ -z "$1" ]] || echo "http://host/$1.htm"
}
# Format your line (see printf help)
function print_row () {
printf "%-20s|%5s|%-30s|%s\n" "$#"
}
# The perl line reformat
perl -0777 -pe 's/([^"])\s*\n/\1 /g ; s/[",]//g' < sample.txt |
while read -r line ; do
arr=(${line})
print_row "${arr[0]}" "$((${#arr[#]} - 1))" "${arr[1]}" "$(get_vuln_url ${arr[1]})"
#echo -e "${arr[0]}\t|$vul_count\t|${arr[1]}\t|$(get_vuln_url ${arr[1]})"
for v in "${arr[#]:2}" ; do
print_row " " " " "$v" "$(get_vuln_url ${arr[1]})"
done
done
Output:
google.com | 3|vuln_example1 |http://host/vuln_example1.htm
| |vuln_example2 |http://host/vuln_example1.htm
| |vuln_example3 |http://host/vuln_example1.htm
facebook.com | 1|vuln_example2 |http://host/vuln_example2.htm
reddit.com | 1|stupidly_long_vuln_name1 |http://host/stupidly_long_vuln_name1.htm
stackoverflow.com | 0| |
Update.
If you don't have Perl, and if your file doesn't have tabulations, you can use this command as a workaround instead:
tr '\n' '\t' < sample.txt | sed -r -e 's/([^"])\s*\t/\1 /g' -e 's/[",]//g' -e 's/\t/\n/g'
tr '\n' '\t' replaces all ends of line by tabulations
sed part acts like Perl line, except it deals with tabulations instead of ends of line and restores tabulations back to ends of line.

script variable in tr

I want to make a script that is looking for special numbers.
numbers like this 153 = 1^3+5^3+3^3
bash script 153 3
153
In my script I have this kinda thing
echo "$1" | tr -d " " | sed -e 's/\([[:digit:]]\)/\1+/g' | tr '+' '^"$2"+'
That last command doesn't work, it does change something, it changes 1+5+3+ to 1^+5^+3^+
So my question is: how can I use variables in tr?
tr replaces one character with another one. It can't replace one character with a longer string. That's sed's job:
set -- 153 3
echo "$1" | \
tr -d " " | \
sed -e 's/\([[:digit:]]\)/\1^'"$2"'+/g; s/\+$//'
The answer by choroba is correct. Here is a python based one-liner:
$ set -- 153 3
$ python -c "print '+'.join([x+'^$2' for x in list('$1')])"
1^3+5^3+3^3
Explanation:
list will convert the string "153" to ['1', '5', '3']
[ x+'^$2' for x in <list> ] is called list comprehension. Effectively it returns another list: ['1^3', '5^3', '3^3']
Then join them with '+'
NOTE: Only reason I added this answer was because, this does not require to adjust the completed string after processing by build-in functions.
Below are the other common approaches:
$ python -c "print '^$2+'.join(list('$1')) + '^$2'" # Add "^3" after join returns "1^3+5^3+3"
$ echo $1 | sed "s/./&^$2+/g; s/+$//" # Remove last '+' sign from "1^3+5^3+3^3+"

Stopping paste after any input is exhausted

I have two programs that produce data on stdout, and I'd like to paste their output together. I can successfully do this like so:
paste <(./prog1) <(./prog2)
But I find that this method will print all lines from both inputs,
and what I really want is to stop paste after either input program is finished.
So if ./prog1 produces the output:
a
b
c
But ./prog2 produces:
Hello
World
I would expect the output:
a Hello
b World
Also note that one of the input programs may actually produce infinite output, and I want to be able to handle that case as well. For example, if my inputs are yes and ./prog2, I should get:
y Hello
y World
Use join instead, with a variation on the Schwartzian transform:
numbered () {
nl -s- -ba -nrz
}
join -j 1 <(prog1 | numbered) <(prog2 | numbered) | sed 's/^[^-]*-//'
Piping to nl numbers each line, and join -1 1 will join corresponding lines with the same number. The extra lines in the longer file will have no join partner and be omitted. Once the join is complete, pipe through sed to remove the line numbers.
Here's one solution:
while IFS= read -r -u7 a && IFS= read -r -u8 b; do echo "$a $b"; done 7<$file1 8<$file2
This has the slightly annoying effect of ignoring the last line of an input file if it is not terminated with a newline (but such a file is not a valid text file).
You can wrap this in a function, of course:
paste_short() {
(
while IFS= read -r -u7 a && IFS= read -r -u8 b; do
echo "$a $b"
done
) 7<"$1" 8<"$2"
}
Consider using awk:
awk 'FNR==NR{a[++i]=$0;next} FNR>i{exit}
{print a[FNR], $0}' <(printf "hello\nworld\n") <(printf "a\nb\nc\n")
hello a
world b
Keep the longer output producing program as your 2nd input.

help on sorting a file using sort

I have this file:
100: pattern1
++++++++++++++++++++
1:pattern2
9:pattern2
+++++++++++++++++++
79: pattern1
61: pattern1
+++++++++++++++++++
and I want to sort it like this:
++++++++++++++++++++
1:pattern2
9:pattern2
+++++++++++++++++++
61:pattern1
79:pattern1
100:pattern1
+++++++++++++++++++
Is it possible using Linux sort command only ?
If I had :
4:pat1
3:pat2
2:pat2
1:pat1
O/p should be:
1:pat1
++++++++++++
2:pat2
3:pat2
++++++++++++
4:pat1
So, want to sort on first group, but "group" on the pattern of second group.
Please note, the thing after : is a regex pattern not a literal.
Best you can do is to sort it according to the numerical values. But you cannot do anything with the "+"-string.
$ sort -n input
+++++++++++++++++++
+++++++++++++++++++
++++++++++++++++++++
1:wow
9:wow
61: this is it
79: this is it
100: this is it
I don't believe sort alone can do what you need.
Create a new shell script and put this in its contents (ie mysort.sh):
#!/bin/sh
IFS=$'\n' # This makes the for loop below split on newline instead of whitespace.
delim=+++++++++++++++++++
for l in `grep -v ^+| sort -g` # Ignore all + lines and sort by number
do
current=`echo $l | sed s/^[0-9]*://g` # Get what comes after the number
if [ ! -z "$prev" ] && [ "$prev" != "$current" ] # If it has changed...
then # then output a ++++ delimiter line.
echo $delim
fi
prev=$current
echo $l # Output this line.
done
To use it, pipe in the contents of your file like so:
cat input | sh mysort.sh
Probably not -- it's not in the sort of format sort(1) expects. And if you did it would be one of those amazing hacks, not easily used. If you have some sort of rule for what goes between the lines of plus signs, you can do it readily enough with an AWK or Perl or Python script.
If your input was space delimited, not ':' delimited:
sort -rk2 | uniq -D -f1
will do the grouping;
I guess you'd need to sort the 'subsections' later (unfortunately my sort(1) doesn't do composite key ordering. I do believe there are version that allow you to do sort -k2,1n and you'd be done at once).
use --all-repeated=separate instead of -D to get blank separators between groups. Look at man uniq for more ideas!
However, since your input is colon delimited, a hack is required:
sed 's/\([0123456789]\+\):/\1 /' t | sort -rk2 | uniq -D -f1
HTH

Resources