logstash parsing lines that only contains value & fetch several items from it - elasticsearch

I've log input lines.
I want my filer to filter only lines that have the "Add" word within it
(this word can be at anywhere at line)
and extract some values from line
to get something like: (at json format)
Action: Add, val1: 12, val2: 15
Action: Add, val1: 11, val2: 12
from those lines input
ifoeife, Add, val1:12, val2:15
eife, frfr, 90088, Add, val1:11, val2:12
eife, val1:11, val2:12
[val1, val2, action are indexes]

Well. You can use Grok filter. It is possible to create some kind of complicated pattern or just use Regular expressions as described here: https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html#_regular_expressions
Regexp for your line would be something like
[a-zA-Z0-9,]*?(Add|SomeOtherPossibleAction), (val1:\d+), (val2:\d+)

Related

NIFI: Unable to extract two values from a list during each iteration over a loop

I would like to retrieve large SQL dump between date ranges. For the same, I constructed a loop over a date list, which intends to extract adjacent fields. Unfortunately, in my case, it doesnt work as planned.
Following is my flow:
Replace Text: Takes flowfile content date list as all_first_dates
Initialize Count:
While Loop:
Get first and adjacent dates:
However, on seeing the queue, I get the first and second as this:
Whereas, I desired as 2016-01-01 and 2016-01-02 for first and second respectively on my first iteration and so on.
check the description of the getDelimitedField function and it's parameters:
Description: Parses the Subject as a delimited line of text and returns just a single field from that delimited text.
Arguments:
index: The index of the field to return. A value of 1 will return the first field, a value of 2 will return the second field, and so on.
delimiter: Optional argument that provides the character to use as a field separator. If not specified, a comma will be used. This value must be exactly 1 character.
...
you are not passing the second parameter, so the coma used to split the subject, and you got the whole subject as one element in result.

What's the efficient way of checking the format of file by Ruby?

I have a file like:
Fruit.Store={
#blabla
"customer-id:12345,item:store/apple" = (1,2); #blabla
"customer-id:23456,item:store/banana" = (1,3); #blabla
"customer-id:23456,item:store/watermelon" = (1,4);
#blabla
"customer-id:67890,item:store/watermelon" = (1,6);
#The following two are unique
"customer-id:0000,item:store/" = (100, 100);
#
"" = (0,0)
};
Except the comments, each line has the same format: customer-id and item:store/ are fixed, and customer-id is a 5-digit number. The last two records are unique. How could I make sure the file is in the right format elegantly? I am thinking about using the flag for the first special line Fruit.Store={ and than for the following lines split each line by "," and "=", and if the splitted line is not correct, match them with the last two records. I want to use Ruby for it. Any advice? Thank you.
I am also thinking about using regular expression for the format, and wrote:
^"customer:\d{5},item:store\/\D*"=\(\d*,\d*\);
but I want to combine these two situations (with comment and without comment):
^"customer:\d{5},item:store\/\D*"=\(\d*,\d*\);$
^"customer:\d{5},item:store\/\D*"=\(\d*,\d*\);#.*$
how could I do it? Thanks
Using regular expressions could be a good option since each line has a fixed format; and you almost got it, your regex just needed a few tweaks:
(?:#.*|^"customer-id:\d{5},item:store\/\D*" *= *\(\d*, *\d*\); *(?:#.*)?)$
This is what was added to your current regex:
Option to be a comment line (#.*) or (|) a regular line (everything after |).
Check for possible spaces before and after =, after the comma (,) that separates the digits in parenthesis, and at the end of the line.
Option to include another comment at the end of the line ((?:#.*)?).
So just compare each line against this regex to check for the right format.

Multiple sequence alignment. Convert multi-line format to single-line format?

I have a multiple sequence alignment file in which the lines from the different sequences are interspersed, as in the format outputed by clustal and other popular multiple sequence alignment tools. It looks like this:
TGFb3_human_used_for_docking ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPY
tr|B3KVH9|B3KVH9_HUMAN ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPY
tr|G3UBH9|G3UBH9_LOXAF ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPY
tr|G3WTJ4|G3WTJ4_SARHA ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPY
TGFb3_human_used_for_docking LRSADTTHST-
tr|B3KVH9|B3KVH9_HUMAN LRSADTTHST-
tr|G3UBH9|G3UBH9_LOXAF LRSTDTTHST-
tr|G3WTJ4|G3WTJ4_SARHA LRSADTTHST-
Each line begins with a sequence identifier, and then a sequence of characters (in this case describing the amino acid sequence of a protein). Each sequence is split into several lines, so you see that the first sequence (with ID TGFb3_human_used_for_docking) has two lines. I want to convert this to a format in which each sequence has a single line, like this:
TGFb3_human_used_for_docking ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSADTTHST-
tr|B3KVH9|B3KVH9_HUMAN ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSADTTHST-
tr|G3UBH9|G3UBH9_LOXAF ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSTDTTHST-
tr|G3WTJ4|G3WTJ4_SARHA ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSADTTHST-
(In this particular examples the sequences are almost identical, but in general they aren't!)
How can I convert from multi-line multiple sequence alignment format to single-line?
Looks like you need to write a script of some sort to achieve this. Here's a quick example I wrote in Python. It won't line the white-space up prettily like in your example (if you care about that, you'll have to mess around with formatting), but it gets the rest of the job done
#Create a dictionary to accumulate full sequences
full_sequences = {}
#Loop through original file (replace test.txt with your file name)
#and add each line to the appropriate dictionary entry
with open("test.txt") as infile:
for line in infile:
line = [element.strip() for element in line.split()]
if len(line) < 2:
continue
full_sequences[line[0]] = full_sequences.get(line[0], "") + line[1]
#Now loop through the dictionary and write each entry as a single line
outstr = ""
with open("test.txt", "w") as outfile:
for seq in full_sequences:
outstr += seq + "\t\t" + full_sequences[seq] + "\n"
outfile.write(outstr)

Ruby Regex: How to match pattern that follows another pattern?

I have ID numbers that should come after the text ID: so my file consists of
ID: A1234
ID: A1235
ID: A1236
etc. I want to match /[A-Z]*[0-9]+/ but only if it comes after the characters ID:. How would I add that to the regular expression but not make it return ID: as part of the result? I just want it to match the regex that follows ID:, because at the end of the file I have numbers and it's returning them, but those aren't ID numbers.
/ID:\s*([A-Z]*[0-9]+)/
the parentheses capture what's inside the parentheses, and then you can refer to it using backreferences. If you post some code of how you're using the regex, I can try to add some more detail to show you how.

Looking to replace the text in a file after match found in ruby

I have data in the below format in a .txt file:
parameter1=12345 parameter2=23456 parameter3=23456 and so on.. the list is a long one.
I have found a way to match the parameter1 and so on and replace it with some other number.
modified_file=File.read("modified_file.txt",)
modified_file=modified_file.to_s.sub(/#{parameter1}=/, "some text of your choice")
The above regular expression would only replace the value with parameter1= but I intend to change following parameter1=.
I want to write a regular expression which can match the data up to = and replace the data following that.
For Eg: I want to replace 12345 to abcde and 23456 to xyzab so the final result would be:
parameter1=abcde parameter2=xyzab and so on..
/(?<=parameter1=)\S+/
What you want is called a "lookbehind".

Resources