Remove fields if their key is longer than x chars in Logstash? - elasticsearch

I have a logstash pipeline for the most part, parses data sufficiently into JSON to go to elasticsearch; however... Sometimes it does not parse fields very well, and finds keys and values which are very weird, most of the time, these keys are very long, such as: aaaakgaaaaiaaaaaaaaaabiofy1mb....
I was hoping to be able to remove these fields based on how long the key is in the kv. Say any field over 30 chars is removed. Although I may improve the logstash parsing in the futurethat these problems won't persist, for now I'd like to have something such as this as a last stitched sanity check.

Try this filter whith 2 steps :
prefix kv field to be able to identify them (optional here)
loop on this prefixed field to apply the condition (here the limit is 30 prefix included)
filter {
kv {
#Step 1
#prefix=>"kv_"
}
ruby {
code => "
hash = event.to_hash
hash.each do |key, value|
# Step 2
#if(key.to_s.start_with?('kv_'))
if(key.size>30)
event.cancel
end
#end
end
"
}
}

Related

Using Logstash Ruby filter to parse csv file

I have an elasticsearch index which I am using to index a set of documents.
These documents are originally in csv format and I am looking parse these using logstash.
My problem is that I have something along the following lines.
field1,field2,field3,xyz,abc
field3 is something like 123456789 and I want to parse it as 4.56(789) using ruby code filter.
My try:
I tried with stdin and stdout with the following logstash.conf .
input {
stdin {
}
}
filter {
ruby {
code => "
b = event["message"]
string2=""
for counter in (3..(num.size-1))
if counter == 4
string2+= '_'+ num[counter]
elsif counter == 6
string2+= '('+num[counter]
elsif counter == 8
string2+= num[counter] +')'
else
string2+= num[counter]
end
end
event["randomcheck"] = string2
"
}
}
output {
stdout {
codec=>rubydebug
}
}
I am getting syntax error using this.
My final aim is to use this with my csv file , but first I was trying this with stdin and stdout.
Any help will be highly appreciated.
The reason you're getting a syntax error is most likely because you have unescaped double quotes inside the double quoted string. Either make the string single quoted or keep it double quoted but use single quotes inside. I also don't understand how that code is supposed to work.
But that aside, why use a ruby filter in the first place? You can use a csv filter for the CSV parsing and a couple of standard filters to transform 123456789 to 4.56(789).
filter {
# Parse the CSV fields and then delete the 'message' field.
csv {
remove_field => ["message"]
}
# Given an input such as 123456789, extract 4, 56, and 789 into
# their own fields.
grok {
match => [
"column3",
"\d{3}(?<intpart>\d)(?<fractionpart>\d{2})(?<parenpart>\d{3})"
]
}
# Put the extracted fields together into a single field again,
# then delete the temporary fields.
mutate {
replace => ["column3", "%{intpart}.%{fractionpart}(%{parenpart})"]
remove_field => ["intpart", "factionpart", "parenpart"]
}
}
The temporary fields have really bad names in the example above since I don't know what they represent. Also, depending on what the input can look like you may have to adjust the grok expression. As it stands now it assumes nine-digit input.

Extra column when scanning JSON into CSV using .map, sorted order is lost

I am writing a script to convert JSON data to an ordered CSV spreadsheet.
The JSON data itself does not necessarily contain all keys (some fields in the spreadsheet should say "NA").
Typical JSON data looks like this:
json = {"ReferringUrl":"N","PubEndDate":"2010/05/30","ItmId":"347628959","ParentItemId":"46999"}
I have a list of the keys found in each column of the spreadsheet:
keys = ["ReferringUrl", "PubEndDate", "ItmId", "ParentItemId", "OtherKey", "Etc"]
My thought was that I could iterate through each line of JSON like this:
parsed = JSON.parse(json)
result = (0..keys.length).map{ |i| parsed[keys[i]] || 'NA'} #add values associated with keys to an array, using NA if no value is present
CSV.open('file.csv', 'wb') do |csv|
csv << keys #create headings on spreadsheet
csv << result #load data associated with headings into the next line
end
Ideally, this would create a CSV file with the proper information in the proper order in a spreadsheet. However, what happens is the result data comes in completely out of order, and contains an extra column that I don't know what to do with.
Looking at the actual data, since there are actually about 100 keys and most of the fields contain NA, it is very difficult to determine what is happening.
Any advice?
The extra column comes from 0..keys.length which includes the end of the range. The last value of result is going to be parsed[keys[keys.length]] i.e. parsed[nil] i.e. nil. You can avoid that entirely by mapping keys directly
result = keys.map { |key| parsed.fetch(key, 'NA') }
As for the random order of the values, I suspect you aren't giving us all of the relevant information, because I tested your code and the result came out in the same order as keys.
Range has two possible notations
..
and
...
... is exclusive, meaning the range (A...B) would be not include B.
Change to
result = (0...keys.length).map{ |i| parsed[keys[i]] || 'NA'} #add values associated with keys to an array, using NA if no value is present
And see if that prevents the last value in that range from evaluating to nil.

How to print elements with same Xpath below same column (same header)

I'm trying to parse an XML file with REXML on Ruby.
What I want is print all values and the corresponding element name as header. The issue I have
is that some nodes have child elements that appear repeated and have the same Xpath, so for those
elements I want to printing in the same column. Then for the small sample below, the output desired
for the elements of Node_XX would be:
Output I'm looking for:
RepVal|YurVal|CD_val|HJY_val|CD_SubA|CD_SubB
MTSJ|AB01-J|45|01|87|12
||34|11|43|62
What I have so far is the code below, but I don´t know how to do in order repeated
elements be printed in the same column.
Thanks in advance for any help.
Code I have so far:
#!/usr/bin/env ruby
require 'rexml/document'
include REXML
xmldoc = Document.new File.new("input.xml")
arr_H_Xpath = [] # Array to store only once all Xpath´s (without Xpath repeated)
arr_H_Values = [] # Array for headers (each child element´s name)
arr_Values = [] # Values of each child element.
xmldoc.elements.each("//Node_XYZ") {|element|
element.each_recursive do |child|
if (child.has_text? && child.text =~ /^[[:alnum:]]/) && !arr_H_Xpath.include?(child.xpath.gsub(/\[.\]/,"")) # Check if element has text and Xpath is stored in arr_H_Xpath.
arr_H_Xpath << child.xpath.gsub(/\[.\]/,"") #Remove the [..] for repeated XPaths
arr_H_Values << child.xpath.gsub(/\/\w.*\//,"") #Get only name of child element to use it as header
arr_Values << child.text
end
print arr_H_Values + "|"
arr_H_Values.clear
end
puts arr_Values.join("|")
}
The input.xml is:
<TopNode>
<NodeX>
<Node_XX>
<RepCD_valm>
<RepVal>MTSJ</RepVal>
</RepCD_valm>
<RepCD_yur>
<Yur>
<YurVal>AB01-J</YurVal>
</Yur>
</RepCD_yur>
<CodesDif>
<CD_Ranges>
<CD_val>45</CD_val>
<HJY_val>01</HJY_val>
<CD_Sub>
<CD_SubA>87</CD_SubA>
<CD_SubB>12</CD_SubB>
</CD_Sub>
</CD_Ranges>
</CodesDif>
<CodesDif>
<CD_Ranges>
<CD_val>34</CD_val>
<HJY_val>11</HJY_val>
<CD_Sub>
<CD_SubA>43</CD_SubA>
<CD_SubB>62</CD_SubB>
</CD_Sub>
</CD_Ranges>
</CodesDif>
</Node_XX>
<Node_XY>
....
....
....
</Node_XY>
</NodeX>
</TopNode>
Here's one way to solve your problem. It is probably a little unusual, but I was experimenting. :)
First, I chose a data structure that can store the headers as keys and multiple values per key to represent the additional row(s) of data: a MultiMap. It is like a hash with multiple keys.
With the multimap, you can store the elements as key-value pairs:
data = Multimap.new
doc.xpath('//RepVal|//YurVal|//CD_val|//HJY_val|//CD_SubA|//CD_SubB').each do |elem|
data[elem.name] = elem.inner_text
end
The content of data is:
{"RepVal"=>["MTSJ"],
"YurVal"=>["AB01-J"],
"CD_val"=>["45", "34"],
"HJY_val"=>["01", "11"],
"CD_SubA"=>["87", "43"],
"CD_SubB"=>["12", "62"]}
As you can see, this was a simple way to collect all the information you need to create your table. Now it is just a matter of transforming it to your pipe-delimited format. For this, or any delimited format, I recommend using CSV:
out = CSV.generate({col_sep: "|"}) do |csv|
columns = data.keys.to_a.uniq
csv << columns
while !data.values.empty? do
csv << columns.map { |col| data[col].shift }
end
end
The output is:
RepVal|YurVal|CD_val|HJY_val|CD_SubA|CD_SubB
MTSJ|AB01-J|45|01|87|12
||34|11|43|62
Explanation:
CSV.generate creates a string. If you wanted to create an output file directly, use CSV.open instead. See the CSV class for more information. I added the col_sep option to delimit with a pipe character instead of the default of a comma.
Getting a list of columns would just be the keys if data was a hash. But since it is a Multimap which will repeat key names, I have to call .to_a.uniq on it. Then I add them to the output using csv << columns.
In order to create the second row (and any subsequent rows), we slice down and get the first value for each key of data. That's what the data[col].shift does: it actually removes the first value from each value in data. The loop is in place to keep going as long as there are more values (more rows).

How do I search an array using a partial string, and return the index?

I want to search through an array using a partial string, and then get the index where that string is found. For example:
a = ["This is line 1", "We have line 2 here", "and finally line 3", "potato"]
a.index("potato") # this returns 3
a.index("We have") # this returns nil
Using a.grep will return the full string, and using a.any? will return a correct true/false statement, but neither returns the index where the match was found, or at least I can't figure out how to do it.
I'm working on a piece of code that reads a file, looks for a specific header, and then returns the index of that header so it can use it as an offset for future searches. Without starting my search from a specific index, my other searches will get false positives.
Use a block.
a.index{|s| s.include?("We have")}
or
a.index{|s| s =~ /We have/}

problem with parsing string from excel file

i have ruby code to parse data in excel file using Parseexcel gem. I need to save 2 columns in that file into a Hash, here is my code:
worksheet.each { |row|
if row != nil
key = row.at(1).to_s.strip
value = row.at(0).to_s.strip
if !parts.has_key?(key) and key.length > 0
parts[key] = value
end
end
}
however it still save duplicate keys into the hash: "020098-10". I checked the excel file at the specified row and found the difference are " 020098-10" and "020098-10". the first one has a leading space while the second doesn't. I dont' understand is it true that .strip function already remove all leading and trailing white space?
also when i tried to print out key.length, it gave me these weird number:
020098-10 length 18
020098-10 length 17
which should be 9....
If you will inspect the strings you receive, you will probably get something like:
" \x000\x002\x000\x000\x009\x008\x00-\x001\x000\x00"
This happens because of the strings encoding. Excel works with unicode while ruby uses ISO-8859-1 by default. The encodings will differ on various platforms.
You need to convert the data you receive from excel to a printable encoding.
However when you should not encode strings created in ruby as you will end with garbage.
Consider this code:
#enc = Encoding::Converter.new("UTF-16LE", "UTF-8")
def convert(cell)
if cell.numeric
cell.value
else
#enc.convert(cell.value).strip
end
end
parts = {}
worksheet.each do |row|
continue unless row
key = convert row.at(1)
value = convert row.at(0)
parts[key] = value unless parts.has_key?(key) or key.empty?
end
You may want change the encodings to a different ones.
The newer Spreadsheet-gem handles charset conversion automatically for you, to UTF-8 I think as standard but you can change it, so I'd recommend using it instead.

Resources