I'm processing data from government sources (FEC, state voter databases, etc). It's inconsistently malformed, which breaks my CSV parser in all sorts of delightful ways.
It's externally sourced and authoritative. I must parse it, and I cannot have it re-input, validated on input, or the like. It is what it is; I don't control the input.
Properties:
Fields contain malformed UTF-8 (e.g. Foo \xAB bar)
The first field of a line specifies the record type from a known set. Knowing the record type, you know how many fields there are and their respective data types, but not until you do.
Any given line within a file might use quoted strings ("foo",123,"bar") or unquoted (foo,123,bar). I haven't yet encountered any where it's mixed within a given line (i.e. "foo",123,bar) but it's probably in there.
Strings may include internal newline, quote, and/or comma character(s).
Strings may include comma separated numbers.
Data files can be very large (millions of rows), so this needs to still be reasonably fast.
I'm using Ruby FasterCSV (known as just CSV in 1.9), but the question should be language-agnostic.
My guess is that a solution will require preprocessing substitution with unambiguous record separator / quote characters (eg ASCII RS, STX). I've started a bit here but it doesn't work for everything I get.
How can I process this kind of dirty data robustly?
ETA: Here's a simplified example of what may be in single file:
"this","is",123,"a","normal","line"
"line","with "an" internal","quote"
"short line","with
an
"internal quote", 1 comma and
linebreaks"
un "quot" ed,text,with,1,2,3,numbers
"quoted","number","series","1,2,3"
"invalid \xAB utf-8"
It is possible to subclass Ruby's File to process each line of the the CSV file before it is passed to the Ruby's CSV parser. For example, here's how I used this trick to replace non-standard backslash-escaped quotes \" with standard double-quotes ""
class MyFile < File
def gets(*args)
line = super
if line != nil
line.gsub!('\\"','""') # fix the \" that would otherwise cause a parse error
end
line
end
end
infile = MyFile.open(filename)
incsv = CSV.new(infile)
while row = incsv.shift
# process each row here
end
You could in principle do all sorts of additional processing, e.g. UTF-8 cleanups. The nice thing about this approach is you handle the file on a line by line basis, so you don't need to load it all into memory or create an intermediate file.
First, here is a rather naive attempt: http://rubular.com/r/gvh3BJaNTc
/"(.*?)"(?=[\r\n,]|$)|([^,"\s].*?)(?=[\r\n,]|$)/m
The assumptions here are:
A field may start with quotes. In which case, it should end with a quote that is either:
before a comma
before a new line (if it is last field on its line)
before the end of the file (if it is last field on the last line)
Or, its first character is not a quote, so it contains characters until the same condition as before is met.
This almost does what you want, but fails on these fields:
1 comma and
linebreaks"
As TC had pointed out in the comments, your text is ambiguous. I'm sure you already know it, but for completeness:
"a" - is that a or "a"? How do you represent a value that you want to be wrapped in quotes?
"1","2" - might be parsed as 1,2, or as 1","2 - both are legal.
,1 \n 2, - End of line, or newline in the value? You cannot tell, specially if this is supposed to be the last value of its line.
1 \n 2 \n 3 - One value with newlines? Two values (1\n2,3 or 1,2\n3)? Three values?
You may be able to get some clues if you examine the first value on each row, which as you have said, should tell you the number of columns and their types - this can give you the additional information you are missing to parse the file (for example, if you know there should another field in this line, then all newlines belong in the current value). Even then though, it looks like there are serious problems here...
I made an app to reformat CSV files, doubling the single quotes inside fields and replacing the new lines inside them with a string like '\n'.
Once the data is inside the database we can replace back the '\n' to new lines.
I needed to do this because the apps I had to process CSV does not deal correctly with new lines.
Feel free to use and change.
In python:
import sys
def ProcessCSV(filename):
file1 = open(filename, 'r')
filename2 = filename + '.out'
file2 = open(filename2, 'w')
print 'Reformatting {0} to {1}...', filename, filename2
line1 = file1.readline()
while (len(line1) > 0):
line1 = line1.rstrip('\r\n')
line2 = ''
count = 0
lastField = ( len(line1) == 0 )
while not lastField:
lastField = (line1.find('","') == -1)
res = line1.partition('","')
field = res[0]
line1 = res[2]
count = count + 1
hasStart = False
hasEnd = False
if ( count == 1 ) and ( field[:1] == '"' ) :
field = field[1:]
hasStart = True
elif count > 1:
hasStart = True
while (True):
if ( lastField == True ) and ( field[-1:] == '"' ) :
field = field[:-1]
hasEnd = True
elif not lastField:
hasEnd = True
if lastField and not hasEnd:
line1 = file1.readline()
if (len(line1) == 0): break
line1 = line1.rstrip('\r\n')
lastField = (line1.find('","') == -1)
res = line1.partition('","')
field = field + '\\n' + res[0]
line1 = res[2]
else:
break
field = field.replace('"', '""')
line2 = line2 + iif(count > 1, ',', '') + iif(hasStart, '"', '') + field + iif(hasEnd, '"', '')
if len(line2) > 0:
file2.write(line2)
file2.write('\n')
line1 = file1.readline()
file1.close()
file2.close()
print 'Done'
def iif(st, v1, v2):
if st:
return v1
else:
return v2
filename = sys.argv[1]
if len(filename) == 0:
print 'You must specify the input file'
else:
ProcessCSV(filename)
In VB.net:
Module Module1
Sub Main()
Dim FileName As String
FileName = Command()
If FileName.Length = 0 Then
Console.WriteLine("You must specify the input file")
Else
ProcessCSV(FileName)
End If
End Sub
Sub ProcessCSV(ByVal FileName As String)
Dim File1 As Integer, File2 As Integer
Dim Line1 As String, Line2 As String
Dim Field As String, Count As Long
Dim HasStart As Boolean, HasEnd As Boolean
Dim FileName2 As String, LastField As Boolean
On Error GoTo locError
File1 = FreeFile()
FileOpen(File1, FileName, OpenMode.Input, OpenAccess.Read)
FileName2 = FileName & ".out"
File2 = FreeFile()
FileOpen(File2, FileName2, OpenMode.Output)
Console.WriteLine("Reformatting {0} to {1}...", FileName, FileName2)
Do Until EOF(File1)
Line1 = LineInput(File1)
'
Line2 = ""
Count = 0
LastField = (Len(Line1) = 0)
Do Until LastField
LastField = (InStr(Line1, """,""") = 0)
Field = Strip(Line1, """,""")
Count = Count + 1
HasStart = False
HasEnd = False
'
If (Count = 1) And (Left$(Field, 1) = """") Then
Field = Mid$(Field, 2)
HasStart = True
ElseIf Count > 1 Then
HasStart = True
End If
'
locFinal:
If (LastField) And (Right$(Field, 1) = """") Then
Field = Left$(Field, Len(Field) - 1)
HasEnd = True
ElseIf Not LastField Then
HasEnd = True
End If
'
If LastField And Not HasEnd And Not EOF(File1) Then
Line1 = LineInput(File1)
LastField = (InStr(Line1, """,""") = 0)
Field = Field & "\n" & Strip(Line1, """,""")
GoTo locFinal
End If
'
Field = Replace(Field, """", """""")
'
Line2 = Line2 & IIf(Count > 1, ",", "") & IIf(HasStart, """", "") & Field & IIf(HasEnd, """", "")
Loop
'
If Len(Line2) > 0 Then
PrintLine(File2, Line2)
End If
Loop
FileClose(File1, File2)
Console.WriteLine("Done")
Exit Sub
locError:
Console.WriteLine("Error: " & Err.Description)
End Sub
Function Strip(ByRef Text As String, ByRef Separator As String) As String
Dim nPos As Long
nPos = InStr(Text, Separator)
If nPos > 0 Then
Strip = Left$(Text, nPos - 1)
Text = Mid$(Text, nPos + Len(Separator))
Else
Strip = Text
Text = ""
End If
End Function
End Module
Related
hi all i have this question as bellow
how capitalize full in one vb6 Vb6 string variable
‘example
‘my fullname
Dim fullname as string
Fullname = “abdirahman abdirisaq ali”
Msgbox capitalize(fullname)
it prints abdirahmanAbdirisaq ali that means it skips the middle name space even if I add more spaces its same .
this is my own code and efforts it takes me at least 2 hours and still .
I tired it tired tired please save me thanks more.
Please check my code and help me what is type of mistakes I wrote .
This is my code
Private Function capitalize(txt As String) As String
txt = LTrim(txt)
temp_str = ""
Start_From = 1
spacing = 0
For i = 1 To Len(txt)
If i = 1 Then
temp_str = UCase(Left(txt, i))
Else
Start_From = Start_From + 1
If Mid(txt, i, 1) = " " Then
Start_From = i
spacing = spacing + 1
temp_str = temp_str & UCase(Mid(txt, Start_From + 1, 1))
Start_From = Start_From + 1
Else
temp_str = temp_str & LCase(Mid(txt, Start_From, 1))
End If
End If
Next i
checkName = temp_str
End Function
It's far simpler than that. In VB6 you should use Option Explicit to properly type your variables. That also requires you to declare them.
Option Explicit
Private Function capitalize(txt As String) As String
Dim temp_str as String
Dim Names As Variant
Dim Index As Long
'Remove leading and trailing spaces
temp_str = Trim$(txt)
'Remove any duplicate spaces just to be sure.
Do While Instr(temp_str, " ") > 0
temp_str = Replace(temp_str, " ", " ")
Loop
'Create an array of the individual names, separating them by the space delimiter
Names = Split(temp_str, " ")
'Now put them, back together with capitalisation
temp_str = vbnullstring
For Index = 0 to Ubound(Names)
temp_str = temp_str + Ucase$(Left$(Names(Index),1)) + Mid$(Names(Index),2) + " "
Next
'Remove trailing space
capitalize = Left$(temp_str, Len(temp_str) - 1)
End Function
That's the fairly easy part. If you are only going to handle people's names it still needs more work to handle names like MacFarland, O'Connor, etc.
Business names get more complicated with since they can have a name like "Village on the Lake Apartments" where some words are not capitalized. It's a legal business name so the capitalization is important.
Professional and business suffixes can also be problematic if everything is in lower case - like phd should be PhD, llc should be LLC, and iii, as in John Smith III, would come out Iii.
There is also a VB6 function that will capitalize the first letter of each word. It is StrConv(string,vbProperCase) but it also sets everything that is not the first letter to lower case. So PhD becomes Phd and III becomes Iii. Where as the above code does not change the trailing portion to lower case so if it is entered correctly it remains correct.
Try this
Option Explicit
Private Sub Form_Load()
MsgBox capitalize("abdirahman abdirisaq ali")
MsgBox capitalize("abdirahman abdirisaq ali")
End Sub
Private Function capitalize(txt As String) As String
Dim Names() As String
Dim NewNames() As String
Dim i As Integer
Dim j As Integer
Names = Split(txt, " ")
j = 0
For i = 0 To UBound(Names)
If Names(i) <> "" Then
Mid(Names(i), 1, 1) = UCase(Left(Names(i), 1))
ReDim Preserve NewNames(j)
NewNames(j) = Names(i)
j = j + 1
End If
Next
capitalize = Join(NewNames, " ")
End Function
Use the VB6 statement
Names = StrConv(Names, vbProperCase)
it's all you need (use your own variable instead of Names)
I have a web service which returns a json response as follows:
"database" ; True
"cpu usage" ; 30%
"connection response" ; 1
"memory" ; 48%
The requirement is to create a vb script which would read through the results, compare it against a set threshold and set a flag accordingly.
That is, I need the result to say "green" if the value against "database" is "true", cpu usage is less than 80%, connection response is more than 0 and memory usage is less than 80%.
Could someone please help me with the above request. This is actually to be used with SCOM monitoring.
Your JSON will be more like this. Note that I have changed the variable names to remove spaces - you will need to modify the code accordingly if these guesses were wrong. In JSON variable names and any non-numeric values are in quotes. Typically you would use a JSON parser to handle this but if it really is this simple you can use some simple string handling code to proceed.
{
"database": "true",
"cpu_usage": 30,
"connection_response": 1,
"memory": 48
}
Call this function passing it the JSON string that you get from the service. It works on the basis that a JSON string is a string and we can chop it about to get usable values IF it is of simple format. If it becomes a more complex message then you will need to search for a JSON parser for VB, or if the interface can respond in XML you will find it much easier to handle in VB.
This is VB6 code (easier for me to test with) - you will need to remove all of the 'as string', 'as integer' etc from the variable declares for VB Script. I have included a val() function for VBScript sourced from here though not tested with my function. You need val() as the JSON is string formatted and if you try to compare numeric values to strings you will get unexpected results.
'
' Function to return RED or GREEN depending on values in simple JSON
'
Function checkStatus(sJSON As String) As String
Dim aVals() As String, aParams() As String, i As Integer, sName As String, sVal As String
Dim bDatabase As Boolean, bCPU As Boolean, bConnection As Boolean, bMemory As Boolean
aVals = Split(sJSON, ",")
For i = 0 To UBound(aVals)
aVals(i) = Trim(aVals(i)) ' remove any leading & trailing spaces
aVals(i) = Replace(aVals(i), "{", "") ' remove braces open
aVals(i) = Replace(aVals(i), "}", "") ' remove braces close
aVals(i) = Replace(aVals(i), """", "") ' remove quotes > "database: true"
Debug.Print "vals[" & i & "]=" & aVals(i)
If Len(aVals(i)) > 0 Then ' should catch any dodgy JSON formatting but may need refinement
aParams = Split(aVals(i), ":") ' split the line e.g. "database: true" > "database" and " true"
If UBound(aParams) > 0 Then
sName = LCase(Trim(aParams(0))) ' now we have sName = "database"
sVal = LCase(Trim(aParams(1))) ' and sVal = "true"
Select Case sName
Case "database"
bDatabase = False
If sVal = "true" Then
bDatabase = True
End If
Case "cpu_usage"
bCPU = False
If Val(sVal) > 80 Then
bCPU = True
End If
Case "connection_response"
bConnection = False
If Val(sVal) > 0 Then
bConnection = True
End If
Case "memory"
bMemory = False
If Val(sVal) < 80 Then
bMemory = True
End If
End Select
End If
End If
Next i
checkStatus = "RED" ' default return value to indicate an issue
' compare the flags to decide if all is well.
If bDatabase And bCPU Then 'And bConnection And bMemory Then
checkStatus = "GREEN"
End If
End Function
Function Val( myString )
' Val Function for VBScript (aka ParseInt Function in VBScript).
' By Denis St-Pierre.
' Natively VBScript has no function to extract numbers from a string.
' Based shamelessly on MS' Helpfile example on RegExp object.
' CAVEAT: Returns only the *last* match found
' (or, with objRE.Global = False, only the *first* match)
Dim colMatches, objMatch, objRE, strPattern
' Default if no numbers are found
Val = 0
strPattern = "[-+0-9]+" ' Numbers positive and negative; use
' "ˆ[-+0-9]+" to emulate Rexx' Value()
' function, which returns 0 unless the
' string starts with a number or sign.
Set objRE = New RegExp ' Create regular expression object.
objRE.Pattern = strPattern ' Set pattern.
objRE.IgnoreCase = True ' Set case insensitivity.
objRE.Global = True ' Set global applicability:
' True => return last match only,
' False => return first match only.
Set colMatches = objRE.Execute( myString ) ' Execute search.
For Each objMatch In colMatches ' Iterate Matches collection.
Val = objMatch.Value
Next
Set objRE= Nothing
End Function
I have text file with multiple line of different jobs jobs status .last run date..etc has been given as below
Jobname=FC;lastdate=12032015;lastresult=0
I need to write out the jobname and lastresult status with "success" for 0 and "fail" for other cases.
As you iterating through the lines of the file (where readline is the variable holding the line we are reading at the time):
jobname= Split(Split(readLine, ";")(0), "=")(1)
if Split(Split(readLine, ";")(2), "=")(1) = 0 Then
lastresult="Success"
else
lastresult="Failure"
end if
Something like this should capture your jobname and lastresult. We are just using SPLIT() to split the string by a delimiter ";" and grabbing the token we need (and then splitting that as well).
Use a Regexp looking for = followed by a sequence of non-; and an array indexed by the comparison between "=0" and the last part of the input line - as in:
>> Set r = New RegExp
>> r.Global = True
>> r.Pattern = "=[^;]+"
>> a = Split("success fail")
>> s = "Jobname=FC;lastdate=12032015;lastresult=0|Jobname=Other;lastdate=12032015;lastresult=Else"
>> For Each s In Split(s, "|")
>> Set ms = r.Execute(s)
>> WScript.Echo Mid(ms(0).Value,2), a(1 + ("=0" = ms(2)))
>> Next
>>
FC success
Other fail
I tried with batch to extract required code, but this doesn't work especially for big files. I'm wondering if this is possible with VB script. So,
I need to extract text from file between 2 delimiters and copy it to TXT file. This text looks like XML code, instead delimiters <string> text... </string>, I have :::SOURCE text .... ::::SOURCE. As you see in first delimiter are 3x of ':' and in second are 4x of ':'
Most important is that there are multiple lines between these 2 delimiters.
Example of text:
text&compiled unreadable characters
text&compiled unreadable characters
:::SOURCE
just this code
just this code
...
just this code
::::SOURCE text&compiled unreadable characters
text&compiled unreadable characters
Desired output:
just this code
just this code
...
just this code
Maybe you can try somethig like this:
filePath = "D:\Temp\test.txt"
Set fso = CreateObject("Scripting.FileSystemObject")
Set f = fso.OpenTextFile(filePath)
startTag = ":::SOURCE"
endTag = "::::SOURCE"
startTagFound = false
endTagFound = false
outputStr = ""
Do Until f.AtEndOfStream
lineStr = f.ReadLine
startTagPosition = InStr(lineStr, startTag)
endTagPosition = InStr(lineStr, endTag)
If (startTagFound) Then
If (endTagPosition >= 1) Then
outputStr = outputStr + Mid(lineStr, 1, endTagPosition - 1)
Exit Do
Else
outputStr = outputStr + lineStr + vbCrlf
End If
ElseIf (startTagPosition >= 1) Then
If (endTagPosition >= 1) Then
outputStr = Mid(lineStr, startTagPosition + Len(startTag), endTagPosition - startTagPosition - Len(startTag) - 1)
Exit Do
Else
startTagFound = true
outputStr = Mid(lineStr, startTagPosition + Len(startTag)) + vbCrlf
End If
End If
Loop
WScript.Echo outputStr
f.Close
I've made the assumption that start and end tag can be anywhere inside the file, not only at start of lines. Maybe you can simplify the code if you have more information on the "encoding".
Someone posted a great little function here the other day that separated the full path of a file into several parts that looked like this:
Function BreakDown(Full As String, FName As String, PName As String, Ext As String) As Integer
If Full = "" Then
BreakDown = False
Exit Function
End If
If InStr(Full, "\") Then
FName = Full
PName = ""
Sloc% = InStr(FName, "\")
Do While Sloc% <> 0
PName = PName + Left$(FName, Sloc%)
FName = Mid$(FName, Sloc% + 1)
Sloc% = InStr(FName, "\")
Loop
Else
PName = ""
FName = Full
End If
Dot% = InStr(Full, ".")
If Dot% <> 0 Then
Ext = Mid$(Full, Dot%)
Else
Ext = ""
End If
BreakDown = True
End Function
However if the line continues past that point it counts it as part of the extension, is there anyway to make this only count to 3 characters after the last period in a string?
Dot% = InStrRev(Full, ".") ' First . from end of string
If Dot% <> 0 Then
Ext = Mid$(Full, Dot%, 3)
Else
Ext = ""
End If
Mid$ syntax: Mid(string, start[, length])
If you just have blank characters then just add this as the first line
Full = Trim(Full)
If you have other characters then
Change:
Ext = Mid$(Full, Dot%)
to:
Ext = Mid$(Full, Dot%, 3)