Read data from txt files - using Linq-Entities? - text-files

in my current ASP.net MVC 3.0 project i am stuck with a situation.
I have four .txt files each has approximatly 100k rows of records
These files will be replaced with new files on weekly bases.
I need to Query data from these four text files, I am not able to choose the best and efficient way to do this.
3 ways I could think
Convert these text files to XML on a weekly basis and query it with Linq-XML
Run a batch import weekly from txt to SQL Server and query using Linq-Entities
avoid all conversions and query directly from text files.
Can any one suggest a best way to deal with this situation.
update:
Url of the Text File
I should connect to this file with credentials.
once i connect successfully, I will have the text file as below with Pipeline as Deliminator
This is the text file
Now i have to look up for the field highlighted in yellow and get the data in that row.
Note: First two lines of the text file are headers of the File.

Well As i Found a way my self. Hope this will be useful for any who are interested to get this done.
string url = "https://myurl.com/folder/file.txt";
WebClient request = new WebClient();
request.Credentials = new NetworkCredential(ConfigurationManager.AppSettings["UserName"], ConfigurationManager.AppSettings["Password"]);
Stream s = request.OpenRead(url);
using (StreamReader strReader = new StreamReader(s))
{
for (int i = 0; i <= 1; i++)
strReader.ReadLine();
while (!strReader.EndOfStream)
{
var CurrentLine = strReader.ReadLine();
var count = CurrentLine.Split('|').Count();
if (count > 3 && CurrentLine.Split('|')[3].Equals("SearchString"))
{
#region Bind Data to Model
//var Line = CurrentLine.Split('|');
//CID.RecordType = Line[0];
//CID.ChangeIdentifier = Line[1];
//CID.CoverageID = Convert.ToInt32(Line[2]);
//CID.NationalDrugCode = Line[3];
//CID.DrugQualifier = Convert.ToInt32(Line[4]);
#endregion
break;
}
}
s.Close();
}
request.Dispose();

Related

Does super CSV support parsing a file header?

I have a CSV that has a few file header lines at top. The rest of the rows are in the normal tablular format. Is it possible to parse the header row or process it differently from the remainder of the normal tabular data?
You can Get the headers separately quite simple.
headers are in line no 1, which makes it simple to fetch them.
here is an example:
listReader = new CsvListReader(new FileReader(CSV_FILENAME), CsvPreference.STANDARD_PREFERENCE);
final CellProcessor[] processors = getProcessors();
List<Object> customerList;
while( (customerList = listReader.read(processors)) != null ) {
System.out.println(String.format("lineNo=%s, rowNo=%s, customerList=%s", listReader.getLineNumber(), listReader.getRowNumber(), customerList));
if(listReader.getRowNumber()==1)
{
// do what ever you need with the headers...
}

Pick the latest file based on timestamp provided in the filename

I have to pick the files in order(first file first) from say a folder (C:\Users) and file name has the timestamp in it.
For example below are my files in C:\Users\ and the time stamp is after the first underscore i.e. 20170126102806 in the first file below. I have to loop through files and pick the first file and so on. so out of 5 files below,20170123-000011_20170126101823_AAA is the first file. How do I do this in SSIS?
1.20170123-000011_20170126102806_AAA
2.20170123-000011_20170126103251_AAA
3.20170123-000011_20170126101823_AAA
4.20170123-000011_20170126103305_AAA
5.20170123-000011_20170126102641_AAA
You can act in two ways:
use the foreach loop container to get the list of files, and then populate a database table.
Then, outside the foreach loop, use an Execute SQL to select from that table using an appropriate ORDER BY. Load an object variable with the result set. Then use a second foreach loop to step through the variable object and collect files.
use a Script Task to retrieve the contents of the folder (the list of files) and sort files then load an object variable with the dataset. Then use a foreach loop to step through the variable object to collect files.
I hope this help.
You could use a script task in a For Each Loop. Use the filename returned as the source to load each time.
using System.IO;
public void Main()
{
string filePath = "D:\\Temp";
DirectoryInfo dir = new DirectoryInfo(filePath);
var files = dir.GetFiles("*_AAA");//Or from a variable
DateTime fileCreateDate1 = File.GetCreationTime(filePath + "\\" + files[0]);
if (files.Length >= 2)
{
for (int i = 1; i < files.Length; i++)
{
DateTime fileCreateDate2 = File.GetCreationTime(filePath+ "\\" + files[i]);
if (fileCreateDate1 < fileCreateDate2)
{
fileCreateDate1 = fileCreateDate2;
}
}
}
Dts.Variables["User::FileToLoad"].Value = fileCreateDate1;
Dts.TaskResult = (int)ScriptResults.Success;
}
You will have to remove the file after it was loaded or else it will be loaded each time as it is the oldest or latest file.
There might be a bug or so, but have similar code that works. Just iron it out if needed.

How to retrieve total view count of large number of pages combined from the GA API

We are interested in the statistics of the different pages combined from the Google Analytics core reporting API. The only way I found to query statistics multiple pages at the same is by creating a filter like so:
ga:pagePath==page?id=a,ga:pagePath==page?id=b,ga:pagePath==page?id=c
And this get escaped inside the filter parameter of the GET query.
However when the GET query gets over 2000 characters I get the following response:
414. That’s an error.
The requested URL /analytics/v3/data/ga... is too large to process. That’s all we know.
Note that just like in the example call the only part that is different per page is a GET parameter in the pagePath, but we have to OR a new filter specifying both the metric (pagePath) as well as the part of the path that is always identical.
Is there any way to specify a large number of different pages to query without hitting this limit in the GET query (I can't find any documentation for doing POST requests)? Or are there alternatives to creating batches of a max of X different pages per query and adding them up on my end?
Instead of using ga:pagePath as part of a filter you should use it as a dimension. You can get up to 10,000 rows per query this way and paginate to get all results. Then parse the results client side to get what you need. Additionally use a filter to scope the results down if possible based on your site structure or page names.
I am sharing a sample code where you can fetch more then 10,000 record data via help of Items PerPage
private void GetDataofPpcInfo(DateTime dtStartDate, DateTime dtEndDate, AnalyticsService gas, List<PpcReportData> lstPpcReportData, string strProfileID)
{
int intStartIndex = 1;
int intIndexCnt = 0;
int intMaxRecords = 10000;
var metrics = "ga:impressions,ga:adClicks,ga:adCost,ga:goalCompletionsAll,ga:CPC,ga:visits";
var r = gas.Data.Ga.Get("ga:" + strProfileID, dtStartDate.ToString("yyyy-MM-dd"), dtEndDate.ToString("yyyy-MM-dd"),
metrics);
r.Dimensions = "ga:campaign,ga:keyword,ga:adGroup,ga:source,ga:isMobile,ga:date";
r.MaxResults = 10000;
r.Filters = "ga:medium==cpc;ga:campaign!=(not set)";
while (true)
{
r.StartIndex = intStartIndex;
var dimensionOneData = r.Fetch();
dimensionOneData.ItemsPerPage = intMaxRecords;
if (dimensionOneData != null && dimensionOneData.Rows != null)
{
var enUS = new CultureInfo("en-US");
intIndexCnt++;
foreach (var lstFirst in dimensionOneData.Rows)
{
var objPPCReportData = new PpcReportData();
objPPCReportData.Campaign = lstFirst[dimensionOneData.ColumnHeaders.IndexOf(dimensionOneData.ColumnHeaders.FirstOrDefault(h => h.Name == "ga:campaign"))];
objPPCReportData.Keywords = lstFirst[dimensionOneData.ColumnHeaders.IndexOf(dimensionOneData.ColumnHeaders.FirstOrDefault(h => h.Name == "ga:keyword"))];
lstPpcReportData.Add(objPPCReportData);
}
intStartIndex = intIndexCnt * intMaxRecords + 1;
}
else break;
}
}
Only one thing is problamatic that your query length shouldn't exceed around 2000 odd characters

LinqToExcel - Need to start at a specific row

I'm using the LinqToExcel library. Working great so far, except that I need to start the query at a specific row. This is because the excel spreadsheet from the client uses some images and "header" information at the top of the excel file before the data actually starts.
The data itself will be simple to read and is fairly generic, I just need to know how to tell the ExcelQueryFactory to start at a specific row.
I am aware of the WorksheetRange<Company>("B3", "G10") option, but I don't want to specify an ending row, just where to start reading the file.
Using the latest v. of LinqToExcel with C#
I just tried this code and it seemed to work just fine:
var book = new LinqToExcel.ExcelQueryFactory(#"E:\Temporary\Book1.xlsx");
var query =
from row in book.WorksheetRange("A4", "B16384")
select new
{
Name = row["Name"].Cast<string>(),
Age = row["Age"].Cast<int>(),
};
I only got back the rows with data.
I suppose that you already solved this, but maybe for others - looks like you can use
var excel = new ExcelQueryFactory(path);
var allRows = excel.WorksheetNoHeader();
//start from 3rd row (zero-based indexing), length = allRows.Count() or computed range of rows you want
for (int i = 2; i < length; i++)
{
RowNoHeader row = allRows.ElementAtOrDefault(i);
//process the row - access columns as you want - also zero-based indexing
}
Not as simple as specifying some Range("B3", ...), but also the way.
Hope this helps at least somebody ;)
I had tried this, works fine for my scenario.
//get the sheets info
var faceWrksheet = excel.Worksheet(facemechSheetName);
// get the total rows count.
int _faceMechRows = faceWrksheet.Count();
// append with End Range.
var faceMechResult = excel.WorksheetRange<ExcelFaceMech>("A5", "AS" + _faceMechRows.ToString(), SheetName).
Where(i => i.WorkOrder != null).Select(x => x).ToList();
Have you tried WorksheetRange<Company>("B3", "G")
Unforunatly, at this moment and iteration in the LinqToExcel framework, there does not appear to be any way to do this.
To get around this we are requiring the client to have the data to be uploaded in it's own "sheet" within the excel document. The header row at the first row and the data under it. If they want any "meta data" they will need to include this in another sheet. Below is an example from the LinqToExcel documentation on how to query off a specific sheet.
var excel = new ExcelQueryFactory("excelFileName");
var oldCompanies = from c in repo.Worksheet<Company>("US Companies") //worksheet name = 'US Companies'
where c.LaunchDate < new DateTime(1900, 0, 0)
select c;

Google calendar query returns at most 25 entries

I'm trying to delete all calendar entries from today forward. I run a query then call getEntries() on the query result. getEntries() always returns 25 entries (or less if there are fewer than 25 entries on the calendar). Why aren't all the entries returned? I'm expecting about 80 entries.
As a test, I tried running the query, deleting the 25 entries returned, running the query again, deleting again, etc. This works, but there must be a better way.
Below is the Java code that only runs the query once.
CalendarQuery myQuery = new CalendarQuery(feedUrl);
DateFormat dfGoogle = new SimpleDateFormat("yyyy-MM-dd'T00:00:00'");
Date dt = Calendar.getInstance().getTime();
myQuery.setMinimumStartTime(DateTime.parseDateTime(dfGoogle.format(dt)));
// Make the end time far into the future so we delete everything
myQuery.setMaximumStartTime(DateTime.parseDateTime("2099-12-31T23:59:59"));
// Execute the query and get the response
CalendarEventFeed resultFeed = service.query(myQuery, CalendarEventFeed.class);
// !!! This returns 25 (or less if there are fewer than 25 entries on the calendar) !!!
int test = resultFeed.getEntries().size();
// Delete all the entries returned by the query
for (int j = 0; j < resultFeed.getEntries().size(); j++) {
CalendarEventEntry entry = resultFeed.getEntries().get(j);
entry.delete();
}
PS: I've looked at the Data API Developer's Guide and the Google Data API Javadoc. These sites are okay, but not great. Does anyone know of additional Google API documentation?
You can increase the number of results with myQuery.setMaxResults(). There will be a maximum maximum though, so you can make multiple queries ('paged' results) by varying myQuery.setStartIndex().
http://code.google.com/apis/gdata/javadoc/com/google/gdata/client/Query.html#setMaxResults(int)
http://code.google.com/apis/gdata/javadoc/com/google/gdata/client/Query.html#setStartIndex(int)
Based on the answers from Jim Blackler and Chris Kaminski, I enhanced my code to read the query results in pages. I also do the delete as a batch, which should be faster than doing individual deletions.
I'm providing the Java code here in case it is useful to anyone.
CalendarQuery myQuery = new CalendarQuery(feedUrl);
DateFormat dfGoogle = new SimpleDateFormat("yyyy-MM-dd'T00:00:00'");
Date dt = Calendar.getInstance().getTime();
myQuery.setMinimumStartTime(DateTime.parseDateTime(dfGoogle.format(dt)));
// Make the end time far into the future so we delete everything
myQuery.setMaximumStartTime(DateTime.parseDateTime("2099-12-31T23:59:59"));
// Set the maximum number of results to return for the query.
// Note: A GData server may choose to provide fewer results, but will never provide
// more than the requested maximum.
myQuery.setMaxResults(5000);
int startIndex = 1;
int entriesReturned;
List<CalendarEventEntry> allCalEntries = new ArrayList<CalendarEventEntry>();
CalendarEventFeed resultFeed;
// Run our query as many times as necessary to get all the
// Google calendar entries we want
while (true) {
myQuery.setStartIndex(startIndex);
// Execute the query and get the response
resultFeed = service.query(myQuery, CalendarEventFeed.class);
entriesReturned = resultFeed.getEntries().size();
if (entriesReturned == 0)
// We've hit the end of the list
break;
// Add the returned entries to our local list
allCalEntries.addAll(resultFeed.getEntries());
startIndex = startIndex + entriesReturned;
}
// Delete all the entries as a batch delete
CalendarEventFeed batchRequest = new CalendarEventFeed();
for (int i = 0; i < allCalEntries.size(); i++) {
CalendarEventEntry entry = allCalEntries.get(i);
BatchUtils.setBatchId(entry, Integer.toString(i));
BatchUtils.setBatchOperationType(entry, BatchOperationType.DELETE);
batchRequest.getEntries().add(entry);
}
// Get the batch link URL and send the batch request
Link batchLink = resultFeed.getLink(Link.Rel.FEED_BATCH, Link.Type.ATOM);
CalendarEventFeed batchResponse = service.batch(new URL(batchLink.getHref()), batchRequest);
// Ensure that all the operations were successful
boolean isSuccess = true;
StringBuffer batchFailureMsg = new StringBuffer("These entries in the batch delete failed:");
for (CalendarEventEntry entry : batchResponse.getEntries()) {
String batchId = BatchUtils.getBatchId(entry);
if (!BatchUtils.isSuccess(entry)) {
isSuccess = false;
BatchStatus status = BatchUtils.getBatchStatus(entry);
batchFailureMsg.append("\nID: " + batchId + " Reason: " + status.getReason());
}
}
if (!isSuccess) {
throw new Exception(batchFailureMsg.toString());
}
There is a small quote on the API page
http://code.google.com/apis/calendar/data/1.0/reference.html#Parameters
Note: The max-results query parameter for Calendar is set to 25 by default,
so that you won't receive an entire
calendar feed by accident. If you want
to receive the entire feed, you can
specify a very large number for
max-results.
So to get all events from a google calendar feed, we do this:
google.calendarurl.com/.../basic?max-results=999999
in the API you can also query with setMaxResults=999999
I got here while searching for a Python solution;
Should anyone be stuck in the same way, the important line is the fourth:
query = gdata.calendar.service.CalendarEventQuery(cal, visibility, projection)
query.start_min = start_date
query.start_max = end_date
query.max_results = 1000
Unfortunately, Google is going to limit the maximum number of queries you can retrieve. This is so as to keep the query governor in their guidelines (HTTP requests not allowed to take more than 30 seconds, for example). They've built their whole architecture around this, so you might as well build the logic as you have.

Resources