Text Mining an Outlook Email Archive

Text Mining an Outlook Email Archive - outlook

I am considering preforming some text-mining on a set of large individual .pst file containing >4 years of communication.
Initially, I would like to just extract the header information to identify social networks, but ultimately would like to begin to classify emails based on key-words or create some structured output that would support some further analysis.
Does anyone have any suggestions where to begin?

You should check the research done on the publicly available Enron Email Dataset -> The page has link to some interesting papers

Related

Survey Monkey results directly into DB

This may be a question for Survey Monkey, but I felt that someone here may have encountered something like this in past experiences. Is there a way to work with the API of Survey Monkey (SM), to add the information from the survey straight into a database of my own? I realize that I can generate the information into output files, but I was wondering if there was a way to directly access the information from the SM database. I feel like this might cause some privacy concerns for SM. Has anyone attempted this, or would the best option of mine be to create my own surveys without a third party website?

I had a similar issue and here's my solution.
I was doing health related surveys which contain HIPPA protected Personal Health Info. Zapier is NOT HIPAA safe, so the "zap the results over to Google Drive" solution didn't work.
So I wanted a quick n dirty way to grab SM survey data and begin to design a data structure to analyse and store this data. I figured that I would start with <1000 results, sort it out, then build out a bigger/fancier structure as needed.
I just downloaded CSV's of the SM individual responses, munged the downloaded CSV files to make a Python CSV reader happy, then wrote a Python 3.5 script to grab the survey data and spit it out into a couple of output CSV files designed for different analytic purposes.
It was really quick and easy to alter the Python script to deliver different subsets of data to different output files, and really quick and easy to see if these output (CSV or XLS) files really told me what I wanted to know.
This is a really quick and easy way to start analysing right away without spending too much time on procedural overhead. You can alter CSV (or XLS ) tables really quickly and easily, so you can mix and match data / derivative data as much as you want. A wise person once told me "don't think, do." So the more you analyse on small runs of data, the better your final Big Buildout In The Sky will look.
Yah, you can spend a lot of time writing and API and setting up a dbase, but if you are not completely happy with what you want out of the SM data, start small. Hope this helps.

Best application monitoring system with dashboard [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I have been playing with Graphite as application monitoring system but I'm wondering if there's something better out there for what I want.
Here are a few requirements I have in mind:
Dashboards (easy to create/change)
the items on the dashboard should be mostly charts but also colored "number boxes" (a la http://shopify.github.io/dashing/)
when a metrics goes below/beyond a certain value show some warnings on the screen (different frame/background) and potentially send an email
setting up a rule-to-warn (see above) should be simple to do and have many ways to specify a threshold (absolute value, +/- the min/max/avg over the past 30 days, percentile, etc...)
Clicking on one of the charts/boxes would redirect to a larger/more detailed chart or a "sub-dashboard"
I would prefer open-source but I'm open to commercial products especially cloud-based solutions.
Any suggestions?
Many thanks in advance.

I personally use the following combination:
Dashboard: Grafana. It is really good looking and makes easy to create and edit dashboards. Unfortunately it doesn't have colored "number boxes" but you can also look into using a wider-purpose one like Geckoboard or Ducksboard for this end.
Alerting system: Seyren. Lets you specify alerts when any metric trespasses a certain threshold and alerts via mail and dozen of other ways, readily integrated. However it doesn't ease in any way dealing with historical values, percentiles, etc. You have to do this manually via Graphite functions. Another popular option: Cabot. I use Seyren instead because it looks more active and is lighter to deploy.
Unfortunately there is not final "answer" to your question, only suggestions. You might find more appropriate forums for your question than Stackoverflow, like mailing lists or Reddit.
Hope it helps anyhow! :)

Your question states prefer open source, but if you’re really open to commercial option, I think the ZingChart JavaScript charting library meets your requirements.
1. Dashboards (easy to create/change): ZingChart uses a CSS-like syntax that is pretty easy to use and edit.
2. Dashboard items should be charts and number boxes: The library allows you to create widgets to display items like you've described. Here is a demo with number boxes -- http://www.zingchart.com/playground/presentation/51b21c1a3c8ae
3 . Warnings on screen: As you can see the in the demo above, rules have been set for the number boxes to display in green for increases and red for decreases. Similar rules can be set for a range of values. (Which addresses number 4).
You could even use multiple rules sets for a values that are close to reaching the threshold. http://zingchart.com/playground/run/5460f51991002 This example shows rules set to place a red marker on data points below the value of 200.
4 . Rule-to-warn: There are a variety of ways to use rules to replicate your desired warning. You can also combine rules with our API and create warnings as well as fire an event which can be used to trigger an email.
Its not mentioned in your question but if real time data is a consideration, ZingChart also provides the ability to transfer data via http or websocket protocols. I’m on the team at ZingChart so if you have any questions about the demo or the features described, please feel free to reach out.

If you have the right budget the best tool is splunk. It is not cloud based but it is the best when it comes to analyzing data and creating graphs and dashboards out of generated data by scripts and log files.
Splunk comes with a very flexible query language and the ability to create scheduled searches that can be used as a very robust monitoring solution. I still did not find any better product but the downside is the high price.

Body Text extraction from websites e.g. extract only article heading and text not all text in site

I am looking for algorithms that allow text extraction from websites. I do not mean "strip html", or any of the hundreds of libraries that allow this.
So for example for a news article I would like to identify the heading and all the text, but not the comments section and so on.
Are there any algorithms for that out there? Thank you!

In computer science literature this problem is usually referred to as the page segmentation or boiler plate detection problem. See the report Boilerplate Detection using Shallow Text Features and its related blog post. Also, I have a few reports and software sites bookmarked that address the problem. Also, see this stackoverflow question.

there are a few open source tools available that do similar article extraction tasks.
https://github.com/jiminoc/goose which was open source by Gravity.com
It has info on the wiki as well as the source you can view. There are dozens of unit tests that show the text extracted from various articles.

"Content extraction" is a very difficult topic. There are no common standards to identify the "main-article" content (there are several approaches to make HTML easier readably for crawlers, e.g. schema.org, but none of these is very popularly used).
So it turns out, that if you want good results, its probably best to define your own XPath selectors for each (news) website you want to scrape. Although there are some APIs for HTML content extraction, but as I said its very hard to develop an algorithm which works for every site.
Some APIs you could use:
alchemyapi.com
diffbot.com
boilerpipe-web.appspot.com
aylien.com
textracto.com

What you're trying to do is called "content extraction". It turns out to be a surprisingly hard problem to solve well, and many naive solutions do quite badly.
Instapaper and Readability both have to solve this, and you may learn something from looking at their solutions. They also both provide services that you may be able to take advantage of - perhaps you can outsource your problem to them and let their API take care of it. :)
Failing that, a search for "html content extraction" returns a great deal of useful results, including a number of papers on the subject.

I compared a few different libraries, and had really great luck with Mozilla's Readability library (Node), or its Python wrapper.
For example, take this CNN article: https://edition.cnn.com/2022/06/01/tech/elon-musk-tesla-ends-work-from-home/index.html
Readability successfully returns only the relevant data:
New York (CNN Business) Elon Musk is demanding that Tesla office workers return to in-person work or leave the company. The policy, disclosed in leaked emails Musk sent to Tesla's executive staff Tuesday, was first reported by electric vehicle news site Electrek. "Anyone who wishes to do remote work must be in the office for a minimum (and I mean *minimum*) of 40 hours per week or depart Tesla. This is less than we ask of factory workers," Musk wrote, adding that the office must be the employee's primary workplace where the other workers they regularly interact with are based — "not a remote branch office unrelated to the job duties." Musk said he would personally review any request for exemption from the policy, but that for the most part, "If you don't show up, we will assume you have resigned."
etc.

I think your best shoot is study what information can you get from the metadata and write a good html parser, oEmbed could be a good standard =)
https://oembed.com/#section7

Any good tool or library for recursive convert ANY files to tiff / images? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
We have to convert EVERYTHING to images for archiving purpose. DOC, HTML, email, ZIP, PDF, TXT and any document you can read/view on computer. In addition, it must recursive convent on all embed attachment and files in zip.
I know ImgMaker only. Is it the best or I can have something better?
My boss ask me to search that are there any alternative other then ImgMaker.
Any open source or profit suggestion are welcome.

There is a whole industry built around this type of function and numerous service providers that charge a fee per document to do this type of conversion. You are better off buying than building it on your own.
The idea of converting Everything is fundamentally a fool's errand as you would need a single program that could render every file type ever created (in essence recreating every piece of software that ever wrote a data file AND recreating every version of each). Also, not every file format has a format that has a direct rendered form. For example, what do you do with a database file, a DLL,an XML file, a WAV file?
If you are looking for something that does a reasonable job for a large number of formats, there are two main players with OEM toolkits, but both are extremely expensive and neither supports the Java platform directly. I use the former if you have any additional questions.
Stellent (now Oracle) OutsideIn: http://www.oracle.com/technologies/embedded/outside-in.html
Autonomy KeyView: http://www.autonomy.com/content/Products/idol-modules-keyview-viewing/index.en.html
Another possible option is an image print driver like Black Ice, but it has several issues including the need for a copy of every software application on the machine the code is running on, and an operator to dismiss all the inevitable dialogs that will come up when you open the files in the native application. Also, for things like Excel spreadsheets, you usually need some manual tweaking of the spreadsheet to make the printout look right (else you get 900 pages added to your tiff with that one extra column that wouldn't fit)

I don't know if this will help, because it sounds like you want something totally automated, but there are many pseudo-printer drivers that can create TIFF images as output. For example:
http://sourceforge.net/projects/pdfcreator/

Uh? How do you expect to convert a zip archive to an image? What should the pixels show? Should it be lossless, so you can convert back? If it's for archiving, I would guess that is a requirement, but it sounds weird.

What's going to happen to the tiff images afterwards? Assuming you want to manage them in some way, it seems to me you'd be better off looking for some complete documentation management product that can take these doc types as input and manage/archive the (presumably) large number of images that you'll have.
Otherwise you would seem to be re-inventing the wheel.
If you want open-source, something like Alfresco
Note the server based transformation feature below
Alfresco offers one integrated
repository to manage all formats of
content across image management,
document management, web content
management and email repositories. The
repository is a modern platform with:
One Repository for any Digital Asset
The industry’s most scalable, standards-based, JSR-170 content repository
Standards support for JSR-170, Web Services and REST
High-Availability, Fault Tolerance and Scalability – Auto failover and clustering
Secure Distributed Capture over Web Services, HTTP and HTTPS
Reuse of Alfresco Business Policy Rules
Server-based transformation between many formats including TIFF, JPEG, GIF, PNG, MS-Office, PDF and FLASH
Metadata Extraction and Management
Automatic Classification Framework

find to do the recursion in combination with convert from imagemagick tookit would get you pretty far. I guess to support all what you want, you'll need to write a script that calls the right program.

The question as asked cannot be answered sensibly. One obvious solution is to simply rename each file by attaching .tiff. E.g. you could get ringtone.mp3.tiff. Insane as it is, there are not many better ways to convert an .mp3 to a .tiff.
Note that this is not an IT problem. The business is assuming everything is an image, and music is the trivial example of something that isn't.
( To clarify - this was assuming an automated setting, e.g. to archive incoming email for legal reasons. If that's required, you MUST archive incoming MP3's too. If you've got humans in the loop, this question would not belong on a programming forum. )

How do you perform address validation? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
The community reviewed whether to reopen this question 11 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
Is it even possible to perform address (physical, not e-mail) validation? It seems like the sheer number of address formats, even in the US alone, would make this a fairly difficult task. On the other hand it seems like a task that would be necessary for several business requirements.

Here's a free and sort of "outside the box" way to do it. Not 100% perfect, but it should reject blatantly non-existent addresses.
Submit the entire address to Google's geocoding web service. This service attempts to return the exact coordinates of the location you feed it, i.e. latitude and longitude.
In my experience if the address is invalid you will get a result of 602 from the service. There's definitely a possibility of false positives or false negatives, but used in conjunction with other consistency checks it could be useful.
(Yahoo's geocoding web service, on the other hand, will return the coordinates of the center of the town if the town exists but the rest of the address is bogus. Potentially useful as long as you pay close attention to the "precision" field in the result).

There are a number of good answers in here but most of them make the assumption that the user wants an "API" solution where they must write code to connect to a 3rd-party service and/or screen scrape the USPS. This is all well and good, but should be factored into the business requirements and costs associated with the implementation and then weighed against the desired benefits.
Depending upon the business requirements and the way that the data is received into the system, a real-time address processing solution may be the best bet. If a real-time solution is required, you will want to consider the license agreement and technical limitations of the Google Maps/Bing/Yahoo APIs. They typically limit the number of calls you can make each day. The USPS web tools API is the same in additional they restrict how/why you can use their system and how you are allowed to use the data thereafter.
At the same time, there are a handful of great service providers that can easily process a static list of addresses. Essentially, you give the service provider a CSV file or Excel file, they clean it up and get it back to you. It's a one-time deal with no long-term commitment or obligation—usually.
Full disclosure: I'm the founder of SmartyStreets. We do address verification for addresses within the United States. We are easily able to CASS certify a list and we also offer a address verification web service API. We have no hidden fees, contracts, or anything. You use our service until you no longer need it and you can walk away. (Unlike cell phone companies that require a contract.)

USPS has an address cleaner online, which someone has screen scraped into a poor man's webservice. However, if you're doing this often enough, it'd be a better idea to apply for a USPS account and call their own webservice.

I will refer you to my blog post - A lesson in address storage, I go into some of the techniques and algorithms used in the process of address validation. My key thought is "Don't be lazy with address storage, it will cause you nothing but headaches in the future!"
Also, there is another StackOverflow question that asks this question entitled How should international geographic addresses be stored in a relational database.

In the course of developing an in-house address verification service at a German company I used to work for I've come across a number of ways to tackle this issue. I'll do my best to sum up my findings below:
Free, Open Source Software
Clearly, the first approach anyone would take is an open-source one (like openstreetmap.org), which is never a bad idea. But whether or not you can really put this to good and reliable use depends very much on how much you need to rely on the results.
Addresses are an incredibly variable thing. Verifying U.S. addresses is not an easy task, but bearable, but once you're going for Europe, especially the U.K. with their extensive Postal Code system, the open-source approach will simply lack data.
Web Services / APIs
Enterprise-Class Software
Money gets it done, obviously. But not every business or developer can spend ~$0.15 per address lookup (that's $150 for 1,000 API requests) - a very expensive business model the vast majority of address validation APIs have implemented.
What I ended up integrating: streetlayer API
Since I was not willing to take on the programmatic approach of verifying address data manually I finally came to the conclusion that I was in need of an API with a price tag that would not make my boss want to fire me and still deliver solid and reliable international verification results.
Long story short, I ended up integrating an API built by apilayer, called "streetlayer API". I was easily convinced by a simple JSON integration, surprisingly accurate validation results and their developer-friendly pricing. Also, 100 requests/month are entirely free.
Hope this helps!

I have used the services of http://www.melissadata.com Their "address object" works very well. Its pricey, yes. But when you consider costs of writing your own solutions, the cost of dirty data in your application, returned mailers - lost sales, and the like - the costs can be justified.

For us-based address data my company has used GeoStan. It has bindings for C and Java (and we created a Perl binding). Note that it is a commercial product and isn't cheap. It is quite fast though (~300 addresses per second) and offers features like CASS certification (USPS bulk mail discount), DPV (Delivery point verification) flagging, and LON/LAT geocoding.
There is a Perl module Geo::PostalAddress, but it uses heuristics and doesn't have the other features mentioned for GeoStan.
Edit: some have mentioned 'doing it yourself', if you do decide to do this, a good source of information to start with is the US Census Tiger Data Set, which contains a lot of information about the US including address information.

As seen on reddit:
$address = urlencode('1600 Pennsylvania Avenue, Washington, DC');
$json = json_decode(file_get_contents("http://where.yahooapis.com/geocode?q=$address&flags=J"));
print_r($json);

Fixaddress.com service is available that provides following services,
1) Address Validation.
2) Address Correction.
3) Address spell correcting.
4) Correct addresses phonetic mistakes.
Fixaddress.com uses USPS and Tiger data as reference data.
For more detail visit below link,
http://www.fixaddress.com/

One area where address lookups have to be performed reliably is for VOIP E911 services. I know companies reliably using the following services for this:
Bandwidth.com 9-1-1 Access API MSAG Address Validation
MSAG = Master Street Address Guide
https://www.bandwidth.com/9-1-1/
SmartyStreet US Street Address API
https://smartystreets.com/docs/cloud/us-street-api

There are companies that provide this service. Service bureaus that deal with mass mailing will scrub an entire mailing list to that it's in the proper format, which results in a discount on postage. The USPS sells databases of address information that can be used to develop custom solutions. They also have lists of approved vendors who provide this kind of software and service.
There are some (but not many) packages that have APIs for hooking address validation into your software.
However, you're right that its a pretty nasty problem.
http://www.usps.com/ncsc/ziplookup/vendorslicensees.htm

As mentioned there are many services out there, if you are looking to truly validate the entire address then I highly recommend going with a Web Service type service to ensure that changes can quickly be recognized by your application.
In addition to the services listed above, webservice.net has this US Address Validation service. http://www.webservicex.net/WCF/ServiceDetails.aspx?SID=24

We have had success with Perfect Address.
Their database has all the US street names and street number ranges. Also acts as a pretty decent parser for free-form address fields, if you are lucky enough to have that kind of data.

Validating it is a valid address is one thing.
But if you're trying to validate a given person lives at a given address, your only almost-guarantee would be a test mail to the address, and even that is not certain if the person is organised or knows somebody at that address.
Otherwise people could just specify an arbitrary random address which they know exists and it would mean nothing to you.
The best you can do for immediate results is request the user send a photographed / scanned copy of the head of their bank statement or some other proof-of-recent-residence, because at least then they have to work harder to forget it, and forging said things show up easily with a basic level of image forensic analysis.

There is no global solution. For any given country it is at best rather tricky.
In the UK, the PostOffice controlls postal addresses, and can provide (at a cost) address information for validation purposes.
Government agencies also keep an extensive list of addresses, and these are centrally collated in the NLPG (National Land and Property Gazetteer).
Actually validating against these lists is very difficult. Most people don't even know exactly how their address as it is held by the PostOffice. Some businesses don't even know what number they are on a particular street.
Your best bet is to approach a company that specialises in this kind of thing.

Yahoo has also a Placemaker API. It is good only for locations but it has an universal id for all world locations.
It look that there is no standard in ISO list.

You could also try SAP's Data Quality solutions which are available in both a server platform is processing a large number of requests or as an embeddable SDK if you wanted to run it in process with your application. We use it in our application and it's very robust and scalable.

NAICS.com is coming out with an API that will add all kinds of key business data including street address. This would happen on the fly as your site's forms are processed. https://www.naics.com/business-intelligence-api/

You can try Pitney Bowes “IdentifyAddress” Api available at - https://identify.pitneybowes.com/
The service analyses and compares the input addresses against the known address databases around the world to output a standardized detail. It corrects addresses, adds missing postal information and formats it using the format preferred by the applicable postal authority. I also uses additional address databases so it can provide enhanced detail, including address quality, type of address, transliteration (such as from Chinese Kanji to Latin characters) and whether an address is validated to the premise/house number, street, or city level of reference information.
You will find a lot of samples and sdk available on the site and i found it extremely easy to integrate.

For US addresses you can require a valid state, and verify that the zip is valid. You could even check that the zip code is in the right state, but beyond that I don't think there are many tests you could run that wouldn't provide a lot of false negatives.
What are you trying to do -- prevent simple mistakes or enforcing some kind of identity check?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio