how to capture road traffic data for a specific period and place - traffic

I would like to capture road traffic data for a specific location and in a specific period and then I require to do some analysis on data. I try to check how I can do that. I found that I am able to use some public API for obtaining traffic information but I feel there could be other way as well. I require this data for a Big Data project.
Please suggest me how I should store data as well, I mean what is the best practice for store a big range of data of traffic.

Well, your question is very general! As I understand and based on my experience,
I suggest you something:
1- First and most efficient is that you can analysis your data online then extract your statistics and keep your results.
Easy for research here: http://www.mathworks.com/help/images/examples/detecting-cars-in-a-video-of-traffic.html
2- You can tag your data and keep only the sequences that you need, at least ignore a big unnecessary data automatically.
3- If you need to keep the video sequences for any reasons, then I suggest to use video compressors and decrease your videos not only by compressors also in terms of size.
http://video-compressor.en.softonic.com/

Related

Big Data Analysis Simulation

First post ever, so here we go! (Thanks for taking the time to read!)
I am currently studying in college and working on a research project on how different hardware (specifically a ram-disk vs hard rive) can affect the speed of big data analysis. I know how to set up the various hardware and all of that jazz, however, I have no previous experience with big data analysis, and after looking for a few days I have found no answers (even here). I need any software to be able to simulate big data analysis - I have read of Hadoop, but have no idea where to begin on that - and it seems that even with it there is no simulation. How would I go about getting software along with data to analyze? Specifically, something I could run as a control group and then again with the data stored on a ram-disk in order to see if there is a performance increase.
I really feel in over my head here and don't know where to start, so any help or tips are welcome. Thank you very much!
To clarify, I am hoping to begin on a very small-scale database, but I also have resources with my school to set up a very large drive to be able to test with.
There are many DB solutions out there in the market.
However, the big data DB must be designed to process this particular data. The characteristics of big data are summarized as 3V which means data volume, velocity, and variety.
Big data is a large amount of data in terabytes(TB) or more. This is the most basic feature of big data, which means that there is a large amount of data that is still being generated through multiple paths.
Also, large amounts of data must be collected and analyzed in real time in accordance with the user’s needs. The diversity of big data has various forms. That is, it includes all types of data such as a regular, semi-regular and irregular data. In addition to traditional instructed data such as books, magazines, medical records, video and audio, it also includes the data which have location information.
Machbase database is one of big data software you can try. This DB website also offers the user manual and the page of getting started, where users can easily follow instructions. Good luck!!

Tools for Feature Extraction from Binary Data of Images

I am working on a project where I am have image files that have been malformed (fuzzed i.e their image data have been altered). These files when rendered on various platforms lead to warning/crash/pass report from the platform.
I am trying to build a shield using unsupervised machine learning that will help me identify/classify these images as malicious or not. I have the binary data of these files, but I have no clue of what featureSet/patterns I can identify from this, because visually these images could be anything. (I need to be able to find feature set from the binary data)
I need some advise on the tools/methods I could use for automatic feature extraction from this binary data; feature sets which I can use with unsupervised learning algorithms such as Kohenen's SOM etc.
I am new to this, any help would be great!
I do not think this is feasible.
The problem is that these are old exploits, and training on them will not tell you much about future exploits. Because this is an extremely unbalanced problem: no exploit uses the same thing as another. So even if you generate multiple files of the same type, you will in the end have likely a relevant single training case for example for each exploit.
Nevertheless, what you need to do is to extract features from the file meta data. This is where the exploits are, not in the actual image. As such, parsing the files is already much the area where the problem is, and your detection tool may become vulnerable to exactly such an exploit.
As the data may be compressed, a naive binary feature thing will not work, either.
You probably don't want to look at the actual pixel data at all since the corruption most (almost certain) lay in the file header with it's different "chunks" (example for png, works differently but in the same way for other formats):
http://en.wikipedia.org/wiki/Portable_Network_Graphics#File_header
It should be straight forward to choose features, make a program that reads all the header information from the file and if the information is missing and use this information as features. Still will be much smaller then the unnecessary raw image data.
Oh, and always start out with simpler algorithms like pca together with kmeans or something, and if they fail you should bring out the big guns.

What is the design & architecture behind facebook's status update mechanism?

I'm planning on creating a social network and I don't think I quite understand how the status update module of facebook is designed. Hoping I can find some help here. At algorithmic and datastructure level, what is the most efficient way to create a status update mechanism in a social network?
A full table scan for all friends and then sorting their updates is very naive and costly. Do we use some sort of mechanism based on hashing or something else? Please let me know.
P.S: I'm not talking about their EdgeRank algorithm but the basic status update. How do they find and fetch them from the database?
Thanks in advance for the help!
Here is a great presentation that answers your question. The specific answer comes up at around minute 55:40, but I suggest that you watch the entire presentation to understand how the solution fits into the entire architecture.
In short:
A particular server ("leaf") stores all feed items for a particular user. So data for each of your friends is stored entirely at a specific destination.
When you want to view your news feed, one of the aggregator servers sends request to all the leaf servers for your friends and ranks the results. The aggregator knows which servers to send requests to based on the userid of each friend.
This is terribly simplified, of course. This only works because all of it is memcached, the system is designed to minimize latency, some ranking is done at the leaf server that contains the friend's feed items, etc.
You really don't want to be hitting the database for any of this to work at a reasonable speed. FB use MySql mostly as a key-value store; JOINing tables is just impossible at their scale. Then they put memcache servers in front of the databases and application servers.
Having said that, don't worry about scaling problems until you have them (unless, of course, you are worrying about them for the fun of it.) On day one, scaling is the least of your problems.

What are some information data structures?

Most data structures are designed to hold data.
Data is something that means something to a computer.
Information is something that means something to a human.
What data structures are designed more for information rather than data?
Examples might include things like xml, .jpg, and Gray codes which all have an information feel to me.
This looks like a too broad question. Information is stored as data in many different ways but ultimately the way you interpret it will give it some meaning. For instance a word document written in Chinese will be stored as data and interpreted by someone who knows how to read mandarin.
If you are talking about information retrieval using AI techniques, that's another story, also very broad. So be more specific to help yourself.
Finally, the way you store data some times is related to the way they are represented in real life. An image, a matrix, note a tree for example. Some more complex information, like a huge DNA sequence, are stored in a way that is more suitable for computers (to speed up pattern analysis for instance). So there is also a translation from information (suitable for humans) to data (suitable to computers) back and forth.
That's why there's job for us!
Books, newspapers, videos.
Media.
Information is data with a context. The context has to be a part of the data structure to be considered information.
XML is a good example. Pretty much any office document format, many of which have at least an XML representation. Charts and graphs. Plain text files.
I'm not sure I would include .jpg, since it's not really a human-readable format. You need a computer to display a .jpg for you or it's just data.
It's worth mentioning that just about any information is just the same... data arranged in a way that a person or machine can understand.

Data structure/Algorithm for Streaming Data and identifying topics

I want to know the effective algorithms/data structures to identify the below information in streaming data.
Consider a real-time streaming data like twitter. I am mainly interested in the below queries rather than storing the actual data.
I need my queries to run on actual data but not any of the duplicates.
As I am not interested in storing the complete data, it will be difficult for me to identify the duplicate posts. However, I can hash all the posts and check against them. But I would like to identify near duplicate posts also. How can I achieve this.
Identify the top k topics being discussed by the users.
I want to identify the top topics being discussed by users. I don't want the top frequency words as shown by twitter. Instead I want to give some high level topic name of the most frequent words.
I would like my system to be real-time. I mean, my system should be able to handle any amount of traffic.
I can think of map reduce approach but I am not sure how to handle synchronization issues. For example, duplicate posts can reach different nodes and both of them could store them in the index.
In a typical news source, one will be removing any stop words in the data. In my system I would like to update my stop words list by identifying top frequent words across a wide range of topics.
What will be effective algorithm/data structure to achieve this.
I would like to store the topics over a period of time to retrieve interesting patterns in the data. Say, friday evening everyone wants to go to a movie. what will be the efficient way to store this data.
I am thinking of storing it in hadoop distributed file system, but over a period of time, these indexes become so large that I/O will be my major bottleneck.
Consider multi-lingual data from tweets around the world. How can I identify similar topics being discussed across a geographical area?
There are 2 problems here. One is identifying the language being used. It can be identified based on the person tweeting. But this information might affect the privacy of the users. Other idea, could be running it through a training algorithm. What is the best method currently followed for this. Other problem is actually looking up the word in a dictionary and associating it to common intermediate language like say english. How to take care of word sense disambiguation like a same word being used in different contests.
Identify the word boundaries
One possibility is to use some kind of training algorithm. But what is the best approach followed. This is some way similar to word sense disambiguation, because you will be able to identify word boundaries based on the actual sentence.
I am thinking of developing a prototype and evaluating the system rather than the concrete implementation. I think its not possible to scrap the real-time twitter data. I am thinking this approach can be tested on some data freely available online. Any ideas, where I can get this data.
Your feedback is appreciated.
Thanks for your time.
-- Bala
There are a couple different questions buried in here. I can't understand all that you're asking, but here's a the big one as I understand it: You want to categorize messages by topic. You also want to remove duplicates.
Removing duplicates is (relatively) easy. To remove "near" duplicates, you could first remove uninteresting parts from your data. You could start by removing capitalization and punctuation. You could also remove the most common words. Then you could add the resulting message to a Bloom filter. Hashing isn't good enough for Twitter, as the hashed messages wouldn't be much smaller than the full messages. You'd end up with a hash that doesn't fit in memory. That's why you'd use a Bloom filter instead. It might have to be a very large Bloom filter, but it will still be smaller than the hash table.
The other part is a difficult categorization problem. You probably do not want to write this part yourself. There are a number of libraries and programs available for categorization, but it might be hard to find one that fits your needs. An example is the Vowpal Wabbit project, which is a fast online algorithm for categorization. However, it only works on one category at a time. For multiple categories, you would have to run multiple copies and train them separately.
Identifying the language sounds less difficult. Don't try to do something smart like "training", instead put the most common words from each language in a dictionary. For each message, use the language whose words appeared most frequently.
If you want the algorithm to come up with categories on its own, good luck.
I'm not really sure if I'm answering your main question, but you could determine the similarity of two messages by calculating the Levenshtein distance between them. You can think of this as the "edit difference" between two strings (I.E., how many edits would need to be made to one, to convert it to the other).
Hello we have created a very similar demo using api.cortical.io functionality.
There you can create semantic fingerprints of each tweet. (you could also extract the top most keywords or some similar terms, that don't need to actually be part of the tweet).
We have used the fingerprints to filter the twitter stream based on content.
On twistiller.com you can see the result. The public 1% twitter stream is monitored for four different topic areas.

Resources