Twitter, MongoDB and PostgreSQL are fun. Let’s put all together and see what we can do. Twitter is an evolving platform for many kinds of analyses. Anyone can access to the content and can be a data scientist for a while. If you’d like to play Big Brother just go ahead and start playing with it and you’ll find a lot of interesting things from people all over the world. Some says that NoSQL databases (such as MongoDB) are perfect for storing Big Data due its scalability and non-relational nature. The good thing in not being a computer scientist is that I can test them as an outsider – without knowing what I am really doing :).
A little background: a few months before I had access to a database of approx. 200.000 tweets. It’s really nothing compared to some other databases but still big enough for retrieving data to be time-consuming. I was not responsible for the data collection but all data were coming from the Twitter Streaming API and my colleagues stored them both in a MongoDB collection and in a PostgreSQL table. They used the API’s location parameters for requesting data from an area located in the Southern part of the UK. Retrieving geographic data from PostgreSQL (with postgis) is relatively easy and well known but what about MongoDB? Can we even do it? I had no idea but it seemed to be fun enough to explore it. In these posts (maybe there will be 3 or so) I’ll show you how I visualized them. I’ll write about how I tried to extract some weather related information from them (come on, it’s the UK so I thought everyone tweets about the weather!) and lastly, I will show you how I tried to compare the two database engines in terms of speed.
What’s Twitter?
Twitter is a social networking and microblogging service that allows you answer the question, “What are you doing?” by sending short text messages 140 characters in length, called “tweets”, to your friends, or “followers. – tweeternet.com
Yes, that is true. The basics of Twitter are shown above, but in our case, it is also important that tweets may contain geographic information. Although some says this is not even information, just data – which is true, actually –, the main point is that some tweets contain location based information as coordinates coming from either the user’s GPS enabled smartphone or their profile’s location settings. Unfortunately just a small portion of tweets are geocoded/geotagged (let’s say 1-3%). If I understand Twitter right, these coordinates are generated in one of these ways (user must be geo-enabled):
- GPS coordinates from the built-in GPS receiver
- User’s home location entered in the profile
There are some problems with the second method which can generate some ‘noise’ in the data. If a user does not have location enabled in his or her cell phone, Twitter tries to match the tweet to the location found in the profile. This works well in most of the cases but imagine a situation of someone from London, UK who is travelling a lot across continents and does not have location enabled in the smartphone. The tweets of this user may appear in London. Also, there are always some tweets outside the given boundary of the request. This is the noise that has to be filtered.
What is MongoDB?
MongoDB is an open source database that uses a document-oriented data model. MongoDB is one of several database types to arise in the mid-2000s under the NoSQL banner. Instead of using tables and rows as in relational databases, MongoDB is built on an architecture of collections and documents. Documents comprise sets of key-value pairs and are the basic unit of data in MongoDB. Collections contain sets of documents and function as the equivalent of relational database tables. – searchdatamanagement.techtarget.com
So, MongoDB is a NoSQL (non-relational) DBMS which stores JSON-like documents (a binary form, called BSON). MongoDB does not have a fix schema for collections (a collection is like table in SQL). It has schema for a single document, which is dynamic so we can modify it on the go. This fact gives us some freedom. It is highly scalable and some people say this can be a good solution for storing BIG DATA. It can be true, but I advise that you should be careful about using NoSQL databases. If it is clear that the NoSQL solution suits your needs, MongoDB can be a perfect solution so you can enjoy the speed and scalability. There are a lot of posts out there that say NoSQL is bad, so be wise and read them as well. See for example this one.
As a side note, PostgreSQL has also some NoSQL capabilities. It can store hstore and JSON as well and from PostgreSQL 9.4 coming out fall 2014 it can be a great competitor for MongoDB in terms of speed.
MongoDB now has some geo-capabilities included. It can store GeoJSON objects, and has geospatial indexing implemented in it. Of course we can not compare this functionality right now to the functionality of PostgreSQL’s postgis extension, but it’s a good starting point. I’ll play with these capabilities as well.
For the forthcoming posts, I used
- Python 2.7 with the following packages: pymongo, psycopg2, urllib2, json, (optionally time, datetime)
- R, with plyr, stringr, wordcloud, ggplot2, tm
- and QGIS with OpenLayers plugin