In my last post I showed how to capture tweets from an area. After collecting data for about a week, plotting the number of tweets per hour shows this:
I knew that this was going to happen as I logged into to my AWS instance every now and then to check how the data collection was coming along. Once the process had stopped and I restarted it manually.
When setting up a process which collects data over a long time (several weeks or longer) it would be great if:
In this case tweets are being collected from the Twitter firehose. For this a script is required that receives tweets and stores them, and it needs to be run on a machine that is always online.
There are a few options, for this:
If a desktop is used a power outage, or a router restart will crash the script. When this happens it has to be restarted manually, or it has to be configured to restart automatically after a failure. This is not covered in this post but is possible.
My internet connection is usually shaky so I decided to use my AWS EC2 instance to run this process. The great thing about them is that you get a year free, and after than pay around $20 a month. Here is a tutorial about getting started with EC2. Setting this up is a little bit tricky as there are many terms and concepts that are specific to AWS that must be understood. The upside is that after setting this up working with other AWS services becomes easier. Also, this kind of cloud computing is useful as it allows us to store, search and otherwise process large amounts of data when necessary paying only for the amount of time the service was used.
One word of caution about the availability of AWS, on Sunday the 25th of August of 2013 there was an outage on AWS. This brought down a few online services, such as Airbnb, Flipboard and others for about 15 minutes. There is nothing that is completely fail proof.
Now that the basics of a long running process have been covered, let’s consider tools to keep the process running and ways to monitor it.
Supervisord is a powerful python program which can start, stop and restart processes. This last feature is what interests us.
At its simplest this tool requires a config file where we specify what command to run, and if the script should be restarted automatically. A sample config file can be found here
After a few weeks a glance at the program’s output reveals that the script was restarted several times:
2013-08-12 00:20:16,369 INFO spawned: 'TwitterLocationFilter' with pid 24505 2013-08-12 00:20:26,381 INFO success: TwitterLocationFilter entered RUNNING state, process has stayed up for > than 10 seconds (startsecs) 2013-08-21 21:53:26,562 INFO exited: TwitterLocationFilter (exit status 1; not expected) 2013-08-21 21:53:27,568 INFO spawned: 'TwitterLocationFilter' with pid 13373 2013-08-21 21:53:37,581 INFO success: TwitterLocationFilter entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
There are many other things that are possible with this tool such as: write extensions, register listeners which get notified when certain things happen and it is possible to monitor/control process over a web interface. This presentation goes into more detail.
Because this process only has to run for a few weeks, and because it is possible to see how much data was missed, it is not important to understand why it crashes. If this were in production or a client depended on this process’ output just restarting it when it fails is not the correct solution. An effort should be made to understand ”’why”’ it fails.
Storing large amounts of data could lead to filling up all space on the server. When this happens the machine will crash. The
df command on Unix will show how much space is left on the disk. Together with some Python code it is also possible to find how much data we stored, and the last tweet collected.
Here is the code where a page is generated with these statistics. The most interesting thing here is how we get the timestamp of the last tweet. The file where tweets are written to is read form the end (with a 1GB file reading from the beginning could take a long time) and we search for the first occurrence of the ‘created_at’ field and extract the timestamp using regex.
This webpage is generated with the Bottle framework which is a great tool to create small, interactive websites. The AWS server has nginx setup to serve static content. To run the bottle web app, Gunicorn can be used as a web server with nginx configured as a proxy. In this way, nginx will continue to serve all static content but will proxy all requests to the monitoring tool to Gunicorn which will run the bottle app and serve the page to nginx, which will in turn send it to the browser.
It is also possible to setup the monitor page on the OS X dashboard. Simply navigate to the page and left click anywhere on the page and select: ‘Open in dashboard…’.
On Android it is also possible to use the Widget Maker app to show the monitor as a widget.