It’s been 10 days since I am back from a 5 month stint in Europe, and its 10 days to move on to my next stint (more on that very very soon 😉 ). Being in Mumbai has its perks, with the alma mater, friends, vada pav and Marine Drive being high points. However, 20 days of doing nothing can really get to you, and you need to keep on doing something just to keep sane.
I had wanted to do a data visualization project for quite some time, and I had always wondered what data I could use. Being back from Europe and having travelled a few cities, and owning a tablet, gave me the spark I was searching for. I had 5 months of GPS data (recorded automatically by my trusty companion, the Nexus 7) logging my travels. That amounted to 73,000 recordings, spread over 5 months, 6 cities and a lot of geography.
Well, Google Location History exported all the data to a KML file readily enough, and there I hit the first hurdle. Too much data. I had data from all over, starting from India to my transit there via the Middle East, and those points just seemed like “outliers” (data which was much different from the rest). So, before I got around to plotting anything, I figured I ought to delete some data, and I set about to do it. Unfortunately, the KML exported by Google Location History isn’t exactly the most beautiful. Data is split across two lines, one with time and one with co-ordinates, and needs a bit of work to be made into CSV, so one can do some mathemagic with the numbers. So, what do you do ?
I learnt vim. I went and talked to a friend, and showed him what I was doing. He fired up vi, typed in a few commands, and there was my data for one city, all cleaned up and ready to be analysed. So, I went home that night, fired up vimtutor, and learnt vim. Totally worth the two hours I put into it.
Discussing what I was doing also helped me brush up statistics and algorithms. Lots of ways come to mind about ways to “cluster” the data. Since a lot of points were pretty much the same location, with all the quirks of GPS measurements, they could be replaced with just one point. Additionally, I could take a city’s data , combine it with my knowledge of my walk in the city, and use k-means clustering to cluster it into a few places that I hung out. That raised the question, did I really want to cluster the places, when the aim I had in keeping logging turned on was to record everywhere I had been and to later see and show all the streets that I had walked ? Ought I just show all the data points, or would it be nicer to show the path I followed, step by step, and with insights into how much time I spent where along the way, maybe for some other traveller looking for something like I had ?
Too many questions, or rather decisions and tests to make, and not too many answers. So, while I figured it out, I went ahead and plotted a few of the cities I visited on a map, using Google Fusion Tables (incredibly easy and handy). Here are a couple of those maps, until I decide to (and/or) finish this project. And if you want to play with this dataset yourself, you can get it here : http://imojo.in/gpsdataset
Also, some terms that you might like to look up, that I came across while working on this : [ clustering, k-means clustering, elbow method, visual block mode in vi, Manhattan distance, curse of dimensionality, forward difference, second forward difference, normalisation, outliers ]