I’d like to recommend two excellent critical papers on user-generated geographic content and the geosocial web. The first one is by Muki Haklay and raises important issues on the democratizing effects of the Web 2.0 and neography, while the second one by Crampton et. al. takes up the issue and suggests possible solutions to improve the study and analysis of geosocial media.

In his study [1], Haklay argues that neographic theory and practice assume an instrumentalist view of technology, i.e. that technology is value-free and that there is a clear seperation between the means and the ends. Obviously, Haklay does not agree with this view and argues that there is less empowerment and democratization to be found than commonly assumed. In order to realize the full potential of neographic tools and practices, anyone implementing neogeographic tools or practices needs to take into account economic and political aspects. There is a substantial body of work supporting Haklay, including the research by Mark Graham [2], which I recommended in my last post. Patrick Meier on iRevolution has a in-depth commentary of Haklay’s paper [3] and provides a somewhat more optimistic interpretation. My own point of view is running along similar lines as Haklay’s, in that the contemporary digital divides are a continuation of old power divides that participatory GIS sought to overcome in the 90s. And while I have no ill will towards companies that add value to user-generated content, I am highly skeptical of such “involuntary crowdsourcing”, in which the crowd provides freely the raw material but in the end has to pay for access to derived products [4]. There is some similarity to the argument for Open Government Data – why should the tax payers (and tax paying companies) pay again for the use of the data, when they already payed for the creation of it?

Crampton et al. [5] investigate critically the hype around the “Big Data” geoweb. They remind the reader of (a) the limitations inherent in “big-data”-based analysis and (b) shortcomings of the simple spatial ontology of the geotag. Concerning (a), the data used often has limited explanatory value or informational richness, something our research has shown as well [6]. Further,  geocoded social media are still a non-representative sample, no matter how many of them one has collected. Concerning (b), Crampton et al. point out a number of problems with the geotag, e.g. that it is difficult to ascertain whether it refers to the origin of the content or the topic of the content, its lineage and accuracy, and its oversimplification of geography by limiting place geometry to points or lat/lon pairs (see also [7]). As a consequence of their analysis, the authors suggest that studies of the geoweb should try to take into account:

  1. social media that is not explicitly geographic
  2. spatialities beyond the “here and now”
  3. methodologies that are not focused on proximity only
  4. non-human social media
  5. geographic data from non-user generated sources.

I have to admit that I am a little bit proud to say that our research has addressed three of those suggestions: We haven’t limited our sample to geo-coded social media, instead we have re-geo-coded even those with existing coordinates to ensure that we capture the places the social media was about. We have also gone beyond the “here and now” by spatio-temporal clustering data. Finally, a core concept of our approach is the enrichment of the social media data with explicitly geographic data from non-user generated (i.e. authoritative) sources (a paper describing the details has just been accepted but not published yet, an overview can be found here [8]).

Crampton et. al. conclude their paper with the important reminder that caution is needed regarding the surveillance potential of such research, with intelligence agencies around the world focusing more and more on open source intelligence (OSINT). Indeed it seems that even in Really Big Data, our spatial behaviour is unique enough to allow identification [9].

[1] http://www.envplan.com/abstract.cgi?id=a45184

[2] http://www.zerogeography.net/

[3] http://irevolution.net/2013/03/17/neogeography-and-democratization/

[4] http://phg.sagepub.com/content/36/1/72.abstract

[5] http://www.tandfonline.com/doi/abs/10.1080/15230406.2013.777137?journalCode=tcag20#.UYZa6klic8o

[6] http://www.igi-global.com/article/context-analysis-volunteered-geographic-information/75443

[7] http://www.tandfonline.com/doi/full/10.1080/17538947.2012.712273#.UYZaeUlic8o

[8] https://sites.google.com/site/geoconavi/implementation-details

[9] http://www.nature.com/srep/2013/130325/srep01376/full/srep01376.html

or so the saying goes. At least part of it. Anyway, it’s been very quiet here for almost three months now. The main reason is that most of my spare energy at the moment goes into searching for new work – my current project (and with it funding) will end in a couple of months, so I’m spending my time less writing and more scouting. And flying a UAV, actually, because we’re following up on last year’s successful “Big Blue Balloon” experience. I’ll be posting about Ed (our “Environmental Drone”) soonish.

In the meantime, let me recommend some great posts by other bloggers.

First, there’s always something worthwhile on iRevolution – I am always awed by the frequency with which Patrick can publish high-quality blog posts. Yesterday’s post caught my eye in particular, because it shows Patrick isn’t only a keen thinker and great communicator, but also could do well as entrepreneur – check out his ideas for a smartphone application for disaster-struck communities here.

Then, I really love reading Brian Timoney’s MapBrief blog. It’s not only enlightening, it’s also fun – as long as you’re not the target of Brian’s sharp wit. Recently, he has run a series on why map portals don’t work. Most of the reasons should be pretty obvious, but equally obvious is the failure by most portals to do differently. Read up on it here.

On Zero Geography, Mark Graham shares with us the latest results from his research, and there have been several great posts recently on the usage of Twitter in several African cities. Visit it here to learn some surprising things about contemporary digital divides.

I hope you enjoy reading them as much as I did, but I also promise you won’t have to wait for “strange aeons” before some original material is posted here.

Although our system had been running successfully for quite a while behind high walls, it now finally became available online:

http://forest.jrc.ec.europa.eu/effis/applications/vgi/

Please note that

a) it’s still beta

b) currently, there are few forest fires in Europe, so you will likely see a higher ratio of noise

c) we haven’t put the clustering module online yet, because it still requires frequent human supervision to adjust parameters

If you’d like to know more, don’t hesistate to contact us!

Here’s some more info on the idea and technology behind it:

The rationale for the research is the emergence of new social media platforms that change the way people create and use information during crisis events. Most wide spread are platforms for micro blogging (e.g. Twitter), networking (e.g. Facebook), and photo sharing (e.g. Flickr). These social media platforms increasingly offer to include geographic information on the whereabouts of their users or the information posted. Potentially, this rich information can contribute to a more effective response to natural disasters. In fact, social media have been put into good use on various occasions. This increasing amount of bi-directional horizontal peer-to-peer information exchange also affects the traditional uni-directional vertical flow of information. Traditional broadcasting media open up to micro journalism and several official administrative agencies already adapt and use third-party social media accounts for communicating information. However, incorporation of UGC into the established administrative emergency protocols has not advanced significantly. It seems that public officials view such volunteered information often as a threat that could spread misinformation and rumours, as long as there is no reliable quality control.

So far, mainly human volunteers have carried out the tasks of filtering, validating and assessing the quality of UGC, and with great success. However, this approach is not sustainable and scalable for a continuous, reliable utilization of UGC in crisis response, because the amount of data is ever increasing, and volunteers might not be available in sufficient numbers. The research community has already begun to investigate in assessing trust, reputation and credibility of UGC and volunteered geographic information (VGI) in particular, but several issues pose enormous challenges to automated approaches: Among them a lack of unified interface and heterogeneous media formats and platforms lead to a wide variety of possible data structures, a lack of syntactical control over the data entered by the users, the ingenuity of users and software developers able to overcome device or interface limitations, and an unknown and variable proportion of disruptive or redundant content.

We propose that an integration with existing spatial data infrastructures (SDI) and the geographic contextualization of geo-coded UGC (UGGC) can greatly enhance the options for assessing its quality. We call this approach the GEOgraphic CONtext Analysis of Volunteered Information (GEOCONAVI). This approach emulates one of the basic heuristics which humans use to deal with information that has unknown quality: A comparison with “What do I already know?” By spatio-temporally clustering UGGC, we emulate another heuristic, that of social confirmation (“What do others say?”), and look for confirming or contradicting content. Both these heuristics influence the credibility assessment of the UGGC or VGI. Another criterion is that of relevance, which we assess from the quasi-objective point of view of the potential damage a forest fire can cause, by investigating again the geographic context.

To recapitulate, the GEOCONAVI system requires the following tasks to be carried out semi-automatically or automatically: First, the retrieval and storage of UGC or VGI from various sources. Second, the enrichment of the retrieved UGC with information about source, content, location, and geographic context turning it into UGGC or VGI. Third, the clustering of the UGGC in space and time. Fourth, the detection of new events, or the assignment to known events. Fifth, the dissemination of the results.

The following figure shows an overview of the workflow, plus the current implementation.

Even more info at:

https://sites.google.com/site/geoconavi/implementation-details

 

In this blog post, I’ll try to give a brief overview of my attempts to mine our Tweet database with machine learning methods, in order to find the the few needles in the haystack (i.e. Tweets with information on forest fires). Therefore, it might be a bit less flashy than blue balloons or new open access journal articles, but maybe someone finds the following helpful. Or, even better, points out mistakes I have made. Because I am venturing into unknown territory here – discovering patterns in geographic objects through spatial clustering techniques is part of my job description, but “10-fold stratified cross validation” etc. were a bit intimidating even for me.

So the context is my research on using social media for improving the response to natural disasters like forest fires. In other words, to find recent micro-blog posts and images about forest fires and assess their quality and potential to increase situational awareness. At the JRC, we have developed a whole workflow (GeoCONVAI, for GEOgraphic CONtext Analysis of Volunteered Information) that covers different aspects of this endeavor, for more information see this site and the previous blog post’s links to published journal articles.

Because the concept of “fire” is used metaphorically in many other contexts, we have a lot of noise in our retrieved data. In order to filter for topicality (i.e. whether an entry is about forest fire or not), we manually annotated around 6000 Tweets, then counted the occurences of relevant keywords (e.g. “fire”, “forest”, “hectares”, “firefighters”, etc.). From this, we developed some simple rules to guide the filtering. For an obvious example, the simultaneous occurrence of “fire” and “hectares” is a very good indicator, even in the absence of any word related to vegetation. The results from our case studies show that our rules have merit. However, it seemed an inexcusable omission to not try any machine learning algorithms on this problem. Now that the project is finished, I finally found the time to just that…

So, the objectives were to find simple rules that allow to determine whether a Tweet is about forest fires, and compare those rules with the ones manually devised.

The method obviously is a supervised classification, and the concept to classify is Forest Fire Topicality. The instances from which to learn and which to test are a set of roughly 6000 annotated Tweets, classified into “About Forest Fires” and “Not About Forest Fires”. The attributes used are the number of times each of the keywords shows up in a Tweet. The set of keywords is too large to post it here (because of multiple languages used – did I mention that? no? sorry), but we grouped the keywords into categories. The set of keyword groups used is {fire, incendie, shrubs, forest, hectares, fire fighters, helicopters, canadair, alarm, evacuation} (NB: The distinction between fire and incendie groups is a result of languages like French, where we have the distinct words of “incendie(s)” and “feu(x)”).

The tool of choice is the Weka suite, mainly because of its availability and excellent documentation. As classification methods, I chose to focus on Naive Bayes methods and Decision Trees, because of their widespread use and because they fit the data (by the way, my guide through this experiment was mostly the excellent book “Data Mining – Practical Machine Learning Tools and Techniques, Second Edition” by Ian H. Witten and Eibe Frank – any errors I made are entirely my responsibility).

Regarding the data preparation, little was actually needed – no attribute selection or discretization was necessary, and we had already transformed the text (unigrams) into numbers (their occurences).

So I was ready to load the CSVs into Weka, convert them to ARFF and start (machine) learning! For  verification/error estimation, a standard stratified 10-fold cross validation seemed sufficient.

The computations went all very quickly, and all showed roughly a 90% accuracy. Below the output of one run:

=== Run information ===

Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: gt2_grpd_no_dup-weka.filters.unsupervised.attribute.Remove-R1,14
Instances:5681
Attributes:13
alarm
alert
fireman
forest
shrub
bushfire
canadair
helicopter
evacuation
fire
incendi
hectar
ff_topic
Test mode:10-fold cross-validation

=== Classifier model (full training set) ===

J48 pruned tree
——————

incendi <= 0
|    hectar <= 0
|   |    forest <= 0: N (4232.0/376.0)
|   |    forest > 0
|   |   |    fire <= 0: N (204.0/52.0)
|   |   |    fire > 0: Y (22.0/4.0)
|    hectar > 0
|   |    fire <= 0: N (54.0/11.0)
|   |    fire > 0: Y (34.0)
incendi > 0: Y (1135.0/60.0)

Number of Leaves  :     6

Size of the tree :     11

Time taken to build model: 0.16seconds

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances        5178               91.1459 %
Incorrectly Classified Instances       503                8.8541 %
Kappa statistic                          0.7605
Mean absolute error                      0.1588
Root mean squared error                  0.2821
Relative absolute error                 39.7507 %
Root relative squared error             63.1279 %
Total Number of Instances             5681

=== Detailed Accuracy By Class ===

TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
0.72      0.016      0.946     0.72      0.818      0.858    Y
0.984     0.28       0.902     0.984     0.942      0.858    N
Weighted Avg.    0.911     0.207      0.914     0.911     0.907      0.858

=== Confusion Matrix ===

a    b   <– classified as
1127  439 |    a = Y
64 4051 |    b = N

Problematic seemed the large number of false negatives, i.e. Tweets that the machine learning algorithms classified as not being about forest fires, when in fact they were. It seemed we needed to adjust for different costs (“counting the costs”), i.e. a false negative should have a much higher negative cost than a false positive. In Weka, there are two ways of incorporating the cost: Either using a cost matrix for the evaluation part (won’t change the outcome), or using a cost matrix with the MetaCost classifier (will change the outcome). Surprisingly and unfortunately, the MetaCost classifier did not improve the results significantly. I tried several values with the NaiveBayes classifier. For a cost matrix of

0 (TP)    10 (FN)
1 (FP)      0 (TN)

the result is

a             b       <– classified as
1196      370  |    a = Y
403   3712   |    b = N

As opposed to

a             b       <– classified as
1152      414   |    a = Y
256    3859   |    b = N

for standard costs of

0(TP)    1(FN)
1(FP)    0(TN)

Further increasing the cost for FN does no good. Using

0(TP)    1(FN)
1(FP)    0(TN)

the results are

a             b       <– classified as
1552        14  |    a = Y
2779   1336  |    b = N

In summary, out of the various Decision Tree and Naive Bayes classifiers, the J48 works best. The biggest problem is a large number of false negatives introduced by the combination of
incendie <= 0 AND hectar <= 0 AND forest <= 0 (see above).
However, trying to split up that group proved futile: The only usable keyword would be the “fire” group, but adding fire > 0 equaling Y would introduce a large number of FP. Some exploratory filtering showed that there is no other reliable way to reduce this high number of FP without overfitting.

Later, I had another go at it with a newly annotated data set from another case study. Again, I tried several classifiers (among them J48, logistic, Bayes, Ada/MultiBoost), and again, J48 works best overall. It has also the advantage that the results (i.e. the tree) is easily understandable. Noticing that “hectars” is such an important classifier (good for us with respect the case study, but also part of the “what” situational awareness), I tried another run without it. Results are not better, but the decision tree is now relatively complicated and also uses the number of keywords. I removed that as well and the remaining decision tree is interesting for comparision:

=== Run information ===

Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: prepared_weka_input_cs_2011_steps_234-weka.filters.unsupervised.attribute.Remove-R1,15-16,18-21-weka.filters.unsupervised.attribute.Remove-R12-weka.filters.unsupervised.attribute.Remove-R12
Instances:1481
Attributes:12
alarm
alert
fireman
forest
shrub
bushfire
canadair
helicopter
evacuation
fire
incendi
on_topic
Test mode:10-fold cross-validation

=== Classifier model (full training set) ===

J48 pruned tree
——————

forest <= 0
|    fire <= 0:  N (1316.0/63.0)
|    fire > 0
|   |    incendi <= 0:  Y (29.0/2.0)
|   |    incendi > 0:  N (72.0/17.0)
forest > 0:  Y (64.0/7.0)

Number of Leaves  :     4

Size of the tree :     7

Time taken to build model: 0.02seconds

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances        1392               93.9905 %
Incorrectly Classified Instances        89                6.0095 %
Kappa statistic                          0.6235
Mean absolute error                      0.1098
Root mean squared error                  0.2348
Relative absolute error                 55.6169 %
Root relative squared error             74.8132 %
Total Number of Instances             1481

=== Detailed Accuracy By Class ===

TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
0.993     0.488      0.942     0.993     0.967      0.764     N
0.512     0.007      0.903     0.512     0.654      0.764     Y
Weighted Avg.    0.94      0.435      0.938     0.94      0.932      0.764

=== Confusion Matrix ===

a    b   <– classified as
1308    9 |    a =  N
80   84 |    b =  Y

It seems that, given the data and the attributes used, the results from the machine learning support our initial, hand-crafted rule set. That’s fair enough for a result, I abandoned my forays into machine learning at this point (the learning curve looked quite steep from here on, and resources are scarce, as always). This nice result however can’t hide the fact that we still have a large amount of noise that we can’t get rid off by looking only at the keywords used. So we need more sophisticated text analysis or a novel approach. Not being an expert on natural language processing, we chose the second path and came up with something – but you really should read the papers or have a look at the GeoCONAVI site if you would like to learn more. If you have any questions or comments, please post them either here (for the benefits of others), or send me a PM.

We have one free access and one open access paper on our work newly published – check them out, especially the first one if you want to get an overview of what work sparked this blog:

Craglia, M., Ostermann, F., & Spinsanti, L. (2012). Digital Earth from vision to practice: making sense of citizen-generated content. International Journal of Digital Earth, 5(5), 398–416. doi:10.1080/17538947.2012.712273

http://www.tandfonline.com/doi/full/10.1080/17538947.2012.712273

Schade, S., Ostermann, F., Spinsanti, L., & Kuhn, W. (2012). Semantic Observation Integration. Future Internet, 4(3), 807–829

http://www.mdpi.com/1999-5903/4/3/807/htm

As promised last week a while ago, here’s an update on current research:

In case you’re wondering what a Big Blue Balloon has to do with research that’s supposed to be somehow related to geography, spatial information, or social media, let me help you: What’s a Big Blue Balloon? First of all, it’s a vehicle of some sort, because it is not stationary (though in all likelyhood cannot move autonomously). Second, it’s in the air (at least as long as it’s inflated!). Thirdly, it can or cannot be “manned”. In the latter case, it is an Unmanned Aerial Vehicle! Or UAV for the acronym lovers among us, and “drone” for those who prefer more catchy names (and abhor acronyms). And that’s the new project I am working on: Retrieving data from UAVs, and integrating it with existing spatial data infrastructures and user-generated geographic content (in case you haven’t noticed, that was the link to this blog’s overall theme). Yes, I know, everyone’s into drones right now, but I content myself that we’re looking into them from a distinctive angle, i.e. the data integration issue. We are also in the process of procuring a “real” UAS (that’s Unmanned Aerial System, including the ground control) in the form of a Mikrokopter, but due to legal, institutional and corporate issues, this is delayed (though not canceled).

In the meantime, we (btw: “we” means my collegue Laura and myself, and the credit for discovering and procuring the subject of this post is entirely hers!) have been looking into DIY-MUAVs (that would be Do-It-Yourself-Micro-Unmanned-Aerial-Vehicles) and grassroots aerial photography. There is an astonishing amount of activity (mentally taking a note here on a future geosocialite blog’s topic), but we have decided on this:

Source: Breadpig Shop (http://shop.breadpig.com/collections/publiclaboratory/products/balloon-mapping-kit)

It’s actually from the Public Laboratory’s for Open Technology and Science (PLOTS), and there is a lot of info on their web pages that I am not going to repeat here – go there and see for yourself. For the lazy among you or those short on time: It’s filled with helium and can easily carry one camera. Ordered, shipped, “assembled” and camera mounted (a Nikon Coolpix P6000 affixed to some polystyrene for protection), and ready-to-go. Erm, wait, we need helium. A lot of it. Do they sell this in a DIY store? Fortunately, the JRC does all kinds of things I still have no idea about. So after asking around, we found a large helium gas cartridge. After some very early test, we were ready for our first real field trial:

Source: The Author (who is trying to get the &$%&§ interval shooting mode to work)

Lift off! Source: The Author

Up Up and Away! Source: You guessed correctly, the author.

Our Big Blue Balloon performed nicely, climbing up to an altitude of around 100 meters (then a helicopter flew past in what seemed a decidedly close distance, and we opted for less altitude), and taking lots of pictures with this camera on interval shooting:

Source: The Balloon

After making a stroll back to office, we parked our Big Blue Balloon in the basement. The image taking were of mixed quality, some excellent, a lot of them blurred and distorted (we have to work in the fixture to the balloon). Unfortunately, the GPS of the camera is quite weak, and did not obtain and coordinates after the first few images (can’t be a lack of satellites in line of sight, can it?). We tried to ortho-rectify some of them with MapKnitter.org, but with mixed success – the base layer is not very good, and since we mostly took images of some greenery, there are not that many structures on the ground to allow for good rectification. But not bad for a very first trial, I think.

So we are now devising two experiments with it: First, to get images from above to stitch and ortho-rectify the semi-automatically, and second a panoramic 360° shooting.

PS: My boss has produced a cool short video taken during the first trial. Because of privacy etc. (we attracted some attention with this, so there are a number of people in the video, and we can’t ask them all for permission), it is not public – you can ask me for the link and password if you think I will trust you :-)

For the past two years, I have been working mainly on an exploratory research project investigating the use of social media for fighting forest fires. That is, what information do social media contain that might be helpful for decision makers, fire fighters on the ground, and the public, and how can we utilize this information best. This project ended officially (and according to plan) last May, and my intention was to share a lot of information on this project on Geosocialite. For various reasons, this has not happened (this part not according to plan), but the main reason was that I wanted to share some data and interactive maps with you. Well, both plans have been stuck for a little while now due to institutional and corporate policies beyond my control. Sharing is not always easy…

I still hope that I will be able to show case some of the work before the forest fire season is over. In the meantime, those interested in the concepts behind our approach and what we implemented can have a look here: https://sites.google.com/site/geoconavi/home There you will also find some presentations and other stuff – we will keep on updating it.

Since I am now employed on another project (see below), I have to do any further work on GeoCONAVI in my spare time. Right now I am looking into machine learning for determining the topicality of some content, in other words, is some micro blog post about a forest fire or not. Some experiments with Weka on a set of annotated Tweets look promising, and I will share my experiences, insights and blunders soon here.

Another upcoming post will deal with some of the more entertaining aspects of my new main project (which is about Unmanned Aerial Vehicle data integration), and will involve a Big Blue Balloon…

Stay tuned :-)

Follow

Get every new post delivered to your Inbox.

Join 172 other followers