or so the saying goes. At least part of it. Anyway, it’s been very quiet here for almost three months now. The main reason is that most of my spare energy at the moment goes into searching for new work – my current project (and with it funding) will end in a couple of months, so I’m spending my time less writing and more scouting. And flying a UAV, actually, because we’re following up on last year’s successful “Big Blue Balloon” experience. I’ll be posting about Ed (our “Environmental Drone”) soonish.

In the meantime, let me recommend some great posts by other bloggers.

First, there’s always something worthwhile on iRevolution – I am always awed by the frequency with which Patrick can publish high-quality blog posts. Yesterday’s post caught my eye in particular, because it shows Patrick isn’t only a keen thinker and great communicator, but also could do well as entrepreneur – check out his ideas for a smartphone application for disaster-struck communities here.

Then, I really love reading Brian Timoney’s MapBrief blog. It’s not only enlightening, it’s also fun – as long as you’re not the target of Brian’s sharp wit. Recently, he has run a series on why map portals don’t work. Most of the reasons should be pretty obvious, but equally obvious is the failure by most portals to do differently. Read up on it here.

On Zero Geography, Mark Graham shares with us the latest results from his research, and there have been several great posts recently on the usage of Twitter in several African cities. Visit it here to learn some surprising things about contemporary digital divides.

I hope you enjoy reading them as much as I did, but I also promise you won’t have to wait for “strange aeons” before some original material is posted here.

Although our system had been running successfully for quite a while behind high walls, it now finally became available online:

http://forest.jrc.ec.europa.eu/effis/applications/vgi/

Please note that

a) it’s still beta

b) currently, there are few forest fires in Europe, so you will likely see a higher ratio of noise

c) we haven’t put the clustering module online yet, because it still requires frequent human supervision to adjust parameters

If you’d like to know more, don’t hesistate to contact us!

Here’s some more info on the idea and technology behind it:

The rationale for the research is the emergence of new social media platforms that change the way people create and use information during crisis events. Most wide spread are platforms for micro blogging (e.g. Twitter), networking (e.g. Facebook), and photo sharing (e.g. Flickr). These social media platforms increasingly offer to include geographic information on the whereabouts of their users or the information posted. Potentially, this rich information can contribute to a more effective response to natural disasters. In fact, social media have been put into good use on various occasions. This increasing amount of bi-directional horizontal peer-to-peer information exchange also affects the traditional uni-directional vertical flow of information. Traditional broadcasting media open up to micro journalism and several official administrative agencies already adapt and use third-party social media accounts for communicating information. However, incorporation of UGC into the established administrative emergency protocols has not advanced significantly. It seems that public officials view such volunteered information often as a threat that could spread misinformation and rumours, as long as there is no reliable quality control.

So far, mainly human volunteers have carried out the tasks of filtering, validating and assessing the quality of UGC, and with great success. However, this approach is not sustainable and scalable for a continuous, reliable utilization of UGC in crisis response, because the amount of data is ever increasing, and volunteers might not be available in sufficient numbers. The research community has already begun to investigate in assessing trust, reputation and credibility of UGC and volunteered geographic information (VGI) in particular, but several issues pose enormous challenges to automated approaches: Among them a lack of unified interface and heterogeneous media formats and platforms lead to a wide variety of possible data structures, a lack of syntactical control over the data entered by the users, the ingenuity of users and software developers able to overcome device or interface limitations, and an unknown and variable proportion of disruptive or redundant content.

We propose that an integration with existing spatial data infrastructures (SDI) and the geographic contextualization of geo-coded UGC (UGGC) can greatly enhance the options for assessing its quality. We call this approach the GEOgraphic CONtext Analysis of Volunteered Information (GEOCONAVI). This approach emulates one of the basic heuristics which humans use to deal with information that has unknown quality: A comparison with “What do I already know?” By spatio-temporally clustering UGGC, we emulate another heuristic, that of social confirmation (“What do others say?”), and look for confirming or contradicting content. Both these heuristics influence the credibility assessment of the UGGC or VGI. Another criterion is that of relevance, which we assess from the quasi-objective point of view of the potential damage a forest fire can cause, by investigating again the geographic context.

To recapitulate, the GEOCONAVI system requires the following tasks to be carried out semi-automatically or automatically: First, the retrieval and storage of UGC or VGI from various sources. Second, the enrichment of the retrieved UGC with information about source, content, location, and geographic context turning it into UGGC or VGI. Third, the clustering of the UGGC in space and time. Fourth, the detection of new events, or the assignment to known events. Fifth, the dissemination of the results.

The following figure shows an overview of the workflow, plus the current implementation.

Even more info at:

https://sites.google.com/site/geoconavi/implementation-details

 

In this blog post, I’ll try to give a brief overview of my attempts to mine our Tweet database with machine learning methods, in order to find the the few needles in the haystack (i.e. Tweets with information on forest fires). Therefore, it might be a bit less flashy than blue balloons or new open access journal articles, but maybe someone finds the following helpful. Or, even better, points out mistakes I have made. Because I am venturing into unknown territory here – discovering patterns in geographic objects through spatial clustering techniques is part of my job description, but “10-fold stratified cross validation” etc. were a bit intimidating even for me.

So the context is my research on using social media for improving the response to natural disasters like forest fires. In other words, to find recent micro-blog posts and images about forest fires and assess their quality and potential to increase situational awareness. At the JRC, we have developed a whole workflow (GeoCONVAI, for GEOgraphic CONtext Analysis of Volunteered Information) that covers different aspects of this endeavor, for more information see this site and the previous blog post’s links to published journal articles.

Because the concept of “fire” is used metaphorically in many other contexts, we have a lot of noise in our retrieved data. In order to filter for topicality (i.e. whether an entry is about forest fire or not), we manually annotated around 6000 Tweets, then counted the occurences of relevant keywords (e.g. “fire”, “forest”, “hectares”, “firefighters”, etc.). From this, we developed some simple rules to guide the filtering. For an obvious example, the simultaneous occurrence of “fire” and “hectares” is a very good indicator, even in the absence of any word related to vegetation. The results from our case studies show that our rules have merit. However, it seemed an inexcusable omission to not try any machine learning algorithms on this problem. Now that the project is finished, I finally found the time to just that…

So, the objectives were to find simple rules that allow to determine whether a Tweet is about forest fires, and compare those rules with the ones manually devised.

The method obviously is a supervised classification, and the concept to classify is Forest Fire Topicality. The instances from which to learn and which to test are a set of roughly 6000 annotated Tweets, classified into “About Forest Fires” and “Not About Forest Fires”. The attributes used are the number of times each of the keywords shows up in a Tweet. The set of keywords is too large to post it here (because of multiple languages used – did I mention that? no? sorry), but we grouped the keywords into categories. The set of keyword groups used is {fire, incendie, shrubs, forest, hectares, fire fighters, helicopters, canadair, alarm, evacuation} (NB: The distinction between fire and incendie groups is a result of languages like French, where we have the distinct words of “incendie(s)” and “feu(x)”).

The tool of choice is the Weka suite, mainly because of its availability and excellent documentation. As classification methods, I chose to focus on Naive Bayes methods and Decision Trees, because of their widespread use and because they fit the data (by the way, my guide through this experiment was mostly the excellent book “Data Mining – Practical Machine Learning Tools and Techniques, Second Edition” by Ian H. Witten and Eibe Frank – any errors I made are entirely my responsibility).

Regarding the data preparation, little was actually needed – no attribute selection or discretization was necessary, and we had already transformed the text (unigrams) into numbers (their occurences).

So I was ready to load the CSVs into Weka, convert them to ARFF and start (machine) learning! For  verification/error estimation, a standard stratified 10-fold cross validation seemed sufficient.

The computations went all very quickly, and all showed roughly a 90% accuracy. Below the output of one run:

=== Run information ===

Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: gt2_grpd_no_dup-weka.filters.unsupervised.attribute.Remove-R1,14
Instances:5681
Attributes:13
alarm
alert
fireman
forest
shrub
bushfire
canadair
helicopter
evacuation
fire
incendi
hectar
ff_topic
Test mode:10-fold cross-validation

=== Classifier model (full training set) ===

J48 pruned tree
——————

incendi <= 0
|    hectar <= 0
|   |    forest <= 0: N (4232.0/376.0)
|   |    forest > 0
|   |   |    fire <= 0: N (204.0/52.0)
|   |   |    fire > 0: Y (22.0/4.0)
|    hectar > 0
|   |    fire <= 0: N (54.0/11.0)
|   |    fire > 0: Y (34.0)
incendi > 0: Y (1135.0/60.0)

Number of Leaves  :     6

Size of the tree :     11

Time taken to build model: 0.16seconds

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances        5178               91.1459 %
Incorrectly Classified Instances       503                8.8541 %
Kappa statistic                          0.7605
Mean absolute error                      0.1588
Root mean squared error                  0.2821
Relative absolute error                 39.7507 %
Root relative squared error             63.1279 %
Total Number of Instances             5681

=== Detailed Accuracy By Class ===

TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
0.72      0.016      0.946     0.72      0.818      0.858    Y
0.984     0.28       0.902     0.984     0.942      0.858    N
Weighted Avg.    0.911     0.207      0.914     0.911     0.907      0.858

=== Confusion Matrix ===

a    b   <– classified as
1127  439 |    a = Y
64 4051 |    b = N

Problematic seemed the large number of false negatives, i.e. Tweets that the machine learning algorithms classified as not being about forest fires, when in fact they were. It seemed we needed to adjust for different costs (“counting the costs”), i.e. a false negative should have a much higher negative cost than a false positive. In Weka, there are two ways of incorporating the cost: Either using a cost matrix for the evaluation part (won’t change the outcome), or using a cost matrix with the MetaCost classifier (will change the outcome). Surprisingly and unfortunately, the MetaCost classifier did not improve the results significantly. I tried several values with the NaiveBayes classifier. For a cost matrix of

0 (TP)    10 (FN)
1 (FP)      0 (TN)

the result is

a             b       <– classified as
1196      370  |    a = Y
403   3712   |    b = N

As opposed to

a             b       <– classified as
1152      414   |    a = Y
256    3859   |    b = N

for standard costs of

0(TP)    1(FN)
1(FP)    0(TN)

Further increasing the cost for FN does no good. Using

0(TP)    1(FN)
1(FP)    0(TN)

the results are

a             b       <– classified as
1552        14  |    a = Y
2779   1336  |    b = N

In summary, out of the various Decision Tree and Naive Bayes classifiers, the J48 works best. The biggest problem is a large number of false negatives introduced by the combination of
incendie <= 0 AND hectar <= 0 AND forest <= 0 (see above).
However, trying to split up that group proved futile: The only usable keyword would be the “fire” group, but adding fire > 0 equaling Y would introduce a large number of FP. Some exploratory filtering showed that there is no other reliable way to reduce this high number of FP without overfitting.

Later, I had another go at it with a newly annotated data set from another case study. Again, I tried several classifiers (among them J48, logistic, Bayes, Ada/MultiBoost), and again, J48 works best overall. It has also the advantage that the results (i.e. the tree) is easily understandable. Noticing that “hectars” is such an important classifier (good for us with respect the case study, but also part of the “what” situational awareness), I tried another run without it. Results are not better, but the decision tree is now relatively complicated and also uses the number of keywords. I removed that as well and the remaining decision tree is interesting for comparision:

=== Run information ===

Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: prepared_weka_input_cs_2011_steps_234-weka.filters.unsupervised.attribute.Remove-R1,15-16,18-21-weka.filters.unsupervised.attribute.Remove-R12-weka.filters.unsupervised.attribute.Remove-R12
Instances:1481
Attributes:12
alarm
alert
fireman
forest
shrub
bushfire
canadair
helicopter
evacuation
fire
incendi
on_topic
Test mode:10-fold cross-validation

=== Classifier model (full training set) ===

J48 pruned tree
——————

forest <= 0
|    fire <= 0:  N (1316.0/63.0)
|    fire > 0
|   |    incendi <= 0:  Y (29.0/2.0)
|   |    incendi > 0:  N (72.0/17.0)
forest > 0:  Y (64.0/7.0)

Number of Leaves  :     4

Size of the tree :     7

Time taken to build model: 0.02seconds

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances        1392               93.9905 %
Incorrectly Classified Instances        89                6.0095 %
Kappa statistic                          0.6235
Mean absolute error                      0.1098
Root mean squared error                  0.2348
Relative absolute error                 55.6169 %
Root relative squared error             74.8132 %
Total Number of Instances             1481

=== Detailed Accuracy By Class ===

TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
0.993     0.488      0.942     0.993     0.967      0.764     N
0.512     0.007      0.903     0.512     0.654      0.764     Y
Weighted Avg.    0.94      0.435      0.938     0.94      0.932      0.764

=== Confusion Matrix ===

a    b   <– classified as
1308    9 |    a =  N
80   84 |    b =  Y

It seems that, given the data and the attributes used, the results from the machine learning support our initial, hand-crafted rule set. That’s fair enough for a result, I abandoned my forays into machine learning at this point (the learning curve looked quite steep from here on, and resources are scarce, as always). This nice result however can’t hide the fact that we still have a large amount of noise that we can’t get rid off by looking only at the keywords used. So we need more sophisticated text analysis or a novel approach. Not being an expert on natural language processing, we chose the second path and came up with something – but you really should read the papers or have a look at the GeoCONAVI site if you would like to learn more. If you have any questions or comments, please post them either here (for the benefits of others), or send me a PM.

We have one free access and one open access paper on our work newly published – check them out, especially the first one if you want to get an overview of what work sparked this blog:

Craglia, M., Ostermann, F., & Spinsanti, L. (2012). Digital Earth from vision to practice: making sense of citizen-generated content. International Journal of Digital Earth, 5(5), 398–416. doi:10.1080/17538947.2012.712273

http://www.tandfonline.com/doi/full/10.1080/17538947.2012.712273

Schade, S., Ostermann, F., Spinsanti, L., & Kuhn, W. (2012). Semantic Observation Integration. Future Internet, 4(3), 807–829

http://www.mdpi.com/1999-5903/4/3/807/htm

As promised last week a while ago, here’s an update on current research:

In case you’re wondering what a Big Blue Balloon has to do with research that’s supposed to be somehow related to geography, spatial information, or social media, let me help you: What’s a Big Blue Balloon? First of all, it’s a vehicle of some sort, because it is not stationary (though in all likelyhood cannot move autonomously). Second, it’s in the air (at least as long as it’s inflated!). Thirdly, it can or cannot be “manned”. In the latter case, it is an Unmanned Aerial Vehicle! Or UAV for the acronym lovers among us, and “drone” for those who prefer more catchy names (and abhor acronyms). And that’s the new project I am working on: Retrieving data from UAVs, and integrating it with existing spatial data infrastructures and user-generated geographic content (in case you haven’t noticed, that was the link to this blog’s overall theme). Yes, I know, everyone’s into drones right now, but I content myself that we’re looking into them from a distinctive angle, i.e. the data integration issue. We are also in the process of procuring a “real” UAS (that’s Unmanned Aerial System, including the ground control) in the form of a Mikrokopter, but due to legal, institutional and corporate issues, this is delayed (though not canceled).

In the meantime, we (btw: “we” means my collegue Laura and myself, and the credit for discovering and procuring the subject of this post is entirely hers!) have been looking into DIY-MUAVs (that would be Do-It-Yourself-Micro-Unmanned-Aerial-Vehicles) and grassroots aerial photography. There is an astonishing amount of activity (mentally taking a note here on a future geosocialite blog’s topic), but we have decided on this:

Source: Breadpig Shop (http://shop.breadpig.com/collections/publiclaboratory/products/balloon-mapping-kit)

It’s actually from the Public Laboratory’s for Open Technology and Science (PLOTS), and there is a lot of info on their web pages that I am not going to repeat here – go there and see for yourself. For the lazy among you or those short on time: It’s filled with helium and can easily carry one camera. Ordered, shipped, “assembled” and camera mounted (a Nikon Coolpix P6000 affixed to some polystyrene for protection), and ready-to-go. Erm, wait, we need helium. A lot of it. Do they sell this in a DIY store? Fortunately, the JRC does all kinds of things I still have no idea about. So after asking around, we found a large helium gas cartridge. After some very early test, we were ready for our first real field trial:

Source: The Author (who is trying to get the &$%&§ interval shooting mode to work)

Lift off! Source: The Author

Up Up and Away! Source: You guessed correctly, the author.

Our Big Blue Balloon performed nicely, climbing up to an altitude of around 100 meters (then a helicopter flew past in what seemed a decidedly close distance, and we opted for less altitude), and taking lots of pictures with this camera on interval shooting:

Source: The Balloon

After making a stroll back to office, we parked our Big Blue Balloon in the basement. The image taking were of mixed quality, some excellent, a lot of them blurred and distorted (we have to work in the fixture to the balloon). Unfortunately, the GPS of the camera is quite weak, and did not obtain and coordinates after the first few images (can’t be a lack of satellites in line of sight, can it?). We tried to ortho-rectify some of them with MapKnitter.org, but with mixed success – the base layer is not very good, and since we mostly took images of some greenery, there are not that many structures on the ground to allow for good rectification. But not bad for a very first trial, I think.

So we are now devising two experiments with it: First, to get images from above to stitch and ortho-rectify the semi-automatically, and second a panoramic 360° shooting.

PS: My boss has produced a cool short video taken during the first trial. Because of privacy etc. (we attracted some attention with this, so there are a number of people in the video, and we can’t ask them all for permission), it is not public – you can ask me for the link and password if you think I will trust you 🙂

For the past two years, I have been working mainly on an exploratory research project investigating the use of social media for fighting forest fires. That is, what information do social media contain that might be helpful for decision makers, fire fighters on the ground, and the public, and how can we utilize this information best. This project ended officially (and according to plan) last May, and my intention was to share a lot of information on this project on Geosocialite. For various reasons, this has not happened (this part not according to plan), but the main reason was that I wanted to share some data and interactive maps with you. Well, both plans have been stuck for a little while now due to institutional and corporate policies beyond my control. Sharing is not always easy…

I still hope that I will be able to show case some of the work before the forest fire season is over. In the meantime, those interested in the concepts behind our approach and what we implemented can have a look here: https://sites.google.com/site/geoconavi/home There you will also find some presentations and other stuff – we will keep on updating it.

Since I am now employed on another project (see below), I have to do any further work on GeoCONAVI in my spare time. Right now I am looking into machine learning for determining the topicality of some content, in other words, is some micro blog post about a forest fire or not. Some experiments with Weka on a set of annotated Tweets look promising, and I will share my experiences, insights and blunders soon here.

Another upcoming post will deal with some of the more entertaining aspects of my new main project (which is about Unmanned Aerial Vehicle data integration), and will involve a Big Blue Balloon…

Stay tuned 🙂

I’ve just returned from the World Wide Web 2012 conference at Lyon, where I attended two workshops related to my research: “Making Sense of Microposts” and “Social Web for Disaster Management”. Both workshops had interestings talks and discusssions, and I thought I’d share the highlights here. Bear in mind that the selection of papers and any comments are very much biased towards geographic information….

The World Wide Web Conference is one of the biggest events related to The Internet ™, and participants include researchers and developers from academia and industry, many of them with a background in computer science. In this respect, it attrachts a slightly different community than I am used to – this fact, coupled with two workshops thematically closely linked to our exploratory research project “Engaging the Citizens in Forest Fire Risk and Impact Assessment”, promised a good opportunity to expand knowledge and discuss the state-of-the-art in related disciplines.

The first full-day workshop, “Making Sense of Microposts”, set out to examine “information extraction and leveraging of semantics from microposts, with a focus on novel methods for handling the particular challenges due to enforced brevity of expression; making use of the collective knowledge encoded in microposts’ semantics in innovative ways; social and enterprise studies that guide the design of appealing and usable new systems based on this type of data, by leveraging Semantic Web technologies.”

Highlights included the keynote presentation by Greg Ver Steeg (Information Sciences Institute, University of Southern California) on “Information Theoretic Tools for Social Media”, in which the presenter showed how concepts like “information transfer” and “information entropy” can be used to explain past and predict future user behaviour, or identify spam accounts. Another paper by David de Castro Reis (Google Engineering, Brazil) et al. looked at the problem from a slightly different angle than usual: Instead of trying to give a micro post some context and thus understand its topicality, they presented an approach to identify unambiguous keywords for search queries, which is applicable to the problem of high noise ratio encountered in our exploratory research. Unfortunately, it’s based on access to Google search query logs… Simon Scerri from the Digital Enterprise Research Institute (National University of Ireland) introduced a “Personal Information Model” based on a concise ontology (DLPO), which would facilitate the integration of various heterogeneous information sources. Finally, Te Kao (Web Information Systems, Technical University Delft) presented a system that assesses the relevance of Tweets for a specific topic based on syntactical, semantic and contextual features. While there are some similarities to what we have developed, the differences in approach (such as the lack of geographic context proposed by us) encourage further discussion, which I hope to be able to initiate soon.

The second workshop attended by me, “Social Web for Disaster Management”, attempted ” to bring together researchers and practitioners who are interested in employing data from the social Web for disaster management.” With this objective, it is thematically very close to the exploratory research project, but drew its audience from a different community. The first talk by Cindy Hui (Rutgers University) presented a case study examining the spread of information through an emergency at the university. What’s still missing, IMHO, is to examine the spread of the actual informatino bits (i.e. what, where, who). Liam McNamara (Uppsala University) mined conversations from users changing their geographic location, with the aim of finding out how users communicated about their changing surroundings. The corpus used included only Tweets already geo-coded, thus potentially not using a representative sample. An important aspect of using social media during crisis events would be to detect the changing nature of events, an issue the paper “Automatic Sub-Event Detection in Emergency Management Using Social Media” by Daniela Pohl (Klagenfurt University) addressed. However, again the geographic nature and extent of (crisis) events was under-represented. A research group from CSIRO (Australia) presented their system for increasing “Emergency Situation Awareness from Twitter for Crisis Management”, which bears some similarities to the approach developed at the JRC. Another confirmation for the JRC approach came from the work of Julie Dugdale (University of Grenoble 2), who presented results from interviewing the crisis responders and relief workers from the Haiti 2010 earthquake. The main finding was that individual distress messages were highly likely to be (unintentionally) wrong, but an aggregated view of all the incoming information helped the crisis responders to make right decisions.

All in all, it was a rewarding trip to Lyon (big thanks to the organizers of the workshops!), last but not least because of one the nicest public parks I have seen so far:

The Parc de la Tête d’Or, which was situated between my hotel and the convention center, and features a large lake, botanical gardens, small zoo, velodrome, etc. etc., allowing for a refreshing walk before and after a day full of ideas.