Archive

experiments

Although our system had been running successfully for quite a while behind high walls, it now finally became available online:

http://forest.jrc.ec.europa.eu/effis/applications/vgi/

Please note that

a) it’s still beta

b) currently, there are few forest fires in Europe, so you will likely see a higher ratio of noise

c) we haven’t put the clustering module online yet, because it still requires frequent human supervision to adjust parameters

If you’d like to know more, don’t hesistate to contact us!

Here’s some more info on the idea and technology behind it:

The rationale for the research is the emergence of new social media platforms that change the way people create and use information during crisis events. Most wide spread are platforms for micro blogging (e.g. Twitter), networking (e.g. Facebook), and photo sharing (e.g. Flickr). These social media platforms increasingly offer to include geographic information on the whereabouts of their users or the information posted. Potentially, this rich information can contribute to a more effective response to natural disasters. In fact, social media have been put into good use on various occasions. This increasing amount of bi-directional horizontal peer-to-peer information exchange also affects the traditional uni-directional vertical flow of information. Traditional broadcasting media open up to micro journalism and several official administrative agencies already adapt and use third-party social media accounts for communicating information. However, incorporation of UGC into the established administrative emergency protocols has not advanced significantly. It seems that public officials view such volunteered information often as a threat that could spread misinformation and rumours, as long as there is no reliable quality control.

So far, mainly human volunteers have carried out the tasks of filtering, validating and assessing the quality of UGC, and with great success. However, this approach is not sustainable and scalable for a continuous, reliable utilization of UGC in crisis response, because the amount of data is ever increasing, and volunteers might not be available in sufficient numbers. The research community has already begun to investigate in assessing trust, reputation and credibility of UGC and volunteered geographic information (VGI) in particular, but several issues pose enormous challenges to automated approaches: Among them a lack of unified interface and heterogeneous media formats and platforms lead to a wide variety of possible data structures, a lack of syntactical control over the data entered by the users, the ingenuity of users and software developers able to overcome device or interface limitations, and an unknown and variable proportion of disruptive or redundant content.

We propose that an integration with existing spatial data infrastructures (SDI) and the geographic contextualization of geo-coded UGC (UGGC) can greatly enhance the options for assessing its quality. We call this approach the GEOgraphic CONtext Analysis of Volunteered Information (GEOCONAVI). This approach emulates one of the basic heuristics which humans use to deal with information that has unknown quality: A comparison with “What do I already know?” By spatio-temporally clustering UGGC, we emulate another heuristic, that of social confirmation (“What do others say?”), and look for confirming or contradicting content. Both these heuristics influence the credibility assessment of the UGGC or VGI. Another criterion is that of relevance, which we assess from the quasi-objective point of view of the potential damage a forest fire can cause, by investigating again the geographic context.

To recapitulate, the GEOCONAVI system requires the following tasks to be carried out semi-automatically or automatically: First, the retrieval and storage of UGC or VGI from various sources. Second, the enrichment of the retrieved UGC with information about source, content, location, and geographic context turning it into UGGC or VGI. Third, the clustering of the UGGC in space and time. Fourth, the detection of new events, or the assignment to known events. Fifth, the dissemination of the results.

The following figure shows an overview of the workflow, plus the current implementation.

Even more info at:

https://sites.google.com/site/geoconavi/implementation-details

 

In this blog post, I’ll try to give a brief overview of my attempts to mine our Tweet database with machine learning methods, in order to find the the few needles in the haystack (i.e. Tweets with information on forest fires). Therefore, it might be a bit less flashy than blue balloons or new open access journal articles, but maybe someone finds the following helpful. Or, even better, points out mistakes I have made. Because I am venturing into unknown territory here – discovering patterns in geographic objects through spatial clustering techniques is part of my job description, but “10-fold stratified cross validation” etc. were a bit intimidating even for me.

So the context is my research on using social media for improving the response to natural disasters like forest fires. In other words, to find recent micro-blog posts and images about forest fires and assess their quality and potential to increase situational awareness. At the JRC, we have developed a whole workflow (GeoCONVAI, for GEOgraphic CONtext Analysis of Volunteered Information) that covers different aspects of this endeavor, for more information see this site and the previous blog post’s links to published journal articles.

Because the concept of “fire” is used metaphorically in many other contexts, we have a lot of noise in our retrieved data. In order to filter for topicality (i.e. whether an entry is about forest fire or not), we manually annotated around 6000 Tweets, then counted the occurences of relevant keywords (e.g. “fire”, “forest”, “hectares”, “firefighters”, etc.). From this, we developed some simple rules to guide the filtering. For an obvious example, the simultaneous occurrence of “fire” and “hectares” is a very good indicator, even in the absence of any word related to vegetation. The results from our case studies show that our rules have merit. However, it seemed an inexcusable omission to not try any machine learning algorithms on this problem. Now that the project is finished, I finally found the time to just that…

So, the objectives were to find simple rules that allow to determine whether a Tweet is about forest fires, and compare those rules with the ones manually devised.

The method obviously is a supervised classification, and the concept to classify is Forest Fire Topicality. The instances from which to learn and which to test are a set of roughly 6000 annotated Tweets, classified into “About Forest Fires” and “Not About Forest Fires”. The attributes used are the number of times each of the keywords shows up in a Tweet. The set of keywords is too large to post it here (because of multiple languages used – did I mention that? no? sorry), but we grouped the keywords into categories. The set of keyword groups used is {fire, incendie, shrubs, forest, hectares, fire fighters, helicopters, canadair, alarm, evacuation} (NB: The distinction between fire and incendie groups is a result of languages like French, where we have the distinct words of “incendie(s)” and “feu(x)”).

The tool of choice is the Weka suite, mainly because of its availability and excellent documentation. As classification methods, I chose to focus on Naive Bayes methods and Decision Trees, because of their widespread use and because they fit the data (by the way, my guide through this experiment was mostly the excellent book “Data Mining – Practical Machine Learning Tools and Techniques, Second Edition” by Ian H. Witten and Eibe Frank – any errors I made are entirely my responsibility).

Regarding the data preparation, little was actually needed – no attribute selection or discretization was necessary, and we had already transformed the text (unigrams) into numbers (their occurences).

So I was ready to load the CSVs into Weka, convert them to ARFF and start (machine) learning! For  verification/error estimation, a standard stratified 10-fold cross validation seemed sufficient.

The computations went all very quickly, and all showed roughly a 90% accuracy. Below the output of one run:

=== Run information ===

Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: gt2_grpd_no_dup-weka.filters.unsupervised.attribute.Remove-R1,14
Instances:5681
Attributes:13
alarm
alert
fireman
forest
shrub
bushfire
canadair
helicopter
evacuation
fire
incendi
hectar
ff_topic
Test mode:10-fold cross-validation

=== Classifier model (full training set) ===

J48 pruned tree
——————

incendi <= 0
|    hectar <= 0
|   |    forest <= 0: N (4232.0/376.0)
|   |    forest > 0
|   |   |    fire <= 0: N (204.0/52.0)
|   |   |    fire > 0: Y (22.0/4.0)
|    hectar > 0
|   |    fire <= 0: N (54.0/11.0)
|   |    fire > 0: Y (34.0)
incendi > 0: Y (1135.0/60.0)

Number of Leaves  :     6

Size of the tree :     11

Time taken to build model: 0.16seconds

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances        5178               91.1459 %
Incorrectly Classified Instances       503                8.8541 %
Kappa statistic                          0.7605
Mean absolute error                      0.1588
Root mean squared error                  0.2821
Relative absolute error                 39.7507 %
Root relative squared error             63.1279 %
Total Number of Instances             5681

=== Detailed Accuracy By Class ===

TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
0.72      0.016      0.946     0.72      0.818      0.858    Y
0.984     0.28       0.902     0.984     0.942      0.858    N
Weighted Avg.    0.911     0.207      0.914     0.911     0.907      0.858

=== Confusion Matrix ===

a    b   <– classified as
1127  439 |    a = Y
64 4051 |    b = N

Problematic seemed the large number of false negatives, i.e. Tweets that the machine learning algorithms classified as not being about forest fires, when in fact they were. It seemed we needed to adjust for different costs (“counting the costs”), i.e. a false negative should have a much higher negative cost than a false positive. In Weka, there are two ways of incorporating the cost: Either using a cost matrix for the evaluation part (won’t change the outcome), or using a cost matrix with the MetaCost classifier (will change the outcome). Surprisingly and unfortunately, the MetaCost classifier did not improve the results significantly. I tried several values with the NaiveBayes classifier. For a cost matrix of

0 (TP)    10 (FN)
1 (FP)      0 (TN)

the result is

a             b       <– classified as
1196      370  |    a = Y
403   3712   |    b = N

As opposed to

a             b       <– classified as
1152      414   |    a = Y
256    3859   |    b = N

for standard costs of

0(TP)    1(FN)
1(FP)    0(TN)

Further increasing the cost for FN does no good. Using

0(TP)    1(FN)
1(FP)    0(TN)

the results are

a             b       <– classified as
1552        14  |    a = Y
2779   1336  |    b = N

In summary, out of the various Decision Tree and Naive Bayes classifiers, the J48 works best. The biggest problem is a large number of false negatives introduced by the combination of
incendie <= 0 AND hectar <= 0 AND forest <= 0 (see above).
However, trying to split up that group proved futile: The only usable keyword would be the “fire” group, but adding fire > 0 equaling Y would introduce a large number of FP. Some exploratory filtering showed that there is no other reliable way to reduce this high number of FP without overfitting.

Later, I had another go at it with a newly annotated data set from another case study. Again, I tried several classifiers (among them J48, logistic, Bayes, Ada/MultiBoost), and again, J48 works best overall. It has also the advantage that the results (i.e. the tree) is easily understandable. Noticing that “hectars” is such an important classifier (good for us with respect the case study, but also part of the “what” situational awareness), I tried another run without it. Results are not better, but the decision tree is now relatively complicated and also uses the number of keywords. I removed that as well and the remaining decision tree is interesting for comparision:

=== Run information ===

Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: prepared_weka_input_cs_2011_steps_234-weka.filters.unsupervised.attribute.Remove-R1,15-16,18-21-weka.filters.unsupervised.attribute.Remove-R12-weka.filters.unsupervised.attribute.Remove-R12
Instances:1481
Attributes:12
alarm
alert
fireman
forest
shrub
bushfire
canadair
helicopter
evacuation
fire
incendi
on_topic
Test mode:10-fold cross-validation

=== Classifier model (full training set) ===

J48 pruned tree
——————

forest <= 0
|    fire <= 0:  N (1316.0/63.0)
|    fire > 0
|   |    incendi <= 0:  Y (29.0/2.0)
|   |    incendi > 0:  N (72.0/17.0)
forest > 0:  Y (64.0/7.0)

Number of Leaves  :     4

Size of the tree :     7

Time taken to build model: 0.02seconds

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances        1392               93.9905 %
Incorrectly Classified Instances        89                6.0095 %
Kappa statistic                          0.6235
Mean absolute error                      0.1098
Root mean squared error                  0.2348
Relative absolute error                 55.6169 %
Root relative squared error             74.8132 %
Total Number of Instances             1481

=== Detailed Accuracy By Class ===

TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
0.993     0.488      0.942     0.993     0.967      0.764     N
0.512     0.007      0.903     0.512     0.654      0.764     Y
Weighted Avg.    0.94      0.435      0.938     0.94      0.932      0.764

=== Confusion Matrix ===

a    b   <– classified as
1308    9 |    a =  N
80   84 |    b =  Y

It seems that, given the data and the attributes used, the results from the machine learning support our initial, hand-crafted rule set. That’s fair enough for a result, I abandoned my forays into machine learning at this point (the learning curve looked quite steep from here on, and resources are scarce, as always). This nice result however can’t hide the fact that we still have a large amount of noise that we can’t get rid off by looking only at the keywords used. So we need more sophisticated text analysis or a novel approach. Not being an expert on natural language processing, we chose the second path and came up with something – but you really should read the papers or have a look at the GeoCONAVI site if you would like to learn more. If you have any questions or comments, please post them either here (for the benefits of others), or send me a PM.

As promised last week a while ago, here’s an update on current research:

In case you’re wondering what a Big Blue Balloon has to do with research that’s supposed to be somehow related to geography, spatial information, or social media, let me help you: What’s a Big Blue Balloon? First of all, it’s a vehicle of some sort, because it is not stationary (though in all likelyhood cannot move autonomously). Second, it’s in the air (at least as long as it’s inflated!). Thirdly, it can or cannot be “manned”. In the latter case, it is an Unmanned Aerial Vehicle! Or UAV for the acronym lovers among us, and “drone” for those who prefer more catchy names (and abhor acronyms). And that’s the new project I am working on: Retrieving data from UAVs, and integrating it with existing spatial data infrastructures and user-generated geographic content (in case you haven’t noticed, that was the link to this blog’s overall theme). Yes, I know, everyone’s into drones right now, but I content myself that we’re looking into them from a distinctive angle, i.e. the data integration issue. We are also in the process of procuring a “real” UAS (that’s Unmanned Aerial System, including the ground control) in the form of a Mikrokopter, but due to legal, institutional and corporate issues, this is delayed (though not canceled).

In the meantime, we (btw: “we” means my collegue Laura and myself, and the credit for discovering and procuring the subject of this post is entirely hers!) have been looking into DIY-MUAVs (that would be Do-It-Yourself-Micro-Unmanned-Aerial-Vehicles) and grassroots aerial photography. There is an astonishing amount of activity (mentally taking a note here on a future geosocialite blog’s topic), but we have decided on this:

Source: Breadpig Shop (http://shop.breadpig.com/collections/publiclaboratory/products/balloon-mapping-kit)

It’s actually from the Public Laboratory’s for Open Technology and Science (PLOTS), and there is a lot of info on their web pages that I am not going to repeat here – go there and see for yourself. For the lazy among you or those short on time: It’s filled with helium and can easily carry one camera. Ordered, shipped, “assembled” and camera mounted (a Nikon Coolpix P6000 affixed to some polystyrene for protection), and ready-to-go. Erm, wait, we need helium. A lot of it. Do they sell this in a DIY store? Fortunately, the JRC does all kinds of things I still have no idea about. So after asking around, we found a large helium gas cartridge. After some very early test, we were ready for our first real field trial:

Source: The Author (who is trying to get the &$%&§ interval shooting mode to work)

Lift off! Source: The Author

Up Up and Away! Source: You guessed correctly, the author.

Our Big Blue Balloon performed nicely, climbing up to an altitude of around 100 meters (then a helicopter flew past in what seemed a decidedly close distance, and we opted for less altitude), and taking lots of pictures with this camera on interval shooting:

Source: The Balloon

After making a stroll back to office, we parked our Big Blue Balloon in the basement. The image taking were of mixed quality, some excellent, a lot of them blurred and distorted (we have to work in the fixture to the balloon). Unfortunately, the GPS of the camera is quite weak, and did not obtain and coordinates after the first few images (can’t be a lack of satellites in line of sight, can it?). We tried to ortho-rectify some of them with MapKnitter.org, but with mixed success – the base layer is not very good, and since we mostly took images of some greenery, there are not that many structures on the ground to allow for good rectification. But not bad for a very first trial, I think.

So we are now devising two experiments with it: First, to get images from above to stitch and ortho-rectify the semi-automatically, and second a panoramic 360° shooting.

PS: My boss has produced a cool short video taken during the first trial. Because of privacy etc. (we attracted some attention with this, so there are a number of people in the video, and we can’t ask them all for permission), it is not public – you can ask me for the link and password if you think I will trust you 🙂

Dinosaurs?

“Did you see Aldo, Ugo, Luca, Maria or Anna? If so, then send us a text message, reporting which dinosaur you saw, and where. If you spot all five of them, you’ll win a small reward. ” – That was the teaser line with which we tried to lure people into our experiment.  What experiment, you ask? Well, there’s the occasion of the Open Day of the JRC, where roughly 10,000 visitors get an overview of the diverse activities of our center. We were one of the activities, and saw an opportunity not only to show what we do with your data, but also get the visitors to participate in generating information, analyze it, and learn from this experience. The security of the JRC was not too enthusiastic about our initial proposal to have the people report on (fake) wild fires. Neither did our follow-up idea of using natural animals like wolves and bears… So in the end we had the idea to use dinos, because kids love them, and if a stray message goes to someone else, that person might have some second thoughts about the sender’s sanity, but at least shouldn’t panic or alert the authorities. So that’s how Aldo, Maria, and company were born.

Meet Aldo – he would like to give you his phone number…

Set-up

While in our project we monitor social media networks for information of forest fires, for the open day we chose to use SMS text messages, because almost everyone who has a mobile phone knows how to use them. That decision, however, led directly to the first problem: How do we access the messages? Although most phones have some proprietary software to access text messages on a computer, the overall quality of these programs is, well, questionable. Plus, most of them don’t allow easy export for further analysis. Fortunately, there is FrontlineSMS, a great tool to do just everything we want: Receive messages, export them, run external commands or http requests, etc. pp. Unfortunately, it works only with a limited number of phones, mostly older ones. I had hoped that everyone has now at least three generations of mobile phones in his/her basement, but it took quite a while and effort to find one (and the correct data cable!) that worked. We finally managed to get our hands on a old Nokia 6021 and its CA-42 data cable.

Now we had a way to get to the messages. What’s next? Well, the elegant way would have been to use FrontlineSMS’s http feature and send those messages to a RESTful webservice of our own, where they would be first analysed and then visualized on a map displaying the JRC. Unfortunately, there wasn’t much time left, and I am geographic information analyst by background, and not a skilled web designer or programmer (one could say “amateur”, but actually I prefer the old meaning of the word “dilettante”…). Plus, I was impressed by the Crowdmap service of the Ushahidi platform. So I made the fateful decision of “why implement the wheel another time?”… Instead of using a webservice, I wrote a little Python script that got triggered from FrontlineSMS when a message arrived. It would analyze the message, looking for a Dino name and a placename. The latter was provided by our own gazetteer: After having chosen the location of the five dinos, we geo-referenced all buildings, streets and point of interests in the vicinity, setting up a gazzetteer in English and Italian. After the Python script had categorized and geo-located the message, it created a report that was uploaded to a Crowdmap deployment. While there are many ways to transmit messages to crowdmap, only the upload of a CSV allows to include information on the location. The data was also displayed on a desktop computer running QuantumGIS, because the Crowdmap had only limited querying functionality, and we wanted to show people their messages.

Another problem was the performance of the set-up: I had anticipated a few thousand people, of which a fraction would play our game. After I heard that we have more than 10,000 expected visitors, all with families, I became a little bit nervous…

So, the day before the open day, we set up everything. I’ll spare you the details of what went wrong, suffice it to say that it was a lot. Last minute changes to the code, Wifi issues, you name it…. However, with the help of our collegues, we got the dinosaurs in place and the system running. Now the weather forecast predicted some serious thunderstorms in the morning. Would our dinos stand up to them?

The Open Day

Well, they did. Mostly. Maria was a bit wobbly behind her plexiglas shielding since water had leaked in, despite the best efforts of our unit’s park rangers… So the gates opened at 10, and the messages began pouring in a few minutes thereafter. But what was that? Some messages did not get geo-located as they should. Uh, well, blame me and my lack of Italian skills. Of course there are special characters, and I had tried to anticipate them. What I had not anticipated was the heavy use of ” ‘ ” before placenames (“all’edificio”). But that was quickly fixed by operating on the open heart of the system, i.e. the Python script. A more serious problem was that people used placenames other than those we had thought of, or made heavy use of abbreviations (maybe I should text more often, then I would have anticipated this). We had tested the system with colleagues, of course, but they had reported in a more structured way. So there’s obviously a difference in reporting places between people working with geographic information, and those that do not…. who would have thought. But we fixed that too by expanding the gazetteer on-the-fly. So the messages kept pouring in, the map filled with with dots that grew bigger and bigger (because they were clustered and symbolized proportional in size to cluster size). Everything was going well, except… there were no reports of Maria. None. I got nervous. Was there a bug in the code I had not thought of? Some devious placename that messed up the parsing algorithm? But a thorough check confirmed: Maria simply had not been found and reported on, yet. So we dispatched a team of seasoned troubleshooters and dino hunters to find out what had happened to Maria. They soon reported back: Not only had Maria been damaged by the rain. Not only was she in a place that did not have as many visitors. No, she was also partially obscured by a parking car! The only car parking in the whole area (there were only few cars allowed on the premises during the open day) was parking right in front of Maria! Well, that’s why field work is dirty. There are circumstances you just cannot foresee. In the end, Maria got some reports, but only a fraction of those we received about the other dinosaurs.

In total, we got 293 reports of dinosaur sightings. That is actually much less than we had expected (and feared!). In fact, it was probably to our advantage that people thought the game was meant only for kids to be played, and that the competition by other activities from other units and departments was so tough (there were more than 80 activities in the whole research center). I doubt that the mobile phone would have been able to handle triple the number of messages…. And the desktop where people could ask about their messages was occupied enough already.

So that’s the end of the story: Here you can see all the reports that were successfully geo-coded and categorized (the others are not displayed) and the map. We’ll analyze the messages we got, geocode those that were not geocoded only because of our gazetteer, and publish the results, both on this blog and in some journal in more detail.

Screenshot of JRC Open Day Crowdmap deployment:

It was a very exciting two-week period, and I learned a tremendous amount of things. Most of all, it increased my respect for everyone out there working in the field, who’s setting up and running things in much more adverse conditions than we had to face. My respects also to the community of software developers and volunteers who make the numerous crowdmapping efforts possible. And a big thanks to my colleagues who helped us set up everything and run it.