Archive

opinion

Instead of working on my backlog of half-finished drafts, Big Data issues keep on popping up. A while ago, I posted a longer piece on Big Citizen Data, and remarked that a lot of seemingly 20th century issues on data quality and sampling bias are being steadfastly ignored nowadays. Jonas Lerman has published an excellent argument on the Standford Law Review on the matter of exclusion through digital invisibility. To cite the abstract:

Legal debates over the “big data” revolution currently focus on the risks of inclusion: the privacy and civil liberties consequences of being swept up in big data’s net. This Essay takes a different approach, focusing on the risks of exclusion: the threats big data poses to those whom it overlooks. Billions of people worldwide remain on big data’s periphery. Their information is not regularly collected or analyzed, because they do not routinely engage in activities that big data is designed to capture. Consequently, their preferences and needs risk being routinely ignored when governments and private industry use big data and advanced analytics to shape public policy and the marketplace. Because big data poses a unique threat to equality, not just privacy, this Essay argues that a new “data antisubordination” doctrine may be needed.” (source: Stanford Law Review, 03.09.2013).

The article is well worth reading, even if the second part is unfamiliar territory for those not well-versed in US law (e.g. me).

It made me rethink (though not change) my attitude towards some of the popular means of getting citizen (customer) information: If no precautions and countermeasures are taken, the socially and financially disadvantaged may actually want to share as much of their data on shopping, leisure activities and other preferences in order to prevent being completely marginalized…

NB This post is not about Citizen Science, but about the data trail that each and everyone generates, willingly or not, volunteered or not. It’s also a bit longer than usual. And yes, of course I focus on geographic data.

Isn’t there already a “Big Citizen Data” research band wagon?

Yes, indeed, that’s true. There is a large and still rapidly growing body of research on the collection, analysis and utility of information from Citizens. The labels are just as diverse as the research, and include volunteered geographic information, neogeography, user-generated geographic content, or crowdsourced mapping – and that’s the geospatial domain only! The objectives range from improving humanitarian assistence for those in imminent danger and need, to improving your dinner experience by removing spam from peer rating platforms.

What I am missing, though, is research that explicitly aims to help Citizens in protecting their political rights and their ability to determine what information on them is available to whom. Call it critical geographic information science or counter mapping 2.0. (btw, I would be delighted by comments that prove me wrong on this one!).

Who cares? Well, everyone should. We miss a broad and informed public debate on the issue, despite

  • the ongoing disclosures on the various electronic surveillance programs of several prominent intelligence agencies,
  • the increasing demand of businesses to reveal data on yourself if you want to do business with them, and
  • the carelessness of your social network friends when posting pictures in which you are depicted or posts in which they mention you, The confusion around the new Facebook Graph Search  shows that few people have sufficient knowledge on the technical issues.

My argument is that research on and use of citizen data is more beneficial than risky, because we need knowledge and tools for citizens to help them manage “their” information in this information age.

So are you Post-Privacy or in Denial?

So it seems that we have almost lost control on who is able to know what about us (although I am not sure we ever had real control). We leak data and information on us in many ways:

  • Involuntary: When criminals or government agencies break into accounts of yours or eaves-drop on communication. For those who observe a few basic precautions, this is probably the least likely cause, but also potentially the most harmful one. More common is the re-packaging and sharing of information on you by companies. Although this rather belongs to the next bullet point, since we all accepted the TOS/EULAs, didn’t we?
  • Unknowingly: Who can say they fully understood Facebook’s privacy controls? The myriad options and constant changes to them, coupled with the unpredictable behaviour of your friends, make it practically impossible to control who can read what. During my research utilizing Twitter, I have been wondering numerous times whether people are actually aware that what they post is public and can be retrieved and read by anyone…
  • Willingly: For some, sharing information about oneself seems to be an addictive habit. However, often it is just fun and elicits interesting comments or even conversations.

So there are basically two responses to the problem: Going Post-Privacy, or Denial. In the first case, people just give up, or never cared in the first place, or value an indifferent feeling of security higher than privacy. It’s the dream of businesses and law enforcement alike. A transparent customer is a customer well-exploited well-served, and a transparent citizen is a citizen well-controlled well-protected. Not very appealing, if you ask me. But the alternative seems just as bleak: Basically one would have to forego all electronic services altogether. It is still possible, but it will lead to social and policital isolation and subsistence farming in the long run, because more and more of our social and business interactions will be based on electronic processes. Even the use of cash will become much more restricted to small transactions (as is the case already in Italy) with the justificaiton that it limits money-laundering.

And scientific research?

In a way, science has looked at the issue from the other side of the glass: There is a wealth of information out there, and it is growing. But is it ethical to use this information if it hasn’t been volunteered for this particular purpose [1]? Can we just use someone’s Tweets to find out more about political sentiments at a given location? Can we display a distressed person’s request for help in a crisis situation without their explicit consent? If so, for how long? Do we have to delete the information once the situation might have changed? When making socio-economic small-scale (micro) analysis, what is the level of detail we can go for? Usually it’s restricted to building blocks, but with modern ways to link heterogeous data together, even supposedly anonymous information can be de-anonymized with little or no ancillary information [2][3]. And with advanced machine learning algorithms, one can even predict where we are going to be [4]. To a certain extent, the response by the scientific community mirrors the public discourse. Either a shoulder-shrugging “That’s the way it is, now let’s crunch those numbers and process that text”, or a refusal to use new electronic data, often accompanied by a defensive “There’s no valuable information to be learned anyway in this ocean of trivia.”

But science should do more than just teach use more about our fellow citizens: First of all, it should also enable them to learn something about the consequences of their behavior, and provide options and alternative to change these consequences. In a way, this is only fair, since academic research is funded largely by society through taxes, so that is one obvious reason why research should benefit the members of society, a.k.a. citizens. Further, it should also critically investigate the use of Big Data. I have come to think of Big Data as the Positivist’s Revenge, because it seems we are repeating some positivist mistakes with Big Data: The illusive promise of some objective truth to be found amidst all the patterns. Because now that we have so much data, it means there must be so much information in it, and we can forget 20th century issues like sampling bias, base rate fallacy, reproduction of power structures, marginality, , etc. etc., right? Right?

Well, of course, no. The users and abusers of social media are a highly biased sample of the whole population, and those most likely to be in need of information and empowerment are those who are underrepresented. Examples include the content found on GoogleMaps vs the content found on OpenStreetMap [5][6], and the gender bias of Neogeographers [7]. The social components of the Web 2.0 and the availability of powerful open source software does not automatically result a democratization of power [8].

So what now?

So all is lost? Do we really have to give up and become either digital emigrants, or exploited and controlled subjects? Is constant Sousveillance [9] the only way to fight back? Of course not. There are a number of ideas to turn Big Citizen Data into a citizen’s asset that goes beyond improving his restaurant experience or movie theater choice. Big Data can serve well in the context of crisis management [10]. Companies could become data philanthropists [11]. And we researchers can step up to the challenge and become a bit more idealist again. Yes, idealist researchers could approach the Big Data issues on two fronts:

First, research on methodology that allows the retrieval and analysis of large data sets with low hardware specs, so as to empower those with little knowledge and resources, reducing any Digital Divide, and giving them at least the ability to monitor their own information output. This won’t of course enable them to find out about the information on them that is in private hands. For that, we are desperately in need of legislation: Anyone who uses de-anonymized information on me should be required to inform me about it. Second, research on issues that show the digital divide and approach the digital representation of citizens critically. To be fair, the role and importance of citizens has been acknowledged already by various research funding schemes. But unless researchers step up to the challenges and really care and do something about them, the ubiquitous and frequent mentioning of the term “Citizen” in research proposals will produce nothing but longer and more convuluted proposals.

References

[1] Harvey, F. (2012). To Volunteer or to Contribute Locational Information? Towards Truth in Labeling for Crowdsourced Geographic Information. In D. Sui, S. Elwood, & M. Goodchild (Eds.), Crowdsourcing Geographic Knowledge: Volunteered Geographic Information (VGI) in Theory and Practice (pp. 31–42). Berlin: Springer.

[2] http://www.nature.com/srep/2013/130325/srep01376/full/srep01376.html

[3] http://www.technologyreview.com/news/514351/has-big-data-made-anonymity-impossible/

[4] http://www.cs.rochester.edu/~sadilek/publications/Sadilek-Krumm_Far-Out_AAAI-12.pdf

[5] http://www.wired.com/wiredscience/2013/08/power-of-amateur-cartographers/

[6] http://www.zeit.de/digital/internet/2013-05/google-maps-palaestina-israelische-siedlungen

[7] http://www.floatingsheep.org/2012/07/sheepcamp-2012-monica-stephens-on.html

[8] https://povesham.wordpress.com/2012/06/22/neogeography-and-the-delusion-of-democratisation/

[9] https://en.wikipedia.org/wiki/Sousveillance

[10] http://irevolution.net/2013/06/10/wrong-assumptions-big-data/

[11] http://irevolution.net/2012/06/04/big-data-philanthropy-for-humanitarian-response/

#hochwasser 2013 in Germany

I’d like to summarize my perception of the use of social media during the European floods of 2013, with a special emphasis on Germany (NB most of the links are for German sources; for an excellent blog post focused on Dresden, go here). Since I have been travelling during the event, I had to gather my information just recently, i.e. after the actual event. Therefore, the information certainly is incomplete, and I’d be happy for additional information and corrections by the gentle readers…

For those outside of Germany, here’s a brief overview of what happened:

  • The floods were mainly caused by a cold and wet spring resulting in saturated soils, coupled with abnormal meteorological situation and heavy rains for several days end of May and beginning of June.
  • The floods affected most countries of central Europe, however I will focus on Germany here.
  • In Germany, several Länder were affected, with the worst damage occuring in the South and East.
  • The two weeks saw a massive mobilisation of around 75.000 fire fighters, plus 19.000 soldiers.
  • Several cities reported record high water lines, several dams burst, and large areas were flooded. There were 14 deaths.
  • The situation now mostly under control, only some areas still flooded. Compare the official information here.

For more information, a good start are the German and English wikipedia articles, with lots and lots of references.

Examples of social media use (Facebook pages, maps) include:

  • A Google map for the city of Magdeburg curated by four collaborators, with over one million hits and a corresponding Facebook page.
  • Another Google map for the city of Dresden, curated by eight collaborators and with almost four million hits.
  • A third Google map for Halle, a bit smaller in scope with two contributors and half a million hits.
  • Additionally, there are many pages of Facebook, usually focusing on a geographic area or place.
  • On Twitter, the most used hashtag seems to be #hochwasser, but many others were also used. On a dedicated channel, requests and offers for help for Dresden could be posted (see also a corresponding website).

As I mentioned, I wasn’t able to collect any data – if someone has data and would like to attempt an analysis, I’d be happy to help out.

For Germany, the use of social media during a disaster was a new experience – fortunately, there are not that many large-scale disasters happening occuring, and the last one (floods of 2002) happened before the advent of social media. In consequence, the use of social media found an echo in more traditional broadcast media (e.g. Handelsblatt, Neue Osnabrücker Zeitung, and Spiegel Online).

Highlights and lowlights

In other words, what worked and what didn’t?

Positive experiences include:

  • Many volunteers can be mobilised within little time.
  • More information (channels) were available for everyone (with internet connection).
  • Self-organizing help (who does what) works overall, with volunteers gathering and providing information, helping in the deployment of sandbags, and aiding the volunteers through infrastructure and consumables.

Some negative experiences were:

  • No weighting or ranking available, making it difficult to estimate the importance and urgency of information and requests. Subjective criteria like proximity and local knowledge can help but may be misleading.
  • A blurring between private and official channels.
  • A lack of feedback and checks led to occasional proliferation of wrong information.
  • Too many helpers and a lack of coordination can have a negative impact (coordination, gawkers, …).

But apparently, a lack of coordination can also affect public authorities (article on Cicero).

Algorithms to the rescue?

It’s obvious that the problems described above are not specifically German or flood-related. They are problems that haunt any undertaking of a large crowd. In my humble opinion, there are two main avenues to overcome the problems and thereby increase the utility of social media: Improved filtering and ranking, and improved platforms.

I have been an advocate for algorithmic filtering and ranking of social media messages for some time now (see my research publications and this blog). Various studies show that even in critical situations like disasters, algorithmic approaches can provide two important advantages: First, they can filter out noise and redundant messages. And second, they can organize and enrich the remaining information to faciliitate human curation. Examples for algorithmic approaches include Swiftriver and GeoCONAVI, with ongoing research for example at the QCRI. The Ushahidi platform and the Stand-By-Task-Force are examples for successful human (crowd sourced) filtering and curation.

I have also been a long-time skeptic of the utility of information streams, which are one of the dominating characteristics of Web 2.0 (from the proverbial Twitter streams to Facebook’s Timeline to the increasing number of “live tickers” on news sites that replace journalistic and editorial care taking with unfiltered and raw data). These relentless streams of information don’t stop for important news, and marginal (but nevertheless important events) risk being overlooked. He who shouts the loudest and the longest wins (the battle for attention). In order to organize the flood of information, a more interactive interface is necessary, such as … a map! Putting the textual information from Facebook posts, Tweets and other sources on a combined map and make the information searchable by place, time and content would be a significant improvement. While I wish to express my sincere congratulations and respect to the map makers linked above, it is also obvious that for larger events and more up-to-date information, more resources are needed. Either computing power and algorithms, or volunteers and professionals. Or even better, both.

Can we do it?

It seems that the current state of affairs in Germany resembles the situation of the Californian wildfires of 2007. I’m not trying to be condescending here – this is not surprising because there are fewer natural disasters in Germany, and the infrastructure for dealing with those is generally good (and it seems there is still room for improvement in the US, too).

However, simply tapping into the gigantic information stream is not the solution per se (as Patrick Meier argues as well), but a first step. There are many examples that show it’s possible, and our GeoCONAVI system used off-the-shelf hardware to monitor four European countries for social media on forest fires. In my opinion, the big problems are not computational, but ethical, legal and organizational. Legal implications include issues of privacy (although if only public messages are being used, this is less of a problem), and liability – what if wrong information leads to property damage, or even worse to the loss of human life? Organizational and political obstacles at least in Germany are the many agencies involved in civil protection: On the Federal level (strictly for defence issues), the Länder level (strictly for natural disasters and such, and each Land has its own agency), plus the various organizations such as (volunteer) fire departments, Technisches Hilfswerk, etc etc. Since disasters don’t stop at geographical or organizational borders, this could be a real problem, although it seems that the during the 2013 flood the public authorities coordinated their work rather closely and well (with the exception mentioned above). The EU has also a new Emergency Response Centre based on the capabilities and knowledge from the JRC.

I’d like to recommend two excellent critical papers on user-generated geographic content and the geosocial web. The first one is by Muki Haklay and raises important issues on the democratizing effects of the Web 2.0 and neography, while the second one by Crampton et. al. takes up the issue and suggests possible solutions to improve the study and analysis of geosocial media.

In his study [1], Haklay argues that neographic theory and practice assume an instrumentalist view of technology, i.e. that technology is value-free and that there is a clear seperation between the means and the ends. Obviously, Haklay does not agree with this view and argues that there is less empowerment and democratization to be found than commonly assumed. In order to realize the full potential of neographic tools and practices, anyone implementing neogeographic tools or practices needs to take into account economic and political aspects. There is a substantial body of work supporting Haklay, including the research by Mark Graham [2], which I recommended in my last post. Patrick Meier on iRevolution has a in-depth commentary of Haklay’s paper [3] and provides a somewhat more optimistic interpretation. My own point of view is running along similar lines as Haklay’s, in that the contemporary digital divides are a continuation of old power divides that participatory GIS sought to overcome in the 90s. And while I have no ill will towards companies that add value to user-generated content, I am highly skeptical of such “involuntary crowdsourcing”, in which the crowd provides freely the raw material but in the end has to pay for access to derived products [4]. There is some similarity to the argument for Open Government Data – why should the tax payers (and tax paying companies) pay again for the use of the data, when they already payed for the creation of it?

Crampton et al. [5] investigate critically the hype around the “Big Data” geoweb. They remind the reader of (a) the limitations inherent in “big-data”-based analysis and (b) shortcomings of the simple spatial ontology of the geotag. Concerning (a), the data used often has limited explanatory value or informational richness, something our research has shown as well [6]. Further,  geocoded social media are still a non-representative sample, no matter how many of them one has collected. Concerning (b), Crampton et al. point out a number of problems with the geotag, e.g. that it is difficult to ascertain whether it refers to the origin of the content or the topic of the content, its lineage and accuracy, and its oversimplification of geography by limiting place geometry to points or lat/lon pairs (see also [7]). As a consequence of their analysis, the authors suggest that studies of the geoweb should try to take into account:

  1. social media that is not explicitly geographic
  2. spatialities beyond the “here and now”
  3. methodologies that are not focused on proximity only
  4. non-human social media
  5. geographic data from non-user generated sources.

I have to admit that I am a little bit proud to say that our research has addressed three of those suggestions: We haven’t limited our sample to geo-coded social media, instead we have re-geo-coded even those with existing coordinates to ensure that we capture the places the social media was about. We have also gone beyond the “here and now” by spatio-temporal clustering data. Finally, a core concept of our approach is the enrichment of the social media data with explicitly geographic data from non-user generated (i.e. authoritative) sources (a paper describing the details has just been accepted but not published yet, an overview can be found here [8]).

Crampton et. al. conclude their paper with the important reminder that caution is needed regarding the surveillance potential of such research, with intelligence agencies around the world focusing more and more on open source intelligence (OSINT). Indeed it seems that even in Really Big Data, our spatial behaviour is unique enough to allow identification [9].

[1] http://www.envplan.com/abstract.cgi?id=a45184

[2] http://www.zerogeography.net/

[3] http://irevolution.net/2013/03/17/neogeography-and-democratization/

[4] http://phg.sagepub.com/content/36/1/72.abstract

[5] http://www.tandfonline.com/doi/abs/10.1080/15230406.2013.777137?journalCode=tcag20#.UYZa6klic8o

[6] http://www.igi-global.com/article/context-analysis-volunteered-geographic-information/75443

[7] http://www.tandfonline.com/doi/full/10.1080/17538947.2012.712273#.UYZaeUlic8o

[8] https://sites.google.com/site/geoconavi/implementation-details

[9] http://www.nature.com/srep/2013/130325/srep01376/full/srep01376.html

or so the saying goes. At least part of it. Anyway, it’s been very quiet here for almost three months now. The main reason is that most of my spare energy at the moment goes into searching for new work – my current project (and with it funding) will end in a couple of months, so I’m spending my time less writing and more scouting. And flying a UAV, actually, because we’re following up on last year’s successful “Big Blue Balloon” experience. I’ll be posting about Ed (our “Environmental Drone”) soonish.

In the meantime, let me recommend some great posts by other bloggers.

First, there’s always something worthwhile on iRevolution – I am always awed by the frequency with which Patrick can publish high-quality blog posts. Yesterday’s post caught my eye in particular, because it shows Patrick isn’t only a keen thinker and great communicator, but also could do well as entrepreneur – check out his ideas for a smartphone application for disaster-struck communities here.

Then, I really love reading Brian Timoney’s MapBrief blog. It’s not only enlightening, it’s also fun – as long as you’re not the target of Brian’s sharp wit. Recently, he has run a series on why map portals don’t work. Most of the reasons should be pretty obvious, but equally obvious is the failure by most portals to do differently. Read up on it here.

On Zero Geography, Mark Graham shares with us the latest results from his research, and there have been several great posts recently on the usage of Twitter in several African cities. Visit it here to learn some surprising things about contemporary digital divides.

I hope you enjoy reading them as much as I did, but I also promise you won’t have to wait for “strange aeons” before some original material is posted here.

In this blog post, I’ll try to give a brief overview of my attempts to mine our Tweet database with machine learning methods, in order to find the the few needles in the haystack (i.e. Tweets with information on forest fires). Therefore, it might be a bit less flashy than blue balloons or new open access journal articles, but maybe someone finds the following helpful. Or, even better, points out mistakes I have made. Because I am venturing into unknown territory here – discovering patterns in geographic objects through spatial clustering techniques is part of my job description, but “10-fold stratified cross validation” etc. were a bit intimidating even for me.

So the context is my research on using social media for improving the response to natural disasters like forest fires. In other words, to find recent micro-blog posts and images about forest fires and assess their quality and potential to increase situational awareness. At the JRC, we have developed a whole workflow (GeoCONVAI, for GEOgraphic CONtext Analysis of Volunteered Information) that covers different aspects of this endeavor, for more information see this site and the previous blog post’s links to published journal articles.

Because the concept of “fire” is used metaphorically in many other contexts, we have a lot of noise in our retrieved data. In order to filter for topicality (i.e. whether an entry is about forest fire or not), we manually annotated around 6000 Tweets, then counted the occurences of relevant keywords (e.g. “fire”, “forest”, “hectares”, “firefighters”, etc.). From this, we developed some simple rules to guide the filtering. For an obvious example, the simultaneous occurrence of “fire” and “hectares” is a very good indicator, even in the absence of any word related to vegetation. The results from our case studies show that our rules have merit. However, it seemed an inexcusable omission to not try any machine learning algorithms on this problem. Now that the project is finished, I finally found the time to just that…

So, the objectives were to find simple rules that allow to determine whether a Tweet is about forest fires, and compare those rules with the ones manually devised.

The method obviously is a supervised classification, and the concept to classify is Forest Fire Topicality. The instances from which to learn and which to test are a set of roughly 6000 annotated Tweets, classified into “About Forest Fires” and “Not About Forest Fires”. The attributes used are the number of times each of the keywords shows up in a Tweet. The set of keywords is too large to post it here (because of multiple languages used – did I mention that? no? sorry), but we grouped the keywords into categories. The set of keyword groups used is {fire, incendie, shrubs, forest, hectares, fire fighters, helicopters, canadair, alarm, evacuation} (NB: The distinction between fire and incendie groups is a result of languages like French, where we have the distinct words of “incendie(s)” and “feu(x)”).

The tool of choice is the Weka suite, mainly because of its availability and excellent documentation. As classification methods, I chose to focus on Naive Bayes methods and Decision Trees, because of their widespread use and because they fit the data (by the way, my guide through this experiment was mostly the excellent book “Data Mining – Practical Machine Learning Tools and Techniques, Second Edition” by Ian H. Witten and Eibe Frank – any errors I made are entirely my responsibility).

Regarding the data preparation, little was actually needed – no attribute selection or discretization was necessary, and we had already transformed the text (unigrams) into numbers (their occurences).

So I was ready to load the CSVs into Weka, convert them to ARFF and start (machine) learning! For  verification/error estimation, a standard stratified 10-fold cross validation seemed sufficient.

The computations went all very quickly, and all showed roughly a 90% accuracy. Below the output of one run:

=== Run information ===

Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: gt2_grpd_no_dup-weka.filters.unsupervised.attribute.Remove-R1,14
Instances:5681
Attributes:13
alarm
alert
fireman
forest
shrub
bushfire
canadair
helicopter
evacuation
fire
incendi
hectar
ff_topic
Test mode:10-fold cross-validation

=== Classifier model (full training set) ===

J48 pruned tree
——————

incendi <= 0
|    hectar <= 0
|   |    forest <= 0: N (4232.0/376.0)
|   |    forest > 0
|   |   |    fire <= 0: N (204.0/52.0)
|   |   |    fire > 0: Y (22.0/4.0)
|    hectar > 0
|   |    fire <= 0: N (54.0/11.0)
|   |    fire > 0: Y (34.0)
incendi > 0: Y (1135.0/60.0)

Number of Leaves  :     6

Size of the tree :     11

Time taken to build model: 0.16seconds

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances        5178               91.1459 %
Incorrectly Classified Instances       503                8.8541 %
Kappa statistic                          0.7605
Mean absolute error                      0.1588
Root mean squared error                  0.2821
Relative absolute error                 39.7507 %
Root relative squared error             63.1279 %
Total Number of Instances             5681

=== Detailed Accuracy By Class ===

TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
0.72      0.016      0.946     0.72      0.818      0.858    Y
0.984     0.28       0.902     0.984     0.942      0.858    N
Weighted Avg.    0.911     0.207      0.914     0.911     0.907      0.858

=== Confusion Matrix ===

a    b   <– classified as
1127  439 |    a = Y
64 4051 |    b = N

Problematic seemed the large number of false negatives, i.e. Tweets that the machine learning algorithms classified as not being about forest fires, when in fact they were. It seemed we needed to adjust for different costs (“counting the costs”), i.e. a false negative should have a much higher negative cost than a false positive. In Weka, there are two ways of incorporating the cost: Either using a cost matrix for the evaluation part (won’t change the outcome), or using a cost matrix with the MetaCost classifier (will change the outcome). Surprisingly and unfortunately, the MetaCost classifier did not improve the results significantly. I tried several values with the NaiveBayes classifier. For a cost matrix of

0 (TP)    10 (FN)
1 (FP)      0 (TN)

the result is

a             b       <– classified as
1196      370  |    a = Y
403   3712   |    b = N

As opposed to

a             b       <– classified as
1152      414   |    a = Y
256    3859   |    b = N

for standard costs of

0(TP)    1(FN)
1(FP)    0(TN)

Further increasing the cost for FN does no good. Using

0(TP)    1(FN)
1(FP)    0(TN)

the results are

a             b       <– classified as
1552        14  |    a = Y
2779   1336  |    b = N

In summary, out of the various Decision Tree and Naive Bayes classifiers, the J48 works best. The biggest problem is a large number of false negatives introduced by the combination of
incendie <= 0 AND hectar <= 0 AND forest <= 0 (see above).
However, trying to split up that group proved futile: The only usable keyword would be the “fire” group, but adding fire > 0 equaling Y would introduce a large number of FP. Some exploratory filtering showed that there is no other reliable way to reduce this high number of FP without overfitting.

Later, I had another go at it with a newly annotated data set from another case study. Again, I tried several classifiers (among them J48, logistic, Bayes, Ada/MultiBoost), and again, J48 works best overall. It has also the advantage that the results (i.e. the tree) is easily understandable. Noticing that “hectars” is such an important classifier (good for us with respect the case study, but also part of the “what” situational awareness), I tried another run without it. Results are not better, but the decision tree is now relatively complicated and also uses the number of keywords. I removed that as well and the remaining decision tree is interesting for comparision:

=== Run information ===

Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: prepared_weka_input_cs_2011_steps_234-weka.filters.unsupervised.attribute.Remove-R1,15-16,18-21-weka.filters.unsupervised.attribute.Remove-R12-weka.filters.unsupervised.attribute.Remove-R12
Instances:1481
Attributes:12
alarm
alert
fireman
forest
shrub
bushfire
canadair
helicopter
evacuation
fire
incendi
on_topic
Test mode:10-fold cross-validation

=== Classifier model (full training set) ===

J48 pruned tree
——————

forest <= 0
|    fire <= 0:  N (1316.0/63.0)
|    fire > 0
|   |    incendi <= 0:  Y (29.0/2.0)
|   |    incendi > 0:  N (72.0/17.0)
forest > 0:  Y (64.0/7.0)

Number of Leaves  :     4

Size of the tree :     7

Time taken to build model: 0.02seconds

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances        1392               93.9905 %
Incorrectly Classified Instances        89                6.0095 %
Kappa statistic                          0.6235
Mean absolute error                      0.1098
Root mean squared error                  0.2348
Relative absolute error                 55.6169 %
Root relative squared error             74.8132 %
Total Number of Instances             1481

=== Detailed Accuracy By Class ===

TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
0.993     0.488      0.942     0.993     0.967      0.764     N
0.512     0.007      0.903     0.512     0.654      0.764     Y
Weighted Avg.    0.94      0.435      0.938     0.94      0.932      0.764

=== Confusion Matrix ===

a    b   <– classified as
1308    9 |    a =  N
80   84 |    b =  Y

It seems that, given the data and the attributes used, the results from the machine learning support our initial, hand-crafted rule set. That’s fair enough for a result, I abandoned my forays into machine learning at this point (the learning curve looked quite steep from here on, and resources are scarce, as always). This nice result however can’t hide the fact that we still have a large amount of noise that we can’t get rid off by looking only at the keywords used. So we need more sophisticated text analysis or a novel approach. Not being an expert on natural language processing, we chose the second path and came up with something – but you really should read the papers or have a look at the GeoCONAVI site if you would like to learn more. If you have any questions or comments, please post them either here (for the benefits of others), or send me a PM.