Archive

Monthly Archives: October 2012

In this blog post, I’ll try to give a brief overview of my attempts to mine our Tweet database with machine learning methods, in order to find the the few needles in the haystack (i.e. Tweets with information on forest fires). Therefore, it might be a bit less flashy than blue balloons or new open access journal articles, but maybe someone finds the following helpful. Or, even better, points out mistakes I have made. Because I am venturing into unknown territory here – discovering patterns in geographic objects through spatial clustering techniques is part of my job description, but “10-fold stratified cross validation” etc. were a bit intimidating even for me.

So the context is my research on using social media for improving the response to natural disasters like forest fires. In other words, to find recent micro-blog posts and images about forest fires and assess their quality and potential to increase situational awareness. At the JRC, we have developed a whole workflow (GeoCONVAI, for GEOgraphic CONtext Analysis of Volunteered Information) that covers different aspects of this endeavor, for more information see this site and the previous blog post’s links to published journal articles.

Because the concept of “fire” is used metaphorically in many other contexts, we have a lot of noise in our retrieved data. In order to filter for topicality (i.e. whether an entry is about forest fire or not), we manually annotated around 6000 Tweets, then counted the occurences of relevant keywords (e.g. “fire”, “forest”, “hectares”, “firefighters”, etc.). From this, we developed some simple rules to guide the filtering. For an obvious example, the simultaneous occurrence of “fire” and “hectares” is a very good indicator, even in the absence of any word related to vegetation. The results from our case studies show that our rules have merit. However, it seemed an inexcusable omission to not try any machine learning algorithms on this problem. Now that the project is finished, I finally found the time to just that…

So, the objectives were to find simple rules that allow to determine whether a Tweet is about forest fires, and compare those rules with the ones manually devised.

The method obviously is a supervised classification, and the concept to classify is Forest Fire Topicality. The instances from which to learn and which to test are a set of roughly 6000 annotated Tweets, classified into “About Forest Fires” and “Not About Forest Fires”. The attributes used are the number of times each of the keywords shows up in a Tweet. The set of keywords is too large to post it here (because of multiple languages used – did I mention that? no? sorry), but we grouped the keywords into categories. The set of keyword groups used is {fire, incendie, shrubs, forest, hectares, fire fighters, helicopters, canadair, alarm, evacuation} (NB: The distinction between fire and incendie groups is a result of languages like French, where we have the distinct words of “incendie(s)” and “feu(x)”).

The tool of choice is the Weka suite, mainly because of its availability and excellent documentation. As classification methods, I chose to focus on Naive Bayes methods and Decision Trees, because of their widespread use and because they fit the data (by the way, my guide through this experiment was mostly the excellent book “Data Mining – Practical Machine Learning Tools and Techniques, Second Edition” by Ian H. Witten and Eibe Frank – any errors I made are entirely my responsibility).

Regarding the data preparation, little was actually needed – no attribute selection or discretization was necessary, and we had already transformed the text (unigrams) into numbers (their occurences).

So I was ready to load the CSVs into Weka, convert them to ARFF and start (machine) learning! For  verification/error estimation, a standard stratified 10-fold cross validation seemed sufficient.

The computations went all very quickly, and all showed roughly a 90% accuracy. Below the output of one run:

=== Run information ===

Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: gt2_grpd_no_dup-weka.filters.unsupervised.attribute.Remove-R1,14
Instances:5681
Attributes:13
alarm
alert
fireman
forest
shrub
bushfire
canadair
helicopter
evacuation
fire
incendi
hectar
ff_topic
Test mode:10-fold cross-validation

=== Classifier model (full training set) ===

J48 pruned tree
——————

incendi <= 0
|    hectar <= 0
|   |    forest <= 0: N (4232.0/376.0)
|   |    forest > 0
|   |   |    fire <= 0: N (204.0/52.0)
|   |   |    fire > 0: Y (22.0/4.0)
|    hectar > 0
|   |    fire <= 0: N (54.0/11.0)
|   |    fire > 0: Y (34.0)
incendi > 0: Y (1135.0/60.0)

Number of Leaves  :     6

Size of the tree :     11

Time taken to build model: 0.16seconds

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances        5178               91.1459 %
Incorrectly Classified Instances       503                8.8541 %
Kappa statistic                          0.7605
Mean absolute error                      0.1588
Root mean squared error                  0.2821
Relative absolute error                 39.7507 %
Root relative squared error             63.1279 %
Total Number of Instances             5681

=== Detailed Accuracy By Class ===

TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
0.72      0.016      0.946     0.72      0.818      0.858    Y
0.984     0.28       0.902     0.984     0.942      0.858    N
Weighted Avg.    0.911     0.207      0.914     0.911     0.907      0.858

=== Confusion Matrix ===

a    b   <– classified as
1127  439 |    a = Y
64 4051 |    b = N

Problematic seemed the large number of false negatives, i.e. Tweets that the machine learning algorithms classified as not being about forest fires, when in fact they were. It seemed we needed to adjust for different costs (“counting the costs”), i.e. a false negative should have a much higher negative cost than a false positive. In Weka, there are two ways of incorporating the cost: Either using a cost matrix for the evaluation part (won’t change the outcome), or using a cost matrix with the MetaCost classifier (will change the outcome). Surprisingly and unfortunately, the MetaCost classifier did not improve the results significantly. I tried several values with the NaiveBayes classifier. For a cost matrix of

0 (TP)    10 (FN)
1 (FP)      0 (TN)

the result is

a             b       <– classified as
1196      370  |    a = Y
403   3712   |    b = N

As opposed to

a             b       <– classified as
1152      414   |    a = Y
256    3859   |    b = N

for standard costs of

0(TP)    1(FN)
1(FP)    0(TN)

Further increasing the cost for FN does no good. Using

0(TP)    1(FN)
1(FP)    0(TN)

the results are

a             b       <– classified as
1552        14  |    a = Y
2779   1336  |    b = N

In summary, out of the various Decision Tree and Naive Bayes classifiers, the J48 works best. The biggest problem is a large number of false negatives introduced by the combination of
incendie <= 0 AND hectar <= 0 AND forest <= 0 (see above).
However, trying to split up that group proved futile: The only usable keyword would be the “fire” group, but adding fire > 0 equaling Y would introduce a large number of FP. Some exploratory filtering showed that there is no other reliable way to reduce this high number of FP without overfitting.

Later, I had another go at it with a newly annotated data set from another case study. Again, I tried several classifiers (among them J48, logistic, Bayes, Ada/MultiBoost), and again, J48 works best overall. It has also the advantage that the results (i.e. the tree) is easily understandable. Noticing that “hectars” is such an important classifier (good for us with respect the case study, but also part of the “what” situational awareness), I tried another run without it. Results are not better, but the decision tree is now relatively complicated and also uses the number of keywords. I removed that as well and the remaining decision tree is interesting for comparision:

=== Run information ===

Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: prepared_weka_input_cs_2011_steps_234-weka.filters.unsupervised.attribute.Remove-R1,15-16,18-21-weka.filters.unsupervised.attribute.Remove-R12-weka.filters.unsupervised.attribute.Remove-R12
Instances:1481
Attributes:12
alarm
alert
fireman
forest
shrub
bushfire
canadair
helicopter
evacuation
fire
incendi
on_topic
Test mode:10-fold cross-validation

=== Classifier model (full training set) ===

J48 pruned tree
——————

forest <= 0
|    fire <= 0:  N (1316.0/63.0)
|    fire > 0
|   |    incendi <= 0:  Y (29.0/2.0)
|   |    incendi > 0:  N (72.0/17.0)
forest > 0:  Y (64.0/7.0)

Number of Leaves  :     4

Size of the tree :     7

Time taken to build model: 0.02seconds

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances        1392               93.9905 %
Incorrectly Classified Instances        89                6.0095 %
Kappa statistic                          0.6235
Mean absolute error                      0.1098
Root mean squared error                  0.2348
Relative absolute error                 55.6169 %
Root relative squared error             74.8132 %
Total Number of Instances             1481

=== Detailed Accuracy By Class ===

TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
0.993     0.488      0.942     0.993     0.967      0.764     N
0.512     0.007      0.903     0.512     0.654      0.764     Y
Weighted Avg.    0.94      0.435      0.938     0.94      0.932      0.764

=== Confusion Matrix ===

a    b   <– classified as
1308    9 |    a =  N
80   84 |    b =  Y

It seems that, given the data and the attributes used, the results from the machine learning support our initial, hand-crafted rule set. That’s fair enough for a result, I abandoned my forays into machine learning at this point (the learning curve looked quite steep from here on, and resources are scarce, as always). This nice result however can’t hide the fact that we still have a large amount of noise that we can’t get rid off by looking only at the keywords used. So we need more sophisticated text analysis or a novel approach. Not being an expert on natural language processing, we chose the second path and came up with something – but you really should read the papers or have a look at the GeoCONAVI site if you would like to learn more. If you have any questions or comments, please post them either here (for the benefits of others), or send me a PM.

Advertisements