Spam Workflow Documentation

Workflow for setting up the spam identification measures from scratch

Note: all spam identification and computation scripts are in scripts/spam

Compute the Tag Spam measure

You'll first need to compute the base tag spam probabilities from the Bibsonomy data. The Bibsonomy data is in /l/cnets/research/givalink/2007-12-31_spammer/tas, or by download from http://www.kde.cs.uni-kassel.de/bibsonomy/dumps_spammer/2007-12-31_spammer.tgz . You can use it to run a script like this:

 givalink@smithers ~/jpr-working/givealink$ ruby script/spam/compute_tag_spam_from_bibsonomy.rb 
 ../data/2007-12-31_spammer/tas ../data/tag_spam.marshal

This should run in five minutes or less, and produce output like

 Computing tag spam based on assignments in ../data/2007-12-31_spammer/tas
 Storing resultant hash in ../data/tag_spam.marshal
 Done. 320876 tags processed.

This produces the file tag_spam.marshal, which is a hash associating tags with spam probabilities. Next, compute the actual tag_spam measure for tags in Givalink:

 givalink@smithers ~/jpr-working/givealink$ ruby script/spam/compute_tag_spam.rb 
 ../data/tag_spam.marshal ../data/givalink_tag_spam.txt

This produces a number of lines of status about various users which are being processed. It should be fairly fast (several users per second). When it is done, it has created a file containing triplets of (user_id, url_id, tag_spam) for every post (that is, user, url pair) in Givealink.

Compute the Tag Blur measure

To do this, you'll first need to compute tag similarities (σ's) for tags in Givalink which are not tainted by spam. To do this, we assume that all tags assigned to resources initially added from Bibsonomy are good.