Spam Workflow Documentation
Workflow for setting up the spam identification measures from scratch
Note: all spam identification and computation scripts are in
scripts/spam
Compute the Tag Spam measure
You'll first need to compute the base tag spam probabilities from the Bibsonomy data. The Bibsonomy data is in/l/cnets/research/givalink/2007-12-31_spammer/tas
, or by download from http://www.kde.cs.uni-kassel.de/bibsonomy/dumps_spammer/2007-12-31_spammer.tgz . You can use it to run a script like this:
givalink@smithers ~/jpr-working/givealink$ ruby script/spam/compute_tag_spam_from_bibsonomy.rb ../data/2007-12-31_spammer/tas ../data/tag_spam.marshalThis should run in five minutes or less, and produce output like
Computing tag spam based on assignments in ../data/2007-12-31_spammer/tas Storing resultant hash in ../data/tag_spam.marshal Done. 320876 tags processed.This produces the file
tag_spam.marshal
, which is a hash associating tags with spam probabilities. Next, compute the actual tag_spam
measure for tags in Givalink:
givalink@smithers ~/jpr-working/givealink$ ruby script/spam/compute_tag_spam.rb ../data/tag_spam.marshal ../data/givalink_tag_spam.txtThis produces a number of lines of status about various users which are being processed. It should be fairly fast (several users per second). When it is done, it has created a file containing triplets of
(user_id, url_id, tag_spam)
for every post (that is, user, url pair) in Givealink.