Data Cleaning
This is an overview of the different methods for cleaning text data. Each method needs the following flags:
-d: the database we are using
-t: the table inside the database where our text lives (aka the message table)
-g: the table column we will be grouping the text by (aka group)
We start with a message table called "msgs" (available in the packaged data):
mysql> describe msgs;
+--------------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+------------------+------+-----+---------+----------------+
| message_id | int(11) | NO | PRI | NULL | auto_increment |
| user_id | int(10) unsigned | YES | MUL | NULL | |
| date | varchar(64) | YES | | NULL | |
| created_time | datetime | YES | MUL | NULL | |
| message | text | YES | | NULL | |
+--------------+------------------+------+-----+---------+----------------+
Language Filtering
Uses the langid python package. For each message it will assign a confidence and keep the message if the confidence is over 0.80. By default this will lowercase your messages before running through langid.
This will create a new table whose name is taken from the -t flag and appends "_en".
# creates the table msgs_en
./dlatkInterface.py -d dla_tutorial -t msgs -c user_id --language_filter en
Adding the --clean_messages flag will remove URLs and @-mentions before applying the language filter, which improves the language classification
# creates the table msgs_en
./dlatkInterface.py -d dla_tutorial -t msgs -c user_id --language_filter en --clean_messages
Deduplicating
Removes tweets which contains the same first 6 tokens as another message within the same group (-c) (therefore each group id should have more than one row in your MySQL table). This will create a new table whose name is taken from the -t flag and appends "_dedup".
# creates the table msgs_dedup
./dlatkInterface.py -d dla_tutorial -t msgs -c user_id --deduplicate
To also replace URLs (with <URL>) and @-mentions (with <USER>) add the --clean_messages flag (this helps the language classification task):
./dlatkInterface.py -d dla_tutorial -t msgs -c user_id --deduplicate --clean_messages
Removing URLs and @-mentions
When using --clean_messages alone it will create a new table whose name is taken from the -t flag and appends "_an".
# creates the table msgs_an
./dlatkInterface.py -d dla_tutorial -t msgs -c user_id --clean_messages
Spam Filtering
If any message contains one of the spam words it is marked as spam (is_spam = 1, otherwise is_spam = 0). If number of spam messages / total message > THRESHOLD then user is removed from new message table.
Spam words = 'share', 'win', 'check', 'enter', 'products', 'awesome', 'prize', 'sweeps', 'bonus', 'gift'
This will create a new table whose name is taken from the -t flag and appends "_nospam".
# creates the table msgs_nospam
./dlatkInterface.py -d dla_tutorial -t msgs -c user_id --spam_filter 0.1
Real World Example
Using both the language filter and the deduplication filters typically yeild nice results (both with the --clean_messages flag):
# creates the table msgs_en
./dlatkInterface.py -d dla_tutorial -t msgs -c user_id --language_filter en --clean_messages
# creates the table msgs_en_dedup
./dlatkInterface.py -d dla_tutorial -t msgs_en -c user_id --deduplicate --clean_messages