Advanced Feature Extraction

This is an overview of the different feature extraction methods. Each method needs the following flags:

-d: the database we are using
-t: the table inside the database where our text lives (aka the message table)
-g: the table column we will be grouping the text by (aka group)

We start with a message table called "msgs" (available in the packaged data):

mysql> describe msgs;
+--------------+------------------+------+-----+---------+----------------+
| Field        | Type             | Null | Key | Default | Extra          |
+--------------+------------------+------+-----+---------+----------------+
| message_id   | int(11)          | NO   | PRI | NULL    | auto_increment |
| user_id      | int(10) unsigned | YES  | MUL | NULL    |                |
| date         | varchar(64)      | YES  |     | NULL    |                |
| created_time | datetime         | YES  | MUL | NULL    |                |
| message      | text             | YES  |     | NULL    |                |
+--------------+------------------+------+-----+---------+----------------+

N-grams, Collocations, Tf-idf and more

Unigrams

--add_ngrams
-n

./dlatkInterface.py -d dla_tutorial -t msgs -g user_id --add_ngrams -n 1

mysql> select * from feat$1gram$msgs$user_id limit 5;
+----+----------+-----------+-------+----------------------+
| id | group_id | feat      | value | group_norm           |
+----+----------+-----------+-------+----------------------+
|  1 |    28451 | wonderful |     1 | 0.000953288846520496 |
|  2 |    28451 | let       |     1 | 0.000953288846520496 |
|  3 |    28451 | promotion |     1 | 0.000953288846520496 |
|  4 |    28451 | assured   |     1 | 0.000953288846520496 |
|  5 |    28451 | lime      |     1 | 0.000953288846520496 |
+----+----------+-----------+-------+----------------------+

This also creates a "meta" table called feat$meta_1gram$msgs$user_id which contains average 1gram length, average 1grams per message and total 1grams:

mysql> select * from feat$meta_1gram$msgs$user_id limit 5;
+----+----------+------------------+-------+------------------+
| id | group_id | feat             | value | group_norm       |
+----+----------+------------------+-------+------------------+
|  1 |    28451 | _avg1gramLength  |     4 | 3.76549094375596 |
|  2 |    28451 | _avg1gramsPerMsg |    81 | 80.6923076923077 |
|  3 |    28451 | _total1grams     |  1049 |             1049 |
|  4 |   174357 | _avg1gramLength  |     4 | 3.73343605546995 |
|  5 |   174357 | _avg1gramsPerMsg |   216 | 216.333333333333 |
+----+----------+------------------+-------+------------------+

N-grams

This command will make separate feature tables for each "n".

--add_ngrams
-n

./dlatkInterface.py -d dla_tutorial -t msgs -g user_id --add_ngrams -n 2 3

mysql> select * from feat$2gram$msgs$user_id limit 5;
+----+----------+------------------+-------+----------------------+
| id | group_id | feat             | value | group_norm           |
+----+----------+------------------+-------+----------------------+
|  1 |    28451 | this time        |     2 |  0.00193050193050193 |
|  2 |    28451 | email ,          |     1 | 0.000965250965250965 |
|  3 |    28451 | comfortable than |     1 | 0.000965250965250965 |
|  4 |    28451 | do something     |     1 | 0.000965250965250965 |
|  5 |    28451 | charecter ,      |     1 | 0.000965250965250965 |
+----+----------+------------------+-------+----------------------+

mysql> select * from feat$3gram$msgs$user_id limit 5;
+----+----------+-------------------+-------+----------------------+
| id | group_id | feat              | value | group_norm           |
+----+----------+-------------------+-------+----------------------+
|  1 |    28451 | i did something   |     1 | 0.000977517106549365 |
|  2 |    28451 | to my old         |     1 | 0.000977517106549365 |
|  3 |    28451 | , lots of         |     1 | 0.000977517106549365 |
|  4 |    28451 | out some babies   |     1 | 0.000977517106549365 |
|  5 |    28451 | stumbled across a |     1 | 0.000977517106549365 |
+----+----------+-------------------+-------+----------------------+

N-grams From Other Tokenizers

DLATK uses Happier Fun Tokenizer as its standard tokenizer. It also has the option of using the TweetNLP tokenizer with the --add_tweettok flag. One can go straight to a feature table from a message table, via Happier Fun Tokenizer, with --add_ngrams. Alternatively, one can go from a tokenized table via --add_tweettok or --add_tokenized (or any other tokenizer you wish to use) to a feature table with --add_ngrams_from_tokenized

./dlatkInterface.py -d dla_tutorial -t msgs_tok -g user_id --add_ngrams_from_tokenized -n 1

mysql> select * from feat$1gram$msgs_tok$user_id limit 5;
+----+----------+-------------+-------+----------------------+
| id | group_id | feat        | value | group_norm           |
+----+----------+-------------+-------+----------------------+
|  1 |    28451 | nod         |     1 | 0.000953288846520496 |
|  2 |    28451 | pub         |    11 |   0.0104861773117255 |
|  3 |    28451 | destruction |     1 | 0.000953288846520496 |
|  4 |    28451 | else        |     1 | 0.000953288846520496 |
|  5 |    28451 | ?           |     4 |  0.00381315538608198 |
+----+----------+-------------+-------+----------------------+

Feature Occurrence Filter

This removes rare features. Specifically, it filters features so as to keep only those features which are used by X percentage of groups or more. The missing features are aggregated into a feature called <OOV> which contains the value and group norm data for all the missing features. The percentage X is set with the --set_p_occ flag.

./dlatkInterface.py -d dla_tutorial -t msgs -g user_id -f 'feat$1to3gram$msgs$user_id' --feat_occ_filter --set_p_occ .05 --group_freq_thresh 500

Note the use of --group_freq_thresh. This is one of the only feature extraction methods where this flag is considered.

Character n-grams

--add_char_ngrams
-n

./dlatkInterface.py -d dla_tutorial -t msgs -g user_id --add_char_ngrams -n 1

./dlatkInterface.py -d dla_tutorial -t msgs -g user_id --add_char_ngrams -n 1 2 --combine_feat_tables 1to2Cgram

mysql> select * from feat$1to2Cgram$msgs$user_id limit 5;
+----+----------+------+-------+---------------------+
| id | group_id | feat | value | group_norm          |
+----+----------+------+-------+---------------------+
|  1 |    28451 |      |   898 |   0.184659675097676 |
|  2 |    28451 | v    |    45 | 0.00925354719309068 |
|  3 |    28451 | d    |   125 |  0.0257042977585852 |
|  4 |    28451 | ;    |     9 | 0.00185070943861814 |
|  5 |    28451 | y    |    71 |  0.0146000411268764 |
+----+----------+------+-------+---------------------+

TF-IDF Tables

Creates new feature table where the group_norm is the tf-idf score

--tf_idf

./dlatkInterface.py -d dla_tutorial -t msgs -g user_id -f 'feat$1gram$msgs$user_id' --tf_idf

mysql> select * from feat$tf_idf_1gram$msgs$user_id order limit 5;;
+---------+----------+-----------+-------+--------------------+
| id      | group_id | feat      | value | group_norm         |
+---------+----------+-----------+-------+--------------------+
|  307349 |  2033616 | delivered |     1 | 0.0000878334772103 |
|  278647 |  4144593 | crap      |     6 |  0.000998442620366 |
| 1043863 |  3482840 | story     |     2 |  0.000334689956064 |
| 1150911 |  2876677 | uh        |     2 |  0.000141436336165 |
|  283547 |  3711805 | crosses   |     2 |  0.000827587016091 |
+---------+----------+-----------+-------+--------------------+

Collocations and Pointwise Mutual Information

# creates the table feat$1to3gram$msgs$user_id
./dlatkInterface.py -d dla_tutorial -t msgs -g user_id -f 'feat$1to3gram$msgs$user_id' --feat_colloc_filter --set_pmi_threshold 6.0

Transformed Tables

These switches transform the feature table during feature extraction and therefore need least one feature extraction command: --add_ngrams, --add_lex_table, etc.

# produces the table feat$1gram$msgs$user_id$16to8
./dlatkInterface.py -d dla_tutorial -t msgs -g user_id --add_ngrams -n 1 --anscombe

# produces the table feat$1gram$msgs$user_id$16to4
./dlatkInterface.py -d dla_tutorial -t msgs -g user_id --add_ngrams -n 1 --sqrt

# produces the table feat$1gram$msgs$user_id$16to3
./dlatkInterface.py -d dla_tutorial -t msgs -g user_id --add_ngrams -n 1 --log

# produces the table feat$1gram$msgs$user_id$16to1
./dlatkInterface.py -d dla_tutorial -t msgs -g user_id --add_ngrams -n 1 --boolean

mysql> select * from feat$1gram$msgs$user_id$16to8 limit 5;
+-----+----------+------+-------+---------------+
| id  | group_id | feat | value | group_norm    |
+-----+----------+------+-------+---------------+
| 188 |    28451 | !    |     8 | 1.23713590324 |
| 296 |    28451 | $    |     1 | 1.22630059748 |
| 204 |    28451 | '    |     2 | 1.22785435243 |
| 223 |    28451 | *    |     4 | 1.23095597872 |
|  38 |    28451 | ,    |    53 | 1.30464448623 |
+-----+----------+------+-------+---------------+

mysql> select * from feat$1gram$msgs$user_id$16to4 limit 5;
+-----+----------+------+-------+-----------------+
| id  | group_id | feat | value | group_norm      |
+-----+----------+------+-------+-----------------+
| 275 |    28451 | !    |     8 | 0.0873287511199 |
| 245 |    28451 | $    |     1 | 0.0308753760547 |
| 414 |    28451 | '    |     2 |   0.04366437556 |
| 239 |    28451 | *    |     4 | 0.0617507521094 |
|  45 |    28451 | ,    |    53 |  0.224776130551 |
+-----+----------+------+-------+-----------------+

mysql> select * from feat$1gram$msgs$user_id$16to3 limit 5;
+-----+----------+------+-------+-------------------+
| id  | group_id | feat | value | group_norm        |
+-----+----------+------+-------+-------------------+
| 278 |    28451 | !    |     8 |  0.00759737747394 |
| 244 |    28451 | $    |     1 | 0.000952834755272 |
| 265 |    28451 | '    |     2 |  0.00190476248065 |
| 171 |    28451 | *    |     4 |  0.00380590373768 |
| 283 |    28451 | ,    |    53 |   0.0492893813166 |
+-----+----------+------+-------+-------------------+


mysql> select * from feat$1gram$msgs$user_id$16to1 limit 5;
+-----+----------+------+-------+------------+
| id  | group_id | feat | value | group_norm |
+-----+----------+------+-------+------------+
|  51 |    28451 | !    |     8 |          1 |
| 148 |    28451 | $    |     1 |          1 |
| 105 |    28451 | '    |     2 |          1 |
| 277 |    28451 | *    |     4 |          1 |
| 304 |    28451 | ,    |    53 |          1 |
+-----+----------+------+-------+------------+

Word Tables

The word table is used to select groups that meet a certain language useage threshold. This is what we call the "group frequency threshold", as specified by the --group_freq_thresh flag. It says that we will only consider groups who use at least N words (typically 1 when working at the message level, 500 when working at the user level and 40,000 when working with communities). The word table is automatically queried based on the -t and -g flag. For example, given the following base command:

./dlatkInterface.py -d dla_tutorial -t msgs -g user_id

DLATK will query the table "feat$1gram$msgs$user_id". The flag --word_table overrides this. It is especially useful when working with large data when the standard word table will not fit into memory. In this case we often use a feature occurrence filtered table (filtered at a small threshold). For example

./dlatkInterface.py -d dla_tutorial -t msgs -g user_id --word_table 'feat$1gram$msgs$user_id$0_01'

Lexica

DLATK supports both unweighted and weighted lexica. Here is an example of an unweighted lexicon. Note that the MySQL table still contains the column "weight" which is set to 1 everywhere. This is unnecessary but sometimes more insightful to be explicit.

# creates the table feat$cat_LIWC2015$msgs$user_id$1gra
./dlatkInterface.py -d dla_tutorial -t msgs -g user_id --add_lex_table -l LIWC2015

mysql> select * from dlatk_lexica.LIWC2015 limit 5;
+----+------+----------+--------+
| id | term | category | weight |
+----+------+----------+--------+
|  1 | he   | PPRON    |      1 |
|  2 | he'd | PPRON    |      1 |
|  3 | he's | PPRON    |      1 |
|  4 | her  | PPRON    |      1 |
|  5 | hers | PPRON    |      1 |
+----+------+----------+--------+

mysql> select * from feat$cat_met_a30_2000_cp_w$msgs$user_id$1gra  limit 5;
+----+----------+------+-------+--------------------------+
| id | group_id | feat | value | group_norm               |
+----+----------+------+-------+--------------------------+
|  1 |    28451 | 298  |     4 | 0.0000000217525421774642 |
|  2 |    28451 | 278  |     6 |     0.000150407662892745 |
|  3 |    28451 | 295  |    17 |     0.000545379245206831 |
|  4 |    28451 | 1375 |    47 |       0.0010413347897739 |
|  5 |    28451 | 276  |    15 |     0.000299298548129527 |
+----+----------+------+-------+--------------------------+

Here is an example of a weighted lexicon. Note the use of the --weighted_lexicon flag. Here we are using LDA Facebook topics which are available here).

# creates the table feat$cat_met_a30_2000_cp_w$msgs$user_id$1gra
./dlatkInterface.py -d dla_tutorial -t msgs -g user_id --add_lex_table -l met_a30_2000_cp --weighted_lexicon

mysql> select * from dlatk_lexica.met_a30_2000_cp limit 5;
+----+---------+----------+--------------------+
| id | term    | category | weight             |
+----+---------+----------+--------------------+
|  1 | ce      | 344      |  0.000162284972412 |
|  2 | concept | 344      |  0.000556947925369 |
|  3 | cough   | 344      | 0.0000711541198235 |
|  4 | bring   | 344      |   0.00570741964554 |
|  5 | finest  | 344      |  0.000520020800832 |
+----+---------+----------+--------------------+

mysql> select * from feat$cat_met_a30_2000_cp_w$msgs$user_id$1gra  limit 5;
+----+----------+------+-------+--------------------------+
| id | group_id | feat | value | group_norm               |
+----+----------+------+-------+--------------------------+
|  1 |    28451 | 298  |     4 | 0.0000000217525421774642 |
|  2 |    28451 | 278  |     6 |     0.000150407662892745 |
|  3 |    28451 | 295  |    17 |     0.000545379245206831 |
|  4 |    28451 | 1375 |    47 |       0.0010413347897739 |
|  5 |    28451 | 276  |    15 |     0.000299298548129527 |
+----+----------+------+-------+--------------------------+

Combining Feature Tables

Combine multiple feature tables into a single table.

--combine_feat_tables

./dlatkInterface.py -d dla_tutorial -t msgs -g user_id -f 'feat$1gram$msgs$user_id' 'feat$2gram$msgs$user_id' 'feat$3gram$msgs$user_id' --combine_feat_tables 1to3gram

This also works during ngram extraction:

./dlatkInterface.py -d dla_tutorial -t msgs -g user_id --add_ngrams -n 1 2 3 --combine_feat_tables 1to3gram

Part of Speech

Part of Speech Usage

--add_pos_table

./dlatkInterface.py -d dla_tutorial -t msgs -g user_id --add_pos_table

mysql> select * from feat$pos$msgs$user_id limit 5;
+----+----------+------+-------+----------------------+
| id | group_id | feat | value | group_norm           |
+----+----------+------+-------+----------------------+
|  1 |  2300555 | RP   |    12 |  0.00575539568345324 |
|  2 |  2300555 | ''   |     1 | 0.000479616306954436 |
|  3 |  2300555 | PRP  |   107 |   0.0513189448441247 |
|  4 |  2300555 | CC   |    61 |   0.0292565947242206 |
|  5 |  2300555 | WRB  |    15 |  0.00719424460431655 |
+----+----------+------+-------+----------------------+

Part of Speech N-grams

--add_pos_ngram_table

./dlatkInterface.py -d dla_tutorial -t msgs -g user_id --add_pos_ngram_table

mysql> select * from feat$1gram_pos$msgs$user_id limit 5;
+----+----------+----------------------+-------+----------------------+
| id | group_id | feat                 | value | group_norm           |
+----+----------+----------------------+-------+----------------------+
|  1 |  2300555 | shiiiennntaaaahhh/NN |     1 | 0.000479616306954436 |
|  2 |  2300555 | thx/VBN              |     1 | 0.000479616306954436 |
|  3 |  2300555 | aku/NN               |     1 | 0.000479616306954436 |
|  4 |  2300555 | passgae/NN           |     1 | 0.000479616306954436 |
|  5 |  2300555 | feel/VBP             |     6 |  0.00287769784172662 |
+----+----------+----------------------+-------+----------------------+