Understanding Feature Table Names

This page will explain the standard way of naming feature tables in DLATK.

This is how DLATK expects them to be named.

Deviate at your own risk.

Structure

Every feature table has the same structure: id, group_id, feat, value and group_norm. Here is an example of a message level (group_id = message_id) 1gram table:

mysql> describe feat$1gram$msgs$message_id;
+------------+---------------------+------+-----+---------+----------------+
| Field      | Type                | Null | Key | Default | Extra          |
+------------+---------------------+------+-----+---------+----------------+
| id         | bigint(16) unsigned | NO   | PRI | NULL    | auto_increment |
| group_id   | int(11)             | YES  | MUL | NULL    |                |
| feat       | varchar(36)         | YES  | MUL | NULL    |                |
| value      | int(11)             | YES  |     | NULL    |                |
| group_norm | double              | YES  |     | NULL    |                |
+------------+---------------------+------+-----+---------+----------------+

The column naming convention is identical across tables but the MySQL Type is not, though generally feat is a varchar, value is an int and group_norm is a double. The columns are defined as follows:

group_id: Identifier for each group as determined from the -g flag. This is typically a message id (e.g. Tweet id), user id (e.g. Twitter user id), community id (e.g. U.S. County FIPS code or state code), etc.
feat: feature name such as an ngram, LDA topic id, etc.
value: The number of times the feature was used by the group_id.
group_norm: The relative frequency of the feature use for the group_id. This is usually value divided by the sum of all value`s for the `group_id.

Things to keep in mind when creating your own feature tables:

The id column is technically not necessary but every other column is needed.
Tables are sparse encoded: group_id / feat pairs are assumed to be zero if missing from the table.
Nulls and 0's in the group_norm column will throw an error.
Do not use Decimal types in feature tables.
Keep the group_id and feat columns indexed.

Example 1: unigram, bigram, etc features

These tables are generally created with the --add_ngrams flag of fwInterface.

feat$1to3gram$statuses_er1$user_id$16to1$0_01$pmi3_0
| f0 |field 1 |  field 2   |field3 |field4| f5 |field 6|

Field 0 Specifies this as a feature table. All feature tables begin with the word "feat".

Field 1 Specifies kinds of features; these are 1-, 2-, and 3-grams, the result of running --combine_feat_tables after --add_ngrams

Field 2 Gives the message table (-t) that the features were derived from

Field 3 Gives the group ID (-g) that features were grouped by

Field 4 Specifies scaling on features. The default (or unscaled) feature tables do not include this field.

16to8: --anscombe
16to4: --sqrt
16to3: --log
16to1: --boolean

Field 5 Shows feature occurrence filter (--feat_occ_filter) used on feature table (i.e., what %age of groups necessary to include feature in table)

Field 6 Gives the PMI threshold set by --feat_colloc_filter, and optionally, --set_pmi_threshold

Example 2: extracted lexicon/topic features

These tables are generally created with the --add_lex_table flag of dlatkInterface.

feat$cat_met_a30_2000_cp_w$messages_en$cty_id$1gra
| f0 |       field 1      |  field 2  |field3|field4|

Field 0 Specifies this as a feature table

Field 1 Specifies the source of features; these are extracted from the topic lexicon met_a30_2000, and the table was created via --add_lex_table. The trailing "_w" indicates a weighted lexicon. "_cp" stands for "conditional probability", one of the two types of topic lexica normally created (see DLATK LDA Interface).

Field 2 Gives the message table (-t) that the features were derived from

Field 3 Gives the group ID (-g) that features were grouped by

Field 4 The first four characters from Field 1 of the word table (--word_table) used to derive the lexicon/topic features. By default this is the 1gram table. In previous version (less than 1.1.5) this field specified the scaling on features.