Understanding Feature Table Names
This page will explain the standard way of naming feature tables in DLATK.
This is how DLATK expects them to be named.
Deviate at your own risk.
Structure
Every feature table has the same structure: id, group_id, feat, value and group_norm. Here is an example of a message level (group_id = message_id) 1gram table:
mysql> describe feat$1gram$msgs$message_id;
+------------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+---------------------+------+-----+---------+----------------+
| id | bigint(16) unsigned | NO | PRI | NULL | auto_increment |
| group_id | int(11) | YES | MUL | NULL | |
| feat | varchar(36) | YES | MUL | NULL | |
| value | int(11) | YES | | NULL | |
| group_norm | double | YES | | NULL | |
+------------+---------------------+------+-----+---------+----------------+
The column naming convention is identical across tables but the MySQL Type is not, though generally feat is a varchar, value is an int and group_norm is a double. The columns are defined as follows:
group_id: Identifier for each group as determined from the -g flag. This is typically a message id (e.g. Tweet id), user id (e.g. Twitter user id), community id (e.g. U.S. County FIPS code or state code), etc.
feat: feature name such as an ngram, LDA topic id, etc.
value: The number of times the feature was used by the group_id.
group_norm: The relative frequency of the feature use for the group_id. This is usually value divided by the sum of all value`s for the `group_id.
Things to keep in mind when creating your own feature tables:
The id column is technically not necessary but every other column is needed.
Tables are sparse encoded: group_id / feat pairs are assumed to be zero if missing from the table.
Nulls and 0's in the group_norm column will throw an error.
Do not use Decimal types in feature tables.
Keep the group_id and feat columns indexed.
Example 1: unigram, bigram, etc features
These tables are generally created with the --add_ngrams flag of fwInterface.
feat$1to3gram$statuses_er1$user_id$16to1$0_01$pmi3_0
| f0 |field 1 | field 2 |field3 |field4| f5 |field 6|
Field 0 Specifies this as a feature table. All feature tables begin with the word "feat".
Field 1 Specifies kinds of features; these are 1-, 2-, and 3-grams, the result of running --combine_feat_tables after --add_ngrams
Field 2 Gives the message table (-t) that the features were derived from
Field 3 Gives the group ID (-g) that features were grouped by
Field 4 Specifies scaling on features. The default (or unscaled) feature tables do not include this field.
16to8: --anscombe
16to4: --sqrt
16to3: --log
16to1: --boolean
Field 5 Shows feature occurrence filter (--feat_occ_filter) used on feature table (i.e., what %age of groups necessary to include feature in table)
Field 6 Gives the PMI threshold set by --feat_colloc_filter, and optionally, --set_pmi_threshold
Example 2: extracted lexicon/topic features
These tables are generally created with the --add_lex_table flag of dlatkInterface.
feat$cat_met_a30_2000_cp_w$messages_en$cty_id$1gra
| f0 | field 1 | field 2 |field3|field4|
Field 0 Specifies this as a feature table
Field 1 Specifies the source of features; these are extracted from the topic lexicon met_a30_2000, and the table was created via --add_lex_table. The trailing "_w" indicates a weighted lexicon. "_cp" stands for "conditional probability", one of the two types of topic lexica normally created (see DLATK LDA Interface).
Field 2 Gives the message table (-t) that the features were derived from
Field 3 Gives the group ID (-g) that features were grouped by
Field 4 The first four characters from Field 1 of the word table (--word_table) used to derive the lexicon/topic features. By default this is the 1gram table. In previous version (less than 1.1.5) this field specified the scaling on features.