Clustering and Super Topics
Step 1 - Clustering
In this step we will
Cluster features based on their distribution over some corpus
Output reduced feature space to a lexicon
Note: This step refers to a general clustering method. While the following commands cluster a topic based feature table one can also cluster other features as well.
Setup:
-d: the database we are using
-t: the table inside the database where our text lives
-g: the table column we will be grouping the text by
--group_freq_thresh: Ignore groups which do not contain a certain number of words
Clustering flags:
--fit_reducer: flag to initalize the clustering
--reducer_to_lexicon: writes the clustered features to a lexicon format
--n_components: specify the number of clusters
--model: specify the clustering algorithm
Step 1a: Fit Reducer
This flag initializes the clustering process. Using --model one can specify the following clustering algorithms:
nmf - Non-Negative matrix factorization by Projected Gradient (NMF)
pca - (Principal component analysis) Linear dimensionality reduction using Singular Value Decomposition of the data and keeping only the most significant singular vectors to project the data to a lower dimensional space.
sparsepca - (Sparse Principal Components Analysis) Finds the set of sparse components that can optimally reconstruct the data. The amount of sparseness is controllable by the coefficient of the L1 penalty.
lda - (Linear Discriminant Analysis) A classifier with a linear decision boundary, generated by fitting class conditional densities to the data and using Bayes’ rule.
kmeans - K-Means clustering
dbscan - (Density-Based Spatial Clustering of Applications with Noise) Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density.
spectral - Apply clustering to a projection to the normalized laplacian. In practice Spectral Clustering is very useful when the structure of the individual clusters is highly non-convex or more generally when a measure of the center and spread of the cluster is not a suitable description of the complete cluster. For instance when clusters are nested circles on the 2D plan.
gmm - (Gaussian Mixture Model)
We can also specify the number of components in the clustering with --n_components.
Sample Command
dlatkInterface.py -d dla_tutorial -t msgs -c user_id -f 'feat$cat_met_a30_2000_cp_w$msgs$user_id$16to16' --fit_reducer --model nmf --group_freq_thresh 500
SQL QUERY: select distinct feat from feat$cat_met_a30_2000_cp_w$msgs$user_id$16to16
Adding zeros to group norms (978 groups * 48 feats).
[Applying StandardScaler to X: StandardScaler(copy=True, with_mean=True, with_std=True)]
(N, features): (978, 48)
[Doing clustering using : nmf]
model: NMF(alpha=10, beta=1, eta=0.1, init='nndsvd', l1_ratio=0.95, max_iter=200,
n_components=30, nls_max_iter=2000, random_state=42, shuffle=False,
solver='cd', sparseness=None, tol=0.0001, verbose=0)
--
Interface Runtime: 1.06 seconds
DLATK exits with success! A good day indeed ¯\_(ツ)_/¯.
This command will cluster the topics, using only users with 500 or more words, but note that while we ran clustering but haven't told DLATK what to do with those clusters.
Step 1b: Reducer to lexicon
This flag takes the output from --fit_reducer and saves the weights in a "lexicon".
Sample Command
dlatkInterface.py -d dla_tutorial -t msgs -c user_id -f 'feat$cat_met_a30_2000_cp_w$msgs$user_id$16to16' --fit_reducer --model nmf --reducer_to_lexicon msgs_reduced10_nmf --n_components 10
This command will produce table msgs_reduced10_nmf in the database dlatk_lexica. Here the term column contains topic ids and the category column contains the reduced component number. Weights are entries in the m x n facorization matrix produced in the clustering method where m is the number of features in your feature table and n is the number of components specified by --n_components.
mysql> select * from dlatk_lexica.msgs_reduced10_nmf limit 5;
+----+------+----------+----------------+
| id | term | category | weight |
+----+------+----------+----------------+
| 1 | 272 | RFEAT7 | 8.73150685503 |
| 2 | 101 | RFEAT7 | 0.141548621844 |
| 3 | 278 | RFEAT1 | 3.56542413757 |
| 4 | 346 | RFEAT1 | 0.28171112431 |
| 5 | 290 | RFEAT1 | 0.731260898487 |
| 6 | 349 | RFEAT1 | 7.81170807542 |
| 7 | 276 | RFEAT1 | 6.14425018597 |
| 8 | 1781 | RFEAT1 | 1.26721023671 |
| 9 | 107 | RFEAT1 | 1.29672280401 |
| 10 | 344 | RFEAT8 | 0.159080830815 |
+----+------+----------+----------------+
Step 2: Super Topics
This allows one to unroll the reduced lexicon table at the word level to create a new topic set. We refer to the reduced lexicon table produced in Step 1b using the --reduced_lexicon flag. We will use two additional flags:
Note: For this step you need to first cluster a topic based feature table. Super topics do not make sense for other features such as 1to3grams.
Sample Command
dlatkInterface.py -d dla_tutorial -t msgs -c user_id --reduced_lexicon msgs_reduced10_nmf --super_topics msgs_10nmf_fbcp -l met_a30_2000_cp
This produces the table emp_50nmf_fbcp in dlatk_lexica:
mysql> select * from dlatk_lexica.msgs_10nmf_fbcp limit 10;
+----+--------------+----------+------------------------+
| id | term | category | weight |
+----+--------------+----------+------------------------+
| 1 | 8) | RFEAT7 | 0.00006541063856000905 |
| 2 | 8d | RFEAT7 | 0.8575587089762855 |
| 3 | :d | RFEAT7 | 1.7992666876157764 |
| 4 | :p | RFEAT7 | 0.020210781774380928 |
| 5 | ;d | RFEAT7 | 1.5833280925503392 |
| 6 | <3 | RFEAT7 | 0.010472837603757787 |
| 7 | >: | RFEAT7 | 0.0003415147203434118 |
| 8 | >:d | RFEAT7 | 1.8820983398248488 |
| 9 | accident | RFEAT7 | 0.0026496137794309827 |
| 10 | accidentally | RFEAT7 | 0.009535178394562414 |
+----+--------------+----------+------------------------+
Under the hood
Super topic creation runs the following MySQL command:
CREATE TABLE <newtable> AS SELECT super.category, orig.term, SUM(super.weight * orig.weight) AS weight
FROM <supertopic> super, <origtopic> orig
WHERE super.term=orig.category
GROUP BY super.category, orig.term;
Everything in a single command:
dlatkInterface.py -d dla_tutorial -t msgs -c user_id -f 'feat$cat_met_a30_2000_cp_w$msgs$user_id$16to16' \
--fit_reducer --model nmf --reducer_to_lexicon msgs_reduced10_nmf --n_components 10 \
--super_topics msgs_10nmf_fbcp -l met_a30_2000_cp
Step 3: Using your super topics
Now that we have the super topic table we can extract features over our corpus:
dlatkInterface.py -d dla_tutorial -t msgs -c user_id --add_lex_table -l msgs_10nmf_fbcp --weighted_lexicon
This command produces the following feature table:
mysql> select * from feat$cat_msgs_10nmf_fbcp_w$msgs$user_id$16to16 limit 10;
+----+----------+--------+-------+-------------------------+
| id | group_id | feat | value | group_norm |
+----+----------+--------+-------+-------------------------+
| 1 | 28451 | RFEAT9 | 5 | 0.000000921854579604825 |
| 2 | 28451 | RFEAT1 | 36 | 0.00467374305816601 |
| 3 | 28451 | RFEAT8 | 12 | 0.0000368626647544864 |
| 4 | 28451 | RFEAT4 | 18 | 0.00112652429225695 |
| 5 | 28451 | RFEAT0 | 203 | 0.0366415948096726 |
| 6 | 28451 | RFEAT2 | 17 | 0.000283381640611496 |
| 7 | 28451 | RFEAT7 | 8 | 0.00000232305243540397 |
| 8 | 28451 | RFEAT6 | 7 | 0.000518033494791095 |
| 9 | 28451 | RFEAT3 | 64 | 0.00532112905379904 |
| 10 | 28451 | RFEAT5 | 15 | 0.000161019216334877 |
+----+----------+--------+-------+-------------------------+
We can also create a set of super topics whose weights are based off of the freq_t50ll table and use them for printing wordclouds:
# create log likihood version of super topics
# creates msgs_10nmf_fbll in dlatk_lexica
dlatkInterface.py -d dla_tutorial -t msgs -c user_id --reduced_lexicon msgs_reduced10_nmf --super_topics msgs_10nmf_fbll -l met_a30_2000_freq_t50ll
# print wordclouds using all of the above
dlatkInterface.py -d dla_tutorial -t msgs -c user_id -f 'feat$cat_msgs_10nmf_fbcp_w$msgs$user_id$16to16'
--outcome_table blog_outcomes --group_freq_thresh 500 --outcomes age gender --output_name supertopic_output \
--topic_tagcloud --make_topic_wordcloud --topic_lexicon msgs_10nmf_fbll \
--tagcloud_colorscheme bluered