DLATK LDA Interface
Note: These instructions introduce the new streamlined interface for LDA topic estimation. To use the old manual interface, see Mallet LDA Interface.
For a conceptual overview of LDA, see this intro
Step 0: Setup
Mallet:
This tutorial uses Mallet.
Install to your home directory using the following website: http://mallet.cs.umass.edu/download.php
NOTE - if you plan to be running on large datasets (~15M FB messages or similar) you may have to adjust parameters in your mallet script file. See more info in the "Run LDA with Mallet" step.
PyMallet:
Depending on your DLATK installation, you may also need to install pymallet with the following command:
pip install dlatk-pymallet
Step 1: (If necessary) Create sample tweet table
If necessary, create a message table to run LDA on:
use dla_tutorial;
create table msgs_lda like msgs;
insert into msgs_lda select * from msgs where rand()<(2/6);
Step 2: Generate a feature table
This is a standard unigram feature table generation command.
dlatkInterface.py -d dla_tutorial -t msgs_lda -c message_id --add_ngrams -n 1
Step 3: Estimate LDA topics
A minimal command for estimating LDA topics is shown below:
dlatkInterface.py -d dla_tutorial -t msgs_lda -c message_id \
-f 'feat$1gram$msgs_lda$message_id$16to16' \
--estimate_lda_topics \
--lda_lexicon_name my_lda_lexicon
However, it is important to realize that the command above will estimate LDA topics using PyMallet, which is in general much slower than Mallet. To use Mallet for topic estimation, you can use the following command:
dlatkInterface.py -d dla_tutorial -t msgs_lda -c message_id \
-f 'feat$1gram$msgs_lda$message_id$16to16' \
--estimate_lda_topics \
--lda_lexicon_name my_lda_lexicon \
--mallet_path /path/to/mallet/bin/mallet
Be sure to replace /path/to/mallet/bin/mallet
with the correct path to which you installed Mallet in Step 0.
It is good practice to refrain from storing the topics as a lexicon until after you have reviewed them. While the interim LDA estimation files are typically stored in your /tmp
directory, you can specify a different directory to allow you to more easily review the topics you have estimated. The following command will store these files in the lda_files
directory and prevent creating a topic lexicon:
dlatkInterface.py -d dla_tutorial -t msgs_lda -c message_id \
-f 'feat$1gram$msgs_lda$message_id$16to16' \
--estimate_lda_topics \
--save_lda_files lda_files
--no_lda_lexicon \
--mallet_path /path/to/mallet/bin
You can now review the .keys
file in the lda_files
directory to view the estimated topics and decide whether you should change any parameters (e.g., --num_stopwords or --lda_alpha).
An important difference between this new interface and the old one is that stop words are no longer derived from a static Mallet stoplist. Instead, DLATK will determine the most common terms in your feature table and remove them (by default, it sets the top 50 most frequent terms as stop words, but this can be controlled with --num_stopwords). To disable stopping entirely, use --no_lda_stopping.
There are several options you may wish to use with --estimate_lda_topics:
Step 4: Extract features from lexicon
You’re now ready to start using the topic distribution lexicon
dlatkInterface.py -d DATABASE -t MESSAGE_TABLE -c GROUP_ID \
--add_lex_table -l my_lda_lexicon_cp --weighted_lexicon
Always extract features using the _cp
lexicon. The _freq_t50ll
lexicon is only used when generating topic_tagclouds: --topic_tagcloud --topic_lexicon.