Mallet LDA Interface
NOTE: DLATK can now produce LDA topics through a simpler interface: DLATK LDA Interface. Although the steps described below are still valid, you are advised to use the new automated interface.
For a conceptual overview of LDA, see this intro
Step 0: Get Access To Mallet
This tutorial uses Mallet.
Install to your home directory using the following website: http://mallet.cs.umass.edu/download.php
NOTE - if you plan to be running on large datasets (~15M FB messages or similar) you may have to adjust parameters in your mallet script file. See more info in the "Run LDA with Mallet" step.
Step 1: (If necessary) Create sample tweet table
If necessary create a message table to run LDA on
use dla_tutorial;
create table msgs_lda like msgs;
insert into msgs_lda select * from msgs where rand()<(2/6);
We will create an output folder in your home directory
mkdir ~/lda_tutorial
Step 2, option A: Create tokenized table in mySQL and Export it
Using the infrastructure to tokenize the messages and print the tokenized messages in a file
dlatkInterface.py -d dla_tutorial -t msgs_lda -c message_id --add_tokenized #creates table *msgs_lda_tok* in *dla_tutorial*
dlatkInterface.py -d dla_tutorial -t msgs_lda --print_tokenized_lines ~/lda_tutorial/msgs_lda.txt
This will create the file ~/lda_tutorial/msgs_lda.txt:
NOTE - if you have newlines in your message this step may cause issues down the line, it is recommended to replace all newlines with a representative character before proceeding
OPTIONAL: If you want to restrict your LDA words to a subset of words in the corpus (e.g., those said by at least N people, or said in no more than Y percent of messages), you can use the --feat_whitelist flag, eg:
dlatkInterface.py -d dla_tutorial -t msgs_lda --print_tokenized_lines ser1_filt.txt --feat_whitelist 'feat$1gram$msgs_lda$user_id$16to16$0_005'
Step 2, option B: Generate a feature table and convert it to a mallet-appropriate formatted text file
You can use this step in place of step 2, option A.
dlatkInterface.py -d dla_tutorial -t msgs_lda -c message_id --add_ngrams -n 1
dlatkInterface.py -d dla_tutorial -t msgs_lda -c message_id -f 'feat$1gram$msgs_lda$message_id$16to16' --print_joined_feature_lines msgs_lda.txt
Step 3: Format for Mallet
Prepares the messages and tokenizes them again for Mallet, removing stopwords and non English characters. For help use ./bin/mallet import-file --help
./bin/mallet import-file --input ~/lda_tutorial/msgs_lda.txt \
--token-regex "(#|@)?(?!(\W)\2+)([a-zA-Z\_\-\'0-9\(-\@]{2,})" \
--output ~/lda_tutorial/msgs_lda.mallet \
--remove-stopwords --keep-sequence [--extra-stopwords EXTRA_STOPWORDS_FILE]
This will create the file ~/lda_tutorial/msgs_lda.mallet.
Step 4: Run LDA with Mallet
This is the actual LDA step, which might take a while (4 days and a half on 20 mil tweets) for help do ./bin/mallet train-topics --help
./bin/mallet train-topics --input ~/lda_tutorial/msgs_lda.mallet \
--alpha 5 --num-topics 2000 --optimize-burn-in 0 --output-model ~/lda_tutorial/msgs_lda.model \
--output-state ~/lda_tutorial/msgs_lda_state.gz \
--output-topic-keys ~/lda_tutorial/msgs_lda.keys
Here alpha is a prior on he number of topics per document. The other hyper-parameter beta (which we usually do not change) is a prior on the number of words per topic.
This creates the following files:
msgs_lda.model
msgs_lda_state.gz
msgs_lda.keys
The file msgs_lda.keys contains the topics and at this point is it good to inspect them to see if you should change any of the above parameters (number of topics, alpha, beta, etc.)
> head ~/lda_tutorial/msgs_lda.keys
0 0.0025 italian pig americans wolf american straw pigs 04 stick film defamation censored pinocchio brick 03 sopranos pig's offensive june wop
1 0.0025 yoga levi yea pages yay morning boy mix lorraine evening bit law she's joe honey exam property study gay spring
2 0.0025 projection sng recap mtt dried 18 sessions planetpoker 5.50 pokerstars faulty limit summary music-lovers end-users 1540-1608 raritan tester dch brigade
3 0.0025 :) urllink misses marion lovely managed melbourne sean pub sort round rock bus :( tent ;) perth we've elephant flat
4 0.0025 jill alcohol aaaannnndd soorry georgie celica precariously timid mirrors cam defined cat's praying expects hitler valley asses
5 0.0025 apc uhm downsides surpression tly pary undies spybot band-aids pow sprinkles trashed bloated celebrated paramaters thas elses country's immortal association
Note: When dealing with giant sets of data, for example creating Facebook topics, one might encounter the error Exception in thread "main" java.lang.OutOfMemoryError: Java heap space. You must edit the following line in ~/Mallet/bin/mallet: MEMORY=1g. You can then change the 1g value upwards – to 2g, 4g, or even higher depending on your system’s RAM, which you can find out by looking up the machine’s system information.
Step 5: Add message ID’s to state file
Adds the message ID’s to the topic distributions and stores the result in lda_topics
dlatkInterface.py --add_message_id ~/lda_tutorial/msgs_lda.txt ~/lda_tutorial/msgs_lda_state.gz --output_name ~/lda_tutorial/lda_topics
This creates the file ~/lda_tutorial/lda_topics:
> head ~/lda_tutorial/lda_topics
#doc source pos typeindex type topic
#alpha : 0.0025 0.0025 0.0025 0.0025 ....
#beta : 0.01
0 2 0 0 miss 1840
0 2 1 1 su 661
0 2 2 2 living 623
0 2 3 3 skills 466
0 2 4 4 teacher 466
0 2 5 5 form 1319
Step 6: Import state file into database
Imports the topic-message probability distributions in a raw format (type of JSON) not readable by DLA
dlatkInterface.py -d dla_tutorial -t msgs_lda_tok --add_lda_messages ~/lda_tutorial/lda_topics
This creates the table msgs_lda_tok_lda$lda_topics in the database dla_tutorial:
select message from msgs_lda_tok_lda$lda_topics limit 1;
| [{"topic_id": "296", "doc": "10", "term": "urllink", "index": "0", "term_id": "191", "message_id": "40"}, {"topic_id": "947", "doc": "10", "term": "busy", "index": "1", "term_id": "249", "message_id": "40"}, {"topic_id": "1804", "doc": "10", "term": "roadway", "index": "2", "term_id": "250", "message_id": "40"}, {"topic_id": "296", "doc": "10", "term": "urllink", "index": "3", "term_id": "191", "message_id": "40"}]
NOTE - "Duplicate entry 'xxxx' for key 'PRIMARY'" errors may be indicative of an issues with newlines. See step 2 for a solution.
Step 7: Create topic-word distributions
Creates the readable distributions on the messages
dlatkInterface.py -d dla_tutorial -t msgs_lda -m 'msgs_lda_tok_lda$lda_topics' --create_dists
This creates the following files:
msgs_lda_tok_lda.lda_topics.freq.threshed50.loglik.csv
msgs_lda_tok_lda.lda_topics.lik.csv
msgs_lda_tok_lda.lda_topics.loglik.csv
msgs_lda_tok_lda.lda_topics.topicGivenWord.csv
msgs_lda_tok_lda.lda_topics.wordGivenTopic.csv
Step 8: Add topic-lexicon to lexicon database
Generates the lexicons based on different probability distribution types
topic given word
dlatkInterface.py --lex_interface --topic_csv \
--topicfile=/home/user/lda_tutorial/msgs_lda_tok_lda.lda_topics.topicGivenWord.csv \
-c msgs_lda_cp
This will create the table msgs_lda_cp in the database dlatk_lexica. You can change the database with the flag --lexicondb:
mysql> select * from msgs_lda_cp limit 5;
+----+----------------+----------+--------------------+
| id | term | category | weight |
+----+----------------+----------+--------------------+
| 1 | erm | 274 | 0.0625 |
| 2 | productively | 274 | 0.333333333333333 |
| 3 | jared | 274 | 0.125 |
| 4 | book | 274 | 1 |
| 5 | sketch | 274 | 0.0909090909090909 |
+----+----------------+----------+--------------------+
frequency, thresholded to loglik >= 50
dlatkInterface.py --lex_interface --topic_csv \
--topicfile=~/lda_tutorial/msgs_lda_tok_lda.lda_topics.freq.threshed50.loglik.csv \
-c msgs_lda_freq_t50ll
This will create the table msgs_lda_freq_t50ll in the database dlatk_lexica. You can change the database with the flag --lexicondb:
mysql> select * from msgs_lda_freq_t50ll limit 5;
+----+------------------+----------+--------+
| id | term | category | weight |
+----+------------------+----------+--------+
| 1 | wildest | 766 | 1 |
| 2 | kazadoom | 766 | 1 |
| 3 | sentirnos | 766 | 1 |
| 4 | charlie | 766 | 1 |
| 5 | commondreams.org | 766 | 1 |
+----+------------------+----------+--------+
Step 9: Extract features from lexicon
You’re now ready to start using the topic distribution lexicon
dlatkInterface.py -d DATABASE -t MESSAGE_TABLE -c GROUP_ID --add_lex_table -l msgs_lda_cp --weighted_lexicon
(always extract features using the _cp lexicon. The “freq_t50ll” lexicon is only used when generating topic_tagclouds: --topic_tagcloud --topic_lexicon)