DLA Rules of Thumb
p = |feats|
First rule of thumb: there really are no hard and fast rules that always apply but these are best places to start.
DLA
Feat_occ_filter (--feat_occ_filter):
Rule of thumb: N/2 < p < N
Depends on:
Expected effect sizes
Sparsity of your observations
How many words do you have per person? (measure how well we are estimating word use rates)
"True" strength of relationship
Colloc_filter
When to apply?
good for DLA
usually not good for prediction (less accurate models)
generally a pmi threshold of 3 works for anything from 2grams to 4grams.
Prediction
Feat_occ_filter
Rule of thumb: N < p < 2*N
(when you’re doing "magic sauce"”" feature selection or LASSO (L1) penalization)
Colloc filter: doesn’t usually help (sometimes hurts)
What usually works best
Regression (listed in order to what usually works best)
Feat_occ_filter => Univariate selection => PCA => L2 (ridge) regression (--feature_selection magic sauce ---model ridgecv)
LASSO L1 regression (with no separate feature selection)
ElasticNet L1L2 regression (presumably worse because there is one more hyper-parameter to set)
Choosing dimensions for PCA
If a lot of observations (>>10k): 10% of p
If few observations (< 10k) 50% of N
(see more complicated funtions in regressionPredictor.py featureSelectionString)
Classification
Levels of analysis and group frequency threshold
County level:
Ok to push the boundaries for p (use lots of features compared to observations), because you have well-estimated features
GFT: 20 to 50k range; (if really good data, use 50k)
User-level:
Above rules for p directly apply.
GFT: 1k (500 if N < 5k; 2k if N > 100k usually has absolutely no benefit)
Message-level:
Rules above apply except, normally best to use binary encoding of 1to3grams
GFT: 1 (but depends on task)