DLA Rules of Thumb

p = |feats|

First rule of thumb: there really are no hard and fast rules that always apply but these are best places to start.


  • Feat_occ_filter (--feat_occ_filter):

    • Rule of thumb: N/2 < p < N

  • Depends on:

    • Expected effect sizes

    • Sparsity of your observations

    • How many words do you have per person? (measure how well we are estimating word use rates)

    • "True" strength of relationship


  • See --feat_colloc_filter

  • When to apply?

    • good for DLA

    • usually not good for prediction (less accurate models)

  • generally a pmi threshold of 3 works for anything from 2grams to 4grams.


  • Feat_occ_filter

    • See --feat_occ_filter

    • Rule of thumb: N < p < 2*N

    • (when you’re doing "magic sauce"”" feature selection or LASSO (L1) penalization)

  • Colloc filter: doesn’t usually help (sometimes hurts)

  • What usually works best

    • Regression (listed in order to what usually works best)

      • Feat_occ_filter => Univariate selection => PCA => L2 (ridge) regression (--feature_selection magic sauce ---model ridgecv)

      • LASSO L1 regression (with no separate feature selection)

      • ElasticNet L1L2 regression (presumably worse because there is one more hyper-parameter to set)

    • Choosing dimensions for PCA

      • If a lot of observations (>>10k): 10% of p

      • If few observations (< 10k) 50% of N

      • (see more complicated funtions in regressionPredictor.py featureSelectionString)

    • Classification

      • L1 linear-svm (--model linear-svc)

      • L1 logistic regression (--model lr)

      • Extremely randomized trees (--model etc)

Levels of analysis and group frequency threshold

  • See --group_freq_thresh

  • County level:

    • Ok to push the boundaries for p (use lots of features compared to observations), because you have well-estimated features

    • GFT: 20 to 50k range; (if really good data, use 50k)

  • User-level:

    • Above rules for p directly apply.

    • GFT: 1k (500 if N < 5k; 2k if N > 100k usually has absolutely no benefit)

  • Message-level:

    • Rules above apply except, normally best to use binary encoding of 1to3grams

    • GFT: 1 (but depends on task)