Minimal Version Control Lesson: Use It
There is no excuse for a digital creative person to not use some sort of version control or source control. In the past disk space was too dear, version control systems were too expensive and software...
View ArticleWhat does a generalized linear model do?
What does a generalized linear model do? R supplies a modeling function called glm() that fits generalized linear models (abbreviated as GLMs). A natural question is what does it do and what problem is...
View ArticleHow robust is logistic regression?
Logistic Regression is a popular and effective technique for modeling categorical outcomes as a function of both continuous and categorical variables. The question is: how robust is it? Or: how robust...
View ArticleLevel fit summaries can be tricky in R
Model level fit summaries can be tricky in R. A quick read of model fit summary data for factor levels can be misleading. We describe the issue and demonstrate techniques for dealing with them.When...
View ArticleRudie can’t fail (if majorized)
We have been writing for a while about the convergence of Newton steps applied to a logistic regression (See: What does a generalized linear model do?, How robust is logistic regression? and...
View ArticleError Handling in R
It’s often the case that I want to write an R script that loops over multiple datasets, or different subsets of a large dataset, running the same procedure over them: generating plots, or fitting a...
View ArticleWin-Vector’s Nina Zumel: “I Write, Therefore I Think”
Check out: I Write, Therefore I Think Related posts: Congratulations to both Dr. Nina Zumel and EMC- great job An Appreciation of Locality Sensitive Hashing
View ArticleMore on ROC/AUC
A bit more on the ROC/AUC The receiver operating characteristic curve (or ROC) is one of the standard methods to evaluate a scoring system. Nina Zumel has described its application, but we would like...
View ArticleRevisiting Cleveland’s The Elements of Graphing Data in ggplot2
I was flipping through my copy of William Cleveland’s The Elements of Graphing Data the other day; it’s a book worth revisiting. I’ve always liked Cleveland’s approach to visualization as statistical...
View ArticleDon’t use correlation to track prediction performance
Using correlation to track model performance is “a mistake that nobody would ever make” combined with a vague “what would be wrong if I did do that” feeling. I hope after reading this feel a least a...
View ArticleA bit more on sample size
In our article What is a large enough random sample? we pointed out that if you wanted to measure a proportion to an accuracy “a” with chance of being wrong of “d” then a idea was to guarantee you had...
View ArticleWorry about correctness and repeatability, not p-values
In data science work you often run into cryptic sentences like the following: Age adjusted death rates per 10,000 person years across incremental thirds of muscular strength were 38.9, 25.9, and 26.6...
View ArticleBayesian and Frequentist Approaches: Ask the Right Question
It occurred to us recently that we don’t have any articles about Bayesian approaches to statistics here. I’m not going to get into the “Bayesian versus Frequentist” war; in my opinion, which style of...
View ArticleEstimating rates from a single occurrence of a rare event
Elon Musk’s writing about a Tesla battery fire reminded me of some of the math related to trying to estimate the rate of a rare event from a single occurrence of the event (plus many non-event...
View ArticleResolving git “pseudo conflicts”
I strongly advise using version control, and usually recommend using git as your version control system. Usually I feel a bit guilty about this advice as git is so general that it is more of a toolkit...
View ArticleSample size and power for rare events
We have written a bit on sample size for common events, we have written about rare events, and we have written about frequentist significance testing. We would like to specialize our sample size...
View ArticleUnit tests as penance
It recently hit me that I see unit tests as a form of penance (in addition to being a great tool for specification and test driven development). If you fix a bug and don’t add a unit test I suspect you...
View ArticleGeneralized linear models for predicting rates
I often need to build a predictive model that estimates rates. The example of our age is: ad click through rates (how often a viewer clicks on an ad estimated as a function of the features of the ad...
View ArticleUnspeakable bets: take small steps
I was watching my cousins play Unspeakable Words over Christmas break and got interested in the end game. The game starts out as a spell a word from cards and then bet some points game, but in the end...
View ArticleThe Extra Step: Graphs for Communication versus Exploration
Visualization is a useful tool for data exploration and statistical analysis, and it’s an important method for communicating your discoveries to others. While those two uses of visualization are...
View Article
More Pages to Explore .....