Nick Guenther's research while affiliated with University of Waterloo and other places

Publications (4)

Article
Text mining is the process of turning free text into numerical variables and then analyzing them with statistical techniques. We introduce the command ngram, which implements the most common approach to text mining, the “bag of words”. An n-gram is a contiguous sequence of words in a text. Broadly speaking, ngram creates hundreds or thousands of va...
Article
Support vector machines are statistical- and machine-learning techniques with the primary goal of prediction. They can be applied to continuous, binary, and categorical outcomes analogous to Gaussian, logistic, and multinomial regression. We introduce a new command for this purpose, svmachines. This package is a thin wrapper for the widely deployed...
Article
Text mining is the art of turning free text into numerical variables and then analyzing them with statistical techniques. We introduce the Stata command ngram which implements the most common approach to text mining, "bag of words''. An n-gram is a contiguous sequence of words in a text. Broadly speaking, ngram creates hundreds or thousands of vari...

Citations

... A n-gram is a consecutive sequence of n words in a text [17]. This research used a combination of unigram and bigram tokenization. ...
... Stata for text mining exists yet there is much space for developing in a growing field (see for instance Provalis Research 2024 and William and Williams 2014 andSchonlau et al. 2017) ...
... SVMs in Scikitlearn support both dense and sparse sample vectors as input. [9] Bagging classifier An ensemble meta-estimator called a bagging classifier model fits base classifiers one at a time to random subsets of the original dataset, and it then averages or votes on each classifier's predictions to produce a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator like a decision tree, by introducing randomisation into its construction procedure and then making the ensemble out of it. ...
... N-grams, in this context, are word phrases consisting of n-number of words in direct proximity to each other (Bharadwaj & Shao, 2019;Gurcan & Cagiltay, 2023;Schonlau et al., 2017). This study was focused exclusively on bigrams (i.e., two-word phrases). ...