A newer version of this project is available. See below for other available versions.

Twitter Tweets on Non-Tobacco Blunt Wraps in the USA from January 2017 to November 2021

Principal Investigator(s): View help for Principal Investigator(s) Joshua Rhee, University of California-Irvine

Name	File Type	Size	Last Modified
All Relevant Tweets.csv	text/csv	13.3 MB	10/17/2022 12:03:PM
SNP Data by Unique User.csv	text/csv	2.6 MB	10/17/2022 11:52:AM
Tweets by Month and Category.csv	text/csv	5.8 KB	10/17/2022 11:45:AM
Wordcloud Data.csv	text/csv	1.4 MB	10/17/2022 01:28:PM

Project Citation:

Rhee, Joshua. Twitter Tweets on Non-Tobacco Blunt Wraps in the USA from January 2017 to November 2021. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2022-10-18. https://doi.org/10.3886/E182001V1

Project Description

Summary: The presented data contains Twitter tweet information on non-tobacco blunt wraps, which are marketed as a non-tobacco alternative to the traditional blunt wraps (e.g., little cigars, cigarillos). Tweets that were posted in the USA and from January 2017 to November 2021 were collected utilizing Texera (https://texera.ics.uci.edu/), an application developed and maintained by Professor Chen Li's research team at the University of California, Irvine's Department of Computer Science. Texera is built to search, store, edit, and analyze tweets through Twitter's Application Programming Interface (API v2), approved by the Academic Research Product Track. This data is openly available to everyone, but particularly related to the manuscript, "Do Blunt Smokers’ Tweets about Non-Tobacco Wraps Reveal a Potential Replacement for the Cigarillo?" by Joshua Rhee, MPH, Yicong Huang, BS, Sadeem Alsudais, MS, Shengquan Ni, BS, Avinash Kumar, MS, Jacob Paredes, BA, BS, Chen Li, PhD, and David S. Timberlake, PhD. Please contact Dr. David Timberlake (dtimberl@uci.edu) for any correspondences regarding the research, or Dr. Chen Li (chenli@ics.uci.edu) regarding Texera. Special thanks to both Aurash Jason Soroosh, MSPH, RD, and Paul McMurray, BS, for helping with the coding and training of the SVM classifier, and providing input on creating the search string for identifying relevant tweets. This research was funded by the University of California’s Tobacco-Related Disease Research Program (TRDRP; Grant No. T31IP1678; Recipient: DST) and the National Science Foundation (NSF; Grant No. 2107150; Recipient: CL).

As a brief summary, the research assessed the informal conversation held within Twitter for the potential of non-tobacco blunt wraps as an alternative to traditional blunt wraps. This was done by first creating a Boolean Search String of relevant key terms related to both non-tobacco blunt wraps and traditional blunt wraps. An initial 149,343 potentially relevant tweets were obtained, but was further screened for relevancy by training a Support Vector Machine (SVM) classifier. Subsequently, a total of 48,695 relevant tweets were identified by the SVM classifier. Next, the relevant tweets were coded either Organic or Commercial; Organic tweets are any informative conversations revolving non-tobacco wraps and blunt wraps which are not sponsored by a commercial entity, and Commercial tweets are tweets posted by small or large tobacco or non-tobacco wrap retailers, have a URL containing a related product, or must contain keywords indicative of promoting relevant products.

The archive contains the following four files: (1) Tweets by Month and Category.csv, (2) SNP Data by Unique User.csv, (3) All Relevant Tweets.csv, and (4) Wordcloud Data.csv. The first file aggregated the obtained relevant tweets by the month and year the tweet was created, alongside with the monthly frequency of relevant tweets. The column named 'Type' refers to the type of relevant tweet for the date and frequencies. The 'Overall' type tweet aggregates all 'Commercial' and 'Organic' tweets. The (2) SNP Data by Unique User.csv dataset contains the Social Networking Potential Scores (SNP), a type of quantitative measure for influence, for unique users identified within our corpus of relevant tweets. However, please note that not all SNP measures were calculated for all unique users, as we were unable to retrieve the total number of retweets from each unique user, which is required to calculate SNP; some users had their account banned or inactivated which prevented us from calculating their SNP measures. The respective dataset also contains information on each user's Twitter followers, following, and total count of tweets for each user. The (3) All Relevant Tweets.csv dataset contains all relevant tweets identified from our time period. This dataset contains information on the content of each tweet, the tokenized terms for each tweet, the date the tweets were created, the count of likes, quotes, replies, and retweets, and whether the tweet was labeled as commercial or organic. Lastly, the (4) Wordcloud Data.csv contains all tokenized words from the relevant tweets, and contains their overall frequencies from all relevant tweets. The data was cleaned to not contain unreadable special characters (e.g., emojies) from tweets, but still inclusive of non-informative words (e.g., aaaaaaa, aaaie). The respective dataset also contains information whether the keyword was selected as one of the top 25 frequent term identified from the sklearn toolkit for each respective wordcloud identified. Please refer to the manuscript on details for each identified wordcloud and topic. The sklearn toolkit (https://scikit-learn.org/stable/) was used to identify coherent topics present within all relevant tweets, and provided the most frequent tokenized terms in each topic. However, please note that the data only provides information whether the tokenized term was present in the wordcloud, not its subsetted frequency within each topic, which is currently a limitation.

The published data adheres to all guidelines stated from the Twitter Developer Policy (https://developer.twitter.com/en/developer-terms/policy). Under the section for Content Redistribution, special permissions are granted to academic researchers sharing Tweet IDs, and User IDs for non-commercial research purposes. The present research are listed on the behalf of an academic institution and the sole purpose of non-commercial research, and also shared for the purpose of enabling peer review and validation of our research.

Funding Sources: Tobacco-Related Disease Research Program (T31IP1678); National Science Foundation (2107150)

Scope of Project

Subject Terms: twitter; tobacco; blunts; non-tobacco blunt wraps; cannabis

Geographic Coverage: United States of America

Time Period(s): 1/2017 – 11/2021

Data Type(s): other

Collection Notes: Data were collected using Texera (https://texera.ics.uci.edu/), an application developed and maintained by Professor Chen Li's research team at the University of California, Irvine's Department of Computer Science. Texera is built to search, store, edit, and analyze tweets through Twitter's Application Programming Interface (API v2), approved by the Academic Research Product Track.

Methodology

Data Source: Twitter

Related Publications

Download this project

Published Versions

V2 [2023-07-13]

V1 [2022-10-18]

Export Metadata

Dublin Core

DDI 2.5

Report a Problem

Found a serious problem with the data, such as disclosure risk or copyrighted content? Let us know.

This material is distributed exactly as it arrived from the data depositor. ICPSR has not checked or processed this material. Users should consult the investigator(s) if further information is desired.