A newer version of this project is available. See below for other available versions.
Twitter Tweets on Non-Tobacco Blunt Wraps in the USA from January 2017 to November 2021
Principal Investigator(s): View help for Principal Investigator(s) Joshua Rhee, University of California-Irvine
Version: View help for Version V1
Name | File Type | Size | Last Modified |
---|---|---|---|
|
text/csv | 13.3 MB | 10/17/2022 12:03:PM |
|
text/csv | 2.6 MB | 10/17/2022 11:52:AM |
|
text/csv | 5.8 KB | 10/17/2022 11:45:AM |
|
text/csv | 1.4 MB | 10/17/2022 01:28:PM |
Project Citation:
Rhee, Joshua. Twitter Tweets on Non-Tobacco Blunt Wraps in the USA from January 2017 to November 2021. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2022-10-18. https://doi.org/10.3886/E182001V1
Project Description
Summary:
View help for Summary
The presented data contains Twitter tweet information on non-tobacco blunt wraps, which are marketed as a non-tobacco alternative to the traditional blunt wraps (e.g., little cigars, cigarillos). Tweets that were posted in the USA and from January 2017 to November 2021 were collected utilizing Texera (https://texera.ics.uci.edu/), an application developed and maintained by Professor Chen Li's research team at the University of California, Irvine's Department of Computer Science. Texera is built to search, store, edit, and analyze tweets through Twitter's Application Programming Interface (API v2), approved by the Academic Research Product Track. This data is openly available to everyone, but particularly related to the manuscript, "Do Blunt Smokers’ Tweets about Non-Tobacco Wraps Reveal a Potential Replacement for the Cigarillo?" by Joshua Rhee, MPH, Yicong Huang, BS, Sadeem Alsudais, MS, Shengquan Ni, BS, Avinash Kumar, MS, Jacob Paredes, BA, BS, Chen Li, PhD, and David S. Timberlake, PhD. Please contact Dr. David Timberlake (dtimberl@uci.edu) for any correspondences regarding the research, or Dr. Chen Li (chenli@ics.uci.edu) regarding Texera. Special thanks to both Aurash Jason Soroosh, MSPH, RD, and Paul McMurray, BS, for helping with the coding and training of the SVM classifier, and providing input on creating the search string for identifying relevant tweets. This research was funded by the University of California’s Tobacco-Related Disease Research Program (TRDRP; Grant No. T31IP1678; Recipient: DST) and the National Science Foundation (NSF; Grant No. 2107150; Recipient: CL).
As a brief summary, the research assessed the informal conversation held within Twitter for the potential of non-tobacco blunt wraps as an alternative to traditional blunt wraps. This was done by first creating a Boolean Search String of relevant key terms related to both non-tobacco blunt wraps and traditional blunt wraps. An initial 149,343 potentially relevant tweets were obtained, but was further screened for relevancy by training a Support Vector Machine (SVM) classifier. Subsequently, a total of 48,695 relevant tweets were identified by the SVM classifier. Next, the relevant tweets were coded either Organic or Commercial; Organic tweets are any informative conversations revolving non-tobacco wraps and blunt wraps which are not sponsored by a commercial entity, and Commercial tweets are tweets posted by small or large tobacco or non-tobacco wrap retailers, have a URL containing a related product, or must contain keywords indicative of promoting relevant products.
The archive contains the following four files: (1) Tweets by Month and Category.csv, (2) SNP Data by Unique User.csv, (3) All Relevant Tweets.csv, and (4) Wordcloud Data.csv. The first file aggregated the obtained relevant tweets by the month and year the tweet was created, alongside with the monthly frequency of relevant tweets. The column named 'Type' refers to the type of relevant tweet for the date and frequencies. The 'Overall' type tweet aggregates all 'Commercial' and 'Organic' tweets. The (2) SNP Data by Unique User.csv dataset contains the Social Networking Potential Scores (SNP), a type of quantitative measure for influence, for unique users identified within our corpus of relevant tweets. However, please note that not all SNP measures were calculated for all unique users, as we were unable to retrieve the total number of retweets from each unique user, which is required to calculate SNP; some users had their account banned or inactivated which prevented us from calculating their SNP measures. The respective dataset also contains information on each user's Twitter followers, following, and total count of tweets for each user. The (3) All Relevant Tweets.csv dataset contains all relevant tweets identified from our time period. This dataset contains information on the content of each tweet, the tokenized terms for each tweet, the date the tweets were created, the count of likes, quotes, replies, and retweets, and whether the tweet was labeled as commercial or organic. Lastly, the (4) Wordcloud Data.csv contains all tokenized words from the relevant tweets, and contains their overall frequencies from all relevant tweets. The data was cleaned to not contain unreadable special characters (e.g., emojies) from tweets, but still inclusive of non-informative words (e.g., aaaaaaa, aaaie). The respective dataset also contains information whether the keyword was selected as one of the top 25 frequent term identified from the sklearn toolkit for each respective wordcloud identified. Please refer to the manuscript on details for each identified wordcloud and topic. The sklearn toolkit (https://scikit-learn.org/stable/) was used to identify coherent topics present within all relevant tweets, and provided the most frequent tokenized terms in each topic. However, please note that the data only provides information whether the tokenized term was present in the wordcloud, not its subsetted frequency within each topic, which is currently a limitation.
The published data adheres to all guidelines stated from the Twitter Developer Policy (https://developer.twitter.com/en/developer-terms/policy). Under the section for Content Redistribution, special permissions are granted to academic researchers sharing Tweet IDs, and User IDs for non-commercial research purposes. The present research are listed on the behalf of an academic institution and the sole purpose of non-commercial research, and also shared for the purpose of enabling peer review and validation of our research.
As a brief summary, the research assessed the informal conversation held within Twitter for the potential of non-tobacco blunt wraps as an alternative to traditional blunt wraps. This was done by first creating a Boolean Search String of relevant key terms related to both non-tobacco blunt wraps and traditional blunt wraps. An initial 149,343 potentially relevant tweets were obtained, but was further screened for relevancy by training a Support Vector Machine (SVM) classifier. Subsequently, a total of 48,695 relevant tweets were identified by the SVM classifier. Next, the relevant tweets were coded either Organic or Commercial; Organic tweets are any informative conversations revolving non-tobacco wraps and blunt wraps which are not sponsored by a commercial entity, and Commercial tweets are tweets posted by small or large tobacco or non-tobacco wrap retailers, have a URL containing a related product, or must contain keywords indicative of promoting relevant products.
The archive contains the following four files: (1) Tweets by Month and Category.csv, (2) SNP Data by Unique User.csv, (3) All Relevant Tweets.csv, and (4) Wordcloud Data.csv. The first file aggregated the obtained relevant tweets by the month and year the tweet was created, alongside with the monthly frequency of relevant tweets. The column named 'Type' refers to the type of relevant tweet for the date and frequencies. The 'Overall' type tweet aggregates all 'Commercial' and 'Organic' tweets. The (2) SNP Data by Unique User.csv dataset contains the Social Networking Potential Scores (SNP), a type of quantitative measure for influence, for unique users identified within our corpus of relevant tweets. However, please note that not all SNP measures were calculated for all unique users, as we were unable to retrieve the total number of retweets from each unique user, which is required to calculate SNP; some users had their account banned or inactivated which prevented us from calculating their SNP measures. The respective dataset also contains information on each user's Twitter followers, following, and total count of tweets for each user. The (3) All Relevant Tweets.csv dataset contains all relevant tweets identified from our time period. This dataset contains information on the content of each tweet, the tokenized terms for each tweet, the date the tweets were created, the count of likes, quotes, replies, and retweets, and whether the tweet was labeled as commercial or organic. Lastly, the (4) Wordcloud Data.csv contains all tokenized words from the relevant tweets, and contains their overall frequencies from all relevant tweets. The data was cleaned to not contain unreadable special characters (e.g., emojies) from tweets, but still inclusive of non-informative words (e.g., aaaaaaa, aaaie). The respective dataset also contains information whether the keyword was selected as one of the top 25 frequent term identified from the sklearn toolkit for each respective wordcloud identified. Please refer to the manuscript on details for each identified wordcloud and topic. The sklearn toolkit (https://scikit-learn.org/stable/) was used to identify coherent topics present within all relevant tweets, and provided the most frequent tokenized terms in each topic. However, please note that the data only provides information whether the tokenized term was present in the wordcloud, not its subsetted frequency within each topic, which is currently a limitation.
The published data adheres to all guidelines stated from the Twitter Developer Policy (https://developer.twitter.com/en/developer-terms/policy). Under the section for Content Redistribution, special permissions are granted to academic researchers sharing Tweet IDs, and User IDs for non-commercial research purposes. The present research are listed on the behalf of an academic institution and the sole purpose of non-commercial research, and also shared for the purpose of enabling peer review and validation of our research.
Funding Sources:
View help for Funding Sources
Tobacco-Related Disease Research Program (T31IP1678);
National Science Foundation (2107150)
Scope of Project
Subject Terms:
View help for Subject Terms
twitter;
tobacco;
blunts;
non-tobacco blunt wraps;
cannabis
Geographic Coverage:
View help for Geographic Coverage
United States of America
Time Period(s):
View help for Time Period(s)
1/2017 – 11/2021
Data Type(s):
View help for Data Type(s)
other
Collection Notes:
View help for Collection Notes
Data were collected using Texera (https://texera.ics.uci.edu/), an application developed and maintained by Professor Chen Li's research team at the University of California, Irvine's Department of Computer Science. Texera is built to search, store, edit, and analyze tweets through Twitter's Application Programming Interface (API v2), approved by the Academic Research Product Track.
Methodology
Data Source:
View help for Data Source
Twitter
Related Publications
Published Versions
Report a Problem
Found a serious problem with the data, such as disclosure risk or copyrighted content? Let us know.
This material is distributed exactly as it arrived from the data depositor. ICPSR has not checked or processed this material. Users should consult the investigator(s) if further information is desired.