This work presents an exploratory study that retrieves, processes, and analyses Twitter data to gain insights about the relevance and perceptions of the wine industry in the Douro Portuguese region (including Porto and Douro wines), as well as other regions in the country. The main techniques and algorithms used in our work belong to the families of natural language processing and machine learning, and the practical relevance of the proposed methodology has been proven in the analysis of 1.2 million unique messages from more than 764,000 distinct users retrieved from the Twitter platform. Derived results from this study are valuable to provide insights that can be further used in the context of Business Informatics to promote better and more efficient marketing campaigns, for example, centering the topic on the most interested people or communicating with the most appropriate words.

1. Introduction

The universe of wine (i.e., consumption, marketing, and tourism) is a social-cultural phenomenon accompanying the development of western societies throughout history [1], being currently considered a sector typically conditioned by social influences [2]. With the growing advancement of new technologies and the emergence of a more digital universe, the wine sector, due to the characteristics that define it, is increasingly dependent on the interaction and sharing of experiences among companies, professionals and consumers in online communities. Social media has changed the way companies and consumers interact, creating a greater exchange of information and a closer, more active, and loyal relationship, essential for adding value to the product or brand [35]. In recent years, there has been an exponential increase in the number of tourists using the Internet (especially social media) to book travel and accommodation, buy products and services, and choose holiday destinations.

Social media has become an opportunity not only for users or consumers, but also for brand managers. Social media allow the production and dissemination of experiences among users, while simultaneously serving as a fundamental tool for brand managers to collect information and assess the visibility of their product or service. Furthermore, several studies have already demonstrated the impact that social media can have on the decision to shop online [6], including the purchase and improvement of wines [7, 8]. Indeed, the use of social media is impacting the global discussion of the wine sector, in relation to not only wine brands but also wine tourism [9]. It has always been known that the wine sector is very dependent on sharing experiences and that there is great trust in people’s testimony regarding the assessment of product quality. Taking this argument into account, it becomes increasingly urgent to understand what is being said in a social network about a particular brand or product of wine. In addition to the fact that personal testimonies can impact a user’s decision-making [10], we also know that users trust the experiences of their peers [11] and take feedback provided by users in its online community very seriously [12]. For example, with regard to the wine sector, several studies (e.g., [13]) recognize the importance of Twitter for beverage bloggers to express their opinion and share tips and advice. In fact, these bloggers may be becoming public opinion leaders as they are considered to be trustworthy and credible [13, 14]. Based on these arguments, Wilson and Quinton [15] considered Twitter a fundamental tool to add soft value to wine-focused businesses. In this sense, what is said about wine on a network such as Twitter, from a personal or business perspective, positive or negative, can influence the public’s opinion about a certain brand.

The main objective of this study is to verify what is being said on Twitter about Portuguese wines and wine regions, including Porto and Douro. Despite the various studies that have been published, some questions remain unanswered by either academia or wine professionals. This study does not aim to make a correlation between the content posted and the purchase of wines. Instead, we are interested in collecting information and understanding the public (consumers and tourists), which we consider to be a prerequisite for the success of a product, brand, or tourist destination.

The wine industry plays a fundamental role in the economy of many countries, such as Portugal. In the year 2018, the volume of wine produced in this country reached approximately 6.1 million hectoliters, while approximately 3 million hectoliters were exported to the rest of the world with an invoicing of more than 800 million euros [16, 17]. In Portugal, there are several wine regions and protected designations of origin (DOP), with the Douro region being the most recognized and prolific [18]. Located in the north of Portugal, this region is one of the most important wine-producing regions in the country and is internationally known for being the origin of the famous Port Wine brand [19]. The Douro Valley is also the first demarcated wine-producing region in the world (since 1756), and Alto Douro Wine Region was classified as a World Heritage Site by the United Nations Educational, Scientific and Cultural Organization (UNESCO) in 2001 [20]. This distinction sparked growth in tourism, wine tourism, and an appreciation of Douro wine production [21, 22]. Like other world-famous wine regions, the Douro Valley has become a place of interest to visit (due to its scenic beauty), combining tourism and wine production as an economic and social driver at a regional and local level [2325].

The importance of our study is twofold. First, the existing literature on the Douro and other Portuguese wine regions has focused mainly on business management methodologies with little regard for the perception tourists have of the region and their consumer behavior [22]. Studies that focus on the perception of tourists, including business travellers, essentially resort to the analysis of questionnaires and interviews [19, 26]. In contrast, our research offers a vision focused on social media with applied machine learning techniques, which we believe can be a very important contribution to the existing literature.

Second, our study offers an international perspective on what is being said and commented on about the Portuguese wine sector on Twitter. Not only can this network add value to the wine business, [15] but its total number of users continues to increase. Furthermore, it is important to emphasize that our research is the first study that uses Twitter to explore the impact of the Douro and Porto wine regions.

The number of active users on these platforms keeps growing on a daily basis and is estimated to reach 3.1 billion by 2021 [27]. The ease of use, high accessibility, and broad audiences of social media combined with the natural socialization around wines pave the way for the promotion of new discussion forums and a better perception of consumer preferences towards novel marketing strategies designed around wine and tourism [12, 28]. For example, one study demonstrated the association between luxury wine brands and the social media visibility indicators, suggesting that some of these brands should consider these platforms in their marketing strategies [29]. Notably, a study on the impact of social media practices on wine sales in US wineries showed that 87 percent of the wineries reported a perceived increase in sales due to social media [30]. Regarding wine and tourism partnership, a previous study about the relevance of social media on the wine tourism in Greece suggested that customer relationship management could be improved using these channels of communication [31]. Other studies pointed out the fundamental role that social media may have the promotion and valuation of brands with consumers; for example, social media can be important in assessing customer satisfaction [32, 33] by quickly and closely verifying the relationship between brand and consumer, with greater involvement between both [34, 35], and by capturing consumer experiences or guiding potential buyers, or by understanding the quality of service through, as an example, negative or positive comments related to a certain product. Moreover, the presence of a brand on social media can affect the company’s competitiveness and its value or reputation.

The present research contributes to this line of work by creating a methodology to perform an exploratory analysis on Twitter to gain insight about the relevance and the perceptions of the wine industry in the Douro Portuguese Region (including Porto and Douro wines) as well as the other regions in the country.

Several studies have already pointed out that although Twitter is considered a generalist social network (i.e., its users talk about any type of conversation), it is becoming increasingly important in carrying out market and business research [3638]. It is also currently one of the social networks with the largest number of registered and active users who produce a large amount of content (i.e., tweets) daily [39]. Finally, Twitter provides an application programming interface (API) to easily access the information posted by users. For these reasons, Twitter was the media platform selected to carry out this research, although the presented methodology is not bound to this specific social network and can be applied to others. Extracting data based on posted information (tweets and retweets) allows us to reduce the cost of collecting data by other means. In addition, the act of tweeting is inherently associated with the user’s search to engage with other people and to disseminate and share ideas and opinions, which represents an excellent object of study to understand consumer perception. We consider all types of tweets, whether from professionals or consumers, related to the wine sector.

In short, this work seeks to fill the lack of studies focused on this topic, taking into account the problems identified in the literature. As we mentioned earlier, no study to date in Portugal has made such a comprehensive survey of social media posts about the Douro Valley and Douro and Porto wines. Correia et al. [19] carried out a global survey of tourists’ experiences of the Douro Valley in Portugal, but essentially using interviews and surveys. In addition, the authors performed a content analysis of tourists’ comments on the brands’ pages on Facebook and TripAdvisor. Correia et al. [19] also found that many wine companies did not have social media pages or profiles. In a similar study, Vieira et al. [40] applied a sentiment analysis to 814 posts and found that the Port wine brand receives, in general, a very positive appreciation by online stakeholders. However, the authors found that the young generation (Millenials) has lost interest in Port wine.

Allied to these problems—(1) lack of online presence of brands and (2) lack of interest in the product among younger people—it should also be noted that sales of Port wine declined in the last decade, between 2006 and 2016, in terms of volume and value, a trend that has not been observed in terms of tourist attraction in the Douro Valley [40]. However, and to reinforce the importance of the online market, in 2020, the COVID-19 pandemic served to elucidate the relevance of online presence in the commercialization of Douro and Port wines. With the confinement imposed by the pandemic situation, there was an exponential increase in the online purchase of wine (information available online: https://www.tsf.pt/lifestyle/confinamento-leva-a-aumento-exponencial-dacompra-de-vinho-online-12123858.html (Accessed on April 8, 2022)). Aware of the importance of the online presence of the Douro brand, the Instituto dos Vinhos do Douro e do Porto, the entity responsible for controlling the quality and quantity of Port wines, increased the budget dedicated to promoting the brand to the international market, reinforcing its commitment to digital market (Information available online: https://www.dinheirovivo.pt/economia/vinhos-do-douro-e-porto-reforcampromocao-para-os-25-me-em-2021-13346251.html (Accessed on April 8, 2022)).

In view of the current situation, in which, first of all, the importance of social media in promoting the territory and wines of the Douro is recognized, although there is still a reduced presence of producers and brands on social media, and, secondly, the online market has been discovered as an opportunity to combat the increasing drops in sales over the previous years, our study emerges as an unprecedented and very relevant approach, which will allow us to identify the most influential users, the perception of buyers about the brand and territory, and the most discussed topics on social media. It is for these reasons that our study is necessary.

The main contribution of this research lays in unveiling the general knowledge extracted from the people who use Twitter to communicate with each other and focusing on the topic of the Portuguese wine industry. With the analysis of this knowledge, it could be possible to develop the capacity to better and more efficiently promote marketing campaigns (e.g., centering the topic on the most interested people or communicating with the most appropriate words), an objective of the field of business informatics.

The methodological design of our study is structured in three complementary phases. First, we collected and filtered data to build three corpora related to wine and wine regions. Only English-language tweets were considered to obtain an international dimension. Then, the corpora were processed according to a set of criteria that allowed filtering the pertinent information. Subsequently, we proceed to the content analysis of the data, highlighting the main topics, the most relevant words, the most active users, and the associated text segments.

2. Materials and Methods

The methodology developed to retrieve, process, and study wine-related Twitter data consists of three fundamental steps: (i) data collection and filtering; (ii) corpus processing; and (iii) information analysis (Figure 1).

2.1. Data Collection and Filtering

The Twitter4J Java library was used to retrieve the tweets related to the topic of wine [41]. The query used to retrieve the tweets was “wine.” The idea of using such a general query was to obtain as many tweets as possible about the topic in a small window of time. Twitter4J was set to retrieve only English tweets, that is, tweets that the Twitter API considered to be written in the English language, while tweets from suspended user accounts were excluded. The collection of posts dates from 8 January 2020 to 8 May 2020, obtaining a collection of 1,200,428 unique tweets from 764,108 distinct users. From this general collection, three corpora were constructed using more specific keywords to focus on the conversations about Portugal. In particular, the keywords “Douro,” “Porto” (including variants like “port”) and the “Others” PDO and wine regions (e.g., “vinho verde,” “bairrada,” or “minho”) were searched for within the content of the tweets. A tweet is considered to belong to the “Douro” corpus, for example, if the user cited the keyword “douro” in the tweet. Figure 2 shows the number of unique tweets and users for the three generated corpora, as well as the common tweets and users among them (i.e., a tweet may cite both “Porto” and “Douro” and a user may tweet about two or more regions).

2.2. Corpora Processing

Every tweet in the corpora was processed according to the following steps: (i) transform to lowercase; (ii) remove HTML tags; (iii) extraction of URL entities; (iv) recognition and extraction of user mentions (e.g., @ivdp_ip); (v) identification and extraction of hashtags (e.g., #winelover); (vi) tokenization (i.e., split a set of text up into words); (vii) filter tokens by length (i.e., between 3 and 20 characters); (viii) filter English stopwords and generic domain words (e.g., wine); (ix) stemming (i.e., tasting to tast); and (x) generation of n-grams (up to bigrams).

While the extraction of different elements from the tweets (i.e., URLs, mentions and hashtags) denotes the omission of these values from the text, they were kept for other analyses.

2.3. Topic Discovery

The latent Dirichlet allocation (LDA) algorithm was applied to discover the topics of conversation most discussed the three corpora [42]. LDA is a probabilistic topic modeling method widely used to identify possible topics in a general set of documents. The key assumption behind LDA is that each given document is a mix of multiple topics [43]. LDA models have two hyperparameters, α and β, used to tune the document-topic distribution and the topic-word distribution, respectively. Therefore, a low value of α places less weight on documents that may be high in number but address secondary topics. Similarly, a low value of β places less weight on topics composed of a high number of only secondary n-grams (i.e., uni-, bi-, or tri-grams).

To improve the model outcomes, a grid search approximation was applied to find the optimal hyperparameters of the LDA models. The objective was to maximize both the model perplexity and the topic coherence score [44]. To increase the scoring of the topic coherence, the model only considered nouns and proper nouns for the Porto corpus [45].

2.4. User Characterization

The following analyses were conducted for every user in the three corpora: (i) identify the most active and cited users inside and outside the topic; (ii) identify the influential users presented in the corpora; (iii) detect countries with the most presence, that is, countries that presented the most activity of users; (iv) discover the type of user account; and (v) analyze gender distribution.

The most active users were identified by taking into account the number of tweets posted about the topic (i.e., tweets containing the term “wine” and one of the Portuguese keywords) and outside the topic (i.e., those tweets not containing the term “wine” nor the Portuguese keywords). The most cited users were discovered by counting the number of mentions made by other users.

The identification of influential people, that is, users whose content spreads easily through a large part of the community, is a topic already discussed in previous works [46, 47]. These works proposed several valid ways to calculate this metric depending on the definition of influence applied. For example, to identify the influential users in the three analyzed corpora, the users were ranked by their follower (F/f), retweet (Rt/t), and favorite (Fav/t) ratios. The equation of the F/f ratio can be used as an indicator of the potential that the user has to spread their content and to estimate the number of people interested in the user’s status updates:where #Followers refers to the number of followings of the user and #Followings indicates the number of people who follow the user.

The equation of Rt/t ratio and Fav/ratio can be used as an indicator of the potential engagement that the user can reach inside the network and it can be useful when comparing different users in the same subject:where #Retweets refers to the total number of retweets and #Tweets indicates the total number of tweets.where #Favorites refers to the total number of favorite tweets, whereas #Tweets indicates the number of total tweets related to the topic.

Note that equations (2) and (3) can be adapted to only consider the tweets related to a specific topic, for example, tweets related to the Douro region. With this consideration, it was possible to identify users who might have influence in a small specific community rather than the whole network. So, the adapted equations are the following:where #Retweets_topic refers to the number of retweets reached in the tweets related to the studied topic and #Tweets_topic indicates the number of tweets related to the topic.where #Favorites_topic refers to the number of favorites reached in the tweets related to the topic under study, whereas #Tweets_topic indicates the number of tweets related to the topic.

The detection of active countries and the gender of the users enabled a demographic analysis of the accounts. To perform the detection of the user location, the proposed method accounted for the information related to time-zone, the UTC-offset, and the location text introduced by the user on Twitter. This information was used in combination with the GeoNames database API [48]. A strategy based on a gender-name dictionary and a convolutional neural network model to identify faces were used to identify the gender of the user. Finally, the classification of user accounts considered three types of users, namely, individuals, organizations, and unknown. To perform this classification, the proposed strategy used the information of the user profile (i.e., name and description) provided by Twitter. The algorithms and techniques employed to infer this knowledge were originally proposed by the works of Pérez-Pérez et al. [49, 50].

2.5. Content Analysis

Taking into consideration all the tweets stored in the corpora, the following steps were performed: (i) identification and analysis of most common hashtags, resources, and mentions; (ii) study of the tweets with larger engagement (i.e., favorites and retweets); (iii) analysis of posting distribution over time; (iv) discovery of relevant words and semantic associations; and (v) sentiment analysis.

The identification of the most common hashtags, shared resources, and mentions led to the idea of performing a descriptive analysis of these data to better understand what people like to share or like to point out about the theme under consideration. The study of the tweets with high values of engagement was meant to discover which content became popular in the network and why it had reached that popularity. The analysis of the posting time distribution revealed the peak posting times, that is, hours when the users usually post more tweets, as well as changes in the habit of posting due to the impact produced by external events (e.g., COVID-19). The discovery of relevant words and semantic associations aimed at discovering keywords used by the users in their conversations (e.g., brands or places) and understating how the users perceived the main wine regions of Portugal (i.e., what other words used when they cite Douro, Porto, and others regions).

Finally, sentiment analysis was performed to identify the social sentiments of the community about each topic. The sentiment score ranged between −1.0 and 1.0 for negative and positive tweets, respectively. To adjust this range to the processed corpora, a tweet was considered negative when its score was less than −0.7 and positive when its score was greater than 0.7. The remaining values considered the tweet as neutral.

All the previous analyses were performed using the tools provided by the Stanford CoreNLP pipeline [51] in combination with the Valence Aware Dictionary and sEntiment Reasoner (VADER) [52].

3. Results

3.1. Topic Discovering

Tables 13 describe the topics discovered by the LDA model for the “Douro,” “Porto,” and “Others” corpora in terms of their five most representative terms. The tables also illustrate the number of optimal topics, the topic label, and a representative tweet sorted by the number of favorites or retweets. The presented tweets retained their original content, that is, hashtags, symbols, URLs, etc. Additionally, all identified terms were stemmed and every set of topics was manually labeled based on its most weighted terms. For example, in the case of the “Douro” corpus, the topic “Tourism” was so labeled because the terms that most contributed to the topic were “valley,” “travel,” “region,” “tour,” and “visit.”

In general, the distribution of the tweets per topic was not balanced; that is, there was always a more dominant topic per corpus, although the particularities of the LDA algorithm should prevent the outcomes from being affected [53]. Regarding the results obtained for the three corpora, a total of 5 different topics were produced: “Quinta do Crasto,” “Types of wine,” “Consumption,” “Tourism,” and “Gastronomy.” The content of the tweets related to the topic “Quinta do Crasto” was focused on advertising this famous Quinta situated in Sabrosa (Douro), while the tweets related to the topic “Types of wine” were focused on presenting different styles of Portuguese wines to the community, with special emphasis on red and white wines. On the other hand, the topic “Consumption” had tweets related to the habit of consuming an alcoholic product, usually accompanied by some type of food. The topic “Tourism” was focused on extolling the virtues of the different Portuguese regions, and in many cases, tweets were related to people presenting their experiences in these regions. Finally, the tweets related to the topic “Gastronomy” were focused on showing pairings between different wines and foods, and the presentation of food recipes using several Portuguese wines.

According to the “Gastronomy” topic results, people usually preferred to use Porto wines to prepare food recipes. The reasons may be that Port wines are usually fortified and sweet so they can add sweetness and balance to the dish, in addition to the fact that these wines also present a wide variety of flavors. On the other hand, the “Types of wine” topic showed that people usually talked about red fruity wines belonging to the Touriga grape variety in the “Douro” corpus, while in the “Others” corpus, people usually talked about dry or sweet wines, which are more common in regions like Madeira. Finally, there were two topics in common among the overall corpora, that is, “Consumption” and “Tourism.” The “Consumption” topic in the “Douro” corpus was centered on consuming and producing wine products in the quintas, while in “Porto” and “Others” corpora, the content was more general and the places more diverse (e.g., events or parties). Lastly, the “Tourism” topic in “Douro” corpus was focused on visiting and making tours around the Douro Valley and the Douro River, while in the “Porto” corpus was more centered in the city of Oporto and its surroundings and in the “Others” corpus was especially focused on the island of Madeira and the region of Alentejo.

3.2. User Characterization

The following section is used to describe the analyses performed on the user accounts presented in the corpora and is structured as follows: (i) identification of both the most active and cited users inside and outside the topic; (ii) identification of the influential users presented in the corpora; and (iii) demographic analysis based on the countries with most presence, the sex of the users, and their type of account.

3.3. Active Users in the Network

Table 4 shows the most active users obtained from the general corpus of tweets related to wines. The column “#Tweets in topic” depicts the users sorted by the number of tweets posted in the topic of wines in descending order, whereas the column “#Tweets in Twitter” shows the users sorted by the number of tweets posted in the network.

Regarding the most active users in the topic, the user accounts of “RealWineGuru” and “alawine” were very similar in terms of number and type of publications. Both users looked like wine professionals and their content was centered in the publications of news related to the world of wines (e.g., reviews of different wines, news about the wine industry in several countries, or wine pairings). These accounts would be of interest to someone looking for the latest news related to the topic. In contrast, the user account of “BeerDescriber” was a Twitter bot whose function was to post tweets describing a diverse variety of alcoholic beverages, including wines. Although the number of followers and followings of this account was relatively low compared to the previously commented users (1 follower and 12 followings compared to 15 k–35 k and 53 k–107 k, respectively), its activity in the network is very elevated due to the automatic propagation of messages from time to time.

Similar to the previous data, Table 5 shows the most active users sorted by the number of tweets posted in the network (i.e., column “#Tweets in Twitter”), ensuring that they have published at least one tweet related to the topic of wines. The reason for exploring these users was to be able to identify some recognized users in the social network (e.g., famous people) who had also posted content related to the topic at some point. However, the users found were mainly associated with big commercial companies, organizations related to news and media (i.e., “OccurWorld” and “PulpNews”) or social media accounts from different websites (i.e., “urbandictionary”). The content of the tweets posted by these accounts was mainly related to the dissemination of news related to wine around the world, spamming offers or events, and users asking different types of questions about wine such as the meaning of the term or the available deliveries for online purchases.

From a more specific point of view, Figure 3 shows the most active users for each corpus (i.e., “Douro,” “Porto,” and “Others”) instead of the general corpus of tweets related to wines. The X-axis “#Tweets” depicts the number of tweets posted by each user in the corresponding corpus.

A review of the results indicates that the most active users presented in the three corpora were mainly accounts related to organizations (i.e., hotels/quintas and wine sellers) or individuals associated with some type of business (e.g., “Lmalopes” offers gastronomic experiences). The main content of the tweets posted by these users was focused on the direct (i.e., offering a product with a description and a direct link to buy it) or indirect (i.e., talking about the virtues of a product and including a media related to it) advertisement of some kind of product. This falls in line with previous studies pointing out the advantages of social networks for this type of marketing [7, 37], even more so with such a social topic and as linked to the tourism as wines.

3.4. Influential People

Table 6 shows the most influential users in each corpus based on the calculation of their ratios (i.e., F/f, Fav/t, and Rt/t, respectively). According to the results, there were 6 users among the three corpora that were present in both Fav/t and Rt/t ratios, that is, “Fernanda_FRocha,” “winewankers,” “DemiCassiani,” “InsiderFood,” “PortugueseIn,” and “incorrectstrkid.” Although these accounts did not present high activity in their specific topic (the maximum number of tweets posted by these users was 5) and their F/f ratio is not as elevated as other users, they managed to get high values of engagement compared to the average value of the other accounts. For this reason, these users seem interesting enough to be monitored over time to see whether the behavior is recurring or, on the contrary, an isolated tweet that went viral. Some of these accounts, such as “DemiCassiani,” were users who were highly interested in wines, as stated in their descriptions. Regarding the content of their messages, they were mainly centered on commenting on the beauty of the Portuguese regions (e.g., the vineyards or landscapes), as well as making humorous remarks about wines.

On the other hand, the accounts that were only present in one of the columns related to the engagement ratios (e.g., “dmarie116” or “AnneScottlin”) presented similar characteristics to the users mentioned previously, that is, low activity in their corresponding topic and a F/f ratio below than the average. However, the content of their tweets was more diverse, being especially focused on banal conversations rather than on showing something to the readers. This may be a solid reason to explain why their tweets did not have as much impact on the community.

Finally, the user accounts with the highest values of F/f ratios were mainly associated with organizations and companies. These user accounts showed a slight increase in the number of posted tweets compared to previous commented users. Additionally, the content of their tweets was mainly focused on offering various events or products, mainly centered on tourism. This was the expected behavior for these types of accounts that are mostly focused on organizations, given that people usually follow them on social networks to be in touch with possible offers or events that may be advertised.

3.5. Demographic Analysis

Table 7 depicts the distribution of users by sex and type of account for each corpus. The “number” column indicates the number of tweets posted by each sex or type of user. According to these results, there were no significant differences between males and females when talking about wine. It was a gender-parity topic (with a ratio of 1.18 : 1), even considering the average number of tweets per user (∼1.39). The usual topics of conversation were also similar and were mainly related to user experiences, product advertising or recommendations, and gastronomy.

Most users were recognized as individuals; that is, there was an average of 1,186 individuals and 196 organizations per corpus with an average ratio of 6 : 1. Therefore, the organizations were more active in the network with an average of 1.8 tweets per account compared with the 1.34 tweets posted by individuals. Most of these tweets were related to the advertisement of different tours, products, and events offered by the different organizations. This behavior denoted the importance of social networks like Twitter to be a primary channel of communication and marketing [15, 30].

Figure 4 depicts a world map showing the distribution of the user accounts with a recognized location for the three corpora. The color represents the number of users located in each country (i.e., the redder the color, the higher the number of users). As can be seen, the distribution of users by country was uniform, with the top countries being the United States (1,717 users), Portugal (278 users), the United Kingdom (258 users), and Canada (99 users), respectively. These results could reflect a bias due to the limitation of obtaining tweets only in the English language.

3.6. Content Analysis

The following section is used to describe the analyses performed on the content of posted tweets and is structured as follows: (i) identification and analysis of most common hashtags, resources, and mentions; (ii) study of the tweets with larger engagement (i.e., favorites and retweets); (iii) analysis of posting distribution over time; (iv) discovery of relevant words and semantic associations; and (v) sentiment analysis.

3.7. Hashtags, Mentions, and Resources

Table 8 shows the top 5 common hashtags, mentions, and shared resources (i.e., URLs by domain) for each corpus. Several generic hashtags related to the domain under study (e.g., wine) or the corpus (e.g., douro) were not considered because they were present in most of the messages without providing any extra knowledge. The “number” column indicates the occurrences of each value in a specific corpus.

According to the results, there were no differences among the used hashtags in the corpora. Considering the content of the tweets, the hashtags were mainly used to focus the conversations on the country of Portugal or one of its wine regions, for example, Alentejo or Madeira. There was a significantly high number of appearances of the hashtag “#travel”; that is, it was in the top in the three corpora, pointing out the tourist attraction of the country for foreigners [54]. Similarly, the hashtag “#wineLover” was at the top for both “Porto” and “Others” corpora. According to the content of the tweets in which this hashtag was present, it was used to denote the proven quality of the wines inside the country as well as showing a strong tourist attraction [23]. There were also hashtags with commercial connotations like “#Spain” in the “Porto” corpus and “#quintadocrasto” in the “Douro” corpus. The former may have been used to capture the attention of people from other countries, and it was mainly used by the user account “VinsPortugal,” a French wine boutique offering Portuguese wines through Twitter. The latter was mostly used by the user account “QuintadoCrasto” to advertise its products or events or for people commenting on their experiences in this quinta.

When analyzing the content of the tweets to search for the common user mentions, “@QuintadoCrasto” was found to be the most mentioned user account in the “Douro” corpus. The rest of the mentions were related to accounts and tweets talking about trips to the region of Douro, especially to see the valley, the river, or the vineyards. Conversely, the mentions found in the “Porto” corpus were related to more diverse topics, although most of them were centered on alcoholic products and food. For example, the mentions to the user accounts “@papaspilar” and “@ryeandrivet” talked about a rum that was aged in Port wine casks, whereas the mentions to the accounts “@HoarseWisperer” and “@laurenthehough” talked about consuming a Port wine cheese. Lastly, the mentions encountered in “Others” corpus were mainly centered in the gastronomy of the country (e.g., “@TurismodeLisboa”), wines (e.g., “@wines_Portugal”), and different events (e.g., “@ivdp_ip”).

Regarding the most shared resources, the social network Instagram was the undisputed leader compared to the other resources, with a total of 572 appearances (i.e., 8% of the total) among all corpora. This was in concordance with the conclusion of multiple studies pointing out the social behavior of wine [1]. Another commonly shared resource was different webpages whose purpose was to sell Portuguese wines (especially from the region of Douro and Porto), such as “vinsportugal” who appeared a total of 133 times (i.e., 3% of the total) in the Porto corpus and “Winehouseportugal” who had a total of 49 occurrences (i.e., 4% of the total) in the Douro corpus. With these results, it is possible to say that there may be an interest in the acquisition of Portuguese products by foreigners. Lastly, people also shared different blogs and news related to wine, food recipes, or tourism (e.g., “cookinglisbon,” “cookandcorks,” or “visitevora”).

3.8. Content Engagement

Table 9 shows the most retweeted and favorite tweets in each corpus. The most retweeted and favorite tweet in the “Douro” corpus, and most retweeted in the “Porto” corpus, was originally posted by the account of “Fernanda_FRocha.” This tweet also contained a picture of the river along with the text. The F/f ratio of the user had a value of 2.5, whereas her general Fav/t ratio had a value of 0.72, meaning that it was not common that a tweet posted by this user would reach this engagement. Although the most probable reason for achieving such a strong engagement could be due to the content of the tweet instead of the user, that is, the community likes content that promotes the beauty of the country, it would be useful to keep monitoring this account to determine the exact reason for this high-level response.

The most favorite tweet for the “Porto” corpus was originally posted by the user “InsiderFood,” an account dedicated to posting news about worldwide foods. In this case, the tweet contained a short documentary showing the process of producing traditional Port wine. This user account posted several tweets in the topic under study (i.e., 16 original tweets) and had a F/f ratio of 3,010, a Fav/t ratio in the topic of 319 and a Rt/t ratio in the topic of 100, meaning that the account could be very valuable due to the high power of diffusion and reception of information inside the social network. This was in concordance with the high number of favorites, reaching a value of 2,794, much higher than the other tweets. It is worth noting that both this tweet and the previously commented one presented media (i.e., a picture or a video) that was related to the topic discussed in their messages, which can help achieve a high level of engagement [55].

Finally, the most retweeted and favorited tweet in the “Others” corpus was originally posted by the user “NatGeo,” the official account of National Geographic. The content of the tweet was also accompanied by an external link to an article posted on the National Geographic website. A total of 5 different tweets posted by this user and related to the topic under study were found. Its F/f ratio had a value of 297,703, whereas the Fav/t ratio and the Rt/t ratio in the topic had values of 917 and 238, respectively. As expected, due to the entity of the user account, most of the tweets posted by this user usually reach high levels of engagement.

3.9. Posting Time Distribution

The time chart in Figure 5 illustrates how the number of tweets evolved over time by month, hour, and day of the week. In particular, Figure 5(a) depicts tweets related to the “Douro” corpus, Figure 5(b) shows the number of tweets related to the “Porto” corpus, and Figure 5(c) illustrates tweets related to the “Others” corpus. The plots related to the distribution by month have a dark green dashed line showing the estimated number of tweets for all of May (i.e., the current number of tweets from 1 May to 8 May is multiplied by the number of weeks until the end of the month).

The evolution of tweets by months had a similar distribution for “Douro” and “Others” corpus with a peak in January–February (i.e., 235 and 272 tweets, respectively) and a continuous decrease until May. This behavior may be somewhat expected due to the start of the COVID-19 pandemic in mid-March, after which activities related to tourism and events (e.g., travel, celebration of events, or staying at quintas) plummeted, while tweets related to the imposed quarantine greatly increased [56, 57]. Interestingly, this behavior did not seem to have occurred for the “Porto” corpus, which had its peak of activity in April (i.e., 2,195 tweets), although a possible explanation is the fact that the tweets related to the “Porto” corpus were more varied in nature, with many tweets from individuals talking about their experiences with Porto wines, in addition to those from organizations in the other two corpora. It is worth noting that during the quarantine imposed by the COVID-19, there was no noticeable increase in the tweeting activity related to online purchases of wine, which may imply various conclusions including the possibility that the Portuguese wine producers do not take full advantage of this kind of market as did other countries [58, 59].

The distribution of tweets by hours followed a similar pattern for the three corpora. On the one hand, the number of tweets had a decrease in activity during the early hours (i.e., from 1 h to 8 h) while experiencing an increase in activity during the mornings and especially the afternoons (e.g., 15 h and 19 h were the peak activity hours). This behavior was expected as it matched the most common working schedules of European organizations. Similarly, the distribution by days of the week had its peak of activity during the weekdays with a decrease in the number of tweets during the weekends for the “Douro” and “Others” corpora, while the “Porto” corpus experienced an increase on Saturdays. Again, this may be explained because of the higher activity of individual accounts in this corpus compared to the other two. People tend to have more spare time on weekends, which can be translated into more time spent in social networks.

3.10. Relevant Words and Semantic Associations

Table 10 depicts the most occurring unigrams and bigrams in the tweets of each studied corpus (i.e., words that appeared next to each other in the tweet). The column “Relations” is similar to the previous ones although, in this case, it shows the semantic relations for the desired term that is manually selected for each corpus (i.e., it shows the words that appeared next to a selected term). Therefore, for the “Douro” corpus, this column illustrates the relationship with the term “Douro”; for the “Porto” corpus, it depicts the relationship for the term “Port”; and for the “Others” corpus, it shows the relationship for its most common regions (i.e., the terms Madeira, Alentejo, and Algarve).

An analysis of the table and matching the results to those presented in the Topic Discovering section, the most dominant n-grams in the “Douro” corpus were centered on the topic of “quintas” (i.e., primarily rural properties usually used as an appellation for agricultural estates, such as wineries) and tourism. Moreover, considering the semantic relationship, the tourist attraction of the Douro region became even more evident. People usually emphasize the Douro Valley and the Douro River as well as the activities derived from them, such as the catamaran used for cruising the river. The importance of this Portuguese area is so universally acknowledged that the UNESCO recognized it as world heritage site in the year 2001 [20].

Regarding the results of the “Porto” corpus, the dominant n-grams were mainly centered on the consumption of several types of Port wines (e.g., Taylor or Tawny). Similarly, the relationship with the word “Port” also emphasized these kinds of conversations, including even more varieties (e.g., vintage port or double port). Of all the types commented, the one that stood out above the rest was “Taylor port” with 969 appearances among all tweets, indicating that it is an internationally known type of wine.

Finally, regarding the results of “Others” corpus, the most dominant n-grams were focused in the different Portuguese regions (e.g., the island of Madeira or Alentejo) and their exclusive wines (e.g., green wine). Regarding the semantic relations with these regions, people tended to comment about the wines in Alentejo and the Algarve as well as the beaches situated in the region of Algarve, which attracts among the highest number of summer tourists in the country.

3.11. Sentiment Analysis

Figure 6 depicts the social sentiment in the corpora. As other studies have pointed out, the predominant sentiment when talking about wines was mainly neutral-positive (with an average score of ∼0.3 among all corpus). This was something expected due to the social and enjoyable nature of the product [2, 60].

The negative tweets were primarily related to the sharing of bad news (e.g., “…Harsh Heat and Drought Breaks an Over 150 Year Tradition for Port Producers…”), to be part of a discussion among users (e.g., “…DO NOT drink the port in a jug…that stuff is seriously wrong…”) or recent emergencies like COVID-19 (e.g., “…The Coronavirus emergency is having a devastating economic impact on the Mid Coast’s wine industry…”). However, in the “Porto” corpus, there was a significant number of tweets related to specific complaints about Taylor Port wine. It did not appear to be appreciated by foreigners (e.g., “That TAYLOR PORT wine is so damn NASTY ‼,” “Taylor port is not wine it’s colored death” or “Taylor port wine is disgusting”).

4. Discussion

This work presents a preliminary study of the knowledge of social networks to explore the Portuguese wine industry and its tourism potential. Our study focused on the analysis of data extracted from Twitter. This digital media platform has been little studied in Portugal, namely, with regard to digital marketing and the interaction between brands and tourists interested in the Douro Valley region. In the literature, we found several studies that evaluated and monitored the publication of content related to the Douro Valley on social media [9, 19, 40, 61]. However, none of the studies included a comprehensive reading of what is being said on social media about the Douro Valley and Douro and Port wines. Existing studies use specific case studies, analyzing the presence of a specific brand, producer, or entity on media platforms. Vieira et al. [40] analyze the impact of a hotel group’s campaigns on Instagram, analyzing the engagement generated by the publications and the importance of sharing content by digital influencers. Almeida et al. [61] focus their study on the importance of using hashtags to promote brands in the hotel sector. In a different perspective, Jorge et al. [9] explore the presence of several international destinations, including the Douro valley, through the pages or accounts of tourist entities on Facebook, Instagram, YouTube, and Twitter, measuring the engagement achieved per publication. Additionally, Correia et al. [19] use a quinta in the Douro Valley as a case study, applying content analysis on Facebook and TripAdvisor, as well as interviews and surveys. This study makes a qualitative analysis of the main motivations for visiting this farm, as well as surveying various aspects associated with location, landscape, and wine consumption.

Given this bibliographic scenario, we believe that our study may fill a gap in the literature on the importance of social media, namely, Twitter, in relation to the interaction between consumers or tourists and companies or brands in the Douro Valley region.

The analyzed data are based on a four-month dataset compromising 1,200,428 unique tweets from 764,108 distinct users. The methodology developed to retrieve, process, and analyze this data was mainly focused on the use of algorithms and techniques related to machine learning and the natural language processing family. This allowed for an analysis of the content posted about the topic by the different users, as well as a study of user accounts within the wine community, making it possible to respond to several questions such as what the main topics of conversations are among people, how users express themselves within the topic, or which accounts participating in these communities are the most interesting.

The identified topics of the conversation clearly identified the close relationship that exists between the universe of Portuguese wines (e.g., consumption or types) and tourism and how both mutually feed back into each other. Gastronomy was also a topic that appeared in many conversations within the user community, especially the sharing of recipes in which one of the ingredients was a Portuguese wine. The presence of the “Quinta de crasto” as a topic of conversation was also notable. This indicates how well this organization is using the social networks to apply marketing strategies on many users.

This study also brought to light a set of user accounts that could be useful for future analysis. Although these users had poor activity in terms of tweets related to the topic, some of their tweets managed to go viral. Additionally, while there were fewer organizations than individuals, they were found to be more active and generally achieved greater impact with their messages. This finding demonstrates the importance of planning a communication strategy together with influencers or users with a large following on social media. These can be decisive not only in the process of promoting the tourist destination, but also during consumer decision-making, conveying credible and realistic experiences and messages.

The topic of wines was also found to be a gender-parity topic with no significant differences between the number of males and females. This implies that, for example, when launching marketing campaigns both genders should be taken into account to reach a larger audience. With a geographical distribution concentrated in English-speaking countries, and the United States in particular, a market analysis in these countries would be useful for future advertising campaigns.

Regarding the content of the messages, the conversations related to tourism were mainly focused on different areas of the Douro (e.g., the Valley or the River) as well as specific tours and events in the area. This indicates a strong tourist attraction of the region throughout the world and effective advertisement outside the country. When talking about the different types of wines, the Taylor port stood out from the rest. However, taking into account the sentiment analysis of the tweets that cited this type of wine, it is possible to observe that people did not speak particularly well of it. An attempt should be made to solve this issue by analyzing the root of the problem and, for example, launching an advertising campaign to address the negative points mentioned by the community.

Finally, the distribution of the tweets over time showed a general decrease in activity from March to May. This period coincides with the strong spread of the COVID-19 disease throughout several European countries, which led to a reduction in tourism and events. However, there was no noticeable increase in tweets related to online shopping. This could be an interesting point for the wine brands to address, especially because of a possible long-term resurgence of the disease.

5. Conclusions

In general, our study allowed us to understand the dimension of the universe associated with Port wine and wine tourism in the Douro Valley on Twitter. The main topics of discussion were raised within the three corpora analyzed: “Douro,” “Porto,” and “Others.” An analysis of these topics made it possible to verify the types of wine and the most important characteristics of a region, while at the same time allowing us to follow consumer feedback in relation to their tourist experience or about a certain product or service.

We also concluded that large commercial companies related to the wine and hotel sector are the most active users on this social media platform. This demonstrates the importance that industry organizations place on Twitter to promote the territory and their brands. In addition, we found that there are independent users who can be influential and thus play a very important role in the development of digital marketing strategies.

Our findings also found very positive evidence regarding consumers’ perception of the Douro Valley and its wines, as the sentiment analysis showed user comments and experiences with neutral-positive assessments more frequently than publications with content with more negative connotation.

6. Practical Implications

The different analyses performed on the tweets and user accounts allowed the creation of a first straightforward methodology, which permits a continuous evaluation of the current impact that exists in social networks for the topic under study. The proposed approach and the application of different analyses (some of them briefly commented in this document) may pave the way to the creation of an information system centered in the real-time monitoring, analysis, and redirection of the information in social media.

Therefore, this hypothetical system would be able to perform more complex analyses in addition to long time and continuous monitoring of a specific topic. With this kind of information, it would be possible to build a more complex methodology used to infer new interesting and hidden knowledge from social networks. This would lead to the publication of novel studies including, for example, (i) providing useful first-hand information to propose interested stakeholders with the possibility of rebranding their products; (ii) gaining a better understanding of the worldwide impact of relevant concepts; (iii) obtaining knowledge about the strengths and weaknesses of a given product; (iv) generating marketing activity in the network with the ability to measure its acceptance among users; or (v) understanding how people feel and react to a specific campaign.

Finally, this system could be used by any interested stakeholder to gain insight into any other business-related topic and, most notably, it can be used to improve the capacity of stakeholders to promote more focused marketing campaigns and to communicate with the target consumers in a more efficient way, two outcomes derived from the field of business informatics.

Data Availability

All data are available in the article.

Additional Points

The main limitations of this work are the following: (i) the filtered data based on specific Portuguese keywords (e.g., Douro) may not fully cover the totality of the tweets related to that topic; (ii) the language of the tweets are restricted to English, so it is assumed that information is being lost, especially when analyzing information related to tourism; and (iii) other social networks should be taken into account since Twitter does maybe not accurately or entirely reflect the attitudes of the users about the topic.

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this article.


This work was supported by the Associate Laboratory for Green Chemistry—LAQV, financed by the Portuguese Foundation for Science and Technology (FCT/MCTES) Ref. UID/QUI/50006/2020; the Portuguese Foundation for Science and Technology (FCT/MCTES) under the scope of the strategic funding of UIDB/04469/2020 unit and BioTecNorte operation funded by the European Regional Development Fund (ERDF) under the scope of Norte2020—Programa Operacional Regional do Norte. Ref. NORTE-01-0145-FEDER-000004; the Consellería de Educación, Universidades e Formación Profesional (Xunta de Galicia) under the scope of the strategic funding of ED431C2018/55-GRC Competitive Reference Group, the “Centro singular de investigación de Galicia” (accreditation 2019-2022) funded by the European Regional Development Fund (ERDF)-Ref. ED431G2019/06; and Portuguese Foundation for Science and Technology for a PhD Grant (SFRH/BD/145497/2019).