Trend-Based Categories Recommendations and Age-Gender Prediction for Pinterest and Twitter Users

Garcia-Guzman, Roberto; Andrade-Ambriz, Yair A.; Ibarra-Manzano, Mario-Alberto; Ledesma, Sergio; Gomez, Juan Carlos; Almanza-Ojeda, Dora-Luz

doi:10.3390/app10175957

Open AccessArticle

Trend-Based Categories Recommendations and Age-Gender Prediction for Pinterest and Twitter Users

by

Roberto Garcia-Guzman

¹

,

Yair A. Andrade-Ambriz

¹

,

Mario-Alberto Ibarra-Manzano

¹

,

Sergio Ledesma

^1,2

,

Juan Carlos Gomez

¹

and

Dora-Luz Almanza-Ojeda

^1,*

¹

Departamento de Ingeniería Electrónica, DICIS, Universidad de Guanajuato, Carr. Salamanca-Valle de Santiago KM. 3.5 + 1.8 Km., Salamanca 36885, Mexico

²

Faculty of Health Sciences, University of Ottawa, Ottawa, ON K1N 6N5, Canada

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2020, 10(17), 5957; https://doi.org/10.3390/app10175957

Submission received: 9 July 2020 / Revised: 21 August 2020 / Accepted: 23 August 2020 / Published: 28 August 2020

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figure

Versions Notes

Abstract

:

Category suggestions or recommendations for customers or users have become an essential feature for commerce or leisure websites. This is a growing topic that follows users’ activity in social networks generating a huge quantity of information about their interests, contacts, among many others. These data are usually collected to analyze people’s behavior, trends, and integrate a complete user profile. In this sense, we analyze a dataset collected from Pinterest to predict the gender and age by processing input images using a Convolutional Neural Network. Our method is based on the meaning of the image rather than the visual content. Additionally, we propose a heuristic-based approach for text analysis to predict users’ age and gender from Twitter. Both of the classifiers are based on text and images and they are compared with various similar approaches in the state of the art. Suggested categories are based on association rules conformed by the activity of thousands of users in order to estimate trends. Computer simulations showed that our approach can recommend interesting categories for a user analyzing his current interest and comparing this interest with similar users’ profiles or trends and, therefore, achieve an improved user profile. The proposed method is capable of predicting the user’s age with high accuracy, and at the same time, it is able to predict gender and category information from the user. The certainty that one or more suggested categories be interesting to people is higher for those users with a large number of publications.

Keywords:

gender and age prediction; convolutional neural networks; category suggestion; social networks

1. Introduction

Multiple studies have been done in order to explore how to predict age, gender, authorship, region, and other characteristics of the author of a text. In years, along with the rise of social networks, a lot of works have focused on authorship and attributes identification in social media posts, by analyzing text, images, and posting behavior. However, most work is focused on text analysis due to the amount of text generated per minute in social networks as Twitter [1]. At present, several statistics, people similarities, and trends are computed from the user activity in social networks, including the author profiling, which remains as an open multifactor challenge. In this context, the most prominent conference about author profiling (PAN at CLEF) has been studying, since 2013, these topics; however, it was until 2018 when images were included as part of the training sets to explore the age and gender identification using images and text information.

The use of images for author profiling provides additional information and, therefore, more data diversity to create a model. For instance, the authors in [2,3] explored, in 2014, the possibility of predicting users’ gender using Scale-Invariant Feature Transform or SIFT visual descriptor on images posted by the user, from Twitter and Pinterest, respectively. The authors in [2] classified such posted images into categories before predicting gender and, in [3], they directly computes gender prediction using a Logistic Regression Classifier. Due to several image descriptors spent high computational time and show more complexity, Convolutional Neural Networks (CNN) have influenced most of the methods for gender and age prediction because of their high performance results. For instance, more recently, during PAN 2018 [4], multiple techniques were explored to predict user’s gender on twitter posts including images, the best result was reported by Takahashi et al. [4] who used the CNN model VGG16 to extract the features from images and then combining them with textual features. Additionally, in 2018 Álvarez Carmona [5] used a similar method extracting features with a VGG16 for classifying age and gender with a Support Vector Machine (SVM) on Twitter images. Later, Bravo-Marmolejo [6] made a comparison between different classifier for age and gender using features that were extracted from a ResNet-50 on Pinterest images.

This work focuses on the analysis of users’ publications in social networks, specifically Pinterest and Twitter, with the aim of establishing a representative user model based on age and gender. The main idea behind our work is to recommend trend categories to Pinterest users by analyzing the images they have pinned and, with this information, enrich the typical user profile consisting of gender and age user prediction. To do this, we propose two separate classifiers: (1) for predicting age and gender and (2) for classifying images pinned by users on Pinterest. Moreover, for the age and gender classifier, two methods are implemented, one of them is text-based and the other is image-based. Therefore, we analyze distinct and reliable features in tweets or in images published on Twitter or Pinterest respectively, in order to predict age and gender. We evaluate two different types of datasets (from Twitter and Pinterest) to compare text and image features performance separately for predicting an initial profile for each user in the dataset (age and gender). The second classifier predicts the category for any pinned image on Pinterest along with the other 32 predefined categories. The proposed image classifier helps to describe the interest of users and, at the same time, create a set of user preferences. Finally, additional categories can be recommended to the user according to similar image pins published and the association rules based on the activity of thousands of users. Thus, the Pinterest categories predicted for a user are combined and evaluated following proposed association rules.

Based on the results that were obtained using VGG16 as feature extractor, we performed some test to find out if better results are obtained by using a specific form of machine learning technique called Transfer Learning. This technique is used to avoid training deep networks from scratch, instead, using a pre-trained network on a similar and bigger dataset, and then retraining specific layers for the new output. It is worth noting that Bravo-Marmolejo [6] and Álvarez Carmona [5] used a type of transfer learning, however they did not retrained the network. Thus, we re-trained some of the layers to improve classification performance. This re-training was performed because the extracted features should be more specialized for age and gender recognition than for general image classification. A collected dataset from Pinterest (labeled as DB1) is used to perform experimental tests for age and gender prediction.

Concerning text analysis, we want to explore a different approach from those techniques that are only focused on the processing of plain text. This traditional techniques try to analyze the semantics and relationship between words [4,7] using techniques like word2vec, bag of words or TF-IDF [8]. In this work, we assume that there is some information from the user that can be extracted (i.e., age and gender) by analyzing the hashtags, links, and tags from a tweet, and also in how the users use punctuation signs. We use the database from PAN 2015 based on Twitter users (labeled as DBT) to perform the experimental tests by text analysis.

Moreover, we want to find similarities between users by analyzing their preferred categories on Pinterest, and then built a module capable of recommending new categories to a user based on the general Pinterest users preferences. A different dataset of Pinterest is used, because we require a huge number of images from different users and categories to set up the association rules. These rules are generated by the Apriori algorithm applied to recommend additional categories to Pinterest users. The dataset used to generate association rules is labeled as DB2. This document is structured, as follows: Section 2 describes our global methodology used to predict age-gender, category, and recommendations for social network users. Section 3 describes the obtained results for different datasets. Finally, Section 4 shows the conclusions and further perspectives for user profile.

2. Materials and Methods

2.1. Method Description

The proposed category recommendation method for Pinterest users consists in predict an initial profile based on the age and gender of the user using tweets or images datasets. Most of the existing methods combine both datasets to improve gender and age classification results while using the same social network. As we know, Twitter allows users to publish mainly text while Pinterest allows mostly images. However, in addition to text messages, Twitter users can also publish images. Unlike texts, images are posted with less frequency than text and, therefore, images in Twitter could be less representative than text for predicting the user profile. In this context, our first goal is to predict age and gender by analyzing user text and images separately, as such data come from social networks with different approaches.

Figure 1 shows the proposed methodology for age and gender prediction using images. Our method requires a set of images and it is based on transfer learning. Thus, given the DB1 database of posted images in Pinterest, the first step is to resize the images. Next, the full image database is split to create the training set and the testing set. The images in the training set go through a VGG16 Convolutional Neural Network (CNN) and, finally, to age and gender classifiers by means of a multi-output layer. With this technique, some layers of the network are trained after initialization.

The age and gender classifier that is based on tweets analysis will be described in Section 2.5. After predicting the age and gender of the user, a second classifier to predict the category of a pinned image in Pinterest is implemented. This classifier is trained using the same method that is described in Figure 1, however instead of using a multi-output layer this model uses a single-output layer with 32 outputs, that is, one output for each category in the dataset. Once the image category is obtained, we evaluate a set of rules generated by the Apriori Algorithm using a dataset with a large number of proactive users in the social network. Ideally, Apriori rules are used to extract only the most frequent item-sets to find their behavior and, in most cases, extract important information that can be used to improve the accuracy of the classifier. In our scenario, we used Apriori rules to know the behavior for all the items in our dataset. The validation to recommend additional categories to users is performed using two proposed methods A and B according to their pinned images. This validation includes the analysis of a dataset with information from more than 400,000 users. This approach is explained in detail in Section 3.

The method that is proposed in this work extracts additional information from social networks when compared with classic methods typically used in the state of the art. Therefore, our method is capable of predicting the user’s age and gender with high accuracy and, at the same time, it is able to predict category information from the user without violating privacy issues.

2.2. Description of Pinterest Database

For our first experimental tests of age and gender classification, we use the image dataset that was collected by López-Santamaría et al. [8]. This dataset includes 548,761 pins from 256 users extracted directly from the Pinterest website. In this case, the ages of the users are similar to those of PAN 2015: 18–24, 25–34, 35–49, and 50+. Pinterest social network provides 32 predefined categories in which the user can classify their posts. Table 1 shows the list of categories found in Pinterest and the number of images we have in our database per category. Therefore, uncategorized pins were removed, and the final dataset DB1 contains 307,225 pins from 241 users.

2.3. VGG16

According to Alvarez-Carmona [5] and Takahashi et al. in [4], the CNN with 16 layers, VGG, performs well for age or gender prediction on social network images. Hence, we use this architecture for our work. This network was originally proposed by Simonyan & Zisserman [9] in 2015 with an architecture consisting of 13 convolutional layers divided in groups of 2 and 3, where each group is separated by a pooling layer and each convolutional layer uses multiple 3 × 3 kernels to perform the convolution. Small modifications were made to the original architecture that has two fully connected layers and an output layer for 1000 classes. In this study, the output layer is modified to include a multi-output layer: one for the gender and the other one for the age, totally independent of each other. The two fully connected layers are independent from each other but connected to the same network. This configuration allows training age and gender predictor using the same network but with independent outputs, as illustrated in Figure 1. We use Transfer learning because the training of this network architecture implies high computational time and resources. That is, a pre-trained network is utilized to obtain similar results fast.

2.4. Transfer Learning

The use of the weights of a pre-trained network to initialize a new CNN is a technique that has reported good results for age-gender prediction [4,5]. In this work, we load the weights of a VGG16 trained on the image dataset Imagenet (http://image-net.org/about-overview) and replace the last layer with our own multi-output layer (as mentioned in Section 2.3). These new outputs are independent of each other, as shown in Figure 1. The weights of the 13 convolutional layers remain fixed while the fully connected layers are trained. The original VGG16 network has an input size of 224 × 224, thus all our input images were resized.

2.5. Gender and Age Prediction by Text

This section considers an heuristic text-based approach to propose a second strategy for gender and age prediction without using CNN. The second set of tests includes the analysis of text collected from Twitter. For our text approach, we used the database DBT coming from PAN 2015 [10] that gathers 152 users (in the english dataset) with 100 tweets each and contains labels for age and gender for each user.

According to [11,12] the use of symbols, linguistic structures and styles in blogs or social networks is directly related to the users’ age. Recently, new features, like links, tags, and hashtags, appearing in social networks have been introduced to address the users age prediction problem in social networks [13,14]. Moreover, these textual features are also analyzed to predict the gender of users, based on correlations between words and phrases shown in texts [15]. The analysis of different styles, punctuation, or symbols used in tweets replaces the classical method to analyze the whole text posted. In our method, we count the number of links, hashtags, tags, words, and other features instead of just directly processing the text. Table 2 describes the features used in this work. Observe, that we assume that the frequency of appearance of the feature in Table 2 is directly related to the users’ age and gender. Because we want to know the number of words in the user vocabulary, we merge all of the tweets per user and obtain a polynomial of the frequency of appearance of every word. In this work, the feature vector length is 17 per user (most of the features have a length, with the exception of the polynomial vocabulary). After the feature vectors have been built, the feature classification is performed by a Bag of Trees classifier.

3. User Profile and Category Suggestion Approach

Researchers have proven that it is possible to cluster users according to their behavior in social networks. The authors in [5,16] showed that users from the same gender group tend to show a preference for similar types of image categories. Furthermore, providing a recommendation that is based on one or several choices is a strategy largely used by browsers or e-commerce pages to assist user navigation. Following the same idea, we should be able to identify the relationship between different pinned categories and determine whether or not a new category could be interesting for a user, based on previous categories pinned.

A great amount of information is required to extract such rules, for that reason we used a different database of Pinterest users. The DB2 database was taken from [17] that was collected by repeatedly visiting the pages of the 32 Pinterest categories (similar categories reported in Table 1) and retrieving the most popular pins. For our purposes, we used the images, its category label and the list of users with the number of pins published in each category. This database consists of 176,092 images, belonging to 401,241 users, it is important to note that one image can be repined by multiple users.

Several algorithms have been studied and proposed for association rule mining for both small and big datasets, but, because our amount of available categories is limited, we used the Apriori algorithm [18] to get those rules. The purpose of this algorithm is to find which items are related to each other. In other words, this algorithm provides rules to indicate which categories can appear once another set of categories is present. Each rule consists of an item-set with an appearance percentage, called support. The item-set is divided in base and add, where the add is a subset of items which probability of appearance can be influenced by the presence of the base. After running the Apriori algorithm, around 10,000 rules were found and sorted according to the support they have.

Given the category of an image from a user, we used two methods to recommend additional categories that are typically pined by similar users. In the first method, method A, we inspect the list of rules looking for the item-set with the highest support, which includes the category published by the user. Once we identify the item-set, any other element of the set can be recommended to the user as a new category. In the second method (called method B), we use the confidence value, defined as the probability of appearance of the add once the base has appeared. To do this, the category of the image is searched on the base of the rules and the rule with the highest confidence is selected, any category on the add set can be recommended to the user. Table 3 shows some examples of real rules mined by the algorithm using DB2 dataset.

Suppose that we have an image from a user with category ‘Weddings’, applying both methods A and B we obtain the rules that are shown in Table 3. With method A, we look the rule with highest support that includes ‘Weddings’, for this example the best result is {Home&Decor, Weddings, Hair&Beauty} with a support value of 0.05, then we randomly recommend ‘Home&Decor’ or ‘Hair&Beauty’. Note that, for this method we do not take into account base or confidence parameters provided for each rule. With method B we look in the base of the item-set, then we choose the rule with highest confidence; therefore, the recommendation is the category given by add set, in this case ‘History’.

Notice that the Apriori algorithm states that the entirety of the base must be present to ensure the confidence of the appearance of the add. However, in our scenario with Pinterest in many cases is nearly impossible to find a rule with exactly the same base and relevant confidence. For such reason, we only look for one or two categories present in the base. The percentages of successful recommendation are described in the next section.

In some special cases, some users do not have a category for all the images and boards in Pinterest. Therefore, an image classifier is needed to handle these special cases. Such classifier should be able to distinguish images from each one of the 32 Pinterest categories. As we mentioned above, this classifier is trained using the same method described in Figure 1, except that instead of using a multi-output layer we use a single output layer with 32 outputs.

4. Results Analysis

4.1. Results for Age and Gender Prediction Using Images from Pinterest

In the first part of the experimental tests, we used the DB1 Pinterest dataset described in Section 2.2 for age and gender prediction. It is well known that balanced datasets can improve classification. In our case, images from men users were only 1/5 of the original dataset (307,225 labeled images). For this reason, only 60,000 pins were used and, consequently, the number of users was reduced to 126 distributed in 61 men and 65 women. In order to have ∼30,000 images in each gender and ∼15,000 in each age range, we took men users’ images from each age range as possible until we reached 30,000 images. In our dataset, men images were unbalanced for age, as is shown in Table 4, so we filled the ages categories with images from women users trying to reach 60,000 images with ∼15,000 on each age range. Notably, we do not have the same pins from men and women per age range; however, we look for a balance on pins number per category while maintaining the same number of pins provided by men and women users.

As before, all of the images were resized to 224 × 224 pixels and the dataset was split into 85% for training and 15% for testing. The training and validation of the last fully connected layers of the network were performed 10 times and, for each training and validation, the images in the dataset were randomly shuffled. For this test, we obtain a mean accuracy of 72% for gender and 49% for age using DB1 dataset. Further metrics of the experimental tests and a comparison with the related works are shown in Table 5. The best accuracy (81%) is obtained by Takahashi et al. [19], however, they use a different dataset (images from Twitter) and only provide gender prediction. For age prediction, our method achieves the highest score 49%, which is 8% higher than [6] that uses Pinterest dataset and 10% higher than [5] for a Twitter dataset. Furthermore, we include the F1 metric to compare our approach with those methods that provide it, although this metric is more representative than accuracy for unbalanced datasets. We achieve the highest score for F1 giving a reliable percentage of recognition per class.

4.2. Results for Gender and Age Prediction Using Text from Twitter

In this section, we discuss the results of the methodology that is described in Section 2.5 for text analysis with the PAN 2015 database DBT. The experimental tests were performed 10 times and we obtained, on average, 64% of accuracy for gender prediction and 67% for age. Additional performance metrics for the results are 64% and 65% of precision and recall, respectively, for gender and 58% and 65% of precision and recall, respectively, for age. In general, the age classification is unbalanced and produces values for the precision and the F1 norm that are lower than those values for accuracy and recall. As we mentioned above, social networks like Twitter or Pinterest are essentially used for different purposes. In addition to pinned images, Pinterest users can also publish short text messages added to their images. Contrary to Twitter, text are posted with less frequency than images and, therefore, text in Pinterest could be less representative than images for predicting the user profile. In this context, a comparison with the related works is shown in Table 6 using different datasets. It is important to point out that authors in [5] used the Twitter PAN 2014 with additional images.

While the accuracy of our model (using text and images) is not better than Takahashi et al. [19] in the state of the art, the benefit of our model is that we also predict age of users using the same architecture of the CNN. The accuracy of our results and the F1 metric for age prediction (67%) outperform the approaches presented in Table 6. In addition, our strategy could be easily extended to combine text and image information to enrich the vector features in the CNN, which could improve the accuracy results for age and gender prediction. However, we expect to combine text and image published by the same user from different social networks to perform further tests.

4.3. Trends Analysis and Category Recommendations

Our results for category recommendations are divided in two parts. First, we show the results from the pin classification when the category of the image is missing. Second, we discuss the recommendation results when a category is known. An accuracy of 42% was obtained for the Pinterest database DB1 with 32 categories. Further tests were performed for DB2 database, from where we obtained the rules described in Section 3, obtaining a 51% accuracy. Different metrics computed for classification performance are illustrated on Table 7. Note that metrics for the dataset DB2 are slightly different, we assume that this behavior is due to the unbalanced category of the dataset DB1 for some categories.

In the second part of experimental tests, we used the DB1 dataset to test our category recommendation methods. We identified the users who posted images in at least five different categories and at the most 25. The number of users with this specific number of categories was 132 from the 256 users. Given a specific category from a user, a successful recommendation is made when a new category (recommended using method A or B illustrated in Table 3) is found in the list of categories for that user. The overall performance is obtained by calculating the percentage of successful recommendations for all of the posted categories per user, then averaging this percentage for all the 132 users. The first test was performed using only one category as a seed to look into the rules. For this test, the overall performance was 29% of successful recommendations for method A, and 30% for method B.

The second test was performed using two categories taken as seed. Thus, for each user, all of the combinations of two categories were tested and validated with the rest of the posted categories of the user. In this test, we achieved an overall performance of 19% with method A and 17% with method B. However, most of the systems that provide recommendations give more than one option to the user. To explore this possibility, we tested recommending more than one category to the user. In this sense, we considered a successful recommendation when at least one of them was in the posted categories. The results for 1, 2, and 3 suggestions are summarized in Table 8. As we expect, expanding the number of categories in the seed delivers a low percentage of successful recommendation, because it is more complicated to find a particular pair of categories than to find one in the item-set for method A or in the Base for the method B. Moreover, providing two or three suggested categories improves the probability of one of given suggestions can be chosen by user with an initial seed. From these results, we confirm that more suggestions provide higher probabilities that one or more categories could be chosen by a Pinterest user once he has a certain item-set or base of categories.

5. Discussion and Conclusions

We propose a method to predict gender, age, and suggested category to construct an enriched profile of users of Twitter and Pinterest. To do this, we develop two strategies to predict the gender and age of users analyzing text and images separately. The innovation in our proposed text-based model is the semantic and personality aspects considered to construct the feature vector that achieves reliable score without using a CNN as a classifier. From the review of age and gender results, it is more convenient to analyze text than images based on our achieved score. Nevertheless, the image dataset acts as a bridge between the age-gender predictor and the categories that can be recommended to the user according to his images pinned. Thus, our image classifier utilizes the same dataset as the age and gender model and it is only applied to label uncategorized images in the user wall of Pinterest. Once all the images belong to one of the 32 categories of Pinterest, the association rules analyze categories posted by the user and suggest one or more categories to him. The experimental results show that a loss of accuracy is present in some rules with very low support or confidence. This, in turn, highlights the diversity of Pinterest’ categories and indicates that the rule should not be popular.

Further modifications to this work consist in combining text and images published by the same user from different social networks. An alternative way to improve the percentages of categories suggestions can be obtained by increasing the current support and confidence values for the association rules. Future work also includes the integration of a complete user profile based on the users’ activity on social networks and cluster similar users’ profile by search algorithms. At the same time, we could associate groups of users with the same interest and construct population trends. Our particular attention is to create clusters of users with similar interests that are based on representative models and provide this information to National Statistic Institutions.

Author Contributions

D.-L.A.-O. proposed and described the methodology; R.G.-G. and Y.A.A.-A. programmed the methodology (classifiers and decision rules); M.-A.I.-M. designed, analyzed and validated experimental tests; S.L. tested and validated the metrics of the data; J.C.G. contributed to dataset collection, clean up and validation; all authors contributed to the writing of the manuscript; D.-L.A.-O., R.G.-G. and S.L. carried out the proofreading of the paper; All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Fondo Sectorial CONACYT-INEGI project number 290910. The APC was funded by Universidad de Guanajuato under POA 2020.

Acknowledgments

Sergio Ledesma acknowledges DAIP, University of Guanajuato and the University of Ottawa for their sponsorship in the realization of this work.

Conflicts of Interest

The authors declare that there are no conflict of interest regarding the publication of this paper.

References

Corea, F. Can Twitter proxy The Investors’ Sentiment? The Case for the Technology Sector. Big Data Res. 2016, 4, 70–74. [Google Scholar] [CrossRef]
Ma, X.; Tsuboshita, Y.; Kato, N. Gender estimation for sns user profiling using automatic image annotation. In Proceedings of the 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), Chengdu, China, 14–18 July 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 1–6. [Google Scholar]
You, Q.; Bhatia, S.; Sun, T.; Luo, J. The eyes of the beholder: Gender prediction using images posted in online social networks. In Proceedings of the 2014 IEEE International Conference on Data Mining Workshop, Shenzhen, China, 14 December 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 1026–1030. [Google Scholar]
Rangel, F.; Rosso, P.; Montes-y Gómez, M.; Potthast, M.; Stein, B. Overview of the 6th author profiling task at pan 2018: Multimodal gender identification in Twitter. In Working Notes Papers of the CLEF; CLEF Association: Avignon, France, 2018. [Google Scholar]
Alvarez-Carmona, M.A.; Pellegrin, L.; Montes-y Gómez, M.; Sánchez-Vega, F.; Escalante, H.J.; López-Monroy, A.P.; Villaseñor-Pineda, L.; Villatoro-Tello, E. A visual approach for age and gender identification on Twitter. J. Intell. Fuzzy Syst. 2018, 34, 3133–3145. [Google Scholar] [CrossRef] [Green Version]
Bravo-Marmolejo, S.P.; Moreno, J.; Gomez, J.C.; Pérez-Martínez, C.; Ibarra-Manzano, M.A.; Almanza-Ojeda, D.L. Identification of Age and Gender in Pinterest by Combining Textual and Deep Visual Features. In International Conference on Information and Software Technologies; Springer: Berlin/Heidelberg, Germany, 2019; pp. 321–332. [Google Scholar]
Rangel, F.; Rosso, P.; Potthast, M.; Stein, B. Overview of the 5th author profiling task at pan 2017: Gender and language variety identification in twitter. In Working Notes Papers of the CLEF; CLEF Association: Dublin, Ireland, 2017; p. 1613-0073. [Google Scholar]
López-Santamaría, L.M.; Gomez, J.C.; Almanza-Ojeda, D.L.; Ibarra-Manzano, M.A. Age and Gender Identification in Unbalanced Social Media. In Proceedings of the 2019 International Conference on Electronics, Communications and Computers (CONIELECOMP), Cholula, Mexico, 27 February–1 March 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 74–80. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Rangel Pardo, F.M.; Celli, F.; Rosso, P.; Potthast, M.; Stein, B.; Daelemans, W. Overview of the 3rd Author Profiling Task at PAN 2015. In CLEF 2015 Evaluation Labs and Workshop Working Notes Papers; CLEF Association: Toulouse, France, 2015; pp. 1–8. [Google Scholar]
Rosenthal, S.; McKeown, K. Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; Association for Computational Linguistics: Portland, OR, USA, 2011; pp. 763–772. [Google Scholar]
Eckert, P. Age as a Sociolinguistic Variable. In The Handbook of Sociolinguistics; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2017; Chapter 9; pp. 151–167. [Google Scholar] [CrossRef]
Pandya, A.; Oussalah, M.; Monachesi, P.; Kostakos, P. On the use of distributed semantics of tweet metadata for user age prediction. Future Gener. Comput. Syst. 2020, 102, 437–452. [Google Scholar] [CrossRef]
Pandya, A.; Oussalah, M.; Monachesi, P.; Kostakos, P.; Lovén, L. On the Use of URLs and Hashtags in Age Prediction of Twitter Users. In Proceedings of the 2018 IEEE International Conference on Information Reuse and Integration (IRI), Salt Lake City, UT, USA, 6–9 July 2018; pp. 62–69. [Google Scholar] [CrossRef] [Green Version]
Schwartz, H.A.; Eichstaedt, J.C.; Kern, M.L.; Dziurzynski, L.; Ramones, S.M.; Agrawal, M.; Shah, A.; Kosinski, M.; Stillwell, D.; Seligman, M.E.P.; et al. Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach. PLoS ONE 2013, 8, e73791. [Google Scholar] [CrossRef] [PubMed]
Bandari, D.; Xiang, S.; Martin, J.; Leskovec, J. Categorizing user sessions at pinterest. In Proceedings of the 2019 IEEE International Conference on Big Data and Smart Computing (BigComp), Kyoto, Japan, 27 Februay–2 March 2019; IEEE: Piscataway, NY, USA, 2019; pp. 1–8. [Google Scholar]
Zhong, C.; Karamshuk, D.; Sastry, N. Predicting pinterest: Automating a distributed human computation. In Proceedings of the 24th international conference on World Wide Web. International World Wide Web Conferences Steering Committee, Florence, Italy, 18–22 May 2015; pp. 1417–1426. [Google Scholar]
Agrawal, R.; Srikant, R. Fast Algorithms for Mining Association Rules in Large Databases. In Proceedings of the 20th International Conference on Very Large Data Bases, VLDB’94, Santiago, Chile, 20–23 August 1994; Volume 1215, pp. 487–499. [Google Scholar]
Takahashi, T.; Tahara, T.; Nagatani, K.; Miura, Y.; Taniguchi, T.; Ohkuma, T. Text and image synergy with feature cross technique for gender identification. In Proceedings of the Ninth International Conference of the CLEF Association (CLEF 2018), Avignon, France, 10–14 September 2018. [Google Scholar]
Modaresi, P.; Liebeck, M.; Conrad, S. Exploring the Effects of Cross-Genre Machine Learning for Author Profiling in PAN 2016. In Proceedings of the Seventh International Conference of the CLEF Association (CLEF 2016), Évora, Portugal, 5–8 September 2016. [Google Scholar]

Figure 1. General Methodology using images.

Table 1. Categories and number of images in our DB1 database corresponding to categories in Pinterest.

Category	Num. of Images	Category	Num. of Images
1. Animals	11,103	17. Home & Decoration	14,099
2. Architecture	4358	18. Humor	5934
3. Art	18,139	19. Illustrations/posters	706
4. Cars & Motorcycles	1886	20. Kids	2608
5. Celebrities	9909	21. Men’s Fashion	1422
6. Design	9654	22. Outdoors	7380
7. DIY & Crafts	29,593	23. Photography	12,414
8. Education	2442	24. Products	4959
9. Film/Music/Books	11,075	25. Quotes	4217
10. Food & Drink	45,192	26. Science & Nature	8283
11. Gardening	5841	27. Sports	884
12. Geek	897	28. Tattoos	247
13. Hair & Beauty	10,051	29. Technology	3486
14. Health & Fitness	4810	30. Travel	16,133
15. History	2161	31. Weddings	6108
16. Holiday & Events/Party	9547	32. Women’s Fashion	41,684

Table 2. Textual features extracted from user tweets for DBT dataset.

Feature	Length
Number of links per tweet	1
Number of tags per tweet	1
Number of hashtags per tweet	1
Number of ‘;’ per tweet	1
Number of ‘...’ per tweet	1
Number of ‘,’ per tweet	1
Sum of punctuation signs per tweet	1
Sum of links, hashtags and tags per tweet	1
Length of the tweet per tweet	1
Number of words per tweet	1
Number of unique words per tweet	1
Polynomial of vocabulary	6

Table 3. Example of association rules using DB2 dataset. In the first row, the text in bold indicates those parameters evaluated for method A, and in the second row for method B.

No. Rule	Item-Set	Support	Base	Add (Recommendation)	Confidence
1	Home&Decor, Weddings, Hair&Beauty	0.05	Weddings, Home&Decor	Hair&Beauty	0.48
2	Weddings, Design, Sports, History	0.04	Design, Weddings, Sports	History	0.67

Table 4. Number of images per gender and age range in DB1 dataset.

Age Range	Men (61 Users)	Women (65 Users)	Total Images	Distribution per Class
18–24	100	14,688	14,788	24.6%
25–34	1, 471	13,601	15,072	25.0%
35–49	14,375	801	15,176	25.3%
50+	14,304	801	15,105	25.1%
Images per gender	30,250	29,891	60,141	–
Distribution per class	50.30%	49.70%	–	–

Table 5. Age and Gender prediction comparison with related approaches using Twitter or Pinterest image datasets.

Method	Social Network	Class	Accuracy	Precision	Recall	F1
SVM (Ma X. 2014 [2])	Twitter	Gender	-	66.4%	63.4%	63.7%
Logistic R (You Q. 2014 [3])	Pinterest	Gender	71.3%	71.4%	71.3%	71.2%
VGG16 (Takahashi 2018 [19])	Twitter PAN 2018	Gender	81%	-	-	-
VGG16 + SVM (Alvarez 2018 [5])	Twitter	Gender	70%	-	-	-
		Age	39%	-	-	-
ResNet-50+NCC (Bravo 2019 [6])	Pinterest	Gender	71%	-	-	67%
ResNet-50+RF		Age	41%	-	-	29%
VGG16 (Our approach)	Pinterest	Gender	72%	72%	72%	72%
	(DB1)	Age	49%	49%	50%	49%

Table 6. Age and Gender prediction comparison with related approaches analyzing text from social network users.

Method	Social Network	Class	Accuracy	F1
Logistic Regression (Modaresi 2016 [20])	Twitter PAN 2016	Gender	75%	-
	(blogs & reviews)	Age	51%	-
SVM (Álvarez 2018 [5])	Twitter PAN 2014	Gender	75%	-
		Age	40%	-
ResNet-50+KNN (Bravo 2019 [6])	Pinterest	Gender	74%	62%
ResNet-50+LR)		Age	47%	39%
VGG16 (Takahashi 2018 [19])	Twitter PAN 2018	Gender	79%	-
Bag of Trees (Our approach)	Twitter PAN 2015	Gender	64%	64%
	(DBT)	Age	67%	59%

Table 7. Results for category classification on Pinterest images from two datasets.

Output	Accuracy	Precision	Recall	F1
32 Categories (DB1)	42%	42%	43%	42%
32 Categories (DB2)	51%	46%	50%	46%

Table 8. Percentage of successful recommendations using method A and B.

No. Categories	Method	1 Suggestion	2 Suggestions	3 Suggestions
1	A	29%	48%	58%
1	B	30%	39%	40%
2	A	19%	30%	37%
2	B	17%	23%	24%

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Garcia-Guzman, R.; Andrade-Ambriz, Y.A.; Ibarra-Manzano, M.-A.; Ledesma, S.; Gomez, J.C.; Almanza-Ojeda, D.-L. Trend-Based Categories Recommendations and Age-Gender Prediction for Pinterest and Twitter Users. Appl. Sci. 2020, 10, 5957. https://doi.org/10.3390/app10175957

AMA Style

Garcia-Guzman R, Andrade-Ambriz YA, Ibarra-Manzano M-A, Ledesma S, Gomez JC, Almanza-Ojeda D-L. Trend-Based Categories Recommendations and Age-Gender Prediction for Pinterest and Twitter Users. Applied Sciences. 2020; 10(17):5957. https://doi.org/10.3390/app10175957

Chicago/Turabian Style

Garcia-Guzman, Roberto, Yair A. Andrade-Ambriz, Mario-Alberto Ibarra-Manzano, Sergio Ledesma, Juan Carlos Gomez, and Dora-Luz Almanza-Ojeda. 2020. "Trend-Based Categories Recommendations and Age-Gender Prediction for Pinterest and Twitter Users" Applied Sciences 10, no. 17: 5957. https://doi.org/10.3390/app10175957

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Trend-Based Categories Recommendations and Age-Gender Prediction for Pinterest and Twitter Users

Abstract

1. Introduction

2. Materials and Methods

2.1. Method Description

2.2. Description of Pinterest Database

2.3. VGG16

2.4. Transfer Learning

2.5. Gender and Age Prediction by Text

3. User Profile and Category Suggestion Approach

4. Results Analysis

4.1. Results for Age and Gender Prediction Using Images from Pinterest

4.2. Results for Gender and Age Prediction Using Text from Twitter

4.3. Trends Analysis and Category Recommendations

5. Discussion and Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI