Multidimensional Data Structures and Big Data Management

A special issue of Information (ISSN 2078-2489). This special issue belongs to the section "Information Systems".

Deadline for manuscript submissions: 31 January 2025 | Viewed by 10613

Special Issue Editor


E-Mail Website
Guest Editor
Department of Computer Engineering and Informatics, University of Patras, 26504 Rio Achaia, Greece
Interests: multidimensional data structures; decentralized systems for big data management; indexing; query processing and query optimization
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

The MDPI journal Information invites submissions to a Special Issue on “Multidimensional Data Structures and Big Data Management”.

The area of big data has increasingly become a space where data are described by their volume, velocity, variety, and other factors that comprise them in a modern world scenario. This has led to the emergence of a growing sector in the field of big data management, as data often arise from different sources which are sometimes heterogeneous, and therefore, their efficient organizing and handling are of particular interest.

This raises several challenges, such as indexing in massive datasets, query optimization in large databases, distributed methods and data mining, as well as knowledge extraction.

Furthermore, data structures which are responsible for storing, organizing, retrieving, and processing data play a major role in solving the aforementioned challenges. Official distributed engine software such as Apache Spark, Containers (Kubernetes and elastic cloud with Kibana), etc. is capable of transposing the problem of handling massive datasets into a more undemanding one by splitting the data into chunks and creating clusters to perform tasks.

The integration of data structures and indexes along with innovative distributed engine tools is the main scope of this Special Issue, accompanied by modern ML and AI methods for provisioning and prediction on a large scale.

Ultimately, this Special Issue is concerned with groundbreaking topics at the interface of data structures and indexing, distributed ML, query processing, and optimization, with particular emphasis on multidimensional data structures for big data management.

Topics of call

  • Efficient data structures
  • Big data indexing strategies
  • Distributed machine learning
  • Big data management techniques
  • Random sampling for data mining
  • Automated machine learning
  • Modern database systems
  • Big data management for smart IoT applications
  • Advanced distributed hash tables (DHTs)
  • Innovative schemes for information retrieval and knowledge extraction
  • AI and machine learning approaches for handling massive datasets
  • Query optimization based on machine learning approaches

Prof. Dr. Spyros Sioutas
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Information is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • multidimensional data structures
  • big data management
  • big data indexing strategies
  • large scale query processing and query optimization
  • large scale machine learning

Published Papers (6 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

20 pages, 1199 KiB  
Article
An Agent-Based Model for Disease Epidemics in Greece
by Vasileios Thomopoulos and Kostas Tsichlas
Information 2024, 15(3), 150; https://doi.org/10.3390/info15030150 - 07 Mar 2024
Viewed by 1144
Abstract
In this research, we present the first steps toward developing a data-driven agent-based model (ABM) specifically designed for simulating infectious disease dynamics in Greece. Amidst the ongoing COVID-19 pandemic caused by SARS-CoV-2, this research holds significant importance as it can offer valuable insights [...] Read more.
In this research, we present the first steps toward developing a data-driven agent-based model (ABM) specifically designed for simulating infectious disease dynamics in Greece. Amidst the ongoing COVID-19 pandemic caused by SARS-CoV-2, this research holds significant importance as it can offer valuable insights into disease transmission patterns and assist in devising effective intervention strategies. To the best of our knowledge, no similar study has been conducted in Greece. We constructed a prototype ABM that utilizes publicly accessible data to accurately represent the complex interactions and dynamics of disease spread in the Greek population. By incorporating demographic information and behavioral patterns, our model captures the specific characteristics of Greece, enabling accurate and context-specific simulations. By using our proposed ABM, we aim to assist policymakers in making informed decisions regarding disease control and prevention. Through the use of simulations, policymakers have the opportunity to explore different scenarios and predict the possible results of various intervention measures. These may include strategies like testing approaches, contact tracing, vaccination campaigns, and social distancing measures. Through these simulations, policymakers can assess the effectiveness and feasibility of these interventions, leading to the development of well-informed strategies aimed at reducing the impact of infectious diseases on the Greek population. This study is an initial exploration toward understanding disease transmission patterns and a first step towards formulating effective intervention strategies for Greece. Full article
(This article belongs to the Special Issue Multidimensional Data Structures and Big Data Management)
Show Figures

Figure 1

19 pages, 3617 KiB  
Article
Deep Learning Approaches for Big Data-Driven Metadata Extraction in Online Job Postings
by Panagiotis Skondras, Nikos Zotos, Dimitris Lagios, Panagiotis Zervas, Konstantinos C. Giotopoulos and Giannis Tzimas
Information 2023, 14(11), 585; https://doi.org/10.3390/info14110585 - 25 Oct 2023
Viewed by 1730
Abstract
This article presents a study on the multi-class classification of job postings using machine learning algorithms. With the growth of online job platforms, there has been an influx of labor market data. Machine learning, particularly NLP, is increasingly used to analyze and classify [...] Read more.
This article presents a study on the multi-class classification of job postings using machine learning algorithms. With the growth of online job platforms, there has been an influx of labor market data. Machine learning, particularly NLP, is increasingly used to analyze and classify job postings. However, the effectiveness of these algorithms largely hinges on the quality and volume of the training data. In our study, we propose a multi-class classification methodology for job postings, drawing on AI models such as text-davinci-003 and the quantized versions of Falcon 7b (Falcon), Wizardlm 7B (Wizardlm), and Vicuna 7B (Vicuna) to generate synthetic datasets. These synthetic data are employed in two use-case scenarios: (a) exclusively as training datasets composed of synthetic job postings (situations where no real data is available) and (b) as an augmentation method to bolster underrepresented job title categories. To evaluate our proposed method, we relied on two well-established approaches: the feedforward neural network (FFNN) and the BERT model. Both the use cases and training methods were assessed against a genuine job posting dataset to gauge classification accuracy. Our experiments substantiated the benefits of using synthetic data to enhance job posting classification. In the first scenario, the models’ performance matched, and occasionally exceeded, that of the real data. In the second scenario, the augmented classes consistently outperformed in most instances. This research confirms that AI-generated datasets can enhance the efficacy of NLP algorithms, especially in the domain of multi-class classification job postings. While data augmentation can boost model generalization, its impact varies. It is especially beneficial for simpler models like FNN. BERT, due to its context-aware architecture, also benefits from augmentation but sees limited improvement. Selecting the right type and amount of augmentation is essential. Full article
(This article belongs to the Special Issue Multidimensional Data Structures and Big Data Management)
Show Figures

Figure 1

27 pages, 1404 KiB  
Article
EVCA Classifier: A MCMC-Based Classifier for Analyzing High-Dimensional Big Data
by Eleni Vlachou, Christos Karras, Aristeidis Karras, Dimitrios Tsolis and Spyros Sioutas
Information 2023, 14(8), 451; https://doi.org/10.3390/info14080451 - 09 Aug 2023
Cited by 2 | Viewed by 1645
Abstract
In this work, we introduce an innovative Markov Chain Monte Carlo (MCMC) classifier, a synergistic combination of Bayesian machine learning and Apache Spark, highlighting the novel use of this methodology in the spectrum of big data management and environmental analysis. By employing a [...] Read more.
In this work, we introduce an innovative Markov Chain Monte Carlo (MCMC) classifier, a synergistic combination of Bayesian machine learning and Apache Spark, highlighting the novel use of this methodology in the spectrum of big data management and environmental analysis. By employing a large dataset of air pollutant concentrations in Madrid from 2001 to 2018, we developed a Bayesian Logistic Regression model, capable of accurately classifying the Air Quality Index (AQI) as safe or hazardous. This mathematical formulation adeptly synthesizes prior beliefs and observed data into robust posterior distributions, enabling superior management of overfitting, enhancing the predictive accuracy, and demonstrating a scalable approach for large-scale data processing. Notably, the proposed model achieved a maximum accuracy of 87.91% and an exceptional recall value of 99.58% at a decision threshold of 0.505, reflecting its proficiency in accurately identifying true negatives and mitigating misclassification, even though it slightly underperformed in comparison to the traditional Frequentist Logistic Regression in terms of accuracy and the AUC score. Ultimately, this research underscores the efficacy of Bayesian machine learning for big data management and environmental analysis, while signifying the pivotal role of the first-ever MCMC Classifier and Apache Spark in dealing with the challenges posed by large datasets and high-dimensional data with broader implications not only in sectors such as statistics, mathematics, physics but also in practical, real-world applications. Full article
(This article belongs to the Special Issue Multidimensional Data Structures and Big Data Management)
Show Figures

Figure 1

24 pages, 3063 KiB  
Article
Local Community Detection in Graph Streams with Anchors
by Konstantinos Christopoulos, Georgia Baltsou and Konstantinos Tsichlas
Information 2023, 14(6), 332; https://doi.org/10.3390/info14060332 - 12 Jun 2023
Cited by 2 | Viewed by 1107
Abstract
Community detection in dynamic networks is a challenging research problem. One of the main obstacles is the stability issues that arise during the evolution of communities. In dynamic networks, new communities may emerge and existing communities may disappear, grow, or shrink. As a [...] Read more.
Community detection in dynamic networks is a challenging research problem. One of the main obstacles is the stability issues that arise during the evolution of communities. In dynamic networks, new communities may emerge and existing communities may disappear, grow, or shrink. As a result, a community can evolve into a completely different one, making it difficult to track its evolution (this is known as the drifting/identity problem). In this paper, we focused on the evolution of a single community. Our aim was to identify the community that contains a particularly important node, called the anchor, and to track its evolution over time. In this way, we circumvented the identity problem by allowing the anchor to define the core of the relevant community. We proposed a framework that tracks the evolution of the community defined by the anchor and verified its efficiency and effectiveness through experimental evaluation. Full article
(This article belongs to the Special Issue Multidimensional Data Structures and Big Data Management)
Show Figures

Figure 1

34 pages, 1619 KiB  
Article
AutoML with Bayesian Optimizations for Big Data Management
by Aristeidis Karras, Christos Karras, Nikolaos Schizas, Markos Avlonitis and Spyros Sioutas
Information 2023, 14(4), 223; https://doi.org/10.3390/info14040223 - 05 Apr 2023
Cited by 7 | Viewed by 2491
Abstract
The field of automated machine learning (AutoML) has gained significant attention in recent years due to its ability to automate the process of building and optimizing machine learning models. However, the increasing amount of big data being generated has presented new challenges for [...] Read more.
The field of automated machine learning (AutoML) has gained significant attention in recent years due to its ability to automate the process of building and optimizing machine learning models. However, the increasing amount of big data being generated has presented new challenges for AutoML systems in terms of big data management. In this paper, we introduce Fabolas and learning curve extrapolation as two methods for accelerating hyperparameter optimization. Four methods for quickening training were presented including Bag of Little Bootstraps, k-means clustering for Support Vector Machines, subsample size selection for gradient descent, and subsampling for logistic regression. Additionally, we also discuss the use of Markov Chain Monte Carlo (MCMC) methods and other stochastic optimization techniques to improve the efficiency of AutoML systems in managing big data. These methods enhance various facets of the training process, making it feasible to combine them in diverse ways to gain further speedups. We review several combinations that have potential and provide a comprehensive understanding of the current state of AutoML and its potential for managing big data in various industries. Furthermore, we also mention the importance of parallel computing and distributed systems to improve the scalability of the AutoML systems while working with big data. Full article
(This article belongs to the Special Issue Multidimensional Data Structures and Big Data Management)
Show Figures

Figure 1

Review

Jump to: Research

31 pages, 183324 KiB  
Review
Linked Data Interfaces: A Survey
by Eleonora Bernasconi, Miguel Ceriani, Davide Di Pierro, Stefano Ferilli and Domenico Redavid
Information 2023, 14(9), 483; https://doi.org/10.3390/info14090483 - 30 Aug 2023
Cited by 3 | Viewed by 1319
Abstract
In the era of big data, linked data interfaces play a critical role in enabling access to and management of large-scale, heterogeneous datasets. This survey investigates forty-seven interfaces developed by the semantic web community in the context of the Web of Linked Data, [...] Read more.
In the era of big data, linked data interfaces play a critical role in enabling access to and management of large-scale, heterogeneous datasets. This survey investigates forty-seven interfaces developed by the semantic web community in the context of the Web of Linked Data, displaying information about general topics and digital library contents. The interfaces are classified based on their interaction paradigm, the type of information they display, and the complexity reduction strategies they employ. The main purpose to be addressed is the possibility of categorizing a great number of available tools so that comparison among them becomes feasible and valuable. The analysis reveals that most interfaces use a hybrid interaction paradigm combining browsing, searching, and displaying information in lists or tables. Complexity reduction strategies, such as faceted search and summary visualization, are also identified. Emerging trends in linked data interface focus on user-centric design and advancements in semantic annotation methods, leveraging machine learning techniques for data enrichment and retrieval. Additionally, an interactive platform is provided to explore and compare data on the analyzed tools. Overall, there is no one-size-fits-all solution for developing linked data interfaces and tailoring the interaction paradigm and complexity reduction strategies to specific user needs is essential. Full article
(This article belongs to the Special Issue Multidimensional Data Structures and Big Data Management)
Show Figures

Figure 1

Back to TopTop