1. Introduction
Architectural heritage is a type of cultural heritage that is material and immovable. All physical objects created by various construction activities during the evolution and development of human society belong to architectural heritage. From small buildings and structures to large villages and cities, whether or not they are currently intact does not affect their basic material properties of being immovable. As a product of history, architectural heritage is marked by the era, reflecting the level of science and technology of the corresponding era, as well as the social, political, military, and cultural conditions of the time for a local area. Therefore, it is essential to let more people know the story behind architectural heritage, which is of great significance to the promotion and inheritance of local culture. To achieve this, we need a powerful tool to recommend existing architectural heritage based on user preferences.
Recently, digital technology has become popular, which has great significance for cultural heritage [
1,
2,
3,
4]. A comprehensive digital archive is established for heritages by means of photography, scanning, and 3D printing. The rise of digital information technology has greatly broken through the limitations of traditional conservation means. It provides richer technical options for recognition, protection, presentation, and finding a breakthrough for solving the problems of heritage conservation. Digital tools also have the advantage of increasing opportunities in terms of cultural mediation and proposing interactive tools that involve the public in discovery cultures, thus making them actors in their own knowledge acquisition processes. Such kinds of new technologies allow us more ways to access and promote the local architectural heritage. In particular, the deep learning (DL) method is currently showing its power in this area. Recently, the work by Tejedor et al. [
5] introduces the future perspectives of the DL method for the diagnosis of heritage buildings. The methods reviewed in the paper cover several important topics, e.g., object detection which can be integrated with UAVs and a GPS system [
6], 3D models for the reconstruction of lost architectural heritage [
7], and neural networks applied to heat flux meter (HFM) methods [
8,
9]. Some defects of the DL method are also given in the paper, especially the limitation of dataset and computation time cost. For the promotion of architectural heritage, we introduce another kind of DL method, namely, content-based image retrieval (CBIR), for a case study. A fine-tuning strategy is adopted to solve the data quantity problem, which is a common issue for real-world application of DL. We detail our target, method, and contributions in the continuous parts of the paper.
With the rapid development of convolutional neural networks (CNNs) [
10,
11], DL [
12,
13,
14] methods have achieved many encouraging results in heritage image analysis [
15,
16]. Among them, CBIR [
17,
18], which searches for images from a large image collection, can serve as a recommendation tool for local architectural heritage. Suppose we have an image collection that contains a variety of architectural heritages from a local area. For one query image (e.g., an image of a temple) uploaded by the user, the retrieval method can return the top-ranked images from the image collection based on visual similarities. The retrieved images are the recommendations for users. However, applying image retrieval to local architectural heritage faces an unavoidable problem—data insufficiency. The amount of architectural heritage in a local area may not have enough quantity. However, having adequate data is always an essential factor for the training of a DL model [
19]. It is easy to gather images of some common classes (e.g., residential buildings) in a local area, while, for some landmark buildings (e.g., bridges), there may be only a few entities available. The lack of data and data imbalance among classes will lead to a decrease in retrieval accuracy or cause training failure. Therefore, solving the problem of data insufficiency is the first priority.
As we discussed above, it is almost impossible to collect enough data in a local area for the training of a DL retrieval model. However, can we collect data from a broader area for training purposes? For architectural heritages from the same class, there are of course diverse feature distributions among different areas. For example, the palaces from China tend to be majestic and magnificent, but the palaces in Japan are graceful and delicate. However, based on the definition of architectural functionality, the architectures from the same class always share many common characteristics. Thus, for better training of the retrieval model, we designed a data fine-tuning strategy similar to the thought of transfer learning [
20]. We can collect architectural heritage images from a broader area, or even not limited to heritage, to form the
source data and then transfer the shared knowledge to the local architecture heritage
target data. Taking Jiangxi, China, as a local area example, our experiment proves that utilizing this strategy enables better retrieval performance and enhances the feature extraction ability of the model for target data.
In this paper, we explore a case study for the recommendation of local architectural heritage. The main issue we have to address is the data quantity restriction for local architectural heritage. For this purpose, a new dataset named Chinese Architectural Heritage 10 (CAH10) is constructed. CAH10 contains 3080 images from 10 classic classes of Chinese architectural heritage (detailed in
Section 3). We also design an image retrieval system using hash coding, which aims at high accuracy and efficient image retrieval. A CNN backbone is adopted to extract image features and a hashing module is used to realize the retrieval. Considering the real-world demands of an architectural heritage retrieval task for Jiangxi, China, we utilize a data fine-tuning strategy by first training the model on source data and then transferring the learned knowledge to target data. A small number of local area data can train a high-performance retrieval model. Our experiment results show that this learning strategy can break the data quantity restriction and better specify the backbone CNNs’ feature extraction for target data. Furthermore, by demonstrating the retrieval results and using the UMAP [
21] embedding, we can further observe some interesting feature relations among different classes, which are meaningful for actual uses. We also apply the proposed method to a real-world application. Based on the coordinate information of CAH10, an application of
image-to-location is provided. This application can provide the user with a better recommendation experience and extend the use of retrieval tasks.
We summarize our main contributions as follows:
We construct a new dataset CAH10 for traditional Chinese architectural heritage.
We propose a deep-hashing-based retrieval method that can realize high recommendation accuracy. The analysis of the retrieved results reveals the relations among different architectural heritage categories.
A data fine-tuning strategy is adopted to break the quantity restriction of local architectural heritage data. This strategy can also enable better image feature extraction of the retrieval model.
For a better user experience, an application of image-to-location is provided for building a recommendation system.
The rest of the paper is organized as follows.
Section 2 introduces the existing research. We detail the constructed dataset in
Section 3.
Section 4 presents notations and the model. It gives an overview of the proposed retrieval method and data specifying strategy. In
Section 5, we present a performance comparison of different retrieval methods and analyze the retrieved results of the proposed method. In
Section 6, we further emphasize the advantages of our method. Limitations and future works are also discussed. Finally, conclusions are drawn in
Section 7.
2. Related Works
Content-based image retrieval (CBIR) is a research branch in the field of computer vision that focuses on large-scale digital image content retrieval. A typical CBIR system allows users to input an image to find other images with the same or similar content. In contrast, traditional image retrieval is text-based, i.e., the query function is implemented by the name of the image, textual information, and indexing relationships. The core of CBIR is image retrieval using the visual features of images. Essentially, it is an approximate matching technique that integrates the technical achievements of various fields, such as computer vision, image processing, image understanding, and database, in which feature extraction and indexing can be performed automatically by the computer, avoiding the subjectivity of manual description. The process of user retrieval is generally to provide a sample image (query), and then the system extracts the features of the query image, compares them with the features in the database, and returns the image similar to the query features to the user.
Remarkable progress has been made in image feature representations for CBIR [
22]. Among different kinds of image retrieval methods, the main advantage of hashing retrieval is the quickness. Many hashing methods have been designed for approximate nearest-neighbor (ANN) search in Hamming space for image retrieval. Hashing-based methods map high-dimensional data into compact binary codes with a preset number of bits and generate similar binary code data items, which can greatly reduce the calculation consumption and storage space. In the early stages, hashing methods focused on data-independent methods, such as locality-sensitive hashing (LSH) and its variance [
23,
24]. The major drawback of LSH is that long code is necessary to achieve satisfactory search accuracy, which limits its application. Recently, the deep hashing method [
25,
26,
27,
28,
29] has gained great performance in image hashing. CNNs are adopted as feature extractors and hash layers are applied to study hash codes. Pairwise similarity loss or triplet ranking loss are used for similarity learning. CNNH [
30] adopts a two-stage strategy to maximize the Hamming distances between binary codes of dissimilar images and minimize the Hamming distances of similar images. HashNet [
26] adopts the weighted maximum likelihood (WML) estimation to alleviate the severe data imbalance by adding weights in pairwise loss functions. The Hamming distance between hash codes of images is forced to be greater than a certain threshold. Works [
26,
30] such as HashNet directly make data pairs similar if they share at least one category.
Due to its many applications, deep-learning-based image retrieval has been applied to cultural heritage as well. Gupta et al. [
17] curated a novel dataset consisting of monument images of historical, cultural, and religious importance from all over India. Their retrieval task is based on the architectural characteristic of each class and holistically infuses semantic-hierarchy-preserving embeddings to learn deep representations for retrieval in the domain of Indian heritage monuments. Sipiran et al. [
18] present a benchmark of cultural heritage objects for the evaluation of 3D shape retrieval methods. Their experiments and results show that learning methods on image-based multiview representation are suitable for tackling 3D retrieval in a cultural heritage domain. Belhi et al. [
31] explore the difference between CNN features and classical features for a large-scale cultural image retrieval task. Their method can quickly identify the closest clusters and then only matches images from the selected clusters. Liu et al. [
32] use modern information technology to efficiently retrieve images of national costumes. This research lay a good foundation for the informatization of national costumes and is conducive to the inheritance and protection of intangible cultural heritage. Different from previous works where data are sufficient for training a deep model, in this paper, we want to deal with a real-world task in that data are minimal.
3. Data
CAH10 contains 3080 images of 10 classic classes belonging to Chinese architectural heritage. We demonstrate the image samples for each class in
Figure 1a–j. The CAH10 dataset consists of three subsets: the target set, source set, and query set. Here, we briefly explain how we collect these data and what each subset serves.
Source set: We use the images of the source set to pretrain our retrieval model. It contains 90% of the random split images from an image collection. The images from the collection are searched by using the keyword of each heritage class name and selected by a specialist based on the class definition. Selected images cover different regions and cultures; synthetic images are also included.
Query set: The query set contains the remaining 10% of the image collection. All these images are used to evaluate the model retrieval accuracy on the source set or target set. This set also serves as the possible user input to demonstrate what will be retrieved by the model.
Target set: This set contains the local architectural heritage images which are the entities that our retrieval system wants to recommend. We collect 285 images from Jiangxi, China. For the accessible image-to-location recommendation, each image is attached with its geographical coordinates. Notice that the images from the target set are excluded from both the source set and query set.
The specific data distribution of CAH10 is shown in
Table 1. We can observe that the total image number of the source set is almost 10 times that of the target set. The classification of traditional Chinese architecture in this paper is based on the classification methods of architectural heritage mainly derived from the international arena, and the architectural heritage is classified according to the functions and functional categories of buildings. “Convention Concerning the Protection of The World Cultural and Natural Heritage”, “Hoi An Protocols for Best Conservation Practice in Asia”, “Burra Charter” and “International Symposium on the Concepts and Practices of Conservation and Restoration of Historic Buildings in East Asia” all require the classification of cultural relics according to the function and value of architectural heritage. In terms of the specific types of architectural heritage, we mainly refer to two laws, “Law of the Peopleís Republic of China on Protection of Cultural Relics” and “The Regulation on the Protection of Famous Historical and Cultural Cities, Towns, and Villages”. Combined with the local laws of Jiangxi Province, such as the “Regulations on the Protection of Cultural Relics in Jiangxi Province” and the “Regulations on the Protection of Revolutionary Cultural Relics in Jiangxi Province”, the classification of building types was further supplemented and improved, thus summarizing the ten architectural classifications in CAH10. In these ten categories, seven categories, including bridges, residential buildings, palaces, temples, theatres, towers, and modern historic buildings, have been specifically classified in Western architecture, while ancestral halls, memorial archways, and pavilions are unique architectural forms in China. According to China’s current cultural relics protection regulations, the archway is unique to China in form and similar in function to the column and triumphal arch. This classification method is based on international and Chinese cultural heritage protection and management regulations and further subdivides building types according to local regulations in Jiangxi Province.