Measuring Patent Similarity Based on Text Mining and Image Recognition

Lin, Wenguang; Yu, Wenqiang; Xiao, Renbin

doi:10.3390/systems11060294

Open AccessArticle

Measuring Patent Similarity Based on Text Mining and Image Recognition

by

Wenguang Lin

¹,

Wenqiang Yu

¹ and

Renbin Xiao

^2,*

¹

School of Mechanical and Automotive Engineering, Xiamen University of Technology, Xiamen 361024, China

²

School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan 430074, China

^*

Author to whom correspondence should be addressed.

Systems 2023, 11(6), 294; https://doi.org/10.3390/systems11060294

Submission received: 23 March 2023 / Revised: 30 May 2023 / Accepted: 6 June 2023 / Published: 8 June 2023

(This article belongs to the Section Systems Practice in Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Patent application is one of the important ways to protect innovation achievements that have great commercial value for enterprises; it is the initial step for enterprises to set the business development track, as well as a powerful means to protect their core competitiveness. The emergence of a large amount of patent data makes the effective detection of patent data difficult, and patent infringement cases occur frequently. Manual measurement in patent detection is slow, costly, and subjective, and can only play an auxiliary role in measuring the validity of patents. Protecting the inventive achievements of patent holders and realizing more accurate and effective patent detection were the issues explored by academics. There are five main methods to measure patent similarity: clustering-based method, vector space model (VSM)-based method, subject–action–object (SAO) structure-based method, deep learning-based method, and patent structure-based method. To solve this problem, this paper proposes a calculation method to fuse the similarity of patent text and image. Firstly, the SAO structure extraction technique is used for the patent text to obtain the effective content of the text, and the SAO structure is compared for similarity; secondly, the patent image information is extracted and compared; finally, the patent similarity is obtained by fusing the two aspects of information. The feasibility and effectiveness of the scheme are proven by studying a large number of patent similarity cases in the field of mechanical structures.

Keywords:

SAO structure; image contour extraction; text mining; patent similarity

1. Introduction

At present, in the era of economic globalization, every country is constantly innovating and developing. The effective carrier of advanced technology in every country is the patent, which contains rich technical information. It is estimated that 70–90% of patent information is not disclosed elsewhere, so patents have a higher technological content than other technology carriers [1]. At the early stage of national development, domestic attention to patents is still insufficient, and some private enterprises are in a state of insufficient knowledge of patents, some enterprises are in a state of understanding patents but do not pay attention to them, and only a few enterprises recognize the importance of patents. However, with the development of economic globalization and increasing competition in various industries, countries pay more and more attention to patents and establish relevant institutions, the state pays more and more attention to combating piracy and protecting patents, and domestic enterprises also pay attention to patents [2,3]. In recent years, the number of patent applications is increasing, and China in particular became the leading country in patent applications. Since the 21st century, the number of patent applications continued to increase, and according to the number of patent applications in 2022 released by the World Intellectual Property Organization (WIPO), the number of patent applications in 2022 is as high as 278,100, with Asia accounting for 54.7% of the total. China continues to be the largest source of patent cooperation treaty applications with 70,015 applications, and the United States is in second place with 59,056 applications [4]. WIPO’s Global Innovation Index report shows that Switzerland, the United States, and Sweden are in the top three for innovation capacity [5]. International IP filings will remain essentially unchanged in 2022. Despite challenging economic conditions and reduced venture funding, companies are investing in innovation [6].

The issue of patent infringement is a very important legal and commercial issue, especially in the field of intellectual property, which became increasingly prominent due to the continuous advancement of innovation and research and development.

The analysis of patent infringement can be considered from both legal and economic perspectives. From a legal point of view, patent infringement is punishable by law, and patent owners can defend their rights and interests through litigation and other means. In addition, according to the laws of different countries or regions, the compensation standard and the amount of compensation for patent infringement may also vary. Therefore, when analyzing the issue of patent infringement, a thorough study of the relevant laws is required to ensure that the legitimate rights and interests of the patent owner are protected.

From a business perspective, patent infringement can have a negative impact on a company’s business interests, as infringement can lead to problems such as a reduction in the patentee’s market share and damage to its brand reputation. In addition, patent infringement can create financial and commercial risks for the company. Therefore, companies must adequately protect their patents and products to reduce the risk of patent infringement.

The patent application process usually consists of three parts: the patent invention process, the patent application process, and the patent examination process. The patent application process needs to investigate the same type of patent to avoid duplication of patent novelty and innovation, while in the patent examination process, the patent examiner needs to review similar patents to assess the validity of patents. However, the number of patents is increasing, and the efficiency of manual examination can no longer achieve the expected goal, so assessing the validity of patents more intelligently is what academics are exploring. Patent similarity detection is of great importance. First, in terms of technology research, companies need to understand the patent situation in the market to avoid unnecessary legal disputes and develop more accurate market strategies and development paths. Second, patent similarity detection supports innovation. Researchers can learn about existing patents in related fields to avoid duplication of work and provide better research directions. Finally, patent similarity detection helps protect intellectual property rights, reduce the risk of infringement, and protect the legal rights of patent owners. By accurately assessing the similarity between patents, the uniqueness and validity of patents can be ensured, promoting the development of innovation and providing a reliable IP protection mechanism for enterprises and researchers. Achieving a faster and more accurate measurement of patent similarity is the key to evaluating patent validity [7]. At present, patent infringement detection methods (PIDM) are mainly divided into the following: PIDM based on clustering, PIDM based on VSM, PIDM based on SAO structures, PIDM based on deep learning, and PIDM based on patent structures. Existing patent similarity detection methods have various problems. For example, the clustering-based method has the problem of low accuracy, the vector space-based method has a high cost, the SAO structure-based method requires high accuracy of structure extraction, and the deep learning-based method has the problem of poor interpretation.

In order to improve the efficiency of patent similarity and reduce the time and cost, this paper proposes a multimodal patent similarity detection algorithm that combines patent text and images based on existing research. The method extracts the SAO structure and patent image features, uses the SAO structure to prove the patent text similarity, extracts the feature information within the image contour to prove the patent image similarity, and weights the information of the two patents to obtain the patent similarity, which can prove the patent similarity more effectively.

2. Literature Review

2.1. Patent Similarity

The clustering-based PIDM puts all detected patents together to generate one or more clusters, and those clustered with the target patents are more likely to be infringing patents. In particular, Jeong [8] extracted problem solved concept (PSC) terms and constructed a PSC-based map, clustering and evaluating them to explore opportunities for new patent creation. Zhu [9] combined a self-organizing neural network (SOM) with fuzzy C-means (FCM) clustering to obtain a SOM-based FCM algorithm, which improved the quality of clustering, automatically identified patents similar to the patents under investigation and designed a patent infringement detection system. Lee et al. [10] utilized a principal component analysis (PCA) algorithm to cluster and visualize keyword space vectors. Yoon et al. [11] converted each patent document into a vector by extracting keywords, used PCA to reduce dimensionality, and finally performed SOFM with the vector as input to create patent maps for clustering purposes. Lai et al. [12] proposed a method, called the bibliometrics-based patent co-citation approach, by analyzing the co-citations of target patents using clustering methods for cited patents and creating a patent classification system. However, the clustering-based approach can only cluster several different classes, and there will be a large number of patents in the same class as the target patent, which cannot effectively reduce the examination work.

VSM-based PIDM converts text into spatial vectors and feeds into patent similarity by comparing spatial vector similarity. Magerman et al. [13] demonstrated patent similarity using VSM and latent semantic analysis. Yoon et al. [14] used the Doc2Vec model [15] to demonstrate patent similarity and predict the future direction of technology development from the constructed patent network. The Doc2Vec model was improved from the Word2Vec model [16] by replacing the original spatial vector for word detection with the spatial vector for paragraph detection. SAO2Vec [17,18] is an improved spatial vector model based on Doc2Vec. It is easier to construct the vector space model, but the dimensionality of the vectors is positively related to the size of the prediction, and the vectors constructed by large-scale prediction are high-dimensional and sparse, which makes the computation more complicated.

SAO-based PIDM analyzes information such as sentence lexicality and obtains the desired structure using natural language processing (NLP) techniques. Park et al. [19] used the WordNet-based SAO structure to measure patent similarity and used multidimensional scaling to map patent relationships to a two-dimensional space and group patents that could infringe. Li et al. [20] used the SAO structure to prove patent similarity and extended it by using the Sorensen–Dice index [21,22], which has good flexibility and robustness. Yoon [23,24] also used the SAO structure to prove patent similarity and then used similarity to analyze potential competitors and partners. Park et al. [25] proposed a patent infringement map based on SAO semantic similarity to identify patent infringement. The calculation of patent similarity based on SAO structure has a serious dependence on the extracted SAO structure, which requires manual annotation if a higher quality SAO structure is to be obtained.

Deep learning developed rapidly in recent years, with significant achievements in text, image, and radio, and many researchers applied deep learning techniques to the field of patents. Lu et al. [26] proposed a patent citation classification model based on deep learning by selecting convolutional neural networks (CNN) at the document encoding level and introducing multilayer perceptron to gradually compress and extract the most relevant features and adjust the nonlinear relationships. Ma et al. [27] constructed a patent model tree and compared the advantages and disadvantages of CNN, RNN, LSTM, and Siamese LSTM, and established that Siamese LSTM [28,29] has obvious advantages among them. Deep learning PIDM uses neural network models for vectorized representation, although the accuracy rate is high, the model is poorly interpreted, the data for constructing specialized fields are difficult to obtain, and a large amount of manual involvement is required at the initial stage.

The composition of a patent includes several structural components, such as inventor, application number, filing date, IPC classification number, abstract, claims, etc. The last patent-based PIDM considers these structures. Zhang et al. [30] used the IPC classification model and semantic model to evaluate patent similarity by constructing patent terms into different layers of trees, each layer having its own weight value, and equating patent similarity by calculating tree similarity. Fujii et al. [31] used punctuation to segment the claims and Okapi BM25 [32] to obtain paragraph similarity, and then cumulatively obtained overall patent text similarity. Among citation methods [33], Lee et al. [34] proposed a stochastic patent citation analysis method, and Rodriguez et al. [35] proposed a similarity measure in citation networks that exploited direct and indirect co-citation links between patents. Klaans and Boyack [36] compared the accuracy of direct citation-based, bibliographic coupling, and co-citation in representing knowledge classification. In general, the classification in the direct citation classification was better than that in the other classifications. Wu et al. [37] also proposed a method for evaluating patent similarity by considering direct and indirect citations. Cheng et al. [38] used USPC and IPC construction techniques and functional class matrices to demonstrate patent similarity. Similarity based on the patent structure is more relevant, but for patent infringement, each part has a different weight, and manual weighting is resource-intensive and less feasible.

2.2. SAO Semantic Analysis

SAO structure is a construction in which the subject (S) and object (O) of a sentence are related under action (A), and an SAO structure simply reflects the content of a sentence, giving a complete picture of how two things are related or affect each other. For example, in the sentence “The shower spray water”, “shower” is the subject, “spray” is the action, and “water” is the object. Similar to the SAO structure, the subject–predicate–object (SPO) structure, which consists of subject elements, object elements, and the relationships between them, can be considered a semantic network and is widely used for knowledge discovery in biomedical literature [39], while SAO is commonly used for text mining in patent documents [40].

SAO structure is a technical tool for NLP, which is favored by scholars and received wide attention, and the ability of SAO structure extraction became more powerful in the process of the continuous improvement of machine learning algorithms. For example, Kim et al. [41] analyzed the “for” and “to” phrases and verbal forms of object elements to effectively explore the purpose and effect of the technique in depth. Miao et al. [42] used the purpose relationship between the SAO structure and the technology–relationship–technology structure to mine technology solutions and functional information. He et al. [43] proposed a potential technology requirement identification model based on semantic analysis of the SAO structure. They realized the layout and visualization of requirements based on the technology life cycle to guide the direction of technology development and optimize resource allocation. Li et al. [44] used the Unified Medical Language System to evaluate the similarity between SAO structures, which was introduced in the field of medical patents. Yoon [23,24] also used SAO structures to demonstrate patent similarity and then used similarity to analyze potential competitors and partners. Using NLP techniques, rapid mining of SAO structures from text can be achieved.

The structure of the SAO patent triad is extracted from the text, and usually the subject and object are in the form of nouns, representing the performer and the event performed, respectively. The predicates are all used as actions to link the subject and object [45]. A set of SAO structures may be included in a single sentence, or multiple sets of SAO structures may be included. In the patent text, the SAO structure of the patent is summarized as shown in Table 1. The similarity between patents can be translated into the similarity of the SAO set, as shown in Figure 1.

The key point of using the SAO structure applied in the patent field is the quality of the SAO structure, so manual extraction is the most accurate method, but this method is not possible in the presence of a large number of patents, which requires a lot of effort and is very inefficient. However, with the development of NLP, it became possible to extract SAO structures using NLP tools.

2.3. Contour Detection

Contour detection refers to the process of extracting the target contour by ignoring texture and noise interference within the image [46]. Traditional contour detection is broadly classified into three types: pixel-based, edge-based, and region-based contour detection methods. The pixel-based approach is concerned with discontinuity of the image boundary, and the occurrence of sharp changes in pixels around the contour indicates that a regional change is generated. This process introduced linear filtering [47,48,49], such as the Prewitt operator, Sobel operator, and Canny operator. Later, many scholars proposed the use of higher-level features such as luminance, color, and texture gradients [50], and the combination of these features improved robustness. The edge-based approach considers the overall image information and divides the contour extraction process into edge detection and edge grouping [51]. Individual edge points are collected and then formed into a continuum, irrelevant data are eliminated, and the remaining data are rearranged, with each grouping corresponding to a specific object [52]. The early determination of edge elements in the likelihood of being in the same contour was based on empirical statistics, after which Elder [53] added Bayesian inference methods, while Mahamud [54] introduced the concept of contour saliency to identify smooth closed contours. Finally, with regard to region-based approaches, Arbelaez et al. [55,56] proposed the concept of ultrametric contour maps, in which local contrast and regional contribution are involved in the dissimilarity of adjacent regions, and the key to their method lies in the definition of hyperparametric distance. The region-based method is more stable to noise and can adapt to relatively uneven contours.

3. Data Collection

Showerheads are widely used in daily life. With the continuous development of society, people’s demands for showerhead products also increased, and they no longer have only the single function of spraying water, but have added functions, such as disinfection, spraying bath products, and even massage. As a product in the traditional mechanical field, the shower is characterized by a variety of functions, a mature market, and sufficient patent applications. For this reason, the product was chosen as the research object for the experiment. In this paper, the patent database of the United States Patent and Trademark Office was searched with “showerhead” as the keyword, and the handheld shower patents from the past ten years were downloaded for testing; the total number of patents was 131. This paper lists some of the patents, as shown in Table 2.

4. TF-IDF

The term frequency-inverse document frequency (TF-IDF) model is a statistical method that can evaluate the importance of words in a text in the corpus and is a common model for calculating text similarity. The calculation process is shown in Figure 2.

Calculate word frequency: Word frequency is the number of times a word appears in this article. To make it easier to compare articles of different lengths, word frequency is normalized by dividing the number of occurrences by the total number of words in the article.

t e r m f r e q u e n c y = \frac{T h e n u m b e r o f t i m e s a w o r d a p p e a r s i n t h e a r t i c l e}{T o t a l n u m b e r o f w o r d s i n t h e a r t i c l e}

(1)

2.: Calculate the inverse document frequency: A corpus is a collection of all articles that simulate the language environment. The more frequent a single word is, the larger the denominator becomes, and the closer the inverse document frequency is to zero. The denominator is added by 1 to prevent the denominator value from being 0 (i.e., all documents do not contain the word); lg means to take the logarithm of the obtained value.

I n v e r s e d o c u m e n t f r e q u e n c y = \log (\frac{T o t a l n u m b e r o f d o c u m e n t s i n t h e c o r p u s}{D o c u m e n t c o n t a i n i n g t h e w o r d + 1})

(2)

3.: Calculation of the TF-IDF: As you can see, TF-IDF is proportional to the number of occurrences of a word in the document and is inversely proportional to the number of occurrences of that word in the entire corpus. So, the algorithm for automatic keyword extraction is clear: the TF-IDF value is calculated for each word in the document, and then the top 100 words are taken in descending order. For visualization, words are sorted by TF-IDF value and the top 50 words are captured. Figure 3 shows a heat map of TF-IDF values for these words in some patents.

T F - I D F = t e r m f r e q u e n c y \times I n v e r s e d o c u m e n t f r e q u e n c y

(3)

4.: Build a word frequency list: Build a word frequency matrix; the length of the matrix is the number of texts, the width of the matrix is the number of words, and each group of vectors represents the frequency of words contained in each text.
5.: Calculating the Cosine Similarity: Given two attribute vectors, A and B, the cosine similarity is given by the dot product and the vector length, as shown in Equation (4).

c o s (θ) = \frac{A ∙ B}{||A|| ||B||} = \frac{\sum_{i = 1}^{n} {A_{i} \times B}_{i}}{\sqrt{\sum_{i = 1}^{n} {(A_{i})}^{2}} \times \sqrt{\sum_{i = 1}^{n} {(B_{i})}^{2}}}

(4)

5. SAO Structure

5.1. SAO Structure Extraction and Cleaning

In order to better extract the SAO structure, this paper uses a method based on dependent syntactic analysis to extract triples from patents, and the main steps are shown below. The current level of NLP for SAO structure extraction is improving, but it is still impossible to accurately extract all effective SAO structures and there is bound to be some noise, so cleaning the extracted SAO structure is a necessary process. Figure 4 illustrates the SAO extraction process.

Segmenting the text into independent sentences.
Dependent syntactic analysis of the sentences.
Extraction of all SAO structures.
Clean up the SAO structure and remove the meaningless SAO structure.

The whole text of the patent is divided into sentences and the SAO structure is extracted for each sentence. The Spacy library has certain advantages in execution speed and accuracy, so the text is lexically annotated and dependent syntactic analysis is performed using Spacy to extract the subject, predicate, and object of the text, some of which may contain multiple sets of keywords. The text content of patent US20180318860A1 was subjected to SAO structure extraction, and some of the SAO structures are shown in Table 3. The number of SAO structures extracted by each patent is shown in Table 4.

5.2. SAO Structure Semantic Similarity Calculation

Each patent text is represented as a collection of SAO structures, and each SAO structure consists of a subject, a predicate, and an object. The similarity of SAO structures is obtained from the similarity of internal elements, so the similarity between internal elements, i.e., the similarity between words, is measured first.

In this paper, the Word2Vec model is chosen to compute the semantic similarity between words. The Word2Vec model is a language model proposed by Mikolov et al. [16] based on the NNLM model of Bengio et al. [57] and the log-bilinear model of Hinton et al. [58]. A word can be quickly and efficiently trained into a vector form after optimization based on a given valid corpus, providing an effective tool for subsequent word similarity. Word2Vec contains two core architectures, the CBOW model and the Skip-gram model, as shown in Figure 5. The CBOW model predicts the probability of occurrence of the current word w(t) by context, while the Skip-gram model is the exact opposite of the CBOW model, predicting the probability of occurrence of several words before and after the current word w(t). Skip-gram is less efficient but has relatively high accuracy, so this paper chooses to use the Skip-gram model as the training model to ensure the high priority of accuracy.

The Skip-gram model uses a three-layer network structure to train word vectors, including an input layer, a hidden layer, and an output layer. The input layer is the one-hot encoding corresponding to the input words, while the hidden and output layers are the two vector matrices

W_{1}

and

W_{2}

. The central word matrix

W_{1}

has the dimension

V * N

and the surrounding word matrix

W_{2}

has the dimension

N * V

, where V is the size of the lexicon and N is the dimensionality of the constructed word vector. Using the one-hot encoding of the input layer multiplied by the matrix

W_{1}

to obtain the vector of which we want to reduce the dimensionality, this vector can be considered as the central word vector representation of the word. This vector is then multiplied by the surrounding word vector matrix

W_{2}

, which is the influence of the surrounding word on the word; and finally, a word vector of size

1 * V

is obtained, which is finally normalized by Sotfmax to obtain the predicted probability value. The difference between the probability value and the true value is actually the loss, and according to these losses, the vector matrices

W_{1}

and

W_{2}

are adjusted using the backpropagation algorithm to make the prediction more accurate. The training objective function for this sequence of words is formulated as:

l = \frac{1}{N} \sum_{t = 1}^{N} \sum_{- k \leq c \leq k} \log P ({w o r d}_{t + c} | {w o r d}_{t})

(5)

In this formula, k is the window size, and the larger the window, the more information is captured and the more accurate the result, but the efficiency decreases. After training, each word has its own vector representation, which is finally represented by cosine similarity.

In practice, it is experimentally found that word vectors generated by Word2Vec training are not as accurate as the NNLM model; but given a sufficient corpus, word vectors generated by Word2Vec become more and more accurate. Therefore, it can be trained on English Wikipedia to obtain a highly accurate word vector model.

The S and O are in the extracted SAO structure because there is a singular–plural distinction in the extraction process, and in practice, the singular and plural refer to the same object, so it is important to unify the word forms and eliminate this distinction. Lemminflect is a Python module for reducing the morphology of English words. It uses a dictionary to reduce the morphology of English words, and its accuracy rate is higher than NLTK, spaCy, and Stanford Core NLP. For example, the dictionary has maps from “pipelines” to “pipeline”, “showers” to “shower”, and “plays, played, playing” to “play”, so when you make the change, you can simply consult the dictionary to restore the words.

The SAO structure, where S and O are nouns, can be cross-calculated, and A is a verb and is calculated separately. The specific computation is shown in Figure 6.

The formula for calculating the similarity between two SAO structures is as follows:

\begin{array}{l} S i m ({S A O}_{i}, {S A O}_{j}) = \\ \{\begin{matrix} \frac{1}{3} \frac{[S i m (S_{(i),} S_{(j)}) + S i m (O_{(i),} O_{(j)})]}{2} + \frac{1}{3} S i m (A_{(i),} A_{(j)}), \\ S i m (S_{(i),} S_{(j)}) + S i m (O_{(i),} O_{(j)}) \geq S i m (S_{(i),} O_{(j)}) + S i m (O_{(i),} S_{(j)}) \\ \frac{1}{3} \frac{[S i m (S_{(i),} O_{(j)}) + S i m (O_{(i),} S_{(j)})]}{2} + \frac{1}{3} S i m (A_{(i),} A_{(j)}), \\ S i m (S_{(i),} S_{(j)}) + S i m (O_{(i),} O_{(j)}) < S i m (S_{(i),} O_{(j)}) + S i m (O_{(i),} S_{(j)}) \end{matrix} . \end{array}

(6)

5.3. Patent Similarity Calculations

After obtaining the similarity of SAO structures, the Hungarian algorithm is used to find the maximum number of matches for two SAO sets, as shown in Figure 7. The red line represents the matching result. The Hungarian algorithm is a combinatorial optimization algorithm used for solving task assignment problems in polynomial time, and is later used to solve matching problems in graph theory.

In this paper, we set the threshold P. If the similarity of two SAO structures reaches the threshold, it is defined as a match that can be made.

S A O 1 (i) (1 \leq i \leq n)

represents all the SAO structures in patent 1, and

S A O 2 (j) (1 \leq j \leq m)

represents all the SAO structures in patent 2. However, it is possible for the SAO structures in one patent to match multiple SAO structures in another patent, and the two foci of matching are:

(1): The match is the set of edges.
(2): In this set, any two edges cannot have a common vertex.

Therefore, this paper uses the Hungarian algorithm to achieve maximum matching.

s i m_{t e x t} = \frac{2 * h u n g a r i a n}{N u m_S A O (p) + N u m_S A O (p k)}

(7)

5.4. Determining the Optimal Threshold

In order to distinguish the similarity between relevant patents and targets, this experiment wants the proportion of patents with high patent similarity and patents with zero patent similarity to be as small as possible. High similarity means that patent similarity values are more similar and difficult to distinguish. Smaller repeated similarity values imply subtle differences in similarity between patents. The smaller the proportion of patents with zero similarity, the more detailed the textual content analysis. Before calculating the initial level of patent similarity, a threshold (P) for the SAO structure must be set. A range of thresholds from 0.3 to 1 was set for the search with a step size of 0.01 to determine the optimal threshold setting. Figure 8 shows the proportion of patents with zero similarity and patents with too much similarity at different thresholds.

To reduce the proportion of patents with high similarity and those with 0 similarity, the experiment initially chose a threshold range between 0.6 and 0.8. To cross-check the results, the experiment invited experts to perform manual reading to ensure that the difference between the measured results and the manual understanding was minimized. After reviewing all combinations, it was confirmed that a threshold of 0.8 was chosen.

5.5. Patent Similarity between Target Patents and Related Patents

Using the SAO proof-of-structure method, the target patent US20180318860A1 was compared with other related patents to rank the similarity, and Table 5 shows the top ten patent serial numbers and the degree of patent similarity.

5.6. Weighted SAO structure

Wang et al. [40] introduced the calculation index of different weighted SAO (DWSAO), extracted the patent SAO structures, and calculated their weight information to measure the patent similarity in robotics. The number of patents contained in the patent set is N. The target patents have m SAO structures, and

S A O_{i}^{p}

denotes the i-th SAO structure of patent P. Formula 8 calculates its feature weight DWSAO value:

D W S A O_{i}^{p} = 1 - \frac{F}{N + 1}

(8)

where F represents the document frequency of

S A O_{i}^{p}

, set the initial value to 1, traverse N patents, and add 1 to F if the patent contains SAO similar to

S A O_{i}^{p}

. It is derived from the formula that the greater the commonality of the SAO structure with other patents, the weaker the ability to represent technical features, and the smaller the DWSAO value. The calculation process is shown in Figure 9.

6. Multimodal Patent Similarity Analysis

The research method in this paper is to compare the target patents with related patents using the analysis method of fused images and SAO structures. In the previous SAO structure, to obtain patent similarity, the degree of similarity between patent texts was obtained only by the similarity of the SAO structure. The abstract of the patent text contains a comprehensive overview of the features of the invention, and the claims contain a detailed overview of the content of the invention; rich in content, the amount of content of the abstract and the claims are large, and the corpus available is numerous, which is suitable for studying patent infringement and patent similarity. In this paper, we choose to combine image information with the SAO structure to accurately promote patent similarity. The specific implementation process is shown in Figure 10.

6.1. Proof of Patent Similarity

First, the SAO structure in the patent is extracted, and the resulting SAO structure is preprocessed using standard preprocessing. Second, the patent contour is extracted from the drawings attached to the abstract in the patent to preserve internal information. At the end of the process, each patent corresponds to an SAO set and a processed patent image.
Based on semantic information, the SAO structure similarity and the similarity between the related patents containing the SAO structure set and the target patent are calculated. Each patent contains an SAO structure set, and the similarity of the SAO set is obtained to indicate the similarity of the patent, and the Hungarian algorithm is applied to obtain the corresponding similarity of the SAO structure set.
Calculate the similarity of image features between related patents and target patents, detect the contour of the patent image, reconstruct the contour map using Fourier descriptors, retain the image within the contour, and calculate image similarity using the mutual information method. Finally, combine the weighting of patent text similarity to obtain the overall patent similarity.
The TF-IDF method, the SAO structure method, the DWSAO method, and the Sentence Bidirectional Encoder Representations from Transformers (SBERT) method are used to calculate patent similarity between the target patent and related patents and to compare the accuracy of different methods.

6.2. Contour Extraction

After extracting the patented contours in the image, the image is first blurred using median filtering to reduce noise. Median filtering is a nonlinear smoothing technique that replaces the median of the gray values of pixel points in the eight neighborhoods around a point with the gray values of pixels at that point, and the process is shown in Figure 11a, and the grayscale before and after the change is shown in Figure 11b.

Secondly, binarization is performed to facilitate contour extraction, and adaptive thresholding tends to localize the threshold by averaging the pixel value of a pixel point with the pixel value of the region in which the point is located to determine whether the point belongs to 0 or 1.

Finally, for the pre-processed image, the Fourier operator eight-neighborhood detection is performed to extract the contour point coordinates and retain the contour point coordinates with the largest area of the region, after which the sweep profile is reconstructed by the Fourier descriptor to retain valid image information within the contour map before and after treatment, as shown in Figure 12.

Image alignment methods based on mutual information were widely used in the field of image alignment. In this paper, we characterize the similarity between two images by calculating their mutual information. Table 6 shows the top ten patent serial numbers and patent image similarity.

The patent similarity calculation is based on the fusion of the patent image similarity method and the patent text similarity method, as shown in Equation (9).

s i m = α * s i m_{t e x t} + β * s i m_{i m g}

(9)

which satisfies 0 < α < 1, 0 < β < 1 and α + β = 1 in the equation.

6.3. Threshold Selected

We compared several sets of thresholds and determined the appropriate threshold from them, as shown in Table 7.

Patent images are auxiliary to the patent text, giving visual effects to the patent text and thus making it easier for the reader to understand the patent, so the weights of the patent images are lower than the weights of the patent text. The combinations in Table 6 are all confirmed, with combination 4 being the most effective, so α is selected as 0.8 and β as 0.2.

6.4. Patent Similarity between Target Patents and Related Patents

In this paper, the target patent US20180318860A1 is selected as the target patent and compared with other related patents, similarity is ranked using the method proposed in this paper, and Table 8 shows the top ten patent serial numbers and patent similarity.

7. Analysis and Validation of Results

To further verify the effectiveness of the method, the proposed method in this paper was compared with the TF-IDF method, SAO method, DWSAO method, and SBERT method. The purpose of the experiment was to find patents with a high degree of similarity to the target patent. In cooperation with the patent office, the university invited three experts, one of whom was a university professor, another an industry expert with more than five years of experience, and the last was a patent examiner who was practicing for five years, to assess similarity based on the functional features involved in the patent and the technical means used to solve the technical problem. The top ten most similar patents were obtained by ranking them from top to bottom, and Table 9 shows the results of this analysis. To show the average variation value of each method more graphically, the results are presented in Figure 13.

As shown in the table, the absolute difference between manual reading and TF-IDF is 62, and the mean difference is 6.2. The absolute difference between manual reading and the traditional SAO method is 43, and the mean difference is 4.3. The absolute difference between manual reading and DWSAO is 43, and the mean difference is 4.3. The absolute difference between manual reading and SBERT is 61, and the mean difference is 6.1. The TF-IDF method is less accurate than the SAO method because the SAO structure reflects the structural relationship between engineering components. Park et al. [18] concluded that patent similarity based on the SAO structure is better than a text-based method, which is consistent with their conclusion. The SBERT method, although advanced, does not extract effective patent features, so the results are not excellent. The method used in this paper improves the traditional SAO, and the average ranking change is 2.9, which is better than other methods, which shows the higher accuracy of the method used in this paper.

8. Discussion and Conclusions

The patent specification carries detailed patent information, and companies face significant time and money costs for patent infringement. Blocking the publication of infringing patents at the source of patent publication can effectively mitigate subsequent patent disputes. Before a patent is granted, the patent examiner must go through a series of processes to identify the innovation and novelty of the invention patent, but the number of patent applications is growing too fast and the examination task is becoming increasingly difficult, so new examination techniques need to be improved. Several methods for patent-assisted auditing already exist with good results. In this paper, we also make the following contributions: (1) In this paper, we propose a new method to compute patent similarity by using SAO structure and the Word2Vec model, fusing semantic information with image feature information, and obtaining patent similarity by weight fusion. (2) In the process of patent text similarity calculation, the Hungarian algorithm is used to convert SAO structure similarity into patent similarity. (3) This method has a higher accuracy rate than the traditional methods, and effectively improves the review efficiency of patent examiners.

The method proposed in this paper improved accuracy to a certain extent, but there are still some shortcomings, such as the following: (1) There is still great room for improvement in SAO structure extraction, and future work will integrate deep learning-based information extraction techniques to introduce deeper semantic information and improve the accuracy of structure extraction. (2) Create a patent corpus to supplement technical terms and abbreviations. (3) It is also possible to use machine learning with a collective intelligence [59] or swarm intelligence [60] approach to replace expert judgment thresholds. (4) Extract image information by deep learning and introduce deep image features. (5) Only 131 shower patents applied in USPTO within ten years are used as case study, the quantity of which is low, and is necessary to involve the case in other fields in further research such as from design for mass customization to design for mass personalization [61]. To be sure, the accuracy of patent similarity measurement will be improved by the above-improved methods.

Author Contributions

Conceptualization, W.L. and R.X.; methodology, W.L.; software, W.L. and W.Y.; validation, W.Y.; formal analysis, W.L.; investigation, W.L.; resources, W.L.; data curation, W.Y.; writing—original draft preparation, W.L. and W.Y.; writing—review and editing, W.L.; visualization, W.Y.; supervision, W.L.; project administration, W.L.; funding acquisition, W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under grant No. 52275249, and the Social Science Foundation of Fujian Province of China (No. FJ2021B128).

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Asche, G.E. “80% of technical information found only in patents”—Is there proof of this? World Pat. Inf. 2017, 48, 16–28. [Google Scholar] [CrossRef]
Zhai, C.Y.; Du, D.B.; Shi, W.T. Spatiotemporal Evolution and Determinants of the Geography of Chinese Patents Abroad: A Case Study of Strategic Emerging Industries. Systems 2023, 11, 33. [Google Scholar] [CrossRef]
Ma, H.K. The Dynamics of China’s Collaborative Innovation Network in Agricultural Biotechnology: A Spatial-Topological Perspective. Systems 2023, 11, 73. [Google Scholar] [CrossRef]
International Patent Applications Defy 2022 Challenges, Continue Upward Trend. Available online: https://www.wipo.int/pressroom/en/articles/2023/article_0002.html (accessed on 14 May 2023).
Global Innovation Index 2022. Available online: https://www.wipo.int/global_innovation_index/en/2022/ (accessed on 7 March 2023).
WIPO: China’s Global Ranking in Innovation Steadily Improves. Available online: https://baijiahao.baidu.com/s?id=1745312150286743555&wfr=spider&for=pc (accessed on 20 March 2023).
Arts, S.; Cassiman, B.; Gomez, J.C. Text matching to measure patent similarity. Strat. Manag. J. 2018, 39, 62–84. [Google Scholar] [CrossRef] [Green Version]
Jeong, C.; Kim, K. Creating patents on the new technology using analogy-based patent mining. Expert Syst. Appl. 2014, 41, 3605–3614. [Google Scholar] [CrossRef]
Zhu, D.M. Bibliometric analysis of patent infringement retrieval model based on self-organizing map neural network algorithm. Libr. Hi Tech 2020, 38, 479–491. [Google Scholar] [CrossRef]
Lee, S.; Yoon, B.; Park, Y. An approach to discovering new technology opportunities: Keyword-based patent map approach. Technovation 2009, 29, 481–497. [Google Scholar] [CrossRef]
Yoon, B.; Yoon, C.; Park, Y. On the development and application of a self-organizing feature map-based patent map. R&D Manag. 2002, 32, 291–300. [Google Scholar]
Kuei-Kuei, L.; Shiao-Jun, W. Using the patent co-citation approach to establish a new patent classification system. Inform. Process. Manag. 2005, 41, 313–330. [Google Scholar]
Magerman, T.; Van Looy, B.; Song, X. Exploring the feasibility and accuracy of Latent Semantic Analysis based text mining techniques to detect similarity between patent documents and scientific publications. Scientometrics 2010, 82, 289–306. [Google Scholar] [CrossRef]
Yoon, B.; Kim, S.; Kim, S.; Seol, H. Doc2vec-based link prediction approach using SAO structures: Application to patent network. Scientometrics 2022, 127, 5385–5414. [Google Scholar] [CrossRef]
Le, Q.; Mikolov, T. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on International Conference on Machine Learning (PMLR), Beijing, China, 21 June 2014. [Google Scholar]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. In Proceedings of the International Conference on Learning Representations (ICLR), Scottsdale, AZ, USA, 2–4 May 2013. [Google Scholar]
Kim, S.; Yoon, B. Patent infringement analysis using a text mining technique based on SAO structure. Comput. Ind. 2021, 125, 103379. [Google Scholar] [CrossRef]
Jang, H.J.; Park, S.J.; Yoon, B. Exploring Technology Opportunities Based on User Needs: Application of Opinion Mining and SAO Analysis. Eng. Manag. J. 2022, 1–14. [Google Scholar] [CrossRef]
Park, H.; Yoon, J.; Kim, K. Identifying patent infringement using SAO based semantic technological similarities. Scientometrics 2012, 90, 515–529. [Google Scholar] [CrossRef]
Li, X.M.; Wang, C.; Zhang, X.F.; Sun, W. Generic SAO Similarity Measure via Extended Sorensen-Dice Index. IEEE Access 2020, 8, 66538–66552. [Google Scholar] [CrossRef]
Dice, L.R. Measures of the Amount of Ecologic Association Between Species. Ecology 1945, 26, 297–302. [Google Scholar] [CrossRef]
Sørensen, T.A. A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. Biol. Skr. 1948, 5, 1–34. [Google Scholar]
Yoon, J.; Park, H.; Kim, K. Identifying technological competition trends for R&D planning using dynamic patent maps: SAO-based content analysis. Scientometrics 2013, 94, 313–331. [Google Scholar]
Yoon, J.; Kim, K. Identifying rapidly evolving technological trends for R&D planning using SAO-based semantic patent networks. Scientometrics 2011, 88, 213–228. [Google Scholar]
Park, I.; Yoon, B. A semantic analysis approach for identifying patent infringement based on a product–patent map. Technol. Anal. Strat. Manag. 2014, 26, 855–874. [Google Scholar] [CrossRef]
Lu, Y.; Xiong, X.; Zhang, W.; Liu, J.; Zhao, R. Research on classification and similarity of patent citation based on deep learning. Scientometrics 2020, 123, 813–839. [Google Scholar] [CrossRef]
Ma, C.; Zhao, T.; Li, H. A Method for Calculating Patent Similarity Using Patent Model Tree Based on Neural Network. In Proceedings of the 9th International Conference on Brain Inspired Cognitive System (BICS), Xi’an, China, 7–8 July 2018. [Google Scholar]
Mueller, J.; Thyagarajan, A. Siamese Recurrent Architectures for Learning Sentence Similarity. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016. [Google Scholar]
Neculoiu, P.; Versteegh, M.; Rotaru, M. Learning text similarity with siamese recurrent networks. In Proceedings of the 1st Workshop on Representation Learning for NLP (RepL4NLP), Berlin, Germany, 7–12 August 2016. [Google Scholar]
Zhang, Y.; Shang, L.; Huang, L.; Porter, A.L.; Zhang, G.; Lu, J.; Zhu, D. A hybrid similarity measure method for patent portfolio analysis. J. Inf. 2016, 10, 1108–1130. [Google Scholar] [CrossRef] [Green Version]
Fujii, A.; Ishikawa, T. Document Structure Analysis for the NTCIR-5 Patent Retrieval Task. In Proceedings of the NTCIR-5 Workshop Meeting (NTCIR), Tokyo, Japan, 6–9 December 2005. [Google Scholar]
Robertson, S.E.; Walker, S. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval (SIGIR), Dublin, Ireland, 3–6 July 1994. [Google Scholar]
Kim, B.T.; Hyun, E. Mapping the Landscape of Blockchain Technology Knowledge: A Patent Co-Citation and Semantic Similarity Approach. Systems 2023, 11, 111. [Google Scholar] [CrossRef]
Lee, C.; Cho, Y.; Seol, H.; Park, Y. A stochastic patent citation analysis approach to assessing future technological impacts. Technol. Forecast. Soc. Chang. 2012, 79, 16–29. [Google Scholar] [CrossRef]
Rodriguez, A.; Kim, B.; Turkoz, M.; Lee, J.; Coh, B.; Jeong, M.K. New multi-stage similarity measure for calculation of pairwise patent similarity in a patent citation network. Scientometrics 2015, 103, 565–581. [Google Scholar] [CrossRef]
Klavans, R.; Boyack, K.W. Which Type of Citation Analysis Generates the Most Accurate Taxonomy of Scientific and Technical Knowledge? J. Am. Soc. Inf. Sci Technol. 2017, 68, 984–998. [Google Scholar] [CrossRef]
Wu, H.; Chen, H.; Lee, K.; Liu, Y. A method for assessing patent similarity using direct and indirect citation links. In Proceedings of the 2010 IEEE International Conference on Industrial Engineering and Engineering Management, Macao, China, 7–10 December 2010. [Google Scholar]
Cheng, T.; Wang, M. The Patent-Classification Technology/Function Matrix—A Systematic Method for Design around. JIPR 2013, 18, 158–167. [Google Scholar]
Keselman, A.; Rosemblat, G.; Kilicoglu, H.; Fiszman, M.; Jin, H.; Shin, D.; Rindflesch, T.C. Adapting semantic natural language processing technology to address information overload in influenza epidemic management. J. Am. Soc. Inf. Sci. Technol. 2010, 61, 2531–2543. [Google Scholar] [CrossRef]
Wang, X.; Ren, H.; Chen, Y.; Liu, Y.; Qiao, Y.; Huang, Y. Measuring patent similarity with SAO semantic analysis. Scientometrics 2019, 121, 1–23. [Google Scholar] [CrossRef]
Kim, K.; Park, K.; Lee, S. Investigating technology opportunities: The use of SAOx analysis. Scientometrics 2019, 118, 45–70. [Google Scholar] [CrossRef]
Miao, H.; Wang, Y.; Li, X.; Wu, F. Integrating Technology-Relationship-Technology Semantic Analysis and Technology Roadmapping Method: A Case of Elderly Smart Wear Technology. IEEE Trans. Eng. Manag. 2022, 69, 262–278. [Google Scholar] [CrossRef]
He, X.; Meng, X.; Dong, Y.; Wu, Y. Demand identification model of potential technology based on SAO structure semantic analysis: The case of new energy and energy saving fields. Technol. Soc. 2019, 58, 101–116. [Google Scholar] [CrossRef]
Li, R.; Wang, X.; Liu, Y.; Zhang, S. Improved Technology Similarity Measurement in the Medical Field based on Subject-Action-Object Semantic Structure: A Case Study of Alzheimer’s Disease. IEEE Trans. Eng. Manag. 2023, 70, 280–293. [Google Scholar] [CrossRef]
Lin, W.; Liu, X.; Xiao, R. Research on Product Core Component Acquisition Based on Patent Semantic Network. Entropy 2022, 24, 549. [Google Scholar] [CrossRef]
Gong, X.; Su, H.; Xu, D.; Zhang, Z.; Shen, F.; Yang, H. An Overview of Contour Detection Approaches. Int. J. Autom. Comput. 2018, 15, 656–672. [Google Scholar] [CrossRef]
Wang, X. Laplacian operator-based edge detectors. IEEE Trans Pattern Anal. Mach. Intell. 2007, 29, 886–890. [Google Scholar] [CrossRef] [PubMed]
Nixon, M.S.; Aguado, A.S. Feature Extraction & Image Processing for Computer Vision, 3rd ed.; Elsevier: Amsterdam, The Netherlands, 2012; pp. 143–154. [Google Scholar]
Canny, J. A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 1986, 8, 679–698. [Google Scholar] [CrossRef]
Martin, D.R.; Fowlkes, C.C.; Malik, J. Learning to detect natural image boundaries using local brightness, color, and texture cues. IEEE Trans. Pattern Anal. Mach. Intell. 2004, 26, 530–549. [Google Scholar] [CrossRef]
Cox, I.J.; Rehg, J.M.; Hingorani, S. A Bayesian multiple-hypothesis approach to edge grouping and contour segmentation. Int. J. Comput. Vis. 1993, 11, 5–24. [Google Scholar] [CrossRef]
Amir, A.; Lindenbaum, M. A generic grouping algorithm and its quantitative analysis. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 168–185. [Google Scholar] [CrossRef]
Elder, J.H.; Zucker, S.W. Computing Contour Closure. In Proceedings of the 4th European Conference on Computer Vision (ECCV), Cambridge, UK, 15–18 April 1996. [Google Scholar]
Mahamud, S.; Williams, L.R.; Thornber, K.K.; Xu, K. Segmentation of multiple salient closed contours from real images. IEEE Trans. Pattern Anal. Mach. Intell. 2003, 25, 433–444. [Google Scholar] [CrossRef]
Arbelaez, P. Boundary Extraction in Natural Images Using Ultrametric Contour Maps. In Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), New York, NY, USA, 17–22 June 2006. [Google Scholar]
Arbelaez, P.; Maire, M.; Fowlkes, C.; Malik, J. From contours to regions: An empirical evaluation. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009. [Google Scholar]
Bengio, Y.; Ducharme, R.; Vincent, P. A neural probabilistic language model. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Denver, CO, USA, 1 January 2000. [Google Scholar]
Mnih, A.; Hinton, G. Three new graphical models for statistical language modelling. In Proceedings of the 24th International Conference on Machine Learning (ICML), Corvalis, OR, USA, 20–24 June 2007. [Google Scholar]
Xiao, R.B.; Feng, Z.H.; Wang, J.H. Collective intelligence: Conception, research progress and application analysis. J. Nanchang Inst. Technol. 2022, 41, 1–21. [Google Scholar]
Xiao, R.B.; Chen, Z.Z. From swarm intelligence optimization to swarm intelligence evolution. J. Nanchang Inst. Technol. 2023, 42, 1–10. [Google Scholar]
Xiao, R.B.; Lai, R.S.; Li, R.W. From design for mass customization to design for mass personalization. J. Nanchang Inst. Technol. 2021, 40, 1–12. [Google Scholar]

Figure 1. Extraction of SAO.

Figure 2. TF-IDF model text similarity calculation process.

Figure 3. TF-IDF thermal map.

Figure 4. SAO structure extraction process.

Figure 5. CBOW model and Skip-gram model.

Figure 6. Breakdown diagram of the SAO structure comparison.

Figure 7. Hungarian algorithm.

Figure 8. Percentage of patents with zero similarity and those with too much similarity at different thresholds.

Figure 9. DWSAO calculation process.

Figure 10. General procedure for identifying similar patents.

Figure 11. Median filtering.

Figure 12. Image pre-processing.

Figure 13. Average ranking change.

Table 1. Patent SAO structure.

No.	SAO Structure	Example
1		Wherein the conductive mechanism further comprises a control circuit module; conductive mechanism-comprises-control circuit module
2		The first conductive portion and the second conductive portion are electrically coupled with the control circuit module through conductive line; the first conductive portion-coupled-module the second conductive portion-coupled-module
3		A direction-adjustable showerhead fixing structure includes a showerhead main body and a connecting seat; showerhead-includes-body showerhead-includes-connecting seat
4		The invention reduces and eliminates the above disadvantages; invention-reduces-disadvantages invention-eliminates-disadvantages
5		The control circuit module is disposed on the mount and is located in the first chamber; module-disposed-mount module-located the-first chamber

Table 2. Partial patent display.

No.	Patent Number	No.	Patent Number
1	US20220105526A1	……	……
2	US20210178409A1	……	……
3	US20210027988A1	124	US20180257090A1
4	US20200384486A1	125	US20180250690A1
5	US20190262849A1	126	US20180065131A1
6	US20190184316A1	127	US20170297039A1
7	US20190143348A1	128	US20170252764A1
8	US20180318860A1	129	US20170189918A1
……	……	130	US20170165682A1
……	……	131	US20170165684A1

Table 3. Part of the SAO structure.

No	S	A	O
1	Invention	Provide	Showerhead
2	Conduit	Taper	Passage
3	Conduit	Taper	Outlet
4	Invention	Have	Application
5	Water	Passing	Showerhead
6	Configuration	Require	Tolerance
……	……	……	……
128	Jet	Include	Passage
129	Jet	Include	Outlet
130	Passage	Include	Ducting
131	Passage	Include	Apertures

Table 4. Number of SAO structures of some patents.

No.	Patent Number	Number of SAO Structure
1	US20220105526A1	334
2	US20210178409A1	611
3	US20200384486A1	712
4	US20190262849A1	199
……	……	……
127	US20170297039A1	327
128	US20170252764A1	116
129	US20170189918A1	466
130	US20170165682A1	117
131	US20170165684A1	84

Table 5. Top 10 most similar patents related to traditional SAO and targets.

No	No. Patent	Similarity
1	7	0.0727
2	14	0.0712
3	2	0.0671
4	17	0.0648
5	9	0.0643
6	23	0.0598
7	8	0.0515
8	5	0.0488
9	19	0.0484
10	20	0.0465

Table 6. Top 10 related patents that are most similar to the target patent image.

No.	No. Patent	Similarity
1	24	0.3932
2	22	0.3833
3	26	0.3711
4	12	0.3675
5	20	0.3495
6	9	0.3245
7	0	0.3215
8	1	0.3131
9	4	0.2932
10	2	0.2856

Table 7. The value of the ranking change under different thresholds.

No.	α	β	Rank Change Value Sum
1	0.5	0.5	56
2	0.6	0.4	53
3	0.7	0.3	53
4	0.8	0.2	29
5	0.9	0.1	41

Table 8. Top 10 relevant patents for which the text proposes a method most similar to the target.

No.	No. Patent	Similarity
1	9	0.11574
2	2	0.1084
3	20	0.1021
4	12	0.0967
5	4	0.09376
6	8	0.0921
7	1	0.09184
8	19	0.09048
9	17	0.08746
10	24	0.08592

Table 9. Comparison of various methods.

No. Patent	Expert Reading	TF-IDF	SBERT	SAO	DWSAO	SAO-img
9	1	4	2	5	5	1
4	2	19	18	11	13	5
8	3	5	14	7	7	6
2	4	6	13	3	8	2
1	5	14	12	12	9	7
23	6	8	8	6	11	12
7	7	1	7	1	3	11
17	8	12	10	4	4	9
3	9	23	17	14	12	14
16	10	7	5	13	10	13
Rank change value sum	-	62	61	43	43	29
Average ranking change	-	6.2	6.1	4.3	4.3	2.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lin, W.; Yu, W.; Xiao, R. Measuring Patent Similarity Based on Text Mining and Image Recognition. Systems 2023, 11, 294. https://doi.org/10.3390/systems11060294

AMA Style

Lin W, Yu W, Xiao R. Measuring Patent Similarity Based on Text Mining and Image Recognition. Systems. 2023; 11(6):294. https://doi.org/10.3390/systems11060294

Chicago/Turabian Style

Lin, Wenguang, Wenqiang Yu, and Renbin Xiao. 2023. "Measuring Patent Similarity Based on Text Mining and Image Recognition" Systems 11, no. 6: 294. https://doi.org/10.3390/systems11060294

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Measuring Patent Similarity Based on Text Mining and Image Recognition

Abstract

1. Introduction

2. Literature Review

2.1. Patent Similarity

2.2. SAO Semantic Analysis

2.3. Contour Detection

3. Data Collection

4. TF-IDF

5. SAO Structure

5.1. SAO Structure Extraction and Cleaning

5.2. SAO Structure Semantic Similarity Calculation

5.3. Patent Similarity Calculations

5.4. Determining the Optimal Threshold

5.5. Patent Similarity between Target Patents and Related Patents

5.6. Weighted SAO structure

6. Multimodal Patent Similarity Analysis

6.1. Proof of Patent Similarity

6.2. Contour Extraction

6.3. Threshold Selected

6.4. Patent Similarity between Target Patents and Related Patents

7. Analysis and Validation of Results

8. Discussion and Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI