Discuss how internet has become worlds largest repository.

Abstract

After decades of development, Internet has become the world’s largest repository, which the vast majority of the information is in the form of Web text exists. To take advantage of these resources, allowing users to find exactly the information required, saving search time, enhance the value of the emergence of Web data mining technology. Use classification, clustering, association analysis, trend forecasting technology to discover and extract useful patterns of interest to the user and concealed information from Web text.

Web Document Clustering Web text mining is an important research branch, as an unsupervised learning method, without the training process, do not need to manually mark the document in advance, with a certain degree of flexibility and high automation capability can effectively organize documents, obtain summary and navigation. You can solve the information clutter and information overload problem to a certain extent. As information retrieval, information filtering, search engine technology infrastructure, digital libraries and other areas, Web document clustering has broad application prospects.

In this work, I investigated the effectiveness of the smallest position value (SPV) technique in mapping the continuous version of the PSO algorithm to the task matching problem in a heterogeneous computing environment. The experimental evaluation demonstrated that the task matching generated by this technique will result in an unbalanced load distribution. In this work, I have therefore also designed a load rebalance PSO heuristic (BMO-LR) that results in minimization of make span and balanced utilization of the available compute nodes even in heterogeneous computing environments.

Word Key: BMO algorithm; document clustering; Web Document Clustering

Contents

A BMO algorithm on document clustering. 1

Abstract 1

Contents. 2

Introduction. 4

1.1 Research background. 4

1.2 Current research. 5

1.3 Research methods to be adopted. 8

1.4 Anticipated research results and innovations. 8

Literature Review.. 10

2 .1 Conventional Heuristics for Task-Matching. 10

2.2 Evolutionary Computing Heuristics. 11

Bird mating optimizer 16

3.1 Introduction Bird mating optimizer 16

3.1 Bird mating optimizer 19

Web document clustering theory and technology. 25

4.1 document representation model 25

4.1.1 Vector Space Model 25

4.1.2 Suffix tree document model 27

4.2 feature item weight calculation. 33

4.3 similarity measure. 34

4.4 Clustering Algorithm.. 34

4.4.1 Hierarchical clustering. 35

4.4.2 Divisions. 39

4.4.3 Density-Based Methods. 42

4.4.4 Grid-based method. 42

4.4.5 Model-based approach. 42

4.4.6 Fuzzy clustering method. 43

4.4.7 Incremental Clustering. 44

Weighted suffix tree Web document clustering method. 46

5.1introdution. 46

5.2Suffix tree clustering algorithm.. 48

5.2.1 Text analysis and preprocessing. 48

5.2.2 Suffix Tree. 48

5.2.3 The basic class cluster identification. 51

5.3Weighted suffix tree clustering algorithm.. 53

5.3.1Web Document Analysis. 54

5.3.2 Weighted suffix tree is defined and constructed. 55

5.3.3Document preprocessing. 58

Experimental Results. 60

6 .1 Performance Metrics. 61

6.1.1Makespan. 61

6.1.2 Average Resource Utilization (ARU) 61

6 .2 Evaluation. 62

6.2.1 Performance in Homogeneous Environment 62

6.2.2 Performance in Heterogeneous Environment 62

6.2.3 ARU comparison. 63

6.2.4 Reliability Analysis. 63

References. 67

Acknowledgments. 71

Introduction

1.1 Research background

With the application of computer networks and database systems, data contained in the database of rapid expansion, a large amount of data hidden many important information, which information is often hidden data can provide a more detailed theoretical analysis and decision-making people, so how to quickly find the relationship between these data in order to add the shortcut to use these data is imminent. However, the traditional method of statistical analysis, data processing and analysis in terms of the effect of the deep is not very clear, present in the data implicit information and internal relations are often ignored, the trend cannot be based on existing information to predict future data. This requires a new technology to intelligently analyze massive raw data, it can be fully utilized a large number of effective resources, unified optimization. Data mining theory and technology research is to meet this requirement and to generate new research directions and rapid development.

Data mining is to extract knowledge from people interested in a large database or data warehouse, this knowledge is implicit, potentially useful information previously unknown. It is one of the most cutting-edge international research on decision-making in the field of database information, is under deeper and more fully utilize the information resources of the urgent needs of the background produced and developed rapidly, causing the academia and industry attention.

With the rapid development of information and data technology, the rapid expansion of the Internet, how to quickly and efficiently get people to the potential value of the value of useful information has become increasingly important from a number of information. XML (extensible Markup Language, to mark the extension language) as a common data representation on a network and exchange format, has been more and more applications. Compared with ordinary text documents,} [3] self-descriptive, semi-structured, hierarchical, scalable, flexible and simple features, itself with a certain structure and semantic information. These features make XML machine-independent platform, providers and programming languages, and make it in a different system, different databases, set up a bridge of communication between different languages, therefore, XML-based data network to Mining Technology gives power and flexibility, it is easy to implement integrated, easy to transport and exchange of data and other characteristics of heterogeneous data makes heterogeneous database queries and search easier, so that the XML document data mining cents No doubt that is an important part of the data mining research. Data mining algorithm major classification mode, the clustering pattern analysis, association rules, neural network algorithms, decision trees, sequence mode, and wherein the clustering algorithm is an effective unsupervised machine learning algorithms, data mining a very important research topic [[4]. Therefore, as the carrier XML data exchange is widely used in various fields of application today, an efficient and fast XML clustering mechanism will greatly shorten the information retrieval time and improve the efficiency of data query, dig out the potential value of information, to provide better data to support decision-making, and therefore has a huge research and application.

1.2 Current research

In recent years, researches on XML document clustering mainly around two aspects to expand: the structure and content of XML documents. XML is a semi-structured data markup elements and their nesting shows the structure of XML documents, excavated from the carrier in a semi-structured XML, the structural part of the analysis to understand XML document has great significance. Here at home and abroad in recent years, in a structured XML document based on its analysis of the situation clustering applications.

(1) Based on the edit distance method: literature Tai first proposed the use edit distance to measure the difference between two trees. On the basis of this work, the literature suggests that the calculation method edit distance between two trees based on such calculation algorithm on the edit distance between two trees to be further enriched. In the literature on the basis of the concept of edit distance used to describe the structure of XML documents, to calculate a picture switch to another chart minimal cost by pattern matching, and thus to make the calculation XMI. Document similarity information. But the high computational complexity of the method is O (N3), so the large-scale document applications is not very suitable. And on the basis of literature, literature proposes a method of sub-optimal solution can be obtained within approximately linear time. However, the above two ways to record the structural changes between the two documents, and XML document structure similar information for comparison and not calculated directly, it has some limitations. Tree path model proposed in the literature more “tree edit distance” simple complexity also much lower, but when you are using an exact match for the path to match, that is only the path length and the corresponding node content is exactly the same path that the similarity path. This matching method is simple and practical, but also is widely used in many other related studies.

(2) Based on the method functions: document proposes a method based on Fourier transform, which the XML document into a time series, each label corresponds to a pulse signal. This approach will convert the entire XML document into a mathematical model, the mathematical model by calling to compare the differences between two different sequences, in order to get similarity different XML documents. Based on the similarity of method of calculation of the Fourier transform more in line with the actual demand, get better test results, but it cannot be excluded tag appears repeatedly calculate the impact on structural similarity, which is ignored in the same structure frequency information for the impact of structural similarity.

(3) Based on Vector Space Model: Literature] vector space model VSM to represent an XML document, an XML document will be represented by a vector, and based on the distance to measure the similarity of feature vectors and by dividing approach to clustering. Vector space model currently used to quantify the XML document clustering methods are further classified most widely used, but this method is too high because the vector dimension represents a huge amount of calculation, and therefore large documents more unbearable. Document first XML document to quantify, and then extract the path sequence, and XML description for VSM in the set of vectors, using particles of Clustering method for clustering XML documents. In the subsequent part of the algorithm, in order to accelerate convergence, using C-means for space-point fast local tuning. This method has the advantage in dealing with large-scale document collections, but this is bound to affect the operation efficiency of the algorithm, so in practice the results will need further practice and validation.

(4) Method Structure Diagram: Citations chart to measure the similarity between XML documents, the purpose of a similar cluster structure of XML documents by graph matching. It measures the similarity is not high precision, may result in large differences in the two documents have the same structure diagram. Document proposed based on similarity calculation have to map XML document structures, the method is based on the structure of the XML document structure information of a directed graph, and chart all the nodes to the endpoint of the distance, and wherein Empowering each node weight, and then calculating the weight of the product and semantic level and, through this value and all the weight ratio to finalize the structure of the two documents similarity. However, this method has some limitations in the cluster, since it ignores the equal sides have to question the sequential relationship between the sides and edges.

(5) The method based on the model tree structure: document proposes XML document similarity calculation method based on BFS tree graph structure, the method according to the breadth-first search algorithm to find the minimum code tree edit distance and the concept gives new definitions, and give full consideration to the right side and the junction between the weight and operating cost as a basis for comparison of XML documents showing the structure of similar information. However, this method is not too high in terms of efficiency, and only for the aspects of the structure of XML documents, and did not relate to the content and semantics of the document. In terms of literature characterization XML document to maximal frequent path and maximal frequent subtrees based, and made on the basis of a hierarchical clustering algorithm cohesion XML Cluster: an XML document clustering. However, this method step after the completion of an algorithm, it cannot be rolled back, which in a certain sense to increase the risk of the whole cluster, if there is a step wrong, and the correctness of the final clustering result is problematic.

(6) The tree path model-based approach: [17] The text clustering and XML document clustering comparative studies to improve the division clustering method, clustering method to dividing an application into the XML document through a path mining algorithm. However, this method requires prior regulations before the initial clustering center and the number of clusters, this will lead to volatility and random clustering results is relatively large. Documents XML document structure when performing similarity matching, frequent use is intensive path tree, but its use is entirely match the path, doing so on the missing node letter J,, part of the path of self-similarity and drawn The actual gap is relatively large. Although the literature on the basis of improved exact match path, called PCXSS proposed method of calculation of the similarity XML document structure based on the path, but the method does not consider the leaf node contains a different path and not mutually similarity also lost a similar message.

Focuses on the clustering method above to XML, a document based on structural characteristics, in addition to XML documents and some content-based clustering methods, but these methods are most simply seen as an ordinary XML document text using traditional clustering methods, ignoring its own structural features semi-structured, hierarchical, etc., and cannot take advantage of XML documents the characteristics of the data fully, completely clusters. The clustering research for XML, document, whether it is based on the structure or content-based clustering most studies are based on the similarity of XML-based, XML documents because the study can be applied not only to the similarity of the cluster XML document, it can also apply to all aspects of the many to the similarity-based XML data mining. This article is based on this point, focusing on XMI, the similarity of the document were studied.

1.3 Research methods to be adopted

This article uses the following methods in the research process:

(1) Literature study. In this paper, the full use of the library, periodicals room and Internet resources to collect relevant literature on A BMO based Bird mating optimizer for document clustering. Then, we conducted a consolidation and induction systems. The work of this writing has laid a good theoretical basis.

(2) Specification and empirical analysis method of combining. This article first summarizes and discusses the theoretical system at home and abroad about the genetic algorithm； Taking these theories, combined with the actual situation of the system, analyzed the genetic algorithm for document clustering.

1.4 Anticipated research results and innovations

Innovation and innovation of this paper is as follows:

In class Open parallel programs for the study, combined with heterogeneous systems parallel loop scheduling and dynamic voltage scaling processor technology to optimize system power consumption under conditions to meet the performance constraints. First, the establishment of a parallel loop scheduling problem basic model of heterogeneous system power-aware, and then gives the lower bound of heterogeneous systems parallel loop scheduling energy consumption by the assay method, the lower bound can be used to evaluate the efficiency of power optimization methods. Furthermore, the heterogeneous systems parallel loop scheduling problem summed up the general integer programming problem, given the problem solving method. Due to the current processor only supports a limited number of discrete operating frequency; it makes a recycling method for scheduling within the processor to further reduce power consumption. Finally, in order to CPU-GPU heterogeneous systems as a platform to review the 10 typical scientific computing core program, the experimental results show that this method can effectively reduce system power consumption and improve system performance.

Because of restrictions on the programming model or architecture, the current acceleration components based on heterogeneous parallel programs mostly using general-purpose microprocessors and acceleration components perform different computing segments alternating manner. This article will be calculated by the same type of processor only segment called isomorphic complete computing segment; accordingly, calculated by the number of different types of processors to complete segment is called heterogeneous computing segment. According to the above definition, this article will only be isomorphic parallel program by computing segment is called a homogeneous computing section of the program; accordingly, will include parallel programming of heterogeneous computing segment called heterogeneous computing segment program. This article is divided against said heterogeneous parallel programs, namely the establishment of a heterogeneous parallel system energy optimization model and the corresponding energy consumption optimization.

Literature Review

In this chapter, I will discuss both conventional techniques and evolutionary computing based strategies that have been extensively applied to solve the task-matching problem in the large scale distributed systems.

2 .1 Conventional Heuristics for Task-Matching

Min-Min Heuristic: The Min-Min concept based on the D-schedule heuristics presented in was first implemented for the scheduling layer in Smart Net ET. Al. presented this concept as the Min-Min heuristic for matching independent tasks to heterogeneous computing systems.

The Min-Min heuristic is a batch mode heuristic (Algorithm 1).The set of tasks to be matched to available machines is created from tasks waiting in the task-queue after fixed intervals. The completion time of each task in the set over all the machines is estimated (dine-3, Algorithm 1).For each task, the machine which can complete the task in the minimum possible time is determined(dine-4, Algorithm 1).The task with the overall minimum expected completion time is selected and assigned to the respect-five machine. Then this task is removed from the set and the machine available time is updated for the assigned machine(dine-7, Algorithm约.These steps are repeated iteratively until all the tasks in the set are matched for execution to at least one processing element.

Algorithm	1 Min-Min Heuristic
1	calculate ETC(Expected Time to Complete)matrix
2	while all tasks in T are matched do
3	determine earliest completion time for each task
4	determine task with minimum overall earliest completion tune
5	assign task to corresponding machine
6	remove task from T
7	update machine available time of corresponding machine
8	end while

The Min-Min heuristic assigns shorter tasks first so that the assigned machine becomes available at the earliest possible time for executing the next task. This undying principle is responsible for the performance of the Min-Min heuristic. However, this heuristic does not work well when one of the tasks in the set is significantly longer in comparison to other tasks. The Min-Min heuristic will assign the longest task in the end instead of assigning it to run in parallel with the execution of shorter tasks.

Max-Min Heuristic: The Max-Min heuristic was also first implemented for task assignment in Smart Net. The steps in the Max-Min heuristic are similar to the Min-Min heuristic but with one significant difference. In the Max-Min heuristic the task with the maximum completion time is given priority in the assignment for execution. That is, instead of minimum (Line-4, Algorithm 1) the task with maximum overall earliest completion time will be assigned first for execution. The Max-Min heuristic performs better than Min-Min heuristic when the number of small tasks to be executed is comparatively greater than the number of long tasks. The long tasks will be assigned for execution in the start and it will be possible to execute the shorter tasks in parallel.

FCFS Heuristic: In First-Come-First-Serve heuristic, the arrival time of the submitted task is the sole criteria for task matching. The task with an earlier arrival time in the task queue is selected first and assigned for execution when the required resources for task execution are available. The primary advantage of this heuristic is its low computational overhead of O (1). Also, it does not require any prior estimation of run-time of the submitted tasks. However, this heuristic results in higher average waiting time submitted tasks as the tasks cannot be executed as long as the required resources in use for the execution of tasks ahead of them in the queue. The FCFS heuristic is also expected to result in an unbalanced load distribution as the selection of a machine for executing a task is not done based on the length of the task. The performance of the FCFS heuristic can be improved if it is coupled with backfilling techniques. Backfilling techniques allow the assignment of other tasks which are behind in the task-queue to begin execution if the task ahead in the queue is waiting due to lack of required resources. This increases the overall efficiency of the system by eliminating the idle time to a large extent.

2.2 Evolutionary Computing Heuristics

To overcome the issues observed in conventional scheduling strategies, researchers were motivated to explore the application of evolutionary computing heuristics. This section describes the evolutionary computing heuristics used in the literature to solve the task-matching problem.

Simulated Annealing: The Simulated Annealing (SA) heuristic is a probabilistic method based on the emulation of the phenomenon of slow decrease in temperature of a substance till it reaches a low energy state like crystallization. The mathematical model of SA can be used to solve optimization problems which involve minimization of a problem related objective function. The SA heuristic has been applied to solve the task-matching problem. The SA based heuristic designed by Lin ET. Al. Is one such example. It has also been used for task matching in distributed computing environments. The results presented by Kasen ET. Al. demonstrate that SA based heuristic is able to generate better task-matching solutions than Min-Min or Min-Max heuristic.

To begin, a starting temperature is initialized along with a random solution vector T, where the value of T (i) represents the machine to which the i^th task is matched. Figure 2.1 represents a solution vector for 7 tasks to be matched to 3 available machines. The vector can be perturbed by either updating the value at any randomly selected index i or by randomly selecting two indices i and, and then exchanging their values. The vector generated after perturbation is known as neighborhood vector.

If the make span resulting from the neighborhood vector is less than best previously

Algorithm	2 Simulated Annealing Heuristic
1	Termp=termp’
2	T=T₀
3	bestcost=Makespan(T₀)
4	bestsoL=T₀
5	while temp>temp_low do
6	T’=perturb_T()
7	if rnakespan(T’)<rnakespan(T) then
8	T=T⁰
9	if rnakespan(T’)<bestcost then
10	Bestsol=T’
11	bestcost=rnakespan(T’)
12	end if
13	else
14	rand()<exp{makeSpan(T)-makeapan(T’)/temp}
15	T=T¹
16	end if
17	termp=termp*∂
18	end while

Achieved make span, the solution given by the neighborhood vector is stored as the best solution achieved so far (Algorithm 2).The vector is perturbed again and the same steps are repeated for multiple iterations. The total number of iterations is determined by the decrement ratio applied for the decrease in temperature in each iteration and also on a minimum temperature allowed.

Figure 2.1: Solution vector

T={2,1,3,1,2,2,3}

If the solution does not improve from previous iteration, it can still be accepted with a certain probability. The control operator given in equation 2.1 controls the probability of accepting inferior solutions to avoid the convergence at a local minimum. As the number of iterations increases or temp decreases, the probability of accepting inferior solutions decreases (i.e. Towards the end of the execution of the heuristic).

exp((makespan(T)-makespan(T’))/temp) 2.1

Although SA heuristic can be easily implemented for the task-matching problem and has less computational cost, the performance is significantly dependent on the appropriate initialization of parameters. The perturbation technique chosen to update the solution vector T also has a significant effect on the final solution quality. If the initialization is poor and/or the applied perturbation technique is not adequate to the problem, the solution quality obtained by SA can be very low.

Genetic Algorithm The underlying mechanism of Genetic Algorithm (GA) is based on the phenomenon of collective information sharing between chromosomes to generate the fittest chromosome during the evolution process. The chromosomes are encoded based on the problem such that they represent multiple candidate solutions to the problem. Researchers have frequently used GA in the past to solve a large number of optimization problems including the task-matching problem in large scale distributed systems. The GA based approach for task-matching has shown better results in comparison to the conventional Max-Min heuristic.

To begin, a population of chromosomes is initialized randomly. In the current context, a chromosome structure is generally a string with length equal to the number of tasks to be matched. The entry at index i in the chromosome string represents the processor to which the corresponding i^th task is matched.

After initialization, the solution represented by each chromosome is evaluated to

Algorithm	3 Genetic Algorithm
1	initialize_population()
2	Evaluate_fitness()
3	selection()
4	while itr>itrtotal do
5	crossover()
6	mutation()
7	evaluate_tress()
8	selection()
	end while

Determine its fitness value or goodness. For the task-matching problem, make span value is determined and used to calculate the fitness value of each chromosome.

The successive iterations in a GA are known as generations. During each generation, a portion of the more fit chromosomes representing better solutions are selected to move to the next generation. The selection is commonly done using the roulette-wheel mechanism by assigning greater probability of selection to the solutions with smaller make span values. In each generation, individual chromosomes are evolved using crossover and mutation operations as described below. To control both the operations, a crossover probability P_C and a mutation probability P_M are selected a-priori.

From among the chromosomes selected for the next generation, pairs of chromo-comes are randomly selected for crossover. By using a stochastic random number generator, a random number(R) is generated between 0 and 1 for each pair. In a basic GA, for each pair, if R is less than PC then a random crossover index is selected and part of the strings of the two chromosome strings are exchanged (i.e. Crossed over).Figure 2.2 demonstrates the crossover operation between two chromosomes. In this example, the chromosome string represents a mapping of seven tasks onto three machines.

After the crossover operation, strings are evolved with the mutation operation. Again, a random number (R’) is generated between 0 and 1 for each chromosome. For each chromosome, if R’ is greater than PM then a mutation index is selected randomly and the value at that index is randomly changed. The updated value is selected to be between one and the total number of machines available for task completion.

Figure 2.2: Crossover operation

After both the operations, selection of chromosomes is done again for the next generation. The chromosomes are evolved iteratively in each generation to create chromosomes representing better task-matching solutions. The algorithm is executed until the set number of total generations are completed and the best solution thus far is returned .The pseudo code for GA is shown in Algorithm 3

Figure 2.3: Mutation operation

Generally, a batch with high number of tasks has to be matched onto machines in the case of large scale systems. Therefore, the length of the chromosome would have to be long to represent a candidate task-matching solution. This would result in increased computational cost for completing the evolution operations (selection, crossover and mutation).As the computational cost increases significantly with in-crease in the number of tasks, GA algorithm does not scale well in terms of the computational cost relative to the problem size of the task-matching problem. The performance of the search performed by the GA is also significantly affected by the two control parameters: P_C and P_M. It is difficult to fine-tune the values of these probability based parameters to make the GA suitable to the problem being solved.

Bird mating optimizer

3.1 Introduction Bird mating optimizer

As a fertile source of concepts, principles, and mechanisms, nature can be an inspiration to design artificial computation systems for solving complex computational problems. Evolutionary algorithms (EAs), inspired by biological evolution, and swarm intelligence (SI) algorithms, inspired by collective animal behavior, are two main classes of nature-inspired computations which have attracted more and more attentions during recent years.

Owing to their simplicity and flexibility, EAs have been widely applied to solve scientific and engineering problems and have been the most successful artificial computation systems to tackle complex computational problems. An evolutionary algorithm is a generic population-based metaheuristic optimization approach, trying to simulate some mechanisms of biological evolution. There are different variants of evolutionary algorithms, but the common underlying idea behind all these problem-solving techniques is the same [2]. Candidate solutions to the optimization problem play the role of individuals in a population, and fitness function specifies the environment within which the solutions live. Evolution of the population then happens to improve the quality of the individuals by applying recombination (crossover) and mutation operators. Recombination is applied to the selected solutions (parents) to generate new ones (offspring). They are also mutated by making a small change to the solution. Such operators are employed to discover regions of space for which good solutions have already been acquired.

Success of an optimization algorithm depends mostly on its ability to establish good balance between exploration and exploitation. Exploration refers to generation of new solutions in as yet untested regions of search space and exploitation means the concentration of the algorithm’s search at the vicinity of current good solutions. The inability of the algorithm to make a good balance between exploration and exploitation results in premature convergence, getting trapped in a local optima, and stagnation. In EAs, selection tries to provide exploitation while crossover and mutation operators provide exploration.

The various dialects of EAs follow some general outlines, and differ only in technical details. EAs process a whole collection of candidate solutions simultaneously, use recombination to mix information of more candidate solutions into a new one, employ mutation to maintain the diversity of the population, and are stochastic. Four major EA paradigms are genetic algorithm (GA), genetic programming (GP) , evolutionary programming (EP) and evolution strategy (ES) . EP and ES borrow ideas from each other and also from GA. Though they work in a different way to GA, the broad principles are in many respects, similar. The most popular method among EAs which is mostly used in optimization problems is GA. GA is a search technique used in computer science and engineering to obtain the approximate solutions of optimization problems. One reason behind the GA’s success is the fact that its advocates are very good at describing the algorithm in an easy to understand and non-mathematical way.

In GA, a population of solutions is initialized subject to certain constraints. Each solution is coded as a vector, termed a chromosome, with elements being described as genes. GA can quickly discover good solutions, even for difficult search spaces, but it has some drawbacks: (1) GA has a trend to converge towards local optima rather than the global optimum, unless the fitness function is well defined; (2) there is a difficulty to operate on dynamic data sets; and (3) simpler optimization algorithms may find better solutions than GA at a same amount of computation time [9]. There is another important concern in GAs known as premature convergence. This happens when the population of chromosomes reaches a formation such that crossover no longer generates offspring that can outperform their parents, as must be the case in a similar population. Under such conditions, all standard forms of crossover simply regenerate the current parents [10]. On the other words, premature convergence is the result of losing population diversity too quickly and getting stuck in a local optimum.

Initially, improvements in GA have been sought in the optimal proportion and adaptation of the main parameters (probability of crossover and mutation, population size and crossover operator) ，but, recently, attention has been shifted to breeding (process of forming new candidate chromosomes). Consequently, in recent years, some researchers have developed various versions of GAs where the improvements are to seek the optimal breeding conditions.

EAs utilize the general rules of natural evolution (selection, crossover, and mutation) and try to develop ways to produce new solutions using these rules. However, though various living organisms of nature make use of these general rules to perform evolution process, the ways used by them is different in details. It is obvious that the performance of optimization algorithm will be extremely affected by the mechanism of forming new solutions. Hence, different ways lead to varying quality generations. From this aspect, we can add more algorithms to the category of so-called EAs.

Evolution process with the way used by some special organisms can be an inspiration to devise new evolutionary-based optimization techniques, with which we may obtain more accurate results than the other ones. In this paper, inspired by bird mating strategies, we propose a novel evolutionary optimization algorithm, named bird mating optimizer (BMO) for global optimization.

Birds are the most speciose class of tetrapod vertebrates having around 10,000 living species. Mating process in bird’s society has very similarities with an optimization process in which each bird breeds or attempts to breed a brood with high quality genes (a perfect state), because a bird with better genes has more chance to live. In the same way, an optimization process searches to discover a global solution (a perfect state) in which the quality of each solution is determined by a criterion named objective (fitness) function. In engineering optimization, decision variables are given values in the search space and a solution vector is made. If a good solution is acquired, that experience is memorized and the possibility of making a better solution increases at the next time.

During mating season birds employ a variety of intelligent behaviors such as singing, tail drumming or dancing to attract potential mates. Some courtship rituals are quite elaborate and serve to form a bond between the potential mates. The quality of each bird is specified by its features such as beak, tail, wing, and so on. The related gene of each feature determines the quality of that feature, together making the overall quality of the bird. A gene is a hereditary unit that can be passed on through breeding to next generations. Imagine a bird which has good genes among a species. This bird can fly adeptly and get more food. Hence, it is healthier than the other birds, lives longer and breeds more. The bird passes these genes for better ones onto its broods by selecting a superior mate. They also live longer and have more broods and the gene continues to be inherited generation after generation.

The ultimate success of a bird to raise a brood with superior features depends on the strategy it uses. Different ways result in broods with diverse features. Study among bird’s society reveals that they employ different strategies to conduct mating process. In general, there are five strategies: monogamy, polygyny, polyandry, parthenogenesis and promiscuity. According to its species, each bird makes use of one of these ways to breed. Most birds are monogamous, meaning that a male bird only mates with a female one. In the monogamous behavior, parental duties are shared between the pair so that the male bird defends of the territory while the main task of the female is to produce eggs and supply them. In polygynous species, a male tends to mate with several females while in polyandrous a female tends to mate with several males. Polygyny is much more common than polyandry in the bird’s society [19]. Parthenogenesis denotes a mating system in which a female is able to raise brood without the help of males. Promiscuity is another mating strategy employed by a few bird species, meaning mating systems with no stable relationships in which mating between two birds is a one-time event. This type of mating indicates a rather chaotic social structure in which the male will almost certainly never see his brood or the nest, and most likely will not see the female for another brief visit.

The BMO proposed in this paper, is a population-based optimization algorithm which employs mating process of birds as a framework. Under this framework, concepts and strategies are metaphorically adopted for designing optimum searching techniques. In BMO, each bird is a feasible solution for the problem and is specified by a vector with predefined number of genes, equal to the problem dimension. During generations, the birds employ probabilistic ways to improve the quality of their broods by selecting better-quality mate(s).

In order to assess the performance of BMO comprehensively, intensive studies are presented based on a set of 23 bench-mark functions. For studying the usefulness of the proposed algorithm, its performance is compared with those of classical EP (CEP), classical ES (CES), fast EP (FEP), fast ES (FES), GA, particle swarm optimization (BMO), group search optimizer (GSO) and differential evolution (DE) reported in [1,20-22]. The performance of BMO is also compared with that of particle swarm pattern search (PSwarm) algorithm. PSwarm takes the advantage of particle swarm in looking for a global optimum in the search phase of the pattern search algorithm. The pattern search enforces convergence for a local optimum.

3.1 Bird mating optimizer

The population of BMO algorithm is called society and each society member, representing a feasible solution of the problem, is called bird. The society includes male and female birds. The females are those birds that have the most promising genes. The females of the society are categorized into two groups, namely, parthenogenetic and polyandrous while the males are classified into three groups, namely, monogamous, polygynous as well as promiscuous. Totally, BMO uses five bird species so that each species has its own updating pattern. The way by which each species produces a candidate solution will be explained below in detail.

Monogamy is a mating system in which a male bird tends to mate with one female. Most birds are monogamous. During the mating season, males employ the related intelligent behaviors and attract the females towards themselves. Each male evaluates the quality of the female birds, uses a probabilistic approach to select one of them as his own interesting female, and mates with her. Female birds with more promising genes have a more chance of being selected. Consider a monogamous bird, that wants to mate with his own interesting female,. The resultant brood is given as follows.

c = a random number between 1 and n

if r1>mcf

x_b(c) = L(c)-r₂ x (L(c)-u(c));

end

Where is the resultant brood, w is a time-varying weight to adjust the importance of the interesting female,is a 1xd vector whose each element, distributed randomly in [0,1 ], influences on the corresponding element of ,n denotes the problem dimension, mcf is the mutation control factor varying between 0 and 1,r_i‘s are random numbers between 0 and 1, and u and L are the upper and lower bounds of the elements, respectively.

Using the first part of Eq. (1), the male bird attempts to pass on good genes to his brood by combining his genes with the genes of his own interesting female. The male bird then employs the second part to make mutation in one of the brood’s genes with the probability of 1-mcf.

Polygyny denotes a mating system in which a male bird tends to mate with two or more females. Possible benefits of extra-pair copulation include getting better genes for the brood. In nature, a polygynous bird mates with several females resulting in a number of broods, but in BMO this behavior is metaphorically adopted in which by mating a polygynous bird with multiple females only one brood is raised which its genes are a combination of the females genes. After the attraction of the females, each male selects his interesting ones with a probabilistic approach, and mates with them. The resultant brood is produced by the following equation

C=a random number between 1 and n

Where is the number of interesting birds anddenotes the jth interesting bird.

Promiscuity implies mating systems with no stable relationships in which mating between two birds is a one-time event. This type of mating indicates a rather chaotic social structure in which the male will almost certainly never see his brood or the nest, and most likely will not see the female for another brief visit. In promiscuity which is used by a few bird species, a male bird tends to mate with one female. In BMO, promiscuous birds are produced using a chaotic sequence through the generations. However, the way by which a promiscuous bird breeds is same as that of monogamous birds.

Parthenogenesis denotes a mating system in which a female is able to raise brood without the help of males [25]. In this system, each female tries to pass on better genes to her brood by making a small change in her genes probabilistically. Each parthenogenetic bird produces a brood by the following process.

Where the parthenogenetic mutation control factor and u is denotes the step size.

Polyandry denotes a mating system in which a female bird tends to mate with two or more males. After the attraction of the potential males, each polyandrous bird selects his interesting ones by a probabilistic approach, and mates with them. The resultant brood is produced as same as Eq.

The steps of BMO algorithm are as follows:

Step 1: initialization: a society of birds is randomly initialized in the search space. Each bird is a feasible solution of the problem and is specified by a vector, with the length of n.

Step 2: fitness function value: the quality of each bird is computed by putting its elements into the fitness function.

Step 3: ranking: the birds are ranked based on their quality.

Step 4: class访canon: Birds with the most promising genes are selected as females and the others are chosen as males. The females are equally divided into two groups so that the better ones make parthenogenesis birds and the others make polyandrous ones. The males are categorized into three groups. The males included in the first group that have bet-tar genes than the others are selected as monogamous. The males of the second group are chosen as polygynous.

Step 5: generating promiscuous birds: the males of the third group are removed from the society and new ones are generated using a chaotic sequence. The new birds are considered as promiscuous.

Step 6: breeding: each bird breeds a brood using its own pattern.

Step 7: replacement: each bird makes a decision to add its brood to the society. The bird evaluates the brood’s quality. If the brood is in the search space and includes better quality, the bird will abandon the society and the brood will join to it, otherwise, the brood will be abandoned and the bird will stay in the society.

Step 8: Steps 3-7 are repeated until a predefined number of generations, is met.

Step 9: the bird with the best quality is selected from the society as the final solution. Figs. 1 and 2 depict flowchart and pseudo code of BMO algorithm.

Initialization:

Determine the society size, percentage of monogamous, polyandrous birds, maximum number of generations, and the other parameters

Compute objective function of the birds

Sort birds based on their objective function

Partition the society into males and females

Specify monogamous, polygynous, and polyandrous birds

Remove the worst birds and generate promiscuous birds based on the chaotic sequence

Compute objective function of the promiscuous birds

For i=1 to number of monogamous birds

Select interesting elite bird

Produce the brood based on Eq. (1)

Next i

For i=1 to number of polygynous birds

Select interesting elite birds

Produce the brood based on Eq. (2)

Next i

For i=1 to number of polyandrous birds

Select interesting elite birds

Produce i the brood based on Eq. (2)

For i=1 to number of promiscuous birds

Select interesting elite bird

Produce the brood based on Eq. (1)

Next i

For i=1 to number of parthenogenetic birds

Produce the brood based on Eq. (3)

Next i

Compute objective function of the broods

Perform replacement stage

Update the parameters

Until termination criterion is met

BMO parameter setting

Io order to apply BMO algorithm to a problem its adjustable parameters have to be tuned. The performance of each optimization algorithm is affected by its parameters. The optimal tuning of the BMO parameters will be studied in future researches. Nevertheless, some guidelines obtained experimentally are given in the following to tune the BMO parameters.

(1)It seems that the most important parameter which needs to be adjusted is the optimal proportion of each species from the society. We propose the percentage of monogamous, polygynous, promiscuous, polyandrous, and parthenogenetic birds is respectively set at 50, 30, 10, 5 and 5 of the society.

Two or three interesting mates for polygynous and polyandrous birds will be sufficient.

It is better that 10 monogamous birds which have better qualities than the other males are selected as interesting candidates for participating in the rituals of polyandrous birds.

Mutation control factors (mcf and mcf_p) are between 0 and 1. mcf can be set at 0.9 or 0.95. Small values of this parameter may result in bad impact on the performance of the algorithm. It is better to select mcf_p as an increasing linear function which changes from a small value near-by zero (for example 0.1)to a large one near-by 1 (for example 0.9). This behavior permits to parthenogenetic birds to change their genes at the beginning of the algorithm with high probability. This probability decreases during the generations and helps the parthenogenetic birds to converge to the solution.

Which determines the mutation size of each gene of parthenogenetic birds is from order of 10^-2 to 10^-3.

In order to provide good balance between local and global search, w decreases linearly from a value near-by 2 to a small one near-by 0.

Web document clustering theory and technology

Web has quickly developed into a massive, widely distributed global information space. Web document clustering has become a Web information retrieval, electronic commerce, decision support and other research areas an important research question, the reason is the internal structure of the cluster can reveal Web document set so close to the associated documentation together form a class cluster (Cluster), with a greater similarity between the cluster members in each class , between members of different clusters with less similarity to address information clutter phenomenon to a large extent, to facilitate accurate positioning information of the user desired.

This chapter analyzes the process of document clustering as a clue, the relevant theory and technical document clustering and Web document provides an overview of clustering. Including document representation model, feature calculation, similarity measure, commonly used clustering algorithms, clustering evaluation indicators. Finally, the analysis of text clustering.

4.1 document representation model

4.1.1 Vector Space Model

In general, the document clustering technology is based on four concepts: ① document data model; ②Similarity calculation; ③ clustering model; ④ clustering algorithm. It can be seen clustering algorithm is based on the model of the document indicates.

Existing text analysis techniques inherited many of the concepts from the traditional information retrieval (Information Retrieval, IR) field, which is a very important concept is the vector space model (Vector Space Model, VSM). Vector space model is about the document represented a statistical model, the 1970s have Gerard Salton et al. Typically, mainly used to measure the sentence, paragraph or similarity between documents. The basic idea is to document, or text represented as a feature vector. There are three concepts in the vector space model:

Vector space model VSM is mainly used for a set of objects or features mathematical modeling process, usually such mathematical tools to model matrix corresponding vector space. In this matrix, each row of the matrix vector occupy, the dimension is determined by the size of each line feature vector space, and later in the narrative here represented as M. Thus, if there are N objects, each object M by N characteristic of the open space can be represented by nXm matrix V. Where V, each element of the i-th target price indicates the weight (Weight) on the J features. In general, feature items are not uniformly distributed in each document, it is usually sparse v (Sparse Matrix). Columns of the matrix M represents the number of features appear in the document, generally M is a very large number. Therefore, text clustering model using VSM A key feature is the barriers, and that is the dimension of the feature space is too large, straight out of the v structure will be a high-dimensional sparse matrix.

From the characteristics of the vector space model, the items identified in the case, claims the recalculation is another problem of document clustering. Most of the methods are based on the experience of the following two points:

① If the number of times a word appears in a document more, then the stronger it is and the relevance of the document body;

② If a word in the collection have appeared many times in the document, then its contribution to the classification is small.

Fi represents the frequency of word order appears in the document i j, N is the number of the document collection D Chinese, M is the word Number, ni j is the total number of words appearing in the document collection. Boolean weights, word frequency weighting, TF-IDF weight, etc. is a common method of calculating weights.

Huge dimension of the matrix v of document feature vectors, to the subsequent data processing has brought great difficulties, in many literature as “dimension disaster.” Existing high-dimensional data dimensionality reduction methods are principal component analysis (Principal Component Analysis, PCA) latent semantic indexing model (Latent Semantic Indexing, LSI) and self-organizing map (Self- Organized Mapping, SOM)

Principal component analysis is one of the most widely used linear dimensionality reduction. It is conceptually simple and efficient algorithm. PCA will be the size of the variance as a measure of how much information the standard, the greater the variance that the more information you provide, on the contrary, the less the information provided. PCA through a linear transformation to retain large variance, containing information on more weight, lose less information components, thereby reducing the dimensionality of the data. However, PCA is a linear projection, not properly handle data located on the nonlinear flow; and when you use this method, we should know how many principal components retained.

Latent Semantic Indexing is a model made Dumas et al., The use of statistical calculations of latent semantic links between words and text in the document is exported, thus avoiding mismatching problem words; at the same time by the singular value decomposition SUD, whichever ago k-dimensional approximation matrix processing, thereby greatly reducing the dimension of the matrix process.

SOM with unsupervised learning to get map data from an embedded space in the fixed dimension to high-dimensional data space distribution. SOM can be seen as a dimension reduction technique, it can be unsupervised learning a dimensional space from D to L-dimensional mesh embedded map. This mapping can be consistent topology D-dimensional data with the L-dimensional grid. However, this algorithm itself determine the characteristics of some of its shortcomings, such can not be defined to optimize the cost function selection algorithm parameters cannot maintain convergence and convergence of the general lack of proof.

4.1.2 Suffix tree document model

Another model used in this paper is a suffix tree document model (Suffix Tree Model, STM). This model is the first to be Zama: and E-zine proposed to be used in the 1997 document clustering. In fact, the suffix tree has been used to build the fully indexed data structure for efficient matching and search string. STM is the text document as a string of words composed of generalized suffix tree construction documentation set multiple documents (Generalized Suffix Tree, GST). STM models provide a centralized identification documents and extracts from n-repeated phrase approach.

Suffix tree provides an efficient search string matching algorithm and is widely used in basic research and a string of problems and practical applications, such as large-scale search for biological sequence data, approximate string matching], and spam filtering the statistical feature extraction.

A string suffix tree is composed of all of the suffix string compressed search tree (Compact Tire), suffix tree each leaf node represents a suffix string. Document for creating suffix tree of all suffixes of a string consisting of the demonstration program. As the string S = “Mississippi”, its length is m = 11, create a string in order to facilitate its end add a special character “$ ” suffix S;. By the itch character to the characters of m sub-string of 1 <i <miss all suffix of the suffix tree shown in Figure 4-1 each side marked substring of string node S suffix tree is divided into three: the root (Root Node, an intermediate node (Internal Node) and leaf nodes (Leaf Node). If one side contains only “$”, then the corresponding node is grayed out, this type of leaf node is called a terminal node (Terminal Node. each intermediate node contains at least two leaf nodes, no child nodes are leaf nodes. suffix leaf node number for the corresponding string S.

Fig.4-1 Suffix tree of string S

String S suffix tree T is defined in the literature as follows:

① suffix tree is a rooted tree;

② Each edge is labeled a non-empty string S; Each node labeled from the root to the node

Concatenated edge content path.

③ path from the root to leaf node and the suffix S to one relationship;

④ each intermediate node contains at least two child nodes, the two sides cannot be transmitted from one node contains the same word beginning substring;

Weiner in 1973 first proposed a method of linear structures suffix tree. Three years later, McVeigh gives a simple, more space-saving construction methods. McCreight suffix tree construction algorithm requirements suffix tree in reverse order (Reverse order) structure, that is, the character string is added sequentially from the end of the suffix tree. This requirement prevents the use of the suffix tree algorithm online updates, but also hindered the application of the suffix tree method thus created.

Kekkonen made a to-right linear suffix tree construction method in 1995, and allows dynamic updates suffix tree. Nelson gives an algorithm using Kekkonen suffix tree constructed during the demonstration. Chime in his paper also used this method demonstrates the string S = “abababc ‘, the Kekkonen algorithm creation process.

Starting from an empty tree, Kekkonen algorithm incrementally added to each prefix S suffix tree. Corresponding to the string S, the first character of a suffix tree is first inserted, then ab, aba like. When abbacy finally inserted, the suffix tree is created. Such Kekkonen algorithm creates a series of implicit suffix tree (Implicit Suffix Trees), the last one is the real suffix tree we need. An implicit suffix tree is a simplified terminator not added suffix tree, and was not inserted matching prefix other suffixes. Figure4-2 shows when a, ab, aba and Abba insert suffix tree implicit suffix tree. Abba inserted after the suffix tree alone extracted form Figure 4-3.

Fig. 4-2 Implicit suffix tree of “abab”

Fig. 4-3 Suffix tree of “Abba”

Insert a new prefix by traversing the tree and each suffix tree to the tree. From the beginning of the longest suffix algorithm, as shown in the 2.2 Abba, and then to the shortest suffix, usually an empty string. Each suffix ending in the following three types of nodes:

① leaf node, as shown in Figure 2.3, numbered 3,4,5,6 nodes.

② explicit node (Explicit Node), a non-leaf node (or an intermediate node) Figure 2.3 Number of nodes 1 and 2. They represent the two sides of the split point.

③ implicit node (Implicit Node), as shown in Figure 2.3, the prefix a, b and so ended up in the middle of an edge. These locations are called implicit node. They should appear in the suffix tree, but the compression characteristic path, did not appear. Suffix tree is created in the process, some implicit node, will be converted to an explicit node.

Suffix link (Suffix Link) is a linear structure Kekkonen suffix tree algorithm main features. It is a pointer in each intermediate node discovery. In a fully suffix tree, each intermediate node represents a character string from the root to this node. Suffix link points to the first node representing a suffix string. Thus, if a string contains from 0 to m-1 input characters, the suffix string of chain termination node will point to the symbol labels. Figure 2.4 shows a string S = “abababc” suffix tree. The first suffix link on node 4 found that the nodes represent string abab. The first is the suffix string abab, so the suffix node to node 4 of 6 chain. Similarly, the node 6 also has its own suffix link, pointing to node 1.

Update suffix tree, while building suffix link. The algorithm while tracking the creation of the leaf node’s parent. When a new edge is created, a suffix link is created from the beginning of the old parent, and points to the current parent node. With suffix chain guide from a suffix to another suffix it is very easy.

Fig.4-4 Suffix tree of “ababac” with suffix links

In order to obtain O (m) time complexity, Kekkonen algorithm uses both the following two techniques. After

① create a leaf node, in the subsequent suffix tree creation process, it is always a leaf node;

② limit update tree, Kekkonen defines the active point (Active Point) and end (End Point) limit tree changing area in the update process.

After these two techniques used to construct the suffix tree m length of the string of time complexity O (m). In addition, Kekkonen in the literature using finite state automata theory proved this linear time complexity.

In the information retrieval (Information Retrieval), the phrase is used as an information richer feature items. Because a phrase with one or more words, and to maintain the order of the words in between. In order to identify and extract the document set sharing phrase suffix tree STM document model suffix tree data structure as a document clustering or document data suitable representation model. In traditional application suffix tree, suffix tree data structure is used as a full-text indexing and search string matching, a string is considered to be a sequence of characters. STM document model suffix tree clustering of the documents involved in as a string, but the string is not made up of characters, but by the word (Word) composed of the suffix tree construction document set form STM. Therefore, the corresponding S TM, suffix tree contains all suffixes all documents.

Definition 2.1 contains a string of words m S suffix tree T is a rooted directed compressed search tree (Compact Trie), and contains m leaf nodes. Each intermediate node contains at least two child nodes, each side is marked on the S non-empty string, from the two sides issued the same node can not contain the same word beginning substring. Label each node from the edge of the root to the node on the path concatenated content. Label with string S suffix leaf node is one to one relationship.

Usually when you create a suffix tree, a special terminator “$ ” is appended to the end of the string S, preventing a suffix S becomes the prefix of another suffix.

Figure 2.5 shows the structure of the three documents suffix tree, suffix tree node is represented by a circle. There are three nodes: the root node, intermediate nodes and leaf nodes. There is a special leaf called a terminal node, use the special character ‘$’ ‘suffix indicates that each leaf node represents a document; each intermediate node represents a share of at least two suffixes phrase each intermediate node associated with a rectangle. Frame in which the above figures represent the document ID, the following figures indicate the number of nodes corresponding to the document through.

Fig.4-5 Suffix tree of three documents

4.2 feature item weight calculation

Feature item weight (Weight of feature term) can be calculated using a variety of methods. Each method has its own purpose, no one is the best. See a lot of literature, I found that word frequency (Term Frequency, TF) and flat rate document (Document Frequency, DF) is an important parameter calculation feature item weights. This section of the calculation feature item weights overview.

Frequency TF initially only be characterized as a calculated parameter entry weights, then document frequency DF and Anti document frequency (Inverse Document Frequency} IDF) have been used to calculate the characteristics of the item weights. IDF was first proposed in 1972 by the Spark Jones, as a characteristic of the most commonly used measure of the importance of words. IDF is defined as the reciprocal of the document sets the total number of suppliers and the number of documents that contain the specified words. IDF’s role is to distinguish one word can measure differences in the ability to document theme. DF and IDF comprehensive feature item weight calculation TF-IDF program has proven to be a very effective method of calculation, which are also the most widely used information retrieval methods. TF provides a measure of the documentation related to the partial feature item, IDF provides a global measure of feature items in the documentation set.

During document clustering, consider two questions.① which features to better describe the subject of a document to determine the cluster; ② what characteristics can put a cluster document distinguish from the entire documentation set. The first feature set provides a similarity measure between the intra-cluster documents, the second feature set provides a measure of dissimilarity between clusters. In the vector space model, TF measure document similarity, dissimilarity measure IDF document.

4.3 similarity measure

The main role of similarity measure is a reflection of the degree of similarity or two data objects between two documents. Similarity measure has an important influence on the clustering effect of clustering algorithm. Reported in the literature a lot of similarity measurement method, only the most commonly used paper reviews several.

Typically, the similarity between two documents or objects of data in the interval [0, 1], i.e., the maximum similarity between two documents is a minimum degree of similarity is zero. Similarity is determined to meet general reflection and symmetry.

Usually a data object is described by a set of features, usually expressed as a multidimensional vector. Set x, the data set X in an object.

There are many ways to measure the distance measured is converted to the similarity in the cluster analysis, focusing on the similarity of reflection and symmetry. General clustering studies, similarity measure and distance measurement can be used interchangeably.

4.4 Clustering Algorithm

Thought is a widely used data clustering mining technology, is found from the data class cluster, meet certain criteria to form a class cluster document clustering from human visual analysis of data: the distribution of data to a different class cluster to facilitate understanding of the data. Clustering process based on a series of metrics or measurements. Clustering process often based on the similarity or distance between data objects to assign the document to a different class clusters. Similarity judgment and optimization criteria (Optimization Criteria) are two important factors affecting the clustering effect.

According to the data indicate, the process, the application of methods and uses, clustering algorithms can be divided into the following several types.

① AHP (Hierarchical Method

② Divisions (Partitioned Method)

Method (Density-based Method ③ density based

④ grid-based method (Grid-based Method

Method (Model-based Method ⑤ model of

⑥ fuzzy clustering method (Fuzzy Clustering Method

⑦ incremental clustering method (Incremental Clustering Method

This paper describes the clustering document several common and simple comparison analysis. Cited literature part of the text.

4.4.1 Hierarchical clustering

Hierarchical clustering algorithm to the data of the object hierarchy to establish cluster nodes form a cluster of trees. If you press the bottom-up or hierarchical decomposition is called agglomerative hierarchical clustering (Agglomerative Hierarchical Clustering; and conduct a top-down hierarchical decomposition, the hierarchical clustering method called split (Divisive Hierarchical Clustering).

Agglomerative hierarchical clustering first each data object as a cluster, then gradually merge these clusters to form larger clusters until all of the data objects are in the same cluster, or a termination condition is satisfied. Divisive Hierarchical Clustering contrary, it is first of all the objects in a cluster, then gradually divided into smaller and smaller clusters until each object is self-contained cluster, or a termination condition is reached, such as up the distance between a desired number of clusters, or two recent class cluster exceeds a certain threshold.

Hierarchy process data objects can be analyzed on different levels of granularity, and easy to implement similar measures or distance measures. However, a simple hierarchical clustering algorithm termination condition vague and merge or split cluster operation performed inconvenience amendment may lead to reduced quality of clustering results. In addition, due to the need to check and estimate the large number of objects or cluster determinants to merge or split, so this method is lack of scalability. So, usually when the level clustering problem solving methods and other methods combined. Effective combination of hierarchical clustering methods and other methods can form a multi-stage clustering, clustering quality can be improved. These methods include BIRCH, CURE, ROCK and Chameleon algorithm.

①BIRCH algorithm

BIRCH algorithm uses hierarchical approach equilibrium iteration Statute and clustering. It first divides the object into a tree, and then use other clustering algorithms for clustering results refinement. It introduces two concepts: clustering feature (CF) and clustering feature tree (CF tree), they are used to summarize clustering description, and clustering algorithm can improve the efficiency of large databases and scalability.

Clustering feature is a triple, it gives a sub-cluster of information and summary description. Suppose a sub-cluster with N d-dimensional data point or object {Oil} (I = 1, 2, N), then the sub-cluster clustering feature is defined as follows:

CF= (N,LS,SS) (4-13)

CF tree is a height-balanced tree, which stores the clustering feature hierarchical clustering, which has two parameters: branching factor B and the threshold T. Branching factor B defines the maximum number of children of each non-leaf node, while the threshold T gives the maximum diameter of the cluster tree stored in the sub-leaf node. B and T parameters determine the size of the CF tree.

BIRCH clustering algorithm uses a multi-stage technology, the data set is generated after one pass initial cluster of CF trees, then through one or more times to improve the quality of the scan CF tree. After CF tree built, you can use any clustering algorithm, such as the typical division method, its leaf node cluster. When you insert a new data object, CF tree can be dynamically constructed, reconstruction CF tree is similar to B + tree-building node in the insert and split, so BIRCH algorithm supports incremental clustering.

Due to limit the size of each node CF tree may cause the node does not always correspond to the thought of a natural cluster. Moreover, if the cluster is not spherical, BIRCH algorithm does not work well, because it uses the concept of controlling the diameter of the border clustering.

②CURE algorithm

CURE algorithm is a bottom-up hierarchical clustering algorithm, it is the level algorithm and classification algorithm together to overcome deficiencies algorithms tend to find spherical clusters when calculating the distance between clusters, both without a centroid, nor are all the points, but with a set of points. In other words, it is more than an object rather than an object to represent a cluster. In fact, CURE is to select a number from one cluster to spread good point to represent the cluster, these points are used to determine the shape and size of the cluster. Then, shrink to the cluster centroid according to a certain factor. By adjusting the value of shrinkage factor, it can identify different types of class clusters. Using the location of these points after shrinking to represent clusters, to find out the last two clusters, and then combine them. Repeat the process until you get the desired number of clusters obtained.

The clustering of large data sets, CURE take a random sampling method. Although random sampling is in effect a compromise between precision and, but generally, the quality of the sampling medium size clusters can be better guaranteed. To accelerate cluster contraction speed, CURE Firstly, the sample data is partitioned, and local clustering within division block, after the removal of isolated points, then the clustering of production last cluster for each divided block in the local cluster.

CURE algorithm overcomes the disadvantages of using a single method or on behalf of the centroid point can be found differences in the size of non-spherical and larger clusters. Shrink clusters or discrete points CURE algorithm reduces the sensitivity of isolated points.

③ROCK algorithm

CURE algorithm cannot handle the enumeration type data, and CURE ROCK algorithm is based on data applicable to the enumeration of agglomerative hierarchical clustering algorithm. Static interconnectivity by two clusters of connectivity and user-defined aggregates compared in order to measure the similarity between two clusters. Among them, two clusters of interconnectivity between two clusters is the number of cross-chain, and refers to the common connection between the number of neighbor point’s p and medicine. In other words, inter-cluster similarity is the number of different clusters have a common neighbor point to determine.

ROCK Firstly, according to conceptual similarity threshold and common neighbors, build a sparse graph of data from a given similarity matrix, then the sparse graph using hierarchical clustering algorithm to cluster.

④Chameleon algorithm

Chameleon can be considered a dynamic model using clustering algorithms in the hierarchy clustering. In its clustering process, if connectivity and interconnectivity between the two clusters and cluster approximation and internal objects are highly correlated, then merge the two clusters. In favor of natural and homogeneous cluster discovery dynamic model based on the consolidation process, and as long as the definition of the similarity function can be applied to all types of data.

Chameleon algorithm is based on the CURE algorithm and ROCK algorithm proposed. CURE algorithm and its related programs ignore the objects on two different clusters interconnectivity of information gathering, and ROCK algorithm and its associated program emphasizes the interconnectivity between objects, but ignored the approximation of information between objects. Chameleon algorithm takes the interconnectivity, and taking into account the degree of similarity, particularly intra-cluster of characteristics to determine the most similar sub-clusters. Therefore, it does not depend on a static model provided by the user can automatically adapt to the internal characteristics of the merged clusters.

Chameleon algorithm uses data object graph partitioning algorithm clustering several relatively small cluster, then use condensed consolidated subclass hierarchical clustering algorithm, to discover real results clusters. Chameleon algorithm selection RI and RC are high clusters merge, essentially merge both good connectivity and proximity to each other two clusters. Compared to CURE algorithm, Chameleon algorithm found clusters of arbitrary shape quality aspects of enhanced capacity. However, in the process. When the high-dimensional data objects, the worst complexity of the algorithm to achieve.

4.4.2 Divisions

For a given n data objects of data collection, the use of objective function minimization strategies, by dividing the data into k groups, each group is a cluster, this is the division method (Partitioned Method. This clustering methods simultaneously the following two conditions: ① each group comprising at least one data object; ② each data object must belong to one and only one cluster of course, in some cases, such as fuzzy clustering, some data objects can belong to multiple clusters. The most common and best known clustering algorithm is divided into k-means (k- mean) algorithm and k-medics (k- center point) algorithm, the other dividing methods are mostly variations of these two methods.

①k-means algorithm

K-means algorithm parameters k, the n objects into k clusters, so that the cluster has a high degree of similarity, while lower similarity between clusters. Similarity calculated according to the average value of the objects in a cluster centroid cluster that is carried out. processing k-means algorithm is as follows: First, randomly selected k objects as the initial k clusters centroid; then, tell the rest of the object assigned to the nearest cluster according to their distance from the respective cluster centroid; and finally, again even if each cluster centroid. This process is repeated until the objective function is minimized so far. Objective function is usually used in the form of square error criterion function:

Step 1: Randomly choose k object as initial cluster centroid;

Step 2: Calculate the object distance from the centroid of each cluster, the object is divided into its nearest cluster;

Step 3: Recalculate the mean of each new cluster, that centroid;

Step 4: If the cluster centroid does not change, then divide the result is returned, otherwise go to step 20

K-means algorithm try to find the least square error function k partitions. When the result is a dense cluster, while significant differences between the cluster and the cluster, the better its results. Faced with large data sets, the algorithm is relatively scalable, and has high efficiency. Complexity of the algorithm is O (nkt), where, n is the number of data set objects, desired number of clusters obtained as k, t is the number of iterations. Under normal circumstances, the algorithm may terminate in local optima.

However, k-means algorithm only in the case of the average of the cluster is defined to use. This may not work for some applications, such as those involving data classification properties. Second, this algorithm requires given in advance the number of clusters to be generated k, apparently, in some applications it is not practical. In addition, k- average algorithm does not apply to find a lot of non-convex shape of the cluster, or cluster size difference. Furthermore, it is for the noise and outlier data is sensitive.

K-means algorithm has many variants, for example, the average k- die mold replacement cluster algorithms, with a new way to deal with dissimilarity measure classify objects, with a modified mold clustering method Frequency. The k-means algorithm and k- modulus algorithm, used to treat type and classification of property has a value of data to produce a k a prototype algorithm.

K-means clustering algorithm there are many other improvements, such as BK-means (dichotomous k-means) algorithm, incremental k-means algorithm, CFK-means algorithm.

②k- medics algorithm

Process k-medics (k- center) algorithm similar to k-means algorithm process, the only difference is the k-medics algorithm uses an object closest to the center of the cluster to represent the cluster, and k-means algorithm using centroid to represent clusters. In k-means algorithm, noise and outlier data is very sensitive, because the calculation of a great value centroid of a great deal of influence. The k-medics algorithm, by using the center instead of the center of mass, can effectively eliminate this effect.

k-medics algorithm process as follows: First, k randomly selected objects as the initial point of k clusters representatives, will allocate the remaining objects and objects from the representative point to the nearest cluster according to its; then repeated with a non-representative point to replace the representative point, quality inspection clustering is improved. If so, keep the replacement, replace or give up, repeat the process until no longer change. Clustering quality with a cost function to estimate the average dissimilarity measure object representing the function between points.

In various k-medics algorithm, the more common are PAM (Partitioning Medio’s) algorithm, CLARA (Clustering Large Application) algorithm, Around CLARANS (Clustering Large Application based upon Randomized Search) algorithm.

③Bisecting k-means algorithm

Bisecting k-means algorithm, referred to as BK-means algorithm is a direct extension of k-means algorithm, which is based on a simple idea, in order to obtain k clusters, the first collection of all data objects will be split into two clusters, from the Select a cluster continue to divide, and so on, until a k clusters, and k-means different, BK-means, strictly speaking, a division of hierarchical clustering algorithm, can be nested document category structure to a tree He expressed that this hierarchical structure more in line with the text set its own characteristics, and the algorithm is not affected by initialization. The time complexity of the algorithm is not only low, but also to get a good clustering effect. Literature that text use BK-means hierarchical division was better than the traditional k-means method and UPGMA.

Research on the breakdown-hierarchical clustering algorithm focused on how to select a cluster as the next division of objects, and how to divide the selected cluster. BK-means algorithm main process is as follows:

Step 1: Initialize cluster tables, to include cluster is composed of all of the data objects.

Step 2: Remove the cluster from a cluster table, the selected cluster multiple dichotomous test, the number of tests limited to M, the set number of trials T – Oo

Step 3: If T <M, go to step 4, otherwise, to step So

Step 4: Use the k-means algorithm, the selected cluster into two clusters, T = T + 1, go to step 3.

Step 5: Select two clusters have optimal partitioning of the two sub-test, the two clusters added to the cluster table.

Step 6: If the cluster table contains k clusters, stop, otherwise go to step 2

4.4.3 Density-Based Methods

Many algorithms are used to describe the data objects from the similarity between, however, the non-spherical data set, only the distance is not enough to describe. In this case, use the density to replace the similarity, which is density-based clustering algorithm. Algorithms based on the density of the distribution density of data objects from view, the density of the area large enough to connect, so that the class can be found in any shape. Such algorithms can be found in addition to the class of arbitrary shape, but also to effectively remove noise and outliers. Common density-based clustering algorithm DBSCAN, OPTICS, DENCLUE algorithm.

4.4.4 Grid-based method

Grid-based clustering algorithm, the vector space into a finite number of elements, then space quantized clusters. Such algorithm has fast processing speed, the disadvantage is only found clustering boundary is horizontal or vertical and oblique boundary cannot be detected. Usually decision is based on the time complexity of the grid clustering algorithm determined by the number of grid cells, regardless of the size of the data set. In addition, the accuracy of clustering depends on the size of grid cells. Such method does not apply to high-dimensional case, because the number of grid cells increases exponentially increased dimension.

All grid-based clustering algorithm are the following problems: First, how to choose the right cell size and number, when too few cells, the accuracy will be low, and too much the number of units, the complexity of the algorithm becomes large; two It is how to summarize information in each cell object Typical grid-based clustering algorithms include:. STING, Wave Cluster, CLIQUE like.

4.4.5 Model-based approach

Establish the probability of data in line with the potential of model-based clustering method based on the assumption that the distribution. Such methods attempt to fit a given data optimization and some mathematical model. Model-based clustering method mainly statistical methods and neural network methods.

Clustering Method Based on statistics, the most famous is Fishe: COBWEB algorithm, which is a simple incremental conceptual clustering algorithm, which creates hierarchical clustering in the form of a classification tree. Classification tree each node corresponds to a concept, including the concept of a probabilistic description, an overview is divided at the node object. The algorithm uses a heuristic estimate metric. If the objects added to the classification tree, it is necessary to join to be able to produce the highest classification utility position that according to the division generated the highest classification utility, the objects in an existing category or create a new category for it.

There is also an extension COBWEB algorithm CLASSIT, continuity of data for processing incremental clustering. Auto Class algorithm uses Bayesian statistical analysis to estimate the number of results clusters. By searching the model space All Categories possibility of automatically determining the complexity of classification categories and model number description.

Each cluster neural network described as a sample, the sample as the cluster prototype does not necessarily correspond to specific data instances and objects. According to some distance measure, new objects can be allocated to the sample most similar clusters. Property is assigned to a cluster of objects can be predicted based on the properties of the cluster specimens. Neural network comprises Rumelhart and others made the competitive learning neural networks and self-organizing feature maps Kohonen proposed (Self-Organizing Feature Map, SOM) neural network.

4.4.6 Fuzzy clustering method

A data object can be assigned to a class cluster, commonly known as hard clustering (Hard or Crisp Clustering) or deterministic clustering. In some cases there is no certainty support, clustering can introduce the concept of fuzzy logic, a data object belongs to a class cluster to some extent, and it can also belong to several different degree class clusters. Commonly used fuzzy clustering algorithm (Fuzzy Clustering) is fuzzy C – means FCM (Fuzzy C-Means Clustering Algorithm.

FCM algorithm has three require prior artificially given parameters: Fuzzy weighted m, initial cluster centers, and cluster number of clusters. m is a fuzzy weighted index, the degree of control for fuzzy membership matrix, quantitative analysis difficult, need to go through the specific use of the test to obtain the best, Bedeck experience gives a range of 1.1 <m <5; select determines the initial cluster centers performance of the algorithm, given initial cluster centers if there is no representative, easily lead to local optimal solution algorithm; In addition, the clustering algorithm needs to specify the number of clusters in the face of a large number of complex data to be analyzed, artificially given cluster number difficult. After this three arguments given, FCM clustering results are given algorithm speed and amount of data about the accuracy of the results.

4.4.7 Incremental Clustering

Providing a document similarity measure is determined that the degree of similarity between documents. However, this also depends on the clustering method is how to use the similarity calculation formula. Some literature gives a lot of clustering algorithms. Steinbach these document clustering algorithms were compared. Charka proposed an incremental hierarchical clustering algorithm.

Incremental clustering is a policy adopted by online applications. Because the online process, time is a key factor in relation to on-line processing is available. Incremental clustering algorithm for each treatment a data object, process data objects incrementally assigned to its corresponding class cluster. This approach may seem simple, but there are several key issues to be resolved: ① how to decide where the next object should be assigned a class clusters; ② how to deal with objects inserted affect the order of the clustering results; ③ have a data object After it has been assigned to a certain class cluster, whether it can be allowed to re-assigned to other types of clusters. Usually heuristics to deal with these issues. Evaluation of an incremental clustering algorithm is good or bad, mainly to see whether this algorithm can find the appropriate class cluster for each new data object, and not because of the insertion order problem significantly sacrificing the accuracy of clustering. Common clustering methods have a single trip clustering (Single-Pass Clustering) and K nearest neighbor clustering (K-Nearest Neighbor Clustering, KNN. Single trip clustering sequential processing document set, and each document and all had been constituted class clusters compare with the value of a document if the similarity between clusters of any class higher than a given threshold, the document is added to the nearest cluster among the class;.. otherwise form a class cluster composed of the documents often by calculating the average of the current document with a class cluster similarity of all the documents to determine whether the document and the similarity value .KNN clustering method is mainly used in classification, it was also used in the clustering method For each new document, the document and other compute the similarity of each document, and select the greatest similarity value K documents. If a class cluster K includes the aforementioned documents most of the document, the new document is assigned to a document that contains most of the K class cluster.

Weighted suffix tree Web document clustering method

5.1introdution

Web document clustering has become an important research issue in the field of Web information retrieval, clustering can reveal the reason is the internal structure of a Web document set, make the document similar associated together to form clusters or cluster, type the cluster, each between a cluster having greater similarity members, with less similarity between members of different clusters, in order to address the current phenomenon of online information clutter to a large extent, user-friendly and accurate positioning required information. Therefore, in Web information retrieval system, if effectively carried out Web document clustering, it can greatly improve overall system performance and optimization search results.

Typically, document clustering technology built on four concepts: data representation model, similarity comparison, clustering model and clustering algorithm. Most existing clustering methods are based on Vector Space Model, VSM. Its data representation model is to produce a document as a feature vector Bag of words, the right words reuse calculate TF-IDF formula by calculating the feature vector cosine similarity worth to documents. Frequency text to appear in the text to express features characteristic of independence is premised on the assumption, which is that the text of the constitution between each word independently, does not matter. However, that is not true, there are a lot of links between words and words, and this method has some limitations.

Currently, many document clustering algorithms are hard clustering, each text only belong to a class cluster, there is no overlap between class clusters and cross-righteous, cannot fully reflect the real text clustering feature set, in fact, a lot of when there is often concentrated in the text some text, there are several themes, they both belong to this cluster, they belong to another cluster.

Samir was first proposed text clustering algorithm based on suffix tree representation document feature, called Suffix Tree Clustering, STC, and an effective solution to the above problems. Suffix tree clustering method text or documents as the phrase (one or more words of an ordered sequence) rather than a collection of individual words, so that the association between the words of the text as a text to be considered an important feature, as text It correlates as a prerequisite. By identifying phrase shared between different documents on text clustering, you can more fully use the relevant information between the words, to better express textual features, resulting in better clustering effect. STC algorithm time complexity of a substantially linear relationship between the size of the document set for the O (Long). STC algorithm based documentation set Suffix Tree Model, STM, compared to the vector space model, which takes into account the near sequential relationship between words, resulting in a better clustering effect.

Suffix tree clustering algorithm can achieve inter-class cluster of overlapping clusters, which is similar to fuzzy clustering soft clustering, a document can be attributed to one or more different classes cluster.

Suffix tree algorithm as a novel, incremental, linear time calculation method, the resulting data structure is very compact. Suffix tree algorithm is very suitable to solve a string of basic questions such as: find longest repeated substring, approximate string matching, string comparison, text compression and document clustering processing speed quickly.

Its advantage suffix tree clustering algorithm, the paper after it is improved applies Web document clustering.

Web document has a special structure, is seen as semi-structured documents, structure and importance of the different parts of the HTML tag identification document, such as document title text, generally indicate important information. Currently, based on document clustering algorithm has some suffix tree, are the Web document as a normal document processing, document preprocessing time to remove structural information Web document, only the document as a Web without any format string.

In order to integrate Web document structure information to the suffix tree clustering algorithm, this paper proposes a new model of weighted suffix tree, forming a weighted suffix tree clustering method. According to the structure characteristics of Web documents, given different levels of importance Web various parts of the document, and to replace the document builder sentence generalized suffix tree. In the process of building a suffix tree, suffix tree nodes, in addition to storing it in the document ID, sentence ID, number and other information, also stores information about the document structure, that is, the level of importance as the structural weight of the sentence into the suffix tree nodes, forming weighted generalized suffix tree. In the base class cluster selection and consolidation process, the number of documents contained in the comprehensive utilization of node information, the number of sentences, the phrase length and importance of rank and other information.

This chapter first describes the general suffix tree clustering algorithm, and then introduce the weighted suffix tree clustering algorithm.

5.2Suffix tree clustering algorithm

Samir proposed suffix tree clustering algorithm-based STC is to identify phrases shared between text-based, is a fast clustering algorithm, the time complexity is long.

Suffix tree clustering algorithm STC3.1

Documents to be clusters of N: Input

Output: clustering score after descending in accordance with

Steps:

① configuration documentation set generalized suffix tree;

② candidate class computing cluster score, identify basic clades;

③ basic class clusters merge to form the final class cluster, and in descending order according to the score.

5.2.1 Text analysis and preprocessing

Text analysis is the basis STC algorithm, the main work of this step is to document the standardization process. Recognize text in the words of the English text mainly stemmed remove stop words; identifying numbers, punctuation marks, web page HTML tags. After the pretreatment is completed, we can construct the generalized suffix tree.

5.2.2 Suffix Tree

In the vector space model text features represented in the number of word frequency appear in the text to represent text feature, it is characterized as a prerequisite for independence assumption, which is that the text of the constitution between each word independently, does not matter. Then, in fact, there are a lot of relationship between words and words, this document model method has some limitations.

Suffix tree is a data structure, initially for effective string pattern matching and query design. It has been widely used in string handling, such as: finding the longest repeated substring matching similar strings, string comparison and text compression. Suffix tree usually with strings of characters were constructed, we use the same method and Samir, the text string seen as composed by the word, rather than a single-character string. String mentioned in this article refers to this by a number-word text, the document also mentioned the string words formed.

Definition 3.1: text string S is composed by a series of words.

Definitions3.2: The length of the string S is the number of words it contains.

Definitions 3.3: suffix string S refers to the string S starting from one word to the last word of the composition.

String length m S [1…..m] There are m suffix, S[i…..m].

Among them, I=1,2,…..m. Each suffix has one or more words in composition.

Definitions 3.4: string S is a suffix tree consists of all of the S suffix a compressed search tree. It has the following features:

(1) The suffix tree is a rooted directed tree;

(2) For each intermediate node is not the root comprises at least two child nodes;

(3) Each edge is marked as a non-empty string S;

(4) The two sides issued from one node cannot contain the same word beginning substring;

Side of the label defined by each node to the root node of the path string

United;

(5) For any leaf i, from the root to the leaf edge label the entire path in series on the S content is played from the position of the suffix string sub i, namely S [i … m].

A word string from the m S suffix tree T, with m just a leaf, the leaves can be given from 1 to m label. Figure 3.1 is the string “cat ate mouse ate cheese” suffix tree.

Usually suffix tree can only express a text or a string, and the string contains a large documentation set, in order to take advantage of the suffix tree clustering, need to be more text or document processing is built on the same suffix tree, therefore, the need for suffix tree concept be extended to form a generalized suffix tree multiple documents.

Fig 5 .1 The suffix tree of “cat ate mouse ate cheese”

Fig 5 .1 the suffix tree of “cat ate mouse ate cheese”

Fig 5.2 the generalized suffix tree of do, dl and d2

Figure 3.2, each intermediate node attached to a box. In the suffix tree clustering that follows, only use intermediate nodes, in the box marked only the string through the middle node number. There is a special node in the generalized suffix tree, which the edge of the label contains only special terminator; gray circles represent, called Terminal Node.

The documentation set includes three documentation string, first string has three suffixes, four suffixes, third string has four suffixes, drawings are shown 11 leaf node of the second string have generalized suffix tree an important feature is the multiple strings can be found in a shared string. It is also possible to use a major feature suffix tree clustering.

Construction string S suffix tree, generally the first S [1 ..m] As a unilateral join the tree, followed by the respective suffixes S [i…m], i = 2, 3, m, is added to the tree. General construction algorithm is as follows:

① set G_i, suffix tree represents an intermediate state in the construction process, which is composed of from 1 to i All suffixes added to the resultant;

② G_i consists of a tree from the root to a leaf labeled 1-sided between the components. The side with the strings to identify;

③ tree G_{i + 1}; the number of G, be constructed as follows:

(a) From G_i, the root node, the algorithm to find the longest path from the root, and the road diameter tag to match suffix S [i + 1.m] prefix “is. The path through the successful

Compare and match suffix S [i + 1..m ]on a unique path from the root along with the

Words, until no longer matches up to find it.

(b) when there is no deeper match, just to a node w, or are in the middle of one side.

(c) If you are in the middle of the edge (u, v), then insert a new node w, the (u,v) after the last word of v) is divided into two sides, so that the new node w on the edge of the matches suffix S [i + 1..m] a prefix.

(d) In both cases, we create a new edge (w, i + 1), the edge from w to connect to a logo for a new leaf i + 1, and the suffix S [i + 1 .. m] does not match the parts to identify this new edge.

Using this method to construct a string of length m S suffix tree time complexity is O (m²). A lot of literature on the suffix tree construction algorithm has been improved, most notably the proposed linear line structure Kekkonen suffix the method of the tree, in section 2.1.2 of this article gives a brief introduction, Samir used the method. This section uses the same method to construct the suffix tree and Samir.

5.2.3 The basic class cluster identification

STC algorithm, the basic class cluster is represented by the phrase class cluster, the cluster of basic class identification is to identify the phrase class cluster, the phrase class cluster is a collection of documents with a common phrase, and it is the generalized suffix tree after construction, by identifying intermediate node contains multiple documents obtained. In order to better illustrate STC algorithm, first define a few concepts.

Definitions 3.6: The phrase refers to an ordered sequence of one or more words in. Phrases indefinite length, but cannot cross the boundary of the phrase.

Definitions 3.7: Boundary phrase is inserted into the document parser between phrases, such as punctuation, or HTML tags and other identifying special syntax notation. Beginning and end of the document is also considered the phrase boundary.

Restrictions phrase boundary is mainly based on two factors: First, to reduce the cost to build the suffix tree, the second is the phrase boundaries are often the subject of implied.

Definitions 3.8: The phrase is defined as a class cluster is shared by at least two documents phrases, and contains a set of documents that phrase.

Definitions 3.9: maximum phrase class cluster is the number of documents in the case do not change, you cannot increase the term of the phrase.

The phrase class cluster is a collection of common phrases of the text, identify the phrase class clusters can be seen as the establishment of documentation set phrase inverted table. Suffix tree structure is built documentation set phrases inverted table, each node contains at least two documents of generalized suffix tree can be seen as a phrase class cluster, each phrase for class clusters, based on the number and the text it contains Number of words in a phrase given to a certain weight. Weight class cluster size select phrases based on a basic class clusters. Either phrase weight class cluster B value function, s (B) is defined as:

S (B) =•f () (3 .1)

Wherein, is the number of documents the phrase class clusters, is the number of words in phrases P, that the effective length of the phrase. The function f is a phrase length adjustment function, to obtain the minimum value for a single word, phrase 2-6 interposed between the lengths is linear for a constant set longer phrases.

Normally you can set a threshold value set, weights greater than set phrases class cluster as a basic class cluster, or cluster according to weight class phrase inverse sort, select, in front of 300-500 phrases class cluster as a basic class clusters.

5.3Weighted suffix tree clustering algorithm

Web document has a special structure, it is generally seen as a semi-structured documents, structure and importance of the different parts of the HTML tag identification document, such as document title text, generally indicate important information. Currently, based on document clustering algorithm has some suffix tree, are the Web document as a normal document processing, document preprocessing time to remove structural information Web document, only the document as a Web without any format string.

In order to integrate Web document structure information to the suffix tree clustering algorithm, this section presents a new weighted suffix tree clustering method WSTC. According to the structure characteristics of Web documents, given different levels of importance Web various parts of the document, and to replace the document builder sentence suffix tree. In the process of building a suffix tree, suffix tree nodes, in addition to storing it in the document ID, sentence ID, number and other information, also stores information about the document structure, that is, the level of importance as the structural weight of the sentence into the suffix tree nodes, forming weighted generalized suffix tree. In the base class cluster selection and consolidation process, the number of documents contained in the comprehensive utilization of node information, the number of sentences, the phrase length and importance of rank and other information.

The main process WSTC clustering algorithm similar to STC algorithm, described as follows:

Algorithm 3.2 weighted suffix tree clustering algorithm WSTC

Documents to be clusters of N: Input

Output: clustering score after descending in accordance with

Steps:

① structural analysis of Web documents, divided into sentences contain levels of importance;

② Weighted Generalized Suffix Tree WGST construction document set;

③ weighted suffix tree model based on the base class cluster score, choose the larger value h base class;

④ merger h base class, the final clustering results, and in descending order according to the class cluster score.

5.3.1Web Document Analysis

HTML tag identifies the different information from different parts of the document the importance of the document, the same text appears in different places, and their significance is different, the structure of HTML documents for, given different levels of importance to different parts of the document, the level of importance as weights participation suffix tree clustering algorithm. The importance of class is divided into three levels: High, Medium, and Low.

High: including the title of the document, the Meta keyword, each section title, etc.

Medium: bold, italic, highlight color representation of text, hyperlinks and table headings like.

Low: plain text.

When constructing the suffix tree, the information from the level of importance of these structures represented as weights stored in the node suffix tree, use the base class clusters in the subsequent selection process of merging.

Most Web documents are relatively long, about 300400 words. If you directly to the entire document as a string suffix tree construction, then the depth of the suffix tree is relatively large, the maximum length of the document may be difficulty handling relatively large. DIG literature when constructing maps, the document divided into separate sentences. In this paper, constructing suffix trees, but also to divide documents into a sentence, the sentence is divided into words (or called words). The length of the sentence is usually 1020 words. Suffix tree structure of the sentence as a unit, the level is relatively small, easy to follow-up treatment.

The document represented as a vector made up of sentences, sentences into words vector representation thereof;

d_i, is a collection of documents D in the i-th document;

s_ij is the first document d_iJ sentence;

p_i, is the number of sentences in the document d_i;

t_ijk, it is the first sentence s_ijk words;

s_ijis the length of the sentence l_ij, that sentence contains the number of words;

w_ij is the weight and weight-related sentence S_ij.

Right heavy sentence determined by their tags in the HTML document. For example, the title sentence rating of High, normal text sentence is rated Low. The value stored in the importance level suffix tree node, determine the specific figures in the experiment.

5.3.2 Weighted suffix tree is defined and constructed

First, the definition of weighted generalized suffix tree sentence, then define weighted generalized suffix tree document, the document set. For a sentence containing the word n,, To facilitate the creation of the suffix tree, the tail adding a special non-null character “$.” Suffix tree T is defined as follows:

contains 3.10 the definition of a word in a sentence s n suffix tree T is a side and each are marked only on the leaves of the tree of n, for each node is not the root, at least two children, and not empty string. The two sides issued from one node cannot contain the same word beginning substring. Suffix tree is the most important feature for any leaf k, from the root to the content side of the entire path in series on the leaf is, from the suffix k sub-location from the string, that s [k … n].

Weighted s definition 3.11 of a sentence suffix tree T, in addition to the root of any node v contains sentences s, through the number of nodes r, w, and their level of importance in the document, v={(s,r,w)}.

Weighted generalized suffix tree is defined 3.12 document contains d_i p_i, a sentence of document d_i, the length of each sentence is l_ij. Document di, weighted generalized suffix tree is a generalized suffix tree and only the leaves, each leaf node is marked as a list of triples （s_ij,r_ij,w_ij）,namely v={（s_ij,r_ij,w_ij）:j}. For each intermediate node is not the root, at least there are two child nodes, and each are marked on one side of a non-empty string a sentence, the two sides issued from one node cannot contain words beginning with the same substring, for any leaf node, from the root to the leaves edge in series on the path of the entire contents of a sentence is the document d_i S_ij any suffix.

Weighted generalized suffix tree is defined 3.13 D in any document set contains a document d_ip_i, the total sentence a sentence, the documentation set is the number of P=,. N documents contain a weighted generalized suffix tree document set D is weighted generalized suffix tree contains a sentence P. Each node is a list of triples, defined as v={（s_ij,r_ij,w_ij）:ij}.

, from the root to any leaf nodes are connected in series together is the content of any of any one document a sentence s_ij d_i suffix.

Based on the weighted generalized suffix tree analysis, we can derive some of the following attributes.

Properties 3.1 starting from the root node is divided into different levels, each one level increase, the number of nodes of the phrase contained words increases, you can get different lengths of phrases.

Property 3.2 can be any one of the nodes corresponding to the right phrase in the document weights.

Properties 3.3 by calculating the number of a node that contains sentences, the number of times a document can be calculated through the node.

Below an example to illustrate the differences between the two documents and the weighted suffix tree common suffix tree. The literature

Use two to three sentences Web documents, HTML tag identifies the level of importance of the sentence.

d₀:<title>cat ate cheeses/title><body>mouse ate cheese too</body>

d₁:<bold>cat ate mouse too</bold>

The Web document as unstructured documents suffix tree structure as shown in Figure 3.5. Length of the document is much greater than the length of the sentence, with the increased number of documents, the node level deeper. In the suffix tree structure, including the number of documents through the node. Information rectangle associated with each intermediate node is included: the upper part of the document through the node number, the lower is the number of the document through the node corresponds.

According to the structure of Web documents and HTML tags, divided into three sentences of two documents, and determine the level of importance of each sentence, the sentence as a unit suffix tree structure, as shown in 3.6. Its maximum depth is the number of words in the sentence. The number of words in a sentence in 1020 in general, the depth of the sentence as a unit structure of the suffix tree is significantly less than the depth of the unit in the document structure of the suffix tree. Information contained in the rectangle associated with each intermediate node is: the upper part of the document and sentences through nodes, left of the decimal is the document number, the right is the sentence number; the middle is the number of the corresponding sentence through the node; the lower part of the corresponding sentence importance level, which is calculated on the basis of weight. Specifically H, M, L values determined in experiments calculations.

Comparing Figure 3.6 and Figure 3.5 will notice the difference between the weighted suffix tree and common suffix tree:

① weighted suffix tree structure with the sentence;

② increased the number of sentences for each node ID and sentence;

③ each intermediate node contains a level of importance;

④ weighted suffix tree, suffix length is relatively short;

⑤ phrase boundaries are defined within the range of the sentence.

Fig. 5.5 The generalized suffix tree of d_o and d_l

Fig. 5-6 the weighted generalized suffix tree of d_o and d_l

5.3.3Document preprocessing

Before clustering, the first document cleaning and pre-treatment to obtain Web content and the level of importance of different parts of the document.

Literature indicates when the document structure, HTML document is converted into an XML description, including the level of importance of different parts of the document. Such a conversion is more complex, there are still on the XML parsing problem, increase the processing complexity.

In this paper, the document pre-treatment process, first, in accordance with HTML tags to Web documents are divided into sections with different level of importance. Secondly, comma, period, semicolon, question mark, exclamation mark, etc., as sentence delimiters dividing the segment into sentences. Third, the split sentence into words, the removal of stop words, and stemmed. All the sentences and their level of importance is stored in two-dimensional array, a storage content of the sentence, the sentence of another storage level of importance. Finally, in order to use the sentence structure and the level of importance the weighted suffix tree.

Experimental Results

References

[1]GOMAA M,POWELL MD,VIJAYKUMAR TN.Heat-and-run:leveraging SMTand CMP to manage power density through the op-erating system.ASPLOS-XI:Proceedings of the 11th Inter-national Conference on Architectural Support for Programming Lan-guages and Operating Systems.2004

[2]Kumar R,Farkas I K,Jouppi P N,et al.Single-ISA heterogeneous multicore architectures;the potential for processor power reduction.Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture.2003

[3]Kumar R,Tullsen Dean M,Norman P.Core architecture optimization for heterogeneous chip multiprocessors.Proceedings of the15th International Conference on Parallel Architectures and Compilation Techniques(PACT’’06).2006

[4]Suleman M A,Mutlu O,Qureshi M K,et al.Accelerating critical section execution with asymmetric multi-core architectures.Proc of the 14th Int Conf on Architectural Support for Programming Languages and Operating Systems.2009

[5]Page A.J,T.J Naughton.Dynamic Task Scheduling using Genetic Algorithm for Heterogeneous Distributed Computing.Proceedings of 19th IEEE International Parallel and Distributed Processing Symposium.2005

[6]Calhoun B.H,Chandrakasan A.Characterizing and modeling minimum energy operation for Subthreshold circuits.Proceedings of International Symposium on Low Power Electronics and Design 2004.2004

[7]Kim N.S,Blaauw D,Mudge T.Leakage power optimization techniques for ultra deep sub-micron multi-level caches.Proceedings of 2003 International Conference on Computer-Aided Design(ICCAD‘03).2003

[8]Ku J.C,Ozdemir S,Memik G.and et al.Thermal Management of On-Chip Caches through Power Density Minimizatio.Proceeding of the 38~(th)annual IEEE/ACM International Symposium on Microarchitecture IEEECS November 12-16 2005.2005

[9]SHRIVASTAVA A,ISSENIN I,DUTT N.Compilation techniques for energy reduction in horizontally partitioned cache architectures.International Conference on Compilers,Architecture and Synthesis for Embedded Systems.2005

[10]Saputra H,Kandemir M,Vijaykrishnan N,et al.Energy-Conscious Compilation Based on Voltage Scaling.Proceedings of the joint conference on Languages,compilers and tools for embedded systems:software and compilers for embedded systems.2002

[11]Sih G C,Lee E A.A compile-time scheduling heuristic for interconnection-constrained heterogeneous processor architecture.IEEE Transactions on Parallel and Distributed Systems.1993

[12]Andrew S.Tanenbaum Distributed Operation System..1995

[13]Jie Wu.《Distributed System Design》..1999

[14]A.A.Khokhar,V.K.Prasanna et al.Heterogeneous Computing:Challenges and Opportunities.Computer.1993

[15]P.Messina,D.Culler et al.Architecture Comm.of ACM.1998

[16]N.J.Boden,D.Cohen.Myrinet:A Gigabit-per-second Local Area Network.IEEE Micro Magazine.1995

[17]R.B Gillet.Overview of memory channel network for PCI.IEEE Computer Society International Conference.1996

[18]BRUCE P LESTER.《The Art of Parallel Programming》.

[19]Kai Hwang,Zhi Wei Xu.《Scalable Parallel Computing》..1999

[20]Jean Serra.Introduction to Mathematical Morphology..1986

[21]Geoffrey Falzon,Maozhen Li.Enhancing genetic algorithms for dependent job scheduling in grid computing environments[J].The Journal of Supercomputing.2012(1)

[22]Nabil Tabba,Reza Entezari-Maleki,Ali Movaghar.Reduced Communications Fault Tolerant Task Scheduling Algorithm for Multiprocessor Systems[J].Procedia Engineering.2012

[23]Josep Rius,Soraya Estrada,Fernando Cores,Francesc Solsona.Incentive mechanism for scheduling jobs in a peer-to-peer computing system[J].Simulation Modelling Practice and Theory.2012

[24]Jiayin Li,Meikang Qiu,Zhong Ming,Gang Quan,Xiao Qin,Zonghua Gu.Online optimization for scheduling preemptable tasks on IaaS cloud systems[J].Journal of Parallel and Distributed Computing.2012(5)

[25]Jorge G.Barbosa,Belmiro Moreira.Dynamic scheduling of a batch of parallel task jobs on heterogeneous clusters[J].Parallel Computing.2011(8)

[26]Xiaoyong Tang,Kenli Li,Guiping Liao,Kui Fang,Fan Wu.A stochastic scheduling algorithm for precedence constrained tasks on Grid[J].Future Generation Computer Systems.2011(8)

[27]Rui Zhang,Cheng Wu.A hybrid local search algorithm for scheduling real-world job shops with batch-wise pending due dates[J].Engineering Applications of Artificial Intelligence.2011(2)

[28]Wei Du,Guo-Hua Cui,Wei Liu.An Uncertainty Enhanced Trust Evolution Strategy for e-Science[J].Journal of Computer Science and Technology.2010(6)

[29]Félix Gómez Mármol,Gregorio Martínez Pérez.Towards pre-standardization of trust and reputation models for distributed and heterogeneous systems[J].Computer Standards&Interfaces.2010(4)

[30]Wenjia Niu,Gang Li,Zhijun Zhao,Hui Tang,Zhongzhi Shi.Multi-granularity context model for dynamic Web service composition[J].Journal of Network and Computer Applications.2010(1)

[31]Fengshun Lu,Junqiang Song,Xiaoqun Cao,Xiaoqian Zhu.CPU/GPU computing for long-wave radiation physics on large GPU clusters[J].Computers and Geosciences.2011

[32]Alan Stewart.A programming model for BSP with partitioned synchronisation[J].Formal Aspects of Computing.2011(4)

[33]Satnam Singh.Computing without Processors[J].Queue.2011(6)

[34]M.J.Harvey,G.De Fabritiis.Swan:A tool for porting CUDA programs to OpenCL[J].Computer Physics Communications.2011(4)

[35]Wang Xian,Aoki Takayuki.Multi-GPU performance of incompressible flow computation by lattice Boltzmann method on GPU cluster[J].Parallel Computing.2011(9)

[36]AndréR.Brodtkorb,Martin L.Sætra,Mustafa Altinakar.Efficient shallow water simulations on GPUs:Implementation,visualization,verification,and validation[J].Computers and Fluids.2011

[37]S.J.Pennycook,S.D.Hammond,S.A.Jarvis,G.R.Mudalige.Performance analysis of a hybrid MPI/CUDA implementation of the NASLU benchmark[J].ACM SIGMETRICS Performance Evaluation Review.2011(4)

[38]Kuen Hung Tsoi,Anson H.T.Tse,Peter Pietzuch,Wayne Luk.Programming framework for clusters with heterogeneous accelerators[J].ACM SIGARCH Computer Architecture News.2011(4)

[39]Xuejun Yang,Xiangke Liao,Weixia Xu,Junqiang Song,Qingfeng Hu,Jinshu Su,Liquan Xiao,Kai Lu,Qiang Dou,Juping Jiang,Canqun Yang.TH-1:China’s first petaflop supercomputer[J].Frontiers of Computer Science in China.2010(4)

[40]Chao-Tung Yang,Chih-Lin Huang,Cheng-Fang Lin.Hybrid CUDA,OpenMP,and MPI parallel programming on multicore GPU clusters[J].Computer Physics Communications.2010(1)

[41]Daniel Costa,Alain Hertz,Clivier Dubuis.Embedding a sequential procedure within an evolutionary algorithm for coloring problems in graphs[J].Journal of Heuristics.1995(1)

[42]Ismail Ababneh.An efficient free-list submesh allocation scheme for two-dimensional mesh-connected multicomputers[J].The Journal of Systems&Software.2006(8)

[43]Sushil Chandra Jain,Anshul Kumar,Shashi Kumar.Hybrid Multi-FPGA Board Evaluation by Permitting Limited Multi-Hop Routing[J].Design Automation for Embedded Systems.2003(4)

Acknowledgments

Time flies, Love is a dream, trance my graduate life is about to end. Dribs and drabs over the past three years began to emerge in my mind, just graduate from happy when excited, but now coming to an end dismay study life; from topics of confusion and anxiety, entangled in the painful process of writing, to Thanksgiving with excitement when I sign off. Chung many feelings in my heart: thank tutor to help my school, in these years’ time, you with profound knowledge, rigorous scholarship affect me, so I further awareness of the profession, made me realize that this extensive professional world, there are a number of known and unknown knowledge waiting for me to learn and explore. Your encouragement and trust, so I have the confidence and the opportunity to be trained in various aspects; thank you for the strict requirements and careful guidance, so that I met in the process of writing a paper every puzzles can be solved to find a clear direction. Thanks for the help my business mentor. My special thanks to my colleagues at work, help me to carry a lot of market research.

Project Management Institute thank all those who have given guidance on my school teachers in the learning process in the past few years is that you strengthen the training of our sports theory.

Thank schooling accompany me through together students, together with your time, whether for academic debate, for their own future, inspired by the life and well-being of the exchange, which together time with you is my life precious years’ worth condensed into eternal memory.

And during the investigation papers in data collection, and many friends and enthusiastic people meet by chance, this study provides support and assistance with enthusiastic, cannot be listed here, together express our heartfelt thanks!

Finally, I want to thank the support I have been given pro selfless people, you will always be my most warm haven, your expectations and encouragement will accompany me to overcome all difficulties and setbacks in the future life path. Again, in my master’s degree course to give all I care, guidance and help people to extend my heartfelt thanks and best wishes!

CHAPTER 6 Experimental Results

I need clear experiment!!!

Clear steps and results!!

Can you guys do it????