Duplicate Question Detection Using Random Forest Algorithm applications of natural language processing presents a need for an effective method to compute the similarity between very short texts or sentences. An example of this is a conversational agent/dialogue system with script strategies in which sentence similarity is essential to the implementation. Sentence similarity will have internet-related applications as well. In Web page retrieval, sentence similarity has proven to be one of the best techniques for improving retrieval effectiveness.
In-text mining, sentence similarity is used as a criterion to discover unseen knowledge from textual databases. In a context of customer support via a chat channel, with the help of duplicate question detection, previous interactions between customers and human operators can be explored to provide an increasingly automatic question answering service. Another major application of computing the sentence similarity can be to find the questions that have the same semantic meanings and can be considered a duplicate. This gives rise to our project entitled Duplicate Question Detection (DQD).
Duplicate Question Detection (DQD) is the system, which warns about the duplication of questions. The main task of the DQD project is to predict whether two questions can be considered the same or not. This shall help us identify whether a similar type of questions ever existed in the past, or not. DQD aims to fetch a question as an input, pair it with existing questions, process them to gather their significant features, and finally put feed them to our algorithm to make the final prediction. Some obvious text-based features are a number of words matching between the two questions, length of the two questions, number of words, number of characters, number of stop words. It can be done with Tf-IDF scores, but that is both not very useful and computationally expensive.
Instead, one of the most important features that can be used as a weighted word match share, where each word’s weight is the inverse frequency of the word in the corpus if the user has a rare word in both the questions, they might be discussing similar topics. It also requires features like cosine distance, Jaccard distance, Euclidean distance etc. The proper implementation of the available algorithms and a new technique is a big challenge in such natural language processing. Time complexity, space complexity, resource management add further complexity in the project.
Use Case Diagram
The data modeling part focuses on showing how the data has been involved and maintained in the system. For this, we have constructed a simple class diagram where each class represents the data and its attributes shows the properties of that class. Likewise, the methods show how they can behave. The major data source is the questions asked by the users to our system.
Duplicate Question Detection Using Random Forest Algorithm takes in two questions as an input. These questions are taken in string format, which is then sent for the preprocessing in order to generate the appropriate representation, which could then be fetched to our actual prediction system. That is the trees in the random forest. Finally, through the data sent to the system, the prediction is made on either the given questions were duplicate or not.
We have used Python Django framework for creating a simple-looking web application. The interface consists of a home page where a user can see the previously posted questions and answers and link to the page where the user can input Question.
Input Output Design
Data to our system is given in a string format, which is simply a question. The question is feed to model for the prediction. Observing the predictions, if any duplicate questions are found from the stored set of questions, the system will show those duplicate questions the user. In case there is no duplicate, the user question is added to our existing set of questions.
Description of the Algorithm used
Duplicate Question Detection Using Random Forest Algorithm can be considered as a binary classification problem where we classify our input in either of the two categories or classes. Duplicate or Not Duplicate. For our problem, we have a number of lgorithms to perform classification. However, we chose to go with Random Forest due to its simplicity and ability to solve our problem. Brief discussion about the algorithm is given below. Random Forest is extended form of decision trees.
It is a CART algorithm short for classification and regression trees. Meaning that it can be used for both types of problems. The algorithm uses a tree to make a prediction or generate the result. Each non-leaf node in this tree is basically a decision-maker. These nodes are called decision nodes. Each node has one attribute and a value that is used to make a decision on where to go next. Depending on the test, we either move to the left branch or the right 15branch of that node. Since we can construct a classifier, each leaf node of our decision tree in the random forest represents a class.
Decision Tree Explain
One major issue is how we can construct an optimal tree. What should be considered before selecting an attribute based on which decisions are to be made at each node? There is a number of approaches to do this. There are four major approaches:
- Gini Index
- Information Gain
- Reduction in variance
Among these, our random forest uses the Gini index to split the dataset at each node. It is also called a cost function that evaluates our split. A Gini score gives us the idea of how good our split is by telling how mixed the classes are in the two groups that is created by our split. Less the Gini value, better will our split gets. The split is considered worst if the 16results in the 50/50 classes in each group formed by that split. Gini Score can be calculated using the given formula.
c = Number of classes
i = Number of data in split belonging to ith class
t = total number of data in the split
Computing the weighted average of the Gini index of both attribute, we select the one, which gives less GINI value. At each node, we calculate the Gini index for each attribute and finally construct the decision tree.
Finally, computing multiple decision trees, we create a forest of trees known as random forest. The prefix random comes in the sense that, every time the decision tree is to be trained, instead of taking the whole dataset, only fraction of dataset is taken randomly and used for that particular decision tree to train. Likewise, every tree uses only some of the features at once, most often the square root of total available features. Hence this reduces the chance of overfitting our decision tree.
Working of algorithm
- Calculate Gini Index for each attribute
- Take attribute with the lowest gini index as root node
- For splitting through nodes,
- Take the attribute with the lowest gini index as splitting node
- Repeat the steps until Gini Index=0.
Step 1: Calculate the total positive and negative results on the split on a attribute.
Example: For fuzz_ratio
Positive = 1504
Negative = 2294
Total samples = 3798
Step 2: Calculate the gini index for that attribute
Gini index (fuzz_ratio) = 0.473
Step 3: compare the gini index for each attribute, select the gini index with the lowest value, and split on that node. In this particular case, we have the attribute, fuzz ratio, with lowest gini index of 0.473. So split on that attribute.
Step 4: With respect Repeat to above word process mover with respect distance and to root node corresponding attributes. value 2.395,
In this way, we construct decision trees. Finally, we created a forest of trees.
Unit test was done by feeding the system with several test cases. We performed unit test by giving two different inputs as input and observed our output. The system worked fine. However, the system gave some wrong predictions as well.
The system included two parts: One for the web application and the other for making the prediction. We tested our system for the both. First, we created our prediction model. On the next phase, we developed a simple web application interface and integrated them together to form a complete system.
The major intention of this project(Duplicate Question Detection Using Random Forest Algorithm) is to develop a system where different users can ask numbers of questions. However, the system would detect if any question were repeated. Doing so, the system would decrease the size of its database in the case of large numbers of users and every user asking so many question. In addition, the answerers do not have to keep answering the same question repeatedly.
Resources Person:- Sanjeev Roka Github