Project Description  Automatically Retrieve Valuable Information from Feature

Project Description  Automatically Retrieve Valuable Information from Feature Requests (15 points) Due Dates: 11:59pm on 11/28, 2016 1. Introduction In open-source software repositories (e.g., sourceforge), forums are provided for users to propose requests of features that they desire to be developed in the next release of the system. However, the description of user submitted requests are not all relevant to the features that would be concerned by the developer team. Besides, after each new release of a system, a large number of new feature requests are proposed. In order to relieve the pressure of project members in manually reviewing a large amount of user requests and filtering out the irrelevant information, we propose an approach to automatically detecting valuable information from each user feature request. After pre-processing the user requests, a large amount of irrelevant information is removed, which will significantly reduce the manual effort of project managers in reviewing and making decisions on these shortened feature requests only containing the relevant and useful information. The objective of this project is to find the patterns of most valuable contents in user submitted feature requests. We provide you a set of users’ feature requests from the open source repositories. We also provide you a list of questions indicating the developers’ interests in those feature requests. First, please manually locate and retrieve the answers to each question from each feature request. The answers you identify will become the valuable information extracted from the feature request. Next, you need to discover and derive the patterns from the answers that you previously identify to each question. Finally, you should be able to automatically retrieve the answers to each question from the newly proposed feature requests with your derived patterns. 2. Feature Requests and Questions A feature request contains two fields: an ID “R{number}” and a textual “description”. Here are five examples of the feature requests: R1. I have been looking for an app that does not store passwords and just generates them using a hash of a master password phrase and a sort of the hostname of the site. R2. I think Linux port is a good idea. I use both Windows and Linux but cannot use KeePass on my Linux OS. Please port KeePass to Linux. R3. Users should only have one instance of KeePass running at anytime, or it will cost more system resources. If a user started the program while the program was already running, the program should bring up the already running instance instead of starting a new one. R4. At the moment, anyone can change the master key of the database, even if he does not know the original one. It’s a big threaten to password safety. R5. It would be convenient for users to show numbers of entries near the group. Besides, we also provide you a list of questions as follows that ask about the relevant information that may interest the developers. Note that the given list of questions may not be complete. Q1. Which word(s) or phrase(s) implies that the request proposes a new feature? Q2. What is the subject of the feature request? Q3. What is the object of the feature request? Q4. Which verb or verb phrase is used to describe the feature request? Q5. Which word or phrase implies the purpose of the feature request? Q6. What benefit will the requested feature bring to users? Q7. Which word or phrase tells something that current system cannot fulfil? Q8. Which word or phrase mentions or implies an existing feature? Q9. When will the requested feature be needed? Q10. Where will the requested feature be applied? 3. Project Tasks Implementation Language and Platform Requirements: Developing Language: Java JDK version: 1.8 or higher IDE: Eclipse The project consists of 3 tasks to be elaborated in the following subsections.  Task 1: Label Answers to Questions 3.1. Choose Questions and Prepare Feature Requests First, please choose two questions from the 10 questions listed in Section 2. Then prepare two sets of feature requests for each of your two selected questions and each set contains at least 60 feature requests. Each selected feature request must contain the answer to its corresponding question. Feature requests may be duplicated in the two sets. Feature requests of various software projects can be accessed at https://sourceforge.net/. For example, https://sourceforge.net/projects/pnotes/ is the homepage of project “PNotes” and you may access the page of its feature requests by clicking the “Tickets” menu (see the screenshot below). For each question, you may retrieve 60 feature requests from one single project or multiple different projects in the sourceforge. Then for each question, split the set of feature requests into two subsets: one contains 50 requests, the other contains 10 requests. The subset with 50 requests will be used as the Training Set and the subset with the remaining 10 requests will be used as the Testing Set. The Training Set will be used to derive the patterns of answers for automatically retrieving answers to each question later. The Testing Set will be used for evaluating the performance of the patterns you derive. 3.2. Structured Representation of Answers In order to derive the patterns of answers to the questions, you need to first convert the extracted answers in natural language into a structured representation. The extracted answer to each question from one request must be consecutive words/phrases. Eventually, the patterns of answers will be defined with this structured representation. The ultimate goal of this project is to automatically retrieve the answers to the questions from the feature requests in the Testing Set leveraging the structured representation. Below is an example structured representation of extracted answers with 7 attributes: S1. Index of the sentence containing the answer S2. Index of the first word of the answer in the sentence S3. The number of words in the answer S4. POS tag of the first word in the answer S5. POS tag of the last word in the answer S6. POS tag of the word immediately before the answer in the request, S7. POS tag of the word immediately following the answer in the request. What is POS tagging and how can we tag an English sentence? Part-of-Speech (POS) tagging identifies grammatical roles of the words in the natural language text. An automated POS tagging tool CLAWS can be downloaded at http://ucrel.lancs.ac.uk/claws/trial.html. You can also automatically label POS tag for each word in a sentence by writing your own code. If writing your own code, please import the Java Package developed by Stanford NLP group which can be downloaded from http://nlp.stanford.edu/software/tagger.html. You should first add this package into your Build Path. Then your code can call the API in this package to label POS tags for a feature request as follows: Properties props = new Properties(); props.setProperty(“annotators”, ” tokenize, ssplit, pos, lemma, ner, parse, dcoref “); StanfordCoreNLP pipeline = new StanfordCoreNLP(props); String featureRequest = “”; //Your feature request Annotation document = new Annotation(featureRequest); pipeline.annotate(document); List sentences = document.get(SentencesAnnotation.class); for (CoreMap sentence : sentences) { List tokens = sentence.get(TokensAnnotation.class); for (CoreLabel token : tokens) { String pos = token.get(PartOfSpeechAnnotation.class);//Get POS of each word } } S1-S7 are the required attributes that structurally represent the answers. You are also encouraged to add additional attributes in your representation if the additional attributes can improve the accuracy of your results in Task 3. If additional attributes you added into the provided structured representation template are tested and proved to be effective in Task 3, your project grade will be improved. Keep in mind that all the attributes (including the ones you added if any) in the structured representation should be computed for each word in the feature request. S1-S3 are required to be automatically computed. For S4-S7, you are highly encouraged to implement the automatic computation. 3.3. Label Answers to Questions with the Given Template For each feature request in the Training Sets, you should translate your answers to each question into the structured representation based on the following template. Request ID Question ID Answer Text Structured Representation of Answer S1 S2 S3 S4 S5 S6 S7 The following example demonstrates how to translate the answers (manually retrieved from feature request R5) to the questions Q4, Q6 and Q10 into the structured representation with 7 attributes S1-S7. R5 It would be convenient for users to show numbers of entries near the group. (1) Locate the answer to each question from the request: {Q4, show numbers of entries}, {Q6, convenient}, {Q10, near the group} (2) Calculate the POS tags of each word in R5: It/PRP would/MD be/VB convenient/JJ for/IN users/NNS to/TO show/VB numbers/NNS of/IN entries/NNS near/IN the/DT group/NN ./. (3) Fill in the template of structured representation below. Table 1. The template of structured representation of answers Request ID Question ID Answer Text Structured Representation of Answer S1 S2 S3 S4 S5 S6 S7 R5 Q4 show numbers of entries 1 8 4 VB NNS TO IN R5 Q6 convenient 1 4 1 JJ JJ VB IN R5 Q10 near the group 1 12 3 IN NN NNS . For each feature request in the Testing Sets, you only need to manually locate and retrieve the answer to each question that you chose. That is, only column 1-3 will be filled in for each feature request in the Testing Sets. 3.4 Task 1 Deliverables 1. Java classes and compiled executables for calculating all the attributes in the structured representations of the answers (manually retrieved from each feature request) to the two questions that you choose. 2. questions.txt containing two questions you choose. Use “Q1” and “Q2” as the question IDs, 3. frtrsf.txt containing 50 feature requests in the Training Set for your first question, and lrtrsf.txt containing your answer labeling results in the structured representation template. 4. frtrss.txt containing 50 feature requests in the Training Set for your second question, and lrtrss.txt containing your answer labeling results in the structured representation template. 5. frtesf.txt containing 10 feature requests in the Testing Set of your first question, and lrtesf.txt containing your answer labeling results in the structured representation template (column 1-3 only). 6. frtess.txt containing 10 feature requests in Testing Set of your second question, and lrtess.txt containing your answer labeling results in the structured representation template (column 1-3 only). Each line in a .txt file containing feature requests consists of the request ID and request description. Request ID and request description are separated by a “TAB”. Here is an example: The .txt files with labeled requests in the Training Sets should have the same columns as shown in Table 1 and the columns are separated by a ‘TAB’. Here is an example of the structured representation for an answer to the question Q4. The .txt files with labeled requests in the Testing Sets should only contain the first three columns in the following example. All the files in Task 1 deliverables should be packaged in one folder named as “Task 1-{Your Fist Name}{Your Last Name}”. Please submit one zip file containing the entire folder.  Task 2: Compute Pattern of Answers to Each Question 3.5 Input and Output The input of this task is all the labeled answers (structured representations) to the question for each Training Set. The output should be a generalized structured representation (i.e., pattern) of answers to each question. The pattern of answers consists of all the attributes in the structured representation. You may take the following approach to derive a pattern from the structured representation of answers. For the numerical attributes S1, S2 and S3, take a range of value for each attribute and calculate the percentage of answers in the Training Set that fall into that range. For each attribute, you may try different ranges of the values and sort them by the percentage of occurrence in R1 I have been looking for … R2 I think Linux port is a good … R3 Users should only have one instance… R4 At the moment, anyone can change … R5 It would be convenient for users to show numbers … …… R5 Q4 show numbers of entries 1 8 4 VB NNS TO IN …… descending order. For the categorical attributes S4, S5, S6 and S7, list all the observed POS tags for each attribute in the Training Set and sort by the percentage of their occurrence in descending order. Note that the percentages for each attribute should sum up to 100%. Below is an example pattern of S1, S2, S4 and S6 of answers to the question Q4. Table 2. An example pattern of answers to Q4 Representation Attribute S1 S2 S4 S6 Patterns 1-2 3-4 3-4 5-6 1-2 VB NNS TO TO IN Percentage 70% 30% 60% 30% 10% 60% 30% 10% 80% 20% 3.6 Task 2 Deliverables (1) Patterns of S1-S7 for answers to the first question. Name the file as “Task2-Q1.txt”. (2) Patterns of S1-S7 for answers to the second question. Name the file as “Task2-Q2.txt”. You are required to submit your derived patterns in the following format. (3) Package both files into one folder named as “Task 2-{Your Fist Name}{Your Last Name}”. Please submit one zip file containing the entire folder.  Task 3: Automatically Retrieve Answers with Your Patterns 3.7 Implementation This task asks you to implement a system to exploit your derived patterns from Task 2 to automatically retrieve the answers to the two selected questions from the feature requests in your Testing Sets. Your implementation must meet the requirements below. (1) Write your code as a single “.java” class and name the java class as “{Your Fist Name}{Your Last Name}” (2) Declare the package of this java class with “package retrieve.answers.auto” (3) Define static function “retrieve” as follows: Tips for retrieving answers automatically: (1) Locate the first word of the answer. For each word in the given feature request, compute its possibility to be the first word of the answer, denoted with Pf. The first word of the answer can be determined by S1, S2, S4 and S6. For example, there is a word W in the feature request R, and we need to calculate the probability of W to be the first word of the answer to question Q4, denoted by S1 {1,2},70% {3,4},30% …… S4 VB,60% NNS,30% TO,10% …… public static void retrieve (String testFilePath, String questionID){ // for each line in testFile retrieve the answer to the question whose id is questionID // create an ‘answer{questionID}.txt’ file in the same folder of testFile // write request ID and the answer of each line in testFile into answer{questionID}.txt in one line } Pf(W|R,Q4). Assuming the values of S1, S2, S4 and S6 of W in the Testing Set for Q4 are 1, 5, VB and TO respectively, based on the pattern you have derived in Task 2 (see Table 2), we can compute that Pf(W|R,Q4) = (70% + 30% + 60% + 80%) / 4 = 0.60 The above calculation assumes that all attributes in the structured representations are of the same importance in determining the probability of a word W to be the first word of the answer. You may also assign weight to each attribute to differentiate its importance. For example, if you believe that S1 is less important than others in determining the first word of the answer, you could assign weights as: 0.1 for S1 and 0.3 for S2, S4 and S6 respectively (The sum of all weights should be equal to ONE.). Then: Pf(W|R,Q4) = 70% * 0.1 + 30% * 0.3 + 60% * 0.3 + 80% * 0.3 = 0.58 Finally, you can choose the word with the highest Pf as the first word of the answer. Note that once the computation is formulated, it will be consistently applied to all the feature requests in the Testing Set for a specific question. That is, you may not change the weights of the attributes applied to the test instances for a specific question once it’s determined. (2) Locate the last word of the answer. for each word occurring after the first word in the same sentence, compute its probability to be the last word of the answer, denoted by Pl. The last word can be determined by the attributes S3, S5 and S7. The method of calculating Pl is similar to Pf described in (1). (3) By locating the first and last words of the answer, you will be able to automatically retrieve the entire answer from the feature request. 3.8 Task3 Deliverables (1) The “{Your Fist Name}{Your Last Name}.java” file. (2) The ‘answer{questionID}.txt’ files. (3) Package both files into one folder named as “Task 3-{Your Fist Name}{Your Last Name}”. Please submit one zip file containing the entire folder.

Leave a Reply

Your email address will not be published. Required fields are marked *