regex helper user manual contents 1. about regex helper 2. system

REGEX HELPER USER MANUAL
CONTENTS
1.
2.
3.
4.
5.
6.
ABOUT REGEX HELPER
SYSTEM REQUIREMENTS
DEPLOYING REGEX HELPER
MAIN USER INTERFACE
USAGE AND FUNCTIONALITY
SAMPLE USE CASE (With Screenshots)
ABOUT REGEX HELPER
Regex Helper is a web-based tool that can be used to retrieve synonyms for tokens in a regular expression from
a dataset.
The Regex Helper interface allows the user to give a regular expression and data location as input and retrieve
all synonyms for a token in the regular expression along with the corresponding matches. Further, the user can
provide feedback on the relevancy of synonyms to the system to improve the set of results in the future
iterations.
The user can continue this process until a sufficient number of synonyms are retrieved or until no more relevant
synonyms are retrieved.
SYSTEM REQUIREMENTS
The Regex Helper tool has been developed in Java. It is available as a Web Archive file which is to be deployed
on a Web Server.
Web Server : Apache Tomcat 7.0 (http://tomcat.apache.org/download-70.cgi)
Java: Version 1.7
Tested on Red Hat Linux Release 6.5
DEPLOYING REGEX HELPER
The package is available in the following location:
http://pages.cs.wisc.edu/~gayatrik/RegexHelper/RegexHelper.war
The application is available as a Web Archive file - RegexHelper.war. It needs to be deployed on a Web server.
In Apache Tomcat 7.0, the base directory of server installation is referred to by $CATALINA_BASE.
To deploy the application, copy the web application archive file into directory $CATALINA_BASE/webapps/. When
Tomcat is started, it will automatically expand the web application archive file into its unpacked form, and execute
the application that way.
Once deployed, the web application can be accessed from the browser in the following manner.
http://localhost:8080/RegexHelper
MAIN USER INTERFACE
The main user interface of the tool is as shown below:
The inputs to the Regex Helper are:
1.
Regular Expression
The input regular expression should indicate the word for which the synonyms are to be found. Atleast one seed
word needs to be included in the input regular expression.
Sample Regular Expression :
If a rule is of the form : (athletic|batting|fitness|work[ -]?out) gloves?
The regular expression can be given as “(athletic|\syn) gloves?”.
The token '\syn' is used to indicate the word for which synonyms are to be found. Here, “(athletic|\syn)” indicates
that the seed word for which the synonyms are to be found is 'athletic'. This would find all the synonyms that match
the rule and are relevant to “athletic”.
2.
Data Location
The data location is the location of the dataset. It can be a file or a directory on the local drive.
3.
Additional Options:
i.
Multiline
This option indicates to the tool whether the surrounding context of a synonym spans across
multiple lines. For example, if the dataset file consists of each line having a product title, the
context is NOT multiline. This helps in matching the synonyms that are more relevant to the current
context. The default value is false.
In case of a text file, consisting of an e-mail (say), the context can be set to Multiline as the context
spans over several lines in that case.
ii.
Number of context words:
This is to specify the number of words nearby the synonym that could be considered
as the context of the synonym. The default value is 5.
iii.
Max Number of words in Synonym:
This is to specify the maximum number of words that the synonym can contain. The
default value is 1. If this option is set to 2, the synonyms consisting of both 1 and 2
words would be retrieved.
iv.
Minimum Number of characters in Synonym:
This option is to specify the minimum number of characters that a synonym can
contain. The default value is 2.
v.
Number of words to match if (.*) is used in expression
For a regular expression of the form, (tape|\syn).*dispensers? , the (.*) indicates that any number
of characters might be matched. This option is used to set a bound on the number of words that
can match the (.*). The default value is 3. If option is set to None, all the matches with the initial
words as the synonyms would be retrieved. This is because the (.*) is greedy and tries to match
as many words as possible.
4.
Logging
The process status messages are displayed in the status messages area. These logs could be saved to a file by
checking the “Enable Logging” option (Enabling logging for every run might generate a large number of files). At
the end of the process, the report of the current run can be generated by clicking on “Finish and Generate Report”
button. The log and report files are saved under the bin/RegexTemp directory on the server. The filenames are of
the form log_GUID.html and report_GUID.html respectively where GUID corresponds to a unique identifier for the
run. This GUID would be displayed in the status message log when the process is submitted.
USAGE AND FUNCTIONALITY:
1. Provide the regular expression and the data location as input to RegexHelper along with any necessary
additional options.
2. Click on Submit for the process to start mining the dataset. The status of the process is updated in the
status messages log.
3. If this log needs to be stored in a log file, enable logging before submitting the process.
4. After the results are returned, the user can provide relevant feedback by selecting those synonyms that
seem relevant. The user is provided with the data that matches each synonym so as to verify the same.
5. After the feedback is submitted to the system, the system provides the user with the next ten most relevant
results incorporating the feedback of the user.
6. This process can be continued either until no more relevant synonyms are retrieved / until the required
number of synonyms are retrieved.
7. When the user wants to finish the process, click on Finish and Generate report, to complete the feedback
process, generate a HTML report of the process and display it. The logs and the reports are stored in the
server folder.
8. To start a new process with another regular expression, refresh the page or close the current RegexHelper
instance and open another instance in the browser in order to avoid session related issues.
SAMPLE USE CASE: (Screenshots)
1.
Regular expression and data location (folder) provided as input with Multiline option checked.
2.
The process runs and the status log is displayed.
3.
Once the process is complete, the user can select the relevant synonyms by examining the matches.
4.
The synonyms along with a few matches in the dataset are displayed to the user. The user can ` select
the relevant synonyms by checking them.
5.
On examining the results, the user can submit the feedback by clicking on “Submit Feedback”. Each
iteration displays 10 synonyms.
6.
After the user decides to stop the process and not continue with any further iterations, the user clicks on
the “Finish and Generate Report” button.
7.
This generates a HTML report and displays it to the user. The report is saved under the RegexTemp
directory on the server.