AnswerBot
Is it possible to parse a natural language (plain English) question and gather relevant information for it from the internet using Natural Language Processing techniques? AnswerBot is the answer. Code can be found here.
AnswerBot is, put simply, a search engine. You input a question, it parses it using the NLP techniques of Part-Of-Speech tagging and dependency parsing and then searches through Wikipedia, making use of semantic similarity calculations, to gather information to answer the input. The NLP functionalities just mentioned are provided by the Python library, SpaCy.
Background Information
Semantic Similarity
How similar two things are in terms of the meanings of their words.
Part-Of-Speech tagging
Words in some text are categorised into whether they are a noun or a verb, etc (their part of speech).
Dependency parsing
In terms of linguistics, the words in a sentence can be thought of as being linked to each other as dependencies. Dependency parsing means parsing and categorising how the words in a sentence relate and link to each other.
Above is a visualisation of dependency parsing and POS tagging. You can visualise your own sentences on the DisplaCy website.
Note: there is one word from which all other words are linked either directly or indirectly. This is called the root
. In this case, it’s the word “sat”.
Question Parsing
What needs to be extracted from the input is terms, categorised into being either ’detail’ or ‘keywords’.
When does a ‘detail’ stop being just additional information and start being a ‘keyword’, though? It isn’t as binary as you might first think - and resolving terms to be strictly either may result in a less complete representation of the question and so in inaccuracies. For this reason, I decided to take a fuzzy approach and arrange the terms into a linear hierarchal list (aka: a spectrum) of relative ‘detail’-ness (dependency/importance) instead.
Internally, this structure shall be represented by a list of terms, where ‘keyword’-like terms emerge at the left and the ‘detail’-like emerge at the right.
Natural Language | Abstract Hierarchy |
---|---|
Obama’s age | ["Obama", "age"] |
Obama’s dad’s age | ["Obama", "dad", "age"] |
the biggest animal ever seen in Europe | ["Europe", "animal", "biggest"] |
Note: Google Search has a feature similar to what is trying to be achieved here called Featured Snippets
(image below), which seems to make use of a similar hierarchal structure (see underlined).
Question Fixing
First of all, minor pre-processing is performed upon the input to ensure the input looks like a question - making sure it ends with a ?
and starts with a capital letter.
Parsing
The question
is split into queries
and the queries are parsed into terms
, which are arranged into the hierarchy as described.
My Terminology | SpaCy Representation | Info |
---|---|---|
question |
Document |
This is the question after the question fixing pre-processing. |
query |
Span |
Usually the Span represents a sentence, but it doesn’t have to be a full sentence. |
term |
Token |
The parsing works using a function, in which an input Token
has all of its dependants (children) traversed through, and for each child, it is decided (by considering the dependency type + POS) whether to prepend or append the child to the list (relative to the input token) and whether to omit certain terms. The function is recursively called for each of the children as well - the termination point being: when the input token has no children.
The recursion is initiated with the root token of a certain query (which is a Span
object).
Variations Generation
Note on terminology: I call a group of terms
a group
. And a list of these groups
is a grouping
(but also may be called a variation
since they can be thought of as being a variant of the hierarchal structure).
Grouping Combinations
When searching for pages relevant to the parsed terms, we want to consider the different groupings of the terms for a fuller picture. E.g: both [["animal"], ["biggest"]]
and [["animal", "biggst"]]
. (What if there was a page relating to ‘’biggest animals” directly?). Note: the structure is now a list of groups of terms.
First of all, everything is grouped together into a list, then that list is split at every possible position. The spaces in-between items can be thought of ‘split positions’ or ‘places to put a comma’. Let’s say that \(1\) means to split and \(0\) to not. The ways of splitting, then, is simply the binary pattern up to \(2^{n-1}-1\) where \(n\) is the length of the list. This is how where to split is decided. View code.
Example:
list: 0 1 2 3
split positions: 0 1 2
Note: the number of different split positions
is always one less than the length of the list because the only valid split positions are those in-between two elements.
Possible split positions (Binary pattern up to \(2^{4-1}-1\)):
000
001
010
011
100
101
110
111
Splitting a string abcd
based on these positions:
abcd
abc,d
ab,cd
ab,c,d
a,bcd
a,bc,d
a,b,cd
a,b,c,d
Grouping Permutations
For every grouping of the terms, we also want to consider all the different ways of ordering them too. E.g: [["country"], ["home"]]
should be considered as well as [["home"], ["country"]]
.
This is done using the standard-library itertools
‘s permutation
function. View code.
Searching
Candidate Searching
Initially, the first group (which may consist of one or more keywords) of each of the different groupings are run through the Wikipedia search API (I used the python wrapper library called wikipedia
for this) to get some initial candidate pages.
The titles of the candidates then have the semantic similarity between them and the group of terms calculated (provided by SpaCy) to serve as a metric for relevancy, and those under a certain threshold are discarded.
Relevancy metric (want to maximise) for a given page \(p\) and grouping \(g\):
\[s(p_t,g_0)\]Where \(s(..,..)\) is spacy’s semantic similarity function, and \(p_t\) corresponds to the page’s title. NB: \(g_0\) corresponds to the 0-th group of the grouping (the first group).
Only the titles are used for the elimination because getting the entire content for each candidate, given the amount of candidates at this stage, will mean a large number of additional API requests to be made to Wikipedia, which should be minimised because sending and receiving data over the internet takes a relatively long time (also, excessive requests to Wikipedia is not very nice to their servers).
After the elimination, the contents of all the remaining pages are then downloaded through the API, and are kept in order of their relevancy levels - their “score”.
Page Ranking
Now that we have the contents, for each grouping, each page is ranked in terms of relevancy to the specific grouping. This is, again, done with SpaCy’s semantic similarity calculation functionality, however this time, the whole contents of each page are considered.
Ranking metric (want to maximise) for a given page:
\[\frac{\sum _i^{l\:}\:s\left(p_c,g_i\right)}{l}+s\left(p_t,g_0\right)\]Where \(p_c\) is the page’s content, and \(l\) is the length of the grouping.
Data Searching
Now we want to find sentences, given a list of input sentences, that maximise relevancy to the groupings.
Remember that a grouping
is a list of groups
of terms. The list of groups
in every grouping
is iterated over, and all the sentences in the given input sentences are ranked by their relevancy to the group
. All but the top (certain number of) sentences are then discarded. The remaining sentences is the list of sentences that is operated on in the next iteration (the scope). In this fashion, by eliminating less relevant sentences for each group, the most relevant sentences for each grouping
are selected.
The relevancy metric for a certain group
of terms and a sentence is composed of two metrics:
- the semantic similarity (provided by SpaCy) between the group and the sentence
- The sentence has the same keywords parsing and arranging algorithm performed upon it as the one used on input questions (described in the question parsing section). The metric is then the average semantic similarity between each keyword extracted from the sentence and the group.
The 2nd metric is weighted (halved).
The relevancy metric of a sentence to a given group
\(a\) and sentence \(x\) can be represented as (want to maximise):
Where \(p(..)\) is the parsing algorithm, that returns a list of keywords of length \(k\).
Demonstration
A demonstration of the final program running. You can find it here.
Analysis
I ran a set of 59 questions through the program and recorded the following things for each question: Execution Time (in seconds), Result Classification and Question Type. Spreadsheet available here.
Result Classification
Class | Info |
---|---|
Top Page + Item | Is the first ‘page’ in the results and is in the preview items (top 3 items) |
Top Item | Is not first ‘page’ in the results but is in the preview items (top 3 items) |
Top Page | Not a top item but is the first page in the results |
Buried | Neither first page nor top item but answer can be found within one of the pages |
Relevant | Question not answered, but relevant information is given |
Irrelevant | Question not answered and irrelevant information given |
No Items | No search results (no items message shown) |
Error | Exception occurs |
Question Type
Type | Info |
---|---|
Fact | Answerable with a specific value, theoretically should be same independent of source |
Attribute | Searching for the attribute x of y (e.g age of y, population of y) |
Search | General search as opposed to question |
Info | Answerable with information, not necessarily expecting a ‘value’ response |
Other | Invalid/Improperly formed/Out of scope (asking for the time, etc) |
Results
A majority (61%) of the questions were successfully answered (the green sections). Also, 86% of the time, relevant information was given in some form.
Results Breakdown
Above is a breakdown of what type of questions produced which results.
- quite a large proportion of the
Top-Page + Item
results are achieved byInfo
type questions. Attribute
questions seem to largely result in onlyRelevant
results. This can be explained by the program not putting extra emphasis on attribute keywords such aswhen
and so only giving vague related information.- The amount of
Top-Page + Item
results andRelevant
results are equal. This would imply it is just as likely that the program gives onlyRelevant
information without answering the question as it is to give an immediate answer, right at the top of the results. - Of the times the question is answered, the majority of the time, it is answered with the answer right at the top of the results (
Top Page + Item
)
Above is an alternate stacked bar chart for the same data. It highlights the accuracy of the results for each type of question. Some observations:
Attribute
type questions having a high proportion of onlyRelevant
results (like I said earlier) is highlighted even more here.- It’s clear to see that
Fact
andInfo
type questions achieve the best results. Other
questions tend not to be directly answered, but even still, a large proportion of the time, relevant information is still given.
Here I’ve simplified the Result Classes to either Answered
or Unanswered
. NB: This chart is showing the proportions (percentages). From it, we can note some other observations:
- Large majorities of the
Answered
questions are quite clearlyInfo
andFact
type questions. This would suggest that there is a greater probability that the program will answer these types of questions correctly in comparison to the other types. - The
Unanswered
question types are very evenly split, which implies that there is no one type of question that has a greater tendency than the others to be unanswered.
Execution Time Distribution
This program is quite slow to execute, especially in comparison to modern search engines. A large factor for this is the large amount of requests over the internet that it must make.
Above is a graph showing the execution time distribution (in terms of frequency). From the graph, it can be seen that the peak amount of times, the time taken for the program to execute is 0-10s. It can also be seen that the large majority of the time, it takes the program 0-30s to execute.
Time Complexity
Here is a graph of how the execution time increases as the input length increases. It essentially shows how this program scales with an increase in input question length.
An exponential regression curve has been calculated and added to the graph to show the trend.
Reflection
Because the program has a number of key potential limiting factors:
-
NLP functions (handled by SpaCy)
- Parsing and ordering of keywords
- Candidate scope (Wikipedia)
- Candidate selection and Elimination
- Sentence Selection
A way of improving the program, is by improving any one of these said factors such as by expanding the scope, making a more complex parsing algorithm, refining the metrics used, etc.
The program could also be improved by taking different approaches entirely:
Approach | Reason |
---|---|
The keywords parsing + ordering could be done directly using Machine Learning approaches. | No longer will need to rely on third party libraries (such as SpaCy) - the accuracy over which I have no control + a more direct (and less error prone) way to do the parsing. |
Implement an extensive set of known question types, with set parsing and answering schemes for each. | The accuracy of interpretation and answering questions (as long as they are known) would be greatly increased + be more reliable. |
Parsing an expected answer type (number, date, day, etc) and and weighting answers based on this. | Answering accuracy would be increased as as less likely/inapplicable answers would be penalised more. |
In terms of the speed of the program, which at the moment is pretty slow, it can be improved through indexing and caching because at the moment that would mean avoiding each document to be downloaded and parsed upon each query, which is what has to happen at the moment.