20 Sep, 2018 - Articles

Is it possible to parse a natural language (plain English) question and gather relevant information for it from the internet using Natural Language Processing techniques? AnswerBot is the answer. Code can be found here.

AnswerBot is, put simply, a search engine. You input a question, it parses it using the NLP techniques of Part-Of-Speech tagging and dependency parsing and then searches through Wikipedia, making use of semantic similarity calculations, to gather information to answer the input. The NLP functionalities just mentioned are provided by the Python library, SpaCy.

Background Information

Semantic Similarity

How similar two things are in terms of the meanings of their words.

Part-Of-Speech tagging

Words in some text are categorised into whether they are a noun or a verb, etc (their part of speech).

Dependency parsing

In terms of linguistics, the words in a sentence can be thought of as being linked to each other as dependencies. Dependency parsing means parsing and categorising how the words in a sentence relate and link to each other.

The catNOUNsatVERBonADPthe mat.NOUNnsubjpreppobj

Above is a visualisation of dependency parsing and POS tagging. You can visualise your own sentences on the DisplaCy website.

Note: there is one word from which all other words are linked either directly or indirectly. This is called the root. In this case, it’s the word “sat”.

Question Parsing

What needs to be extracted from the input is terms, categorised into being either ’detail’ or ‘keywords’.

When does a ‘detail’ stop being just additional information and start being a ‘keyword’, though? It isn’t as binary as you might first think - and resolving terms to be strictly either may result in a less complete representation of the question and so in inaccuracies. For this reason, I decided to take a fuzzy approach and arrange the terms into a linear hierarchal list (aka: a spectrum) of relative ‘detail’-ness (dependency/importance) instead.

Internally, this structure shall be represented by a list of terms, where ‘keyword’-like terms emerge at the left and the ‘detail’-like emerge at the right.

Natural Language Abstract Hierarchy
Obama’s age ["Obama", "age"]
Obama’s dad’s age ["Obama", "dad", "age"]
the biggest animal ever seen in Europe ["Europe", "animal", "biggest"]

Note: Google Search has a feature similar to what is trying to be achieved here called Featured Snippets (image below), which seems to make use of a similar hierarchal structure (see underlined).

Google's Featured Snippets

Question Fixing

First of all, minor pre-processing is performed upon the input to ensure the input looks like a question - making sure it ends with a ? and starts with a capital letter.


The question is split into queries and the queries are parsed into terms, which are arranged into the hierarchy as described.

My Terminology SpaCy Representation Info
question Document This is the question after the question fixing pre-processing.
query Span Usually the Span represents a sentence, but it doesn’t have to be a full sentence.
term Token  

The parsing works using a function, in which an input Token has all of its dependants (children) traversed through, and for each child, it is decided (by considering the dependency type + POS) whether to prepend or append the child to the list (relative to the input token) and whether to omit certain terms. The function is recursively called for each of the children as well - the termination point being: when the input token has no children.

The recursion is initiated with the root token of a certain query (which is a Span object).

View code

Variations Generation

Note on terminology: I call a group of terms a group. And a list of these groups is a grouping (but also may be called a variation since they can be thought of as being a variant of the hierarchal structure).

Grouping Combinations

When searching for pages relevant to the parsed terms, we want to consider the different groupings of the terms for a fuller picture. E.g: both [["animal"], ["biggest"]] and [["animal", "biggst"]] . (What if there was a page relating to ‘’biggest animals” directly?). Note: the structure is now a list of groups of terms.

First of all, everything is grouped together into a list, then that list is split at every possible position. The spaces in-between items can be thought of ‘split positions’ or ‘places to put a comma’. Let’s say that \(1\) means to split and \(0\) to not. The ways of splitting, then, is simply the binary pattern up to \(2^{n-1}-1\) where \(n​\) is the length of the list. This is how where to split is decided. View code.


list:            0   1   2   3
split positions:   0   1   2

Note: the number of different split positions is always one less than the length of the list because the only valid split positions are those in-between two elements.

Possible split positions (Binary pattern up to \(2^{4-1}-1\)):


Splitting a string abcd based on these positions:


Grouping Permutations

For every grouping of the terms, we also want to consider all the different ways of ordering them too. E.g: [["country"], ["home"]] should be considered as well as [["home"], ["country"]].

This is done using the standard-library itertools‘s permutation function. View code.


Candidate Searching

Initially, the first group (which may consist of one or more keywords) of each of the different groupings are run through the Wikipedia search API (I used the python wrapper library called wikipedia for this) to get some initial candidate pages.

The titles of the candidates then have the semantic similarity between them and the group of terms calculated (provided by SpaCy) to serve as a metric for relevancy, and those under a certain threshold are discarded.

Relevancy metric (want to maximise) for a given page \(p\) and grouping \(g\):


Where \(s(..,..)\) is spacy’s semantic similarity function, and \(p_t\) corresponds to the page’s title. NB: \(g_0\) corresponds to the 0-th group of the grouping (the first group).

Only the titles are used for the elimination because getting the entire content for each candidate, given the amount of candidates at this stage, will mean a large number of additional API requests to be made to Wikipedia, which should be minimised because sending and receiving data over the internet takes a relatively long time (also, excessive requests to Wikipedia is not very nice to their servers).

After the elimination, the contents of all the remaining pages are then downloaded through the API, and are kept in order of their relevancy levels - their “score”.

Page Ranking

Now that we have the contents, for each grouping, each page is ranked in terms of relevancy to the specific grouping. This is, again, done with SpaCy’s semantic similarity calculation functionality, however this time, the whole contents of each page are considered.

Ranking metric (want to maximise) for a given page:

\[\frac{\sum _i^{l\:}\:s\left(p_c,g_i\right)}{l}+s\left(p_t,g_0\right)\]

Where \(p_c\) is the page’s content, and \(l\) is the length of the grouping.

Data Searching

Now we want to find sentences, given a list of input sentences, that maximise relevancy to the groupings.

Remember that a grouping is a list of groups of terms. The list of groups in every grouping is iterated over, and all the sentences in the given input sentences are ranked by their relevancy to the group. All but the top (certain number of) sentences are then discarded. The remaining sentences is the list of sentences that is operated on in the next iteration (the scope). In this fashion, by eliminating less relevant sentences for each group, the most relevant sentences for each grouping are selected.

The relevancy metric for a certain group of terms and a sentence is composed of two metrics:

  1. the semantic similarity (provided by SpaCy) between the group and the sentence
  2. The sentence has the same keywords parsing and arranging algorithm performed upon it as the one used on input questions (described in the question parsing section). The metric is then the average semantic similarity between each keyword extracted from the sentence and the group.

The 2nd metric is weighted (halved).

The relevancy metric of a sentence to a given group \(a\) and sentence \(x\) can be represented as (want to maximise):

\[\sum _i^l\:s\left(x,\:a_i\right)+\frac{1}{2}\left(\sum _i^l\:\left(\frac{\sum _j^k\:s\left(a_i,\:p\left(x\right)_j\right)}{k}\right)\right)\]

Where \(p(..)\) is the parsing algorithm, that returns a list of keywords of length \(k\).


A demonstration of the final program running. You can find it here.


I ran a set of 59 questions through the program and recorded the following things for each question: Execution Time (in seconds), Result Classification and Question Type. Spreadsheet available here.

Result Classification

Class Info
Top Page + Item Is the first ‘page’ in the results and is in the preview items (top 3 items)
Top Item Is not first ‘page’ in the results but is in the preview items (top 3 items)
Top Page Not a top item but is the first page in the results
Buried Neither first page nor top item but answer can be found within one of the pages
Relevant Question not answered, but relevant information is given
Irrelevant Question not answered and irrelevant information given
No Items No search results (no items message shown)
Error Exception occurs

Question Type

Type Info
Fact Answerable with a specific value, theoretically should be same independent of source
Attribute Searching for the attribute x of y (e.g age of y, population of y)
Search General search as opposed to question
Info Answerable with information, not necessarily expecting a ‘value’ response
Other Invalid/Improperly formed/Out of scope (asking for the time, etc)


image/svg+xmlResults Top Page + Item Top Item Top Page Buried Relevant Irrelevant No Items Error

A majority (61%) of the questions were successfully answered (the green sections). Also, 86% of the time, relevant information was given in some form.

Results Breakdown

image/svg+xml0 2 4 6 8 10 12 14 16 Top Page + Item Top Item Top Page Buried Relevant Irrelevant No Items Error Results Breakdown Fact Attribute Search Info Other

Above is a breakdown of what type of questions produced which results.

  • quite a large proportion of the Top-Page + Item results are achieved by Info type questions.
  • Attribute questions seem to largely result in only Relevant results. This can be explained by the program not putting extra emphasis on attribute keywords such as when and so only giving vague related information.
  • The amount of Top-Page + Item results and Relevant results are equal. This would imply it is just as likely that the program gives only Relevant information without answering the question as it is to give an immediate answer, right at the top of the results.
  • Of the times the question is answered, the majority of the time, it is answered with the answer right at the top of the results (Top Page + Item)
image/svg+xml0 5 10 15 20 25 Fact Attribute Search Info Other Results Breakdown Top Page + Item Top Item Top Page Buried Relevant Irrelevant No Items Error

Above is an alternate stacked bar chart for the same data. It highlights the accuracy of the results for each type of question. Some observations:

  • Attribute type questions having a high proportion of only Relevant results (like I said earlier) is highlighted even more here.
  • It’s clear to see that Fact and Info type questions achieve the best results.
  • Other questions tend not to be directly answered, but even still, a large proportion of the time, relevant information is still given.
image/svg+xml0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Answered Unanswered Results Breakdown Fact Attribute Search Info Other

Here I’ve simplified the Result Classes to either Answered or Unanswered. NB: This chart is showing the proportions (percentages). From it, we can note some other observations:

  • Large majorities of the Answered questions are quite clearly Info and Fact type questions. This would suggest that there is a greater probability that the program will answer these types of questions correctly in comparison to the other types.
  • The Unanswered question types are very evenly split, which implies that there is no one type of question that has a greater tendency than the others to be unanswered.

Execution Time Distribution

This program is quite slow to execute, especially in comparison to modern search engines. A large factor for this is the large amount of requests over the internet that it must make.

image/svg+xml0 5 10 15 20 25 0 10 20 30 40 50 60 70 80 90 100 Frequency Upper Bound (s) Time Distribution

Above is a graph showing the execution time distribution (in terms of frequency). From the graph, it can be seen that the peak amount of times, the time taken for the program to execute is 0-10s. It can also be seen that the large majority of the time, it takes the program 0-30s to execute.

Time Complexity

Here is a graph of how the execution time increases as the input length increases. It essentially shows how this program scales with an increase in input question length.

image/svg+xml0.00 20.00 40.00 60.00 80.00 100.00 120.00 0 10 20 30 40 50 60 70 Time Taken (s) Input Length (char count) Time Complexity

An exponential regression curve has been calculated and added to the graph to show the trend.


Because the program has a number of key potential limiting factors:

  • NLP functions (handled by SpaCy)

  • Parsing and ordering of keywords
  • Candidate scope (Wikipedia)
  • Candidate selection and Elimination
  • Sentence Selection

A way of improving the program, is by improving any one of these said factors such as by expanding the scope, making a more complex parsing algorithm, refining the metrics used, etc.

The program could also be improved by taking different approaches entirely:

Approach Reason
The keywords parsing + ordering could be done directly using Machine Learning approaches. No longer will need to rely on third party libraries (such as SpaCy) - the accuracy over which I have no control + a more direct (and less error prone) way to do the parsing.
Implement an extensive set of known question types, with set parsing and answering schemes for each. The accuracy of interpretation and answering questions (as long as they are known) would be greatly increased + be more reliable.
Parsing an expected answer type (number, date, day, etc) and and weighting answers based on this. Answering accuracy would be increased as as less likely/inapplicable answers would be penalised more.

In terms of the speed of the program, which at the moment is pretty slow, it can be improved through indexing and caching because at the moment that would mean avoiding each document to be downloaded and parsed upon each query, which is what has to happen at the moment.