Invertex index text search

Invertex index text search 

In this assignment, you will implement an inverted index and use it to store term occurrence information. You will need to design and implement appropriate data structures, write your data to disk, and be able to read it back.

Assignment Teams

This assignment is designed to be carried out in groups of two. It is up to you to find apartner and form a team.

Once you have formed a group, you need to register it through Canvas prior to assignment submission.

Teaching Servers

Three CSIT teaching servers are available for your use:

(titan|saturn|jupiter).csit.rmit.edu.au. You can log in to these machinesusing ssh, for example:

% ssh s1234567@jupiter.csit.rmit.edu.au, where s1234567 is your student number. These servers host the document collection and other data for the assignments in this course. You are encouraged to develop your code on these machines. If you choose to develop your code elsewhere, it is your responsibility to ensure that your assignment submission can be compiled and executed on(titan|saturn|jupiter).csit.rmit.edu.au, as this is where your code will be run for marking purposes.

Programming Tasks

Have a look at the file /home/inforet/a1/latimes on(titan|saturn|jupiter).csit.rmit.edu.au. It is part of the TREC TIPSTERtest collection, and consists of various articles from the LA Times newspaper.

Here is an example of the markup format:

<DOC>

<DOCNO> LA010189-0001 </DOCNO>

<DOCID> … </DOCID>

<DATE> … </DATE>

<SECTION> … </SECTION>

<LENGTH> … </LENGTH>

<HEADLINE>

The article headline.

</HEADLINE>

<TEXT>

The text content of the document.

</TEXT>

</DOC>

Individual documents are wrapped in <DOC> … </DOC> tags. These indicate the beginning and end of each individual document in the collection file.

The <DOCNO> … </DOCNO>tags contain a unique identifier for the document. You willneed to keep track of this to refer to documents in the collection.

The <HEADLINE> … </HEADLINE> and <TEXT> … </TEXT> tags contains the textcontent of the document. This is the actual content of the document that you will need toindex.

Your task is to write two programs. The first will index the collection, gather appropriate term distribution statistics, and write the index data to disk. The second program will load the index data, accept text terms as input, and return the appropriate term distribution statistics for each term.

1 Indexing

Your indexing program must be called index and accept a number of optional command-

line arguments. The invocation specification should be:

% ./index [-p] <sourcefile>

(or equivalent invocation in another programming language) where the optional ag –pwill print the terms being indexed, and the mandatory argument <sourcefile> is thecollection to be indexed. These are described in more detail below.

Note that your implementations must be efficient, making use of appropriate algorithms

and data structures. This will be part of the assessment criteria. 

1.1 Parsing Module

Your first task is to parse the content data contained in the HEADLINE and TEXT tags,tokenizing and extracting content terms, and removing any punctuation and excess markuptags. Punctuation consists of any symbols that are not letters or numbers. You will needto carefully consider issues of token normalisation (for example, how to deal with acronymsand hyphenated words).

All content terms should be case-folded to lower-case as they are indexed.

Your program must be called index, and should accept an optional command-line argument-p. This ag indicates that your program should print all content terms, in the same orderas in the original document, to the standard output stream, stdout. If the ag is notspecified, your program should produce no output to standard out. An example of howyour program might be run is as follows:

% ./index -p /home/inforet/a1/latimes(or equivalent invocation).

  • As your parser encounters each new document, you will need to assign an ordinal number as a document identifier. These can be assigned sequentially (i.e. the first document is 0, the second is 1, and so on). This is how the documents will be referred to in the inverted list information.
  • You also need to track the unique document identifier specified in the <DOCNO> tags, and how these map to the integer identifiers that you assign. Your program should therefore always write an output file to disk, called map, which contains this mapping information. Auxiliary files such as this must be written to the current working directory. 

1.2 Stopping Module

Stopping is the process of excluding words that have grammatical functions, but contain nomeaning themselves, from consideration during indexing. Examples of such stopwords are

the, of, and and. A stoplist is available on (titan|saturn|jupiter).csit.rmit.edu.auat /home/inforet/a1/stoplist.

Extend your program index with a module to stop the input terms. Your program should

accept an optional command-line argument, -s <stoplist>, where stoplist is a file ofstopwords that are to be excluded from indexing. An example of how your program shouldbe invoked is:

% ./index [-s <stoplist>] [-p] <sourcefile>

Note that the -p option must still be available; when it is specified, your program should

print all content terms that are not stopwords to the standard output stream.

The content of the <stoplist>file must be stored in an efficient lookup structure, a hashtable. It is up to you to choose a suitable hash function for text strings. You will be askedto explain your implementation in the report.

1.3 Indexing Module 

Extend your program index to build an inverted index for the tokenised text. The invertedindex needs to store term occurrence information for all content terms that occur in the file that is being indexed. For each term t, you need to store:

_ The document frequency, ft, a count of the number of documents in the collection inwhich term t occurs.

_ An inverted list for term t, which consists of postings of the form< d; fd;t >where:

– d is the document number in which t occurs

– fd;t is the within-document frequency, a count of how often t occurs in documentd

For example, if the term insomnia occurs in two documents in the collection, twice indocument 10, and three times in document 23, then the inverted list would be:

insomnia: 2 <10, 2><23, 3>

Important: the punctuation symbols are only used to make the list human-readable. Theactual internal representation of the inverted list would just store a sequence of integers,represented appropriately, for example 2 10 2 23 3. Your stored lists must not includethe extra punctuation between items. See the lecture notes for further details on theinverted index and inverted lists.

You can assume that you will be working with collections that are small enough to fit inmain memory.

When your index program has finished constructing the inverted lists for each unique termthat occurs in the collection, the data should be written to disk. Your program shouldwrite three data files in total. These files must be written to the current working directory.

  1. lexicon: Contains each unique term that occurs in the collection and a “pointer” tothe inverted list for that term.
  2. invlists: Contains the inverted list information, consisting only of numerical data(represented as binary integer data).
  3. map: Contains the mapping information from document id numbers (as used in theinverted lists) to the actual document names.

Since the lexicon and inverted lists are stored separately, your lexicon file will need toinclude information about the file offset position in invlists where the inverted list forthecorresponding term occurs. Your program should not simply read the invlists filesequentially from the start each time it is accessed.

It is up to you to design the particulars of the implementation. However, please read thenext section on Querying carefully first, as this is likely to have implications for how youstore your data. You will be asked to explain your implementation in a report.

2 Querying

Write a program called search that loads your index data, and searches it to retrieve the

inverted lists for particular terms. An example of how your program should be invoked is:

% ./search <lexicon><invlists><map><queryterm 1> [… <queryterm N>](or equivalent invocation) where <lexicon>, <invlists> and <map> are the inverted index files that your index program created. The program will also receive a variable numberof query terms (but at least one query term) as command-line arguments. Each of theseterms should be looked up in the lexicon, and the appropriate inverted list data fetchedfrom the inverted list file. For each queryterm, you need to print the following informationto standard out:

  • The current query term;
  • The document frequency ft;
  • A list of each document in which the term occurs (using the identifier from the <DOCNO> field, which you can look up in your map _le), and the within-document frequency fd; t.

An example of how your program might be run is as follows:

% ./search lexicon invlists map insomnia (or equivalent invocation). If the term insomnia occurs in two documents, twice in document FT911-22 and three times in document FT911-2032 then the output should be:

insomnia

2

FT911-22 2

FT911-2032 3

Your implementation should follow these specifications:

  • The lexicon and the map should be loaded into memory when your program is invoked, and stored using an efficient look-up data structure such as a hash table.
  • The invlists data should not be pre-fetched into memory. Instead, the inverted list data should be read from disk as each query term is processed. That is, your program should read only that section of the invlists file that corresponds to the list data for a particular term.
  • The output of your program must follow the format given in the example above, and must not produce any additional output beyond what is specified in the assignment requirements.

3 Report 

Create a file called report.pdf and answer the questions below (your report must includethe six headings given in subsections 3.1 to 3.6).

Remember to clearly cite any sources (including books, research papers, course notes, etc.)that you referred to while designing aspects of your programs.

The answers in your report should not include pasted extracts from your source code (thesource code will be submitted separately, as part of the assignment). However, you maywish to make use of pseudo-code to explain particular concepts, algorithms, or implementation choices, if needed.

3.1 Index Construction 

Explain your inverted index implementation. As part of your explanation you need toclearly describe:

  1. How you tokenize terms, including how you deal with punctuation and markup tags,and handle acronyms and hyphenated words
  2. How you gather term occurrence statistics while your index program is parsing thecollection
  3. How you merge the information together to construct the final lexicon and invlists files
  4. the format of your lexicon, invlists and map files that are written to disk

This explanation should be around a page in length. 

3.2 Stoplist

Briefly explain how you implemented your stoplist module, including the hash functionthat you used, and why it is an appropriate hash function for this task. This explanationshould be around half a page in length.

3.3 Index Search

Explain your index search implementation. As part of your explanation you need to clearlydescribe:

  • the data structure you used to hold your lexicon in memory
  • how you look up corresponding term occurrence information in your invlists file on disk (including an explanation of how you read only the appropriate chunk of the file for the term that is being looked up, and avoid scanning sequentially from the start)

This explanation should be around a page in length, but no longer than one and a halfpages. 

3.4 Index Size

Report the size of your inverted index, separately giving the size of the lexicon, inverted lists, mapping file, and a grand total. How does this compare with the size of the originalcollection? In a paragraph, discuss why the index is larger/smaller than the originalcollection.

3.5 Limitations

This assignment requires you to design and implement a large indexing system for textdocuments. In this section, you should briefly outline any known limitations or bugs inyour approach or implementation. For example, perhaps you noticed that your code willcrash in certain conditions, and despite your best efforts you just couldn’t figure out why.

Or, perhaps due to time constraints you were forced to make a less than ideal designdecision.

Please be up-front in listing known issues, and demonstrate your understanding of IRsystem requirements by explaining what you think an “ideal” system might do in addition(or instead).

3.6 Contributions

In this section, you should provide details about what each team member contributed tothe submitted assignment materials. Note that both team members will receive the samemark. However, you must include this section. 

4 Optional Extension: Compression

ONLY attempt this section if you have completed all previous sections of the assignment.If you submit a solution to this extension exercise, you will need to submit two versionsof your code, one without compression (i.e. all previous sections of the assignment), andone with compression (i.e. incorporating the following requirements). You MUST clearlyexplain in your README.txt _le which _les correspond to which part of the assignment,as well as how each should be compiled and run.

  1. Extend your index program so that the inverted lists are compressed using variable-byte coding before they are written to disk, and decompressed after being read fromdisk.
  2. Extend your report to include a subsection called Compression, and explain yourimplementation of the chosen compression scheme. What is the size of your invertedindex using compression? How does this compare to the uncompressed invertedindex? Write no more than half a page.

Solution

Introduction

For this project we choose python programming language. It allows concentrating on the program logic and project goals leaving many other program aspects “under the hood” (memory management, passing and storing data, etc.). The main drawback is slow running time due to its interpreting nature. We leave out the discussion of the performance for the Limitations section. On the other hand there is a number of advantages of using python. Python has rich standard library and offers short and simple solutions when compiled languages do not provide such functionality out of the box. The price for such overhead is significant speed reduction especially in for/while loops and large datasets. While developing the code we tried to make it memory-efficient with the use of generators when possible. Such approach allows handling big dataset without storing all the data in memory.

Notes on data structures used. Since implementing hash table data structure will be superfluous in python we will be using python’s built-in dict to hold key-value pairs. Internally they are implemented as resizable hash tables with key’s hash being computed with hash build-it function. Python’s dict uses open addressing to resolve hash collisions. Open addressing is a method of collision resolution where probing is used. It the case of the occupied slot it searches the slots by slot to find an empty one using pseudo random probing. The dictionary is resized when more that 2/3 of the table is occupied.

In our program we will be primarily using strings as hash keys. A pseudo-code for python’s hash built-in is the following:

function string_hash:

set len to string’s length

initialize var p pointing to first char of string object

set x to value pointed by p left shifted by 7 bits

while len >= 0:

set var x to (1000003 * x) xor value pointed by p

increment pointer p

set x to x xor length of string object

return x as the hash

In order to have cleaner and more expressive code we will use two more dictionary-based structures from collections module, namely defaultdict and Counter. Defaultdict allows constructing a list of within-document frequencies (inverted list) without having to check if the particular key (term) is already present in the hash table. Technically the introducing of  defaultdict made the code just two lines shorter but on the other hand such improvement increases code readability and makes the relevant variable more meaningful. The use of Counter data structure is quite natural because we deal with word counts in the text. From the python standard library documentation: “the Counter is a dict subclass for counting hashable objects”.  Basically it is a collection of key-value pairs where values represent counts of keys.

Index Construction

For each <DOC> element we create a dictionary of its child elements as keys and their content as value. In fact we need only 3 child elements: <DOCNO>, <TEXT> and <HEADLIE> but we consider it useful to have all fields just for the reference or ongoing development. Before tokenizing plain text we remove all paragraph tags. We assume no other tags are present inside <TEXT> and <HEADLIE> elements. In most cases a hyphenated word is a composite, so we just threat such cases as two  (or more) separate words. This assumption is reasonable to some extent. As an alternative we were thinking of discarding hyphenated words but found the first approach more sensible. We don’t have any special treatment for abbreviations since one cannot simply tell abbreviated word from the UPPERCASE. Yet, we consider a case of words with apostrophes (like Ben’s, don’t, we’re, etc). In such cases we simply strip the part following the apostrophe up to the end of the word. All punctuation characters are replaced with a space. With such approach one can easily split a string into a list of separate words with pythons str.split method, which when omitting a separator splits on space and treats multiple spaces as a single one. Having received a list of tokens one can easily get their respective counts just by passing the list as initializer to Counter object (discussed earlier). Having applied such an algorithm for each document we get a list of Counter objects corresponding to word occurrences per document.

On the next step we have to convert the list of Counters to the final lexicon and invlists files. First we construct an invlists object represented by the defaultdict of lists with word being a key and a list of (index, count) pairs for value. Then we iterate over the list of dicts and over each word in Counter (nested loop). We discard stop words and append the corresponding (index, count) to the value list of word (key) in the invlists dict. After this operation we have a hash table data structure containing words as keys and list of pair (index, count) as values almost ready for the next step. Before creating data structures which will be written to files we create a list of lexicon words sorted alphabetically. This step is not actually required (it was realized while writing the report) but it adds some consistency to the program. Finally we create two more data structures: a list of binary data containing invlist information for each word and a dictionary mapping a word to its entry offset in invlist file. The invlist info consists of the number of document the word occurs in (it is simply the length of list value in invlists defaultdict) followed by flattened list of (index, count) pairs. Those values are converted into binary format using struct module from python’s standard library. Data offset is computed using struct.calcsize function. The (word, offset) key – value pair is inserted into a dictionary to be used for the lexicon file output.

At this stage we have all data structures which can be simply written to files. Lexicon and map files are just plain text files where each line has key: value format. In the case of lexicon file the key is a word and the value is its offset in the  invlists and for map file the key is document index while value is document’s docno. The invlists file is plain binary file containing integers as described above.

Stoplist

Similarly to the previous section we used python’s built-in data structure to hold stop words. We use set

data structure which holds unordered collection of unique values and has O(1) element lookup complexity which perfectly suits our purpose. Internally set is implemented the same way as dictionary keys but without associated value attached. Since stop words are stored as string the hash function used is the same as described in the previous section. In case  when the stoplist file is not provided we use empty set, so we don’t have to write special case in the code flow.

Index Search

While developing search program we tried to reverse the flow on the index program. So, we used dictionaries both for lexicon and map data structures as described in the final part of Index Construction section. Namely, lexicon holds (word, offset) key-value pairs and map holds (index, docno). The implementation of dict has already been discussed before. Having lexicon dict one can easily look up the start locatation of the corresponding invlist by the offset. Since all values in invlists file are 4-byte integers we and the first value of a separate invlist is the number of its elements, one can easily read all the consecutive values with above-mentioned struct module at the same time constructing (index, count) pairs. Since we store a mapping from index to docno in the map dictionary it is quite straightforward to produce the required output.

Index Size

The sizes of source file and of 3 resulting output files are the following:

$ du -h /home/inforet/a1/latimes

478M   /home/inforet/a1/latimes

$ du -h invlists

181M   invlists

$ du -h lexicon

13M     lexicon

$ du -h map

2.6M    map

The total size of lexicon, invlists and map files is 196.6 Mb which is about 41% of the source file size. The size of map file is the smallest which is obvious: it is proportional to the number of the documents in the source file whilst the size of lexicon file is proportional to the number of unique words in documents.

$ wc -l map

131896 map

$ wc -l lexicon

602377 lexicon

One can see that there are about 5 times more lines in lexicon file compared to map file.

The file contribution the most to the total size is binary file invlists. Its size is also proportional to the number of the words in the lexicon. On the other hand its upper limit is proportional to the number of documents (in the worst-case scenario each word occurs in every document). Since the invlists is stored in binary format it uses only 4 bytes per number which allows to reduce file size compared to plain text firmat which uses 1 byte per character and needs separators between values.

Finally, the total size of the inverted index is more than 2 times less than the size of the source file. The sorce file is larger because it stores multiple occurrences of each word (plus multiple stop words) also it is stored as plain text, while the index file contains all the information in the compressed form. One can consider the inverted index a compressed version of the source file where the space is reduced by the cost of losing the original word ordering.

Limitations

The biggest limitation considered is the use of the programming language for constructing index and performing the search query. Since all the operations for term querying are O(1) it didn’t actually affect th performance of the search program. On the other hand the running time of index is O(m*n) where m is the number of documents and n is the number of unique words. The python implementation loses much in the performance to compiled languages. Still we can take into account the fact that indexing will be performed only once, so in wider picture the implementation is still useful and we expect it to scale linearly with the growth of m*n.

We don’t expect any bugs in the code as long as the source file is properly formatted. We take into account cases when headline or/and text is missing. Actually the index program can be improved by adding exception handling on the badly formatted documents.

Index 

#!/usr/bin/python

import struct

import argparse

from collections import Counter, defaultdict

def read_stoplist(fname):

with open(fname, ‘r’) as fp:

words = fp.read().splitlines()

return set(words)

def read_docs(fname):

current_doc = ”

with open(fname, ‘r’) as fp:

for line in fp:

line = line.lower()

current_doc += line

if ‘</doc>’ in line:

before, after = line.split(‘</doc>’, 1)

current_doc += before

yield current_doc.strip()

current_doc = after

def parse_doc(doc_text):

fields = {}

# remove excess markup

stop_tags = [‘<doc>’, ‘</doc>’, ‘<p>’, ‘</p>’]

for tag in stop_tags:

doc_text = doc_text.replace(tag, ”)

while doc_text:

# assume the line starts with the opening tag

tag_name, doc_text = doc_text.split(‘<‘, 1)[1].split(‘>’, 1)

tag_text, doc_text = doc_text.split(‘</’ + tag_name + ‘>’, 1)

fields[tag_name] = tag_text.strip()

doc_text = doc_text.strip()

return fields

def parse_text(doc_obj):

# might be missing

headline = doc_obj.get(‘headline’, ”)

text = doc_obj.get(‘text’, ”)

word_counts = tokenize_text(headline + ‘\n’ + text)

payload = {‘docno’ : doc_obj[‘docno’],

‘counts’: word_counts}

return payload

def tokenize_text(text):

norm_text = “”

i = 0

while i < len(text):

if text[i].isalnum():

norm_text += text[i]

# skip ‘s, ‘re, ‘ve, etc.

elif text[i] == “‘” and text[i-1].isalnum():

while i < len(text) and not text[i].isspace():

i = i + 1

else:

norm_text += ‘ ‘

i = i + 1

tokens = norm_text.split()

return Counter(tokens)

def pack_invlist(invlist):

frequency = len(invlist)

payload = (frequency, ) + sum(invlist, ())

return payload

def create_arg_parser():

parser = argparse.ArgumentParser()

parser.add_argument(“sourcefile”,

help=”collection to be indexed”)

parser.add_argument(‘-p’, action=’store_true’,

help=’print all content terms being indexed that are not stopwords’)

parser.add_argument(‘-s’, dest=’stoplist’, required=False,

help=’a file of stopwords that are to be excluded from indexing’)

return parser

if __name__ == ‘__main__’:

parser = create_arg_parser()

args = parser.parse_args()

collection = (parse_doc(doc) for doc in read_docs(args.sourcefile))

parsed = [parse_text(item) for item in collection]

stopwords = set() if args.stoplist is None else read_stoplist(args.stoplist)

# construct inverted list for each word

invlists = defaultdict(list)

for index, doc in enumerate(parsed):

for word in filter(lambda x: x not in stopwords, doc[‘counts’]):

invlists[word].append((index, doc[‘counts’][word]))

ordering = sorted(invlists)

# display terms

if args.p:

for term in ordering:

print(term)

# prepare invlists for the output

offssets_data = {}

invlists_data = []

offset = 0

for term in ordering:

offssets_data[term] = offset

payload = pack_invlist(invlists[term])

fmt = “<{}i”.format(len(payload))

data = struct.pack(fmt, *payload)

invlists_data.append(data)

offset = offset + struct.calcsize(fmt)

# unique term in the collection and a “pointer” to the inverted list

with open(‘lexicon’, ‘w’) as fp:

for term in ordering:

fp.write(“{}: {}\n”.format(term, offssets_data[term]))

# inverted list information (binary)

with open(‘invlists’, ‘wb’) as fp:

for data in invlists_data:

fp.write(data)

# map the document identifier specified to the integer identifiers

with open(‘map’, ‘w’) as fp:

for index, doc in enumerate(parsed):

fp.write(“{}: {}\n”.format(index, doc[‘docno’])) 

search 

#!/usr/bin/python

import struct

import argparse

def read_lexicon(fname):

with open(fname, ‘r’) as fp:

lexicon = dict(line.split(‘:’) for line in fp)

return {key: int(val) for key, val in lexicon.items()}

def read_index_map(fname):

with open(fname, ‘r’) as fp:

index_map = dict(line.split(‘:’) for line in fp)

return {int(key): val.strip() for key, val in index_map.items()}

def read_invlists(fname, offset):

with open(fname, ‘rb’) as fp:

fp.seek(offset)

data = fp.read(struct.calcsize(“<i”))

occurences = struct.unpack(“<i”, data)[0]

frequencies = []

for k in range(occurences):

data = fp.read(struct.calcsize(“<2i”))

pair = struct.unpack(“<2i”, data)

frequencies.append(pair)

return occurences, frequencies

def create_arg_parser():

parser = argparse.ArgumentParser()

parser.add_argument(“lexicon”,

help=”a text file with a list of unique terms and pointers to the inverted list”)

parser.add_argument(“invlists”,

help=”a binary file with inverted list information”)

parser.add_argument(“index_map”, metavar=”map”,

help=”a text file with a mapping of the document identifier specified to the integer identifiers”)

parser.add_argument(‘queryterm’, nargs=’+’, type = str.lower,

help=’a query term to look up in the lexicon’)

return parser

if __name__ == ‘__main__’:

parser = create_arg_parser()

args = parser.parse_args()

lexicon = read_lexicon(args.lexicon)

index_map = read_index_map(args.index_map)

for query in args.queryterm:

print(query)

if query in lexicon:

offset = lexicon[query]

occurences, doc_frequencues = read_invlists(args.invlists, offset)

print(occurences)

for index, count in doc_frequencues:

print(“{} {}”.format(index_map[index], count))

else:

print(“0”)