24 August 2016 - Aditya Ambasth

18 2 Term Document Incidence Matrices Stanford NLP Professor Dan Jurafsky & Chris Manning You

hello in this section I'm going to

introduce the important idea of a term document matrix but also I'm going to explain why it isn't actually a practical data structure for an information retrieval system or take as our example doing information retrieval over the works of William Shakespeare so let's suppose we have this concrete question which plays of Shakespeare contained the words Brutus and Caesar but not Calpurnia well if you're starting from a very basic level of text searching commands the first way that you'd think about to solve this problem is by using searching through the text of the documents exhaustively what's known in the UNIX world as grepping and so we could kind of first of all grep for plays that contain Brutus and Caesar and then if you know your grep command well you can give a flag for files that do not match and you could get out the ones that don't contain Calpurnia now these days for works of the size of William Shakespeare for this kind of query grepping is a perfectly satisfactory solution our DES strives and computers are sufficiently fast that you would use this method and it takes no time at all to find the answer but

nevertheless that isn't a good answer for the full information retrieval problem it falls flat in a number of ways once your corpus becomes large and so that means something like everything on your hard disk or even more so the world wide web we can't afford to do a linear scan through all our documents every time we have a query then some parts but like the not part become less trivial to implement the just finding things but even more so than the not part will have more complex queries like finding uses of the word Romans near countryman and we can't do that with a grep command but even more than that the big thing that's happened in information retrieval is the idea of ranking finding the best documents to return for query and that's something that we just can't get out of the linear scan model of finding things that match and we'll talk about all of these issues in the way they're handled in modern information retrieval systems in later lectures but let's first go to this idea of a term document matrix so what we do in the term document matrix is that we have the rows of the matrix our words or often they're also called

in information retrieval the terms and then the columns of the matrix are our documents and what we're doing here is a very simple thing we're simply saying let's fill in every cell in this boolean matrix by whether the word appears in the play or not so Anthony appears in Antony and Cleopatra but Calpurnia does not appear in Antony and Cleopatra so this matrix represents the appearance of words and documents and if we have this matrix it's straightforward to then answer boolean queries such as our example before queries for documents that contain Brutus and Caesar but not Calpurnia let's just go through concretely how we do that so what we're going to do is we're going to take the vectors for the terms and the query and then we're going to put them together with boolean operations so first of all we can take out the row that is referring to Brutus it goes up here then we can take the row for Caesar and end it there and then finally we can take the row for Calpurnia complement it and then stick it down here so Calpurnia only appears in julius caesar and so we've complemented it to a vector where everything is 1 apart from Julius Caesar

and at that point we can just add those three vectors together and our answer is this one of one zero zero one zero zero so we've been able to do information retrieval successfully and can tell that these this query is satisfied by the documents Antony and Cleopatra and Hamlet and indeed we can then go off to the document collection and confirm that that is the case so here we are so in Antony and Cleopatra when Antony found Julius Caesar dead he cried almost roaring and he wept when at Philippi he found Brutus slain and similarly refined both words occurring in Hamlet okay so that suggests that we could do information retrieval simply by working with this term document matrix so an important thing to realize is that that doesn't really work once we go to sensible sized collections and so let's just go through that for a minute let's go through a sensible size but still small collection so suppose that we have 1 million documents and we'll often use in to refer to the number of documents each of which is a on average a thousand words long ok so what does that mean in terms of the size of our document collection and in terms of the size of

our matrix so if we have an average of 6 bytes per word including spaces and punctuation the amount of data we're talking about here 6 gigabytes so that's a teeny fraction of one modern hard disk in your laptop but let us then suppose we try and work out how many distinct terms there are in our document collection and we need to know the number of distinct terms because that corresponds to the number of columns now matrix and let's suppose they're about half a million that would be a typical number for a million documents and so often refer to this number of different terms as M well what does that mean well what it means is that even with that size document collection we can't build this term document matrix because we'll have 500,000 rows and a million columns and that's half a trillion zeroes and ones it's already huge and probably bigger than we have space to store and the document collection gets bigger than a million documents things are just going to get worse but this is really important observation which is although the matrix here had half a trillion zeroes and ones in it that actually almost all of the entries are zero that

the document has at most 1 billion ones and it'd be good for you guys to stop and think for a fraction of a second why is it that there at most 1 billion ones and the answer to that is well if we have 1 million documents and the average document is a thousand were as long as we said last time then the actual number of word tokens is only 1 billion so even if we assume that every word and every document were different we could only have at most 1 billion 1 entries and most likely we have far less than that because we'll have common words like that out of and - occurring many many times in each document and so therefore the key observation is the matrix we're dealing with is very very very sparse and so the central question in the design of information retrieval data structures is taking advantage of that sparsity and coming up with a better data representation and the secret of doing that and having the efficient storage mechanism is we want to only record the positions that hold a one and not the positions that hold a zero ok so I hope that's given you an understanding of the term document matrix it's an important conceptual data structure that

we keep on coming back to again and again when we talk about various kinds of algorithms we think about them in terms of that matrix as you'll see but when we actually come to doing storage and computer systems we can also see that we never actually want to store and their information retrieval representation in that form

XML Transcript: