Bag-o-words: What's that and how do I make one?

2022-02-09 21:18. Keywords: corpus linguistics, python, tutorial, linguistic analysis

Over the last few weeks I have been askes several times to explain what a bag of words is, so in this post, I'll present the basic idea about what it is and how to make one. Let's get started.

What's a Bag of Words?

The Bag of Words idea refers to a particular analytical approach to language data, whereby textual data is represented by a list of words and the corresponding number of occurrences of each word. Metaphorically reaching into this bag, you can already tell a lot about the text and its author. For instance the themes in the text should be represented by repeating (sets of) words. Or maybe you want to know how varied the author's / speaker's vocabulary is. It's easy to find out with the bag-o-words technique.

There's a Bag of Words model entry on Wikipedia with a basic example. Consider the text:

John likes to watch movies. Mary likes movies too.

As a bag of words, the example would look like this in JSON format.

{"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1}

How to make a Bag of Words?

I suppose there are are several possibilities for compiling a corpus, or set of texts, into a bag of words. Undoubtedly though, the most efficient way is to use some kind of scripting language to do it for you. In this example, I'll show you one way to do it in Python.

First, we must establish the basic infrastructure of our project. (Actually it's a good idea to design infrastructure for a project at the very beginning, before you start compiling data and working on analyses.) To be very simple, we will have a bag-o-words/ directory, inside which will live a corpus/ directory and our mk_bag-o-words.py script. All the text files in our corpus live inside the corpus/ directory. I put some random text files with Lorem ipsum in the corpus/ folder for the purpose of this walkthrough. There's a whole working example on my github page

Here's the code for the script first. Have a look and then we'll go through it line by line.

	
#!/usr/bin/env python3
import os, json

def main():
    ### list all files in the corpus dir
    corpus_files = os.listdir('corpus')
    ### initialize empty dictionary 
    bag = {}

    ### create bag of words
    for corpus_file in corpus_files:
        with open(f'corpus/{corpus_file}', 'r') as txtfile:
            txt_lines = txtfile.readlines()
            for line in txt_lines:
                line = line.strip()
                words = line.split(' ')
                for word in words:
                    if word not in bag:
                        bag[word] = 1
                    else:
                        bag[word] += 1

    ### sort dict by val
    bag = {k: v for k, v in sorted(bag.items(), key=lambda item: item[1], reverse=True)}

    ### dump dict as JSON obj to a file
    with open('bag-o-words.json', 'w+') as outfile:
        json.dump(bag, outfile, indent=4)
        

if __name__ == '__main__':
	main()

So, at this point if you're used to reading python code, you can probably make sense of it. But let's go through it section by section.

The script uses two packages both from the standard library, so there's nothing to install. os is necessary to navigate the file system on your computer – we have to read the corpus/ directory. We'll also use the json package to dump our ranked list of words —bag-o-words— to a file.

Inside our main function:

	
def main():
    ### list all files in the corpus dir
    corpus_files = os.listdir('corpus')
    ### initialize empty dictionary 
    bag = {}

...we first need to make a list of all files in the corpus/ directory. Then we initialize an empty dictionary; the dictionary will become our bag-o-words.

The next bit is the most involved...

			
    ### create bag of words
    for corpus_file in corpus_files:
        with open(f'corpus/{corpus_file}', 'r') as txtfile:
            txt_lines = txtfile.readlines()
            for line in txt_lines:
                line = line.strip()
                words = line.split(' ')
                for word in words:
                    if word not in bag:
                        bag[word] = 1
                    else:
                        bag[word] += 1

...so line by line, we go through the list of files in the corpus one by one and for each, we open the file for reading. Then we commit each line in the file to a list and iterate through each line of text. We strip punctuation and end-line returns from the text, then separate each line into a list of words using split. Notice the empty space in (' '), so with punctuation out of the way, we're using space to delimit words. Finally for each word we check if it's in the bag dictionary. If it's not, we add the word as a key to the dict with 1 as the value. If the word is already in the dict, we increment (+=) by 1.

After this code block completes, we technically have our bag-o-words already. However, for easy human readability, we probably want to sort the dictionary by values — in other words, we want the words that appear the most with at the top of the list.

	
    ### sort dict by val
    bag = {k: v for k, v in sorted(bag.items(), key=lambda item: item[1], reverse=True)}

This takes a little dictionary comprehension magic.

And finally, we want to dump our sorted bag to a file

	
    ### dump dict as JSON obj to a file
    with open('bag-o-words.json', 'w+') as outfile:
        json.dump(bag, outfile, indent=4)

So firsrt we open a file for writing, then json.dump our bag to the file. The indent parameter makes the output pretty with line breaks and consistent tabbing. The output file bag-o-words.json will look like this:

	
{
    "nec": 18,
    "et": 17,
    "Donec": 16,
    "at": 15,
    "eu": 15,
	...
    "ridiculus": 1,
    "mus.": 1,
    "cursus.": 1,
    "tempor,": 1,
    "venenatis.": 1
}

And there it is. With that you have a basic template for creating a bag-o-words from a corpus.

↧ Comment on this Post ↧

Enter your name | alias and email to comment on this post.

cookiebanner	Meta cookie for the cookies that are set.
csrftoken	This cookie prevents Cross-Site-Request-Forgery attacks. It is necessary to use forms on the site: signup to the mailing list, commenting on blog posts, and using the contact form.
sessionid	This cookie is necessary to allow logging in, for example.