MongoDB

Introduction

Data analysis is an important part of many applications. MongoDB provides several tools to run analytics natively using the aggregation framework.

The aggregation framework is based on the concept of pipeline. The concept is that you take input from a MongoDB collection and the documents from this collection go through a series of stages. Each stage performs a operation on its inputs. Each stage’s input is the output of the precedent stage. All inputs and outputs are documents.

If you are familiar with Javascript, the concept is the same when you chain methods. You may have encountered something like this:

const myData = myArray.map( (item, i) => {
    //do something special
}).filter( (item, i) => {
    //We filter in this stage
}).reduce( (item, i) => {
    //I need to reduce in this stage
})

In the example above, we have three stages. A map(), a filter() and a reduce(). The first stage ( map ) takes myArray as input. The output of this method becomes the input of my next method ( filter ). So on and so forth, until I finally don’t need to transform my data and have what I need.

But in MongoDB, every input and output will be documents. At the end, we have access to the output, the same way we would after a query.

So, in short, the aggregation framework works with pipelines. Pipelines work with a MongoDB collection. Pipelines are composed of stages. Each stage executes a data processing task on its input and produces documents as its output. This output is passed to the next stage.

So, now we now what it is, how do we use it?

Getting started with familiar operations

We will use a restaurants collection provided by MongoDB. You can download it here. Save it to a file.

Next, you need to import data into the collection. We will use mongoimport to achieve this. Open a terminal window and enter this command:

mongoimport --db test --collection restaurants --drop --file PATH/TO/JSON/FILE

Make sure you have a mongod instance running already. This command will import our data in the test database, inside the restaurants collection. Feel free to rename those if you want.

We have our data in place. We will start with some basic operations like the match, project, sort, skip and limit stages.

How our documents are organized

In our restaurants collection, documents have the following fields:

  • _id field ( mandatory in all MongoDB documents)
  • name field ( String )
  • cuisine ( String )
  • stars ( Integer )
  • address ( embedded document with the following fields:
    • street ( String )
    • city ( String )
    • state ( String )
    • zipcode ( String )
      )

$match

Let’s start with a filter to look for all restaurants with a Russian cuisine:

>db.restaurants.aggregate([
    { $match: { cuisine: 'Russian' }}
])

{ "_id" : ObjectId("5977aaed3abbae8aef525fdc"), "name" : "Abe Fisher", "cuisine" : "Russian", "stars" : 1.2, "address" : { "street" : "828 Jaduv Boulevard", "city" : "Hiburu", "state" : "TX", "zipcode" : "00004" } }

{ "_id" : ObjectId("5977aaef3abbae8aef53e606"), "name" : "Abe Fisher", "cuisine" : "Russian", "stars" : 2.4, "address" : { "street" : "463 Zujew Street", "city" : "Peamazut", "state" : "NH", "zipcode" : "00304" } }

{ "_id" : ObjectId("5977aadc3abbae8aef4715f1"), "name" : "Abe Fisher", "cuisine" : "Russian", "stars" : 4.5, "address" : { "street" : "105 Buwza Heights", "city" : "Asovikufu", "state" : "MA", "zipcode" : "00488" } }

{ "_id" : ObjectId("5977aada3abbae8aef45bf79"), "name" : "Abe Fisher", "cuisine" : "Russian", "stars" : 3.5, "address" : { "street" : "1774 Emcek Key", "city" : "Mossofnat", "state" : "NY", "zipcode" : "02312" } }

{ "_id" : ObjectId("5977aadd3abbae8aef47bf64"), "name" : "Abe Fisher", "cuisine" : "Russian", "stars" : 0.4, "address" : { "street" : "1172 Kohis Road", "city" : "Jihnahac", "state" : "MS", "zipcode" : "02585" } }

{ "_id" : ObjectId("5977aae33abbae8aef4b8cbf"), "name" : "Abe Fisher", "cuisine" : "Russian", "stars" : 4.2, "address" : { "street" : "1582 Ogujab Pass", "city" : "Vihjuani", "state" : "MN", "zipcode" : "03000" } }

{ "_id" : ObjectId("5977aadc3abbae8aef46e0e3"), "name" : "Abe Fisher", "cuisine" : "Russian", "stars" : 3.6, "address" : { "street" : "1076 Cugvu Glen", "city" : "Ittupa", "state" : "NH", "zipcode" : "03726" } }

This is the equivalent of running this query :

db.restaurants.find({cuisine: 'Russian'})

You can notice that we use the aggregate method. This is the method used to run an aggregation query. In order to aggregate, we need to pass an aggregation pipeline. A pipeline is an array of documents. Each document specify what process we want our data to go through.

Let’s add a second stage in our aggregation query.

$project

The $project stage allows us to control the output. We can tell to MongoDB which fields we would like to see or not.

db.restaurants.aggregate([
    { $match: { cuisine: 'Russian'},
    { $project: {
        _id: 0,
        name: 1,
        stars: 1
    }}
    }
])

{ "name" : "Abe Fisher", "stars" : 1.2 }
{ "name" : "Abe Fisher", "stars" : 2.4 }
{ "name" : "Abe Fisher", "stars" : 4.5 }
{ "name" : "Abe Fisher", "stars" : 3.5 }
{ "name" : "Abe Fisher", "stars" : 0.4 }
{ "name" : "Abe Fisher", "stars" : 4.2 }
{ "name" : "Abe Fisher", "stars" : 3.6 }

...

My second stage tells MongoDB to output only two fields, name and stars. I also explicitly excluded _id. By default, the _id field will be displayed is you don’t set it to 0.

The $match stage filters against the collection and passes the resulting documents to the $project stage one at a time. $project performs its operation by reshaping the fields accordingly and passes the output out of the pipeline and back to me.

Let’s add an additional stage.

$limit

The $limit stage will limit the number of results returned by the pipeline. Like so:

> db.restaurants.aggregate([
    { $match: {cuisine: 'Russian'}},
    { $limit: 3 },
    { $project: {_id: 0, name: 1, stars: 1}} 
    ])

{ "name" : "Abe Fisher", "stars" : 1.2 }
{ "name" : "Abe Fisher", "stars" : 2.4 }
{ "name" : "Abe Fisher", "stars" : 4.5 }

With $limit, I only return the first three documents.

Note: I put the $limit stage before the $project stage. If I put the limit stage last, it would mean that the project stage would have to process data on documents that I would not return anyway. By putting my stages in this order, the project stage only process three documents.

Some more examples

Here are a few random examples of tasks that you can accomplish:

  • In this example, I want to rename the stars field to rating. I also want to have the restaurant’s name in uppercase and I want the documents to be sorted by ratings.
> db.restaurants.aggregate([ 
    {$project: {restaurant_name: {$toUpper: "$name"}, _id: 0, ratings: "$stars"}}, 
    { $sort: { ratings : -1}}, 
    { $limit: 5 } ])

{ "restaurant_name" : "MODERN LOVE", "ratings" : 5 }
{ "restaurant_name" : "LAUBERGE CHEZ FRANCOIS", "ratings" : 5 }
{ "restaurant_name" : "DAI DUE", "ratings" : 5 }
{ "restaurant_name" : "FRANKLINVILLE INN", "ratings" : 5 }
{ "restaurant_name" : "LAUREL", "ratings" : 5 }

Two fields returned: restaurant_name in uppercase and ratings in descending order. I used the $toUpper operator to transform the names. The $project stage created a new field called restaurant_name and gave it the uppercase value of name. Same for ratings with the field stars. Notice that we prefix the value with a $ to tell MongoDB that we are talking about another field’s value.

Finally, let’s use the $group stage. In this example, we will group the restaurants by their cuisine types. Every time we encounter a certain cuisine, we will increase a counter by 1. At the end, we will know the most frequent cuisine types in our collection:

> db.restaurants.aggregate([ 
    { $project: { _id: 0, cuisine: 1}}, 
    { $group: { _id: "$cuisine", count: { $sum: 1}} }, 
    { $sort: { count: -1}}, 
    { $limit: 5} ])

{ "_id" : "American", "count" : 6183 }
{ "_id" : "Chinese", "count" : 2418 }
{ "_id" : "Café/Coffee/Tea", "count" : 1214 }
{ "_id" : "Pizza", "count" : 1163 }
{ "_id" : "Italian", "count" : 1069 }

In the project stage, we only keep what we need for our next stages, the cuisine field. Then, we group these results in documents where the _id field will be the particular cuisine and the count field will calculate the number of times we encounter that cuisine.

Next, we sort those results to have the highest count first and I limit the results to 5. In our collection, we mostly have American cuisine ( whatever that is ) with 6183 documents then Chinese with 2418 documents…

This is a rather quick introduction of what the aggregation framework allows you to do. There is of course a lot more possibilities. If you are curious about it, you can read more on the MongoDB docs

Feel free to share and comment.

Have a nice day!

Read more