TechnologySeptember 24, 2017

# Gremlin Recipes: 8 – Sack() Operator

###### Duy Hai Doan

Part 8 of 10 for the series Gremlin Recipes. The purpose is to explain the internal of Gremlin and give people a deeper insight into the query language to master it.

This blog post is the 8th from the series Gremlin Recipes. It is recommended to read the previous blog posts first:

## I KillrVideo dataset

To illustrate this series of recipes, you need first to create the schema for KillrVideo and import the data. See here for more details.

The graph schema of this dataset is :

## II Introduction to `sack()`

The `sack()` step was introduced recently to help people solving some problem more efficiently.

In some case people are writing Gremlin traversals that make use of data aggregation using path information. Typically, people will use `path()` and then perform “reduction” of the data in the path to get the specific result.

Unfortunately, this is inefficient as path calculations are expensive, not mergable. The same ca be now achieved using `sack()`.

A sack is a local datastructure relative to each traverser, unlike global datastructure as when using `aggregate()/store()/group(x)/subgraph(x)....` It mean that each traverser is now equipped with his own sack. A sack can contain any type of data structure:

• primitives: Int, Double, Long, …
• collections: List/Set/Map\

### A. Defining the sack

The definition of this data structure should be placed right after the traversal object:

```gremlin>g.withSack(1.0).            // Define a sack containing an initial double value V(). ... ```

```gremlin>g.withSack([] as Set).      // Define a sack containing an initial empty Set V(). ... ```

```gremlin>g.withSack([1,2,3] as Set). // Define a sack containing a pre-initialized Set V(). ... ```

```gremlin>g.withSack([] as List).     // Define a sack containing an initial empty List V(). ... ```

```gremlin>g.withSack([:]).             // Define a sack containing an initial empty Map V(). ... ```

```gremlin>g.withSack([:].              // Define a sack containing an initial empty Map withDefault{key -> [] as List}). // with default values being an empty list V().```

`... `

The complete signature of the `withSack()` step is `withSack(default_data_structure, split_operator, merge_operator)`

• split operator is used to split the data structure whenever traverser is facing a branch and needs to duplicate itself. Example: `V().has("user","userId", "u861").out("rated")`. From the User vertex now there are as many traversers as there are ratings to different movie. Thus the sack structure at User vertex get duplicated
• merge operator is used to merge different sack data structures together whenever all traversers are converging to a single point. Example: `g.V().hasLabel("movie").out("director").has("name", "Woody Allen")`. The traversers from many Movie vertices are converging to the single Person vertex which is Woody Allen. The sack at vertex Woody Allen is the merge of all sack data structures coming from his movies. If you do not define any merge operator, the sacks cannot be merged

Note: for primitive data type, split operator == value copy. For other data types like collection, if you don’t define any split operator, the same object reference will be copied to all branching traversers. Consequently all those traversers will shared the same data structure. Be very wary of this

Some example of defining collection as data structure for sack:

```// Define a sack with sum() as merge operator // The split operator is implicitly copy-by-value for primitives gremlin>g.withSack(1.0, sum).    V().    ... ```

```// Define a sack with initial empty Set // Split function == Set::clone() // Merge function == Set::appendAll() g.withSack([] as Set){set-> set.clone()}{setA,setB -> setA.appendAll(setB); setA}.    V()    ... ```

```// Define a sack with initial empty Map // Split function == Map::clone() // Merge function == Map::putAll() gremlin>g.withSack([:]){it.clone}{mapA,mapB -> mapA.putAll(mapB); mapA}.    V().    ... ```

### B. Mutating the sack

To mutate the content of the sack, you can use the step `sack(bi-function).by(projection_axis)`.

• bi-function is a function with signature: `DATA_STRUCTURE apply(DATA_STRUCTURE current, X projected_value)`
• DATA_STRUCTURE is the type of the data structure inside the sack (primitives, collections, …) initially defined with `g.withSack(DATA_STRUCTURE)`
• X is the type of the projected value on the current vertex/edge using the modulator `by()`
• projection_axis is the property/inner traversal result applied on a vertex/edge

### C. Accessing the sack

The sack data structure can be accessed at any time by calling the step `sack()`. The data type returned by this step is `Iterator`

## III Usage of `sack()`

```gremlin>g.withSack(0).V().    has("movie", "title", "Blade Runner").    // Iterator    inE("rated").                             // Iterator       sack(sum).by("rating").                // iterator.sideEffect(sack -> sack.get() + values("rating"))    outV().                                   // Iterator    sack()                                    // Iterator ==>6 ==>5 ==>8 ==>8 ==>8 ==>8 ==>9 ... ```

In the above traversal the interesting step is `inE("rated").sack(sum).by("rating").outV()`. On each edge “rated”, we add the value “rating” into the sack value. Please note the usage of `by("rating")` acting as projection for the `sack()`. Sum is the bi-function `Int apply(Int current, Int projected_value)`

When the traversal reaches step `outV()` there are as many traversers as there are users rating Blade Runner and each of those traversers carries a sack containing the value of the rating. When calling `sack()` the content of each sack is extracted and therefor we have an `Iterator`

Computing Blade Runner average rating with `sack()` is really easy:

```gremlin>g.withSack(0).V().    has("movie", "title", "Blade Runner").    // Iterator    inE("rated").                             // Iterator       sack(sum).by("rating").                // iterator.sideEffect(sack -> sack.get() + values("rating"))    outV().                                   // Iterator    sack().                                   // Iterator    mean() ==>8.20353982300885 ```

Now let’s see how we can use `sack()` for useful purposes. First you need to import the schema of London Tube as described in this previous post.

Then switch g alias to London_Tube

`gremlin>:remote config alias g London_Tube.g`

`==>g=London_Tube.g`

One of the goal in this blog post was to find the shortest path between South Kensington and Marble Arch tube stations. Shortest path can mean either least stations count or least line changes.

### Shortest path by least stations count

```gremlin>g.withSack(0).V().    has("station", "name", "South Kensington"). // Iterator    emit().    repeat(timeLimit(400).       bothE("connectedTo").       otherV().       sack(sum).by(constant(1)). // +1       simplePath()).    has("name", "Marble Arch").    order(). // Order each journey       by(sack(),incr). // by the sack value, ASCENDING    limit(1).    path().by(       choose(label).       option("station", values("name")).       option("connectedTo", constant("-->"))) ==>[South Kensington, -->, Knightsbridge, -->, Hyde Park Corner, -->, Green Park, -->, Bond Street, -->, Marble Arch] ```

The interesting pieces are:

• `sack(sum).by(constant(1))`: for each traversed station, we increment the counter by +1
• `order().by(sack(),incr)`: order each journey by the counter in each sack, ASCENDING order

### Shortest path by least line changes

```gremlin>g.withSack([] as Set){it.clone()}{a,b -> a.addAll(b); a}.    V().    has("station", "name", "South Kensington"). // Iterator    emit().    repeat(timeLimit(400).       bothE("connectedTo").       sack{set,val -> set.add(val); set}.by("line").       otherV().       simplePath()).    has("name", "Marble Arch").    order(). // Order each journey by(sack().count(local),incr). // by the sack number of element, ASCENDING    limit(1).    path().by(       choose(label).       option("station", values("name")).       option("connectedTo", constant("-->"))) ==>[South Kensington, -->, Gloucester Road, -->, High Street Kensington, -->, Notting Hill Gate, -->, Queensway, -->, Lancaster Gate, -->, Marble Arch] ```

Let’s dig into the traversal:

• `withSack([] as Set)`: defines a sack with an empty Set
• `{it.clone()}`: defines the split operator for the set, which is just cloning the set itself
• `{a,b -> a.addAll(b); a}`: defines the merge operator for the set, which is just appending all contents of the sets together
• Please notice the Groovy closure syntax to pass lambdas as arguments to the `withSack()` method. More details about this syntax is given here
• `sack{set,val -> set.add(val); set}.by("line")`: on each traversed “connectedTo” edge, we add the “line” property to our set inside the sack
• `order().by(sack().count(local),incr)`: we order the journeys by the number of lines inside each sack. Please notice the usage of `sack().count(local)`. Since the sack() type is now an `Iterator>` we need the local keyword to count the number of element in the nested set

And that’s all folks! Do not miss the other Gremlin recipes in this series.

If you have any question about Gremlin, find me on the datastaxacademy.slack.com, channel dse-graph. My id is @doanduyhai

Gremlin

## III Usage of sack()

### Shortest path by least line changes

#### One-stop Data API for Production GenAI

Astra DB gives JavaScript developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.