Duy Hai Doan

Part 5 of 10 for the series Gremlin Recipes. The purpose is to explain the internal of <a href="http://tinkerpop.apache.org/docs/3.2.5/reference/#traversal">Gremlin</a> and give people a deeper insight into the query language to master it.

This blog post is the 5th from the series Gremlin Recipes. It is recommended to read the previous blog posts first:

<ol>
	<li><a href="https://academy.datastax.com/content/gremlin-traversals">Gremlin as a Stream</a></li>
	<li><a href="https://academy.datastax.com/content/sql-tpo-p">SQL to Gremlin</a></li>
	<li><a href="https://academy.datastax.com/content/gremlin-recipes-3-%E2%80%93-recommendation-engine-traversal">Recommendation Engine traversal</a></li>
	<li><a href="https://academy.datastax.com/content/gremlin-recipes-4-%E2%80%93-recursive-traversals-1">Recursive traversals</a></li>
</ol>

<h2>I London_Tube dataset</h2>

 
To illustrate this series of recipes, you need first to create the schema for London_Tube and import the data.

The graph schema of this dataset is:

<img alt="London Tube Schema" data-entity-type="file" data-entity-uuid="bff53a0a-d894-436b-9e37-bafa21667184" src="https://www.datastax.com/sites/default/files/inline-images/LondonTube_Schema-1024x327.png" />

LondonTube_Schema

The schema is pretty simple, we have a single vertex label station with the following properties:

<ul>
	<li>name: the station name</li>
	<li>lines: a collection of tube lines to which this station belongs</li>
	<li>is_rail: whether the staton belongs to the railway system</li>
	<li>zone: the zone in which the station belongs to. It is not an integer since zone 2.5 can exists …</li>
</ul>

INSERTING DATA 
First, open the <a href="http://tinkerpop.apache.org/docs/3.1.1-incubating/tutorials/the-gremlin-console/">Gremlin console</a> or <a href="https://www.datastax.com/dev#DataStax-Studio">DataStax Studio</a> (whichever works fine) and execute the following statements:

Open-source Gremlin Console Config

<code>:remote connect tinkerpop.server conf/remote.yaml session-manage 
:remote config timeout max 
:remote console 
system.graph('London_Tube'). ifNotExists().create() 
:remote config alias g London_Tube.g</code>

KillrVideo schema &amp; data loading&nbsp;

<code>schema.clear(); 
schema.propertyKey("zone").Double().single().create(); 
schema.propertyKey("line").Text().single().create(); 
schema.propertyKey("name").Text().single().create(); 
schema.propertyKey("id").Int().single().create(); 
schema.propertyKey("is_rail").Boolean().single().create(); 
schema.propertyKey("lines").Text().multiple().create(); 
schema.edgeLabel("connectedTo").multiple().properties("line").create(); 
schema.vertexLabel("station").partitionKey("id").properties("name", "zone", "is_rail", "lines").create(); 
schema.vertexLabel("station").index("search").search().by("name").asText().by("zone").by("is_rail").by("lines").asText().add(); 
schema.vertexLabel("station").index("stationByName").materialized().by("name").add(); 
schema.vertexLabel("station").index("toStationByLine").outE("connectedTo").by("line").add(); 
schema.vertexLabel("station").index("fromStationByLine").inE("connectedTo").by("line").add(); 
schema.edgeLabel("connectedTo").connection("station", "station").add(); 
&nbsp; 
schema.config().option("tx_autostart").set(true); 
&nbsp; 
// Load data from file london_tube.gryo 
graph.io(IoCore.gryo()).readGraph("/path/to/london_tube.gryo");</code>

The file london_tube.gryo can be downloaded <a href="https://drive.google.com/open?id=0B3qV2Nx-GibgU0pxQS1SbXF0STA">here</a>

<h2>II Path object</h2>

In this post we will explore the usage of Gremlin path object. Let’s say we want to know the “path” between the station South Kensington and all its neighbours stations. First let’s create a classical traversal:

<code>gremlin&gt;g.V(). 
&nbsp; has("station", "name", "South Kensington").&nbsp; 
&nbsp; union(identity(), both("connectedTo"))</code>

Please notice the usage of both("connectedTo"). Indeed the direction of the connection between 2 stations does not matter but since Gremlin is a directed graph we need to use both().

The step union(identity(), <code>both("connectedTo"))</code> will output the original station South Kensington along side with its neighbours.

<img alt="SouthKensington_neighbours" data-entity-type="file" data-entity-uuid="22549373-5e4d-4163-99cc-239458707d99" src="https://www.datastax.com/sites/default/files/inline-images/SouthKensington_neighbours-1024x396.png" />

SouthKensington_neighbours

So what is a “path” in Gremlin ? According to the JavaDoc:

A Path denotes a particular walk through a Graph as defined by a Traversal. In abstraction, any Path implementation maintains two lists: a list of sets of labels and a list of objects. The list of labels are the labels of the steps traversed. The list of objects are the objects traversed.

In a nutshell, the path object implements the interface <code>Iterator</code>, <code>Object</code> can be anything among:

<ul>
	<li>all labels created on the traversal (using the modulator <code>as("label"))</code></li>
	<li>all vertices visited by the traversal</li>
	<li>all edges visited by the traversal</li>
	<li>all side-effects or data structures created during the traversal</li>
</ul>

To display the path that connects South Kensington to its neighbours:

<code>gremlin&gt;g.V(). &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; has("station", "name", "South Kensington"). // Iterator&lt;Station&gt; 
&nbsp; both("connectedTo"). &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;// Iterator&lt;Station&gt;&nbsp; 
&nbsp; dedup(). 
&nbsp; path() &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;// Iterator&lt;Path&gt; == Iterator&lt;Iterator&lt;Object&gt;&gt;, Object == Vertex/Edge ... 
==&gt;[v[{~label=station, id=236}], v[{~label=station, id=99}]] 
==&gt;[v[{~label=station, id=236}], v[{~label=station, id=146}]] 
==&gt;[v[{~label=station, id=236}], v[{~label=station, id=229}]]</code>

We can see there are 3 paths that connect South Kensington to its neighbours. To make the display nicer let’s project the path object on their property “name” to display station names

<code>gremlin&gt;g.V(). &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 
&nbsp; has("station", "name", "South Kensington"). // Iterator&lt;Station&gt; 
&nbsp; both("connectedTo"). &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;// Iterator&lt;Station&gt;&nbsp; 
&nbsp; dedup(). 
&nbsp; path(). &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // Iterator&lt;Path&gt; == Iterator&lt;Iterator&lt;Object&gt;&gt;, Object == Vertex/Edge ... 
&nbsp; &nbsp; by("name") &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;// Iterator&lt;Iterator&lt;String&gt;&gt; &nbsp;&nbsp; 
==&gt;[South Kensington, Gloucester Road] 
==&gt;[South Kensington, Knightsbridge] 
==&gt;[South Kensington, Sloane Square]</code>

We want Gremlin to help us finding the path between South Kensington and Covent Garden using the Piccadilly line:

<code>gremlin&gt;g.V(). 
&nbsp; has("station", "name", "South Kensington"). &nbsp; 
&nbsp; emit(). 
&nbsp; repeat(timeLimit(200).both("connectedTo"). &nbsp; &nbsp; &nbsp; &nbsp;// Expand the graph on edge "connectedTo" 
&nbsp; &nbsp;filter(bothE("connectedTo"). &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; 
&nbsp; &nbsp; &nbsp;has("line",Search.tokenPrefix("Piccadilly"))). // Only retain stations connected by "Piccadilly" 
&nbsp; &nbsp;simplePath()). &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // Simple path to avoid cyclic loops 
&nbsp; has("name", "Covent Garden"). &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // Only retain target station "Covent Garden" 
&nbsp; path().unfold()</code>

<img alt="South Kensington Covent Garden" data-entity-type="file" data-entity-uuid="53349b8f-4ca6-419d-88bd-8764fc3ce6cf" src="https://www.datastax.com/sites/default/files/inline-images/SouthKensington_CoventGarden-1024x249.png" />

SouthKensington_CoventGarden

We purposely put <code>emit()</code> before the <code>repeat()</code> step so that the original South Kensington station is emitted alongside with all other stations on the journey.

For the <code>repeat()</code> step, instead of putting a limit in term of number of times we traverse the connectedTo edge with the step <code>times(x)</code>, we rather set a time limit to avoid graph explosion with <code>timeLimit(200)</code>.

We only collect stations which belong to the Piccadilly line with the prefix search filtering <code>has("line", Search.tokenPrefix("Piccadilly"))<code>. This predicate is DSEspecific and is leveraging DSE Search. The filtering is performed on adjacent edges “connectedTo” thus the step <code>filter(bothE("connectedTo") ...)</code></code></code>

Finally we filter the target station to match Covent Gardenwith the step<code> has("name", "Covent Garden")</code>.If we were to stop our traversal at this step, there will be a single displayed station, which is <code>Covent Garden.</code>

What we want is different: all the stations of the journey from South Kensington to Covent Garden and here <code>path()</code> step comes in handy.

Since we only visit vertices on our traversal, the step <code>path</code> is of type <code>Iterator&gt;</code>, the outer <code>Iterator</code> represents the <code>Traversal</code> object itself. To access the inner <code>Iterator </code>we need to use the <code>unfold()</code> operator.

So consequently<code> path().unfold()</code> become an <code>Iterator</code> and will display all stations from South Kensington to Covent Garden

We can check the result against the real London tube map

<img alt="London Tube Map Piccadilly" data-entity-type="file" data-entity-uuid="86c143d5-c478-4e5f-ab8f-7ac6f518fb93" src="https://www.datastax.com/sites/default/files/inline-images/London_Tube_Map_Piccadilly.png" />

London_Tube_Map_Piccadilly

<h2>III Finding shortest path</h2>

<ol>
	<li>Now let’s say we want to go from South Kensington to Marble Arch, if we look at the London tube map, there are 2 possible journeys: the one that minimizes station count going through Green Park &amp; Bond Street (5 stations) but we have 2 line changes:&nbsp;Picadilly line -&gt; Jubilee line and then Jubilee line -&gt; Central line</li>
	<li>the one that minimizes the number of line changes going through Notting Hill Gate but with more stations (6)</li>
</ol>

<img alt="South Kensington Marble Arch" data-entity-type="file" data-entity-uuid="91edea35-0eea-46c3-b9f9-02f155c972fa" src="https://www.datastax.com/sites/default/files/inline-images/SouthKensington_MarbleArch.png" />

SouthKensington_MarbleArch

So let’s see how we can ask Gremlin to find the path that minimizes station count:

<code>gremlin&gt;g.V(). 
&nbsp; has("station", "name", "South Kensington"). &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;// Iterator&lt;Station&gt; 
&nbsp; emit(). 
&nbsp; repeat(both("connectedTo").simplePath().timeLimit(400)). // Iterator&lt;Station&gt; 
&nbsp; has("name", "Marble Arch"). &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;// iterator.filter(station -&gt; station.getName().equals("Marble Arch")) 
&nbsp; order(). &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // order all the paths leading to Marble Arch&nbsp; 
&nbsp; &nbsp; by(path().count(local), incr). &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // by number of elements(station) in each path, ASC 
&nbsp; limit(1). &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;// take the 1st path == shortest path&nbsp; 
&nbsp; path().unfold() &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;// Iterator&lt;Station&gt;</code>

&gt;<img alt="South Kensington Marble Arch Path1" data-entity-type="file" data-entity-uuid="67c474e9-1b96-4597-abde-4ffad8990249" src="https://www.datastax.com/sites/default/files/inline-images/SouthKensington_MarbleArch_Path1-1024x250.png" />
<pSouthKensington_MarbleArch_Path1

Let’s analyse the above traversal. First <code>repeat(both("connectedTo").simplePath().timeLimit(400))</code> represent graph expansion in all directions from South Kensington but time-limited to 400ms. <code>has("name", "Marble Arch")</code> will filter all emitted stations to take only those who are Marble Arch.

At this stage we have have multiple journeys leading to the same destination. Now the fun begin with the path object. <code>order().by(path().count(local), incr)</code> will order each found journey by the number of “object” found in the path history. The trick here is that <code>path()</code> represents a collection of path history e.g. <code>Iterator&gt;</code> and we want to sort them by the station count, thus <code>count(local)</code>. Taking the 1st matching path will guarantee us the journey with the minimum station count.

The last <code>path().unfold()</code> is just for display purpose. We want to show the complete list of stations leading from South Kensington to Marble Arch

What if we want to minimize the number of line changes instead ? Path again! Minimizing the number of line changes is equivalent to order all the journeys by the number of distinct line values of the “connectedTo” edge in each path and then take the minimum.

<code>gremlin&gt;g.V(). 
&nbsp; has("station", "name", "South Kensington"). 
&nbsp; emit(). 
&nbsp; repeat(bothE("connectedTo").otherV().simplePath()). &nbsp; &nbsp; &nbsp; &nbsp;// instead of both("connectedTo"), we do bothE("...").otherV() to collect edges on the path and save it as "paths" 
&nbsp; until(has("name", "Marble Arch").or().loops().is(eq(6))). // limit graph expansion to max 6 hops 
&nbsp; has("name", "Marble Arch"). 
&nbsp; order().by( 
&nbsp; &nbsp; &nbsp; &nbsp; path().unfold(). &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;// Iterator&lt;Object&gt; where Object == Station &amp; Edge "connectedTo" 
&nbsp; &nbsp; &nbsp; &nbsp; hasLabel("connectedTo"). &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;// only take edges "connectedTo" 
&nbsp; &nbsp; &nbsp; &nbsp; values("line").dedup().count(), &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // take the line value on the edge, dedup &amp; count 
&nbsp; &nbsp; &nbsp; &nbsp; incr). &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;// ORDER BY COUNT DISTINCT LINE NUMBER on the path, ASC 
&nbsp; &nbsp; limit(1). 
&nbsp; &nbsp; path().unfold(). 
&nbsp; &nbsp; hasLabel("station")&nbsp; 
&nbsp;</code>

The beginning of the traversal is quite similar but now instead of doing <code>both("connectedTo")</code> we did <code>bothE("connectedTo").otherV().</code> Semantically it is equivalent but the subtle difference is that now we are visiting also all “connectedTo” edges in our traversal, not only station vertices, and it is done on purpose.

Instead of using <code>timeLimit(400) </code>we can also use the <code>until(has("name", "Marble Arch").or().loops().is(eq(6)))</code> step to stop graph expansion whenever we find Marble Arch or whenever we exceed 6 hops from South Kensington.

Now the ordering step is important. We unfold the path object, which now becomes an Iterator of Station vertices and “connectedTo” edges. We want to select only “connectedTo” edges thus <code>hasLabel("connectedTo")</code>. Then we extract the “line” property, deduplicate them and count them. The ordering will be done on this distinct lines count, ASCENDING.

The traversal ends with a <code>path().unfold()</code> to display nicely the journey, with an additional filtering <code>hasLabel("station") </code>to remove all the “connectedTo” edges from the path.

The result is what we expected:

<img alt="South .Kensington Marble Arch Path2" data-entity-type="file" data-entity-uuid="ab5a8382-9ba1-457e-9edd-d6d85e0629ed" src="https://www.datastax.com/sites/default/files/inline-images/SouthKensington_MarbleArch_Path2-1024x288.png" />

SouthKensington_MarbleArch_Path2

And that’s all folks! Do not miss the other Gremlin recipes in this series.

If you have any question about Gremlin, find me on the <a href="http://datastaxacademy.slack.com/">datastaxacademy.slack.com</a>, channel dse-graph. My id is @doanduyhai
</p

Gremlin Recipes: 5 – Path Object

Duy Hai Doan

Discover more

Share

Share

I London_Tube dataset

II Path object

III Finding shortest path

More Technology

How to Build a Crystal Image Search App with Vector Search

Knowledge Graphs for RAG without a GraphDB

How Winweb Built its AI Assistant with DataStax Astra DB and LangChain

Vercel + Astra DB: Get Data into Your GenAI Apps Fast

One-stop Data API for Production GenAI