Duy Hai Doan

Part 4 of 10 for the series Gremlin Recipes. The purpose is to explain the internal of <a href="http://tinkerpop.apache.org/docs/3.2.5/reference/#traversal">Gremlin</a> and give people a deeper insight into the query language to master it.

This blog post is the 4th from the series <a href="https://academy.datastax.com/content/gremlin-recipes">Gremlin Recipes</a>. It is recommended to read the previous blog posts first:

<ol>
	<li><a href="https://academy.datastax.com/content/gremlin-traversals">Gremlin as a Stream</a></li>
	<li><a href="https://academy.datastax.com/content/sql-tpo-p">SQL to Gremlin</a></li>
	<li><a href="https://academy.datastax.com/content/gremlin-recipes-3-%E2%80%93-recommendation-engine-traversal">Recommendation Engine traversal</a></li>
</ol>

<h2>I KillrVideo dataset</h2>

 
To illustrate this series of recipes, you need first to create the schema for KillrVideo and import the data. See <a href="http://www.doanduyhai.com/blog/?p=13224#killrvideo_dataset">here</a> for more details.

The graph schema of this dataset is :

<h3><iframe height="400px" src="https://s3.amazonaws.com/datastax-graph-schema-viewer/index.html#/?schema=killr_video_small.json" width="100%"></iframe></h3>

<h2>II Graph expansion</h2>

In this post we will explore all the techniques for recursive traversal with Gremlin. We will focus on mainly on this part of the schema:

<img alt="User Knows" data-entity-type="" data-entity-uuid="" data-widget="image" src="https://www.datastax.com/sites/default/files/inline-images/Users-Knows-1024x378.png" />

A&nbsp;user knows other users who knows other users etc. This is typical of a recursive traversal. First we want to know which user in our dataset knows the most other users:

<code>remlin&gt;g.V(). // Iterator<vertex> </vertex> 
<vertex>&nbsp; &nbsp;</vertex><vertex>hasLabel("user"). // Iterator<user> </user></vertex> 
<vertex>&nbsp; &nbsp;</vertex><vertex><user>order(). </user></vertex> 
<vertex>&nbsp; &nbsp;</vertex><vertex>&nbsp; &nbsp;</vertex><vertex></vertex><vertex><user>by(outE("knows").count(), decr). // Order the users by their outgoing/incoming edge "knows", DESCENDING </user></vertex> 
<vertex>&nbsp; &nbsp;</vertex><vertex><user>values("userId"). // Only display userId </user></vertex> 
<vertex>&nbsp; &nbsp;</vertex><vertex><user>limit(1) // Take the 1st matching user 
==&gt;u861</user></vertex></code>

<code><vertex><user> </user></vertex><vertex><user>gremlin&gt;g.V(). &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // Iterator&lt;Vertex&gt;</user></vertex> 
<vertex>&nbsp; &nbsp;</vertex><vertex><user>has("user", "userId", "u861"). // Iterator&lt;User&gt;</user></vertex> 
<vertex>&nbsp; &nbsp;</vertex><vertex><user>outE("knows"). &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // Iterator&lt;Knows_Edge&gt;</user></vertex><vertex><user></user></vertex> 
<vertex>&nbsp; &nbsp;</vertex><vertex><user>count() 
==&gt;13&nbsp;</user></vertex><vertex><user><vertex><user><knows_edge> </knows_edge></user></vertex></user></vertex></code>

So user u861 seems to knows 13 other users as his/her 1st degree connection. So user u861 seems to knows 13 other users as his/her 1st degree connection.

Now let’s list all those 13 1st degree friends of user u861Now let’s list all those 13 1st degree friends of user u861

<code>gremlin&gt;g.V(). // Iterator<vertex> 
&nbsp; &nbsp;has("user", "userId", "u861"). // Iterator<user> </user></vertex> 
<vertex>&nbsp; &nbsp;</vertex><vertex><user>out("knows"). // Iterator<user> </user></user></vertex> 
<vertex>&nbsp; &nbsp;</vertex><vertex><user><user>values("userId"). // Iterator<string> </string></user></user></vertex> 
<vertex>&nbsp; &nbsp;</vertex><vertex><user><user><string>fold() // Iterator<collection<string>&gt; 
==&gt;[u751, u868, u778, u887, u713, u733, u752, u548, u892, u657, u841, u150, u154]</collection<string></string></user></user></vertex></code>

<h2>III Repeating traversal</h2>

Now, let’s display all 2nd degree friends of user u861:

<code>gremlin&gt;g.V(). // Iterator<vertex> 
&nbsp; &nbsp;has("user", "userId", "u861"). // Iterator<user></user></vertex> 
<vertex>&nbsp; &nbsp;</vertex><vertex><user>as("u861"). // Label u861</user></vertex> 
<vertex>&nbsp; &nbsp;</vertex><vertex><user>out("knows"). // 1st degree friends</user></vertex> 
<vertex>&nbsp; &nbsp;</vertex><vertex><user>out("knows"). // 2nd degree friends</user></vertex> 
<vertex>&nbsp; &nbsp;</vertex><vertex><user>dedup(). // Remove duplicates</user></vertex> 
<vertex>&nbsp; &nbsp;</vertex><vertex><user>where(neq("u861")). // Exclude u861</user></vertex> 
<vertex>&nbsp; &nbsp;</vertex><vertex><user>values("userId"). // Iterator<string></string></user></vertex> 
<vertex>&nbsp; &nbsp;</vertex><vertex><user><string>fold() // Iterator<collection<string>&gt; 
==&gt;[u747, u719, u755, u756, u886, u880, u819, u831, u810, u832, u869, u718, u875, u830, u723, u727, u207, u744, u860, u704, u730, u829, u620, u882, u1035, u724, u800, u703, u705, u416, u717, u814, u767, u895, u804, u118, u263, u900, u887, u766, u851, u757, u606, u550, u155, u897, u750, u792, u797, u776, u702, u873, u855, u131, u196, u707, u103, u152] </collection<string></string></user></vertex></code>

As we can see, the combinatory explosion occurs right after the 2nd degree connection and will go on exponentially. We also notice the repetition of the step <code>out("knows")</code>. If we were looking at Nth degree connection, we would have to repeat the same step N times, which is quite verbose. To avoid this Gremlin exposes a convenient step: <code>repeat(... innner traversal)</code>. The above traversal can be rewritten as:

<code>has("user", "userId", "u861"). // Iterator<user> </user> 
<vertex>&nbsp; &nbsp;</vertex><user>as("u861"). // Label u861 </user> 
<vertex>&nbsp; &nbsp;</vertex><user>repeat(out("knows")).times(2). // 2nd degree friends </user> 
<vertex>&nbsp; &nbsp;</vertex><user>dedup(). // Remove duplicates </user> 
<vertex>&nbsp; &nbsp;</vertex><user>where(neq("u861")). // Exclude u861 </user> 
<vertex>&nbsp; &nbsp;</vertex><user>values("userId"). // Iterator<string> </string></user> 
<vertex>&nbsp; &nbsp;</vertex><user><string>fold() // Iterator<collection<string>&gt; </collection<string></string></user></code><code><user><string><collection<string> 
==&gt;[u747, u719, u755, u756, u886, u880, u819, u831, u810, u832, u869, u718, u875, u830, u723, u727, u207, u744, u860, u704, u730, u829, u620, u882, u1035, u724, u800, u703, u705, u416, u717, u814, u767, u895, u804, u118, u263, u900, u887, u766, u851, u757, u606, u550, u155, u897, u750, u792, u797, u776, u702, u873, u855, u131, u196, u707, u103, u152] </collection<string></string></user></code>

So far so good. There are 58 users connected to u861 by 2nd degree connection.

<h2>IV Emitting traversed vertices</h2>

Now, what if we want to display 1st and 2nd degree connections together ? For this Gremlin provides the <code>emit()</code> modulator to “output” all the traversed vertices:

<code>gremlin&gt;g.V(). &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // Iterator&lt;Vertex&gt; 
&nbsp; has("user", "userId", "u861"). // Iterator&lt;User&gt; 
&nbsp; as("u861"). &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;// Label u861 &nbsp; &nbsp; 
&nbsp; repeat(out("knows")).times(2). // 2nd degree friends 
&nbsp; emit(). &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;// Output the traversed User vertices &nbsp;&nbsp; 
&nbsp; dedup(). &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // Remove duplicates 
&nbsp; where(neq("u861")) 
==&gt;...</code>

Using <a href="https://www.datastax.com/products/datastax-studio-and-development-tools#DataStax-Studio">Datastax Studio</a> for a better visualisation of the results, here is the web of all connections up to 2nd degree for user u861

<img alt="Web of Connections" data-entity-type="" data-entity-uuid="" data-widget="image" src="https://www.datastax.com/sites/default/files/inline-images/2nd_degree_connections.png" />

The visualisation reveals the very fast graph expansion when using recursive traversals. We can restrict ourselves to the 1st&nbsp;degree connections:

<code>gremlin&gt;g.V(). &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // Iterator&lt;Vertex&gt; 
&nbsp; has("user", "userId", "u861"). // Iterator&lt;User&gt; 
&nbsp; repeat(out("knows")).times(1). // 1st degree friends 
&nbsp; emit() 
==&gt;...</code>

<img alt="User Edges" data-entity-type="" data-entity-uuid="" data-widget="image" src="https://www.datastax.com/sites/default/files/inline-images/1st_degree_connections.png" />

Not only the users are displayed but <a href="https://www.datastax.com/dev#DataStax-Studio">DataStax Studio</a> conveniently shows all the edges that exist between the users and we just realize that u892 and u887 know each other !

But there is still a missing piece in our picture, where is u861 ??? Indeed, there are 2 possible placements for the <code>emit()</code> modulator. If <code>emit()</code> is placed after <code>repeat()</code>, it will “output” all vertices leaving the repeat-traversal. If <code>emit()</code> is placed before <code>repeat()</code>, it will “output” the vertices prior to entering the repeat-traversal.

In our example we just need to put <code>emit()</code> right before <code>repeat(out("knows")).times(1)</code>:

<code>gremlin&gt;g.V(). &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // Iterator&lt;Vertex&gt; 
&nbsp; has("user", "userId", "u861"). // Iterator&lt;User&gt; 
&nbsp; emit(). &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;// Output u861 and other users 
&nbsp; repeat(out("knows")).times(1) &nbsp;// 1st degree friends &nbsp;&nbsp; 
==&gt;...</code>

<img alt="Web of Connections" data-entity-type="" data-entity-uuid="" data-widget="image" src="https://www.datastax.com/sites/default/files/inline-images/Web_of_connections.png" />

Now, we have a beautiful web of connections outgoing from u861

<h2>V Controlling the recursion</h2>

Until now, we were just using <code>times(x)</code> to control how far we will deep dive into recursion. What you should now is that <code>times(x)</code> is just a shorthand for a more generalized form of recursion control : <code>until(loops().is(eq(1)))</code>.

<code>until() </code>is controlling when to break out of the recursion based on a condition. As with <code>emit()</code> you can place <code>until()</code> before or after the <code>repeat()</code> step.

<code>until(...).repeat(...) </code>is equivalent to a <code>while(...) do{ }</code> repeat(….).until(…) is equivalent to <code>do{ } while(...)</code>

Suppose we want to know all friends of u861 having age 32:

<code>gremlin&gt; :remote config timeout 10000 
==&gt;Set remote timeout to 10000ms 
gremlin&gt; g.V(). 
&nbsp; has("user", "userId", "u861"). 
&nbsp; until(has("age", 32)). 
&nbsp; repeat(out("knows")) 
Request timed out while processing - increase the timeout with the :remote command 
Type ':help' or ':h' for help. 
Display stack trace? [yN]</code>

We just ran into a timeout. Why so ? Doesn’t use u861 have any connected friend with 32 years old ? We can run a quick check:

<code>gremlin&gt; g.V(). 
&nbsp; has("user", "userId", "u861"). 
&nbsp; out("knows"). 
&nbsp; has("age", 32). 
&nbsp; valueMap("userId", "age") 
==&gt;{userId=[u887], age=[32]}</code>

User u887 is indeed a direct connection of u861 having the matching age. So why did our traversal timeout ???

One problem with our traversal is that we may run into cyclic sub-graphs e.g user xxx which knows user yyy which knows in return user xxx. To shield ourselves from cyclic graph we can use <code>repeat(out("knows").simplePath()))</code> to consider only acyclic paths.

But the fundamental reason of the timeout is because Gremlin OLTP query engine is optimized for depth-first search. We will explain this concept is another post but long story shot, in the traversal <code>until(has("age", 32)).repeat(out("knows"))</code>,&nbsp;Gremlin will select one vertex and expand on the knows edge in multiple direction (graph explosion). Some branches may be stopped because the traverser encounters an user with age==32 but other branches may continue further to N hops. Using repeat(out(“knows”).simplePath())) helps reducing the combinatory explosion due to cyclic loops but we still have a rapid graph expansion.

<img alt="Graph Expansion" data-entity-type="file" data-entity-uuid="f723e0e1-b9cc-4388-aeb3-a8f14de4f9c9" src="https://www.datastax.com/sites/default/files/inline-images/Graph_Expansion-1024x650_0.png" />

The only way to limit depth-first expansion is to control the depth of each traversed branch using the<code> loops() </code>step: <code>until(has("age", 32).or().loops().is(eq(3)))</code>

<code><code>gremlin&gt;g.V(). 
&nbsp; has("user", "userId", "u861"). 
&nbsp; until(has("age", 32).or().loops().is(eq(3))). 
&nbsp; repeat(out("knows").simplePath()). 
&nbsp; valueMap("userId", "age") 
==&gt;{userId=[u887], age=[32]} 
==&gt;{userId=[u719], age=[32]} 
==&gt;{userId=[u755], age=[32]} 
==&gt;{userId=[u800], age=[32]} 
==&gt;{userId=[u887], age=[32]} 
==&gt;{userId=[u361], age=[23]} 
==&gt;{userId=[u740], age=[21]} 
...</code></code>

Here we are ! The condition until<code>(has("age", 32).or().loops().is(eq(3))) </code>means that we will stop the loop whenever one of those conditions are met

either an user with age == 32 
or the depth of the branch (materialized by <code>loops())</code> That explains why in the results we can see some users with age == 32 (u887, u719 …) but other having age != 32 (u361, u740, …). They are still returned as a result because they match the 2nd condition of our <code>until()</code>

<h2>VI Time-limiting the graph exploration</h2>

We have just seen that one technique to control the recursive combinatory explosion is to put a limit on the depth of each explored branch. Another powerful technique is to simply cap the computation time of graph expansion. For this we can use <code>timeLimit(time_in_millisecs)</code>. All the computation time, including the time taken by the loops and recursive traversal.

The previous traversal can be rewritten as:

<code><code>gremlin&gt;g.V(). 
&nbsp; has("user", "userId", "u861"). 
&nbsp; until(has("age", 32)). 
&nbsp; repeat(out("knows").simplePath()). 
&nbsp; timeLimit(100). 
&nbsp; valueMap("userId", "age") 
==&gt;{userId=[u887], age=[32]} 
==&gt;{userId=[u719], age=[32]} 
==&gt;{userId=[u755], age=[32]} 
==&gt;{userId=[u800], age=[32]} 
==&gt;{userId=[u887], age=[32]}</code></code>

In the above traversal, we want the computation of all the block <code>until(has("age", 32)).repeat(out("knows").simplePath()) </code>will take at most 100ms. If we move the <code>timeLimit(100)</code> inside the <code>repeat(...) </code>step:

<code><code>gremlin&gt;g.V(). 
&nbsp; has("user", "userId", "u861"). 
&nbsp; until(has("age", 32)). 
&nbsp; repeat(out("knows").simplePath().timeLimit(100)). 
&nbsp; valueMap("userId", "age"). 
&nbsp; dedup() 
==&gt;{userId=[u887], age=[32]} 
==&gt;{userId=[u719], age=[32]} 
==&gt;{userId=[u755], age=[32]} 
==&gt;{userId=[u800], age=[32]} 
==&gt;{userId=[u885], age=[32]} 
==&gt;{userId=[u877], age=[32]} 
==&gt;{userId=[u807], age=[32]} 
==&gt;{userId=[u771], age=[32]} 
==&gt;{userId=[u783], age=[32]} 
==&gt;{userId=[u813], age=[32]} 
==&gt;{userId=[u431], age=[32]} 
==&gt;{userId=[u775], age=[32]} 
==&gt;{userId=[u839], age=[32]} 
==&gt;{userId=[u988], age=[32]} 
==&gt;{userId=[u936], age=[32]} 
==&gt;{userId=[u970], age=[32]} 
==&gt;{userId=[u1088], age=[32]} 
==&gt;{userId=[u1089], age=[32]}</code></code>

Now we have more result, in fact <code>repeat(out("knows").simplePath().timeLimit(100))</code> means that each of the repetition (loop) should take maximum 100ms

This time limiting step is quite powerful because when you don’t know how fast your graph will expand because each vertex may have a very different adjacency degree, using <code>timeLimit(...)</code> will keep your computation resources under control.

And that’s all folks! Do not miss the other Gremlin recipes in this series.

If you have any question about Gremlin, find me on the <a href="http://datastaxacademy.slack.com/">datastaxacademy.slack.com</a>, channel dse-graph. My id is @doanduyhai

Gremlin Recipes: 4 – Recursive Traversals

Duy Hai Doan

Discover more

Share

Share

I KillrVideo dataset

II Graph expansion

III Repeating traversal

IV Emitting traversed vertices

V Controlling the recursion

VI Time-limiting the graph exploration

More Technology

How to Build a Crystal Image Search App with Vector Search

Knowledge Graphs for RAG without a GraphDB

How Winweb Built its AI Assistant with DataStax Astra DB and LangChain

Vercel + Astra DB: Get Data into Your GenAI Apps Fast

One-stop Data API for Production GenAI