Marko A. Rodriguez

<p><img alt="supernodes graph" data-align="right" data-entity-type="file" data-entity-uuid="5817ec5e-e0d7-4e4a-b07b-5abba170ae4c" src="https://www.datastax.com/sites/default/files/inline-images/supernodes-graph.png" /></p>

<p>In&nbsp;<a href="http://en.wikipedia.org/wiki/Graph_theory">graph theory</a>&nbsp;and&nbsp;<a href="http://en.wikipedia.org/wiki/Network_science">network science</a>, a "supernode" is a vertex with a disproportionately high&nbsp;<a href="http://en.wikipedia.org/wiki/Degree_(graph_theory)">number</a>&nbsp;of incident edges. While supernodes are rare in natural graphs (as statistically demonstrated with&nbsp;<a href="http://en.wikipedia.org/wiki/Power_law">power-law</a>&nbsp;degree distributions), they show up frequently during graph analysis. The reason being is that supernodes are connected to so many other vertices that they exist on numerous paths in the graph. Therefore, an arbitrary traversal is likely to touch a supernode. In graph computing, supernodes can lead to system performance problems. Fortunately, for&nbsp;<a href="https://github.com/tinkerpop/blueprints/wiki/Property-Graph-Model">property graphs</a>, there is a theoretical and applied solution to this problem.</p>

<p>&nbsp;</p>

<h2>Supernodes in the Real-World</h2>

<h3>Peer-to-Peer File Sharing</h3>

<p>At the turn of the millenium,&nbsp;<a href="http://en.wikipedia.org/wiki/Peer-to-peer_file_sharing">online file sharing</a>&nbsp;was being supported by services like&nbsp;<a href="http://en.wikipedia.org/wiki/Napster">Napster</a>&nbsp;and&nbsp;<a href="http://en.wikipedia.org/wiki/Gnutella">Gnutella</a>. Unlike Napster, Gnutella is a true peer-to-peer system in that it has no central file index. Instead, a client's search is sent to its adjacent clients. If those clients don't have the file, then the request propagates to their adjacent clients, so forth and so on. As in any natural graph, a supernode is only a few steps away. Therefore, in many peer-to-peer networks, supernode clients are quickly inundated with search requests and in turn, a&nbsp;<a href="http://en.wikipedia.org/wiki/Denial-of-service_attack">DoS</a>&nbsp;is effected.</p>

<h3>Social Network Celebrities</h3>

<p>President&nbsp;<a href="https://twitter.com/BarackObama">Barack Obama</a>&nbsp;currently has 21,322,866 followers on&nbsp;<a href="http://twitter.com/">Twitter</a>. When Obama tweets, that tweet must register in the activity streams of 21+ million accounts. The Barack Obama vertex is considered a supernode. As an opposing example, when&nbsp;<a href="https://twitter.com/spmallette">Stephen Mallette</a>&nbsp;tweets, only 59 streams need to be updated. Twitter realizes this discrepancy and maintains different mechanisms for handling "the Obamas" (i.e. the celebrities) and "the Stephens" (i.e. the plebeians) of the Twitter-sphere.</p>

<h2>Blueprints and Vertex Queries</h2>

<p><a href="http://blueprints.tinkerpop.com/">Blueprints</a>&nbsp;is a Java interface for graph-based software. Various&nbsp;<a href="http://en.wikipedia.org/wiki/Graph_database">graph databases</a>,&nbsp;<a href="http://en.wikipedia.org/wiki/Social_network_analysis_software">in-memory graph engines</a>, and&nbsp;<a href="http://en.wikipedia.org/wiki/Graph_database#Distributed_Graph_Processing">batch-analytics frameworks</a>&nbsp;make use of Blueprints. In June 2012,&nbsp;<a href="https://github.com/tinkerpop/blueprints/wiki/The-Major-Differences-Between-Blueprints-1.x-and-2.x">Blueprints 2.x</a>&nbsp;was released with support for "<a href="https://github.com/tinkerpop/blueprints/wiki/Vertex-Query">vertex queries</a>." A vertex query is best explained with an example.<br />
<img alt="vertex query" data-align="right" data-entity-type="file" data-entity-uuid="6e2fc01e-86e2-436e-9bcc-2689fb271e82" src="https://www.datastax.com/sites/default/files/inline-images/knows-likes-tweets-schema.png" />Suppose there is a vertex named Dan. Incident to Dan are 1,110 edges. These edges denote the people Dan knows (10 edges), the things he likes (100 edges), and the tweets he has tweeted (1000 edges). If Dan wants a list of all the people he knows and incident edges are not indexed by label, then Dan would have to iterate through all 1,110 edges to find the 10 people he knew. However, if Dan's edges are indexed by edge label, then a lookup into a hash on&nbsp;<code>knows</code>&nbsp;would immediately yield the 10 people --&nbsp;<code>O(n)</code>&nbsp;vs.&nbsp;<code>O(1)</code>, where&nbsp;<code>n</code>&nbsp;is the number of edges incident to Dan.</p>

<p>The idea of partitioning edges by discriminating qualities can be taken a step further in&nbsp;<a href="https://github.com/tinkerpop/blueprints/wiki/Property-Graph-Model">property graphs</a>. Property graphs support key/value pairs on vertices and edges. For example, a&nbsp;<code>knows</code>-edge can have a&nbsp;<code>type</code>-property with possible values of "work," "family," and "favorite" and a&nbsp;<code>since</code>&nbsp;property specifying when the relationship began. Similarly,&nbsp;<code>likes</code>-edges can have a 1-to-5&nbsp;<code>rating</code>-property and&nbsp;<code>tweet</code>-edges can have a&nbsp;<code>time</code>stamp denoting when the tweet was tweeted. Blueprints'&nbsp;<code><a href="http://tinkerpop.com/docs/javadocs/blueprints/2.1.0/com/tinkerpop/blueprints/Query.html">Query</a></code>&nbsp;allows the developer to specify contraints on the incident edges to be retrieved. For example, to get all of Dan's highly rated items, the following Blueprints code is evaluated.</p>

<table border="0" cellpadding="0" cellspacing="0">
	<tbody>
		<tr>
			<td>
			<p><code>dan.query().labels(</code><code>"likes"</code><code>).interval(</code><code>"rating"</code><code>,</code><code>4</code><code>,</code><code>6</code><code>).vertices()</code></p>
			</td>
		</tr>
	</tbody>
</table>

<h2>Titan and Vertex-Centric Indices</h2>

<p>Blueprints only provides the interface for representing vertex queries. It is up to the underlying graph system to use the specified constraints to their advantage. The distributed graph database&nbsp;<a href="http://thinkaurelius.github.com/titan/">Titan</a>&nbsp;makes extensive use of vertex-centric indices for fine-grained retrieval of edge data from both disk and memory. To demonstrate the effectiveness of these indices, a benchmark is provided using Titan/<a href="https://github.com/thinkaurelius/titan/wiki/Using-BerkeleyDB">BerkeleyDB</a>&nbsp;(an&nbsp;<a href="http://en.wikipedia.org/wiki/ACID">ACID</a>&nbsp;variant of Titan -- see Titan's&nbsp;<a href="https://github.com/thinkaurelius/titan/wiki/Storage-Backend-Overview">storage overview</a>).</p>

<p>10 Titan/BerkeleyDB instances are created with a person-vertex named Dan. 5 of those instances have vertex-centric indices, and 5 do not. Each of the 5 instances per type have a variable number of edges incident to Dan. These numbers are provided below.</p>

<table>
	<tbody>
		<tr>
			<th bgcolor="#EAE8E8">total incident edges</th>
			<th bgcolor="#EAE8E8"><code>knows</code>-edges</th>
			<th bgcolor="#EAE8E8"><code>likes</code>-edges</th>
			<th bgcolor="#EAE8E8"><code>tweets</code>-edges</th>
		</tr>
		<tr>
			<td>111</td>
			<td>1</td>
			<td>10</td>
			<td>100</td>
		</tr>
		<tr>
			<td>1,110</td>
			<td>10</td>
			<td>100</td>
			<td>1000</td>
		</tr>
		<tr>
			<td>11,100</td>
			<td>100</td>
			<td>1000</td>
			<td>10000</td>
		</tr>
		<tr>
			<td>111,000</td>
			<td>1000</td>
			<td>10000</td>
			<td>100000</td>
		</tr>
		<tr>
			<td>1,110,000</td>
			<td>10000</td>
			<td>100000</td>
			<td>1000000</td>
		</tr>
	</tbody>
</table>

<p>The&nbsp;<a href="http://gremlin.tinkerpop.com/">Gremlin</a>/Groovy script to generate the aforementioned&nbsp;<a href="http://en.wikipedia.org/wiki/Star_(graph_theory)">star-graphs</a>&nbsp;is provided below, where&nbsp;<code>i</code>&nbsp;is the variable defining the size of the resultant graph.</p>

<table border="0" cellpadding="0" cellspacing="0">
	<tbody>
		<tr>
			<td>
			<p><code>g = TitanFactory.open(</code><code>'/tmp/supernode'</code><code>)</code></p>

			<p><code>// index configuration snippet goes here for Titan w/ vertex-centric indices</code></p>

			<p><code>g.createKeyIndex(</code><code>'name'</code><code>,Vertex.</code><code>class</code><code>)</code></p>

			<p><code>g.addVertex([name:</code><code>'dan'</code><code>])</code></p>

			<p><code>&nbsp;&nbsp;</code>&nbsp;</p>

			<p><code>r = </code><code>new</code> <code>Random(</code><code>100</code><code>)</code></p>

			<p><code>types = [</code><code>'work'</code><code>,</code><code>'family'</code><code>,</code><code>'favorite'</code><code>]</code></p>

			<p><code>(</code><code>1</code><code>..i).</code><code>each</code><code>{g.addEdge(g.V(</code><code>'name'</code><code>,</code><code>'dan'</code><code>).next(),g.addVertex(),</code><code>'knows'</code><code>,[type:types.</code><code>get</code><code>(r.nextInt(</code><code>3</code><code>)),since:it]); stopTx(g,it)}</code></p>

			<p><code>(</code><code>1</code><code>..(i*</code><code>10</code><code>)).</code><code>each</code><code>{g.addEdge(g.V(</code><code>'name'</code><code>,</code><code>'dan'</code><code>).next(),g.addVertex(),</code><code>'likes'</code><code>,[rating:r.nextInt(</code><code>5</code><code>)]); stopTx(g,it)}</code></p>

			<p><code>(</code><code>1</code><code>..(i*</code><code>100</code><code>)).</code><code>each</code><code>{g.addEdge(g.V(</code><code>'name'</code><code>,</code><code>'dan'</code><code>).next(),g.addVertex(),</code><code>'tweets'</code><code>,[time:it]); stopTx(g,it)}</code></p>
			</td>
		</tr>
	</tbody>
</table>

<p>For the 5 Titan/BerkeleyDB instances&nbsp;<strong>with</strong>&nbsp;vertex-centric indices, the following code fragment was evaluated. This code defines the indices (see Titan's&nbsp;<a href="https://github.com/thinkaurelius/titan/wiki/Type-Definition-Overview">type configurations</a>).</p>

<table border="0" cellpadding="0" cellspacing="0">
	<tbody>
		<tr>
			<td>
			<p><code>type = g.makeType().name(</code><code>'type'</code><code>).simple().functional(false).dataType(String.</code><code>class</code><code>).makePropertyKey()</code></p>

			<p><code>since = g.makeType().name(</code><code>'since'</code><code>).simple().functional(false).dataType(Integer.</code><code>class</code><code>).makePropertyKey()</code></p>

			<p><code>rating = g.makeType().name(</code><code>'rating'</code><code>).simple().functional(false).dataType(Integer.</code><code>class</code><code>).makePropertyKey()</code></p>

			<p><code>time = g.makeType().name(</code><code>'time'</code><code>).simple().functional(false).dataType(Integer.</code><code>class</code><code>).makePropertyKey()</code></p>

			<p><code>g.makeType().name(</code><code>'knows'</code><code>).primaryKey(type,since).makeEdgeLabel()</code></p>

			<p><code>g.makeType().name(</code><code>'likes'</code><code>).primaryKey(rating).makeEdgeLabel()</code></p>

			<p><code>g.makeType().name(</code><code>'tweets'</code><code>).primaryKey(time).makeEdgeLabel()</code></p>
			</td>
		</tr>
	</tbody>
</table>

<p>Next, three traversals rooted at Dan are presented. The first gets all the people Dan knows of a particular randomly chosen type (e.g. family members). The second returns all of the things that Dan has highly rated (i.e. 4 or 5 star ratings). The third retrieves Dan's 10 most recent tweets. Finally, note that Gremlin compiles each expression to an appropriate vertex query (see Gremlin's&nbsp;<a href="https://github.com/tinkerpop/gremlin/wiki/Traversal-Optimization">traversal optimizations</a>).</p>

<table border="0" cellpadding="0" cellspacing="0">
	<tbody>
		<tr>
			<td>
			<p><code>g.V(</code><code>'name'</code><code>,</code><code>'dan'</code><code>).outE(</code><code>'knows'</code><code>).has(</code><code>'type'</code><code>,types.</code><code>get</code><code>(r.nextInt(</code><code>3</code><code>)).inV</code></p>

			<p><code>g.V(</code><code>'name'</code><code>,</code><code>'dan'</code><code>).outE(</code><code>'likes'</code><code>).interval(</code><code>'rating'</code><code>,</code><code>4</code><code>,</code><code>6</code><code>).inV</code></p>

			<p><code>g.V(</code><code>'name'</code><code>,</code><code>'dan'</code><code>).outE(</code><code>'tweets'</code><code>).has(</code><code>'time'</code><code>,T.gt,(i*</code><code>100</code><code>)-</code><code>10</code><code>).inV</code></p>
			</td>
		</tr>
	</tbody>
</table>

<p>The traversals above were each run 25 times with the database restarted after each query in order to demonstrate response times with cold&nbsp;<a href="http://en.wikipedia.org/wiki/Java_virtual_machine">JVM</a>&nbsp;caches. Note that in-memory, warm-cache response times show a similar pattern (albeit relatively faster). The averaged results are plotted below where the y-axis is on a&nbsp;<a href="http://en.wikipedia.org/wiki/Logarithmic_scale">log scale</a>. The green, red, and blue colors denote the first, second and third queries, respectively. Moreover, there is a light and a dark version of each color. The light version is Titan/BerkeleyDB&nbsp;<strong>without</strong>&nbsp;vertex-centric indices and the dark version is Titan/BerkeleyDB&nbsp;<strong>with</strong>&nbsp;vertex-centric indices.</p>
<img alt="vertex degree" data-align="center" data-entity-type="file" data-entity-uuid="e094a2a2-ee0d-4398-94ff-4f78e4543832" src="https://www.datastax.com/sites/default/files/inline-images/vertex-times-barchart.png" />
<p>Perhaps the most impressive result is the retrieval of Dan's 10 most recent tweets (blue). With vertex-centric indices (dark blue), as the number of Dan's tweets grow to 1 million, the time it takes to get the top 10 stays constant at around 1.5 milliseconds. Without indices, this query grows proportionate to the amount of data and ultimately requires 13 seconds to complete (light blue).&nbsp;<strong>That is a 4 orders of magnitude difference in response time for the same result set</strong>. This example demonstrates how useful vertex-centric indices are for&nbsp;<a href="http://en.wikipedia.org/wiki/Activity_stream">activity stream</a>-type systems.<br />
<img alt="vertex data barchart" data-align="right" data-entity-type="file" data-entity-uuid="dac33875-196c-4e20-aee9-a5074459c62a" src="https://www.datastax.com/sites/default/files/inline-images/vertex-data-barchart.png" />The plot on the right displays the number of vertices returned by each query over each graph size. As expected, the number of&nbsp;<code>tweets</code>&nbsp;stays constant at 10 while the number of&nbsp;<code>knows</code>&nbsp;and&nbsp;<code>likes</code>&nbsp;vertices retrieved grows proportionate to the growing graphs. While the examples on the same graph (with and without indices) return the same data, getting to that data is faster with vertex-centric indices.</p>

<p>Finally, Titan also supports composite key indices. The graph construction code fragment previous assigns a primary key of both&nbsp;<code>type</code>&nbsp;and&nbsp;<code>since</code>&nbsp;to&nbsp;<code>knows</code>-edges. Therefore, retrieving Dan's 10 most recent coworkers is more efficient than, in-memory, getting all of Dan's coworkers and then sorting on&nbsp;<code>since</code>. The interested reader can explore the runtimes of such composite vertex-centric queries by augmenting the provided code snippets.</p>

<h2>Conclusion</h2>

<p>A supernode is only a problem when the discriminating information between edges is ignored. If all edges are treated equally, then linear&nbsp;<code>O(n)</code>&nbsp;searches through the incident edge set of a vertex are required. However when indices and sort orders are used,&nbsp;<code>O(log(n))</code>&nbsp;and&nbsp;<code>O(1)</code>&nbsp;lookups can be achieved. The presented results demonstrate 2-5x faster retrievals for the presented&nbsp;<code>knows</code>/<code>likes</code>&nbsp;queries and up to 10,000x faster for the&nbsp;<code>tweets</code>&nbsp;query when vertex-centric indices are employed. Now consider when a traversal is more than a single hop.</p>

<p><br />
<img alt="vertex centric index logo" data-align="left" data-entity-type="file" data-entity-uuid="037d45bf-2e2a-4db1-a83c-4e9a628513a3" src="https://www.datastax.com/sites/default/files/inline-images/vertex-centric-index-logo.png" />The runtimes compound in a&nbsp;<a href="http://en.wikipedia.org/wiki/Combinatorial_explosion">combinatoric</a>&nbsp;manner. Compounding at 1 millisecond vs 10 seconds leads to astronomical differences in overall traversal runtime.</p>

<p>The graph database&nbsp;<a href="http://thinkaurelius.github.com/titan/">Titan</a>&nbsp;can scale to support 100s of billions of edges (via Apache&nbsp;<a href="https://github.com/thinkaurelius/titan/wiki/Using-Cassandra">Cassandra</a>&nbsp;and&nbsp;<a href="https://github.com/thinkaurelius/titan/wiki/Using-HBase">HBase</a>). Vertices with a million+ incident edges are frequent in such massive graphs. In the world of Big Graph Data, it is important to store and retrieve data from disk and memory efficiently. With Titan, edge filtering is pushed down to the disk-level so only requisite data is actually fetched and brought into memory. Vertex-centric queries and indices overcome the supernode problem by intelligently leveraging the label and property information of the edges incident to a vertex.</p>

<h2>Related Material</h2>

<p>Rodriguez, M.A., Broecheler, M., "<a href="http://www.slideshare.net/slidarko/titan-the-rise-of-big-graph-data">Titan: The Rise of Big Graph Data</a>," Public Lecture at Jive Software, Palo Alto, 2012.</p>

<p>Broecheler, M., LaRocque, D., Rodriguez, M.A., "<a href="http://thinkaurelius.com/2012/08/06/titan-provides-real-time-big-graph-data/">Titan Provides Real-Time Big Graph Data</a>," Aurelius Blog, August 2012.</p>


A Solution to the Supernode Problem

Marko A. Rodriguez

Share

Share

Supernodes in the Real-World

Peer-to-Peer File Sharing

Social Network Celebrities

Blueprints and Vertex Queries

Titan and Vertex-Centric Indices

Conclusion

Related Material

More Technology

Knowledge Graphs for RAG without a GraphDB

How Winweb Built its AI Assistant with DataStax Astra DB and LangChain

Vercel + Astra DB: Get Data into Your GenAI Apps Fast

Simplifying Agent Development with Astra DB Connector for Vertex AI Search

One-stop Data API for Production GenAI