3 Cassandra Use Cases for Large-Scale Family History
Do you know who your great-great-great-great grandparents are?
If you’re interested in learning more about your family’s history, you may want to head over to FamilySearch and start tracing your genealogy.
FamilySearch is a nonprofit genealogical society that is affiliated with the Church of Jesus Christ of Latter-day Saints and traces its roots back to 1894. On the company’s website, you can read up on and add to your family history and connect with other family members (e.g., distant cousins) who are doing the same thing.
At DataStax Accelerate last year, Tom Creighton, CTO and lead architect of FamilySearch, told attendees about how his organization is using DataStax and Apache Cassandra™ to help users learn more about their families.
During his talk, Creighton outlined three different Cassandra use cases that are enabling researchers to put together a more complete picture of their families’ histories which, in turn, helps us all understand more about where we came from.
1. Family Tree
The first use case Creighton explored was FamilySearch’s most common one: enabling users to access information about their families and collaborate with other family tree researchers.
Here, at a very basic level, users are able to trace their family trees back. When records are available, they can click on a card for each person in their family tree, which then brings up relevant information—like who their parents are, when they lived, and where they lived.
Initially, FamilySearch was running technology built on Oracle databases, which presented an issue at scale. To accommodate a growing user base and an ever-increasing amount of data, the organization migrated to Cassandra and now runs 21 nodes on 40 to 50 servers in AWS, processing 300,000 reads/second and 9,300 writes/second.
“Cassandra’s only gone offline one time since we put this in production three years ago,” Creighton explained. “And that was [due to] human error.”
Cassandra makes it easy to use command query responsibility segregation (CQRS) which enables FamilySearch to write to four different types of records whenever information is uploaded to the system. For example, if a user adds someone’s spouse to their record, not only will that record be updated, but so will a change history record that a user can reference to see how an individual's data has changed over time.
FamilySearch also uses Cassandra to facilitate a hinting app, which helps guide people to other places they can go to do more research to flesh out their family trees.
To date, FamilySearch has published more than four billion records; more than one million new records come online every day. When users want to conduct new research but aren’t sure where to start, a machine learning, AI-based app helps steer them to additional resources that are likely to relate to data in their own family trees. Think about it the same way as how Amazon recommends products you might be interested in based on a confluence of factors.
The application, which runs trillions of comparisons, processes 192,000 reads/second and 12,000 writes/second.
“We’re constantly improving our machine learning classifier and changing the model a bit,” Creighton said.
3. Resource Metadata System (RMS)
FamilySearch also uses Cassandra, coupled with DataStax Enterprise Search, to help researchers track down records that may be related to their ancestors based on specific qualifiers.
For example, imagine a researcher’s ancestor died in Belfast, Ireland in 1863. The researcher could search by artifact date (e.g., 1863), artifact type (e.g., death records), or artifact place (e.g., Belfast, Ireland) to see whether any relevant records turn up.
Much of the information uploaded into FamilySearch’s system is crowdsourced. Researchers from around the world often scan paper records and upload digital images to FamilySearch to improve data quality.
Currently, it takes an average of six months from camera capture to publish a collection with searchability. Creighton said his company is focused on reducing that time frame to 24 hours, and an easy way to accomplish that is by adding metadata to images and make that information searchable.
“A researcher might say ‘I don’t see anything about my ancestor, but I know what area they’re in or I know what time frame, or I’m looking for a certain kind of record,’” Creighton explained.
Metadata searches powered by Cassandra and DataStax Search can then recommend images that pertain to those parameters, and the researcher can study them to see if any of them are relevant to their own family trees.
The RMS application processes 60,000 reads/second and 40,000 writes/second.
To learn more about how FamilySearch uses DataStax to help people around the world understand their family histories on a deeper level, watch Tom’s Accelerate presentation here.
If you’re interested in learning more about how leading organizations across all industries are using DataStax and Cassandra to accelerate business, join us at DataStax Accelerate 2020 which will be held in San Diego (May 11-13) and London (June 2-3). See you there!