How to choose the right metadata store for applications at scale

Metadata is data about your data, meaning it provides broader information about your data. In recent years, we have seen an explosion of data with data-intensive workloads for applications, a move toward cloud, IoT apps, and big data analytics. As your data grows, your metadata can grow exponentially. It can come in various unstructured forms, in huge volumes, and may even contain sensitive information.

All of the above reasons make it challenging to maintain and standardize metadata. Therefore, selecting the right database can reduce the administrative burden of managing your metadata, while maximizing the business value you’re able to derive from the metadata.

Why metadata is important

Metadata is significant in search optimization, making information more discoverable and accessible. For example, capturing location information from a search query can improve the relevance of the result.

When a user in North America searches "MOBA games," they'll likely see "League of Legends" and "DOTA 2," while a user in Japan may find "Arena of Valor." These results are tailored by search engines using metadata like location and browsing history, enhancing personalized user experiences. Metadata also plays a crucial role in semantic search, improving the relevancy of results by understanding the searcher's intent and the contextual meaning of terms. For example, a search for "books about World War II written by historians" will yield results informed by the author's metadata, ensuring books are from historians rather than novelists or hobbyists.

Regardless of whether you're developing a SaaS, e-commerce application, or enterprise solution, metadata is crucial. It helps understand customer intent and behaviors, facilitates the creation of personalized experiences, and ensures adherence to security and compliance standards. By capturing as many attributes about the customer as possible, further analysis like segmentation or classification can be made more accurate too, improving the overall quality and value of the data over time. In customer management scenarios, a comprehensive view of the customer’s data (often referred to as a 360-degree view) can be a critical aid in optimizing future support interactions and upsell promotions.

Effective metadata management can make a huge difference in how quickly and accurately data can be understood, processed, and reported in any organization.

Challenges maintaining metadata

As your data grows, your metadata grows. Whether SQL-based, key-value, time-series, or unstructured (for example, MongoDB), database engines need to be optimized and carefully architected to handle this sheer amount of unstructured data.

Often metadata grows faster than your data. Let's consider an e-commerce platform where data primarily consists of products and customers who buy these products. Some of the kinds of metadata in this system could be customers' purchase history, product reviews, and browsing history within the application.

Each time a customer buys a product, a transaction record is created that includes customer ID, product ID, purchase date, quantity, and total cost. The customer's product browsing history and bounce rate are also recorded in this app. Customers on the platform can write multiple reviews for different products. Each review could contain details like customer ID, product ID, rating, review text, date, etc. All of these (i.e. record, browsing history, bounce rate etc, spending pattern) are metadata for this application. Metadata grows more rapidly in this example than data.

Challenges with maintaining metadata in SQL-based databases:

SQL databases have several challenges due to the inherent limitations of relational database management systems (RDBMS).

Scalability and cost: SQL databases typically scale vertically (i.e., by adding more resources to a single server) rather than horizontally (i.e., by adding more servers). With the sheer volume of metadata, it is more reasonable to scale horizontally. However, horizontal scaling is complex and requires database and infrastructure expertise in your team to manage replicas and shards. You also have the option to choose an autoscaling strategy with a cloud provider, however, the cost and engineering time spent could be significant in either case.
Schema rigidity: SQL databases require a predefined schema, which might not be suitable for metadata that can be diverse and dynamic. For instance, let's assume you are capturing the user's device metadata. If you were to use SQL, you would create a schema with predefined columns. In this case, you have 4 columns, userID, location, deviceId, and IpAddress. However, let's assume that many users are starting to use your application from a different device with device-type information, and you need to capture this information. To do so now, you have to update your schema. In this case, it will be a simple change as follows.
```
ALTER TABLE UserLoginInfo
ADD COLUMN deviceType VARCHAR(255);
```
However, as unstructured metadata is dynamic and changing all the time you may have to do similar operations over and over again. This could be a time-consuming effort depending on the complexity of your dataset.
Distribution and replication: Data replication across multiple geographical locations can be a challenge with SQL databases. There is no out-of-the-box solution for this problem — you must invest significant engineering resources to do it correctly.
Complexity of relationships: While relational databases are excellent at handling complex relationships, they can become difficult to manage when relationships among metadata become too intricate. It may lead to complicated queries that impact performance. Handling unstructured JSON documents and defining relationship is complex and can not be easily translated to table structure with data entity relationships.

Challenges with maintaining metadata in NoSQL databases

NoSQL databases offer flexibility and scalability and are well-suited to handle unstructured data. However, using them as metadata stores present its own sets of challenges.

Consistency: NoSQL databases are not strongly consistent. NoSQL databases typically follow the CAP theorem (Consistency, Availability, Partition tolerance), which means that you may have to choose between consistency and availability in case of a network partition. This could be problematic for a metadata store where consistency might be critical.
For instance, let's assume you have an e-commerce application collecting metadata such as user location, product bounce rate, etc. You want to provide users with personalized recommendations based on the data collected. If there is inconsistency in the data (i.e., not updated or corrupted), your recommendation will not be accurate.

Complexity of relationships: NoSQL databases are not designed to handle complex relationships between entities. If your metadata has intricate relationships, querying and managing it in a NoSQL database might become complicated. You may also end up creating many indexes for querying your data for a diverse dataset, which can lead to performance issues.

Let's say we are working with a blogging application where we have two collections: Users and Posts. Each post document in the Posts collection has a user_id field, a reference to a user document in the Users collection. We want to find all posts by a specific user along with the user's information. Here's how you might perform this complex query in a popular NoSQL database (in this case, MongoDB) using $lookup:

db.Posts.aggregate([
    {
        $match: {
            user_id: ObjectId('...')  // replace with the actual user_id
        }
    },
    {
        $lookup: {
            from: 'Users',  // join with the Users collection
            localField: 'user_id',  // field from the Posts collection
            foreignField: '_id',  // field from the Users collection
            as: 'user_info'  // output array with matching user documents
        }
    },
    {
        $unwind: '$user_info'  // flatten the output array
    }
]);

This query is already quite lengthy. Metadata is usually a JSON object with many fields; you need only a few in most cases. Let's say you are only interested in user_location and user_language fields. To do so, you can modify the query as shown below.

db.Posts.aggregate([
    {
        $match: {
            user_id: ObjectId('...')  // replace with the actual user_id
        }
    },
    {
        $lookup: {
            from: 'Users',  // join with the Users collection
            let: { user_id: '$user_id' },  // define variables to use in the pipeline
            pipeline: [
                {
                    $match: {
                        $expr: {
                            $eq: ['$_id', '$$user_id']  // match user_id from Posts with _id from Users
                        }
                    }
                },
                {
                    $project: {
                        _id: 0,  // exclude this field
                        user_location: 1,  // include this field
                        user_language: 1  // include this field
                    }
                }
            ],
            as: 'user_info'  // output array with matching user documents
        }
    },
    {
        $unwind: '$user_info'  // flatten the output array
    }
]);

Notice at this point the query is getting really complicated.

You can achieve the same result with better performance in Fauna with the following FQL query.

Post.where(.user_id == "1231233") {
 user {
   user_location
   user_language
 }
}

Schema-less design: While NoSQL's schema-less design provides flexibility and is good for unstructured or semi-structured data, it can also lead to problems. Without a rigid schema, there can be a lack of data integrity and standardization, potentially leading to inconsistencies in the metadata.

Why use Fauna as your Metadata store?

Fauna combines the best of both SQL and NoSQL worlds. Since Fauna is a relational database with a document model, this means you can store unstructured data and define SQL-like relationships.

Fauna is strongly consistent, globally distributed, and ACID compliant. It is also managed, auto-scales, and has a pay-per-usage pricing model.

On top of this, you get maximum flexibility to write complex queries with the Fauna Query Language (FQL).

FQL is a modern query language inspired by JavaScript, TypeScript, Python and GraphQL. It gives you maximum flexibility to craft any query, compute on your data and reduce over-fetching and roundtrips. FQL offers SQL's query and data relationship capabilities without rigid schema maintenance. With its ORM-like API, FQL makes querying complex, unstructured data straightforward and efficient.

The following is a sample query in FQL that finds all users who logged in from a specific geolocation in a given time period.

User.where(.lastlogin > '12-07-2022' && .location == 'US')

Note how writing the above query feels like writing code in a modern programming language like JavaScript or TypeScript. FQL is a language that feels familiar to developers yet focuses on allowing clear and concise queries and transactional code.

Learn more about FQL here.

Fauna’s database engine is designed to handle huge loads of data at a global scale while providing strong consistency, using the Calvin Model of distributed computing to achieve this.

You can also create temporal data in Fauna as well as create custom functions (UDFs) in the database layer to reduce complexity and optimize query performance.

Who is using Fauna?

Many high-growth companies spanning diverse industries (from SaaS and gaming to IoT) currently use Fauna as their metadata store. The following is a brief overview of how these companies are using Fauna to their advantage.

Hannon Hill is a leading software provider that choose Fauna as their primary database as well as their metadata store. Fuana currently powers over 5 million daily transactions for Hannon Hill’s new tool Clive. Hannon Hill selected Fauna primarily because of its data model flexibility, efficient API delivery model, auto-scaling capabilities, parallel transaction support, and cost-efficiency.
Insights.gg, a platform designed for sharing and reviewing video game performance, chose to leverage Fauna for its global distribution, serverless delivery, flexibility, and low latency. As their user base rapidly expanded beyond North America, the limitations of their previous single-region PostgreSQL database became apparent. They needed a technology stack that would enable global distribution and reduce latency without extensive engineering effort while flexibly scaling up or down based on demand.

Insights.gg currently uses Fauna for their primary database as well their metadata store. Insights.gg has over 1 million users. There are about 100,000 users actively sharing and reviewing videos and commentary daily on the insights.gg platform.

Choosing the right metadata store for scaling applications is critical. As both SQL and NoSQL databases have limitations, a solution that combines both strengths is ideal. With its blend of SQL and NoSQL features, Fauna provides strong consistency, global distribution, auto-scaling, and ACID compliance. Ultimately, Fauna offers a robust solution for metadata management in large-scale applications.