How To: build custom datasource integrations

This document outlines how to build a custom datasource integration. Custom datasource integrations run outside TechWolf’s infrastructure and are functionally unrelated to TechWolf’s own products. Information enters TechWolf through the API. Custom datasource integrations are owned by the customer. You are responsible for building and maintaining the integration, using the TechWolf SkillEngine API. This page explains how to build datasource integrations by sharing how we approached this ourselves, the main tasks, and the challenges involved. This is guidance, not a strict prescription. If you have questions or feedback on this guide, please reach out to us at [email protected].

Why File-based?

In Key concepts, we explain the differences between file-based and API-based integrations, and provide a comparison of the two approaches. For TechWolf, we prefer to build file-based integrations for large data ingestions to optimize for coverage. API-based integrations require O(n) work where n is the number of APIs, plus ongoing maintenance for each integration. Additionally, working with files offers (1) built-in observability when things go wrong, (2) broad support in source systems for exporting data as files, and (3) the ability to trade immediacy for consistency, resilience, and completeness since most input data is not time-critical. This guide primarily focuses on file-based integrations, with notes applicable to API-based integrations. For API-based guidance, see the addendum: Addendum: API-based datasource integrations.

TL;DR - file-based datasource integration

Differential file ingestion

Depending on data size, opt for differential file ingestion (files containing differential updates).

Minimal updates

Regardless of ingestion strategy, datasources can cause request storms by requesting and updating all data. Implement a proper minimal-update strategy to avoid running into global rate limits for your other applications.

Sync consistently

Datasources are not time-critical. A daily sync with 1–2 concurrent requests should be fine, depending on the size of the dataset.

Eventual consistency

Keep track of failed records, and retry them slowly (daily or slower) until successful. Eventual consistency is your friend.

Strict format

Use CSV files and validate for RFC 4180 (CSV RFC), or use Parquet or other strict structured formats.

Entity dependencies

Entities that have dependencies can make you reach to custom implementations per entity type. Do not fall for long-term ongoing maintenance of keeping up with our API’s entity dependencies, and instead count on eventual consistency. An entity that could not load today due to a dependency, might load tomorrow.

Prevent large deletes

Protect your integration against large deletes; let a human verify if more than X% of a dataset is deleted.

Install alerting

Install alerting for non-running integrations. If not, something might break, not run, and nobody will notice.

1. Volume of data and considerations

Depending on the source, files can become very large. Think about files containing millions of records, each of which containing detailed Course descriptions or Employee feedback. Handling large files is challenging when processing rows individually or when performing complex transformations. To scale effectively, use differential files and perform minimal updates to the destination API.

Differential files

We distinguish between differential files (we call them diffs) and full dumps. A differential file contains only the changes since the last diff, while a full dump contains the entire dataset. Differential files are typically generated by the source system and can contain a flag on each record indicating upsert or delete. Some systems output a register of timestamped updates. Such transactional files can be reduced to differential files by applying updates in chronological order. Ingesting differential files requires precise control on how you deal with failures. You can track what failed and retry at some point in the future until it works. We maintain a ledger of the desired state. It is a single file containing everything that should exist. When a new differential file is ingested, we update the ledger rather than sending directly to the API. This allows us to retry failed records and re-ingest files after failures. This also enables point‑in‑time recovery using file‑based integrations, keeping the downstream API untouched, which is a powerful capability for customers and for TechWolf. See Minimal updates for how we update the downstream API.

Minimal updates

Datasource ingestion is the heaviest use of the TechWolf API. To reduce load, we perform minimal updates. We calculate the set of items that must be created, updated, or deleted to reach the desired state (the ledger). These updates are also files, resulting in clear visibility into what is changing. We track what already exists in the API based on previous successful runs and recorded failures. That is the state. For a given differential file, the minimal set of updates equals the difference between the current API state and the desired integration state (state vs. ledger). An additional benefit: each integration tracks its own state, so multiple integrations of the same type can run in parallel without coordinating. An example of what happens when not tracking state separately per integration:

You have one source system for Courses. You correctly fetch all Courses from it and load them into the TechWolf API.
You pull the TechWolf API daily, fetch all Courses, compare to the source system, and delete only what is not in the source system.
You onboard another datasource for Courses.
Both integrations run separately, but each wants to delete the set of Courses from the other system. The system is empty again.

This setup however duplicates some information, since state mirrors the API. Drift can occur, so we periodically reconcile or resend the entire desired state. We accept this trade-off for the benefits of minimal updates and observability.

2. Preparing the data for ingestion

Source data might not match the format required by the TechWolf API. Validation and transformation are recommended.

Validation

The most common issue is CSV formatting errors. We accept CSV files encoded per RFC 4180; see our CSV guidelines. Our advise would be to do the same, preventing any CSV formatting errors or preventing ingestion of wrongly formatted data. Validate input before sending data to the TechWolf API. Although API errors are clear, lightweight input-side business validation reduces time-to-resolution (e.g., preventing a 400 due to an empty field).

Transformation

Formatting data for the TechWolf API is key. Keep transformations simple and transform all first instead of transforming and sending each record individually. This improves observability and measurability of inputs. The transformed input can be inspected as a whole.

3. Entity dependencies

Some entities depend on other entities. For example, a learning event, which indicates when an employee has completed a course, depends on the course and the employee being present in the TechWolf API. It may be appealing to load entities in dependency order (e.g., courses and employees before learning events). However, this increases complexity and can impact scheduling. Worst case scenario: you want to add a job as an event to an employee. The employee needs an organizational unit; the event needs the job and the employee; the job needs its job family and group, and job data; some job data might need a vacancy. The job’s skills must be validated before they count in the employee’s skill event (see initial validation).

Instead, we recognize that much data changes infrequently and real-time loading is unnecessary. We attempt to load all data concurrently regardless of dependencies. If something fails due to a missing parent, we retry the next day. This applies to cross-entity and intra-entity dependencies. For example, load Custom Properties for an Employee via a separate integration if they originate from a different source file. This avoids ambiguity about whether the Employee loaded correctly versus whether the Custom Properties attached. Scheduling integrations with light dependency ordering can be a middle ground: most data and dependencies load on the first run, and the next load fixes most remaining issues.

4. Initial validation

Jobs have suggested Skill Profiles. These suggestions can be used to govern your Job profiles, essentially validating that the Skills suggested for the Job are correct. Some customers prefer not to validate all jobs upfront. Instead, they start from an initial state and then validate iteratively - removing what doesn’t fit and adding what’s missing. Starting from a state where all suggested skills are validated is called initial validation. The challenge with initial validation is knowing when all data for a specific Job is loaded (titles, descriptions, Vacancies, etc.). It is difficult for a system to know which data is still missing and should be awaited. TechWolf implements this case-by-case, depending on which datasources must be loaded before something is “ready for initial validation.” Potential approaches:

Track when a Job was first loaded. Wait for the system to settle (new updates may arrive after creation), then validate Jobs past that threshold.
If you know the exact sources a Job requires, wait for all to load before validating. This is complex, but a robust implementation can improve initial validation significantly.
Opt out of initial validation. This can be useful if you prefer a single kickoff and then review new Jobs based on suggestions rather than an “already validated state.”

5. Security considerations

Authentication

Sending data in bulk to the TechWolf API requires authentication. We use our OAuth 2.0 flow. Cache tokens and refresh them before they expire.

Encryption

When building a custom datasource integration, your data does not leave your company, so requirements may be less strict than for TechWolf-hosted integrations. Noteworthy considerations:

Encrypt data at rest. File-based processing interacts heavily with files, so this is essential.
Isolate integrations and files, and apply record-level isolation where possible. Limit operations on input data so one record cannot affect others.
Beyond input data and run derivatives, do not store, log, or emit datasource data. Persisting data is the responsibility of the destination system, if at all.
Prevent large deletes. This is often not what you want, and letting a human verify that it is okay that more than 10% of records will be deleted will prevent problems later on.
Detect anomalies in the system in general. Integrations that don’t run, large ingestions, mass failures, or failures that keep happening.

Addendum: API-based datasource integrations

The following is a brief overview of considerations when building custom API-based datasource integrations.

Consider accepting <100% data capture. If 0.3% of ticket data is missed due to network failures, that may be acceptable for certain scopes; provided the failure rate is uniformly distributed. Failures correlated with load spikes (e.g., specific time zones) are not acceptable, as they impact fairness.
Define a strategy for catastrophic failures (e.g., missing a month of updates). Plan for drift remediation. Checkpoints or full exports can aid recovery.
Consider falling back to file-based integrations where appropriate.

Datasource Integrations

Employee Skill Assistant

Skill Sync Integrations

Reporting

References

How To: build custom datasource integrations

Why File-based?

TL;DR - file-based datasource integration

Differential file ingestion

Minimal updates

Sync consistently

Eventual consistency

Strict format

Entity dependencies

Prevent large deletes

Install alerting

1. Volume of data and considerations

Differential files

Minimal updates

2. Preparing the data for ingestion

Validation

Transformation

3. Entity dependencies

4. Initial validation

5. Security considerations

Authentication

Encryption

Addendum: API-based datasource integrations

Datasource Integrations

Employee Skill Assistant

Skill Sync Integrations

Reporting

References

​Why File-based?

​TL;DR - file-based datasource integration

Differential file ingestion

Minimal updates

Sync consistently

Eventual consistency

Strict format

Entity dependencies

Prevent large deletes

Install alerting

​1. Volume of data and considerations

​Differential files

​Minimal updates

​2. Preparing the data for ingestion

​Validation

​Transformation

​3. Entity dependencies

​4. Initial validation

​5. Security considerations

​Authentication

​Encryption

​Addendum: API-based datasource integrations

Why File-based?

TL;DR - file-based datasource integration

1. Volume of data and considerations

Differential files

Minimal updates

2. Preparing the data for ingestion

Validation

Transformation

3. Entity dependencies

4. Initial validation

5. Security considerations

Authentication

Encryption

Addendum: API-based datasource integrations