Why File-based?
In Key concepts, we explain the differences between file-based and API-based integrations, and provide a comparison of the two approaches. For TechWolf, we prefer to build file-based integrations for large data ingestions to optimize for coverage. API-based integrations requireO(n)
work
where n
is the number of APIs, plus ongoing maintenance for each integration.
Additionally, working with files offers (1) built-in observability when things
go wrong, (2) broad support in source systems for exporting data as files, and
(3) the ability to trade immediacy for consistency, resilience, and completeness
since most input data is not time-critical.
This guide primarily focuses on file-based integrations, with notes applicable
to API-based integrations. For API-based guidance, see the addendum:
Addendum: API-based datasource integrations.
TL;DR - file-based datasource integration
Differential file ingestion
Depending on data size, opt for differential file ingestion (files
containing differential updates).
Minimal updates
Regardless of ingestion strategy, datasources can cause request storms
by requesting and updating all data. Implement a proper minimal-update
strategy to avoid running into global rate limits for your other
applications.
Sync consistently
Datasources are not time-critical. A daily sync with 1–2 concurrent
requests should be fine, depending on the size of the dataset.
Eventual consistency
Keep track of failed records, and retry them slowly (daily or slower)
until successful. Eventual consistency is your friend.
Strict format
Use CSV files and validate for RFC 4180 (CSV RFC), or use Parquet or
other strict structured formats.
Entity dependencies
Entities that have dependencies can make you reach to custom
implementations per entity type. Do not fall for long-term ongoing
maintenance of keeping up with our API’s entity dependencies, and
instead count on eventual consistency. An entity that could not load
today due to a dependency, might load tomorrow.
Prevent large deletes
Protect your integration against large deletes; let a human verify if
more than X% of a dataset is deleted.
Install alerting
Install alerting for non-running integrations. If not, something might
break, not run, and nobody will notice.
1. Volume of data and considerations
Depending on the source, files can become very large. Think about files containing millions of records, each of which containing detailed Course descriptions or Employee feedback. Handling large files is challenging when processing rows individually or when performing complex transformations. To scale effectively, use differential files and perform minimal updates to the destination API.Differential files
We distinguish between differential files (we call them diffs) and full dumps. A differential file contains only the changes since the last diff, while a full dump contains the entire dataset. Differential files are typically generated by the source system and can contain a flag on each record indicating upsert or delete. Some systems output a register of timestamped updates. Such transactional files can be reduced to differential files by applying updates in chronological order. Ingesting differential files requires precise control on how you deal with failures. You can track what failed and retry at some point in the future until it works. We maintain a ledger of the desired state. It is a single file containing everything that should exist. When a new differential file is ingested, we update the ledger rather than sending directly to the API. This allows us to retry failed records and re-ingest files after failures. This also enables point‑in‑time recovery using file‑based integrations, keeping the downstream API untouched, which is a powerful capability for customers and for TechWolf. See Minimal updates for how we update the downstream API.Minimal updates
Datasource ingestion is the heaviest use of the TechWolf API. To reduce load, we perform minimal updates. We calculate the set of items that must be created, updated, or deleted to reach the desired state (the ledger). These updates are also files, resulting in clear visibility into what is changing. We track what already exists in the API based on previous successful runs and recorded failures. That is the state. For a given differential file, the minimal set of updates equals the difference between the current API state and the desired integration state (state vs. ledger). An additional benefit: each integration tracks its own state, so multiple integrations of the same type can run in parallel without coordinating. An example of what happens when not tracking state separately per integration:- You have one source system for Courses. You correctly fetch all Courses from it and load them into the TechWolf API.
- You pull the TechWolf API daily, fetch all Courses, compare to the source system, and delete only what is not in the source system.
- You onboard another datasource for Courses.
- Both integrations run separately, but each wants to delete the set of Courses from the other system. The system is empty again.
2. Preparing the data for ingestion
Source data might not match the format required by the TechWolf API. Validation and transformation are recommended.Validation
The most common issue is CSV formatting errors. We accept CSV files encoded per RFC 4180; see our CSV guidelines. Our advise would be to do the same, preventing any CSV formatting errors or preventing ingestion of wrongly formatted data. Validate input before sending data to the TechWolf API. Although API errors are clear, lightweight input-side business validation reduces time-to-resolution (e.g., preventing a 400 due to an empty field).Transformation
Formatting data for the TechWolf API is key. Keep transformations simple and transform all first instead of transforming and sending each record individually. This improves observability and measurability of inputs. The transformed input can be inspected as a whole.3. Entity dependencies
Some entities depend on other entities. For example, a learning event, which indicates when an employee has completed a course, depends on the course and the employee being present in the TechWolf API. It may be appealing to load entities in dependency order (e.g., courses and employees before learning events). However, this increases complexity and can impact scheduling. Worst case scenario: you want to add a job as an event to an employee. The employee needs an organizational unit; the event needs the job and the employee; the job needs its job family and group, and job data; some job data might need a vacancy. The job’s skills must be validated before they count in the employee’s skill event (see initial validation).
4. Initial validation
Jobs have suggested Skill Profiles. These suggestions can be used to govern your Job profiles, essentially validating that the Skills suggested for the Job are correct. Some customers prefer not to validate all jobs upfront. Instead, they start from an initial state and then validate iteratively - removing what doesn’t fit and adding what’s missing. Starting from a state where all suggested skills are validated is called initial validation. The challenge with initial validation is knowing when all data for a specific Job is loaded (titles, descriptions, Vacancies, etc.). It is difficult for a system to know which data is still missing and should be awaited. TechWolf implements this case-by-case, depending on which datasources must be loaded before something is “ready for initial validation.” Potential approaches:- Track when a Job was first loaded. Wait for the system to settle (new updates may arrive after creation), then validate Jobs past that threshold.
- If you know the exact sources a Job requires, wait for all to load before validating. This is complex, but a robust implementation can improve initial validation significantly.
- Opt out of initial validation. This can be useful if you prefer a single kickoff and then review new Jobs based on suggestions rather than an “already validated state.”
5. Security considerations
Authentication
Sending data in bulk to the TechWolf API requires authentication. We use our OAuth 2.0 flow. Cache tokens and refresh them before they expire.Encryption
When building a custom datasource integration, your data does not leave your company, so requirements may be less strict than for TechWolf-hosted integrations. Noteworthy considerations:- Encrypt data at rest. File-based processing interacts heavily with files, so this is essential.
- Isolate integrations and files, and apply record-level isolation where possible. Limit operations on input data so one record cannot affect others.
- Beyond input data and run derivatives, do not store, log, or emit datasource data. Persisting data is the responsibility of the destination system, if at all.
- Prevent large deletes. This is often not what you want, and letting a human verify that it is okay that more than 10% of records will be deleted will prevent problems later on.
- Detect anomalies in the system in general. Integrations that don’t run, large ingestions, mass failures, or failures that keep happening.
Addendum: API-based datasource integrations
The following is a brief overview of considerations when building custom API-based datasource integrations.- Consider accepting <100% data capture. If 0.3% of ticket data is missed due to network failures, that may be acceptable for certain scopes; provided the failure rate is uniformly distributed. Failures correlated with load spikes (e.g., specific time zones) are not acceptable, as they impact fairness.
- Define a strategy for catastrophic failures (e.g., missing a month of updates). Plan for drift remediation. Checkpoints or full exports can aid recovery.
- Consider falling back to file-based integrations where appropriate.