Bridging the OT/IT Gap: How to Integrate Legacy Systems and APIs into a Unified Data Platform

If I have to sit through one more slide deck claiming "seamless digital transformation" without a single mention of how they handle OPC-UA tag ingestion or messy ERP tables, I’m going to lose my mind. We’ve all been there: the shop floor is running on PLCs from the 90s, the MES is a siloed black box, and the corporate team wants an Industry 4.0 dashboard that updates in real-time. But how do you actually get there?

image

As a data lead, my priority isn't the vision—it's the plumbing. You need a platform that can handle both the batch-heavy world of SAP/ERP systems and the high-frequency streaming requirements of IoT sensors. If you aren't talking about Kafka for streaming or Airflow for orchestration, you aren't building a data platform; you’re building a graveyard for data.

The Reality of Disconnected Manufacturing Data

The manufacturing stack is naturally fragmented. You have your Operational Technology (OT) generating high-fidelity vibration and temperature data, and your Information Technology (IT) generating transactional business data. Bridging these is the "holy grail" of Industry 4.0.

In my experience, the biggest failure point is assuming you can just dump everything into a data lake and "figure it out later." That’s a recipe for a data swamp. You need an architecture that respects the gravity of data—keeping high-velocity OT data near the edge while unifying it with business context in the cloud.

Selecting Your Architectural Foundation

When evaluating vendors, I look for people who know the difference between a managed service and a black box. Whether you’re leaning toward Azure (especially with the recent maturity of Microsoft Fabric) or AWS (with its robust IoT Greengrass and Kinesis ecosystem), your choice of partners matters.

I’ve seen great work from firms like STX Next, who understand the nuances of Python-based automation and data engineering, and NTT DATA, who can handle the massive-scale integration that global manufacturing firms require. Similarly, Addepto has been making waves in the AI/ML integration space, which is where you eventually want to take your data once it’s unified.

The Comparison Matrix

Component Legacy/Batch Strategy Streaming/API Strategy Ingestion Airflow + Custom Connectors Kafka / MQTT Brokers Processing dbt on Snowflake/Databricks Flink or Spark Structured Streaming Storage Data Lakehouse (Parquet/Delta) Time-series DB (Influx/Timescale) Observability Great Expectations / Monte Carlo Prometheus / Grafana

The "Week 2" Test: How Fast Can You Start?

Here is my challenge to any vendor: How fast can you start and what do I get in week 2? If you tell me you need three months for "requirements gathering," we are done. By week two, I expect a functional ingestion pipeline from at least one PLC and one API, landing in a staging table. If you don't have a record of "time-to-first-byte," you don't have a platform.

Proof Points I Demand

    Records per day: Can the ingestion layer handle 10M+ events per day from legacy sensors without choking? Downtime %: What is the historical availability of your pipeline orchestration? Latency: Show me the delta between sensor trigger and dashboard update.

API Ingestion and Legacy Integration

Legacy integration isn't just about PLCs. It's about coaxing data out of antiquated APIs that were never designed for scale. You need a wrapper strategy. Don't let your ERP direct-connect to your cloud platform. Use an intermediate layer—a staging zone—where you can run your dbt transformations to clean up the garbage data that inevitably comes out of legacy MES systems.

When you integrate APIs, you need to account for:

Rate Limiting: Legacy systems will crash if you pull too hard. Use a queue. Schema Drift: ERP schemas change. You need observability. Backfilling: Your pipeline must be idempotent. If a job fails, can you re-run it for a specific window without duplicating rows?

Batch vs. Streaming: The False Choice

Stop asking if you need batch or streaming. You need both. You need streaming for the "Are we currently burning down the factory?" alerts. You need batch for the "What was our OEE for last month?" analytics. Modern lakehouses like Databricks or Snowflake allow you to treat both with a unified semantic layer. If your vendor says you have to choose one, they are selling you old technology.

Final Thoughts

Spark pipelines manufacturing

To succeed in manufacturing data, you have to move past the buzzwords. I don't care about your "AI-powered roadmap." I care about how you handle authentication for a 15-year-old SOAP API, how you manage partition strategies in your S3 bucket, and how you monitor for stale data.

If you're building this out, find a partner that knows the tools. Whether it's STX Next digging into the backend logic, NTT DATA managing the enterprise rollout, or Addepto building the predictive maintenance models, ensure they are talking about the stack—not just the theory.

image

Let's get to work. What’s the plan for Monday morning?