Data Architectures Part 1

Data Storage and Ingestion

Data architecture is a term that describes a formal approach to handling data, from its inception to its use as an output. This includes provisions for its storage. The actions performed over the data are considered processes and these are expected to vary from business to business. While it is difficult to prescribe a setup for your business, there are several steps that can be taken to identify an approach that is tailored to your needs. In this article, we will discuss one of the five dimensions of a data architecture that is able to produce data summaries and descriptive analytics.

Developing an approach that suits your business

What does the data architecture currently look like for your business? The answer to this question depends on the state of your data and your processes, two topics that we discussed in State of Your Data. The underlying idea is that before we develop an approach to enhancing your data capabilities, we have to define the existing arrangements that are in place. For example, consider the steps that are taken to gather the information needed for regulatory reporting, tax filings, and performance summaries.

The following will help you to start analyzing your existing setup:

Make a diagram showing the flow of the data from the point where it is entered in your system to when it is captured in a report or summary.
List the existing processes at each step.
Identify the gaps between your current state and the desired state. For example, what type of information or changes are you seeking?

Having a clear mapping of the systems and processes that your business has in place will allow you to articulate whether the existing state is adequately meeting your needs and how to go about making enhancements. With your current setup in view, we will cover data storage and ingestion as part of an architecture equipped for analytics.

Data architectures capable of producing descriptive analytics

We deconstruct the concept of data architecture into five dimensions and these are conceived as phases in a cyclical process in the development of analytics, borrowed from software development operations (DevOps). The five dimensions are data storage and ingestion, data transformation, data exploration, deployment and reporting, monitoring and continuous development.

Data Storage and Ingestion

Where is your data stored? Data storage includes provisions for onsite/offsite hardware; identifying storage capacity and gauging capacity needs; frequency, location and security of backups; and cloud. Different solutions are available, such as data lakes, data warehousing.

Database Type. At its most basic level, this refers to the drives where your files are stored and their capabilities. For example, Windows and Macs provide a graphical user interface for filing systems by folder. Each folder will contain data that are governed by the limitations of file types and the softwares that produced them.

Excel files contain elements of what is known as relational data, where column names and row indices point to specific values. This file type creates data in tables-- tabular data. The primary limitation of tabular data is persistence of form. Because information is entered manually and few restrictions are in place, it is difficult to enforce data types and formulas expected in a single file. Another subsequent limitation is integrating data residing in different files consistently.

Word documents (.doc) will contain text, digits, and images. Text documents (.txt) are similar but can be saved with formatting instructions (rich text) or as plain text. The information in these types of documents are grouped only by file names and thus are unstructured; structure is defined at the time of processing. To analyze this type of data, programmers will write a script to read in the text and create a structure with regular expressions (regex) when necessary, depending of the type of analysis that is performed. For example, if we are working with email files in text format, we can expect fields to indicate sections and it would thus be necessary to extract the text that follow the fields needed for analysis. Insights can be derived from text with a variety of machine learning approaches for natural language processing. This includes topic modeling, querying text and generating responses from learned representations, next word prediction, etc.

What about other software that is used to capture everyday data points? This includes staff-facing and customer-facing customized softwares like Quickbooks, Excel, Point of Service, online forms, Customer Relationship Management Systems, Enterprise Resource Planning Systems, etc. As you can see, your business will have data originating from many different sources. In order to leverage your data for analytic inference and strategic decisions, it would need to be restructured and integrated.

"Integration of multiple information systems aims at combining selected systems so that they form a unified new whole and give users the illusion of interacting with one single information system. The reason for integration is twofold: First, given a set of existing information systems, an integrated view can be created to facilitate information access and reuse through a single information access point. Second, given a certain information need, data from different complementing information systems is combined to gain a more comprehensive basis to satisfy the need." (Ziegler & Dittrich, pg. 1)

Another benefit of having structured and integrated data is that it allows you to define rules around naming conventions, limit data types, consistently apply formulas to aggregate data, and classify groups of data (think of customer information versus product orders). This can be implemented as a database management system (DBMS). A database management system "enables users to define, create, maintain, and control access to the database...It is the software that interacts with the users' application programs and the database" (Connelly & Begg, pg 15). With a DBMS and restructured data in place, you would have a unified interface to interact with the whole of your data. This would also provide you the ability to use a structured query language (SQL) to create table views and perform analyses on demand. A smaller scaled alternative would be to create an executable program that extracts data from your softwares, transforms it into the necessary formats, and produces sets of predetermined analyses on a routine basis. This aligns with the process used in data engineering, extract, transform, load (ETL). Ultimately it is helpful to understand that ETL is used to setup the recurring process of connecting data from a software program to another endpoint; pipelines allow data to be transported, in this case to a structured database with a graphic user interface (DBMS). What would vary would be the scale at which it would need to be implemented (to create a defined script that can be executed or a software with a graphical user interface).

The primary benefit of setting up a relational, structured database is that it allows you to efficiently query and create tables over your data. The resulting question is: How can you setup your data stores so that they are not isolated by program but are instead are accessible for analysis.

In today’s data-driven environment, businesses can no longer afford to let their information remain siloed across disconnected systems. By investing in a structured and integrated data architecture—starting with thoughtful storage and ingestion strategies—you can set the stage for meaningful analysis and smarter decision-making. Whether through a relational database or a streamlined ETL process, the goal is the same: to make data accessible, reliable, and ready for action. With the right foundation, your business can gain the power to transform raw inputs into clear, strategic insights.

Connolly, Thomas M., Begg, Carolyn E. (2010). Database Systems: A practical Approach to Design, Implementation, and Management. Fifth edition. Addison-Wesley.

Ziegler, Patrick, Dittrich Klaus R. (2007). Data Integration -- Problems, Approaches, and Perspectives. In: Krogstie, John, Opdahl, Andreas L., Brinkkemper, Sjaak (eds) Conceptual Modelling in Information Systems Engineering. Springer, Berlin, Heidelberg.

Empowering Individuals and Small Businesses with Right-Sized Software Tools

Data Architectures Part 1

Data Storage and Ingestion

If you enjoyed this post, please consider sponsoring our work.