Apache Spark stands as a cornerstone in modern data management, vital for its high-performance processing of big data. Its ability to handle vast data volumes across clusters makes it essential for businesses dealing with large datasets. Spark's in-memory computing dramatically accelerates data processing, outperforming traditional technologies like Hadoop MapReduce, especially in complex applications.
Spark's versatility extends to advanced analytics, supporting everything from basic data queries to sophisticated machine learning and real-time streaming analytics. This makes it a comprehensive tool for various data processing needs. User-friendly with APIs in Python, Java, and Scala, Spark simplifies complex tasks with structures like DataFrames and Datasets.
Spark is at the core of data platforms such as DataBricks, and is used in both managed clusters and cloud configurations for large-scale data analysis and transformation. It’s the central component in Sosnoski Software’s approach to data engineering.
Apache Kafka is the open source distributed event streaming platform that plays a pivotal role in modern data management. It excels in processing high volumes of data in real-time, offering high throughput, scalability, and fault tolerance. Kafka enables the handling of millions of messages per second, making it essential for systems that depend on immediate data processing and analytics.
Its architecture ensures data durability and system reliability through data replication and persistent storage, safeguarding against potential system failures. Kafka’s scalability is a key feature, allowing businesses to easily expand their data infrastructure in response to increased demands.
Kafka also supports decoupling of producers and consumers, facilitating independent operation and providing flexibility in data processing rates. Kafka's ecosystem, which includes Kafka Connect and Kafka Streams, enables versatile data integration and sophisticated stream processing, respectively.
Kafka is an essential data management component for enterprises that deal with high-frequency events, including everything from financial services to travel to medical monitoring. It lets you optimize your data flows and rapidly respond to issues or trends that effect your operations. At Sosnoski Software we’re excited to see how Kafka is revolutionizing event-driven architectures, and eager to help you make use of Kafka for your organization’s data.
Data lakes are foundational to modern data engineering, offering a scalable repository for storing vast amounts of raw data from various sources. Their flexibility allows for the retention of unstructured data, making them ideal for complex analytics and machine learning. These repositories support the dynamic needs of enterprises by allowing for real-time analytics and big data processing at a lower cost.
Data lakes complement cloud-based architectures with their scalable storage and compute resources, making them economical for managing large datasets. They serve as both a repository for unprocessed data and a processing space where data can be prepared for analysis or reporting.
While data lakes provide significant advantages, they require stringent security and governance to ensure data integrity and compliance with privacy regulations. We at Sosnoski Software are happy to assist you with both implementing and managing data lakes as a part of your organization's data strategy.
The Extract, Transform, Load (ETL) process is a fundamental data integration procedure:
The ETL process is often automated using specialized ETL tools or custom-built scripts and workflows, using management platforms such as Apache Airflow or services such as Azure Data Factory. It is central to creating a centralized, organized, and clean data repository that fully supports downstream analytics, reporting, and business intelligence applications
ELT (Extract, Load, Transform) is a variation of the ETL process where the transformation step occurs after loading the data into a staging area or data lake. ELT is often the preferred approach when dealing with large volumes of data or inconsistent or unknown data formats. Since data is loaded as soon as possible it's always available for processing. Tools that don't require structured data can interact with it immediately without waiting for the transform step to be complete.
Sosnoski Software has extensive experience with ETL, including working with a range of integration platforms and tools. We're accustomed to the complexities of extracting data from diverse sources, including everything from legacy flatfile formats to the latest commercial APIs, and we take pride in optimizing transforms for accuracy, consistency, and performance. Whatever your ETL / ELT requirements, we can help!
Sosnoski Software Associates Limited is the Aotearoa New Zealand company established by Dennis Sosnoski to provide consulting services around all aspects of enterprise data communications and data engineering. Dennis personally oversees all projects, and depending on your requirements may apply his expertise directly to meet your goals, or assemble a team of trusted associates. These professionals are selected for their specific skills and proven excellence, ensuring we have the best expertise for your project's requirements.
With over 20 years of professional experience in data interchange and data engineering, Dennis has developed a deep understanding and expertise in managing and optimizing data systems for diverse needs. His journey in this field has equipped him with the insights and skills necessary to tackle complex data challenges, ensuring efficient, reliable, and secure data solutions for your projects. He is committed to leveraging this extensive background to deliver exceptional value and innovative solutions tailored to your specific objectives.
You can find out more about some of Dennis' past articles, presentations, and open source projects below.
Company founder Dennis Sosnoski was a prolific writer on data exchange and JVM topics until about 2015, when he stepped back from writing to focus on increasing professional commitments and the demands of ongoing projects. Here are a few of his published articles which are still relevant today, with links for translations where appropriate.
The JVM concurrency series was one of the most popular on IBM developerWorks site before their reorganization to focus on IBM products, and some of the articles in the series are still widely referenced today. Here are links to archived copies of the articles of the series:
Java and Scala concurrency basics | Translations: Chinese | Russian | Japanese |
Java 8 concurrency basics | Translations: Chinese | Russian | Japanese |
To block, or not to block? | Translations: Chinese | Russian | Japanese |
Asynchronous event handling in Scala | Translations: Chinese | Russian | Japanese |
Acting asynchronously with Akka | Translations: Chinese | Russian | Japanese | Portuguese |
Building actor applications with Akka | Translations: Chinese | Russian | Japanese |
Keeping Your Secrets was published on InfoQ in 2013, but the information is still useful for developers using Java or Java library-based languages such as Kotlin and Scala. Besides the original article you can also view the slides from Dennis's related Securing Your Communications presentation to the Seattle Android Developers Meetup group in August 2014.
Company founder Dennis Sosnoski was an active presenter for conferences and user groups until about 2015, when he cut back on other activities to focus on increasing professional commitments and the demands of ongoing projects. Here are slide sets for a couple of his later presentations which are still relevant today.
Securing Your Communications: Why (Android) SSL isn't always secure... and how to make it secure is my presentation to the Seattle Android Developers meetup group on 13 August 2014. This covers both general threats to SSL security and some Android-specific issues, including ways to improve your security by using the new Play Services ProviderInstaller API to get updated Conscrypt security code or using the open source Bouncy / Spongy Castle library to replace the platform implementation. Click the image below to view the slide set (use mouse clicks or arrow keys to move through the slides, or scroll with your mouse wheel).
Acting Concurrent: The Akka actor model for Java and Scala is my presentation to the Seattle Java Users Group meeting on 19 August 2014. This starts with a look at Java 8 and Scala concurrency features, then moves on to using Akka actors in both Scala and Java, including discussion of retrofitting existing Java applications to Akka. Click the image below to view the slide set (use mouse clicks or arrow keys to move through the slides, or scroll with your mouse wheel).
Company founder Dennis Sosnoski has been an active participant in the open source community for many years. He has made major contributions in the XML and web services areas, including as an Apache Software Foundation committer and Project Management Committee member for several web services projects, and also participated in related Java Specification Requests (JSRs) which shaped the evolution of the Java platform.
He led the development of the widely used JiBX XML Data Binding tool (most recent version on GitHub), providing a very fast and flexible way of converting Java objects to and from XML representations. He also worked on XML data transfer efficiency, defining the XBIS XML Information Set Encoding format used for very high performance web services based on JiBX.
support@sosnoski.com