[progress News] [progress Openedge Abl] Bulk Data Movement To Sql Databases In Hybrid Cloud

  • Thread starter Saikrishna Teja Bobba
  • Start date
Status
Not open for further replies.
S

Saikrishna Teja Bobba

Guest
Saikrishna recently experimented with bulk exporting huge datasets from Hadoop to SQL Server and measured the performance of various scenarios. Take a look at the results.

With the rise of big data in the cloud and demand for open analytics, we’re seeing bulk data movement as a common data integration pattern.

Big data platforms are migrating to the cloud (such as Amazon EMR, IBM BigInsights on Cloud, Microsoft Azure HDInsight and SAP Altiscale) to better handle elastic workloads that consume expanding cloud-resident data sets and streams. Advanced analytics techniques are then used to crunch these data sets into valuable insights.

But how can these insights be exported to enterprise on-premises business analytics platforms for distribution, such as Microsoft Power BI or Oracle Analytics?

This article looks into the performance of moving big data insights from the cloud to on-premises data warehouses or marts using a cloud-agnostic hybrid data pipeline.

The following are the observations recorded while performing Bulk Export of data to SQL Server using Apache Sqoop in various scenarios using Progress DataDirect Hybrid Connectivity solutions.

Environment Setup


This scenario was set up to check the Bulk Export capabilities to SQL Server using DataDirect Cloud. Here I have a Hadoop Single Node that was installed in Pseudo distributed mode on AWS EC2, where I had Sqoop and Hive installed alongside.

Behind the firewall, we have SQL Server 2016 installed on an on-premises Progress vCloud. A huge dataset was exported from Hadoop to SQL Server using Sqoop and the JDBC client from the DataDirect Cloud service.

bulk-export.png



Test Methodology


Dataset Used:

New York City Taxi Data 2016

Dataset Size:

  • 11 million rows / 1.5914 GB
  • 22 million rows / 3.1827 GB
  • 44 million rows / 6.3655 GB
For each scenario and dataset size, I exported the data from Hadoop to SQL Server using the drivers’ Bulk Load feature being enabled and disabled on separate tests.

The goal of this test is to establish preliminary throughput numbers for hybrid cloud/ground environments with minimal tuning of batch sizes at the JDBC driver level.

To see more details on properties related to tuning, see BCP Performance on Sqoop EXPORT to SQL Server from Hadoop with additional details using our on-premises JDBC connector.

The recovery model for the database on SQL Server is set to BULK_LOGGING, which is the only change on the SQL Server end.

Results of Bulk Export from Hadoop to SQL Server

11 Million Rows

Bulk Load Enabled:

11-million-rows.png



Bulk Load Disabled:
Time taken to complete the load: 3,142.6827 seconds

Bulk vs. Non-Bulk:
Minimum performance gain observed at batch size 4,000: 164%
Maximum performance gain observed at batch size 500: 188%

22 Million Rows

Bulk Load Enabled:

22-million.png


Bulk Load Disabled:
Time taken to complete the load: 6,980.4868 seconds

Bulk vs. Non-Bulk:
Minimum performance gain observed at batch size 32,000: 233%
Maximum performance gain observed at batch size 6,4000: 236%

44 Million Rows

Bulk Load Enabled:

44-million.png



Bulk Load Disabled:
Time taken to complete the load: 102,208.5821 seconds

Bulk vs. Non-Bulk:
Minimum performance gain observed at batch size 16,000: 1,908%
Maximum performance gain observed at batch size 4,000: 2,366%

Sqoop Scripts Used in Experiment


sqoop export --connect

'jdbc:datadirect:ddcloud:database=SQLNonBulk;user=<user>;password=<password>; TransactionMode=ignore' --driver com.ddtek.jdbc.ddcloud.DDCloudDriver --table 'NYCTAXIDATA' --export-dir /user/hive/warehouse/datadirect.db/nyctaxidata --input-lines-terminated-by "\n" --input-fields-terminated-by ',' --batch -m 10

Note: The batch sizes in hybrid connectivity scenarios are configured in the DataDirect Cloud service, rather than the JDBC client.

Conclusion


The performance gains demonstrated in this experiment, up to 2,300%, prove that cloud-based Hadoop platforms make it possible to integrate data insights with enterprise analytics platforms for inclusion in intra-day reporting and analytics.There were discrepancies in throughput due in part to using non-production infrastructure, but the bulk export performance is consistently improved with DataDirect hybrid bulk export capabilities.

While the experiment is with a SQL Server database, similar bulk movement facilities are available with other popular DataDirect hybrid data sources such as Oracle and Salesforce.

As always with any data connectivity performance testing, Progress recommends testing in a production-like environment.

Try the experiment against your big data platform.

Continue reading...
 
Status
Not open for further replies.
Top