diff --git a/PyTorch/contrib/audio/Jasper/.dockerignore b/PyTorch/contrib/audio/Jasper/.dockerignore
new file mode 100644
index 0000000000000000000000000000000000000000..a620be2e6de100dbe077603daf8bf0ff455c8490
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/.dockerignore
@@ -0,0 +1,8 @@
+*.pt
+results/
+*__pycache__
+checkpoints/
+.git/
+datasets/
+external/tensorrt-inference-server/
+checkpoints/
diff --git a/PyTorch/contrib/audio/Jasper/.gitignore b/PyTorch/contrib/audio/Jasper/.gitignore
new file mode 100644
index 0000000000000000000000000000000000000000..bb051c475eb09edc93d6d42aade2079f983711f8
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/.gitignore
@@ -0,0 +1,9 @@
+__pycache__
+*.pt
+results/
+datasets/
+checkpoints/
+
+*.swp
+*.swo
+*.swn
diff --git a/PyTorch/contrib/audio/Jasper/.keep b/PyTorch/contrib/audio/Jasper/.keep
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/PyTorch/contrib/audio/Jasper/Dockerfile b/PyTorch/contrib/audio/Jasper/Dockerfile
new file mode 100644
index 0000000000000000000000000000000000000000..8ba48ec3485cfc9e9ffbb5cdea447c266df01c5d
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/Dockerfile
@@ -0,0 +1,30 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:20.10-py3
+FROM ${FROM_IMAGE_NAME}
+
+RUN apt update && apt install -y libsndfile1 && apt install -y sox && rm -rf /var/lib/apt/lists/*
+
+WORKDIR /workspace/jasper
+
+# Install requirements (do this first for better caching)
+COPY requirements.txt .
+RUN conda install -y pyyaml==5.4.1
+RUN pip install --disable-pip-version-check -U -r requirements.txt
+
+RUN pip install --force-reinstall --extra-index-url https://developer.download.nvidia.com/compute/redist nvidia-dali-cuda110==1.2.0
+
+# Copy rest of files
+COPY . .
diff --git a/PyTorch/contrib/audio/Jasper/LICENSE b/PyTorch/contrib/audio/Jasper/LICENSE
new file mode 100644
index 0000000000000000000000000000000000000000..2ae5b8195cb8fbcae84c84d39e3541e91f081755
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/LICENSE
@@ -0,0 +1,203 @@
+ Except where otherwise noted, the following license applies to all files in this repo.
+
+ Apache License
+ Version 2.0, January 2004
+ http://www.apache.org/licenses/
+
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+ 1. Definitions.
+
+ "License" shall mean the terms and conditions for use, reproduction,
+ and distribution as defined by Sections 1 through 9 of this document.
+
+ "Licensor" shall mean the copyright owner or entity authorized by
+ the copyright owner that is granting the License.
+
+ "Legal Entity" shall mean the union of the acting entity and all
+ other entities that control, are controlled by, or are under common
+ control with that entity. For the purposes of this definition,
+ "control" means (i) the power, direct or indirect, to cause the
+ direction or management of such entity, whether by contract or
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
+ outstanding shares, or (iii) beneficial ownership of such entity.
+
+ "You" (or "Your") shall mean an individual or Legal Entity
+ exercising permissions granted by this License.
+
+ "Source" form shall mean the preferred form for making modifications,
+ including but not limited to software source code, documentation
+ source, and configuration files.
+
+ "Object" form shall mean any form resulting from mechanical
+ transformation or translation of a Source form, including but
+ not limited to compiled object code, generated documentation,
+ and conversions to other media types.
+
+ "Work" shall mean the work of authorship, whether in Source or
+ Object form, made available under the License, as indicated by a
+ copyright notice that is included in or attached to the work
+ (an example is provided in the Appendix below).
+
+ "Derivative Works" shall mean any work, whether in Source or Object
+ form, that is based on (or derived from) the Work and for which the
+ editorial revisions, annotations, elaborations, or other modifications
+ represent, as a whole, an original work of authorship. For the purposes
+ of this License, Derivative Works shall not include works that remain
+ separable from, or merely link (or bind by name) to the interfaces of,
+ the Work and Derivative Works thereof.
+
+ "Contribution" shall mean any work of authorship, including
+ the original version of the Work and any modifications or additions
+ to that Work or Derivative Works thereof, that is intentionally
+ submitted to Licensor for inclusion in the Work by the copyright owner
+ or by an individual or Legal Entity authorized to submit on behalf of
+ the copyright owner. For the purposes of this definition, "submitted"
+ means any form of electronic, verbal, or written communication sent
+ to the Licensor or its representatives, including but not limited to
+ communication on electronic mailing lists, source code control systems,
+ and issue tracking systems that are managed by, or on behalf of, the
+ Licensor for the purpose of discussing and improving the Work, but
+ excluding communication that is conspicuously marked or otherwise
+ designated in writing by the copyright owner as "Not a Contribution."
+
+ "Contributor" shall mean Licensor and any individual or Legal Entity
+ on behalf of whom a Contribution has been received by Licensor and
+ subsequently incorporated within the Work.
+
+ 2. Grant of Copyright License. Subject to the terms and conditions of
+ this License, each Contributor hereby grants to You a perpetual,
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+ copyright license to reproduce, prepare Derivative Works of,
+ publicly display, publicly perform, sublicense, and distribute the
+ Work and such Derivative Works in Source or Object form.
+
+ 3. Grant of Patent License. Subject to the terms and conditions of
+ this License, each Contributor hereby grants to You a perpetual,
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+ (except as stated in this section) patent license to make, have made,
+ use, offer to sell, sell, import, and otherwise transfer the Work,
+ where such license applies only to those patent claims licensable
+ by such Contributor that are necessarily infringed by their
+ Contribution(s) alone or by combination of their Contribution(s)
+ with the Work to which such Contribution(s) was submitted. If You
+ institute patent litigation against any entity (including a
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
+ or a Contribution incorporated within the Work constitutes direct
+ or contributory patent infringement, then any patent licenses
+ granted to You under this License for that Work shall terminate
+ as of the date such litigation is filed.
+
+ 4. Redistribution. You may reproduce and distribute copies of the
+ Work or Derivative Works thereof in any medium, with or without
+ modifications, and in Source or Object form, provided that You
+ meet the following conditions:
+
+ (a) You must give any other recipients of the Work or
+ Derivative Works a copy of this License; and
+
+ (b) You must cause any modified files to carry prominent notices
+ stating that You changed the files; and
+
+ (c) You must retain, in the Source form of any Derivative Works
+ that You distribute, all copyright, patent, trademark, and
+ attribution notices from the Source form of the Work,
+ excluding those notices that do not pertain to any part of
+ the Derivative Works; and
+
+ (d) If the Work includes a "NOTICE" text file as part of its
+ distribution, then any Derivative Works that You distribute must
+ include a readable copy of the attribution notices contained
+ within such NOTICE file, excluding those notices that do not
+ pertain to any part of the Derivative Works, in at least one
+ of the following places: within a NOTICE text file distributed
+ as part of the Derivative Works; within the Source form or
+ documentation, if provided along with the Derivative Works; or,
+ within a display generated by the Derivative Works, if and
+ wherever such third-party notices normally appear. The contents
+ of the NOTICE file are for informational purposes only and
+ do not modify the License. You may add Your own attribution
+ notices within Derivative Works that You distribute, alongside
+ or as an addendum to the NOTICE text from the Work, provided
+ that such additional attribution notices cannot be construed
+ as modifying the License.
+
+ You may add Your own copyright statement to Your modifications and
+ may provide additional or different license terms and conditions
+ for use, reproduction, or distribution of Your modifications, or
+ for any such Derivative Works as a whole, provided Your use,
+ reproduction, and distribution of the Work otherwise complies with
+ the conditions stated in this License.
+
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
+ any Contribution intentionally submitted for inclusion in the Work
+ by You to the Licensor shall be under the terms and conditions of
+ this License, without any additional terms or conditions.
+ Notwithstanding the above, nothing herein shall supersede or modify
+ the terms of any separate license agreement you may have executed
+ with Licensor regarding such Contributions.
+
+ 6. Trademarks. This License does not grant permission to use the trade
+ names, trademarks, service marks, or product names of the Licensor,
+ except as required for reasonable and customary use in describing the
+ origin of the Work and reproducing the content of the NOTICE file.
+
+ 7. Disclaimer of Warranty. Unless required by applicable law or
+ agreed to in writing, Licensor provides the Work (and each
+ Contributor provides its Contributions) on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+ implied, including, without limitation, any warranties or conditions
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+ PARTICULAR PURPOSE. You are solely responsible for determining the
+ appropriateness of using or redistributing the Work and assume any
+ risks associated with Your exercise of permissions under this License.
+
+ 8. Limitation of Liability. In no event and under no legal theory,
+ whether in tort (including negligence), contract, or otherwise,
+ unless required by applicable law (such as deliberate and grossly
+ negligent acts) or agreed to in writing, shall any Contributor be
+ liable to You for damages, including any direct, indirect, special,
+ incidental, or consequential damages of any character arising as a
+ result of this License or out of the use or inability to use the
+ Work (including but not limited to damages for loss of goodwill,
+ work stoppage, computer failure or malfunction, or any and all
+ other commercial damages or losses), even if such Contributor
+ has been advised of the possibility of such damages.
+
+ 9. Accepting Warranty or Additional Liability. While redistributing
+ the Work or Derivative Works thereof, You may choose to offer,
+ and charge a fee for, acceptance of support, warranty, indemnity,
+ or other liability obligations and/or rights consistent with this
+ License. However, in accepting such obligations, You may act only
+ on Your own behalf and on Your sole responsibility, not on behalf
+ of any other Contributor, and only if You agree to indemnify,
+ defend, and hold each Contributor harmless for any liability
+ incurred by, or claims asserted against, such Contributor by reason
+ of your accepting any such warranty or additional liability.
+
+ END OF TERMS AND CONDITIONS
+
+ APPENDIX: How to apply the Apache License to your work.
+
+ To apply the Apache License to your work, attach the following
+ boilerplate notice, with the fields enclosed by brackets "[]"
+ replaced with your own identifying information. (Don't include
+ the brackets!) The text should be enclosed in the appropriate
+ comment syntax for the file format. We also recommend that a
+ file or class name and description of purpose be included on the
+ same "printed page" as the copyright notice for easier
+ identification within third-party archives.
+
+ Copyright 2019 NVIDIA Corporation
+
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this file except in compliance with the License.
+ You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
\ No newline at end of file
diff --git a/PyTorch/contrib/audio/Jasper/NOTICE b/PyTorch/contrib/audio/Jasper/NOTICE
new file mode 100644
index 0000000000000000000000000000000000000000..7916839bcc44c5bc319f2bbd69087e89720ea9a6
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/NOTICE
@@ -0,0 +1,5 @@
+Jasper in PyTorch
+
+This repository includes source code (in "parts/") from:
+* https://github.com/keithito/tacotron and https://github.com/ryanleary/patter licensed under MIT license.
+
diff --git a/PyTorch/contrib/audio/Jasper/README.md b/PyTorch/contrib/audio/Jasper/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..1af2ea85fdc000ef5b060b0b6f84719be44eee37
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/README.md
@@ -0,0 +1,51 @@
+# Jasper
+
+This implements training of Jasper on the LibriSpeech dataset.
+
+- Reference implementation:
+```bash
+url=https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechRecognition/Jasper
+```
+
+
+## Requirements
+
+- Install PyTorch ([pytorch.org](http://pytorch.org))
+- `pip install -r requirements.txt`
+- Download the LibriSpeech dataset from http://www.openslr.org/12
+
+
+## Training
+
+- To run the model, you should cd to the directory of test
+
+```bash
+# 1p train full
+bash ./train_full_1p.sh --data_path=xxx
+
+#1p train perf
+bash ./train_performance_1p.sh --data_path=xxx
+
+# 8p train full
+bash ./train_full_8p.sh --data_path=xxx
+
+# 8p train perf
+bash ./train_performance_8p.sh --data_path=xxx
+
+```
+
+
+## Result
+
+Batch size 1p为32,8p为32
+
+| 名称 | WER | 性能/fps | Epochs |
+| :------: | :------: | :------: | :------: |
+| GPU-1p | - | 10 | 1 |
+| GPU-8p | 10.73 | 78 | 30 |
+| NPU-1p | - | 4 | 1 |
+| NPU-8p | 10.89 | 34 | 30 |
+
+
+
+
diff --git a/PyTorch/contrib/audio/Jasper/README_raw.md b/PyTorch/contrib/audio/Jasper/README_raw.md
new file mode 100644
index 0000000000000000000000000000000000000000..ad548b22af923bb52b36e71f50d68636bb5e2c4f
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/README_raw.md
@@ -0,0 +1,843 @@
+# Jasper For
+PyTorch
+
+This repository provides scripts to train the Jasper model to achieve near state of the art accuracy and perform high-performance inference using NVIDIA TensorRT. This repository is tested and maintained by NVIDIA.
+
+## Table Of Contents
+- [Model overview](#model-overview)
+ * [Model architecture](#model-architecture)
+ * [Default configuration](#default-configuration)
+ * [Feature support matrix](#feature-support-matrix)
+ * [Features](#features)
+ * [Mixed precision training](#mixed-precision-training)
+ * [Enabling mixed precision](#enabling-mixed-precision)
+ * [Enabling TF32](#enabling-tf32)
+ * [Glossary](#glossary)
+- [Setup](#setup)
+ * [Requirements](#requirements)
+- [Quick Start Guide](#quick-start-guide)
+- [Advanced](#advanced)
+ * [Scripts and sample code](#scripts-and-sample-code)
+ * [Parameters](#parameters)
+ * [Command-line options](#command-line-options)
+ * [Getting the data](#getting-the-data)
+ * [Dataset guidelines](#dataset-guidelines)
+ * [Training process](#training-process)
+ * [Inference process](#inference-process)
+ * [Evaluation process](#evaluation-process)
+ * [Deploying Jasper using Triton Inference Server](#deploying-jasper-using-triton-inference)
+- [Performance](#performance)
+ * [Benchmarking](#benchmarking)
+ * [Training performance benchmark](#training-performance-benchmark)
+ * [Inference performance benchmark](#inference-performance-benchmark)
+ * [Results](#results)
+ * [Training accuracy results](#training-accuracy-results)
+ * [Training accuracy: NVIDIA DGX A100 (8x A100 80GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-80gb)
+ * [Training accuracy: NVIDIA DGX-1 (8x V100 32GB)](#training-accuracy-nvidia-dgx-1-8x-v100-32gb)
+ * [Training stability test](#training-stability-test)
+ * [Training performance results](#training-performance-results)
+ * [Training performance: NVIDIA DGX A100 (8x A100 80GB)](#training-performance-nvidia-dgx-a100-8x-a100-80gb)
+ * [Training performance: NVIDIA DGX-1 (8x V100 16GB)](#training-performance-nvidia-dgx-1-8x-v100-16gb)
+ * [Training performance: NVIDIA DGX-1 (8x V100 32GB)](#training-performance-nvidia-dgx-1-8x-v100-32gb)
+ * [Training performance: NVIDIA DGX-2 (16x V100 32GB)](#training-performance-nvidia-dgx-2-16x-v100-32gb)
+ * [Inference performance results](#inference-performance-results)
+ * [Inference performance: NVIDIA DGX A100 (1x A100 80GB)](#inference-performance-nvidia-dgx-a100-gpu-1x-a100-80gb)
+ * [Inference performance: NVIDIA DGX-1 (1x V100 16GB)](#inference-performance-nvidia-dgx-1-1x-v100-16gb)
+ * [Inference performance: NVIDIA DGX-1 (1x V100 32GB)](#inference-performance-nvidia-dgx-1-1x-v100-32gb)
+ * [Inference performance: NVIDIA DGX-2 (1x V100 32GB)](#inference-performance-nvidia-dgx-2-1x-v100-32gb)
+ * [Inference performance: NVIDIA T4](#inference-performance-nvidia-t4)
+- [Release notes](#release-notes)
+ * [Changelog](#changelog)
+ * [Known issues](#known-issues)
+
+## Model overview
+This repository provides an implementation of the Jasper model in PyTorch from the paper `Jasper: An End-to-End Convolutional Neural Acoustic Model` [https://arxiv.org/pdf/1904.03288.pdf](https://arxiv.org/pdf/1904.03288.pdf).
+The Jasper model is an end-to-end neural acoustic model for automatic speech recognition (ASR) that provides near state-of-the-art results on LibriSpeech among end-to-end ASR models without any external data. The Jasper architecture of convolutional layers was designed to facilitate fast GPU inference, by allowing whole sub-blocks to be fused into a single GPU kernel. This is important for meeting strict real-time requirements of ASR systems in deployment.
+
+The results of the acoustic model are combined with the results of external language models to get the top-ranked word sequences
+corresponding to a given audio segment. This post-processing step is called decoding.
+
+This repository is a PyTorch implementation of Jasper and provides scripts to train the Jasper 10x5 model with dense residuals from scratch on the [Librispeech](http://www.openslr.org/12) dataset to achieve the greedy decoding results of the original paper.
+The original reference code provides Jasper as part of a research toolkit in TensorFlow [openseq2seq](https://github.com/NVIDIA/OpenSeq2Seq).
+This repository provides a simple implementation of Jasper with scripts for training and replicating the Jasper paper results.
+This includes data preparation scripts, training and inference scripts.
+Both training and inference scripts offer the option to use Automatic Mixed Precision (AMP) to benefit from Tensor Cores for better performance.
+
+In addition to providing the hyperparameters for training a model checkpoint, we publish a thorough inference analysis across different NVIDIA GPU platforms, for example, DGX A100, DGX-1, DGX-2 and T4.
+
+This model is trained with mixed precision using Tensor Cores on Volta, Turing, and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results 3x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
+
+The original paper takes the output of the Jasper acoustic model and shows results for 3 different decoding variations: greedy decoding, beam search with a 6-gram language model and beam search with further rescoring of the best ranked hypotheses with Transformer XL, which is a neural language model. Beam search and the rescoring with the neural language model scores are run on CPU and result in better word error rates compared to greedy decoding.
+This repository provides instructions to reproduce greedy decoding results. To run beam search or rescoring with TransformerXL, use the following scripts from the [openseq2seq](https://github.com/NVIDIA/OpenSeq2Seq) repository:
+https://github.com/NVIDIA/OpenSeq2Seq/blob/master/scripts/decode.py
+https://github.com/NVIDIA/OpenSeq2Seq/tree/master/external_lm_rescore
+
+### Model architecture
+Details on the model architecture can be found in the paper [Jasper: An End-to-End Convolutional Neural Acoustic Model](https://arxiv.org/pdf/1904.03288.pdf).
+
+| | |
+|:---:|:---:|
+|Figure 1: Jasper BxR model: B- number of blocks, R- number of sub-blocks | Figure 2: Jasper Dense Residual |
+
+Jasper is an end-to-end neural acoustic model that is based on convolutions.
+In the audio processing stage, each frame is transformed into mel-scale spectrogram features, which the acoustic model takes as input and outputs a probability distribution over the vocabulary for each frame.
+The acoustic model has a modular block structure and can be parametrized accordingly:
+a Jasper BxR model has B blocks, each consisting of R repeating sub-blocks.
+
+Each sub-block applies the following operations in sequence: 1D-Convolution, Batch Normalization, ReLU activation, and Dropout.
+
+Each block input is connected directly to the last subblock of all following blocks via a residual connection, which is referred to as `dense residual` in the paper.
+Every block differs in kernel size and number of filters, which are increasing in size from the bottom to the top layers.
+Irrespective of the exact block configuration parameters B and R, every Jasper model has four additional convolutional blocks:
+one immediately succeeding the input layer (Prologue) and three at the end of the B blocks (Epilogue).
+
+The Prologue is to decimate the audio signal
+in time in order to process a shorter time sequence for efficiency. The Epilogue with dilation captures a bigger context around an audio time step, which decreases the model word error rate (WER).
+The paper achieves best results with Jasper 10x5 with dense residual connections, which is also the focus of this repository and is in the following referred to as Jasper Large.
+
+### Default configuration
+The following features were implemented in this model:
+
+* GPU-supported feature extraction with data augmentation options [SpecAugment](https://arxiv.org/abs/1904.08779) and [Cutout](https://arxiv.org/pdf/1708.04552.pdf)
+* offline and online [Speed Perturbation](https://www.danielpovey.com/files/2015_interspeech_augmentation.pdf)
+* data-parallel multi-GPU training and evaluation
+* AMP with dynamic loss scaling for Tensor Core training
+* FP16 inference
+
+Competitive training results and analysis is provided for the following Jasper model configuration
+
+| **Model** | **Number of Blocks** | **Number of Subblocks** | **Max sequence length** | **Number of Parameters** |
+|--------------|----------------------|-------------------------|-------------------------|--------------------------|
+| Jasper Large | 10 | 5 | 16.7 s | 333 M |
+
+
+### Feature support matrix
+The following features are supported by this model.
+
+| **Feature** | **Jasper** |
+|---------------|---------------|
+|[Apex AMP](https://nvidia.github.io/apex/amp.html) | Yes |
+|[Apex DistributedDataParallel](https://nvidia.github.io/apex/parallel.html#apex.parallel.DistributedDataParallel) | Yes |
+
+#### Features
+[Apex AMP](https://nvidia.github.io/apex/amp.html) - a tool that enables Tensor Core-accelerated training. Refer to the [Enabling mixed precision](#enabling-mixed-precision) section for more details.
+
+[Apex
+DistributedDataParallel](https://nvidia.github.io/apex/parallel.html#apex.parallel.DistributedDataParallel) -
+a module wrapper that enables easy multiprocess distributed data parallel
+training, similar to
+[torch.nn.parallel.DistributedDataParallel](https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel).
+`DistributedDataParallel` is optimized for use with
+[NCCL](https://github.com/NVIDIA/nccl). It achieves high performance by
+overlapping communication with computation during `backward()` and bucketing
+smaller gradient transfers to reduce the total number of transfers required.
+
+
+### Mixed precision training
+*Mixed precision* is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
+
+1. Porting the model to use the FP16 data type where appropriate.
+2. Adding loss scaling to preserve small gradient values.
+
+The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.
+
+For information about:
+* How to train using mixed precision, see the[Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
+* Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
+* APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).
+
+
+#### Enabling mixed precision
+For training, mixed precision can be enabled by setting the flag: `train.py --amp`. When using bash helper scripts: `scripts/train.sh` `scripts/inference.sh`, etc., mixed precision can be enabled with env variable `AMP=true`.
+
+Mixed precision is enabled in PyTorch by using the Automatic Mixed Precision
+(AMP) library from [APEX](https://github.com/NVIDIA/apex) that casts variables
+to half-precision upon retrieval, while storing variables in single-precision
+format. Furthermore, to preserve small gradient magnitudes in backpropagation,
+a [loss
+scaling](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#lossscaling)
+step must be included when applying gradients. In PyTorch, loss scaling can be
+easily applied by using `scale_loss()` method provided by AMP. The scaling
+value to be used can be
+[dynamic](https://nvidia.github.io/apex/amp.html#apex.amp.initialize) or fixed.
+
+For an in-depth walk through on AMP, check out sample usage
+[here](https://nvidia.github.io/apex/amp.html#). [APEX](https://github.com/NVIDIA/apex) is a PyTorch extension that contains
+utility libraries, such as AMP, which require minimal network code changes to
+leverage Tensor Cores performance.
+
+The following steps were needed to enable mixed precision training in Jasper:
+
+* Import AMP from APEX (file: `train.py`):
+```bash
+from apex import amp
+```
+
+* Initialize AMP and wrap the model and the optimizer
+```bash
+ model, optimizer = amp.initialize(
+ min_loss_scale=1.0,
+ models=model,
+ optimizers=optimizer,
+ opt_level=’O1’)
+
+```
+
+* Apply `scale_loss` context manager
+```bash
+with amp.scale_loss(loss, optimizer) as scaled_loss:
+ scaled_loss.backward()
+```
+
+#### Enabling TF32
+TensorFloat-32 (TF32) is the new math mode in [NVIDIA A100](#https://www.nvidia.com/en-us/data-center/a100/) GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs.
+
+TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.
+
+For more information, refer to the [TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x](#https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) blog post.
+
+TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.
+
+
+### Glossary
+**Acoustic model**
+Assigns a probability distribution over a vocabulary of characters given an audio frame.
+
+**Language Model**
+Assigns a probability distribution over a sequence of words. Given a sequence of words, it assigns a probability to the whole sequence.
+
+**Pre-training**
+Training a model on vast amounts of data on the same (or different) task to build general understandings.
+
+**Automatic Speech Recognition (ASR)**
+Uses both acoustic model and language model to output the transcript of an input audio signal.
+
+
+## Setup
+The following section lists the requirements in order to start training and evaluating the Jasper model.
+
+### Requirements
+This repository contains a `Dockerfile` which extends the PyTorch 20.10-py3 NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
+
+* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
+* [PyTorch 20.10-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)
+- Supported GPUs:
+ - [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
+ - [NVIDIA Turing architecture](https://www.nvidia.com/en-us/geforce/turing/)
+ - [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/)
+
+Further required python packages are listed in `requirements.txt`, which are automatically installed with the Docker container built. To manually install them, run
+```bash
+pip install -r requirements.txt
+```
+
+For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:
+
+* [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
+* [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry)
+* [Running PyTorch](https://docs.nvidia.com/deeplearning/dgx/pytorch-release-notes/running.html#running)
+
+For those unable to use the PyTorch NGC container, to set up the required environment or create your own container, see the versioned [NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html).
+
+
+## Quick Start Guide
+
+To train your model using mixed or TF32 precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the Jasper model on the Librispeech dataset. For details concerning training and inference, see [Advanced](#Advanced) section.
+
+1. Clone the repository.
+```bash
+git clone https://github.com/NVIDIA/DeepLearningExamples
+cd DeepLearningExamples/PyTorch/SpeechRecognition/Jasper
+```
+2. Build the Jasper PyTorch container.
+
+Running the following scripts will build and launch the container which contains all the required dependencies for data download and processing as well as training and inference of the model.
+
+```bash
+bash scripts/docker/build.sh
+```
+
+3. Start an interactive session in the NGC container to run data download/training/inference
+
+```bash
+bash scripts/docker/launch.sh
+```
+Within the container, the contents of this repository will be copied to the `/workspace/jasper` directory. The `/datasets`, `/checkpoints`, `/results` directories are mounted as volumes
+and mapped to the corresponding directories ``, ``, `` on the host.
+
+4. Download and preprocess the dataset.
+
+No GPU is required for data download and preprocessing. Therefore, if GPU usage is a limited resource, launch the container for this section on a CPU machine by following Steps 2 and 3.
+
+Note: Downloading and preprocessing the dataset requires 500GB of free disk space and can take several hours to complete.
+
+This repository provides scripts to download, and extract the following datasets:
+
+* LibriSpeech [http://www.openslr.org/12](http://www.openslr.org/12)
+
+LibriSpeech contains 1000 hours of 16kHz read English speech derived from public domain audiobooks from LibriVox project and has been carefully segmented and aligned. For more information, see the [LIBRISPEECH: AN ASR CORPUS BASED ON PUBLIC DOMAIN AUDIO BOOKS](http://www.danielpovey.com/files/2015_icassp_librispeech.pdf) paper.
+
+Inside the container, download and extract the datasets into the required format for later training and inference:
+```bash
+bash scripts/download_librispeech.sh
+```
+Once the data download is complete, the following folders should exist:
+```bash
+datasets/LibriSpeech/
+├── dev-clean
+├── dev-other
+├── test-clean
+├── test-other
+├── train-clean-100
+├── train-clean-360
+└── train-other-500
+```
+
+Since `/datasets/` is mounted to `` on the host (see Step 3), once the dataset is downloaded it will be accessible from outside of the container at `/LibriSpeech`.
+
+
+Next, convert the data into WAV files:
+```bash
+bash scripts/preprocess_librispeech.sh
+```
+Once the data is converted, the following additional files and folders should exist:
+```bash
+datasets/LibriSpeech/
+├── dev-clean-wav
+├── dev-other-wav
+├── librispeech-train-clean-100-wav.json
+├── librispeech-train-clean-360-wav.json
+├── librispeech-train-other-500-wav.json
+├── librispeech-dev-clean-wav.json
+├── librispeech-dev-other-wav.json
+├── librispeech-test-clean-wav.json
+├── librispeech-test-other-wav.json
+├── test-clean-wav
+├── test-other-wav
+├── train-clean-100-wav
+├── train-clean-360-wav
+└── train-other-500-wav
+```
+
+The DALI data pre-processing pipeline, which is enabled by default, performs speed perturbation on-line during training.
+Without DALI, on-line speed perturbation might slow down the training.
+If you wish to disable DALI, speed perturbation can be computed off-line with:
+```bash
+SPEEDS="0.9 1.1" bash scripts/preprocess_librispeech.sh
+```
+
+5. Start training.
+
+Inside the container, use the following script to start training.
+Make sure the downloaded and preprocessed dataset is located at `/LibriSpeech` on the host (see Step 3), which corresponds to `/datasets/LibriSpeech` inside the container.
+
+```bash
+[OPTION1=value1 OPTION2=value2 ...] bash scripts/train.sh
+```
+By default automatic precision is disabled, batch size is 64 over two gradient accumulation steps, and the recipe is run on a total of 8 GPUs. The hyperparameters are tuned for a GPU with at least 32GB of memory and will require adjustment for 16GB GPUs (e.g., by lowering batch size and using more gradient accumulation steps).
+
+Options are being passed as environment variables. More details on available options can be found in [Parameters](#parameters) and [Training process](#training-process).
+
+6. Start validation/evaluation.
+
+Inside the container, use the following script to run evaluation.
+ Make sure the downloaded and preprocessed dataset is located at `/LibriSpeech` on the host (see Step 3), which corresponds to `/datasets/LibriSpeech` inside the container.
+```bash
+[OPTION1=value1 OPTION2=value2 ...] bash scripts/evaluation.sh [OPTIONS]
+```
+By default, this will use full precision, a batch size of 64 and run on a single GPU.
+
+Options are being passed as environment variables. More details on available options can be found in [Parameters](#parameters) and [Evaluation process](#evaluation-process).
+
+
+7. Start inference/predictions.
+
+Inside the container, use the following script to run inference.
+ Make sure the downloaded and preprocessed dataset is located at `/LibriSpeech` on the host (see Step 3), which corresponds to `/datasets/LibriSpeech` inside the container.
+A pretrained model checkpoint can be downloaded from [NGC model repository](https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16).
+
+```bash
+[OPTION1=value1 OPTION2=value2 ...] bash scripts/inference.sh
+```
+By default this will use single precision, a batch size of 64 and run on a single GPU.
+
+Options are being passed as environment variables. More details on available options can be found in [Parameters](#parameters) and [Inference process](#inference-process).
+
+
+## Advanced
+
+The following sections provide greater details of the dataset, running training and inference, and getting training and inference results.
+
+
+### Scripts and sample code
+In the `root` directory, the most important files are:
+```
+jasper
+├── common # data pre-processing, logging, etc.
+├── configs # model configurations
+├── Dockerfile # container with the basic set of dependencies to run Jasper
+├── inference.py # entry point for inference
+├── jasper # model-specific code
+├── notebooks # jupyter notebooks and example audio files
+├── scripts # one-click scripts required for running various supported functionalities
+│ ├── docker # contains the scripts for building and launching the container
+│ ├── download_librispeech.sh # downloads LibriSpeech dataset
+│ ├── evaluation.sh # runs evaluation using the `inference.py` script
+│ ├── inference_benchmark.sh # runs the inference benchmark using the `inference_benchmark.py` script
+│ ├── inference.sh # runs inference using the `inference.py` script
+│ ├── preprocess_librispeech.sh # preprocess LibriSpeech raw data files for training and inference
+│ ├── train_benchmark.sh # runs the training performance benchmark using the `train.py` script
+│ └── train.sh # runs training using the `train.py` script
+├── train.py # entry point for training
+├── triton # example of inference using Triton Inference Server
+└── utils # data downloading and common routines
+```
+
+### Parameters
+
+Parameters could be set as env variables, or passed as positional arguments.
+
+The complete list of available parameters for `scripts/train.sh` script contains:
+```bash
+DATA_DIR: directory of dataset. (default: '/datasets/LibriSpeech')
+MODEL_CONFIG: relative path to model configuration. (default: 'configs/jasper10x5dr_speedp-online_speca.yaml')
+OUTPUT_DIR: directory for results, logs, and created checkpoints. (default: '/results')
+CHECKPOINT: a specific model checkpoint to continue training from. To resume training from the last checkpoint, see the RESUME option.
+RESUME: resume training from the last checkpoint found in OUTPUT_DIR, or from scratch if there are no checkpoints (default: true)
+CUDNN_BENCHMARK: boolean that indicates whether to enable cudnn benchmark mode for using more optimized kernels. (default: true)
+NUM_GPUS: number of GPUs to use. (default: 8)
+AMP: if set to `true`, enables automatic mixed precision (default: false)
+BATCH_SIZE: effective data batch size. The real batch size per GPU might be lower, if gradient accumulation is enabled (default: 64)
+GRAD_ACCUMULATION_STEPS: number of gradient accumulation steps until optimizer updates weights. (default: 2)
+LEARNING_RATE: initial learning rate. (default: 0.01)
+MIN_LEARNING_RATE: minimum learning rate, despite LR scheduling (default: 1e-5)
+LR_POLICY: how to decay LR (default: exponential)
+LR_EXP_GAMMA: decay factor for the exponential LR schedule (default: 0.981)
+EMA: decay factor for exponential averages of checkpoints (default: 0.999)
+SEED: seed for random number generator and used for ensuring reproducibility. (default: 0)
+EPOCHS: number of training epochs. (default: 440)
+WARMUP_EPOCHS: number of initial epoch of linearly increasing LR. (default: 2)
+HOLD_EPOCHS: number of epochs to hold maximum LR after warmup. (default: 140)
+SAVE_FREQUENCY: number of epochs between saving the model to disk. (default: 10)
+EPOCHS_THIS_JOB: run training for this number of epochs. Does not affect LR schedule like the EPOCHS parameter. (default: 0)
+DALI_DEVICE: device to run the DALI pipeline on for calculation of filterbanks. Valid choices: cpu, gpu, none. (default: gpu)
+PAD_TO_MAX_DURATION: pad all sequences with zeros to maximum length. (default: false)
+EVAL_FREQUENCY: number of steps between evaluations on the validation set. (default: 544)
+PREDICTION_FREQUENCY: the number of steps between writing a sample prediction to stdout. (default: 544)
+TRAIN_MANIFESTS: lists of .json training set files
+VAL_MANIFESTS: lists of .json validation set files
+
+```
+
+The complete list of available parameters for `scripts/inference.sh` script contains:
+```bash
+DATA_DIR: directory of dataset. (default: '/datasets/LibriSpeech')
+MODEL_CONFIG: model configuration. (default: 'configs/jasper10x5dr_speedp-online_speca.yaml')
+OUTPUT_DIR: directory for results and logs. (default: '/results')
+CHECKPOINT: model checkpoint path. (required)
+DATASET: name of the LibriSpeech subset to use. (default: 'dev-clean')
+LOG_FILE: path to the DLLogger .json logfile. (default: '')
+CUDNN_BENCHMARK: enable cudnn benchmark mode for using more optimized kernels. (default: false)
+MAX_DURATION: filter out recordings shorter then MAX_DURATION seconds. (default: "")
+PAD_TO_MAX_DURATION: pad all sequences with zeros to maximum length. (default: false)
+PAD_LEADING: pad every batch with leading zeros to counteract conv shifts of the field of view. (default: 16)
+NUM_GPUS: number of GPUs to use. Note that with > 1 GPUs WER results might be inaccurate due to the batching policy. (default: 1)
+NUM_STEPS: number of batches to evaluate, loop the dataset if necessary. (default: 0)
+NUM_WARMUP_STEPS: number of initial steps before measuring performance. (default: 0)
+AMP: enable FP16 inference with AMP. (default: false)
+BATCH_SIZE: data batch size. (default: 64)
+EMA: Attempt to load exponentially averaged weights from a checkpoint. (default: true)
+SEED: seed for random number generator and used for ensuring reproducibility. (default: 0)
+DALI_DEVICE: device to run the DALI pipeline on for calculation of filterbanks. Valid choices: cpu, gpu, none. (default: gpu)
+CPU: run inference on CPU. (default: false)
+LOGITS_FILE: dump logit matrices to a file. (default: "")
+PREDICTION_FILE: save predictions to a file. (default: "${OUTPUT_DIR}/${DATASET}.predictions")
+```
+
+The complete list of available parameters for `scripts/evaluation.sh` is the same as for `scripts/inference.sh` except for the few default changes.
+```bash
+PREDICTION_FILE: (default: "")
+DATASET: (default: "test-other")
+```
+
+The `scripts/inference_benchmark.sh` script pads all input to a fixed duration and computes the mean, 90%, 95%, 99% percentile of latency for the specified number of inference steps. Latency is measured in milliseconds per batch. The `scripts/inference_benchmark.sh` measures latency for a single GPU and loops over a number of batch sizes and durations. It extends `scripts/inference.sh`, and changes the defaults with:
+```bash
+BATCH_SIZE_SEQ: batch sizes to measure on. (default: "1 2 4 8 16")
+MAX_DURATION_SEQ: input durations (in seconds) to measure on (default: "2 7 16.7")
+CUDNN_BENCHMARK: (default: true)
+PAD_TO_MAX_DURATION: (default: true)
+PAD_LEADING: (default: 0)
+NUM_WARMUP_STEPS: (default: 10)
+NUM_STEPS: (default: 500)
+DALI_DEVICE: (default: cpu)
+```
+
+The `scripts/train_benchmark.sh` script pads all input to the same length according to the input argument `MAX_DURATION` and measures average training latency and throughput performance. Latency is measured in seconds per batch, throughput in sequences per second.
+Training performance is measured with on-line speed perturbation and cuDNN benchmark mode enabled.
+The script `scripts/train_benchmark.sh` loops over a number of batch sizes and GPU counts.
+It extends `scripts/train.sh`, and the complete list of available parameters for `scripts/train_benchmark.sh` script contains:
+```bash
+BATCH_SIZE_SEQ: batch sizes to measure on. (default: "1 2 4 8 16")
+NUM_GPUS_SEQ: number of GPUs to run the training on. (default: "1 4 8")
+MODEL_CONFIG: (default: "configs/jasper10x5dr_speedp-online_train-benchmark.yaml")
+TRAIN_MANIFESTS: (default: "$DATA_DIR/librispeech-train-clean-100-wav.json")
+RESUME: (default: false)
+EPOCHS_THIS_JOB: (default: 2)
+EPOCHS: (default: 100000)
+SAVE_FREQUENCY: (default: 100000)
+EVAL_FREQUENCY: (default: 100000)
+GRAD_ACCUMULATION_STEPS: (default: 1)
+PAD_TO_MAX_DURATION: (default: true)
+EMA: (default: 0)
+```
+
+### Command-line options
+To see the full list of available options and their descriptions, use the `-h` or `--help` command-line option with the Python file, for example:
+
+```bash
+python train.py --help
+python inference.py --help
+```
+
+### Getting the data
+The Jasper model was trained on the LibriSpeech dataset. We use the concatenation of `train-clean-100`, `train-clean-360` and `train-other-500` for training and `dev-clean` for validation.
+
+This repository contains the `scripts/download_librispeech.sh` and `scripts/preprocess_librispeech.sh` scripts which will automatically download and preprocess the training, test and development datasets. By default, data will be downloaded to the `/datasets/LibriSpeech` directory, a minimum of 250GB free space is required for download and preprocessing, the final preprocessed dataset is approximately 100GB. With offline speed perturbation, the dataset will be about 3x larger.
+
+#### Dataset guidelines
+The `scripts/preprocess_librispeech.sh` script converts the input audio files to WAV format with a sample rate of 16kHz, target transcripts are stripped from whitespace characters, then lower-cased. For `train-clean-100`, `train-clean-360` and `train-other-500`. It can optionally create speed perturbed versions with rates of 0.9 and 1.1 for data augmentation. In the current version, those augmentations are applied on-line with the DALI pipeline without any impact on training time.
+
+After preprocessing, the script creates JSON files with output file paths, sample rate, target transcript and other metadata. These JSON files are used by the training script to identify training and validation datasets.
+
+The Jasper model was tuned on audio signals with a sample rate of 16kHz, if you wish to use a different sampling rate then some hyperparameters might need to be changed - specifically window size and step size.
+
+
+### Training process
+
+The training is performed using `train.py` script along with parameters defined in `scripts/train.sh`
+The `scripts/train.sh` script runs a job on a single node that trains the Jasper model from scratch using LibriSpeech as training data. To make training more efficient, we discard audio samples longer than 16.7 seconds from the training dataset, the total number of these samples is less than 1%. Such filtering does not degrade accuracy, but it allows us to decrease the number of time steps in a batch, which requires less GPU memory and increases training speed.
+Apart from the default arguments as listed in the [Parameters](#parameters) section, by default the training script:
+
+* Runs on 8 GPUs with at least 32GB of memory and training/evaluation batch size 64, split over two gradient accumulation steps
+* Uses TF32 precision (A100 GPU) or FP32 (other GPUs)
+* Trains on the concatenation of all 3 LibriSpeech training datasets and evaluates on the LibriSpeech dev-clean dataset
+* Maintains an exponential moving average of parameters for evaluation
+* Has cudnn benchmark enabled
+* Runs for 440 epochs
+* Uses an initial learning rate of 0.01 and an exponential learning rate decay
+* Saves a checkpoint every 10 epochs
+* Automatically removes old checkpoints and preserves milestone checkpoints
+* Runs evaluation on the development dataset every 544 iterations and at the end of training
+* Maintains a separate checkpoint with the lowest WER on development set
+* Prints out training progress every iteration to stdout
+* Creates a DLLogger logfile and a Tensorboard log
+* Calculates speed perturbation on-line during training
+* Uses SpecAugment in data pre-processing
+* Filters out audio samples longer than 16.7 seconds
+* Pads each batch so its length would be divisible by 16
+* Uses masked convolutions and dense residuals as described in the paper
+* Uses weight decay of 0.001
+* Uses [Novograd](https://arxiv.org/pdf/1905.11286.pdf) as optimizer with betas=(0.95, 0)
+
+Enabling AMP permits batch size 64 with one gradient accumulation step. In the current setup it will improve upon the greedy WER [Results](#results) of the Jasper paper on a DGX-1 with 32GB V100 GPUs.
+
+### Inference process
+Inference is performed using the `inference.py` script along with parameters defined in `scripts/inference.sh`.
+The `scripts/inference.sh` script runs the job on a single GPU, taking a pre-trained Jasper model checkpoint and running it on the specified dataset.
+Apart from the default arguments as listed in the [Parameters](#parameters) section by default the inference script:
+
+* Evaluates on the LibriSpeech dev-clean dataset
+* Uses a batch size of 64
+* Runs for 1 epoch and prints out the final word error rate
+* Creates a log file with progress and results which will be stored in the results folder
+* Pads each batch so its length would be divisible by 16
+* Does not use data augmentation
+* Does greedy decoding and saves the transcription in the results folder
+* Has the option to save the model output tensors for more complex decoding, for example, beam search
+* Has cudnn benchmark disabled
+
+### Evaluation process
+Evaluation is performed using the `inference.py` script along with parameters defined in `scripts/evaluation.sh`.
+The setup is similar to `scripts/inference.sh`, with two differences:
+
+* Evaluates the LibriSpeech test-other dataset
+* Model outputs are not saved
+
+### Deploying Jasper using Triton Inference Server
+The NVIDIA Triton Inference Server provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server.
+More information on how to perform inference using Triton Inference Server with different model backends can be found in the subfolder [./triton/README.md](triton/README.md)
+
+
+## Performance
+
+The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA’s latest software release. For the most up-to-date performance measurements, go to [NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference).
+
+### Benchmarking
+The following section shows how to run benchmarks measuring the model performance in training and inference modes.
+
+#### Training performance benchmark
+To benchmark the training performance in a specific setting on the `train-clean-100` subset of LibriSpeech, run:
+
+```bash
+BATCH_SIZE_SEQ= NUM_GPUS_SEQ= bash scripts/train_benchmark.sh
+```
+
+By default, this script runs 2 epochs on the configuration `configs/jasper10x5dr_speedp-online_train-benchmark.yaml`,
+which applies gentle speed perturbation that does not change the length of the output, enabling immediate stabilization of training step times in the cuDNN benchmark mode. The script runs benchmarks on batch sizes 32 on 1, 4, and 8 GPUs, and requires a 8x 32GB GPU machine.
+
+#### Inference performance benchmark
+To benchmark the inference performance on a specific batch size and audio length, run:
+
+```bash
+BATCH_SIZE_SEQ= MAX_DURATION_SEQ= bash scripts/inference_benchmark.sh
+```
+By default, the script runs on a single GPU and evaluates on the dataset limited to utterances shorter than MAX_DURATION. It uses the model configuration `configs/jasper10x5dr_speedp-online_speca.yaml`.
+
+
+### Results
+The following sections provide details on how we achieved our performance and accuracy in training and inference.
+All results are trained on 960 hours of LibriSpeech with a maximum audio length of 16.7s. The training is evaluated
+on LibriSpeech dev-clean, dev-other, test-clean, test-other. Checkpoints for evaluation are being chosen based on their
+word error rate on dev-clean.
+
+#### Training accuracy results
+
+##### Training accuracy: NVIDIA DGX A100 (8x A100 80GB)
+Our results were obtained by running the `scripts/train.sh` training script in the PyTorch 20.10-py3 NGC container with NVIDIA DGX A100 with (8x A100 80GB) GPUs.
+The following table reports the word error rate (WER) of the acoustic model with greedy decoding on all LibriSpeech dev and test datasets for mixed precision training.
+
+| Number of GPUs | Batch size per GPU | Precision | dev-clean WER | dev-other WER | test-clean WER | test-other WER | Time to train |
+|-----|-----|-------|-------|-------|------|-------|------|
+| 8 | 64 | mixed | 3.20 | 9.78 | 3.41 | 9.71 | 70 h |
+
+##### Training accuracy: NVIDIA DGX-1 (8x V100 32GB)
+Our results were obtained by running the `scripts/train.sh` training script in the PyTorch 20.10-py3 NGC container with NVIDIA DGX-1 with (8x V100 32GB) GPUs.
+The following table reports the word error rate (WER) of the acoustic model with greedy decoding on all LibriSpeech dev and test datasets for mixed precision training.
+
+| Number of GPUs | Batch size per GPU | Precision | dev-clean WER | dev-other WER | test-clean WER | test-other WER | Time to train |
+|-----|-----|-------|-------|-------|------|-------|-------|
+| 8 | 64 | mixed | 3.26 | 10.00 | 3.54 | 9.80 | 130 h |
+
+We show the best of 5 runs (mixed precision) and 2 runs (FP32) chosen based on dev-clean WER. For FP32, two gradient accumulation steps have been used.
+
+##### Training stability test
+The following table compares greedy decoding word error rates across 8 different training runs with different seeds for mixed precision training.
+
+| DGX A100 80GB, FP16, 8x GPU | Seed #1 | Seed #2 | Seed #3 | Seed #4 | Seed #5 | Seed #6 | Seed #7 | Seed #8 | Mean | Std |
+|-----------:|----------:|----------:|----------:|----------:|----------:|----------:|----------:|----------:|-------:|------:|
+| dev-clean | 3.46 | 3.55 | 3.45 | 3.44 | 3.25 | 3.34 | 3.20 | 3.40 | 3.39 | 0.11 |
+| dev-other | 10.30 | 10.77 | 10.36 | 10.26 | 9.99 | 10.18 | 9.78 | 10.32 | 10.25 | 0.27 |
+| test-clean | 3.84 | 3.81 | 3.66 | 3.64 | 3.58 | 3.55 | 3.41 | 3.73 | 3.65 | 0.13 |
+| test-other | 10.61 | 10.52 | 10.49 | 10.47 | 9.89 | 10.09 | 9.71 | 10.26 | 10.26 | 0.31 |
+
+
+| DGX-1 32GB, FP16, 8x GPU | Seed #1 | Seed #2 | Seed #3 | Seed #4 | Seed #5 | Seed #6 | Seed #7 | Seed #8 | Mean | Std |
+|-----------:|----------:|----------:|----------:|----------:|----------:|----------:|----------:|----------:|-------:|------:|
+| dev-clean | 3.31 | 3.31 | 3.26 | 3.44 | 3.40 | 3.35 | 3.36 | 3.28 | 3.34 | 0.06 |
+| dev-other | 10.02 | 10.01 | 10.00 | 10.06 | 10.05 | 10.03 | 10.10 | 10.04 | 10.04 | 0.03 |
+| test-clean | 3.49 | 3.50 | 3.54 | 3.61 | 3.57 | 3.58 | 3.48 | 3.51 | 3.54 | 0.04 |
+| test-other | 10.11 | 10.14 | 9.80 | 10.09 | 10.17 | 9.99 | 9.86 | 10.00 | 10.02 | 0.13 |
+
+#### Training performance results
+Our results were obtained by running the `scripts/train.sh` training script in the PyTorch 20.10-py3 NGC container. Performance (in sequences per second) is the steady-state throughput.
+
+##### Training performance: NVIDIA DGX A100 (8x A100 80GB)
+| Batch size / GPU | GPUs | Throughput - TF32 | Throughput - mixed precision | Throughput speedup (TF32 to mixed precision) | Weak scaling - TF32 | Weak scaling - mixed precision |
+|----:|----:|-------:|-------:|-----:|-----:|-----:|
+| 32 | 1 | 42.18 | 64.32 | 1.52 | 1.00 | 1.00 |
+| 32 | 4 | 157.49 | 239.23 | 1.52 | 3.73 | 3.72 |
+| 32 | 8 | 310.10 | 470.09 | 1.52 | 7.35 | 7.31 |
+| 64 | 1 | 49.64 | 75.59 | 1.52 | 1.00 | 1.00 |
+| 64 | 4 | 192.66 | 289.16 | 1.50 | 3.88 | 3.83 |
+| 64 | 8 | 371.41 | 547.91 | 1.48 | 7.48 | 7.25 |
+
+Note: Mixed precision permits higher batch sizes during training. We report the maximum batch sizes (as powers of 2), which are allowed without gradient accumulation.
+
+To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
+
+##### Training performance: NVIDIA DGX-1 (8x V100 16GB)
+| Batch size / GPU | GPUs | Throughput - FP32 | Throughput - mixed precision | Throughput speedup (FP32 to mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
+|----:|----:|------:|-------:|-----:|-----:|-----:|
+| 16 | 1 | 10.71 | 27.87 | 2.60 | 1.00 | 1.00 |
+| 16 | 4 | 40.28 | 99.80 | 2.48 | 3.76 | 3.58 |
+| 16 | 8 | 78.23 | 193.89 | 2.48 | 7.30 | 6.96 |
+
+Note: Mixed precision permits higher batch sizes during training. We report the maximum batch sizes (as powers of 2), which are allowed without gradient accumulation.
+
+To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
+
+##### Training performance: NVIDIA DGX-1 (8x V100 32GB)
+| Batch size / GPU | GPUs | Throughput - FP32 | Throughput - mixed precision | Throughput speedup (FP32 to mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
+|----:|----:|------:|-------:|-----:|-----:|-----:|
+| 32 | 1 | 12.22 | 34.08 | 2.79 | 1.00 | 1.00 |
+| 32 | 4 | 46.97 | 128.39 | 2.73 | 3.84 | 3.77 |
+| 32 | 8 | 92.44 | 249.00 | 2.69 | 7.57 | 7.31 |
+| 64 | 1 | N/A | 39.30 | N/A | N/A | 1.00 |
+| 64 | 4 | N/A | 150.18 | N/A | N/A | 3.82 |
+| 64 | 8 | N/A | 282.68 | N/A | N/A | 7.19 |
+
+Note: Mixed precision permits higher batch sizes during training. We report the maximum batch sizes (as powers of 2), which are allowed without gradient accumulation.
+
+To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
+
+##### Training performance: NVIDIA DGX-2 (16x V100 32GB)
+| Batch size / GPU | GPUs | Throughput - FP32 | Throughput - mixed precision | Throughput speedup (FP32 to mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
+|----:|----:|-------:|-------:|-----:|------:|------:|
+| 32 | 1 | 13.46 | 38.94 | 2.89 | 1.00 | 1.00 |
+| 32 | 4 | 51.38 | 143.44 | 2.79 | 3.82 | 3.68 |
+| 32 | 8 | 100.54 | 280.48 | 2.79 | 7.47 | 7.20 |
+| 32 | 16 | 188.14 | 515.90 | 2.74 | 13.98 | 13.25 |
+| 64 | 1 | N/A | 43.86 | N/A | N/A | 1.00 |
+| 64 | 4 | N/A | 165.27 | N/A | N/A | 3.77 |
+| 64 | 8 | N/A | 318.10 | N/A | N/A | 7.25 |
+| 64 | 16 | N/A | 567.47 | N/A | N/A | 12.94 |
+
+Note: Mixed precision permits higher batch sizes during training. We report the maximum batch sizes (as powers of 2), which are allowed without gradient accumulation.
+
+To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
+
+
+#### Inference performance results
+Our results were obtained by running the `scripts/inference_benchmark.sh` script in the PyTorch 20.10-py3 NGC container on NVIDIA DGX A100, DGX-1, DGX-2 and T4 on a single GPU. Performance numbers (latency in milliseconds per batch) were averaged over 500 iterations.
+
+##### Inference performance: NVIDIA DGX A100 (1x A100 80GB)
+| | | FP16 Latency (ms) Percentiles | | | | TF32 Latency (ms) Percentiles | | | | FP16/TF32 speed up |
+|-----:|---------------:|------:|------:|------:|------:|------:|------:|-------:|------:|------:|
+| BS | Duration (s) | 90% | 95% | 99% | Avg | 90% | 95% | 99% | Avg | Avg |
+| 1 | 2.0 | 32.40 | 32.50 | 32.82 | 32.30 | 33.30 | 33.64 | 34.65 | 33.25 | 1.03 |
+| 2 | 2.0 | 32.90 | 33.51 | 34.35 | 32.69 | 34.48 | 34.65 | 35.66 | 34.27 | 1.05 |
+| 4 | 2.0 | 32.85 | 33.01 | 33.89 | 32.60 | 34.09 | 34.46 | 35.22 | 34.00 | 1.04 |
+| 8 | 2.0 | 35.51 | 35.89 | 37.10 | 35.33 | 34.86 | 35.36 | 36.08 | 34.45 | 0.98 |
+| 16 | 2.0 | 36.00 | 36.57 | 37.40 | 35.77 | 43.83 | 44.12 | 44.77 | 43.39 | 1.21 |
+| 1 | 7.0 | 33.50 | 33.99 | 34.91 | 33.03 | 33.83 | 34.25 | 34.95 | 33.70 | 1.02 |
+| 2 | 7.0 | 34.43 | 34.89 | 35.72 | 34.22 | 34.41 | 34.73 | 35.69 | 34.28 | 1.00 |
+| 4 | 7.0 | 34.30 | 34.59 | 35.43 | 34.07 | 37.95 | 38.18 | 38.87 | 37.55 | 1.10 |
+| 8 | 7.0 | 35.98 | 36.28 | 37.11 | 35.28 | 44.64 | 44.79 | 45.37 | 44.29 | 1.26 |
+| 16 | 7.0 | 39.86 | 40.08 | 41.16 | 39.33 | 55.17 | 55.46 | 57.24 | 54.56 | 1.39 |
+| 1 | 16.7 | 35.20 | 35.80 | 38.71 | 34.36 | 35.36 | 35.76 | 36.55 | 34.64 | 1.01 |
+| 2 | 16.7 | 35.40 | 35.81 | 36.50 | 34.76 | 36.34 | 36.53 | 37.40 | 35.87 | 1.03 |
+| 4 | 16.7 | 36.01 | 36.38 | 37.37 | 35.57 | 44.69 | 45.09 | 45.88 | 43.92 | 1.23 |
+| 8 | 16.7 | 41.48 | 41.78 | 44.22 | 40.69 | 58.57 | 58.74 | 59.62 | 58.11 | 1.43 |
+| 16 | 16.7 | 61.37 | 61.93 | 66.32 | 60.92 | 97.33 | 97.71 | 100.04 | 96.56 | 1.59 |
+
+To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
+
+##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)
+| | | FP16 Latency (ms) Percentiles | | | | FP32 Latency (ms) Percentiles | | | | FP16/FP32 speed up |
+|-----:|---------------:|-------:|-------:|-------:|-------:|-------:|-------:|-------:|-------:|------:|
+| BS | Duration (s) | 90% | 95% | 99% | Avg | 90% | 95% | 99% | Avg | Avg |
+| 1 | 2.0 | 45.42 | 45.62 | 49.54 | 45.02 | 48.83 | 48.99 | 51.66 | 48.44 | 1.08 |
+| 2 | 2.0 | 50.31 | 50.53 | 53.66 | 49.10 | 49.87 | 50.04 | 52.99 | 49.41 | 1.01 |
+| 4 | 2.0 | 49.17 | 49.48 | 52.13 | 48.73 | 52.92 | 53.21 | 55.28 | 52.31 | 1.07 |
+| 8 | 2.0 | 51.20 | 51.40 | 52.32 | 49.01 | 73.02 | 73.30 | 75.00 | 71.99 | 1.47 |
+| 16 | 2.0 | 51.75 | 52.24 | 56.36 | 51.27 | 83.99 | 84.57 | 86.69 | 83.24 | 1.62 |
+| 1 | 7.0 | 48.13 | 48.53 | 50.95 | 46.78 | 48.52 | 48.75 | 50.89 | 48.01 | 1.03 |
+| 2 | 7.0 | 49.52 | 50.10 | 52.35 | 48.00 | 65.27 | 65.41 | 66.59 | 64.79 | 1.35 |
+| 4 | 7.0 | 51.75 | 52.01 | 54.39 | 50.38 | 93.75 | 94.77 | 97.04 | 92.27 | 1.83 |
+| 8 | 7.0 | 54.80 | 56.27 | 66.23 | 52.95 | 130.65 | 131.09 | 132.91 | 129.82 | 2.45 |
+| 16 | 7.0 | 73.02 | 73.42 | 75.83 | 71.96 | 157.53 | 158.20 | 160.73 | 155.51 | 2.16 |
+| 1 | 16.7 | 48.10 | 48.52 | 52.71 | 47.20 | 73.34 | 73.56 | 74.19 | 72.69 | 1.54 |
+| 2 | 16.7 | 64.21 | 64.52 | 65.56 | 56.06 | 129.48 | 129.97 | 131.78 | 126.36 | 2.25 |
+| 4 | 16.7 | 60.38 | 61.03 | 63.18 | 58.87 | 183.33 | 183.85 | 185.53 | 181.90 | 3.09 |
+| 8 | 16.7 | 85.88 | 86.34 | 87.70 | 84.46 | 227.42 | 228.21 | 229.63 | 225.71 | 2.67 |
+| 16 | 16.7 | 135.62 | 136.40 | 137.69 | 131.58 | 276.90 | 277.59 | 281.16 | 275.08 | 2.09 |
+
+To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
+
+##### Inference performance: NVIDIA DGX-1 (1x V100 32GB)
+| | | FP16 Latency (ms) Percentiles | | | | FP32 Latency (ms) Percentiles | | | | FP16/FP32 speed up |
+|-----:|---------------:|-------:|-------:|-------:|-------:|-------:|-------:|-------:|-------:|------:|
+| BS | Duration (s) | 90% | 95% | 99% | Avg | 90% | 95% | 99% | Avg | Avg |
+| 1 | 2.0 | 52.74 | 53.01 | 54.40 | 51.47 | 55.97 | 56.22 | 57.93 | 54.93 | 1.07 |
+| 2 | 2.0 | 51.77 | 52.15 | 54.69 | 50.98 | 56.58 | 56.87 | 58.88 | 55.35 | 1.09 |
+| 4 | 2.0 | 51.41 | 51.76 | 53.47 | 50.55 | 61.56 | 61.87 | 63.81 | 60.74 | 1.20 |
+| 8 | 2.0 | 51.83 | 52.15 | 54.08 | 50.85 | 80.20 | 80.69 | 81.67 | 77.69 | 1.53 |
+| 16 | 2.0 | 70.48 | 70.96 | 72.11 | 62.98 | 93.00 | 93.44 | 94.17 | 89.05 | 1.41 |
+| 1 | 7.0 | 49.77 | 50.21 | 51.88 | 48.73 | 52.74 | 52.99 | 54.54 | 51.67 | 1.06 |
+| 2 | 7.0 | 51.12 | 51.47 | 52.84 | 49.98 | 65.33 | 65.63 | 67.07 | 64.64 | 1.29 |
+| 4 | 7.0 | 53.13 | 53.56 | 55.68 | 52.15 | 93.54 | 93.85 | 94.72 | 92.76 | 1.78 |
+| 8 | 7.0 | 57.67 | 58.07 | 59.89 | 56.41 | 133.93 | 134.18 | 134.88 | 133.15 | 2.36 |
+| 16 | 7.0 | 76.09 | 76.48 | 79.13 | 75.27 | 162.35 | 162.77 | 164.63 | 161.30 | 2.14 |
+| 1 | 16.7 | 54.78 | 55.29 | 56.83 | 52.51 | 75.37 | 76.27 | 78.05 | 74.32 | 1.42 |
+| 2 | 16.7 | 56.80 | 57.20 | 59.01 | 55.49 | 130.60 | 131.36 | 132.93 | 128.55 | 2.32 |
+| 4 | 16.7 | 64.19 | 64.84 | 66.47 | 62.87 | 188.09 | 188.76 | 190.07 | 185.76 | 2.95 |
+| 8 | 16.7 | 87.46 | 87.86 | 89.99 | 86.47 | 232.33 | 232.89 | 234.43 | 230.44 | 2.67 |
+| 16 | 16.7 | 136.02 | 136.52 | 139.44 | 134.78 | 283.87 | 284.59 | 286.70 | 282.01 | 2.09 |
+
+To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
+
+##### Inference performance: NVIDIA DGX-2 (1x V100 32GB)
+| | | FP16 Latency (ms) Percentiles | | | | FP32 Latency (ms) Percentiles | | | | FP16/FP32 speed up |
+|-----:|---------------:|-------:|-------:|-------:|-------:|-------:|-------:|-------:|-------:|------:|
+| BS | Duration (s) | 90% | 95% | 99% | Avg | 90% | 95% | 99% | Avg | Avg |
+| 1 | 2.0 | 35.88 | 36.12 | 39.80 | 35.20 | 42.95 | 43.67 | 46.65 | 42.23 | 1.20 |
+| 2 | 2.0 | 36.36 | 36.57 | 40.97 | 35.60 | 41.83 | 42.21 | 45.60 | 40.97 | 1.15 |
+| 4 | 2.0 | 36.69 | 36.89 | 41.25 | 36.05 | 48.35 | 48.52 | 52.35 | 47.80 | 1.33 |
+| 8 | 2.0 | 37.49 | 37.70 | 41.37 | 36.88 | 65.41 | 65.64 | 66.50 | 64.96 | 1.76 |
+| 16 | 2.0 | 41.35 | 41.79 | 45.58 | 40.91 | 77.22 | 77.51 | 79.48 | 76.54 | 1.87 |
+| 1 | 7.0 | 36.07 | 36.55 | 40.31 | 35.62 | 39.52 | 39.84 | 43.07 | 38.93 | 1.09 |
+| 2 | 7.0 | 37.42 | 37.66 | 41.36 | 36.79 | 55.94 | 56.19 | 58.33 | 55.60 | 1.51 |
+| 4 | 7.0 | 38.51 | 38.95 | 42.55 | 37.98 | 86.62 | 87.08 | 87.50 | 86.20 | 2.27 |
+| 8 | 7.0 | 42.82 | 43.00 | 47.11 | 42.55 | 122.05 | 122.29 | 122.70 | 121.59 | 2.86 |
+| 16 | 7.0 | 67.74 | 67.92 | 69.05 | 65.69 | 149.92 | 150.16 | 151.03 | 149.49 | 2.28 |
+| 1 | 16.7 | 39.28 | 39.78 | 43.34 | 38.35 | 66.73 | 67.16 | 69.80 | 66.01 | 1.72 |
+| 2 | 16.7 | 43.05 | 43.42 | 47.18 | 42.43 | 120.04 | 121.12 | 123.32 | 118.14 | 2.78 |
+| 4 | 16.7 | 52.18 | 52.49 | 56.11 | 51.63 | 176.09 | 176.51 | 178.70 | 174.60 | 3.38 |
+| 8 | 16.7 | 78.55 | 78.79 | 81.66 | 78.04 | 216.19 | 216.68 | 217.63 | 214.48 | 2.75 |
+| 16 | 16.7 | 125.57 | 125.92 | 128.78 | 124.33 | 264.11 | 264.49 | 266.14 | 262.80 | 2.11 |
+
+To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
+
+##### Inference performance: NVIDIA T4
+| | | FP16 Latency (ms) Percentiles | | | | FP32 Latency (ms) Percentiles | | | | FP16/FP32 speed up |
+|-----:|---------------:|-------:|-------:|-------:|-------:|-------:|-------:|-------:|-------:|------:|
+| BS | Duration (s) | 90% | 95% | 99% | Avg | 90% | 95% | 99% | Avg | Avg |
+| 1 | 2.0 | 43.62 | 46.95 | 50.46 | 37.23 | 51.31 | 52.37 | 56.21 | 49.77 | 1.34 |
+| 2 | 2.0 | 49.09 | 50.46 | 53.11 | 40.61 | 81.85 | 82.22 | 83.94 | 80.81 | 1.99 |
+| 4 | 2.0 | 47.71 | 51.14 | 55.09 | 41.29 | 112.56 | 115.13 | 118.56 | 111.60 | 2.70 |
+| 8 | 2.0 | 51.37 | 53.11 | 55.48 | 45.94 | 198.95 | 199.48 | 200.28 | 197.22 | 4.29 |
+| 16 | 2.0 | 63.59 | 64.30 | 66.90 | 61.77 | 221.75 | 222.07 | 223.22 | 220.09 | 3.56 |
+| 1 | 7.0 | 47.49 | 48.66 | 53.36 | 40.76 | 73.63 | 74.41 | 77.65 | 72.41 | 1.78 |
+| 2 | 7.0 | 48.63 | 50.01 | 58.35 | 43.44 | 114.66 | 115.28 | 117.63 | 112.41 | 2.59 |
+| 4 | 7.0 | 52.19 | 52.85 | 54.22 | 49.94 | 200.38 | 201.29 | 202.97 | 197.21 | 3.95 |
+| 8 | 7.0 | 84.90 | 85.56 | 87.52 | 83.41 | 404.00 | 404.72 | 405.70 | 400.25 | 4.80 |
+| 16 | 7.0 | 157.12 | 157.58 | 159.19 | 155.01 | 490.93 | 492.09 | 493.44 | 486.45 | 3.14 |
+| 1 | 16.7 | 50.57 | 51.57 | 57.58 | 46.27 | 150.39 | 151.84 | 153.54 | 147.31 | 3.18 |
+| 2 | 16.7 | 63.64 | 64.55 | 66.31 | 61.98 | 256.54 | 258.16 | 262.71 | 250.34 | 4.04 |
+| 4 | 16.7 | 140.44 | 141.06 | 142.00 | 138.14 | 519.59 | 521.41 | 523.86 | 512.74 | 3.71 |
+| 8 | 16.7 | 267.03 | 268.06 | 270.01 | 263.15 | 727.33 | 728.61 | 731.36 | 722.62 | 2.75 |
+| 16 | 16.7 | 362.40 | 364.02 | 367.80 | 358.75 | 867.92 | 869.19 | 871.46 | 860.37 | 2.40 |
+
+To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
+
+## Release notes
+We're constantly refining and improving our performance on AI and HPC workloads even on the same hardware with frequent updates to our software stack. For our latest performance data please refer to these pages for AI and HPC benchmarks.
+
+### Changelog
+February 2021
+* Added DALI data-processing pipeline for on-the-fly data processing and augmentation on CPU or GPU
+* Revised training recipe: ~10% relative improvement in Word Error Rate (WER)
+* Updated Triton scripts for compatibility with Triton V2 API, updated Triton inference results
+* Refactored codebase
+* Updated performance results for the PyTorch 20.10-py3 NGC container
+
+June 2020
+* Updated performance tables to include A100 results
+
+December 2019
+* Inference support for TRT 6 with dynamic shapes
+* Inference support for TensorRT Inference Server with acoustic model backends in ONNX, PyTorch JIT, TensorRT
+* Jupyter notebook for inference with TensorRT Inference Server
+
+November 2019
+* Google Colab notebook for inference with native TensorRT
+
+September 2019
+* Inference support for TensorRT 6 with static shapes
+* Jupyter notebook for inference
+
+August 2019
+* Initial release
+
+### Known issues
+There are no known issues in this release.
diff --git a/PyTorch/contrib/audio/Jasper/common/__init__.py b/PyTorch/contrib/audio/Jasper/common/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/PyTorch/contrib/audio/Jasper/common/audio.py b/PyTorch/contrib/audio/Jasper/common/audio.py
new file mode 100644
index 0000000000000000000000000000000000000000..916394f50d8302027c4db0fc63f09a05b1787fed
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/common/audio.py
@@ -0,0 +1,247 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+import soundfile as sf
+
+import librosa
+import torch
+import numpy as np
+
+import sox
+
+
+def audio_from_file(file_path, offset=0, duration=0, trim=False, target_sr=16000):
+ audio = AudioSegment(file_path, target_sr=target_sr, int_values=False,
+ offset=offset, duration=duration, trim=trim)
+
+ samples = torch.tensor(audio.samples, dtype=torch.float).cuda()
+ num_samples = torch.tensor(samples.shape[0]).int().cuda()
+ return (samples.unsqueeze(0), num_samples.unsqueeze(0))
+
+
+class AudioSegment(object):
+ """Monaural audio segment abstraction.
+
+ :param samples: Audio samples [num_samples x num_channels].
+ :type samples: ndarray.float32
+ :param sample_rate: Audio sample rate.
+ :type sample_rate: int
+ :raises TypeError: If the sample data type is not float or int.
+ """
+
+ def __init__(self, filename, target_sr=None, int_values=False, offset=0,
+ duration=0, trim=False, trim_db=60):
+ """Create audio segment from samples.
+
+ Samples are convert float32 internally, with int scaled to [-1, 1].
+ Load a file supported by librosa and return as an AudioSegment.
+ :param filename: path of file to load
+ :param target_sr: the desired sample rate
+ :param int_values: if true, load samples as 32-bit integers
+ :param offset: offset in seconds when loading audio
+ :param duration: duration in seconds when loading audio
+ :return: numpy array of samples
+ """
+ with sf.SoundFile(filename, 'r') as f:
+ dtype = 'int32' if int_values else 'float32'
+ sample_rate = f.samplerate
+ if offset > 0:
+ f.seek(int(offset * sample_rate))
+ if duration > 0:
+ samples = f.read(int(duration * sample_rate), dtype=dtype)
+ else:
+ samples = f.read(dtype=dtype)
+ samples = samples.transpose()
+
+ samples = self._convert_samples_to_float32(samples)
+ if target_sr is not None and target_sr != sample_rate:
+ samples = librosa.core.resample(samples, sample_rate, target_sr)
+ sample_rate = target_sr
+ if trim:
+ samples, _ = librosa.effects.trim(samples, trim_db)
+ self._samples = samples
+ self._sample_rate = sample_rate
+ if self._samples.ndim >= 2:
+ self._samples = np.mean(self._samples, 1)
+
+ def __eq__(self, other):
+ """Return whether two objects are equal."""
+ if type(other) is not type(self):
+ return False
+ if self._sample_rate != other._sample_rate:
+ return False
+ if self._samples.shape != other._samples.shape:
+ return False
+ if np.any(self.samples != other._samples):
+ return False
+ return True
+
+ def __ne__(self, other):
+ """Return whether two objects are unequal."""
+ return not self.__eq__(other)
+
+ def __str__(self):
+ """Return human-readable representation of segment."""
+ return ("%s: num_samples=%d, sample_rate=%d, duration=%.2fsec, "
+ "rms=%.2fdB" % (type(self), self.num_samples, self.sample_rate,
+ self.duration, self.rms_db))
+
+ @staticmethod
+ def _convert_samples_to_float32(samples):
+ """Convert sample type to float32.
+
+ Audio sample type is usually integer or float-point.
+ Integers will be scaled to [-1, 1] in float32.
+ """
+ float32_samples = samples.astype('float32')
+ if samples.dtype in np.sctypes['int']:
+ bits = np.iinfo(samples.dtype).bits
+ float32_samples *= (1. / 2 ** (bits - 1))
+ elif samples.dtype in np.sctypes['float']:
+ pass
+ else:
+ raise TypeError("Unsupported sample type: %s." % samples.dtype)
+ return float32_samples
+
+ @property
+ def samples(self):
+ return self._samples.copy()
+
+ @property
+ def sample_rate(self):
+ return self._sample_rate
+
+ @property
+ def num_samples(self):
+ return self._samples.shape[0]
+
+ @property
+ def duration(self):
+ return self._samples.shape[0] / float(self._sample_rate)
+
+ @property
+ def rms_db(self):
+ mean_square = np.mean(self._samples ** 2)
+ return 10 * np.log10(mean_square)
+
+ def gain_db(self, gain):
+ self._samples *= 10. ** (gain / 20.)
+
+ def pad(self, pad_size, symmetric=False):
+ """Add zero padding to the sample.
+
+ The pad size is given in number of samples. If symmetric=True,
+ `pad_size` will be added to both sides. If false, `pad_size` zeros
+ will be added only to the end.
+ """
+ self._samples = np.pad(self._samples,
+ (pad_size if symmetric else 0, pad_size),
+ mode='constant')
+
+ def subsegment(self, start_time=None, end_time=None):
+ """Cut the AudioSegment between given boundaries.
+
+ Note that this is an in-place transformation.
+ :param start_time: Beginning of subsegment in seconds.
+ :type start_time: float
+ :param end_time: End of subsegment in seconds.
+ :type end_time: float
+ :raise ValueError: If start_time or end_time is incorrectly set, e.g. out
+ of bounds in time.
+ """
+ start_time = 0.0 if start_time is None else start_time
+ end_time = self.duration if end_time is None else end_time
+ if start_time < 0.0:
+ start_time = self.duration + start_time
+ if end_time < 0.0:
+ end_time = self.duration + end_time
+ if start_time < 0.0:
+ raise ValueError("The slice start position (%f s) is out of "
+ "bounds." % start_time)
+ if end_time < 0.0:
+ raise ValueError("The slice end position (%f s) is out of bounds." %
+ end_time)
+ if start_time > end_time:
+ raise ValueError("The slice start position (%f s) is later than "
+ "the end position (%f s)." % (start_time, end_time))
+ if end_time > self.duration:
+ raise ValueError("The slice end position (%f s) is out of bounds "
+ "(> %f s)" % (end_time, self.duration))
+ start_sample = int(round(start_time * self._sample_rate))
+ end_sample = int(round(end_time * self._sample_rate))
+ self._samples = self._samples[start_sample:end_sample]
+
+
+class Perturbation:
+ def __init__(self, p=0.1, rng=None):
+ self.p = p
+ self._rng = random.Random() if rng is None else rng
+
+ def maybe_apply(self, segment, sample_rate=None):
+ if self._rng.random() < self.p:
+ self(segment, sample_rate)
+
+
+class SpeedPerturbation(Perturbation):
+ def __init__(self, min_rate=0.85, max_rate=1.15, discrete=False, p=0.1, rng=None):
+ super(SpeedPerturbation, self).__init__(p, rng)
+ assert 0 < min_rate < max_rate
+ self.min_rate = min_rate
+ self.max_rate = max_rate
+ self.discrete = discrete
+
+ def __call__(self, data, sample_rate):
+ if self.discrete:
+ rate = np.random.choice([self.min_rate, None, self.max_rate])
+ else:
+ rate = self._rng.uniform(self.min_rate, self.max_rate)
+
+ if rate is not None:
+ data._samples = sox.Transformer().speed(factor=rate).build_array(
+ input_array=data._samples, sample_rate_in=sample_rate)
+
+
+class GainPerturbation(Perturbation):
+ def __init__(self, min_gain_dbfs=-10, max_gain_dbfs=10, p=0.1, rng=None):
+ super(GainPerturbation, self).__init__(p, rng)
+ self._rng = random.Random() if rng is None else rng
+ self._min_gain_dbfs = min_gain_dbfs
+ self._max_gain_dbfs = max_gain_dbfs
+
+ def __call__(self, data, sample_rate=None):
+ del sample_rate
+ gain = self._rng.uniform(self._min_gain_dbfs, self._max_gain_dbfs)
+ data._samples = data._samples * (10. ** (gain / 20.))
+
+
+class ShiftPerturbation(Perturbation):
+ def __init__(self, min_shift_ms=-5.0, max_shift_ms=5.0, p=0.1, rng=None):
+ super(ShiftPerturbation, self).__init__(p, rng)
+ self._min_shift_ms = min_shift_ms
+ self._max_shift_ms = max_shift_ms
+
+ def __call__(self, data, sample_rate):
+ shift_ms = self._rng.uniform(self._min_shift_ms, self._max_shift_ms)
+ if abs(shift_ms) / 1000 > data.duration:
+ # TODO: do something smarter than just ignore this condition
+ return
+ shift_samples = int(shift_ms * data.sample_rate // 1000)
+ # print("DEBUG: shift:", shift_samples)
+ if shift_samples < 0:
+ data._samples[-shift_samples:] = data._samples[:shift_samples]
+ data._samples[:-shift_samples] = 0
+ elif shift_samples > 0:
+ data._samples[:-shift_samples] = data._samples[shift_samples:]
+ data._samples[-shift_samples:] = 0
diff --git a/PyTorch/contrib/audio/Jasper/common/dali/__init__.py b/PyTorch/contrib/audio/Jasper/common/dali/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..ff800034d6e3c3b67b75cb5aa71f3b7d3340205a
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/common/dali/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/PyTorch/contrib/audio/Jasper/common/dali/data_loader.py b/PyTorch/contrib/audio/Jasper/common/dali/data_loader.py
new file mode 100644
index 0000000000000000000000000000000000000000..99c06c695a18d965314dab32ba592116df010ffe
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/common/dali/data_loader.py
@@ -0,0 +1,158 @@
+# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import math
+import numpy as np
+import torch.distributed as dist
+from .iterator import DaliJasperIterator, SyntheticDataIterator
+from .pipeline import DaliPipeline
+from common.helpers import print_once
+
+
+def _parse_json(json_path: str, start_label=0, predicate=lambda json: True):
+ """
+ Parses json file to the format required by DALI
+ Args:
+ json_path: path to json file
+ start_label: the label, starting from which DALI will assign consecutive int numbers to every transcript
+ predicate: function, that accepts a sample descriptor (i.e. json dictionary) as an argument.
+ If the predicate for a given sample returns True, it will be included in the dataset.
+
+ Returns:
+ output_files: dictionary, that maps file name to label assigned by DALI
+ transcripts: dictionary, that maps label assigned by DALI to the transcript
+ """
+ import json
+ global cnt
+ with open(json_path) as f:
+ librispeech_json = json.load(f)
+ output_files = {}
+ transcripts = {}
+ curr_label = start_label
+ for original_sample in librispeech_json:
+ if not predicate(original_sample):
+ continue
+ transcripts[curr_label] = original_sample['transcript']
+ output_files[original_sample['files'][-1]['fname']] = curr_label
+ curr_label += 1
+ return output_files, transcripts
+
+
+def _dict_to_file(dict: dict, filename: str):
+ with open(filename, "w") as f:
+ for key, value in dict.items():
+ f.write("{} {}\n".format(key, value))
+
+
+class DaliDataLoader:
+ """
+ DataLoader is the main entry point to the data preprocessing pipeline.
+ To use, create an object and then just iterate over `data_iterator`.
+ DataLoader will do the rest for you.
+ Example:
+ data_layer = DataLoader(DaliTrainPipeline, path, json, bs, ngpu)
+ data_it = data_layer.data_iterator
+ for data in data_it:
+ print(data) # Here's your preprocessed data
+
+ Args:
+ device_type: Which device to use for preprocessing. Choose: "cpu", "gpu"
+ pipeline_type: Choose: "train", "val", "synth"
+ """
+
+ def __init__(self, gpu_id, dataset_path: str, config_data: dict, config_features: dict, json_names: list,
+ symbols: list, batch_size: int, pipeline_type: str, grad_accumulation_steps: int = 1,
+ synth_iters_per_epoch: int = 544, device_type: str = "gpu"):
+ import torch
+ self.batch_size = batch_size
+ self.grad_accumulation_steps = grad_accumulation_steps
+ self.drop_last = (pipeline_type == 'train')
+ self.device_type = device_type
+ pipeline_type = self._parse_pipeline_type(pipeline_type)
+ if pipeline_type == "synth":
+ self._dali_data_iterator = self._init_synth_iterator(self.batch_size, config_features['nfilt'],
+ iters_per_epoch=synth_iters_per_epoch,
+ ngpus=torch.distributed.get_world_size())
+ else:
+ self._dali_data_iterator = self._init_iterator(gpu_id=gpu_id, dataset_path=dataset_path,
+ config_data=config_data,
+ config_features=config_features,
+ json_names=json_names, symbols=symbols,
+ train_pipeline=pipeline_type == "train")
+
+ def _init_iterator(self, gpu_id, dataset_path, config_data, config_features, json_names: list, symbols: list,
+ train_pipeline: bool):
+ """
+ Returns data iterator. Data underneath this operator is preprocessed within Dali
+ """
+
+ def hash_list_of_strings(li):
+ return str(abs(hash(''.join(li))))
+
+ output_files, transcripts = {}, {}
+ max_duration = config_data['max_duration']
+ for jname in json_names:
+ of, tr = _parse_json(jname if jname[0] == '/' else os.path.join(dataset_path, jname), len(output_files),
+ predicate=lambda json: json['original_duration'] <= max_duration)
+ output_files.update(of)
+ transcripts.update(tr)
+ file_list_path = os.path.join("/tmp", "jasper_dali.file_list." + hash_list_of_strings(json_names))
+ _dict_to_file(output_files, file_list_path)
+ self.dataset_size = len(output_files)
+ print_once(f"Dataset read by DALI. Number of samples: {self.dataset_size}")
+
+ pipeline = DaliPipeline.from_config(config_data=config_data, config_features=config_features, device_id=gpu_id,
+ file_root=dataset_path, file_list=file_list_path,
+ device_type=self.device_type, batch_size=self.batch_size,
+ train_pipeline=train_pipeline)
+
+ return DaliJasperIterator([pipeline], transcripts=transcripts, symbols=symbols, batch_size=self.batch_size,
+ reader_name="file_reader", train_iterator=train_pipeline)
+
+ def _init_synth_iterator(self, batch_size, nfeatures, iters_per_epoch, ngpus):
+ self.dataset_size = ngpus * iters_per_epoch * batch_size
+ return SyntheticDataIterator(batch_size, nfeatures, regenerate=True)
+
+ @staticmethod
+ def _parse_pipeline_type(pipeline_type):
+ pipe = pipeline_type.lower()
+ assert pipe in ("train", "val", "synth"), 'Invalid pipeline type (choices: "train", "val", "synth").'
+ return pipe
+
+ def _shard_size(self):
+ """
+ Total number of samples handled by a single GPU in a single epoch.
+ """
+ world_size = dist.get_world_size() if dist.is_initialized() else 1
+ if self.drop_last:
+ divisor = world_size * self.batch_size * self.grad_accumulation_steps
+ return self.dataset_size // divisor * divisor // world_size
+ else:
+ return int(math.ceil(self.dataset_size / world_size))
+
+ def __len__(self):
+ """
+ Number of batches handled by each GPU.
+ """
+ if self.drop_last:
+ assert self._shard_size() % self.batch_size == 0, f'{self._shard_size()} {self.batch_size}'
+
+ return int(math.ceil(self._shard_size() / self.batch_size))
+
+ def data_iterator(self):
+ return self._dali_data_iterator
+
+ def __iter__(self):
+ return self._dali_data_iterator
diff --git a/PyTorch/contrib/audio/Jasper/common/dali/iterator.py b/PyTorch/contrib/audio/Jasper/common/dali/iterator.py
new file mode 100644
index 0000000000000000000000000000000000000000..ea1101350053e7581f7bbfef5d0d4dcfc12b47d5
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/common/dali/iterator.py
@@ -0,0 +1,161 @@
+# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import torch
+import torch.distributed as dist
+import numpy as np
+from common.helpers import print_once
+from common.text import _clean_text, punctuation_map
+
+
+def normalize_string(s, symbols, punct_map):
+ """
+ Normalizes string.
+ Example:
+ 'call me at 8:00 pm!' -> 'call me at eight zero pm'
+ """
+ labels = set(symbols)
+ try:
+ text = _clean_text(s, ["english_cleaners"], punct_map).strip()
+ return ''.join([tok for tok in text if all(t in labels for t in tok)])
+ except Exception as e:
+ print_once("WARNING: Normalizing failed: {s} {e}")
+
+
+class DaliJasperIterator(object):
+ """
+ Returns batches of data for Jasper training:
+ preprocessed_signal, preprocessed_signal_length, transcript, transcript_length
+
+ This iterator is not meant to be the entry point to Dali processing pipeline.
+ Use DataLoader instead.
+ """
+
+ def __init__(self, dali_pipelines, transcripts, symbols, batch_size, reader_name, train_iterator: bool):
+ self.transcripts = transcripts
+ self.symbols = symbols
+ self.batch_size = batch_size
+ from nvidia.dali.plugin.pytorch import DALIGenericIterator
+ from nvidia.dali.plugin.base_iterator import LastBatchPolicy
+
+ # in train pipeline shard_size is set to divisable by batch_size, so PARTIAL policy is safe
+ self.dali_it = DALIGenericIterator(
+ dali_pipelines, ["audio", "label", "audio_shape"], reader_name=reader_name,
+ dynamic_shape=True, auto_reset=True, last_batch_policy=LastBatchPolicy.PARTIAL)
+
+ @staticmethod
+ def _str2list(s: str):
+ """
+ Returns list of floats, that represents given string.
+ '0.' denotes separator
+ '1.' denotes 'a'
+ '27.' denotes "'"
+ Assumes, that the string is lower case.
+ """
+ list = []
+ for c in s:
+ if c == "'":
+ list.append(27.)
+ else:
+ list.append(max(0., ord(c) - 96.))
+ return list
+
+ @staticmethod
+ def _pad_lists(lists: list, pad_val=0):
+ """
+ Pads lists, so that all have the same size.
+ Returns list with actual sizes of corresponding input lists
+ """
+ max_length = 0
+ sizes = []
+ for li in lists:
+ sizes.append(len(li))
+ max_length = max_length if len(li) < max_length else len(li)
+ for li in lists:
+ li += [pad_val] * (max_length - len(li))
+ return sizes
+
+ def _gen_transcripts(self, labels, normalize_transcripts: bool = True):
+ """
+ Generate transcripts in format expected by NN
+ """
+ lists = [
+ self._str2list(normalize_string(self.transcripts[lab.item()], self.symbols, punctuation_map(self.symbols)))
+ for lab in labels
+ ] if normalize_transcripts else [self._str2list(self.transcripts[lab.item()]) for lab in labels]
+ sizes = self._pad_lists(lists)
+ return torch.tensor(lists).cuda(), torch.tensor(sizes, dtype=torch.int32).cuda()
+
+ def __next__(self):
+ data = self.dali_it.__next__()
+ transcripts, transcripts_lengths = self._gen_transcripts(data[0]["label"])
+ return data[0]["audio"], data[0]["audio_shape"][:, 1], transcripts, transcripts_lengths
+
+ def next(self):
+ return self.__next__()
+
+ def __iter__(self):
+ return self
+
+
+# TODO: refactor
+class SyntheticDataIterator(object):
+ def __init__(self, batch_size, nfeatures, feat_min=-5., feat_max=0., txt_min=0., txt_max=23., feat_lens_max=1760,
+ txt_lens_max=231, regenerate=False):
+ """
+ Args:
+ batch_size
+ nfeatures: number of features for melfbanks
+ feat_min: minimum value in `feat` tensor, used for randomization
+ feat_max: maximum value in `feat` tensor, used for randomization
+ txt_min: minimum value in `txt` tensor, used for randomization
+ txt_max: maximum value in `txt` tensor, used for randomization
+ regenerate: If True, regenerate random tensors for every iterator step.
+ If False, generate them only at start.
+ """
+ self.batch_size = batch_size
+ self.nfeatures = nfeatures
+ self.feat_min = feat_min
+ self.feat_max = feat_max
+ self.feat_lens_max = feat_lens_max
+ self.txt_min = txt_min
+ self.txt_max = txt_max
+ self.txt_lens_max = txt_lens_max
+ self.regenerate = regenerate
+
+ if not self.regenerate:
+ self.feat, self.feat_lens, self.txt, self.txt_lens = self._generate_sample()
+
+ def _generate_sample(self):
+ feat = (self.feat_max - self.feat_min) * np.random.random_sample(
+ (self.batch_size, self.nfeatures, self.feat_lens_max)) + self.feat_min
+ feat_lens = np.random.randint(0, int(self.feat_lens_max) - 1, size=self.batch_size)
+ txt = (self.txt_max - self.txt_min) * np.random.random_sample(
+ (self.batch_size, self.txt_lens_max)) + self.txt_min
+ txt_lens = np.random.randint(0, int(self.txt_lens_max) - 1, size=self.batch_size)
+ return torch.Tensor(feat).cuda(), \
+ torch.Tensor(feat_lens).cuda(), \
+ torch.Tensor(txt).cuda(), \
+ torch.Tensor(txt_lens).cuda()
+
+ def __next__(self):
+ if self.regenerate:
+ return self._generate_sample()
+ return self.feat, self.feat_lens, self.txt, self.txt_lens
+
+ def next(self):
+ return self.__next__()
+
+ def __iter__(self):
+ return self
diff --git a/PyTorch/contrib/audio/Jasper/common/dali/pipeline.py b/PyTorch/contrib/audio/Jasper/common/dali/pipeline.py
new file mode 100644
index 0000000000000000000000000000000000000000..bdbb26b40c939aa329c3e0c0670d9fcfb087129d
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/common/dali/pipeline.py
@@ -0,0 +1,366 @@
+# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import nvidia.dali as dali
+import nvidia.dali.fn as fn
+import nvidia.dali.types as types
+import multiprocessing
+import numpy as np
+import torch
+import math
+import itertools
+
+
+class DaliPipeline():
+ def __init__(self, *,
+ train_pipeline: bool, # True if train pipeline, False if validation pipeline
+ device_id,
+ num_threads,
+ batch_size,
+ file_root: str,
+ file_list: str,
+ sample_rate,
+ discrete_resample_range: bool,
+ resample_range: list,
+ window_size,
+ window_stride,
+ nfeatures,
+ nfft,
+ frame_splicing_factor,
+ dither_coeff,
+ silence_threshold,
+ preemph_coeff,
+ pad_align,
+ max_duration,
+ mask_time_num_regions,
+ mask_time_min,
+ mask_time_max,
+ mask_freq_num_regions,
+ mask_freq_min,
+ mask_freq_max,
+ mask_both_num_regions,
+ mask_both_min_time,
+ mask_both_max_time,
+ mask_both_min_freq,
+ mask_both_max_freq,
+ preprocessing_device="gpu",
+ is_triton_pipeline=False):
+ self._dali_init_log(locals())
+
+ if torch.distributed.is_initialized():
+ shard_id = torch.distributed.get_rank()
+ n_shards = torch.distributed.get_world_size()
+ else:
+ shard_id = 0
+ n_shards = 1
+
+ self.preprocessing_device = preprocessing_device.lower()
+ assert self.preprocessing_device == "cpu" or self.preprocessing_device == "gpu", \
+ "Incorrect preprocessing device. Please choose either 'cpu' or 'gpu'"
+ self.frame_splicing_factor = frame_splicing_factor
+
+ # TODO(janton): Implement this
+ assert frame_splicing_factor == 1, "Frame splicing is not yet implemented"
+
+ self.resample_range = resample_range
+ self.discrete_resample_range = discrete_resample_range
+
+ self.train = train_pipeline
+ self.sample_rate = sample_rate
+ self.dither_coeff = dither_coeff
+ self.nfeatures = nfeatures
+ self.max_duration = max_duration
+ self.mask_params = {
+ 'time_num_regions': mask_time_num_regions,
+ 'time_min': mask_time_min,
+ 'time_max': mask_time_max,
+ 'freq_num_regions': mask_freq_num_regions,
+ 'freq_min': mask_freq_min,
+ 'freq_max': mask_freq_max,
+ 'both_num_regions': mask_both_num_regions,
+ 'both_min_time': mask_both_min_time,
+ 'both_max_time': mask_both_max_time,
+ 'both_min_freq': mask_both_min_freq,
+ 'both_max_freq': mask_both_max_freq,
+ }
+ self.do_remove_silence = True if silence_threshold is not None else False
+
+ @dali.pipeline_def
+ def dali_jasper_pipe():
+ if is_triton_pipeline:
+ assert not self.train, "Pipeline for Triton shall be a validation pipeline"
+ if torch.distributed.is_initialized():
+ raise RuntimeError(
+ "You're creating Triton pipeline, using multi-process mode. Please use single-process mode.")
+ encoded, label = fn.external_source(device="cpu", name="DALI_INPUT_0", no_copy=True)
+ else:
+ encoded, label = fn.readers.file(device="cpu", name="file_reader",
+ file_root=file_root, file_list=file_list, shard_id=shard_id,
+ num_shards=n_shards, shuffle_after_epoch=train_pipeline)
+
+ speed_perturbation_coeffs = None
+ if resample_range is not None:
+ if discrete_resample_range:
+ values = [self.resample_range[0], 1.0, self.resample_range[1]]
+ speed_perturbation_coeffs = fn.random.uniform(device="cpu", values=values)
+ else:
+ speed_perturbation_coeffs = fn.random.uniform(device="cpu", range=resample_range)
+
+ if self.train and speed_perturbation_coeffs is not None:
+ dec_sample_rate_arg = speed_perturbation_coeffs * self.sample_rate
+ elif resample_range is None:
+ dec_sample_rate_arg = self.sample_rate
+ else:
+ dec_sample_rate_arg = None
+
+ audio, _ = fn.decoders.audio(encoded, sample_rate=dec_sample_rate_arg, dtype=types.FLOAT, downmix=True)
+
+ if self.do_remove_silence:
+ begin, length = fn.nonsilent_region(audio, cutoff_db=silence_threshold)
+ audio = fn.slice(audio, begin, length, axes=[0])
+
+ # Max duration drop is performed at DataLayer stage
+
+ if self.preprocessing_device == "gpu":
+ audio = audio.gpu()
+
+ if self.dither_coeff != 0.:
+ audio = audio + fn.random.normal(audio) * self.dither_coeff
+
+ audio = fn.preemphasis_filter(audio, preemph_coeff=preemph_coeff)
+
+ spec = fn.spectrogram(audio, nfft=nfft,
+ window_length=window_size * sample_rate, window_step=window_stride * sample_rate)
+
+ mel_spec = fn.mel_filter_bank(spec, sample_rate=sample_rate, nfilter=self.nfeatures, normalize=True)
+
+ log_features = fn.to_decibels(mel_spec, multiplier=np.log(10), reference=1.0, cutoff_db=math.log(1e-20))
+
+ log_features_len = fn.shapes(log_features)
+ if self.frame_splicing_factor != 1:
+ log_features_len = self._div_ceil(log_features_len, self.frame_splicing_factor)
+
+ log_features = fn.normalize(log_features, axes=[1])
+ log_features = fn.pad(log_features, axes=[1], fill_value=0, align=pad_align)
+
+ if self.train and self._do_spectrogram_masking():
+ anchors, shapes = fn.external_source(source=self._cutouts_generator, num_outputs=2, cycle=True)
+ log_features = fn.erase(log_features, anchor=anchors, shape=shapes, axes=[0, 1], fill_value=0,
+ normalized_anchor=True)
+
+ # When modifying DALI pipeline returns, make sure you update `output_map` in DALIGenericIterator invocation
+ return log_features.gpu(), label.gpu(), log_features_len.gpu()
+
+ self.pipe_handle = dali_jasper_pipe(batch_size=batch_size, num_threads=num_threads, device_id=device_id)
+
+ def get_pipeline(self):
+ return self.pipe_handle
+
+ @classmethod
+ def from_config(cls, train_pipeline: bool, device_id, batch_size, file_root: str, file_list: str, config_data: dict,
+ config_features: dict, device_type: str = "gpu", do_resampling: bool = True,
+ num_cpu_threads=multiprocessing.cpu_count()):
+
+ max_duration = config_data['max_duration']
+ sample_rate = config_data['sample_rate']
+ silence_threshold = -60 if config_data['trim_silence'] else None
+
+ # TODO Take into account resampling probablity
+ # TODO config_features['speed_perturbation']['p']
+
+ if do_resampling and config_data['speed_perturbation'] is not None:
+ resample_range = [config_data['speed_perturbation']['min_rate'],
+ config_data['speed_perturbation']['max_rate']]
+ discrete_resample_range = config_data['speed_perturbation']['discrete']
+ else:
+ resample_range = None
+ discrete_resample_range = False
+
+ window_size = config_features['window_size']
+ window_stride = config_features['window_stride']
+ nfeatures = config_features['n_filt']
+ nfft = config_features['n_fft']
+ frame_splicing_factor = config_features['frame_splicing']
+ dither_coeff = config_features['dither']
+ pad_align = config_features['pad_align']
+ pad_to_max_duration = config_features['pad_to_max_duration']
+ assert not pad_to_max_duration, "Padding to max duration currently not supported in DALI"
+ preemph_coeff = .97
+
+ config_spec = config_features['spec_augment']
+ if config_spec is not None:
+ mask_time_num_regions = config_spec['time_masks']
+ mask_time_min = config_spec['min_time']
+ mask_time_max = config_spec['max_time']
+ mask_freq_num_regions = config_spec['freq_masks']
+ mask_freq_min = config_spec['min_freq']
+ mask_freq_max = config_spec['max_freq']
+ else:
+ mask_time_num_regions = 0
+ mask_time_min = 0
+ mask_time_max = 0
+ mask_freq_num_regions = 0
+ mask_freq_min = 0
+ mask_freq_max = 0
+
+ config_cutout = config_features['cutout_augment']
+ if config_cutout is not None:
+ mask_both_num_regions = config_cutout['masks']
+ mask_both_min_time = config_cutout['min_time']
+ mask_both_max_time = config_cutout['max_time']
+ mask_both_min_freq = config_cutout['min_freq']
+ mask_both_max_freq = config_cutout['max_freq']
+ else:
+ mask_both_num_regions = 0
+ mask_both_min_time = 0
+ mask_both_max_time = 0
+ mask_both_min_freq = 0
+ mask_both_max_freq = 0
+
+ inst = cls(train_pipeline=train_pipeline,
+ device_id=device_id,
+ preprocessing_device=device_type,
+ num_threads=num_cpu_threads,
+ batch_size=batch_size,
+ file_root=file_root,
+ file_list=file_list,
+ sample_rate=sample_rate,
+ discrete_resample_range=discrete_resample_range,
+ resample_range=resample_range,
+ window_size=window_size,
+ window_stride=window_stride,
+ nfeatures=nfeatures,
+ nfft=nfft,
+ frame_splicing_factor=frame_splicing_factor,
+ dither_coeff=dither_coeff,
+ silence_threshold=silence_threshold,
+ preemph_coeff=preemph_coeff,
+ pad_align=pad_align,
+ max_duration=max_duration,
+ mask_time_num_regions=mask_time_num_regions,
+ mask_time_min=mask_time_min,
+ mask_time_max=mask_time_max,
+ mask_freq_num_regions=mask_freq_num_regions,
+ mask_freq_min=mask_freq_min,
+ mask_freq_max=mask_freq_max,
+ mask_both_num_regions=mask_both_num_regions,
+ mask_both_min_time=mask_both_min_time,
+ mask_both_max_time=mask_both_max_time,
+ mask_both_min_freq=mask_both_min_freq,
+ mask_both_max_freq=mask_both_max_freq)
+ return inst.get_pipeline()
+
+ @staticmethod
+ def _dali_init_log(args: dict):
+ if (not torch.distributed.is_initialized() or (
+ torch.distributed.is_initialized() and torch.distributed.get_rank() == 0)): # print once
+ max_len = max([len(ii) for ii in args.keys()])
+ fmt_string = '\t%' + str(max_len) + 's : %s'
+ print('Initializing DALI with parameters:')
+ for keyPair in sorted(args.items()):
+ print(fmt_string % keyPair)
+
+ @staticmethod
+ def _div_ceil(dividend, divisor):
+ return (dividend + (divisor - 1)) // divisor
+
+ def _do_spectrogram_masking(self):
+ return self.mask_params['time_num_regions'] > 0 or self.mask_params['freq_num_regions'] > 0 or \
+ self.mask_params['both_num_regions'] > 0
+
+ @staticmethod
+ def _interleave_lists(*lists):
+ """
+ [*, **, ***], [1, 2, 3], [a, b, c] -> [*, 1, a, **, 2, b, ***, 3, c]
+ Returns:
+ iterator over interleaved list
+ """
+ assert all((len(lists[0]) == len(test_l) for test_l in lists)), "All lists have to have the same length"
+ return itertools.chain(*zip(*lists))
+
+ def _generate_cutouts(self):
+ """
+ Returns:
+ Generates anchors and shapes of the cutout regions.
+ Single call generates one batch of data.
+ The output shall be passed to DALI's Erase operator
+ anchors = [f0 t0 f1 t1 ...]
+ shapes = [f0w t0h f1w t1h ...]
+ """
+ MAX_TIME_DIMENSION = 20 * 16000
+ freq_anchors = np.random.random(self.mask_params['freq_num_regions'])
+ time_anchors = np.random.random(self.mask_params['time_num_regions'])
+ both_anchors_freq = np.random.random(self.mask_params['both_num_regions'])
+ both_anchors_time = np.random.random(self.mask_params['both_num_regions'])
+ anchors = []
+ for anch in freq_anchors:
+ anchors.extend([anch, 0])
+ for anch in time_anchors:
+ anchors.extend([0, anch])
+ for t, f in zip(both_anchors_time, both_anchors_freq):
+ anchors.extend([f, t])
+
+ shapes = []
+ shapes.extend(
+ self._interleave_lists(
+ np.random.randint(self.mask_params['freq_min'], self.mask_params['freq_max'] + 1,
+ self.mask_params['freq_num_regions']),
+ # XXX: Here, a time dimension of the spectrogram shall be passed.
+ # However, in DALI ArgumentInput can't come from GPU.
+ # So we leave the job for Erase (masking operator) to get it together.
+ [int(MAX_TIME_DIMENSION)] * self.mask_params['freq_num_regions']
+ )
+ )
+ shapes.extend(
+ self._interleave_lists(
+ [self.nfeatures] * self.mask_params['time_num_regions'],
+ np.random.randint(self.mask_params['time_min'], self.mask_params['time_max'] + 1,
+ self.mask_params['time_num_regions'])
+ )
+ )
+ shapes.extend(
+ self._interleave_lists(
+ np.random.randint(self.mask_params['both_min_freq'], self.mask_params['both_max_freq'] + 1,
+ self.mask_params['both_num_regions']),
+ np.random.randint(self.mask_params['both_min_time'], self.mask_params['both_max_time'] + 1,
+ self.mask_params['both_num_regions'])
+ )
+ )
+ return anchors, shapes
+
+ def _cutouts_generator(self):
+ """
+ Generator, that wraps cutouts creation in order to randomize inputs
+ and allow passing them to DALI's ExternalSource operator
+ """
+
+ def tuples2list(tuples: list):
+ """
+ [(a, b), (c, d)] -> [[a, c], [b, d]]
+ """
+ return map(list, zip(*tuples))
+
+ [anchors, shapes] = tuples2list([self._generate_cutouts() for _ in range(self.pipe_handle.max_batch_size)])
+ yield np.array(anchors, dtype=np.float32), np.array(shapes, dtype=np.float32)
+
+class DaliTritonPipeline(DaliPipeline):
+ def __init__(self, **kwargs):
+ kwargs['is_triton_pipeline'] = True
+ super().__init__(**kwargs)
+
+def serialize_dali_triton_pipeline(output_path: str, config_data: dict, config_features: dict):
+ pipe = DaliTritonPipeline.from_config(train_pipeline=False, device_id=-1, batch_size=-1, file_root=None,
+ file_list=None, config_data=config_data, config_features=config_features,
+ do_resampling=False, num_cpu_threads=-1)
+ pipe.serialize(filename=output_path)
diff --git a/PyTorch/contrib/audio/Jasper/common/dataset.py b/PyTorch/contrib/audio/Jasper/common/dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..daf4070d4dc3dcbc0660df84e32cb36426faa970
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/common/dataset.py
@@ -0,0 +1,237 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+from pathlib import Path
+
+import numpy as np
+
+import torch
+from torch.utils.data import Dataset, DataLoader
+from torch.utils.data.distributed import DistributedSampler
+
+from .audio import (audio_from_file, AudioSegment, GainPerturbation,
+ ShiftPerturbation, SpeedPerturbation)
+from .text import _clean_text, punctuation_map
+
+
+def normalize_string(s, labels, punct_map):
+ """Normalizes string.
+
+ Example:
+ 'call me at 8:00 pm!' -> 'call me at eight zero pm'
+ """
+ labels = set(labels)
+ try:
+ text = _clean_text(s, ["english_cleaners"], punct_map).strip()
+ return ''.join([tok for tok in text if all(t in labels for t in tok)])
+ except:
+ print(f"WARNING: Normalizing failed: {s}")
+ return None
+
+
+class FilelistDataset(Dataset):
+ def __init__(self, filelist_fpath):
+ self.samples = [line.strip() for line in open(filelist_fpath, 'r')]
+
+ def __len__(self):
+ return len(self.samples)
+
+ def __getitem__(self, index):
+ audio, audio_len = audio_from_file(self.samples[index])
+ return (audio.squeeze(0), audio_len, torch.LongTensor([0]),
+ torch.LongTensor([0]))
+
+
+class SingleAudioDataset(FilelistDataset):
+ def __init__(self, audio_fpath):
+ self.samples = [audio_fpath]
+
+
+class AudioDataset(Dataset):
+ def __init__(self, data_dir, manifest_fpaths, labels,
+ sample_rate=16000, min_duration=0.1, max_duration=float("inf"),
+ pad_to_max_duration=False, max_utts=0, normalize_transcripts=True,
+ sort_by_duration=False, trim_silence=False,
+ speed_perturbation=None, gain_perturbation=None,
+ shift_perturbation=None, ignore_offline_speed_perturbation=False):
+ """Loads audio, transcript and durations listed in a .json file.
+
+ Args:
+ data_dir: absolute path to dataset folder
+ manifest_filepath: relative path from dataset folder
+ to manifest json as described above. Can be coma-separated paths.
+ labels (str): all possible output symbols
+ min_duration (int): skip audio shorter than threshold
+ max_duration (int): skip audio longer than threshold
+ pad_to_max_duration (bool): pad all sequences to max_duration
+ max_utts (int): limit number of utterances
+ normalize_transcripts (bool): normalize transcript text
+ sort_by_duration (bool): sort sequences by increasing duration
+ trim_silence (bool): trim leading and trailing silence from audio
+ ignore_offline_speed_perturbation (bool): use precomputed speed perturbation
+
+ Returns:
+ tuple of Tensors
+ """
+ self.data_dir = data_dir
+ self.labels = labels
+ self.labels_map = dict([(labels[i], i) for i in range(len(labels))])
+ self.punctuation_map = punctuation_map(labels)
+ self.blank_index = len(labels)
+
+ self.pad_to_max_duration = pad_to_max_duration
+
+ self.sort_by_duration = sort_by_duration
+ self.max_utts = max_utts
+ self.normalize_transcripts = normalize_transcripts
+ self.ignore_offline_speed_perturbation = ignore_offline_speed_perturbation
+
+ self.min_duration = min_duration
+ self.max_duration = max_duration
+ self.trim_silence = trim_silence
+ self.sample_rate = sample_rate
+
+ perturbations = []
+ if speed_perturbation is not None:
+ perturbations.append(SpeedPerturbation(**speed_perturbation))
+ if gain_perturbation is not None:
+ perturbations.append(GainPerturbation(**gain_perturbation))
+ if shift_perturbation is not None:
+ perturbations.append(ShiftPerturbation(**shift_perturbation))
+ self.perturbations = perturbations
+
+ self.max_duration = max_duration
+
+ self.samples = []
+ self.duration = 0.0
+ self.duration_filtered = 0.0
+
+ for fpath in manifest_fpaths:
+ self._load_json_manifest(fpath)
+
+ if sort_by_duration:
+ self.samples = sorted(self.samples, key=lambda s: s['duration'])
+
+ def __getitem__(self, index):
+ s = self.samples[index]
+ rn_indx = np.random.randint(len(s['audio_filepath']))
+ duration = s['audio_duration'][rn_indx] if 'audio_duration' in s else 0
+ offset = s.get('offset', 0)
+
+ segment = AudioSegment(
+ s['audio_filepath'][rn_indx], target_sr=self.sample_rate,
+ offset=offset, duration=duration, trim=self.trim_silence)
+
+ for p in self.perturbations:
+ p.maybe_apply(segment, self.sample_rate)
+
+ segment = torch.FloatTensor(segment.samples)
+
+ return (segment,
+ torch.tensor(segment.shape[0]).int(),
+ torch.tensor(s["transcript"]),
+ torch.tensor(len(s["transcript"])).int())
+
+ def __len__(self):
+ return len(self.samples)
+
+ def _load_json_manifest(self, fpath):
+ for s in json.load(open(fpath, "r", encoding="utf-8")):
+
+ if self.pad_to_max_duration and not self.ignore_offline_speed_perturbation:
+ # require all perturbed samples to be < self.max_duration
+ s_max_duration = max(f['duration'] for f in s['files'])
+ else:
+ # otherwise we allow perturbances to be > self.max_duration
+ s_max_duration = s['original_duration']
+
+ s['duration'] = s.pop('original_duration')
+ if not (self.min_duration <= s_max_duration <= self.max_duration):
+ self.duration_filtered += s['duration']
+ continue
+
+ # Prune and normalize according to transcript
+ tr = (s.get('transcript', None) or
+ self.load_transcript(s['text_filepath']))
+
+ if not isinstance(tr, str):
+ print(f'WARNING: Skipped sample (transcript not a str): {tr}.')
+ self.duration_filtered += s['duration']
+ continue
+
+ if self.normalize_transcripts:
+ tr = normalize_string(tr, self.labels, self.punctuation_map)
+
+ s["transcript"] = self.to_vocab_inds(tr)
+
+ files = s.pop('files')
+ if self.ignore_offline_speed_perturbation:
+ files = [f for f in files if f['speed'] == 1.0]
+
+ s['audio_duration'] = [f['duration'] for f in files]
+ s['audio_filepath'] = [str(Path(self.data_dir, f['fname']))
+ for f in files]
+ self.samples.append(s)
+ self.duration += s['duration']
+
+ if self.max_utts > 0 and len(self.samples) >= self.max_utts:
+ print(f'Reached max_utts={self.max_utts}. Finished parsing {fpath}.')
+ break
+
+ def load_transcript(self, transcript_path):
+ with open(transcript_path, 'r', encoding="utf-8") as transcript_file:
+ transcript = transcript_file.read().replace('\n', '')
+ return transcript
+
+ def to_vocab_inds(self, transcript):
+ chars = [self.labels_map.get(x, self.blank_index) for x in list(transcript)]
+ transcript = list(filter(lambda x: x != self.blank_index, chars))
+ return transcript
+
+
+def collate_fn(batch):
+ bs = len(batch)
+ max_len = lambda l, idx: max(el[idx].size(0) for el in l)
+
+ # audio = torch.zeros(bs, max_len(batch, 0))
+ audio = torch.zeros(bs,680000)
+ audio_lens = torch.zeros(bs, dtype=torch.int32)
+ # transcript = torch.zeros(bs, max_len(batch, 2))
+ transcript = torch.zeros(bs, 700)
+ transcript_lens = torch.zeros(bs, dtype=torch.int32)
+
+ for i, sample in enumerate(batch):
+ audio[i].narrow(0, 0, sample[0].size(0)).copy_(sample[0])
+ audio_lens[i] = sample[1]
+ transcript[i].narrow(0, 0, sample[2].size(0)).copy_(sample[2])
+ transcript_lens[i] = sample[3]
+ return audio, audio_lens, transcript, transcript_lens
+
+
+def get_data_loader(dataset, batch_size, multi_gpu=True, shuffle=True,
+ drop_last=True, num_workers=4):
+
+ kw = {'dataset': dataset, 'collate_fn': collate_fn,
+ 'num_workers': num_workers, 'pin_memory': True}
+
+ if multi_gpu:
+ loader_shuffle = False
+ sampler = DistributedSampler(dataset, shuffle=shuffle)
+ else:
+ loader_shuffle = shuffle
+ sampler = None
+
+ return DataLoader(batch_size=batch_size, drop_last=drop_last,
+ sampler=sampler, shuffle=loader_shuffle, **kw)
diff --git a/PyTorch/contrib/audio/Jasper/common/features.py b/PyTorch/contrib/audio/Jasper/common/features.py
new file mode 100644
index 0000000000000000000000000000000000000000..731d765c249d3a194f001c1cd4e3ed8e05e8fff2
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/common/features.py
@@ -0,0 +1,522 @@
+import math
+import random
+
+import librosa
+import torch
+import torch.nn as nn
+
+from apex import amp
+import numpy as np
+import torch.nn.functional as F
+from scipy.signal import get_window
+from librosa.util import pad_center, tiny, utils
+
+
+def window_sumsquare(window, n_frames, hop_length=200, win_length=800,
+ n_fft=800, dtype=np.float32, norm=None):
+ """
+ # from librosa 0.6
+ Compute the sum-square envelope of a window function at a given hop length.
+ This is used to estimate modulation effects induced by windowing
+ observations in short-time fourier transforms.
+ Parameters
+ ----------
+ window : string, tuple, number, callable, or list-like
+ Window specification, as in `get_window`
+ n_frames : int > 0
+ The number of analysis frames
+ hop_length : int > 0
+ The number of samples to advance between frames
+ win_length : [optional]
+ The length of the window function. By default, this matches `n_fft`.
+ n_fft : int > 0
+ The length of each analysis frame.
+ dtype : np.dtype
+ The data type of the output
+ Returns
+ -------
+ wss : np.ndarray, shape=`(n_fft + hop_length * (n_frames - 1))`
+ The sum-squared envelope of the window function
+ """
+ if win_length is None:
+ win_length = n_fft
+
+ n = n_fft + hop_length * (n_frames - 1)
+ x = np.zeros(n, dtype=dtype)
+
+ # Compute the squared window at the desired length
+ win_sq = get_window(window, win_length, fftbins=True)
+ win_sq = utils.normalize(win_sq, norm=norm) ** 2
+ win_sq = utils.pad_center(win_sq, n_fft)
+
+ # Fill the envelope
+ for i in range(n_frames):
+ sample = i * hop_length
+ x[sample:min(n, sample + n_fft)] += win_sq[:max(0, min(n_fft, n - sample))]
+ return x
+
+
+class STFT(torch.nn.Module):
+ def __init__(self, filter_length=1024, hop_length=512, win_length=None,
+ window='hann'):
+ """
+ This module implements an STFT using 1D convolution and 1D transpose convolutions.
+ This is a bit tricky so there are some cases that probably won't work as working
+ out the same sizes before and after in all overlap add setups is tough. Right now,
+ this code should work with hop lengths that are half the filter length (50% overlap
+ between frames).
+
+ Keyword Arguments:
+ filter_length {int} -- Length of filters used (default: {1024})
+ hop_length {int} -- Hop length of STFT (restrict to 50% overlap between frames) (default: {512})
+ win_length {[type]} -- Length of the window function applied to each frame (if not specified, it
+ equals the filter length). (default: {None})
+ window {str} -- Type of window to use (options are bartlett, hann, hamming, blackman, blackmanharris)
+ (default: {'hann'})
+ """
+ super(STFT, self).__init__()
+ self.filter_length = filter_length
+ self.hop_length = hop_length
+ self.win_length = win_length if win_length else filter_length
+ self.window = window
+ self.forward_transform = None
+ self.pad_amount = int(self.filter_length / 2)
+ scale = self.filter_length / self.hop_length
+ fourier_basis = np.fft.fft(np.eye(self.filter_length))
+
+ cutoff = int((self.filter_length / 2 + 1))
+ fourier_basis = np.vstack([np.real(fourier_basis[:cutoff, :]),
+ np.imag(fourier_basis[:cutoff, :])])
+ forward_basis = torch.FloatTensor(fourier_basis[:, None, :])
+ inverse_basis = torch.FloatTensor(
+ np.linalg.pinv(scale * fourier_basis).T[:, None, :])
+
+ assert (filter_length >= self.win_length)
+ # get window and zero center pad it to filter_length
+ fft_window = get_window(window, self.win_length, fftbins=True)
+ fft_window = pad_center(fft_window, filter_length)
+ fft_window = torch.from_numpy(fft_window).float()
+
+ # window the bases
+ forward_basis *= fft_window
+ inverse_basis *= fft_window
+
+ self.register_buffer('forward_basis', forward_basis.float())
+ self.register_buffer('inverse_basis', inverse_basis.float())
+
+ def transform(self, input_data):
+ """Take input data (audio) to STFT domain.
+
+ Arguments:
+ input_data {tensor} -- Tensor of floats, with shape (num_batch, num_samples)
+
+ Returns:
+ magnitude {tensor} -- Magnitude of STFT with shape (num_batch,
+ num_frequencies, num_frames)
+ phase {tensor} -- Phase of STFT with shape (num_batch,
+ num_frequencies, num_frames)
+ """
+ num_batches = input_data.shape[0]
+ num_samples = input_data.shape[-1]
+
+ self.num_samples = num_samples
+
+ # similar to librosa, reflect-pad the input
+ input_data = input_data.view(num_batches, 1, num_samples)
+
+ input_data = F.pad(
+ input_data.unsqueeze(1),
+ (self.pad_amount, self.pad_amount, 0, 0),
+ mode='constant')
+ input_data = input_data.squeeze(1)
+
+ forward_transform = F.conv1d(
+ input_data,
+ self.forward_basis,
+ stride=self.hop_length,
+ padding=0)
+
+ cutoff = int((self.filter_length / 2) + 1)
+ real_part = forward_transform[:, :cutoff, :]
+ imag_part = forward_transform[:, cutoff:, :]
+
+ magnitude = torch.sqrt(real_part ** 2 + imag_part ** 2)
+ phase = torch.atan2(imag_part.data, real_part.data)
+
+ return magnitude, phase
+
+ def inverse(self, magnitude, phase):
+ """Call the inverse STFT (iSTFT), given magnitude and phase tensors produced
+ by the ```transform``` function.
+
+ Arguments:
+ magnitude {tensor} -- Magnitude of STFT with shape (num_batch,
+ num_frequencies, num_frames)
+ phase {tensor} -- Phase of STFT with shape (num_batch,
+ num_frequencies, num_frames)
+
+ Returns:
+ inverse_transform {tensor} -- Reconstructed audio given magnitude and phase. Of
+ shape (num_batch, num_samples)
+ """
+ recombine_magnitude_phase = torch.cat(
+ [magnitude * torch.cos(phase), magnitude * torch.sin(phase)], dim=1)
+
+ inverse_transform = F.conv_transpose1d(
+ recombine_magnitude_phase,
+ self.inverse_basis,
+ stride=self.hop_length,
+ padding=0)
+
+ if self.window is not None:
+ window_sum = window_sumsquare(
+ self.window, magnitude.size(-1), hop_length=self.hop_length,
+ win_length=self.win_length, n_fft=self.filter_length,
+ dtype=np.float32)
+ # remove modulation effects
+ approx_nonzero_indices = torch.from_numpy(
+ np.where(window_sum > tiny(window_sum))[0])
+ window_sum = torch.from_numpy(window_sum).to(inverse_transform.device)
+ inverse_transform[:, :, approx_nonzero_indices] /= window_sum[approx_nonzero_indices]
+
+ # scale by hop ratio
+ inverse_transform *= float(self.filter_length) / self.hop_length
+
+ inverse_transform = inverse_transform[..., self.pad_amount:]
+ inverse_transform = inverse_transform[..., :self.num_samples]
+ inverse_transform = inverse_transform.squeeze(1)
+
+ return inverse_transform
+
+ def forward(self, input_data):
+ """Take input data (audio) to STFT domain and then back to audio.
+
+ Arguments:
+ input_data {tensor} -- Tensor of floats, with shape (num_batch, num_samples)
+
+ Returns:
+ reconstruction {tensor} -- Reconstructed audio given magnitude and phase. Of
+ shape (num_batch, num_samples)
+ """
+ # print("input_data",input_data)
+ self.magnitude, self.phase = self.transform(input_data)
+ reconstruction = self.inverse(self.magnitude, self.phase)
+ return reconstruction
+# stft = STFT()
+class BaseFeatures(nn.Module):
+ """Base class for GPU accelerated audio preprocessing."""
+ __constants__ = ["pad_align", "pad_to_max_duration", "max_len"]
+
+ def __init__(self, pad_align, pad_to_max_duration, max_duration,
+ sample_rate, window_size, window_stride, spec_augment=None,
+ cutout_augment=None):
+ super(BaseFeatures, self).__init__()
+
+ self.pad_align = pad_align
+ self.pad_to_max_duration = pad_to_max_duration
+ self.win_length = int(sample_rate * window_size) # frame size
+ self.hop_length = int(sample_rate * window_stride)
+
+ # Calculate maximum sequence length (# frames)
+ if pad_to_max_duration:
+ self.max_len = 1 + math.ceil(
+ (max_duration * sample_rate - self.win_length) / self.hop_length
+ )
+
+ if spec_augment is not None:
+ self.spec_augment = SpecAugment(**spec_augment)
+ else:
+ self.spec_augment = None
+
+ if cutout_augment is not None:
+ self.cutout_augment = CutoutAugment(**cutout_augment)
+ else:
+ self.cutout_augment = None
+
+ @torch.no_grad()
+ def calculate_features(self, audio, audio_lens):
+ return audio, audio_lens
+
+ def __call__(self, audio, audio_lens, optim_level=0):
+ dtype = audio.dtype
+ audio = audio.float()
+ if optim_level == 1:
+ with amp.disable_casts():
+ feat, feat_lens = self.calculate_features(audio, audio_lens)
+ else:
+ feat, feat_lens = self.calculate_features(audio, audio_lens)
+
+ feat = self.apply_padding(feat)
+
+ if self.cutout_augment is not None:
+ feat = self.cutout_augment(feat)
+
+ if self.spec_augment is not None:
+ feat = self.spec_augment(feat)
+
+ feat = feat.to(dtype)
+ return feat, feat_lens
+
+ def apply_padding(self, x):
+ if self.pad_to_max_duration:
+ x_size = max(x.size(-1), self.max_len)
+ else:
+ x_size = x.size(-1)
+
+ if self.pad_align > 0:
+ pad_amt = x_size % self.pad_align
+ else:
+ pad_amt = 0
+
+ padded_len = x_size + (self.pad_align - pad_amt if pad_amt > 0 else 0)
+ return nn.functional.pad(x, (0, padded_len - x.size(-1)))
+
+class SpecAugment(nn.Module):
+ """Spec augment. refer to https://arxiv.org/abs/1904.08779
+ """
+ def __init__(self, freq_masks=0, min_freq=0, max_freq=10, time_masks=0,
+ min_time=0, max_time=10):
+ super(SpecAugment, self).__init__()
+ assert 0 <= min_freq <= max_freq
+ assert 0 <= min_time <= max_time
+
+ self.freq_masks = freq_masks
+ self.min_freq = min_freq
+ self.max_freq = max_freq
+
+ self.time_masks = time_masks
+ self.min_time = min_time
+ self.max_time = max_time
+
+ @torch.no_grad()
+ def forward(self, x):
+ sh = x.shape
+ mask = torch.zeros(x.shape, dtype=torch.bool, device=x.device)
+
+ for idx in range(sh[0]):
+ for _ in range(self.freq_masks):
+ w = torch.randint(self.min_freq, self.max_freq + 1, size=(1,)).item()
+ f0 = torch.randint(0, max(1, sh[1] - w), size=(1,))
+ mask[idx, f0:f0+w] = 1
+
+ for _ in range(self.time_masks):
+ w = torch.randint(self.min_time, self.max_time + 1, size=(1,)).item()
+ t0 = torch.randint(0, max(1, sh[2] - w), size=(1,))
+ mask[idx, :, t0:t0+w] = 1
+
+ return x.masked_fill(mask, 0)
+
+
+class CutoutAugment(nn.Module):
+ """Cutout. refer to https://arxiv.org/pdf/1708.04552.pdf
+ """
+ def __init__(self, masks=0, min_freq=20, max_freq=20, min_time=5, max_time=5):
+ super(CutoutAugment, self).__init__()
+ assert 0 <= min_freq <= max_freq
+ assert 0 <= min_time <= max_time
+
+ self.masks = masks
+ self.min_freq = min_freq
+ self.max_freq = max_freq
+ self.min_time = min_time
+ self.max_time = max_time
+
+ @torch.no_grad()
+ def forward(self, x):
+ sh = x.shape
+ mask = torch.zeros(x.shape, dtype=torch.bool, device=x.device)
+
+ for idx in range(sh[0]):
+ for i in range(self.masks):
+
+ w = torch.randint(self.min_freq, self.max_freq + 1, size=(1,)).item()
+ h = torch.randint(self.min_time, self.max_time + 1, size=(1,)).item()
+
+ f0 = int(random.uniform(0, sh[1] - w))
+ t0 = int(random.uniform(0, sh[2] - h))
+
+ mask[idx, f0:f0+w, t0:t0+h] = 1
+
+ return x.masked_fill(mask, 0)
+
+
+@torch.jit.script
+def normalize_batch(x, seq_len, normalize_type: str):
+# print ("normalize_batch: x, seq_len, shapes: ", x.shape, seq_len, seq_len.shape)
+ if normalize_type == "per_feature":
+ x_mean = torch.zeros((seq_len.shape[0], x.shape[1]), dtype=x.dtype,
+ device=x.device)
+ x_std = torch.zeros((seq_len.shape[0], x.shape[1]), dtype=x.dtype,
+ device=x.device)
+ for i in range(x.shape[0]):
+ x_mean[i, :] = x[i, :, :seq_len[i]].mean(dim=1)
+ x_std[i, :] = x[i, :, :seq_len[i]].std(dim=1)
+ # make sure x_std is not zero
+ x_std += 1e-5
+ return (x - x_mean.unsqueeze(2)) / x_std.unsqueeze(2)
+
+ elif normalize_type == "all_features":
+ x_mean = torch.zeros(seq_len.shape, dtype=x.dtype, device=x.device)
+ x_std = torch.zeros(seq_len.shape, dtype=x.dtype, device=x.device)
+ for i in range(x.shape[0]):
+ x_mean[i] = x[i, :, :int(seq_len[i])].mean()
+ x_std[i] = x[i, :, :int(seq_len[i])].std()
+ # make sure x_std is not zero
+ x_std += 1e-5
+ return (x - x_mean.view(-1, 1, 1)) / x_std.view(-1, 1, 1)
+ else:
+ return x
+
+
+@torch.jit.script
+def stack_subsample_frames(x, x_lens, stacking: int = 1, subsampling: int = 1):
+ """ Stacks frames together across feature dim, and then subsamples
+
+ input is batch_size, feature_dim, num_frames
+ output is batch_size, feature_dim * stacking, num_frames / subsampling
+
+ """
+ seq = [x]
+ for n in range(1, stacking):
+ tmp = torch.zeros_like(x)
+ tmp[:, :, :-n] = x[:, :, n:]
+ seq.append(tmp)
+ x = torch.cat(seq, dim=1)[:, :, ::subsampling]
+
+ if subsampling > 1:
+ x_lens = torch.ceil(x_lens.float() / subsampling).int()
+
+ if x.size(2) > x_lens.max().item():
+ assert abs(x.size(2) - x_lens.max().item()) <= 1
+ x = x[:,:,:x_lens.max().item()]
+
+ return x, x_lens
+
+
+class FilterbankFeatures(BaseFeatures):
+ # For JIT, https://pytorch.org/docs/stable/jit.html#python-defined-constants
+ __constants__ = ["dither", "preemph", "n_fft", "hop_length", "win_length",
+ "log", "frame_splicing", "normalize"]
+ # torchscript: "center" removed due to a bug
+
+ def __init__(self, spec_augment=None, cutout_augment=None,
+ sample_rate=8000, window_size=0.02, window_stride=0.01,
+ window="hamming", normalize="per_feature", n_fft=None,
+ preemph=0.97, n_filt=64, lowfreq=0, highfreq=None, log=True,
+ dither=1e-5, pad_align=8, pad_to_max_duration=False,
+ max_duration=float('inf'), frame_splicing=1):
+ super(FilterbankFeatures, self).__init__(
+ pad_align=pad_align, pad_to_max_duration=pad_to_max_duration,
+ max_duration=max_duration, sample_rate=sample_rate,
+ window_size=window_size, window_stride=window_stride,
+ spec_augment=spec_augment, cutout_augment=cutout_augment)
+
+ torch_windows = {
+ 'hann': torch.hann_window,
+ 'hamming': torch.hamming_window,
+ 'blackman': torch.blackman_window,
+ 'bartlett': torch.bartlett_window,
+ 'none': None,
+ }
+
+ self.n_fft = n_fft or 2 ** math.ceil(math.log2(self.win_length))
+
+ self.normalize = normalize
+ self.log = log
+ #TORCHSCRIPT: Check whether or not we need this
+ self.dither = dither
+ self.frame_splicing = frame_splicing
+ self.n_filt = n_filt
+ self.preemph = preemph
+ highfreq = highfreq or sample_rate / 2
+ window_fn = torch_windows.get(window, None)
+ window_tensor = window_fn(self.win_length,
+ periodic=False) if window_fn else None
+ filterbanks = torch.tensor(
+ librosa.filters.mel(sample_rate, self.n_fft, n_mels=n_filt,
+ fmin=lowfreq, fmax=highfreq),
+ dtype=torch.float).unsqueeze(0)
+ # torchscript
+ self.register_buffer("fb", filterbanks)
+ self.register_buffer("window", window_tensor)
+ # print("n_fft", self.n_fft)
+ # print("hop_length", self.hop_length)
+ # print("win_length", self.win_length)
+ # print("window", self.window.to(dtype=torch.float))
+
+ # self.stft = STFT(filter_length=512, hop_length=160,win_length=320)
+ def get_seq_len(self, seq_len):
+ return torch.ceil(seq_len.to(dtype=torch.float) / self.hop_length).to(
+ dtype=torch.int)
+
+ # do stft
+ # TORCHSCRIPT: center removed due to bug
+ # def stft(self, x):
+ # return torch.stft(x, n_fft=self.n_fft, hop_length=self.hop_length,
+ # win_length=self.win_length,pad_mode = "constant",
+ # window=self.window.to(dtype=torch.float))
+
+ def stft(self, x):
+ result = []
+ for i in range(x.shape[0]):
+ tmp = librosa.stft(x[i].numpy(), n_fft=self.n_fft, hop_length=self.hop_length,
+ win_length=self.win_length, pad_mode="reflect",
+ window='hann')
+ tmp_real = torch.from_numpy(tmp.real).float()
+ tmp_imag = torch.from_numpy(tmp.imag).float()
+ tmp = torch.stack((tmp_real, tmp_imag), dim=2)
+ result.append(tmp)
+
+ return torch.stack((result), dim=0)
+ @torch.no_grad()
+ def calculate_features(self, x, seq_len):
+ dtype = x.dtype
+ seq_len = self.get_seq_len(seq_len)
+ # dither
+ # print(seq_len)
+ if self.dither > 0:
+ x += self.dither * torch.randn_like(x)
+
+ # do preemphasis
+ if self.preemph is not None:
+ x = torch.cat(
+ (x[:, 0].unsqueeze(1), x[:, 1:] - self.preemph * x[:, :-1]), dim=1)
+ # print(x)
+ # print(seq_len)
+ # x = x.to("cpu")
+ # print("inputxsize",x.size())
+ # x = x.numpy()
+ x = self.stft(x)
+ # x = torch.from_numpy(x)
+ # x = x.npu()
+ # get power spectrum
+ x = x.pow(2).sum(-1)
+ # print("x-stft-size:", x.size())
+ # print(x.dtype)
+ y = self.fb.to(x.dtype)
+ # print("fb_ysize:", y.size())
+ # dot with filterbank energies
+ x = torch.matmul(self.fb.to(x.dtype), x)
+
+ # log features if required
+ if self.log:
+ x = torch.log(x + 1e-20)
+
+ # frame splicing if required
+ if self.frame_splicing > 1:
+ raise ValueError('Frame splicing not supported')
+
+ # normalize if required
+ x = normalize_batch(x, seq_len, normalize_type=self.normalize)
+
+ # mask to zero any values beyond seq_len in batch,
+ # pad to multiple of `pad_align` (for efficiency)
+ max_len = x.size(-1)
+ mask = torch.arange(max_len, dtype=seq_len.dtype, device=x.device)
+ mask = mask.expand(x.size(0), max_len) >= seq_len.unsqueeze(1)
+ x = x.masked_fill(mask.unsqueeze(1), 0)
+
+ # TORCHSCRIPT: Is this del important? It breaks scripting
+ # del mask
+
+ return x.to(dtype), seq_len
diff --git a/PyTorch/contrib/audio/Jasper/common/helpers.py b/PyTorch/contrib/audio/Jasper/common/helpers.py
new file mode 100644
index 0000000000000000000000000000000000000000..efdb11a4457f85443a258275a524dd17c077d2f0
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/common/helpers.py
@@ -0,0 +1,300 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import glob
+import os
+import re
+from collections import OrderedDict
+
+from apex import amp
+
+import torch
+import torch.distributed as dist
+
+from .metrics import word_error_rate
+
+
+def print_once(msg):
+ if not dist.is_initialized() or dist.get_rank() == 0:
+ print(msg)
+
+
+def add_ctc_blank(symbols):
+ return symbols + ['']
+
+
+def ctc_decoder_predictions_tensor(tensor, labels):
+ """
+ Takes output of greedy ctc decoder and performs ctc decoding algorithm to
+ remove duplicates and special symbol. Returns prediction
+ Args:
+ tensor: model output tensor
+ label: A list of labels
+ Returns:
+ prediction
+ """
+ blank_id = len(labels) - 1
+ hypotheses = []
+ labels_map = {i: labels[i] for i in range(len(labels))}
+ prediction_cpu_tensor = tensor.long().cpu()
+ # iterate over batch
+ for ind in range(prediction_cpu_tensor.shape[0]):
+ prediction = prediction_cpu_tensor[ind].numpy().tolist()
+ # CTC decoding procedure
+ decoded_prediction = []
+ previous = len(labels) - 1 # id of a blank symbol
+ for p in prediction:
+ if (p != previous or previous == blank_id) and p != blank_id:
+ decoded_prediction.append(p)
+ previous = p
+ hypothesis = ''.join([labels_map[c] for c in decoded_prediction])
+ hypotheses.append(hypothesis)
+ return hypotheses
+
+
+def greedy_wer(preds, tgt, tgt_lens, labels):
+ """
+ Takes output of greedy ctc decoder and performs ctc decoding algorithm to
+ remove duplicates and special symbol. Prints wer and prediction examples to screen
+ Args:
+ tensors: A list of 3 tensors (predictions, targets, target_lengths)
+ labels: A list of labels
+
+ Returns:
+ word error rate
+ """
+ with torch.no_grad():
+ references = gather_transcripts([tgt], [tgt_lens], labels)
+ hypotheses = ctc_decoder_predictions_tensor(preds, labels)
+
+ wer, _, _ = word_error_rate(hypotheses, references)
+ return wer, hypotheses[0], references[0]
+
+
+def gather_losses(losses_list):
+ return [torch.mean(torch.stack(losses_list))]
+
+
+def gather_predictions(predictions_list, labels):
+ results = []
+ for prediction in predictions_list:
+ results += ctc_decoder_predictions_tensor(prediction, labels=labels)
+ return results
+
+
+def gather_transcripts(transcript_list, transcript_len_list, labels):
+ results = []
+ labels_map = {i: labels[i] for i in range(len(labels))}
+ # iterate over workers
+ for txt, lens in zip(transcript_list, transcript_len_list):
+ for t, l in zip(txt.long().cpu(), lens.long().cpu()):
+ t = list(t.numpy())
+ results.append(''.join([labels_map[c] for c in t[:l]]))
+ return results
+
+
+def process_evaluation_batch(tensors, global_vars, labels):
+ """
+ Processes results of an iteration and saves it in global_vars
+ Args:
+ tensors: dictionary with results of an evaluation iteration, e.g. loss, predictions, transcript, and output
+ global_vars: dictionary where processes results of iteration are saved
+ labels: A list of labels
+ """
+ for kv, v in tensors.items():
+ if kv.startswith('loss'):
+ global_vars['EvalLoss'] += gather_losses(v)
+ elif kv.startswith('predictions'):
+ global_vars['preds'] += gather_predictions(v, labels)
+ elif kv.startswith('transcript_length'):
+ transcript_len_list = v
+ elif kv.startswith('transcript'):
+ transcript_list = v
+ elif kv.startswith('output'):
+ global_vars['logits'] += v
+
+ global_vars['txts'] += gather_transcripts(
+ transcript_list, transcript_len_list, labels)
+
+
+def process_evaluation_epoch(aggregates, tag=None):
+ """
+ Processes results from each worker at the end of evaluation and combine to final result
+ Args:
+ aggregates: dictionary containing information of entire evaluation
+ Return:
+ wer: final word error rate
+ loss: final loss
+ """
+ if 'losses' in aggregates:
+ eloss = torch.mean(torch.stack(aggregates['losses'])).item()
+ else:
+ eloss = None
+ hypotheses = aggregates['preds']
+ references = aggregates['txts']
+
+ wer, scores, num_words = word_error_rate(hypotheses, references)
+ multi_gpu = dist.is_initialized()
+ if multi_gpu:
+ if eloss is not None:
+ eloss /= dist.get_world_size()
+ eloss_tensor = torch.tensor(eloss).npu()
+ dist.all_reduce(eloss_tensor)
+ eloss = eloss_tensor.item()
+
+ scores_tensor = torch.tensor(scores).npu().float()
+ dist.all_reduce(scores_tensor)
+ scores = scores_tensor.item()
+ num_words_tensor = torch.tensor(num_words).npu().float()
+ dist.all_reduce(num_words_tensor)
+ num_words = num_words_tensor.item()
+ wer = scores * 1.0 / num_words
+ return wer, eloss
+
+
+def num_weights(module):
+ return sum(p.numel() for p in module.parameters() if p.requires_grad)
+
+
+def convert_v1_state_dict(state_dict):
+ rules = [
+ ('^jasper_encoder.encoder.', 'encoder.layers.'),
+ ('^jasper_decoder.decoder_layers.', 'decoder.layers.'),
+ ]
+ ret = {}
+ for k, v in state_dict.items():
+ if k.startswith('acoustic_model.'):
+ continue
+ if k.startswith('audio_preprocessor.'):
+ continue
+ for pattern, to in rules:
+ k = re.sub(pattern, to, k)
+ ret[k] = v
+
+ return ret
+
+
+class Checkpointer(object):
+
+ def __init__(self, save_dir, model_name, keep_milestones=[100,200,300],
+ use_amp=False):
+ self.save_dir = save_dir
+ self.keep_milestones = keep_milestones
+ self.use_amp = use_amp
+ self.model_name = model_name
+
+ tracked = [
+ (int(re.search('epoch(\d+)_', f).group(1)), f)
+ for f in glob.glob(f'{save_dir}/{self.model_name}_epoch*_checkpoint.pt')]
+ tracked = sorted(tracked, key=lambda t: t[0])
+ self.tracked = OrderedDict(tracked)
+
+ def save(self, model, ema_model, optimizer, epoch, step, best_wer,
+ is_best=False):
+ """Saves model checkpoint for inference/resuming training.
+
+ Args:
+ model: the model, optionally wrapped by DistributedDataParallel
+ ema_model: model with averaged weights, can be None
+ optimizer: optimizer
+ epoch (int): epoch during which the model is saved
+ step (int): number of steps since beginning of training
+ best_wer (float): lowest recorded WER on the dev set
+ is_best (bool, optional): set name of checkpoint to 'best'
+ and overwrite the previous one
+ """
+ rank = 0
+ if dist.is_initialized():
+ dist.barrier()
+ rank = dist.get_rank()
+
+ if rank != 0:
+ return
+
+ # Checkpoint already saved
+ if not is_best and epoch in self.tracked:
+ return
+
+ unwrap_ddp = lambda model: getattr(model, 'module', model)
+ state = {
+ 'epoch': epoch,
+ 'step': step,
+ 'best_wer': best_wer,
+ 'state_dict': unwrap_ddp(model).state_dict(),
+ 'ema_state_dict': unwrap_ddp(ema_model).state_dict() if ema_model is not None else None,
+ 'optimizer': optimizer.state_dict(),
+ 'amp': amp.state_dict() if self.use_amp else None,
+ }
+
+ if is_best:
+ fpath = os.path.join(
+ self.save_dir, f"{self.model_name}_best_checkpoint.pt")
+ else:
+ fpath = os.path.join(
+ self.save_dir, f"{self.model_name}_epoch{epoch}_checkpoint.pt")
+
+ print_once(f"Saving {fpath}...")
+ torch.save(state, fpath)
+
+ if not is_best:
+ # Remove old checkpoints; keep milestones and the last two
+ self.tracked[epoch] = fpath
+ for epoch in set(list(self.tracked)[:-2]) - set(self.keep_milestones):
+ try:
+ os.remove(self.tracked[epoch])
+ except:
+ pass
+ del self.tracked[epoch]
+
+ def last_checkpoint(self):
+ tracked = list(self.tracked.values())
+
+ if len(tracked) >= 1:
+ try:
+ torch.load(tracked[-1], map_location='cpu')
+ return tracked[-1]
+ except:
+ print_once(f'Last checkpoint {tracked[-1]} appears corrupted.')
+
+ elif len(tracked) >= 2:
+ return tracked[-2]
+ else:
+ return None
+
+ def load(self, fpath, model, ema_model, optimizer, meta):
+
+ print_once(f'Loading model from {fpath}')
+ checkpoint = torch.load(fpath, map_location="cpu")
+
+ unwrap_ddp = lambda model: getattr(model, 'module', model)
+ state_dict = convert_v1_state_dict(checkpoint['state_dict'])
+ unwrap_ddp(model).load_state_dict(state_dict, strict=True)
+
+ if ema_model is not None:
+ if checkpoint.get('ema_state_dict') is not None:
+ key = 'ema_state_dict'
+ else:
+ key = 'state_dict'
+ print_once('WARNING: EMA weights not found in the checkpoint.')
+ print_once('WARNING: Initializing EMA model with regular params.')
+ state_dict = convert_v1_state_dict(checkpoint[key])
+ unwrap_ddp(ema_model).load_state_dict(state_dict, strict=True)
+
+ optimizer.load_state_dict(checkpoint['optimizer'])
+
+ if self.use_amp:
+ amp.load_state_dict(checkpoint['amp'])
+
+ meta['start_epoch'] = checkpoint.get('epoch')
+ meta['best_wer'] = checkpoint.get('best_wer', meta['best_wer'])
diff --git a/PyTorch/contrib/audio/Jasper/common/metrics.py b/PyTorch/contrib/audio/Jasper/common/metrics.py
new file mode 100644
index 0000000000000000000000000000000000000000..4ae47a4c069dfe329a87d82cead0f2b91775139c
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/common/metrics.py
@@ -0,0 +1,59 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+def __levenshtein(a, b):
+ """Calculates the Levenshtein distance between two sequences."""
+
+ n, m = len(a), len(b)
+ if n > m:
+ # Make sure n <= m, to use O(min(n,m)) space
+ a, b = b, a
+ n, m = m, n
+
+ current = list(range(n + 1))
+ for i in range(1, m + 1):
+ previous, current = current, [i] + [0] * n
+ for j in range(1, n + 1):
+ add, delete = previous[j] + 1, current[j - 1] + 1
+ change = previous[j - 1]
+ if a[j - 1] != b[i - 1]:
+ change = change + 1
+ current[j] = min(add, delete, change)
+
+ return current[n]
+
+
+def word_error_rate(hypotheses, references):
+ """Computes average Word Error Rate (WER) between two text lists."""
+
+ scores = 0
+ words = 0
+ len_diff = len(references) - len(hypotheses)
+ if len_diff > 0:
+ raise ValueError("Uneqal number of hypthoses and references: "
+ "{0} and {1}".format(len(hypotheses), len(references)))
+ elif len_diff < 0:
+ hypotheses = hypotheses[:len_diff]
+
+ for h, r in zip(hypotheses, references):
+ h_list = h.split()
+ r_list = r.split()
+ words += len(r_list)
+ scores += __levenshtein(h_list, r_list)
+ if words!=0:
+ wer = 1.0*scores/words
+ else:
+ wer = float('inf')
+ return wer, scores, words
diff --git a/PyTorch/contrib/audio/Jasper/common/optimizers.py b/PyTorch/contrib/audio/Jasper/common/optimizers.py
new file mode 100644
index 0000000000000000000000000000000000000000..8175919196c47531c5985efce72ce663c2d5d213
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/common/optimizers.py
@@ -0,0 +1,269 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import torch
+from torch.optim import Optimizer
+import math
+
+
+def lr_policy(step, epoch, initial_lr, optimizer, steps_per_epoch, warmup_epochs,
+ hold_epochs, num_epochs=None, policy='linear', min_lr=1e-5,
+ exp_gamma=None):
+ """
+ learning rate decay
+ Args:
+ initial_lr: base learning rate
+ step: current iteration number
+ N: total number of iterations over which learning rate is decayed
+ lr_steps: list of steps to apply exp_gamma
+ """
+ warmup_steps = warmup_epochs * steps_per_epoch
+ hold_steps = hold_epochs * steps_per_epoch
+
+ if policy == 'legacy':
+ assert num_epochs is not None
+ tot_steps = num_epochs * steps_per_epoch
+
+ if step < warmup_steps:
+ a = (step + 1) / (warmup_steps + 1)
+ elif step < warmup_steps + hold_steps:
+ a = 1.0
+ else:
+ a = (((tot_steps - step)
+ / (tot_steps - warmup_steps - hold_steps)) ** 2)
+
+ elif policy == 'exponential':
+ assert exp_gamma is not None
+
+ if step < warmup_steps:
+ a = (step + 1) / (warmup_steps + 1)
+ elif step < warmup_steps + hold_steps:
+ a = 1.0
+ else:
+ a = exp_gamma ** (epoch - warmup_epochs - hold_epochs)
+
+ else:
+ raise ValueError
+
+ new_lr = max(a * initial_lr, min_lr)
+ for param_group in optimizer.param_groups:
+ param_group['lr'] = new_lr
+
+
+class AdamW(Optimizer):
+ """Implements AdamW algorithm.
+
+ It has been proposed in `Adam: A Method for Stochastic Optimization`_.
+
+ Arguments:
+ params (iterable): iterable of parameters to optimize or dicts defining
+ parameter groups
+ lr (float, optional): learning rate (default: 1e-3)
+ betas (Tuple[float, float], optional): coefficients used for computing
+ running averages of gradient and its square (default: (0.9, 0.999))
+ eps (float, optional): term added to the denominator to improve
+ numerical stability (default: 1e-8)
+ weight_decay (float, optional): weight decay (L2 penalty) (default: 0)
+ amsgrad (boolean, optional): whether to use the AMSGrad variant of this
+ algorithm from the paper `On the Convergence of Adam and Beyond`_
+
+ Adam: A Method for Stochastic Optimization:
+ https://arxiv.org/abs/1412.6980
+ On the Convergence of Adam and Beyond:
+ https://openreview.net/forum?id=ryQu7f-RZ
+ """
+
+ def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8,
+ weight_decay=0, amsgrad=False):
+ if not 0.0 <= lr:
+ raise ValueError("Invalid learning rate: {}".format(lr))
+ if not 0.0 <= eps:
+ raise ValueError("Invalid epsilon value: {}".format(eps))
+ if not 0.0 <= betas[0] < 1.0:
+ raise ValueError("Invalid beta parameter at index 0: {}".format(betas[0]))
+ if not 0.0 <= betas[1] < 1.0:
+ raise ValueError("Invalid beta parameter at index 1: {}".format(betas[1]))
+ defaults = dict(lr=lr, betas=betas, eps=eps,
+ weight_decay=weight_decay, amsgrad=amsgrad)
+ super(AdamW, self).__init__(params, defaults)
+
+ def __setstate__(self, state):
+ super(AdamW, self).__setstate__(state)
+ for group in self.param_groups:
+ group.setdefault('amsgrad', False)
+
+ def step(self, closure=None):
+ """Performs a single optimization step.
+
+ Arguments:
+ closure (callable, optional): A closure that reevaluates the model
+ and returns the loss.
+ """
+ loss = None
+ if closure is not None:
+ loss = closure()
+
+ for group in self.param_groups:
+ for p in group['params']:
+ if p.grad is None:
+ continue
+ grad = p.grad.data
+ if grad.is_sparse:
+ raise RuntimeError('Adam does not support sparse gradients, please consider SparseAdam instead')
+ amsgrad = group['amsgrad']
+
+ state = self.state[p]
+
+ # State initialization
+ if len(state) == 0:
+ state['step'] = 0
+ # Exponential moving average of gradient values
+ state['exp_avg'] = torch.zeros_like(p.data)
+ # Exponential moving average of squared gradient values
+ state['exp_avg_sq'] = torch.zeros_like(p.data)
+ if amsgrad:
+ # Maintains max of all exp. moving avg. of sq. grad. values
+ state['max_exp_avg_sq'] = torch.zeros_like(p.data)
+
+ exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
+ if amsgrad:
+ max_exp_avg_sq = state['max_exp_avg_sq']
+ beta1, beta2 = group['betas']
+
+ state['step'] += 1
+ # Decay the first and second moment running average coefficient
+ exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
+ exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
+ if amsgrad:
+ # Maintains the maximum of all 2nd moment running avg. till now
+ torch.max(max_exp_avg_sq, exp_avg_sq, out=max_exp_avg_sq)
+ # Use the max. for normalizing running avg. of gradient
+ denom = max_exp_avg_sq.sqrt().add_(group['eps'])
+ else:
+ denom = exp_avg_sq.sqrt().add_(group['eps'])
+
+ bias_correction1 = 1 - beta1 ** state['step']
+ bias_correction2 = 1 - beta2 ** state['step']
+ step_size = group['lr'] * math.sqrt(bias_correction2) / bias_correction1
+ p.data.add_(torch.mul(p.data, group['weight_decay']).addcdiv_(1, exp_avg, denom), alpha=-step_size)
+
+ return loss
+
+
+class Novograd(Optimizer):
+ """
+ Implements Novograd algorithm.
+
+ Args:
+ params (iterable): iterable of parameters to optimize or dicts defining
+ parameter groups
+ lr (float, optional): learning rate (default: 1e-3)
+ betas (Tuple[float, float], optional): coefficients used for computing
+ running averages of gradient and its square (default: (0.95, 0))
+ eps (float, optional): term added to the denominator to improve
+ numerical stability (default: 1e-8)
+ weight_decay (float, optional): weight decay (L2 penalty) (default: 0)
+ grad_averaging: gradient averaging
+ amsgrad (boolean, optional): whether to use the AMSGrad variant of this
+ algorithm from the paper `On the Convergence of Adam and Beyond`_
+ (default: False)
+ """
+
+ def __init__(self, params, lr=1e-3, betas=(0.95, 0), eps=1e-8,
+ weight_decay=0, grad_averaging=False, amsgrad=False):
+ if not 0.0 <= lr:
+ raise ValueError("Invalid learning rate: {}".format(lr))
+ if not 0.0 <= eps:
+ raise ValueError("Invalid epsilon value: {}".format(eps))
+ if not 0.0 <= betas[0] < 1.0:
+ raise ValueError("Invalid beta parameter at index 0: {}".format(betas[0]))
+ if not 0.0 <= betas[1] < 1.0:
+ raise ValueError("Invalid beta parameter at index 1: {}".format(betas[1]))
+ defaults = dict(lr=lr, betas=betas, eps=eps,
+ weight_decay=weight_decay,
+ grad_averaging=grad_averaging,
+ amsgrad=amsgrad)
+
+ super(Novograd, self).__init__(params, defaults)
+
+ def __setstate__(self, state):
+ super(Novograd, self).__setstate__(state)
+ for group in self.param_groups:
+ group.setdefault('amsgrad', False)
+
+ def step(self, closure=None):
+ """Performs a single optimization step.
+
+ Arguments:
+ closure (callable, optional): A closure that reevaluates the model
+ and returns the loss.
+ """
+ loss = None
+ if closure is not None:
+ loss = closure()
+
+ for group in self.param_groups:
+ for p in group['params']:
+ if p.grad is None:
+ continue
+ grad = p.grad.data
+ if grad.is_sparse:
+ raise RuntimeError('Sparse gradients are not supported.')
+ amsgrad = group['amsgrad']
+
+ state = self.state[p]
+
+ # State initialization
+ if len(state) == 0:
+ state['step'] = 0
+ # Exponential moving average of gradient values
+ state['exp_avg'] = torch.zeros_like(p.data)
+ # Exponential moving average of squared gradient values
+ state['exp_avg_sq'] = torch.zeros([]).to(state['exp_avg'].device)
+ if amsgrad:
+ # Maintains max of all exp. moving avg. of sq. grad. values
+ state['max_exp_avg_sq'] = torch.zeros([]).to(state['exp_avg'].device)
+
+ exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
+ if amsgrad:
+ max_exp_avg_sq = state['max_exp_avg_sq']
+ beta1, beta2 = group['betas']
+
+ state['step'] += 1
+
+ norm = torch.sum(torch.pow(grad, 2))
+
+ if exp_avg_sq == 0:
+ exp_avg_sq.copy_(norm)
+ else:
+ exp_avg_sq.mul_(beta2).add_(norm, alpha=1 - beta2)
+
+ if amsgrad:
+ # Maintains the maximum of all 2nd moment running avg. till now
+ torch.max(max_exp_avg_sq, exp_avg_sq, out=max_exp_avg_sq)
+ # Use the max. for normalizing running avg. of gradient
+ denom = max_exp_avg_sq.sqrt().add_(group['eps'])
+ else:
+ denom = exp_avg_sq.sqrt().add_(group['eps'])
+
+ grad.div_(denom)
+ if group['weight_decay'] != 0:
+ grad.add_(p.data, alpha=group['weight_decay'])
+ if group['grad_averaging']:
+ grad.mul_(1 - beta1)
+ exp_avg.mul_(beta1).add_(grad)
+
+ p.data.add_(exp_avg, alpha=-group['lr'])
+
+ return loss
diff --git a/PyTorch/contrib/audio/Jasper/common/tb_dllogger.py b/PyTorch/contrib/audio/Jasper/common/tb_dllogger.py
new file mode 100644
index 0000000000000000000000000000000000000000..ecc6ec86898eac8b6c2f3b13fc3128e3058f850d
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/common/tb_dllogger.py
@@ -0,0 +1,159 @@
+import atexit
+import glob
+import os
+import re
+import numpy as np
+
+import torch
+from torch.utils.tensorboard import SummaryWriter
+
+import dllogger
+from dllogger import StdOutBackend, JSONStreamBackend, Verbosity
+
+
+tb_loggers = {}
+
+
+class TBLogger:
+ """
+ xyz_dummies: stretch the screen with empty plots so the legend would
+ always fit for other plots
+ """
+ def __init__(self, enabled, log_dir, name, interval=1, dummies=True):
+ self.enabled = enabled
+ self.interval = interval
+ self.cache = {}
+ if self.enabled:
+ self.summary_writer = SummaryWriter(
+ log_dir=os.path.join(log_dir, name),
+ flush_secs=120, max_queue=200)
+ atexit.register(self.summary_writer.close)
+ if dummies:
+ for key in ('aaa', 'zzz'):
+ self.summary_writer.add_scalar(key, 0.0, 1)
+
+ def log(self, step, data):
+ for k, v in data.items():
+ self.log_value(step, k, v.item() if type(v) is torch.Tensor else v)
+
+ def log_value(self, step, key, val, stat='mean'):
+ if self.enabled:
+ if key not in self.cache:
+ self.cache[key] = []
+ self.cache[key].append(val)
+ if len(self.cache[key]) == self.interval:
+ agg_val = getattr(np, stat)(self.cache[key])
+ self.summary_writer.add_scalar(key, agg_val, step)
+ del self.cache[key]
+
+ def log_grads(self, step, model):
+ if self.enabled:
+ norms = [p.grad.norm().item() for p in model.parameters()
+ if p.grad is not None]
+ for stat in ('max', 'min', 'mean'):
+ self.log_value(step, f'grad_{stat}', getattr(np, stat)(norms),
+ stat=stat)
+
+
+def unique_log_fpath(log_fpath):
+
+ if not os.path.isfile(log_fpath):
+ return log_fpath
+
+ # Avoid overwriting old logs
+ saved = sorted([int(re.search('\.(\d+)', f).group(1))
+ for f in glob.glob(f'{log_fpath}.*')])
+
+ log_num = (saved[-1] if saved else 0) + 1
+ return f'{log_fpath}.{log_num}'
+
+
+def stdout_step_format(step):
+ if isinstance(step, str):
+ return step
+ fields = []
+ if len(step) > 0:
+ fields.append("epoch {:>4}".format(step[0]))
+ if len(step) > 1:
+ fields.append("iter {:>4}".format(step[1]))
+ if len(step) > 2:
+ fields[-1] += "/{}".format(step[2])
+ return " | ".join(fields)
+
+
+def stdout_metric_format(metric, metadata, value):
+ name = metadata.get("name", metric + " : ")
+ unit = metadata.get("unit", None)
+ format = f'{{{metadata.get("format", "")}}}'
+ fields = [name, format.format(value) if value is not None else value, unit]
+ fields = [f for f in fields if f is not None]
+ return "| " + " ".join(fields)
+
+
+def init_log(args):
+ enabled = (args.local_rank == 0)
+ if enabled:
+ fpath = args.log_file or os.path.join(args.output_dir, 'nvlog.json')
+ backends = [JSONStreamBackend(Verbosity.DEFAULT,
+ unique_log_fpath(fpath)),
+ StdOutBackend(Verbosity.VERBOSE,
+ step_format=stdout_step_format,
+ metric_format=stdout_metric_format)]
+ else:
+ backends = []
+
+ dllogger.init(backends=backends)
+ dllogger.metadata("train_lrate", {"name": "lrate", "format": ":>3.2e"})
+
+ for id_, pref in [('train', ''), ('train_avg', 'avg train '),
+ ('dev', ' avg dev '), ('dev_ema', ' EMA dev ')]:
+
+ dllogger.metadata(f"{id_}_loss",
+ {"name": f"{pref}loss", "format": ":>7.2f"})
+
+ dllogger.metadata(f"{id_}_wer",
+ {"name": f"{pref}wer", "format": ":>6.2f"})
+
+ dllogger.metadata(f"{id_}_throughput",
+ {"name": f"{pref}utts/s", "format": ":>5.0f"})
+
+ dllogger.metadata(f"{id_}_took",
+ {"name": "took", "unit": "s", "format": ":>5.2f"})
+
+ tb_subsets = ['train', 'dev', 'dev_ema'] if args.ema else ['train', 'dev']
+ global tb_loggers
+ tb_loggers = {s: TBLogger(enabled, args.output_dir, name=s)
+ for s in tb_subsets}
+
+ log_parameters(vars(args), tb_subset='train')
+
+
+def log(step, tb_total_steps=None, subset='train', data={}):
+
+ if tb_total_steps is not None:
+ tb_loggers[subset].log(tb_total_steps, data)
+
+ if subset != '':
+ data = {f'{subset}_{key}': v for key,v in data.items()}
+ dllogger.log(step, data=data)
+
+
+def log_grads_tb(tb_total_steps, grads, tb_subset='train'):
+ tb_loggers[tb_subset].log_grads(tb_total_steps, grads)
+
+
+def log_parameters(data, verbosity=0, tb_subset=None):
+ for k,v in data.items():
+ dllogger.log(step="PARAMETER", data={k:v}, verbosity=verbosity)
+
+ if tb_subset is not None and tb_loggers[tb_subset].enabled:
+ tb_data = {k:v for k,v in data.items()
+ if type(v) in (str, bool, int, float)}
+ tb_loggers[tb_subset].summary_writer.add_hparams(tb_data, {})
+
+
+def flush_log():
+ dllogger.flush()
+ for tbl in tb_loggers.values():
+ if tbl.enabled:
+ tbl.summary_writer.flush()
diff --git a/PyTorch/contrib/audio/Jasper/common/text/LICENSE b/PyTorch/contrib/audio/Jasper/common/text/LICENSE
new file mode 100644
index 0000000000000000000000000000000000000000..4ad4ed1d5e34d95c8380768ec16405d789cc6de4
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/common/text/LICENSE
@@ -0,0 +1,19 @@
+Copyright (c) 2017 Keith Ito
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.
diff --git a/PyTorch/contrib/audio/Jasper/common/text/__init__.py b/PyTorch/contrib/audio/Jasper/common/text/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..4901823853d9100bcbc58f4913241252815e6f55
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/common/text/__init__.py
@@ -0,0 +1,32 @@
+# Copyright (c) 2017 Keith Ito
+""" from https://github.com/keithito/tacotron """
+import re
+import string
+from . import cleaners
+
+def _clean_text(text, cleaner_names, *args):
+ for name in cleaner_names:
+ cleaner = getattr(cleaners, name)
+ if not cleaner:
+ raise Exception('Unknown cleaner: %s' % name)
+ text = cleaner(text, *args)
+ return text
+
+
+def punctuation_map(labels):
+ # Punctuation to remove
+ punctuation = string.punctuation
+ punctuation = punctuation.replace("+", "")
+ punctuation = punctuation.replace("&", "")
+ # TODO We might also want to consider:
+ # @ -> at
+ # # -> number, pound, hashtag
+ # ~ -> tilde
+ # _ -> underscore
+ # % -> percent
+ # If a punctuation symbol is inside our vocab, we do not remove from text
+ for l in labels:
+ punctuation = punctuation.replace(l, "")
+ # Turn all punctuation to whitespace
+ table = str.maketrans(punctuation, " " * len(punctuation))
+ return table
diff --git a/PyTorch/contrib/audio/Jasper/common/text/cleaners.py b/PyTorch/contrib/audio/Jasper/common/text/cleaners.py
new file mode 100644
index 0000000000000000000000000000000000000000..a99db1a625f31c7754ff295314bfa73b5d9e8e6f
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/common/text/cleaners.py
@@ -0,0 +1,107 @@
+# Copyright (c) 2017 Keith Ito
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" from https://github.com/keithito/tacotron
+Modified to add puncturation removal
+"""
+
+'''
+Cleaners are transformations that run over the input text at both training and eval time.
+
+Cleaners can be selected by passing a comma-delimited list of cleaner names as the "cleaners"
+hyperparameter. Some cleaners are English-specific. You'll typically want to use:
+ 1. "english_cleaners" for English text
+ 2. "transliteration_cleaners" for non-English text that can be transliterated to ASCII using
+ the Unidecode library (https://pypi.python.org/pypi/Unidecode)
+ 3. "basic_cleaners" if you do not want to transliterate (in this case, you should also update
+ the symbols in symbols.py to match your data).
+
+'''
+
+import re
+from unidecode import unidecode
+from .numbers import normalize_numbers
+
+# Regular expression matching whitespace:
+_whitespace_re = re.compile(r'\s+')
+
+# List of (regular expression, replacement) pairs for abbreviations:
+_abbreviations = [(re.compile('\\b%s\\.' % x[0], re.IGNORECASE), x[1]) for x in [
+ ('mrs', 'misess'),
+ ('mr', 'mister'),
+ ('dr', 'doctor'),
+ ('st', 'saint'),
+ ('co', 'company'),
+ ('jr', 'junior'),
+ ('maj', 'major'),
+ ('gen', 'general'),
+ ('drs', 'doctors'),
+ ('rev', 'reverend'),
+ ('lt', 'lieutenant'),
+ ('hon', 'honorable'),
+ ('sgt', 'sergeant'),
+ ('capt', 'captain'),
+ ('esq', 'esquire'),
+ ('ltd', 'limited'),
+ ('col', 'colonel'),
+ ('ft', 'fort'),
+]]
+
+def expand_abbreviations(text):
+ for regex, replacement in _abbreviations:
+ text = re.sub(regex, replacement, text)
+ return text
+
+def expand_numbers(text):
+ return normalize_numbers(text)
+
+def lowercase(text):
+ return text.lower()
+
+def collapse_whitespace(text):
+ return re.sub(_whitespace_re, ' ', text)
+
+def convert_to_ascii(text):
+ return unidecode(text)
+
+def remove_punctuation(text, table):
+ text = text.translate(table)
+ text = re.sub(r'&', " and ", text)
+ text = re.sub(r'\+', " plus ", text)
+ return text
+
+def basic_cleaners(text):
+ '''Basic pipeline that lowercases and collapses whitespace without transliteration.'''
+ text = lowercase(text)
+ text = collapse_whitespace(text)
+ return text
+
+def transliteration_cleaners(text):
+ '''Pipeline for non-English text that transliterates to ASCII.'''
+ text = convert_to_ascii(text)
+ text = lowercase(text)
+ text = collapse_whitespace(text)
+ return text
+
+def english_cleaners(text, table=None):
+ '''Pipeline for English text, including number and abbreviation expansion.'''
+ text = convert_to_ascii(text)
+ text = lowercase(text)
+ text = expand_numbers(text)
+ text = expand_abbreviations(text)
+ if table is not None:
+ text = remove_punctuation(text, table)
+ text = collapse_whitespace(text)
+ return text
diff --git a/PyTorch/contrib/audio/Jasper/common/text/numbers.py b/PyTorch/contrib/audio/Jasper/common/text/numbers.py
new file mode 100644
index 0000000000000000000000000000000000000000..46ce110676201ee0ba620a80eb4ba44c1790731f
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/common/text/numbers.py
@@ -0,0 +1,99 @@
+# Copyright (c) 2017 Keith Ito
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" from https://github.com/keithito/tacotron
+Modifed to add support for time and slight tweaks to _expand_number
+"""
+
+import inflect
+import re
+
+
+_inflect = inflect.engine()
+_comma_number_re = re.compile(r'([0-9][0-9\,]+[0-9])')
+_decimal_number_re = re.compile(r'([0-9]+\.[0-9]+)')
+_pounds_re = re.compile(r'£([0-9\,]*[0-9]+)')
+_dollars_re = re.compile(r'\$([0-9\.\,]*[0-9]+)')
+_ordinal_re = re.compile(r'[0-9]+(st|nd|rd|th)')
+_number_re = re.compile(r'[0-9]+')
+_time_re = re.compile(r'([0-9]{1,2}):([0-9]{2})')
+
+
+def _remove_commas(m):
+ return m.group(1).replace(',', '')
+
+
+def _expand_decimal_point(m):
+ return m.group(1).replace('.', ' point ')
+
+
+def _expand_dollars(m):
+ match = m.group(1)
+ parts = match.split('.')
+ if len(parts) > 2:
+ return match + ' dollars' # Unexpected format
+ dollars = int(parts[0]) if parts[0] else 0
+ cents = int(parts[1]) if len(parts) > 1 and parts[1] else 0
+ if dollars and cents:
+ dollar_unit = 'dollar' if dollars == 1 else 'dollars'
+ cent_unit = 'cent' if cents == 1 else 'cents'
+ return '%s %s, %s %s' % (dollars, dollar_unit, cents, cent_unit)
+ elif dollars:
+ dollar_unit = 'dollar' if dollars == 1 else 'dollars'
+ return '%s %s' % (dollars, dollar_unit)
+ elif cents:
+ cent_unit = 'cent' if cents == 1 else 'cents'
+ return '%s %s' % (cents, cent_unit)
+ else:
+ return 'zero dollars'
+
+
+def _expand_ordinal(m):
+ return _inflect.number_to_words(m.group(0))
+
+
+def _expand_number(m):
+ if int(m.group(0)[0]) == 0:
+ return _inflect.number_to_words(m.group(0), andword='', group=1)
+ num = int(m.group(0))
+ if num > 1000 and num < 3000:
+ if num == 2000:
+ return 'two thousand'
+ elif num > 2000 and num < 2010:
+ return 'two thousand ' + _inflect.number_to_words(num % 100)
+ elif num % 100 == 0:
+ return _inflect.number_to_words(num // 100) + ' hundred'
+ else:
+ return _inflect.number_to_words(num, andword='', zero='oh', group=2).replace(', ', ' ')
+ # Add check for number phones and other large numbers
+ elif num > 1000000000 and num % 10000 != 0:
+ return _inflect.number_to_words(num, andword='', group=1)
+ else:
+ return _inflect.number_to_words(num, andword='')
+
+def _expand_time(m):
+ mins = int(m.group(2))
+ if mins == 0:
+ return _inflect.number_to_words(m.group(1))
+ return " ".join([_inflect.number_to_words(m.group(1)), _inflect.number_to_words(m.group(2))])
+
+def normalize_numbers(text):
+ text = re.sub(_comma_number_re, _remove_commas, text)
+ text = re.sub(_pounds_re, r'\1 pounds', text)
+ text = re.sub(_dollars_re, _expand_dollars, text)
+ text = re.sub(_decimal_number_re, _expand_decimal_point, text)
+ text = re.sub(_ordinal_re, _expand_ordinal, text)
+ text = re.sub(_number_re, _expand_number, text)
+ text = re.sub(_time_re, _expand_time, text)
+ return text
diff --git a/PyTorch/contrib/audio/Jasper/common/text/symbols.py b/PyTorch/contrib/audio/Jasper/common/text/symbols.py
new file mode 100644
index 0000000000000000000000000000000000000000..24efedf8daf042c91a00894aa1aec1eacb69944e
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/common/text/symbols.py
@@ -0,0 +1,19 @@
+# Copyright (c) 2017 Keith Ito
+""" from https://github.com/keithito/tacotron """
+
+'''
+Defines the set of symbols used in text input to the model.
+
+The default is a set of ASCII characters that works well for English or text that has been run through Unidecode. For other data, you can modify _characters. See TRAINING_DATA.md for details. '''
+from . import cmudict
+
+_pad = '_'
+_punctuation = '!\'(),.:;? '
+_special = '-'
+_letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
+
+# Prepend "@" to ARPAbet symbols to ensure uniqueness (some are the same as uppercase letters):
+_arpabet = ['@' + s for s in cmudict.valid_symbols]
+
+# Export all symbols:
+symbols = [_pad] + list(_special) + list(_punctuation) + list(_letters) + _arpabet
diff --git a/PyTorch/contrib/audio/Jasper/common/train.py b/PyTorch/contrib/audio/Jasper/common/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..2fec7f56eda810cb07046038ccda092ea5288d98
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/common/train.py
@@ -0,0 +1,516 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import copy
+import os
+import random
+import time
+
+try:
+ import nvidia_dlprof_pytorch_nvtx as pyprof
+except ModuleNotFoundError:
+ import pyprof
+
+import torch
+import numpy as np
+import torch.cuda.profiler as profiler
+import torch.distributed as dist
+from apex import amp
+from apex.parallel import DistributedDataParallel
+
+from common import helpers
+# from common.dali.data_loader import DaliDataLoader
+from common.dataset import AudioDataset, get_data_loader
+from common.features import BaseFeatures, FilterbankFeatures
+from common.helpers import (Checkpointer, greedy_wer, num_weights, print_once,
+ process_evaluation_epoch)
+from common.optimizers import AdamW, lr_policy, Novograd
+from common.tb_dllogger import flush_log, init_log, log
+from common.utils import BenchmarkStats
+from jasper import config
+from jasper.model import CTCLossNM, GreedyCTCDecoder, Jasper
+
+
+def parse_args():
+ parser = argparse.ArgumentParser(description='Jasper')
+
+ training = parser.add_argument_group('training setup')
+ training.add_argument('--epochs', default=400, type=int,
+ help='Number of epochs for the entire training; influences the lr schedule')
+ training.add_argument("--warmup_epochs", default=0, type=int,
+ help='Initial epochs of increasing learning rate')
+ training.add_argument("--hold_epochs", default=0, type=int,
+ help='Constant max learning rate epochs after warmup')
+ training.add_argument('--epochs_this_job', default=0, type=int,
+ help=('Run for a number of epochs with no effect on the lr schedule.'
+ 'Useful for re-starting the training.'))
+ training.add_argument('--cudnn_benchmark', action='store_true', default=True,
+ help='Enable cudnn benchmark')
+ training.add_argument('--amp', '--fp16', action='store_true', default=False,
+ help='Use mixed precision training')
+ training.add_argument('--seed', default=42, type=int, help='Random seed')
+ training.add_argument('--local_rank', default=os.getenv('LOCAL_RANK', 0),
+ type=int, help='GPU id used for distributed training')
+ training.add_argument('--pre_allocate_range', default=None, type=int, nargs=2,
+ help='Warmup with batches of length [min, max] before training')
+ training.add_argument('--pyprof', action='store_true', help='Enable pyprof profiling')
+
+ optim = parser.add_argument_group('optimization setup')
+ optim.add_argument('--batch_size', default=32, type=int,
+ help='Global batch size')
+ optim.add_argument('--lr', default=1e-3, type=float,
+ help='Peak learning rate')
+ optim.add_argument("--min_lr", default=1e-5, type=float,
+ help='minimum learning rate')
+ optim.add_argument("--lr_policy", default='exponential', type=str,
+ choices=['exponential', 'legacy'], help='lr scheduler')
+ optim.add_argument("--lr_exp_gamma", default=0.99, type=float,
+ help='gamma factor for exponential lr scheduler')
+ optim.add_argument('--weight_decay', default=1e-3, type=float,
+ help='Weight decay for the optimizer')
+ optim.add_argument('--grad_accumulation_steps', default=1, type=int,
+ help='Number of accumulation steps')
+ optim.add_argument('--optimizer', default='novograd', type=str,
+ choices=['novograd', 'adamw'], help='Optimization algorithm')
+ optim.add_argument('--ema', type=float, default=0.0,
+ help='Discount factor for exp averaging of model weights')
+
+ io = parser.add_argument_group('feature and checkpointing setup')
+ io.add_argument('--dali_device', type=str, choices=['none', 'cpu', 'gpu'],
+ default='none', help='Use DALI pipeline for fast data processing')
+ io.add_argument('--resume', action='store_true',
+ help='Try to resume from last saved checkpoint.')
+ io.add_argument('--ckpt', default=None, type=str,
+ help='Path to a checkpoint for resuming training')
+ io.add_argument('--save_frequency', default=10, type=int,
+ help='Checkpoint saving frequency in epochs')
+ io.add_argument('--keep_milestones', default=[100, 200, 300], type=int, nargs='+',
+ help='Milestone checkpoints to keep from removing')
+ io.add_argument('--save_best_from', default=380, type=int,
+ help='Epoch on which to begin tracking best checkpoint (dev WER)')
+ io.add_argument('--eval_frequency', default=200, type=int,
+ help='Number of steps between evaluations on dev set')
+ io.add_argument('--log_frequency', default=25, type=int,
+ help='Number of steps between printing training stats')
+ io.add_argument('--prediction_frequency', default=100, type=int,
+ help='Number of steps between printing sample decodings')
+ io.add_argument('--model_config', type=str, required=True,
+ help='Path of the model configuration file')
+ io.add_argument('--train_manifests', type=str, required=True, nargs='+',
+ help='Paths of the training dataset manifest file')
+ io.add_argument('--val_manifests', type=str, required=True, nargs='+',
+ help='Paths of the evaluation datasets manifest files')
+ io.add_argument('--dataset_dir', required=True, type=str,
+ help='Root dir of dataset')
+ io.add_argument('--output_dir', type=str, required=True,
+ help='Directory for logs and checkpoints')
+ io.add_argument('--log_file', type=str, default=None,
+ help='Path to save the training logfile.')
+ io.add_argument('--benchmark_epochs_num', type=int, default=1,
+ help='Number of epochs accounted in final average throughput.')
+ io.add_argument('--override_config', type=str, action='append',
+ help='Overrides a value from a config .yaml.'
+ ' Syntax: `--override_config nested.config.key=val`.')
+ return parser.parse_args()
+
+
+def reduce_tensor(tensor, num_gpus):
+ rt = tensor.clone()
+ dist.all_reduce(rt, op=dist.ReduceOp.SUM)
+ return rt.true_divide(num_gpus)
+
+
+def apply_ema(model, ema_model, decay):
+ if not decay:
+ return
+
+ sd = getattr(model, 'module', model).state_dict()
+ for k, v in ema_model.state_dict().items():
+ v.copy_(decay * v + (1 - decay) * sd[k])
+
+
+@torch.no_grad()
+def evaluate(epoch, step, val_loader, val_feat_proc, labels, model,
+ ema_model, ctc_loss, greedy_decoder, use_amp, use_dali=False):
+
+ for model, subset in [(model, 'dev'), (ema_model, 'dev_ema')]:
+ if model is None:
+ continue
+
+ model.eval()
+ start_time = time.time()
+ agg = {'losses': [], 'preds': [], 'txts': []}
+
+ for batch in val_loader:
+ if use_dali:
+ # with DALI, the data is already on GPU
+ feat, feat_lens, txt, txt_lens = batch
+ if val_feat_proc is not None:
+ feat, feat_lens = val_feat_proc(feat, feat_lens, use_amp)
+ else:
+ batch = [t.npu(non_blocking=True) for t in batch]
+ audio, audio_lens, txt, txt_lens = batch
+ feat, feat_lens = val_feat_proc(audio, audio_lens, use_amp)
+
+ log_probs, enc_lens = model.forward(feat, feat_lens)
+ loss = ctc_loss(log_probs, txt, enc_lens, txt_lens)
+ pred = greedy_decoder(log_probs)
+
+ agg['losses'] += helpers.gather_losses([loss])
+ agg['preds'] += helpers.gather_predictions([pred], labels)
+ agg['txts'] += helpers.gather_transcripts([txt], [txt_lens], labels)
+
+ wer, loss = process_evaluation_epoch(agg)
+ log((epoch,), step, subset, {'loss': loss, 'wer': 100.0 * wer,
+ 'took': time.time() - start_time})
+ model.train()
+ return wer
+
+
+def main():
+ args = parse_args()
+
+ assert(torch.npu.is_available())
+ assert args.prediction_frequency % args.log_frequency == 0
+
+ torch.backends.cudnn.benchmark = args.cudnn_benchmark
+
+ # set up distributed training
+ multi_gpu = False
+ if multi_gpu:
+ torch.cuda.set_device(args.local_rank)
+ dist.init_process_group(backend='nccl', init_method='env://')
+ world_size = dist.get_world_size()
+ print_once(f'Distributed training with {world_size} GPUs\n')
+ else:
+ world_size = 1
+
+ torch.manual_seed(args.seed + args.local_rank)
+ np.random.seed(args.seed + args.local_rank)
+ random.seed(args.seed + args.local_rank)
+
+ init_log(args)
+
+ cfg = config.load(args.model_config)
+ config.apply_config_overrides(cfg, args)
+
+ symbols = helpers.add_ctc_blank(cfg['labels'])
+
+ assert args.grad_accumulation_steps >= 1
+ assert args.batch_size % args.grad_accumulation_steps == 0
+ batch_size = args.batch_size // args.grad_accumulation_steps
+
+ print_once('Setting up datasets...')
+ train_dataset_kw, train_features_kw = config.input(cfg, 'train')
+ val_dataset_kw, val_features_kw = config.input(cfg, 'val')
+
+ # use_dali = args.dali_device in ('cpu', 'gpu')
+ use_dali = False
+ if use_dali:
+ assert train_dataset_kw['ignore_offline_speed_perturbation'], \
+ "DALI doesn't support offline speed perturbation"
+
+ # pad_to_max_duration is not supported by DALI - have simple padders
+ if train_features_kw['pad_to_max_duration']:
+ train_feat_proc = BaseFeatures(
+ pad_align=train_features_kw['pad_align'],
+ pad_to_max_duration=True,
+ max_duration=train_features_kw['max_duration'],
+ sample_rate=train_features_kw['sample_rate'],
+ window_size=train_features_kw['window_size'],
+ window_stride=train_features_kw['window_stride'])
+ train_features_kw['pad_to_max_duration'] = False
+ else:
+ train_feat_proc = None
+
+ if val_features_kw['pad_to_max_duration']:
+ val_feat_proc = BaseFeatures(
+ pad_align=val_features_kw['pad_align'],
+ pad_to_max_duration=True,
+ max_duration=val_features_kw['max_duration'],
+ sample_rate=val_features_kw['sample_rate'],
+ window_size=val_features_kw['window_size'],
+ window_stride=val_features_kw['window_stride'])
+ val_features_kw['pad_to_max_duration'] = False
+ else:
+ val_feat_proc = None
+
+ train_loader = DaliDataLoader(gpu_id=args.local_rank,
+ dataset_path=args.dataset_dir,
+ config_data=train_dataset_kw,
+ config_features=train_features_kw,
+ json_names=args.train_manifests,
+ batch_size=batch_size,
+ grad_accumulation_steps=args.grad_accumulation_steps,
+ pipeline_type="train",
+ device_type=args.dali_device,
+ symbols=symbols)
+
+ val_loader = DaliDataLoader(gpu_id=args.local_rank,
+ dataset_path=args.dataset_dir,
+ config_data=val_dataset_kw,
+ config_features=val_features_kw,
+ json_names=args.val_manifests,
+ batch_size=batch_size,
+ pipeline_type="val",
+ device_type=args.dali_device,
+ symbols=symbols)
+ else:
+ train_dataset_kw, train_features_kw = config.input(cfg, 'train')
+ train_dataset = AudioDataset(args.dataset_dir,
+ args.train_manifests,
+ symbols,
+ **train_dataset_kw)
+ train_loader = get_data_loader(train_dataset,
+ batch_size,
+ multi_gpu=multi_gpu,
+ shuffle=True,
+ num_workers=4)
+ train_feat_proc = FilterbankFeatures(**train_features_kw)
+
+ val_dataset_kw, val_features_kw = config.input(cfg, 'val')
+ val_dataset = AudioDataset(args.dataset_dir,
+ args.val_manifests,
+ symbols,
+ **val_dataset_kw)
+ val_loader = get_data_loader(val_dataset,
+ batch_size,
+ multi_gpu=multi_gpu,
+ shuffle=False,
+ num_workers=4,
+ drop_last=False)
+ val_feat_proc = FilterbankFeatures(**val_features_kw)
+
+ dur = train_dataset.duration / 3600
+ dur_f = train_dataset.duration_filtered / 3600
+ nsampl = len(train_dataset)
+ print_once(f'Training samples: {nsampl} ({dur:.1f}h, '
+ f'filtered {dur_f:.1f}h)')
+
+ if train_feat_proc is not None:
+ train_feat_proc.cpu()
+ if val_feat_proc is not None:
+ val_feat_proc.cpu()
+
+ steps_per_epoch = len(train_loader) // args.grad_accumulation_steps
+
+ # set up the model
+ model = Jasper(encoder_kw=config.encoder(cfg),
+ decoder_kw=config.decoder(cfg, n_classes=len(symbols)))
+ model.cpu()
+ ctc_loss = CTCLossNM(n_classes=len(symbols))
+ greedy_decoder = GreedyCTCDecoder()
+
+ print_once(f'Model size: {num_weights(model) / 10**6:.1f}M params\n')
+
+ # optimization
+ kw = {'lr': args.lr, 'weight_decay': args.weight_decay}
+ if args.optimizer == "novograd":
+ optimizer = Novograd(model.parameters(), **kw)
+ elif args.optimizer == "adamw":
+ optimizer = AdamW(model.parameters(), **kw)
+ else:
+ raise ValueError(f'Invalid optimizer "{args.optimizer}"')
+
+ adjust_lr = lambda step, epoch, optimizer: lr_policy(
+ step, epoch, args.lr, optimizer, steps_per_epoch=steps_per_epoch,
+ warmup_epochs=args.warmup_epochs, hold_epochs=args.hold_epochs,
+ num_epochs=args.epochs, policy=args.lr_policy, min_lr=args.min_lr,
+ exp_gamma=args.lr_exp_gamma)
+
+ if args.amp:
+ model, optimizer = amp.initialize(
+ min_loss_scale=1.0, models=model, optimizers=optimizer,
+ opt_level='O1', max_loss_scale=512.0)
+
+ if args.ema > 0:
+ ema_model = copy.deepcopy(model)
+ else:
+ ema_model = None
+
+ if multi_gpu:
+ model = DistributedDataParallel(model)
+
+ if args.pyprof:
+ pyprof.init(enable_function_stack=True)
+
+ # load checkpoint
+ meta = {'best_wer': 10**6, 'start_epoch': 0}
+ checkpointer = Checkpointer(args.output_dir, 'Jasper',
+ args.keep_milestones, args.amp)
+ if args.resume:
+ args.ckpt = checkpointer.last_checkpoint() or args.ckpt
+
+ if args.ckpt is not None:
+ checkpointer.load(args.ckpt, model, ema_model, optimizer, meta)
+
+ start_epoch = meta['start_epoch']
+ best_wer = meta['best_wer']
+ epoch = 1
+ step = start_epoch * steps_per_epoch + 1
+
+ if args.pyprof:
+ torch.autograd.profiler.emit_nvtx().__enter__()
+ profiler.start()
+
+ # training loop
+ model.train()
+
+ # pre-allocate
+ if args.pre_allocate_range is not None:
+ n_feats = train_features_kw['n_filt']
+ pad_align = train_features_kw['pad_align']
+ a, b = args.pre_allocate_range
+ for n_frames in range(a, b + pad_align, pad_align):
+ print_once(f'Pre-allocation ({batch_size}x{n_feats}x{n_frames})...')
+
+ feat = torch.randn(batch_size, n_feats, n_frames, device='cpu')
+ feat_lens = torch.ones(batch_size, device='cpu').fill_(n_frames)
+ txt = torch.randint(high=len(symbols)-1, size=(batch_size, 100),
+ device='cpu')
+ txt_lens = torch.ones(batch_size, device='cpu').fill_(100)
+ log_probs, enc_lens = model(feat, feat_lens)
+ del feat
+ loss = ctc_loss(log_probs, txt, enc_lens, txt_lens)
+ loss.backward()
+ model.zero_grad()
+
+ bmark_stats = BenchmarkStats()
+
+ for epoch in range(start_epoch + 1, args.epochs + 1):
+ if multi_gpu and not use_dali:
+ train_loader.sampler.set_epoch(epoch)
+
+ epoch_utts = 0
+ epoch_loss = 0
+ accumulated_batches = 0
+ epoch_start_time = time.time()
+
+ for batch in train_loader:
+
+ if accumulated_batches == 0:
+ adjust_lr(step, epoch, optimizer)
+ optimizer.zero_grad()
+ step_loss = 0
+ step_utts = 0
+ step_start_time = time.time()
+
+ if use_dali:
+ # with DALI, the data is already on GPU
+ feat, feat_lens, txt, txt_lens = batch
+ if train_feat_proc is not None:
+ feat, feat_lens = train_feat_proc(feat, feat_lens, args.amp)
+ else:
+ batch = [t.npu(non_blocking=True) for t in batch]
+ audio, audio_lens, txt, txt_lens = batch
+ feat, feat_lens = train_feat_proc(audio, audio_lens, args.amp)
+ print("feat",feat)
+ print("feat_len",feat_lens)
+ log_probs, enc_lens = model(feat, feat_lens)
+
+ loss = ctc_loss(log_probs, txt, enc_lens, txt_lens)
+ loss /= args.grad_accumulation_steps
+
+ if torch.isnan(loss).any():
+ print_once(f'WARNING: loss is NaN; skipping update')
+ else:
+ if multi_gpu:
+ step_loss += reduce_tensor(loss.data, world_size).item()
+ else:
+ step_loss += loss.item()
+
+ if args.amp:
+ with amp.scale_loss(loss, optimizer) as scaled_loss:
+ scaled_loss.backward()
+ else:
+ loss.backward()
+ step_utts += batch[0].size(0) * world_size
+ epoch_utts += batch[0].size(0) * world_size
+ accumulated_batches += 1
+
+ if accumulated_batches % args.grad_accumulation_steps == 0:
+ epoch_loss += step_loss
+ optimizer.step()
+ apply_ema(model, ema_model, args.ema)
+
+ if step % args.log_frequency == 0:
+ preds = greedy_decoder(log_probs)
+ wer, pred_utt, ref = greedy_wer(preds, txt, txt_lens, symbols)
+
+ if step % args.prediction_frequency == 0:
+ print_once(f' Decoded: {pred_utt[:90]}')
+ print_once(f' Reference: {ref[:90]}')
+
+ step_time = time.time() - step_start_time
+ log((epoch, step % steps_per_epoch or steps_per_epoch, steps_per_epoch),
+ step, 'train',
+ {'loss': step_loss,
+ 'wer': 100.0 * wer,
+ 'throughput': step_utts / step_time,
+ 'took': step_time,
+ 'lrate': optimizer.param_groups[0]['lr']})
+
+ step_start_time = time.time()
+
+ if step % args.eval_frequency == 0:
+ wer = evaluate(epoch, step, val_loader, val_feat_proc,
+ symbols, model, ema_model, ctc_loss,
+ greedy_decoder, args.amp, use_dali)
+
+ if wer < best_wer and epoch >= args.save_best_from:
+ checkpointer.save(model, ema_model, optimizer, epoch,
+ step, best_wer, is_best=True)
+ best_wer = wer
+
+ step += 1
+ accumulated_batches = 0
+ # end of step
+
+ # DALI iterator need to be exhausted;
+ # if not using DALI, simulate drop_last=True with grad accumulation
+ if not use_dali and step > steps_per_epoch * epoch:
+ break
+
+ epoch_time = time.time() - epoch_start_time
+ epoch_loss /= steps_per_epoch
+ log((epoch,), None, 'train_avg', {'throughput': epoch_utts / epoch_time,
+ 'took': epoch_time,
+ 'loss': epoch_loss})
+ bmark_stats.update(epoch_utts, epoch_time, epoch_loss)
+
+ if epoch % args.save_frequency == 0 or epoch in args.keep_milestones:
+ checkpointer.save(model, ema_model, optimizer, epoch, step, best_wer)
+
+ if 0 < args.epochs_this_job <= epoch - start_epoch:
+ print_once(f'Finished after {args.epochs_this_job} epochs.')
+ break
+ # end of epoch
+
+ if args.pyprof:
+ profiler.stop()
+ torch.autograd.profiler.emit_nvtx().__exit__(None, None, None)
+
+ log((), None, 'train_avg', bmark_stats.get(args.benchmark_epochs_num))
+
+ if epoch == args.epochs:
+ evaluate(epoch, step, val_loader, val_feat_proc, symbols, model,
+ ema_model, ctc_loss, greedy_decoder, args.amp, use_dali)
+
+ checkpointer.save(model, ema_model, optimizer, epoch, step, best_wer)
+ flush_log()
+
+
+if __name__ == "__main__":
+ main()
diff --git a/PyTorch/contrib/audio/Jasper/common/utils.py b/PyTorch/contrib/audio/Jasper/common/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..2bd7986f1924f633091010ef24ec489c38f6e87a
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/common/utils.py
@@ -0,0 +1,20 @@
+import numpy as np
+
+
+class BenchmarkStats:
+ """ Tracks statistics used for benchmarking. """
+ def __init__(self):
+ self.utts = []
+ self.times = []
+ self.losses = []
+
+ def update(self, utts, times, losses):
+ self.utts.append(utts)
+ self.times.append(times)
+ self.losses.append(losses)
+
+ def get(self, n_epochs):
+ throughput = sum(self.utts[-n_epochs:]) / sum(self.times[-n_epochs:])
+
+ return {'throughput': throughput, 'benchmark_epochs_num': n_epochs,
+ 'loss': np.mean(self.losses[-n_epochs:])}
diff --git a/PyTorch/contrib/audio/Jasper/configs/jasper10x5dr_speca.yaml b/PyTorch/contrib/audio/Jasper/configs/jasper10x5dr_speca.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..b0c0d5b9c42b175d7199bfa96950adf3c25c721d
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/configs/jasper10x5dr_speca.yaml
@@ -0,0 +1,139 @@
+# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+name: "Jasper"
+labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
+ "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
+
+input_val:
+ audio_dataset: &val_dataset
+ sample_rate: &sample_rate 16000
+ trim_silence: true
+ normalize_transcripts: true
+
+ filterbank_features: &val_features
+ normalize: per_feature
+ sample_rate: *sample_rate
+ window_size: 0.02
+ window_stride: 0.01
+ window: hann
+ n_filt: &n_filt 64
+ n_fft: 512
+ frame_splicing: &frame_splicing 1
+ dither: 0.00001
+ pad_align: 16
+
+# For training we keep samples < 16.7s and apply augmentation
+input_train:
+ audio_dataset:
+ <<: *val_dataset
+ max_duration: 16.7
+ ignore_offline_speed_perturbation: true
+
+ filterbank_features:
+ <<: *val_features
+ max_duration: 16.7
+
+ spec_augment:
+ freq_masks: 2
+ max_freq: 20
+ time_masks: 2
+ max_time: 75
+
+jasper:
+ encoder:
+ init: xavier_uniform
+ in_feats: *n_filt
+ frame_splicing: *frame_splicing
+ activation: relu
+ use_conv_masks: true
+ blocks:
+ - &Conv1
+ filters: 256
+ repeat: 1
+ kernel_size: [11]
+ stride: [2]
+ dilation: [1]
+ dropout: 0.2
+ residual: false
+ - &B1
+ filters: 256
+ repeat: 5
+ kernel_size: [11]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.2
+ residual: true
+ residual_dense: true
+ - *B1
+ - &B2
+ filters: 384
+ repeat: 5
+ kernel_size: [13]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.2
+ residual: true
+ residual_dense: true
+ - *B2
+ - &B3
+ filters: 512
+ repeat: 5
+ kernel_size: [17]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.2
+ residual: true
+ residual_dense: true
+ - *B3
+ - &B4
+ filters: 640
+ repeat: 5
+ kernel_size: [21]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.3
+ residual: true
+ residual_dense: true
+ - *B4
+ - &B5
+ filters: 768
+ repeat: 5
+ kernel_size: [25]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.3
+ residual: true
+ residual_dense: true
+ - *B5
+ - &Conv2
+ filters: 896
+ repeat: 1
+ kernel_size: [29]
+ stride: [1]
+ dilation: [2]
+ dropout: 0.4
+ residual: false
+ - &Conv3
+ filters: &enc_feats 1024
+ repeat: 1
+ kernel_size: [1]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.4
+ residual: false
+
+ decoder:
+ in_feats: *enc_feats
+ init: xavier_uniform
diff --git a/PyTorch/contrib/audio/Jasper/configs/jasper10x5dr_speedp-offline.yaml b/PyTorch/contrib/audio/Jasper/configs/jasper10x5dr_speedp-offline.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..89c135ea719d6a1e224203f301e1c0bea9460d1a
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/configs/jasper10x5dr_speedp-offline.yaml
@@ -0,0 +1,139 @@
+# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+name: "Jasper"
+labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
+ "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
+
+input_val:
+ audio_dataset: &val_dataset
+ sample_rate: &sample_rate 16000
+ trim_silence: true
+ normalize_transcripts: true
+
+ filterbank_features: &val_features
+ normalize: per_feature
+ sample_rate: *sample_rate
+ window_size: 0.02
+ window_stride: 0.01
+ window: hann
+ n_filt: &n_filt 64
+ n_fft: 512
+ frame_splicing: &frame_splicing 1
+ dither: 0.00001
+ pad_align: 16
+
+# For training we keep samples < 16.7s and apply augmentation
+input_train:
+ audio_dataset:
+ <<: *val_dataset
+ max_duration: 16.7
+ ignore_offline_speed_perturbation: false
+
+ filterbank_features:
+ <<: *val_features
+ max_duration: 16.7
+
+ spec_augment:
+ freq_masks: 0
+ max_freq: 20
+ time_masks: 0
+ max_time: 75
+
+jasper:
+ encoder:
+ init: xavier_uniform
+ in_feats: *n_filt
+ frame_splicing: *frame_splicing
+ activation: relu
+ use_conv_masks: true
+ blocks:
+ - &Conv1
+ filters: 256
+ repeat: 1
+ kernel_size: [11]
+ stride: [2]
+ dilation: [1]
+ dropout: 0.2
+ residual: false
+ - &B1
+ filters: 256
+ repeat: 5
+ kernel_size: [11]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.2
+ residual: true
+ residual_dense: true
+ - *B1
+ - &B2
+ filters: 384
+ repeat: 5
+ kernel_size: [13]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.2
+ residual: true
+ residual_dense: true
+ - *B2
+ - &B3
+ filters: 512
+ repeat: 5
+ kernel_size: [17]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.2
+ residual: true
+ residual_dense: true
+ - *B3
+ - &B4
+ filters: 640
+ repeat: 5
+ kernel_size: [21]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.3
+ residual: true
+ residual_dense: true
+ - *B4
+ - &B5
+ filters: 768
+ repeat: 5
+ kernel_size: [25]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.3
+ residual: true
+ residual_dense: true
+ - *B5
+ - &Conv2
+ filters: 896
+ repeat: 1
+ kernel_size: [29]
+ stride: [1]
+ dilation: [2]
+ dropout: 0.4
+ residual: false
+ - &Conv3
+ filters: &enc_feats 1024
+ repeat: 1
+ kernel_size: [1]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.4
+ residual: false
+
+ decoder:
+ in_feats: *enc_feats
+ init: xavier_uniform
diff --git a/PyTorch/contrib/audio/Jasper/configs/jasper10x5dr_speedp-offline_speca.yaml b/PyTorch/contrib/audio/Jasper/configs/jasper10x5dr_speedp-offline_speca.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..2c7e45818a5a11b04655de6b7f6f16f688e2a47f
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/configs/jasper10x5dr_speedp-offline_speca.yaml
@@ -0,0 +1,139 @@
+# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+name: "Jasper"
+labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
+ "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
+
+input_val:
+ audio_dataset: &val_dataset
+ sample_rate: &sample_rate 16000
+ trim_silence: true
+ normalize_transcripts: true
+
+ filterbank_features: &val_features
+ normalize: per_feature
+ sample_rate: *sample_rate
+ window_size: 0.02
+ window_stride: 0.01
+ window: hann
+ n_filt: &n_filt 64
+ n_fft: 512
+ frame_splicing: &frame_splicing 1
+ dither: 0.00001
+ pad_align: 16
+
+# For training we keep samples < 16.7s and apply augmentation
+input_train:
+ audio_dataset:
+ <<: *val_dataset
+ max_duration: 16.7
+ ignore_offline_speed_perturbation: false
+
+ filterbank_features:
+ <<: *val_features
+ max_duration: 16.7
+
+ spec_augment:
+ freq_masks: 2
+ max_freq: 20
+ time_masks: 2
+ max_time: 75
+
+jasper:
+ encoder:
+ init: xavier_uniform
+ in_feats: *n_filt
+ frame_splicing: *frame_splicing
+ activation: relu
+ use_conv_masks: true
+ blocks:
+ - &Conv1
+ filters: 256
+ repeat: 1
+ kernel_size: [11]
+ stride: [2]
+ dilation: [1]
+ dropout: 0.2
+ residual: false
+ - &B1
+ filters: 256
+ repeat: 5
+ kernel_size: [11]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.2
+ residual: true
+ residual_dense: true
+ - *B1
+ - &B2
+ filters: 384
+ repeat: 5
+ kernel_size: [13]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.2
+ residual: true
+ residual_dense: true
+ - *B2
+ - &B3
+ filters: 512
+ repeat: 5
+ kernel_size: [17]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.2
+ residual: true
+ residual_dense: true
+ - *B3
+ - &B4
+ filters: 640
+ repeat: 5
+ kernel_size: [21]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.3
+ residual: true
+ residual_dense: true
+ - *B4
+ - &B5
+ filters: 768
+ repeat: 5
+ kernel_size: [25]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.3
+ residual: true
+ residual_dense: true
+ - *B5
+ - &Conv2
+ filters: 896
+ repeat: 1
+ kernel_size: [29]
+ stride: [1]
+ dilation: [2]
+ dropout: 0.4
+ residual: false
+ - &Conv3
+ filters: &enc_feats 1024
+ repeat: 1
+ kernel_size: [1]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.4
+ residual: false
+
+ decoder:
+ in_feats: *enc_feats
+ init: xavier_uniform
diff --git a/PyTorch/contrib/audio/Jasper/configs/jasper10x5dr_speedp-offline_speca_nomask.yaml b/PyTorch/contrib/audio/Jasper/configs/jasper10x5dr_speedp-offline_speca_nomask.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..61619428a1129172e8f6656b11672127666b653d
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/configs/jasper10x5dr_speedp-offline_speca_nomask.yaml
@@ -0,0 +1,139 @@
+# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+name: "Jasper"
+labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
+ "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
+
+input_val:
+ audio_dataset: &val_dataset
+ sample_rate: &sample_rate 16000
+ trim_silence: true
+ normalize_transcripts: true
+
+ filterbank_features: &val_features
+ normalize: per_feature
+ sample_rate: *sample_rate
+ window_size: 0.02
+ window_stride: 0.01
+ window: hann
+ n_filt: &n_filt 64
+ n_fft: 512
+ frame_splicing: &frame_splicing 1
+ dither: 0.00001
+ pad_align: 16
+
+# For training we keep samples < 16.7s and apply augmentation
+input_train:
+ audio_dataset:
+ <<: *val_dataset
+ max_duration: 16.7
+ ignore_offline_speed_perturbation: false
+
+ filterbank_features:
+ <<: *val_features
+ max_duration: 16.7
+
+ spec_augment:
+ freq_masks: 2
+ max_freq: 20
+ time_masks: 2
+ max_time: 75
+
+jasper:
+ encoder:
+ init: xavier_uniform
+ in_feats: *n_filt
+ frame_splicing: *frame_splicing
+ activation: relu
+ use_conv_masks: false
+ blocks:
+ - &Conv1
+ filters: 256
+ repeat: 1
+ kernel_size: [11]
+ stride: [2]
+ dilation: [1]
+ dropout: 0.2
+ residual: false
+ - &B1
+ filters: 256
+ repeat: 5
+ kernel_size: [11]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.2
+ residual: true
+ residual_dense: true
+ - *B1
+ - &B2
+ filters: 384
+ repeat: 5
+ kernel_size: [13]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.2
+ residual: true
+ residual_dense: true
+ - *B2
+ - &B3
+ filters: 512
+ repeat: 5
+ kernel_size: [17]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.2
+ residual: true
+ residual_dense: true
+ - *B3
+ - &B4
+ filters: 640
+ repeat: 5
+ kernel_size: [21]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.3
+ residual: true
+ residual_dense: true
+ - *B4
+ - &B5
+ filters: 768
+ repeat: 5
+ kernel_size: [25]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.3
+ residual: true
+ residual_dense: true
+ - *B5
+ - &Conv2
+ filters: 896
+ repeat: 1
+ kernel_size: [29]
+ stride: [1]
+ dilation: [2]
+ dropout: 0.4
+ residual: false
+ - &Conv3
+ filters: &enc_feats 1024
+ repeat: 1
+ kernel_size: [1]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.4
+ residual: false
+
+ decoder:
+ in_feats: *enc_feats
+ init: xavier_uniform
diff --git a/PyTorch/contrib/audio/Jasper/configs/jasper10x5dr_speedp-online-discrete.yaml b/PyTorch/contrib/audio/Jasper/configs/jasper10x5dr_speedp-online-discrete.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..c0c59e196b2e939acb3897f9e21e3d3b753c32eb
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/configs/jasper10x5dr_speedp-online-discrete.yaml
@@ -0,0 +1,144 @@
+# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+name: "Jasper"
+labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
+ "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
+
+input_val:
+ audio_dataset: &val_dataset
+ sample_rate: &sample_rate 16000
+ trim_silence: true
+ normalize_transcripts: true
+
+ filterbank_features: &val_features
+ normalize: per_feature
+ sample_rate: *sample_rate
+ window_size: 0.02
+ window_stride: 0.01
+ window: hann
+ n_filt: &n_filt 64
+ n_fft: 512
+ frame_splicing: &frame_splicing 1
+ dither: 0.00001
+ pad_align: 16
+
+# For training we keep samples < 16.7s and apply augmentation
+input_train:
+ audio_dataset:
+ <<: *val_dataset
+ max_duration: 16.7
+ ignore_offline_speed_perturbation: true
+
+ speed_perturbation:
+ discrete: true
+ min_rate: 0.9
+ max_rate: 1.1
+
+ filterbank_features:
+ <<: *val_features
+ max_duration: 16.7
+
+ spec_augment:
+ freq_masks: 0
+ max_freq: 20
+ time_masks: 0
+ max_time: 75
+
+jasper:
+ encoder:
+ init: xavier_uniform
+ in_feats: *n_filt
+ frame_splicing: *frame_splicing
+ activation: relu
+ use_conv_masks: true
+ blocks:
+ - &Conv1
+ filters: 256
+ repeat: 1
+ kernel_size: [11]
+ stride: [2]
+ dilation: [1]
+ dropout: 0.2
+ residual: false
+ - &B1
+ filters: 256
+ repeat: 5
+ kernel_size: [11]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.2
+ residual: true
+ residual_dense: true
+ - *B1
+ - &B2
+ filters: 384
+ repeat: 5
+ kernel_size: [13]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.2
+ residual: true
+ residual_dense: true
+ - *B2
+ - &B3
+ filters: 512
+ repeat: 5
+ kernel_size: [17]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.2
+ residual: true
+ residual_dense: true
+ - *B3
+ - &B4
+ filters: 640
+ repeat: 5
+ kernel_size: [21]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.3
+ residual: true
+ residual_dense: true
+ - *B4
+ - &B5
+ filters: 768
+ repeat: 5
+ kernel_size: [25]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.3
+ residual: true
+ residual_dense: true
+ - *B5
+ - &Conv2
+ filters: 896
+ repeat: 1
+ kernel_size: [29]
+ stride: [1]
+ dilation: [2]
+ dropout: 0.4
+ residual: false
+ - &Conv3
+ filters: &enc_feats 1024
+ repeat: 1
+ kernel_size: [1]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.4
+ residual: false
+
+ decoder:
+ in_feats: *enc_feats
+ init: xavier_uniform
diff --git a/PyTorch/contrib/audio/Jasper/configs/jasper10x5dr_speedp-online-discrete_speca.yaml b/PyTorch/contrib/audio/Jasper/configs/jasper10x5dr_speedp-online-discrete_speca.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..d2491b30b7cda0f3d8a0c649a270b3b492e0a031
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/configs/jasper10x5dr_speedp-online-discrete_speca.yaml
@@ -0,0 +1,144 @@
+# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+name: "Jasper"
+labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
+ "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
+
+input_val:
+ audio_dataset: &val_dataset
+ sample_rate: &sample_rate 16000
+ trim_silence: true
+ normalize_transcripts: true
+
+ filterbank_features: &val_features
+ normalize: per_feature
+ sample_rate: *sample_rate
+ window_size: 0.02
+ window_stride: 0.01
+ window: hann
+ n_filt: &n_filt 64
+ n_fft: 512
+ frame_splicing: &frame_splicing 1
+ dither: 0.00001
+ pad_align: 16
+
+# For training we keep samples < 16.7s and apply augmentation
+input_train:
+ audio_dataset:
+ <<: *val_dataset
+ max_duration: 16.7
+ ignore_offline_speed_perturbation: true
+
+ speed_perturbation:
+ discrete: true
+ min_rate: 0.9
+ max_rate: 1.1
+
+ filterbank_features:
+ <<: *val_features
+ max_duration: 16.7
+
+ spec_augment:
+ freq_masks: 2
+ max_freq: 20
+ time_masks: 2
+ max_time: 75
+
+jasper:
+ encoder:
+ init: xavier_uniform
+ in_feats: *n_filt
+ frame_splicing: *frame_splicing
+ activation: relu
+ use_conv_masks: true
+ blocks:
+ - &Conv1
+ filters: 256
+ repeat: 1
+ kernel_size: [11]
+ stride: [2]
+ dilation: [1]
+ dropout: 0.2
+ residual: false
+ - &B1
+ filters: 256
+ repeat: 5
+ kernel_size: [11]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.2
+ residual: true
+ residual_dense: true
+ - *B1
+ - &B2
+ filters: 384
+ repeat: 5
+ kernel_size: [13]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.2
+ residual: true
+ residual_dense: true
+ - *B2
+ - &B3
+ filters: 512
+ repeat: 5
+ kernel_size: [17]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.2
+ residual: true
+ residual_dense: true
+ - *B3
+ - &B4
+ filters: 640
+ repeat: 5
+ kernel_size: [21]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.3
+ residual: true
+ residual_dense: true
+ - *B4
+ - &B5
+ filters: 768
+ repeat: 5
+ kernel_size: [25]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.3
+ residual: true
+ residual_dense: true
+ - *B5
+ - &Conv2
+ filters: 896
+ repeat: 1
+ kernel_size: [29]
+ stride: [1]
+ dilation: [2]
+ dropout: 0.4
+ residual: false
+ - &Conv3
+ filters: &enc_feats 1024
+ repeat: 1
+ kernel_size: [1]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.4
+ residual: false
+
+ decoder:
+ in_feats: *enc_feats
+ init: xavier_uniform
diff --git a/PyTorch/contrib/audio/Jasper/configs/jasper10x5dr_speedp-online_speca.yaml b/PyTorch/contrib/audio/Jasper/configs/jasper10x5dr_speedp-online_speca.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..a165af7f4ac147889639b33ed647e6eaa5e3e224
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/configs/jasper10x5dr_speedp-online_speca.yaml
@@ -0,0 +1,144 @@
+# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+name: "Jasper"
+labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
+ "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
+
+input_val:
+ audio_dataset: &val_dataset
+ sample_rate: &sample_rate 16000
+ trim_silence: true
+ normalize_transcripts: true
+
+ filterbank_features: &val_features
+ normalize: per_feature
+ sample_rate: *sample_rate
+ window_size: 0.02
+ window_stride: 0.01
+ window: hann
+ n_filt: &n_filt 64
+ n_fft: 512
+ frame_splicing: &frame_splicing 1
+ dither: 0.00001
+ pad_align: 16
+
+# For training we keep samples < 16.7s and apply augmentation
+input_train:
+ audio_dataset:
+ <<: *val_dataset
+ max_duration: 16.7
+ ignore_offline_speed_perturbation: true
+
+ speed_perturbation:
+ discrete: false
+ min_rate: 0.85
+ max_rate: 1.15
+
+ filterbank_features:
+ <<: *val_features
+ max_duration: 16.7
+
+ spec_augment:
+ freq_masks: 2
+ max_freq: 20
+ time_masks: 2
+ max_time: 75
+
+jasper:
+ encoder:
+ init: xavier_uniform
+ in_feats: *n_filt
+ frame_splicing: *frame_splicing
+ activation: relu
+ use_conv_masks: true
+ blocks:
+ - &Conv1
+ filters: 256
+ repeat: 1
+ kernel_size: [11]
+ stride: [2]
+ dilation: [1]
+ dropout: 0.2
+ residual: false
+ - &B1
+ filters: 256
+ repeat: 5
+ kernel_size: [11]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.2
+ residual: true
+ residual_dense: true
+ - *B1
+ - &B2
+ filters: 384
+ repeat: 5
+ kernel_size: [13]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.2
+ residual: true
+ residual_dense: true
+ - *B2
+ - &B3
+ filters: 512
+ repeat: 5
+ kernel_size: [17]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.2
+ residual: true
+ residual_dense: true
+ - *B3
+ - &B4
+ filters: 640
+ repeat: 5
+ kernel_size: [21]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.3
+ residual: true
+ residual_dense: true
+ - *B4
+ - &B5
+ filters: 768
+ repeat: 5
+ kernel_size: [25]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.3
+ residual: true
+ residual_dense: true
+ - *B5
+ - &Conv2
+ filters: 896
+ repeat: 1
+ kernel_size: [29]
+ stride: [1]
+ dilation: [2]
+ dropout: 0.4
+ residual: false
+ - &Conv3
+ filters: &enc_feats 1024
+ repeat: 1
+ kernel_size: [1]
+ stride: [1]
+ dilation: [1]
+ dropout: 0.4
+ residual: false
+
+ decoder:
+ in_feats: *enc_feats
+ init: xavier_uniform
diff --git a/PyTorch/contrib/audio/Jasper/inference.py b/PyTorch/contrib/audio/Jasper/inference.py
new file mode 100644
index 0000000000000000000000000000000000000000..317215e91fb35ad547761be7f5c2c5e7cac2c976
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/inference.py
@@ -0,0 +1,398 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import math
+import os
+import random
+import time
+from heapq import nlargest
+from itertools import chain, repeat
+from pathlib import Path
+from tqdm import tqdm
+
+import dllogger
+import torch
+import numpy as np
+import torch.distributed as distrib
+import torch.nn.functional as F
+from apex import amp
+from apex.parallel import DistributedDataParallel
+from dllogger import JSONStreamBackend, StdOutBackend, Verbosity
+
+from jasper import config
+from common import helpers
+from common.dali.data_loader import DaliDataLoader
+from common.dataset import (AudioDataset, FilelistDataset, get_data_loader,
+ SingleAudioDataset)
+from common.features import BaseFeatures, FilterbankFeatures
+from common.helpers import print_once, process_evaluation_epoch
+from jasper.model import GreedyCTCDecoder, Jasper
+from common.tb_dllogger import stdout_metric_format, unique_log_fpath
+
+
+def get_parser():
+ parser = argparse.ArgumentParser(description='Jasper')
+ parser.add_argument('--batch_size', default=16, type=int,
+ help='Data batch size')
+ parser.add_argument('--steps', default=0, type=int,
+ help='Eval this many steps for every worker')
+ parser.add_argument('--warmup_steps', default=0, type=int,
+ help='Burn-in period before measuring latencies')
+ parser.add_argument('--model_config', type=str, required=True,
+ help='Relative model config path given dataset folder')
+ parser.add_argument('--dataset_dir', type=str,
+ help='Absolute path to dataset folder')
+ parser.add_argument('--val_manifests', type=str, nargs='+',
+ help='Relative path to evaluation dataset manifest files')
+ parser.add_argument('--ckpt', default=None, type=str,
+ help='Path to model checkpoint')
+ parser.add_argument('--pad_leading', type=int, default=16,
+ help='Pads every batch with leading zeros '
+ 'to counteract conv shifts of the field of view')
+ parser.add_argument('--amp', '--fp16', action='store_true',
+ help='Use FP16 precision')
+ parser.add_argument('--cudnn_benchmark', action='store_true',
+ help='Enable cudnn benchmark')
+ parser.add_argument('--cpu', action='store_true',
+ help='Run inference on CPU')
+ parser.add_argument("--seed", default=None, type=int, help='Random seed')
+ parser.add_argument('--local_rank', default=os.getenv('LOCAL_RANK', 0),
+ type=int, help='GPU id used for distributed training')
+
+ io = parser.add_argument_group('feature and checkpointing setup')
+ io.add_argument('--dali_device', type=str, choices=['none', 'cpu', 'gpu'],
+ default='gpu', help='Use DALI pipeline for fast data processing')
+ io.add_argument('--save_predictions', type=str, default=None,
+ help='Save predictions in text form at this location')
+ io.add_argument('--save_logits', default=None, type=str,
+ help='Save output logits under specified path')
+ io.add_argument('--transcribe_wav', type=str,
+ help='Path to a single .wav file (16KHz)')
+ io.add_argument('--transcribe_filelist', type=str,
+ help='Path to a filelist with one .wav path per line')
+ io.add_argument('-o', '--output_dir', default='results/',
+ help='Output folder to save audio (file per phrase)')
+ io.add_argument('--log_file', type=str, default=None,
+ help='Path to a DLLogger log file')
+ io.add_argument('--ema', action='store_true',
+ help='Load averaged model weights')
+ io.add_argument('--torchscript', action='store_true',
+ help='Evaluate with a TorchScripted model')
+ io.add_argument('--torchscript_export', action='store_true',
+ help='Export the model with torch.jit to the output_dir')
+ io.add_argument('--override_config', type=str, action='append',
+ help='Overrides a value from a config .yaml.'
+ ' Syntax: `--override_config nested.config.key=val`.')
+ return parser
+
+
+def durs_to_percentiles(durations, ratios):
+ durations = np.asarray(durations) * 1000 # in ms
+ latency = durations
+
+ latency = latency[5:]
+ mean_latency = np.mean(latency)
+
+ latency_worst = nlargest(math.ceil((1 - min(ratios)) * len(latency)), latency)
+ latency_ranges = get_percentile(ratios, latency_worst, len(latency))
+ latency_ranges[0.5] = mean_latency
+ return latency_ranges
+
+
+def get_percentile(ratios, arr, nsamples):
+ res = {}
+ for a in ratios:
+ idx = max(int(nsamples * (1 - a)), 0)
+ res[a] = arr[idx]
+ return res
+
+
+def torchscript_export(data_loader, audio_processor, model, greedy_decoder,
+ output_dir, use_amp, use_conv_masks, model_config, device,
+ save):
+
+ audio_processor.to(device)
+
+ for batch in data_loader:
+ batch = [t.to(device, non_blocking=True) for t in batch]
+ audio, audio_len, _, _ = batch
+ feats, feat_lens = audio_processor(audio, audio_len)
+ break
+
+ print("\nExporting featurizer...")
+ print("\nNOTE: Dithering causes warnings about non-determinism.\n")
+ ts_feat = torch.jit.trace(audio_processor, (audio, audio_len))
+
+ print("\nExporting acoustic model...")
+ model(feats, feat_lens)
+ ts_acoustic = torch.jit.trace(model, (feats, feat_lens))
+
+ print("\nExporting decoder...")
+ log_probs = model(feats, feat_lens)
+ ts_decoder = torch.jit.script(greedy_decoder, log_probs)
+ print("\nJIT export complete.")
+
+ if save:
+ precision = "fp16" if use_amp else "fp32"
+ module_name = f'{os.path.basename(model_config)}_{precision}'
+ ts_feat.save(os.path.join(output_dir, module_name + "_feat.pt"))
+ ts_acoustic.save(os.path.join(output_dir, module_name + "_acoustic.pt"))
+ ts_decoder.save(os.path.join(output_dir, module_name + "_decoder.pt"))
+
+ return ts_feat, ts_acoustic, ts_decoder
+
+
+def main():
+
+ parser = get_parser()
+ args = parser.parse_args()
+
+ log_fpath = args.log_file or str(Path(args.output_dir, 'nvlog_infer.json'))
+ log_fpath = unique_log_fpath(log_fpath)
+ dllogger.init(backends=[JSONStreamBackend(Verbosity.DEFAULT, log_fpath),
+ StdOutBackend(Verbosity.VERBOSE,
+ metric_format=stdout_metric_format)])
+
+ [dllogger.log("PARAMETER", {k: v}) for k, v in vars(args).items()]
+
+ for step in ['DNN', 'data+DNN', 'data']:
+ for c in [0.99, 0.95, 0.9, 0.5]:
+ cs = 'avg' if c == 0.5 else f'{int(100*c)}%'
+ dllogger.metadata(f'{step.lower()}_latency_{c}',
+ {'name': f'{step} latency {cs}',
+ 'format': ':>7.2f', 'unit': 'ms'})
+ dllogger.metadata(
+ 'eval_wer', {'name': 'WER', 'format': ':>3.2f', 'unit': '%'})
+
+ if args.cpu:
+ device = torch.device('cpu')
+ else:
+ assert torch.cuda.is_available()
+ device = torch.device('cuda')
+ torch.backends.cudnn.benchmark = args.cudnn_benchmark
+
+ if args.seed is not None:
+ torch.manual_seed(args.seed + args.local_rank)
+ np.random.seed(args.seed + args.local_rank)
+ random.seed(args.seed + args.local_rank)
+
+ # set up distributed training
+ multi_gpu = not args.cpu and int(os.environ.get('WORLD_SIZE', 1)) > 1
+ if multi_gpu:
+ torch.cuda.set_device(args.local_rank)
+ distrib.init_process_group(backend='nccl', init_method='env://')
+ print_once(f'Inference with {distrib.get_world_size()} GPUs')
+
+ cfg = config.load(args.model_config)
+ config.apply_config_overrides(cfg, args)
+
+ symbols = helpers.add_ctc_blank(cfg['labels'])
+
+ use_dali = args.dali_device in ('cpu', 'gpu')
+ dataset_kw, features_kw = config.input(cfg, 'val')
+
+ measure_perf = args.steps > 0
+
+ # dataset
+ if args.transcribe_wav or args.transcribe_filelist:
+
+ if use_dali:
+ print("DALI supported only with input .json files; disabling")
+ use_dali = False
+
+ assert not (args.transcribe_wav and args.transcribe_filelist)
+
+ if args.transcribe_wav:
+ dataset = SingleAudioDataset(args.transcribe_wav)
+ else:
+ dataset = FilelistDataset(args.transcribe_filelist)
+
+ data_loader = get_data_loader(dataset,
+ batch_size=1,
+ multi_gpu=multi_gpu,
+ shuffle=False,
+ num_workers=0,
+ drop_last=(True if measure_perf else False))
+
+ _, features_kw = config.input(cfg, 'val')
+ assert not features_kw['pad_to_max_duration']
+ feat_proc = FilterbankFeatures(**features_kw)
+
+ elif use_dali:
+ # pad_to_max_duration is not supported by DALI - have simple padders
+ if features_kw['pad_to_max_duration']:
+ feat_proc = BaseFeatures(
+ pad_align=features_kw['pad_align'],
+ pad_to_max_duration=True,
+ max_duration=features_kw['max_duration'],
+ sample_rate=features_kw['sample_rate'],
+ window_size=features_kw['window_size'],
+ window_stride=features_kw['window_stride'])
+ features_kw['pad_to_max_duration'] = False
+ else:
+ feat_proc = None
+
+ data_loader = DaliDataLoader(
+ gpu_id=args.local_rank or 0,
+ dataset_path=args.dataset_dir,
+ config_data=dataset_kw,
+ config_features=features_kw,
+ json_names=args.val_manifests,
+ batch_size=args.batch_size,
+ pipeline_type=("train" if measure_perf else "val"), # no drop_last
+ device_type=args.dali_device,
+ symbols=symbols)
+
+ else:
+ dataset = AudioDataset(args.dataset_dir,
+ args.val_manifests,
+ symbols,
+ **dataset_kw)
+
+ data_loader = get_data_loader(dataset,
+ args.batch_size,
+ multi_gpu=multi_gpu,
+ shuffle=False,
+ num_workers=4,
+ drop_last=False)
+
+ feat_proc = FilterbankFeatures(**features_kw)
+
+ model = Jasper(encoder_kw=config.encoder(cfg),
+ decoder_kw=config.decoder(cfg, n_classes=len(symbols)))
+
+ if args.ckpt is not None:
+ print(f'Loading the model from {args.ckpt} ...')
+ checkpoint = torch.load(args.ckpt, map_location="cpu")
+ key = 'ema_state_dict' if args.ema else 'state_dict'
+ state_dict = helpers.convert_v1_state_dict(checkpoint[key])
+ model.load_state_dict(state_dict, strict=True)
+
+ model.to(device)
+ model.eval()
+
+ if feat_proc is not None:
+ feat_proc.to(device)
+ feat_proc.eval()
+
+ if args.amp:
+ model = model.half()
+
+ if args.torchscript:
+ greedy_decoder = GreedyCTCDecoder()
+
+ feat_proc, model, greedy_decoder = torchscript_export(
+ data_loader, feat_proc, model, greedy_decoder, args.output_dir,
+ use_amp=args.amp, use_conv_masks=True, model_toml=args.model_toml,
+ device=device, save=args.torchscript_export)
+
+ if multi_gpu:
+ model = DistributedDataParallel(model)
+
+ agg = {'txts': [], 'preds': [], 'logits': []}
+ dur = {'data': [], 'dnn': [], 'data+dnn': []}
+
+ looped_loader = chain.from_iterable(repeat(data_loader))
+ greedy_decoder = GreedyCTCDecoder()
+
+ sync = lambda: torch.cuda.synchronize() if device.type == 'cuda' else None
+
+ steps = args.steps + args.warmup_steps or len(data_loader)
+ with torch.no_grad():
+
+ for it, batch in enumerate(tqdm(looped_loader, initial=1, total=steps)):
+
+ if use_dali:
+ feats, feat_lens, txt, txt_lens = batch
+ if feat_proc is not None:
+ feats, feat_lens = feat_proc(feats, feat_lens)
+ else:
+ batch = [t.to(device, non_blocking=True) for t in batch]
+ audio, audio_lens, txt, txt_lens = batch
+ feats, feat_lens = feat_proc(audio, audio_lens)
+
+ sync()
+ t1 = time.perf_counter()
+
+ if args.amp:
+ feats = feats.half()
+
+ feats = F.pad(feats, (args.pad_leading, 0))
+ feat_lens += args.pad_leading
+
+ if model.encoder.use_conv_masks:
+ log_probs, log_prob_lens = model(feats, feat_lens)
+ else:
+ log_probs = model(feats, feat_lens)
+
+ preds = greedy_decoder(log_probs)
+
+ sync()
+ t2 = time.perf_counter()
+
+ # burn-in period; wait for a new loader due to num_workers
+ if it >= 1 and (args.steps == 0 or it >= args.warmup_steps):
+ dur['data'].append(t1 - t0)
+ dur['dnn'].append(t2 - t1)
+ dur['data+dnn'].append(t2 - t0)
+
+ if txt is not None:
+ agg['txts'] += helpers.gather_transcripts([txt], [txt_lens],
+ symbols)
+ agg['preds'] += helpers.gather_predictions([preds], symbols)
+ agg['logits'].append(log_probs)
+
+ if it + 1 == steps:
+ break
+
+ sync()
+ t0 = time.perf_counter()
+
+ # communicate the results
+ if args.transcribe_wav:
+ for idx, p in enumerate(agg['preds']):
+ print_once(f'Prediction {idx+1: >3}: {p}')
+
+ elif args.transcribe_filelist:
+ pass
+
+ elif not multi_gpu or distrib.get_rank() == 0:
+ wer, _ = process_evaluation_epoch(agg)
+
+ dllogger.log(step=(), data={'eval_wer': 100 * wer})
+
+ if args.save_predictions:
+ with open(args.save_predictions, 'w') as f:
+ f.write('\n'.join(agg['preds']))
+
+ if args.save_logits:
+ logits = torch.cat(agg['logits'], dim=0).cpu()
+ torch.save(logits, args.save_logits)
+
+ # report timings
+ if len(dur['data']) >= 20:
+ ratios = [0.9, 0.95, 0.99]
+ for stage in dur:
+ lat = durs_to_percentiles(dur[stage], ratios)
+ for k in [0.99, 0.95, 0.9, 0.5]:
+ kk = str(k).replace('.', '_')
+ dllogger.log(step=(), data={f'{stage.lower()}_latency_{kk}': lat[k]})
+
+ else:
+ print_once('Not enough samples to measure latencies.')
+
+
+if __name__ == "__main__":
+ main()
diff --git a/PyTorch/contrib/audio/Jasper/jasper/config.py b/PyTorch/contrib/audio/Jasper/jasper/config.py
new file mode 100644
index 0000000000000000000000000000000000000000..60f7a258eb73ae656b18d05deefbf000313162a2
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/jasper/config.py
@@ -0,0 +1,125 @@
+import copy
+import inspect
+import typing
+from ast import literal_eval
+from contextlib import suppress
+from numbers import Number
+
+import yaml
+
+from .model import JasperDecoderForCTC, JasperBlock, JasperEncoder
+from common.audio import GainPerturbation, ShiftPerturbation, SpeedPerturbation
+from common.dataset import AudioDataset
+from common.features import CutoutAugment, FilterbankFeatures, SpecAugment
+from common.helpers import print_once
+
+
+def default_args(klass):
+ sig = inspect.signature(klass.__init__)
+ return {k: v.default for k,v in sig.parameters.items() if k != 'self'}
+
+
+def load(fpath):
+ if fpath.endswith('.toml'):
+ raise ValueError('.toml config format has been changed to .yaml')
+
+ cfg = yaml.safe_load(open(fpath, 'r'))
+
+ # Reload to deep copy shallow copies, which were made with yaml anchors
+ yaml.Dumper.ignore_aliases = lambda *args: True
+ cfg = yaml.dump(cfg)
+ cfg = yaml.safe_load(cfg)
+ return cfg
+
+
+def validate_and_fill(klass, user_conf, ignore_unk=[], optional=[]):
+ conf = default_args(klass)
+
+ for k,v in user_conf.items():
+ assert k in conf or k in ignore_unk, f'Unknown parameter {k} for {klass}'
+ conf[k] = v
+
+ # Keep only mandatory or optional-nonempty
+ conf = {k:v for k,v in conf.items()
+ if k not in optional or v is not inspect.Parameter.empty}
+
+ # Validate
+ for k,v in conf.items():
+ assert v is not inspect.Parameter.empty, \
+ f'Value for {k} not specified for {klass}'
+ return conf
+
+
+def input(conf_yaml, split='train'):
+ conf = copy.deepcopy(conf_yaml[f'input_{split}'])
+ conf_dataset = conf.pop('audio_dataset')
+ conf_features = conf.pop('filterbank_features')
+
+ # Validate known inner classes
+ inner_classes = [
+ (conf_dataset, 'speed_perturbation', SpeedPerturbation),
+ (conf_dataset, 'gain_perturbation', GainPerturbation),
+ (conf_dataset, 'shift_perturbation', ShiftPerturbation),
+ (conf_features, 'spec_augment', SpecAugment),
+ (conf_features, 'cutout_augment', CutoutAugment),
+ ]
+ for conf_tgt, key, klass in inner_classes:
+ if key in conf_tgt:
+ conf_tgt[key] = validate_and_fill(klass, conf_tgt[key])
+
+ for k in conf:
+ raise ValueError(f'Unknown key {k}')
+
+ # Validate outer classes
+ conf_dataset = validate_and_fill(
+ AudioDataset, conf_dataset,
+ optional=['data_dir', 'labels', 'manifest_fpaths'])
+
+ conf_features = validate_and_fill(
+ FilterbankFeatures, conf_features)
+
+ # Check params shared between classes
+ shared = ['sample_rate', 'max_duration', 'pad_to_max_duration']
+ for sh in shared:
+ assert conf_dataset[sh] == conf_features[sh], (
+ f'{sh} should match in Dataset and FeatureProcessor: '
+ f'{conf_dataset[sh]}, {conf_features[sh]}')
+
+ return conf_dataset, conf_features
+
+
+def encoder(conf):
+ """Validate config for JasperEncoder and subsequent JasperBlocks"""
+
+ # Validate, but don't overwrite with defaults
+ for blk in conf['jasper']['encoder']['blocks']:
+ validate_and_fill(JasperBlock, blk, optional=['infilters'],
+ ignore_unk=['residual_dense'])
+
+ return validate_and_fill(JasperEncoder, conf['jasper']['encoder'])
+
+
+def decoder(conf, n_classes):
+ decoder_kw = {'n_classes': n_classes, **conf['jasper']['decoder']}
+ return validate_and_fill(JasperDecoderForCTC, decoder_kw)
+
+
+def apply_config_overrides(conf, args):
+ if args.override_config is None:
+ return
+ for override_key_val in args.override_config:
+ key, val = override_key_val.split('=')
+ with suppress(TypeError, ValueError):
+ val = literal_eval(val)
+ apply_nested_config_override(conf, key, val)
+
+
+def apply_nested_config_override(conf, key_str, val):
+ fields = key_str.split('.')
+ for f in fields[:-1]:
+ conf = conf[f]
+ f = fields[-1]
+ assert (f not in conf
+ or type(val) is type(conf[f])
+ or (isinstance(val, Number) and isinstance(conf[f], Number)))
+ conf[f] = val
diff --git a/PyTorch/contrib/audio/Jasper/jasper/model.py b/PyTorch/contrib/audio/Jasper/jasper/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..dd38ce4b774af403e85a2b4caf455bd10bde59f5
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/jasper/model.py
@@ -0,0 +1,275 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+
+activations = {
+ "hardtanh": nn.Hardtanh,
+ "relu": nn.ReLU,
+ "selu": nn.SELU,
+}
+
+
+def init_weights(m, mode='xavier_uniform'):
+ if type(m) == nn.Conv1d or type(m) == MaskedConv1d:
+ if mode == 'xavier_uniform':
+ nn.init.xavier_uniform_(m.weight, gain=1.0)
+ elif mode == 'xavier_normal':
+ nn.init.xavier_normal_(m.weight, gain=1.0)
+ elif mode == 'kaiming_uniform':
+ nn.init.kaiming_uniform_(m.weight, nonlinearity="relu")
+ elif mode == 'kaiming_normal':
+ nn.init.kaiming_normal_(m.weight, nonlinearity="relu")
+ else:
+ raise ValueError("Unknown Initialization mode: {0}".format(mode))
+
+ elif type(m) == nn.BatchNorm1d:
+ if m.track_running_stats:
+ m.running_mean.zero_()
+ m.running_var.fill_(1)
+ m.num_batches_tracked.zero_()
+ if m.affine:
+ nn.init.ones_(m.weight)
+ nn.init.zeros_(m.bias)
+
+
+def get_same_padding(kernel_size, stride, dilation):
+ if stride > 1 and dilation > 1:
+ raise ValueError("Only stride OR dilation may be greater than 1")
+ return (kernel_size // 2) * dilation
+
+
+class MaskedConv1d(nn.Conv1d):
+ """1D convolution with sequence masking
+ """
+ __constants__ = ["masked"]
+ def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+ padding=0, dilation=1, groups=1, bias=False, masked=True):
+ super(MaskedConv1d, self).__init__(
+ in_channels, out_channels, kernel_size, stride=stride,
+ padding=padding, dilation=dilation, groups=groups, bias=bias)
+
+ self.masked = masked
+
+ def get_seq_len(self, lens):
+ return ((lens + 2 * self.padding[0] - self.dilation[0]
+ * (self.kernel_size[0] - 1) - 1) // self.stride[0] + 1)
+
+ def forward(self, x, x_lens=None):
+ if self.masked:
+ max_len = x.size(2)
+ idxs = torch.arange(max_len, dtype=x_lens.dtype, device=x_lens.device)
+ mask = idxs.expand(x_lens.size(0), max_len) >= x_lens.unsqueeze(1)
+ x = x.masked_fill(mask.unsqueeze(1).to(device=x.device), 0)
+ x_lens = self.get_seq_len(x_lens)
+
+ return super(MaskedConv1d, self).forward(x), x_lens
+
+
+class JasperBlock(nn.Module):
+ __constants__ = ["use_conv_masks"]
+
+ """Jasper Block. See https://arxiv.org/pdf/1904.03288.pdf
+ """
+ def __init__(self, infilters, filters, repeat=3, kernel_size=11, stride=1,
+ dilation=1, padding='same', dropout=0.2, activation=None,
+ residual=True, residual_panes=[], use_conv_masks=False):
+ super(JasperBlock, self).__init__()
+
+ assert padding == "same", "Only 'same' padding is supported."
+
+ padding_val = get_same_padding(kernel_size[0], stride[0], dilation[0])
+ self.use_conv_masks = use_conv_masks
+ self.conv = nn.ModuleList()
+ for i in range(repeat):
+ self.conv.extend(self._conv_bn(infilters if i == 0 else filters,
+ filters,
+ kernel_size=kernel_size,
+ stride=stride,
+ dilation=dilation,
+ padding=padding_val))
+ if i < repeat - 1:
+ self.conv.extend(self._act_dropout(dropout, activation))
+
+ self.res = nn.ModuleList() if residual else None
+ res_panes = residual_panes.copy()
+ self.dense_residual = residual
+
+ if residual:
+ if len(residual_panes) == 0:
+ res_panes = [infilters]
+ self.dense_residual = False
+
+ for ip in res_panes:
+ self.res.append(nn.ModuleList(
+ self._conv_bn(ip, filters, kernel_size=1)))
+
+ self.out = nn.Sequential(*self._act_dropout(dropout, activation))
+
+ def _conv_bn(self, in_channels, out_channels, **kw):
+ return [MaskedConv1d(in_channels, out_channels,
+ masked=self.use_conv_masks, **kw),
+ nn.BatchNorm1d(out_channels, eps=1e-3, momentum=0.1)]
+
+ def _act_dropout(self, dropout=0.2, activation=None):
+ return [activation or nn.Hardtanh(min_val=0.0, max_val=20.0),
+ nn.Dropout(p=dropout)]
+
+ def forward(self, xs, xs_lens=None):
+ if not self.use_conv_masks:
+ xs_lens = 0
+
+ # forward convolutions
+ out = xs[-1]
+ lens = xs_lens
+ for i, l in enumerate(self.conv):
+ if isinstance(l, MaskedConv1d):
+ out, lens = l(out, lens)
+ else:
+ out = l(out)
+
+ # residuals
+ if self.res is not None:
+ for i, layer in enumerate(self.res):
+ res_out = xs[i]
+ for j, res_layer in enumerate(layer):
+ if j == 0: # and self.use_conv_mask:
+ res_out, _ = res_layer(res_out, xs_lens)
+ else:
+ res_out = res_layer(res_out)
+ out += res_out
+
+ # output
+ out = self.out(out)
+ if self.res is not None and self.dense_residual:
+ out = xs + [out]
+ else:
+ out = [out]
+
+ if self.use_conv_masks:
+ return out, lens
+ else:
+ return out, None
+
+
+class JasperEncoder(nn.Module):
+ __constants__ = ["use_conv_masks"]
+
+ def __init__(self, in_feats, activation, frame_splicing=1,
+ init='xavier_uniform', use_conv_masks=False, blocks=[]):
+ super(JasperEncoder, self).__init__()
+
+ self.use_conv_masks = use_conv_masks
+ self.layers = nn.ModuleList()
+
+ in_feats *= frame_splicing
+ all_residual_panes = []
+ for i,blk in enumerate(blocks):
+
+ blk['activation'] = activations[activation]()
+
+ has_residual_dense = blk.pop('residual_dense', False)
+ if has_residual_dense:
+ all_residual_panes += [in_feats]
+ blk['residual_panes'] = all_residual_panes
+ else:
+ blk['residual_panes'] = []
+
+ self.layers.append(
+ JasperBlock(in_feats, use_conv_masks=use_conv_masks, **blk))
+
+ in_feats = blk['filters']
+
+ self.apply(lambda x: init_weights(x, mode=init))
+
+ def forward(self, x, x_lens=None):
+ out, out_lens = [x], x_lens
+ for l in self.layers:
+ out, out_lens = l(out, out_lens)
+
+ return out, out_lens
+
+
+class JasperDecoderForCTC(nn.Module):
+ def __init__(self, in_feats, n_classes, init='xavier_uniform'):
+ super(JasperDecoderForCTC, self).__init__()
+
+ self.layers = nn.Sequential(
+ nn.Conv1d(in_feats, n_classes, kernel_size=1, bias=True),)
+ self.apply(lambda x: init_weights(x, mode=init))
+
+ def forward(self, enc_out):
+ out = self.layers(enc_out[-1]).transpose(1, 2)
+ return F.log_softmax(out, dim=2)
+
+
+class GreedyCTCDecoder(nn.Module):
+ @torch.no_grad()
+ def forward(self, log_probs, log_prob_lens=None):
+
+ if log_prob_lens is not None:
+ max_len = log_probs.size(1)
+ idxs = torch.arange(max_len, dtype=log_prob_lens.dtype,
+ device=log_prob_lens.device)
+ mask = idxs.unsqueeze(0) >= log_prob_lens.unsqueeze(1)
+ log_probs[:,:,-1] = log_probs[:,:,-1].masked_fill(mask, float("Inf"))
+
+ return log_probs.argmax(dim=-1, keepdim=False).int()
+
+
+class Jasper(nn.Module):
+ def __init__(self, encoder_kw, decoder_kw, transpose_in=False):
+ super(Jasper, self).__init__()
+ self.transpose_in = transpose_in
+ self.encoder = JasperEncoder(**encoder_kw)
+ self.decoder = JasperDecoderForCTC(**decoder_kw)
+
+ def forward(self, x, x_lens=None):
+ if self.encoder.use_conv_masks:
+ assert x_lens is not None
+ enc, enc_lens = self.encoder(x, x_lens)
+ out = self.decoder(enc)
+ return out, enc_lens
+ else:
+ if self.transpose_in:
+ x = x.transpose(1, 2)
+ enc, _ = self.encoder(x)
+ out = self.decoder(enc)
+ return out # torchscript refuses to output None
+
+ # TODO Explicitly add x_lens=None for inference (now x can be a Tensor or tuple)
+ def infer(self, x, x_lens=None):
+ if self.encoder.use_conv_masks:
+ return self.forward(x, x_lens)
+ else:
+ ret = self.forward(x)
+ return ret, len(ret)
+
+
+class CTCLossNM:
+ def __init__(self, n_classes):
+ self._criterion = nn.CTCLoss(blank=n_classes-1, reduction='none')
+
+ def __call__(self, log_probs, targets, input_length, target_length):
+ input_length = input_length.long()
+ target_length = target_length.long()
+ targets = targets.long()
+ loss = self._criterion(log_probs.transpose(1, 0), targets, input_length,
+ target_length)
+ # note that this is different from reduction = 'mean'
+ # because we are not dividing by target lengths
+ return torch.mean(loss)
diff --git a/PyTorch/contrib/audio/Jasper/notebooks/README.md b/PyTorch/contrib/audio/Jasper/notebooks/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..072ed740fafdd11f0b51c349bffd0546db72028e
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/notebooks/README.md
@@ -0,0 +1,203 @@
+# Jasper notebooks
+
+This folder provides different notebooks to run Jasper inference step by step.
+
+## Table Of Contents
+
+- [Jasper Jupyter Notebook for TensorRT](#jasper-jupyter-notebook-for-tensorrt)
+ * [Requirements](#requirements)
+ * [Quick Start Guide](#quick-start-guide)
+- [Jasper Colab Notebook for TensorRT](#jasper-colab-notebook-for-tensorrt)
+ * [Requirements](#requirements)
+ * [Quick Start Guide](#quick-start-guide)
+- [Jasper Jupyter Notebook for TensorRT Inference Server](#jasper-colab-notebook-for-tensorrt-inference-server)
+ * [Requirements](#requirements)
+ * [Quick Start Guide](#quick-start-guide)
+
+## Jasper Jupyter Notebook for TensorRT
+### Requirements
+
+`./trt/` contains a Dockerfile which extends the PyTorch 19.09-py3 NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
+
+* [NVIDIA Turing](https://www.nvidia.com/en-us/geforce/turing/) or [Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) based GPU
+* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
+* [PyTorch 19.09-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)
+* [NVIDIA machine learning repository](https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb) and [NVIDIA cuda repository](https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.1.243-1_amd64.deb) for NVIDIA TensorRT 6
+* [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
+* [Pretrained Jasper Model Checkpoint](https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16)
+
+### Quick Start Guide
+
+Running the following scripts will build and launch the container containing all required dependencies for both TensorRT as well as native PyTorch. This is necessary for using inference with TensorRT and can also be used for data download, processing and training of the model.
+
+#### 1. Clone the repository.
+
+```
+git clone https://github.com/NVIDIA/DeepLearningExamples
+cd DeepLearningExamples/PyTorch/SpeechRecognition/Jasper
+```
+
+#### 2. Build the Jasper PyTorch with TRT 6 container:
+
+```
+bash trt/scripts/docker/build.sh
+```
+
+#### 3. Create directories
+Prepare to start a detached session in the NGC container.
+Create three directories on your local machine for dataset, checkpoint, and result, respectively, naming "data" "checkpoint" "result":
+
+```
+mkdir data checkpoint result
+```
+
+#### 4. Download the checkpoint
+Download the checkpoint file jasperpyt_fp16 from NGC Model Repository:
+- https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16
+
+to the directory: _checkpoint_
+
+The Jasper PyTorch container will be launched in the Jupyter notebook. Within the container, the contents of the root repository will be copied to the /workspace/jasper directory.
+
+The /datasets, /checkpoints, /results directories are mounted as volumes and mapped to the corresponding directories "data" "checkpoint" "result" on the host.
+
+#### 5. Run the notebook
+
+For running the notebook on your local machine, run:
+
+```
+jupyter notebook -- notebooks/JasperTRT.ipynb
+```
+
+For running the notebook on another machine remotely, run:
+
+```
+jupyter notebook --ip=0.0.0.0 --allow-root
+```
+
+And navigate a web browser to the IP address or hostname of the host machine at port 8888: `http://[host machine]:8888`
+
+Use the token listed in the output from running the jupyter command to log in, for example: `http://[host machine]:8888/?token=aae96ae9387cd28151868fee318c3b3581a2d794f3b25c6b`
+
+
+
+## Jasper Colab Notebook for TensorRT
+### Requirements
+
+`./trt/` contains a Dockerfile which extends the PyTorch 19.09-py3 NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
+
+* [NVIDIA Turing](https://www.nvidia.com/en-us/geforce/turing/) or [Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) based GPU
+* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
+* [PyTorch 19.09-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)
+* [NVIDIA machine learning repository](https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb) and [NVIDIA cuda repository](https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.1.243-1_amd64.deb) for NVIDIA TensorRT 6
+* [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
+* [Pretrained Jasper Model Checkpoint](https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16)
+
+### Quick Start Guide
+
+Running the following scripts will build and launch the container containing all required dependencies for both TensorRT as well as native PyTorch. This is necessary for using inference with TensorRT and can also be used for data download, processing and training of the model.
+
+#### 1. Clone the repository.
+
+```
+git clone https://github.com/NVIDIA/DeepLearningExamples
+cd DeepLearningExamples/PyTorch/SpeechRecognition/Jasper
+```
+
+#### 2. Build the Jasper PyTorch with TRT 6 container:
+
+```
+bash trt/scripts/docker/build.sh
+```
+
+#### 3. Create directories
+Prepare to start a detached session in the NGC container.
+Create three directories on your local machine for dataset, checkpoint, and result, respectively, naming "data" "checkpoint" "result":
+
+```
+mkdir data checkpoint result
+```
+
+#### 4. Download the checkpoint
+Download the checkpoint file jasperpyt_fp16 from NGC Model Repository:
+- https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16
+
+to the directory: _checkpoint_
+
+The Jasper PyTorch container will be launched in the Jupyter notebook. Within the container, the contents of the root repository will be copied to the /workspace/jasper directory.
+
+The /datasets, /checkpoints, /results directories are mounted as volumes and mapped to the corresponding directories "data" "checkpoint" "result" on the host.
+
+#### 5. Run the notebook
+
+>>>>>>> 2deaddbc2ea58d5318b06203ae30ace2dd576ecb
+For running the notebook on your local machine, run:
+
+```
+jupyter notebook -- notebooks/Colab_Jasper_TRT_inference_demo.ipynb
+```
+
+For running the notebook on another machine remotely, run:
+
+```
+jupyter notebook --ip=0.0.0.0 --allow-root
+```
+
+And navigate a web browser to the IP address or hostname of the host machine at port 8888: `http://[host machine]:8888`
+
+Use the token listed in the output from running the jupyter command to log in, for example: `http://[host machine]:8888/?token=aae96ae9387cd28151868fee318c3b3581a2d794f3b25c6b`
+
+
+
+## Jasper Jupyter Notebook for TensorRT Inference Server
+### Requirements
+
+`./trtis/` contains a Dockerfile which extends the PyTorch 19.09-py3 NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
+
+* [NVIDIA Turing](https://www.nvidia.com/en-us/geforce/turing/) or [Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) based GPU
+* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
+* [PyTorch 19.09-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)
+* [TensorRT Inference Server 19.09 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:tensorrtserver)
+* [NVIDIA machine learning repository](https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb) and [NVIDIA cuda repository](https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.1.243-1_amd64.deb) for NVIDIA TensorRT 6
+* [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
+* [Pretrained Jasper Model Checkpoint](https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16)
+
+### Quick Start Guide
+
+
+#### 1. Clone the repository.
+
+```
+git clone https://github.com/NVIDIA/DeepLearningExamples
+cd DeepLearningExamples/PyTorch/SpeechRecognition/Jasper
+```
+
+#### 2. Build a container that extends NGC PyTorch 19.09, TensorRT, TensorRT Inference Server, and TensorRT Inference Client.
+
+```
+bash trtis/scripts/docker/build.sh
+```
+
+#### 3. Download the checkpoint
+Download the checkpoint file jasper_fp16.pt from NGC Model Repository:
+- https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16
+
+to an user specified directory _CHECKPOINT_DIR_
+
+#### 4. Run the notebook
+
+For running the notebook on your local machine, run:
+
+```
+jupyter notebook -- notebooks/JasperTRTIS.ipynb
+```
+
+For running the notebook on another machine remotely, run:
+
+```
+jupyter notebook --ip=0.0.0.0 --allow-root
+```
+
+And navigate a web browser to the IP address or hostname of the host machine at port 8888: `http://[host machine]:8888`
+
+Use the token listed in the output from running the jupyter command to log in, for example: `http://[host machine]:8888/?token=aae96ae9387cd28151868fee318c3b3581a2d794f3b25c6b`
diff --git a/PyTorch/contrib/audio/Jasper/npu_fused_adamw.py b/PyTorch/contrib/audio/Jasper/npu_fused_adamw.py
new file mode 100644
index 0000000000000000000000000000000000000000..a5bfb52bc6a7c4fa67d6a5e9efee124b169001c1
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/npu_fused_adamw.py
@@ -0,0 +1,257 @@
+# Copyright (c) 2020, Huawei Technologies.
+# Copyright (c) 2019, Facebook CORPORATION.
+# All rights reserved.
+#
+# Licensed under the BSD 3-Clause License (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# https://opensource.org/licenses/BSD-3-Clause
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+from collections import defaultdict
+
+import torch
+from torch.optim.optimizer import Optimizer
+
+from apex.contrib.combine_tensors import combine_npu
+
+
+class NpuFusedAdamW(Optimizer):
+ """Implements AdamW algorithm.
+
+ Currently NPU-only. Requires Apex to be installed via
+ ``pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--npu_float_status" ./``.
+
+ This version of NPU fused AdamW implements 1 fusions.
+
+ * A combine-tensor apply launch that batches the elementwise updates applied to all the model's parameters
+ into one or a few kernel launches.
+
+ :class:`apex.optimizers.NpuFusedAdamW` may be used as a drop-in replacement for ``torch.optim.AdamW``::
+
+ opt = apex.optimizers.NpuFusedAdamW(model.parameters(), lr = ....)
+ ...
+ opt.step()
+
+ :class:`apex.optimizers.FusedAdamW` should be used with Amp. Currently, if you wish to use :class:`NpuFusedAdamW`
+ with Amp, only ``opt_level O1 and O2`` can be choosed::
+
+ opt = apex.optimizers.NpuFusedAdamW(model.parameters(), lr = ....)
+ model, opt = amp.initialize(model, opt, opt_level="O2")
+ ...
+ opt.step()
+
+
+ The original Adam algorithm was proposed in `Adam: A Method for Stochastic Optimization`_.
+ The AdamW variant was proposed in `Decoupled Weight Decay Regularization`_.
+
+ Arguments:
+ params (iterable): iterable of parameters to optimize or dicts defining
+ parameter groups
+ lr (float, optional, default: 1e-3): learning rate
+ betas (Tuple[float, float], optional, default: (0.9, 0.999)): coefficients used
+ for computing running averages of gradient and its square
+ eps (float, optional, default: 1e-8): term added to the denominator to improve
+ numerical stability
+ weight_decay (float, optional, default: 1e-2): weight decay coefficient
+ amsgrad (boolean, optional, default: False): whether to use the AMSGrad variant of
+ this algorithm from the paper `On the Convergence of Adam and Beyond`_
+
+ .. _Adam\: A Method for Stochastic Optimization:
+ https://arxiv.org/abs/1412.6980
+ .. _Decoupled Weight Decay Regularization:
+ https://arxiv.org/abs/1711.05101
+ .. _On the Convergence of Adam and Beyond:
+ https://openreview.net/forum?id=ryQu7f-RZ
+ """
+
+ def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8,
+ weight_decay=1e-2, amsgrad=False):
+ if lr < 0.0:
+ raise ValueError("Invalid learning rate: {}".format(lr))
+ if eps < 0.0:
+ raise ValueError("Invalid epsilon value: {}".format(eps))
+ if betas[0] < 0.0 or betas[0] >= 1.0:
+ raise ValueError("Invalid beta parameter at index 0: {}".format(betas[0]))
+ if betas[1] < 0.0 or betas[1] >= 1.0:
+ raise ValueError("Invalid beta parameter at index 1: {}".format(betas[1]))
+ if weight_decay < 0.0:
+ raise ValueError("Invalid weight_decay value: {}".format(weight_decay))
+ defaults = dict(lr=lr, betas=betas, eps=eps,
+ weight_decay=weight_decay, amsgrad=amsgrad)
+ self.is_npu_fused_optimizer = True
+ super(NpuFusedAdamW, self).__init__(params, defaults)
+
+ def __setstate__(self, state):
+ super(NpuFusedAdamW, self).__setstate__(state)
+ for group in self.param_groups:
+ group.setdefault('amsgrad', False)
+
+ def _init_param_state(self, p, amsgrad):
+ state = self.state[p]
+ # State initialization
+ if len(state) == 0:
+ state['step'] = 0
+ # Exponential moving average of gradient values
+ state['exp_avg'] = torch.zeros_like(p, memory_format=torch.preserve_format)
+ # Exponential moving average of squared gradient values
+ state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
+ if amsgrad:
+ # Maintains max of all exp. moving avg. of sq. grad. values
+ state['max_exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
+ else:
+ exp_avg_tmp = torch.zeros_like(p, memory_format=torch.preserve_format)
+ exp_avg_tmp.copy_(state['exp_avg'])
+ state['exp_avg'] = exp_avg_tmp
+
+ exp_avg_sq_tmp = torch.zeros_like(p, memory_format=torch.preserve_format)
+ exp_avg_sq_tmp.copy_(state['exp_avg_sq'])
+ state['exp_avg_sq'] = exp_avg_sq_tmp
+
+ if amsgrad:
+ max_exp_avg_sq_tmp = torch.zeros_like(p, memory_format=torch.preserve_format)
+ max_exp_avg_sq_tmp.copy_(state['max_exp_avg_sq'])
+ state['max_exp_avg_sq'] = max_exp_avg_sq_tmp
+
+ def _combine_group_param_states(self, group_index):
+ group = self.param_groups[group_index]
+ stash = self._amp_stash
+ group_params_list = stash.params_lists_indexed_by_group[group_index]
+
+ amsgrad = group['amsgrad']
+
+ combined_param_states = []
+ for params in group_params_list:
+ step_list = []
+ exp_avg_list = []
+ exp_avg_sq_list = []
+ max_exp_avg_sq_list = []
+
+ for p in params:
+ if p.grad is None:
+ continue
+ grad = p.grad
+ if grad.is_sparse:
+ raise RuntimeError('NpuFusedAdamW does not support sparse gradients, '
+ 'please consider SparseAdam instead')
+
+ self._init_param_state(p, amsgrad)
+ state = self.state[p]
+ step_list.append(state['step'])
+ exp_avg_list.append(state['exp_avg'])
+ exp_avg_sq_list.append(state['exp_avg_sq'])
+ if amsgrad:
+ max_exp_avg_sq_list.append(state['max_exp_avg_sq'])
+
+ combined_step = 0
+ combined_exp_avg = None
+ combined_exp_avg_sq = None
+ combined_max_exp_avg_sq = None
+
+ if len(exp_avg_list) > 0:
+ combined_step = step_list[0]
+ combined_exp_avg = combine_npu(exp_avg_list)
+ combined_exp_avg_sq = combine_npu(exp_avg_sq_list)
+ combined_max_exp_avg_sq = combine_npu(max_exp_avg_sq_list)
+
+ combined_state = defaultdict(dict)
+ combined_state['step'] = combined_step
+ combined_state['exp_avg'] = combined_exp_avg
+ combined_state['exp_avg_sq'] = combined_exp_avg_sq
+ combined_state['max_exp_avg_sq'] = combined_max_exp_avg_sq
+ combined_param_states.append(combined_state)
+ stash.combined_param_states_indexed_by_group[group_index] = combined_param_states
+
+ def _combine_param_states_by_group(self):
+ stash = self._amp_stash
+ if stash.param_states_are_combined_by_group:
+ return
+
+ stash.combined_param_states_indexed_by_group = []
+ for _ in self.param_groups:
+ stash.combined_param_states_indexed_by_group.append([])
+
+ for i, _ in enumerate(self.param_groups):
+ self._combine_group_param_states(i)
+ stash.param_states_are_combined_by_group = True
+
+ def _group_step(self, group_index):
+ group = self.param_groups[group_index]
+ for p in group['params']:
+ if p.grad is None:
+ continue
+
+ grad = p.grad
+ if grad.is_sparse:
+ raise RuntimeError('NpuFusedAdamW does not support sparse gradients, '
+ 'please consider SparseAdam instead')
+ state_p = self.state[p]
+ state_p['step'] += 1
+
+ amsgrad = group['amsgrad']
+ beta1, beta2 = group['betas']
+
+ stash = self._amp_stash
+ combined_group_params = stash.combined_params_indexed_by_group[group_index]
+ combined_group_grads = stash.combined_grads_indexed_by_group[group_index]
+ combined_group_param_states = stash.combined_param_states_indexed_by_group[group_index]
+
+ for combined_param, combined_grad, combined_param_state in zip(combined_group_params,
+ combined_group_grads,
+ combined_group_param_states):
+ if combined_param is None or combined_grad is None:
+ continue
+
+ # Perform stepweight decay. The fused method is used here to speed up the calculation
+ combined_param.mul_(1 - group['lr'] * group['weight_decay'])
+
+ exp_avg, exp_avg_sq = combined_param_state['exp_avg'], combined_param_state['exp_avg_sq']
+ if amsgrad:
+ max_exp_avg_sq = combined_param_state['max_exp_avg_sq']
+
+ combined_param_state['step'] += 1
+ bias_correction1 = 1 - beta1 ** combined_param_state['step']
+ bias_correction2 = 1 - beta2 ** combined_param_state['step']
+
+ # Decay the first and second moment running average coefficient
+ exp_avg.mul_(beta1).add_(combined_grad, alpha=1 - beta1)
+ exp_avg_sq.mul_(beta2).addcmul_(combined_grad, combined_grad, value=1 - beta2)
+ if amsgrad:
+ # Maintains the maximum of all 2nd moment running avg. till now
+ torch.max(max_exp_avg_sq, exp_avg_sq, out=max_exp_avg_sq)
+ # Use the max. for normalizing running avg. of gradient
+ denom = (max_exp_avg_sq.sqrt() / math.sqrt(bias_correction2)).add_(group['eps'])
+ else:
+ denom = (exp_avg_sq.sqrt() / math.sqrt(bias_correction2)).add_(group['eps'])
+
+ step_size = group['lr'] / bias_correction1
+
+ combined_param.addcdiv_(exp_avg, denom, value=-step_size)
+
+ @torch.no_grad()
+ def step(self, closure=None):
+ if not hasattr(self, "_amp_stash"):
+ raise RuntimeError('apex.optimizers.NpuFusedAdamW should be used with AMP.')
+
+ self._check_already_combined_params_and_grads()
+ # combine params and grads first
+ self._combine_params_and_grads_by_group()
+ # then combine param states
+ self._combine_param_states_by_group()
+
+ loss = None
+ if closure is not None:
+ with torch.enable_grad():
+ loss = closure()
+
+ for i, _ in enumerate(self.param_groups):
+ self._group_step(i)
+
+ return loss
\ No newline at end of file
diff --git a/PyTorch/contrib/audio/Jasper/platform/DGX1-16GB_Jasper_AMP_8GPU.sh b/PyTorch/contrib/audio/Jasper/platform/DGX1-16GB_Jasper_AMP_8GPU.sh
new file mode 100644
index 0000000000000000000000000000000000000000..57bcd4c5ecf52c203cabb510a9da39b439b24c21
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/platform/DGX1-16GB_Jasper_AMP_8GPU.sh
@@ -0,0 +1,3 @@
+#!/bin/bash
+
+NUM_GPUS=8 AMP=true BATCH_SIZE=64 GRADIENT_ACCUMULATION_STEPS=4 bash scripts/train.sh "$@"
diff --git a/PyTorch/contrib/audio/Jasper/platform/DGX1-16GB_Jasper_FP32_8GPU.sh b/PyTorch/contrib/audio/Jasper/platform/DGX1-16GB_Jasper_FP32_8GPU.sh
new file mode 100644
index 0000000000000000000000000000000000000000..0317b41a61845c5e349d19ab9f8fb9416fc6bef1
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/platform/DGX1-16GB_Jasper_FP32_8GPU.sh
@@ -0,0 +1,3 @@
+#!/bin/bash
+
+NUM_GPUS=8 BATCH_SIZE=64 GRADIENT_ACCUMULATION_STEPS=4 bash scripts/train.sh "$@"
diff --git a/PyTorch/contrib/audio/Jasper/platform/DGX1-32GB_Jasper_AMP_8GPU.sh b/PyTorch/contrib/audio/Jasper/platform/DGX1-32GB_Jasper_AMP_8GPU.sh
new file mode 100644
index 0000000000000000000000000000000000000000..8953566d4afe807e606c4915d12648fa02c7b198
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/platform/DGX1-32GB_Jasper_AMP_8GPU.sh
@@ -0,0 +1,3 @@
+#!/bin/bash
+
+NUM_GPUS=8 AMP=true BATCH_SIZE=64 GRADIENT_ACCUMULATION_STEPS=1 bash scripts/train.sh "$@"
diff --git a/PyTorch/contrib/audio/Jasper/platform/DGX1-32GB_Jasper_FP32_8GPU.sh b/PyTorch/contrib/audio/Jasper/platform/DGX1-32GB_Jasper_FP32_8GPU.sh
new file mode 100644
index 0000000000000000000000000000000000000000..eed6a1273fba38397c5733e987153faf2e02683f
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/platform/DGX1-32GB_Jasper_FP32_8GPU.sh
@@ -0,0 +1,3 @@
+#!/bin/bash
+
+NUM_GPUS=8 BATCH_SIZE=64 GRADIENT_ACCUMULATION_STEPS=2 bash scripts/train.sh "$@"
diff --git a/PyTorch/contrib/audio/Jasper/platform/DGX2_Jasper_AMP_16GPU.sh b/PyTorch/contrib/audio/Jasper/platform/DGX2_Jasper_AMP_16GPU.sh
new file mode 100644
index 0000000000000000000000000000000000000000..a0ccca4a1d4997b60743c283a3528b35a5476031
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/platform/DGX2_Jasper_AMP_16GPU.sh
@@ -0,0 +1,3 @@
+#!/bin/bash
+
+NUM_GPUS=16 AMP=true BATCH_SIZE=64 GRADIENT_ACCUMULATION_STEPS=1 bash scripts/train.sh "$@"
diff --git a/PyTorch/contrib/audio/Jasper/platform/DGX2_Jasper_AMP_8GPU.sh b/PyTorch/contrib/audio/Jasper/platform/DGX2_Jasper_AMP_8GPU.sh
new file mode 100644
index 0000000000000000000000000000000000000000..8953566d4afe807e606c4915d12648fa02c7b198
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/platform/DGX2_Jasper_AMP_8GPU.sh
@@ -0,0 +1,3 @@
+#!/bin/bash
+
+NUM_GPUS=8 AMP=true BATCH_SIZE=64 GRADIENT_ACCUMULATION_STEPS=1 bash scripts/train.sh "$@"
diff --git a/PyTorch/contrib/audio/Jasper/platform/DGX2_Jasper_FP32_16GPU.sh b/PyTorch/contrib/audio/Jasper/platform/DGX2_Jasper_FP32_16GPU.sh
new file mode 100644
index 0000000000000000000000000000000000000000..873fb92f17223584c8bd02fdc0c68b6b5c056550
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/platform/DGX2_Jasper_FP32_16GPU.sh
@@ -0,0 +1,3 @@
+#!/bin/bash
+
+NUM_GPUS=16 BATCH_SIZE=64 GRADIENT_ACCUMULATION_STEPS=1 bash scripts/train.sh "$@"
diff --git a/PyTorch/contrib/audio/Jasper/platform/DGX2_Jasper_FP32_8GPU.sh b/PyTorch/contrib/audio/Jasper/platform/DGX2_Jasper_FP32_8GPU.sh
new file mode 100644
index 0000000000000000000000000000000000000000..4ac61b677db6f436cc39c73f8d145304255942cf
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/platform/DGX2_Jasper_FP32_8GPU.sh
@@ -0,0 +1,3 @@
+#!/bin/bash
+
+NUM_GPUS=8 AMP=true BATCH_SIZE=64 GRADIENT_ACCUMULATION_STEPS=2 bash scripts/train.sh "$@"
diff --git a/PyTorch/contrib/audio/Jasper/platform/DGXA100_Jasper_AMP_8GPU.sh b/PyTorch/contrib/audio/Jasper/platform/DGXA100_Jasper_AMP_8GPU.sh
new file mode 100644
index 0000000000000000000000000000000000000000..8953566d4afe807e606c4915d12648fa02c7b198
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/platform/DGXA100_Jasper_AMP_8GPU.sh
@@ -0,0 +1,3 @@
+#!/bin/bash
+
+NUM_GPUS=8 AMP=true BATCH_SIZE=64 GRADIENT_ACCUMULATION_STEPS=1 bash scripts/train.sh "$@"
diff --git a/PyTorch/contrib/audio/Jasper/platform/DGXA100_Jasper_TF32_8GPU.sh b/PyTorch/contrib/audio/Jasper/platform/DGXA100_Jasper_TF32_8GPU.sh
new file mode 100644
index 0000000000000000000000000000000000000000..eed6a1273fba38397c5733e987153faf2e02683f
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/platform/DGXA100_Jasper_TF32_8GPU.sh
@@ -0,0 +1,3 @@
+#!/bin/bash
+
+NUM_GPUS=8 BATCH_SIZE=64 GRADIENT_ACCUMULATION_STEPS=2 bash scripts/train.sh "$@"
diff --git a/PyTorch/contrib/audio/Jasper/requirements.txt b/PyTorch/contrib/audio/Jasper/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..92eb868b1c810ed63c1d567ba5dcf6e5c66cfdc4
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/requirements.txt
@@ -0,0 +1,13 @@
+ascii-graph==1.5.1
+inflect==5.3.0
+ipdb
+librosa==0.8.0
+pandas==1.1.4
+pycuda==2020.1
+pyyaml>=5.4
+soundfile
+sox==1.4.1
+tqdm==4.53.0
+unidecode==1.2.0
+wrapt==1.10.11
+git+git://github.com/NVIDIA/dllogger.git@26a0f8f1958de2c0c460925ff6102a4d2486d6cc#egg=dllogger
\ No newline at end of file
diff --git a/PyTorch/contrib/audio/Jasper/scripts/docker/build.sh b/PyTorch/contrib/audio/Jasper/scripts/docker/build.sh
new file mode 100644
index 0000000000000000000000000000000000000000..cfdc97c010ec55f3ffd1228f027fc9e9432b785a
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/scripts/docker/build.sh
@@ -0,0 +1,3 @@
+#!/bin/bash
+
+docker build . --rm -t jasper
\ No newline at end of file
diff --git a/PyTorch/contrib/audio/Jasper/scripts/docker/launch.sh b/PyTorch/contrib/audio/Jasper/scripts/docker/launch.sh
new file mode 100644
index 0000000000000000000000000000000000000000..bc719ffadcd23168b6fd2e9621e4a9d5b37dfbb2
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/scripts/docker/launch.sh
@@ -0,0 +1,31 @@
+#!/bin/bash
+
+SCRIPT_DIR=$(cd $(dirname $0); pwd)
+: ${JASPER_REPO:="$SCRIPT_DIR/../.."}
+
+: ${DATA_DIR:=${1:-"$JASPER_REPO/datasets"}}
+: ${CHECKPOINT_DIR:=${2:-"$JASPER_REPO/checkpoints"}}
+: ${OUTPUT_DIR:=${3:-"$JASPER_REPO/results"}}
+: ${SCRIPT:=${4:-}}
+
+mkdir -p $DATA_DIR
+mkdir -p $CHECKPOINT_DIR
+mkdir -p $OUTPUT_DIR
+
+MOUNTS=""
+MOUNTS+=" -v $DATA_DIR:/dataset"
+MOUNTS+=" -v $CHECKPOINT_DIR:/checkpoints"
+MOUNTS+=" -v $OUTPUT_DIR:/results"
+MOUNTS+=" -v $JASPER_REPO:/workspace/jasper"
+MOUNTS+=" -v /usr/local/Ascend:/usr/local/Ascend"
+
+echo $MOUNTS
+docker run -it --rm --device /dev/davinci5 --device /dev/davinci_manager --device /dev/hisi_hdc --device /dev/devmm_svm \
+ --env PYTHONDONTWRITEBYTECODE=1 \
+ --shm-size=4g \
+ --ulimit memlock=-1 \
+ --ulimit stack=67108864 \
+ $MOUNTS \
+ $EXTRA_JASPER_ENV \
+ -w /workspace/jasper \
+ jasper:latest bash $SCRIPT
diff --git a/PyTorch/contrib/audio/Jasper/scripts/download_librispeech.sh b/PyTorch/contrib/audio/Jasper/scripts/download_librispeech.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b96744eb94ae4327fbf27fcfa91133504592b067
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/scripts/download_librispeech.sh
@@ -0,0 +1,32 @@
+#!/usr/bin/env bash
+
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+DATA_SET="LibriSpeech"
+DATA_ROOT_DIR="/home/cyl_dataset"
+DATA_DIR="${DATA_ROOT_DIR}/${DATA_SET}"
+
+if [ ! -d "$DATA_DIR" ]
+then
+ mkdir --mode 755 $DATA_DIR
+
+ python utils/download_librispeech.py \
+ utils/librispeech.csv \
+ $DATA_DIR \
+ -e ${DATA_ROOT_DIR}/
+else
+ echo "Directory $DATA_DIR already exists."
+fi
diff --git a/PyTorch/contrib/audio/Jasper/scripts/evaluation.sh b/PyTorch/contrib/audio/Jasper/scripts/evaluation.sh
new file mode 100644
index 0000000000000000000000000000000000000000..08009e514e13bc49d62c136a885c9c58d6177e7d
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/scripts/evaluation.sh
@@ -0,0 +1,22 @@
+#!/bin/bash
+
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+set -a
+
+: ${PREDICTION_FILE:=}
+: ${DATASET:="test-other"}
+
+bash ./scripts/inference.sh "$@"
diff --git a/PyTorch/contrib/audio/Jasper/scripts/inference.sh b/PyTorch/contrib/audio/Jasper/scripts/inference.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f60c3d587536427ac165c03aca177f670e5a0b70
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/scripts/inference.sh
@@ -0,0 +1,65 @@
+#!/bin/bash
+
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+: ${DATA_DIR:=${1:-"/datasets/LibriSpeech"}}
+: ${MODEL_CONFIG:=${2:-"configs/jasper10x5dr_speedp-online_speca.yaml"}}
+: ${OUTPUT_DIR:=${3:-"/results"}}
+: ${CHECKPOINT:=${4:-"/checkpoints/jasper_fp16.pt"}}
+: ${DATASET:="test-other"}
+: ${LOG_FILE:=""}
+: ${CUDNN_BENCHMARK:=false}
+: ${MAX_DURATION:=""}
+: ${PAD_TO_MAX_DURATION:=false}
+: ${PAD_LEADING:=16}
+: ${NUM_GPUS:=1}
+: ${NUM_STEPS:=0}
+: ${NUM_WARMUP_STEPS:=0}
+: ${AMP:=false}
+: ${BATCH_SIZE:=64}
+: ${EMA:=true}
+: ${SEED:=0}
+: ${DALI_DEVICE:="gpu"}
+: ${CPU:=false}
+: ${LOGITS_FILE:=}
+: ${PREDICTION_FILE:="${OUTPUT_DIR}/${DATASET}.predictions"}
+
+mkdir -p "$OUTPUT_DIR"
+
+ARGS="--dataset_dir=$DATA_DIR"
+ARGS+=" --val_manifest=$DATA_DIR/librispeech-${DATASET}-wav.json"
+ARGS+=" --model_config=$MODEL_CONFIG"
+ARGS+=" --output_dir=$OUTPUT_DIR"
+ARGS+=" --batch_size=$BATCH_SIZE"
+ARGS+=" --seed=$SEED"
+ARGS+=" --dali_device=$DALI_DEVICE"
+ARGS+=" --steps $NUM_STEPS"
+ARGS+=" --warmup_steps $NUM_WARMUP_STEPS"
+ARGS+=" --pad_leading $PAD_LEADING"
+
+[ "$AMP" = true ] && ARGS+=" --amp"
+[ "$EMA" = true ] && ARGS+=" --ema"
+[ "$CUDNN_BENCHMARK" = true ] && ARGS+=" --cudnn_benchmark"
+[ -n "$CHECKPOINT" ] && ARGS+=" --ckpt=${CHECKPOINT}"
+[ -n "$LOG_FILE" ] && ARGS+=" --log_file $LOG_FILE"
+[ -n "$PREDICTION_FILE" ] && ARGS+=" --save_prediction $PREDICTION_FILE"
+[ -n "$LOGITS_FILE" ] && ARGS+=" --logits_save_to $LOGITS_FILE"
+[ "$CPU" == "true" ] && ARGS+=" --cpu"
+[ -n "$MAX_DURATION" ] && ARGS+=" --override_config input_val.audio_dataset.max_duration=$MAX_DURATION" \
+ ARGS+=" --override_config input_val.filterbank_features.max_duration=$MAX_DURATION"
+[ "$PAD_TO_MAX_DURATION" = true ] && ARGS+=" --override_config input_val.audio_dataset.pad_to_max_duration=True" \
+ ARGS+=" --override_config input_val.filterbank_features.pad_to_max_duration=True"
+
+python -m torch.distributed.launch --nproc_per_node=$NUM_GPUS inference.py $ARGS
diff --git a/PyTorch/contrib/audio/Jasper/scripts/inference_benchmark.sh b/PyTorch/contrib/audio/Jasper/scripts/inference_benchmark.sh
new file mode 100644
index 0000000000000000000000000000000000000000..66dfe5ab724912ac2b58ce2135a4e4d4d7063300
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/scripts/inference_benchmark.sh
@@ -0,0 +1,38 @@
+#!/bin/bash
+
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+set -a
+
+: ${OUTPUT_DIR:=${3:-"/results"}}
+: ${CUDNN_BENCHMARK:=true}
+: ${PAD_TO_MAX_DURATION:=true}
+: ${PAD_LEADING:=0}
+: ${NUM_WARMUP_STEPS:=10}
+: ${NUM_STEPS:=500}
+
+: ${AMP:=false}
+: ${DALI_DEVICE:="cpu"}
+: ${BATCH_SIZE_SEQ:="1 2 4 8 16"}
+: ${MAX_DURATION_SEQ:="2 7 16.7"}
+
+for MAX_DURATION in $MAX_DURATION_SEQ; do
+ for BATCH_SIZE in $BATCH_SIZE_SEQ; do
+
+ LOG_FILE="$OUTPUT_DIR/perf-infer_dali-${DALI_DEVICE}_amp-${AMP}_dur${MAX_DURATION}_bs${BATCH_SIZE}.json"
+ bash ./scripts/inference.sh "$@"
+
+ done
+done
diff --git a/PyTorch/contrib/audio/Jasper/scripts/preprocess_librispeech.sh b/PyTorch/contrib/audio/Jasper/scripts/preprocess_librispeech.sh
new file mode 100644
index 0000000000000000000000000000000000000000..3ae5b5768ffe03825bb512b8cda8b30f411fd9fe
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/scripts/preprocess_librispeech.sh
@@ -0,0 +1,54 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+#!/usr/bin/env bash
+
+SPEEDS=$1
+[ -n "$SPEEDS" ] && SPEED_FLAG="--speed $SPEEDS"
+
+python ./utils/convert_librispeech.py \
+ --input_dir /home/dataset/LibriSpeech/train-clean-100 \
+ --dest_dir /home/dataset/LibriSpeech/train-clean-100-wav \
+ --output_json /home/dataset/LibriSpeech/librispeech-train-clean-100-wav.json \
+ $SPEED_FLAG
+python ./utils/convert_librispeech.py \
+ --input_dir /home/dataset/LibriSpeech/train-clean-360 \
+ --dest_dir /home/dataset/LibriSpeech/train-clean-360-wav \
+ --output_json /home/dataset/LibriSpeech/librispeech-train-clean-360-wav.json \
+ $SPEED_FLAG
+python ./utils/convert_librispeech.py \
+ --input_dir /home/dataset/LibriSpeech/train-other-500 \
+ --dest_dir /home/dataset/LibriSpeech/train-other-500-wav \
+ --output_json /home/dataset/LibriSpeech/librispeech-train-other-500-wav.json \
+ $SPEED_FLAG
+
+
+python ./utils/convert_librispeech.py \
+ --input_dir /home/dataset/LibriSpeech/dev-clean \
+ --dest_dir /home/dataset/LibriSpeech/dev-clean-wav \
+ --output_json /home/dataset/LibriSpeech/librispeech-dev-clean-wav.json
+python ./utils/convert_librispeech.py \
+ --input_dir /home/dataset/LibriSpeech/dev-other \
+ --dest_dir /home/dataset/LibriSpeech/dev-other-wav \
+ --output_json /home/dataset/LibriSpeech/librispeech-dev-other-wav.json
+
+
+python ./utils/convert_librispeech.py \
+ --input_dir /home/dataset/LibriSpeech/test-clean \
+ --dest_dir /home/dataset/LibriSpeech/test-clean-wav \
+ --output_json /home/dataset/LibriSpeech/librispeech-test-clean-wav.json
+python ./utils/convert_librispeech.py \
+ --input_dir /home/dataset/LibriSpeech/test-other \
+ --dest_dir /home/dataset/LibriSpeech/test-other-wav \
+ --output_json /home/dataset/LibriSpeech/librispeech-test-other-wav.json
diff --git a/PyTorch/contrib/audio/Jasper/scripts/train.sh b/PyTorch/contrib/audio/Jasper/scripts/train.sh
new file mode 100644
index 0000000000000000000000000000000000000000..51ad11e924c605f74e6c49c7dad84daee7795e99
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/scripts/train.sh
@@ -0,0 +1,90 @@
+#!/bin/bash
+
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export OMP_NUM_THREADS=1
+
+: ${DATA_DIR:=${1:-"/datasets/LibriSpeech"}}
+: ${MODEL_CONFIG:=${2:-"configs/jasper10x5dr_speedp-online_speca.yaml"}}
+: ${OUTPUT_DIR:=${3:-"/results"}}
+: ${CHECKPOINT:=${4:-}}
+: ${RESUME:=true}
+: ${CUDNN_BENCHMARK:=true}
+: ${NUM_GPUS:=8}
+: ${AMP:=True}
+: ${BATCH_SIZE:=32}
+: ${GRAD_ACCUMULATION_STEPS:=2}
+: ${LEARNING_RATE:=0.01}
+: ${MIN_LEARNING_RATE:=0.00001}
+: ${LR_POLICY:=exponential}
+: ${LR_EXP_GAMMA:=0.981}
+: ${EMA:=0.999}
+: ${SEED:=0}
+: ${EPOCHS:=440}
+: ${WARMUP_EPOCHS:=2}
+: ${HOLD_EPOCHS:=140}
+: ${SAVE_FREQUENCY:=10}
+: ${EPOCHS_THIS_JOB:=0}
+: ${DALI_DEVICE:="cpu"}
+: ${PAD_TO_MAX_DURATION:=false}
+: ${EVAL_FREQUENCY:=544}
+: ${PREDICTION_FREQUENCY:=544}
+: ${TRAIN_MANIFESTS:="$DATA_DIR/librispeech-train-clean-100-wav.json \
+ $DATA_DIR/librispeech-train-clean-360-wav.json \
+ $DATA_DIR/librispeech-train-other-500-wav.json"}
+: ${VAL_MANIFESTS:="$DATA_DIR/librispeech-dev-clean-wav.json"}
+
+mkdir -p "$OUTPUT_DIR"
+
+ARGS="--dataset_dir=$DATA_DIR"
+ARGS+=" --val_manifests $VAL_MANIFESTS"
+ARGS+=" --train_manifests $TRAIN_MANIFESTS"
+ARGS+=" --model_config=$MODEL_CONFIG"
+ARGS+=" --output_dir=$OUTPUT_DIR"
+ARGS+=" --lr=$LEARNING_RATE"
+ARGS+=" --batch_size=$BATCH_SIZE"
+ARGS+=" --min_lr=$MIN_LEARNING_RATE"
+ARGS+=" --lr_policy=$LR_POLICY"
+ARGS+=" --lr_exp_gamma=$LR_EXP_GAMMA"
+ARGS+=" --epochs=$EPOCHS"
+ARGS+=" --warmup_epochs=$WARMUP_EPOCHS"
+ARGS+=" --hold_epochs=$HOLD_EPOCHS"
+ARGS+=" --epochs_this_job=$EPOCHS_THIS_JOB"
+ARGS+=" --ema=$EMA"
+ARGS+=" --seed=$SEED"
+ARGS+=" --optimizer=novograd"
+ARGS+=" --weight_decay=1e-3"
+ARGS+=" --save_frequency=$SAVE_FREQUENCY"
+ARGS+=" --keep_milestones 100 200 300 400"
+ARGS+=" --save_best_from=380"
+ARGS+=" --log_frequency=1"
+ARGS+=" --eval_frequency=$EVAL_FREQUENCY"
+ARGS+=" --prediction_frequency=$PREDICTION_FREQUENCY"
+ARGS+=" --grad_accumulation_steps=$GRAD_ACCUMULATION_STEPS "
+ARGS+=" --dali_device=$DALI_DEVICE"
+
+[ "$AMP" = true ] && ARGS+=" --amp"
+[ "$RESUME" = true ] && ARGS+=" --resume"
+[ "$CUDNN_BENCHMARK" = true ] && ARGS+=" --cudnn_benchmark"
+[ -n "$MAX_DURATION" ] && ARGS+=" --override_config input_train.audio_dataset.max_duration=$MAX_DURATION" \
+ ARGS+=" --override_config input_train.filterbank_features.max_duration=$MAX_DURATION"
+[ "$PAD_TO_MAX_DURATION" = true ] && ARGS+=" --override_config input_train.audio_dataset.pad_to_max_duration=True" \
+ ARGS+=" --override_config input_train.filterbank_features.pad_to_max_duration=True"
+[ -n "$CHECKPOINT" ] && ARGS+=" --ckpt=$CHECKPOINT"
+[ -n "$LOG_FILE" ] && ARGS+=" --log_file $LOG_FILE"
+[ -n "$PRE_ALLOCATE" ] && ARGS+=" --pre_allocate_range $PRE_ALLOCATE"
+
+DISTRIBUTED="-m torch.distributed.launch --nproc_per_node=$NUM_GPUS"
+python $DISTRIBUTED train.py $ARGS
diff --git a/PyTorch/contrib/audio/Jasper/scripts/train_benchmark.sh b/PyTorch/contrib/audio/Jasper/scripts/train_benchmark.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f70760fe588d86b730764e3bfa2c8418b9870cdb
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/scripts/train_benchmark.sh
@@ -0,0 +1,49 @@
+#!/bin/bash
+
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+set -a
+
+# measure on speed perturbed data, but so slightly that fbank length remains the same
+# with pad_to_max_duration, this reduces cuDNN benchmak's burn-in period to a single step
+: ${DATA_DIR:=${1:-"/datasets/LibriSpeech"}}
+: ${OUTPUT_DIR:=${3:-"/results"}}
+: ${TRAIN_MANIFESTS:="$DATA_DIR/librispeech-train-clean-100-wav.json"}
+
+# run for a number of epochs, but don't finalize the training
+: ${EPOCHS_THIS_JOB:=2}
+: ${EPOCHS:=100000}
+: ${RESUME:=false}
+: ${SAVE_FREQUENCY:=100000}
+: ${EVAL_FREQUENCY:=100000}
+: ${GRAD_ACCUMULATION_STEPS:=1}
+
+: ${AMP:=false}
+: ${EMA:=0}
+: ${DALI_DEVICE:="gpu"}
+: ${NUM_GPUS_SEQ:="1 4 8"}
+: ${BATCH_SIZE_SEQ:="32"}
+# A probable range of batch lengths for LibriSpeech
+# with BS=64 and continuous speed perturbation (0.85, 1.15)
+: ${PRE_ALLOCATE:="1408 1920"}
+
+for NUM_GPUS in $NUM_GPUS_SEQ; do
+ for BATCH_SIZE in $BATCH_SIZE_SEQ; do
+
+ LOG_FILE="$OUTPUT_DIR/perf-train_dali-${DALI_DEVICE}_amp-${AMP}_ngpus${NUM_GPUS}_bs${BATCH_SIZE}.json"
+ bash ./scripts/train.sh "$@"
+
+ done
+done
diff --git a/PyTorch/contrib/audio/Jasper/test/docker/build.sh b/PyTorch/contrib/audio/Jasper/test/docker/build.sh
new file mode 100644
index 0000000000000000000000000000000000000000..cfdc97c010ec55f3ffd1228f027fc9e9432b785a
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/test/docker/build.sh
@@ -0,0 +1,3 @@
+#!/bin/bash
+
+docker build . --rm -t jasper
\ No newline at end of file
diff --git a/PyTorch/contrib/audio/Jasper/test/docker/launch.sh b/PyTorch/contrib/audio/Jasper/test/docker/launch.sh
new file mode 100644
index 0000000000000000000000000000000000000000..bc719ffadcd23168b6fd2e9621e4a9d5b37dfbb2
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/test/docker/launch.sh
@@ -0,0 +1,31 @@
+#!/bin/bash
+
+SCRIPT_DIR=$(cd $(dirname $0); pwd)
+: ${JASPER_REPO:="$SCRIPT_DIR/../.."}
+
+: ${DATA_DIR:=${1:-"$JASPER_REPO/datasets"}}
+: ${CHECKPOINT_DIR:=${2:-"$JASPER_REPO/checkpoints"}}
+: ${OUTPUT_DIR:=${3:-"$JASPER_REPO/results"}}
+: ${SCRIPT:=${4:-}}
+
+mkdir -p $DATA_DIR
+mkdir -p $CHECKPOINT_DIR
+mkdir -p $OUTPUT_DIR
+
+MOUNTS=""
+MOUNTS+=" -v $DATA_DIR:/dataset"
+MOUNTS+=" -v $CHECKPOINT_DIR:/checkpoints"
+MOUNTS+=" -v $OUTPUT_DIR:/results"
+MOUNTS+=" -v $JASPER_REPO:/workspace/jasper"
+MOUNTS+=" -v /usr/local/Ascend:/usr/local/Ascend"
+
+echo $MOUNTS
+docker run -it --rm --device /dev/davinci5 --device /dev/davinci_manager --device /dev/hisi_hdc --device /dev/devmm_svm \
+ --env PYTHONDONTWRITEBYTECODE=1 \
+ --shm-size=4g \
+ --ulimit memlock=-1 \
+ --ulimit stack=67108864 \
+ $MOUNTS \
+ $EXTRA_JASPER_ENV \
+ -w /workspace/jasper \
+ jasper:latest bash $SCRIPT
diff --git a/PyTorch/contrib/audio/Jasper/test/env_npu.sh b/PyTorch/contrib/audio/Jasper/test/env_npu.sh
new file mode 100644
index 0000000000000000000000000000000000000000..baf2d935433fe9ae244516846faecf2a7a27488a
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/test/env_npu.sh
@@ -0,0 +1,80 @@
+#!/bin/bash
+export install_path=/usr/local/Ascend
+
+if [ -d ${install_path}/toolkit ]; then
+ export LD_LIBRARY_PATH=${install_path}/fwkacllib/lib64/:/usr/include/hdf5/lib/:/usr/local/:/usr/local/lib/:/usr/lib/:${install_path}/driver/lib64/common/:${install_path}/driver/lib64/driver/:${install_path}/add-ons:${path_lib}:${LD_LIBRARY_PATH}
+ export PATH=${install_path}/fwkacllib/ccec_compiler/bin:${install_path}/fwkacllib/bin:$PATH
+ export PYTHONPATH=${install_path}/fwkacllib/python/site-packages:${install_path}/tfplugin/python/site-packages:${install_path}/toolkit/python/site-packages:$PYTHONPATH
+ export PYTHONPATH=/usr/local/python3.7.5/lib/python3.7/site-packages:$PYTHONPATH
+ export ASCEND_OPP_PATH=${install_path}/opp
+else
+ if [ -d ${install_path}/nnae/latest ];then
+ export LD_LIBRARY_PATH=${install_path}/nnae/latest/fwkacllib/lib64/:/usr/local/:/usr/local/python3.7.5/lib/:/usr/local/openblas/lib:/usr/local/lib/:/usr/lib64/:/usr/lib/:${install_path}/driver/lib64/common/:${install_path}/driver/lib64/driver/:${install_path}/add-ons/:/usr/lib/aarch64_64-linux-gnu:$LD_LIBRARY_PATH
+ export PATH=$PATH:${install_path}/nnae/latest/fwkacllib/ccec_compiler/bin/:${install_path}/nnae/latest/toolkit/tools/ide_daemon/bin/
+ export ASCEND_OPP_PATH=${install_path}/nnae/latest/opp/
+ export OPTION_EXEC_EXTERN_PLUGIN_PATH=${install_path}/nnae/latest/fwkacllib/lib64/plugin/opskernel/libfe.so:${install_path}/nnae/latest/fwkacllib/lib64/plugin/opskernel/libaicpu_engine.so:${install_path}/nnae/latest/fwkacllib/lib64/plugin/opskernel/libge_local_engine.so
+ export PYTHONPATH=${install_path}/nnae/latest/fwkacllib/python/site-packages/:${install_path}/nnae/latest/fwkacllib/python/site-packages/auto_tune.egg/auto_tune:${install_path}/nnae/latest/fwkacllib/python/site-packages/schedule_search.egg:$PYTHONPATH
+ export ASCEND_AICPU_PATH=${install_path}/nnae/latest
+ else
+ export LD_LIBRARY_PATH=${install_path}/ascend-toolkit/latest/fwkacllib/lib64/:/usr/local/:/usr/local/lib/:/usr/lib64/:/usr/lib/:/usr/local/python3.7.5/lib/:/usr/local/openblas/lib:${install_path}/driver/lib64/common/:${install_path}/driver/lib64/driver/:${install_path}/add-ons/:/usr/lib/aarch64-linux-gnu:$LD_LIBRARY_PATH
+ export PATH=$PATH:${install_path}/ascend-toolkit/latest/fwkacllib/ccec_compiler/bin/:${install_path}/ascend-toolkit/latest/toolkit/tools/ide_daemon/bin/
+ export ASCEND_OPP_PATH=${install_path}/ascend-toolkit/latest/opp/
+ export OPTION_EXEC_EXTERN_PLUGIN_PATH=${install_path}/ascend-toolkit/latest/fwkacllib/lib64/plugin/opskernel/libfe.so:${install_path}/ascend-toolkit/latest/fwkacllib/lib64/plugin/opskernel/libaicpu_engine.so:${install_path}/ascend-toolkit/latest/fwkacllib/lib64/plugin/opskernel/libge_local_engine.so
+ export PYTHONPATH=${install_path}/ascend-toolkit/latest/fwkacllib/python/site-packages/:${install_path}/ascend-toolkit/latest/fwkacllib/python/site-packages/auto_tune.egg/auto_tune:${install_path}/ascend-toolkit/latest/fwkacllib/python/site-packages/schedule_search.egg:$PYTHONPATH
+ export ASCEND_AICPU_PATH=${install_path}/ascend-toolkit/latest
+ fi
+fi
+
+${install_path}/driver/tools/msnpureport -g error -d 0
+${install_path}/driver/tools/msnpureport -g error -d 1
+${install_path}/driver/tools/msnpureport -g error -d 2
+${install_path}/driver/tools/msnpureport -g error -d 3
+${install_path}/driver/tools/msnpureport -g error -d 4
+${install_path}/driver/tools/msnpureport -g error -d 5
+${install_path}/driver/tools/msnpureport -g error -d 6
+${install_path}/driver/tools/msnpureport -g error -d 7
+
+#Host־,0-ر/1-
+export ASCEND_SLOG_PRINT_TO_STDOUT=0
+#Ĭ־,0-debug/1-info/2-warning/3-error
+export ASCEND_GLOBAL_LOG_LEVEL=3
+#Event־־,0-ر/1-
+export ASCEND_GLOBAL_EVENT_ENABLE=0
+#Ƿtaskque,0-ر/1-
+export TASK_QUEUE_ENABLE=1
+#ǷPTCopy,0-ر/1-
+export PTCOPY_ENABLE=1
+#Ƿ2combined־,0-ر/1-
+export COMBINED_ENABLE=1
+#Ƿ3combined־,0-ر/1-
+export TRI_COMBINED_ENABLE=1
+#ⳡǷҪ±,Ҫ
+export DYNAMIC_OP="ADD#MUL"
+# HCCL,1-ر/0-
+export HCCL_WHITELIST_DISABLE=1
+# HCCLĬϳʱʱ120s٣Ϊ1800sPyTorchĬ
+export HCCL_CONNECT_TIMEOUT=1800
+
+ulimit -SHn 512000
+
+path_lib=$(python3.7 -c """
+import sys
+import re
+result=''
+for index in range(len(sys.path)):
+ match_sit = re.search('-packages', sys.path[index])
+ if match_sit is not None:
+ match_lib = re.search('lib', sys.path[index])
+
+ if match_lib is not None:
+ end=match_lib.span()[1]
+ result += sys.path[index][0:end] + ':'
+
+ result+=sys.path[index] + '/torch/lib:'
+print(result)"""
+)
+
+echo ${path_lib}
+
+export LD_LIBRARY_PATH=/usr/local/python3.7.5/lib/:${path_lib}:$LD_LIBRARY_PATH
+
diff --git a/PyTorch/contrib/audio/Jasper/test/train_full_1p.sh b/PyTorch/contrib/audio/Jasper/test/train_full_1p.sh
new file mode 100644
index 0000000000000000000000000000000000000000..42933ac655f428c2daf694505be89a5c5debb564
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/test/train_full_1p.sh
@@ -0,0 +1,177 @@
+#!/bin/bash
+
+################基础配置参数,需要模型审视修改##################
+# 必选字段(必须在此处定义的参数): Network batch_size RANK_SIZE
+# 网络名称,同目录名称
+Network="Jasper"
+# 训练batch_size
+batch_size=32
+# 训练使用的npu卡数
+export RANK_SIZE=1
+
+# 参数校验,data_path为必传参数,其他参数的增删由模型自身决定;此处新增参数需在上面有定义并赋值
+for para in $*
+do
+ if [[ $para == --data_path* ]];then
+ data_path=`echo ${para#*=}`
+ fi
+done
+
+# 校验是否传入data_path,不需要修改
+if [[ $data_path == "" ]];then
+ echo "[Error] para \"data_path\" must be confing"
+ exit 1
+fi
+
+###############指定训练脚本执行路径###############
+# cd到与test文件夹同层级目录下执行脚本,提高兼容性;test_path_dir为包含test文件夹的路径
+cur_path=`pwd`
+cur_path_last_diename=${cur_path##*/}
+if [ x"${cur_path_last_diename}" == x"test" ];then
+ test_path_dir=${cur_path}
+ cd ..
+ cur_path=`pwd`
+else
+ test_path_dir=${cur_path}/test
+fi
+echo ${pwd}
+
+#################创建日志输出目录,不需要修改#################
+ASCEND_DEVICE_ID=0
+if [ -d ${test_path_dir}/output/${ASCEND_DEVICE_ID} ];then
+ rm -rf ${test_path_dir}/output/${ASCEND_DEVICE_ID}
+ mkdir -p ${test_path_dir}/output/$ASCEND_DEVICE_ID
+else
+ mkdir -p ${test_path_dir}/output/$ASCEND_DEVICE_ID
+fi
+
+# 变量
+export DETECTRON2_DATASETS=${data_path}
+export PYTHONPATH=./:$PYTHONPATH
+export OMP_NUM_THREADS=1
+
+: ${MODEL_CONFIG:=${2:-"configs/jasper10x5dr_speedp-online_speca.yaml"}}
+: ${OUTPUT_DIR:=${3:-"/results"}}
+: ${CHECKPOINT:=${4:-}}
+: ${RESUME:=true}
+: ${CUDNN_BENCHMARK:=true}
+: ${NUM_GPUS:=1}
+: ${AMP:=True}
+: ${GRAD_ACCUMULATION_STEPS:=2}
+: ${LEARNING_RATE:=0.01}
+: ${MIN_LEARNING_RATE:=0.00001}
+: ${LR_POLICY:=exponential}
+: ${LR_EXP_GAMMA:=0.981}
+: ${EMA:=0.999}
+: ${SEED:=0}
+: ${EPOCHS:=33}
+: ${WARMUP_EPOCHS:=2}
+: ${HOLD_EPOCHS:=140}
+: ${SAVE_FREQUENCY:=10}
+: ${EPOCHS_THIS_JOB:=0}
+: ${DALI_DEVICE:="cpu"}
+: ${PAD_TO_MAX_DURATION:=false}
+: ${EVAL_FREQUENCY:=544}
+: ${PREDICTION_FREQUENCY:=544}
+: ${TRAIN_MANIFESTS:="$data_path/librispeech-train-clean-100-wav.json \
+ $data_path/librispeech-train-clean-360-wav.json \
+ $data_path/librispeech-train-other-500-wav.json"}
+: ${VAL_MANIFESTS:="$data_path/librispeech-dev-clean-wav.json"}
+
+mkdir -p "$OUTPUT_DIR"
+
+ARGS="--dataset_dir=$data_path"
+ARGS+=" --val_manifests $VAL_MANIFESTS"
+ARGS+=" --train_manifests $TRAIN_MANIFESTS"
+ARGS+=" --model_config=$MODEL_CONFIG"
+ARGS+=" --output_dir=$OUTPUT_DIR"
+ARGS+=" --lr=$LEARNING_RATE"
+ARGS+=" --batch_size=$batch_size"
+ARGS+=" --min_lr=$MIN_LEARNING_RATE"
+ARGS+=" --lr_policy=$LR_POLICY"
+ARGS+=" --lr_exp_gamma=$LR_EXP_GAMMA"
+ARGS+=" --epochs=$EPOCHS"
+ARGS+=" --warmup_epochs=$WARMUP_EPOCHS"
+ARGS+=" --hold_epochs=$HOLD_EPOCHS"
+ARGS+=" --epochs_this_job=$EPOCHS_THIS_JOB"
+ARGS+=" --ema=$EMA"
+ARGS+=" --seed=$SEED"
+ARGS+=" --optimizer=novograd"
+ARGS+=" --weight_decay=1e-3"
+ARGS+=" --save_frequency=$SAVE_FREQUENCY"
+ARGS+=" --keep_milestones 100 200 300 400"
+ARGS+=" --save_best_from=380"
+ARGS+=" --log_frequency=1"
+ARGS+=" --eval_frequency=$EVAL_FREQUENCY"
+ARGS+=" --prediction_frequency=$PREDICTION_FREQUENCY"
+ARGS+=" --grad_accumulation_steps=$GRAD_ACCUMULATION_STEPS "
+ARGS+=" --dali_device=$DALI_DEVICE"
+
+[ "$AMP" = true ] && ARGS+=" --amp"
+[ "$RESUME" = true ] && ARGS+=" --resume"
+[ "$CUDNN_BENCHMARK" = true ] && ARGS+=" --cudnn_benchmark"
+[ -n "$MAX_DURATION" ] && ARGS+=" --override_config input_train.audio_dataset.max_duration=$MAX_DURATION" \
+ ARGS+=" --override_config input_train.filterbank_features.max_duration=$MAX_DURATION"
+[ "$PAD_TO_MAX_DURATION" = true ] && ARGS+=" --override_config input_train.audio_dataset.pad_to_max_duration=True" \
+ ARGS+=" --override_config input_train.filterbank_features.pad_to_max_duration=True"
+[ -n "$CHECKPOINT" ] && ARGS+=" --ckpt=$CHECKPOINT"
+[ -n "$LOG_FILE" ] && ARGS+=" --log_file $LOG_FILE"
+[ -n "$PRE_ALLOCATE" ] && ARGS+=" --pre_allocate_range $PRE_ALLOCATE"
+
+#################启动训练脚本#################
+#训练开始时间,不需要修改
+start_time=$(date +%s)
+# 非平台场景时source 环境变量
+check_etp_flag=`env | grep etp_running_flag`
+etp_flag=`echo ${check_etp_flag#*=}`
+if [ x"${etp_flag}" != x"true" ];then
+ source ${test_path_dir}/env_npu.sh
+fi
+
+DISTRIBUTED="-m torch.distributed.launch --nproc_per_node=$NUM_GPUS"
+python $DISTRIBUTED train.py $ARGS > ${test_path_dir}/output/${ASCEND_DEVICE_ID}/train_${ASCEND_DEVICE_ID}.log 2>&1 &
+
+wait
+
+
+##################获取训练数据################
+#训练结束时间,不需要修改
+end_time=$(date +%s)
+e2e_time=$(( $end_time - $start_time ))
+
+#结果打印,不需要修改
+echo "------------------ Final result ------------------"
+#输出性能FPS,需要模型审视修改
+FPS=`grep "avg train utts/s" ${test_path_dir}/output/${ASCEND_DEVICE_ID}/train_${ASCEND_DEVICE_ID}.log | awk '{print $9}' | awk 'END {print}'`
+#打印,不需要修改
+echo "Final Performance images/sec : $FPS"
+
+#输出训练精度,需要模型审视修改
+train_accuracy=`grep "avg" ${test_path_dir}/output/${ASCEND_DEVICE_ID}/train_${ASCEND_DEVICE_ID}.log | grep "wer" | awk '{print $16}' | awk 'END {print}'`
+#打印,不需要修改
+echo "Final Train Accuracy : ${train_accuracy}"
+echo "E2E Training Duration sec : $e2e_time"
+
+#性能看护结果汇总
+#训练用例信息,不需要修改
+BatchSize=${batch_size}
+DeviceType=`uname -m`
+CaseName=${Network}_bs${BatchSize}_${RANK_SIZE}'p'_'acc'
+
+##获取性能数据,不需要修改
+#吞吐量
+ActualFPS=${FPS}
+#单迭代训练时长
+TrainingTime=`awk 'BEGIN{printf "%.2f\n", '${batch_size}'*1000/'${FPS}'}'`
+
+
+#关键信息打印到${CaseName}.log中,不需要修改
+echo "Network = ${Network}" > ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
+echo "RankSize = ${RANK_SIZE}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
+echo "BatchSize = ${BatchSize}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
+echo "DeviceType = ${DeviceType}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
+echo "CaseName = ${CaseName}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
+echo "ActualFPS = ${ActualFPS}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
+echo "TrainingTime = ${TrainingTime}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
+echo "TrainAccuracy = ${train_accuracy}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
+echo "E2ETrainingTime = ${e2e_time}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
diff --git a/PyTorch/contrib/audio/Jasper/test/train_full_8p.sh b/PyTorch/contrib/audio/Jasper/test/train_full_8p.sh
new file mode 100644
index 0000000000000000000000000000000000000000..177cd2e4cf504e147d6d4fe5600f9896780d713c
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/test/train_full_8p.sh
@@ -0,0 +1,177 @@
+#!/bin/bash
+
+################基础配置参数,需要模型审视修改##################
+# 必选字段(必须在此处定义的参数): Network batch_size RANK_SIZE
+# 网络名称,同目录名称
+Network="Jasper"
+# 训练batch_size
+batch_size=32
+# 训练使用的npu卡数
+export RANK_SIZE=8
+
+# 参数校验,data_path为必传参数,其他参数的增删由模型自身决定;此处新增参数需在上面有定义并赋值
+for para in $*
+do
+ if [[ $para == --data_path* ]];then
+ data_path=`echo ${para#*=}`
+ fi
+done
+
+# 校验是否传入data_path,不需要修改
+if [[ $data_path == "" ]];then
+ echo "[Error] para \"data_path\" must be confing"
+ exit 1
+fi
+
+###############指定训练脚本执行路径###############
+# cd到与test文件夹同层级目录下执行脚本,提高兼容性;test_path_dir为包含test文件夹的路径
+cur_path=`pwd`
+cur_path_last_diename=${cur_path##*/}
+if [ x"${cur_path_last_diename}" == x"test" ];then
+ test_path_dir=${cur_path}
+ cd ..
+ cur_path=`pwd`
+else
+ test_path_dir=${cur_path}/test
+fi
+echo ${pwd}
+
+#################创建日志输出目录,不需要修改#################
+ASCEND_DEVICE_ID=0
+if [ -d ${test_path_dir}/output/${ASCEND_DEVICE_ID} ];then
+ rm -rf ${test_path_dir}/output/${ASCEND_DEVICE_ID}
+ mkdir -p ${test_path_dir}/output/$ASCEND_DEVICE_ID
+else
+ mkdir -p ${test_path_dir}/output/$ASCEND_DEVICE_ID
+fi
+
+# 变量
+export DETECTRON2_DATASETS=${data_path}
+export PYTHONPATH=./:$PYTHONPATH
+export OMP_NUM_THREADS=1
+
+: ${MODEL_CONFIG:=${2:-"configs/jasper10x5dr_speedp-online_speca.yaml"}}
+: ${OUTPUT_DIR:=${3:-"/results"}}
+: ${CHECKPOINT:=${4:-}}
+: ${RESUME:=true}
+: ${CUDNN_BENCHMARK:=true}
+: ${NUM_GPUS:=8}
+: ${AMP:=True}
+: ${GRAD_ACCUMULATION_STEPS:=2}
+: ${LEARNING_RATE:=0.01}
+: ${MIN_LEARNING_RATE:=0.00001}
+: ${LR_POLICY:=exponential}
+: ${LR_EXP_GAMMA:=0.981}
+: ${EMA:=0.999}
+: ${SEED:=0}
+: ${EPOCHS:=33}
+: ${WARMUP_EPOCHS:=2}
+: ${HOLD_EPOCHS:=140}
+: ${SAVE_FREQUENCY:=10}
+: ${EPOCHS_THIS_JOB:=0}
+: ${DALI_DEVICE:="cpu"}
+: ${PAD_TO_MAX_DURATION:=false}
+: ${EVAL_FREQUENCY:=544}
+: ${PREDICTION_FREQUENCY:=544}
+: ${TRAIN_MANIFESTS:="$data_path/librispeech-train-clean-100-wav.json \
+ $data_path/librispeech-train-clean-360-wav.json \
+ $data_path/librispeech-train-other-500-wav.json"}
+: ${VAL_MANIFESTS:="$data_path/librispeech-dev-clean-wav.json"}
+
+mkdir -p "$OUTPUT_DIR"
+
+ARGS="--dataset_dir=$data_path"
+ARGS+=" --val_manifests $VAL_MANIFESTS"
+ARGS+=" --train_manifests $TRAIN_MANIFESTS"
+ARGS+=" --model_config=$MODEL_CONFIG"
+ARGS+=" --output_dir=$OUTPUT_DIR"
+ARGS+=" --lr=$LEARNING_RATE"
+ARGS+=" --batch_size=$batch_size"
+ARGS+=" --min_lr=$MIN_LEARNING_RATE"
+ARGS+=" --lr_policy=$LR_POLICY"
+ARGS+=" --lr_exp_gamma=$LR_EXP_GAMMA"
+ARGS+=" --epochs=$EPOCHS"
+ARGS+=" --warmup_epochs=$WARMUP_EPOCHS"
+ARGS+=" --hold_epochs=$HOLD_EPOCHS"
+ARGS+=" --epochs_this_job=$EPOCHS_THIS_JOB"
+ARGS+=" --ema=$EMA"
+ARGS+=" --seed=$SEED"
+ARGS+=" --optimizer=novograd"
+ARGS+=" --weight_decay=1e-3"
+ARGS+=" --save_frequency=$SAVE_FREQUENCY"
+ARGS+=" --keep_milestones 100 200 300 400"
+ARGS+=" --save_best_from=380"
+ARGS+=" --log_frequency=1"
+ARGS+=" --eval_frequency=$EVAL_FREQUENCY"
+ARGS+=" --prediction_frequency=$PREDICTION_FREQUENCY"
+ARGS+=" --grad_accumulation_steps=$GRAD_ACCUMULATION_STEPS "
+ARGS+=" --dali_device=$DALI_DEVICE"
+
+[ "$AMP" = true ] && ARGS+=" --amp"
+[ "$RESUME" = true ] && ARGS+=" --resume"
+[ "$CUDNN_BENCHMARK" = true ] && ARGS+=" --cudnn_benchmark"
+[ -n "$MAX_DURATION" ] && ARGS+=" --override_config input_train.audio_dataset.max_duration=$MAX_DURATION" \
+ ARGS+=" --override_config input_train.filterbank_features.max_duration=$MAX_DURATION"
+[ "$PAD_TO_MAX_DURATION" = true ] && ARGS+=" --override_config input_train.audio_dataset.pad_to_max_duration=True" \
+ ARGS+=" --override_config input_train.filterbank_features.pad_to_max_duration=True"
+[ -n "$CHECKPOINT" ] && ARGS+=" --ckpt=$CHECKPOINT"
+[ -n "$LOG_FILE" ] && ARGS+=" --log_file $LOG_FILE"
+[ -n "$PRE_ALLOCATE" ] && ARGS+=" --pre_allocate_range $PRE_ALLOCATE"
+
+#################启动训练脚本#################
+#训练开始时间,不需要修改
+start_time=$(date +%s)
+# 非平台场景时source 环境变量
+check_etp_flag=`env | grep etp_running_flag`
+etp_flag=`echo ${check_etp_flag#*=}`
+if [ x"${etp_flag}" != x"true" ];then
+ source ${test_path_dir}/env_npu.sh
+fi
+
+DISTRIBUTED="-m torch.distributed.launch --nproc_per_node=$NUM_GPUS"
+python $DISTRIBUTED train.py $ARGS > ${test_path_dir}/output/${ASCEND_DEVICE_ID}/train_${ASCEND_DEVICE_ID}.log 2>&1 &
+
+wait
+
+
+##################获取训练数据################
+#训练结束时间,不需要修改
+end_time=$(date +%s)
+e2e_time=$(( $end_time - $start_time ))
+
+#结果打印,不需要修改
+echo "------------------ Final result ------------------"
+#输出性能FPS,需要模型审视修改
+FPS=`grep "avg train utts/s" ${test_path_dir}/output/${ASCEND_DEVICE_ID}/train_${ASCEND_DEVICE_ID}.log | awk '{print $9}' | awk 'END {print}'`
+#打印,不需要修改
+echo "Final Performance images/sec : $FPS"
+
+#输出训练精度,需要模型审视修改
+train_accuracy=`grep "avg" ${test_path_dir}/output/${ASCEND_DEVICE_ID}/train_${ASCEND_DEVICE_ID}.log | grep "wer" | awk '{print $16}' | awk 'END {print}'`
+#打印,不需要修改
+echo "Final Train Accuracy : ${train_accuracy}"
+echo "E2E Training Duration sec : $e2e_time"
+
+#性能看护结果汇总
+#训练用例信息,不需要修改
+BatchSize=${batch_size}
+DeviceType=`uname -m`
+CaseName=${Network}_bs${BatchSize}_${RANK_SIZE}'p'_'acc'
+
+##获取性能数据,不需要修改
+#吞吐量
+ActualFPS=${FPS}
+#单迭代训练时长
+TrainingTime=`awk 'BEGIN{printf "%.2f\n", '${batch_size}'*1000/'${FPS}'}'`
+
+
+#关键信息打印到${CaseName}.log中,不需要修改
+echo "Network = ${Network}" > ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
+echo "RankSize = ${RANK_SIZE}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
+echo "BatchSize = ${BatchSize}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
+echo "DeviceType = ${DeviceType}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
+echo "CaseName = ${CaseName}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
+echo "ActualFPS = ${ActualFPS}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
+echo "TrainingTime = ${TrainingTime}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
+echo "TrainAccuracy = ${train_accuracy}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
+echo "E2ETrainingTime = ${e2e_time}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
diff --git a/PyTorch/contrib/audio/Jasper/test/train_performance_1p.sh b/PyTorch/contrib/audio/Jasper/test/train_performance_1p.sh
new file mode 100644
index 0000000000000000000000000000000000000000..771b33dc495b7818dcb2c3e53c5466b4de7f799b
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/test/train_performance_1p.sh
@@ -0,0 +1,177 @@
+#!/bin/bash
+
+################基础配置参数,需要模型审视修改##################
+# 必选字段(必须在此处定义的参数): Network batch_size RANK_SIZE
+# 网络名称,同目录名称
+Network="Jasper"
+# 训练batch_size
+batch_size=32
+# 训练使用的npu卡数
+export RANK_SIZE=1
+
+# 参数校验,data_path为必传参数,其他参数的增删由模型自身决定;此处新增参数需在上面有定义并赋值
+for para in $*
+do
+ if [[ $para == --data_path* ]];then
+ data_path=`echo ${para#*=}`
+ fi
+done
+
+# 校验是否传入data_path,不需要修改
+if [[ $data_path == "" ]];then
+ echo "[Error] para \"data_path\" must be confing"
+ exit 1
+fi
+
+###############指定训练脚本执行路径###############
+# cd到与test文件夹同层级目录下执行脚本,提高兼容性;test_path_dir为包含test文件夹的路径
+cur_path=`pwd`
+cur_path_last_diename=${cur_path##*/}
+if [ x"${cur_path_last_diename}" == x"test" ];then
+ test_path_dir=${cur_path}
+ cd ..
+ cur_path=`pwd`
+else
+ test_path_dir=${cur_path}/test
+fi
+echo ${pwd}
+
+#################创建日志输出目录,不需要修改#################
+ASCEND_DEVICE_ID=0
+if [ -d ${test_path_dir}/output/${ASCEND_DEVICE_ID} ];then
+ rm -rf ${test_path_dir}/output/${ASCEND_DEVICE_ID}
+ mkdir -p ${test_path_dir}/output/$ASCEND_DEVICE_ID
+else
+ mkdir -p ${test_path_dir}/output/$ASCEND_DEVICE_ID
+fi
+
+# 变量
+export DETECTRON2_DATASETS=${data_path}
+export PYTHONPATH=./:$PYTHONPATH
+export OMP_NUM_THREADS=1
+
+: ${MODEL_CONFIG:=${2:-"configs/jasper10x5dr_speedp-online_speca.yaml"}}
+: ${OUTPUT_DIR:=${3:-"/results"}}
+: ${CHECKPOINT:=${4:-}}
+: ${RESUME:=true}
+: ${CUDNN_BENCHMARK:=true}
+: ${NUM_GPUS:=1}
+: ${AMP:=True}
+: ${GRAD_ACCUMULATION_STEPS:=2}
+: ${LEARNING_RATE:=0.01}
+: ${MIN_LEARNING_RATE:=0.00001}
+: ${LR_POLICY:=exponential}
+: ${LR_EXP_GAMMA:=0.981}
+: ${EMA:=0.999}
+: ${SEED:=0}
+: ${EPOCHS:=1}
+: ${WARMUP_EPOCHS:=2}
+: ${HOLD_EPOCHS:=140}
+: ${SAVE_FREQUENCY:=10}
+: ${EPOCHS_THIS_JOB:=0}
+: ${DALI_DEVICE:="cpu"}
+: ${PAD_TO_MAX_DURATION:=false}
+: ${EVAL_FREQUENCY:=544}
+: ${PREDICTION_FREQUENCY:=544}
+: ${TRAIN_MANIFESTS:="$data_path/librispeech-train-clean-100-wav.json \
+ $data_path/librispeech-train-clean-360-wav.json \
+ $data_path/librispeech-train-other-500-wav.json"}
+: ${VAL_MANIFESTS:="$data_path/librispeech-dev-clean-wav.json"}
+
+mkdir -p "$OUTPUT_DIR"
+
+ARGS="--dataset_dir=$data_path"
+ARGS+=" --val_manifests $VAL_MANIFESTS"
+ARGS+=" --train_manifests $TRAIN_MANIFESTS"
+ARGS+=" --model_config=$MODEL_CONFIG"
+ARGS+=" --output_dir=$OUTPUT_DIR"
+ARGS+=" --lr=$LEARNING_RATE"
+ARGS+=" --batch_size=$batch_size"
+ARGS+=" --min_lr=$MIN_LEARNING_RATE"
+ARGS+=" --lr_policy=$LR_POLICY"
+ARGS+=" --lr_exp_gamma=$LR_EXP_GAMMA"
+ARGS+=" --epochs=$EPOCHS"
+ARGS+=" --warmup_epochs=$WARMUP_EPOCHS"
+ARGS+=" --hold_epochs=$HOLD_EPOCHS"
+ARGS+=" --epochs_this_job=$EPOCHS_THIS_JOB"
+ARGS+=" --ema=$EMA"
+ARGS+=" --seed=$SEED"
+ARGS+=" --optimizer=novograd"
+ARGS+=" --weight_decay=1e-3"
+ARGS+=" --save_frequency=$SAVE_FREQUENCY"
+ARGS+=" --keep_milestones 100 200 300 400"
+ARGS+=" --save_best_from=380"
+ARGS+=" --log_frequency=1"
+ARGS+=" --eval_frequency=$EVAL_FREQUENCY"
+ARGS+=" --prediction_frequency=$PREDICTION_FREQUENCY"
+ARGS+=" --grad_accumulation_steps=$GRAD_ACCUMULATION_STEPS "
+ARGS+=" --dali_device=$DALI_DEVICE"
+
+[ "$AMP" = true ] && ARGS+=" --amp"
+[ "$RESUME" = true ] && ARGS+=" --resume"
+[ "$CUDNN_BENCHMARK" = true ] && ARGS+=" --cudnn_benchmark"
+[ -n "$MAX_DURATION" ] && ARGS+=" --override_config input_train.audio_dataset.max_duration=$MAX_DURATION" \
+ ARGS+=" --override_config input_train.filterbank_features.max_duration=$MAX_DURATION"
+[ "$PAD_TO_MAX_DURATION" = true ] && ARGS+=" --override_config input_train.audio_dataset.pad_to_max_duration=True" \
+ ARGS+=" --override_config input_train.filterbank_features.pad_to_max_duration=True"
+[ -n "$CHECKPOINT" ] && ARGS+=" --ckpt=$CHECKPOINT"
+[ -n "$LOG_FILE" ] && ARGS+=" --log_file $LOG_FILE"
+[ -n "$PRE_ALLOCATE" ] && ARGS+=" --pre_allocate_range $PRE_ALLOCATE"
+
+#################启动训练脚本#################
+#训练开始时间,不需要修改
+start_time=$(date +%s)
+# 非平台场景时source 环境变量
+check_etp_flag=`env | grep etp_running_flag`
+etp_flag=`echo ${check_etp_flag#*=}`
+if [ x"${etp_flag}" != x"true" ];then
+ source ${test_path_dir}/env_npu.sh
+fi
+
+DISTRIBUTED="-m torch.distributed.launch --nproc_per_node=$NUM_GPUS"
+python $DISTRIBUTED train.py $ARGS > ${test_path_dir}/output/${ASCEND_DEVICE_ID}/train_${ASCEND_DEVICE_ID}.log 2>&1 &
+
+wait
+
+
+##################获取训练数据################
+#训练结束时间,不需要修改
+end_time=$(date +%s)
+e2e_time=$(( $end_time - $start_time ))
+
+#结果打印,不需要修改
+echo "------------------ Final result ------------------"
+#输出性能FPS,需要模型审视修改
+FPS=`grep "avg train utts/s" ${test_path_dir}/output/${ASCEND_DEVICE_ID}/train_${ASCEND_DEVICE_ID}.log | awk '{print $9}' | awk 'END {print}'`
+#打印,不需要修改
+echo "Final Performance images/sec : $FPS"
+
+#输出训练精度,需要模型审视修改
+train_accuracy=`grep "avg" ${test_path_dir}/output/${ASCEND_DEVICE_ID}/train_${ASCEND_DEVICE_ID}.log | grep "wer" | awk '{print $16}' | awk 'END {print}'`
+#打印,不需要修改
+echo "Final Train Accuracy : ${train_accuracy}"
+echo "E2E Training Duration sec : $e2e_time"
+
+#性能看护结果汇总
+#训练用例信息,不需要修改
+BatchSize=${batch_size}
+DeviceType=`uname -m`
+CaseName=${Network}_bs${BatchSize}_${RANK_SIZE}'p'_'acc'
+
+##获取性能数据,不需要修改
+#吞吐量
+ActualFPS=${FPS}
+#单迭代训练时长
+TrainingTime=`awk 'BEGIN{printf "%.2f\n", '${batch_size}'*1000/'${FPS}'}'`
+
+
+#关键信息打印到${CaseName}.log中,不需要修改
+echo "Network = ${Network}" > ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
+echo "RankSize = ${RANK_SIZE}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
+echo "BatchSize = ${BatchSize}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
+echo "DeviceType = ${DeviceType}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
+echo "CaseName = ${CaseName}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
+echo "ActualFPS = ${ActualFPS}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
+echo "TrainingTime = ${TrainingTime}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
+echo "TrainAccuracy = ${train_accuracy}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
+echo "E2ETrainingTime = ${e2e_time}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
diff --git a/PyTorch/contrib/audio/Jasper/test/train_performance_8p.sh b/PyTorch/contrib/audio/Jasper/test/train_performance_8p.sh
new file mode 100644
index 0000000000000000000000000000000000000000..8836f670791a2f5d7ecc34fee72bfab42c046f55
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/test/train_performance_8p.sh
@@ -0,0 +1,177 @@
+#!/bin/bash
+
+################基础配置参数,需要模型审视修改##################
+# 必选字段(必须在此处定义的参数): Network batch_size RANK_SIZE
+# 网络名称,同目录名称
+Network="Jasper"
+# 训练batch_size
+batch_size=32
+# 训练使用的npu卡数
+export RANK_SIZE=8
+
+# 参数校验,data_path为必传参数,其他参数的增删由模型自身决定;此处新增参数需在上面有定义并赋值
+for para in $*
+do
+ if [[ $para == --data_path* ]];then
+ data_path=`echo ${para#*=}`
+ fi
+done
+
+# 校验是否传入data_path,不需要修改
+if [[ $data_path == "" ]];then
+ echo "[Error] para \"data_path\" must be confing"
+ exit 1
+fi
+
+###############指定训练脚本执行路径###############
+# cd到与test文件夹同层级目录下执行脚本,提高兼容性;test_path_dir为包含test文件夹的路径
+cur_path=`pwd`
+cur_path_last_diename=${cur_path##*/}
+if [ x"${cur_path_last_diename}" == x"test" ];then
+ test_path_dir=${cur_path}
+ cd ..
+ cur_path=`pwd`
+else
+ test_path_dir=${cur_path}/test
+fi
+echo ${pwd}
+
+#################创建日志输出目录,不需要修改#################
+ASCEND_DEVICE_ID=0
+if [ -d ${test_path_dir}/output/${ASCEND_DEVICE_ID} ];then
+ rm -rf ${test_path_dir}/output/${ASCEND_DEVICE_ID}
+ mkdir -p ${test_path_dir}/output/$ASCEND_DEVICE_ID
+else
+ mkdir -p ${test_path_dir}/output/$ASCEND_DEVICE_ID
+fi
+
+# 变量
+export DETECTRON2_DATASETS=${data_path}
+export PYTHONPATH=./:$PYTHONPATH
+export OMP_NUM_THREADS=1
+
+: ${MODEL_CONFIG:=${2:-"configs/jasper10x5dr_speedp-online_speca.yaml"}}
+: ${OUTPUT_DIR:=${3:-"/results"}}
+: ${CHECKPOINT:=${4:-}}
+: ${RESUME:=true}
+: ${CUDNN_BENCHMARK:=true}
+: ${NUM_GPUS:=8}
+: ${AMP:=True}
+: ${GRAD_ACCUMULATION_STEPS:=2}
+: ${LEARNING_RATE:=0.01}
+: ${MIN_LEARNING_RATE:=0.00001}
+: ${LR_POLICY:=exponential}
+: ${LR_EXP_GAMMA:=0.981}
+: ${EMA:=0.999}
+: ${SEED:=0}
+: ${EPOCHS:=3}
+: ${WARMUP_EPOCHS:=2}
+: ${HOLD_EPOCHS:=140}
+: ${SAVE_FREQUENCY:=10}
+: ${EPOCHS_THIS_JOB:=0}
+: ${DALI_DEVICE:="cpu"}
+: ${PAD_TO_MAX_DURATION:=false}
+: ${EVAL_FREQUENCY:=544}
+: ${PREDICTION_FREQUENCY:=544}
+: ${TRAIN_MANIFESTS:="$data_path/librispeech-train-clean-100-wav.json \
+ $data_path/librispeech-train-clean-360-wav.json \
+ $data_path/librispeech-train-other-500-wav.json"}
+: ${VAL_MANIFESTS:="$data_path/librispeech-dev-clean-wav.json"}
+
+mkdir -p "$OUTPUT_DIR"
+
+ARGS="--dataset_dir=$data_path"
+ARGS+=" --val_manifests $VAL_MANIFESTS"
+ARGS+=" --train_manifests $TRAIN_MANIFESTS"
+ARGS+=" --model_config=$MODEL_CONFIG"
+ARGS+=" --output_dir=$OUTPUT_DIR"
+ARGS+=" --lr=$LEARNING_RATE"
+ARGS+=" --batch_size=$batch_size"
+ARGS+=" --min_lr=$MIN_LEARNING_RATE"
+ARGS+=" --lr_policy=$LR_POLICY"
+ARGS+=" --lr_exp_gamma=$LR_EXP_GAMMA"
+ARGS+=" --epochs=$EPOCHS"
+ARGS+=" --warmup_epochs=$WARMUP_EPOCHS"
+ARGS+=" --hold_epochs=$HOLD_EPOCHS"
+ARGS+=" --epochs_this_job=$EPOCHS_THIS_JOB"
+ARGS+=" --ema=$EMA"
+ARGS+=" --seed=$SEED"
+ARGS+=" --optimizer=novograd"
+ARGS+=" --weight_decay=1e-3"
+ARGS+=" --save_frequency=$SAVE_FREQUENCY"
+ARGS+=" --keep_milestones 100 200 300 400"
+ARGS+=" --save_best_from=380"
+ARGS+=" --log_frequency=1"
+ARGS+=" --eval_frequency=$EVAL_FREQUENCY"
+ARGS+=" --prediction_frequency=$PREDICTION_FREQUENCY"
+ARGS+=" --grad_accumulation_steps=$GRAD_ACCUMULATION_STEPS "
+ARGS+=" --dali_device=$DALI_DEVICE"
+
+[ "$AMP" = true ] && ARGS+=" --amp"
+[ "$RESUME" = true ] && ARGS+=" --resume"
+[ "$CUDNN_BENCHMARK" = true ] && ARGS+=" --cudnn_benchmark"
+[ -n "$MAX_DURATION" ] && ARGS+=" --override_config input_train.audio_dataset.max_duration=$MAX_DURATION" \
+ ARGS+=" --override_config input_train.filterbank_features.max_duration=$MAX_DURATION"
+[ "$PAD_TO_MAX_DURATION" = true ] && ARGS+=" --override_config input_train.audio_dataset.pad_to_max_duration=True" \
+ ARGS+=" --override_config input_train.filterbank_features.pad_to_max_duration=True"
+[ -n "$CHECKPOINT" ] && ARGS+=" --ckpt=$CHECKPOINT"
+[ -n "$LOG_FILE" ] && ARGS+=" --log_file $LOG_FILE"
+[ -n "$PRE_ALLOCATE" ] && ARGS+=" --pre_allocate_range $PRE_ALLOCATE"
+
+#################启动训练脚本#################
+#训练开始时间,不需要修改
+start_time=$(date +%s)
+# 非平台场景时source 环境变量
+check_etp_flag=`env | grep etp_running_flag`
+etp_flag=`echo ${check_etp_flag#*=}`
+if [ x"${etp_flag}" != x"true" ];then
+ source ${test_path_dir}/env_npu.sh
+fi
+
+DISTRIBUTED="-m torch.distributed.launch --nproc_per_node=$NUM_GPUS"
+python $DISTRIBUTED train.py $ARGS > ${test_path_dir}/output/${ASCEND_DEVICE_ID}/train_${ASCEND_DEVICE_ID}.log 2>&1 &
+
+wait
+
+
+##################获取训练数据################
+#训练结束时间,不需要修改
+end_time=$(date +%s)
+e2e_time=$(( $end_time - $start_time ))
+
+#结果打印,不需要修改
+echo "------------------ Final result ------------------"
+#输出性能FPS,需要模型审视修改
+FPS=`grep "avg train utts/s" ${test_path_dir}/output/${ASCEND_DEVICE_ID}/train_${ASCEND_DEVICE_ID}.log | awk '{print $9}' | awk 'END {print}'`
+#打印,不需要修改
+echo "Final Performance images/sec : $FPS"
+
+#输出训练精度,需要模型审视修改
+train_accuracy=`grep "avg" ${test_path_dir}/output/${ASCEND_DEVICE_ID}/train_${ASCEND_DEVICE_ID}.log | grep "wer" | awk '{print $16}' | awk 'END {print}'`
+#打印,不需要修改
+echo "Final Train Accuracy : ${train_accuracy}"
+echo "E2E Training Duration sec : $e2e_time"
+
+#性能看护结果汇总
+#训练用例信息,不需要修改
+BatchSize=${batch_size}
+DeviceType=`uname -m`
+CaseName=${Network}_bs${BatchSize}_${RANK_SIZE}'p'_'acc'
+
+##获取性能数据,不需要修改
+#吞吐量
+ActualFPS=${FPS}
+#单迭代训练时长
+TrainingTime=`awk 'BEGIN{printf "%.2f\n", '${batch_size}'*1000/'${FPS}'}'`
+
+
+#关键信息打印到${CaseName}.log中,不需要修改
+echo "Network = ${Network}" > ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
+echo "RankSize = ${RANK_SIZE}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
+echo "BatchSize = ${BatchSize}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
+echo "DeviceType = ${DeviceType}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
+echo "CaseName = ${CaseName}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
+echo "ActualFPS = ${ActualFPS}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
+echo "TrainingTime = ${TrainingTime}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
+echo "TrainAccuracy = ${train_accuracy}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
+echo "E2ETrainingTime = ${e2e_time}" >> ${test_path_dir}/output/$ASCEND_DEVICE_ID/${CaseName}.log
diff --git a/PyTorch/contrib/audio/Jasper/train.py b/PyTorch/contrib/audio/Jasper/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..9c82f7e2a66097073f7249f91040b60dd6f2faac
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/train.py
@@ -0,0 +1,529 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sklearn
+import argparse
+import copy
+import os
+import random
+import time
+import npu_fused_adamw
+
+try:
+ import nvidia_dlprof_pytorch_nvtx as pyprof
+except ModuleNotFoundError:
+ import pyprof
+
+import torch
+import numpy as np
+import torch.cuda.profiler as profiler
+import torch.distributed as dist
+from apex import amp
+#from apex.parallel import DistributedDataParallel
+from torch.nn.parallel import DistributedDataParallel
+from common import helpers
+# from common.dali.data_loader import DaliDataLoader
+from common.dataset import AudioDataset, get_data_loader
+from common.features import BaseFeatures, FilterbankFeatures
+from common.helpers import (Checkpointer, greedy_wer, num_weights, print_once,
+ process_evaluation_epoch)
+from common.optimizers import AdamW, lr_policy, Novograd
+from common.tb_dllogger import flush_log, init_log, log
+from common.utils import BenchmarkStats
+from jasper import config
+from jasper.model import CTCLossNM, GreedyCTCDecoder, Jasper
+
+
+def parse_args():
+ parser = argparse.ArgumentParser(description='Jasper')
+
+ training = parser.add_argument_group('training setup')
+ training.add_argument('--epochs', default=400, type=int,
+ help='Number of epochs for the entire training; influences the lr schedule')
+ training.add_argument("--warmup_epochs", default=0, type=int,
+ help='Initial epochs of increasing learning rate')
+ training.add_argument("--hold_epochs", default=0, type=int,
+ help='Constant max learning rate epochs after warmup')
+ training.add_argument('--epochs_this_job', default=0, type=int,
+ help=('Run for a number of epochs with no effect on the lr schedule.'
+ 'Useful for re-starting the training.'))
+ training.add_argument('--cudnn_benchmark', action='store_true', default=True,
+ help='Enable cudnn benchmark')
+ training.add_argument('--amp', '--fp16', action='store_true', default=True,
+ help='Use mixed precision training')
+ training.add_argument('--seed', default=42, type=int, help='Random seed')
+ training.add_argument('--local_rank', default=os.getenv('LOCAL_RANK', 0),
+ type=int, help='GPU id used for distributed training')
+ training.add_argument('--pre_allocate_range', default=None, type=int, nargs=2,
+ help='Warmup with batches of length [min, max] before training')
+ training.add_argument('--pyprof', action='store_true', help='Enable pyprof profiling')
+
+ optim = parser.add_argument_group('optimization setup')
+ optim.add_argument('--batch_size', default=32, type=int,
+ help='Global batch size')
+ optim.add_argument('--lr', default=1e-3, type=float,
+ help='Peak learning rate')
+ optim.add_argument("--min_lr", default=1e-5, type=float,
+ help='minimum learning rate')
+ optim.add_argument("--lr_policy", default='exponential', type=str,
+ choices=['exponential', 'legacy'], help='lr scheduler')
+ optim.add_argument("--lr_exp_gamma", default=0.99, type=float,
+ help='gamma factor for exponential lr scheduler')
+ optim.add_argument('--weight_decay', default=1e-3, type=float,
+ help='Weight decay for the optimizer')
+ optim.add_argument('--grad_accumulation_steps', default=1, type=int,
+ help='Number of accumulation steps')
+ optim.add_argument('--optimizer', default='novograd', type=str,
+ choices=['novograd', 'adamw'], help='Optimization algorithm')
+ optim.add_argument('--ema', type=float, default=0.0,
+ help='Discount factor for exp averaging of model weights')
+
+ io = parser.add_argument_group('feature and checkpointing setup')
+ io.add_argument('--dali_device', type=str, choices=['none', 'cpu', 'gpu'],
+ default='none', help='Use DALI pipeline for fast data processing')
+ io.add_argument('--resume', action='store_true',
+ help='Try to resume from last saved checkpoint.')
+ io.add_argument('--ckpt', default=None, type=str,
+ help='Path to a checkpoint for resuming training')
+ io.add_argument('--save_frequency', default=10, type=int,
+ help='Checkpoint saving frequency in epochs')
+ io.add_argument('--keep_milestones', default=[100, 200, 300], type=int, nargs='+',
+ help='Milestone checkpoints to keep from removing')
+ io.add_argument('--save_best_from', default=380, type=int,
+ help='Epoch on which to begin tracking best checkpoint (dev WER)')
+ io.add_argument('--eval_frequency', default=200, type=int,
+ help='Number of steps between evaluations on dev set')
+ io.add_argument('--log_frequency', default=25, type=int,
+ help='Number of steps between printing training stats')
+ io.add_argument('--prediction_frequency', default=100, type=int,
+ help='Number of steps between printing sample decodings')
+ io.add_argument('--model_config', type=str, required=True,
+ help='Path of the model configuration file')
+ io.add_argument('--train_manifests', type=str, required=True, nargs='+',
+ help='Paths of the training dataset manifest file')
+ io.add_argument('--val_manifests', type=str, required=True, nargs='+',
+ help='Paths of the evaluation datasets manifest files')
+ io.add_argument('--dataset_dir', required=True, type=str,
+ help='Root dir of dataset')
+ io.add_argument('--output_dir', type=str, required=True,
+ help='Directory for logs and checkpoints')
+ io.add_argument('--log_file', type=str, default=None,
+ help='Path to save the training logfile.')
+ io.add_argument('--benchmark_epochs_num', type=int, default=1,
+ help='Number of epochs accounted in final average throughput.')
+ io.add_argument('--override_config', type=str, action='append',
+ help='Overrides a value from a config .yaml.'
+ ' Syntax: `--override_config nested.config.key=val`.')
+ return parser.parse_args()
+
+
+def reduce_tensor(tensor, num_gpus):
+ rt = tensor.clone()
+ dist.all_reduce(rt, op=dist.ReduceOp.SUM)
+ return rt.true_divide(num_gpus)
+
+
+def apply_ema(model, ema_model, decay):
+ if not decay:
+ return
+
+ sd = getattr(model, 'module', model).state_dict()
+ for k, v in ema_model.state_dict().items():
+ v.copy_(decay * v + (1 - decay) * sd[k])
+
+
+@torch.no_grad()
+def evaluate(epoch, step, val_loader, val_feat_proc, labels, model,
+ ema_model, ctc_loss, greedy_decoder, use_amp, use_dali=False):
+
+ for model, subset in [(model, 'dev'), (ema_model, 'dev_ema')]:
+ if model is None:
+ continue
+
+ model.eval()
+ start_time = time.time()
+ agg = {'losses': [], 'preds': [], 'txts': []}
+
+ for batch in val_loader:
+ if use_dali:
+ # with DALI, the data is already on GPU
+ feat, feat_lens, txt, txt_lens = batch
+ if val_feat_proc is not None:
+ feat, feat_lens = val_feat_proc(feat, feat_lens, use_amp)
+ else:
+ audio, audio_lens, txt, txt_lens = batch
+ # print("audio",audio)
+ feat, feat_lens = val_feat_proc(audio, audio_lens, use_amp)
+ feat = feat.npu()
+ audio = audio.npu()
+ feat_lens = feat_lens.npu()
+ txt = txt.npu()
+ log_probs, enc_lens = model.forward(feat, feat_lens)
+ loss = ctc_loss(log_probs, txt, enc_lens, txt_lens)
+ pred = greedy_decoder(log_probs)
+
+ agg['losses'] += helpers.gather_losses([loss])
+ agg['preds'] += helpers.gather_predictions([pred], labels)
+ agg['txts'] += helpers.gather_transcripts([txt], [txt_lens], labels)
+
+ wer, loss = process_evaluation_epoch(agg)
+ log((epoch,), step, subset, {'loss': loss, 'wer': 100.0 * wer,
+ 'took': time.time() - start_time})
+ model.train()
+ return wer
+
+
+def main():
+ args = parse_args()
+
+ assert(torch.npu.is_available())
+ assert args.prediction_frequency % args.log_frequency == 0
+
+ # torch.backends.cudnn.benchmark = args.cudnn_benchmark
+
+ # set up distributed training
+ multi_gpu = True
+ if multi_gpu:
+ torch.npu.set_device(args.local_rank)
+ dist.init_process_group(backend='hccl', init_method='env://')
+ world_size = dist.get_world_size()
+ print_once(f'Distributed training with {world_size} GPUs\n')
+ else:
+ world_size = 1
+
+ torch.manual_seed(args.seed + args.local_rank)
+ np.random.seed(args.seed + args.local_rank)
+ random.seed(args.seed + args.local_rank)
+
+ init_log(args)
+
+ cfg = config.load(args.model_config)
+ config.apply_config_overrides(cfg, args)
+
+ symbols = helpers.add_ctc_blank(cfg['labels'])
+
+ assert args.grad_accumulation_steps >= 1
+ assert args.batch_size % args.grad_accumulation_steps == 0
+ batch_size = args.batch_size // args.grad_accumulation_steps
+
+ print_once('Setting up datasets...')
+ train_dataset_kw, train_features_kw = config.input(cfg, 'train')
+ val_dataset_kw, val_features_kw = config.input(cfg, 'val')
+
+ # use_dali = args.dali_device in ('cpu', 'gpu')
+ use_dali = False
+ if use_dali:
+ assert train_dataset_kw['ignore_offline_speed_perturbation'], \
+ "DALI doesn't support offline speed perturbation"
+
+ # pad_to_max_duration is not supported by DALI - have simple padders
+ if train_features_kw['pad_to_max_duration']:
+ train_feat_proc = BaseFeatures(
+ pad_align=train_features_kw['pad_align'],
+ pad_to_max_duration=True,
+ max_duration=train_features_kw['max_duration'],
+ sample_rate=train_features_kw['sample_rate'],
+ window_size=train_features_kw['window_size'],
+ window_stride=train_features_kw['window_stride'])
+ train_features_kw['pad_to_max_duration'] = False
+ else:
+ train_feat_proc = None
+
+ if val_features_kw['pad_to_max_duration']:
+ val_feat_proc = BaseFeatures(
+ pad_align=val_features_kw['pad_align'],
+ pad_to_max_duration=True,
+ max_duration=val_features_kw['max_duration'],
+ sample_rate=val_features_kw['sample_rate'],
+ window_size=val_features_kw['window_size'],
+ window_stride=val_features_kw['window_stride'])
+ val_features_kw['pad_to_max_duration'] = False
+ else:
+ val_feat_proc = None
+
+ train_loader = DaliDataLoader(gpu_id=args.local_rank,
+ dataset_path=args.dataset_dir,
+ config_data=train_dataset_kw,
+ config_features=train_features_kw,
+ json_names=args.train_manifests,
+ batch_size=batch_size,
+ grad_accumulation_steps=args.grad_accumulation_steps,
+ pipeline_type="train",
+ device_type=args.dali_device,
+ symbols=symbols)
+
+ val_loader = DaliDataLoader(gpu_id=args.local_rank,
+ dataset_path=args.dataset_dir,
+ config_data=val_dataset_kw,
+ config_features=val_features_kw,
+ json_names=args.val_manifests,
+ batch_size=batch_size,
+ pipeline_type="val",
+ device_type=args.dali_device,
+ symbols=symbols)
+ else:
+ train_dataset_kw, train_features_kw = config.input(cfg, 'train')
+ train_dataset = AudioDataset(args.dataset_dir,
+ args.train_manifests,
+ symbols,
+ **train_dataset_kw)
+ train_loader = get_data_loader(train_dataset,
+ batch_size,
+ multi_gpu=multi_gpu,
+ shuffle=True,
+ num_workers=4)
+ train_feat_proc = FilterbankFeatures(**train_features_kw)
+
+ val_dataset_kw, val_features_kw = config.input(cfg, 'val')
+ val_dataset = AudioDataset(args.dataset_dir,
+ args.val_manifests,
+ symbols,
+ **val_dataset_kw)
+ val_loader = get_data_loader(val_dataset,
+ batch_size,
+ multi_gpu=multi_gpu,
+ shuffle=False,
+ num_workers=4,
+ drop_last=False)
+ val_feat_proc = FilterbankFeatures(**val_features_kw)
+
+ dur = train_dataset.duration / 3600
+ dur_f = train_dataset.duration_filtered / 3600
+ nsampl = len(train_dataset)
+ print_once(f'Training samples: {nsampl} ({dur:.1f}h, '
+ f'filtered {dur_f:.1f}h)')
+
+ # if train_feat_proc is not None:
+ # train_feat_proc.cpu()
+ # if val_feat_proc is not None:
+ # val_feat_proc.cpu()
+ train_feat_proc.cpu()
+ val_feat_proc.cpu()
+ steps_per_epoch = len(train_loader) // args.grad_accumulation_steps
+
+ # set up the model
+ model = Jasper(encoder_kw=config.encoder(cfg),
+ decoder_kw=config.decoder(cfg, n_classes=len(symbols)))
+ model.npu()
+ ctc_loss = CTCLossNM(n_classes=len(symbols))
+ greedy_decoder = GreedyCTCDecoder()
+
+ print_once(f'Model size: {num_weights(model) / 10**6:.1f}M params\n')
+
+ # optimization
+ kw = {'lr': args.lr, 'weight_decay': args.weight_decay}
+ if args.optimizer == "novograd":
+ optimizer = Novograd(model.parameters(), **kw)
+ elif args.optimizer == "adamw":
+ optimizer = npu_fused_adamw(model.parameters(), **kw)
+ else:
+ raise ValueError(f'Invalid optimizer "{args.optimizer}"')
+
+ adjust_lr = lambda step, epoch, optimizer: lr_policy(
+ step, epoch, args.lr, optimizer, steps_per_epoch=steps_per_epoch,
+ warmup_epochs=args.warmup_epochs, hold_epochs=args.hold_epochs,
+ num_epochs=args.epochs, policy=args.lr_policy, min_lr=args.min_lr,
+ exp_gamma=args.lr_exp_gamma)
+
+ if args.amp:
+# model, optimizer = amp.initialize(
+# min_loss_scale=1.0, models=model, optimizers=optimizer,
+# opt_level='O1', max_loss_scale=512.0)
+ model, optimizer = amp.initialize(models=model, optimizers=optimizer, loss_scale=32, combine_grad=True)
+
+ if args.ema > 0:
+ ema_model = copy.deepcopy(model)
+ else:
+ ema_model = None
+
+ if multi_gpu:
+ model = DistributedDataParallel(model, device_ids=[args.local_rank], broadcast_buffers=False)
+
+ if args.pyprof:
+ pyprof.init(enable_function_stack=True)
+
+ # load checkpoint
+ meta = {'best_wer': 10**6, 'start_epoch': 0}
+ checkpointer = Checkpointer(args.output_dir, 'Jasper',
+ args.keep_milestones, args.amp)
+ if args.resume:
+ args.ckpt = checkpointer.last_checkpoint() or args.ckpt
+
+ if args.ckpt is not None:
+ checkpointer.load(args.ckpt, model, ema_model, optimizer, meta)
+
+ start_epoch = meta['start_epoch']
+ best_wer = meta['best_wer']
+ epoch = 1
+ step = start_epoch * steps_per_epoch + 1
+
+ if args.pyprof:
+ torch.autograd.profiler.emit_nvtx().__enter__()
+ profiler.start()
+
+ # training loop
+ model.train()
+
+ # pre-allocate
+ if args.pre_allocate_range is not None:
+ n_feats = train_features_kw['n_filt']
+ pad_align = train_features_kw['pad_align']
+ a, b = args.pre_allocate_range
+ for n_frames in range(a, b + pad_align, pad_align):
+ print_once(f'Pre-allocation ({batch_size}x{n_feats}x{n_frames})...')
+
+ feat = torch.randn(batch_size, n_feats, n_frames, device='cpu')
+ feat_lens = torch.ones(batch_size, device='cpu').fill_(n_frames)
+ txt = torch.randint(high=len(symbols)-1, size=(batch_size, 100),
+ device='cpu')
+ txt_lens = torch.ones(batch_size, device='cpu').fill_(100)
+ log_probs, enc_lens = model(feat, feat_lens)
+ del feat
+ loss = ctc_loss(log_probs, txt, enc_lens, txt_lens)
+ loss.backward()
+ model.zero_grad()
+
+ bmark_stats = BenchmarkStats()
+
+ for epoch in range(start_epoch + 1, args.epochs + 1):
+ if multi_gpu and not use_dali:
+ train_loader.sampler.set_epoch(epoch)
+
+ epoch_utts = 0
+ epoch_loss = 0
+ accumulated_batches = 0
+ epoch_start_time = time.time()
+
+ for batch in train_loader:
+
+ if accumulated_batches == 0:
+ adjust_lr(step, epoch, optimizer)
+ optimizer.zero_grad()
+ step_loss = 0
+ step_utts = 0
+ step_start_time = time.time()
+
+ if use_dali:
+ # with DALI, the data is already on GPU
+ feat, feat_lens, txt, txt_lens = batch
+ if train_feat_proc is not None:
+ feat, feat_lens = train_feat_proc(feat, feat_lens, args.amp)
+ else:
+ # print("progress")
+ # batch = [t.cpu(non_blocking=True) for t in batch]
+ audio, audio_lens, txt, txt_lens = batch
+ # print("audio",audio)
+ feat, feat_lens = train_feat_proc(audio, audio_lens, args.amp)
+ # print("feat",feat)
+ # print("feat_len",feat_lens)
+ feat = feat.npu()
+ audio = audio.npu()
+ feat_lens = feat_lens.npu()
+ txt = txt.npu()
+ log_probs, enc_lens = model(feat, feat_lens)
+
+ loss = ctc_loss(log_probs, txt, enc_lens, txt_lens)
+ loss /= args.grad_accumulation_steps
+
+ if torch.isnan(loss).any():
+ print_once(f'WARNING: loss is NaN; skipping update')
+ else:
+ if multi_gpu:
+ step_loss += reduce_tensor(loss.data, world_size).item()
+ else:
+ step_loss += loss.item()
+
+ if args.amp:
+ with amp.scale_loss(loss, optimizer) as scaled_loss:
+ scaled_loss.backward()
+ else:
+ loss.backward()
+ step_utts += batch[0].size(0) * world_size
+ epoch_utts += batch[0].size(0) * world_size
+ accumulated_batches += 1
+
+ if accumulated_batches % args.grad_accumulation_steps == 0:
+ epoch_loss += step_loss
+ optimizer.step()
+ apply_ema(model, ema_model, args.ema)
+
+ if step % args.log_frequency == 0:
+ preds = greedy_decoder(log_probs)
+ wer, pred_utt, ref = greedy_wer(preds, txt, txt_lens, symbols)
+
+ if step % args.prediction_frequency == 0:
+ print_once(f' Decoded: {pred_utt[:90]}')
+ print_once(f' Reference: {ref[:90]}')
+
+ step_time = time.time() - step_start_time
+ log((epoch, step % steps_per_epoch or steps_per_epoch, steps_per_epoch),
+ step, 'train',
+ {'loss': step_loss,
+ 'wer': 100.0 * wer,
+ 'throughput': step_utts / step_time,
+ 'took': step_time,
+ 'lrate': optimizer.param_groups[0]['lr']})
+
+ step_start_time = time.time()
+
+ if step % args.eval_frequency == 0:
+ wer = evaluate(epoch, step, val_loader, val_feat_proc,
+ symbols, model, ema_model, ctc_loss,
+ greedy_decoder, args.amp, use_dali)
+
+ if wer < best_wer and epoch >= args.save_best_from:
+ checkpointer.save(model, ema_model, optimizer, epoch,
+ step, best_wer, is_best=True)
+ best_wer = wer
+
+ step += 1
+ accumulated_batches = 0
+ # end of step
+
+ # DALI iterator need to be exhausted;
+ # if not using DALI, simulate drop_last=True with grad accumulation
+ if not use_dali and step > steps_per_epoch * epoch:
+ break
+
+ epoch_time = time.time() - epoch_start_time
+ epoch_loss /= steps_per_epoch
+ log((epoch,), None, 'train_avg', {'throughput': epoch_utts / epoch_time,
+ 'took': epoch_time,
+ 'loss': epoch_loss})
+ bmark_stats.update(epoch_utts, epoch_time, epoch_loss)
+
+ if epoch % args.save_frequency == 0 or epoch in args.keep_milestones:
+ checkpointer.save(model, ema_model, optimizer, epoch, step, best_wer)
+
+ if 0 < args.epochs_this_job <= epoch - start_epoch:
+ print_once(f'Finished after {args.epochs_this_job} epochs.')
+ break
+ # end of epoch
+
+ if args.pyprof:
+ profiler.stop()
+ torch.autograd.profiler.emit_nvtx().__exit__(None, None, None)
+
+ log((), None, 'train_avg', bmark_stats.get(args.benchmark_epochs_num))
+
+ if epoch == args.epochs:
+ evaluate(epoch, step, val_loader, val_feat_proc, symbols, model,
+ ema_model, ctc_loss, greedy_decoder, args.amp, use_dali)
+
+ checkpointer.save(model, ema_model, optimizer, epoch, step, best_wer)
+ flush_log()
+
+
+if __name__ == "__main__":
+ main()
diff --git a/PyTorch/contrib/audio/Jasper/triton/Dockerfile b/PyTorch/contrib/audio/Jasper/triton/Dockerfile
new file mode 100644
index 0000000000000000000000000000000000000000..9fda344c5cbae91d61145504e15b75de06d5b52e
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/triton/Dockerfile
@@ -0,0 +1,10 @@
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tritonserver:20.10-py3-clientsdk
+FROM ${FROM_IMAGE_NAME}
+
+RUN apt update && apt install -y python3-pyaudio libsndfile1
+
+RUN pip3 install -U pip
+RUN pip3 install onnxruntime unidecode inflect soundfile
+
+WORKDIR /workspace/jasper
+COPY . .
diff --git a/PyTorch/contrib/audio/Jasper/triton/README.md b/PyTorch/contrib/audio/Jasper/triton/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..38dd13ec2d780a407ffcd33a0b5cbf8a9a734cfe
--- /dev/null
+++ b/PyTorch/contrib/audio/Jasper/triton/README.md
@@ -0,0 +1,388 @@
+# Deploying the Jasper Inference model using Triton Inference Server
+
+This subfolder of the Jasper for PyTorch repository contains scripts for deployment of high-performance inference on NVIDIA Triton Inference Server as well as detailed performance analysis. It offers different options for the inference model pipeline.
+
+
+## Table Of Contents
+- [Solution overview](#solution-overview)
+- [Inference Pipeline in Triton Inference Server](#inference-pipeline-in-triton-inference-server)
+- [Setup](#setup)
+- [Quick Start Guide](#quick-start-guide)
+- [Advanced](#advanced)
+ * [Scripts and sample code](#scripts-and-sample-code)
+- [Performance](#performance)
+ * [Inference Benchmarking in Triton Inference Server](#inference-benchmarking-in-triton-inference-server)
+ * [Results](#results)
+ * [Performance Analysis for Triton Inference Server: NVIDIA T4](#performance-analysis-for-triton-inference-server-nvidia-t4)
+ * [Maximum batch size](#maximum-batch-size)
+ * [Batching techniques: Static versus Dynamic Batching](#batching-techniques-static-versus-dynamic)
+ * [TensorRT, ONNXRT-CUDA, and PyTorch JIT comparisons](#tensorrt-onnxrt-cuda-and-pytorch-jit-comparisons)
+- [Release Notes](#release-notes)
+ * [Changelog](#change-log)
+ * [Known issues](#known-issues)
+
+
+## Solution Overview
+
+The [NVIDIA Triton Inference Server](https://github.com/NVIDIA/triton-inference-server) provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server.
+
+This folder contains detailed performance analysis as well as scripts to run Jasper inference using Triton Inference Server.
+
+A typical Triton Inference Server pipeline can be broken down into the following steps:
+
+1. The client serializes the inference request into a message and sends it to the server (Client Send).
+2. The message travels over the network from the client to the server (Network).
+3. The message arrives at the server, and is deserialized (Server Receive).
+4. The request is placed on the queue (Server Queue).
+5. The request is removed from the queue and computed (Server Compute).
+6. The completed request is serialized in a message and sent back to the client (Server Send).
+7. The completed message then travels over the network from the server to the client (Network).
+8. The completed message is deserialized by the client and processed as a completed inference request (Client Receive).
+
+Generally, for local clients, steps 1-4 and 6-8 will only occupy a small fraction of time, compared to step 5. As backend deep learning systems like Jasper are rarely exposed directly to end users, but instead only interfacing with local front-end servers, for the sake of Jasper, we can consider that all clients are local.
+
+In this section, we will go over how to launch both the Triton Inference Server and the client and get the best performance solution that fits your specific application needs.
+
+More information on how to perform inference using NVIDIA Triton Inference Server can be found in [triton/README.md](https://github.com/triton-inference-server/server/blob/master/README.md).
+
+
+## Inference Pipeline in Triton Inference Server
+
+The Jasper model pipeline consists of 3 components, where each part can be customized to be a different backend:
+
+**Data preprocessor**
+
+The data processor transforms an input raw audio file into a spectrogram. By default the pipeline uses mel filter banks as spectrogram features. This part does not have any learnable weights.
+
+**Acoustic model**
+
+The acoustic model takes in the spectrogram and outputs a probability over a list of characters. This part is the most compute intensive, taking more than 90% of the entire end-to-end pipeline. The acoustic model is the only component with learnable parameters and what differentiates Jasper from other end-to-end neural speech recognition models. In the original paper, the acoustic model contains a masking operation for training (More details in [Jasper PyTorch README](../README.md)). We do not use masking for inference.
+
+**Greedy decoder**
+
+The decoder takes the probabilities over the list of characters and outputs the final transcription. Greedy decoding is a fast and simple way of doing this by always choosing the character with the maximum probability.
+
+To run a model with TensorRT, we first construct the model in PyTorch, which is then exported into a ONNX static graph. Finally, a TensorRT engine is constructed from the ONNX file and can be launched to do inference. The following table shows which backends are supported for each part along the model pipeline.
+
+|Backend\Pipeline component|Data preprocessor|Acoustic Model|Decoder|
+|---|---|---|---|
+|PyTorch JIT|x|x|x|
+|ONNX|-|x|-|
+|TensorRT|-|x|-|
+
+In order to run inference with TensorRT outside of the inference server, refer to the [Jasper TensorRT README](../tensort/README.md).
+
+
+## Setup
+
+The repository contains a folder `./triton` with a `Dockerfile` which extends the PyTorch 20.10-py3 NGC container and encapsulates some dependencies. Ensure you have the following components:
+
+- [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
+- [PyTorch 20.10-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)
+- [Triton Inference Server 20.10 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:tritonserver)
+- Access to [NVIDIA machine learning repository](https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb) and [NVIDIA CUDA repository](https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.1.243-1_amd64.deb) for NVIDIA TensorRT 6
+- Supported GPUs:
+ - [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
+ - [NVIDIA Turing architecture](https://www.nvidia.com/en-us/geforce/turing/)
+ - [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/)
+- [Pretrained Jasper Model Checkpoint](https://ngc.nvidia.com/catalog/models/nvidia:jasper_pyt_ckpt_amp)
+
+Required Python packages are listed in `requirements.txt`. These packages are automatically installed when the Docker container is built.
+
+
+## Quick Start Guide
+
+Running the following scripts will build and launch the container containing all required dependencies for native PyTorch as well as Triton. This is necessary for using inference and can also be used for data download, processing, and training of the model. For more information on the scripts and arguments, refer to the [Advanced](#advanced) section.
+
+1. Clone the repository.
+
+ ```bash
+ git clone https://github.com/NVIDIA/DeepLearningExamples
+ cd DeepLearningExamples/PyTorch/SpeechRecognition/Jasper
+ ```
+
+2. Build the Jasper PyTorch container.
+
+ Running the following scripts will build the container which contains all the required dependencies for data download and processing as well as converting the model.
+
+ ```bash
+ bash scripts/docker/build.sh
+ ```
+
+3. Start an interactive session in the Docker container:
+
+ ```bash
+ bash scripts/docker/launch.sh
+ ```
+
+ Where , and can be either empty or absolute directory paths to dataset, existing checkpoints or potential output files. When left empty, they default to `datasets/`, `/checkpoints`, and `results/`, respectively. The `/datasets`, `/checkpoints`, `/results` directories will be mounted as volumes and mapped to the corresponding directories ``, ``, `` on the host.
+
+ Note that ``, ``, and `` directly correspond to the same arguments in `scripts/docker/launch.sh` and `trt/scripts/docker/launch.sh` mentioned in the [Jasper PyTorch README](../README.md) and [Jasper TensorRT README](../tensorrt/README.md).
+
+ Briefly, `` should contain, or be prepared to contain a `LibriSpeech` sub-directory (created in [Acquiring Dataset](../trt/README.md)), `` should contain a PyTorch model checkpoint (`*.pt`) file obtained through training described in [Jasper PyTorch README](../README.md), and `` should be prepared to contain converted model and logs.
+
+4. Downloading the `test-clean` part of `LibriSpeech` is required for model conversion. But it is not required for inference on Triton Inference Server, which can use a single .wav audio file. To download and preprocess LibriSpeech, run the following inside the container:
+
+ ```bash
+ bash triton/scripts/download_triton_librispeech.sh
+ bash triton/scripts/preprocess_triton_librispeech.sh
+ ```
+
+5. (Option 1) Convert pretrained PyTorch model checkpoint into Triton Inference Server compatible model backends.
+
+ Inside the container, run:
+
+ ```bash
+ export CHECKPOINT_PATH=
+ export CONVERT_PRECISIONS=
+ export CONVERTS=
+ bash triton/scripts/export_model.sh
+ ```
+
+ Where `` (`"/checkpoints/jasper_fp16.pt"`) is the absolute file path of the pretrained checkpoint, `` (`"fp16" "fp32"`) is the list of precisions used for conversion, and `` (`"feature-extractor" "decoder" "ts-trace" "onnx" "tensorrt"`) is the list of conversions to be applied. The feature extractor converts only to TorchScript trace module (`feature-extractor`), the decoder only to TorchScript script module (`decoder`), and the Jasper model can convert to TorchScript trace module (`ts-trace`), ONNX (`onnx`), or TensorRT (`tensorrt`).
+
+ A pretrained PyTorch model checkpoint for model conversion can be downloaded from the [NGC model repository](https://ngc.nvidia.com/catalog/models/nvidia:jasper_pyt_ckpt_amp).
+
+ More details can be found in the [Advanced](#advanced) section under [Scripts and sample code](#scripts-and-sample-code).
+
+6. (Option 2) Download pre-exported inference checkpoints from NGC.
+
+ Alternatively, you can skip the manual model export and download already generated model backends for every version of the model pipeline.
+
+ * [Jasper_ONNX](https://ngc.nvidia.com/catalog/models/nvidia:jasper_pyt_onnx_fp16_amp/version),
+ * [Jasper_TorchScript](https://ngc.nvidia.com/catalog/models/nvidia:jasper_pyt_torchscript_fp16_amp/version),
+ * [Jasper_TensorRT_Turing](https://ngc.nvidia.com/catalog/models/nvidia:jasper_pyt_trt_fp16_amp_turing/version),
+ * [Jasper_TensorRT_Volta](https://ngc.nvidia.com/catalog/models/nvidia:jasper_pyt_trt_fp16_amp_volta/version).
+
+ If you wish to use TensorRT pipeline, make sure to download the correct version for your hardware. The extracted model folder should contain 3 subfolders `feature-extractor-ts-trace`, `decoder-ts-script` and `jasper-x` where `x` can be `ts-trace`, `onnx`, `tensorrt` depending on the model backend. Copy the 3 model folders to the directory `./triton/model_repo/fp16` in your Jasper project.
+
+7. Build a container that extends Triton Inference Client:
+
+ From outside the container, run:
+
+ ```bash
+ bash triton/scripts/docker/build_triton_client.sh
+ ```
+
+Once the above steps are completed you can either run inference benchmarks or perform inference on real data.
+
+8. (Option 1) Run all inference benchmarks.
+
+ From outside the container, run:
+
+ ```bash
+ export RESULT_DIR=
+ export PRECISION_TESTS=
+ export BATCH_SIZES=
+ export SEQ_LENS=
+ bash triton/scripts/execute_all_perf_runs.sh
+ ```
+
+ Where `` is the absolute path to potential output files (`./results`), `` is a list of precisions to be tested (`"fp16" "fp32"`), `` is a list of tested batch sizes (`"1" "2" "4" "8"`), and `` are tested sequnce lengths (`"32000" "112000" "267200"`).
+
+ Note: This can take several hours to complete due to the extensiveness of the benchmark. More details about the benchmark are found in the [Advanced](#advanced) section under [Performance](#performance).
+
+9. (Option 2) Run inference on real data using the Client and Triton Inference Server.
+
+ 8.1 From outside the container, restart the server:
+
+ ```bash
+ bash triton/scripts/run_server.sh
+ ```
+
+ 8.2 From outside the container, submit the client request using:
+ ```bash
+ bash triton/scripts/run_client.sh
+ ```
+
+ Where `` can be either "ts-trace", "tensorrt" or "onnx", `` is either "fp32" or "fp16". `` is an absolute local path to the directory of files. is the relative path to to either an audio file in .wav format or a manifest file in .json format.
+
+ Note: If is *.json should be the path to the LibriSpeech dataset. In this case this script will do both inference and evaluation on the accoring LibriSpeech dataset.
+
+
+## Advanced
+
+The following sections provide greater details about the Triton Inference Server pipeline and inference analysis and benchmarking results.
+
+
+### Scripts and sample code
+
+The `triton/` directory contains the following files:
+* `jasper-client.py`: Python client script that takes an audio file and a specific model pipeline type and submits a client request to the server to run inference with the model on the given audio file.
+* `speech_utils.py`: helper functions for `jasper-client.py`.
+* `converter.py`: Python script for model conversion to different backends.
+* `jasper_module.py`: helper functions for `converter.py`.
+* `model_repo_configs/`: directory with Triton model config files for different backend and precision configurations.
+
+The `triton/scripts/` directory has easy to use scripts to run supported functionalities, such as:
+* `./docker/build_triton_client.sh`: builds container
+* `execute_all_perf_runs.sh`: runs all benchmarks using Triton Inference Server performance client; calls `generate_perf_results.sh`
+* `export_model.sh`: from pretrained PyTorch checkpoint generates backends for every version of the model inference pipeline.
+* `prepare_model_repository.sh`: copies model config files from `./model_repo_configs/` to `./deploy/model_repo` and creates links to generated model backends, setting up the model repository for Triton Inference Server
+* `generate_perf_results.sh`: runs benchmark with `perf-client` for specific configuration and calls `run_perf_client.sh`
+* `run_server.sh`: launches Triton Inference Server
+* `run_client.sh`: launches client by using `jasper-client.py` to submit inference requests to server
+
+
+### Running the Triton Inference Server
+
+Launch the Triton Inference Server in detached mode to run in the background by default:
+
+```bash
+bash triton/scripts/run_server.sh
+```
+
+To run in the foreground interactively, for debugging purposes, run:
+
+```bash
+DAEMON="--detach=false" bash triton/scripts/run_server.sh
+```
+
+The script mounts and loads models at `$PWD/triton/deploy/model_repo` to the server with all visible GPUs. In order to selectively choose the devices, set `NVIDIA_VISIBLE_DEVICES`.
+
+
+### Running the Triton Inference Client
+
+*Real data*
+In order to run the client with real data, run:
+
+```bash
+bash triton/scripts/run_client.sh