-
Release v1.3.0-beta1
This new beta release includes bug fixes and compatibility for Arm and Arm64 platforms (get the ILGPU Nuget package and ILGPU Algorithms Nuget package).
Changes
- Added support for Arm and Arm64 architectures (#816).
- Added intrinsic mappings for BitOperations functions (#824).
- Clarified documentation for
.IOOperations()
(#818). - Fixed issue promoting unsigned values on the evaluation stack (#822).
- Fixed critical issue in CPUAccelerator runtime (#836).
- Simplified IL Warp and Group extensions (#838).
Internal changes
- Bump GitHubActionsTestLogger from 1.4.1 to 2.0.0 in /Src (#813).
- Bump GitHubActionsTestLogger from 2.0.0 to 2.0.1 in /Src (#815).
- Bump Microsoft.NET.Test.Sdk from 17.2.0 to 17.3.1 in /Src (#832, #839).
- Bump Microsoft.NETFramework.ReferenceAssemblies in /Samples (#835).
- Bump xunit from 2.4.1 to 2.4.2 in /Src (#830).
- Added script to generate compatibility suppression files (#821).
- Fixed CodeQL scanning ‘Error’ alerts (#817).
- Significantly improved and fixed CI pipeline related to our GPU runners (#823, #827, #829, #833, #837).
Special thanks
Special thanks to @jgiannuzzi, @KosmosisDire, @MoFtZ, @NullandKale, and @pavlovic-ivan for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community (especially @kilngod, @NullandKale and @Yey007) for providing feedback, submitting issues and feature requests.
-
Release v1.2.0
This new release includes bug fixes and a significantly improved
O2
optimization pipeline (get the ILGPU Nuget package and ILGPU Algorithms Nuget package).Changes
- Reviewed ILGPU documentation (#750, #776).
- Added Cuda ISA 7.5, ISA 7.6 and SM 8.7 (#778).
- Added support to fold Shuffle and Broadcast operations (#764).
- Added new sample demonstrating the use of ILGPU in Blazor Apps (#779).
- Improved performance by using uniform branches for NVIDIA GPUs (#765).
- Improved
LoopUnrolling
to cover more cases (#766). - Improved inline PTX to support multiple output and by-ref parameters (#760).
- Fixed multi-dimensional RNG number generation (#808).
- Fixed issues with
LibDevice
integration (#784). - Fixed issue with unsigned nested conversions (#772, #774).
- Fixed sample project target frameworks (#771).
Internal changes
- Bump FluentAssertions from 6.5.1 to 6.7.0 in /Src (#785, #807).
- Bump Microsoft.NET.Test.Sdk from 17.1.0 to 17.2.0 in /Src (#805).
- Bump xunit.runner.visualstudio from 2.4.3 to 2.4.5 in /Src (#804).
- Bump System.Memory from 4.5.4 to 4.5.5 in /Src (#785).
- Reset baseline for 1.2.0 (#777).
- Fixed several CI issues (#796, #809, #812).
Special thanks
Special thanks to @hokb, @jgiannuzzi, @kilngod, @MoFtZ, @pavlovic-ivan and @Ruberik for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community (especially @Joey9801, @MPSQUARK, @NullandKale and @Yey007) for providing feedback, submitting issues and feature requests.
-
Release v1.2.0-beta1
This new beta release includes bug fixes and a significantly improved
O2
optimization pipeline (get the ILGPU Nuget package and ILGPU Algorithms Nuget package).Changes
- Reviewed ILGPU documentation (#750, #776).
- Added Cuda ISA 7.5, ISA 7.6 and SM 8.7 (#778).
- Added support to fold Shuffle and Broadcast operations (#764).
- Improved performance by using uniform branches for NVIDIA GPUs (#765).
- Improved
LoopUnrolling
to cover more cases (#766). - Improved inline PTX to support multiple output and by-ref parameters (#760).
- Fixed issues with
LibDevice
integration (#784). - Fixed issue with unsigned nested conversions (#772, #774).
- Fixed sample project target frameworks (#771).
Internal changes
Special thanks
Special thanks to @hokb, @jgiannuzzi, @MoFtZ and @Ruberik for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community (especially @Joey9801, @kilngod, @MPSQUARK, @NullandKale and @Yey007) for providing feedback, submitting issues and feature requests.
-
Release v1.1.0
This new release includes bug fixes, a huge set of new features (e.g.
LibDevice
integration,CudaFFT
andNVML
bindings) and a significantly improvedO2
optimization pipeline (get the ILGPU Nuget package and ILGPU Algorithms Nuget package).Changes
- Bumped
System.Reflection.Metadata
from 6.0.0 to 6.0.1 (#767). - Added
NVML
bindings (#518). - Added
CuFFT
andCuFFTW
bindings (#706). - Added
NvJpeg
image-decoding bindings (#716, #721). - Added
LibDevice
bindings to include highly optimized math functions on NVIDIA GPUs (#707). - Added
FP16
support toCuBlas
bindings (#658). - Added new
alignment
methods to views to improve performance (#684). - Added new global code scheduling transformation to
O2
pipeline (#704, #734). - Improved debug view implementations of all array views (#647).
- Improved automatic vectorization (#668).
- Improved performance of dead-code elimination (#702).
- Improved loop-invariant code motion transformation (#703).
- Improved on-the-fly optimization of
SetField
operations (#671). - Improved on-the-fly optimization of
LoadElementAddress
operations (#733). - Fixed missing binding of accelerator instances during
Cuda
memcopy operations (#705). - Fixed exception handling in the case of missing assembly binding redirects (#775).
- Fixed code-placement phase and invalid removal of DebugAssert values (#749).
- Fixed race condition in
CPUMultiprocessor
during lazy initialization (#747). - Fixed inheritance to avoid removal of IOValue instances (#745).
- Fixed issue with the same phi value being reused in a loop (#756).
- Fixed issue with unique algorithm when running multiple iterations per group (#758).
- Prevented unintentional initialization of the current
Accelerator
instance (#714).
Internal changes
- Require .NET6 for building and enable package validation (#729).
- Bumped
T4.Build
from 0.2.3 to 0.2.4 (#767). - Bumped
FluentAssertions
from 6.5.0 to 6.5.1 (#748). - Bumped
Microsoft.NET.Test.SDK
from 17.0.0 to 17.1.0 (#752). - Fixed warnings in NET6 builds (#710).
- Fixed missing struct constraint on
TraversalSuccessorsProvider
(#727). - Added ILGPU logos to
logo
folder (#717).
Special thanks
Special thanks to @Debiday, @jgiannuzzi, @MoFtZ and @Ruberik for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community (especially @Joey9801, @kilngod, @mikhail-khalizev, @MPSQUARK, @NullandKale, @RER009 and @Yey007) for providing feedback, submitting issues and feature requests.
- Bumped
-
Release v1.1.0-beta1
This new beta release includes bug fixes, a huge set of new features (e.g.
LibDevice
integration,CudaFFT
andNVML
bindings) and a significantly improvedO2
optimization pipeline (get the ILGPU Nuget package and ILGPU Algorithms Nuget package).Changes
- Added
NVML
bindings (#518). - Added
CuFFT
andCuFFTW
bindings (#706). - Added
NvJpeg
image-decoding bindings (#716, #721). - Added
LibDevice
bindings to include highly optimized math functions on NVIDIA GPUs (#707). - Added
FP16
support toCuBlas
bindings (#658). - Added new
alignment
methods to views to improve performance (#684). - Added new global code scheduling transformation to
O2
pipeline (#704, #734). - Improved debug view implementations of all array views (#647).
- Improved automatic vectorization (#668).
- Improved performance of dead-code elimination (#702).
- Improved loop-invariant code motion transformation (#703).
- Improved on-the-fly optimization of
SetField
operations (#671). - Improved on-the-fly optimization of
LoadElementAddress
operations (#733). - Fixed missing binding of accelerator instances during
Cuda
memcopy operations (#705). - Fixed exception handling in the case of missing assembly binding redirects (#730).
- Prevented unintentional initialization of the current
Accelerator
instance (#714).
Internal changes
- Require .NET6 for building and enable package validation (#729).
- Fixed warnings in NET6 builds (#710).
- Fixed missing struct constraint on
TraversalSuccessorsProvider
(#727). - Added ILGPU logos to
logo
folder (#717).
Special thanks
Special thanks to @Debiday, @jgiannuzzi, @MoFtZ and @Ruberik for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community (especially @Joey9801, @kilngod, @mikhail-khalizev, @MPSQUARK, @NullandKale, @RER009 and @Yey007) for providing feedback, submitting issues and feature requests.
- Added
-
Release v1.0.0
This new stable release offers major performance improvements, new APIs to simplify programming, improve productivity and reduce programming errors. It also includes a lot of amazing new features (see below and get the Nuget package).
General notes
- We converted ILGPU into a monorepo project including, ILGPU.Algorithms, ILGPU.Samples, Wiki and enhanced Documentation.
- This version has some breaking changes compared to previous stable ILGPU versions (see below).
Breaking changes
- The
Memory API
, involvingArrayView
andMemoryBuffer
types has been significantly improved to support explicitStride
information (see below). - All
IndexX
andLongIndexX
types have been renamed toIndexXD
andLongIndexXD
to have a unified programming experience with respect to memory buffers and array views (see below). - The
Device API
has been redesigned to explicitly enable, filter and configure the available hardware accelerator devices (see below). - Parts of the Algorithms library have been refined to support the newly introduced stride types.
Changes
- Added new
Memory API
to support explicit stride information (#421, #475, #483). - Added new
Device API
to enable, filter and configure the available hardware accelerator devices (#428). - Added support for
OpenCL 3.0
API (#464). - Added support for inline PTX assembly instructions (#467).
- Added support for multi-dimensional and static constant arrays (#479).
- Added support for convenient profiling use
ProfilingMarker
s (#482). - Improved CPU runtime to support arbitrary
Warp
/Group
/Multiprocessor
configurations (#402, #484). - Improved error messages (#466)
- Enabled folding of debug assertions in
IRBuilder
(#477). - Fixed Group helper methods for multi-dimensional kernels (#481).
- Fixed invalid code generation of
OpenCL
kernels in the presence of constant switch conditions (#441). - Promote
.NET 5
to a default target framework (#529, #536). - Added new
Array
processing pipeline to have full support for nD-arrays (#513). - Added convenience overloads for
AsNDView
(#571). - Added support for zero-length
SubView
operations (#550). - Added Backend optimizations for CPU backend to re-enable support for enhanced shared memory allocations (see #567) (#574).
- Added support for Cuda ISA 7.3 and 7.4 to support all latest drivers (#566).
- Added
UCE
transformation to the backend optimization passes (#569). - Added VS integration of check styles to all projects and fixed style checking (#517, #511).
- Added CPU builder method to register custom CPU devices (#507).
- Added support for chaining
EnableAlgorithms
on Context builder instances (#515). - Improved performance of all tests by enabling aggressive caching (#522).
- Improve hash codes of
IndexND
andLongIndexND
types (#510). - Changed
InvalidEntryPointIndexParameterOfWrongType
error message to be more descriptive (#535). - Changed T4
DllImportSearchPath
toLegacyBehavior
(#514). - Fixed constant folding when converting unsigned integers (#549).
- Fixed critical issue when swapping registers/variables in backends (#541).
- Fixed invalid copies from and to sub views (#523).
- Fixed and enhanced
Stride
andArrayView
types (#509). - Fixed regression in single-pass scan when performing multiple iterations (#525).
- Fixed
RadixSortProvider
andScanProvider
test cases (#516). - Removed obsolete properties and methods (#524).
Repository Changes
- Merged ILGPU.Samples into ILGPU repository (#538, #561, #563, #564, #565, #568).
- Merged ILGPU.Algorithms into ILGPU repository.
- Merged ILGPU Wiki into ILGPU repository (#537).
- Merged external ILGPU v0.10.1 documents (#546).
- Added information about symbols and source link to ReadMe file (#594).
CI Changes
- Add badges for versions and CI (#534).
- Skip publishing nuget packages on forks (#533).
- Selective builds on macOS, master and tags (#530).
- Fix NuGet publishing bug in CI pipeline (#572).
- Restricting the package CI job to run only once (#527).
- Run clean tests on push to master or tag without using caches (#526).
- Added support for releasing pre-view builds via
feedz.io
(#521, #520). - Adapted CI for new ILGPU monorepo (#512).
Major internal changes
- Added build support for .Net5.0 (#446).
- Added support for T4.Build to automatically transform T4 text templates during build (#431).
- Restrict net47 unit tests to only run on CI builds (#465).
- Avoid duplicate CI runs for pull requests made from the same repo (#485).
- Updated InlineList implementation to reduce memory consumption (#478).
- Fixed invalid assertion affecting successor blocks in frontend (#445).
- Added missing struct type constraints (#532).
- Applied general cleanup (#531).
- Removed obsolete configurations from solutions (#599).
- Prepared conditional compilation for future .NET frameworks (#592).
- Updated .Net Framework version from
v4.7
tov4.7.1
(#594). - Added 1.0.0 pre-release documentation (#602).
- Added sample about inline
PTX
assembly instructions (#588). - Added sample about monitoring progress on Cuda accelerators (#593).
- Added sample project for printf-like output in kernels (#600).
- Added sample project for debug asserts in kernels (#600).
- Added sample project for removing consecutive duplicate values (#600).
- Added sample project for calculating histograms (#600).
- Added sample project for fixed sized buffers (#600).
- Added support for zero-length subviews of zero-length views (#585).
- Guard against zero-length (
CUDA
andCL
) allocations to enable allocations of zero bytes (#547, #610). - Simplified naming of GetAsPageLockedArray and AllocatePageLockedArray (#608).
- Fixed transformation issues regarding many functions in kernel modules (without inlining) (#613).
- Fixed invalid detection and processing of loops consisting of a single entry block (#607).
- Fixed invalid conversion of LFA values in SSAStructureConstruction (affect array optimizations, #605).
Notes
- We updated the versions of the .Net dependencies (#576, #577, #578, #579, #580, #581, #582, #583, #586, #591, #595 and #601).
- We updated the required .Net Framework version (from
v4.7
tov4.7.1
) to benefit from the most recent dependency updates (#595). - We updated the ILGPU documentation and all samples to be compatible with this release (#584, #593, #600, #602).
Summary of the changes related to the new Memory API
The new API distinguishes between a coherent, strongly typed
ArrayView<T>
structure and its n-D versionsArrayViewXD<T, TStride>
, which carry dimension-dependent stride information (The actual logic for computing element addresses is moved from theIndexXD
types to the newly addedStrideXD
types). This allows developers to explicitly specify a particular stride of a view,reinterpret
the data layout itself (by changing the stride), and perform compile-time optimizations based on explicitly typed stride information. Consequently, ILGPU’s optimization pipeline is able to remove the overhead of these abstractions in most cases (except in rare use cases where strange-looking strides are used). It also makes all memory transfer-related operations explicit in terms of what memory layout the underlying data will have after an operation is performed.In addition, it moves all
copy
related methods to theArrayView
instances instead of exposing them on the memory buffers. This realizes a “separation of concerns”: One the one hand, aMemoryBuffer
holds a reference to the native memory area and controls its lifetime. On the other hand,ArrayView
structures manage the contents of these buffers and make them available to the actual GPU kernels.Example:
// Simple 1D allocation of 1024 longs with TStride = Stride1D.Dense (all elements are accessed contiguously in memory) var t = accl.Allocate1D<long>(1024); // Advanced 1D allocation of 1024 longs with TStride = Stride1D.General(2) (each memory access will skip 2 elements) // -> allocates 1024 * 2 longs to be able to access all of them var t = accl.Allocate1D<long, Stride1D.General>(1024, new Stride1D.General(2)); // Simple 1D allocation of 1024 longs using the array provided var data1 = new long[1024]; var t2 = accl.Allocate1D(data1); // Simple 2D allocation of 1024 * 1024 longs using the array provided with TStride = Stride2D.DenseX // (all elements in X dimension are accessed contiguously in memory) // -> this will *not* transpose the input buffer as the memory layout will be identical on CPU and GPU var data2 = new long[1024, 1024]; var t3 = accl.Allocate2DDenseX(data2); // Simple 2D allocation of 1024 * 1024 longs using the array provided, with TStride = Stride2D.DenseY // (all elements in Y dimension are accessed contiguously in memory) // -> this *will* transpose the input buffer to match the desired data layout var data3 = new long[1024, 1024]; var t4 = accl.Allocate2DDenseY(data3);
The major changes/features of the new Memory API are:
-
Index1
Index2
Index3
types have been renamed toIndex1D
Index2D
Index3D
to match the naming scheme ofArrayViewXD
andMemoryBufferXD
types. -
LongIndex1
LongIndex2
LongIndex3
types have been renamed toLongIndex1D
LongIndex2D
LongIndex3D
to match the naming scheme of theArrayViewXD
andMemoryBufferXD
types. - Separation of concerns between
MemoryBuffer
andArrayView
instances:ArrayView...
structures represent and manage the contents of buffers (or chunks of buffers).MemoryBuffer...
classes manage the lifetime of allocated memory chunks on a device.
- The
ILGPU.ArrayView
intrinsic structure implements the newly addedIContiguousArrayView
interface that marks contiguous memory sections. - The
ILGPU.Runtime.MemoryBuffer...
classes implement the newly addedIContiguousArrayView
interface that marks contiguous memory sections. - Types implementing the
IContiguousArrayView
interface provide extension methods for initializing, copying from and to the memory region (not supported on accelerators). - This PR adds the notion of
Stride
s. ILGPU contains built-in common strides for 1D, 2D and 3D views.Stride1D.Dense
represents contiguous chunks of memory that pack elements side by side.Stride1D.General
represents strides that skip a certain number of elements.Stride2D.DenseX
represents 2D strides that pack elements side by side in dimension X (transfers from a to views with this stride involve transpose operations).Stride2D.DenseY
represents 2D strides that pack elements in the Y dimension side by side.Stride2D.General
represents strides that skip a certain number of elements in the X and Y dimensions.Stride3D.DenseXY
represents 3D strides that pack elements in the X,Y dimension side by side (transfers from a to views with this stride involve transposition operations).Stride3D.DenseZY
represents 3D strides that pack elements in the Z,Y dimension side by side.Stride3D.General
represents strides that omit a certain number of elements in the X, Y and Z dimensions.
- All
ArrayViewXD
types have been moved to theILGPU.Runtime
namespace. - All
ArrayViewXD
types do not implementIContiguousArrayView
, as they support arbitrary stride information. Note that theArrayView1D<T, Stride1D.Dense>
specialization has an implicit conversion toArrayView<T>
(and vice versa) for auxiliary purposes. - All
CopyFromCPU
andCopyToCPU
methods are provided with additional hints as to whether they are transposing the input elements or keeping the original layout. - Note that
GetAsXDArray(...)
always returns elements in .Net standard layout for 1D, 2D and 3D arrays (this may result in transposing the input elements of the buffer on the CPU). Useview.AsContiguous().GetAsArray()
to get the memory layout of the input buffer.
This also affects the implementation of all
IndexND
types. We moved the index reconstruction functions from the index types to the individual stride implementations:Index2D index = <some_extent>.ReconstructIndex(index);
New way:
Index2D index = Stride2D.DenseX.ReconstructFromElementIndex(index, <some_extent>); // .. or .. Index2D index = Stride2D.DenseY.ReconstructFromElementIndex(index, <some_extent>);
Summary of the changes related to the new Device API
The new Device API removes the enumeration
ContextFlags
and implements the same functionality in an object oriented way using aContext.Builder
class. It offers a fluent-API like configuration interface which makes it easy to set up:// Enables all supported accelerators (default CPU accelerator only) and puts the context // into auto-assertion mode via "AutoAssertions()". In other words, if a debugger is attached, // the `Context` instance will turn on all assertion checks. This behavior is identical // to the current implementation via new Context(); using var context = Context.CreateDefault(); // Turns on O2 and enables all compatible Cuda devices. using var context = Context.Create(builder => { builder.Optimize(OptimizationLevel.O2).Cuda(); }); // Turns on all assertions, enables the IR verifier and enables all compatible OpenCL devices. using var context = Context.Create(builder => { builder.Assertions().Verify().OpenCL(); }); // Turns on kernel source-line annotations, fast math using 32-bit float and enables // *all* (even incompatible) OpenCL devices. using var context = Context.Create(builder => { builder .DebugSymbols(DebugSymbolsMode.KernelSourceAnnotations) .Math(MathMode.Fast32BitOnly) .OpenCL(device => true); }); // Selects an OpenCL device with a warp size of at least 32: using var context = Context.Create(builder => { builder.OpenCL(device => device.WarpSize >= 32); }); // Turns on all assertions in debug mode (same behavior like calling CreateDefault()): using var context = Context.Create(builder => { builder.AutoAssertions(); }); // Turns on debug optimizations (level O0) and all assertions if a debugger is attached: using var context = Context.Create(builder => { builder.AutoDebug(); }); // Turns on debug mode (optimization level P0, assertions and kernel debug information): using var context = Context.Create(builder => { builder.Debug(); }); // Disable caching, enable conservative inlining and inline mutable static field values: using var context = Context.Create(builder => { builder .Caching(CachingMode.Disabled) .Inlining(InliningMode.Conservative) .StaticFields(StaticFieldMode.MutableStaticFields); }); // Turn on *all* CPU accelerators that simulate different hardware platforms: using var context = Context.Create(builder => builder.CPU()); // Turn on an AMD-based CPU accelerator: using var context = Context.Create(builder => builder.CPU(CPUDeviceKind.AMD));
Note that by default all debug symbols are automatically turned off when a debugger is attached. If you want to turn on the debug information in all cases, call
.builder.DebugSymbols(DebugSymbolsMode.Basic)
. At the same time, this PR introduces the notion of aDevice
, which replaces the implementation ofAcceleratorId
. This allows us to query detailed device information without explicitly instantiating an accelerator:// Print all device information without instantiating a single accelerator // (device context) instance. using var context = Context.Create(...); foreach (var device in context) { // Print detailed accelerator information device.PrintInformation(); // ... }
Note that we removed the ability to call the accelerator constructors (e.g.
new CudaAccelerator(...)
) directly. Either use theCreateAccelerator
methods defined in theDevice
classes or use one of the extension methods likeCreateCudaAccelerator(...)
of theContext
class itself:using var context = Context.Create(...); foreach (var device in context) { // Instantiate an accelerator instance on this device using Accelerator accel = device.CreateAccelerator(); // ... } // Instantiate the 2nd Cuda accelerator (NOTE that this is the *2nd* Cuda device // and *not* the 2nd device of your machine). using CudaAccelerator cudaDevice = context.CreateCudaAccelerator(1); // Instantiate the 1st OpenCL accelerator (NOTE that this is the *1st* OpenCL device // and *not* the 1st device of your machine). using CLAccelerator clDevice = context.CreateOpenCLAccelerator(0);
Context
properties that expose types from other (ILGPU internal) namespaces that cannot/should not (?) be covered by the API/ABI guarantees we want to give, has been madeinternal
properties. To access these properties, use one of the available extensionmethods located in the corresponding namespaces:using var context = ... // OLD way var internalIRContext = context.IRContext; // NEW way: // using namespace ILGPU.IR; var internalIRContext = context.GetIRContext();
Using the Algorithms Library with the new Memory and Device APIs
To use the new version of the algorithms library with ILGPU v1.0.0, you need to initialize the library with the help of the new builder pattern:
// Enables all algorithm library features using var context = Context.Create(builder => { builder.EnableAlgorithms(); });
Improved CPU runtime to support arbitrary Warp/Group/Multiprocessor configurations
The new CPU runtime significantly improves the existing
CPUAccelerator
runtime by adding support for user-definedwarp
,group
andmultiprocessor
configurations. It changes the internal functionality to simulate a single warp of at least 2 threads (which ensures that all shuffle-based/reduction-like algorithms can also be run on the CPU by default). At the same time, each virtual multiprocessor can only execute a single thread group at a time. Increasing the number of virtual multiprocessors allows the user to simulate multiple concurrent groups. Most use cases will not require more than a single multiprocessor in practice.Note that all device-wide static
Grid
/Group
/Atomic
/Warp
classes are fully supported to debug/simulate all ILGPU kernels on the CPU.Note that a custom warp size must be a multiple of 2.
This PR adds a new set of static creation methods:
CreateDefaultSimulator(...)
which creates aCPUAccelerator
instance with 4 threads per warp, 4 warps per multiprocessor and a single multiprocessor (MaxGroupSize = 16
).CreateNvidiaSimulator(...)
which creates aCPUAccelerator
instance with 32 threads per warp, 32 warps per multiprocessor and a single multiprocessor (MaxGroupSize = 1024
).CreateAMDSimulator(...)
which creates aCPUAccelerator
instance with 32 threads per warp, 8 warps per multiprocessor and a single multiprocessor (MaxGroupSize = 256
).CreateLegacyAMDSimulator(...)
which creates aCPUAccelerator
instance with 64 threads per warp, 4 warps per multiprocessor and a single multiprocessor (MaxGroupSize = 256
).CreateIntelSimulator(...)
which creates aCPUAccelerator
instance with 16 threads per warp, 8 warps per multiprocessor and a single multiprocessor (MaxGroupSize = 128
).
Furthermore, this PR adds support for advanced debugging features that enable a “sequential-like” execution mode. In this mode, each thread of a group will run sequentially one after another until it hits a synchronization barrier or exits the kernel function. This allows users to conveniently debug larger thread groups consisting of concurrent threads without switching to single-threaded execution. This behavior can be controlled via the newly added
CPUAcceleratorMode
enum:/// <summary> /// The accelerator mode to be used with the <see cref="CPUAccelerator"/>. /// </summary> public enum CPUAcceleratorMode { /// <summary> /// The automatic mode uses <see cref="Sequential"/> if a debugger is attached. /// It uses <see cref="Parallel"/> if no debugger is attached to the /// application. /// </summary> /// <remarks> /// This is the default mode. /// </remarks> Auto = 0, /// <summary> /// If the CPU accelerator uses a simulated sequential execution mechanism. This /// is particularly useful to simplify debugging. Note that different threads for /// distinct multiprocessors may still run in parallel. /// </summary> Sequential = 1, /// <summary> /// A parallel execution mode that runs all execution threads in parallel. This /// reduces processing time but makes it harder to use a debugger. /// </summary> Parallel = 2, }
By default, all
CPUAccelerator
instances use the automatic mode (CPUAcceleratorMode.Auto
) that switches to a sequential execution model as soon as a debugger is attached to the application.Note that threads in the scope of multiple multiprocessors may still run in parallel.
Special thanks
Special thanks to @76creates, @conghuiw, @deng0, @GPSnoopy, @jgiannuzzi, @Joey9801, @ljubon, @MoFtZ, @Nnelg, @nullandkale and @sucrose0413 for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community (especially @faruknane, @mikhail-khalizev, @MPSQUARK, @Ruberik, @Yey007, and @yuryGotham) for providing feedback, submitting issues and feature requests.
-
Release v1.0.0-rc3
This final release candidate is a preview of the upcoming ILGPU stable release with a frozen API surface/feature level. It includes performance improvements and several bug fixes including critical patches for the internal loop optimization phases and cross-device peer accesses (get the ILGPU Nuget package and ILGPU Algorithms Nuget package).
Breaking Changes
- Refined the API for building custom
Atomic
implementations to overcome performance limitations (#667).
Changes
- Added explicit conversion methods for
ArrayView
andArrayView1D
(#666). - Improved
Atomics
performance (#667). - Fixed issue with enabling
IO
operations (#694). - Fixed invalid peer-access functionality (#675).
- Fixed invalid address-space inference in the presence of generic view-based casts (#670).
- Fixed critical issues in
LoopUnrolling
phases (#653, #657, #661). - Fixed invalid thread configuration in
CPUDevice
andCPUMultiprocessor
classes (#665). - Fixed missing
NotInsideKernel
attributes onMemSet
functions (#651). - Fixed missing bindings current accelerator in the scope of profiling markers (#644).
- Fixed radix sort on floating point data types (#643).
Repository Changes
- Polished readme, build and license information. (#650, #655).
- Updated samples to new Atomic function API (#667).
Major internal changes
- Bumped several test dependency packages (#659, #662).
- Bumped SourceLink dependencies to v1.1.1 (#689, #690).
- Bumped T4.Build version to v0.2.3 (#685).
- Added automatic skipping of specific CPU tests on MacOS runners (#669).
Special thanks
Special thanks to @MoFtZ, @jgiannuzzi , @deng0 and @conghuiw for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community for providing feedback, submitting issues and feature requests.
Full Changelog: https://github.com/m4rs-mt/ILGPU/compare/v1.0.0-rc2…v1.0.0-rc3
- Refined the API for building custom
-
Release v1.0.0-rc2
This new release candidate is a preview of the upcoming ILGPU stable release with a frozen API surface/feature level. It includes bug fixes, new features and a refined ILGPU
Index
/Stride
,ScanExtensions
,RadixSortExtensions
andCuBlas
APIs (get the ILGPU Nuget package and ILGPU Algorithms Nuget package).Breaking Changes
- Refined
Index1D|Index2D|Index3D|LongIndex1D|LongIndex2D|LongIndex3D
type API surface: removed multidimensional index reconstruction methods. - Added new multidimensional index reconstruction methods to
Stride1D|Stride2D|Stride3D
types. - Moved bounds checking from
ArrayView1D|ArrayView2D|ArrayView3D
types toStride1D|Stride2D|Stride3D
types. - Refined
CuBlas
API to be compatible with stride information. - Refined
Scan
andRadixSort
APIs to be compatible with stride information.
Changes
- Updated Docs to include links to samples (#618).
- Updated CuBlas interface to work on views with stride information (#631).
- Updated Algorithms.Scan implementation to work on arbitrary stride types (#632)
- Updated Algorithms.RadixSort implementation to work on arbitrary stride types (#637).
- Fixed generated call to
ValueType.GetHashCode
(#617). - Fixed invalid alignment of dynamic shared memory allocations (#630).
- Fixed
OutOfRessources
when emitting Code with debug assertions turned on using the Cuda backend (#628). - Fixed race condition in WarpReductions.Reduce for CPU accelerators (#627).
- Refined index reconstruction methods and fixed element index assertions (#629).
- Refined bounds checks of CUDA and OpenCL APIs (#619).
- Improved hash code of index types to avoid copyright issues (#622).
- Ensure Cuda accelerator is bound before calling CuBlas methods (#624).
- Improved runtime performance of the CPU accelerator launcher (#626).
Repository Changes
Major internal changes
- Allow building
net471
target without Windows (#616).
Special thanks
Special thanks to @MoFtZ, @nullandkale, @jgiannuzzi, @Joey9801, @lostmsu and @kilngod for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community for providing feedback, submitting issues and feature requests.
Full Changelog: https://github.com/m4rs-mt/ILGPU/compare/v1.0.0-rc1…v1.0.0-rc2
- Refined
-
Release v1.0.0-rc1
This new release candidate is a preview of the upcoming ILGPU stable release with a frozen API surface/feature level. It includes bug fixes, a lot of amazing new features and improved samples and documentation (get the ILGPU Nuget package and ILGPU Algorithms Nuget package).
Notes
- We updated the versions of the .Net dependencies (#576, #577, #578, #579, #580, #581, #582, #583, #586, #591, #595 and #601).
- We updated the required .Net Framework version (from
v4.7
tov4.7.1
) to benefit from the most recent dependency updates (#595). - We updated the ILGPU documentation and all samples to be compatible with the latest preview releases (#584, #593, #600, #602).
Changes
- Updated .Net Framework version from
v4.7
tov4.7.1
(#594). - Added 1.0.0 pre-release documentation (#602).
- Added sample about inline
PTX
assembly instructions (#588). - Added sample about monitoring progress on Cuda accelerators (#593).
- Added sample project for printf-like output in kernels (#600).
- Added sample project for debug asserts in kernels (#600).
- Added sample project for removing consecutive duplicate values (#600).
- Added sample project for calculating histograms (#600).
- Added sample project for fixed sized buffers (#600).
- Added support for zero-length subviews of zero-length views (#585).
- Guard against zero-length (
CUDA
andCL
) allocations to enable allocations of zero bytes (#547, #610). - Simplified naming of GetAsPageLockedArray and AllocatePageLockedArray (#608).
- Fixed transformation issues regarding many functions in kernel modules (without inlining) (#613).
- Fixed invalid detection and processing of loops consisting of a single entry block (#607).
- Fixed invalid conversion of LFA values in SSAStructureConstruction (affect array optimizations, #605).
Repository Changes
- Added information about symbols and source link to ReadMe file (#594).
Major internal changes
- Removed obsolete configurations from solutions (#599).
- Prepared conditional compilation for future .NET frameworks (#592).
Special thanks
Special thanks to @MoFtZ, @nullandkale, @Joey9801, @jgiannuzzi and @sucrose0413 for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community for providing feedback, submitting issues and feature requests.
-
Release v1.0.0-beta3
This new beta offers significant performance improvements to the generated kernel programs and includes a lot of amazing new features (get the ILGPU Nuget package and ILGPU Algorithms Nuget package).
Notes
- We converted ILGPU into a monorepo project including, ILGPU.Algorithms, ILGPU.Samples, Wiki and enhanced Documentation.
- This version has some breaking changes compared to previous stable ILGPU versions (see also Release v1.0.0-beta1).
Changes
- Promote
.NET 5
to a default target framework (#529, #536). - Added new
Array
processing pipeline to have full support for nD-arrays (#513). - Added convenience overloads for
AsNDView
(#571). - Added support for zero-length
SubView
operations (#550). - Added Backend optimizations for CPU backend to re-enable support for enhanced shared memory allocations (see #567) (#574).
- Added support for Cuda ISA 7.3 and 7.4 to support all latest drivers (#566).
- Added
UCE
transformation to the backend optimization passes (#569). - Added VS integration of check styles to all projects and fixed style checking (#517, #511).
- Added CPU builder method to register custom CPU devices (#507).
- Added support for chaining
EnableAlgorithms
on Context builder instances (#515). - Improved performance of all tests by enabling aggressive caching (#522).
- Improve hash codes of
IndexND
andLongIndexND
types (#510). - Changed
InvalidEntryPointIndexParameterOfWrongType
error message to be more descriptive (#535). - Changed T4
DllImportSearchPath
toLegacyBehavior
(#514). - Fixed constant folding when converting unsigned integers (#549).
- Fixed critical issue when swapping registers/variables in backends (#541).
- Fixed invalid copies from and to sub views (#523).
- Fixed and enhanced
Stride
andArrayView
types (#509). - Fixed regression in single-pass scan when performing multiple iterations (#525).
- Fixed
RadixSortProvider
andScanProvider
test cases (#516). - Removed obsolete properties and methods (#524).
Repository Changes
- Merged ILGPU.Smples into ILGPU repository (#538, #561, #563, #564, #565, #568).
- Merged ILGPU.Algorithms into ILGPU repository.
- Merged ILGPU Wiki into ILGPU repository (#537).
- Merged external ILGPU v0.10.1 documents (#546).
CI Changes
- Add badges for versions and CI (#534).
- Skip publishing nuget packages on forks (#533).
- Selective builds on macOS, master and tags (#530).
- Fix NuGet publishing bug in CI pipeline (#572).
- Restricting the package CI job to run only once (#527).
- Run clean tests on push to master or tag without using caches (#526).
- Added support for releasing pre-view builds via
feedz.io
(#521, #520).
Major internal changes
- Adapted CI for new ILGPU monorepo (#512).
- Added missing struct type constraints (#532).
- Applied general cleanup (#531).
Special thanks
Special thanks to @MoFtZ, @Joey9801, @jgiannuzzi ,@nullandkale, @76creates, @Nnelg, @ljubon for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community for providing feedback, submitting issues and feature requests.
-
Release v1.0.0-beta2
This new beta offers significant performance improvements to the generated kernel programs and includes a lot of amazing new features (get the Nuget package).
Please note that this version has some breaking changes compared to previous ILGPU versions. Refer to the v1.0-beta1 summary for more information.
-
Release v1.0.0-beta1
This new beta offers significant performance improvements to the generated kernel programs and includes a lot of amazing new features (get the Nuget package).
Please note that this version has some breaking changes compared to previous ILGPU versions.
Breaking changes
- The
Memory API
, involvingArrayView
andMemoryBuffer
types has been significantly improved to support explicitStride
information (see below). - All
IndexX
andLongIndexX
types have been renamed toIndexXD
andLongIndexXD
to have a unified programming experience with respect to memory buffers and array views (see below). - The
Device API
has been redesigned to explicitly enable, filter and configure the available hardware accelerator devices (see below).
Changes
- Added new
Memory API
to support explicit stride information (#421, #475, #483). - Added new
Device API
to enable, filter and configure the available hardware accelerator devices (#428). - Added support for
OpenCL 3.0
API (#464). - Added support for inline PTX assembly instructions (#467).
- Added support for multi-dimensional and static constant arrays (#479).
- Added support for convenient profiling use
ProfilingMarker
s (#482). - Improved CPU runtime to support arbitrary
Warp
/Group
/Multiprocessor
configurations (#402, #484). - Improved error messages (#466)
- Enabled folding of debug assertions in
IRBuilder
(#477). - Fixed Group helper methods for multi-dimensional kernels (#481).
- Fixed invalid code generation of
OpenCL
kernels in the presence of constant switch conditions (#441).
Summary of the changes related to the new Memory API
The new API distinguishes between a coherent, strongly typed
ArrayView<T>
structure and its n-D versionsArrayViewXD<T, TStride>
, which carry dimension-dependent stride information (The actual logic for computing element addresses is moved from theIndexXD
types to the newly addedStrideXD
types). This allows developers to explicitly specify a particular stride of a view,reinterpret
the data layout itself (by changing the stride), and perform compile-time optimizations based on explicitly typed stride information. Consequently, ILGPU’s optimization pipeline is able to remove the overhead of these abstractions in most cases (except in rare use cases where strange-looking strides are used). It also makes all memory transfer-related operations explicit in terms of what memory layout the underlying data will have after an operation is performed.In addition, it moves all
copy
related methods to theArrayView
instances instead of exposing them on the memory buffers. This realizes a “separation of concerns”: One the one hand, aMemoryBuffer
holds a reference to the native memory area and controls its lifetime. On the other hand,ArrayView
structures manage the contents of these buffers and make them available to the actual GPU kernels.Example:
// Simple 1D allocation of 1024 longs with TStride = Stride1D.Dense (all elements are accessed contiguously in memory) var t = accl.Allocate1D<long>(1024); // Advanced 1D allocation of 1024 longs with TStride = Stride1D.General(2) (each memory access will skip 2 elements) // -> allocates 1024 * 2 longs to be able to access all of them var t = accl.Allocate1D<long, Stride1D.General>(1024, new Stride1D.General(2)); // Simple 1D allocation of 1024 longs using the array provided var data1 = new long[1024]; var t2 = accl.Allocate1D(data1); // Simple 2D allocation of 1024 * 1024 longs using the array provided with TStride = Stride2D.DenseX // (all elements in X dimension are accessed contiguously in memory) // -> this will *not* transpose the input buffer as the memory layout will be identical on CPU and GPU var data2 = new long[1024, 1024]; var t3 = accl.Allocate2DDenseX(data2); // Simple 2D allocation of 1024 * 1024 longs using the array provided, with TStride = Stride2D.DenseY // (all elements in Y dimension are accessed contiguously in memory) // -> this *will* transpose the input buffer to match the desired data layout var data3 = new long[1024, 1024]; var t4 = accl.Allocate2DDenseY(data3);
The major changes/features of the new Memory API are:
-
Index1
Index2
Index3
types have been renamed toIndex1D
Index2D
Index3D
to match the naming scheme ofArrayViewXD
andMemoryBufferXD
types. -
LongIndex1
LongIndex2
LongIndex3
types have been renamed toLongIndex1D
LongIndex2D
LongIndex3D
to match the naming scheme of theArrayViewXD
andMemoryBufferXD
types. - Separation of concerns between
MemoryBuffer
andArrayView
instances:ArrayView...
structures represent and manage the contents of buffers (or chunks of buffers).MemoryBuffer...
classes manage the lifetime of allocated memory chunks on a device.
- The
ILGPU.ArrayView
intrinsic structure implements the newly addedIContiguousArrayView
interface that marks contiguous memory sections. - The
ILGPU.Runtime.MemoryBuffer...
classes implement the newly addedIContiguousArrayView
interface that marks contiguous memory sections. - Types implementing the
IContiguousArrayView
interface provide extension methods for initializing, copying from and to the memory region (not supported on accelerators). - This PR adds the notion of
Stride
s. ILGPU contains built-in common strides for 1D, 2D and 3D views.Stride1D.Dense
represents contiguous chunks of memory that pack elements side by side.Stride1D.General
represents strides that skip a certain number of elements.Stride2D.DenseX
represents 2D strides that pack elements side by side in dimension X (transfers from a to views with this stride involve transpose operations).Stride2D.DenseY
represents 2D strides that pack elements in the Y dimension side by side.Stride2D.General
represents strides that skip a certain number of elements in the X and Y dimensions.Stride3D.DenseXY
represents 3D strides that pack elements in the X,Y dimension side by side (transfers from a to views with this stride involve transposition operations).Stride3D.DenseZY
represents 3D strides that pack elements in the Z,Y dimension side by side.Stride3D.General
represents strides that omit a certain number of elements in the X, Y and Z dimensions.
- All
ArrayViewXD
types have been moved to theILGPU.Runtime
namespace. - All
ArrayViewXD
types do not implementIContiguousArrayView
, as they support arbitrary stride information. Note that theArrayView1D<T, Stride1D.Dense>
specialization has an implicit conversion toArrayView<T>
(and vice versa) for auxiliary purposes. - All
CopyFromCPU
andCopyToCPU
methods are provided with additional hints as to whether they are transposing the input elements or keeping the original layout. - Note that
GetAsXDArray(...)
always returns elements in .Net standard layout for 1D, 2D and 3D arrays (this may result in transposing the input elements of the buffer on the CPU). Useview.AsContiguous().GetAsArray()
to get the memory layout of the input buffer.
Summary of the changes related to the new Device API
The new Device API removes the enumeration
ContextFlags
and implements the same functionality in an object oriented way using aContext.Builder
class. It offers a fluent-API like configuration interface which makes it easy to set up:// Enables all supported accelerators (default CPU accelerator only) and puts the context // into auto-assertion mode via "AutoAssertions()". In other words, if a debugger is attached, // the `Context` instance will turn on all assertion checks. This behavior is identical // to the current implementation via new Context(); using var context = Context.CreateDefault(); // Turns on O2 and enables all compatible Cuda devices. using var context = Context.Create(builder => { builder.Optimize(OptimizationLevel.O2).Cuda(); }); // Turns on all assertions, enables the IR verifier and enables all compatible OpenCL devices. using var context = Context.Create(builder => { builder.Assertions().Verify().OpenCL(); }); // Turns on kernel source-line annotations, fast math using 32-bit float and enables // *all* (even incompatible) OpenCL devices. using var context = Context.Create(builder => { builder .DebugSymbols(DebugSymbolsMode.KernelSourceAnnotations) .Math(MathMode.Fast32BitOnly) .OpenCL(device => true); }); // Selects an OpenCL device with a warp size of at least 32: using var context = Context.Create(builder => { builder.OpenCL(device => device.WarpSize >= 32); }); // Turns on all assertions in debug mode (same behavior like calling CreateDefault()): using var context = Context.Create(builder => { builder.AutoAssertions(); }); // Turns on debug optimizations (level O0) and all assertions if a debugger is attached: using var context = Context.Create(builder => { builder.AutoDebug(); }); // Turns on debug mode (optimization level P0, assertions and kernel debug information): using var context = Context.Create(builder => { builder.Debug(); }); // Disable caching, enable conservative inlining and inline mutable static field values: using var context = Context.Create(builder => { builder .Caching(CachingMode.Disabled) .Inlining(InliningMode.Conservative) .StaticFields(StaticFieldMode.MutableStaticFields); }); // Turn on *all* CPU accelerators that simulate different hardware platforms: using var context = Context.Create(builder => builder.CPU()); // Turn on an AMD-based CPU accelerator: using var context = Context.Create(builder => builder.CPU(CPUDeviceKind.AMD));
Note that by default all debug symbols are automatically turned off when a debugger is attached. If you want to turn on the debug information in all cases, call
.builder.DebugSymbols(DebugSymbolsMode.Basic)
. At the same time, this PR introduces the notion of aDevice
, which replaces the implementation ofAcceleratorId
. This allows us to query detailed device information without explicitly instantiating an accelerator:// Print all device information without instantiating a single accelerator // (device context) instance. using var context = Context.Create(...); foreach (var device in context) { // Print detailed accelerator information device.PrintInformation(); // ... }
Note that we removed the ability to call the accelerator constructors (e.g.
new CudaAccelerator(...)
) directly. Either use theCreateAccelerator
methods defined in theDevice
classes or use one of the extension methods likeCreateCudaAccelerator(...)
of theContext
class itself:using var context = Context.Create(...); foreach (var device in context) { // Instantiate an accelerator instance on this device using Accelerator accel = device.CreateAccelerator(); // ... } // Instantiate the 2nd Cuda accelerator (NOTE that this is the *2nd* Cuda device // and *not* the 2nd device of your machine). using CudaAccelerator cudaDevice = context.CreateCudaAccelerator(1); // Instantiate the 1st OpenCL accelerator (NOTE that this is the *1st* OpenCL device // and *not* the 1st device of your machine). using CLAccelerator clDevice = context.CreateOpenCLAccelerator(0);
Context
properties that expose types from other (ILGPU internal) namespaces that cannot/should not (?) be covered by the API/ABI guarantees we want to give, has been madeinternal
properties. To access these properties, use one of the available extension methods located in the corresponding namespaces:using var context = ... // OLD way var internalIRContext = context.IRContext; // NEW way: // using namespace ILGPU.IR; var internalIRContext = context.GetIRContext();
Improved CPU runtime to support arbitrary Warp/Group/Multiprocessor configurations
The new CPU runtime significantly improves the existing
CPUAccelerator
runtime by adding support for user-definedwarp
,group
andmultiprocessor
configurations. It changes the internal functionality to simulate a single warp of at least 2 threads (which ensures that all shuffle-based/reduction-like algorithms can also be run on the CPU by default). At the same time, each virtual multiprocessor can only execute a single thread group at a time. Increasing the number of virtual multiprocessors allows the user to simulate multiple concurrent groups. Most use cases will not require more than a single multiprocessor in practice.Note that all device-wide static
Grid
/Group
/Atomic
/Warp
classes are fully supported to debug/simulate all ILGPU kernels on the CPU.Note that a custom warp size must be a multiple of 2.
This PR adds a new set of static creation methods:
CreateDefaultSimulator(...)
which creates aCPUAccelerator
instance with 4 threads per warp, 4 warps per multiprocessor and a single multiprocessor (MaxGroupSize = 16
).CreateNvidiaSimulator(...)
which creates aCPUAccelerator
instance with 32 threads per warp, 32 warps per multiprocessor and a single multiprocessor (MaxGroupSize = 1024
).CreateAMDSimulator(...)
which creates aCPUAccelerator
instance with 32 threads per warp, 8 warps per multiprocessor and a single multiprocessor (MaxGroupSize = 256
).CreateLegacyAMDSimulator(...)
which creates aCPUAccelerator
instance with 64 threads per warp, 4 warps per multiprocessor and a single multiprocessor (MaxGroupSize = 256
).CreateIntelSimulator(...)
which creates aCPUAccelerator
instance with 16 threads per warp, 8 warps per multiprocessor and a single multiprocessor (MaxGroupSize = 128
).
Furthermore, this PR adds support for advanced debugging features that enable a “sequential-like” execution mode. In this mode, each thread of a group will run sequentially one after another until it hits a synchronization barrier or exits the kernel function. This allows users to conveniently debug larger thread groups consisting of concurrent threads without switching to single-threaded execution. This behavior can be controlled via the newly added
CPUAcceleratorMode
enum:/// <summary> /// The accelerator mode to be used with the <see cref="CPUAccelerator"/>. /// </summary> public enum CPUAcceleratorMode { /// <summary> /// The automatic mode uses <see cref="Sequential"/> if a debugger is attached. /// It uses <see cref="Parallel"/> if no debugger is attached to the /// application. /// </summary> /// <remarks> /// This is the default mode. /// </remarks> Auto = 0, /// <summary> /// If the CPU accelerator uses a simulated sequential execution mechanism. This /// is particularly useful to simplify debugging. Note that different threads for /// distinct multiprocessors may still run in parallel. /// </summary> Sequential = 1, /// <summary> /// A parallel execution mode that runs all execution threads in parallel. This /// reduces processing time but makes it harder to use a debugger. /// </summary> Parallel = 2, }
By default, all
CPUAccelerator
instances use the automatic mode (CPUAcceleratorMode.Auto
) that switches to a sequential execution model as soon as a debugger is attached to the application.Note that threads in the scope of multiple multiprocessors may still run in parallel.
Major internal changes:
- Added build support for .Net5.0 (#446).
- Added support for T4.Build to automatically transform T4 text templates during build (#431).
- Restrict net47 unit tests to only run on CI builds (#465).
- Avoid duplicate CI runs for pull requests made from the same repo (#485).
- Updated InlineList implementation to reduce memory consumption (#478).
- Fixed invalid assertion affecting successor blocks in frontend (#445).
Special thanks
Special thanks to @MoFtZ, @Joey9801, @jgiannuzzi and @GPSnoopy for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community (especially @MPSQUARK, @Nnelg, @Ruberik, @Yey007, @faruknane, @mikhail-khalizev, @nullandkale and @yuryGotham) for providing feedback, submitting issues and feature requests.
- The
-
Release v0.10.1
The new stable version contains several bug fixes and improves the code quality of the generated kernel programs (get the Nuget package).
It is strongly recommended to upgrade to this version as soon as possible to avoid known bugs and some CPU-buffer deallocation issues.
Changes
- Added CopySign intrinsic (#438).
- Added intrinsic mappings for BitConverter functions (#437).
- Added call stack recording during compilation for error reporting (#436).
- Gracefully fail when loading symbols from in-memory assemblies (#435).
- Fixed invalid detection of loop bodies (#452).
- Fixed incorrect assertion on repeating successors (#447).
- Fixed emitting switch statement with constant condition (#442).
- Fixed invalid disposal of CPU buffers (#440).
- Fixed applications blocking during tear-down by changing Accelerator GC thread to run in the background (#439).
- Fixed bounds check on large views (#433).
- Fixed retrieving field from structure types (#426).
Special thanks
Special thanks to @MoFtZ, @marcin-krystianc and @jgiannuzzi for their contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.
-
Release v0.10.0
The new stable version offers significant performance improvements of the generated kernel programs and contains critical resource deallocation fixes (get the Nuget package).
It is strongly recommended to upgrade to this version as soon as possible to avoid resource and GC related deallocation issues.
Breaking changes
- The inheritance hierarchy of the
ExchangeBuffer
class has been changed to avoid exposing internal memory buffers. If you previously relied on the immediate inheritance fromExchangeBufferBase
onMemoryBuffer
, you have to adapt your program to use the intermediate base classMemoryBuffer<T, TIndex>
instead (see diff). - Properties exposing internal memory buffers of the high-level
MemoryBufferXD
classes have been removed to avoid ownership related GC-free issues (see diff).
Why are there breaking changes?
We have decided to remove dangerous properties from several memory buffer classes. The use of these properties can lead to program crashes, since buffers could be disposed asynchronously in the background by the GC without further notice.
Changes
- Improved performance of kernel launchers by passing packed argument structures (#358, #372).
- Graduated different optimizations from
O2
toO1
(release mode) to improve performance in release builds using an additional of stable optimization passes (#344). - Graduated O2 optimizations in the
Cuda
backend toO1
pipeline to generate vectorized IO operations in release builds (#350). - Added support for managed
sizeof
IL instruction (#380). - Added
PrintInformation
method toAccelerator
instances to print detailed accelerator information (#389). - Added enhanced assertions and out-of-bounds checks to all
ArrayView
accesses on GPU devices (Use flagContextFlags.EnableAsserations
or attach a debugger to your application to enable assertion checks. Make sure to use theportable
debug information format for detailed source location information) (#375). - Added support for printf-like output in Kernels for
CPU
,Cuda
andOpenCL
accelerators (#342). - Added new utility Launch/LaunchAutoGrouped methods to immediately launch kernels using a separate strong-reference cache (#336).
- Added new
AlignTo
alignment methods to explicitly alignArrayView
instances to a particular alignment in bytes (#316). - Added enhanced support for local memory via a new
LocalMemory
class (#316). - Added support for several
PopCount
,CLZ
andCTZ
operations (#324). - Added new
MemSet
functions to all memory buffers (#338). - Added new IfConditionalConversion to fold nested and-also and or-else block chains to
O2
pipeline (#328). - Added new local memory optimizations to simplify array accesses (#317).
- Added simple 64-bit-based
LongGlobalIndex
helper to simplify correct computations using 64-bit integers (#337). - Added new
CLPlatformVersion
and fixed OpenCL 1.2 compatibility issues (#335). - Removed support for .NET Core 2.0 (#353).
- Prevent using
SharedMemory
in implicitly grouped kernels (#354). - Prevent using
CudaAccelerator
andCLAccelerator
instances to run on non-native OS .NET versions (#396). - Fixed critical GC-related resource deallocation issues (#376, #393).
- Fixed returning correct length of dynamic shared memory buffers (#357).
- Fixed invalid alignment information in the presence of reinterpret casts (#386).
- Fixed invalid address computations of fixed array buffers (#361).
- Fixed invalid PTX calling convention (#362).
- Fixed edge cases in
LoopUnrolling
(#373). - Fixed invalid
printf
formats forint64
anduintX
types (#391). - Fixed invalid
DebugArrayView
implementations (#345). - Fixed invalid initializations of local memory arrays (#287).
Major internal changes:
- Removed singleton instance of
RuntimeSystem
to avoid concurrency/reflection-API issues (#393). - Updated default optimizations for ILGPU debug builds (#384).
- Added support for unity tests running on. NET Framework 4.7 (#355).
- Migrated from FxCop analyzers to .NET analyzers. (#352).
- Redesigned internal address-space inference passes (#364).
Special thanks
Special thanks to @MoFtZ, @Ruberik and @jgiannuzzi for their contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.
- The inheritance hierarchy of the
-
Release v0.10.0-beta2
This new beta version offers important bug fixes and performance improvements of the generated kernel programs and a set of new features (get the Nuget package).
- Improved performance of kernel launchers by passing packed argument structures (#358, #372).
- Added support for managed
sizeof
IL instruction (#380). - Added
PrintInformation
method toAccelerator
instances to print detailed accelerator information (#389). - Added enhanced assertions and out-of-bounds checks to all
ArrayView
accesses on GPU devices (Use flagContextFlags.EnableAsserations
or attach a debugger to your application to enable assertion checks. Make sure to use theportable
debug information format for detailed source location information) (#375). - Removed support for .NET Core 2.0 (#353).
- Prevent using
SharedMemory
in implicitly grouped kernels (#354). - Prevent using
CudaAccelerator
andCLAccelerator
instances to run on non-native OS .NET versions (#396). - Fixed critical GC-related resource deallocation issues (#376, #393).
- Fixed returning correct length of dynamic shared memory buffers (#357).
- Fixed invalid alignment information in the presence of reinterpret casts (#386).
- Fixed invalid address computations of fixed array buffers (#361).
- Fixed invalid PTX calling convention (#362).
- Fixed edge cases in
LoopUnrolling
(#373). - Fixed invalid
printf
formats forint64
anduintX
types (#391).
Major internal changes:
- Removed singleton instance of
RuntimeSystem
to avoid concurrency/reflection-API issues (#393). - Updated default optimizations for ILGPU debug builds (#384).
- Added support for unity tests running on. NET Framework 4.7 (#355).
- Migrated from FxCop analyzers to .NET analyzers. (#352).
- Redesigned internal address-space inference passes (#364).
Special thanks to @MoFtZ, @Ruberik for their contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.
-
Release v0.10.0-beta1
This new beta version offers significant performance improvements of the generated kernel programs and a set of new features (get the Nuget package).
- Graduated different optimizations from
O2
toO1
(release mode) to improve performance in release builds using an additional of stable optimization passes (#344). - Graduated O2 optimizations in the
Cuda
backend toO1
pipeline to generate vectorized IO operations in release builds (#350). - Added support for printf-like output in Kernels for
CPU
,Cuda
andOpenCL
accelerators (#342). - Added new utility Launch/LaunchAutoGrouped methods to immediately launch kernels using a separate strong-reference cache (#336).
- Added new
AlignTo
alignment methods to explicitly alignArrayView
instances to a particular alignment in bytes (#316). - Added enhanced support for local memory via a new
LocalMemory
class (#316). - Added support for several
PopCount
,CLZ
andCTZ
operations (#324). - Added new
MemSet
functions to all memory buffers (#338). - Added new IfConditionalConversion to fold nested and-also and or-else block chains to
O2
pipeline (#328). - Added new local memory optimizations to simplify array accesses (#317).
- Added simple 64-bit-based
LongGlobalIndex
helper to simplify correct computations using 64-bit integers (#337). - Added new
CLPlatformVersion
and fixed OpenCL 1.2 compatibility issues (#335). - Fixed invalid
DebugArrayView
implementations (#345). - Fixed invalid initializations of local memory arrays (#287).
Special thanks to @MoFtZ and @jgiannuzzi for their contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.
- Graduated different optimizations from
-
Release v0.9.2
The new stable version offers significant performance improvements of the generated kernel programs (get the Nuget package).
- Added new convenience
Launch
methods toAccelerator
class to launch kernels without pre-loading/compiling them (#319). - Changed default inling behavior to
AggressiveInlining
to improve performance of (usually) performance critical GPU programs (#294). - Significantly improved performance of Cuda programs in many cases using a new control-flow scheduling algorithm that can be enabled via O2 or the flag
ContextFlags.EnhancedPTXBackendFeatures
(#274, #303). - Added support for RTX 30xx cards (#302, #305, #311).
- Added support for tuple-types in kernel functions (#266).
- Added support for
Span<T>
in the scope ofMemoryBuffer
copy operations (#122, #276). - Added new Capability API to enable specific extensions in the scope of OpenCL programs and to provide better error messages (#103, #279).
- Added new arithmetic simplifications to enhance the optimization potential of the ILGPU optimization pipeline (#278, #283).
- Added support for unrolling of loop nests to improve performance (#281).
- Added new loop invariant code motion (LICM) code transformation to reduce the code size and enable more aggressive optimizations in O2 mode (#291).
- Enhanced alignment of local and shared-memory allocations in the PTX backend to emit fast vectorized instructions in a huge variety of additional cases (#304).
- Improved alignment of padding in fixed-size structures (#315).
- Fixed invalid Unix OpenCL library names (#327).
- Fixed calling ambiguous OpenCL 64-bit atomic functions (#321).
- Fixed invalid unrolling of loops in some cases (#292).
- Fixed invalid loading of unsigned fields from structures (#314).
- Fixed invalid handling of FP16 types on unsupported devices (#312).
- Fixed invalid constant folding of LHS constants in compare operations (#326).
Major internal changes:
- Enhanced unreachable code elimination to be compatible with the latest optimization pipeline (#300).
- Fixed invalid detection of entry and exit blocks in Loop analysis (#293).
- Added additional debugging capabilities via new dumper methods (#282).
Special thanks to @MoFtZ for his contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.
- Added new convenience
-
Release v0.9.2-beta1
This new beta version offers significant performance improvements of the generated kernel programs (get the Nuget package).
- Changed default inling behavior to
AggressiveInlining
to improve performance of (usually) performance critical GPU programs (#294). - Significantly improved performance of Cuda programs in many cases using a new control-flow scheduling algorithm that can be enabled via O2 or the flag
ContextFlags.EnhancedPTXBackendFeatures
(#274, #303). - Added support for RTX 30xx cards (#302, #305).
- Added support for tuple-types in kernel functions (#266).
- Added support for
Span<T>
in the scope ofMemoryBuffer
copy operations (#122, #276). - Added new Capability API to enable specific extensions in the scope of OpenCL programs and to provide better error messages (#103, #279).
- Added new arithmetic simplifications to enhance the optimization potential of the ILGPU optimization pipeline (#278, #283).
- Added support for unrolling of loop nests to improve performance (#281).
- Added new loop invariant code motion (LICM) code transformation to reduce the code size and enable more aggressive optimizations in O2 mode (#291).
- Enhanced alignment of local and shared-memory allocations in the PTX backend to emit fast vectorized instructions in a huge variety of additional cases (#304).
- Fixed invalid unrolling of loops in some cases (#292).
Major internal changes:
- Enhanced unreachable code elimination to be compatible with the latest optimization pipeline (#300).
- Fixed invalid detection of entry and exit blocks in Loop analysis (#293).
- Added additional debugging capabilities via new dumper methods (#282).
Special thanks to @MoFtZ for his contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.
- Changed default inling behavior to
-
Release v0.9.1
The new stable version offers significant performance improvements of the generated kernel programs (get the Nuget package).
- Added initial loop unrolling capabilities for innermost loops (#259).
- Added new address-space specializer to infer the actual address spaces of memory accesses (#247).
- Added several code simplification techniques to improve generated kernel programs (#268, #270, #271).
- Added support for FP16x2 (
Half2
) types (#273). - Added support for non-capturing lambda kernels (#186).
- Added additional copy operations to ExchangeBuffer (#255).
- Enhanced generation of vectorized IO instructions in the PTX backend using new alignment rules (#247, #260).
- Fixed invalid accelerator synchronization in OpenCL (#246).
- Fixed invalid sign extension of
byte
andushort
values in the context of method calls (#239). - Fixed invalid handling of unsafe array buffers in several cases (#262, #263, #285).
Major internal changes:
- Added new enhanced loop-analyses classes to get detailed insights about loops in ILGPU programs (#259).
- Refactored the internal static-program analysis framework (#247).
- Updated native DLL-interop API (#249).
- Fixed code analysis warnings (#248).
Special thanks to @MoFtZ, @Yey007 and @LxBos for their contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.
-
Release v0.9.1-beta1
This new beta version offers significant performance improvements of the generated kernel programs (get the Nuget package).
- Added initial loop unrolling capabilities for innermost loops (#259).
- Added new address-space specializer to infer the actual address spaces of memory accesses (#247).
- Added several code simplification techniques to improve generated kernel programs (#268, #270, #271).
- Added support for FP16x2 (
Half2
) types (#273). - Added support for non-capturing lambda kernels (#186).
- Added additional copy operations to ExchangeBuffer (#255).
- Enhanced generation of vectorized IO instructions in the PTX backend using new alignment rules (#247, #260).
- Fixed invalid accelerator synchronization in OpenCL (#246).
- Fixed invalid sign extension of
byte
andushort
values in the context of method calls (#239). - Fixed invalid handling of unsafe array buffers in several cases (#262, #263).
Major internal changes:
- Added new enhanced loop-analyses classes to get detailed insights about loops in ILGPU programs (#259).
- Refactored the internal static-program analysis framework (#247).
- Updated native DLL-interop API (#249).
- Fixed code analysis warnings (#248).
Special thanks to @MoFtZ, @Yey007 and @LxBos for their contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.
-
Release v0.9.0
This new stable version offers significant performance and code quality improvements of the generated kernel programs.
- Fixed invalid range checks in memory buffer implementations.
- Fixed invalid 32-bit offsets in memory buffer implementations.
- Fixed if-conversion transformation generating invalid programs in some cases (#232, #233).
- Fixed code-analyses issues that could cause invalid analysis results (#220).
- Added support for 64-bit length buffers and views (#196, #210, #215, #216). Note that this feature includes breaking changes that might affect existing code bases. Please refer to the upgrade guide for more information.
- Added new if-conversion transformation to improve performance (#183).
- Added support for 16-bit float (Half) types (#180, #208).
- Added initial support for fixed array buffers (#200).
- Added support for non-capturing lambda kernels (#79, #136).
- Added support for multidimensional ExchangeBuffers (#148).
- Extended ExchangeBuffers to support conversions to Span and Memory instances (#122).
- Fixed invalid lowering of arrays in divergent control flow (#201).
- Fixed invalid handling of prefixed IL instructions (#204, #211).
Special thanks to @MoFtZ, @Yey007 and @jgiannuzzi for contributing to this release.
-
Release v0.9.0-beta1
- Added support for 64-bit length buffers and views (#196, #210, #215, #216). Note that this feature includes breaking changes that might affect existing code bases. Please refer to the upgrade guide for more information.
- Added new if-conversion transformation to improve performance (#183).
- Added support for 16-bit float (Half) types (#180, #208).
- Added initial support for fixed array buffers (#200).
- Added support for non-capturing lambda kernels (#79, #136).
- Added support for multidimensional ExchangeBuffers (#148).
- Extended ExchangeBuffers to support conversions to Span and Memory instances (#122).
- Fixed invalid lowering of arrays in divergent control flow (#201).
- Fixed invalid handling of prefixed IL instructions (#204, #211).
Special thanks to @MoFtZ, @Yey007 and @jgiannuzzi for contributing to this release.
-
Release v0.8.1.1
The new stable version offers significant performance and code quality improvements of the generated kernel programs.
- Fixed related to Trace and Debug asserts (#176).
- Fixed related to Trace and Debug asserts (#176).
- Improved compile-time performance by up to 4X (#110).
- Reduced memory footprint by up to 3X (#109, #118).
- Added new optimization level O2 to enable expensive and aggressive optimizations (#70, #110, #111, #121).
- No compiler release builds in Nuget package to improve runtime performance (#130).
- Added new IR verifier that can be enabled via
ContextFlags.EnableVerifier
(#121). - Added generation of vectorized instructions to PTX backend (#111).
- Fixed critical code-generation issue on Unix platforms (#116).
- Added dynamic shared memory support for all platforms (#97, #98).
- Added new KernelInfo objects to kernel loaders in order to query detailed kernel statistics (e.g. amount of local memory in bytes) (#104).
-
Release v0.8.1-beta1
The new beta version offers significant performance improvements of the generated kernel programs.
- Improved compile-time performance by up to 4X (#110).
- Reduced memory footprint by up to 3X (#109, #118).
- Added new optimization level O2 to enable expensive and aggressive optimizations (#70, #110, #111, #121).
- No compiler release builds in Nuget package to improve runtime performance (#130).
- Added new IR verifier that can be enabled via
ContextFlags.EnableVerifier
(#121). - Added generation of vectorized instructions to PTX backend (#111).
- Fixed critical code-generation issue on Unix platforms (#116).
- Added dynamic shared memory support for all platforms (#97, #98).
- Added new KernelInfo objects to kernel loaders in order to query detailed kernel statistics (e.g. amount of local memory in bytes) (#104).
-
Release v0.8.0
The new stable version offers significant performance and code quality improvements of the generated kernel programs.
- Added support for on-the-fly specialization of kernels using dynamic partial evaluation.
- Added support for dynamic shared memory (CPU & Cuda backends).
- Added new KernelConfig structure to specify launch dimensions for explicitly grouped kernels.
- Added new Index1 structure to avoid name clashes with new System.Index structure.
- Added additional tuple conversion methods to Index2 and Index3 types.
- Added new EntryPointDescription structure to specify an entry point and its index type.
- Added RuntimeKernelConfig structure to combine static and dynamic information about a particular kernel launch.
- Added support for linear arrays in local memory.
- Added support for enum-value interop (#66).
- Reworked explicitly grouped kernel launchers to use the new KernelConfig structure instead of GroupedIndex types.
- Simplified static Grid and Group properties.
- Removed all GroupedIndex types.
- Updated the whole compilation pipeline to enable more aggressive optimizations.
- Significantly improved performance of emitted PTX and OpenCL code by enabling more aggressive optimizations and clever code generation (#70).
- Added Support for “unmanaged” C# structures in the scope of buffers and views.
- Reworked PTX backend to support all API changes and to fix several critical code-generation issues. This also includes emission of PTX instructions that mimic the Cuda compiler (#68).
- Reworked OpenCL backend to support all API changes and to fix several critical code-generation issues (#67, #72, #73, #74, #78, #85, #88, #91, #92).
- New debug information input module to support the latest PDB format updates.
- Considerably improved error messages using debug information. (#86)
- Reduced memory consumption during the compilation process.
- Performance improvements of the internal compilation pipeline.
- Improved performance of kernel launchers.
- Extended CudaAPI to supported paged-lock host-memory allocation functions.
- Extended ExchangeBuffer to use new page-locked memory allocation (if available).
- Added new IR-rewriter API to perform more advanced IR transformations.
- Adapted all existing transformations to use the new rewriter API.
- Reduced memory consumption of all nodes by compressing information.
- Redesigned several IR nodes to support global program transformations.
- Reworked implementation of
GetSubView
in the context of generic and multidimensional array views (#19). - Fixed several issues in the scope of address-space inference.
- Fixed critical code generation issues that could occur when replacing values.
Special thanks to @MoFtZ for contributing to this release.
-
Release v0.8.0-beta3
- Considerably improved error messages using debug information. (#86)
- Reduced memory consumption during the compilation process.
- Performance improvements of the internal compilation pipeline.
- Added Support for “unmanaged” C# structures in the scope of buffers and views.
- New debug information input module to support the latest PDB format updates.
- Fixed several
OpenCL
code generation issues (#85, #88, #91, #92)
Special thanks to @MoFtZ for contributing to this release.
-
Release v0.8.0-beta2
- Significantly improved performance of emitted
PTX
andOpenCL
code by enabling more aggressive optimizations and clever code generation (#70). - Improved performance of kernel launchers.
- Added support for linear arrays in local memory.
- Added support for
enum
-value interop (#66). - Reworked
PTXBackend
to support all API changes and to fix several critical code-generation issues. This also includes emission of PTX instructions that mimic theCuda
compiler. - Reworked
OpenCL
backend to support all API changes and to fix several critical code-generation issues (#72, #73, #74, #78). - Updated the whole compilation pipeline to enable more aggressive optimizations.
- Added new
IR-rewriter
API to perform more advanced IR transformations. - Adapted all existing transformations to use the new
rewriter API
. - Reduced memory consumption of all nodes by compressing information.
- Redesigned several IR nodes to support global program transformations.
Special thanks to @MoFtZ for contributing to this release.
- Significantly improved performance of emitted
-
Release v0.8.0-beta1
- Added support for on-the-fly specialization of kernels using dynamic partial evaluation.
- Added support for dynamic shared memory (
CPU
&Cuda
backends). - Added new
KernelConfig
structure to specify launch dimensions for explicitly grouped kernels. - Reworked explicitly grouped kernel launchers to use the new
KernelConfig
structure instead ofGroupedIndex
types. - Simplified static
Grid
andGroup
properties. - Added new
Index1
structure to avoid name clashes with newSystem.Index
structure. - Added additional tuple conversion methods to
Index2
andIndex3
types. - Added new
EntryPointDescription
structure to specify an entry point and its index type. - Added
RuntimeKernelConfig
structure to combine static and dynamic information about a particular kernel launch. - Removed all
GroupedIndex
types. - Extended
PTXInstructions
to support bool-based IOs inPTXBackend
(#68). - Extended
ExchangeBuffer
to use new page-locked memory allocation (if available). - Extended
CudaAPI
to supported paged-lock host-memory allocation functions. - Reworked implementation of
GetSubView
in the context of generic and multidimensional array views (#19). - Fixed several issues in the scope of address-space inference.
- Fixed critical code generation issues that could occur when replacing values.
- Fixed invalid pointer types in the scope of
AtomicCAS
operations on AMD hardware (#67).
-
Release v0.7.1
- Added extension method to load the effective address for
Cuda
andCPU
-based array views. - Added support for data blocks (value containers) for easy the interop with value tuples.
- Added additional primitive data blocks to simplify operations on tuples consisting of primitive values.
- Added new ExchangeBuffer class to simplify memory transfers between
CPU
andGPU
memory. - Fixed invalid sub-group extension name in
CLAccelerator
. - Fixed invalid association of supported and unsupported
CL
accelerators. - Removed obsolete dispose functionality from
AcceleratorId
classes. - Fixed
OpenCL
code generator for float values that are assign integers values. - Fixed invalid creation of kernel interop types in
OpenCL
backend. - Made
ABI
thread safe to support concurrent queries of size/alignment information.
- Added extension method to load the effective address for
-
Release v0.7.0
- Added support for .Net Standard 2.1.
- Added support for
OpenCL
-compatible GPUs (beta) - Added parallel code generation in backends to improve code-generation speed.
- Added minimum
CUDA
driver version detection. - Enabled adaptive shared-memory allocation in
CPUAccelerator
. - Added new
Utility.Select
method that can be used to create highly-efficient select instructions in favor of if branches. - Added support to access Grid and Group indices via properties.
- Added support for generic Warp intrinsics that will be automatically generated by the compiler.
- Redesigned intrinsic math functions and moved
XMath
functions to theILGPU.Algorihtms
library. Use the newIntrinsicMath
class for math functions that are supported on all platforms. - Reworked intrinsic functions to allow custom implementations of intrinsics for different backends.
- Ported project to VS2019 including all static-program analysis checks.
- Applied generate code cleanup to be compliant with the new analysis checks.
- Redesigned
AcceleratorId
functionality. - Updated
CudaMemoryBuffer
to supportMemSetToZero
using alternate streams. - Fixed retrieving version number of ILGPU assembly.
- Fixed non-deterministic generation of Phi mappings.
- Fixed invalid loading of small basic types onto the evaluation stack.
- Added utility property to
Accelerator
to resolve a launch extent with the maximum number of groups. - Fixed invalid shared-memory allocation within non-kernel functions in
PTXBackend
.
Special thanks to @MoFtZ for contributing to this release.