* Start updating to cudarc 0.14.
* Adapt a couple more things.
* And a couple more fixes.
* More tweaks.
* And a couple more fixes.
* Bump the major version number.
* Proper module system for the cuda kernels.
* Proper ptx loading.
* Launch the sort kernel.
* Custom op.
* Start using the builder pattern.
* More builder.
* More builder.
* Get candle-core to compile.
* Get the tests to pass.
* Get candle-nn to work too.
* Support for custom cuda functions.
* cudnn fixes.
* Get flash attn to run.
* Switch the crate versions to be alpha.
* Bump the ug dependency.
* Improve reduce perf and add contiguous impl
* Improve arg reduce and add contiguous impl
* Improve softmax kernel. 33%-39% higher thrpt
* fmt
* Fixed all bugs. Improved code quality. Added tests.
* Stash for debugging
* Stash for debugging 2
* Fixing argmax bug and improve performance
Co-authored-by: Christopher Fleetwood <45471420+FL33TW00D@users.noreply.github.com>
* Fix test and add is_valid_simgroup_reduce_type trait
* Online softmax. Improved threadgroup reduce. Tidying up a bit.
* Remove redundant threadgroup_barrier from arg reduce
* Mostly tidying up. Some improvements
* Simplify indexed struct
* tidying
* Reuse operation operator instead of passing it in as a parameter
* Fix how operators are applied to indexed<vec<T,N>>
* Vectorized load. Scalar block reduce. Hitting max throughput for f32 reduce.
* Vectorized load for online softmax. Involves a reinterpret_cast of src which may be suboptimal.
* Metal as_type casting vec<bfloat, N> -> vec<float, N/2> for simd and fast math
* Use constant for input instead of const device. Fix strided reduce.
* Use contiguous reduce in tests
* Rename finalize -> to_scalar
* Support integer types max/min (switch with trait-inferred impl later)
* Was worried I was skipping work -> shuffling the 1D test cases
* Add build.rs to avoid metal kernel jit compile overhead
* Improve build. Extract utils
* Compile metal kernels for both macos and ios
* Fixed over xmas and then forgot about it
* Add calculate_reduce_threads util
* Remove old reduce.metal
* Improve f16/bf16 softmax precision by accumulating in f32
* Remove build.rs (for now)
* Move softmax bench to candle-nn
* Remove redundant thread calc util fn
* Use uint over ushort for indices etc
* Use fast exp in MDReduceOp
* Remove nested metal define for softmax
* Fix some clippy lint.
---------
Co-authored-by: Christopher Fleetwood <45471420+FL33TW00D@users.noreply.github.com>
Co-authored-by: Laurent <laurent.mazare@gmail.com>
* Add some fast Metal MLX SDPA kernels (#32)
* Sketch the sdpa kernel
* Add full sdpa kernel,
* Add test
* Add vectorized kernel for decoding
* Update tests
* Add some docs
* Fix sdpa_vector names
* Add softcapping for vectorized sdpa
* Add softcapping for full sdpa
* Add support for head dim 32, 96, 256
* Add support for head dim 32, 96, 256
* Update docs
* Add update notice
* Clippy and format
* Conditional compilation for bf16
* Use it in quantized llama
* Some review comments
* Use set_params!
* Remove unused
* Remove feature
* Fix metal sdpa for v stride
* Remove comma
* Add the dim method to layout and shape.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
* add: direction for lstm layer
* lint: remove unused Error import
* refactor: remove unnecessary int assignment to Direction enum:
* refactor: use &'static str type instead of String for direction_str:
* Run cargofmt.
---------
Co-authored-by: Laurent <laurent.mazare@gmail.com>
* Add Pixtral.
* More pixtral vision encoder.
* Sketch a pixtral example.
* Sketch a pixtral example.
* Better image loading.
* Support loading images embedded in safetensor files.
* Clippy fixes.
* Add the llava multimodal adapter.
* Add more of the llava bits.
* Add the pixtral config.
* More pixtral inference.
* Add the text generation bits.
* Get the example to work.
* Bugfix.
* Run some bits of the model in f32.
* Blessed version :)
* Better rope frequency computations.
* README update.
* Add a RotatingKVCache.
* Add some KvCache tests.
* Test the reset too.
* More kv-cache testing.
* More tests for the rotating kv-cache.
* Improve the api for the rotating cache so that the whole src tensor gets returned when it's overlarge.
* Handle contiguity + bugfix + use in mimi.
* Add a way to test the mimi streaming mode.
* Mimi streaming fixes.
* More rotating kv-cache.
* Fix the attn mask generation.
* Handle the abs case.
* Add some tests for the generated mask.
* Add Llama 3.1 rope
* Clippy
* Format
* Clippy
* Add support for multiple eos tokens:
* Untagged either
* Remove either dep and fix settings.json
* Make the max positional embeddings configurable
* define structs
* construct ResidualConvUnit
* forward() for ResidualConvUnit
* implement FeatureFusionBlock
* implement Scratch
* implement DPTHead
* add identity module
* implement forward for DTPHead
* add get_intermediate_layers to DinoVisionTransformer
* implement DepthAnythingV2
* some minor tweaks
* fix compile errors
* fix var builder prefixes
* setup initial example
* use fixed patch size of 37 (518 / 14)
* debugged until output
* print min and max values
* add some dynamism to the output location
* scale input image
* extract prep function
* extract output path function
* normalize image with magic mean and std
* add spectral coloring
* squeeze in the right place
* make enterpolation optional
* use bail instead of panic
* omit unnecessary Shape call
* remove empty curly braces
* use bail instead of assert
* use vb and pp
* remove closures
* extract config object
* Apply rustfmt.
* Fix some clippy lints.
* More lints.
* Use the array methods.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
* Add the layernorm cuda kernels.
* Dedicated layer norm op.
* Add the slower variant.
* Plug the cuda implementation.
* Add the metal variant.
* Add a dedicated test.
* Bugfix.
* Add a slice_set op.
* Add some testing.
* Add the dedicated kv-cache module.
* Derive debug and clone.
* Expose more kv-cache functions.
* Return the current data when appending.
* Use the new cache in the quantized phi3 model.
* When converting a tensor to a variable, clone if the tensor is already a variable.
* Add a test to ensure training a batch norm works with VarMaps
---------
Co-authored-by: Jeffrey Dallatezza <jeffreydallatezza@Jeffreys-Laptop.local>
* add sigmoid op
* small fix
* add as a method on `Tensor`
* implement gradient calculation for sigmoid
* add sigmoid tests
* we should have a specialized op for this
* fix clippy
* fix clippy 2
* Revert all previous commits in favor of a `CustomOp` based solution
* use `CustomOp1` implementation
* fix rustfmt
* experimental add metal impl
* add cuda kernel impl
* fix fmt
* Add a test + reduce some cuda duplication.
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>