[x265] [PATCH] feat: --threaded-me

Wed Feb 25 16:50:14 UTC 2026

From e7ab11a765892a18468ddc89b3b239a2998ad4a5 Mon Sep 17 00:00:00 2001
From: Shashank Pathipati <shashank.pathipati at multicorewareinc.com>
Date: Wed, 25 Feb 2026 21:52:31 +0530
Subject: [PATCH] feat: --threaded-me

Threaded ME is a threading feature which unlocks parallelism in the encoder by offloading motion estimation to a dedicated threadpool by computing ahead of WPP and giving speedups of upto 1.5x in many-core CPUs

Co-Authored-By: Somu Vineela <somu at multicorewareinc.com>
---
 .gitignore                      |   1 +
 doc/reST/cli.rst                |  43 +++-
 doc/reST/threading.rst          |  39 ++-
 source/CMakeLists.txt           |   7 +-
 source/cmake/FindNasm.cmake     |   2 +-
 source/cmake/FindNuma.cmake     |   2 +-
 source/common/common.h          |   3 +
 source/common/cudata.cpp        |  58 +++++
 source/common/cudata.h          |   2 +
 source/common/frame.cpp         |  10 +-
 source/common/frame.h           |   1 +
 source/common/framedata.cpp     |  21 ++
 source/common/param.cpp         |  19 ++
 source/common/slice.h           |   5 +
 source/common/threading.cpp     |   9 +
 source/common/threading.h       |  11 +-
 source/common/threadpool.cpp    | 200 +++++++++++++--
 source/common/threadpool.h      |   4 +-
 source/encoder/CMakeLists.txt   |   3 +-
 source/encoder/analysis.cpp     | 147 +++++++++++
 source/encoder/analysis.h       |  18 ++
 source/encoder/api.cpp          |   5 +
 source/encoder/dpb.cpp          |   7 +
 source/encoder/encoder.cpp      |  90 ++++++-
 source/encoder/encoder.h        |   5 +
 source/encoder/frameencoder.cpp |  51 +++-
 source/encoder/frameencoder.h   |  18 ++
 source/encoder/motion.cpp       | 149 +++++++++++
 source/encoder/motion.h         |   2 +
 source/encoder/search.cpp       | 441 +++++++++++++++++++++++++++++++-
 source/encoder/search.h         |  18 ++
 source/encoder/threadedme.cpp   | 275 ++++++++++++++++++++
 source/encoder/threadedme.h     | 259 +++++++++++++++++++
 source/x265.cpp                 |   1 +
 source/x265.h                   |  15 ++
 source/x265cli.cpp              |   1 +
 source/x265cli.h                |   2 +
 37 files changed, 1895 insertions(+), 49 deletions(-)
 create mode 100644 source/encoder/threadedme.cpp
 create mode 100644 source/encoder/threadedme.h

diff --git a/.gitignore b/.gitignore
index 22fd3adc8..62a491de0 100644
--- a/.gitignore
+++ b/.gitignore
@@ -34,3 +34,4 @@
 # Build directory
 build/

+/.*/
\ No newline at end of file
diff --git a/doc/reST/cli.rst b/doc/reST/cli.rst
index 6bb99f92e..7bdf01061 100755
--- a/doc/reST/cli.rst
+++ b/doc/reST/cli.rst
@@ -201,6 +201,12 @@ Logging/Statistic Options
      enough ahead for the necessary reference data to be available. This
      is more of a problem for P frames where some blocks are much more
      expensive than others.
+
+     **Total ThreadedME Wait Time (ms)** Total time this frame waited for
+     ThreadedME to complete CTUs required for compression.
+
+     **Total ThreadedME Time (ms)** Total time spent by ThreadedME worker
+     threads on this frame.
      
 .. option:: --csv-log-level <integer>

@@ -300,14 +306,25 @@ Performance Options
      64-bit machines, or 16 for 32-bit machines. If the total number of threads
      in the system doesn't obey this constraint, we may spawn fewer threads
      than cores which has been empirically shown to be better for performance.
+     However, when :option:`--threaded-me` is enabled, this behavior is
+     overridden and a single thread pool larger than 64 threads may be
+     created. ThreadedME is a singleton job provider, and multiple frame
+     encoders may push work to it concurrently.

-     If the four pool features: :option:`--wpp`, :option:`--pmode(deprecated)`,
-     :option:`--pme(deprecated)` and :option:`--lookahead-slices` are all disabled,
-     then :option:`--pools` is ignored and no thread pools are created.
+     If the five pool features: :option:`--wpp`, :option:`--pmode(deprecated)`,
+     :option:`--pme(deprecated)`, :option:`--lookahead-slices` and :option:`--threaded-me`
+     are all disabled, then :option:`--pools` is ignored and no thread pools are created.

-     If "none" is specified, then all four of the thread pool features are
+     If "none" is specified, then all of the thread pool features are
      implicitly disabled.

+     When :option:`--threaded-me` is enabled, x265 estimates the number of
+     threads to assign to ThreadedME based on motion estimation workload and
+     spawns a dedicated threadpool (threadpool 0) for it. This pool may span
+     multiple NUMA nodes when the ThreadedME allocation target requires it. The remaining
+     threads are then used to create the other pools, which are assigned to
+     frame encoders and lookahead.
+
      Frame encoders are distributed between the available thread pools,
      and the encoder will never generate more thread pools than
      :option:`--frame-threads`.  The pools are used for WPP and for
@@ -383,6 +400,24 @@ Performance Options
      
      Default disabled

+.. option:: --threaded-me, --no-threaded-me
+
+     Threaded motion estimation. Uses a dedicated thread pool to pre-compute
+     motion estimation and evaluate PU combinations for CTUs in parallel.
+     It relaxes inter-frame CTU dependencies to increase parallelism, which
+     can reduce compression efficiency. Recommended on many-core CPUs when
+     encode speed is prioritized over compression efficiency.
+
+     If VBV options are enabled, Threaded ME is automatically disabled and a
+     warning is emitted.
+
+     This feature is implicitly disabled when no thread pool is present.
+
+     --threaded-me provides speedups on many-core CPUs, accompanied by a
+     compression efficiency loss.
+
+     Default disabled.
+
 .. option:: --preset, -p <integer|string>

      Sets parameters to preselected values, trading off compression efficiency against
diff --git a/doc/reST/threading.rst b/doc/reST/threading.rst
index abb0717cc..20282e55d 100644
--- a/doc/reST/threading.rst
+++ b/doc/reST/threading.rst
@@ -191,7 +191,10 @@ regardless of the amount of frame parallelism.

 By default frame parallelism and WPP are enabled together. The number of
 frame threads used is auto-detected from the (hyperthreaded) CPU core
-count, but may be manually specified via :option:`--frame-threads`
+count, but may be manually specified via :option:`--frame-threads`. When
+:option:`--threaded-me` is enabled, the auto-detected frame thread count
+is derived from the thread fraction available after the ThreadedME
+pool is allocated.

      +-------+--------+
      | Cores | Frames |
@@ -246,6 +249,40 @@ The main slicetypeDecide() function itself is also performed by a worker
 thread if your encoder has a thread pool, else it runs within the
 context of the thread which calls the x265_encoder_encode().

+Threaded Motion Estimation
+==========================
+
+The Threaded Motion Estimation module improves parallelism by offloading
+motion estimation (which often dominates encoding time) to a dedicated
+thread pool for pre-processing ahead of WPP.
+
+In the default flow, MVP derivation requires motion vectors from adjacent
+CTUs. This helps compression efficiency but introduces stalls (as described
+in the WPP section), which limits parallelism. Threaded ME relaxes this
+dependency and uses a different algorithm to unlock additional parallelism.
+
+In this algorithm, CTUs from different rows are processed in parallel as soon
+as external row dependencies are resolved. The steps are:
+
+1. For MVP derivation, motion vectors from neighboring PUs and colocated MVs
+   from the reference frame are used when available.
+2. If these are not valid, a median MV from colocated CTUs in reference
+   frames is evaluated.
+3. If the median MV is also unavailable (for example, when the reference
+   frame is an I-slice), a 109-point diamond search is performed on the full
+   CTU and, when available, on the four CU quadrants at depth=1. The
+   resulting vectors are used as MVPs for their respective regions.
+4. Using these MVP seeds, motion estimation is then run for every PU shape
+   enabled by the active configuration, and the resulting MVs are written to
+   a lookup table.
+5. During inter prediction, motion vectors from this precomputed lookup table
+   are used for prediction.
+
+ThreadedME has higher threading demand because it must maintain a lead over WPP
+to minimize stalls. It computes MVs for CTUs across rows within a frame and
+across frames across frame encoders. It is therefore recommended primarily for
+many-core systems where threading resources can be balanced.
+
 SAO
 ===

diff --git a/source/CMakeLists.txt b/source/CMakeLists.txt
index 9f93b6ec2..8bcf0a11b 100755
--- a/source/CMakeLists.txt
+++ b/source/CMakeLists.txt
@@ -11,9 +11,14 @@ if(POLICY CMP0042)
     cmake_policy(SET CMP0042 NEW) # MACOSX_RPATH
 endif()

+cmake_minimum_required (VERSION 2.8.8...3.10) # OBJECT libraries require 2.8.8
+
+if (POLICY CMP0075)
+    cmake_policy(SET CMP0075 NEW) # CMAKE_REQUIRED_LIBRARIES warning
+endif()
+

 project (x265)
-cmake_minimum_required (VERSION 2.8.8...3.10) # OBJECT libraries require 2.8.8
 include(CheckIncludeFiles)
 include(CheckFunctionExists)
 include(CheckSymbolExists)
diff --git a/source/cmake/FindNasm.cmake b/source/cmake/FindNasm.cmake
index ff7eac622..9536ad453 100644
--- a/source/cmake/FindNasm.cmake
+++ b/source/cmake/FindNasm.cmake
@@ -20,6 +20,6 @@ if(NASM_EXECUTABLE)
 endif()

 # Provide standardized success/failure messages
-find_package_handle_standard_args(nasm
+find_package_handle_standard_args(Nasm
     REQUIRED_VARS NASM_EXECUTABLE
     VERSION_VAR NASM_VERSION_STRING)
diff --git a/source/cmake/FindNuma.cmake b/source/cmake/FindNuma.cmake
index e7cf10a8a..958d9de37 100644
--- a/source/cmake/FindNuma.cmake
+++ b/source/cmake/FindNuma.cmake
@@ -40,4 +40,4 @@ endif()

 mark_as_advanced(NUMA_INCLUDE_DIR NUMA_LIBRARY_DIR NUMA_LIBRARY)

-find_package_handle_standard_args(NUMA REQUIRED_VARS NUMA_ROOT_DIR NUMA_INCLUDE_DIR NUMA_LIBRARY)
+find_package_handle_standard_args(Numa REQUIRED_VARS NUMA_ROOT_DIR NUMA_INCLUDE_DIR NUMA_LIBRARY)
diff --git a/source/common/common.h b/source/common/common.h
index 794073577..92af90426 100644
--- a/source/common/common.h
+++ b/source/common/common.h
@@ -352,6 +352,9 @@ typedef int16_t  coeff_t;      // transform coefficient

 #define MAX_MCSTF_TEMPORAL_WINDOW_LENGTH 8

+#define MAX_NUM_PUS_PER_CTU      593   // Maximum number of PUs in a 64x64 CTU
+#define MAX_NUM_PU_SIZES         24    // Number of distinct PU sizes in a 64x64 CTU
+
 namespace X265_NS {

 enum { SAO_NUM_OFFSET = 4 };
diff --git a/source/common/cudata.cpp b/source/common/cudata.cpp
index 550845867..3deafea1c 100644
--- a/source/common/cudata.cpp
+++ b/source/common/cudata.cpp
@@ -1740,6 +1740,64 @@ uint32_t CUData::getInterMergeCandidates(uint32_t absPartIdx, uint32_t puIdx, MV
     return count;
 }

+bool CUData::getMedianColMV(const CUData* colCU, const Frame* colPic, int list, int ref, MV& outMV) const
+{
+    int mvCount = 0;
+    int mvX[MAX_NUM_PARTITIONS], mvY[MAX_NUM_PARTITIONS];
+
+    for (uint32_t partIdx = 0; partIdx < colCU->m_numPartitions; partIdx++)
+    {
+        uint32_t absPartAddr = partIdx & TMVP_UNIT_MASK;
+        if (colCU->m_predMode[partIdx] == MODE_NONE || colCU->isIntra(absPartAddr))
+            continue;
+
+        int8_t refIdx = colCU->m_refIdx[list][partIdx];
+        if (refIdx < 0)
+            continue;
+
+        MV rawMv = colCU->m_mv[list][partIdx];
+
+        int colPOC = colPic->m_encData->m_slice->m_poc;
+        int colRefPOC = colPic->m_encData->m_slice->m_refPOCList[list][refIdx];
+
+        int curPOC = m_slice->m_poc;
+        int curRefPOC = this->m_slice->m_refPOCList[list][ref];
+
+        MV scaledMv = scaleMvByPOCDist(rawMv, curPOC, curRefPOC, colPOC, colRefPOC);
+
+        if (mvCount >= MAX_NUM_PARTITIONS)
+            break;
+
+        mvX[mvCount] = scaledMv.x;
+        mvY[mvCount] = scaledMv.y;
+        mvCount++;
+    }
+
+    if (mvCount == 0)
+        return false;
+
+    size_t mid = mvCount >> 1;
+
+    std::nth_element(mvX, mvX + mid, mvX + mvCount);
+    std::nth_element(mvY, mvY + mid, mvY + mvCount);
+
+    if (mvCount & 1)
+    {
+        outMV.x = mvX[mid];
+        outMV.y = mvY[mid];
+    }
+    else
+    {
+        int lowerMaxX = *std::max_element(mvX, mvX + mid);
+        int lowerMaxY = *std::max_element(mvY, mvY + mid);
+
+        outMV.x = (lowerMaxX + mvX[mid]) >> 1;
+        outMV.y = (lowerMaxY + mvY[mid]) >> 1;
+    }
+
+    return true;
+}
+
 // Create the PMV list. Called for each reference index.
 #if (ENABLE_MULTIVIEW || ENABLE_SCC_EXT)
 int CUData::getPMV(InterNeighbourMV* neighbours, uint32_t picList, uint32_t refIdx, MV* amvpCand, MV* pmv, uint32_t puIdx, uint32_t absPartIdx) const
diff --git a/source/common/cudata.h b/source/common/cudata.h
index 08dc70611..e31fb28ec 100644
--- a/source/common/cudata.h
+++ b/source/common/cudata.h
@@ -332,6 +332,8 @@ public:
     const CUData* getPUAboveRightAdi(uint32_t& arPartUnitIdx, uint32_t curPartUnitIdx, uint32_t partUnitOffset) const;
     const CUData* getPUBelowLeftAdi(uint32_t& blPartUnitIdx, uint32_t curPartUnitIdx, uint32_t partUnitOffset) const;

+    bool      getMedianColMV(const CUData* colCU, const Frame* colPic, int list, int ref, MV& mv) const;
+
 #if ENABLE_SCC_EXT
     void     initSubCU(const CUData& ctu, const CUGeom& cuGeom, int qp, MV lastIntraBCMv[2] = 0);

diff --git a/source/common/frame.cpp b/source/common/frame.cpp
index 200717425..ac54c3427 100644
--- a/source/common/frame.cpp
+++ b/source/common/frame.cpp
@@ -36,6 +36,7 @@ Frame::Frame()
     m_reconRowFlag = NULL;
     m_reconColCount = NULL;
     m_countRefEncoders = 0;
+    m_ctuMEFlags = NULL;
     m_encData = NULL;
     for (int i = 0; i < NUM_RECON_VERSION; i++)
         m_reconPic[i] = NULL;
@@ -179,9 +180,10 @@ bool Frame::create(x265_param *param, float* quantOffsets)
     {
         X265_CHECK((m_reconColCount == NULL), "m_reconColCount was initialized");
         m_numRows = (m_fencPic->m_picHeight + param->maxCUSize - 1)  / param->maxCUSize;
+        int32_t m_numCols = (m_fencPic->m_picWidth + param->maxCUSize - 1)  / param->maxCUSize;
         m_reconRowFlag = new ThreadSafeInteger[m_numRows];
         m_reconColCount = new ThreadSafeInteger[m_numRows];
-
+        m_ctuMEFlags = new ThreadSafeInteger[m_numRows * m_numCols];
         if (quantOffsets)
         {
             int32_t cuCount = (param->rc.qgSize == 8) ? m_lowres.maxBlocksInRowFullRes * m_lowres.maxBlocksInColFullRes :
@@ -358,6 +360,12 @@ void Frame::destroy()
         m_reconColCount = NULL;
     }

+    if (m_ctuMEFlags)
+    {
+        delete[] m_ctuMEFlags;
+        m_ctuMEFlags = NULL;
+    }
+
     if (m_quantOffsets)
     {
         delete[] m_quantOffsets;
diff --git a/source/common/frame.h b/source/common/frame.h
index 704f19422..5b7b1e6df 100644
--- a/source/common/frame.h
+++ b/source/common/frame.h
@@ -113,6 +113,7 @@ public:
     /* Frame Parallelism - notification between FrameEncoders of available motion reference rows */
     ThreadSafeInteger*     m_reconRowFlag;       // flag of CTU rows completely reconstructed and extended for motion reference
     ThreadSafeInteger*     m_reconColCount;      // count of CTU cols completely reconstructed and extended for motion reference
+    ThreadSafeInteger*     m_ctuMEFlags;         // Flag to indicate if threaded me has completed processsing for CTUs
     int32_t                m_numRows;
     volatile uint32_t      m_countRefEncoders;   // count of FrameEncoder threads monitoring m_reconRowCount

diff --git a/source/common/framedata.cpp b/source/common/framedata.cpp
index 70af8a248..9c7e007b8 100644
--- a/source/common/framedata.cpp
+++ b/source/common/framedata.cpp
@@ -23,6 +23,8 @@

 #include "framedata.h"
 #include "picyuv.h"
+#include "search.h"
+#include "threadedme.h"

 using namespace X265_NS;

@@ -35,6 +37,16 @@ bool FrameData::create(const x265_param& param, const SPS& sps, int csp)
 {
     m_param = ¶m;
     m_slice  = new Slice;
+    if (m_param->bThreadedME)
+    {
+        uint32_t bufferSize = sps.numCuInWidth * sps.numCuInHeight;
+        m_slice->m_ctuMV = (CTUMVInfo*)x265_malloc(sizeof(CTUMVInfo) * bufferSize);
+        for (uint32_t i = 0; i < bufferSize; i++)
+        {
+            m_slice->m_ctuMV[i].m_meData = (MEData*)x265_malloc(sizeof(MEData) * MAX_NUM_PUS_PER_CTU);
+        }
+    }
+
     m_picCTU = new CUData[sps.numCUsInFrame];
     m_picCsp = csp;
     m_spsrpsIdx = -1;
@@ -92,6 +104,15 @@ void FrameData::reinit(const SPS& sps)
 void FrameData::destroy()
 {
     delete [] m_picCTU;
+
+    if (m_slice->m_ctuMV)
+    {
+        uint32_t bufferSize = m_slice->m_sps->numCuInWidth * m_slice->m_sps->numCuInHeight;
+        for (uint32_t i = 0; i < bufferSize; i++)
+            x265_free(m_slice->m_ctuMV[i].m_meData);
+
+        x265_free(m_slice->m_ctuMV);
+    }
     delete m_slice;
     delete m_saoParam;

diff --git a/source/common/param.cpp b/source/common/param.cpp
index 9d2ee6686..6d97a57ac 100755
--- a/source/common/param.cpp
+++ b/source/common/param.cpp
@@ -423,6 +423,10 @@ void x265_param_default(x265_param* param)
     param->searchRangeForLayer1 = 3;
     param->searchRangeForLayer2 = 3;

+    /* Threaded ME */
+    param->tmeTaskBlockSize = 1;
+    param->tmeNumBufferRows = 10;
+
     /*Alpha Channel Encoding*/
     param->bEnableAlpha = 0;
     param->numScalableLayers = 1;
@@ -487,6 +491,8 @@ int x265_param_default_preset(x265_param* param, const char* preset, const char*
             param->rc.hevcAq = 0;
             param->rc.qgSize = 32;
             param->bEnableFastIntra = 1;
+            param->tmeTaskBlockSize = 0; // Auto-detect
+            param->tmeNumBufferRows = 20;
         }
         else if (!strcmp(preset, "superfast"))
         {
@@ -508,6 +514,8 @@ int x265_param_default_preset(x265_param* param, const char* preset, const char*
             param->rc.qgSize = 32;
             param->bEnableSAO = 0;
             param->bEnableFastIntra = 1;
+            param->tmeTaskBlockSize = 0; // Auto-detect
+            param->tmeNumBufferRows = 20;
         }
         else if (!strcmp(preset, "veryfast"))
         {
@@ -522,6 +530,8 @@ int x265_param_default_preset(x265_param* param, const char* preset, const char*
             param->maxNumReferences = 2;
             param->rc.qgSize = 32;
             param->bEnableFastIntra = 1;
+            param->tmeTaskBlockSize = 0; // Auto-detect
+            param->tmeNumBufferRows = 20;
         }
         else if (!strcmp(preset, "faster"))
         {
@@ -1523,6 +1533,7 @@ int x265_param_parse(x265_param* p, const char* name, const char* value)
         }
 #endif
         OPT("frame-rc") p->bConfigRCFrame = atobool(value);
+        OPT("threaded-me") p->bThreadedME = atobool(value);
         else
             return X265_PARAM_BAD_NAME;
     }
@@ -1862,6 +1873,11 @@ int x265_check_params(x265_param* param)
         "Valid final VBV buffer emptiness must be a fraction 0 - 1, or size in kbits");
     CHECK(param->vbvEndFrameAdjust < 0,
         "Valid vbv-end-fr-adj must be a fraction 0 - 1");
+    if ((param->rc.vbvBufferSize > 0 || param->rc.vbvMaxBitrate > 0) && param->bThreadedME)
+    {
+        param->bThreadedME = 0;
+        x265_log(param, X265_LOG_WARNING, "VBV and threaded-me both enabled. Disabling threaded-me\n");
+    }
     CHECK(param->minVbvFullness < 0 && param->minVbvFullness > 100,
         "min-vbv-fullness must be a fraction 0 - 100");
     CHECK(param->maxVbvFullness < 0 && param->maxVbvFullness > 100,
@@ -2994,6 +3010,9 @@ void x265_copy_params(x265_param* dst, x265_param* src)
     dst->bEnableHRDConcatFlag = src->bEnableHRDConcatFlag;
     dst->dolbyProfile = src->dolbyProfile;
     dst->bEnableSvtHevc = src->bEnableSvtHevc;
+    dst->bThreadedME = src->bThreadedME;
+    dst->tmeTaskBlockSize = src->tmeTaskBlockSize;
+    dst->tmeNumBufferRows = src->tmeNumBufferRows;
     dst->bEnableFades = src->bEnableFades;
     dst->bEnableSceneCutAwareQp = src->bEnableSceneCutAwareQp;
     dst->fwdMaxScenecutWindow = src->fwdMaxScenecutWindow;
diff --git a/source/common/slice.h b/source/common/slice.h
index 641be210d..8ede39898 100644
--- a/source/common/slice.h
+++ b/source/common/slice.h
@@ -26,6 +26,7 @@
 #define X265_SLICE_H

 #include "common.h"
+#include "mv.h"

 namespace X265_NS {
 // private namespace
@@ -35,6 +36,8 @@ class PicList;
 class PicYuv;
 class MotionReference;

+struct CTUMVInfo;
+
 enum SliceType
 {
     B_SLICE,
@@ -378,6 +381,7 @@ public:
     WeightParam m_weightPredTable[2][MAX_NUM_REF][3]; // [list][refIdx][0:Y, 1:U, 2:V]
     MotionReference (*m_mref)[MAX_NUM_REF + 1];
     RPS         m_rps;
+    CTUMVInfo*  m_ctuMV;

     NalUnitType m_nalUnitType;
     SliceType   m_sliceType;
@@ -419,6 +423,7 @@ public:
         m_lastIDR = 0;
         m_sLFaseFlag = true;
         m_numRefIdx[0] = m_numRefIdx[1] = 0;
+        m_ctuMV = NULL;
         memset(m_refFrameList, 0, sizeof(m_refFrameList));
         memset(m_refReconPicList, 0, sizeof(m_refReconPicList));
         memset(m_refPOCList, 0, sizeof(m_refPOCList));
diff --git a/source/common/threading.cpp b/source/common/threading.cpp
index 034b7f208..43b6f42c8 100644
--- a/source/common/threading.cpp
+++ b/source/common/threading.cpp
@@ -82,6 +82,15 @@ int no_atomic_add(int* ptr, int val)
     pthread_mutex_unlock(&g_mutex);
     return ret;
 }
+
+int64_t no_atomic_add64(int64_t* ptr, int64_t val)
+{
+    pthread_mutex_lock(&g_mutex);
+    *ptr += val;
+    int64_t ret = *ptr;
+    pthread_mutex_unlock(&g_mutex);
+    return ret;
+}
 #endif

 /* C shim for forced stack alignment */
diff --git a/source/common/threading.h b/source/common/threading.h
index 2fa62bcc2..b915c78a5 100644
--- a/source/common/threading.h
+++ b/source/common/threading.h
@@ -56,6 +56,7 @@ int no_atomic_and(int* ptr, int mask);
 int no_atomic_inc(int* ptr);
 int no_atomic_dec(int* ptr);
 int no_atomic_add(int* ptr, int val);
+int64_t no_atomic_add64(int64_t* ptr, int64_t val);
 }

 #define BSR(id, x)            (id) = ((unsigned long)__builtin_clz(x) ^ 31)
@@ -66,7 +67,9 @@ int no_atomic_add(int* ptr, int val);
 #define ATOMIC_AND(ptr, mask) no_atomic_and((int*)ptr, mask)
 #define ATOMIC_INC(ptr)       no_atomic_inc((int*)ptr)
 #define ATOMIC_DEC(ptr)       no_atomic_dec((int*)ptr)
-#define ATOMIC_ADD(ptr, val)  no_atomic_add((int*)ptr, val)
+#define ATOMIC_ADD(ptr, val)  (sizeof(*(ptr)) == 8 ? \
+                               no_atomic_add64((int64_t*)ptr, (int64_t)(val)) : \
+                               no_atomic_add((int*)ptr, (int)(val)))
 #define GIVE_UP_TIME()        usleep(0)

 #elif __GNUC__               /* GCCs builtin atomics */
@@ -82,7 +85,7 @@ int no_atomic_add(int* ptr, int val);
 #define ATOMIC_AND(ptr, mask) __sync_fetch_and_and(ptr, mask)
 #define ATOMIC_INC(ptr)       __sync_add_and_fetch((volatile int32_t*)ptr, 1)
 #define ATOMIC_DEC(ptr)       __sync_add_and_fetch((volatile int32_t*)ptr, -1)
-#define ATOMIC_ADD(ptr, val)  __sync_fetch_and_add((volatile int32_t*)ptr, val)
+#define ATOMIC_ADD(ptr, val)  __sync_fetch_and_add((volatile __typeof__(*(ptr))*)ptr, (__typeof__(*(ptr) + 0))(val))
 #define GIVE_UP_TIME()        usleep(0)

 #elif defined(_MSC_VER)       /* Windows atomic intrinsics */
@@ -95,7 +98,9 @@ int no_atomic_add(int* ptr, int val);
 #define BSF64(id, x)          _BitScanForward64(&id, x)
 #define ATOMIC_INC(ptr)       InterlockedIncrement((volatile LONG*)ptr)
 #define ATOMIC_DEC(ptr)       InterlockedDecrement((volatile LONG*)ptr)
-#define ATOMIC_ADD(ptr, val)  InterlockedExchangeAdd((volatile LONG*)ptr, val)
+#define ATOMIC_ADD(ptr, val)  (sizeof(*(ptr)) == 8 ? \
+                               InterlockedExchangeAdd64((volatile LONGLONG*)ptr, (LONGLONG)(val)) : \
+                               InterlockedExchangeAdd((volatile LONG*)ptr, (LONG)(val)))
 #define ATOMIC_OR(ptr, mask)  _InterlockedOr((volatile LONG*)ptr, (LONG)mask)
 #define ATOMIC_AND(ptr, mask) _InterlockedAnd((volatile LONG*)ptr, (LONG)mask)
 #define GIVE_UP_TIME()        Sleep(0)
diff --git a/source/common/threadpool.cpp b/source/common/threadpool.cpp
index ec96f18b1..a5a02daca 100644
--- a/source/common/threadpool.cpp
+++ b/source/common/threadpool.cpp
@@ -27,6 +27,7 @@
 #include "threading.h"

 #include <new>
+#include <vector>

 #if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7
 #include <winnt.h>
@@ -373,6 +374,138 @@ ThreadPool* ThreadPool::allocThreadPools(x265_param* p, int& numPools, bool isTh
             nodeMaskPerPool[numNumaNodes] |= ((uint64_t)1 << i);
         }
     }
+
+    /* If ThreadedME is enabled, split resources: give half NUMA nodes (or
+     * half cores when NUMA not available) to threaded-me and the other half
+     * to the remaining job providers. */
+    /* Compute total CPU count from detected per-node counts when available */
+    for (int i = 0; i < numNumaNodes + 1; i++)
+        totalNumThreads += threadsPerPool[i];
+    if (!totalNumThreads)
+        totalNumThreads = ThreadPool::getCpuCount();
+
+    int threadsFrameEnc = 0;
+
+    if (p->bThreadedME)
+    {
+        /**
+         * TODO: The following thread split decision has only been tuned
+         * for ultrafast and medium presets. Tuning for other presets
+         * needs to be completed.
+         */
+        int targetTME = getTmeThreadCount(p, totalNumThreads);
+        threadsFrameEnc = totalNumThreads - targetTME;
+
+        if (targetTME < 1)
+            targetTME = 1;
+
+        int defaultNumFT = getFrameThreadsCount(p, totalNumThreads);
+        if (threadsFrameEnc < defaultNumFT)
+        {
+            threadsFrameEnc = defaultNumFT;
+            targetTME = totalNumThreads - threadsFrameEnc;
+        }
+
+#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 || HAVE_LIBNUMA
+        if (bNumaSupport && numNumaNodes > 1)
+        {
+            int tmeNumaNodes = 0;
+            int leftover = 0;
+
+            // First thread pool belongs to ThreadedME
+            std::vector<int> threads(1, 0);
+            std::vector<uint64_t> nodeMasks(1, 0);
+            int poolIndex = 0;
+
+            /* Greedily assign whole NUMA nodes to TME until reaching or exceeding the target */
+            for (int i = 0; i < numNumaNodes + 1; i++)
+            {
+                if (!threadsPerPool[i] && !nodeMaskPerPool[i])
+                    continue;
+
+                int toTake = X265_MIN(threadsPerPool[i], targetTME - threads[0]);
+                if (toTake > 0)
+                {
+                    threads[poolIndex] += toTake;
+                    nodeMasks[poolIndex] |= nodeMaskPerPool[i];
+                    tmeNumaNodes++;
+
+                    if (threads[0] == targetTME)
+                        poolIndex++;
+
+                    if (toTake < threadsPerPool[i])
+                        leftover = threadsPerPool[i] - toTake;
+                }
+                else
+                {
+                    threads.push_back(threadsPerPool[i]);
+                    nodeMasks.push_back(nodeMaskPerPool[i]);
+                    poolIndex++;
+                }
+            }
+
+            // Distribute leftover threads among FrameEncoders
+            if (leftover)
+            {
+                // Case 1: There are 1 or more threadpools for FrameEncoder(s) by now
+                if (threads.size() > 1)
+                {
+                    int split = static_cast<int>(static_cast<double>(leftover) / (numNumaNodes - 1));
+                    for (int pool = 1; pool < numNumaNodes; pool++)
+                    {
+                        int give = X265_MIN(split, leftover);
+                        threads[pool] += give;
+                        leftover -= give;
+                    }
+                }
+
+                // Case 2: FrameEncoder(s) haven't received threads yet
+                if (threads.size() == 1)
+                {
+                    threads.push_back(leftover);
+                    // Give the same node mask as the last node of ThreadedME
+                    uint64_t msb = 1;
+                    uint64_t tmeNodeMask = nodeMasks[0];
+                    while (tmeNodeMask > 1)
+                    {
+                        tmeNodeMask >>= 1;
+                        msb <<= 1;
+                    }
+                    nodeMasks.push_back(msb);
+                }
+            }
+
+            // Apply calculated threadpool assignment
+            // TODO: Make sure this doesn't cause a problem later on
+            memset(threadsPerPool, 0, sizeof(threadsPerPool));
+            memset(nodeMaskPerPool, 0, sizeof(nodeMaskPerPool));
+
+            numPools = numNumaNodes = static_cast<int>(threads.size());
+            for (int pool = 0; pool < numPools; pool++)
+            {
+                threadsPerPool[pool] = threads[pool];
+                nodeMaskPerPool[pool] = nodeMasks[pool];
+            }
+        }
+        else
+#endif
+        {
+            memset(threadsPerPool, 0, sizeof(threadsPerPool));
+            memset(nodeMaskPerPool, 0, sizeof(nodeMaskPerPool));
+
+            threadsPerPool[0] = targetTME;
+            nodeMaskPerPool[0] = 1;
+
+            threadsPerPool[1] = threadsFrameEnc;
+            nodeMaskPerPool[1] = 1;
+
+            numPools = 2;
+        }
+    }
+    else
+    {
+        threadsFrameEnc = totalNumThreads;
+    }

     // If the last pool size is > MAX_POOL_THREADS, clip it to spawn thread pools only of size >= 1/2 max (heuristic)
     if ((threadsPerPool[numNumaNodes] > MAX_POOL_THREADS) &&
@@ -383,17 +516,22 @@ ThreadPool* ThreadPool::allocThreadPools(x265_param* p, int& numPools, bool isTh
                  "Creating only %d worker threads beyond specified numbers with --pools (if specified) to prevent asymmetry in pools; may not use all HW contexts\n", threadsPerPool[numNumaNodes]);
     }

-    numPools = 0;
-    for (int i = 0; i < numNumaNodes + 1; i++)
+    if (!p->bThreadedME)
     {
-        if (bNumaSupport)
-            x265_log(p, X265_LOG_DEBUG, "NUMA node %d may use %d logical cores\n", i, cpusPerNode[i]);
-        if (threadsPerPool[i])
+        numPools = 0;
+        for (int i = 0; i < numNumaNodes + 1; i++)
         {
-            numPools += (threadsPerPool[i] + MAX_POOL_THREADS - 1) / MAX_POOL_THREADS;
-            totalNumThreads += threadsPerPool[i];
+            if (bNumaSupport)
+                x265_log(p, X265_LOG_DEBUG, "NUMA node %d may use %d logical cores\n", i, cpusPerNode[i]);
+
+            if (threadsPerPool[i])
+            {
+                numPools += (threadsPerPool[i] + MAX_POOL_THREADS - 1) / MAX_POOL_THREADS;
+                totalNumThreads += threadsPerPool[i];
+            }
         }
     }
+
     if (!isThreadsReserved)
     {
         if (!numPools)
@@ -403,13 +541,13 @@ ThreadPool* ThreadPool::allocThreadPools(x265_param* p, int& numPools, bool isTh
         }

         if (!p->frameNumThreads)
-            ThreadPool::getFrameThreadsCount(p, totalNumThreads);
+            p->frameNumThreads = ThreadPool::getFrameThreadsCount(p, threadsFrameEnc);
     }

     if (!numPools)
         return NULL;

-    if (numPools > p->frameNumThreads)
+    if (numPools > p->frameNumThreads && !p->bThreadedME)
     {
         x265_log(p, X265_LOG_DEBUG, "Reducing number of thread pools for frame thread count\n");
         numPools = X265_MAX(p->frameNumThreads / 2, 1);
@@ -419,27 +557,33 @@ ThreadPool* ThreadPool::allocThreadPools(x265_param* p, int& numPools, bool isTh
     ThreadPool *pools = new ThreadPool[numPools];
     if (pools)
     {
-        int maxProviders = (p->frameNumThreads + numPools - 1) / numPools + !isThreadsReserved; /* +1 is Lookahead, always assigned to threadpool 0 */
+        int poolCount = (p->bThreadedME) ? numPools - 1 : numPools;
         int node = 0;
         for (int i = 0; i < numPools; i++)
         {
+            int maxProviders = (p->bThreadedME && i == 0) // threadpool 0 is dedicated to ThreadedME
+                ? 1
+                : (p->frameNumThreads + poolCount - 1) / poolCount + !isThreadsReserved; // +1 is Lookahead, always assigned to threadpool 0
+
             while (!threadsPerPool[node])
                 node++;
-            int numThreads = X265_MIN(MAX_POOL_THREADS, threadsPerPool[node]);
+            int numThreads = threadsPerPool[node];
             int origNumThreads = numThreads;
+
             if (i == 0 && p->lookaheadThreads > numThreads / 2)
             {
                 p->lookaheadThreads = numThreads / 2;
                 x265_log(p, X265_LOG_DEBUG, "Setting lookahead threads to a maximum of half the total number of threads\n");
             }
+
             if (isThreadsReserved)
             {
                 numThreads = p->lookaheadThreads;
                 maxProviders = 1;
             }
-
             else if (i == 0)
                 numThreads -= p->lookaheadThreads;
+
             if (!pools[i].create(numThreads, maxProviders, nodeMaskPerPool[node]))
             {
                 X265_FREE(pools);
@@ -510,6 +654,8 @@ bool ThreadPool::create(int numThreads, int maxProviders, uint64_t nodeMask)
             new (m_workers + i)WorkerThread(*this, i);

     m_jpTable = X265_MALLOC(JobProvider*, maxProviders);
+    if (m_jpTable)
+        memset(m_jpTable, 0, sizeof(JobProvider*) * maxProviders);
     m_numProviders = 0;

     return m_workers && m_jpTable;
@@ -659,25 +805,39 @@ int ThreadPool::getCpuCount()
 #endif
 }

-void ThreadPool::getFrameThreadsCount(x265_param* p, int cpuCount)
+int ThreadPool::getFrameThreadsCount(x265_param* p, int cpuCount)
 {
     int rows = (p->sourceHeight + p->maxCUSize - 1) >> g_log2Size[p->maxCUSize];
     if (!p->bEnableWavefront)
-        p->frameNumThreads = X265_MIN3(cpuCount, (rows + 1) / 2, X265_MAX_FRAME_THREADS);
+        return X265_MIN3(cpuCount, (rows + 1) / 2, X265_MAX_FRAME_THREADS);
     else if (cpuCount >= 32)
-        p->frameNumThreads = (p->sourceHeight > 2000) ? 6 : 5;
+        return (p->sourceHeight > 2000) ? 6 : 5;
     else if (cpuCount >= 16)
-        p->frameNumThreads = 4;
+        return 4;
     else if (cpuCount >= 8)
 #if _WIN32 && X265_ARCH_ARM64
-        p->frameNumThreads = cpuCount;
+        return cpuCount;
 #else
-        p->frameNumThreads = 3;
+        return 3;
 #endif
     else if (cpuCount >= 4)
-        p->frameNumThreads = 2;
+        return 2;
     else
-        p->frameNumThreads = 1;
+        return 1;
+}
+
+int ThreadPool::getTmeThreadCount(x265_param* param, int cpuCount)
+{
+    bool isHighRes = (param->sourceWidth > 2000);
+
+    // ultrafast preset or similar options
+    if (!param->subpelRefine || param->minCUSize >= 16)
+    {
+        if (isHighRes) return cpuCount / 2;
+    }
+
+    if (isHighRes) return (cpuCount * 7) / 10;
+    else return (cpuCount * 4) / 5;
 }

 } // end namespace X265_NS
diff --git a/source/common/threadpool.h b/source/common/threadpool.h
index 867539f3a..051f4cba4 100644
--- a/source/common/threadpool.h
+++ b/source/common/threadpool.h
@@ -102,10 +102,12 @@ public:
     void setThreadNodeAffinity(void *numaMask);
     int  tryAcquireSleepingThread(sleepbitmap_t firstTryBitmap, sleepbitmap_t secondTryBitmap);
     int  tryBondPeers(int maxPeers, sleepbitmap_t peerBitmap, BondedTaskGroup& master);
+
     static ThreadPool* allocThreadPools(x265_param* p, int& numPools, bool isThreadsReserved);
     static int  getCpuCount();
     static int  getNumaNodeCount();
-    static void getFrameThreadsCount(x265_param* p,int cpuCount);
+    static int  getFrameThreadsCount(x265_param* p, int cpuCount);
+    static int  getTmeThreadCount(x265_param* p, int cpuCount);
 };

 /* Any worker thread may enlist the help of idle worker threads from the same
diff --git a/source/encoder/CMakeLists.txt b/source/encoder/CMakeLists.txt
index 74e36e1b2..8c8090786 100644
--- a/source/encoder/CMakeLists.txt
+++ b/source/encoder/CMakeLists.txt
@@ -43,4 +43,5 @@ add_library(encoder OBJECT ../x265.h
     reference.cpp reference.h
     encoder.cpp encoder.h
     api.cpp
-    weightPrediction.cpp svt.h)
+    weightPrediction.cpp svt.h
+    threadedme.h threadedme.cpp)
diff --git a/source/encoder/analysis.cpp b/source/encoder/analysis.cpp
index b219d5da4..cea255cc4 100644
--- a/source/encoder/analysis.cpp
+++ b/source/encoder/analysis.cpp
@@ -135,6 +135,153 @@ void Analysis::destroy()
     X265_FREE(cacheCost);
 }

+void Analysis::computeMVForPUs(CUData& ctu, const CUGeom& cuGeom, int qp, Frame& frame)
+{
+    int areaId = 0;
+    int finalIdx = 0;
+
+    uint32_t depth = cuGeom.depth;
+    uint32_t nextDepth = depth + 1;
+
+    uint32_t cuSize = 1 << cuGeom.log2CUSize;
+    bool mightSplit = (cuSize > m_param->minCUSize);
+
+    uint32_t cuX = ctu.m_cuPelX + g_zscanToPelX[cuGeom.absPartIdx];
+    uint32_t cuY = ctu.m_cuPelY + g_zscanToPelY[cuGeom.absPartIdx];
+
+    if (!(cuSize == m_param->maxCUSize))
+    {
+        uint32_t subCUSize = m_param->maxCUSize / 2;
+        areaId = (cuX >= subCUSize) + 2 * (cuY >= subCUSize) + 1;
+    }
+
+    if (mightSplit)
+    {
+        int nextQP = qp;
+        for (uint32_t subPartIdx = 0; subPartIdx < 4; subPartIdx++)
+        {
+            const CUGeom& childGeom = *(&cuGeom + cuGeom.childOffset + subPartIdx);
+            if (m_slice->m_pps->bUseDQP && nextDepth <= m_slice->m_pps->maxCuDQPDepth)
+                nextQP = setLambdaFromQP(ctu, calculateQpforCuSize(ctu, childGeom));
+
+            computeMVForPUs(ctu, childGeom, nextQP, frame);
+        }
+    }
+
+    ModeDepth& md = m_modeDepth[cuGeom.depth];
+    CUData& cu = md.pred[PRED_2Nx2N].cu;
+
+    for (int i = 0; i < MAX_NUM_PU_SIZES; i++)
+    {
+        const PUBlock& pu = g_puLookup[i];
+        int startIdx = g_puStartIdx[pu.width + pu.height][static_cast<int>(pu.partsize)];
+
+        if (pu.width > cuSize || pu.height > cuSize || (pu.width != cuSize && pu.height != cuSize))
+            continue;
+
+        if (!m_param->bEnableAMP && pu.isAmp)
+            continue;
+        if (!m_param->bEnableRectInter && pu.width != pu.height && !pu.isAmp)
+            continue;
+
+        int blockWidth = pu.isAmp ? X265_MAX(pu.width, pu.height) : pu.width;
+        int blockHeight = pu.isAmp ? blockWidth : pu.height;
+
+        int numColsCTU = m_param->maxCUSize / blockWidth;
+        int numRowsCTU = m_param->maxCUSize / blockHeight;
+
+        int puOffset = 0;
+        if (pu.isAmp)
+            puOffset = numRowsCTU * numColsCTU;
+        else if (pu.partsize == SIZE_2NxN)
+            puOffset = numColsCTU;
+        else if (pu.partsize == SIZE_Nx2N)
+            puOffset = 1;
+
+        int col = (cuX - ctu.m_cuPelX) / blockWidth;
+        int row = (cuY - ctu.m_cuPelY) / blockHeight;
+
+        finalIdx = startIdx + row * numColsCTU + col;
+
+        int subIdx =finalIdx - startIdx;
+
+        int puRow = subIdx / numColsCTU;
+        int puCol = subIdx % numColsCTU;
+
+        int leftIdx = (puCol > 0) ? startIdx + puRow * numColsCTU + (puCol - 1) : -1;
+        int aboveIdx = (puRow > 0) ? startIdx + (puRow - 1) * numColsCTU + puCol : -1;
+        int aboveLeftIdx = (puRow > 0 && puCol > 0) ? startIdx + (puRow - 1) * numColsCTU + (puCol - 1) : -1;
+        int aboveRightIdx = (puRow > 0 && puCol < numColsCTU - 1) ? startIdx + (puRow - 1) * numColsCTU + (puCol + 1) : -1;
+
+        int neighborIdx[MD_ABOVE_LEFT + 1] = { leftIdx, aboveIdx, aboveRightIdx, -1, aboveLeftIdx};
+
+        cu.initSubCU(ctu, cuGeom, qp);
+        cu.setPartSizeSubParts(pu.partsize);
+        setLambdaFromQP(cu, qp);
+        puMotionEstimation(m_slice, cuGeom, cu, m_frame->m_fencPic, puOffset, pu.partsize, areaId, finalIdx, false, neighborIdx);
+    }
+}
+
+void Analysis::deriveMVsForCTU(CUData& ctu, const CUGeom& cuGeom, Frame& frame)
+{
+    m_slice = ctu.m_slice;
+    m_frame = &frame;
+    m_param = m_frame->m_param;
+
+    ModeDepth& md = m_modeDepth[0];
+
+    int numPredDir = m_slice->isInterP() ? 1 : 2;
+
+    // Full CTU
+    int baseQP = setLambdaFromQP(ctu, ctu.m_slice->m_pps->bUseDQP ? calculateQpforCuSize(ctu, cuGeom) : ctu.m_slice->m_sliceQp);
+
+    md.pred[PRED_2Nx2N].cu.initSubCU(ctu, cuGeom, baseQP);
+    md.pred[PRED_2Nx2N].cu.setPartSizeSubParts(SIZE_2Nx2N);
+
+    puMotionEstimation(m_slice, cuGeom, md.pred[PRED_2Nx2N].cu, frame.m_fencPic, 0, SIZE_2Nx2N, 0, 0, true);
+
+    // Sub-CUs
+    if (m_param->maxCUSize != m_param->minCUSize)
+    {
+        for (int sub = 0; sub < 4; sub++)
+        {
+            ModeDepth& md1 = m_modeDepth[1];
+
+            const CUGeom& childGeom = *(&cuGeom + cuGeom.childOffset + sub);
+            int qp = setLambdaFromQP(ctu, ctu.m_slice->m_pps->bUseDQP ? calculateQpforCuSize(ctu, childGeom) : ctu.m_slice->m_sliceQp);
+
+            md1.pred[PRED_2Nx2N].cu.initSubCU(ctu, childGeom, qp);
+            md1.pred[PRED_2Nx2N].cu.setPartSizeSubParts(SIZE_2Nx2N);
+
+            puMotionEstimation(m_slice, childGeom, md1.pred[PRED_2Nx2N].cu, frame.m_fencPic, 0, SIZE_2Nx2N, sub + 1, 0, true);
+        }
+    }
+
+    const Frame* colPic = m_slice->m_refFrameList[m_slice->isInterB() && !m_slice->m_colFromL0Flag][m_slice->m_colRefIdx];
+    const CUData* colCU = colPic->m_encData->getPicCTU(ctu.m_cuAddr);
+
+    for (int list = 0; list < numPredDir; list++)
+    {
+        int numRef = ctu.m_slice->m_numRefIdx[list];
+
+        for (int ref = 0; ref < numRef; ref++)
+        {
+            MV medianMv;
+            bool valid = ctu.getMedianColMV(colCU, colPic, list, ref, medianMv);
+            if (!valid)
+                continue;
+
+            for (int areaIdx = 0; areaIdx < 5; areaIdx++)
+            {
+                m_areaBestMV[areaIdx][list][ref] = medianMv;
+            }
+        }
+    }
+
+    computeMVForPUs(ctu, cuGeom, baseQP, frame);
+
+}
+
 Mode& Analysis::compressCTU(CUData& ctu, Frame& frame, const CUGeom& cuGeom, const Entropy& initialContext)
 {
     m_slice = ctu.m_slice;
diff --git a/source/encoder/analysis.h b/source/encoder/analysis.h
index e5fa57367..9dfb34dcf 100644
--- a/source/encoder/analysis.h
+++ b/source/encoder/analysis.h
@@ -130,6 +130,24 @@ public:
     Mode& compressCTU(CUData& ctu, Frame& frame, const CUGeom& cuGeom, const Entropy& initialContext);
     int32_t loadTUDepth(CUGeom cuGeom, CUData parentCTU);

+    /**
+     * @brief Build CTU-level and area-level MVP seeds used by threaded ME.
+     *
+     * Performs an initial 2Nx2N search on the full CTU (and first split depth
+     * when available), propagates temporal/colocated medians, and then drives
+     * per-PU motion estimation.
+     */
+    void deriveMVsForCTU(CUData& ctu, const CUGeom& cuGeom, Frame& frame);
+
+    /**
+     * @brief Recursively walk CU partitions and run ME for each enabled PU shape.
+     *
+     * Computes the PU index mapping (`finalIdx`) used by CTU MV storage and
+     * submits each PU to puMotionEstimation() with neighbor indices for MVP
+     * derivation.
+     */
+    void computeMVForPUs(CUData& ctu, const CUGeom& cuGeom, int qp, Frame& frame);
+
 protected:
     /* Analysis data for save/load mode, writes/reads data based on absPartIdx */
     x265_analysis_inter_data*  m_reuseInterDataCTU;
diff --git a/source/encoder/api.cpp b/source/encoder/api.cpp
index 0a06c6eb3..a2725d5e7 100644
--- a/source/encoder/api.cpp
+++ b/source/encoder/api.cpp
@@ -1403,6 +1403,8 @@ FILE* x265_csvlog_open(const x265_param* param)
                     /* detailed performance statistics */
                     fprintf(csvfp, ", DecideWait (ms), Row0Wait (ms), Wall time (ms), Ref Wait Wall (ms), Total CTU time (ms),"
                         "Stall Time (ms), Total frame time (ms), Avg WPP, Row Blocks");
+
+                    fprintf(csvfp, ", Total ThreadedME Wait Time (ms), Total ThreadedME Time (ms)");
 #if ENABLE_LIBVMAF
                     fprintf(csvfp, ", VMAF Frame Score");
 #endif
@@ -1539,6 +1541,9 @@ void x265_csvlog_frame(const x265_param* param, const x265_picture* pic)
                                                                                      frameStats->totalFrameTime);

         fprintf(param->csvfpt, " %.3lf, %d", frameStats->avgWPP, frameStats->countRowBlocks);
+
+        fprintf(param->csvfpt, ", %.1lf, %.1lf", frameStats->tmeWaitTime / 1000.0, frameStats->tmeTime / 1000.0);
+
 #if ENABLE_LIBVMAF
         fprintf(param->csvfpt, ", %lf", frameStats->vmafFrameScore);
 #endif
diff --git a/source/encoder/dpb.cpp b/source/encoder/dpb.cpp
index c5bb10bf2..c364ef39a 100644
--- a/source/encoder/dpb.cpp
+++ b/source/encoder/dpb.cpp
@@ -94,6 +94,13 @@ void DPB::recycleUnreferenced()
             {
                 curFrame->m_reconRowFlag[row].set(0);
                 curFrame->m_reconColCount[row].set(0);
+
+                uint32_t m_numCols = (curFrame->m_fencPic->m_picWidth + curFrame->m_param->maxCUSize - 1) / curFrame->m_param->maxCUSize;
+                for (uint32_t col = 0; col < m_numCols; col++)
+                {
+                    uint32_t ctuAddr = row * m_numCols + col;
+                    curFrame->m_ctuMEFlags[ctuAddr].set(0);
+                }
             }

             // iterator is invalidated by remove, restart scan
diff --git a/source/encoder/encoder.cpp b/source/encoder/encoder.cpp
index c4a6aae75..dd5963ccf 100644
--- a/source/encoder/encoder.cpp
+++ b/source/encoder/encoder.cpp
@@ -39,6 +39,7 @@
 #include "ratecontrol.h"
 #include "dpb.h"
 #include "nal.h"
+#include "threadedme.h"

 #include "x265.h"

@@ -132,6 +133,7 @@ Encoder::Encoder()
     m_numLumaWPBiFrames = 0;
     m_numChromaWPBiFrames = 0;
     m_lookahead = NULL;
+    m_threadedME = NULL;
     m_rateControl = NULL;
     m_dpb = NULL;
     m_numDelayedPic = 0;
@@ -263,7 +265,7 @@ void Encoder::create()
     bool allowPools = !strlen(p->numaPools) || strcmp(p->numaPools, "none");

     // Trim the thread pool if --wpp, --pme, and --pmode are disabled
-    if (!p->bEnableWavefront && !p->bDistributeModeAnalysis && !p->bDistributeMotionEstimation && !p->lookaheadSlices)
+    if (!p->bEnableWavefront && !p->bDistributeModeAnalysis && !p->bDistributeMotionEstimation && !p->lookaheadSlices && !p->bThreadedME)
         allowPools = false;

     m_numPools = 0;
@@ -275,7 +277,7 @@ void Encoder::create()
         {
             // auto-detect frame threads
             int cpuCount = ThreadPool::getCpuCount();
-            ThreadPool::getFrameThreadsCount(p, cpuCount);
+            p->frameNumThreads = ThreadPool::getFrameThreadsCount(p, cpuCount);
         }
     }

@@ -290,9 +292,12 @@ void Encoder::create()
             x265_log(p, X265_LOG_WARNING, "No thread pool allocated, --pmode disabled\n");
         if (p->lookaheadSlices)
             x265_log(p, X265_LOG_WARNING, "No thread pool allocated, --lookahead-slices disabled\n");
+        if (p->bThreadedME)
+            x265_log(p, X265_LOG_WARNING, "No thread pool allocated, --threaded-me disabled\n");

         // disable all pool features if the thread pool is disabled or unusable.
         p->bEnableWavefront = p->bDistributeModeAnalysis = p->bDistributeMotionEstimation = p->lookaheadSlices = 0;
+        p->bThreadedME = 0;
     }

     x265_log(p, X265_LOG_INFO, "Slices                              : %d\n", p->maxSlices);
@@ -305,6 +310,8 @@ void Encoder::create()
         len += snprintf(buf + len,  sizeof(buf) - len, "%spmode", len ? "+" : "");
     if (p->bDistributeMotionEstimation)
         len += snprintf(buf + len, sizeof(buf) - len, "%spme ", len ? "+" : "");
+    if (p->bThreadedME)
+        len += snprintf(buf + len, sizeof(buf) - len, "%sthreaded-me", len ? "+": "");
     if (!len)
         strcpy(buf, "none");

@@ -316,17 +323,37 @@ void Encoder::create()
         m_frameEncoder[i]->m_nalList.m_annexB = !!m_param->bAnnexB;
     }

+    if (p->bThreadedME)
+    {
+        m_threadedME = new ThreadedME(m_param, *this);
+    }
+
     if (m_numPools)
     {
+        // First threadpool belongs to ThreadedME, if the feature is enabled
+        if (p->bThreadedME)
+        {
+            m_threadedME->m_pool = &m_threadPool[0];
+            m_threadedME->m_jpId = 0;
+
+            m_threadPool[0].m_numProviders = 1;
+            m_threadPool[0].m_jpTable[m_threadedME->m_jpId] = m_threadedME;
+        }
+
+        int numFrameThreadPools = (!m_param->bThreadedME) ? m_numPools : m_numPools - 1;
+
         for (int i = 0; i < m_param->frameNumThreads; i++)
         {
-            int pool = i % m_numPools;
+            // Since first pool belongs to ThreadedME
+            int pool = static_cast<int>(p->bThreadedME) + i % numFrameThreadPools;
             m_frameEncoder[i]->m_pool = &m_threadPool[pool];
             m_frameEncoder[i]->m_jpId = m_threadPool[pool].m_numProviders++;
             m_threadPool[pool].m_jpTable[m_frameEncoder[i]->m_jpId] = m_frameEncoder[i];
         }
-        for (int i = 0; i < m_numPools; i++)
-            m_threadPool[i].start();
+
+
+        for (int j = 0; j < m_numPools; j++)
+            m_threadPool[j].start();
     }
     else
     {
@@ -354,7 +381,7 @@ void Encoder::create()
         lookAheadThreadPool = ThreadPool::allocThreadPools(p, pools, 1);
     }
     else
-        lookAheadThreadPool = m_threadPool;
+        lookAheadThreadPool = (!m_param->bThreadedME) ? m_threadPool : &m_threadPool[1];
     m_lookahead = new Lookahead(m_param, lookAheadThreadPool);
     if (pools)
     {
@@ -367,6 +394,22 @@ void Encoder::create()
     m_lookahead->m_numPools = pools;
     m_dpb = new DPB(m_param);

+    if (p->bThreadedME)
+    {
+        if (!m_threadedME->create())
+        {
+            m_param->bThreadedME = 0;
+            X265_FREE(m_threadedME);
+            m_threadedME = NULL;
+
+            x265_log(m_param, X265_LOG_ERROR, "Failed to create threadedME thread pool, --threaded-me disabled");
+        }
+        else
+        {
+            m_threadedME->start();
+        }
+    }
+
     m_rateControl = new RateControl(*m_param, this);
     if (!m_param->bResetZoneConfig)
     {
@@ -480,6 +523,7 @@ void Encoder::create()
         m_aborted = true;

     initRefIdx();
+
     if (strlen(m_param->analysisSave) && m_param->bUseAnalysisFile)
     {
         char* temp = strcatFilename(m_param->analysisSave, ".temp");
@@ -590,7 +634,10 @@ void Encoder::stopJobs()

     if (m_lookahead)
         m_lookahead->stopJobs();
-
+
+    if (m_threadedME)
+        m_threadedME->stopJobs();
+
     for (int i = 0; i < m_param->frameNumThreads; i++)
     {
         if (m_frameEncoder[i])
@@ -933,6 +980,12 @@ void Encoder::destroy()
         delete m_lookahead;
     }

+    if (m_threadedME)
+    {
+        m_threadedME->destroy();
+        delete m_threadedME;
+    }
+
     delete m_dpb;
     if (!m_param->bResetZoneConfig && m_param->rc.zonefileCount)
     {
@@ -2841,6 +2894,12 @@ void Encoder::printSummary()
         for (int i = 0; i < m_param->frameNumThreads; i++)
             cuStats.accumulate(m_frameEncoder[i]->m_cuStats, *m_param);

+        if (m_param->bThreadedME)
+        {
+            m_threadedME->collectStats();
+            cuStats.accumulate(m_threadedME->m_cuStats, *m_param);
+        }
+
         if (!cuStats.totalCTUTime)
             return;

@@ -2855,7 +2914,7 @@ void Encoder::printSummary()
             batchElapsedTime + coopSliceElapsedTime;

         int64_t totalWorkerTime = cuStats.totalCTUTime + cuStats.loopFilterElapsedTime + cuStats.pmodeTime +
-            cuStats.pmeTime + lookaheadWorkerTime + cuStats.weightAnalyzeTime;
+            cuStats.pmeTime + lookaheadWorkerTime + cuStats.weightAnalyzeTime + cuStats.tmeTime;
         int64_t elapsedEncodeTime = x265_mdate() - m_encodeStartTime;

         int64_t interRDOTotalTime = 0, intraRDOTotalTime = 0;
@@ -2887,6 +2946,12 @@ void Encoder::printSummary()
                 (double)cuStats.countPMETasks / cuStats.countPMEMasters,
                 ELAPSED_MSEC(cuStats.pmeTime) / cuStats.countPMETasks);
         }
+        else if (m_param->bThreadedME && cuStats.countTmeTasks)
+        {
+            x265_log(m_param, X265_LOG_INFO, "CU: %%%05.2lf time spent in motion estimation, averaging %.3lf CU inter modes per CTU\n",
+                100.0 * (cuStats.motionEstimationElapsedTime + cuStats.tmeTime) / totalWorkerTime,
+                (double)cuStats.countMotionEstimate / cuStats.totalCTUs);
+        }
         else
         {
             x265_log(m_param, X265_LOG_INFO, "CU: %%%05.2lf time spent in motion estimation, averaging %.3lf CU inter modes per CTU\n",
@@ -2974,6 +3039,11 @@ void Encoder::printSummary()
             ELAPSED_SEC(totalWorkerTime),
             cuStats.totalCTUs / ELAPSED_SEC(totalWorkerTime));

+        if (m_param->bThreadedME && cuStats.countTmeBlockedCTUs)
+            x265_log(m_param, X265_LOG_INFO, "CU: " X265_LL " CTUs blocked by ThreadedME, %%%05.2lf of total CTUs\n",
+                cuStats.countTmeBlockedCTUs,
+                100.0 * cuStats.countTmeBlockedCTUs / cuStats.totalCTUs);
+
         if (m_threadPool)
             x265_log(m_param, X265_LOG_INFO, "CU: %.3lf average worker utilization, %%%05.2lf of theoretical maximum utilization\n",
                 (double)totalWorkerTime / elapsedEncodeTime,
@@ -3173,6 +3243,10 @@ void Encoder::finishFrameStats(Frame* curFrame, FrameEncoder *curEncoder, x265_f
             frameStats->totalCTUTime = ELAPSED_MSEC(0, curEncoder->m_totalWorkerElapsedTime[layer]);
             frameStats->stallTime = ELAPSED_MSEC(0, curEncoder->m_totalNoWorkerTime[layer]);
             frameStats->totalFrameTime = ELAPSED_MSEC(curFrame->m_encodeStartTime, x265_mdate());
+
+            frameStats->tmeTime = curEncoder->m_totalThreadedMETime[layer];
+            frameStats->tmeWaitTime = curEncoder->m_totalThreadedMEWait[layer];
+
             if (curEncoder->m_totalActiveWorkerCount)
                 frameStats->avgWPP = (double)curEncoder->m_totalActiveWorkerCount / curEncoder->m_activeWorkerCountSamples;
             else
diff --git a/source/encoder/encoder.h b/source/encoder/encoder.h
index 40af6a50d..d532f699f 100644
--- a/source/encoder/encoder.h
+++ b/source/encoder/encoder.h
@@ -33,10 +33,14 @@
 #include "framedata.h"
 #include "svt.h"
 #include "temporalfilter.h"
+#include "threadedme.h"
+
 #ifdef ENABLE_HDR10_PLUS
     #include "dynamicHDR10/hdr10plus.h"
 #endif
+
 struct x265_encoder {};
+
 namespace X265_NS {
 // private namespace
 extern const char g_sliceTypeToChar[3];
@@ -212,6 +216,7 @@ public:
     x265_param*        m_zoneParam;
     RateControl*       m_rateControl;
     Lookahead*         m_lookahead;
+    ThreadedME*        m_threadedME;
     AdaptiveFrameDuplication* m_dupBuffer[DUP_BUFFER];      // picture buffer of size 2
     /*Frame duplication: Two pictures used to compute PSNR */
     pixel*             m_dupPicOne[3];
diff --git a/source/encoder/frameencoder.cpp b/source/encoder/frameencoder.cpp
index ac4f91f59..af73626af 100644
--- a/source/encoder/frameencoder.cpp
+++ b/source/encoder/frameencoder.cpp
@@ -36,6 +36,8 @@
 #include "nal.h"
 #include "temporalfilter.h"

+#include <iostream>
+
 namespace X265_NS {
 void weightAnalyse(Slice& slice, Frame& frame, x265_param& param);

@@ -200,6 +202,8 @@ bool FrameEncoder::init(Encoder *top, int numRows, int numCols)
         m_sliceAddrBits = (uint16_t)(tmp + 1);
     }

+    m_tmeDeps.resize(m_numRows);
+
     m_retFrameBuffer = X265_MALLOC(Frame*, m_param->numLayers);
     for (int layer = 0; layer < m_param->numLayers; layer++)
         m_retFrameBuffer[layer] = NULL;
@@ -447,6 +451,8 @@ void FrameEncoder::compressFrame(int layer)
     m_totalActiveWorkerCount = 0;
     m_activeWorkerCountSamples = 0;
     m_totalWorkerElapsedTime[layer] = 0;
+    m_totalThreadedMETime[layer] = 0;
+    m_totalThreadedMEWait[layer] = 0;
     m_totalNoWorkerTime[layer] = 0;
     m_countRowBlocks = 0;
     m_allRowsAvailableTime[layer] = 0;
@@ -915,7 +921,7 @@ void FrameEncoder::compressFrame(int layer)
      * compressed in a wave-front pattern if WPP is enabled. Row based loop
      * filters runs behind the CTU compression and reconstruction */

-    for (uint32_t sliceId = 0; sliceId < m_param->maxSlices; sliceId++)
+    for (uint32_t sliceId = 0; sliceId < m_param->maxSlices; sliceId++)
         m_rows[m_sliceBaseRow[sliceId]].active = true;

     if (m_param->bEnableWavefront)
@@ -975,8 +981,16 @@ void FrameEncoder::compressFrame(int layer)
                             m_mref[l][ref].applyWeight(rowIdx, m_numRows, sliceEndRow, sliceId);
                     }
                 }
-
+
                 enableRowEncoder(m_row_to_idx[row]); /* clear external dependency for this row */
+
+                if (m_top->m_threadedME && !slice->isIntra())
+                {
+                    ScopedLock lock(m_tmeDepLock);
+                    m_tmeDeps[row].external = true;
+                    m_top->m_threadedME->enqueueReadyRows(row, layer, this);
+                }
+
                 if (!rowInSlice)
                 {
                     m_row0WaitTime[layer] = x265_mdate();
@@ -1038,6 +1052,11 @@ void FrameEncoder::compressFrame(int layer)
     vmafFrameLevelScore();
 #endif

+    m_tmeDepLock.acquire();
+    m_tmeDeps.clear();
+    m_tmeDeps.resize(m_numRows);
+    m_tmeDepLock.release();
+
     if (m_param->maxSlices > 1)
     {
         PicYuv *reconPic = m_frame[layer]->m_reconPic[0];
@@ -1470,7 +1489,9 @@ void FrameEncoder::processRow(int row, int threadId, int layer)
     const uint32_t typeNum = m_idx_to_row[row & 1];

     if (!typeNum)
+    {
         processRowEncoder(realRow, m_tld[threadId], layer);
+    }
     else
     {
         m_frameFilter.processRow(realRow, layer);
@@ -1600,6 +1621,12 @@ void FrameEncoder::processRowEncoder(int intRow, ThreadLocalData& tld, int layer
     if (tld.analysis.m_sliceMaxY < tld.analysis.m_sliceMinY)
         tld.analysis.m_sliceMaxY = tld.analysis.m_sliceMinY = 0;

+    if (m_top->m_threadedME && !slice->isIntra())
+    {
+        ScopedLock lock(m_tmeDepLock);
+        m_tmeDeps[row].internal = true;
+        m_top->m_threadedME->enqueueReadyRows(row, layer, this);
+    }

     while (curRow.completed < numCols)
     {
@@ -1665,6 +1692,26 @@ void FrameEncoder::processRowEncoder(int intRow, ThreadLocalData& tld, int layer
         if (m_param->dynamicRd && (int32_t)(m_rce.qpaRc - m_rce.qpNoVbv) > 0)
             ctu->m_vbvAffected = true;

+        if (m_top->m_threadedME && slice->m_sliceType != I_SLICE)
+        {
+            int64_t waitStart = x265_mdate();
+            bool waited = false;
+
+            // Wait for threadedME to complete ME upto this CTU
+            while (m_frame[layer]->m_ctuMEFlags[cuAddr].get() == 0)
+            {
+#ifdef DETAILED_CU_STATS
+                tld.analysis.m_stats[m_jpId].countTmeBlockedCTUs++;
+#endif
+                m_frame[layer]->m_ctuMEFlags[cuAddr].waitForChange(0);
+                waited = true;
+            }
+
+            int64_t waitEnd = x265_mdate();
+            if (waited)
+                ATOMIC_ADD(&m_totalThreadedMEWait[layer], waitEnd - waitStart);
+        }
+
         // Does all the CU analysis, returns best top level mode decision
         Mode& best = tld.analysis.compressCTU(*ctu, *m_frame[layer], m_cuGeoms[m_ctuGeomMap[cuAddr]], rowCoder);

diff --git a/source/encoder/frameencoder.h b/source/encoder/frameencoder.h
index c31762402..2d039e031 100644
--- a/source/encoder/frameencoder.h
+++ b/source/encoder/frameencoder.h
@@ -41,6 +41,8 @@
 #include "reference.h"
 #include "nal.h"
 #include "temporalfilter.h"
+#include "threadedme.h"
+#include <queue>

 namespace X265_NS {
 // private x265 namespace
@@ -241,6 +243,9 @@ public:
     int64_t                  m_slicetypeWaitTime[MAX_LAYERS];        // total elapsed time waiting for decided frame
     int64_t                  m_totalWorkerElapsedTime[MAX_LAYERS];   // total elapsed time spent by worker threads processing CTUs
     int64_t                  m_totalNoWorkerTime[MAX_LAYERS];        // total elapsed time without any active worker threads
+    int64_t                  m_totalThreadedMEWait[MAX_LAYERS];      // total time spent waiting by CTUs for ThreadedME
+    int64_t                  m_totalThreadedMETime[MAX_LAYERS];      // total time spent processing by ThreadedME
+
 #if DETAILED_CU_STATS
     CUStats                  m_cuStats;
 #endif
@@ -267,6 +272,19 @@ public:

     int                      m_sLayerId;

+    std::queue<CTUTask>      m_tmeTasks;
+    Lock                     m_tmeTasksLock;
+
+    struct TMEDependencyState
+    {
+        bool internal;
+        bool external;
+        bool isQueued;
+    };
+
+    std::vector<TMEDependencyState> m_tmeDeps;
+    Lock                     m_tmeDepLock;
+
     class WeightAnalysis : public BondedTaskGroup
     {
     public:
diff --git a/source/encoder/motion.cpp b/source/encoder/motion.cpp
index 86f413c3d..1a8cf6371 100644
--- a/source/encoder/motion.cpp
+++ b/source/encoder/motion.cpp
@@ -628,6 +628,155 @@ void MotionEstimate::StarPatternSearch(ReferencePlanes *ref,
     }
 }

+int MotionEstimate::diamondSearch(ReferencePlanes* ref, const MV& mvmin, const MV& mvmax, MV& outMV)
+{
+    int bcost = INT_MAX;
+    MV bmv(0, 0);
+    MV omv = bmv;
+
+    ALIGN_VAR_16(int, costs[16]);
+
+    intptr_t stride = ref->lumaStride;
+    pixel* fenc = fencPUYuv.m_buf[0];
+    pixel* fref = ref->fpelPlane[0] + blockOffset;
+
+    for (int16_t dist = 1; dist <= 4; dist <<= 1)
+    {
+        const int32_t top = omv.y - dist;
+        const int32_t bottom = omv.y + dist;
+        const int32_t left = omv.x - dist;
+        const int32_t right = omv.x + dist;
+        const int32_t top2 = omv.y - (dist >> 1);
+        const int32_t bottom2 = omv.y + (dist >> 1);
+        const int32_t left2 = omv.x - (dist >> 1);
+        const int32_t right2 = omv.x + (dist >> 1);
+
+        if (top >= mvmin.y && left >= mvmin.x && right <= mvmax.x && bottom <= mvmax.y)
+        {
+            COST_MV_X4(omv.x, top, omv.x, bottom, left, omv.y, right, omv.y);
+            COST_MV_X4(left2, top2, right2, top2, left2, bottom2, right2, bottom2);
+        }
+        else // check border for each mv
+        {
+            if (top >= mvmin.y) // check top
+            {
+                COST_MV(omv.x, top);
+            }
+            if (top2 >= mvmin.y) // check half top
+            {
+                if (left2 >= mvmin.x)  // check half left
+                {
+                    COST_MV(left2, top2);
+                }
+                if (right2 <= mvmax.x) // check half right
+                {
+                    COST_MV(right2, top2);
+                }
+            }
+            if (left >= mvmin.x) // check left
+            {
+                COST_MV(left, omv.y);
+            }
+            if (right <= mvmax.x) // check right
+            {
+                COST_MV(right, omv.y);
+            }
+            if (bottom2 <= mvmax.y) // check half bottom
+            {
+                if (left2 >= mvmin.x) // check half left
+                {
+                    COST_MV(left2, bottom2);
+                }
+                if (right2 <= mvmax.x) // check half right
+                {
+                    COST_MV(right2, bottom2);
+                }
+            }
+            if (bottom <= mvmax.y) // check bottom
+            {
+                COST_MV(omv.x, bottom);
+            }
+        }
+    }
+
+    for (int16_t dist = 8; dist <= 64; dist += 8)
+    {
+        const int32_t top = omv.y - dist;
+        const int32_t bottom = omv.y + dist;
+        const int32_t left = omv.x - dist;
+        const int32_t right = omv.x + dist;
+
+        if (top >= mvmin.y && left >= mvmin.x && right <= mvmax.x && bottom <= mvmax.y)
+        {
+            COST_MV_X4(omv.x, top, left, omv.y, right, omv.y, omv.x, bottom);
+
+            for (int16_t index = 1; index < 4; index++)
+            {
+                int32_t posYT = top + ((dist >> 2) * index);
+                int32_t posYB = bottom - ((dist >> 2) * index);
+                int32_t posXL = omv.x - ((dist >> 2) * index);
+                int32_t posXR = omv.x + ((dist >> 2) * index);
+
+                COST_MV_X4(posXL, posYT,
+                    posXR, posYT,
+                    posXL, posYB,
+                    posXR, posYB);
+            }
+        }
+        else // check border for each mv
+        {
+            if (top >= mvmin.y) // check top
+            {
+                COST_MV(omv.x, top);
+            }
+            if (left >= mvmin.x) // check left
+            {
+                COST_MV(left, omv.y);
+            }
+            if (right <= mvmax.x) // check right
+            {
+                COST_MV(right, omv.y);
+            }
+            if (bottom <= mvmax.y) // check bottom
+            {
+                COST_MV(omv.x, bottom);
+            }
+            for (int16_t index = 1; index < 4; index++)
+            {
+                int32_t posYT = top + ((dist >> 2) * index);
+                int32_t posYB = bottom - ((dist >> 2) * index);
+                int32_t posXL = omv.x - ((dist >> 2) * index);
+                int32_t posXR = omv.x + ((dist >> 2) * index);
+
+                if (posYT >= mvmin.y) // check top
+                {
+                    if (posXL >= mvmin.x) // check left
+                    {
+                        COST_MV(posXL, posYT);
+                    }
+                    if (posXR <= mvmax.x) // check right
+                    {
+                        COST_MV(posXR, posYT);
+                    }
+                }
+                if (posYB <= mvmax.y) // check bottom
+                {
+                    if (posXL >= mvmin.x) // check left
+                    {
+                        COST_MV(posXL, posYB);
+                    }
+                    if (posXR <= mvmax.x) // check right
+                    {
+                        COST_MV(posXR, posYB);
+                    }
+                }
+            }
+        }
+    }
+    outMV = bmv;
+    return bcost;
+}
+
 void MotionEstimate::refineMV(ReferencePlanes* ref,
                               const MV&        mvmin,
                               const MV&        mvmax,
diff --git a/source/encoder/motion.h b/source/encoder/motion.h
index c9fe86c82..5fc701743 100644
--- a/source/encoder/motion.h
+++ b/source/encoder/motion.h
@@ -99,6 +99,8 @@ public:

     int subpelCompare(ReferencePlanes* ref, const MV &qmv, pixelcmp_t);

+    int diamondSearch(ReferencePlanes* ref, const MV& mvmin, const MV& mvmax, MV& outMV);
+
 protected:

     inline void StarPatternSearch(ReferencePlanes *ref,
diff --git a/source/encoder/search.cpp b/source/encoder/search.cpp
index 0522f52cc..bf47e7a03 100644
--- a/source/encoder/search.cpp
+++ b/source/encoder/search.cpp
@@ -33,6 +33,7 @@

 #include "analysis.h"  // TLD
 #include "framedata.h"
+#include "encoder.h"

 using namespace X265_NS;

@@ -222,6 +223,336 @@ int Search::setLambdaFromQP(const CUData& ctu, int qp, int lambdaQp)
     return quantQP;
 }

+void Search::puMotionEstimation(const Slice* slice, const CUGeom& cuGeom, CUData& cu, PicYuv* fencPic, int puOffset, PartSize part, int areaIdx, int finalIdx, bool isMVP , const int* neighborIdx)
+{
+#ifdef DETAILED_CU_STATS
+    m_stats[cu.m_encData->m_frameEncoderID].countMotionEstimate++;
+#endif
+
+    int satdCost = 0;
+    int numPredDir = slice->isInterP() ? 1 : 2;
+    int searchRange = isMVP ? 32 : m_param->searchRange;
+
+    MV mvp(0,0);
+    MV mvzero(0,0);
+
+    MV mvc[(MD_ABOVE_LEFT + 1) * 2 + 2];
+    MV amvpCand[2][MAX_NUM_REF][AMVP_NUM_CANDS];
+
+    MotionData bestME[2];
+    bestME[0].cost = MAX_UINT;
+    bestME[1].cost = MAX_UINT;
+
+    int numPart = cu.getNumPartInter(0);
+    uint32_t lastMode = 0;
+
+    int row = cu.m_cuAddr / m_slice->m_sps->numCuInWidth;
+    int col = cu.m_cuAddr % m_slice->m_sps->numCuInWidth;
+
+    int numMvc = 0;
+    for (int puIdx = 0; puIdx < numPart; puIdx++)
+    {
+        PredictionUnit pu(cu, cuGeom, puIdx);
+
+        int pos = finalIdx + puIdx * puOffset;
+        int slotIdx = (col % m_slice->m_sps->numCuInWidth) * m_slice->m_sps->numCuInHeight + row;
+
+        InterNeighbourMV neighbours[6];
+        if(!isMVP)
+           cu.getNeighbourMV(puIdx, pu.puAbsPartIdx, neighbours);
+
+        for (int list = 0; list < numPredDir; list++)
+        {
+            int numIdx = slice->m_numRefIdx[list];
+            for (int ref = 0; ref < numIdx; ref++)
+            {
+                getBlkBits(part, slice->isInterP(), puIdx, lastMode, m_listSelBits);
+                uint32_t bits = m_listSelBits[list] + MVP_IDX_BITS;
+                bits += getTUBits(ref, numIdx);
+
+                MV mvmin, mvmax, outmv,mvp_lowres;;
+                mvp = !isMVP ? m_areaBestMV[areaIdx][list][ref] : mvp;
+
+                MV zeroMV[2] = {0,0};
+                const MV* amvp = zeroMV;
+                int mvpIdx = 0;
+
+                bool bLowresMVP = false;
+                if (!isMVP)
+                {
+                    for(int dir = MD_LEFT; dir <= MD_ABOVE_LEFT ; dir++)
+                    {
+                        int neighIdx = neighborIdx[dir];
+                        if (neighIdx >= 0)
+                        {
+                            MEData& neighborData = slice->m_ctuMV[slotIdx].m_meData[neighIdx];
+                            for (int i = 0; i < 2; i++)
+                            {
+                                neighbours[dir].mv[i] = neighborData.mv[i];
+                                neighbours[dir].refIdx[i] = neighborData.ref[i];
+                            }
+                            neighbours[dir].isAvailable = (neighborData.ref[0] >= 0 || neighborData.ref[1] >= 0);
+                        }
+                        else
+                        {
+                            for (int i = 0; i < 2; i++)
+                                neighbours[dir].refIdx[i] = -1;
+                            neighbours[dir].isAvailable = false;
+                        }
+                    }
+
+                    numMvc = cu.getPMV(neighbours, list, ref, amvpCand[list][ref], mvc);
+                    if (numMvc > 0)
+                    {
+                        amvp = amvpCand[list][ref];
+                        mvpIdx = selectMVP(cu, pu, amvp, list, ref);
+                        mvp = amvp[mvpIdx];
+                    }
+                    else if (slice->m_refFrameList[list][ref]->m_encData->m_slice->m_sliceType != I_SLICE)
+                    {
+                        CTUMVInfo& ctuMV = slice->m_refFrameList[list][ref]->m_encData->m_slice->m_ctuMV[slotIdx];
+                        MEData meData = ctuMV.m_meData[pos];
+
+                        bool bi = (meData.ref[0] >= 0 && meData.ref[1] >= 0);
+                        bool uniL0 = (meData.ref[0] >= 0 && meData.ref[1] == REF_NOT_VALID);
+                        bool uniL1 = (meData.ref[1] >= 0 && meData.ref[0] == REF_NOT_VALID);
+
+                        if (uniL0)
+                            mvp = meData.mv[0];
+                        else if (uniL1)
+                            mvp = meData.mv[1];
+                        else if (bi)
+                            mvp = meData.mv[list];
+                    }
+                }
+
+                m_me.setMVP(mvp);
+
+                if (!strlen(m_param->analysisSave) && !strlen(m_param->analysisLoad))
+                {
+                    uint32_t blockX = cu.m_cuPelX + g_zscanToPelX[pu.puAbsPartIdx] + (pu.width  >> 1);
+                    uint32_t blockY = cu.m_cuPelY + g_zscanToPelY[pu.puAbsPartIdx] + (pu.height >> 1);
+
+                    if (blockX < m_slice->m_sps->picWidthInLumaSamples && blockY < m_slice->m_sps->picHeightInLumaSamples)
+                    {
+                        MV lmv = getLowresMV(cu, pu, list, ref);
+                        int layer = m_param->numViews > 1 ? m_frame->m_viewId : (m_param->numScalableLayers > 1) ? m_frame->m_sLayerId : 0;
+                        if (lmv.notZero() && !layer)
+                        {
+                            mvc[numMvc++] = lmv;
+                            bLowresMVP = true;
+                        }
+                        mvp_lowres = lmv;
+                    }
+                }
+
+                PicYuv* recon = slice->m_mref[list][ref].reconPic;
+                int offset = recon->getLumaAddr(cu.m_cuAddr, pu.cuAbsPartIdx + pu.puAbsPartIdx) - recon->getLumaAddr(0);
+
+                m_me.setSourcePU(fencPic->m_picOrg[0], fencPic->m_stride, offset, pu.width, pu.height, m_param->searchMethod, m_param->subpelRefine);
+                setSearchRange(cu, mvp, searchRange, mvmin, mvmax);
+
+                if (isMVP)
+                {
+                    satdCost = m_me.diamondSearch(&slice->m_mref[list][ref], mvmin, mvmax, outmv);
+                    m_areaBestMV[areaIdx][list][ref] = outmv;
+                }
+                else
+                {
+                    m_vertRestriction = slice->m_refPOCList[list][ref] == slice->m_poc;
+                    satdCost = m_me.motionEstimate(&slice->m_mref[list][ref], mvmin, mvmax, mvp, numMvc, mvc, m_param->searchRange, outmv, m_param->maxSlices, m_vertRestriction,
+                        m_param->bSourceReferenceEstimation ? m_slice->m_refFrameList[list][ref]->m_fencPic->getLumaAddr(0) : 0);
+
+                    if (bLowresMVP && mvp_lowres.notZero() && mvp_lowres != mvp)
+                    {
+                        MV outmv_lowres;
+                        bLowresMVP = false;
+                        setSearchRange(cu, mvp_lowres, m_param->searchRange, mvmin, mvmax);
+                        int lowresMvCost = m_me.motionEstimate(&slice->m_mref[list][ref],  mvmin, mvmax, mvp_lowres, numMvc, mvc, m_param->searchRange,outmv_lowres, m_param->maxSlices,
+                            m_vertRestriction, m_param->bSourceReferenceEstimation ? m_slice->m_refFrameList[list][ref]->m_fencPic->getLumaAddr(0): 0);
+
+                        if (lowresMvCost < satdCost)
+                        {
+                            outmv = outmv_lowres;
+                            satdCost = lowresMvCost;
+                            bLowresMVP = true;
+                        }
+                    }
+                }
+
+                bits += m_me.bitcost(outmv);
+                uint32_t mvCost = m_me.mvcost(outmv);
+                uint32_t cost = (satdCost - mvCost) + m_rdCost.getCost(bits);
+
+                if(!isMVP)
+                {
+                    if (bLowresMVP)
+                        updateMVP(mvp, outmv, bits, cost, mvp_lowres);
+
+                    mvp = checkBestMVP(amvp, outmv, mvpIdx, bits, cost);
+                }
+                if (cost < bestME[list].cost)
+                {
+                    bestME[list].mv = outmv;
+                    bestME[list].mvp = mvp;
+                    bestME[list].mvpIdx = 0;
+                    bestME[list].cost = cost;
+                    bestME[list].bits = bits;
+                    bestME[list].mvCost = mvCost;
+                    bestME[list].ref = ref;
+                }
+            }
+        }
+
+        if (isMVP)
+            return;
+
+        //Bi-Direction
+        MotionData bidir[2];
+        uint32_t bidirCost = MAX_UINT;
+        int bidirBits = 0;
+        Yuv& tmpPredYuv = m_rqt[cuGeom.depth].tmpPredYuv;
+
+        if (slice->isInterB() && !cu.isBipredRestriction() &&
+            cu.m_partSize[pu.puAbsPartIdx] != SIZE_2Nx2N && bestME[0].cost != MAX_UINT && bestME[1].cost != MAX_UINT && !isMVP)
+        {
+            bidir[0] = bestME[0];
+            bidir[1] = bestME[1];
+
+            if (m_me.bChromaSATD)
+            {
+                cu.m_mv[0][pu.puAbsPartIdx] = bidir[0].mv;
+                cu.m_refIdx[0][pu.puAbsPartIdx] = (int8_t)bidir[0].ref;
+                cu.m_mv[1][pu.puAbsPartIdx] = bidir[1].mv;
+                cu.m_refIdx[1][pu.puAbsPartIdx] = (int8_t)bidir[1].ref;
+                motionCompensation(cu, pu, tmpPredYuv, true, true);
+
+                satdCost = m_me.bufSATD(tmpPredYuv.getLumaAddr(pu.puAbsPartIdx), tmpPredYuv.m_size) +
+                    m_me.bufChromaSATD(tmpPredYuv, pu.puAbsPartIdx);
+            }
+            else
+            {
+                PicYuv* refPic0 = slice->m_refReconPicList[0][bestME[0].ref];
+                PicYuv* refPic1 = slice->m_refReconPicList[1][bestME[1].ref];
+                Yuv* bidirYuv = m_rqt[cuGeom.depth].bidirPredYuv;
+
+                predInterLumaPixel(pu, bidirYuv[0], *refPic0, bestME[0].mv);
+                predInterLumaPixel(pu, bidirYuv[1], *refPic1, bestME[1].mv);
+                primitives.pu[m_me.partEnum].pixelavg_pp[(tmpPredYuv.m_size % 64 == 0) && (bidirYuv[0].m_size % 64 == 0) && (bidirYuv[1].m_size % 64 == 0)](tmpPredYuv.m_buf[0], tmpPredYuv.m_size, bidirYuv[0].getLumaAddr(pu.puAbsPartIdx), bidirYuv[0].m_size,
+                    bidirYuv[1].getLumaAddr(pu.puAbsPartIdx), bidirYuv[1].m_size, 32);
+                satdCost = m_me.bufSATD(tmpPredYuv.m_buf[0], tmpPredYuv.m_size);
+            }
+
+            bidirBits = bestME[0].bits + bestME[1].bits + m_listSelBits[2] - (m_listSelBits[0] + m_listSelBits[1]);
+            bidirCost = satdCost + m_rdCost.getCost(bidirBits);
+
+            bool bTryZero = bestME[0].mv.notZero() || bestME[1].mv.notZero();
+            if (bTryZero)
+            {
+                MV mvmin, mvmax;
+                int merange = X265_MAX(m_param->sourceWidth, m_param->sourceHeight);
+                setSearchRange(cu, mvzero, merange, mvmin, mvmax);
+                mvmax.y += 2;
+                mvmin <<= 2;
+                mvmax <<= 2;
+
+                bTryZero &= bestME[0].mvp.checkRange(mvmin, mvmax);
+                bTryZero &= bestME[1].mvp.checkRange(mvmin, mvmax);
+            }
+            if (bTryZero)
+            {
+                if (m_me.bChromaSATD)
+                {
+                    cu.m_mv[0][pu.puAbsPartIdx] = mvzero;
+                    cu.m_refIdx[0][pu.puAbsPartIdx] = (int8_t)bidir[0].ref;
+                    cu.m_mv[1][pu.puAbsPartIdx] = mvzero;
+                    cu.m_refIdx[1][pu.puAbsPartIdx] = (int8_t)bidir[1].ref;
+                    motionCompensation(cu, pu, tmpPredYuv, true, true);
+
+                    satdCost = m_me.bufSATD(tmpPredYuv.getLumaAddr(pu.puAbsPartIdx), tmpPredYuv.m_size) +
+                        m_me.bufChromaSATD(tmpPredYuv, pu.puAbsPartIdx);
+                }
+                else
+                {
+                    const pixel* ref0 = m_slice->m_mref[0][bestME[0].ref].getLumaAddr(pu.ctuAddr, pu.cuAbsPartIdx + pu.puAbsPartIdx);
+                    const pixel* ref1 = m_slice->m_mref[1][bestME[1].ref].getLumaAddr(pu.ctuAddr, pu.cuAbsPartIdx + pu.puAbsPartIdx);
+                    intptr_t refStride = slice->m_mref[0][0].lumaStride;
+                    primitives.pu[m_me.partEnum].pixelavg_pp[(tmpPredYuv.m_size % 64 == 0) && (refStride % 64 == 0)](tmpPredYuv.m_buf[0], tmpPredYuv.m_size, ref0, refStride, ref1, refStride, 32);
+                    satdCost = m_me.bufSATD(tmpPredYuv.m_buf[0], tmpPredYuv.m_size);
+                }
+
+                MV mvp0 = bestME[0].mvp;
+                int mvpIdx0 = bestME[0].mvpIdx;
+                uint32_t bits0 = bestME[0].bits - m_me.bitcost(bestME[0].mv, mvp0) + m_me.bitcost(mvzero, mvp0);
+
+                MV mvp1 = bestME[1].mvp;
+                int mvpIdx1 = bestME[1].mvpIdx;
+                uint32_t bits1 = bestME[1].bits - m_me.bitcost(bestME[1].mv, mvp1) + m_me.bitcost(mvzero, mvp1);
+
+                uint32_t cost = satdCost + m_rdCost.getCost(bits0) + m_rdCost.getCost(bits1);
+
+                if (cost < bidirCost)
+                {
+                    bidir[0].mv = mvzero;
+                    bidir[1].mv = mvzero;
+                    bidir[0].mvp = mvp0;
+                    bidir[1].mvp = mvp1;
+                    bidir[0].mvpIdx = mvpIdx0;
+                    bidir[1].mvpIdx = mvpIdx1;
+                    bidirCost = cost;
+                    bidirBits = bits0 + bits1 + m_listSelBits[2] - (m_listSelBits[0] + m_listSelBits[1]);
+                }
+            }
+        }
+        CTUMVInfo & ctuInfo = slice->m_ctuMV[slotIdx];
+        MEData& outME = ctuInfo.m_meData[pos];
+
+        outME.ref[0] = REF_NOT_VALID;
+        outME.ref[1] = REF_NOT_VALID;
+
+        if (bidirCost < bestME[0].cost && bidirCost < bestME[1].cost)
+        {
+            lastMode = 2;
+
+            outME.mv[0] = bidir[0].mv;
+            outME.mv[1] = bidir[1].mv;
+            outME.mvp[0] = bidir[0].mvp;
+            outME.mvp[1] = bidir[1].mvp;
+            outME.mvCost[0] = bestME[0].mvCost;
+            outME.mvCost[1] = bestME[1].mvCost;
+            outME.ref[0] = bestME[0].ref;
+            outME.ref[1] = bestME[1].ref;
+
+            outME.bits = bidirBits;
+            outME.cost = bidirCost;
+        }
+        else if (bestME[0].cost <= bestME[1].cost)
+        {
+            lastMode = 0;
+
+            outME.mv[0] = bestME[0].mv;
+            outME.mvp[0] = bestME[0].mvp;
+            outME.mvCost[0] = bestME[0].mvCost;
+            outME.cost = bestME[0].cost;
+            outME.bits = bestME[0].bits;
+            outME.ref[0] = bestME[0].ref;
+            outME.ref[1] = REF_NOT_VALID;
+        }
+        else
+        {
+            lastMode = 1;
+
+            outME.mv[1] = bestME[1].mv;
+            outME.mvp[1] = bestME[1].mvp;
+            outME.mvCost[1] = bestME[1].mvCost;
+            outME.cost = bestME[1].cost;
+            outME.bits = bestME[1].bits;
+            outME.ref[1] = bestME[1].ref;
+            outME.ref[0] = REF_NOT_VALID;
+        }
+    }
+}
+
 #if CHECKED_BUILD || _DEBUG
 void Search::invalidateContexts(int fromDepth)
 {
@@ -2438,7 +2769,7 @@ void Search::predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bChroma
             /* if no peer threads were bonded, fall back to doing unidirectional
              * searches ourselves without overhead of singleMotionEstimation() */
         }
-        if (bDoUnidir)
+        if (bDoUnidir && !m_param->bThreadedME)
         {
             interMode.bestME[puIdx][0].ref = interMode.bestME[puIdx][1].ref = -1;
             uint32_t refMask = refMasks[puIdx] ? refMasks[puIdx] : (uint32_t)-1;
@@ -2550,7 +2881,7 @@ void Search::predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bChroma

         if (slice->isInterB() && !cu.isBipredRestriction() &&  /* biprediction is possible for this PU */
             cu.m_partSize[pu.puAbsPartIdx] != SIZE_2Nx2N &&    /* 2Nx2N biprediction is handled elsewhere */
-            bestME[0].cost != MAX_UINT && bestME[1].cost != MAX_UINT)
+            bestME[0].cost != MAX_UINT && bestME[1].cost != MAX_UINT && !m_param->bThreadedME)
         {
             bidir[0] = bestME[0];
             bidir[1] = bestME[1];
@@ -2650,8 +2981,108 @@ void Search::predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bChroma
             }
         }

+        uint32_t bestCost = MAX_INT;
+        bool isMerge = false;
+        bool isBidir = false;
+        bool uniL0 = false;
+        bool uniL1 = false;
+
+        if (m_param->bThreadedME)
+        {
+            int cuSize = 1 << cu.m_log2CUSize[0];
+
+            int lookupWidth = pu.width;
+            int lookupHeight = pu.height;
+
+            bool isAmp = cu.m_partSize[0] >= SIZE_2NxnU;
+
+            if (isAmp)
+            {
+                if (cu.m_partSize[0] == SIZE_2NxnU || cu.m_partSize[0] == SIZE_2NxnD)
+                    lookupHeight = (puIdx) ? (pu.width - pu.height) : pu.height;
+                else
+                    lookupWidth = (puIdx) ? (pu.height - pu.width) : pu.width;
+            }
+
+            int startIdx = g_puStartIdx[lookupWidth + lookupHeight][static_cast<int>(cu.m_partSize[0])];
+
+            int alignWidth = isAmp ? cuSize : pu.width;
+            int alignHeight = isAmp ? cuSize : pu.height;
+
+            int numPUX = m_param->maxCUSize / alignWidth;
+            int numPUY = m_param->maxCUSize / alignHeight;
+
+            int puOffset = isAmp ? (puIdx * numPUX * numPUY) : (cu.m_partSize[0] == SIZE_2NxN ? (puIdx * numPUX) : puIdx);
+
+            int relX = (cu.m_cuPelX / alignWidth) % numPUX;
+            int relY = (cu.m_cuPelY / alignHeight) % numPUY;
+
+            int index = startIdx + (relY * numPUX + relX) + puOffset;
+
+            int row = cu.m_cuAddr / m_slice->m_sps->numCuInWidth;
+            int col = cu.m_cuAddr % m_slice->m_sps->numCuInWidth;
+
+            int slotIdx = (col % m_slice->m_sps->numCuInWidth) * m_slice->m_sps->numCuInHeight + row;
+
+            CTUMVInfo& ctuInfo = slice->m_ctuMV[slotIdx];
+            MEData meData = ctuInfo.m_meData[index];
+
+            bestME[0].ref = meData.ref[0];
+            bestME[1].ref = meData.ref[1];
+
+            isBidir = (bestME[0].ref >= 0 && bestME[1].ref >= 0);
+            uniL0 = (bestME[0].ref >= 0 && bestME[1].ref == REF_NOT_VALID);
+            uniL1 = (bestME[1].ref >= 0 && bestME[0].ref == REF_NOT_VALID);
+
+            if(isBidir)
+            {
+                cu.getPMV(interMode.interNeighbours, 0, bestME[0].ref, interMode.amvpCand[0][bestME[0].ref], mvc);
+                cu.getPMV(interMode.interNeighbours, 1, bestME[1].ref, interMode.amvpCand[1][bestME[1].ref], mvc);
+
+                bidir[0].mv = meData.mv[0];
+                bidir[1].mv = meData.mv[1];
+                bidir[0].mvp = interMode.amvpCand[0][bestME[0].ref][0];
+                bidir[1].mvp = interMode.amvpCand[1][bestME[1].ref][0];
+                bidir[0].mvCost = meData.mvCost[0];
+                bidir[1].mvCost = meData.mvCost[1];
+                bidirCost = meData.cost;
+                bidirBits = meData.bits;
+
+                bestCost = bidirCost;
+            }
+            else if (uniL0)
+            {
+                cu.getPMV(interMode.interNeighbours, 0, bestME[0].ref, interMode.amvpCand[0][bestME[0].ref], mvc);
+
+                bestME[0].mv = meData.mv[0];
+                bestME[0].mvp = interMode.amvpCand[0][bestME[0].ref][0];
+                bestME[0].mvCost = meData.mvCost[0];
+                bestME[0].cost = meData.cost;
+                bestME[0].bits = meData.bits;
+
+                bestCost = bestME[0].cost;
+            }
+            else if (uniL1)
+            {
+                cu.getPMV(interMode.interNeighbours, 1, bestME[1].ref, interMode.amvpCand[1][bestME[1].ref], mvc);
+
+                bestME[1].mv = meData.mv[1];
+                bestME[1].mvp = interMode.amvpCand[1][bestME[1].ref][0];
+                bestME[1].mvCost = meData.mvCost[1];
+                bestME[1].cost = meData.cost;
+                bestME[1].bits = meData.bits;
+
+                bestCost = bestME[1].cost;
+            }
+            else
+                x265_log(NULL, X265_LOG_ERROR, "Invalid ME mode");
+
+            if (mrgCost < bestCost)
+                isMerge = true;
+        }
+
         /* select best option and store into CU */
-        if (mrgCost < bidirCost && mrgCost < bestME[0].cost && mrgCost < bestME[1].cost)
+        if ((mrgCost < bidirCost && mrgCost < bestME[0].cost && mrgCost < bestME[1].cost) || isMerge)
         {
             cu.m_mergeFlag[pu.puAbsPartIdx] = true;
             cu.m_mvpIdx[0][pu.puAbsPartIdx] = merge.index; /* merge candidate ID is stored in L0 MVP idx */
@@ -2663,7 +3094,7 @@ void Search::predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bChroma

             totalmebits += merge.bits;
         }
-        else if (bidirCost < bestME[0].cost && bidirCost < bestME[1].cost)
+        else if ((bidirCost < bestME[0].cost && bidirCost < bestME[1].cost) || isBidir)
         {
             lastMode = 2;

@@ -2681,7 +3112,7 @@ void Search::predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bChroma

             totalmebits += bidirBits;
         }
-        else if (bestME[0].cost <= bestME[1].cost)
+        else if ((bestME[0].cost <= bestME[1].cost) || uniL0)
         {
             lastMode = 0;

diff --git a/source/encoder/search.h b/source/encoder/search.h
index df7ad90dd..eb8942c3b 100644
--- a/source/encoder/search.h
+++ b/source/encoder/search.h
@@ -179,6 +179,7 @@ struct CUStats
     int64_t  pmodeBlockTime;                    // elapsed worker time blocked for pmode batch completion
     int64_t  weightAnalyzeTime;                 // elapsed worker time analyzing reference weights
     int64_t  totalCTUTime;                      // elapsed worker time in compressCTU (includes pmode master)
+    int64_t  tmeTime;                           // elapsed worker time in threadedME

     uint32_t skippedMotionReferences[NUM_CU_DEPTH];
     uint32_t totalMotionReferences[NUM_CU_DEPTH];
@@ -195,6 +196,8 @@ struct CUStats
     uint64_t countPModeTasks;
     uint64_t countPModeMasters;
     uint64_t countWeightAnalyze;
+    uint64_t countTmeTasks;
+    uint64_t countTmeBlockedCTUs;
     uint64_t totalCTUs;

     CUStats() { clear(); }
@@ -227,6 +230,7 @@ struct CUStats
         pmodeBlockTime += other.pmodeBlockTime;
         weightAnalyzeTime += other.weightAnalyzeTime;
         totalCTUTime += other.totalCTUTime;
+        tmeTime += other.tmeTime;

         countIntraAnalysis += other.countIntraAnalysis;
         countMotionEstimate += other.countMotionEstimate;
@@ -236,6 +240,8 @@ struct CUStats
         countPModeTasks += other.countPModeTasks;
         countPModeMasters += other.countPModeMasters;
         countWeightAnalyze += other.countWeightAnalyze;
+        countTmeTasks += other.countTmeTasks;
+        countTmeBlockedCTUs += other.countTmeBlockedCTUs;
         totalCTUs += other.totalCTUs;

         other.clear();
@@ -288,6 +294,8 @@ public:

     bool            m_vertRestriction;

+    MV              m_areaBestMV[5][2][MAX_NUM_REF];
+
 #if ENABLE_SCC_EXT
     int             m_ibcEnabled;
     int             m_numBVs;
@@ -341,6 +349,16 @@ public:

     MV getLowresMV(const CUData& cu, const PredictionUnit& pu, int list, int ref);

+    /**
+     * @brief Run motion estimation for one PU partition shape and persist the best ME result.
+     *
+     * Used by Analysis threaded-ME flow. With isMVP=true this bootstraps area MVPs,
+     * and with isMVP=false it performs full PU ME using spatial/temporal neighbors
+     * and stores results into per-CTU MV slots addressed by finalIdx/puOffset.
+     */
+    void puMotionEstimation(const Slice* slice, const CUGeom& cuGeom, CUData& ctu, PicYuv* fencPic, int puOffset, PartSize part, int areaIdx, int finalIdx,
+        bool isMVP ,  const int* neighborIdx = NULL);
+
 #if ENABLE_SCC_EXT
     void      predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bChromaMC, uint32_t masks[2], MV* iMVCandList = NULL);
     bool      predIntraBCSearch(Mode& intraBCMode, const CUGeom& cuGeom, bool bChromaMC, PartSize ePartSize, bool testOnlyPred, bool bUse1DSearchFor8x8, IBC& ibc);
diff --git a/source/encoder/threadedme.cpp b/source/encoder/threadedme.cpp
new file mode 100644
index 000000000..11fd16e9c
--- /dev/null
+++ b/source/encoder/threadedme.cpp
@@ -0,0 +1,275 @@
+/*****************************************************************************
+ * Copyright (C) 2013-2025 MulticoreWare, Inc
+ *
+ * Authors: Shashank Pathipati <shashank.pathipati at multicorewareinc.com>
+ *          Somu Vineela <somu at mutlicorewareinc.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#include "threadedme.h"
+#include "encoder.h"
+#include "frameencoder.h"
+
+#include <iostream>
+#include <sstream>
+
+namespace X265_NS {
+int g_puStartIdx[128][8] = {0};
+
+bool ThreadedME::create()
+{
+    m_active = true;
+    m_tldCount = m_pool->m_numWorkers;
+    m_tld = new ThreadLocalData[m_tldCount];
+    for (int i = 0; i < m_tldCount; i++)
+    {
+        m_tld[i].analysis.initSearch(*m_param, m_enc.m_scalingList);
+        m_tld[i].analysis.create(m_tld);
+    }
+
+    initPuStartIdx();
+
+    configure();
+
+    /* start sequence at zero */
+    m_enqueueSeq = 0ULL;
+
+    return true;
+}
+
+void ThreadedME::configure()
+{
+    if (!m_param->tmeTaskBlockSize)
+    {
+        m_param->tmeTaskBlockSize = m_param->sourceWidth / 480;
+    }
+
+    if (!m_param->tmeNumBufferRows)
+    {
+        m_param->tmeNumBufferRows = 10;
+    }
+}
+
+void ThreadedME::initPuStartIdx()
+{
+    int startIdx = 0;
+    uint32_t ctuSize = m_param->maxCUSize;
+
+    for (uint32_t puIdx = 0; puIdx < MAX_NUM_PU_SIZES; ++puIdx)
+    {
+        const PUBlock& pu = g_puLookup[puIdx];
+
+        if (pu.width > ctuSize || pu.height > ctuSize)
+            continue;
+
+        int indexWidth = pu.isAmp ? X265_MAX(pu.width, pu.height) : pu.width;
+        int indexHeight = pu.isAmp ? indexWidth : pu.height;
+
+        int numPUs = (ctuSize / indexWidth) * (ctuSize / indexHeight);
+        int partIdx = static_cast<int>(pu.partsize);
+
+        g_puStartIdx[pu.width + pu.height][partIdx] = startIdx;
+
+        startIdx += pu.isAmp ? 2 * numPUs : numPUs;
+    }
+}
+
+void ThreadedME::enqueueCTUBlock(int row, int col, int width, int height, int layer, FrameEncoder* frameEnc)
+{
+    frameEnc->m_tmeTasksLock.acquire();
+
+    Frame* frame = frameEnc->m_frame[layer];
+
+    CTUTask task;
+    task.seq = ATOMIC_ADD(&m_enqueueSeq, 1ULL);
+    task.row = row;
+    task.col = col;
+    task.width = width;
+    task.height = height;
+    task.layer = layer;
+
+    task.frame = frame;
+    task.frameEnc = frameEnc;
+
+    frameEnc->m_tmeTasks.push(task);
+    frameEnc->m_tmeTasksLock.release();
+
+    m_taskEvent.trigger();
+}
+
+void ThreadedME::enqueueReadyRows(int row, int layer, FrameEncoder* frameEnc)
+{
+    int bufRow = X265_MIN(row + m_param->tmeNumBufferRows, static_cast<int>(frameEnc->m_numRows));
+
+    for (int r = 0; r < bufRow; r++)
+    {
+        if (frameEnc->m_tmeDeps[r].isQueued)
+            continue;
+
+        bool isInitialRow = r < m_param->tmeNumBufferRows;
+        bool isExternalDepResolved = frameEnc->m_tmeDeps[r].external;
+
+        int prevRow = X265_MAX(0, r - m_param->tmeNumBufferRows);
+        bool isInternalDepResolved = frameEnc->m_tmeDeps[prevRow].internal;
+
+        if ((isInitialRow && isExternalDepResolved) ||
+            (!isInitialRow && isExternalDepResolved && isInternalDepResolved))
+        {
+            int cols = static_cast<int>(frameEnc->m_numCols);
+            for (int c = 0; c < cols; c += m_param->tmeTaskBlockSize)
+            {
+                int blockWidth = X265_MIN(m_param->tmeTaskBlockSize, cols - c);
+                enqueueCTUBlock(r, c, blockWidth, 1, layer, frameEnc);
+            }
+            frameEnc->m_tmeDeps[r].isQueued = true;
+        }
+    }
+}
+
+void ThreadedME::threadMain()
+{
+    while (m_active)
+    {
+        int newCTUsPushed = 0;
+
+        for (int i = 0; i < m_param->frameNumThreads; i++)
+        {
+            FrameEncoder* frameEnc = m_enc.m_frameEncoder[i];
+            frameEnc->m_tmeTasksLock.acquire();
+
+            while (!frameEnc->m_tmeTasks.empty())
+            {
+                CTUTask task = frameEnc->m_tmeTasks.front();
+                frameEnc->m_tmeTasks.pop();
+
+                m_taskQueueLock.acquire();
+                m_taskQueue.push(task);
+                m_taskQueueLock.release();
+
+                newCTUsPushed++;
+                tryWakeOne();
+            }
+
+            frameEnc->m_tmeTasksLock.release();
+        }
+
+        if (newCTUsPushed == 0)
+            m_taskEvent.wait();
+    }
+}
+
+void ThreadedME::findJob(int workerThreadId)
+{
+    m_taskQueueLock.acquire();
+    if (m_taskQueue.empty())
+    {
+        m_helpWanted = false;
+        m_taskQueueLock.release();
+        return;
+    }
+
+    m_helpWanted = true;
+    int64_t stime = x265_mdate();
+
+#ifdef DETAILED_CU_STATS
+    ScopedElapsedTime tmeTime(m_tld[workerThreadId].analysis.m_stats[m_jpId].tmeTime);
+    m_tld[workerThreadId].analysis.m_stats[m_jpId].countTmeTasks++;
+#endif
+
+    CTUTask task = m_taskQueue.top();
+    m_taskQueue.pop();
+    m_taskQueueLock.release();
+
+    int numCols = (m_param->sourceWidth + m_param->maxCUSize - 1) / m_param->maxCUSize;
+    Frame* frame = task.frame;
+
+    for (int i = 0; i < task.height; i++)
+    {
+        for (int j = 0; j < task.width; j++)
+        {
+
+            int ctuAddr = (task.row + i) * numCols + (task.col + j);
+            CUData* ctu = frame->m_encData->getPicCTU(ctuAddr);
+            ctu->m_slice = frame->m_encData->m_slice;
+
+            task.ctu = ctu;
+            task.geom = &task.frameEnc->m_cuGeoms[task.frameEnc->m_ctuGeomMap[ctuAddr]];
+
+            frame->m_encData->m_cuStat[ctuAddr].baseQp = frame->m_encData->m_avgQpRc;
+            initCTU(*ctu, task.row + i, task.col + j, task);
+
+            task.frame->m_ctuMEFlags[ctuAddr].set(0);
+            m_tld[workerThreadId].analysis.deriveMVsForCTU(*task.ctu, *task.geom, *frame);
+
+            task.frame->m_ctuMEFlags[ctuAddr].set(1);
+        }
+    }
+
+    if (m_param->csvLogLevel >= 2)
+    {
+        int64_t etime = x265_mdate();
+        ATOMIC_ADD(&task.frameEnc->m_totalThreadedMETime[task.layer], etime - stime);
+    }
+
+    m_taskEvent.trigger();
+}
+
+
+void ThreadedME::stopJobs()
+{
+    this->m_active = false;
+    m_taskEvent.trigger();
+}
+
+void ThreadedME::destroy()
+{
+    for (int i = 0; i < m_tldCount; i++)
+        m_tld[i].destroy();
+    delete[] m_tld;
+}
+
+void ThreadedME::collectStats()
+{
+#ifdef DETAILED_CU_STATS
+    for (int i = 0; i < m_tldCount; i++)
+        m_cuStats.accumulate(m_tld[i].analysis.m_stats[m_jpId], *m_param);
+#endif
+}
+
+void initCTU(CUData& ctu, int row, int col, CTUTask& task)
+{
+    Frame& frame = *task.frame;
+    FrameEncoder& frameEnc = *task.frameEnc;
+
+    int numRows = frameEnc.m_numRows;
+    int numCols = frameEnc.m_numCols;
+    Slice *slice = frame.m_encData->m_slice;
+    CTURow& ctuRow = frameEnc.m_rows[row];
+
+    const uint32_t bFirstRowInSlice = ((row == 0) || (frameEnc.m_rows[row - 1].sliceId != ctuRow.sliceId)) ? 1 : 0;
+    const uint32_t bLastRowInSlice = ((row == numRows - 1) || (frameEnc.m_rows[row + 1].sliceId != ctuRow.sliceId)) ? 1 : 0;
+
+    const uint32_t bLastCuInSlice = (bLastRowInSlice & (col == numCols - 1)) ? 1 : 0;
+
+    int ctuAddr = (numCols * row) + col;
+
+    ctu.initCTU(frame, ctuAddr, slice->m_sliceQp, bFirstRowInSlice, bLastRowInSlice, bLastCuInSlice);
+}
+
+}
\ No newline at end of file
diff --git a/source/encoder/threadedme.h b/source/encoder/threadedme.h
new file mode 100644
index 000000000..ee66116e3
--- /dev/null
+++ b/source/encoder/threadedme.h
@@ -0,0 +1,259 @@
+/*****************************************************************************
+ * Copyright (C) 2013-2025 MulticoreWare, Inc
+ *
+ * Authors: Shashank Pathipati <shashank.pathipati at multicorewareinc.com>
+ *          Somu Vineela <somu at mutlicorewareinc.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#ifndef THREADED_ME_H
+#define THREADED_ME_H
+
+#include "common.h"
+#include "threading.h"
+#include "threadpool.h"
+#include "cudata.h"
+#include "lowres.h"
+#include "frame.h"
+#include "analysis.h"
+#include "mv.h"
+
+#include <queue>
+#include <vector>
+#include <fstream>
+
+namespace X265_NS {
+
+extern int g_puStartIdx[128][8];
+
+class Encoder;
+class Analysis;
+class FrameEncoder;
+
+struct PUBlock {
+    uint32_t width;
+    uint32_t height;
+    PartSize partsize;
+    bool isAmp;
+};
+
+const PUBlock g_puLookup[MAX_NUM_PU_SIZES] = {
+    { 8,   4, SIZE_2NxN,  0 },
+    { 4,   8, SIZE_Nx2N,  0 },
+    { 8,   8, SIZE_2Nx2N, 0 },
+    { 16,  4, SIZE_2NxnU, 1 },
+    { 16, 12, SIZE_2NxnD, 1 },
+    { 4,  16, SIZE_nLx2N, 1 },
+    { 12, 16, SIZE_nRx2N, 1 },
+    { 16,  8, SIZE_2NxN,  0 },
+    { 8,  16, SIZE_Nx2N,  0 },
+    { 16, 16, SIZE_2Nx2N, 0 },
+    { 32,  8, SIZE_2NxnU, 1 },
+    { 32, 24, SIZE_2NxnD, 1 },
+    { 8,  32, SIZE_nLx2N, 1 },
+    { 24, 32, SIZE_nRx2N, 1 },
+    { 32, 16, SIZE_2NxN,  0 },
+    { 16, 32, SIZE_Nx2N,  0 },
+    { 32, 32, SIZE_2Nx2N, 0 },
+    { 64, 16, SIZE_2NxnU, 1 },
+    { 64, 48, SIZE_2NxnD, 1 },
+    { 16, 64, SIZE_nLx2N, 1 },
+    { 48, 64, SIZE_nRx2N, 1 },
+    { 64, 32, SIZE_2NxN,  0 },
+    { 32, 64, SIZE_Nx2N,  0 },
+    { 64, 64, SIZE_2Nx2N, 0 }
+};
+
+struct CTUTaskData
+{
+    CUData& ctuData;
+    CUGeom& ctuGeom;
+    Frame& frame;
+};
+
+struct CTUBlockTask
+{
+    int row;
+    int col;
+    int width;
+    int height;
+    Frame* frame;
+    class FrameEncoder* frameEnc;
+    unsigned long long seq; /* monotonic sequence to preserve enqueue order */
+};
+
+struct PUData
+{
+    PartSize part;
+    const CUGeom* cuGeom;
+    int puOffset;
+    int areaId;
+    int finalIdx;
+    int qp;
+};
+
+struct MEData
+{
+    MV       mv[2];
+    MV       mvp[2];
+    uint32_t mvCost[2];
+    int      ref[2];
+    int      bits;
+    uint32_t cost;
+};
+
+struct CTUMVInfo
+{
+    MEData* m_meData;
+};
+
+struct CTUTask
+{
+    uint64_t seq;
+    int row;
+    int col;
+    int width;
+    int height;
+    int layer;
+
+    CUData* ctu;
+    CUGeom* geom;
+    Frame* frame;
+    FrameEncoder* frameEnc;
+};
+
+
+struct CompareCTUTask {
+    bool operator()(const CTUTask& a, const CTUTask& b) const {
+        if (a.frame->m_poc == b.frame->m_poc)
+        {
+            int a_pos = a.row + a.col;
+            int b_pos = b.row + b.col;
+            if (a_pos != b_pos) return a_pos > b_pos;
+        }
+
+        /* Compare by sequence number to preserve FIFO enqueue order.
+         * priority_queue in C++ is a max-heap, so return true when a.seq > b.seq
+         * to make smaller seq (earlier enqueue) the top() element. */
+        return a.seq > b.seq;
+    }
+};
+
+/**
+ * @brief Threaded motion-estimation module that schedules CTU blocks across worker threads.
+ *
+ * Owns per-worker analysis state (ThreadLocalData), manages the CTU task queues,
+ * and exposes a JobProvider interface for the thread pool to execute MVP
+ * derivation and ME searches in parallel.
+ */
+class ThreadedME: public JobProvider, public Thread
+{
+public:
+    x265_param*             m_param;
+    Encoder&                m_enc;
+
+    std::priority_queue<CTUTask, std::vector<CTUTask>, CompareCTUTask>  m_taskQueue;
+    Lock                    m_taskQueueLock;
+    Event                   m_taskEvent;
+
+    volatile bool           m_active;
+    unsigned long long      m_enqueueSeq;
+
+    ThreadLocalData*        m_tld;
+    int                     m_tldCount;
+
+#ifdef DETAILED_CU_STATS
+    CUStats                 m_cuStats;
+#endif
+
+    /**
+     * @brief Construct the ThreadedME manager; call create() before use.
+     */
+    ThreadedME(x265_param* param, Encoder& enc): m_param(param), m_enc(enc) {};
+
+    /**
+     * @brief Creates threadpool, thread local data and registers itself as a job provider
+     */
+    bool create();
+
+    /**
+     * @brief Configure ThreadedME parameters to match workload
+     */
+    void configure();
+
+    /**
+     * @brief Initialize lookup table used to index PU offsets for all valid CTU sizes.
+     */
+    void initPuStartIdx();
+
+    /**
+     * @brief Enqueue a block of CTUs for motion estimation.
+     *
+     * Blocks are queued per FrameEncoder and later moved into the global
+     * priority queue consumed by worker threads.
+     */
+    void enqueueCTUBlock(int row, int col, int width, int height, int layer, FrameEncoder* frameEnc);
+
+    /**
+     * @brief Inspect dependency state and enqueue newly-unblocked CTU rows.
+     *
+     * Uses external (row-level) and internal (buffered-row) dependencies to
+     * decide when a row can be split into CTU block tasks.
+     */
+    void enqueueReadyRows(int row, int layer, FrameEncoder* frameEnc);
+
+    /**
+     * @brief Main dispatcher thread that transfers per-frame tasks into the global queue.
+     */
+    void threadMain();
+
+    /**
+     * @brief Dequeue a CTU task, derive MVs, and run ME over all supported PU shapes.
+     *
+     * Called by worker threads via JobProvider; processes an entire CTU block.
+     */
+    void findJob(int workerThreadId);
+
+    /**
+     * @brief Stops worker threads
+     */
+    void stopJobs();
+
+    /**
+     * @brief Cleanup allocated resources
+     */
+    void destroy();
+
+    /**
+     * @brief Accumulate detailed CU statistics from worker thread local data.
+     */
+    void collectStats();
+};
+
+// Utils
+
+/**
+ * @brief A workaround to init CTUs before processRowEncoder does the same,
+ * since the CUData is needed before the FrameEncoder initializes it
+ */
+void initCTU(CUData& ctu, int row, int col, CTUTask& task);
+
+};
+
+#endif
diff --git a/source/x265.cpp b/source/x265.cpp
index 1617ef414..3190208f7 100644
--- a/source/x265.cpp
+++ b/source/x265.cpp
@@ -269,6 +269,7 @@ static bool setRefContext(CLIOptions cliopt[], uint32_t numEncodes)

 int main(int argc, char **argv)
 {
+
 #if HAVE_VLD
     // This uses Microsoft's proprietary WCHAR type, but this only builds on Windows to start with
     VLDSetReportOptions(VLD_OPT_REPORT_TO_DEBUGGER | VLD_OPT_REPORT_TO_FILE, L"x265_leaks.txt");
diff --git a/source/x265.h b/source/x265.h
index 221b6bfaf..446b2da3d 100644
--- a/source/x265.h
+++ b/source/x265.h
@@ -276,6 +276,8 @@ typedef struct x265_frame_stats
     double           decideWaitTime;
     double           row0WaitTime;
     double           wallTime;
+    int64_t          tmeTime;
+    int64_t          tmeWaitTime;
     double           refWaitWallTime;
     double           totalCTUTime;
     double           stallTime;
@@ -1172,6 +1174,15 @@ typedef struct x265_param
      * win, particularly in video sequences with low motion. Default disabled */
     int       bDistributeMotionEstimation;

+    /* Use a dedicated threadpool to pre-process motion estimation. Evaluates all
+     * PU combinations for CTUs in parallel. Dependencies between CTUs in inter
+     * frames is broken to allow for more parallelism, and as result may cause
+     * drop in compression efficiency. Recommended for many core CPUs and when
+     * loss in compression efficiency is acceptable for speedup of encoding.
+     * Default disabled.
+     */
+    int       bThreadedME;
+
     /*== Logging Features ==*/

     /* Enable analysis and logging distribution of CUs. Now deprecated */
@@ -2329,6 +2340,10 @@ typedef struct x265_param
     int      searchRangeForLayer1;
     int      searchRangeForLayer2;

+    /* Threaded ME */
+    int      tmeTaskBlockSize;
+    int      tmeNumBufferRows;
+
     /*SBRC*/
     int      bEnableSBRC;
     int mcstfFrameRange;
diff --git a/source/x265cli.cpp b/source/x265cli.cpp
index c5fd26a68..b5f6bcea8 100755
--- a/source/x265cli.cpp
+++ b/source/x265cli.cpp
@@ -105,6 +105,7 @@ namespace X265_NS {
         H0("   --[no-]slices <integer>       Enable Multiple Slices feature. Default %d\n", param->maxSlices);
         H0("   --[no-]pmode                  Parallel mode analysis. Deprecated from release 4.1. Default %s\n", OPT(param->bDistributeModeAnalysis));
         H0("   --[no-]pme                    Parallel motion estimation. Deprecated from release 4.1. Default %s\n", OPT(param->bDistributeMotionEstimation));
+        H0("   --[no-]threaded-me            Enables standalone multi-threaded module for motion estimation at CTU level. Default %s\n", OPT(param->bThreadedME));
         H0("   --[no-]asm <bool|int|string>  Override CPU detection. Default: auto\n");
         H0("\nPresets:\n");
         H0("-p/--preset <string>             Trade off performance for compression efficiency. Default medium\n");
diff --git a/source/x265cli.h b/source/x265cli.h
index 108cc8e2d..765d6dd8e 100644
--- a/source/x265cli.h
+++ b/source/x265cli.h
@@ -398,6 +398,8 @@ static const struct option long_options[] =
     { "aom-film-grain", required_argument, NULL, 0 },
     { "frame-rc",no_argument, NULL, 0 },
     { "no-frame-rc",no_argument, NULL, 0 },
+    { "threaded-me", no_argument, NULL, 0 },
+    { "no-threaded-me", no_argument, NULL, 0 },
     { 0, 0, 0, 0 },
     { 0, 0, 0, 0 },
     { 0, 0, 0, 0 },
--
2.52.0.windows.1


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20260225/62648848/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-feat-threaded-me.patch
Type: application/octet-stream
Size: 110203 bytes
Desc: 0001-feat-threaded-me.patch
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20260225/62648848/attachment-0001.obj>