RESEARCH

Research & Methodology

How we collect and summarize sentiment signals, with ethics, privacy, and documentation as first principles.

Overview

Instrumetriq observes short-lived market narratives by combining high-resolution market microstructure with aggregated social sentiment.

Each asset is monitored ~120–130 minutes; spot price sampled every 10 seconds.

This is an observational dataset: we publish measurements and derived statistics, not trading advice.

How scoring works

Data pipeline v2.0

Posts are collected from X (Twitter) using controlled, asset-specific queries. Each scoring cycle processes posts through the following stages:

  1. Collection. Public posts are retrieved from X (Twitter) via asset-specific search queries. Queries are designed to capture discussion relevant to each monitored cryptocurrency.
  2. Deduplication. Previously observed posts are identified and removed. Each post is scored at most once, preventing duplicate influence on aggregated metrics.
  3. Crypto relevance filtering. A dedicated classification model evaluates whether each post pertains to cryptocurrency. Posts determined to be off-topic are excluded before sentiment scoring. This step was introduced in February 2026 to reduce noise from posts that mention a token name in a non-crypto context. Filtered posts are retained separately for auditing; no data is discarded silently. The filter model is a fine-tuned BERTweet binary classifier trained on 17,813 labeled examples. Held-out evaluation (n=2,486): accuracy 93.8%, macro F1 0.929. Crypto recall is 96.1% by design - the model is intentionally conservative, preferring to retain ambiguous posts over dropping genuine crypto content. Approximately 11% of non-crypto posts may pass through the filter and reach sentiment scoring. In production the observed pass-through rate is approximately 37% - the filter drops roughly 63% of collected tweets as off-topic. This is an actively working noise reduction step, not merely a safety net. To compensate for the reduced volume, V2 increased the per-query scrape limit from ~40 to ~55+ tweets (up 30-40%), so the posts that survive filtering represent higher-confidence crypto discussion.
  4. Primary sentiment scoring. Each post that passes the relevance filter is scored by a domain-specific sentiment model trained on crypto-related language. The primary model is based on the BERTweet architecture, selected for its pretraining on social media text.
  5. Secondary confidence scoring. A second model, based on the DistilBERT architecture, independently evaluates each post. This secondary model acts as a referee: its confidence score determines whether the primary classification is accepted, overridden, or forced to neutral.
  6. Cycle-level aggregation. Individual post scores are aggregated into cycle-level metrics, including positive/neutral/negative ratios, mean sentiment score, and decision source statistics.
  7. Silence handling. Periods of low or absent posting activity are tracked explicitly. Fields such as recent_posts_count, is_silent, and hours_since_latest_tweet distinguish genuine silence from neutral sentiment.

Referee adjudication

The secondary model's confidence determines the final label for each post. Three outcomes are possible:

primary_default Primary model classification accepted - secondary model confidence supports the primary label
referee_override Secondary model overrides the primary classification - sufficient confidence in a different label
referee_neutral_band Forced neutral - secondary model confidence falls within an uncertainty band where neither label is reliable

These decision sources are tracked per cycle in hybrid_decision_stats.decision_sources.

Version identification

Each record includes a methodology_regime field with values "v1" or "v2", enabling programmatic separation of data produced under different pipeline configurations. The sentiment_model_version field reflects the specific model generation ("v1.0" or "v2.0").

The V2 pipeline was deployed in two phases:

  • Phase 1 (February 16, 2026 05:14 UTC) - Updated sentiment models deployed. Records from this point carry methodology_regime: "v2" and sentiment_model_version: "v2.0".
  • Phase 2 (February 17, 2026 06:03 UTC) - Crypto relevance filter activated. Records from this point additionally include crypto_filter_enabled: true.

Data produced before February 16, 2026 carries methodology_regime: "v1". This represents 63 days and approximately 157,000 records. V1 data predates both the updated sentiment models and the crypto relevance filter. Researchers performing analysis across the full history should treat v1 records as a lower-confidence sentiment period and segment by the methodology_regime field before drawing cross-period comparisons. V1 data is structurally valid and fully retained in the archive - it simply reflects different model versions and lacks the relevance filtering step introduced in February 2026.

Methodology note

Sentiment language in cryptocurrency-related social media discourse is characteristically skewed positive. This property has been documented in prior academic studies (e.g., Chen & Hafner, 2019; Hassan et al., 2022). Measured across all V2 records in this archive, 76.2% of AI-scored posts are classified positive and 72.5% of per-cycle sentiment means are positive-leaning - consistent with the academic literature. The V2 pipeline partially addresses this skew through improved negative sentiment recall and relevance filtering. The dataset continues to reflect observed narrative tone rather than normalized sentiment. Post volume, balance, silence, and decision source distributions provide structural context alongside the sentiment mean.

Selected references
  • Chen, C.Y.H. & Hafner, C.M. (2019). "Sentiment-Induced Bubbles in the Cryptocurrency Market." Journal of Risk and Financial Management, 12(2), 53. doi:10.3390/jrfm12020053
  • Hassan, M.K., Hudaefi, F.A. & Caraka, R.E. (2022). "Mining netizen's opinion on cryptocurrency: sentiment analysis of Twitter data." Studies in Economics and Finance, 39(3), 365-385. doi:10.1108/SEF-06-2021-0237

Interpreting sentiment scores

Because the dataset is structurally positive-leaning, the raw label_3class_mean value for a single cycle is less informative in isolation than how it compares to the asset's own historical baseline. An unusually high or low score relative to an asset's own recent history is a stronger analytical signal than an absolute value compared across all assets or against a neutral midpoint. The hybrid_decision_stats.pos_ratio, neg_ratio, and posts_total fields support this kind of relative, per-asset analysis. This is a suggested interpretation approach, not a transformation applied in the dataset itself.

Methodology V1 (prior to February 2026)

The following describes the original scoring methodology active from December 2025 through February 15, 2026. Data produced during this period carries methodology_regime: "v1" and sentiment_model_version: "v1.0".

Data pipeline

  • Posts are collected for each asset via controlled queries.
  • Posts are deduplicated.
  • Each post is scored by two AI models (primary + referee).
  • Referee confidence controls when we accept, override, or label neutral.
  • Cycle-level aggregates are written (pos/neu/neg ratios, mean score, confidence stats).
  • Silence handling is explicit (recent_posts_count, is_silent, hours_since_latest_tweet).

Decisions

primary_default Primary model decision accepted
referee_override Referee overrides primary
referee_neutral_band Forced neutral due to uncertainty band

Methodology note (V1)

As is common in cryptocurrency-related social media discourse, sentiment language is heavily skewed positive. Our dataset preserves this property rather than normalizing it. The sentiment mean reflects narrative tone, while post volume, balance, and silence capture structural changes in discourse.

Dataset characteristics

The following figures are computed across all V2 records (methodology_regime: "v2", February 16, 2026 onwards). The archive has been running continuously since December 15, 2025.

Archive scope

~245,000 Total records (full archive, growing daily)
99 Days of data
274 Assets monitored per day
~2,500 Observation cycles per day

Posts per asset-observation (v2)

Each record in the dataset represents one asset during one monitoring cycle - not a single aggregate over all 274 assets. Post volume is heavily skewed across the universe: major assets (BTC, ETH, SOL, XLM) regularly generate hundreds of posts per observation; the long tail of lower-profile traded pairs generates very few or none. The median of 4 and the 24.6% zero-post figure reflect this long tail, not gaps in pipeline operation. Assets with no posts in a given observation are flagged via is_silent and hours_since_latest_tweet - silence is a first-class signal, not missing data. Analysts focused on sentiment signal should apply a minimum post volume filter such as posts_total >= 10 to restrict to assets with meaningful discussion. The conditioned analysis in the Statistical Properties section uses this threshold.

4 Median posts per asset-observation
15.6 Mean posts per asset-observation
24.6% Asset-observations with 0 posts (long-tail assets)
13.9% Asset-observations with 50+ posts (high-activity assets)

Sentiment distribution (v2, AI-scored posts)

76.2% Positive posts
20.5% Negative posts
3.2% Neutral posts
72.5% Cycles with positive-leaning mean

Referee adjudication rates (v2)

A low intervention rate indicates the two models largely agree. High override rates would indicate model disagreement and reduced confidence in the final labels.

92.3% Primary model accepted
4.4% Referee override
3.2% Forced neutral (uncertainty band)
7.7% Total referee intervention

Coverage and silence

Each asset is queried using a structured set of crypto-specific terms: brand name, cashtag, and known aliases - for example, Bitcoin is queried as Bitcoin OR $BTC OR "Bitcoin Crypto" OR "Bitcoin Network", and Ethereum as Ethereum OR $ETH OR "Ethereum Crypto" OR "Ethereum Network". Despite these targeted queries, a significant share of collected posts discuss unrelated topics that happen to mention a token name. The crypto relevance filter drops approximately 63% of raw tweets as off-topic, keeping only the ~37% the model classifies as genuine crypto discussion. To offset this reduction, V2 increased the per-query scrape limit from ~40 to ~55+ tweets (up 30-40%), prioritizing relevance confidence over raw volume. A sanity guard bypasses the filter when the drop rate for a single query exceeds 85%, preventing overcorrection on ambiguous tickers.

100% Asset-observations with twitter pipeline data
1.1% Asset-observations flagged as silent (is_silent=true)
~37% Crypto relevance filter pass-through (63% dropped as off-topic)
93.8% Filter model accuracy (held-out eval, n=2,486)

What we store per entry

Market microstructure

Spread, depth at multiple bps, order book imbalance, taker ratios. Futures data (when available) including funding rate, open interest, and futures-spot basis.

Liquidity quality

liq_qv_usd plus global/self percentiles for regime comparisons.

Sentiment windows (aggregated)

last_cycle and last_2_cycles aggregates (not raw tweets), including author_stats and engagement.

Outcomes & derived features

Price-path metrics and derived entry statistics computed from the recorded series.

Activity & silence context

Explicit tracking of posting activity and silence states to distinguish absence of signal from neutral sentiment, including time since latest observed post.

Author & engagement aggregates

Aggregated author statistics and engagement signals (e.g. follower counts, verification flags, likes, replies, reposts). Author identities are not stored or exposed.

Research use cases

Examples of how the dataset can be used in research contexts.

Sentiment time series modeling

  • Use archived, window-level sentiment aggregates to study how social sentiment evolves over short, fixed monitoring periods.
  • Analyze temporal properties such as persistence, volatility, mean reversion, and structural breaks in sentiment signals.
  • Apply standard time-series techniques to social data that is already normalized and aggregated at the entry level.

Microstructure + narrative coupling

  • Examine how aggregated social sentiment aligns with short-term price movement, spreads, and liquidity conditions.
  • Study whether narrative intensity and market microstructure signals co-evolve, diverge, or lag one another.
  • Compare coupling strength across assets, time windows, and market regimes.

Silence & activity research

  • Explicitly separate periods of posting inactivity from neutral or low-confidence sentiment states.
  • Analyze how changes in posting frequency relate to market behavior and volatility.
  • Treat silence as a first-class signal rather than an absence of data.

Model evaluation & drift monitoring

  • Use the historical archive to observe how sentiment model outputs behave over time.
  • Track shifts in score distributions, confidence bands, and decision pathways.
  • Support analysis of model stability, drift, and long-term consistency without retraining.

Event-centric labeling (observational)

  • Identify recurring patterns in sentiment and market behavior around notable events.
  • Analyze outcomes relative to observed entry conditions, rather than assigning forward-looking labels.
  • Support retrospective research into how narratives and markets responded to specific situations.

Cross-sectional comparisons

  • Compare sentiment, liquidity, and volatility metrics across multiple assets within similar time windows.
  • Study relative behavior rather than absolute values to identify outliers and common structures.
  • Analyze how different assets respond to comparable narrative conditions.

Data quality & integrity studies

  • Evaluate completeness, internal consistency, and stability of collected signals.
  • Study the effects of deduplication, aggregation, and sampling on downstream metrics.
  • Use the dataset to validate methodological assumptions in social data collection.

Aggregated author & engagement analysis

  • Analyze engagement dynamics using aggregated author-level statistics.
  • Study how engagement concentration and distribution relate to sentiment outcomes.
  • Preserve privacy by design: identities are neither stored nor exposed, only aggregated signals.

Constraints & data integrity

Observational scope

This dataset is structured for observational research, not for trading decisions or predictive modeling advice. The measurements record what occurred during specific monitoring windows, not what will occur.

Outcomes are presented descriptively. Researchers using this data are expected to form their own interpretations and validate findings independently. No statistical relationship observed here should be assumed to persist or generalize beyond the recorded conditions.

Temporal & sampling limits

Each monitored entry spans approximately 120–130 minutes. Spot price samples are collected every 10 seconds during this window. This is not continuous market coverage, and not all assets are monitored simultaneously.

The dataset captures short-term, high-resolution snapshots rather than long-term trends. Gaps between entries are expected and normal. Sampling frequency and window length are fixed by design, which may limit certain analyses that require different temporal resolutions.

Sentiment aggregation & uncertainty

Sentiment values are aggregated at the cycle level and do not represent individual posts. A neutral sentiment score does not mean no social activity occurred-it may reflect balanced opinions, low confidence, or explicitly neutral content.

Silence and inactivity are tracked separately using explicit flags and timestamps. All sentiment scores carry inherent uncertainty and variance, which are reported alongside the aggregates. Researchers should account for these distributions rather than treating aggregates as point estimates.

Privacy & data handling

Raw posts are not stored in this dataset and are not published or exposed. Author identities are not recorded, stored, or made available at any stage of the pipeline.

Only aggregated author-level statistics (such as follower count distributions, verification ratios, and engagement metrics) are included, preserving privacy by design. The source data is public, but processing follows data minimization principles to ensure no individual-level tracking or re-identification is possible from the published archive.

Statistical Properties

A systematic statistical study was conducted across all Tier 3 features against measured returns at 1-, 3-, 6-, and 12-cycle horizons. The methodology included baseline regressions, conditioned searches, and a full-feature sweep across 4,992 statistical tests with Benjamini-Hochberg FDR correction applied.

Microstructure signals

The strongest statistical associations were observed in derivatives positioning and market microstructure features. Open interest flow (5-minute deltas), funding rates (both extreme positive and negative), futures basis (spot-futures divergence), and order book imbalance (depth-weighted OBI at 5 basis point bands) showed consistent associations with subsequent return observations across multiple horizons. These are well-documented market microstructure phenomena. The dataset captures them cleanly because all features share observation windows, requiring no timestamp alignment or cross-dataset joins.

Conditioned sentiment relationships

When sentiment features were filtered for minimum post volume (≥10 posts per cycle), 48 of 240 tests reached statistical significance-approximately 4× the rate expected under the null hypothesis. The observed direction was consistently contrarian: elevated sentiment preceded lower measured returns. Effect sizes were modest, ranging from 25–37 basis points over 24-hour horizons.

Methodology

The study examined unconditional baseline regressions across six primary sentiment features (hybrid mean score, AI sentiment mean, lexicon score, posts total, positive ratio, negative ratio) as well as conditioned searches with volume filters and interaction terms. Benjamini-Hochberg FDR correction was applied across the full 4,992-test sweep to control for multiple comparisons.

Scope and limitations

These findings cover the period December 2025 – February 2026 during one market regime. No out-of-sample validation has been conducted. Transaction costs, slippage, and market impact are not modeled. Treat these results as descriptive statistics about the dataset's observed structure during this period.

How this dataset is intended to be used

This dataset is designed for research, analysis, and experimentation. It provides a structured foundation for studying how aggregated social sentiment relates to short-term price behavior, liquidity conditions, and market microstructure. Researchers can use it to test hypotheses, build analytical tools, evaluate models, or create benchmarks for sentiment and activity signals in cryptocurrency markets.

It is not a replacement for real-time market data feeds, execution systems, or trading infrastructure. The archive contains fixed snapshots captured under specific conditions, not a continuous stream of updated signals. Value comes from the structure, consistency, and transparency of the measurements-not from predictive claims or forward-looking guarantees.

If you are conducting observational research into social sentiment dynamics, exploring the relationship between narrative and market behavior, or validating methodologies for aggregated signal collection, this dataset may be relevant. It is intended for those who need reproducible, well-documented data to support rigorous analytical work.