RESEARCH
Research & Methodology
How we collect and summarize sentiment signals, with ethics, privacy, and documentation as first principles.
Overview
Instrumetriq observes short-lived market narratives by combining high-resolution market microstructure with aggregated social sentiment.
Each asset is monitored ~120–130 minutes; spot price sampled every 10 seconds.
This is an observational dataset: we publish measurements and derived statistics, not trading advice.
How scoring works
Data pipeline
- Posts are collected for each asset via controlled queries.
- Posts are deduplicated.
- Each post is scored by TWO AI models (primary + referee).
- Referee confidence controls when we accept, override, or label neutral.
- Cycle-level aggregates are written (pos/neu/neg ratios, mean score, confidence stats).
- Silence handling is explicit (recent_posts_count, is_silent, hours_since_latest_tweet).
Hybrid decisions
These decision sources are tracked per cycle in hybrid_decision_stats.decision_sources.
Methodology note
As is common in cryptocurrency-related social media discourse, sentiment language is heavily skewed positive. Our dataset preserves this property rather than normalizing it. The sentiment mean reflects narrative tone, while post volume, balance, and silence capture structural changes in discourse.
This behavior has been documented in prior academic studies of crypto social media sentiment (e.g., Chen & Hafner, 2019; Hassan et al., 2022).
Selected references
- Chen, C.Y.H. & Hafner, C.M. (2019). "Sentiment-Induced Bubbles in the Cryptocurrency Market." Journal of Risk and Financial Management, 12(2), 53. doi:10.3390/jrfm12020053
- Hassan, M.K., Hudaefi, F.A. & Caraka, R.E. (2022). "Mining netizen's opinion on cryptocurrency: sentiment analysis of Twitter data." Studies in Economics and Finance, 39(3), 365-385. doi:10.1108/SEF-06-2021-0237
What we store per entry
Market microstructure
Spread, depth at multiple bps, order book imbalance, taker ratios. Futures data (when available) including funding rate, open interest, and mark price.
Liquidity quality
liq_qv_usd plus global/self percentiles for regime comparisons.
Sentiment windows (aggregated)
last_cycle and last_2_cycles aggregates (not raw tweets), including author_stats and engagement.
Outcomes & derived features
Price-path metrics and derived entry statistics computed from the recorded series.
Activity & silence context
Explicit tracking of posting activity and silence states to distinguish absence of signal from neutral sentiment, including time since latest observed post.
Author & engagement aggregates
Aggregated author statistics and engagement signals (e.g. follower counts, verification flags, likes, replies, reposts). Author identities are not stored or exposed.
Entry Deep Dive (sample)
A rotating example from the public sample archive, illustrating how one monitored entry is represented end-to-end.
Research use cases
Examples of how the dataset can be used in research contexts.
Sentiment time series modeling
- Use archived, window-level sentiment aggregates to study how social sentiment evolves over short, fixed monitoring periods.
- Analyze temporal properties such as persistence, volatility, mean reversion, and structural breaks in sentiment signals.
- Apply standard time-series techniques to social data that is already normalized and aggregated at the entry level.
Microstructure + narrative coupling
- Examine how aggregated social sentiment aligns with short-term price movement, spreads, and liquidity conditions.
- Study whether narrative intensity and market microstructure signals co-evolve, diverge, or lag one another.
- Compare coupling strength across assets, time windows, and market regimes.
Silence & activity research
- Explicitly separate periods of posting inactivity from neutral or low-confidence sentiment states.
- Analyze how changes in posting frequency relate to market behavior and volatility.
- Treat silence as a first-class signal rather than an absence of data.
Model evaluation & drift monitoring
- Use the historical archive to observe how sentiment model outputs behave over time.
- Track shifts in score distributions, confidence bands, and decision pathways.
- Support analysis of model stability, drift, and long-term consistency without retraining.
Event-centric labeling (observational)
- Identify recurring patterns in sentiment and market behavior around notable events.
- Analyze outcomes relative to observed entry conditions, rather than assigning forward-looking labels.
- Support retrospective research into how narratives and markets responded to specific situations.
Cross-sectional comparisons
- Compare sentiment, liquidity, and volatility metrics across multiple assets within similar time windows.
- Study relative behavior rather than absolute values to identify outliers and common structures.
- Analyze how different assets respond to comparable narrative conditions.
Data quality & integrity studies
- Evaluate completeness, internal consistency, and stability of collected signals.
- Study the effects of deduplication, aggregation, and sampling on downstream metrics.
- Use the dataset to validate methodological assumptions in social data collection.
Aggregated author & engagement analysis
- Analyze engagement dynamics using aggregated author-level statistics.
- Study how engagement concentration and distribution relate to sentiment outcomes.
- Preserve privacy by design: identities are neither stored nor exposed, only aggregated signals.
Constraints & data integrity
Observational scope
This dataset is structured for observational research, not for trading decisions or predictive modeling advice. The measurements record what occurred during specific monitoring windows, not what will occur.
Outcomes are presented descriptively. Researchers using this data are expected to form their own interpretations and validate findings independently. No statistical relationship observed here should be assumed to persist or generalize beyond the recorded conditions.
Temporal & sampling limits
Each monitored entry spans approximately 120–130 minutes. Spot price samples are collected every 10 seconds during this window. This is not continuous market coverage, and not all assets are monitored simultaneously.
The dataset captures short-term, high-resolution snapshots rather than long-term trends. Gaps between entries are expected and normal. Sampling frequency and window length are fixed by design, which may limit certain analyses that require different temporal resolutions.
Sentiment aggregation & uncertainty
Sentiment values are aggregated at the cycle level and do not represent individual posts. A neutral sentiment score does not mean no social activity occurred—it may reflect balanced opinions, low confidence, or explicitly neutral content.
Silence and inactivity are tracked separately using explicit flags and timestamps. All sentiment scores carry inherent uncertainty and variance, which are reported alongside the aggregates. Researchers should account for these distributions rather than treating aggregates as point estimates.
Privacy & data handling
Raw posts are not stored in this dataset and are not published or exposed. Author identities are not recorded, stored, or made available at any stage of the pipeline.
Only aggregated author-level statistics (such as follower count distributions, verification ratios, and engagement metrics) are included, preserving privacy by design. The source data is public, but processing follows data minimization principles to ensure no individual-level tracking or re-identification is possible from the published archive.
How this dataset is intended to be used
This dataset is designed for research, analysis, and experimentation. It provides a structured foundation for studying how aggregated social sentiment relates to short-term price behavior, liquidity conditions, and market microstructure. Researchers can use it to test hypotheses, build analytical tools, evaluate models, or create benchmarks for sentiment and activity signals in cryptocurrency markets.
It is not a replacement for real-time market data feeds, execution systems, or trading infrastructure. The archive contains fixed snapshots captured under specific conditions, not a continuous stream of updated signals. Value comes from the structure, consistency, and transparency of the measurements—not from predictive claims or forward-looking guarantees.
If you are conducting observational research into social sentiment dynamics, exploring the relationship between narrative and market behavior, or validating methodologies for aggregated signal collection, this dataset may be relevant. It is intended for those who need reproducible, well-documented data to support rigorous analytical work.