Questions About 63 Million

What's that article about common "one in a million" events?

Looking for an article that was published in the last three years that described how common certain "one in a million" events actually are. Prominent coincidences discussed in the article were whales jumping onto the decks of ships and people finding long lost wedding rings. I believe it may have been in the Guardian or another UK paper's website. [More Inside] At one point, I had it located in an archive but couldn't figure out how to buy it (UK funds). So if anyone finds it, any tips on how to acquire it would also be welcome. Thanks!
http://ask.metafilter.com/12253/Whats-that-article-about-common-one-in-a-million-events

Is New York really losing 400 million dollars a day?

What does it mean when they give an estimate of a financial loss to a city? Bloomberg has given an estimate that the strike will cost the city $400 million a day. But . . . does that mean money that does not come in to the city and money that leaves the city? Or are they being shifty and calculating some revenue and not others. That is, money still circulates, is it really lost or is its path being altered?
http://ask.metafilter.com/29345/Is-New-York-really-losing-400-million-dollars-a-day

A million tall tales?

I don't know quite how to frame this question, but basically, I just finished reading "A million little pieces" after having it recommended to me by just about everyone, and I absolutely hated it. Which is neither here nor there, but in addition, I didn't believe a word of it - can anyone help with that? (Possible spoilers inside...) This isn't meant to disparage anyone who liked the book, just that I thought a lot of it didn't add up and was wondering if anyone knew of any research done on it, or facts surrounding it. All I can find is glowing reviews championing his perseverance, and personal opinions as to the veracity of the account. I find it suspicious that so many of the facts presented in the book cannot be quickly or obviously verified (anonymity, personal experience, so on). Is there a record of Frey's arrest and jail time? Anything that can corroborate his story? It's sold as a memoir, but many on the internet seem to agree with me in taking exc
http://ask.metafilter.com/30095/A-million-tall-tales

MySQL: Better few tables with millions of rows or thousands of tables with hundrets of rows

I'm building an web application and I want to make it scale from the start. I'm not very much into Database scalability and I'm facing this doubts:

Should I use a handful of tables with millions of rows or shall I split stuff into hundreds of tables if not thousands of tables with expectedly hundreds of rows.

I can go both routs but I don't know which one will scale better in the long run.

I found some info on this tread, but it's not helping much: MySQL Whats better for speed one table with millions of rows or managing multiple tables?

Basically I need to know, it's better to scale a database vertically or horizontally?

http://www.stackoverflow.com/questions/14298932/mysql-better-few-tables-with-millions-of-rows-or-thousands-of-tables-with-hundrets-of-rows

Optimizing several million char* to string conversions

I have an application that needs to take in several million char*'s as an input parameter (typically strings less than 512 characters (in unicode)), and convert and store them as .net strings.

It turning out to be a real bottleneck in the performance of my application. I'm wondering if there's some design pattern or ideas to make it more effecient.

There is a key part that makes me feel like it can be improved: There are a LOT of duplicates. Say 1 million objects are coming in, there might only be like 50 unique char* patterns.

For the record, here is the algorithm i'm using to convert char* to string (this algorithm is in C++, but the rest of the project is in C#)

String ^StringTools::MbCharToStr ( const char *Source ) 
{
   String ^str;

   if( (Source == NULL) || (Source[0] == '\0') )
   {
      str = gcnew String("");
   }
   else
   {
      // Find the number of UTF-16 characters needed to hold the
      // converted UTF-8 string, an
    
http://www.stackoverflow.com/questions/14325505/optimizing-several-million-char-to-string-conversions

Number formatting in java to use Lakh format instead of million format

I have tried using NumberFormat and DecimalFormat. Even though i am using locale as en-In the numbers are being formatted in millions formats. Is there any option to format a number in lakhs format instead of millions format.

Ex - i want NumberFormatInstance.format(123456) to give 1,23,456.00 instead of 123,456.00

http://www.stackoverflow.com/questions/14507658/number-formatting-in-java-to-use-lakh-format-instead-of-million-format

300 million items in a Map

  1. If each of them is guaranteed to have a unique key (generated and enforced by an external keying system) which Map implementation is the correct fit for me? Assume this has to be optimized for concurrent lookup only (The data is initialized once during the application startup).
  2. Does this 300 million unique keys have any positive or negative implications on bucketing/collisions?
  3. Any other suggestions?

My map would look something like this

Map<String, <boolean, boolean, boolean, boolean>>
http://www.stackoverflow.com/questions/14538963/300-million-items-in-a-map

Efficient substring search in a large text file containing 100 millions strings(no duplicate string)

I have a large text file(1.5 Gb) having 100 millions Strings(no duplicate String) and all the Strings are arranged line by line in the file . i want to make a wepapplication in java so that when user give a keyword(Substring) he get the count of All the strings present in the file which contains that keyword. i know one technique LUCENE already..is there any other way to do this.?? i want the result within 3-4 seconds. MY SYSTEM HAS 4GB RAM AND DUAL CORE configuration.... need to do this in "JAVA ONLY"

http://www.stackoverflow.com/questions/14633286/efficient-substring-search-in-a-large-text-file-containing-100-millions-strings-no-duplicate-string

Updating 4 million records in SQL server using list of record-ids as input

During a migration project, I'm faced with an update of 4 millions records in our SQL Server.

The update is very simple ; a boolean field needs to be set to true/1 and the input I have is a list of all the id's for which this field must be filled.(one id per line)

I'm not exactly an expert when it comes to sql tasks of this size, so I started out trying 1 UPDATE statement containing a "WHERE xxx IN ( {list of ids, separated by comma} )". First, I tried this with a million records. On a small dataset on a test-server, this worked like a charm, but in the production environment this gave an error. So, I shortened the length of the list of ids a couple of times, but to no avail.

The next thing I tried was to turn each id in the list into an UPDATE statement ("UPDATE yyy SET booleanfield = 1 WHERE id = '{id}'"). Somewhere, I read that it's good to have a GO every x number of lines, so I inserted a GO every 100 lines (using the excellent

http://www.stackoverflow.com/questions/14790548/updating-4-million-records-in-sql-server-using-list-of-record-ids-as-input

How to efficiently compute the cosine similarity between millions of strings

I need to compute the cosine similarity between strings in a list. For example, I have a list of over 10 million strings, each string has to determine similarity between itself and every other string in the list. What is the best algorithm I can use to efficiently and quickly do such task? Is the divide and conquer algorithm applicable?

EDIT

I want to determine which strings are most similar to a given string and be able to have a measure/score associated with the similarity. I think what I want to do falls in line with clustering where the number of clusters are not initially known.

http://www.stackoverflow.com/questions/15041647/how-to-efficiently-compute-the-cosine-similarity-between-millions-of-strings